CN113257221A - Voice model training method based on front-end design and voice synthesis method - Google Patents
Voice model training method based on front-end design and voice synthesis method Download PDFInfo
- Publication number
- CN113257221A CN113257221A CN202110762178.XA CN202110762178A CN113257221A CN 113257221 A CN113257221 A CN 113257221A CN 202110762178 A CN202110762178 A CN 202110762178A CN 113257221 A CN113257221 A CN 113257221A
- Authority
- CN
- China
- Prior art keywords
- text
- code
- phoneme
- codes
- prosody
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 47
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000013461 design Methods 0.000 title claims abstract description 13
- 238000001308 synthesis method Methods 0.000 title claims abstract description 12
- 239000013598 vector Substances 0.000 claims abstract description 51
- 238000002372 labelling Methods 0.000 claims description 19
- 238000010606 normalization Methods 0.000 claims description 17
- 238000013528 artificial neural network Methods 0.000 claims description 13
- AZFKQCNGMSSWDS-UHFFFAOYSA-N MCPA-thioethyl Chemical compound CCSC(=O)COC1=CC=C(Cl)C=C1C AZFKQCNGMSSWDS-UHFFFAOYSA-N 0.000 claims description 10
- 230000011218 segmentation Effects 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 230000033764 rhythmic process Effects 0.000 abstract description 5
- 230000015572 biosynthetic process Effects 0.000 description 10
- 238000003786 synthesis reaction Methods 0.000 description 10
- 230000007547 defect Effects 0.000 description 3
- 230000010354 integration Effects 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/027—Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Signal Processing (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
A speech model training method and a speech synthesis method based on front-end design comprise the following steps: the subsequent steps are as follows: s1: generating a rhythm marking text with rhythm marking; s2: obtaining a first code of linguistic characteristics of text content; s3: acquiring the pronunciation duration of each phoneme; s4: training a pronunciation duration model of each phoneme; s5: outputting a front-end feature coding vector with fixed dimensionality; s6: and carrying out iterative training to obtain an autoregressive model. The invention can effectively reduce the probability of pronunciation error and speed error of single word in the whole sentence. Meanwhile, the pronunciation duration, the sentence rhythm and the like of the special phoneme can be controlled by finely adjusting the front-end linguistic characteristics and the time length characteristics.
Description
Technical Field
The invention belongs to the technical field of artificial intelligent speech synthesis, and particularly relates to a speech model training method and a speech synthesis method based on front-end design.
Background
Speech synthesis is a technique that converts Text into corresponding audio, also known as Text To Speech (TTS). With the development of artificial intelligence and the increase of social demands, attention is paid to a speech synthesis technology with accurate, clear, natural and pleasant pronunciation. The traditional speech synthesis technology includes a splicing method and a parameter synthesis method, and the two methods are gradually replaced by an end-to-end speech synthesis scheme due to poor naturalness and listening impression.
The end-to-end speech synthesis scheme is to directly generate acoustic features from text contents through a model with higher complexity, and then generate audio from the acoustic features by a vocoder. However, the integration level of the end-to-end network structure is high, and the end-to-end network structure is not easy to flexibly adjust when a synthesis problem is encountered. Pronunciations and speeds of certain individual characters often have problems, and the problems are difficult to avoid by adjusting parameters, and need to be screened again and optimized by adding a new data training model. The model optimization iteration period is long, and the synthetic problem is not easy to solve. In addition, for different application scenarios, the speech rate, pronunciation, prosody, etc. may all change, and the end-to-end network with higher integration level is difficult to flexibly adjust for the changes.
Disclosure of Invention
In order to overcome the technical defects in the prior art, the invention discloses a speech model training method and a speech synthesis method based on front-end design.
The invention discloses a speech model training method based on front-end design, which comprises the following steps of sample acquisition:
the method comprises the steps that a sample is collected, wherein the sample is used for collecting high-quality audio data of a single speaker and a text corresponding to the audio data as original training data, and Mel characteristics of the audio data are extracted;
the subsequent steps are as follows:
s1: predicting and marking the prosody of the text by a prosody prediction model to generate a prosody marking text with prosody marking;
s2: extracting the prosody labeling text generated in the step S1 into linguistic features in the text through a front-end rule, wherein the linguistic features comprise position information codes and zero-one codes, and combining the position information codes and the zero-one codes to obtain first linguistic feature codes of text contents;
s3: forcibly aligning the text phonemes in the sample with the corresponding audio data by using a forced alignment algorithm to obtain the pronunciation duration of each phoneme;
s4: building a neural network, taking the first codes of the linguistic features obtained in the step S2 as input, taking the pronunciation duration of each phoneme obtained in the step S3 as a prediction target, and training a pronunciation duration model of each phoneme;
the step is not limited with the steps S5 and S6 in time sequence;
s5: combining the first linguistic feature code of each phoneme generated in the step S2 and the pronunciation duration of each phoneme generated in the step 3 to obtain a second linguistic feature code, performing mean variance normalization on the second linguistic feature code, inputting the normalized feature vector into a shallow neural network, and outputting a front-end feature code vector with fixed dimensionality;
s6: building an end-to-end network of the attention mechanism from the sequence to the sequence, and combining an embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S5 to obtain a prediction vector;
and accessing the combined prediction vector into an autoregressive (LSTM) network, predicting Mel characteristics, and performing iterative training by taking the Mel characteristics of the audio data in the sample as a target to obtain an autoregressive model.
Preferably, the step of S1 includes:
s1.1, training a prosody prediction model by using a text prosody labeling data set, and labeling text prosody by using a special mark;
s1.2, carrying out prosody prediction on the text of the audio data by using the trained prosody prediction model to obtain prosody labeling texts of all the texts.
Preferably, the step of S2 includes:
s2.1, converting the prosody labeling text of the text from the text to pinyin, and converting the pinyin into phonemes to obtain a phoneme sequence of the text, wherein the prosody labeling is represented as one phoneme by using a special symbol;
s2.2, performing word segmentation and part-of-speech prediction on the text to obtain a word segmentation result and a part-of-speech prediction result of the text.
S2.3, calculating the text characteristics, and calculating by taking each phoneme of each text as a minimum unit to obtain the position information of the text as position information codes;
designing a problem set, and generating a zero-one code according to the problem set;
and S2.4, combining the position information code calculated in the S2.3 with the zero-one code to obtain a first language characteristic code of the text content.
Preferably, the specific manner of performing mean variance normalization on the second coding of the linguistic feature in the step S5 is as follows:
s5.1, calculating the mean value and the variance of each code in the second codes of all the linguistic characteristics.
S5.2, subtracting the mean value of the bit code from each bit code, dividing the mean value by the variance of the bit code,
the calculation formula is as follows: y isk= (xk -mk )/skWherein y iskFor the kth bit, the result after normalization, xkEncode the value before normalization for the k bit, mkFor the mean, s, of the kth code of all codes to be normalizedkThe variance of the kth bit code in all codes to be normalized.
A speech synthesis method based on front-end design comprises the following steps:
S1A, processing a text to be synthesized according to the methods of the steps S1-S2 in the training method to obtain a first linguistic feature code;
S2A, inputting the obtained linguistic feature first code of the text to be synthesized into a pronunciation duration model, and training to obtain pronunciation durations of all phonemes;
S3A, combining the first linguistic feature codes of the phonemes generated in the step S1A and the pronunciation duration of each phoneme generated in the step S2A to obtain second linguistic feature codes, performing mean variance normalization on the second linguistic feature codes, inputting the normalized feature vectors into a shallow neural network, and outputting front-end feature coding vectors with fixed dimensions;
S4A: using the sequence obtained by training in step S6 in the training method to a sequence network; merging the embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S3A to obtain a prediction vector; (ii) a
S5A: inputting the prediction vector into an autoregressive model to obtain Mel characteristics;
S6A: the Mel features are input into the vocoder to obtain the synthesized audio.
The invention aims at the defects of poor stability and controllability in the technical scheme of end-to-end voice synthesis. A model training and speech synthesis method based on front-end design is provided to improve the stability and controllability of an end-to-end network and reduce the training difficulty of a neural network. The invention can effectively reduce the probability of pronunciation error and speed error of single word in the whole sentence. Meanwhile, under some special conditions, the pronunciations, the phoneme pronunciation durations, the sentence rhythms and the like of some special phonemes can be controlled by finely adjusting the front-end linguistic characteristics and the time length characteristics.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating one embodiment of a speech model training method according to the present invention;
fig. 2 is a flow chart of a speech synthesis method according to an embodiment of the present invention.
Detailed Description
The following provides a more detailed description of the present invention.
The invention discloses a speech model training method based on front-end design, which comprises the following steps of sample acquisition:
the method comprises the steps of acquiring high-quality audio data of a single speaker and a text corresponding to the audio data as original training data, and extracting Mel characteristics of the audio data. The subsequent steps are shown in fig. 1.
S1: and predicting and labeling the prosody of the text by a prosody prediction model to generate a prosody labeled text with prosody labels.
The prosodic model mainly predicts and correspondingly marks short pause and long pause of the text.
The method specifically comprises the following steps:
s1.1, training a prosody prediction model by using a text prosody labeling data set, and labeling text prosody by using a special mark. For example, the prosodic words, prosodic phrases, short pauses and long pauses in the text are respectively corresponding to the mark symbols #1, #2, #3 and # 4.
S1.2, carrying out prosody prediction on the text of the audio data by using the trained prosody prediction model, and labeling the text according to the method to obtain prosody labeled texts of all the texts.
S2: extracting the prosody labeling text generated in the step S1 by using a front rule to extract linguistic features in the text, including phonemes, participles, parts of speech, and preceding and following linguistic features including phonemes, characters, position information of words in phrases, and the like, matching the linguistic features of the text by using a single phoneme as a minimum unit through a problem set to obtain one-hot codes, wherein each phoneme corresponds to one-hot codes of one linguistic feature.
The method specifically comprises the following steps:
s2.1, converting the prosody labeling text of the text from the text to pinyin, and converting the pinyin into phonemes to obtain a phoneme sequence of the text, wherein the prosody labeling is represented as a phoneme by using a special symbol.
S2.2, performing word segmentation and part-of-speech prediction on the text to obtain a word segmentation result and a part-of-speech prediction result of the text.
S2.3, calculating the characteristics of the text in the front and the back, taking each phoneme of each text as the minimum unit to calculate, and according to the setting of the problem set, calculating: and obtaining the position information of the front and back texts of each phoneme as position information codes.
The position information coding is formed in the same format with a phoneme as a minimum unit, and may include the following information coding: the current phoneme is ranked at the first position of the whole phoneme sequence, ranked at the first position in the current pinyin, ranked at the first position in the current phrase and ranked at the first position in the current prosodic phrase, and the values of the positions are combined into position information codes.
And designing a problem set, wherein the problem set mainly comprises the matching of the basic characteristics of the current phoneme. The method mainly comprises the following parts which can be adjusted according to actual conditions:
the problem design of the current phoneme contains the following information:
specifically, the phoneme, the type of phoneme, the tone of the phoneme, and the part of speech of the word in which the phoneme is located.
The related characteristics of the previous phoneme, the first two phonemes, the next phoneme and the next two phonemes of the current phoneme.
And matching the information by using the problem set, setting the information to be 1 if the problem set is met, and otherwise, setting the information to be 0, and generating a zero-one code (one-hot code) according to the problem set.
And S2.4, combining the position information code calculated in the S2.3 with the zero-one code to obtain a first language characteristic code of the text content.
S3: forcibly aligning the text phonemes of the sample type with the corresponding audio data by using a forced-alignment (MFA) algorithm to obtain the pronunciation duration of each phoneme;
and acquiring the corresponding time point of the audio file corresponding to each phoneme, generating a time list corresponding to the text phoneme, wherein each section of audio corresponds to a phoneme pronunciation time sequence file. The method specifically comprises the following steps:
and corresponding each phoneme to a time node of the audio file by using a forced alignment algorithm to obtain the starting and ending time of each phoneme in the audio file.
And (3) obtaining the pronunciation duration corresponding to the phoneme in each text according to the starting and ending time, and finally obtaining the pronunciation duration corresponding to each phoneme in the first encoding of the linguistic feature obtained in the step (2).
S4: and (4) building a neural network, taking the first code of the linguistic features obtained in the step S2 as an input, taking the pronunciation duration of each phoneme obtained in the step S3 as a prediction target, and training a pronunciation duration model of each phoneme.
The pronunciation duration model trained by the method has the advantage that the pronunciation duration accuracy of phonemes in a predicted sentence is higher.
The pronunciation duration model is used for acquiring the pronunciation duration of the phoneme according to the first encoding of the linguistic features of the phoneme, and the step is not limited with the steps S5 and S6 in time sequence.
S5: combining the first encoding of the linguistic feature of each phoneme generated in the step S2 and the pronunciation duration of each phoneme generated in the step S3 to obtain a second encoding of the linguistic feature, normalizing the mean variance of the second encoding of the linguistic feature, inputting the normalized feature vector into a shallow neural network, outputting a front-end feature encoding vector with fixed dimension,
the method specifically comprises the following steps: combining the first encoding of linguistic feature generated in step S2 and the pronunciation duration of each phoneme generated in step S3, wherein the combination is to supplement the pronunciation duration of each phoneme to the last bit of the encoding to obtain the second encoding of linguistic feature
For example, the linguistic feature of a certain phoneme is firstly coded as AXCB54, the pronunciation duration is 7, and after combination, the phoneme is AXCB 547;
carrying out mean variance normalization on the combined second codes of the linguistic features, inputting a shallow DNN neural network, and outputting high-dimensionality front-end feature code vectors after network training;
the front-end feature coding vector is abstract expression of front-end linguistic features, the front-end feature coding vector is obtained through training, the front-end feature coding vectors obtained through second coding of the linguistic features close to the linguistic features in an actual text through a shallow neural network are close to each other in a high-dimensional space, and the cosine similarity of the front-end feature coding vectors is higher.
The cosine similarity of the front-end feature codes generated by the similar front-end features is high, so that the network can perform better abstract expression on the codes, or understand the meanings of sentences to form the codes, and further explain that the codes are more accurate.
The specific normalization method adopts mean variance normalization, and specifically comprises the following steps:
s5.1, calculating the mean value and the variance of each code in the second codes of all the linguistic characteristics.
S5.2, subtracting the mean value of the bit code from each bit code, dividing the mean value by the variance of the bit code,
the calculation formula is as follows: y isk= (xk -mk )/skWherein y iskFor the kth bit, the result after normalization, xkEncode the value before normalization for the k bit, mkFor the mean, s, of the kth code of all codes to be normalizedkThe variance of the kth bit code in all codes to be normalized. The purpose of the normalization is to facilitate convergence of subsequent networks.
S6: an end-to-end network of sequence-to-sequence attention mechanisms is built,
the end-to-end network includes a sequence-to-sequence network and an LSTM autoregressive network, which in combination form the end-to-end network.
Merging the embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S5 to obtain a prediction vector;
the merging is a vector merging, for example, the embedding vector is 512-dimensional, the front-end coding vector is 256-dimensional, and the merged vector is 256+512= 768-dimensional.
The embedded vector is generated by the sequence-to-sequence network, controllability is low, the front-end coding vector is generated by the previous step according to sample characteristics and self-determined rules, and the purpose of combining the embedded vector and the front-end characteristic coding vector in the step S5 is to avoid synthetic errors generated by depending on a model alone by artificially correcting the front-end coding on the basis of continuing the advantage that the prediction of the Mel characteristic synthetic audio by using the embedded vector alone in the past has high naturalness.
And (3) accessing the combined prediction vector into an autoregressive (LSTM) network, predicting the Mel characteristic of each frame, and performing iterative training by taking the Mel characteristic of the audio data in the original training data as a target to obtain an autoregressive model. The specific training is to connect the sequence-to-sequence network and the autoregressive network into a whole, and the model parameters of the sequence-to-sequence network are obtained simultaneously in the process of obtaining the autoregressive model through training, namely the sequence-to-sequence network is also trained.
And S1-S6, finally training by using the sample to obtain a pronunciation duration model and an autoregressive model.
The speech synthesis method using the trained models includes the following steps, as shown in fig. 2:
S1A, processing a text to be synthesized according to the methods of S1-S2 to obtain a first linguistic feature code;
S2A, inputting the obtained linguistic feature first code of the text to be synthesized into a pronunciation duration model, and training to obtain pronunciation durations of all phonemes;
S3A, combining the first linguistic feature codes of the phonemes generated in the step S1A and the pronunciation duration of each phoneme generated in the step S2A to obtain second linguistic feature codes, performing mean variance normalization on the second linguistic feature codes, inputting the normalized feature vectors into a shallow neural network, and outputting front-end feature coding vectors with fixed dimensions;
S4A: using the sequence obtained by training in the step S6 to a sequence network; merging the embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S3A to obtain a prediction vector;
S5A: inputting the prediction vector into an autoregressive model to obtain Mel characteristics;
S6A: the Mel features are input into the vocoder to obtain the synthesized audio.
The invention aims at the defects of poor stability and controllability in the technical scheme of end-to-end voice synthesis. A model training and speech synthesis method based on front-end design is provided to improve the stability and controllability of an end-to-end network and reduce the training difficulty of a neural network. The invention can effectively reduce the probability of pronunciation error and speed error of single word in the whole sentence. Meanwhile, under some special conditions, the pronunciations, the phoneme pronunciation durations, the sentence rhythms and the like of some special phonemes can be controlled by finely adjusting the front-end linguistic characteristics and the time length characteristics.
Meanwhile, in some special cases, for example, the same word appears too many times in a sentence, which easily interferes with the attention mechanism of the end-to-end network, resulting in deviation of pronunciation and pronunciation duration of the word. Or sentence break errors are easy to occur in the case that a few sentences are too long and have no punctuation. Or some words may not appear in the training set and are prone to pronunciation errors. In these cases, the pronunciations of the words, the pronunciations of the phonemes, the pronunciation durations of the phonemes, the sentence prosody, etc. can be controlled by fine-tuning the front-end linguistic features and the time-length features. The subsequent synthesis is more accurate.
The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.
Claims (5)
1. A speech model training method based on front-end design is characterized by comprising the following steps of sample acquisition:
the method comprises the steps that a sample is collected, wherein the sample is used for collecting high-quality audio data of a single speaker and a text corresponding to the audio data as original training data, and Mel characteristics of the audio data are extracted;
the subsequent steps are as follows:
s1: predicting and marking the prosody of the text by a prosody prediction model to generate a prosody marking text with prosody marking;
s2: extracting the prosody labeling text generated in the step S1 into linguistic features in the text through a front-end rule, wherein the linguistic features comprise position information codes and zero-one codes, and combining the position information codes and the zero-one codes to obtain first linguistic feature codes of text contents;
s3: forcibly aligning the text phonemes in the sample with the corresponding audio data by using a forced alignment algorithm to obtain the pronunciation duration of each phoneme;
s4: building a neural network, taking the first codes of the linguistic features obtained in the step S2 as input, taking the pronunciation duration of each phoneme obtained in the step S3 as a prediction target, and training a pronunciation duration model of each phoneme;
the step is not limited with the steps S5 and S6 in time sequence;
s5: combining the first linguistic feature code of each phoneme generated in the step S2 and the pronunciation duration of each phoneme generated in the step 3 to obtain a second linguistic feature code, performing mean variance normalization on the second linguistic feature code, inputting the normalized feature vector into a shallow neural network, and outputting a front-end feature code vector with fixed dimensionality;
s6: building an end-to-end network of the attention mechanism from the sequence to the sequence, and combining an embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S5 to obtain a prediction vector;
and accessing the combined prediction vector into an autoregressive (LSTM) network, predicting Mel characteristics, and performing iterative training by taking the Mel characteristics of the audio data in the sample as a target to obtain an autoregressive model.
2. The speech model training method of claim 1, wherein the step of S1 comprises:
s1.1, training a prosody prediction model by using a text prosody labeling data set, and labeling text prosody by using a special mark;
s1.2, carrying out prosody prediction on the text of the audio data by using the trained prosody prediction model to obtain prosody labeling texts of all the texts.
3. The speech model training method of claim 1, wherein the step of S2 comprises:
s2.1, converting the prosody labeling text of the text from the text to pinyin, and converting the pinyin into phonemes to obtain a phoneme sequence of the text, wherein the prosody labeling is represented as one phoneme by using a special symbol;
s2.2, performing word segmentation and part-of-speech prediction on the text to obtain a word segmentation result and a part-of-speech prediction result of the text;
s2.3, calculating the text characteristics, and calculating by taking each phoneme of each text as a minimum unit to obtain the position information of the text as position information codes;
designing a problem set, and generating a zero-one code according to the problem set;
and S2.4, combining the position information code calculated in the S2.3 with the zero-one code to obtain a first language characteristic code of the text content.
4. The method of training speech models of claim 1, wherein the normalization of the mean variance of the second coding of linguistic features in step S5 is performed by:
s5.1, calculating the mean value and the variance of each code in the second codes of all the linguistic characteristics;
s5.2, subtracting the mean value of the bit code from each bit code, dividing the mean value by the variance of the bit code,
the calculation formula is as follows: y isk= (xk -mk )/skWherein y iskFor the kth bit, the result after normalization, xkEncode the value before normalization for the k bit, mkFor the mean, s, of the kth code of all codes to be normalizedkThe variance of the kth bit code in all codes to be normalized.
5. A speech synthesis method based on front-end design is characterized by comprising the following steps:
S1A, processing a text to be synthesized according to the method of steps S1-S2 in the training method of claim 1 to obtain a first code of linguistic characteristics;
S2A, inputting the obtained linguistic feature first code of the text to be synthesized into a pronunciation duration model, and training to obtain pronunciation durations of all phonemes;
S3A, combining the first linguistic feature codes of the phonemes generated in the step S1A and the pronunciation duration of each phoneme generated in the step S2A to obtain second linguistic feature codes, performing mean variance normalization on the second linguistic feature codes, inputting the normalized feature vectors into a shallow neural network, and outputting front-end feature coding vectors with fixed dimensions;
S4A: using the sequence-to-sequence network trained in step S6 in the training method of claim 1; merging the embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S3A to obtain a prediction vector;
S5A: inputting the prediction vector into an autoregressive model to obtain Mel characteristics;
S6A: the Mel features are input into the vocoder to obtain the synthesized audio.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110762178.XA CN113257221B (en) | 2021-07-06 | 2021-07-06 | Voice model training method based on front-end design and voice synthesis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110762178.XA CN113257221B (en) | 2021-07-06 | 2021-07-06 | Voice model training method based on front-end design and voice synthesis method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113257221A true CN113257221A (en) | 2021-08-13 |
CN113257221B CN113257221B (en) | 2021-09-17 |
Family
ID=77190767
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110762178.XA Active CN113257221B (en) | 2021-07-06 | 2021-07-06 | Voice model training method based on front-end design and voice synthesis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113257221B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113948062A (en) * | 2021-12-20 | 2022-01-18 | 阿里巴巴达摩院(杭州)科技有限公司 | Data conversion method and computer storage medium |
CN118116363A (en) * | 2024-04-26 | 2024-05-31 | 厦门蝉羽网络科技有限公司 | Speech synthesis method based on time perception position coding and model training method thereof |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5832425A (en) * | 1994-10-04 | 1998-11-03 | Hughes Electronics Corporation | Phoneme recognition and difference signal for speech coding/decoding |
EP3021318A1 (en) * | 2014-11-17 | 2016-05-18 | Samsung Electronics Co., Ltd. | Speech synthesis apparatus and control method thereof |
CN106601228A (en) * | 2016-12-09 | 2017-04-26 | 百度在线网络技术(北京)有限公司 | Sample marking method and device based on artificial intelligence prosody prediction |
CN111640418A (en) * | 2020-05-29 | 2020-09-08 | 数据堂(北京)智能科技有限公司 | Prosodic phrase identification method and device and electronic equipment |
CN111754976A (en) * | 2020-07-21 | 2020-10-09 | 中国科学院声学研究所 | Rhythm control voice synthesis method, system and electronic device |
CN112002304A (en) * | 2020-08-27 | 2020-11-27 | 上海添力网络科技有限公司 | Speech synthesis method and device |
CN112002305A (en) * | 2020-07-29 | 2020-11-27 | 北京大米科技有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN112133278A (en) * | 2020-11-20 | 2020-12-25 | 成都启英泰伦科技有限公司 | Network training and personalized speech synthesis method for personalized speech synthesis model |
CN112802450A (en) * | 2021-01-05 | 2021-05-14 | 杭州一知智能科技有限公司 | Rhythm-controllable Chinese and English mixed speech synthesis method and system thereof |
-
2021
- 2021-07-06 CN CN202110762178.XA patent/CN113257221B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5832425A (en) * | 1994-10-04 | 1998-11-03 | Hughes Electronics Corporation | Phoneme recognition and difference signal for speech coding/decoding |
EP3021318A1 (en) * | 2014-11-17 | 2016-05-18 | Samsung Electronics Co., Ltd. | Speech synthesis apparatus and control method thereof |
CN106601228A (en) * | 2016-12-09 | 2017-04-26 | 百度在线网络技术(北京)有限公司 | Sample marking method and device based on artificial intelligence prosody prediction |
CN111640418A (en) * | 2020-05-29 | 2020-09-08 | 数据堂(北京)智能科技有限公司 | Prosodic phrase identification method and device and electronic equipment |
CN111754976A (en) * | 2020-07-21 | 2020-10-09 | 中国科学院声学研究所 | Rhythm control voice synthesis method, system and electronic device |
CN112002305A (en) * | 2020-07-29 | 2020-11-27 | 北京大米科技有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN112002304A (en) * | 2020-08-27 | 2020-11-27 | 上海添力网络科技有限公司 | Speech synthesis method and device |
CN112133278A (en) * | 2020-11-20 | 2020-12-25 | 成都启英泰伦科技有限公司 | Network training and personalized speech synthesis method for personalized speech synthesis model |
CN112802450A (en) * | 2021-01-05 | 2021-05-14 | 杭州一知智能科技有限公司 | Rhythm-controllable Chinese and English mixed speech synthesis method and system thereof |
Non-Patent Citations (3)
Title |
---|
C. YUAN 等: ""Personalized End-to-End Mandarin Speech Synthesis using Small-sized Corpus"", 《2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE 》 * |
张宇强: ""基于端到端的语音合成"", 《中国优秀硕士学位论文全文数据库(信息科技辑 )》 * |
韩民: ""一种改进的语音合成方法"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113948062A (en) * | 2021-12-20 | 2022-01-18 | 阿里巴巴达摩院(杭州)科技有限公司 | Data conversion method and computer storage medium |
CN118116363A (en) * | 2024-04-26 | 2024-05-31 | 厦门蝉羽网络科技有限公司 | Speech synthesis method based on time perception position coding and model training method thereof |
Also Published As
Publication number | Publication date |
---|---|
CN113257221B (en) | 2021-09-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP7464621B2 (en) | Speech synthesis method, device, and computer-readable storage medium | |
CN112420016B (en) | Method and device for aligning synthesized voice and text and computer storage medium | |
CN105654939A (en) | Voice synthesis method based on voice vector textual characteristics | |
CN110767213A (en) | Rhythm prediction method and device | |
CN113257221B (en) | Voice model training method based on front-end design and voice synthesis method | |
Liu et al. | Mongolian text-to-speech system based on deep neural network | |
CN113205792A (en) | Mongolian speech synthesis method based on Transformer and WaveNet | |
CN113539268A (en) | End-to-end voice-to-text rare word optimization method | |
CN113327574A (en) | Speech synthesis method, device, computer equipment and storage medium | |
Maia et al. | Towards the development of a brazilian portuguese text-to-speech system based on HMM. | |
CN114974218A (en) | Voice conversion model training method and device and voice conversion method and device | |
Kayte et al. | A Marathi Hidden-Markov Model Based Speech Synthesis System | |
US11817079B1 (en) | GAN-based speech synthesis model and training method | |
Chiang et al. | The Speech Labeling and Modeling Toolkit (SLMTK) Version 1.0 | |
CN114708848A (en) | Method and device for acquiring size of audio and video file | |
Bonafonte et al. | The UPC TTS system description for the 2008 blizzard challenge | |
Janyoi et al. | An Isarn dialect HMM-based text-to-speech system | |
Lin et al. | Improving mandarin prosody boundary detection by using phonetic information and deep LSTM model | |
Saychum et al. | A great reduction of wer by syllable toneme prediction for thai grapheme to phoneme conversion | |
CN117524193B (en) | Training method, device, equipment and medium for Chinese-English mixed speech recognition system | |
Nair et al. | Indian text to speech systems: A short survey | |
Zhang et al. | Chinese speech synthesis system based on end to end | |
CN116229994B (en) | Construction method and device of label prediction model of Arabic language | |
Janyoi et al. | Isarn Dialect Speech Synthesis using HMM with syllable-context features | |
Gong et al. | A Review of End-to-End Chinese–Mandarin Speech Synthesis Techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |