CN113257221A - Voice model training method based on front-end design and voice synthesis method - Google Patents

Voice model training method based on front-end design and voice synthesis method Download PDF

Info

Publication number
CN113257221A
CN113257221A CN202110762178.XA CN202110762178A CN113257221A CN 113257221 A CN113257221 A CN 113257221A CN 202110762178 A CN202110762178 A CN 202110762178A CN 113257221 A CN113257221 A CN 113257221A
Authority
CN
China
Prior art keywords
text
code
phoneme
codes
prosody
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110762178.XA
Other languages
Chinese (zh)
Other versions
CN113257221B (en
Inventor
陈佩云
曹艳艳
高君效
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chipintelli Technology Co Ltd
Original Assignee
Chipintelli Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chipintelli Technology Co Ltd filed Critical Chipintelli Technology Co Ltd
Priority to CN202110762178.XA priority Critical patent/CN113257221B/en
Publication of CN113257221A publication Critical patent/CN113257221A/en
Application granted granted Critical
Publication of CN113257221B publication Critical patent/CN113257221B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

A speech model training method and a speech synthesis method based on front-end design comprise the following steps: the subsequent steps are as follows: s1: generating a rhythm marking text with rhythm marking; s2: obtaining a first code of linguistic characteristics of text content; s3: acquiring the pronunciation duration of each phoneme; s4: training a pronunciation duration model of each phoneme; s5: outputting a front-end feature coding vector with fixed dimensionality; s6: and carrying out iterative training to obtain an autoregressive model. The invention can effectively reduce the probability of pronunciation error and speed error of single word in the whole sentence. Meanwhile, the pronunciation duration, the sentence rhythm and the like of the special phoneme can be controlled by finely adjusting the front-end linguistic characteristics and the time length characteristics.

Description

Voice model training method based on front-end design and voice synthesis method
Technical Field
The invention belongs to the technical field of artificial intelligent speech synthesis, and particularly relates to a speech model training method and a speech synthesis method based on front-end design.
Background
Speech synthesis is a technique that converts Text into corresponding audio, also known as Text To Speech (TTS). With the development of artificial intelligence and the increase of social demands, attention is paid to a speech synthesis technology with accurate, clear, natural and pleasant pronunciation. The traditional speech synthesis technology includes a splicing method and a parameter synthesis method, and the two methods are gradually replaced by an end-to-end speech synthesis scheme due to poor naturalness and listening impression.
The end-to-end speech synthesis scheme is to directly generate acoustic features from text contents through a model with higher complexity, and then generate audio from the acoustic features by a vocoder. However, the integration level of the end-to-end network structure is high, and the end-to-end network structure is not easy to flexibly adjust when a synthesis problem is encountered. Pronunciations and speeds of certain individual characters often have problems, and the problems are difficult to avoid by adjusting parameters, and need to be screened again and optimized by adding a new data training model. The model optimization iteration period is long, and the synthetic problem is not easy to solve. In addition, for different application scenarios, the speech rate, pronunciation, prosody, etc. may all change, and the end-to-end network with higher integration level is difficult to flexibly adjust for the changes.
Disclosure of Invention
In order to overcome the technical defects in the prior art, the invention discloses a speech model training method and a speech synthesis method based on front-end design.
The invention discloses a speech model training method based on front-end design, which comprises the following steps of sample acquisition:
the method comprises the steps that a sample is collected, wherein the sample is used for collecting high-quality audio data of a single speaker and a text corresponding to the audio data as original training data, and Mel characteristics of the audio data are extracted;
the subsequent steps are as follows:
s1: predicting and marking the prosody of the text by a prosody prediction model to generate a prosody marking text with prosody marking;
s2: extracting the prosody labeling text generated in the step S1 into linguistic features in the text through a front-end rule, wherein the linguistic features comprise position information codes and zero-one codes, and combining the position information codes and the zero-one codes to obtain first linguistic feature codes of text contents;
s3: forcibly aligning the text phonemes in the sample with the corresponding audio data by using a forced alignment algorithm to obtain the pronunciation duration of each phoneme;
s4: building a neural network, taking the first codes of the linguistic features obtained in the step S2 as input, taking the pronunciation duration of each phoneme obtained in the step S3 as a prediction target, and training a pronunciation duration model of each phoneme;
the step is not limited with the steps S5 and S6 in time sequence;
s5: combining the first linguistic feature code of each phoneme generated in the step S2 and the pronunciation duration of each phoneme generated in the step 3 to obtain a second linguistic feature code, performing mean variance normalization on the second linguistic feature code, inputting the normalized feature vector into a shallow neural network, and outputting a front-end feature code vector with fixed dimensionality;
s6: building an end-to-end network of the attention mechanism from the sequence to the sequence, and combining an embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S5 to obtain a prediction vector;
and accessing the combined prediction vector into an autoregressive (LSTM) network, predicting Mel characteristics, and performing iterative training by taking the Mel characteristics of the audio data in the sample as a target to obtain an autoregressive model.
Preferably, the step of S1 includes:
s1.1, training a prosody prediction model by using a text prosody labeling data set, and labeling text prosody by using a special mark;
s1.2, carrying out prosody prediction on the text of the audio data by using the trained prosody prediction model to obtain prosody labeling texts of all the texts.
Preferably, the step of S2 includes:
s2.1, converting the prosody labeling text of the text from the text to pinyin, and converting the pinyin into phonemes to obtain a phoneme sequence of the text, wherein the prosody labeling is represented as one phoneme by using a special symbol;
s2.2, performing word segmentation and part-of-speech prediction on the text to obtain a word segmentation result and a part-of-speech prediction result of the text.
S2.3, calculating the text characteristics, and calculating by taking each phoneme of each text as a minimum unit to obtain the position information of the text as position information codes;
designing a problem set, and generating a zero-one code according to the problem set;
and S2.4, combining the position information code calculated in the S2.3 with the zero-one code to obtain a first language characteristic code of the text content.
Preferably, the specific manner of performing mean variance normalization on the second coding of the linguistic feature in the step S5 is as follows:
s5.1, calculating the mean value and the variance of each code in the second codes of all the linguistic characteristics.
S5.2, subtracting the mean value of the bit code from each bit code, dividing the mean value by the variance of the bit code,
the calculation formula is as follows: y isk= (xk -mk )/skWherein y iskFor the kth bit, the result after normalization, xkEncode the value before normalization for the k bit, mkFor the mean, s, of the kth code of all codes to be normalizedkThe variance of the kth bit code in all codes to be normalized.
A speech synthesis method based on front-end design comprises the following steps:
S1A, processing a text to be synthesized according to the methods of the steps S1-S2 in the training method to obtain a first linguistic feature code;
S2A, inputting the obtained linguistic feature first code of the text to be synthesized into a pronunciation duration model, and training to obtain pronunciation durations of all phonemes;
S3A, combining the first linguistic feature codes of the phonemes generated in the step S1A and the pronunciation duration of each phoneme generated in the step S2A to obtain second linguistic feature codes, performing mean variance normalization on the second linguistic feature codes, inputting the normalized feature vectors into a shallow neural network, and outputting front-end feature coding vectors with fixed dimensions;
S4A: using the sequence obtained by training in step S6 in the training method to a sequence network; merging the embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S3A to obtain a prediction vector; (ii) a
S5A: inputting the prediction vector into an autoregressive model to obtain Mel characteristics;
S6A: the Mel features are input into the vocoder to obtain the synthesized audio.
The invention aims at the defects of poor stability and controllability in the technical scheme of end-to-end voice synthesis. A model training and speech synthesis method based on front-end design is provided to improve the stability and controllability of an end-to-end network and reduce the training difficulty of a neural network. The invention can effectively reduce the probability of pronunciation error and speed error of single word in the whole sentence. Meanwhile, under some special conditions, the pronunciations, the phoneme pronunciation durations, the sentence rhythms and the like of some special phonemes can be controlled by finely adjusting the front-end linguistic characteristics and the time length characteristics.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating one embodiment of a speech model training method according to the present invention;
fig. 2 is a flow chart of a speech synthesis method according to an embodiment of the present invention.
Detailed Description
The following provides a more detailed description of the present invention.
The invention discloses a speech model training method based on front-end design, which comprises the following steps of sample acquisition:
the method comprises the steps of acquiring high-quality audio data of a single speaker and a text corresponding to the audio data as original training data, and extracting Mel characteristics of the audio data. The subsequent steps are shown in fig. 1.
S1: and predicting and labeling the prosody of the text by a prosody prediction model to generate a prosody labeled text with prosody labels.
The prosodic model mainly predicts and correspondingly marks short pause and long pause of the text.
The method specifically comprises the following steps:
s1.1, training a prosody prediction model by using a text prosody labeling data set, and labeling text prosody by using a special mark. For example, the prosodic words, prosodic phrases, short pauses and long pauses in the text are respectively corresponding to the mark symbols #1, #2, #3 and # 4.
S1.2, carrying out prosody prediction on the text of the audio data by using the trained prosody prediction model, and labeling the text according to the method to obtain prosody labeled texts of all the texts.
S2: extracting the prosody labeling text generated in the step S1 by using a front rule to extract linguistic features in the text, including phonemes, participles, parts of speech, and preceding and following linguistic features including phonemes, characters, position information of words in phrases, and the like, matching the linguistic features of the text by using a single phoneme as a minimum unit through a problem set to obtain one-hot codes, wherein each phoneme corresponds to one-hot codes of one linguistic feature.
The method specifically comprises the following steps:
s2.1, converting the prosody labeling text of the text from the text to pinyin, and converting the pinyin into phonemes to obtain a phoneme sequence of the text, wherein the prosody labeling is represented as a phoneme by using a special symbol.
S2.2, performing word segmentation and part-of-speech prediction on the text to obtain a word segmentation result and a part-of-speech prediction result of the text.
S2.3, calculating the characteristics of the text in the front and the back, taking each phoneme of each text as the minimum unit to calculate, and according to the setting of the problem set, calculating: and obtaining the position information of the front and back texts of each phoneme as position information codes.
The position information coding is formed in the same format with a phoneme as a minimum unit, and may include the following information coding: the current phoneme is ranked at the first position of the whole phoneme sequence, ranked at the first position in the current pinyin, ranked at the first position in the current phrase and ranked at the first position in the current prosodic phrase, and the values of the positions are combined into position information codes.
And designing a problem set, wherein the problem set mainly comprises the matching of the basic characteristics of the current phoneme. The method mainly comprises the following parts which can be adjusted according to actual conditions:
the problem design of the current phoneme contains the following information:
specifically, the phoneme, the type of phoneme, the tone of the phoneme, and the part of speech of the word in which the phoneme is located.
The related characteristics of the previous phoneme, the first two phonemes, the next phoneme and the next two phonemes of the current phoneme.
And matching the information by using the problem set, setting the information to be 1 if the problem set is met, and otherwise, setting the information to be 0, and generating a zero-one code (one-hot code) according to the problem set.
And S2.4, combining the position information code calculated in the S2.3 with the zero-one code to obtain a first language characteristic code of the text content.
S3: forcibly aligning the text phonemes of the sample type with the corresponding audio data by using a forced-alignment (MFA) algorithm to obtain the pronunciation duration of each phoneme;
and acquiring the corresponding time point of the audio file corresponding to each phoneme, generating a time list corresponding to the text phoneme, wherein each section of audio corresponds to a phoneme pronunciation time sequence file. The method specifically comprises the following steps:
and corresponding each phoneme to a time node of the audio file by using a forced alignment algorithm to obtain the starting and ending time of each phoneme in the audio file.
And (3) obtaining the pronunciation duration corresponding to the phoneme in each text according to the starting and ending time, and finally obtaining the pronunciation duration corresponding to each phoneme in the first encoding of the linguistic feature obtained in the step (2).
S4: and (4) building a neural network, taking the first code of the linguistic features obtained in the step S2 as an input, taking the pronunciation duration of each phoneme obtained in the step S3 as a prediction target, and training a pronunciation duration model of each phoneme.
The pronunciation duration model trained by the method has the advantage that the pronunciation duration accuracy of phonemes in a predicted sentence is higher.
The pronunciation duration model is used for acquiring the pronunciation duration of the phoneme according to the first encoding of the linguistic features of the phoneme, and the step is not limited with the steps S5 and S6 in time sequence.
S5: combining the first encoding of the linguistic feature of each phoneme generated in the step S2 and the pronunciation duration of each phoneme generated in the step S3 to obtain a second encoding of the linguistic feature, normalizing the mean variance of the second encoding of the linguistic feature, inputting the normalized feature vector into a shallow neural network, outputting a front-end feature encoding vector with fixed dimension,
the method specifically comprises the following steps: combining the first encoding of linguistic feature generated in step S2 and the pronunciation duration of each phoneme generated in step S3, wherein the combination is to supplement the pronunciation duration of each phoneme to the last bit of the encoding to obtain the second encoding of linguistic feature
For example, the linguistic feature of a certain phoneme is firstly coded as AXCB54, the pronunciation duration is 7, and after combination, the phoneme is AXCB 547;
carrying out mean variance normalization on the combined second codes of the linguistic features, inputting a shallow DNN neural network, and outputting high-dimensionality front-end feature code vectors after network training;
the front-end feature coding vector is abstract expression of front-end linguistic features, the front-end feature coding vector is obtained through training, the front-end feature coding vectors obtained through second coding of the linguistic features close to the linguistic features in an actual text through a shallow neural network are close to each other in a high-dimensional space, and the cosine similarity of the front-end feature coding vectors is higher.
The cosine similarity of the front-end feature codes generated by the similar front-end features is high, so that the network can perform better abstract expression on the codes, or understand the meanings of sentences to form the codes, and further explain that the codes are more accurate.
The specific normalization method adopts mean variance normalization, and specifically comprises the following steps:
s5.1, calculating the mean value and the variance of each code in the second codes of all the linguistic characteristics.
S5.2, subtracting the mean value of the bit code from each bit code, dividing the mean value by the variance of the bit code,
the calculation formula is as follows: y isk= (xk -mk )/skWherein y iskFor the kth bit, the result after normalization, xkEncode the value before normalization for the k bit, mkFor the mean, s, of the kth code of all codes to be normalizedkThe variance of the kth bit code in all codes to be normalized. The purpose of the normalization is to facilitate convergence of subsequent networks.
S6: an end-to-end network of sequence-to-sequence attention mechanisms is built,
the end-to-end network includes a sequence-to-sequence network and an LSTM autoregressive network, which in combination form the end-to-end network.
Merging the embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S5 to obtain a prediction vector;
the merging is a vector merging, for example, the embedding vector is 512-dimensional, the front-end coding vector is 256-dimensional, and the merged vector is 256+512= 768-dimensional.
The embedded vector is generated by the sequence-to-sequence network, controllability is low, the front-end coding vector is generated by the previous step according to sample characteristics and self-determined rules, and the purpose of combining the embedded vector and the front-end characteristic coding vector in the step S5 is to avoid synthetic errors generated by depending on a model alone by artificially correcting the front-end coding on the basis of continuing the advantage that the prediction of the Mel characteristic synthetic audio by using the embedded vector alone in the past has high naturalness.
And (3) accessing the combined prediction vector into an autoregressive (LSTM) network, predicting the Mel characteristic of each frame, and performing iterative training by taking the Mel characteristic of the audio data in the original training data as a target to obtain an autoregressive model. The specific training is to connect the sequence-to-sequence network and the autoregressive network into a whole, and the model parameters of the sequence-to-sequence network are obtained simultaneously in the process of obtaining the autoregressive model through training, namely the sequence-to-sequence network is also trained.
And S1-S6, finally training by using the sample to obtain a pronunciation duration model and an autoregressive model.
The speech synthesis method using the trained models includes the following steps, as shown in fig. 2:
S1A, processing a text to be synthesized according to the methods of S1-S2 to obtain a first linguistic feature code;
S2A, inputting the obtained linguistic feature first code of the text to be synthesized into a pronunciation duration model, and training to obtain pronunciation durations of all phonemes;
S3A, combining the first linguistic feature codes of the phonemes generated in the step S1A and the pronunciation duration of each phoneme generated in the step S2A to obtain second linguistic feature codes, performing mean variance normalization on the second linguistic feature codes, inputting the normalized feature vectors into a shallow neural network, and outputting front-end feature coding vectors with fixed dimensions;
S4A: using the sequence obtained by training in the step S6 to a sequence network; merging the embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S3A to obtain a prediction vector;
S5A: inputting the prediction vector into an autoregressive model to obtain Mel characteristics;
S6A: the Mel features are input into the vocoder to obtain the synthesized audio.
The invention aims at the defects of poor stability and controllability in the technical scheme of end-to-end voice synthesis. A model training and speech synthesis method based on front-end design is provided to improve the stability and controllability of an end-to-end network and reduce the training difficulty of a neural network. The invention can effectively reduce the probability of pronunciation error and speed error of single word in the whole sentence. Meanwhile, under some special conditions, the pronunciations, the phoneme pronunciation durations, the sentence rhythms and the like of some special phonemes can be controlled by finely adjusting the front-end linguistic characteristics and the time length characteristics.
Meanwhile, in some special cases, for example, the same word appears too many times in a sentence, which easily interferes with the attention mechanism of the end-to-end network, resulting in deviation of pronunciation and pronunciation duration of the word. Or sentence break errors are easy to occur in the case that a few sentences are too long and have no punctuation. Or some words may not appear in the training set and are prone to pronunciation errors. In these cases, the pronunciations of the words, the pronunciations of the phonemes, the pronunciation durations of the phonemes, the sentence prosody, etc. can be controlled by fine-tuning the front-end linguistic features and the time-length features. The subsequent synthesis is more accurate.
The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.

Claims (5)

1. A speech model training method based on front-end design is characterized by comprising the following steps of sample acquisition:
the method comprises the steps that a sample is collected, wherein the sample is used for collecting high-quality audio data of a single speaker and a text corresponding to the audio data as original training data, and Mel characteristics of the audio data are extracted;
the subsequent steps are as follows:
s1: predicting and marking the prosody of the text by a prosody prediction model to generate a prosody marking text with prosody marking;
s2: extracting the prosody labeling text generated in the step S1 into linguistic features in the text through a front-end rule, wherein the linguistic features comprise position information codes and zero-one codes, and combining the position information codes and the zero-one codes to obtain first linguistic feature codes of text contents;
s3: forcibly aligning the text phonemes in the sample with the corresponding audio data by using a forced alignment algorithm to obtain the pronunciation duration of each phoneme;
s4: building a neural network, taking the first codes of the linguistic features obtained in the step S2 as input, taking the pronunciation duration of each phoneme obtained in the step S3 as a prediction target, and training a pronunciation duration model of each phoneme;
the step is not limited with the steps S5 and S6 in time sequence;
s5: combining the first linguistic feature code of each phoneme generated in the step S2 and the pronunciation duration of each phoneme generated in the step 3 to obtain a second linguistic feature code, performing mean variance normalization on the second linguistic feature code, inputting the normalized feature vector into a shallow neural network, and outputting a front-end feature code vector with fixed dimensionality;
s6: building an end-to-end network of the attention mechanism from the sequence to the sequence, and combining an embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S5 to obtain a prediction vector;
and accessing the combined prediction vector into an autoregressive (LSTM) network, predicting Mel characteristics, and performing iterative training by taking the Mel characteristics of the audio data in the sample as a target to obtain an autoregressive model.
2. The speech model training method of claim 1, wherein the step of S1 comprises:
s1.1, training a prosody prediction model by using a text prosody labeling data set, and labeling text prosody by using a special mark;
s1.2, carrying out prosody prediction on the text of the audio data by using the trained prosody prediction model to obtain prosody labeling texts of all the texts.
3. The speech model training method of claim 1, wherein the step of S2 comprises:
s2.1, converting the prosody labeling text of the text from the text to pinyin, and converting the pinyin into phonemes to obtain a phoneme sequence of the text, wherein the prosody labeling is represented as one phoneme by using a special symbol;
s2.2, performing word segmentation and part-of-speech prediction on the text to obtain a word segmentation result and a part-of-speech prediction result of the text;
s2.3, calculating the text characteristics, and calculating by taking each phoneme of each text as a minimum unit to obtain the position information of the text as position information codes;
designing a problem set, and generating a zero-one code according to the problem set;
and S2.4, combining the position information code calculated in the S2.3 with the zero-one code to obtain a first language characteristic code of the text content.
4. The method of training speech models of claim 1, wherein the normalization of the mean variance of the second coding of linguistic features in step S5 is performed by:
s5.1, calculating the mean value and the variance of each code in the second codes of all the linguistic characteristics;
s5.2, subtracting the mean value of the bit code from each bit code, dividing the mean value by the variance of the bit code,
the calculation formula is as follows: y isk= (xk -mk )/skWherein y iskFor the kth bit, the result after normalization, xkEncode the value before normalization for the k bit, mkFor the mean, s, of the kth code of all codes to be normalizedkThe variance of the kth bit code in all codes to be normalized.
5. A speech synthesis method based on front-end design is characterized by comprising the following steps:
S1A, processing a text to be synthesized according to the method of steps S1-S2 in the training method of claim 1 to obtain a first code of linguistic characteristics;
S2A, inputting the obtained linguistic feature first code of the text to be synthesized into a pronunciation duration model, and training to obtain pronunciation durations of all phonemes;
S3A, combining the first linguistic feature codes of the phonemes generated in the step S1A and the pronunciation duration of each phoneme generated in the step S2A to obtain second linguistic feature codes, performing mean variance normalization on the second linguistic feature codes, inputting the normalized feature vectors into a shallow neural network, and outputting front-end feature coding vectors with fixed dimensions;
S4A: using the sequence-to-sequence network trained in step S6 in the training method of claim 1; merging the embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S3A to obtain a prediction vector;
S5A: inputting the prediction vector into an autoregressive model to obtain Mel characteristics;
S6A: the Mel features are input into the vocoder to obtain the synthesized audio.
CN202110762178.XA 2021-07-06 2021-07-06 Voice model training method based on front-end design and voice synthesis method Active CN113257221B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110762178.XA CN113257221B (en) 2021-07-06 2021-07-06 Voice model training method based on front-end design and voice synthesis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110762178.XA CN113257221B (en) 2021-07-06 2021-07-06 Voice model training method based on front-end design and voice synthesis method

Publications (2)

Publication Number Publication Date
CN113257221A true CN113257221A (en) 2021-08-13
CN113257221B CN113257221B (en) 2021-09-17

Family

ID=77190767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110762178.XA Active CN113257221B (en) 2021-07-06 2021-07-06 Voice model training method based on front-end design and voice synthesis method

Country Status (1)

Country Link
CN (1) CN113257221B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113948062A (en) * 2021-12-20 2022-01-18 阿里巴巴达摩院(杭州)科技有限公司 Data conversion method and computer storage medium
CN118116363A (en) * 2024-04-26 2024-05-31 厦门蝉羽网络科技有限公司 Speech synthesis method based on time perception position coding and model training method thereof

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5832425A (en) * 1994-10-04 1998-11-03 Hughes Electronics Corporation Phoneme recognition and difference signal for speech coding/decoding
EP3021318A1 (en) * 2014-11-17 2016-05-18 Samsung Electronics Co., Ltd. Speech synthesis apparatus and control method thereof
CN106601228A (en) * 2016-12-09 2017-04-26 百度在线网络技术(北京)有限公司 Sample marking method and device based on artificial intelligence prosody prediction
CN111640418A (en) * 2020-05-29 2020-09-08 数据堂(北京)智能科技有限公司 Prosodic phrase identification method and device and electronic equipment
CN111754976A (en) * 2020-07-21 2020-10-09 中国科学院声学研究所 Rhythm control voice synthesis method, system and electronic device
CN112002304A (en) * 2020-08-27 2020-11-27 上海添力网络科技有限公司 Speech synthesis method and device
CN112002305A (en) * 2020-07-29 2020-11-27 北京大米科技有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN112133278A (en) * 2020-11-20 2020-12-25 成都启英泰伦科技有限公司 Network training and personalized speech synthesis method for personalized speech synthesis model
CN112802450A (en) * 2021-01-05 2021-05-14 杭州一知智能科技有限公司 Rhythm-controllable Chinese and English mixed speech synthesis method and system thereof

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5832425A (en) * 1994-10-04 1998-11-03 Hughes Electronics Corporation Phoneme recognition and difference signal for speech coding/decoding
EP3021318A1 (en) * 2014-11-17 2016-05-18 Samsung Electronics Co., Ltd. Speech synthesis apparatus and control method thereof
CN106601228A (en) * 2016-12-09 2017-04-26 百度在线网络技术(北京)有限公司 Sample marking method and device based on artificial intelligence prosody prediction
CN111640418A (en) * 2020-05-29 2020-09-08 数据堂(北京)智能科技有限公司 Prosodic phrase identification method and device and electronic equipment
CN111754976A (en) * 2020-07-21 2020-10-09 中国科学院声学研究所 Rhythm control voice synthesis method, system and electronic device
CN112002305A (en) * 2020-07-29 2020-11-27 北京大米科技有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN112002304A (en) * 2020-08-27 2020-11-27 上海添力网络科技有限公司 Speech synthesis method and device
CN112133278A (en) * 2020-11-20 2020-12-25 成都启英泰伦科技有限公司 Network training and personalized speech synthesis method for personalized speech synthesis model
CN112802450A (en) * 2021-01-05 2021-05-14 杭州一知智能科技有限公司 Rhythm-controllable Chinese and English mixed speech synthesis method and system thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
C. YUAN 等: ""Personalized End-to-End Mandarin Speech Synthesis using Small-sized Corpus"", 《2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE 》 *
张宇强: ""基于端到端的语音合成"", 《中国优秀硕士学位论文全文数据库(信息科技辑 )》 *
韩民: ""一种改进的语音合成方法"", 《中国优秀硕士学位论文全文数据库(信息科技辑)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113948062A (en) * 2021-12-20 2022-01-18 阿里巴巴达摩院(杭州)科技有限公司 Data conversion method and computer storage medium
CN118116363A (en) * 2024-04-26 2024-05-31 厦门蝉羽网络科技有限公司 Speech synthesis method based on time perception position coding and model training method thereof

Also Published As

Publication number Publication date
CN113257221B (en) 2021-09-17

Similar Documents

Publication Publication Date Title
JP7464621B2 (en) Speech synthesis method, device, and computer-readable storage medium
CN112420016B (en) Method and device for aligning synthesized voice and text and computer storage medium
CN105654939A (en) Voice synthesis method based on voice vector textual characteristics
CN110767213A (en) Rhythm prediction method and device
CN113257221B (en) Voice model training method based on front-end design and voice synthesis method
Liu et al. Mongolian text-to-speech system based on deep neural network
CN113205792A (en) Mongolian speech synthesis method based on Transformer and WaveNet
CN113539268A (en) End-to-end voice-to-text rare word optimization method
CN113327574A (en) Speech synthesis method, device, computer equipment and storage medium
Maia et al. Towards the development of a brazilian portuguese text-to-speech system based on HMM.
CN114974218A (en) Voice conversion model training method and device and voice conversion method and device
Kayte et al. A Marathi Hidden-Markov Model Based Speech Synthesis System
US11817079B1 (en) GAN-based speech synthesis model and training method
Chiang et al. The Speech Labeling and Modeling Toolkit (SLMTK) Version 1.0
CN114708848A (en) Method and device for acquiring size of audio and video file
Bonafonte et al. The UPC TTS system description for the 2008 blizzard challenge
Janyoi et al. An Isarn dialect HMM-based text-to-speech system
Lin et al. Improving mandarin prosody boundary detection by using phonetic information and deep LSTM model
Saychum et al. A great reduction of wer by syllable toneme prediction for thai grapheme to phoneme conversion
CN117524193B (en) Training method, device, equipment and medium for Chinese-English mixed speech recognition system
Nair et al. Indian text to speech systems: A short survey
Zhang et al. Chinese speech synthesis system based on end to end
CN116229994B (en) Construction method and device of label prediction model of Arabic language
Janyoi et al. Isarn Dialect Speech Synthesis using HMM with syllable-context features
Gong et al. A Review of End-to-End Chinese–Mandarin Speech Synthesis Techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant