CN113257221A

CN113257221A - Voice model training method based on front-end design and voice synthesis method

Info

Publication number: CN113257221A
Application number: CN202110762178.XA
Authority: CN
Inventors: 陈佩云; 曹艳艳; 高君效
Original assignee: Chipintelli Technology Co Ltd
Current assignee: Chipintelli Technology Co Ltd
Priority date: 2021-07-06
Filing date: 2021-07-06
Publication date: 2021-08-13
Anticipated expiration: 2041-07-06
Also published as: CN113257221B

Abstract

A speech model training method and a speech synthesis method based on front-end design comprise the following steps: the subsequent steps are as follows: s1: generating a rhythm marking text with rhythm marking; s2: obtaining a first code of linguistic characteristics of text content; s3: acquiring the pronunciation duration of each phoneme; s4: training a pronunciation duration model of each phoneme; s5: outputting a front-end feature coding vector with fixed dimensionality; s6: and carrying out iterative training to obtain an autoregressive model. The invention can effectively reduce the probability of pronunciation error and speed error of single word in the whole sentence. Meanwhile, the pronunciation duration, the sentence rhythm and the like of the special phoneme can be controlled by finely adjusting the front-end linguistic characteristics and the time length characteristics.

Description

Voice model training method based on front-end design and voice synthesis method

Technical Field

The invention belongs to the technical field of artificial intelligent speech synthesis, and particularly relates to a speech model training method and a speech synthesis method based on front-end design.

Background

Speech synthesis is a technique that converts Text into corresponding audio, also known as Text To Speech (TTS). With the development of artificial intelligence and the increase of social demands, attention is paid to a speech synthesis technology with accurate, clear, natural and pleasant pronunciation. The traditional speech synthesis technology includes a splicing method and a parameter synthesis method, and the two methods are gradually replaced by an end-to-end speech synthesis scheme due to poor naturalness and listening impression.

The end-to-end speech synthesis scheme is to directly generate acoustic features from text contents through a model with higher complexity, and then generate audio from the acoustic features by a vocoder. However, the integration level of the end-to-end network structure is high, and the end-to-end network structure is not easy to flexibly adjust when a synthesis problem is encountered. Pronunciations and speeds of certain individual characters often have problems, and the problems are difficult to avoid by adjusting parameters, and need to be screened again and optimized by adding a new data training model. The model optimization iteration period is long, and the synthetic problem is not easy to solve. In addition, for different application scenarios, the speech rate, pronunciation, prosody, etc. may all change, and the end-to-end network with higher integration level is difficult to flexibly adjust for the changes.

Disclosure of Invention

In order to overcome the technical defects in the prior art, the invention discloses a speech model training method and a speech synthesis method based on front-end design.

The invention discloses a speech model training method based on front-end design, which comprises the following steps of sample acquisition:

the method comprises the steps that a sample is collected, wherein the sample is used for collecting high-quality audio data of a single speaker and a text corresponding to the audio data as original training data, and Mel characteristics of the audio data are extracted;

the subsequent steps are as follows:

s1: predicting and marking the prosody of the text by a prosody prediction model to generate a prosody marking text with prosody marking;

s2: extracting the prosody labeling text generated in the step S1 into linguistic features in the text through a front-end rule, wherein the linguistic features comprise position information codes and zero-one codes, and combining the position information codes and the zero-one codes to obtain first linguistic feature codes of text contents;

s3: forcibly aligning the text phonemes in the sample with the corresponding audio data by using a forced alignment algorithm to obtain the pronunciation duration of each phoneme;

s4: building a neural network, taking the first codes of the linguistic features obtained in the step S2 as input, taking the pronunciation duration of each phoneme obtained in the step S3 as a prediction target, and training a pronunciation duration model of each phoneme;

the step is not limited with the steps S5 and S6 in time sequence;

s5: combining the first linguistic feature code of each phoneme generated in the step S2 and the pronunciation duration of each phoneme generated in the step 3 to obtain a second linguistic feature code, performing mean variance normalization on the second linguistic feature code, inputting the normalized feature vector into a shallow neural network, and outputting a front-end feature code vector with fixed dimensionality;

s6: building an end-to-end network of the attention mechanism from the sequence to the sequence, and combining an embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S5 to obtain a prediction vector;

and accessing the combined prediction vector into an autoregressive (LSTM) network, predicting Mel characteristics, and performing iterative training by taking the Mel characteristics of the audio data in the sample as a target to obtain an autoregressive model.

Preferably, the step of S1 includes:

s1.1, training a prosody prediction model by using a text prosody labeling data set, and labeling text prosody by using a special mark;

s1.2, carrying out prosody prediction on the text of the audio data by using the trained prosody prediction model to obtain prosody labeling texts of all the texts.

Preferably, the step of S2 includes:

s2.1, converting the prosody labeling text of the text from the text to pinyin, and converting the pinyin into phonemes to obtain a phoneme sequence of the text, wherein the prosody labeling is represented as one phoneme by using a special symbol;

s2.2, performing word segmentation and part-of-speech prediction on the text to obtain a word segmentation result and a part-of-speech prediction result of the text.

S2.3, calculating the text characteristics, and calculating by taking each phoneme of each text as a minimum unit to obtain the position information of the text as position information codes;

designing a problem set, and generating a zero-one code according to the problem set;

and S2.4, combining the position information code calculated in the S2.3 with the zero-one code to obtain a first language characteristic code of the text content.

Preferably, the specific manner of performing mean variance normalization on the second coding of the linguistic feature in the step S5 is as follows:

s5.1, calculating the mean value and the variance of each code in the second codes of all the linguistic characteristics.

S5.2, subtracting the mean value of the bit code from each bit code, dividing the mean value by the variance of the bit code,

the calculation formula is as follows: y is_k= (x_k -m_k )/s_kWherein y is_kFor the kth bit, the result after normalization, x_kEncode the value before normalization for the k bit, m_kFor the mean, s, of the kth code of all codes to be normalized_kThe variance of the kth bit code in all codes to be normalized.

A speech synthesis method based on front-end design comprises the following steps:

S1A, processing a text to be synthesized according to the methods of the steps S1-S2 in the training method to obtain a first linguistic feature code;

S2A, inputting the obtained linguistic feature first code of the text to be synthesized into a pronunciation duration model, and training to obtain pronunciation durations of all phonemes;

S3A, combining the first linguistic feature codes of the phonemes generated in the step S1A and the pronunciation duration of each phoneme generated in the step S2A to obtain second linguistic feature codes, performing mean variance normalization on the second linguistic feature codes, inputting the normalized feature vectors into a shallow neural network, and outputting front-end feature coding vectors with fixed dimensions;

S4A: using the sequence obtained by training in step S6 in the training method to a sequence network; merging the embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S3A to obtain a prediction vector; (ii) a

S5A: inputting the prediction vector into an autoregressive model to obtain Mel characteristics;

S6A: the Mel features are input into the vocoder to obtain the synthesized audio.

The invention aims at the defects of poor stability and controllability in the technical scheme of end-to-end voice synthesis. A model training and speech synthesis method based on front-end design is provided to improve the stability and controllability of an end-to-end network and reduce the training difficulty of a neural network. The invention can effectively reduce the probability of pronunciation error and speed error of single word in the whole sentence. Meanwhile, under some special conditions, the pronunciations, the phoneme pronunciation durations, the sentence rhythms and the like of some special phonemes can be controlled by finely adjusting the front-end linguistic characteristics and the time length characteristics.

Drawings

FIG. 1 is a schematic flow chart diagram illustrating one embodiment of a speech model training method according to the present invention;

fig. 2 is a flow chart of a speech synthesis method according to an embodiment of the present invention.

Detailed Description

The following provides a more detailed description of the present invention.

the method comprises the steps of acquiring high-quality audio data of a single speaker and a text corresponding to the audio data as original training data, and extracting Mel characteristics of the audio data. The subsequent steps are shown in fig. 1.

S1: and predicting and labeling the prosody of the text by a prosody prediction model to generate a prosody labeled text with prosody labels.

The prosodic model mainly predicts and correspondingly marks short pause and long pause of the text.

The method specifically comprises the following steps:

s1.1, training a prosody prediction model by using a text prosody labeling data set, and labeling text prosody by using a special mark. For example, the prosodic words, prosodic phrases, short pauses and long pauses in the text are respectively corresponding to the mark symbols #1, #2, #3 and # 4.

S1.2, carrying out prosody prediction on the text of the audio data by using the trained prosody prediction model, and labeling the text according to the method to obtain prosody labeled texts of all the texts.

S2: extracting the prosody labeling text generated in the step S1 by using a front rule to extract linguistic features in the text, including phonemes, participles, parts of speech, and preceding and following linguistic features including phonemes, characters, position information of words in phrases, and the like, matching the linguistic features of the text by using a single phoneme as a minimum unit through a problem set to obtain one-hot codes, wherein each phoneme corresponds to one-hot codes of one linguistic feature.

The method specifically comprises the following steps:

s2.1, converting the prosody labeling text of the text from the text to pinyin, and converting the pinyin into phonemes to obtain a phoneme sequence of the text, wherein the prosody labeling is represented as a phoneme by using a special symbol.

S2.3, calculating the characteristics of the text in the front and the back, taking each phoneme of each text as the minimum unit to calculate, and according to the setting of the problem set, calculating: and obtaining the position information of the front and back texts of each phoneme as position information codes.

The position information coding is formed in the same format with a phoneme as a minimum unit, and may include the following information coding: the current phoneme is ranked at the first position of the whole phoneme sequence, ranked at the first position in the current pinyin, ranked at the first position in the current phrase and ranked at the first position in the current prosodic phrase, and the values of the positions are combined into position information codes.

And designing a problem set, wherein the problem set mainly comprises the matching of the basic characteristics of the current phoneme. The method mainly comprises the following parts which can be adjusted according to actual conditions:

the problem design of the current phoneme contains the following information:

specifically, the phoneme, the type of phoneme, the tone of the phoneme, and the part of speech of the word in which the phoneme is located.

The related characteristics of the previous phoneme, the first two phonemes, the next phoneme and the next two phonemes of the current phoneme.

And matching the information by using the problem set, setting the information to be 1 if the problem set is met, and otherwise, setting the information to be 0, and generating a zero-one code (one-hot code) according to the problem set.

S3: forcibly aligning the text phonemes of the sample type with the corresponding audio data by using a forced-alignment (MFA) algorithm to obtain the pronunciation duration of each phoneme;

and acquiring the corresponding time point of the audio file corresponding to each phoneme, generating a time list corresponding to the text phoneme, wherein each section of audio corresponds to a phoneme pronunciation time sequence file. The method specifically comprises the following steps:

and corresponding each phoneme to a time node of the audio file by using a forced alignment algorithm to obtain the starting and ending time of each phoneme in the audio file.

And (3) obtaining the pronunciation duration corresponding to the phoneme in each text according to the starting and ending time, and finally obtaining the pronunciation duration corresponding to each phoneme in the first encoding of the linguistic feature obtained in the step (2).

S4: and (4) building a neural network, taking the first code of the linguistic features obtained in the step S2 as an input, taking the pronunciation duration of each phoneme obtained in the step S3 as a prediction target, and training a pronunciation duration model of each phoneme.

The pronunciation duration model trained by the method has the advantage that the pronunciation duration accuracy of phonemes in a predicted sentence is higher.

The pronunciation duration model is used for acquiring the pronunciation duration of the phoneme according to the first encoding of the linguistic features of the phoneme, and the step is not limited with the steps S5 and S6 in time sequence.

S5: combining the first encoding of the linguistic feature of each phoneme generated in the step S2 and the pronunciation duration of each phoneme generated in the step S3 to obtain a second encoding of the linguistic feature, normalizing the mean variance of the second encoding of the linguistic feature, inputting the normalized feature vector into a shallow neural network, outputting a front-end feature encoding vector with fixed dimension,

the method specifically comprises the following steps: combining the first encoding of linguistic feature generated in step S2 and the pronunciation duration of each phoneme generated in step S3, wherein the combination is to supplement the pronunciation duration of each phoneme to the last bit of the encoding to obtain the second encoding of linguistic feature

For example, the linguistic feature of a certain phoneme is firstly coded as AXCB54, the pronunciation duration is 7, and after combination, the phoneme is AXCB 547;

carrying out mean variance normalization on the combined second codes of the linguistic features, inputting a shallow DNN neural network, and outputting high-dimensionality front-end feature code vectors after network training;

the front-end feature coding vector is abstract expression of front-end linguistic features, the front-end feature coding vector is obtained through training, the front-end feature coding vectors obtained through second coding of the linguistic features close to the linguistic features in an actual text through a shallow neural network are close to each other in a high-dimensional space, and the cosine similarity of the front-end feature coding vectors is higher.

The cosine similarity of the front-end feature codes generated by the similar front-end features is high, so that the network can perform better abstract expression on the codes, or understand the meanings of sentences to form the codes, and further explain that the codes are more accurate.

The specific normalization method adopts mean variance normalization, and specifically comprises the following steps:

the calculation formula is as follows: y is_k= (x_k -m_k )/s_kWherein y is_kFor the kth bit, the result after normalization, x_kEncode the value before normalization for the k bit, m_kFor the mean, s, of the kth code of all codes to be normalized_kThe variance of the kth bit code in all codes to be normalized. The purpose of the normalization is to facilitate convergence of subsequent networks.

S6: an end-to-end network of sequence-to-sequence attention mechanisms is built,

the end-to-end network includes a sequence-to-sequence network and an LSTM autoregressive network, which in combination form the end-to-end network.

Merging the embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S5 to obtain a prediction vector;

the merging is a vector merging, for example, the embedding vector is 512-dimensional, the front-end coding vector is 256-dimensional, and the merged vector is 256+512= 768-dimensional.

The embedded vector is generated by the sequence-to-sequence network, controllability is low, the front-end coding vector is generated by the previous step according to sample characteristics and self-determined rules, and the purpose of combining the embedded vector and the front-end characteristic coding vector in the step S5 is to avoid synthetic errors generated by depending on a model alone by artificially correcting the front-end coding on the basis of continuing the advantage that the prediction of the Mel characteristic synthetic audio by using the embedded vector alone in the past has high naturalness.

And (3) accessing the combined prediction vector into an autoregressive (LSTM) network, predicting the Mel characteristic of each frame, and performing iterative training by taking the Mel characteristic of the audio data in the original training data as a target to obtain an autoregressive model. The specific training is to connect the sequence-to-sequence network and the autoregressive network into a whole, and the model parameters of the sequence-to-sequence network are obtained simultaneously in the process of obtaining the autoregressive model through training, namely the sequence-to-sequence network is also trained.

And S1-S6, finally training by using the sample to obtain a pronunciation duration model and an autoregressive model.

The speech synthesis method using the trained models includes the following steps, as shown in fig. 2:

S1A, processing a text to be synthesized according to the methods of S1-S2 to obtain a first linguistic feature code;

S4A: using the sequence obtained by training in the step S6 to a sequence network; merging the embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S3A to obtain a prediction vector;

Meanwhile, in some special cases, for example, the same word appears too many times in a sentence, which easily interferes with the attention mechanism of the end-to-end network, resulting in deviation of pronunciation and pronunciation duration of the word. Or sentence break errors are easy to occur in the case that a few sentences are too long and have no punctuation. Or some words may not appear in the training set and are prone to pronunciation errors. In these cases, the pronunciations of the words, the pronunciations of the phonemes, the pronunciation durations of the phonemes, the sentence prosody, etc. can be controlled by fine-tuning the front-end linguistic features and the time-length features. The subsequent synthesis is more accurate.

The foregoing is directed to preferred embodiments of the present invention, wherein the preferred embodiments are not obviously contradictory or subject to any particular embodiment, and any combination of the preferred embodiments may be combined in any overlapping manner, and the specific parameters in the embodiments and examples are only for the purpose of clearly illustrating the inventor's invention verification process and are not intended to limit the scope of the invention, which is defined by the claims and the equivalent structural changes made by the description and drawings of the present invention are also intended to be included in the scope of the present invention.

Claims

1. A speech model training method based on front-end design is characterized by comprising the following steps of sample acquisition:

the subsequent steps are as follows:

the step is not limited with the steps S5 and S6 in time sequence;

2. The speech model training method of claim 1, wherein the step of S1 comprises:

3. The speech model training method of claim 1, wherein the step of S2 comprises:

s2.2, performing word segmentation and part-of-speech prediction on the text to obtain a word segmentation result and a part-of-speech prediction result of the text;

4. The method of training speech models of claim 1, wherein the normalization of the mean variance of the second coding of linguistic features in step S5 is performed by:

s5.1, calculating the mean value and the variance of each code in the second codes of all the linguistic characteristics;

5. A speech synthesis method based on front-end design is characterized by comprising the following steps:

S1A, processing a text to be synthesized according to the method of steps S1-S2 in the training method of claim 1 to obtain a first code of linguistic characteristics;

S4A: using the sequence-to-sequence network trained in step S6 in the training method of claim 1; merging the embedded vector output from the sequence to the sequence network with the front-end feature coding vector in the step S3A to obtain a prediction vector;