Background
At present, although the traditional voice synthesis sound can eliminate mechanical sound, so that people can hardly distinguish real person from synthesized sound, the same text only has single emotion. For speech synthesis of different emotions, different audio corpora are often used to respectively train and obtain audios corresponding to the emotions through different models. Not only does this require a large amount of audio corpus, but the switching between different emotions also presents a mechanical sensation that is difficult to satisfy in highly anthropomorphic speech interactions such as digital human interaction. In the multi-emotion speech synthesis, although there is no hard index such as the error rate, similar problems are encountered as in the speech task such as speech recognition. Although speech synthesis can synthesize a text requiring synthesized speech into a corresponding audio, its emotional richness is not as high as that of a human being. This is especially important in digital human voice interaction.
Common deep learning models, such as tacontron, are based on statistical models to realize speech synthesis, often a speech synthesis system in which one model can only train one emotion is required, and switching between different emotion models is required to realize multi-emotion expression. This not only requires training with a large corpus, but also switching back and forth between different speech models can create a severe mechanical sensation. Furthermore, sounds synthesized by different emotion speech models cannot model emotions continuously, i.e., the expressed emotions are not the same, and there is no intermediate soft emotion expression. Different emotion audio corpora are directly added with emotion labels and are put into a model for training, so that synthesized voice can not accord with rhythm characteristics of normal expression emotion of people, namely words which should be accentuated can not be caught, and the voice is very strange to sound.
Adding prosodic tags to the corpus can alleviate the problems, but tagging the corpus is a tedious process, and a good result can be obtained only by a large amount of manual examination. Moreover, in the actual voice interaction, prosody labeling is also needed for the text to be synthesized, the system becomes complicated, and the labeling pair is not likely to affect the final synthesis effect.
Therefore, how to provide a multi-emotion voice synthesis method based on digital people is a problem that needs to be solved urgently by those skilled in the art.
Disclosure of Invention
In view of the above, the invention provides a multi-emotion voice synthesis method based on digital people, which can avoid manual addition of rhythm labels and improve the model training efficiency; through the learning of the audio time domain and frequency domain characteristics, a more vivid audio synthesis effect is realized.
In order to achieve the purpose, the invention adopts the following technical scheme:
a multi-emotion voice synthesis method based on digital people comprises the following steps:
acquiring audio corpora under various emotions;
extracting text information and phoneme timestamp labels in the audio corpus to construct a training data set;
carrying out supervised training on a pre-constructed timestamp prediction model through the training data set;
predicting text information to be synthesized through the trained timestamp prediction model and a preset emotion vector to obtain a phoneme timestamp;
and inputting the phoneme timestamp into a trained acoustic model to obtain a synthetic audio.
Further, in the audio corpus under the plurality of emotions, the emotion includes one or more of neutral, happy, sad, angry, surprised and fear.
Further, extracting text information and phoneme timestamp labels in the audio corpus, and constructing a first training data set includes:
clipping the audio corpus into a plurality of audios,
performing voice recognition on the audio through an ASR voice recognition model to obtain text information; converting the text information into text pinyin, and decomposing the text pinyin into a plurality of phonemes according to a phoneme dictionary to form a phoneme sequence;
performing phoneme alignment on the phoneme sequence to obtain a phoneme timestamp label of each audio;
and inserting a pause frame length into the phoneme sequence and splicing phoneme time stamp labels corresponding to the audio to generate the training data set.
Further, the phoneme timestamp prediction model is a Bilstm model, and a loss function of the Bilstm model adopts a mean square error loss function.
Further, the acoustic model comprises a synthesizer and a vocoder;
generating a model embedding vector according to the phoneme timestamp;
fitting the model embedded vector through the synthesizer to obtain mel frequency spectrum characteristics;
and obtaining the synthesized audio by the voice coder according to the mel frequency spectrum characteristic input value.
Further, generating a model embedding vector according to the phoneme timestamp, comprising the steps of:
and respectively calculating the length difference of a pronunciation frame between each phoneme and adjacent pinyin according to the phoneme timestamp, and splicing the length difference of the pronunciation frame into the phoneme timestamp label to generate a model embedded vector.
Further, the vocoder is a pre-trained hifi-gan model.
Further, the mel frequency spectrum characteristic channel is 80, the hop_size is 256, and the win_size is 1024.
Furthermore, after extracting the text information of the audio information, the invention converts the special characters needing to be converted into Chinese through regularization processing to obtain the regularization standard text, and has the advantages that:
according to the technical scheme, compared with the prior art, the multi-emotion voice synthesis method based on the digital human is disclosed, emotion vectors are combined with timestamp information, the voice emotion characteristics are learned by utilizing a deep learning technology, errors possibly caused by a rhythm label step are reduced, a model can pronounce more naturally, emotion is more vivid, a rhythm prediction step is omitted, speed is increased, and instantaneity is improved.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, the embodiment of the invention discloses a multi-emotion voice synthesis method based on digital people, comprising the following steps:
s1: acquiring audio corpora under various emotions;
in one embodiment, in the audio corpus under a plurality of emotions, the emotion may include one or more of neutral, happy, sad, angry, surprised, and fear. And moreover, the volumes of the audio corpora are consistent, and the moods are consistent.
The method comprises the following specific steps:
a professional broadcaster records multi-emotion audio corpus in a professional recording studio; the broadcaster needs to record six kinds of emotion audio corpora, including: the music is neutral, happy, sad, angry, surprised and frightened, and each emotion records about ten thousand of audios. The broadcaster records various emotion audios respectively, and should keep the same volume as much as possible, and the speaking speed and the tone of the same emotion are kept consistent as much as possible.
In order to ensure the recording efficiency, the recording is not suspended in the process of recording the audio once, and the sound emitted by a speaker in a pause or non-speech mode needs to be edited subsequently in the recording process of the speaker. After the audio recorded in different batches is edited, the volume of each audio needs to be normalized, so that the quality of the training audio is ensured.
S2: extracting text information and phoneme timestamps in the audio corpus to construct a training data set;
in one embodiment, the text information is converted into text pinyin, and the text pinyin is decomposed into a plurality of phonemes according to a phoneme dictionary to form a phoneme sequence;
performing phoneme alignment on the phoneme sequence to obtain a phoneme timestamp label of each audio;
and inserting pause frame length into the phoneme sequence and splicing phoneme time stamp labels corresponding to the audios to generate a training data set. Where the pause frame length may be derived from a factor timestamp label, the phoneme timestamp contains the phonemes and corresponding durations, one for the phonemes and one for the time, where splicing refers to copying the phonemes into multiples of their time frame length. For example, the phoneme abc, the time frame lengths of the three phonemes are 123 respectively, and then the phoneme copy frame length is abbcc.
The method comprises the following specific steps:
as shown in fig. 2, the announcer records the audio and processes the audio clip to obtain a strip of training audio corpus. In order to prevent errors such as word missing, word multiple and the like of a broadcaster during audio recording, firstly, voice recognition processing is carried out on each audio corpus to obtain text content of each audio. And obtaining correct audio and corresponding text information after manual verification. After the correct text information is obtained, the text content is converted into pinyin, and the pinyin phoneme dictionary are obtained through sorting. The audio file and the corresponding pinyin file are sorted, and phoneme alignment is carried out by using an acoustic model through a phoneme dictionary, such as: an MFA tool. After the phoneme alignment, phoneme timestamp information of each audio can be obtained. And (4) organizing the text, the phoneme timestamp, the audio file and other information of each audio corpus to serve as a training data text for the next step of training the model.
S3: carrying out supervised training on a pre-constructed phoneme timestamp prediction model through the first training data set;
as shown in fig. 3, the phoneme timestamp prediction model is trained based on the text information and the phoneme timestamp information in the first training data set.
The prediction time stamp information is used to predict phoneme time stamp information corresponding to an audio sound by a model after a text is input when speech is synthesized, and to know which phoneme sound should be uttered at a certain time when audio synthesis is performed. The unit of the time stamp is expressed in frames, and the time of sounding one phoneme may take several frames. The phoneme time stamp prediction model inputs that only text information is contained, but discontinuous pause exists in actual audio pronunciation, in order to represent the discontinuous pause in an output prediction structure, an sp symbol is added behind each character pinyin to represent the pause frame length, and if the sp has no pause frame length, the corresponding label is 0. The prediction model structure adopts a Bilstm model, and the loss function adopts a mean square error loss function. After text pinyin is input, a phoneme dictionary of the pinyin is inquired and converted into a phoneme sequence containing sp, the phoneme sequence is spliced with a corresponding emotion label vector after being embedding and then input into a Bilstm network, and the corresponding label is the frame length of pronunciation of each phoneme. And training until the model converges.
S4: predicting text information to be synthesized through the trained timestamp prediction model and a preset emotion vector to obtain a phoneme timestamp;
s5: and inputting the phoneme timestamp into a trained acoustic model to obtain a synthetic audio.
In one embodiment, after training the phoneme timestamp prediction model, the pronunciation content within each frame length of the pronunciation audio can be obtained by inputting the text.
The audio content is then generated by an acoustic model.
The input is a phoneme sequence, each phoneme unit consists of a frame length phoneme corresponding to the phoneme unit, and the phoneme sequence is spliced with an emotion vector after being embedding.
In order to grasp the acoustic features of multi-emotion audio emotions, the acoustic model needs to find the phoneme where the accents are located.
Different audio emotional characteristics are embodied on a time domain and a frequency domain, and phoneme time stamp information is obtained, so that the time domain information of a phoneme where stress is located can be learned by a model by utilizing the frame length information of the phoneme.
The current frame length of each character, the frame length difference relative to the previous character and the frame length difference of the next character are spliced into an embedding vector of the previous phoneme sequence and then input into an acoustic model, so that pronunciation audios with different emotions can be learned.
As shown in fig. 4, the acoustic model consists essentially of two parts, a synthesizer and a vocoder. The mel frequency spectrum feature extracted by the audio frequency expectation is used as an output label of the synthesizer, a loss function is calculated according to the mel frequency spectrum output by the synthesizer and the output label, and model training is carried out to judge the convergence of the synthesizer; the mel channel is 80, the hop _sizeis 256, and the win _sizeis 1024.
The vocoder adopts a hifi-gan model, and the vocoder adopts a pre-training model, namely pre-training is performed on large-scale data in advance.
And training a synthesizer model, fitting corresponding audio mel frequency spectrum characteristics by inputting an embedding vector of a phoneme sequence, and finally inputting the generated mel frequency spectrum characteristics into a previously pre-trained hifi-gan model to generate audio so as to test the model effect.
In this embodiment, when the trained model is actually applied, only the text and the emotion to be expressed need to be acquired, and no prosody label needs to be input for the text.
After inputting text and emotion labels, special characters required to be converted into Chinese are first converted into Chinese by regularization, such as Arabic numerals, telephone numbers, unit names, operation symbols and the like.
And converting the processed sentence into a corresponding phoneme label in a phoneme dictionary mode, inserting an sp symbol into a phoneme sequence, splicing an emotion vector, and inputting the emotion vector into a timestamp prediction model. And after the time stamp of each phoneme is obtained, calculating the time stamp information and combining the time stamp information with the previously input emotion phoneme vector to obtain an expanded emotion phoneme vector with the phoneme time stamp information, and inputting the expanded emotion phoneme vector into a synthesizer model to obtain the output mel frequency spectrum feature. And inputting the audio data into the hifi-gan model so as to generate an audio file corresponding to the text and complete the synthesis of the multi-emotion voice.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.