CN113327575B

CN113327575B - Speech synthesis method, device, computer equipment and storage medium

Info

Publication number: CN113327575B
Application number: CN202110602414.1A
Authority: CN
Inventors: 户建坤; 康世胤; 吴志勇; 陈学源; 刘峰
Original assignee: Shenzhen International Graduate School of Tsinghua University; Guangzhou Huya Technology Co Ltd
Current assignee: Shenzhen International Graduate School of Tsinghua University; Guangzhou Huya Technology Co Ltd
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2024-03-01
Anticipated expiration: 2041-05-31
Also published as: CN113327575A

Abstract

The embodiment of the invention provides a voice synthesis method, a device, computer equipment and a storage medium, wherein the method comprises the following steps: in this embodiment, a reference speech signal belonging to a non-target language and target text information belonging to a target language are received, features characterizing a tone in the reference speech signal are identified as target tone, a speech synthesizer trained for the target language is determined, the speech synthesizer includes an acoustic model in which the target text information is converted into spectral features belonging to the target language and conforming to the target tone as target spectral features, the target spectral features are converted into target speech signals belonging to the target language in the vocoder, the tone of the reference speech signal as the non-target language is not used for training the speech synthesizer for the target language, and tone cloning of a speaker not seen can be achieved in a scenario in which cross-language speech synthesis is achieved.

Description

Speech synthesis method, device, computer equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of voice processing, in particular to a voice synthesis method, a voice synthesis device, computer equipment and a storage medium.

Background

TTS (Text To Speech) is intended To convert Text into Speech, is a part of man-machine conversation, and enables a machine To speak, and in recent years, with rapid development of acoustic models and vocoder technologies, TTS plays an important role in many fields such as voice assistants, audio books, and spoken dialogue systems.

TTS can generate natural speech for a speaker with a large number of high-quality speech, almost can be spurious, at present, TTS is limited by a training set, and the timbre of the trained speaker is difficult to acquire, especially in a cross-language TTS scene, the timbre of the speaker is difficult to acquire, and the timbres of a plurality of speakers are different, so that the data volume of the training set is greatly increased by acquiring the timbres of a plurality of speakers, and the training difficulty is greatly increased.

Disclosure of Invention

The embodiment of the invention provides a voice synthesis method, a device, computer equipment and a storage medium, which are used for solving the problem of how to clone tone colors for language synthesis under the condition that tone colors are not seen.

In a first aspect, an embodiment of the present invention provides a method for synthesizing speech, including:

Receiving a reference voice signal belonging to a non-target language and target text information belonging to a target language;

identifying the characteristic representing the tone color in the reference voice signal as a target tone color;

determining a speech synthesizer trained for the target language, the speech synthesizer comprising an acoustic model, a vocoder;

converting the target text information into a spectral feature belonging to the target language and conforming to the target tone in the acoustic model as a target spectral feature;

in the vocoder, the target spectral features are converted into target speech signals belonging to the target language.

In a second aspect, an embodiment of the present invention further provides a speech synthesis apparatus, including:

the synthetic information receiving module is used for receiving a reference voice signal belonging to a non-target language and target text information belonging to a target language;

the target tone extraction module is used for identifying the characteristic representing the tone in the reference voice signal and taking the characteristic representing the tone as a target tone;

a speech synthesizer determining module for determining a speech synthesizer trained for the target language, the speech synthesizer comprising an acoustic model, a vocoder;

a target spectral feature generation module, configured to convert, in the acoustic model, the target text information into a spectral feature belonging to the target language and conforming to the target timbre, as a target spectral feature;

And the target voice signal generation module is used for converting the target frequency spectrum characteristic into a target voice signal belonging to the target language in the vocoder.

In a third aspect, an embodiment of the present invention further provides a computer apparatus, including:

one or more processors;

a memory for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the speech synthesis method as described in the first aspect.

In a fourth aspect, embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech synthesis method according to the first aspect.

In this embodiment, a reference speech signal belonging to a non-target language and target text information belonging to a target language are received, features representing timbre in the reference speech signal are identified, the features are used as target timbre, a speech synthesizer for training the target language is determined, the speech synthesizer comprises an acoustic model and a vocoder, in the acoustic model, the target text information is converted into spectral features which belong to the target language and conform to the target timbre, the spectral features are used as target spectral features, in the vocoder, the target spectral features are converted into the target speech signal belonging to the target language, the timbre of the reference speech signal serving as the non-target language is not used for training the speech synthesizer for the target language, and in a scene for realizing cross-language speech synthesis, timbre cloning of a speaker can be realized, so that the speech synthesizer is not limited to a training set, the data volume of the training set can be ensured to be proper, and the difficulty of training can be reduced.

Drawings

Fig. 1 is a flowchart of a speech synthesis method according to a first embodiment of the present invention;

FIG. 2 is a schematic illustration of a first embodiment of the present invention;

fig. 3 is a flowchart of a speech synthesis method according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of a speech synthesis apparatus according to a third embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Example 1

Fig. 1 is a flowchart of a speech synthesis method according to a first embodiment of the present invention, where the method may be applied to training an acoustic model in a speech synthesizer without tone, and the method may be performed by a speech synthesis apparatus, which may be implemented by software and/or hardware, and may be configured in a computer device, such as a server, a workstation, a personal computer, etc., and specifically includes the following steps:

Step 101, acquiring a sample voice signal, sample text information expressing the content of the sample voice signal, and sample spectrum characteristics converted by the sample voice signal.

The traditional cross-language synthesized TTS model generally uses fewer speakers (such as several or more than ten speakers), the architecture robustness of the speech synthesizer of the embodiment is stronger, and the speech synthesizer can support training by using large multi-speaker corpus (such as hundreds of thousands of speakers), so that the accuracy of the accent of the cross-language speech signal, high pronunciation intelligibility and strong expressive force are ensured during synthesis.

To facilitate collection of a sufficient number of data sets, audio signals recorded by a speaker when speaking in a specified style, text information representing the content of the audio signals, i.e., audio signals recorded by the speaker speaking "text information", may be collected in some large open source databases and/or open source projects or other general channels, and for ease of distinction, the audio signals are denoted as sample audio signals, and the text information is denoted as sample text information.

In addition, the sample audio signal may be converted into spectral features, such as Mel spectra, by fourier transform (Fourier Transform, FT), fast fourier transform (fast Fourier transform, FFT), etc., which are noted as sample spectral features.

Of course, to improve the performance of TTS in a service scenario, an audio signal recorded by a speaker during speaking may be collected through channels of the service scenario (such as short video, game, news, novel, etc.), and the content of the sample audio signal may be converted into text information by means of manual labeling, speech recognition, etc., and converted into spectral features by means of fourier transform, fast fourier transform, etc., and used as a sample spectral signal.

The conventional TTS model usually uses corpora of different languages for training, including corpora of target language to be synthesized and corpora of non-target language, so that accents of non-target language are easily introduced, for example, if English is mixed when Chinese is spoken, english can have accents of Chinese, so that when speech signals of target language are synthesized, especially cross-language synthesized speech signals, wrong accents are easily generated.

In this embodiment, the language of the target TTS is set and recorded as the target language, and the sample language signal and the sample text information all belong to the target language, that is, in this embodiment, only the corpus training speech synthesizer belonging to the target language is used, and the corpus training speech synthesizer not belonging to the target language is not used, so that it can be ensured that the accurate target language (accent, pronunciation intelligibility) is trained by the corpus of the simple target language.

Further, for the sample text information, a factor of the target language, a prosodic structure, or the like may be used for conventional representation, for example, if the target language is english, the sample text information is represented using the english factor.

Step 102, identifying the characteristic representing the tone color in the sample voice signal as the sample tone color.

In this embodiment, the characteristic of the tone color can be expressed in real time from the sample voice signal and recorded as the sample tone color.

Typically, the tone colors of each speaker are different, one speaker may represent a tone color, and unique identification information (e.g., a speeker ID) may be configured for the speaker, i.e., the identification information (e.g., the speeker ID) may be used to represent the tone color.

In one extraction approach, ASV (Automatic Speaker Verification, speaker recognition task) may be multiplexed to identify timbres, extract acoustic features from the sample speech signal as sample acoustic features, e.g., spectral parameters, fundamental frequency parameters, etc., and extract features for classifying speakers from the sample acoustic features as sample timbres.

By classifying a Speaker, it is meant that the identity of the Speaker is determined by mapping to the Speaker's identity (e.g., speaker ID) through a Softmax (logistic regression) or the like function.

If the tone-hot sounding is used for representing tone colors, the situation that a speaker is not seen may not be able to be dealt with, and in the method, the tone colors of the sample voice signals are extracted by using the speaker recognition task, so that the voice synthesizer can deal with the situation that the speaker is not seen, and a realization basis is provided for synthesizing the voice signals in a cross-language manner.

And 103, training an acoustic model by taking sample text information and sample tone as samples and taking sample spectrum characteristics as labels.

In this embodiment, a non-timbre clone model (TTS based on Speaker Verification, SV-TTS) may be applied to train the speech synthesizer, with timbres (expressed as speakers) used during the training phase referred to as "seen speakers", timbres (expressed as speakers) that do not appear during the training phase but appear during the synthesizing phase referred to as "non-seen speakers", and timbre migration synthesizing capabilities of the non-timbre clone model providing support for synthesizing timbres of non-seen speakers of the target language during the synthesizing phase such that the speech synthesizer is operable to synthesize target text information belonging to the target language into target speech signals belonging to the target language conforming to the target timbres.

By unseen tone, it is meant that the target tone of the synthesis stage does not appear in the training stage, i.e., the sample tone of the training stage is not the same as the target tone of the synthesis stage, which is other tone than the sample tone of the training stage.

The speech synthesizer comprises two parts, namely an acoustic model and a vocoder, wherein the acoustic model is used for converting text information into frequency spectrum characteristics which belong to a specified language and accord with a specified tone.

For a certain speaker, training an acoustic model by using sample text information and sample tone as training samples and using sample spectrum characteristics as tags through supervised learning.

Further, the sample text information and the sample tone are input into an acoustic model, and the acoustic module processes the sample text information and the sample tone so as to convert the sample text information into a spectrum characteristic (such as a mel spectrogram) which belongs to a target language and accords with the sample tone, and the spectrum characteristic is recorded as a predicted spectrum characteristic.

In one embodiment of the present invention, considering that when a speaker is not seen to synthesize a speech signal in a cross-language, the input during synthesis is different from the input during training, and when the speaker is not seen to have greater volatility in both tone and language, tacotron-2 (a model of the speech synthesis end-to-end neural network) is prone to generate a problem of poor pronunciation intelligibility for this situation, if a traditional acoustic model is used, more pronunciation intelligibility errors will occur in the synthesis stage.

To cope with the situation that no speaker is found, as shown in fig. 2, the acoustic model in this embodiment may include a CBHG module as the Encoder, a gradual monotonous attention mechanism (Stepwise Monotonic Attention, SMA), three cyclic neural networks as the Decoder, and a Post-Net network, which, compared to Tacotron-2, strengthens the encoding capability of the Encoder, strengthens the decoding capability of the Decoder Encoder, increases the robustness of the attention mechanism, and balances between these several structures, thereby increasing the robustness of the acoustic model in the speech synthesizer, and according to the experimental results, it may be shown that the acoustic model has a significant suppression effect on the generation of Bad cases.

Then in this embodiment step 103 comprises the steps of:

in step 1031, in the encoder, the CBHG module is invoked to encode the sample text information into sample text features.

In this embodiment, a CBHG module is used as the Encoder, where the CBHG module includes structures such as 1-D convolution bank (one-dimensional convolutional layer filter set), highway network (highway network), bidirectional GRU (bi-directional gating cyclic unit), etc., and functions to extract valuable features from input, and can encode sample text information into sample text features, so as to facilitate improvement of generalization capability of the model, and the structure of the CBHG module is more complex than that of CNN (convolutional neural network, convolutional Neural Networks), RNN (Recurrent Neural Network, cyclic neural network) of the Tacotron-2 model, and has stronger encoding capability.

Further, as shown in fig. 2, the Encoder is configured to extract a robust sequence expression of text information, including a pre net network (preprocessing network) and a CBHG module, and in this embodiment, a vector of each word in the sample text information (Character) may be queried in a Look-up Table or the like as a first sample vector sequence.

The pre net network comprises two layers of networks, each layer of network comprises FC (fully connected layer), relu (activation function), dropout (in the training process, a neural network unit is temporarily discarded from the network according to a certain probability), and the like, and in the pre net network, the first sample vector sequence can be subjected to nonlinear conversion to obtain a second sample vector sequence.

In the CBHG module, features are extracted from the second sample vector sequence as sample text features (Content features).

The processing procedure of the CBHG module is as follows:

s1, inputting a two-sample vector sequence, wherein a K-th convolution kernel (filter) channel is K through K one-dimensional convolution layers, and the convolution kernels can effectively model current and context information;

s2, stacking one-dimensional convolution layer outputs together, and maximally pooling (maxpooling) along a time axis to increase current information invariance, wherein stride is taken as 1 to maintain time resolution;

S3, inputting the input sequence into a plurality of one-dimensional convolution layers with fixed widths, adding the output to an initial input sequence (refer to a ResNet connection mode), and adopting Batch Normalization (BN operator for normalization) for all convolutions;

s4, inputting a multi-layer high way network to extract higher-level features;

and S5, finally adding bidirectional GRU at the top for extracting the context characteristics of the sequence as sample text characteristics.

And 1032, splicing the sample text features and the sample tone to obtain sample combination features.

As shown in fig. 2, a sample text Feature (Content Feature) and a sample tone color (Speaker Embedding) are spliced and denoted as a sample combination Feature.

Step 1033, executing a gradual monotone attention mechanism, adding attention when converting the sample combination feature into the spectrum feature, and generating a sample attention feature.

As shown in fig. 2, a stepwise monotonous Attention mechanism SMA is used as an Attention mechanism (Attention), and Attention when converting into a spectral feature is added to a sample combination feature, thereby generating a sample Attention feature (Context).

In a specific implementation, a gradual monotonic attention mechanism may be performed, in which the attention of each frame of sample combination features is linearly fused (e.g., after weights are configured) to obtain sample attention features when the current frame of sample combination features are converted into spectral features, wherein the order between the sample combination features and the sample attention features is maintained monotonically and the sample combination features as inputs are not allowed to be skipped, thereby enhancing the robustness of speech synthesis.

In step 1034, in the decoder, three cyclic neural networks are sequentially invoked to decode the sample attention features into multi-frame prediction spectrum features.

The sample attention characteristic is input to a Decoder, and the Decoder decodes in an autoregressive mode to obtain multi-frame sample spectrum characteristics which are used as intermediate output of TTS.

In one example of this embodiment, as shown in fig. 2, the Decoder includes a pre net network (preprocessing network) in addition to three cyclic neural networks, where the three cyclic neural networks are a first long-short-term memory network LSTM, a second long-term memory network LSTM, and a gate cyclic unit GRU, and are sequentially, in forward propagation order, a pre net network, a gate cyclic unit GRU, a first long-term memory network LSTM, and a second long-term memory network LSTM.

In this example, step 1034 may include the steps of:

step 10341, in the pre net network, performing nonlinear conversion on the predicted spectrum characteristic of the previous frame.

As shown in fig. 2, the predicted spectrum characteristic of the previous frame is input into a pre net network, and the pre net network performs nonlinear conversion on the predicted spectrum characteristic of the previous frame, and outputs a gating loop unit GRU.

For the first cycle, the Last frame predicted spectral feature is empty (i.e., all zero frames), denoted < GO > frame, and for the non-first cycle, the Last frame predicted spectral feature is not empty, denoted Last frame.

Step 10342, in the gating loop unit, processes the predicted spectral features to obtain a predicted attention context.

The gating cycle unit GRU inputs the predicted spectrum characteristics of the previous frame into the gating cycle unit GRU, the gating cycle unit GRU uses a gating mechanism to control input, memory and other information to make predictions at the current time step, and vectors related to attention are output and recorded as predicted attention context.

Further, the gate control loop unit GRU processes the previous frame frequency feature to obtain a query vector for the attention mechanism, the query vector represents the frequency spectrum information, and the query vector will perform correlation operation with the vector representing the text feature in the attention mechanism to obtain the feature (Context) after attention weighted sum processing

As shown in fig. 2, in the first iteration, the predicted attention context may assist the SMA in calculating the sample combination features.

Further, the Gate control unit GRU has two gates, namely a Reset Gate (Reset Gate) and an Update Gate (Update Gate), the Reset Gate determining how to combine the new input information with the previous memory, the Update Gate defining the amount of the previous memory to be saved to the current time step, the Reset Gate and the Update Gate determining which information can be finally output as the Gate control unit GRU. The reset gate and update gating mechanism can preserve information in long-term sequences and does not clear over time or remove because of irrelevant predictions, thus solving the gradient vanishing problem of standard RNNs.

Step 10343, in the first long-short-term memory network, decoding the predicted attention context to obtain candidate spectrum features.

Step 10344, in the second long-short-term memory network, decoding the sample attention feature or the candidate spectrum feature to obtain the current frame prediction spectrum feature and the next frame prediction spectrum feature.

As shown in fig. 2, the predicted attention context is input into a first long short memory network LSTM, which decodes the predicted attention context and outputs candidate spectral features to a second long short memory network LSTM.

In the first iteration, as the predicted spectrum characteristic is empty, the second long-short-term memory network LSTM decodes the sample attention characteristic and outputs the predicted spectrum characteristic of the current frame and the predicted spectrum characteristic of the next frame;

and in non-first iteration, the second long-short-term memory network LSTM decodes the candidate spectrum characteristics and outputs the current frame prediction spectrum characteristics and the next frame prediction spectrum characteristics.

The first Long-Short-Term Memory network LSTM and the second Long-Short-Term Memory network LSTM belong to a Short-Term Memory network (LSTM), and the LSTM is specially designed for solving the Long-Term dependence problem of a common RNN.

The LSTM has three gates, namely an Input Gate (force Gate), a Forget Gate (Input Gate), and an Output Gate (Output Gate), the Input Gate determining the cell state c at the previous time _t-1 How much remains at the current time c _t The input gate determines the input x of the network at the current time _t How much is saved to cell state c _t Outputting the door control unit state c _t How much is output to the current output value h of LSTM _t 。

Step 10345, judging whether the decoding operation is completed; if yes, go to step 10346, if not, go to step 10347.

Because each iteration generates two frames of predicted spectrum features, the number of iterations is half of the total frame number of the sample spectrum features, and the generation of the predicted spectrum features can be completed, therefore, the half of the total frame number of the predicted spectrum features can be set as an iteration threshold, whether the number of current iterations reaches the iteration threshold is judged, if the number of current iterations reaches the iteration threshold, the decoding operation is determined to be completed, and if the number of current iterations is smaller than the iteration threshold, the decoding operation is determined to be incomplete.

Step 10346, outputting all predicted spectral features.

If the decoding operation has been completed, the predicted spectral features generated per iteration may be output, thus constituting complete predicted spectral features.

Step 10347, extracting the predicted spectrum characteristic of the next frame, and returning to step 10341.

If the decoding operation is not completed, the predicted spectrum characteristic of the next frame can be extracted as the predicted spectrum characteristic of the previous frame of the next iteration, and the next iteration is entered.

In step 1035, in the Post-Net network, the multi-frame prediction spectrum characteristic is modified in the time sequence dimension.

As shown in fig. 2, the Post-Net network is used to convert the predicted spectrum feature as the intermediate output into the sample output (i.e. the predicted spectrum feature), in this embodiment, the predicted spectrum feature matched with the vocoder is output, in general, the cyclic neural network processes the predicted spectrum feature according to the time sequence, the i-th frame predicted spectrum feature affects the i+1th frame predicted spectrum feature, and by modifying the Post-Net network, the i+1th frame predicted spectrum feature affects the i+2th frame predicted spectrum feature, so as to improve the accuracy of the predicted spectrum feature.

Step 1036, calculating a difference between the predicted spectral feature and the sample spectral feature as a loss value.

And inputting the predicted spectrum characteristic and the sample spectrum characteristic into a preset LOSS function, and calculating a LOSS value LOSS.

Illustratively, the loss value L includes two parts, calculated as follows:

where T is the number of frames of the spectral signature, y is the sample spectral signature, y ' is the predicted spectral signature, r is the residual information of the y ' input Post-Net network, (y ' +r) can be understood as the spectral signature added to the residual information.

Step 1037, judging whether the loss value is converged; if yes, go to step 1038, if no, go to step 1039.

Step 1038, determining that the acoustic model is trained.

Step 1039, updating the acoustic model, and returning to step 1031.

In this embodiment, a condition indicating convergence may be set in advance for the loss value, for example, the loss value is smaller than a first spectrum threshold value, a difference between adjacent loss values is noted as a variation amplitude, a plurality of consecutive variation amplitudes are each smaller than a second spectrum threshold value, the number of iterative loss values exceeds a third spectrum threshold value, and so on.

In each iteration, it may be determined whether the current loss value satisfies this condition.

If this condition is satisfied, the loss value is considered to be converged, and at this time, the completion of the acoustic model training is confirmed, and the structure of the spectrum prediction network and its parameters are stored.

If the condition is not satisfied, the spectrum prediction network is back-propagated, and parameters of the spectrum prediction network are updated by a manual setting learning rate optimization method represented by a random gradient descent algorithm or an adaptive setting learning rate represented by an adaptive moment estimation, and when the spectrum prediction network is back-propagated, the time prediction network is not updated, and the next iteration is performed.

Step 104, training the vocoder by taking the predicted spectral features as samples and taking the sample audio signals as labels.

In particular implementations, a vocoder in a speech synthesizer is used to convert spectral features (e.g., mel-frequency spectrograms) into a speech signal.

Training a vocoder in a speech synthesizer has two main challenges, namely a noisy data set of the speech data set and a limited number of samples, hi fi-GAN (Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis, efficient, high fidelity speech synthesis generation countermeasure network) with high synthesis quality and high speed can be selected as the vocoder in the speech synthesizer in this embodiment.

The HiFi-GAN comprises a generator, two discriminators, each with sub-discriminators to generate a fixed period of the audio signal, the discriminators being a scale detector, a multi-period detector, respectively.

Wherein the generator is a convolutional neural network, the input is a spectral feature (e.g., mel-frequency spectrogram), and the samples are up-sampled until the number of output frames is the same as the specified duration.

The voice signal is composed of sinusoidal signals with a plurality of different periods, the audio quality can be improved by modeling the audio period mode by the HiFi-GAN, and the speed of generating the voice signal by the HiFi-GAN is high.

Of course, other networks besides HiFi-GAN may be used as vocoders, such as WaveNet, parallels WaveNet, waveRNN, LPCNet, multiband WaveRNN, etc., as this embodiment is not limited in this regard.

For a certain speaker, the vocoder is trained by supervised learning with the predicted spectral features (i.e., the predicted spectral features) as training samples and the actual spectral features (i.e., the sample spectral features) at the time of speaking as tags.

If the HiFi-GAN network is completed, the HiFi-GAN network can be set as a vocoder in a voice synthesizer, and the structure of the HiFi-GAN network and parameters thereof are stored.

In general, the vocoder can be trained by using the same data set as the acoustic model, so that the performance of the acoustic model and the vocoder applied to TTS can be ensured, and of course, in order to improve the training efficiency, the vocoder can also be trained by using other data sets, so that the vocoder which has been trained by other projects can be directly transferred, which is not limited in this embodiment.

Further, before training the vocoder, the data set may be pre-processed for higher quality synthesis by at least one of:

1. the longer signal (i.e., silence segment) representing silence in the sample audio signal is removed by means of an energy-based VAD (Voice Activity Detection ) or the like.

2. Some noise signal is added to the sample audio signal for data enhancement to stabilize the training process and improve TTS performance.

3. The sample audio signal is subjected to nonlinear transformation by a mu-law mode and the like, so that TTS has higher resolution near zero.

Example two

Fig. 3 is a flowchart of a speech synthesis method according to a first embodiment of the present invention, where the method is applicable to a case of performing speech synthesis using a speech synthesizer in a cross-language manner, and the method may be performed by a speech synthesis apparatus, where the speech synthesis apparatus may be implemented by software and/or hardware, and may be configured in a computer device, for example, a server, a workstation, a personal computer, a mobile terminal (such as a mobile phone, a tablet computer, a smart wearable device, etc.), and the method specifically includes the following steps:

step 301, receiving a reference voice signal belonging to a non-target language and target text information belonging to a target language.

In this embodiment, the operating systems in the computer device include Windows, android, iOS and the like, in which clients running speech synthesis, for example, novice applications, news applications, live applications, short video references, instant messaging tools, conference applications, and the like, can be supported.

The user provides the voice signal of the tone to be cloned in the client by means of recording, uploading files and the like, marks the voice signal as a reference voice signal, and selects text information of the voice to be synthesized in the client, marks the text information as target text information, such as content in novels, content in news, content in webpages and the like.

In a scenario where speech synthesis is performed across languages, the reference speech signal belongs to a non-target language and the target text information belongs to a target language, e.g., the reference speech signal belongs to english and the target text information belongs to chinese.

Further, as the target text information, a factor of the target language, a prosodic structure, or the like may be used for conventional representation, for example, if the target language is an english language, the target text information is represented using the english factor.

Of course, in a scenario where speech synthesis is performed in a non-cross-language, the reference speech signal may belong to a target language, for example, the reference speech signal belongs to chinese, and the target text information belongs to chinese, which the present embodiment is not limited to.

Step 302, identifying the characteristic representing the tone color in the reference voice signal as the target tone color.

In a specific implementation, acoustic features may be extracted from the reference speech signal as target acoustic features, and features for classifying the speaker may be extracted from the target acoustic features as target timbre.

The voice color of the target voice signal is extracted by using the speaker recognition task, so that the voice synthesizer can cope with the situation that the speaker is not seen when the voice signal is synthesized, and the cross-language synthesis of the voice signal is realized.

In this embodiment, since the manner of extracting the target tone color when synthesizing the speech signal is substantially similar to the manner of extracting the target tone color when training the speech synthesizer, the description is relatively simple, and the relevant points are only required to be referred to in the part of the description of the manner of extracting the target tone color when training the speech synthesizer, and this embodiment is not described in detail herein.

Step 303, determining a speech synthesizer trained for the target language.

In this embodiment, the speech synthesizer may be trained in advance for a plurality of languages, that is, one speech synthesizer is trained for each language, and the mapping relationship between the language (identified by information such as ID and name) and the speech synthesizer (identified by information such as ID) is recorded.

If the user determines the target language, the speech synthesizer mapped by the target language can be queried in the mapping relation, and the target language and parameters thereof are loaded into the memory to operate.

Further, the speech synthesizer is trained based on the unseen timbre cloning model, and therefore, in general, the target timbre to be cloned is not used to train the speech synthesizer.

Step 304, in the acoustic model, converting the target text information into a spectral feature belonging to the target language and conforming to the target tone as a target spectral feature.

In this embodiment, the speech synthesizer includes an acoustic model, and the target text information and the target tone are input into the acoustic model, and the acoustic model converts the target text information into a spectral feature belonging to the target language and conforming to the target tone, and records the spectral feature as the target spectral feature.

In one embodiment of the present invention, to cope with the situation that no speaker is seen, the acoustic model in this embodiment may include a CBHG module as an encoder, a stepwise monotonic attention mechanism (Stepwise Monotonic Attention, SMA), three recurrent neural networks as a decoder, a Post-Net network, which enhances the encoding capability of the encoder, enhances the decoding capability of the decoder, increases the robustness of the attention mechanism, and balances between the several structures, thereby increasing the robustness of the acoustic model in the speech synthesizer, and according to experimental results, it may be shown that the acoustic model has a significant suppressing effect on Bad Case (abnormal scene) generation.

In this embodiment, step 304 may include the steps of:

In step 3041, in the encoder, the CBHG module is invoked to encode the target text information into target text features.

Furthermore, the encoder includes a pre net network in addition to the CBHG module, and in practical application, the encoder may query a vector of each text in the target text information as a first target vector sequence, and in the pre net network, perform nonlinear conversion on the first target vector sequence to obtain a second target vector sequence, and in the CBHG module, extract a feature from the second target vector sequence as a target text feature.

And step 3042, splicing the target text characteristics and the target tone to obtain target combination characteristics.

Step 3043, executing a gradual monotone attention mechanism, adding attention when converting the target combination feature into a spectrum feature, and generating a target attention feature.

In specific time, executing a gradual monotone attention mechanism, and calculating the attention of each frame of target combination feature when the current frame of target combination feature is converted into a frequency spectrum feature; the attention is linearly fused to obtain the target attention feature, wherein in the progressive monotonic attention mechanism, the order between the target combination feature and the target attention feature is maintained monotonically.

Step 3044, in the decoder, three cyclic neural networks are sequentially called to decode the target attention feature into a multi-frame target spectrum feature.

Further, the decoder comprises a PreNet network in addition to three cyclic neural networks, wherein the three cyclic neural networks comprise a first long-period memory network, a second long-period memory network and a gating cyclic unit; in practical application, nonlinear conversion is carried out on the target spectrum characteristic of the previous frame in the PreNet network; in a gating circulation unit, processing the target frequency spectrum characteristic of the previous frame to obtain a target attention context; decoding the target attention context in the first long-term and short-term memory network to obtain candidate spectrum characteristics; decoding the candidate spectrum characteristics in the second long-term and short-term memory network to obtain the target spectrum characteristics of the current frame and the target spectrum characteristics of the next frame; judging whether the decoding operation is finished; if yes, outputting all target frequency spectrum characteristics; if not, extracting the target spectrum characteristic of the next frame, returning to be executed in the PreNet network, and carrying out nonlinear conversion on the target spectrum characteristic of the last frame.

In step 3045, in the Post-Net network, the multi-frame target spectrum characteristic is corrected in the time sequence dimension.

In this embodiment, since the operation mode of the acoustic model at the time of synthesizing the speech signal is substantially similar to the operation mode of the acoustic model at the time of training the speech synthesizer, the description is relatively simple, and the relevant points are only required to be referred to in the partial explanation of the operation mode of the acoustic model at the time of training the speech synthesizer, and the detailed description of this embodiment is omitted here.

Step 305, in the vocoder, the target spectral feature is converted into a target speech signal belonging to the target language.

In this embodiment, the speech synthesizer includes a vocoder, and the target spectral features output by the acoustic model are input into the vocoder, and the vocoder processes the target spectral features and converts them into target speech signals belonging to the target language.

It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.

Example III

Fig. 4 is a block diagram of a speech synthesis apparatus according to a third embodiment of the present invention, which may specifically include the following modules:

a synthetic information receiving module 401, configured to receive a reference speech signal belonging to a non-target language and target text information belonging to a target language;

a target tone extraction module 402, configured to identify a characteristic of a tone in the reference speech signal as a target tone;

a speech synthesizer determination module 403, configured to determine a speech synthesizer trained for the target language, where the speech synthesizer includes an acoustic model and a vocoder;

a target spectral feature generation module 404, configured to convert, in the acoustic model, the target text information into a spectral feature belonging to the target language and conforming to the target timbre, as a target spectral feature;

A target speech signal generating module 405, configured to convert, in the vocoder, the target spectral feature into a target speech signal belonging to the target language.

In one embodiment of the present invention, the target timbre extraction module 402 includes:

a target acoustic feature extraction module for extracting acoustic features from the reference speech signal as target acoustic features;

and the target classification characteristic extraction module is used for extracting the characteristics for classifying the speaker from the target acoustic characteristics as target timbre.

In one embodiment of the invention, the acoustic model comprises a CBHG module as an encoder, a gradual monotonic attention mechanism, three recurrent neural networks as a decoder, a Post-Net network;

the target spectral feature generation module 404 includes:

a target encoder calling module, configured to call the CBHG module in the encoder to encode the target text information into a target text feature;

the target feature splicing module is used for splicing the target text features and the target tone to obtain target combination features;

the target attention mechanism executing module is used for executing the gradual monotone attention mechanism, adding attention when converting the target combination characteristic into the frequency spectrum characteristic, and generating a target attention characteristic;

A target decoder calling module, configured to call three cyclic neural networks in sequence in the decoder to decode the target attention feature into a multi-frame target spectrum feature;

and the target spectrum correction module is used for correcting the target spectrum characteristics of a plurality of frames in the Post-Net network under the time sequence dimension.

In one embodiment of the invention, the encoder further comprises a PreNet network;

the target encoder call module includes:

the first target vector sequence query module is used for querying the vector of each word in the target text information and taking the vector as a first target vector sequence;

the second target vector sequence conversion module is used for carrying out nonlinear conversion on the first target vector sequence in the PreNet network to obtain a second target vector sequence;

and the target text feature extraction module is used for extracting features from the second target vector sequence in the CBHG module to serve as target text features.

In one embodiment of the present invention, the target attention mechanism execution module includes:

the target attention conversion module is used for executing the gradual monotone attention mechanism and calculating the attention of the target combination feature of each frame when the target combination feature of the current frame is converted into the frequency spectrum feature;

And the target attention feature fusion module is used for linearly fusing the attention to obtain target attention features, wherein in the progressive monotonous attention mechanism, the sequence between the target combination features and the target attention features is maintained monotonous.

In one embodiment of the present invention, the decoder further includes a pre net network, and the three cyclic neural networks include a first long-short-term memory network, a second long-term memory network, and a gating cyclic unit;

the target decoder invoking module includes:

the target nonlinear conversion module is used for carrying out nonlinear conversion on the target spectrum characteristics of the previous frame in the PreNet network;

the target attention context calculation module is used for processing the target spectrum characteristic of the last frame in the gating circulation unit to obtain a target attention context;

the candidate spectrum characteristic decoding module is used for decoding the target attention context in the first long-term and short-term memory network to obtain candidate spectrum characteristics;

a target spectrum feature decoding module, configured to decode, in the second long-short-term memory network, the target attention feature or the candidate spectrum feature to obtain the target spectrum feature of the current frame and the target spectrum feature of the next frame;

The decoding operation judging module is used for judging whether the decoding operation is finished or not; if yes, a target spectrum characteristic output module is called, and if not, a target spectrum characteristic extraction module is called;

the target frequency spectrum characteristic output module is used for outputting all the target frequency spectrum characteristics;

and the target frequency spectrum feature extraction module is used for extracting the target frequency spectrum feature of the next frame and calling the target nonlinear conversion module.

In one embodiment of the present invention, the speech synthesizer determination module 403 includes:

the data set acquisition module is used for acquiring a sample voice signal, sample text information expressing the content of the sample voice signal and sample spectrum characteristics converted by the sample voice signal, wherein the sample voice signal and the sample text information all belong to a target language;

the sample tone recognition module is used for recognizing the characteristic representing the tone in the sample voice signal and taking the characteristic representing the tone as a sample tone;

the acoustic model training module is used for training an acoustic model by taking the sample text information and the sample tone as samples and taking the sample spectrum characteristics as labels;

and the vocoder training module is used for taking the predicted spectrum characteristics as samples and taking the sample audio signals as labels to train the vocoder.

the acoustic model training module includes:

a sample encoder calling module for calling the CBHG module to encode the sample text information into sample text features in the encoder;

the sample characteristic splicing module is used for splicing the sample text characteristics and the sample tone to obtain sample combination characteristics;

the sample attention mechanism executing module is used for executing the progressive monotone attention mechanism, adding attention when the sample combination characteristic is converted into the frequency spectrum characteristic, and generating a sample attention characteristic;

a sample decoder calling module, configured to call three cyclic neural networks in turn in the decoder to decode the sample attention feature into a multi-frame prediction spectrum feature;

the sample spectrum correction module is used for correcting the multi-frame predicted spectrum characteristics in the Post-Net network under the time sequence dimension;

a loss value calculation module for calculating a difference between the predicted spectral feature and the sample spectral feature as a loss value;

The loss value judging module is used for judging whether the loss value converges or not; if yes, executing a completion determination module, otherwise, calling an updating module;

the completion determination module is used for determining that the acoustic model is trained;

and the updating module is used for updating the acoustic model and calling the sample encoder calling module.

the sample encoder call module includes:

the first sample vector sequence query module is used for querying the vector of each word in the sample text information and is used as a first sample vector sequence;

a second sample vector sequence conversion module, configured to perform nonlinear conversion on the first sample vector sequence in the pre net network, to obtain a second sample vector sequence;

and the sample text feature extraction module is used for extracting features from the second sample vector sequence in the CBHG module as sample text features.

In one embodiment of the present invention, the sample attention mechanism execution module includes:

the sample attention conversion module is used for executing the progressive monotonous attention mechanism and calculating the attention of the sample combination features of each frame when the sample combination features of the current frame are converted into frequency spectrum features;

And the sample attention feature fusion module is used for carrying out linear fusion on the attention to obtain target attention features, wherein in the progressive monotonous attention mechanism, the sequence between the sample combination features and the target attention features is maintained monotonous.

the sample decoder invocation module includes:

the prediction nonlinear conversion module is used for carrying out nonlinear conversion on the predicted spectrum characteristics of the previous frame in the PreNet network;

the prediction attention context calculation module is used for processing the prediction spectrum characteristics of the previous frame in the gating circulation unit to obtain a prediction attention context;

the candidate spectrum characteristic decoding module is used for decoding the predicted attention context in the first long-term and short-term memory network to obtain candidate spectrum characteristics;

the prediction spectrum characteristic decoding module is used for decoding the sample attention characteristic or the candidate spectrum characteristic in the second long-term and short-term memory network to obtain the prediction spectrum characteristic of the current frame and the prediction spectrum characteristic of the next frame;

The decoding operation judging module is used for judging whether the decoding operation is finished or not; if yes, calling a predicted spectrum characteristic output module, and if not, calling a predicted spectrum characteristic extraction module;

the prediction spectrum characteristic output module is used for outputting all the prediction spectrum characteristics;

and the prediction spectrum feature extraction module is used for extracting the prediction spectrum feature of the next frame and calling the prediction nonlinear conversion module.

The voice synthesis device provided by the embodiment of the invention can execute the voice synthesis method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 5 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention. Fig. 5 illustrates a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present invention. The computer device 12 shown in fig. 5 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in FIG. 5, the computer device 12 is in the form of a general purpose computing device. Components of computer device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, a bus 18 that connects the various system components, including the system memory 28 and the processing units 16.

Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Computer device 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 12 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 5, commonly referred to as a "hard disk drive"). Although not shown in fig. 5, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 42 generally perform the functions and/or methods of the embodiments described herein.

The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboard, pointing device, display 24, etc.), one or more devices that enable a user to interact with the computer device 12, and/or any devices (e.g., network card, modem, etc.) that enable the computer device 12 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 22. Moreover, computer device 12 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through network adapter 20. As shown, network adapter 20 communicates with other modules of computer device 12 via bus 18. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with computer device 12, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the speech synthesis method provided by the embodiment of the present invention.

Example five

The fifth embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, where the computer program when executed by a processor implements each process of the above-mentioned speech synthesis method, and the same technical effects can be achieved, so that repetition is avoided, and no further description is given here.

The computer readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A method of speech synthesis, comprising:

Converting, in the vocoder, the target spectral features into target speech signals belonging to the target language;

the acoustic model includes a CBHG module as an encoder, a gradual monotonic attention mechanism, three recurrent neural networks as decoders, and a Post-Net network.

2. The method of claim 1, wherein said identifying a feature in said reference speech signal that characterizes a tone as a target tone comprises:

extracting acoustic features from the reference speech signal as target acoustic features;

and extracting the characteristics for classifying the speaker from the target acoustic characteristics as target timbre.

3. The method of claim 1, wherein the step of determining the position of the substrate comprises,

the method for converting the target text information into the spectral features belonging to the target language and conforming to the target tone in the acoustic model, as target spectral features, includes:

in the encoder, invoking the CBHG module to encode the target text information into a target text feature;

splicing the target text feature and the target tone to obtain a target combined feature;

executing the progressive monotone attention mechanism, adding attention when converting into spectrum characteristics to the target combination characteristics, and generating target attention characteristics;

In the decoder, sequentially calling three circulating neural networks to decode the target attention characteristic into a multi-frame target frequency spectrum characteristic;

and in the Post-Net network, correcting the target spectrum characteristics of a plurality of frames in the time sequence dimension.

4. The method of claim 3, wherein the encoder further comprises a pre net network;

the step of calling the CBHG module to encode the target text information into target text features in the encoder comprises the following steps:

inquiring the vector of each text in the target text information to be used as a first target vector sequence;

in the PreNet network, nonlinear conversion is carried out on the first target vector sequence to obtain a second target vector sequence;

and in the CBHG module, extracting features from the second target vector sequence as target text features.

5. A method according to claim 3, wherein said executing said progressive monotonic attention mechanism adds attention to said target combined feature when converted to a spectral feature, generating a target attention feature, comprising:

executing the progressive monotonous attention mechanism, and calculating the attention of the target combination feature of each frame when the target combination feature of the current frame is converted into a frequency spectrum feature;

And linearly fusing the attention to obtain target attention features, wherein in the progressive monotonous attention mechanism, the sequence between the target combination features and the target attention features is maintained monotonous.

6. The method of claim 3, wherein the decoder further comprises a pre net network, and wherein the three recurrent neural networks comprise a first long-term memory network, a second long-term memory network, and a gated loop unit;

and in the decoder, sequentially calling three cyclic neural networks to decode the target attention characteristic into a multi-frame target spectrum characteristic, wherein the method comprises the following steps of:

in the PreNet network, nonlinear conversion is carried out on the target spectrum characteristic of the previous frame;

in the gating circulation unit, processing the target spectrum characteristic of the previous frame to obtain a target attention context;

decoding the target attention context in the first long-term and short-term memory network to obtain candidate spectrum characteristics;

decoding the target attention characteristic or the candidate spectrum characteristic in the second long-term memory network to obtain the target spectrum characteristic of the current frame and the target spectrum characteristic of the next frame;

Judging whether the decoding operation is finished;

if yes, outputting all the target frequency spectrum characteristics;

if not, extracting the target spectrum characteristic of the next frame, and returning to execute the nonlinear conversion on the target spectrum characteristic of the previous frame in the PreNet network.

7. The method of any of claims 1-6, wherein the determining a speech synthesizer trained for the target language comprises:

acquiring a sample voice signal, sample text information expressing the content of the sample voice signal and sample spectrum characteristics converted by the sample voice signal, wherein the sample voice signal and the sample text information all belong to a target language;

identifying characteristics representing tone colors in the sample voice signals as sample tone colors;

training an acoustic model by taking the sample text information and the sample tone as samples and taking the sample spectrum characteristics as labels, and obtaining predicted spectrum characteristics;

and training the vocoder by taking the predicted spectral features as samples and taking the sample audio signals as labels.

8. The method of claim 7, wherein the step of determining the position of the probe is performed,

the training the acoustic model by taking the sample text information and the sample tone as samples and the sample spectrum characteristics as labels comprises the following steps:

In the encoder, invoking the CBHG module to encode the sample text information into sample text features;

splicing the sample text features and the sample tone to obtain sample combination features;

executing the progressive monotonous attention mechanism, adding attention when converting into spectrum characteristics to the sample combination characteristics, and generating sample attention characteristics;

in the decoder, sequentially calling three circulating neural networks to decode the sample attention features into multi-frame prediction spectrum features;

in the Post-Net network, correcting the multi-frame predicted spectrum characteristics in the time sequence dimension;

calculating a difference between the predicted spectral feature and the sample spectral feature as a loss value;

judging whether the loss value converges or not;

if yes, determining that the acoustic model is trained;

if not, updating the acoustic model, returning to the encoder, and calling the CBHG module to encode the sample text information into sample text characteristics.

9. A speech synthesis apparatus, comprising:

a target speech signal generation module for converting the target spectral feature into a target speech signal belonging to the target language in the vocoder;

10. A computer device, the computer device comprising:

one or more processors;

a memory for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the speech synthesis method of any of claims 1-8.

11. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the speech synthesis method according to any of claims 1-8.