CN113870827A

CN113870827A - Training method, device, equipment and medium of speech synthesis model

Info

Publication number: CN113870827A
Application number: CN202111142243.5A
Authority: CN
Inventors: 张旭龙; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2021-12-31

Abstract

The application relates to an artificial intelligence technology, and provides a method, a device, equipment and a medium for training a speech synthesis model. The method comprises the following steps: acquiring a training sample, wherein the training sample comprises training voice information and training text information corresponding to the training voice information, and the training voice information and the training text information have the same content; coding the training voice information through a parameter coder to obtain embedded information of the training voice information; coding the training text information through a speech synthesis model to obtain phoneme data of the training text information; decoding the embedded information and the phoneme data through a voice synthesis model to obtain target voice information; and training the voice synthesis model according to the training voice information and the target voice information to obtain the trained voice synthesis model, so that the training efficiency of the voice synthesis model can be improved.

Description

Training method, device, equipment and medium of speech synthesis model

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for training a speech synthesis model.

Background

Speech synthesis models are used to implement speech synthesis, which refers to a technique for generating artificial speech by mechanical and electronic means, and is a technique for converting text information into intelligible, fluent spoken speech for output. However, in the training process of the existing speech synthesis model, a large number of speech corpora are required to be used as training samples, and the playing time of the same speech corpus may be as long as several hours or even dozens of hours, which increases the difficulty in obtaining the training samples, and leads to a complex training process of the speech synthesis model.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a medium for training a speech synthesis model, which can train the speech synthesis model by using a speech corpus, can conveniently train the speech synthesis model and improve the training efficiency of the speech synthesis model.

In one aspect, an embodiment of the present application provides a method for training a speech synthesis model, where the method includes:

acquiring a training sample, wherein the training sample comprises training voice information and training text information corresponding to the training voice information, and the training voice information and the training text information have the same content;

coding the training voice information through a parameter coder to obtain embedded information of the training voice information;

coding the training text information through a speech synthesis model to obtain phoneme data of the training text information;

decoding the embedded information and the phoneme data through a voice synthesis model to obtain target voice information;

and training the voice synthesis model according to the training voice information and the target voice information to obtain the trained voice synthesis model.

In one embodiment, the specific implementation process of obtaining the embedded information of the training speech information by encoding the training speech information through the parameter encoder is as follows:

carrying out frequency domain conversion on the training voice information to obtain the frequency spectrum characteristics of the training voice information;

performing characterization learning on the spectrum characteristics through a parameter encoder to obtain normally distributed parameters;

and constructing normal distribution based on the parameters of the normal distribution, and sampling from the constructed normal distribution to obtain embedded information.

In one embodiment, the specific implementation process of obtaining the phoneme data of the training text information by encoding the training text information through the speech synthesis model is as follows:

performing word segmentation processing on training text information through a speech synthesis model to obtain a text character string;

converting the text character string to obtain a Phoneme sequence through a word-to-Phoneme (G2P) module in the speech synthesis model;

and performing phoneme coding processing on the phoneme sequence through a phoneme coder in the speech synthesis model to obtain phoneme data.

In one embodiment, the specific implementation process of decoding the embedded information and the phoneme data through the speech synthesis model to obtain the target speech information is as follows:

decoding the embedded information and the phoneme data through a context decoder in the speech synthesis model to obtain a speech frequency spectrum;

and converting the voice frequency spectrum to obtain voice information through a vocoder in the voice synthesis model.

In one embodiment, the specific implementation process of decoding the embedded information and the phoneme data by a context decoder in the speech synthesis model to obtain the speech spectrum includes:

and context decoding is carried out on the embedded information and the phoneme data by adopting an autoregressive mode through a context decoder in the speech synthesis model to obtain a speech frequency spectrum.

In one embodiment, the following process may also be implemented:

acquiring text information to be synthesized;

coding the text information to be synthesized through the trained speech synthesis model to obtain phoneme data of the text information to be synthesized;

decoding phoneme data of text information to be synthesized through the trained speech synthesis model to obtain predicted speech information;

and outputting the predicted voice information.

In one embodiment, the specific implementation process of acquiring the text information to be synthesized is as follows:

when a voice synthesis instruction of text information to be synthesized is detected, acquiring historical voice information of a user outputting the text information to be synthesized;

analyzing and processing the historical voice information to obtain acoustic characteristics of the user;

the specific implementation process of decoding the phoneme data of the text information to be synthesized through the trained speech synthesis model to obtain the predicted speech information is as follows:

and decoding the phoneme data and the acoustic features of the text information to be synthesized through the trained speech synthesis model to obtain predicted speech information.

In another aspect, an embodiment of the present application provides a training apparatus for a speech synthesis model, where the training apparatus for a speech synthesis model includes:

the training voice information acquisition unit is used for acquiring training samples, wherein the training samples comprise training voice information and training text information corresponding to the training voice information, and the training voice information and the training text information have the same content;

the processing unit is used for coding the training voice information through the parameter coder to obtain embedded information of the training voice information;

the processing unit is also used for coding the training text information through the speech synthesis model to obtain phoneme data of the training text information;

the processing unit is also used for decoding the embedded information and the phoneme data through a voice synthesis model to obtain target voice information;

and the processing unit is also used for training the voice synthesis model according to the training voice information and the target voice information to obtain the trained voice synthesis model.

In another aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a communication interface, where the processor, the memory, and the communication interface are connected to each other, where the memory is used to store a computer program that supports a terminal to execute the foregoing method, the computer program includes program instructions, and the processor is configured to call the program instructions, and perform the following steps: acquiring a training sample, wherein the training sample comprises training voice information and training text information corresponding to the training voice information, and the training voice information and the training text information have the same content; coding the training voice information through a parameter coder to obtain embedded information of the training voice information; coding the training text information through a speech synthesis model to obtain phoneme data of the training text information; decoding the embedded information and the phoneme data through a voice synthesis model to obtain target voice information; and training the voice synthesis model according to the training voice information and the target voice information to obtain the trained voice synthesis model.

In still another aspect, an embodiment of the present application provides a computer-readable storage medium, in which a computer program is stored, the computer program including program instructions, which, when executed by a processor, cause the processor to execute the above-mentioned training method for a speech synthesis model.

In the embodiment of the application, a training sample is obtained, the training sample comprises training voice information and training text information corresponding to the training voice information, and the content indicated by the training voice information is the same as that indicated by the training text information; coding the training voice information through a parameter coder to obtain embedded information of the training voice information; coding the training text information through a speech synthesis model to obtain phoneme data of the training text information; decoding the embedded information and the phoneme data through a voice synthesis model to obtain target voice information; the speech synthesis model is trained according to the training speech information and the target speech information to obtain the trained speech synthesis model, the speech synthesis model can be trained by using one speech corpus, the training of the speech synthesis model can be conveniently realized, and the training efficiency of the speech synthesis model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for training a speech synthesis model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an architecture of a training system for a speech synthesis model according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a speech synthesis method provided in an embodiment of the present application;

FIG. 4 is a block diagram of a speech synthesis system according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an apparatus for training a speech synthesis model according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The embodiment of the application relates to a speech synthesis model, and provides a training method of the speech synthesis model, which can realize the training of the speech synthesis model based on training speech information and training text information corresponding to the training speech information, namely, the embodiment of the application can train the speech synthesis model by only using one speech corpus, can conveniently realize the training of the speech synthesis model, and improves the training efficiency of the speech synthesis model.

The number of the training voice information may be one or multiple, and is not specifically limited by the embodiment of the present application. When the number of the training voice information is multiple, different training voice information corresponds to different training text information, the embodiment of the application can train the voice synthesis model once based on each piece of training voice information and the training text information corresponding to the piece of training voice information, and the iterative training of the voice synthesis model is realized through the multiple pieces of training voice information and the multiple pieces of training text information to obtain the trained voice synthesis model.

Each training text message may consist of one or more text strings. For example, if the training text information is "wish you happy birthday", the training speech information corresponding to the training text information is audio data about "wish you happy birthday" inputted by a certain user. Existing speech synthesis models require a large amount of speech information (e.g. audio data input by different users about the same training text information) during training, and the playing time of one speech information may be as long as several hours or even tens of hours. The speech synthesis model can be trained based on a single piece of training speech information corresponding to the same training text information, that is, the speech synthesis model can be trained by using one speech corpus in the embodiment of the application, and the trained speech synthesis model can be ensured to be capable of synthesizing high-quality and high-naturalness speech information.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a method for training a speech synthesis model according to an embodiment of the present disclosure; the method for training the speech synthesis model shown in fig. 1 may be performed by a first electronic device, and includes, but is not limited to, steps S101-S105, wherein:

s101, obtaining a training sample, wherein the training sample comprises training voice information and training text information corresponding to the training voice information, and the training voice information and the training text information have the same indication content.

The first electronic device may obtain a training sample, where training speech information in the training sample may be a single piece of training speech information, that is, audio data input by one user, for example, audio data input by one user about "wish you happy birthday", and then the first electronic device may use the audio data as training speech information, and training text information corresponding to the training speech information may be "wish you happy birthday".

It is understood that the training sample may be input to the first electronic device by a user, for example, the first electronic device collects training voice information through a microphone and collects training text information corresponding to the training voice information through an input device (e.g., a touch panel or a keyboard) of the first electronic device. Optionally, the training sample may also be obtained by the first electronic device from a local storage, or obtained by the first electronic device from another device, or obtained by downloading the first electronic device through the internet, which is not limited by the embodiment of the present application.

The first electronic device can be any one or more of a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent vehicle-mounted device and an intelligent wearable device. Optionally, the first electronic device may also be a server, and the server may be an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers. That is, the server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform.

And S102, coding the training voice information through the parameter coder to obtain the embedded information of the training voice information.

In an embodiment, the first electronic device may perform frequency domain conversion on training speech information to obtain a spectral feature of the training speech information, perform characterization learning on the spectral feature through a parameter encoder to obtain parameters of normal distribution, that is, a mean and a variance, then construct normal distribution based on the parameters of the normal distribution, and sample the constructed normal distribution to obtain embedded information (embedding). According to the method and the device, the characteristic learning is carried out on the spectral features through the parameter encoder, parameters of normal distribution, namely the mean value and the variance, are obtained, then the normal distribution is built based on the parameters of the normal distribution, the embedded information is obtained through sampling in the built normal distribution, and the generalization capability of speech synthesis is improved.

Taking the architecture diagram of the training system of the speech synthesis model shown in fig. 2 as an example, the first electronic device may perform frequency domain conversion on the training speech information to obtain a spectral feature of the training speech information, input the spectral feature to the parameter encoder, perform characterization learning on the spectral feature by the parameter encoder to obtain parameters of normal distribution, that is, a mean value and a variance, then construct normal distribution based on the parameters of normal distribution, and obtain embedded information by sampling from the constructed normal distribution.

The parameter Encoder may be a Variational Auto-Encoder (VAE), for example. The VAE encoder is a neural network whose input is the data point x (i.e., the spectral feature of the target speech information), output is the hidden vector z, and the VAE encoder's parameters are θ, so the VAE encoder can be represented as q θ (z | x). To illustrate more specifically, assume x is a black and white picture vector of 784 dimensions. The VAE encoder needs to encode 728 d data x into the hidden space z, which is much smaller than 784, requiring the VAE encoder to learn a method for efficiently compressing the data into this low dimensional space. Furthermore, we assume that z is normally distributed, and the process of the encoder output z can be actually decomposed into two steps: 1) firstly, the VAE encoder outputs parameters (mean value and variance) of normal distribution, wherein the parameters of the normal distribution are different for each data point; 2) the noise is fused with the normal distribution and sampled from it to obtain the output (i.e., embedded information).

The embedding is a way to convert discrete variables into continuous vector representation. Embedding is very useful in neural networks because it can not only reduce the spatial dimension of a discrete variable, but can also represent the variable meaningfully. In other words, embedding can convert a large sparse vector into a low dimensional space that preserves semantic relationships.

S103, coding the training text information through the voice synthesis model to obtain phoneme data of the training text information.

The speech synthesis model in the embodiments of the present application may include a G2P module and a phoneme coder.

In one embodiment, the first electronic device may perform word segmentation on the training text information to obtain a text string, then convert the text string into a phoneme sequence through the G2P module, and perform phoneme encoding on the phoneme sequence through a phoneme encoder to obtain encoded phoneme data.

Taking the architecture diagram of the training system of the speech synthesis model shown in fig. 2 as an example, the first electronic device may perform word segmentation processing on training text information to obtain a text character string, input the text character string to the G2P module, convert the text character string into a phoneme sequence through the G2P module, and perform phoneme coding processing on the phoneme sequence through a phoneme coder to obtain coded phoneme data.

Here, a phoneme (phone) is a minimum voice unit divided according to natural attributes of voice, and is a minimum voice unit divided from the viewpoint of voice quality in terms of acoustic properties, and one pronunciation action forms one phone in terms of physiological properties. For example, "ma" includes two pronunciation actions of "m" and "a", and is two phonemes. The sounds uttered by the same pronunciation action are the same phoneme, and the sounds uttered by different pronunciation actions are different phonemes. For example, in "ma-mi", two "m" pronunciation actions are the same and are the same phoneme, and "a" and "i" pronunciation actions are different and are different phonemes. The analysis of phonemes is generally described in terms of pronunciation actions. The pronunciation actions like "m" are: the upper and lower lips are closed, the vocal cords vibrate, and the airflow flows out of the nasal cavity to make sound.

The G2P module uses a Recurrent Neural Network (RNN) and a long-short term memory network (LSTM) to realize the conversion from english words to phonemes. The phoneme coder can code the phoneme sequence according to the cycle parameter, the amplitude parameter and the spectrum parameter to obtain coded phoneme data. Specifically, the fundamental period parameter, the amplitude parameter and the spectrum parameter may be smoothed by interpolation to obtain the encoded phoneme data.

For the fundamental parameter, the amplitude parameter may be calculated in units of one frame (e.g., one frame is 180 samples, and the sampling rate is 8 kHz). The spectral parameters are calculated according to a Linear Predictive Coding (LPC) mode, and the calculation formula is as follows:

A_n/(1+a₁Z-1+a₂Z^-2…+a₁₀Z^-10)

A_nas amplitude parameter, Z and a₁…a₁₀Is the LPC parameter.

S104, decoding the embedded information and the phoneme data through the voice synthesis model to obtain target voice information.

In particular implementations, the speech synthesis model may also include a context decoder and a vocoder. The first electronic device may decode the embedded information and the encoded phoneme data through a context decoder, generate a speech spectrum through a sequence-to-sequence manner, and then convert the generated speech spectrum into a speech waveform through a vocoder to obtain target speech information.

Illustratively, the vocoder in the embodiment of the present application may be a Griffin-Lim vocoder or a MelGan vocoder, etc. The Griffin-Lim vocoder is an algorithm for reconstructing voice under the condition that only the amplitude spectrum is known and the phase spectrum is unknown, and is an iterative algorithm, and the iterative process is as follows: firstly, randomly initializing a phase spectrum; synthesizing new speech by Inverse short-time Fourier transform (ISTFT) using the phase spectrum and the known magnitude spectrum; performing short-time Fourier transform (STFT) on the synthesized voice to obtain a new amplitude spectrum and a new phase spectrum; discard the new magnitude spectrum, synthesize the speech with the phase spectrum and known magnitude spectrum, and so on. The MelGAN vocoder can rapidly generate audio based on a generation countermeasure network (GAN), which is a non-autoregressive feedforward convolution architecture (NOF), and the generation of original audio is realized by the GAN, so that high-quality voice information can be generated without introducing additional distillation and perception loss.

In one embodiment, the first electronic device may input the embedded information and the encoded phoneme data to an attention module, perform weighting processing on the embedded information and the encoded phoneme data through attention weights to obtain encoded information, and then perform decoding processing on the encoded information through a context decoder to obtain a speech spectrum.

Illustratively, the Attention Module may include SENet (compact-and-excitation networks) or CBAM (compact Block Attention Module) or the like. The principle of SENEt is: the importance degree of each feature channel is automatically acquired through a learning mode, and then useful features are promoted according to the importance degree and the features which are not useful for the current task are suppressed. The CBAM includes 2 independent sub-modules, a Channel Attention Module (CAM) and a Spatial Attention Module (SAM), which perform Channel and spatial Attention, respectively. This not only saves parameters and computing power, but also ensures that it can be integrated into existing network architectures as a plug-and-play module.

Taking the architecture diagram of the training system of the speech synthesis model shown in fig. 2 as an example, the first electronic device may input the embedded information and the encoded phoneme data to the attention module, and perform weighting processing on the embedded information and the encoded phoneme data through the attention weight to obtain the encoded information. The first electronic device may then input the encoded information to a context decoder, and decode the encoded information by the context decoder to obtain a speech spectrum. Further, the first electronic device may input the voice spectrum to the vocoder, and convert the generated voice spectrum into a voice waveform through the vocoder, thereby obtaining the target voice information.

In one embodiment, the first electronic device may perform context decoding on the embedded information and the phoneme data by using an autoregressive mode through a context decoder in the speech synthesis model to obtain a speech spectrum.

In particular implementations, the context decoder may include an autoregressive module. Autoregressive (AR) module is a statistical method of processing time series using the same variable, e.g., x, for each previous stage, i.e., x₁To x_t-1To predict the current period x_tThat is, the autoregressive module can predict the next word that may be followed according to the above content, so as to obtain the speech spectrum. In the method and the device, context decoding is performed in an autoregressive mode, so that the synthesized voice frames can be smoother, and the synthesized content is more natural.

In one embodiment, the context decoder may include a Bidirectional Encoder characterization module (BERT) based converter, which essentially learns a good feature representation for words by running an auto-supervised learning method on a corpus basis, i.e., a supervised learning that runs on data without artificial labeling.

And S105, training the voice synthesis model according to the training voice information and the target voice information to obtain the trained voice synthesis model.

In specific implementation, the first electronic device may compare the training speech information with target speech information in a training sample to obtain a loss value of the speech synthesis model, and then train the speech synthesis model based on the loss value to obtain the trained speech synthesis model.

By the embodiment of the application, the speech synthesis model can be trained by using a single corpus, and the trained speech synthesis model can generate high-quality speech information expressing nature.

Referring to fig. 3, fig. 3 is a schematic flowchart of a speech synthesis method according to an embodiment of the present application; the speech synthesis method may be performed by the second electronic device, and the scheme may include, but is not limited to, step S301 to step S304, wherein:

s301, acquiring text information to be synthesized.

For example, the second electronic device runs with a reading client, the reading client provides a book listening function, and if a user submits a book listening instruction to a certain text message (such as a novel or poem), the first electronic device may obtain the text message after detecting the book listening instruction, where the text message is a text message to be synthesized. For another example, in a scenario where a user drives a car or is in a bumpy environment or the like, which is inconvenient to browse devices, a certain session interface in the instant messaging client includes at least one piece of text information, if the user needs to convert certain text information into voice, the user can submit a voice conversion instruction to the text information, and the second electronic device can acquire the text information after detecting the voice conversion instruction, where the text information is text information to be synthesized. For another example, when the user interacts with the intelligent customer service client in the second electronic device, if the user submits interaction information (the type of the interaction information may be text or voice) to the intelligent customer service client through the second electronic device, the intelligent customer service client may determine text information to be output to the user based on the interaction information, where the text information is text information to be synthesized. For another example, during electronic navigation by the second electronic device, the second electronic device may acquire navigation information (e.g., indicating that the vehicle is moving straight ahead, turning left or turning right, etc.), where the navigation information is to be synthesized. For another example, in the process of intelligent diagnosis and treatment or remote consultation, if a patient cannot browse the device due to a body or the like (for example, the patient cannot move the body, and a certain distance exists between the second electronic device and the patient), text information input by the opposite-end user is text information to be synthesized, and for example, taking intelligent diagnosis and treatment as an example, the opposite-end user may refer to an intelligent diagnosis and treatment assistant; taking the remote consultation as an example, the peer-to-peer user may refer to a doctor, and is not particularly limited by the embodiment of the present application.

The second electronic device can be any one or more of a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent vehicle-mounted device and an intelligent wearable device. Illustratively, the second electronic device may be a device running a reading client, an instant messaging client, or a smart customer service client. The second electronic device and the first electronic device may be the same device or different devices, and are not specifically limited by the embodiments of the present application.

S302, the text information to be synthesized is coded through the trained speech synthesis model, and phoneme data of the text information to be synthesized is obtained.

The trained speech synthesis model may include a G2P module and a phoneme coder.

In specific implementation, when the second electronic device needs to perform speech synthesis on the text information to be synthesized, the text information to be synthesized may be input to the trained speech synthesis model, the trained speech synthesis model may perform word segmentation processing on the text information to be synthesized to obtain a text character string, then the text character string is converted into a phoneme sequence through the G2P module, and the phoneme sequence is subjected to phoneme coding processing through the phoneme coder to obtain coded phoneme data. The context decoder decodes the embedded information and the encoded phoneme data, generates a voice spectrum in a sequence-to-sequence manner, converts the generated voice spectrum into a voice waveform through a vocoder, obtains predicted voice information, and then outputs the predicted voice information.

Taking the schematic architecture of the speech synthesis system shown in fig. 4 as an example, the second electronic device may perform word segmentation processing on the text information to be synthesized to obtain a text character string, and input the text character string to the G2P module. The text string is then converted to a phoneme sequence by the G2P module, which is input to the phoneme coder. And carrying out phoneme coding processing on the phoneme sequence through a phoneme coder to obtain coded phoneme data.

It should be noted that, in the embodiment of the present application, implementation logic for performing encoding processing on text information to be synthesized is similar to implementation logic for performing encoding processing on training text information in the embodiment shown in fig. 1, and a specific implementation process of step S302 may refer to related description of a specific implementation process shown in step S103 in the embodiment shown in fig. 1, which is not described herein again.

And S303, decoding phoneme data of the text information to be synthesized through the trained speech synthesis model to obtain predicted speech information.

The trained speech synthesis model may also include a context decoder and a vocoder.

In specific implementation, the second electronic device decodes the preset embedded information and the coded phoneme data through a context decoder, generates a speech spectrum in a sequence-to-sequence manner, converts the generated speech spectrum into a speech waveform through a vocoder, obtains predicted speech information, and then outputs the predicted speech information.

Taking the schematic architecture diagram of the speech synthesis system shown in fig. 4 as an example, after the second electronic device obtains the encoded phoneme data, the phoneme data may be input to a context decoder, and the context decoder decodes the preset embedded information and the encoded phoneme data to obtain a speech spectrum. The second electronic device inputs the voice spectrum into the vocoder, and then converts the voice spectrum into a voice waveform through the vocoder, so as to obtain the predicted voice information.

The preset embedded information is preset embedded information. For example, frequency domain conversion is performed on voice information of a certain specific user to obtain spectral features of the voice information, the spectral features are subjected to characterization learning through a parameter encoder to obtain parameters of normal distribution, then normal distribution is constructed based on the parameters of the normal distribution, and preset embedded information is obtained by sampling from the constructed normal distribution. Based on this, for any text information to be synthesized, the trained speech synthesis model decodes the preset embedded information and the phoneme data of the text information to be synthesized to obtain predicted speech information. For another example, the speech information of each of the multiple users is subjected to frequency domain conversion to obtain spectral features of the speech information, the spectral features are subjected to characterization learning through a parameter encoder to obtain parameters of normal distribution, then normal distribution is constructed based on the parameters of the normal distribution, and the embedded information of each user is obtained by sampling from the constructed normal distribution. Based on this, if the user of the second device wishes to adopt the sound of a target user (for example, liu de hua) in the multiple users, the user of the second electronic device may select the embedded information of the target user through the second electronic device, the second electronic device uses the embedded information selected by the user of the second electronic device as preset embedded information, and performs decoding processing on the preset embedded information and the phoneme data of the text information to be synthesized to obtain predicted speech information, which can ensure that the predicted speech information is highly matched with the audio data input by the target user for the text information to be synthesized, and further ensure that the predicted speech information meets the user's will, thereby improving the user's viscosity.

It should be noted that, in the embodiment of the present application, implementation logic for performing decoding processing on phoneme data of text information to be synthesized is similar to implementation logic for performing decoding processing on phoneme data of training text information in the embodiment shown in fig. 1, and a specific implementation process of step S303 may refer to related description of a specific implementation process shown in step S104 in the embodiment shown in fig. 1, and details are not repeated here.

S304, outputting the predicted voice information.

In a specific implementation, after the second electronic device obtains the predicted voice information, the voice information may be displayed, and after a user performs a playing operation on the voice information (for example, clicks or long-presses the voice information, etc.), the second electronic device may generate a playing instruction in response to the playing operation and play the predicted voice information. Or after the second electronic device acquires the predicted voice information, the predicted voice information can be directly played. According to the embodiment of the application, the predicted voice information is directly played, so that a user can know the specific content of the text information to be synthesized under the condition of not browsing the second electronic equipment.

In an embodiment, when detecting a speech synthesis instruction for the text information to be synthesized, the second electronic device may acquire historical speech information of a user outputting the text information to be synthesized, analyze and process the historical speech information to obtain acoustic features of the user, and decode phoneme data and the acoustic features of the text information to be synthesized through a trained speech synthesis model to obtain predicted speech information.

In the embodiment of the application, after detecting the voice conversion instruction of the text information to be synthesized, the second electronic device may analyze the historical voice information of the user outputting the text information to be synthesized to obtain parameters such as the tone of the user, and perform voice synthesis on the text information to be synthesized based on the determined parameters such as the tone, so that the predicted voice information after voice synthesis is closer to the pronunciation of the user.

In an embodiment, after detecting the voice conversion instruction of the text information to be synthesized, the second electronic device may perform emotion analysis on the text information to be synthesized to obtain parameters such as a tone for representing emotion, and perform voice synthesis on the text information to be synthesized based on the determined parameters such as the tone, which may weaken mechanical feeling of predicted voice information after voice synthesis, so that the predicted voice information after voice synthesis is more natural.

In the embodiment of the application, the text information to be synthesized is obtained, the trained speech synthesis model is used for coding the text information to be synthesized to obtain the phoneme data of the text information to be synthesized, the trained speech synthesis model is used for decoding the phoneme data of the text information to be synthesized to obtain the predicted speech information, the predicted speech information is output, and the high quality and the high naturalness of the predicted speech information can be ensured.

The embodiment of the present application further provides a computer storage medium, in which program instructions are stored, and when the program instructions are executed, the computer storage medium is used for implementing the corresponding method described in the above embodiment.

Referring to fig. 5 again, fig. 5 is a schematic structural diagram of a training apparatus for providing a speech synthesis model according to an embodiment of the present application.

In one implementation of the apparatus of the embodiment of the application, the apparatus includes the following structure.

An obtaining unit 501, configured to obtain a training sample, where the training sample includes training speech information and training text information corresponding to the training speech information, and contents indicated by the training speech information and the training text information are the same;

a processing unit 502, configured to perform encoding processing on the training speech information through a parameter encoder to obtain embedded information of the training speech information;

the processing unit 502 is further configured to perform coding processing on the training text information through a speech synthesis model to obtain phoneme data of the training text information;

the processing unit 502 is further configured to decode the embedded information and the phoneme data through the speech synthesis model to obtain target speech information;

the processing unit 502 is further configured to train the speech synthesis model according to the training speech information and the target speech information, so as to obtain a trained speech synthesis model.

In one embodiment, the processing unit 502 performs an encoding process on the training speech information through a parameter encoder to obtain embedded information of the training speech information, including:

performing frequency domain conversion on the training voice information to obtain the frequency spectrum characteristics of the training voice information;

performing characterization learning on the spectrum characteristics through the parameter encoder to obtain normally distributed parameters;

and constructing normal distribution based on the parameters of the normal distribution, and sampling from the constructed normal distribution to obtain the embedded information.

In one embodiment, the processing unit 502 performs encoding processing on the training text information through a speech synthesis model to obtain phoneme data of the training text information, including:

performing word segmentation processing on the training text information through the voice synthesis model to obtain a text character string;

converting the text character string to obtain a phoneme sequence through a G2P module in the speech synthesis model;

and performing phoneme coding processing on the phoneme sequence through a phoneme coder in the speech synthesis model to obtain the phoneme data.

In one embodiment, the processing unit 502 performs decoding processing on the embedded information and the phoneme data through the speech synthesis model to obtain target speech information, including:

and converting the voice frequency spectrum to obtain the target voice information through a vocoder in the voice synthesis model.

In one embodiment, the processing unit 502 performs a decoding process on the embedded information and the phoneme data through a context decoder in the speech synthesis model to obtain a speech spectrum, including:

and performing context decoding on the embedded information and the phoneme data by adopting an autoregressive mode through a context decoder in the speech synthesis model to obtain the speech frequency spectrum.

In one embodiment, the obtaining unit 501 is further configured to obtain text information to be synthesized;

the processing unit 502 is further configured to perform coding processing on the text information to be synthesized through the trained speech synthesis model to obtain phoneme data of the text information to be synthesized;

the processing unit 502 is further configured to decode phoneme data of the text information to be synthesized through the trained speech synthesis model to obtain predicted speech information;

the apparatus may further comprise an output unit 503;

an output unit 503, configured to output the predicted speech information.

In one embodiment, the obtaining unit 501 obtains text information to be synthesized, including:

when a voice synthesis instruction for the text information to be synthesized is detected, acquiring historical voice information of a user outputting the text information to be synthesized;

analyzing and processing the historical voice information to obtain the acoustic characteristics of the user;

the processing unit 502 decodes the phoneme data of the text information to be synthesized through the trained speech synthesis model to obtain predicted speech information, including:

and decoding the phoneme data of the text information to be synthesized and the acoustic features through the trained speech synthesis model to obtain the predicted speech information.

Referring to fig. 6 again, fig. 6 is a schematic structural diagram of an electronic device provided in the embodiment of the present application, where the electronic device in the embodiment of the present application includes a power supply module and the like, and includes a processor 601, a memory 602, and a communication interface 603. Data can be exchanged between the processor 601, the memory 602 and the communication interface 603, and a corresponding data processing scheme is implemented by the processor 601.

The memory 602 may include volatile memory (volatile memory), such as random-access memory (RAM); the memory 602 may also include a non-volatile memory (non-volatile memory), such as a flash memory (flash memory), a solid-state drive (SSD), etc.; the memory 602 may also comprise a combination of memories of the kind described above.

The processor 601 may be a Central Processing Unit (CPU) 601. The processor 601 may also be a combination of a CPU and a GPU. In the electronic device, a plurality of CPUs and GPUs may be included as necessary to perform corresponding data processing. In one embodiment, memory 602 is used to store program instructions. The processor 601 may invoke program instructions to implement the various methods as described above in the embodiments of the present application.

In a first possible implementation, the processor 601 of the electronic device calls the program instructions stored in the memory 602 for performing the following operations:

acquiring a training sample, wherein the training sample comprises training voice information and training text information corresponding to the training voice information, and the content indicated by the training voice information is the same as that indicated by the training text information;

decoding the embedded information and the phoneme data through the speech synthesis model to obtain target speech information;

In an embodiment, when the processor 601 performs an encoding process on the training speech information through a parameter encoder to obtain embedded information of the training speech information, the following operations are specifically performed:

In an embodiment, when the processor 601 performs encoding processing on the training text information through a speech synthesis model to obtain phoneme data of the training text information, specifically perform the following operations:

In an embodiment, when the processor 601 performs decoding processing on the embedded information and the phoneme data through the speech synthesis model to obtain target speech information, specifically perform the following operations:

In an embodiment, when the processor 601 performs decoding processing on the embedded information and the phoneme data through a context decoder in the speech synthesis model to obtain a speech spectrum, specifically perform the following operations:

In one embodiment, processor 601 calls program instructions stored in memory 602 and is further configured to:

acquiring text information to be synthesized;

decoding the phoneme data of the text information to be synthesized through the trained speech synthesis model to obtain predicted speech information;

the communication interface 603 calls program instructions stored in the memory 602 for outputting the predicted speech information.

In one embodiment, when acquiring the text information to be synthesized, the processor 601 specifically performs the following operations:

when the trained speech synthesis model decodes the phoneme data of the text information to be synthesized to obtain predicted speech information, the processor 601 specifically performs the following operations:

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like. The computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

While the invention has been described with reference to a number of embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for training a speech synthesis model, comprising:

2. The method of claim 1, wherein the encoding the training speech information by a parametric encoder to obtain the embedded information of the training speech information comprises:

3. The method of claim 1, wherein the encoding the training text information by the speech synthesis model to obtain the phoneme data of the training text information comprises:

converting the text character string to obtain a phoneme sequence through a word-to-phoneme G2P module in the speech synthesis model;

4. The method of claim 1, wherein said decoding the embedded information and the phoneme data by the speech synthesis model to obtain target speech information comprises:

5. The method of claim 4, wherein said decoding the embedded information and the phoneme data by a context decoder in the speech synthesis model to obtain a speech spectrum comprises:

6. The method of claim 1, wherein the method further comprises:

acquiring text information to be synthesized;

and outputting the predicted voice information.

7. The method of claim 6, wherein the obtaining text information to be synthesized comprises:

the decoding processing is performed on the phoneme data of the text information to be synthesized through the trained speech synthesis model to obtain predicted speech information, and the decoding processing comprises the following steps:

8. An apparatus for training a speech synthesis model, the apparatus comprising:

the training device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a training sample, the training sample comprises training voice information and training text information corresponding to the training voice information, and the training voice information and the training text information have the same content;

the processing unit is used for coding the training voice information through a parameter coder to obtain embedded information of the training voice information;

the processing unit is further configured to perform coding processing on the training text information through a speech synthesis model to obtain phoneme data of the training text information;

the processing unit is further configured to decode the embedded information and the phoneme data through the speech synthesis model to obtain target speech information;

and the processing unit is further configured to train the speech synthesis model according to the training speech information and the target speech information, so as to obtain a trained speech synthesis model.

9. An electronic device comprising a processor, a memory and a communication interface, the processor, the memory and the communication interface being interconnected, wherein the memory is configured to store computer program instructions, and the processor is configured to execute the program instructions to implement the method of training a speech synthesis model according to any one of claims 1-7.

10. A computer-readable storage medium, having stored thereon computer program instructions, which, when executed by a processor, are adapted to perform a method of training a speech synthesis model according to any one of claims 1-7.