CN116844519A

CN116844519A - Speech synthesis method, device, electronic equipment and storage medium

Info

Publication number: CN116844519A
Application number: CN202310654875.2A
Authority: CN
Inventors: 郭璇; 缪陈峰; 马骏; 王少军
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-06-02
Filing date: 2023-06-02
Publication date: 2023-10-03

Abstract

The application relates to the field of financial science and technology, in particular to a voice synthesis method, a device, electronic equipment and a storage medium, wherein target characteristic data and first spectrum coding data are input into a voice synthesis model, and a target Mel spectrum of a target object is output; inputting a target Mel spectrum into a pre-trained vocoder model, and outputting second voice data, wherein the vocoder model comprises a plurality of Flow modules with different dimensions, each Flow module comprises a convolutional neural network layer, and the dimensions of the Flow modules are sequentially reduced along the data processing direction; by the mode, the dimension of each Flow module in the vocoder model is enlarged, the learning capacity of the vocoder model is improved, the quality of voice data obtained by converting the target Mel spectrum is improved, user experience can be improved when the output second voice data is utilized to carry out business communication with a user, and the communication efficiency between the user and the financial industry is improved.

Description

Speech synthesis method, device, electronic equipment and storage medium

Technical Field

The application relates to the field of financial science and technology, in particular to a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium.

Background

In the financial industry, it is often required to make business communication or business reminding for an caller, for example, when a policy is close to a payment date or when the policy is close to a resumption period, reminding the user to pay; as another example, a bank may need to collect customers when the borrower dies during a loan or does not pay on time when the loan expires. In the application scenario of the financial industry, the service communication or service reminding is generally performed by using an artificial intelligent robot to call a user, and communication voice between the artificial intelligent robot and the user is obtained by using a voice synthesis technology.

In the above-mentioned application scenario of the financial industry, speech synthesis is generally performed based on TTS (text-to-speech), which is a text-to-speech synthesis algorithm. The currently prevailing speech synthesis task is Neural network based (Neural TTS). In the above-mentioned task of speech synthesis, the vocoder is a very critical module that converts the synthesized low-resolution acoustic features (such as mel-frequency spectrum) into a high-resolution speech waveform output, thus directly determining the quality level of the synthesized speech.

In the prior art, in order to improve the inference efficiency, a Flow-based WaveGlow vocoder is generally adopted to map an unknown and complex distribution to a known and simple distribution, and then maximum likelihood estimation is performed. Under the network structure of the WaveGlow vocoder, in order to ensure the reversibility of the network of each layer, each layer must keep the same dimension, so that the learning ability of the WaveGlow vocoder on voice characteristics is limited, the synthesis quality of voice data is not improved, the user experience is not improved, and the communication efficiency between the financial industry and users is not improved.

Disclosure of Invention

In view of the above problems, embodiments of the present application provide a method, an apparatus, an electronic device, and a storage medium for speech synthesis, so as to solve the above technical problems.

In a first aspect, an embodiment of the present application provides a method for synthesizing speech, including:

respectively acquiring a phoneme sequence and first spectrum coding data according to the first voice data;

acquiring target feature data according to the phoneme sequence and the speaker feature of the target object;

inputting the target characteristic data and the first spectrum coding data into a pre-trained voice synthesis model, and outputting a target Mel spectrum of the target object;

the target Mel spectrum is input into a pre-trained vocoder model, and second voice data is output, wherein the vocoder model comprises a plurality of Flow modules with different dimensions, each Flow module comprises a convolutional neural network layer, and the dimensions of the plurality of Flow modules are sequentially reduced along the data processing direction.

As one embodiment, the training step of the vocoder model includes:

obtaining at least one training sample, the training sample comprising sample speech data and a sample mel spectrum extracted from the sample speech data;

Splicing the sample voice data and Gaussian noise data to form first input data, inputting the first input data to a first Flow module in a reverse processing direction, and outputting corresponding first output data;

splicing output data output by a previous Flow module with Gaussian noise data to form current input data, inputting the current input data to a current Flow module, and outputting corresponding current output data, wherein the current Flow module is the next Flow module of the previous Flow module in a reverse processing direction;

repeating the steps until the last Flow module in the reverse processing direction outputs the predicted spectrum coding data;

calculating errors according to the predicted spectrum coding data and the spectrum coding data of the sample Mel spectrum, and adjusting parameters of the vocoder model according to the errors until the vocoder model reaches a training convergence condition.

As one embodiment, the (i+1) th Flow module in the reverse processing direction satisfies:

wherein z is _i Output of the ith Flow module as the reverse processing direction, e _i Is z to _i The gaussian noise data of the splice is used, Input of (i+1) th Flow module as reverse processing direction, z _i+1 Output of the (i+1) th Flow module, which is the reverse processing direction,/th Flow module>Is->Probability distribution of p (z) _i+1 ) Is z _i+1 Probability distribution of->Is thatIs>Is->Jacobi matrix, f _i+1 The i+1th Flow module's transformation function is the reverse processing direction.

As one embodiment, the inputting the target mel spectrum into a pre-trained vocoder model, outputting second voice data, includes:

inputting the target Mel spectrum to a Mel spectrum encoder, and outputting target spectrum characteristic data of the target Mel spectrum;

inputting the target spectrum characteristic data to a first Flow module in a data processing direction, and outputting corresponding output data;

splitting output data output by a previous Flow module to obtain current input data and Gaussian noise data, inputting the current input data to the current Flow module, and outputting corresponding current output data, wherein the current Flow module is the next Flow module of the previous Flow module in a data processing direction;

repeating the steps until the last Flow module in the data processing direction, wherein the last Flow module outputs the last output data;

Splitting the final output data into the second voice data and Gaussian noise data, and outputting the second voice data.

As an embodiment, each of the gaussian noise data is derived from the spectrum encoded data of the sample mel spectrum, respectively.

As one embodiment, the training step of the speech synthesis model includes:

respectively acquiring corresponding sample mel spectrums, sample phoneme sequences and sample speaker characteristic data according to the sample audio data;

acquiring sample target feature data according to the sample phoneme sequence and the sample speaker feature data, and acquiring sample spectrum feature data according to a sample Mel spectrum;

inputting the sample target feature data and the sample spectrum feature data into a speech synthesis model to be trained, and outputting a predicted target Mel spectrum;

and calculating a prediction error according to the sample Mel spectrum and the prediction target Mel spectrum, and adjusting parameters of the speech synthesis model according to the prediction error until the speech synthesis model reaches a training convergence condition.

As an embodiment, the loss function of the speech synthesis model is: wherein (1)>For loss value, +_ >Mel spectrum for the i-th predicted target mel spectrum, mel _i For the ith sample mel spectrum, N is the number of samples.

In a second aspect, an embodiment of the present application further provides a speech synthesis apparatus, including:

the feature extraction module is used for respectively acquiring a phoneme sequence and first spectrum coding data according to the first voice data;

the feature fusion module is used for acquiring target feature data according to the phoneme sequence and the speaker features of the target object;

the frequency spectrum synthesis module is used for inputting the target characteristic data and the first frequency spectrum coding data into a pre-trained voice synthesis model and outputting a target Mel spectrum of the target object;

the voice reconstruction module is used for inputting the target mel spectrum into a pre-trained vocoder model and outputting second voice data, wherein the vocoder model comprises a plurality of Flow modules with different dimensions, each Flow module comprises a convolutional neural network layer, and the dimensions of the Flow modules are sequentially reduced along the data processing direction.

In a third aspect, an embodiment of the present application further provides an electronic device, including a processor, and a memory coupled to the processor, the memory storing program instructions executable by the processor; the processor implements the above-described speech synthesis method when executing the program instructions stored in the memory.

In a fourth aspect, an embodiment of the present application further provides a storage medium, where program instructions are stored, where the program instructions implement the above-mentioned speech synthesis method when executed by a processor.

According to the voice synthesis method, the voice synthesis device, the electronic equipment and the storage medium, a phoneme sequence and first spectrum coding data are respectively obtained according to first voice data; acquiring target feature data according to the phoneme sequence and the speaker feature of the target object; inputting the target characteristic data and the first spectrum coding data into a pre-trained voice synthesis model, and outputting a target Mel spectrum of the target object; inputting the target Mel spectrum into a pre-trained vocoder model, and outputting second voice data, wherein the vocoder model comprises a plurality of Flow modules with different dimensions, each Flow module comprises a convolutional neural network layer, and the dimensions of the plurality of Flow modules are sequentially reduced along the data processing direction; by the mode, the dimension of each Flow module in the vocoder model is enlarged, the learning capacity of the vocoder model is improved, the quality of voice data obtained by converting the target Mel spectrum is improved, user experience can be improved when the output second voice data is utilized to carry out business communication with a user, and the communication efficiency between the user and the financial industry is improved.

These and other aspects of the application will be more readily apparent from the following description of the embodiments.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart illustrating a speech synthesis method according to an embodiment of the application.

Fig. 2 is a schematic diagram showing a structure of a vocoder model in a speech synthesis method according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Fig. 5 shows a schematic structural diagram of a storage medium according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present application and are not to be construed as limiting the present application.

In order to enable those skilled in the art to better understand the solution of the present application, the following description will make clear and complete descriptions of the technical solution of the present application in the embodiments of the present application with reference to the accompanying drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the embodiments of the present application, it should be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In describing embodiments of the present application, words such as "exemplary" or "such as" are used to mean illustrated, described, or described. Any embodiment or design described as "exemplary" or "such as" in an embodiment of the application is not necessarily to be construed as preferred or advantageous over another embodiment or design. The use of words such as "example" or "such as" is intended to present relative concepts in a clear manner.

In addition, the term "plurality" in the embodiments of the present application means two or more, and in view of this, the term "plurality" may be understood as "at least two" in the embodiments of the present application. "at least one" may be understood as one or more, for example as one, two or more. For example, including at least one means including one, two or more, and not limiting what is included, e.g., including at least one of A, B and C, then A, B, C, A and B, A and C, B and C, or A and B and C, may be included.

An embodiment of the application provides a voice synthesis method. The execution subject of the speech synthesis method includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the speech synthesis method provided by the embodiment of the application. In other words, the speech synthesis method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The service end includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.

According to the application, the structure of the vocoder model is improved, the quality and the synthesis efficiency of the second voice data output by the vocoder model are improved, and when the output second voice data is utilized to carry out business communication with a user, the user experience can be improved, and the communication efficiency between the user and the financial industry is improved.

It should be noted that, although the speech synthesis method of the present application is particularly suitable for application scenarios in the financial field, for example, application scenarios in which an artificial intelligent robot is adopted to call a user for service communication or service reminding. However, the speech synthesis method of the present application can also be applied to other scenarios besides the financial field.

Fig. 1 is a flow chart of a speech synthesis method according to an embodiment of the application. It should be noted that, if there are substantially the same results, the method of the present application is not limited to the flow sequence shown in fig. 1. In this embodiment, the voice synthesis method includes the steps of:

s10, respectively acquiring a phoneme sequence and first spectrum coding data according to first voice data;

wherein, the phoneme is the smallest sound unit in the speech, and can be analyzed according to the pronunciation action in the syllable of the word, and one action forms one phoneme. For example, in chinese, there are 32 phonemes, which can be divided into initials, finals. The phonemes are generally embodied in the form of international phonetic symbols (InternationalPhoneticAlphabet, IPA), IPA is a system for phonetic transcription, and for example, when a Chinese character is given out as a sound of a word of "safe", three phonemes (phonemic characters) of "p", "ing" and "an" are actually given out successively, and the corresponding pinyin is "pin gan".

In this embodiment, the first voice data is subjected to voice recognition to obtain corresponding first text information, the first text information records the speaking content of the first voice data, the first text information may include words such as chinese or english, each word in the first text information is converted into a phoneme corresponding to an international phonetic symbol to form phoneme information corresponding to the first text information, the character of each phoneme in the phoneme information is converted into a corresponding phoneme coding vector to form a phoneme sequence corresponding to the phoneme information, and each phoneme in the phoneme sequence is presented in a manner of a phoneme coding vector, for example, the phoneme sequence may be obtained by encoding the phoneme information by using an encoder such as a one-hot encoding, a d-vector or an x-vector.

In this embodiment, the first voice data may be converted from the time domain signal into the frequency domain signal with the preset window number through short-time fourier transform; converting the frequency domain signals with the preset window number from the frequency scale to the Mel scale to obtain a first Mel spectrogram, inputting the first Mel spectrogram to a Mel spectrum encoder, outputting corresponding first spectrum coding data, extracting Gao Weiyin features of the first Mel spectrogram by the Mel spectrum encoder, outputting corresponding Gao Weiyin feature vectors, and forming the first spectrum coding data by each Gao Weiyin feature vector.

S20, acquiring target feature data according to the phoneme sequence and the speaker feature of the target object;

the speaker characteristics of the target object may include tone characteristics, which are obtained by extracting characteristics of the target object according to the voice of the target object, and the speaker characteristics may be obtained by extracting characteristics of the voice by using an encoder such as one-hot encoding, d-vector or x-vector.

And (3) carrying out feature fusion on the phoneme sequence obtained in the step (S10) and the speaker features of the target object to obtain target feature data.

As an embodiment, the step S20 specifically includes the following steps:

s21, splicing the phoneme sequence and the speaker characteristic of the target object to obtain a spliced characteristic vector;

the concatenation of the two feature vectors can be achieved by summing the phoneme sequence and the speaker features of the target object.

S22, inputting the spliced feature vector into a first neural network, and outputting the target feature data;

the first neural network is used for extracting invisible features of the spliced feature vector, and the first neural network can be a phoneme-speaker encoder.

In the following, taking the example that the first neural network may include two fully-connected layers as an example, the first neural network may include a first fully-connected layer and a second fully-connected layer, in some embodiments, step S22 specifically includes:

S221, inputting the spliced feature vector into a first full-connection layer, and carrying out feature extraction on the spliced feature vector to obtain a high-dimensional fusion feature vector.

S222, inputting the high-dimensional fusion feature vector into a second full-connection layer, and extracting features of the high-dimensional fusion feature vector to obtain the target feature data;

specifically, the first fully connected layer includes a first number of nodes and the second fully connected layer includes a second number of nodes, the first number being greater than the second number. In step S221, the spliced feature vector is input into a first full-connection layer, and feature fusion is performed on each feature vector in the spliced feature vector at each node of the first full-connection layer, so as to obtain a first number of different first cross features, and the first number of different first cross features form a high-dimensional fused feature vector. In step S222, the high-dimensional fusion feature vector is input into a second full-connection layer, and feature fusion is performed on the first number of different first cross features at each node of the second full-connection layer, so as to obtain a second number of different second cross features, where the second number of different second cross features form target feature data.

S30, inputting the target characteristic data and the first spectrum coding data into a pre-trained voice synthesis model, and outputting a target Mel spectrum of the target object;

the speech synthesis model is obtained by training according to sample target feature data and sample spectrum feature data, the sample target feature data is obtained by feature fusion according to a sample phoneme sequence and sample speaker feature data, the sample spectrum feature data is obtained according to a sample mel spectrum, and the sample mel spectrum, the sample phoneme sequence and the sample speaker feature data are derived from the same sample audio data.

As one embodiment, the target feature data p [ p ] ₀ ，p ₁ ，p ₂ ，...，p _i ，...，p _T1-1 ]Wherein i is more than or equal to 0 and less than or equal to T ₁ -1，T ₁ T is the number of feature vectors in the target feature data ₁ Is an integer greater than or equal to 2. First spectrum coded data q [ q ] ₀ ，q ₁ ，q ₂ ，...，q _j ，...，q _T2-1 ]Wherein j is more than or equal to 0 and T is more than or equal to ₂ -1，T ₂ For the number of eigenvectors, T, in the first spectrally encoded data ₂ Is an integer greater than or equal to 2.

Then, in the present embodiment, step S30 specifically includes the steps of:

s31, acquiring a first alignment matrix according to the target characteristic data and the first spectrum coding data;

specifically, the first alignment matrix α is calculated as follows:

Wherein alpha is _i，j For matrix elements of the ith row and jth column in the first alignment matrix alpha, p _i For the ith feature vector, q in the target feature data _j For the j-th eigenvector, p, in the first spectrally encoded data _m For the mth feature vector in the target feature data, D is the dimension of the outputs of the phoneme-speaker encoder and the Mel-spectrum encoder, exp () is an exponential function based on a natural constant e, T ₁ Is the length of the target feature data. The first spectral encoding data q may be calculated from the first alignment matrix α and the target feature data p.

S32, obtaining an index mapping vector (index mapping vector, IMV) according to the first alignment matrix and the index vector of the target feature data;

as an embodiment, the index map vector pi' is calculated as follows:

wherein pi' _j Mapping vector for jth index, alpha _i,j Is the matrix element, k of the ith row and the jth column in the first alignment matrix alpha _i For the i-th element in the index vector k of the target feature data, the index vector k= [0,1, ], T of the target feature data ₁ -1]A first alignment matrix

S33, acquiring a second alignment matrix according to the index mapping vector and the index vector of the target feature data;

As an embodiment, to reduce the problem of error superposition, a bi-directional accumulation operation is designed to generate a second index map vector:

Δπ′ _j ＝π′ _j -π′ _j-1 ，0＜j≤T ₂ -1，

Δπ _j ＝ReLU(Δπ′ _j )，0＜j≤T ₂ -1，

for the j-th time step, Δpi is accumulated in the forward direction and the reverse direction, respectively:

finally, the second index mapping vector is obtained by the following formula:

wherein,,

the second index map vector pi reconstructs a second alignment matrix α' by:

wherein alpha is _i,j 'is the matrix element of the ith row and jth column in the second alignment matrix alpha', k _i For the i-th element in the index vector k of the target feature data, the index vector k= [0,1, ], T of the target feature data ₁ -1]，k _m The m-th element, pi, in the index vector k of the target feature data _j * For the j-th element in the second index mapping vector pi, exp () is an exponential function based on a natural constant e, T ₁ For the length of the target feature data, sigma ² Is a parameter representing the coefficient to Ji Bianyi.

S34, acquiring fusion feature pairs Ji Xiangliang according to the second alignment matrix and the target feature data;

as one embodiment, the fusion feature pair Ji Xiangliang c is calculated as follows:

wherein c _j For the j-th fusion feature pair Ji Xiangliang, alpha _i,j 'is the matrix element of the ith row and jth column in the second alignment matrix alpha', p _i For the ith element, T, in the target feature data p ₁ Is the length of the target feature data.

S35, acquiring the target Mel spectrum according to the fusion feature alignment vector;

the fusion feature alignment vector is input to a decoder, and the decoder outputs a predicted mel spectrum as a target mel spectrum of a target object. As an embodiment, the decoder may consist of several convolutional layers and one linear layer interspersed with weight normalization, a leak ReLU activation function and residual connections.

S40, inputting the target Mel spectrum into a pre-trained vocoder model, and outputting second voice data, wherein the vocoder model comprises a plurality of Flow modules with different dimensions, each Flow module comprises a convolutional neural network layer, and the dimensions of the plurality of Flow modules are sequentially reduced along a data processing direction;

referring to fig. 2, the data processing direction is a direction from the nth Flow module to the first Flow module, and the reverse processing direction is a direction from the first Flow module to the nth Flow module.

For the ith Flow module, the reverse processing direction of the Flow module has a conversion function of f _i The transformation function of the data processing direction is f _i ^-1 Wherein f _i And f _i ^-1 Are respectively reversible.

As one embodiment, the training step of the vocoder model includes:

s41, acquiring at least one training sample, wherein the training sample comprises sample voice data and sample Mel spectrums extracted from the sample voice data;

s42, splicing the sample voice data and Gaussian noise data to form first input data, inputting the first input data to a first Flow module in a reverse processing direction, and outputting corresponding first output data;

wherein, the sample voice data is x, and the Gaussian noise data is e ₁ The first input data isThe first output data is z ₂ 。

S43, splicing output data output by a previous Flow module with Gaussian noise data to form current input data, inputting the current input data to a current Flow module, and outputting corresponding current output data, wherein the current Flow module is the next Flow module of the previous Flow module in a reverse processing direction;

wherein the last Flow module is the i th in the reverse processing direction, the current Flow module is the i+1 th in the reverse processing direction, and the output data output by the last Flow module is z _i Gaussian noise data e _i The current input data isThe current output data is z _i+1 。

S44, repeating the steps until the last Flow module in the reverse processing direction outputs the predicted spectrum coding data;

wherein the last Flow module is the nth of the reverse processing direction, and the predicted spectrum coding data output by the last Flow module is z _n 。

S45, calculating errors according to the predicted spectrum coding data and the spectrum coding data of the sample Mel spectrum, and adjusting parameters of the vocoder model according to the errors until the vocoder model reaches a training convergence condition.

In the present embodiment, x and z _i 、e _i 、z _n Respectively satisfy Gaussian distribution, which is also called normal distribution, probability distribution of x, z _i Probability distribution e of (e) _i Probability distribution, z of _n The probability distributions of (1) are N (0, 1), p _x 、N (0, 1), respectively.

In some embodiments, the (i+1) th Flow module of the reverse processing direction satisfies:

wherein z is _i Output of the ith Flow module as the reverse processing direction, e _i Is z to _i The gaussian noise data of the splice is used,input of (i+1) th Flow module as reverse processing direction, z _i+1 Output of the (i+1) th Flow module, which is the reverse processing direction,/th Flow module >Is->Probability distribution of p (z) _i+1 ) Is z _i+1 Probability distribution of->Is thatIs>Is->Jacobi matrix, f _i+1 The i+1th Flow module's transformation function is the reverse processing direction. For example, f _i+1 Affine transformation function of the i+1th Flow block of the reverse processing direction.

In some embodiments, during a training phase of the vocoder model, each of the gaussian noise data is derived from the spectral-encoded data of the sample mel spectrum, respectively, and each of the gaussian noise data is a continuous piece of data truncated from the spectral-encoded data of the sample mel spectrum. It should be understood by those skilled in the art that the method of clipping gaussian noise data is not limited in the present application, and a method of clipping from a random position may be adopted, or other methods may be adopted.

In some embodiments, the inputting the target mel-spectrum into a pre-trained vocoder model, outputting second speech data, comprises:

s51, inputting the target Mel spectrum to a Mel spectrum encoder, and outputting target spectrum characteristic data of the target Mel spectrum;

s52, inputting the target spectrum characteristic data to a first Flow module in a data processing direction, and outputting corresponding output data;

The first Flow module in the data processing direction is the nth Flow module in the reverse processing direction, and the target spectrum characteristic data is directly taken as input.

S53, splitting output data output by a previous Flow module to obtain current input data and Gaussian noise data, inputting the current input data to the current Flow module, and outputting corresponding current output data, wherein the current Flow module is the next Flow module of the previous Flow module in a data processing direction;

wherein the last Flow module in the data processing direction is the ith Flow module in the reverse processing direction, the current Flow module in the data processing direction is the ith-1 th Flow module in the reverse processing direction, and the output data output by the last Flow module in the data processing direction is z' _i ＝[z′ _i-1 ，e′ _i-1 ]，z′ _i Discarding Gaussian noise data e' _i-1 After that, the remaining part z' _i-1 The input data of the current Flow module is the data processing direction.

S54, repeating the steps until the last Flow module in the data processing direction outputs the last output data;

s55, splitting the final output data into the second voice data and Gaussian noise data, and outputting the second voice data.

That is, in the inference phase of the trained optimized vocoder model, the training process is completely reversed, i.e., the data z is generated from the gaussian distribution to extrapolate the speech signal x. Firstly, generating target frequency spectrum characteristic data z generated from Gaussian distribution, performing inverse operation of cascade of flow modules after training to obtain second voice data output x, wherein,

in the deducing stage of the trained optimized vocoder model, the input data of the current Flow module is obtained by removing part of the output data of the last Flow module.

As one embodiment, the training step of the speech synthesis model includes:

s61, respectively acquiring corresponding sample Mel spectrums, sample phoneme sequences and sample speaker characteristic data according to the sample audio data;

s62, acquiring sample target feature data according to the sample phoneme sequence and the sample speaker feature data, and acquiring sample spectrum feature data according to a sample Mel spectrum;

s63, inputting the sample target feature data and the sample spectrum feature data into a speech synthesis model to be trained, and outputting a predicted target Mel spectrum;

s64, calculating a prediction error according to the sample Mel spectrum and the prediction target Mel spectrum, and adjusting parameters of the speech synthesis model according to the prediction error until the speech synthesis model reaches a training convergence condition.

In some embodiments, the loss function of the speech synthesis model is: wherein (1)>For loss value, +_>Mel is the i-th sample mel spectrum, N is the number of samples for the i-th predicted target mel spectrum.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

As shown in fig. 3, an embodiment of the present application provides a speech synthesis apparatus, the speech synthesis apparatus 30 comprising: the device comprises a feature extraction module 31, a feature fusion module 32, a spectrum synthesis module 33 and a voice reconstruction module 34, wherein the feature extraction module 31 is used for respectively acquiring a phoneme sequence and first spectrum coding data according to first voice data; a feature fusion module 32, configured to obtain target feature data according to the phoneme sequence and the speaker feature of the target object; a spectrum synthesis module 33, configured to input the target feature data and the first spectrum coding data into a pre-trained speech synthesis model, and output a target mel spectrum of the target object; the voice reconstruction module 34 is configured to input the target mel spectrum into a pre-trained vocoder model, and output second voice data, where the vocoder model includes a plurality of Flow modules with different dimensions, each Flow module includes a convolutional neural network layer, and dimensions of the plurality of Flow modules sequentially decrease along a data processing direction.

As an embodiment, the speech reconstruction module 34 is further configured to:

/>

wherein z is _i Output of the ith Flow module as the reverse processing direction, e _i Is z to _i The gaussian noise data of the splice is used,input of (i+1) th Flow module as reverse processing direction，z _i+1 Output of the (i+1) th Flow module, which is the reverse processing direction,/th Flow module>Is->Probability distribution of p (z) _i+1 ) Is z _i+1 Probability distribution of->Is thatIs>Is->Jacobi matrix, f _i+1 The i+1th Flow module's transformation function is the reverse processing direction.

As an embodiment, the speech reconstruction module 34 is further configured to:

As an embodiment, the spectrum synthesis module 33 is further configured to:

As an embodiment, the loss function of the speech synthesis model is: wherein (1)>For loss value, +_>Mel is the i-th sample mel spectrum, N is the number of samples for the i-th predicted target mel spectrum.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 4, the electronic device 60 includes a processor 61 and a memory 62 coupled to the processor 61.

The memory 62 stores program instructions for implementing the speech synthesis method of any of the embodiments described above.

The processor 61 is configured to execute program instructions stored in the memory 62 for speech synthesis.

The processor 61 may also be referred to as a CPU (Central Processing Unit ). The processor 61 may be an integrated circuit chip with signal processing capabilities. Processor 61 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a storage medium according to an embodiment of the present application. The storage medium 70 of the embodiment of the present application stores the program instructions 71 capable of implementing all the methods described above, where the program instructions 71 may be stored in the storage medium in the form of a software product, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, an optical disk, or other various media capable of storing program codes, or a terminal device such as a computer, a server, a mobile phone, a tablet, or the like.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units. The foregoing is only the embodiments of the present application, and the patent scope of the application is not limited thereto, but is also covered by the patent protection scope of the application, as long as the equivalent structures or equivalent processes of the present application and the contents of the accompanying drawings are changed, or the present application is directly or indirectly applied to other related technical fields.

Although the present application has been described in terms of the preferred embodiments, it should be understood that the present application is not limited to the specific embodiments, but is capable of numerous modifications and equivalents, and alternative embodiments and modifications of the embodiments described above, without departing from the spirit and scope of the present application.

Claims

1. A method of speech synthesis, comprising:

2. The method of speech synthesis according to claim 1, wherein the step of training the vocoder model comprises:

3. The method of claim 2, wherein the (i+1) th Flow module of the reverse processing direction satisfies:

wherein z is _i Output of the ith Flow module as the reverse processing direction, e _i Is z to _i The gaussian noise data of the splice is used,input of (i+1) th Flow module as reverse processing direction, z _i+1 Output of the (i+1) th Flow module, which is the reverse processing direction,/th Flow module>Is->Probability distribution of p (z) _i+1 ) Is z _i+1 Probability distribution of->Is thatIs>Is->Jacobi matrix, f _i+1 The i+1th Flow module's transformation function is the reverse processing direction.

4. The method of speech synthesis according to claim 2, wherein the inputting the target mel-spectrum into a pre-trained vocoder model, outputting second speech data, comprises:

5. The method of speech synthesis according to claim 2, wherein each of the gaussian noise data is derived from the spectrally-encoded data of the sample mel spectrum, respectively.

6. The method of speech synthesis according to claim 2, wherein the training step of the speech synthesis model comprises:

7. The method of claim 6, wherein the loss function of the speech synthesis model is:wherein (1)>For loss value, +_>Mel spectrum for the i-th predicted target mel spectrum, mel _i For the ith sample mel spectrum, N is the number of samples.

8. A speech synthesis apparatus, comprising:

9. An electronic device comprising a processor, and a memory coupled to the processor, the memory storing program instructions executable by the processor; the processor, when executing the program instructions stored in the memory, implements the speech synthesis method according to any one of claims 1 to 7.

10. A storage medium having stored therein program instructions which, when executed by a processor, implement the speech synthesis method of any one of claims 1 to 7.