CN118314876A

CN118314876A - Speech synthesis method, speech synthesis device, apparatus, and storage medium

Info

Publication number: CN118314876A
Application number: CN202410430727.7A
Authority: CN
Inventors: 张旭龙; 王健宗; 程宁; 唐浩彬
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2024-04-10
Filing date: 2024-04-10
Publication date: 2024-07-09

Abstract

The embodiment of the application provides a voice synthesis method, a device, equipment and a storage medium. The method comprises the following steps: acquiring a target text to be synthesized and a target audio; coding operation is carried out on the target text and the target audio respectively, so that corresponding target phoneme embedding and target tone embedding are obtained; embedding the target phonemes and the target timbres into a target diffusion model, and outputting to obtain a corresponding target Mel frequency spectrum; and determining the target text and the target voice waveform corresponding to the target audio based on the target Mel frequency spectrum. The embodiment of the application aims to introduce the audio of a speaker in the process of converting the text into the voice, so that the voice which is closer to the tone of the speaker can be obtained. Particularly in the field of customer service of banking telephone, the quality of speech synthesis can be effectively improved, and further the satisfaction degree of customers is improved.

Description

Speech synthesis method, speech synthesis device, apparatus, and storage medium

Technical Field

The present application relates to the technical field of financial science and technology, and in particular, to a voice synthesis method, a voice synthesis apparatus, a computer device, and a computer readable storage medium.

Background

For financial companies such as banks or insurance companies, intelligent voice customer service is almost used as a way for early-stage communication disputes. The mode can effectively reduce personnel cost, improve communication efficiency and lighten working pressure of staff. The intelligent voice customer service not only greatly shortens the time of the actual conversation process of the manual customer service, but also rapidly processes the customer demands through corresponding human-computer intelligent fusion.

The current voice synthesis technology is developed rapidly, but the naturalness of the voice synthesized by the related voice synthesis technology is not ideal, and when the voice is applied to customer service voice, a customer can obviously hear the voice which is a machine, so that the customer selects to directly reject intelligent voice customer service, and the requirement is changed into artificial voice customer service. Therefore, a method for synthesizing voice is needed to realize the tone color closer to "speaker", i.e. thousands of people and thousands of voices, so as to meet the needs of customers.

Disclosure of Invention

The application provides a voice synthesis method, a voice synthesis device, computer equipment and a computer readable storage medium, aiming at introducing the voice frequency of a speaker in the process of converting text into voice, so that the voice which is more similar to the tone of the speaker can be output through a diffusion model. Particularly in the field of customer service of banking telephone, the quality of speech synthesis can be effectively improved, and further the satisfaction degree of customers is improved.

To achieve the above object, the present application provides a speech synthesis method, the method comprising:

acquiring a target text to be synthesized and a target audio;

coding operation is carried out on the target text and the target audio respectively, so that corresponding target phoneme embedding and target tone embedding are obtained;

Embedding the target phonemes and the target timbres into a target diffusion model, and outputting to obtain a corresponding target Mel frequency spectrum;

and determining the target text and the target voice waveform corresponding to the target audio based on the target Mel frequency spectrum.

In order to achieve the above object, the present application also provides a speech synthesis apparatus, including:

The acquisition module is used for acquiring target texts to be synthesized and target audios;

the coding module is used for respectively carrying out coding operation on the target text and the target audio to obtain corresponding target phoneme embedding and target tone embedding;

the input module is used for embedding the target phonemes and the target timbres into a target diffusion model and outputting a corresponding target Mel frequency spectrum;

And the voice synthesis module is used for determining the target text and the target voice waveform corresponding to the target audio based on the target Mel frequency spectrum.

In addition, to achieve the above object, the present application also provides a computer apparatus including a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and implement the steps of any one of the speech synthesis methods provided by the embodiments of the present application when the computer program is executed.

In addition, to achieve the above object, the present application further provides a computer readable storage medium storing a computer program, which when executed by a processor, causes the processor to implement the steps of any one of the speech synthesis methods provided by the embodiments of the present application.

The voice synthesis method, the voice synthesis device, the computer equipment and the computer readable storage medium disclosed by the embodiment of the application can obtain the target text and the target audio to be synthesized, and encode the target text and the target audio to obtain the corresponding target phoneme embedding and the target tone embedding. Further, the target phoneme embedding and the target tone embedding can be input into a target diffusion model, and the corresponding target Mel frequency spectrum can be obtained through output; and finally, determining a target voice waveform based on the target Mel frequency spectrum. The application introduces the target audio of the speaker in the voice synthesis process and carries out the encoding operation. Thus, the target speech waveform can be made closer to the tone of the speaker. Particularly in the field of customer service of banking telephones, the voice synthesis method provided by the application can effectively improve the quality of voice synthesis and further improve the satisfaction of customers.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of a speech synthesis method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of steps of a speech synthesis method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a speech synthesis method according to an embodiment of the present application;

fig. 4 is a schematic flow chart of obtaining a target mel spectrum according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of generating a target diffusion model according to an embodiment of the present application;

FIG. 6 is a schematic block diagram of a speech synthesis apparatus provided by an embodiment of the present application;

fig. 7 is a schematic block diagram of a computer device provided by an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations. In addition, although the division of the functional modules is performed in the apparatus schematic, in some cases, the division of the modules may be different from that in the apparatus schematic.

The term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

As shown in fig. 1, the speech synthesis method provided by the embodiment of the application can be applied to an application environment as shown in fig. 1. The application environment includes a terminal device 110 and a server 120, where the terminal device 110 may communicate with the server 120 through a network. Specifically, the server 120 can acquire a target text to be synthesized and a target audio; coding operation is carried out on the target text and the target audio respectively, so that corresponding target phoneme embedding and target tone embedding are obtained; further, embedding the target phonemes and embedding the target timbres into a target diffusion model, and outputting to obtain a corresponding target Mel frequency spectrum; and finally, determining the target text and the target voice waveform corresponding to the target audio based on the target mel frequency spectrum, and sending the target voice waveform to the terminal equipment 110. The server 120 may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal device 110 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.

Referring to fig. 2 and fig. 3, fig. 2 is a schematic diagram illustrating steps of a speech synthesis method according to an embodiment of the application; fig. 3 is a flow chart of a speech synthesis method according to an embodiment of the present application. The voice synthesis method can be applied to computer equipment, so that the system page is generated.

As shown in fig. 2, the speech synthesis method includes steps S11 to S14.

Step S11: and obtaining target text to be synthesized and target audio.

The target text to be synthesized is a text which needs to be subjected to voice synthesis. For example, in the case of intelligent voice customer service, voice reading report, etc. of finance companies such as banks or insurance companies, text to be converted into voice is required, and the present application is described by taking the case of intelligent voice customer service of insurance companies as an example.

The target audio is the audio of a 'speaker', namely, the audio to be synthesized and the synthesized voice have the same tone color under the intelligent voice customer service scene of the insurance company, or the tone color of the synthesized voice is infinitely close to the audio to be synthesized. Therefore, the voice synthesis quality of the insurance company under the intelligent voice customer service scene can be improved, and the satisfaction degree of customers is further improved.

Step S12: and respectively carrying out coding operation on the target text and the target audio to obtain corresponding target phoneme embedding and target tone embedding.

The target phonemes are embedded with phoneme information comprising target texts, namely phoneme information of texts to be synthesized in the intelligent voice customer service scene; the target tone embeds tone information including the target audio, i.e., tone information of the "speaker" audio in the intelligent voice customer service scene.

It should be noted that, the method for implementing the encoding operation is not limited in the present application, for example, the encoding operation can be performed on the text to be synthesized in the intelligent voice customer service scene by using the phoneme encoder to obtain the corresponding target phoneme embedding; and the voice frequency of a 'speaker' in the intelligent voice customer service scene can be encoded through a tone encoder, so that the corresponding target tone embedding is obtained.

In the embodiment of the application, the text to be synthesized and the audio of the speaker in the intelligent voice customer service scene can be respectively encoded, so that the phoneme information corresponding to the text to be synthesized and the tone information corresponding to the audio of the speaker in the intelligent voice customer service scene can be extracted. Therefore, the voice synthesis can be performed based on the phoneme information and the tone information, and the synthesized voice is closer to the tone of a speaker in the intelligent voice customer service scene of the insurance company, so that the experience of a client is improved.

Step S13: and (3) embedding the target phonemes and the target timbre into a target diffusion model, and outputting to obtain a corresponding target Mel frequency spectrum.

The target Mel spectrum is the Mel spectrum corresponding to the target text with the target audio.

It should be noted that mel spectrum (Mel Frequency Cepstral Coefficients, abbreviated as MFCC) is a feature extraction method commonly used for speech signal processing, which simulates the perception of human ear to sound, converts the linear scale on the frequency axis into the nonlinear scale, and extracts the key features of sound for subsequent tasks such as speech recognition, speaker recognition, speech synthesis, and the like. The mel spectrum mainly comprises the steps of pre-emphasis, framing, windowing, fast fourier transformation, mel filter bank, discrete cosine transformation and the like.

Further, the diffusion model comprises a parametric Markov chain and a parameterized Markov chain, and the output can be obtained through a series of Markov transformations.

In the embodiment of the application, the target phoneme embedding and the target tone embedding can be input into the target diffusion model, and then the corresponding target Mel frequency spectrum is obtained by output, so that the voice synthesis under the scene of intelligent voice customer service of the insurance company can be realized based on the target Mel frequency spectrum.

Step S14: and determining target text and a target voice waveform corresponding to the target audio based on the target Mel frequency spectrum.

The method for determining the target voice waveform based on the target mel frequency spectrum is not limited in the present application. For example, the audio conversion can be performed on the target mel spectrum through a preset vocoder, so as to obtain the text to be synthesized and the target voice waveform corresponding to the audio of the speaker in the intelligent voice customer service scene.

The preset vocoder includes at least one of WaveGlow, WORLD and STRATIHT vocoders, and the present application is illustrated with reference to the preset vocoder being WaveGlow.

WaveGlow A vocoder is a deep learning model that generates high quality speech, using WaveNet autoregressive models as the generation model, and streaming modeling strategies to achieve efficient generation. The method is mainly applied to the fields of voice synthesis, voice conversion, voice enhancement and the like. WaveGlow the vocoder can input mel frequency spectrum as a condition, and then generate high-quality voice waveforms in the scene of intelligent voice customer service of insurance companies.

Specifically, waveGlow the vocoder first inputs the mel spectrum into a deep neural network for feature extraction and transformation to obtain a low-dimensional feature vector. Then WaveGlow uses the feature vector as a condition to generate a speech waveform. In the task of speech synthesis or the like, using WaveGlow, a speech waveform very similar to the original speech signal can be generated from the mel spectrum, thereby achieving high-quality speech synthesis.

In an embodiment of the present application, a target speech waveform may be generated by WaveGlow vocoders based on a target mel spectrum to achieve text-to-speech conversion. Because the voice frequency of the 'speaker' is introduced in the voice conversion process, the converted voice is more similar to the tone of the 'speaker', and the experience of a client is further improved.

According to the voice synthesis method disclosed by the embodiment of the application, the text and the audio to be synthesized can be obtained under the scene of intelligent voice customer service of an insurance company, and the coding operation is carried out on the text and the audio to be synthesized, so that the corresponding target phoneme embedding and target tone embedding are obtained. Further, the target phoneme embedding and the target tone embedding can be input into a target diffusion model, and the corresponding target Mel frequency spectrum can be obtained through output; and finally, determining a target voice waveform based on the target Mel frequency spectrum. The application introduces the target audio of the speaker in the voice synthesis process and carries out the encoding operation. Thus, the target speech waveform can be made closer to the tone of the speaker. Particularly in the field of customer service of banking telephones, the voice synthesis method provided by the application can effectively improve the quality of voice synthesis and further improve the satisfaction of customers.

With continued reference to fig. 4, fig. 4 is a flowchart illustrating a process for obtaining a target mel spectrum according to an embodiment of the present application. As shown in fig. 4, the target mel spectrum may be obtained through steps S131 to S132.

Step S131: and carrying out duration prediction on the target phoneme embedding and the target audio embedding to obtain target frame length information of each phoneme in the frequency spectrum.

Step S132: and inputting the target phoneme embedding, the target tone embedding and the target frame length information into a target diffusion model, and outputting to obtain a corresponding target Mel frequency spectrum.

Wherein the target frame length information includes a frame length time of a phoneme in the spectrum.

Further, the purpose of the duration prediction is to predict the duration of each phoneme, i.e. the frame length of each phoneme in the spectrum.

The method for performing the duration prediction is not limited in the present application, and the duration prediction may be performed by a duration predictor, for example.

The duration predictor can take the phoneme hiding sequence as input to predict the duration of each phoneme, i.e. how many mel frames the phoneme corresponds to, and convert it into a logarithmic domain, facilitating the prediction. The duration predictor can be optimized with the extracted duration as a training target using a mean square error loss to improve alignment accuracy, thereby reducing the information gap between the model input and output.

In the embodiment of the application, the target phoneme embedding and the target audio embedding can be used for carrying out duration prediction so as to obtain the target frame length information of each phoneme in the frequency spectrum. Furthermore, the target frame length information, the target phoneme embedding and the target tone embedding can be input into a target diffusion model, and the corresponding target Mel frequency spectrum can be obtained through output. Because the target frame length information is introduced into the input of the model, the alignment precision of phonemes can be improved, more accurate Mel frequency spectrum is further obtained, the voice synthesis quality of an insurance company under an intelligent voice customer service scene is effectively improved, and the satisfaction degree of customers is further improved.

With continued reference to fig. 5, fig. 5 is a flowchart illustrating a process for generating a target diffusion model according to an embodiment of the application. As shown in fig. 5, the generation of the target diffusion model may be achieved through steps S21 to S24.

Step S21: and obtaining training samples, wherein the training samples comprise a plurality of text samples and audio samples corresponding to each text sample.

The training samples are samples under the scene of intelligent voice customer service of the insurance company.

The text samples are texts of a plurality of voices to be synthesized in the intelligent voice customer service scene of the insurance company; the audio samples are the audio of the corresponding "speaker", i.e. the audio samples have the same tone as the synthesized speech, or the synthesized speech tone is infinitely close to the audio samples.

Step S22: and respectively carrying out coding operation on each text sample and the corresponding audio sample to obtain corresponding input features, wherein the input features comprise text sample embedding and audio sample embedding.

The text sample is embedded into an input characteristic obtained after the text sample coding operation; the audio samples are embedded as input features obtained after the audio sample encoding operation.

It should be noted that, referring to the step S12, the steps of the specific encoding operation are not repeated here.

In the embodiment of the application, a plurality of text samples and audio samples corresponding to each text sample in the intelligent voice customer service scene of the insurance company can be obtained, and further, the text samples and the audio samples can be subjected to coding operation to obtain input characteristics.

Step S23: and determining a corresponding initial Mel frequency spectrum based on each input feature, and determining the initial Mel frequency spectrum corresponding to each input feature as a label of the input feature.

Step S24: and inputting each input feature and the label corresponding to the input feature into the initial diffusion model to obtain a target diffusion model.

Specifically, the initial mel frequency spectrum corresponding to each input feature is used as a label of the set of input features, then each set of input features with labels is input into the initial diffusion model for supervised learning, and when the training ending condition is met, if the training times reach the frequency threshold or the output precision of the model reaches the precision threshold, training is ended, and a target diffusion model after training is completed is obtained.

Optionally, determining a corresponding initial mel spectrum based on each input feature includes: performing feature conversion on each input feature, and performing variable mapping on the input features after the feature conversion to obtain a corresponding hidden sequence; and decoding each hidden sequence to obtain a corresponding initial Mel frequency spectrum.

Specifically, feature conversion can be performed on each input feature through an acoustic feature generator, different variable information is added to each input feature, variable mapping is performed on each converted input feature, and a hidden sequence corresponding to a sample phoneme feature is obtained, so that enough information is provided for predicting changed voice, and the one-to-many mapping problem in voice synthesis is solved. Furthermore, the obtained hidden sequence can be decoded to obtain a corresponding initial Mel frequency spectrum, so that the quality of voice synthesis under the intelligent voice customer service scene of an insurance company is improved.

Optionally, inputting each input feature and a label corresponding to the input feature into the initial diffusion model to obtain a target diffusion model, including: noise diffusion is carried out on each initial Mel frequency spectrum, and a noise diffusion sample is obtained; and inputting each input characteristic and a noise diffusion sample corresponding to the input characteristic into the initial diffusion model to obtain a target diffusion model.

Specifically, noise diffusion may be performed on acoustic features in the mel spectrum, in which gaussian noise is added to the data structure of the mel spectrum during diffusion until the data result is completely destroyed, and then the added noise is removed and the data structure is restored by a denoising function, thereby obtaining a noise diffusion sample. Furthermore, the mel spectrum and the noise diffusion sample can be input into an initial diffusion model for training to obtain a target diffusion model.

In the embodiment of the application, as noise diffusion is carried out on each initial Mel frequency spectrum, and then the initial Mel frequency spectrum and the noise diffusion sample are input into the pre-initial diffusion model for training to obtain the target diffusion model, the denoising capability of the target diffusion model can be enhanced, and the quality of the target diffusion model voice synthesis under the intelligent voice customer service scene of an insurance company is enhanced.

Further, inputting each input feature and a noise diffusion sample corresponding to the input feature into the initial diffusion model to obtain a target diffusion model, and further including: inputting each input feature and noise diffusion sample corresponding to the input feature into an initial diffusion model, training through a preset loss function to obtain a converged target diffusion model,

The preset loss function comprises the following steps:

Wherein min L (θ) is a preset loss function; epsilon _θ(x_t, c) is the input feature, x _t is the variable sequence, c is text sample embedding and audio sample embedding; e is the noise diffusion sample.

In the embodiment of the application, the initial target diffusion model can be subjected to iterative training through the preset loss function to obtain the converged target diffusion model, so that the accuracy of the output target Mel frequency spectrum is higher. Further, the target speech waveform may be made closer to the speaker's timbre. Particularly in the field of customer service of banking telephones, the voice synthesis method provided by the application can effectively improve the quality of voice synthesis and further improve the satisfaction of customers.

Referring to fig. 6, fig. 6 is a schematic block diagram of a speech synthesis apparatus according to an embodiment of the present application. The speech synthesis apparatus may be configured in a server for performing the aforementioned speech synthesis method.

As shown in fig. 6, the speech synthesis apparatus 200 includes: an acquisition module 201, an encoding module 202, an input module 203, and a speech synthesis module 204.

An obtaining module 201, configured to obtain a target text to be synthesized and a target audio;

the encoding module 202 is configured to perform encoding operations on the target text and the target audio respectively, so as to obtain a corresponding target phoneme embedding and a target tone embedding;

the input module 203 is configured to embed the target phoneme and the target timbre into a target diffusion model, and output a corresponding target mel spectrum;

The speech synthesis module 204 is configured to determine, based on the target mel spectrum, a target speech waveform corresponding to the target text and the target audio.

The encoding module 202 is further configured to perform duration prediction on the target phoneme embedding and the target audio embedding, so as to obtain target frame length information of each phoneme in a frequency spectrum; and inputting the target phoneme embedding, the target tone embedding and the target frame length information into a target diffusion model, and outputting to obtain a corresponding target Mel frequency spectrum.

The speech synthesis module 204 is further configured to perform audio conversion on the target mel frequency spectrum through a preset vocoder to obtain the target text and a target speech waveform corresponding to the target audio, where the preset vocoder includes at least one of a WaveGlow vocoder, a WORLD vocoder and a STRATIHTT vocoder.

The obtaining module 201 is further configured to obtain a training sample, where the training sample includes a plurality of text samples and an audio sample corresponding to each text sample; coding each text sample and the corresponding audio sample to obtain corresponding input features, wherein the input features comprise text sample embedding and audio sample embedding; determining a corresponding initial Mel frequency spectrum based on each input feature, and determining the initial Mel frequency spectrum corresponding to each input feature as a label of the input feature; and inputting each input feature and a label corresponding to the input feature into an initial diffusion model to obtain the target diffusion model.

The obtaining module 201 is further configured to perform feature conversion on each input feature, and perform variable mapping on the input features after feature conversion to obtain a corresponding hidden sequence; and decoding each hidden sequence to obtain the corresponding initial Mel frequency spectrum.

The obtaining module 201 is further configured to perform noise diffusion on each of the initial mel spectrums to obtain a noise diffusion sample; and inputting each input characteristic and a noise diffusion sample corresponding to the input characteristic into the initial diffusion model to obtain the target diffusion model.

The obtaining module 201 is further configured to input each of the input features and a noise diffusion sample corresponding to the input feature to the initial diffusion model, train the initial diffusion model through a preset loss function, obtain the target diffusion model after convergence,

Wherein the preset loss function includes:

Wherein, the min L (theta) is the preset loss function; the E _θ(x_t, c) is the input feature, the x _t is a variable sequence, and c is the text sample embedding and the audio sample embedding; the epsilon is the noise diffusion sample.

It should be noted that, for convenience and brevity of description, specific working processes of the above-described apparatus and each module, unit may refer to corresponding processes in the foregoing method embodiments, which are not repeated herein.

The methods and apparatus of the present application are operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

By way of example, the methods, apparatus described above may be implemented in the form of a computer program that is executable on a computer device as shown in fig. 7.

Referring to fig. 7, fig. 7 is a schematic diagram of a computer device according to an embodiment of the application. The computer device may be a server.

As shown in fig. 7, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a volatile storage medium, a non-volatile storage medium, and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program comprises program instructions which, when executed, cause the processor to perform any one of the speech synthesis methods.

The processor is used to provide computing and control capabilities to support the operation of the entire computer device.

The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium that, when executed by a processor, causes the processor to perform any of the speech synthesis methods.

The network interface is used for network communication such as transmitting assigned tasks and the like. It will be appreciated by those skilled in the art that the architecture of the computer device, which is merely a block diagram of some of the structures associated with the present application, is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.

It should be appreciated that the Processor may be a central processing unit (Central Processing Unit, CPU), it may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein in some embodiments the processor is configured to run a computer program stored in the memory to implement the steps of: acquiring a target text to be synthesized and a target audio; coding operation is carried out on the target text and the target audio respectively, so that corresponding target phoneme embedding and target tone embedding are obtained; embedding the target phonemes and the target timbres into a target diffusion model, and outputting to obtain a corresponding target Mel frequency spectrum; and determining the target text and the target voice waveform corresponding to the target audio based on the target Mel frequency spectrum.

In some embodiments, the processor is further configured to perform duration prediction on the target phoneme insertion and the target audio insertion to obtain target frame length information of each phoneme in a frequency spectrum; and inputting the target phoneme embedding, the target tone embedding and the target frame length information into a target diffusion model, and outputting to obtain a corresponding target Mel frequency spectrum.

In some embodiments, the processor is further configured to perform audio conversion on the target mel spectrum through a preset vocoder to obtain the target text and a target voice waveform corresponding to the target audio, where the preset vocoder includes at least one of WaveGlow vocoder, a WORLD vocoder, and a STRAIGHT vocoder.

In some embodiments, the processor is further configured to obtain a training sample, where the training sample includes a plurality of text samples and an audio sample corresponding to each of the text samples; coding each text sample and the corresponding audio sample to obtain corresponding input features, wherein the input features comprise text sample embedding and audio sample embedding; determining a corresponding initial Mel frequency spectrum based on each input feature, and determining the initial Mel frequency spectrum corresponding to each input feature as a label of the input feature; and inputting each input feature and a label corresponding to the input feature into an initial diffusion model to obtain the target diffusion model.

In some embodiments, the processor is further configured to perform feature conversion on each of the input features, and perform variable mapping on the input features after feature conversion to obtain a corresponding hidden sequence; and decoding each hidden sequence to obtain the corresponding initial Mel frequency spectrum.

In some embodiments, the processor is further configured to perform noise diffusion on each of the initial mel spectrums to obtain noise diffusion samples; and inputting each input characteristic and a noise diffusion sample corresponding to the input characteristic into the initial diffusion model to obtain the target diffusion model.

In some embodiments, the processor is further configured to input each of the input features and a noise diffusion sample corresponding to the input feature to the initial diffusion model, train the initial diffusion model with a preset loss function, obtain the target diffusion model after convergence,

Wherein the preset loss function includes:

The embodiment of the application also provides a computer readable storage medium, and a computer program is stored on the computer readable storage medium, wherein the computer program comprises program instructions, and when the program instructions are executed, any one of the voice synthesis methods provided by the embodiment of the application is realized.

The computer readable storage medium may be an internal storage unit of the computer device according to the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), or the like, which are provided on the computer device.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like.

While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method of speech synthesis, the method comprising:

acquiring a target text to be synthesized and a target audio;

2. The method of claim 1, wherein the target tone color embedding includes a plurality of phonemes, and the encoding operation is performed on the target text and the target audio respectively to obtain a corresponding target phoneme embedding and a target tone color embedding, and further comprising:

performing duration prediction on the target phoneme embedding and the target audio embedding to obtain target frame length information of each phoneme in a frequency spectrum;

the embedding of the target phonemes and the embedding of the target timbres are input into a target diffusion model, and the corresponding target mel frequency spectrum is obtained through output, and the method comprises the following steps:

And inputting the target phoneme embedding, the target tone embedding and the target frame length information into a target diffusion model, and outputting to obtain a corresponding target Mel frequency spectrum.

3. The method of claim 1, wherein the determining the target text and the target speech waveform corresponding to the target audio based on the target mel spectrum comprises:

Performing audio conversion on the target mel frequency spectrum through a preset vocoder to obtain the target text and a target voice waveform corresponding to the target audio,

The preset vocoder includes at least one of WaveGlow vocoders, WORLD vocoders, and STRAIGHT vocoders.

4. The method of claim 1, wherein before the inputting the target phoneme insertion, the target timbre insertion into a target diffusion model, comprising:

obtaining training samples, wherein the training samples comprise a plurality of text samples and audio samples corresponding to each text sample;

Coding each text sample and the corresponding audio sample to obtain corresponding input features, wherein the input features comprise text sample embedding and audio sample embedding;

determining a corresponding initial Mel frequency spectrum based on each input feature, and determining the initial Mel frequency spectrum corresponding to each input feature as a label of the input feature;

and inputting each input feature and a label corresponding to the input feature into an initial diffusion model to obtain the target diffusion model.

5. The method of claim 4, wherein said determining a corresponding initial mel frequency spectrum based on each of said input features comprises:

performing feature conversion on each input feature, and performing variable mapping on the input features after the feature conversion to obtain corresponding hidden sequences;

And decoding each hidden sequence to obtain the corresponding initial Mel frequency spectrum.

6. The method of claim 5, wherein said decoding each of said concealment sequences to obtain a corresponding said initial mel-frequency spectrum further comprises:

noise diffusion is carried out on each initial Mel frequency spectrum, and a noise diffusion sample is obtained;

inputting each input feature and a label corresponding to the input feature into an initial diffusion model to obtain the target diffusion model, wherein the method comprises the following steps:

and inputting each input characteristic and a noise diffusion sample corresponding to the input characteristic into the initial diffusion model to obtain the target diffusion model.

7. The method of claim 6, wherein inputting each of the input features and the noise diffusion samples corresponding to the input features to the initial diffusion model to obtain the target diffusion model, further comprises:

Inputting each input feature and the noise diffusion sample corresponding to the input feature into the initial diffusion model, training through a preset loss function to obtain the target diffusion model after convergence,

Wherein the preset loss function includes:

8. A speech synthesis apparatus, characterized in that the speech synthesis apparatus comprises:

9. A computer device, comprising: a memory and a processor; wherein the memory is connected to the processor for storing a program and the processor is adapted to implement the steps of the speech synthesis method according to any of claims 1-7 by running the program stored in the memory.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the steps of the speech synthesis method according to any of claims 1-7.