CN115394283A

CN115394283A - Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Info

Publication number: CN115394283A
Application number: CN202210885008.5A
Authority: CN
Inventors: 李睿端; 李健; 陈明; 武卫东
Original assignee: Beijing Sinovoice Technology Co Ltd
Current assignee: Beijing Sinovoice Technology Co Ltd
Priority date: 2022-07-26
Filing date: 2022-07-26
Publication date: 2022-11-25

Abstract

The application relates to a speech synthesis method, a speech synthesis device, an electronic device and a readable storage medium, comprising: acquiring text data and audio data corresponding to the text data; acquiring splicing codes according to the text data and the target phoneme codes; splicing the phoneme codes and the target phoneme codes to obtain spliced codes; inputting the splicing codes into a generating model to generate analog audio data, and outputting a first loss function; training the discrimination model according to the first loss function to obtain a trained discrimination model; inputting simulated audio data and the audio data into the trained discrimination model for judgment; and performing iterative optimization on the generation model according to the judgment result until the judgment result output by the simulated audio data in the judgment model is equal to the preset threshold value, and outputting the target audio data. The expressions of the generation model and the discrimination model on more phoneme combinations are made to be closer to the real speaking effect of a speaker.

Description

Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of speech technologies, and in particular, to a speech synthesis method and apparatus, an electronic device, and a storage medium.

Background

At present, a text-to-speech (TTS) technology is a process for inputting text and outputting corresponding audio. Usually, a TTS library is created, a target speaker is required to use a professional recording device in a professional recording studio, and a certain amount of recording is recorded and manually labeled under a stable voice condition, and then training is performed. The recording volume is different according to different technical routes, the traditional sound selection splicing mode needs 20-50 hours of recording by a speaker, and the recording volume required by parameter synthesis and a neural network route is less, but at least 2 hours is required. This requirement for recording volume is feasible in a Business oriented (ToB) scenario. However, in a scenario facing a general user (To Consumer, toC), this kind of manufacturing method is not practical. Therefore, the problem of speech synthesis using a very small data set becomes important.

In order to solve the above problems, in the prior art, an average model is trained by training a corpus of other people, the model has the capability of representing each phoneme feature, the data of a target person is sent to the model, and the speaker feature is also sent to the model as a special input, so that the model reinforces the feature restoring capability of the speaker, thereby meeting the task requirement.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a speech synthesis method, apparatus, electronic device, and storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a speech synthesis method, including:

acquiring text data and audio data corresponding to the text data from a target data set;

acquiring splicing codes according to the text data and the target phoneme codes;

splicing the phoneme codes and the target phoneme codes to obtain spliced codes;

inputting the splicing code into a generating model to generate analog audio data, and outputting a first loss function;

training a discrimination model according to the first loss function to obtain a trained discrimination model;

inputting the simulated audio data and the audio data into a trained discrimination model for judgment;

and performing iterative optimization on the generated model according to the judgment result until the judgment result of the simulated audio data output in the judgment model is equal to a preset threshold value, and outputting target audio data.

Optionally, inputting the splicing code into a generative model to generate analog audio data, and outputting a first loss function includes:

vectorizing according to the splicing codes to obtain splicing vectors;

and converting the splicing vector into an intermediate feature, outputting the intermediate feature and the target feature at the first moment to generate a target feature at the second moment, and generating analog audio data and a first loss function according to the target feature.

Optionally, the performing iterative optimization on the generated model according to the judgment result includes:

obtaining a second loss function of the discrimination model;

and training the discriminant model according to the second loss function, freezing parameters, training the generation model according to the second loss function of the discriminant model, and performing iterative optimization.

Optionally, the obtaining of the splicing code according to the text data and the target phoneme code includes:

converting the text data into a pinyin sequence, and acquiring a corresponding phoneme code according to the pinyin sequence;

and splicing the phoneme codes and the target phoneme codes to obtain spliced codes.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech synthesis apparatus, the apparatus including:

the first acquisition module is used for acquiring text data and audio data corresponding to the text data from a target data set;

the second acquisition module is used for acquiring splicing codes according to the text data and the target phoneme codes;

the splicing module is used for splicing the phoneme codes and the target phoneme codes to obtain spliced codes;

the output module is used for inputting the splicing codes into a generation model to generate analog audio data and outputting a first loss function;

the training module is used for training a discrimination model according to the first loss function to obtain a trained discrimination model;

the judging module is used for judging the simulated audio data and the trained discrimination model;

and the optimization output module is used for performing iterative optimization on the generated model according to a judgment result until the judgment result of the simulated audio data output in the judgment model is equal to a preset threshold value, and outputting target audio data.

Optionally, the output module includes:

the processing submodule is used for carrying out vectorization processing according to the splicing codes to obtain splicing vectors;

and the generation submodule is used for converting the splicing vector into intermediate features, outputting the intermediate features and the target features at the first moment to generate target features at the second moment, and generating simulated audio data and a first loss function according to the target features.

Optionally, the optimization output module includes:

the obtaining submodule is used for obtaining a second loss function of the discrimination model;

and the optimization submodule is used for training the discriminant model according to the second loss function, freezing parameters, training the generation model according to the second loss function of the discriminant model, and performing iterative optimization.

Optionally, the splicing module includes:

the conversion submodule is used for converting the text data into a pinyin sequence and acquiring a corresponding phoneme code according to the pinyin sequence;

and the splicing submodule is used for splicing the phoneme codes and the target phoneme codes to obtain spliced codes.

According to a third aspect of an embodiment of the present disclosure, there is provided an electronic apparatus including:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the speech synthesis method according to the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium, in which instructions, when executed by a processor of an electronic device, enable the electronic device to perform the speech synthesis method according to the first aspect.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

the method and the device can acquire text data and audio data corresponding to the text data from a target data set; acquiring splicing codes according to the text data and the target phoneme codes; splicing the phoneme codes and the target phoneme codes to obtain spliced codes; inputting the splicing code into a generating model to generate analog audio data, and outputting a first loss function; training a discrimination model according to the first loss function to obtain a trained discrimination model; inputting the simulated audio data and the audio data into a trained discrimination model for judgment; and performing iterative optimization on the generated model according to the judgment result until the judgment result of the simulated audio data output in the judgment model is equal to a preset threshold value, and outputting target audio data. Namely, the countermeasure structure between the generation model and the discrimination model is expressed on more phoneme combinations, so that the countermeasure structure is closer to the real speaking effect of the speaker. Due to the confrontation of the generator and the discriminator, the model tends to generate more real audio, the mechanical feeling of the synthesized audio is reduced, and the problem that the output of the Tacotron-2 model is too smooth is solved. According to the technical scheme provided by the embodiment of the disclosure, the voice synthesis can be realized by using few data sets, the phoneme coverage is sufficient, and the simulated synthetic audio effect is more vivid.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram illustrating a method of speech synthesis according to an exemplary embodiment;

FIG. 2 is a block diagram illustrating a speech synthesis apparatus according to an exemplary embodiment;

FIG. 3 is a block diagram illustrating an electronic device in accordance with an example embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

It should be noted that, in the embodiment of the present application, the TTS is divided into several parts, such as text analysis (e.g., text regularization, polyphonic disambiguation, etc.), a prosody prediction module, a duration model, an acoustic model, and a vocoder. The processed text passes through a prosody prediction module, a text with prosody symbols is output, then words are carried out to links such as phonemes 2 phones (G2P), and the TTS is subjected to three main stages of sound selection splicing, parameter synthesis and neural network.

In order to facilitate further understanding of the present application by those skilled in the art, it should be noted that the embodiment of the present application is mainly applied to speech synthesis under few data sets, for example, according to the scheme in the embodiment of the present application, the user a can implement synthesis of a rehearsal of the third country by recording briefly; or, in the car-mounted scene, the user B does not use the voice packet or the star voice carried in the navigation software, but tends to use the voice of family members to perform navigation broadcasting. At this time, the navigation broadcasting can be completed by acquiring few data sets through short-time recording.

FIG. 1 is a flow diagram illustrating a method of speech synthesis, as shown in FIG. 1, including the following steps, according to an example embodiment.

In step 101, text data and audio data corresponding to the text data are obtained from a target data set.

It should be noted that, in the embodiment of the present application, the target data set may be a very small data set, where the very small data set is a data set that can also achieve an output target result through a small amount of data, compared to a case where a large amount of existing data sets are used as data sources and input into a certain model to output a target result, the very small data set includes text data and audio data corresponding to the text data, and the obtained audio data is a recording that a speaker is supposed to perform according to a recorded text.

In step 102, a splicing code is obtained according to the text data and the target phoneme code.

It should be noted that, in the embodiment of the present application, the obtained text data needs to be converted into a computer language, specifically, the text data is converted into a pinyin sequence, and a corresponding phoneme code is obtained according to the pinyin sequence.

It should be noted that converting the text data into the pinyin sequence may be performed by performing text serialization processing on the text data, and then obtaining corresponding phoneme codes according to the pinyin sequence by using the acoustic model, converting the pinyin sequence into the phoneme codes, and extracting acoustic features adapted to the model from the audio.

In step 103, the phoneme codes and the target phoneme codes are spliced to obtain spliced codes.

It should be noted that the phoneme code and the target phoneme code are spliced to obtain a spliced code, and the spliced code is used as an input part of the encoder.

In step 104, the splice code is input into a generative model to generate simulated audio data, and a first loss function is output.

It should be noted that, in the embodiment of the present application, after the splicing code is obtained in step 103, the splicing code is input into a generation model to generate analog audio data, and a first loss function is output.

It should be noted that, the speech synthesis under the condition of few data sets is realized by means of the neural network model, and any TTS model can be selected and added with speaker codes based on the average model trained by multiple persons to realize the task. For example, a basic structure of a Tacotron-2 model is selected, and fine-tune (fine tuning) is performed on the basis of a multi-person model. Speaker coding can be done by one-hot embedding (sparse coding) or by additionally training a model to extract spatker embedding (speaker vector). The reason why the multi-person average model is used is that, considering practical application scenarios, the extremely small data sets are usually 10-100 sentences (the average sentence length is less than 25 characters), and phoneme coverage is difficult to achieve, even basic phoneme (initial consonant + tonal vowel) coverage is difficult to achieve. Therefore, in the prior art, with the help of training corpora of other people (such as 200 hours of data), an average model is trained, and the model has the capability of characterizing each phoneme, but the feature restoring capability for a certain speaker is poor. However, even if a multi-person model is used, the problem of insufficient phoneme coverage still affects the final effect of the model, and the problem of excessive smoothness in the output process of the Tacotron-2 model still needs some structure for correction and improvement.

Therefore, in the embodiment of the application, by using a Generative Adaptive Network (GAN), the model tends to generate more real audio, and the mechanical feeling of the synthesized audio is reduced. In the embodiment of the application, the generated model refers to a generator, the discriminant model refers to a discriminator, the relationship between the generated model and the discriminant model can be understood as that the generator generates one object, the object is input into the discriminator, then the discriminator judges whether the input is real data or machine generated, if the discriminator is not cheated, the generator continues to evolve, a second generation output is output and then input into the discriminator, the discriminator is also evolving at the same time, and stricter requirements are provided for the output of the generator, so that the generator and the discriminator are continuously evolved, and the generated model and the discriminant model can form a generated countermeasure network.

In step 104, outputting the first loss function specifically includes the following steps: vectorizing according to the splicing codes to obtain splicing vectors; and converting the splicing vector into an intermediate feature, outputting the intermediate feature and the target feature at the first moment to generate a target feature at the second moment, and generating analog audio data and a first loss function according to the target feature.

It should be noted that, in the embodiment of the present application, audio data is used as an input, vector mapping is performed, that is, a vector is used to represent an entity, and audio data of an obtained target object is spliced to perform vector mapping, at this time, an obtained first spliced vector and a second spliced vector jointly enter an encoder portion of a speech synthesis model, that is, three one-dimensional convolution layers and a two-way LSTM (Long Short Term Memory) layer, and then the speech synthesis model enters a location sensitive attention) + decoder portion, a location sensitive attention mechanism, which mainly takes an attention result at the previous time as location characteristic information of the sequence and adds the location characteristic information to the original attention mechanism, so that there are two kinds of information, that is, for example, in a speech synthesis task, an output sequence is larger than an input sequence, and because the mixed attention contains location information, a next encoded location can be selected in the input sequence. The model entering position sensitive attention (location sensitive attention) + decoder part is to convert the spliced vector into an intermediate feature, generate a target feature at a second moment from the intermediate feature and the target feature output at a first moment, and generate simulated audio data and a first loss function according to the target feature.

In step 105, the discriminant model is trained according to the first loss function to obtain a trained discriminant model.

In the embodiment of the present application, in step 104, a first loss function of data is calculated in a decoder of a speech synthesis model, and a discriminant model is trained according to the first loss function to obtain a trained discriminant model.

In step 106, the simulated audio data and the audio data are input into the trained discriminative model for determination.

As described in the foregoing, the relationship between the generated model and the discriminant model may be understood as that a thing is generated for the generator, and is input into the discriminant, and then the discriminant determines whether the input is real data or machine-generated, if the discriminant is not spoofed, the generator continues to evolve, and outputs the second generation output, and then inputs the second generation output into the discriminant, and the discriminant also evolves at the same time, and has a stricter requirement on the output of the generator, so in the embodiment of the present application, the discriminant model determines whether the data in the audio data set is real data or machine-generated according to the simulated audio data generated in the generated model and the audio data of the target user actually collected, and outputs a corresponding determination result.

In step 107, the generation model is iteratively optimized according to the judgment result until the judgment result output by the simulation audio data in the judgment model is equal to the preset threshold value, and the target audio data is output.

In step 107, performing iterative optimization on the generated model according to the determination result includes the following steps: obtaining a second loss function of the discrimination model; and training the discriminant model according to the second loss function, freezing parameters, training the generation model according to the second loss function of the discriminant model, and performing iterative optimization.

And iteratively optimizing the generation model according to the judgment result of the discrimination model on the data in the audio data set until the judgment result of the simulated audio data output in the discrimination model is equal to a preset threshold value, wherein the preset threshold value is an index for judging whether the simulated audio data is real or not.

It should be noted that, in an actual process, the generated model and the discriminant model are always in a phase of mutual confrontation, so that it can be ensured that the simulated audio generated by the generated model can completely simulate the real sound of a target user, that is, in the embodiment of the present application, the discriminant is trained by using the loss of the discriminant model, and then the parameters of the discriminant model are frozen and exchanged, and the generated model is trained by using the loss of the discriminant model to perform iterative optimization. Briefly, the generator predicts, generates n batch results, calculates the loss, then trains the arbiter using the loss of the arbiter and the batch consisting of the generated data and the real data, then freezes the parameters, exchanges, and trains the generator using the loss of the arbiter.

The method and the device can acquire text data and audio data corresponding to the text data from a target data set; acquiring splicing codes according to the text data and the target phoneme codes; splicing the phoneme codes and the target phoneme codes to obtain spliced codes; inputting the splicing code into a generating model to generate analog audio data, and outputting a first loss function; training a discrimination model according to the first loss function to obtain a trained discrimination model; inputting the simulated audio data and the audio data into a trained discrimination model for judgment; and performing iterative optimization on the generated model according to the judgment result until the judgment result of the simulated audio data output in the judgment model is equal to a preset threshold value, and outputting target audio data. Namely, the expression of the model on more phoneme combinations is expanded through the GAN structure, so that the model is closer to the real speaking effect of a speaker. Due to the confrontation of the generator and the discriminator, the model tends to generate more real audio, the mechanical feeling of the synthesized audio is reduced, and the problem that the output of the Tacotron-2 model is too smooth is solved. The technical scheme provided by the embodiment of the disclosure can realize speech synthesis by using few data sets, the phoneme coverage is sufficient, and the simulated synthesis audio effect is more vivid.

Fig. 2 is a block diagram illustrating a speech synthesis apparatus according to an exemplary embodiment, which includes a first obtaining module, a second obtaining module, a concatenation module, an output module, a training module, a determination module, and an optimized output module.

A first obtaining module 201, configured to obtain text data and audio data corresponding to the text data from a target data set;

a second obtaining module 202, configured to obtain a splicing code according to the text data and the target phoneme code;

the splicing module 203 is configured to splice the phoneme codes and the target phoneme codes to obtain spliced codes;

an output module 204, configured to input the splicing coding into a generation model to generate analog audio data, and output a first loss function;

a training module 205, configured to train a discriminant model according to the first loss function to obtain a trained discriminant model;

a judging module 206, configured to input the simulated audio data and the trained discrimination model for judgment;

and the optimization output module 207 is configured to perform iterative optimization on the generated model according to a judgment result until the judgment result output by the simulated audio data in the judgment model is equal to a preset threshold, and output target audio data.

Optionally, the output module includes:

Optionally, the optimization output module includes:

Optionally, the splicing module includes:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 3 is a block diagram illustrating an electronic device 1400 in accordance with an example embodiment. For example, the electronic device 1400 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and so forth.

Referring to fig. 3, electronic device 1400 may include one or more of the following components: a processing component 1402, a memory 1404, a power component 1406, a multimedia component 1408, an audio component 1410, an input/output interface 1412, a sensor component 1414, and a communication component 1416.

The processing component 1402 generally controls the overall operation of the device 1400, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. Processing component 1402 may include one or more processors 1420 to execute instructions to perform all or a portion of the steps of the methods described above. Further, processing component 1402 can include one or more modules that facilitate interaction between processing component 1402 and other components. For example, the processing component 1402 can include a multimedia module to facilitate interaction between the multimedia component 1408 and the processing component 1402.

The memory 1404 is configured to store various types of data to support operation at the device 1400. Examples of such data include instructions for any application or method operating on device 1400, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 1404 may be implemented by any type or combination of volatile or non-volatile storage devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 1406 provides power to the various components of the electronic device 1400. Power components 1406 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for electronic device 1400.

The multimedia component 1408 comprises a screen that provides an output interface between the electronic device 1400 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 1408 includes a front-facing camera and/or a rear-facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 1400 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 1410 is configured to output and/or input audio signals. For example, the audio component 1410 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 1400 is in operating modes, such as a call mode, a record mode, and a voice recognition mode. The received audio signals may further be stored in the memory 1404 or transmitted via the communication component 1416. In some embodiments, audio component 1410 further includes a speaker for outputting audio signals.

Input/output interface 1412 provides an interface between processing component 1402 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 1414 includes one or more sensors for providing various aspects of status assessment for the electronic device 1400. For example, the sensor component 1414 may detect an open/closed state of the electronic device 1400, a relative positioning of components, such as a display and keypad of the electronic device 1400, a change in position of the electronic device 1400 or a component of the electronic device 1400, the presence or absence of user contact with the electronic device 1400, an orientation or acceleration/deceleration of the electronic device 1400, and a change in temperature of the electronic device 1400. The sensor assembly 1414 may include a proximity sensor configured to detect the presence of a nearby object in the absence of any physical contact. The sensor assembly 1414 may also include a photosensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 1414 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 1416 is configured to facilitate wired or wireless communication between the electronic device 1400 and other devices. The electronic device 1400 may access a wireless network based on a communication standard, such as WiFi, a carrier network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 1416 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 1416 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 1400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as the memory 1404 that includes instructions executable by the processor 1420 of the electronic device 1400 to perform the above-described method. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method of speech synthesis, the method comprising:

2. The method of claim 1, wherein inputting the splice code into a generative model to generate simulated audio data and outputting a first loss function comprises:

vectorizing according to the splicing codes to obtain splicing vectors;

3. The method of claim 1, wherein the iteratively optimizing the generative model according to the determination comprises:

acquiring a second loss function of the discrimination model;

4. The method of claim 1, wherein obtaining the concatenation code from the text data and the target phoneme code comprises:

5. A speech synthesis apparatus, characterized in that the apparatus comprises:

the training module is used for training a discrimination model according to the first loss function to obtain the trained discrimination model;

6. The apparatus of claim 5, wherein the output module comprises:

7. The apparatus of claim 5, wherein the optimization output module comprises:

and the optimization submodule is used for training the discriminant model according to the second loss function, freezing parameters, training the generation model according to the second loss function of the discriminant model and performing iterative optimization.

8. The apparatus of claim 5, wherein the splicing module comprises:

9. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the speech synthesis method of any of claims 1 to 4.

10. A computer-readable storage medium in which instructions, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the speech synthesis method of any of claims 1 to 4.