CN112270917A

CN112270917A - Voice synthesis method and device, electronic equipment and readable storage medium

Info

Publication number: CN112270917A
Application number: CN202011128996.6A
Authority: CN
Inventors: 詹皓粤; 林悦
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2021-01-26
Anticipated expiration: 2040-10-20
Also published as: CN112270917B

Abstract

The application provides a voice synthesis method, a voice synthesis device, an electronic device and a readable storage medium, wherein the voice synthesis method comprises the following steps: the method comprises the steps of firstly obtaining a text to be processed and tone characteristics of a voice to be synthesized, then converting the text to be processed into a text character set represented by Unicharacters according to a mapping relation between the characters and the Unicharacters corresponding to a preset voice text, then extracting fundamental frequency information characteristics representing each character and/or word from the text character set, and inputting the text character set corresponding to the text to be processed, the extracted fundamental frequency information characteristics and the tone characteristics into a trained voice synthesis model to obtain a synthesized voice. In the process of synthesizing the voice by the voice synthesis model, the characteristics of tone, fundamental frequency information and the like are added into the text represented by the unicode, so that the synthesized voice is more vivid, the accuracy of voice synthesis is improved, and the service experience of voice interaction is greatly improved.

Description

Voice synthesis method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech synthesis method, apparatus, electronic device, and readable storage medium.

Background

In recent years, voice interaction is taken as a novel mode, so that brand-new user experience is brought, and design ideas and application scenes of various products are expanded. Speech synthesis technology is a technology of converting text into sound. Mixed language speech synthesis means that a plurality of languages exist in a text to be synthesized, and the text including the plurality of languages is converted into corresponding sounds.

In the prior art, when text with different languages is converted into corresponding speech, a mixed language speech synthesis model is usually used, but when the mixed language speech synthesis model synthesizes the speech, pronunciation characteristics of the different languages are not considered, so that the synthesized speech effect is more different from the actual speech effect, and the service experience of speech interaction is reduced.

Disclosure of Invention

In view of the above, an object of the present application is to provide a speech synthesis method, an apparatus, an electronic device, and a readable storage medium, in which features such as timbre and fundamental frequency information are added to a text represented by a unicode in a speech synthesis process of a speech synthesis model, so that the synthesized speech is more vivid, the accuracy of speech synthesis is improved, and the service experience of speech interaction is greatly improved.

In a first aspect, an embodiment of the present application provides a speech synthesis method, where the speech synthesis method includes:

acquiring a text to be processed and the tone characteristics of a voice to be synthesized;

converting the text to be processed into a text character set represented by the Unicharacters according to the mapping relation between the fonts corresponding to the preset voice text and the Unicharacters;

extracting fundamental frequency information characteristics representing each word and/or phrase from the text character set;

and inputting the text character set corresponding to the text to be processed, the extracted fundamental frequency information features and the tone features into a trained voice synthesis model to obtain synthesized voice.

Preferably, the converting the text to be processed into a text character set represented by a unicode according to a mapping relationship between a font corresponding to a preset voice text and the unicode includes:

determining a font corresponding to the text to be processed and a plurality of characters and/or words in the text to be processed;

determining a phonetic symbol corresponding to each character and/or word in the text to be processed according to a mapping relation between a font corresponding to a preset voice text and the phonetic symbol;

and determining a text character set represented by the phonetic symbol based on the phonetic symbol corresponding to each word and/or word and the position of each word and/or word in the text to be processed.

determining a font corresponding to each character and/or word in the text to be processed;

inputting the font corresponding to each character and/or word into a trained text processing model to obtain a phonetic symbol corresponding to each character and/or word;

and determining a text character set represented by phonetic symbols based on the position of each word and/or word in the text to be processed.

Preferably, the inputting the text character set corresponding to the text to be processed, the extracted fundamental frequency information features, and the tone features into a trained speech synthesis model to obtain a synthesized speech includes:

inputting the text character set and the fundamental frequency information features into a feedforward neural network of a trained speech synthesis model for linear processing to obtain first linear features corresponding to the fundamental frequency information features and second linear features corresponding to the text character set;

inputting the feature result obtained by integrating the first linear feature and the second linear feature into a multi-layer convolutional neural network of a trained voice synthesis model to obtain a first output result, and inputting the first output result into an attention model of the trained voice synthesis model to obtain a second output result;

and inputting the first output result, the second output result and the tone characteristic into a trained voice synthesis model for information fusion to obtain a synthesized voice corresponding to the text to be processed.

Preferably, the speech synthesis model is trained by:

acquiring a plurality of voice samples, a sample text corresponding to each voice sample and a tone sample characteristic corresponding to each voice sample;

aiming at the sample text corresponding to each voice sample, converting each sample text into a sample character set represented by a uniform character according to a preset mapping relation between a font corresponding to the voice text and the uniform character;

extracting fundamental frequency information sample characteristics representing each word and/or phrase from each sample character set;

and training the constructed neural network model based on the sample character set, the fundamental frequency information sample characteristics, the tone sample characteristics and the voice sample corresponding to the sample character set to obtain a trained voice synthesis model.

Preferably, the training the constructed neural network model based on the sample character set, the fundamental frequency information sample feature, the tone sample feature and the voice sample corresponding to the sample character set to obtain a trained voice synthesis model includes:

inputting the sample character set and the fundamental frequency information sample characteristics corresponding to the sample character set into a feedforward neural network of a constructed neural network model to obtain first prediction linear characteristics corresponding to the fundamental frequency information sample characteristics and second prediction linear characteristics corresponding to the sample character set;

inputting a prediction characteristic result obtained by integrating the first prediction linear characteristic and the second prediction linear characteristic into a multilayer convolution neural network of the constructed neural network model to obtain a first prediction output result, and inputting the first prediction output result into an attention model of the constructed neural network model to obtain a second prediction output result;

and inputting the first prediction output result, the second prediction output result, the tone sample characteristics and the voice sample into a constructed neural network model for training, and adjusting parameters of the neural network model to obtain a trained voice synthesis model.

Preferably, the text processing model is trained by:

acquiring a plurality of voice samples in a corpus, and determining a sample text corresponding to each voice sample;

determining the font and phonetic symbol corresponding to each character and/or word in the sample text;

inputting the font and phonetic symbol corresponding to each character and/or word into the constructed recurrent neural network model for training, and adjusting the parameters of the recurrent neural network model to obtain the trained text processing model.

In a second aspect, an embodiment of the present application provides a speech synthesis apparatus, including:

the acquisition module is used for acquiring the text to be processed and the tone characteristics of the voice to be synthesized;

the conversion module is used for converting the text to be processed into a text character set represented by the Unicharacters according to the mapping relation between the fonts corresponding to the preset voice text and the Unicharacters;

the characteristic extraction module is used for extracting fundamental frequency information characteristics representing each character and/or word from the text character set;

and the voice synthesis module is used for inputting the text character set corresponding to the text to be processed, the extracted fundamental frequency information characteristics and the tone characteristics into a trained voice synthesis model to obtain synthesized voice.

Preferably, when the conversion module is configured to convert the text to be processed into a text character set expressed by unicals according to a mapping relationship between a font corresponding to a preset speech text and a unicode, the conversion module is configured to:

Preferably, when the speech synthesis module is configured to input the text character set, the extracted fundamental frequency information features, and the tone features corresponding to the text to be processed into a trained speech synthesis model to obtain a synthesized speech, the speech synthesis module is configured to:

Preferably, the speech synthesis apparatus further comprises a synthesis model training module, and the synthesis model training module is configured to train the speech synthesis model by:

Preferably, when the synthesis model training module is configured to train the constructed neural network model based on the sample character set, the fundamental frequency information sample feature, the tone sample feature, and the voice sample corresponding to the sample character set, and obtain a trained voice synthesis model, the synthesis model training module is configured to:

Preferably, the speech synthesis apparatus further comprises a process model training module, the process model training module is configured to train the text process model by:

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the speech synthesis method according to the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the speech synthesis method according to the first aspect.

The embodiment of the application provides a voice synthesis method, a voice synthesis device, electronic equipment and a readable storage medium, wherein the voice synthesis method comprises the following steps: the method comprises the steps of firstly obtaining a text to be processed and tone characteristics of a voice to be synthesized, then converting the text to be processed into a text character set represented by Unicharacters according to a mapping relation between the characters and the Unicharacters corresponding to a preset voice text, then extracting fundamental frequency information characteristics representing each character and/or word from the text character set, and inputting the text character set corresponding to the text to be processed, the extracted fundamental frequency information characteristics and the tone characteristics into a trained voice synthesis model to obtain a synthesized voice. Therefore, in the process of synthesizing the voice by the voice synthesis model, the characteristics of tone, fundamental frequency information and the like are added into the text represented by the Unicode, so that the synthesized voice is more vivid, the accuracy of voice synthesis is improved, and the service experience of voice interaction is greatly improved.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 3 is a second schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.

First, an application scenario to which the present application is applicable will be described. The method can be applied to the technical field of voice synthesis, the text to be processed and the tone characteristics of the voice to be synthesized are firstly obtained, then the text to be processed is converted into a text character set represented by Unicharacters according to the mapping relation between the fonts and the Unicharacters corresponding to the preset voice text, then the fundamental frequency information characteristics representing each character and/or word are extracted from the text character set, and the text character set corresponding to the text to be processed, the extracted fundamental frequency information characteristics and the tone characteristics are input into a trained voice synthesis model to obtain the synthesized voice. In the process of synthesizing the voice by the voice synthesis model, the characteristics of tone, fundamental frequency information and the like are added into the text represented by the unicode, so that the synthesized voice is more vivid, the accuracy of voice synthesis is improved, and the service experience of voice interaction is greatly improved.

In the prior art, when text with different languages is converted into corresponding speech, a mixed language speech synthesis model is usually used, but when the mixed language speech synthesis model synthesizes the speech, pronunciation characteristics of the different languages are not considered, so that the synthesized speech effect is more different from the actual speech effect, and the service experience of speech interaction is reduced. Based on this, the embodiment of the application provides a speech synthesis method, a speech synthesis device, an electronic device and a readable storage medium, and in the process of synthesizing speech by using a speech synthesis model, features such as tone and fundamental frequency information are added to a text represented by a unicode, so that the synthesized speech is more vivid, the accuracy of speech synthesis is improved, and the service experience of speech interaction is greatly improved.

Referring to fig. 1, fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the present application. As shown in fig. 1, a speech synthesis method provided in an embodiment of the present application includes:

and S110, acquiring the text to be processed and the tone characteristics of the voice to be synthesized.

In this step, the text to be processed may be a text in a single language or a text in a mixed language; the tone features of the speech to be synthesized can be added in the process of synthesizing the speech by the text to be processed according to the actual requirements of the user.

The required tone color characteristics of the speaker can be set in advance, and the preset tone color characteristics can be added into the voice synthesis process in the process of synthesizing the voice by the text to be processed, so that the voice with the pronunciation tone color corresponding to the tone color characteristics can be obtained.

Therefore, different tone characteristics are set for different application scenes, and service experience of voice interaction can be improved.

And S120, converting the text to be processed into a text character set represented by the Unicharacters according to the mapping relation between the fonts corresponding to the preset voice text and the Unicharacters.

In the step, a mapping relation between a font corresponding to the voice text and the unicode is pre-established, then a font corresponding to each character and/or word in the text to be processed is obtained, and the text to be processed is converted into a text character set represented by the unicode based on the pre-established mapping relation between the fonts and the unicode.

The voice text can be obtained from the audio data, and the obtained voice text can also be a text of a single language or a text of a mixed language; because the voice of the voice text is different, the corresponding font is also different, so that the mapping relation between the font and the unicode is required to be established; the Unicode can be represented by a phonetic symbol, wherein the phonetic symbol is an international phonetic symbol which can scientifically and accurately record and distinguish voices.

Furthermore, in order to unify the input representation of the different language texts, the embodiment of the present application may perform text processing on the different language texts, adopt a text expression mode in which the different language texts use unicals, mainly process special characters such as numbers in the different language texts, and convert the different language texts into unicals for representation. Therefore, the input text can be represented by the Unicode no matter the input text comprises a plurality of languages, and the parameters of the speech synthesis model do not need to be processed during speech synthesis, so that the speech synthesis of the mixed language text is facilitated.

Here, when processing a special character such as a numeral, the special character such as a numeral in a text may be processed using a regular expression.

And S130, extracting fundamental frequency information characteristics representing each word and/or phrase from the text character set.

In this step, the fundamental frequency information feature of each word and/or phrase may be extracted from the text character set represented by the unicode, where the fundamental frequency information feature is represented as a binarization feature of fundamental frequency variation information.

Here, the feature extracted is a binary feature of fundamental frequency change information relating to a unicode (phonetic symbol), and is generally considered as a tone of a sound, where the tone is also referred to as a pitch change feature.

Therefore, the fundamental frequency information characteristic which represents each character and/or word in the text character set can be obtained, and then the pitch change characteristic corresponding to the fundamental frequency information characteristic is added in the voice synthesis process, so that the voice synthesis effect can be improved, and particularly the synthesis effect of the small-language voice can be improved.

And S140, inputting the text character set corresponding to the text to be processed, the extracted fundamental frequency information features and the tone features into a trained voice synthesis model to obtain synthesized voice.

In the step, the input of the voice synthesis model is a text character set, a fundamental frequency information characteristic and a tone characteristic, the output is synthesized voice, and further, the text to be processed can be converted into related synthesized voice based on the pre-trained voice synthesis model.

The method comprises the steps of generating synthetic voice corresponding to tone and semantic content according to a given text to be processed and tone characteristics, specifically, performing text processing on the text to be processed, and converting the text to be processed into a text character set represented by uniform characters according to a mapping relation between a font corresponding to a preset voice text and the uniform characters; then, extracting features, and extracting fundamental frequency information features representing each character and/or word from the text character set; and finally, the text character set obtained by conversion, the extracted fundamental frequency information features and tone features are simultaneously input into the trained speech synthesis model, and the corresponding synthesized speech can be obtained.

Therefore, in the process of synthesizing the voice by using the voice synthesis model, the texts in different languages are converted into the Unicode representation, so that the complete synthesized voice corresponding to the texts in different languages can be obtained, the processing time of synthesizing the voice by using the voice synthesis model is shortened, the working efficiency is improved, the continuity and the integrity of the synthesized voice are ensured in the voice synthesis process, and the service experience of voice interaction is greatly improved. In addition, the texts in different languages are converted into the Unicode characters to be expressed, so that the synthesized voice can be directly obtained through the Unicode characters, the accuracy of voice synthesis is improved, and the universality of voice synthesis is improved.

The training of the voice synthesis model is carried out before the voice synthesis model is adopted for voice synthesis, and the optimal voice synthesis model is obtained. Thus, when a speech is synthesized by using the speech synthesis model, a synthesized speech with high quality of the synthesized speech can be obtained.

In summary, the speech synthesis process in the embodiment of the present application is as follows: the method comprises the steps of representing texts of different languages as a text character set of Unicharacters, obtaining pitch variation characteristics represented by the text character set, setting required speaker tone color characteristics, and generating corresponding voice through a voice synthesis model.

The voice synthesis method comprises the steps of obtaining a text to be processed and tone characteristics of voice to be synthesized, converting the text to be processed into a text character set represented by Unicharacters according to a mapping relation between a preset font corresponding to the voice text and the Unicharacters, extracting fundamental frequency information characteristics representing each character and/or word from the text character set, and inputting the text character set corresponding to the text to be processed, the extracted fundamental frequency information characteristics and the tone characteristics into a trained voice synthesis model to obtain the synthesized voice. Therefore, in the process of synthesizing the voice by the voice synthesis model, the characteristics of tone, fundamental frequency information and the like are added into the text represented by the Unicode, so that the synthesized voice is more vivid, the accuracy of voice synthesis is improved, and the service experience of voice interaction is greatly improved.

In the embodiment of the present application, as a preferred embodiment, the step S120 includes: determining a font corresponding to the text to be processed and a plurality of characters and/or words in the text to be processed; determining a phonetic symbol corresponding to each character and/or word in the text to be processed according to a mapping relation between a font corresponding to a preset voice text and the phonetic symbol; and determining a text character set represented by the phonetic symbol based on the phonetic symbol corresponding to each word and/or word and the position of each word and/or word in the text to be processed.

In the step, the text to be processed is composed of a plurality of characters and/or words, the font corresponding to the text to be processed is obtained, namely the font corresponding to each character and/or word is obtained, the phonetic symbol corresponding to each character and/or word in the text to be processed is obtained based on the mapping relation between the font corresponding to the voice text and the phonetic symbol, and then the text character set represented by the phonetic symbol is obtained based on the position of each character and/or word in the text to be processed.

In the embodiment of the present application, as a preferred embodiment, the step S120 includes: determining a font corresponding to each character and/or word in the text to be processed; inputting the font corresponding to each character and/or word into a trained text processing model to obtain a phonetic symbol corresponding to each character and/or word; and determining a text character set represented by phonetic symbols based on the position of each word and/or word in the text to be processed.

In the step, a text processing model is adopted to convert the text unicode, the input can be a font corresponding to each word and/or word in the text to be processed, and the output can be a phonetic symbol corresponding to each word and/or word.

For example, the text processing model may be a Long Short-Term Memory artificial neural network (LSTM), and the LSTM model may be used to convert texts in different languages into a text character set represented by phonetic symbols. Specifically, in applying the LSTM model, the input is a glyph corresponding to each word and/or phrase, and the output is a phonetic transcription corresponding to each word and/or phrase.

In the embodiment of the present application, as a preferred embodiment, the step S140 includes: inputting the text character set and the fundamental frequency information features into a feedforward neural network of a trained speech synthesis model for linear processing to obtain first linear features corresponding to the fundamental frequency information features and second linear features corresponding to the text character set; inputting the feature result obtained by integrating the first linear feature and the second linear feature into a multi-layer convolutional neural network of a trained voice synthesis model to obtain a first output result, and inputting the first output result into an attention model of the trained voice synthesis model to obtain a second output result; and inputting the first output result, the second output result and the tone characteristic into a trained voice synthesis model for information fusion to obtain a synthesized voice corresponding to the text to be processed.

In the step, the text character set, the fundamental frequency information characteristic and the tone characteristic are input into a trained voice synthesis model, and the synthesized voice corresponding to the text to be processed is obtained through voice synthesis processing of the voice synthesis model.

Specifically, the fundamental frequency information characteristic and the text character set are respectively subjected to linear change once through a feedforward neural network to obtain a first linear characteristic corresponding to the fundamental frequency information characteristic and a second linear characteristic corresponding to the text character set, the two characteristics are spliced together to obtain an integrated characteristic result, the integrated characteristic result is subjected to multilayer convolution neural networks to obtain a first output result, the obtained first output result is input into the attention model to obtain a second output result, the second output result output by the attention model and the first output result output by the multilayer convolution neural networks are input into the circulation neural networks to be subjected to information fusion, and finally the output information fusion result is subjected to linear fitting to obtain the synthesized voice corresponding to the text to be processed.

Further, the speech synthesis method provided by the embodiment of the present application trains the speech synthesis model by the following steps:

a plurality of voice samples, sample text corresponding to each voice sample, and a tone sample feature corresponding to each voice sample are obtained.

In this step, the speech samples are from the corpus and the public database. And extracting a sample text from the voice sample, taking the sample text as an input sample and the voice sample as an output sample, and if the output synthesized voice is similar to the voice sample after the sample text is input into the voice synthesis model, determining that the training of the voice synthesis model is finished, and further, applying the voice synthesis model to carry out voice synthesis.

And aiming at the sample text corresponding to each voice sample, converting each sample text into a sample character set represented by a Unicharacter according to a preset mapping relation between the font corresponding to the voice text and the Unicharacter.

In the step, before each sample text is converted into a sample character set represented by a unicode, a text processing model is trained, and a mapping relation between a font and the unicode is determined according to the trained text processing model.

Specifically, the text processing model is trained by:

acquiring a plurality of voice samples in a corpus, and determining a sample text corresponding to each voice sample; determining the font and phonetic symbol corresponding to each character and/or word in the sample text; inputting the font and phonetic symbol corresponding to each character and/or word into the constructed recurrent neural network model for training, and adjusting the parameters of the recurrent neural network model to obtain the trained text processing model.

In the step, a cyclic neural network model is constructed firstly, a font corresponding to each word and/or phrase is used as the input of the cyclic neural network model, a phonetic symbol corresponding to each word and/or phrase is used as the output of the cyclic neural network model, the cyclic neural network model is trained on the basis of the input and the output, the parameters of the cyclic neural network model are adjusted continuously until the phonetic symbol obtained by the cyclic neural network model on the basis of the font is basically consistent with the phonetic symbol corresponding to each word and/or phrase in a sample text, and the text processing model is determined to be trained completely.

Fundamental frequency information sample features representing each word and/or phrase are extracted from each sample character set.

In the step, before extracting the fundamental frequency information sample characteristics from the sample character set, training a characteristic extraction model, and determining to extract the fundamental frequency information sample characteristics representing each word and/or phrase from each sample character set according to the trained characteristic extraction model.

Here, the corpus corresponds to a feature extraction model, the feature extraction model is directly trained through a voice sample contained in the corpus, and the trained feature extraction model is connected with the text processing model, so that the output of the text processing model can be used as the input of the feature extraction model.

Furthermore, the text processing model extracts the vector representation characteristics of the phonetic symbols, namely, the phonetic symbols are represented by using some randomly initialized vectors, the data of the vectors can be continuously updated in the training process, and finally the obtained vectors potentially represent the pronunciation characteristics of the corresponding phonetic symbols; the features obtained by the feature extraction model are the binary features of the fundamental frequency change information related to the phonetic symbols.

And inputting the sample character set, the fundamental frequency information sample characteristics corresponding to the sample character set and the tone sample characteristics into the constructed neural network model for training, and continuously adjusting parameters of the neural network model until the synthetic voice output by the neural network model is consistent with or highly similar to the voice sample corresponding to the sample character set, so that the completion of the training of the neural network model can be determined, and the trained voice synthetic model is obtained.

Therefore, mixed language voice data of the same speaker is not needed when the voice synthesis model is constructed, namely if a multi-language voice synthesis model such as Chinese-Japanese-English is needed to be synthesized, only the voice data of single language of different speakers need to be recorded. The method has the advantages that the texts in different languages are expressed as the Union characters, the pronunciation expressions in different languages are overlapped to a certain extent, and the application range of the voice synthesis model is improved by adding new pitch variation characteristics, so that the voice synthesis model suitable for multiple languages such as common languages, small languages and the like can be constructed. Furthermore, the speech synthesis model can be suitable for any language, and can also support speech synthesis of a custom language through the pronunciation mode of a custom text.

Specifically, the training a constructed neural network model based on the sample character set, the fundamental frequency information sample characteristics, the tone sample characteristics and the voice sample corresponding to the sample character set to obtain a trained voice synthesis model includes:

Here, the neural network model includes a plurality of parts such as a feedforward neural network, a multilayer convolutional neural network, and an attention model, and when the neural network model is trained, parameters of the neural network model need to be continuously adjusted according to a final output result of the neural network model and output results of the parts until a trained speech synthesis model is obtained.

Furthermore, because the speech synthesis technology executes the same processing process of the text data during the model training, the synthesis of multi-speaker multi-language mixed speech can be realized only by recording the speech data of single language of different speakers, the efficiency of the mixed language speech synthesis is improved, and the universality is higher. By using the method, a set of multi-language mixed voice synthesis model can be constructed by collecting single-language voice data of a plurality of speakers, so that a high-quality voice synthesis model with universality is efficiently constructed, mixed-language voice data of the same speaker is not needed, the pronunciation characteristics of small languages can be improved, and the synthesis of multi-language mixed voice is supported.

Based on the same inventive concept, a speech synthesis apparatus corresponding to the speech synthesis method is also provided in the embodiments of the present application, and because the principle of the apparatus in the embodiments of the present application for solving the problem is similar to the speech synthesis method described above in the embodiments of the present application, the implementation of the apparatus may refer to the implementation of the method, and repeated details are not described again.

Referring to fig. 2 and fig. 3, fig. 2 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application, and fig. 3 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application. As shown in fig. 2, the speech synthesis apparatus 200 includes:

an obtaining module 210, configured to obtain a text to be processed and a tone characteristic of a speech to be synthesized;

the conversion module 220 is configured to convert the text to be processed into a text character set represented by a unicode according to a mapping relationship between a font corresponding to a preset voice text and the unicode;

a feature extraction module 230, configured to extract fundamental frequency information features representing each word and/or phrase from the text character set;

and the speech synthesis module 240 is configured to input the text character set corresponding to the text to be processed, the extracted fundamental frequency information features, and the tone features into a trained speech synthesis model to obtain a synthesized speech.

Preferably, when the converting module 220 is configured to convert the text to be processed into a text character set represented by a unicode according to a mapping relationship between a preset font corresponding to the voice text and the unicode, the converting module 220 is configured to:

Preferably, when the speech synthesis module 240 is configured to input the text character set, the extracted fundamental frequency information features, and the tone features corresponding to the text to be processed into a trained speech synthesis model to obtain a synthesized speech, the speech synthesis module 240 is configured to:

Further, as shown in fig. 3, the speech synthesis apparatus 200 further comprises a synthesis model training module 250, wherein the synthesis model training module 250 is configured to train a speech synthesis model by:

Preferably, when the synthesis model training module 250 is configured to train the constructed neural network model based on the sample character set, the fundamental frequency information sample feature, the tone sample feature, and the voice sample corresponding to the sample character set, and obtain a trained voice synthesis model, the synthesis model training module 250 is configured to:

Preferably, the speech synthesis apparatus 200 further comprises a processing model training module 260, wherein the processing model training module 260 is configured to train the text processing model by:

The voice synthesis device provided by the embodiment of the application comprises an acquisition module, a conversion module, a feature extraction module and a voice synthesis module, wherein the acquisition module is used for acquiring a text to be processed and the tone features of voice to be synthesized; the conversion module is used for converting the text to be processed into a text character set represented by the Unicharacters according to the mapping relation between the fonts corresponding to the preset voice text and the Unicharacters; the feature extraction module is used for extracting fundamental frequency information features representing each character and/or word from the text character set; the voice synthesis module is used for inputting a text character set corresponding to the text to be processed, the extracted fundamental frequency information features and the tone features into a trained voice synthesis model to obtain synthesized voice. Therefore, in the process of synthesizing the voice by the voice synthesis model, the characteristics of tone, fundamental frequency information and the like are added into the text represented by the Unicode, so that the synthesized voice is more vivid, the accuracy of voice synthesis is improved, and the service experience of voice interaction is greatly improved.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 4, the electronic device 400 includes a processor 410, a memory 420, and a bus 430.

The memory 420 stores machine-readable instructions executable by the processor 410, when the electronic device 400 runs, the processor 410 communicates with the memory 420 through the bus 430, and when the machine-readable instructions are executed by the processor 410, the steps of the speech synthesis method in the embodiment of the method shown in fig. 1 may be performed.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the speech synthesis method in the method embodiment shown in fig. 1 may be executed.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A speech synthesis method, characterized in that the speech synthesis method comprises:

2. The method according to claim 1, wherein the converting the text to be processed into a text character set represented by a unicode according to a mapping relationship between a glyph corresponding to a preset voice text and the unicode comprises:

3. The method according to claim 1, wherein the converting the text to be processed into a text character set represented by a unicode according to a mapping relationship between a glyph corresponding to a preset voice text and the unicode comprises:

4. The method according to claim 1, wherein the inputting the text character set, the extracted fundamental frequency information features, and the tone features corresponding to the text to be processed into a trained speech synthesis model to obtain a synthesized speech includes:

5. The speech synthesis method of claim 1, wherein the speech synthesis model is trained by:

6. The method according to claim 5, wherein the training the constructed neural network model based on the sample character set, the fundamental frequency information sample feature, the tone sample feature and the speech sample corresponding to the sample character set to obtain the trained speech synthesis model comprises:

7. A speech synthesis method according to claim 3, characterized in that the text processing model is trained by:

8. A speech synthesis apparatus, characterized in that the speech synthesis apparatus comprises:

9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the speech synthesis method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the speech synthesis method according to one of claims 1 to 7.