CN112270917B

CN112270917B - Speech synthesis method, device, electronic equipment and readable storage medium

Info

Publication number: CN112270917B
Application number: CN202011128996.6A
Authority: CN
Inventors: 詹皓粤; 林悦
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2020-10-20
Filing date: 2020-10-20
Publication date: 2024-06-04
Anticipated expiration: 2040-10-20
Also published as: CN112270917A

Abstract

The application provides a voice synthesis method, a device, electronic equipment and a readable storage medium, wherein the voice synthesis method comprises the following steps: firstly, acquiring a text to be processed and tone characteristics of a voice to be synthesized, then converting the text to be processed into a text character set represented by unified characters according to a mapping relation between fonts and unified characters corresponding to a preset voice text, extracting fundamental frequency information characteristics representing each word and/or word from the text character set, and inputting the text character set corresponding to the text to be processed, the extracted fundamental frequency information characteristics and tone characteristics into a trained voice synthesis model to obtain the synthesized voice. In the process of synthesizing the voice by the voice synthesis model, the voice synthesis model has the advantages that the characteristics of tone, fundamental frequency information and the like are added into the text represented by the unicode, so that the synthesized voice is more vivid and pertinent, the accuracy of voice synthesis is improved, and the service experience of voice interaction is greatly improved.

Description

Speech synthesis method, device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech synthesis method, a device, an electronic apparatus, and a readable storage medium.

Background

In recent years, voice interaction is used as a novel mode, so that not only is brand-new user experience brought, but also various product design ideas and application scenes are expanded. The speech synthesis technology is a technology of converting text into sound. Mixed language speech synthesis refers to the presence of multiple languages in the text to be synthesized, converting text comprising the multiple languages into corresponding sounds.

In the prior art, when texts with different languages are converted into corresponding voices, a mixed language voice synthesis model is generally used, but when the mixed language voice synthesis model synthesizes voices, the pronunciation characteristics of the different languages are not considered, so that the synthesized voice effect is more different from the actual voice effect, and the service experience of voice interaction is reduced.

Disclosure of Invention

In view of the above, the present application is to provide a method, an apparatus, an electronic device, and a readable storage medium for synthesizing speech, which enable the synthesized speech to be more vivid and relevant by adding tone and fundamental frequency information and other features into text represented by unicode during the speech synthesis model synthesis process, thereby improving the accuracy of speech synthesis and greatly improving the service experience of speech interaction.

In a first aspect, an embodiment of the present application provides a speech synthesis method, where the speech synthesis method includes:

Acquiring a text to be processed and tone characteristics of a voice to be synthesized;

Converting the text to be processed into a text character set represented by unified characters according to a mapping relation between a font corresponding to a preset voice text and the unified characters;

Extracting fundamental frequency information features representing each word and/or phrase from the text character set;

And inputting the text character set corresponding to the text to be processed, the extracted fundamental frequency information characteristic and the tone characteristic into a trained voice synthesis model to obtain synthesized voice.

Preferably, the converting the text to be processed into a text character set represented by unicode according to a mapping relationship between a font corresponding to a preset voice text and unicode includes:

Determining a font corresponding to the text to be processed and a plurality of words and/or words in the text to be processed;

Determining a phonetic symbol corresponding to each word and/or word in the text to be processed according to a mapping relation between a font corresponding to a preset voice text and the phonetic symbol;

And determining a text character set represented by the phonetic symbols based on the phonetic symbols corresponding to each word and/or word and the position of each word and/or word in the text to be processed.

Determining each word and/or a font corresponding to the word in the text to be processed;

Inputting the fonts corresponding to each word and/or word into a trained text processing model to obtain phonetic symbols corresponding to each word and/or word;

A set of text characters represented in phonetic symbols is determined based on the location of each word and/or word in the text to be processed.

Preferably, the inputting the text character set, the extracted fundamental frequency information feature and the tone color feature corresponding to the text to be processed into a trained speech synthesis model to obtain a synthesized speech includes:

inputting the text character set and the fundamental frequency information characteristic into a feedforward neural network of a trained voice synthesis model for linear processing to obtain a first linear characteristic corresponding to the fundamental frequency information characteristic and a second linear characteristic corresponding to the text character set;

Inputting the characteristic results obtained by integrating the first linear characteristic and the second linear characteristic into a multi-layer convolutional neural network of a trained voice synthesis model to obtain a first output result, and inputting the first output result into an attention model of the trained voice synthesis model to obtain a second output result;

and inputting the first output result, the second output result and the tone characteristic into a trained voice synthesis model for information fusion to obtain the synthesized voice corresponding to the text to be processed.

Preferably, the speech synthesis model is trained by:

acquiring a plurality of voice samples, sample texts corresponding to each voice sample and tone sample characteristics corresponding to each voice sample;

Aiming at the sample text corresponding to each voice sample, converting each sample text into a sample character set represented by unified characters according to the mapping relation between the font corresponding to the preset voice text and the unified characters;

extracting fundamental frequency information sample characteristics representing each word and/or word from each sample character set;

And training the constructed neural network model based on the sample character set, the fundamental frequency information sample characteristics, the tone sample characteristics and the voice samples corresponding to the sample character set to obtain a trained voice synthesis model.

Preferably, the training the constructed neural network model based on the sample character set, the fundamental frequency information sample feature, the tone sample feature and the voice sample corresponding to the sample character set to obtain a trained voice synthesis model includes:

Inputting the sample character set and fundamental frequency information sample characteristics corresponding to the sample character set into a feedforward neural network of a constructed neural network model to obtain first prediction linear characteristics corresponding to the fundamental frequency information sample characteristics and second prediction linear characteristics corresponding to the sample character set;

Inputting the prediction characteristic result obtained by integrating the first prediction linear characteristic and the second prediction linear characteristic into a multi-layer convolution neural network of a constructed neural network model to obtain a first prediction output result, and inputting the first prediction output result into an attention model of the constructed neural network model to obtain a second prediction output result;

inputting the first prediction output result, the second prediction output result, the tone sample characteristics and the voice sample into a constructed neural network model for training, and adjusting parameters of the neural network model to obtain a trained voice synthesis model.

Preferably, the text processing model is trained by:

Acquiring a plurality of voice samples in a corpus, and determining a sample text corresponding to each voice sample;

Determining the corresponding font and phonetic symbol of each word and/or word in the sample text;

Inputting the fonts and phonetic symbols corresponding to each word and/or word into the constructed cyclic neural network model for training, and adjusting the parameters of the cyclic neural network model to obtain a trained text processing model.

In a second aspect, an embodiment of the present application provides a speech synthesis apparatus, including:

the acquisition module is used for acquiring the text to be processed and tone characteristics of the voice to be synthesized;

The conversion module is used for converting the text to be processed into a text character set represented by unified characters according to the mapping relation between the font corresponding to the preset voice text and the unified characters;

the feature extraction module is used for extracting fundamental frequency information features representing each word and/or phrase from the text character set;

And the voice synthesis module is used for inputting the text character set corresponding to the text to be processed, the extracted fundamental frequency information characteristic and the tone characteristic into a trained voice synthesis model to obtain synthesized voice.

Preferably, the conversion module is configured to, when converting the text to be processed into a text character set represented by unicode according to a mapping relationship between a font corresponding to a preset voice text and unicode,:

Preferably, the voice synthesis module is configured to, when inputting a text character set corresponding to the text to be processed, the extracted fundamental frequency information feature and the tone feature into a trained voice synthesis model to obtain a synthesized voice,:

Preferably, the speech synthesis apparatus further comprises a synthesis model training module for training a speech synthesis model by:

Preferably, the synthesis model training module is configured to train the constructed neural network model based on the sample character set, the fundamental frequency information sample feature, the tone sample feature and the voice sample corresponding to the sample character set, and obtain a trained voice synthesis model, where the synthesis model training module is configured to:

Preferably, the speech synthesis apparatus further comprises a process model training module for training a text process model by:

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over the bus when the electronic device is running, the processor executing the machine-readable instructions to perform the steps of the speech synthesis method as described in the first aspect.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the speech synthesis method according to the first aspect.

The embodiment of the application provides a voice synthesis method, a device, electronic equipment and a readable storage medium, wherein the voice synthesis method comprises the following steps: firstly, acquiring a text to be processed and tone characteristics of a voice to be synthesized, then converting the text to be processed into a text character set represented by unified characters according to a mapping relation between fonts and unified characters corresponding to a preset voice text, extracting fundamental frequency information characteristics representing each word and/or word from the text character set, and inputting the text character set corresponding to the text to be processed, the extracted fundamental frequency information characteristics and tone characteristics into a trained voice synthesis model to obtain the synthesized voice. In this way, in the process of synthesizing the voice by the voice synthesis model, the voice synthesis model has the characteristics of voice color, fundamental frequency information and the like added into the text represented by the unicode, so that the synthesized voice is more vivid and relevant, the accuracy of voice synthesis is improved, and the service experience of voice interaction is greatly improved.

In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a speech synthesis method according to an embodiment of the present application;

Fig. 2 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;

FIG. 3 is a second schematic diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. Based on the embodiments of the present application, every other embodiment obtained by a person skilled in the art without making any inventive effort falls within the scope of protection of the present application.

First, an application scenario to which the present application is applicable will be described. The application can be applied to the technical field of voice synthesis, firstly, the text to be processed and tone characteristics of voice to be synthesized are obtained, then, the text to be processed is converted into a text character set expressed by unified characters according to the mapping relation between the fonts corresponding to the preset voice text and the unified characters, then, fundamental frequency information characteristics representing each word and/or word are extracted from the text character set, and the text character set corresponding to the text to be processed, the extracted fundamental frequency information characteristics and tone characteristics are input into a trained voice synthesis model to obtain the synthesized voice. In the process of synthesizing the voice by the voice synthesis model, the voice synthesis model has the advantages that the characteristics of tone, fundamental frequency information and the like are added into the text represented by the unicode, so that the synthesized voice is more vivid and pertinent, the accuracy of voice synthesis is improved, and the service experience of voice interaction is greatly improved.

In the prior art, when texts with different languages are converted into corresponding voices, a mixed language voice synthesis model is generally used, but when the mixed language voice synthesis model synthesizes voices, the pronunciation characteristics of the different languages are not considered, so that the synthesized voice effect is more different from the actual voice effect, and the service experience of voice interaction is reduced. Based on the above, the embodiment of the application provides a voice synthesis method, a device, an electronic device and a readable storage medium, in the process of synthesizing voice by a voice synthesis model, by adding tone, fundamental frequency information and other characteristics into text represented by unicode, the synthesized voice is more vivid and relevant, the accuracy of voice synthesis is improved, and the service experience of voice interaction is greatly improved.

Referring to fig. 1, fig. 1 is a flowchart of a speech synthesis method according to an embodiment of the application. As shown in fig. 1, the speech synthesis method provided by the embodiment of the present application includes:

s110, acquiring the text to be processed and the tone characteristics of the voice to be synthesized.

In the step, the text to be processed can be a text with a single language or a text with a mixed language; the tone characteristic of the voice to be synthesized can be added in the process of synthesizing the voice by the text to be processed according to the actual requirement of the user.

Here, the required tone characteristic of the speaker can be set in advance, and in the process of synthesizing the voice from the text to be processed, the tone characteristic set in advance can be added into the voice synthesis process, so that the voice with the pronunciation tone corresponding to the tone characteristic can be obtained.

In this way, different tone characteristics are set according to different application scenes, so that service experience of voice interaction can be improved.

S120, converting the text to be processed into a text character set expressed by unified characters according to a mapping relation between the preset font corresponding to the voice text and the unified characters.

In the step, a mapping relation between a font corresponding to a voice text and a unified character is established in advance, then the font corresponding to each word and/or word in the text to be processed is obtained, and the text to be processed is converted into a text character set represented by the unified character based on the mapping relation between the font and the unified character which are established in advance.

The voice text can be obtained from the audio data, and the obtained voice text can also be a single-language text or a mixed-language text; because the voices of the voice texts are different, the corresponding fonts are also different, so that the mapping relation between the fonts and the unified characters is required to be established; unicode can be represented by phonetic symbols, which are international phonetic symbols that can more scientifically and accurately record and distinguish speech.

Furthermore, in order to unify the input representation of the texts in different languages, the embodiment of the application can perform text processing on the texts in different languages, adopts a text expression mode of using unified characters for the texts in different languages, mainly processes special characters such as numbers in the texts in different languages, and converts the texts in different languages into unified character representations. In this way, the input text can be represented by unified characters no matter how many languages are included, and the parameters of the speech synthesis model do not need to be processed during speech synthesis, so that the speech synthesis of the mixed language text is facilitated.

Here, when processing special characters such as numerals, special characters such as numerals in a text may be processed using a regular expression.

S130, extracting fundamental frequency information features representing each word and/or word from the text character set.

In this step, the fundamental frequency information feature of each word and/or phrase may be extracted from the text character set represented by unicode, where the fundamental frequency information feature is represented as a binarized feature of fundamental frequency variation information.

Here, the feature extracted is a binarized feature of fundamental frequency variation information related to unicode (phonetic symbol), which is generally regarded as a tone of sound, wherein the tone is also referred to as a pitch variation feature.

Therefore, the fundamental frequency information characteristic of each word and/or word in the text character set can be obtained, and then the pitch change characteristic corresponding to the fundamental frequency information characteristic is added in the voice synthesis process, so that the voice synthesis effect can be improved and improved, and particularly, the synthesis effect of small-language voices can be improved.

S140, inputting the text character set corresponding to the text to be processed, the extracted fundamental frequency information features and tone features into a trained voice synthesis model to obtain synthesized voice.

In the step, the input of the voice synthesis model is a text character set, the fundamental frequency information characteristic and the tone characteristic, the output is the synthesized voice, and then the text to be processed can be converted into the relevant synthesized voice based on the pre-trained voice synthesis model.

The embodiment of the application generates the synthetic voice corresponding to tone and semantic content according to the given text to be processed and tone characteristics, specifically, firstly, the text to be processed is processed, and the text to be processed is converted into a text character set represented by the unicode according to the mapping relation between the font corresponding to the preset voice text and the unicode; then extracting the characteristics, namely extracting fundamental frequency information characteristics representing each word and/or word from the text character set; and finally, the text character set obtained through conversion is simultaneously input into a trained voice synthesis model by extracting the fundamental frequency information characteristic and the tone characteristic, and the corresponding synthesized voice can be obtained.

In this way, in the process of synthesizing the voice by using the voice synthesis model, the text of different languages is converted into the unicode to be represented, so that the complete synthesized voice corresponding to the text of different languages can be obtained, the processing time of synthesizing the voice by using the voice synthesis model is reduced, the working efficiency is improved, the consistency and the integrity of the synthesized voice are ensured in the process of synthesizing the voice, and the service experience of voice interaction is greatly improved. In addition, the text of different languages is converted into the unified character representation, so that the synthesized voice can be obtained directly through unified characters, the accuracy of voice synthesis is improved, and the universality of voice synthesis is improved.

The method comprises the steps of training a voice synthesis model before voice synthesis is carried out by adopting the voice synthesis model, and obtaining an optimal voice synthesis model. Thus, when the speech synthesis model is used for synthesizing speech, the synthesized speech with higher quality can be obtained.

In summary, the speech synthesis flow in the embodiment of the application is as follows: and representing texts in different languages as a text character set of unified characters, acquiring pitch variation characteristics represented by the text character set, setting required speaker tone characteristics, and generating corresponding voices through the voice synthesis model.

The voice synthesis method provided by the embodiment of the application comprises the steps of firstly obtaining a text to be processed and tone characteristics of voice to be synthesized, then converting the text to be processed into a text character set represented by unified characters according to a mapping relation between fonts and unified characters corresponding to a preset voice text, extracting fundamental frequency information characteristics representing each word and/or word from the text character set, and inputting the text character set corresponding to the text to be processed, the extracted fundamental frequency information characteristics and tone characteristics into a trained voice synthesis model to obtain the synthesized voice. In this way, in the process of synthesizing the voice by the voice synthesis model, the voice synthesis model has the characteristics of voice color, fundamental frequency information and the like added into the text represented by the unicode, so that the synthesized voice is more vivid and relevant, the accuracy of voice synthesis is improved, and the service experience of voice interaction is greatly improved.

In the embodiment of the present application, as a preferred embodiment, step S120 includes: determining a font corresponding to the text to be processed and a plurality of words and/or words in the text to be processed; determining a phonetic symbol corresponding to each word and/or word in the text to be processed according to a mapping relation between a font corresponding to a preset voice text and the phonetic symbol; and determining a text character set represented by the phonetic symbols based on the phonetic symbols corresponding to each word and/or word and the position of each word and/or word in the text to be processed.

In the step, the text to be processed consists of a plurality of words and/or words, the font corresponding to the text to be processed is obtained, the font corresponding to each word and/or word can be obtained, the phonetic symbols corresponding to each word and/or word in the text to be processed are obtained based on the mapping relation between the font corresponding to the voice text and the phonetic symbols, and then the text character set represented by the phonetic symbols is obtained based on the position of each word and/or word in the text to be processed.

In the embodiment of the present application, as a preferred embodiment, step S120 includes: determining each word and/or a font corresponding to the word in the text to be processed; inputting the fonts corresponding to each word and/or word into a trained text processing model to obtain phonetic symbols corresponding to each word and/or word; a set of text characters represented in phonetic symbols is determined based on the location of each word and/or word in the text to be processed.

In the step, a text processing model is adopted to convert text unicode, the input can be a font corresponding to each word and/or word in the text to be processed, and the output can be a phonetic symbol corresponding to each word and/or word.

For example, the text processing model may be a Long Short-Term Memory artificial neural network (LSTM), with which text in different languages may be converted into a set of text characters represented in phonetic symbols. Specifically, when the LSTM model is applied, the input is a font corresponding to each word and/or word, and the output is a phonetic symbol corresponding to each word and/or word.

In the embodiment of the present application, as a preferred embodiment, step S140 includes: inputting the text character set and the fundamental frequency information characteristic into a feedforward neural network of a trained voice synthesis model for linear processing to obtain a first linear characteristic corresponding to the fundamental frequency information characteristic and a second linear characteristic corresponding to the text character set; inputting the characteristic results obtained by integrating the first linear characteristic and the second linear characteristic into a multi-layer convolutional neural network of a trained voice synthesis model to obtain a first output result, and inputting the first output result into an attention model of the trained voice synthesis model to obtain a second output result; and inputting the first output result, the second output result and the tone characteristic into a trained voice synthesis model for information fusion to obtain the synthesized voice corresponding to the text to be processed.

In the step, the text character set, the fundamental frequency information characteristic and the tone characteristic are input into a trained voice synthesis model, and the synthesized voice corresponding to the text to be processed is obtained through the voice synthesis processing of the voice synthesis model.

Specifically, the fundamental frequency information characteristic and the text character set are subjected to linear change once through a feedforward neural network respectively to obtain a first linear characteristic corresponding to the fundamental frequency information characteristic and a second linear characteristic corresponding to the text character set, the two characteristics are spliced together to obtain an integrated characteristic result, the integrated characteristic result is subjected to a multi-layer convolutional neural network to obtain a first output result, the obtained first output result is input into an attention model to obtain a second output result, the second output result output by the attention model and the first output result output by the multi-layer convolutional neural network are input into a cyclic neural network to perform information fusion, and finally the output information fusion result is subjected to linear fitting to obtain synthesized voice corresponding to a text to be processed.

Further, the voice synthesis method provided by the embodiment of the application trains a voice synthesis model through the following steps:

A plurality of voice samples, sample text corresponding to each voice sample, and tone sample features corresponding to each voice sample are acquired.

In this step, the speech samples are from a corpus and a public database. And extracting a sample text from the voice sample, taking the sample text as an input sample, taking the voice sample as an output sample, and if the output synthesized voice is similar to the voice sample after the sample text is input into the voice synthesis model, considering that the training of the voice synthesis model is completed, and further, applying the voice synthesis model to perform voice synthesis.

And aiming at the sample text corresponding to each voice sample, converting each sample text into a sample character set represented by the unified character according to the mapping relation between the font corresponding to the preset voice text and the unified character.

In the step, training of a text processing model is carried out before each sample text is converted into a sample character set represented by unified characters, and the mapping relation between the fonts and the unified characters is determined according to the trained text processing model.

Specifically, a text processing model is trained by:

acquiring a plurality of voice samples in a corpus, and determining a sample text corresponding to each voice sample; determining the corresponding font and phonetic symbol of each word and/or word in the sample text; inputting the fonts and phonetic symbols corresponding to each word and/or word into the constructed cyclic neural network model for training, and adjusting the parameters of the cyclic neural network model to obtain a trained text processing model.

In the step, a cyclic neural network model is firstly constructed, fonts corresponding to each word and/or word are used as input of the cyclic neural network model, phonetic symbols corresponding to each word and/or word are used as output of the cyclic neural network model, the cyclic neural network model is trained based on the input and the output, parameters of the cyclic neural network model are continuously adjusted until phonetic symbols obtained by the cyclic neural network model based on the fonts are basically consistent with phonetic symbols corresponding to each word and/or word in a sample text, and training of the text processing model is determined to be completed.

And extracting fundamental frequency information sample characteristics representing each word and/or word from each sample character set.

In the step, before extracting fundamental frequency information sample characteristics from sample character sets, training a characteristic extraction model, and determining to extract fundamental frequency information sample characteristics representing each word and/or phrase from each sample character set according to the trained characteristic extraction model.

Here, the corpus corresponds to a feature extraction model, the feature extraction model is directly trained through a voice sample contained in the corpus, and the trained feature extraction model is connected with the text processing model, so that the output of the text processing model can be used as the input of the feature extraction model.

Furthermore, the text processing model extracts vector representation features of the phonetic symbols, namely, the phonetic symbols are represented by using some randomly initialized vectors, the data of the vectors are continuously updated in the training process, and the finally obtained vectors potentially represent the pronunciation features of the corresponding phonetic symbols; the feature extracted from the feature extraction model is a binarized feature of fundamental frequency variation information related to phonetic symbols.

Here, the sample character set, the fundamental frequency information sample characteristic corresponding to the sample character set and the tone sample characteristic are input into the constructed neural network model for training, and parameters of the neural network model are continuously adjusted until the synthesized voice output by the neural network model is consistent or highly similar to the voice sample corresponding to the sample character set, and then the neural network model can be determined to be trained, so that a trained voice synthesis model is obtained.

Thus, the mixed language voice data of the same speaker is not needed when the voice synthesis model is constructed, namely, if the multi-language voice synthesis model such as Chinese, english and the like is needed to be synthesized, only the voice data of single language of different speakers is needed to be recorded. Through representing texts in different languages as unified characters, pronunciation representations in different languages have certain overlapping, and new pitch change characteristics are added to improve the application range of a speech synthesis model, so that the speech synthesis model suitable for multiple languages such as common languages and small languages can be built. Furthermore, the speech synthesis model can be suitable for any language, and can also support the speech synthesis of the custom language through the pronunciation mode of the custom text.

Specifically, the training the constructed neural network model based on the sample character set, the fundamental frequency information sample feature, the tone sample feature and the voice sample corresponding to the sample character set to obtain a trained voice synthesis model includes:

The neural network model comprises a feedforward neural network, a multilayer convolution neural network, an attention model and the like, and when the neural network model is trained, parameters of the neural network model need to be continuously adjusted according to the final output result of the neural network model and the output result of each part until a trained voice synthesis model is obtained.

Furthermore, because the voice synthesis technology executes the same processing procedure as that of the text data during model training, the synthesis of multi-speaker multi-language mixed voice can be realized only by recording voice data of different speakers in single language, the efficiency of mixed language voice synthesis is improved, and the universality is higher. By using the method, a set of multilingual mixed speech synthesis model can be constructed by collecting single-language speech data of a plurality of speakers, so that a high-quality speech synthesis model with universality can be constructed efficiently, the mixed-language speech data of the same speaker is not needed, the pronunciation characteristics of small languages can be improved, and the synthesis of multilingual mixed speech is supported.

Based on the same inventive concept, the embodiment of the present application further provides a voice synthesis device corresponding to the voice synthesis method, and since the principle of solving the problem by the device in the embodiment of the present application is similar to that of the foregoing voice synthesis method in the embodiment of the present application, the implementation of the device may refer to the implementation of the method, and the repetition is omitted.

Referring to fig. 2 and 3, fig. 2 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application, and fig. 3 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application. As shown in fig. 2, the voice synthesizing apparatus 200 includes:

an obtaining module 210, configured to obtain a text to be processed and a tone characteristic of a speech to be synthesized;

The conversion module 220 is configured to convert the text to be processed into a text character set represented by unicode according to a mapping relationship between a font corresponding to a preset voice text and unicode;

A feature extraction module 230, configured to extract, from the text character set, a fundamental frequency information feature representing each word and/or phrase;

The speech synthesis module 240 is configured to input the text character set corresponding to the text to be processed, the extracted fundamental frequency information feature and the tone feature into a trained speech synthesis model, so as to obtain a synthesized speech.

Preferably, the conversion module 220 is configured to, when converting the text to be processed into a text character set represented by unicode according to a mapping relationship between a font corresponding to a preset phonetic text and unicode, the conversion module 220 is configured to:

Preferably, when the speech synthesis module 240 is configured to input the text character set corresponding to the text to be processed, the extracted fundamental frequency information feature and the tone feature into a trained speech synthesis model to obtain a synthesized speech, the speech synthesis module 240 is configured to:

Further, as shown in fig. 3, the speech synthesis apparatus 200 further comprises a synthesis model training module 250, the synthesis model training module 250 being configured to train a speech synthesis model by:

Preferably, the synthesis model training module 250 is configured to train the constructed neural network model based on the sample character set, the fundamental frequency information sample feature, the tone sample feature and the voice sample corresponding to the sample character set, and obtain a trained voice synthesis model, where the synthesis model training module 250 is configured to:

Preferably, the speech synthesis apparatus 200 further comprises a process model training module 260, the process model training module 260 being configured to train a text process model by:

The voice synthesis device provided by the embodiment of the application comprises an acquisition module, a conversion module, a feature extraction module and a voice synthesis module, wherein the acquisition module is used for acquiring a text to be processed and tone features of the voice to be synthesized; the conversion module is used for converting the text to be processed into a text character set represented by unified characters according to the mapping relation between the font corresponding to the preset voice text and the unified characters; the feature extraction module is used for extracting fundamental frequency information features representing each word and/or word from the text character set; the voice synthesis module is used for inputting a text character set corresponding to the text to be processed, the extracted fundamental frequency information characteristics and tone characteristics into the trained voice synthesis model to obtain synthesized voice. In this way, in the process of synthesizing the voice by the voice synthesis model, the voice synthesis model has the characteristics of voice color, fundamental frequency information and the like added into the text represented by the unicode, so that the synthesized voice is more vivid and relevant, the accuracy of voice synthesis is improved, and the service experience of voice interaction is greatly improved.

Referring to fig. 4, fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the application. As shown in fig. 4, the electronic device 400 includes a processor 410, a memory 420, and a bus 430.

The memory 420 stores machine-readable instructions executable by the processor 410, when the electronic device 400 is running, the processor 410 communicates with the memory 420 through the bus 430, and when the machine-readable instructions are executed by the processor 410, the steps of the speech synthesis method in the method embodiment shown in fig. 1 can be executed, and the specific implementation can be referred to the method embodiment and will not be described herein.

The embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the speech synthesis method in the embodiment of the method shown in fig. 1 may be executed, and a specific implementation manner may refer to the embodiment of the method and will not be described herein.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method of speech synthesis, the method comprising:

Acquiring a text to be processed and tone characteristics of a voice to be synthesized; the text to be processed is a text with a single language or a text with a mixed language; the tone characteristic of the voice to be synthesized is the tone characteristic of a preset speaker;

converting the text to be processed into a text character set represented by unified characters according to the mapping relation between the fonts corresponding to the voice texts of the preset various languages and the unified characters;

Inputting the first output result, the second output result and the tone characteristic into a trained voice synthesis model for information fusion to obtain a synthesized voice corresponding to the text to be processed; the speech synthesis model is constructed by speech data of a single language of a plurality of speakers; the synthesized voice is single language voice or multi-language mixed voice.

2. The method for synthesizing speech according to claim 1, wherein said converting the text to be processed into a text character set expressed in unicode according to a preset mapping relationship between fonts and unicode corresponding to speech texts in various languages, comprises:

3. The method for synthesizing speech according to claim 1, wherein said converting the text to be processed into a text character set expressed in unicode according to a preset mapping relationship between fonts and unicode corresponding to speech texts in various languages, comprises:

4. The method of speech synthesis according to claim 1, wherein the speech synthesis model is trained by:

5. The method of claim 4, wherein training the constructed neural network model based on the sample character set, the fundamental frequency information sample feature, the tone sample feature, and the voice sample corresponding to the sample character set to obtain a trained voice synthesis model comprises:

6. A method of speech synthesis according to claim 3, characterised in that the text processing model is trained by:

7. A speech synthesis apparatus, characterized in that the speech synthesis apparatus comprises:

The acquisition module is used for acquiring the text to be processed and tone characteristics of the voice to be synthesized; the text to be processed is a text with a single language or a text with a mixed language; the tone characteristic of the voice to be synthesized is the tone characteristic of a preset speaker;

The conversion module is used for converting the text to be processed into a text character set expressed by unified characters according to the preset mapping relation between the fonts and the unified characters corresponding to the voice texts of various languages;

The voice synthesis module is used for inputting the text character set and the fundamental frequency information characteristic into a feedforward neural network of a trained voice synthesis model for linear processing to obtain a first linear characteristic corresponding to the fundamental frequency information characteristic and a second linear characteristic corresponding to the text character set; inputting the characteristic results obtained by integrating the first linear characteristic and the second linear characteristic into a multi-layer convolutional neural network of a trained voice synthesis model to obtain a first output result, and inputting the first output result into an attention model of the trained voice synthesis model to obtain a second output result; inputting the first output result, the second output result and the tone characteristic into a trained voice synthesis model for information fusion to obtain a synthesized voice corresponding to the text to be processed; the speech synthesis model is constructed by speech data of a single language of a plurality of speakers; the synthesized voice is single language voice or multi-language mixed voice.

8. An electronic device, comprising: a processor, a memory and a bus, said memory storing machine readable instructions executable by said processor, said processor and said memory communicating over the bus when the electronic device is running, said processor executing said machine readable instructions to perform the steps of the speech synthesis method according to any of claims 1 to 6.

9. A computer readable storage medium, characterized in that the computer program is stored on the readable storage medium, which computer program, when being executed by a processor, performs the steps of the speech synthesis method according to any of claims 1 to 6.