CN114267330A

CN114267330A - Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Info

Publication number: CN114267330A
Application number: CN202111659164.1A
Authority: CN
Inventors: 刘丹; 伍芸荻
Original assignee: iFlytek Co Ltd
Current assignee: University of Science and Technology of China USTC; iFlytek Co Ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-04-01

Abstract

The invention provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining a chapter phoneme sequence of a chapter text to be synthesized; coding the discourse phoneme sequence to obtain the phonetic characteristics of the discourse text; and performing voice synthesis based on the phonetic features to obtain the synthesized voice of the discourse text. The method, the device, the electronic equipment and the storage medium provided by the invention encode the chapter phoneme sequence of the chapter text to obtain the phonetic characteristics for overall modeling of the chapter text, so that the voice synthesis is carried out according to the phonetic characteristics, the continuity of the synthesized voice on the language sense levels of rhythm, emotion and the like can be ensured, and the naturalness of the synthesized voice is improved.

Description

Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a speech synthesis method and apparatus, an electronic device, and a storage medium.

Background

Text To Speech (TTS) is a technique for converting Text to Speech. Existing speech synthesis methods based on deep learning are mainly classified into two categories: autoregressive speech synthesis methods and non-autoregressive speech synthesis methods.

The two types of voice synthesis methods have good expressions when a single sentence is subjected to voice synthesis, but for discourse texts containing a plurality of sentences, the two types of voice synthesis methods need to splice the voice obtained by independently synthesizing each sentence into a section of voice, so that the situations of prosody and emotional incoherence of the upper sentence and the lower sentence are easy to occur, and the user experience is influenced.

Disclosure of Invention

The invention provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, which are used for solving the problem of discourse and synthesized voice incoherence in the prior art.

The invention provides a speech synthesis method, which comprises the following steps:

determining a chapter phoneme sequence of a chapter text to be synthesized;

coding the discourse phoneme sequence to obtain the phonetic characteristics of the discourse text;

and performing voice synthesis based on the phonetic features to obtain the synthesized voice of the discourse text.

According to the speech synthesis method provided by the invention, the speech synthesis based on the phonetic features to obtain the synthesized speech of the discourse text comprises the following steps:

and performing voice synthesis based on the phonetic features and the language-sense features of the clauses in the discourse text to obtain the synthesized voice of the discourse text.

According to the speech synthesis method provided by the invention, the language sense characteristics of each clause in the discourse text are determined based on the following steps:

based on sample language sense characteristics of each clause in a sample discourse text, performing language sense extraction on each clause in the discourse text to obtain the language sense characteristics of each clause in the discourse text;

the sample language sense characteristic is obtained by extracting the language sense characteristic of the real voice corresponding to the sample discourse text.

According to a speech synthesis method provided by the present invention, the obtaining of the semantic features of each clause in a sample discourse text by performing the semantic extraction on each clause in the discourse text based on the sample semantic features of each clause in the sample discourse text comprises:

semantic extraction is carried out on each clause in the text of the discourse to obtain semantic features of each clause in the text of the discourse;

based on the semantic-sense conversion relation, performing semantic-sense conversion on semantic features of each clause in the text of the discourse to obtain the semantic features of each clause in the text of the discourse;

the semantic-sense conversion relation is determined based on sample semantic features and sample semantic features of each clause in the sample discourse text.

According to the speech synthesis method provided by the invention, the sample speech sensation characteristics are determined based on the following steps:

coding acoustic features of the real voice corresponding to the sample discourse text to obtain voice features of the real voice;

based on the voice-language conversion relation, performing language-language conversion on the voice characteristics to obtain sample language-language characteristics of each clause in the sample discourse text;

the voice and language sense conversion relation is obtained by comparing and learning by taking the local features of each clause in the voice features as positive example points and the local features of other clauses in the voice features as negative example points based on the sentence-level features of each clause in the voice features.

According to the speech synthesis method provided by the present invention, the performing speech synthesis based on the phonetic features and the linguistic features of each sentence in the discourse text to obtain the synthesized speech of the discourse text includes:

fusing the phonetic features and the language-sense features of the clauses in the discourse text by taking the clauses as a unit to obtain the fusion features of the clauses in the discourse text;

and performing voice synthesis based on the fusion characteristics of the clauses in the discourse text to obtain the synthesized voice of the discourse text.

According to the speech synthesis method provided by the invention, the encoding of the discourse phoneme sequence to obtain the phonetic features of the discourse text comprises:

coding the discourse phoneme sequence to obtain a phoneme level vector of the discourse text;

predicting a duration of each phoneme in the sequence of discourse phonemes based on the phoneme level vector;

and upsampling the phoneme-level vector based on the duration of each phoneme in the discourse phoneme sequence to obtain the phonetic features.

The present invention also provides a speech synthesis apparatus comprising:

the phoneme determining unit is used for determining a chapter phoneme sequence of a chapter text to be synthesized;

the discourse encoding unit is used for encoding the discourse phoneme sequence to obtain the phonetic characteristics of the discourse text;

and the voice synthesis unit is used for carrying out voice synthesis based on the phonetic features to obtain the synthesized voice of the discourse text.

The invention also provides an electronic device, comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor implements the steps of any of the speech synthesis methods when executing the computer program.

The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the speech synthesis method as described in any one of the above.

The speech synthesis method, the device, the electronic equipment and the storage medium provided by the invention can obtain the phonetic characteristics aiming at the integral modeling of the text of the chapter by coding the chapter phoneme sequence of the text of the chapter, so that the speech synthesis can ensure the continuity of the synthesized speech at the speech sense level of rhythm, emotion and the like and improve the naturalness of the synthesized speech.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a speech synthesis method provided by the present invention;

FIG. 2 is a schematic flow chart of a semantic feature extraction method according to the present invention;

FIG. 3 is a schematic flow chart of a sample semantic feature extraction method according to the present invention;

FIG. 4 is a schematic structural diagram of a semantic feature extraction model provided in the present invention;

FIG. 5 is a schematic flow chart illustrating step 120 of the method for extracting semantic features according to the present invention;

FIG. 6 is a schematic diagram of a speech synthesis system provided by the present invention;

FIG. 7 is a schematic structural diagram of a speech synthesis apparatus provided in the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The current speech synthesis methods based on deep learning can be divided into two types, namely autoregressive speech synthesis methods and non-autoregressive speech synthesis methods. The autoregressive speech synthesis method uses a classic Encoder-Decoder (E-D) framework, where the Encoder encodes input speech features, the Decoder predicts acoustic features frame by frame in an autoregressive manner, and the Encoder and Decoder are aligned in sequence with an attention mechanism. The Tacotron model is the main representative of this class of methods. Compared with an autoregressive speech synthesis method, the non-autoregressive speech synthesis method also adopts an E-D framework and the encoder has the same function, but the decoder generates the whole acoustic feature sequence in a non-autoregressive mode, an unstable attention mechanism is not used, an additional duration model is added instead, and the duration predicted by the duration model is used for upsampling the encoder output sequence to the same length of the acoustic feature sequence. FastSpeech is a prime representative of this class of approaches.

However, both the autoregressive synthesis method and the non-autoregressive synthesis method model a single sentence, that is, perform speech synthesis in units of sentences. When a single sentence is modeled, information of adjacent sentences cannot be seen in each sentence, so that the generated states are relatively random, the rhythms and the emotions of upper and lower sentences are easy to appear in spliced voices, and the user experience is influenced.

In view of the foregoing problems, embodiments of the present invention provide a speech synthesis method. Fig. 1 is a schematic flow chart of a speech synthesis method provided by the present invention, as shown in fig. 1, the method includes:

step 110, determining the chapter phoneme sequence of the chapter text to be synthesized.

Specifically, the text of the chapters to be synthesized is the text containing a plurality of sentences, and the text of the chapters may be the text of one paragraph or the text of the entire chapter containing a plurality of paragraphs. The chapter text may be directly input by a user, or may be acquired by an image acquisition device such as a scanner, a mobile phone, or a camera, and an OCR (Optical Character Recognition) is performed on the image, or may be obtained by crawling over the internet.

The discourse phoneme sequence is a phoneme-level text sequence for the whole discourse text, and the discourse phoneme sequence contains the phoneme-level text of each sentence in the discourse text, and can be obtained by splicing the phoneme-level text of each sentence in the discourse text according to the arrangement sequence of each sentence in the discourse text. Here, the phoneme-level text may be obtained by converting the corresponding text into phonemes, and for example, the whole text of the chapters may be subjected to phoneme conversion in units of words, so as to obtain a sequence of chapters and phonemes.

And 120, coding the discourse phoneme sequence to obtain the phonetic characteristics of the discourse text.

Specifically, conventional speech synthesis can be divided into two parts, namely encoding, i.e. phonetic feature extraction for text, and decoding, i.e. speech decoding for phonetic features. Considering that the current speech synthesis method using sentences as units is usually coding the phoneme-level text of a single sentence, the context between other sentences in the discourse and the sentence is ignored.

Aiming at the problem, the embodiment of the invention codes the chapter phoneme sequence, thereby realizing the phonetic feature extraction of the chapter text. Because the discourse phoneme sequence covers the phoneme-level texts of all sentences in the discourse text, the overall information in the discourse text can be referred to in the coding process based on the discourse phoneme sequence, so that the phonetic characteristics obtained by modeling the discourse text as a whole can be obtained, compared with the phonetic characteristics obtained by modeling a single sentence in the related technology, the overall information of the discourse text can be reflected, and the continuity of the phonetic characteristics related to each sentence is stronger.

And step 130, performing voice synthesis based on the phonetic features to obtain the synthesized voice of the discourse text.

Specifically, based on step 120, the phonetic features obtained by the overall modeling of the text of the chapters are subjected to speech synthesis, that is, the phonetic features are subjected to speech decoding, so that the synthesized speech of the text of the chapters is obtained. Because the phonetic features in the embodiment of the invention refer to the global information of the text of the chapters, the synthesized voice is more continuous and natural on the level of the senses of prosody, emotion and the like, and the problem of incongruent prosody and emotion of the voice synthesized based on the text of the chapters can be effectively solved.

The speech synthesis method provided by the embodiment of the invention can obtain the phonetic characteristics aiming at the integral modeling of the text of the chapter by coding the phoneme sequence of the chapter of the text of the chapter, so as to carry out speech synthesis, ensure the continuity of the synthesized speech at the level of the senses of prosody, emotion and the like, and improve the naturalness of the synthesized speech.

Considering that the modeling capability of the model is relatively limited, the continuity between the front sentence and the back sentence can be enhanced by aiming at the overall modeling of the text of the chapters, the audience is prevented from generating the feeling of abrupt change between the sentences, but the quality of the synthesized voice still needs to be further optimized. Based on the above embodiment, step 130 includes:

Specifically, when performing speech synthesis on a text of a discourse, the phonetic features of the text of the discourse may be referred to, and the linguistic features of each clause in the text of the discourse may be combined, where the linguistic features of each clause are used to reflect the features of the prosody, emotional trend and other linguistic aspects of the corresponding clause in the text of the discourse, and the linguistic features may be obtained by performing emotion analysis on the text of each clause in the text of the discourse, or may be obtained by performing emotion analysis on the text of each clause in the text of the discourse as a whole, which is not specifically limited in this embodiment of the present invention.

Before the speech synthesis, the phonetic features and the language sense features of each clause in the phonetic features of the text of the chapters may be fused, and the features of each clause obtained by the fusion may be applied to the speech synthesis, or during the speech decoding process for the phonetic features of the text of the chapters, the parameters applied when the speech decoding is performed on the corresponding clause may be adjusted based on the language sense features of each clause in the text of the chapters, that is, the speech decoding of the phonetic features is guided by the language sense features, so that the synthesized speech of the text of the chapters is obtained.

The method provided by the embodiment of the invention combines the language-sensing characteristics of each sentence in the text of the chapter in the voice synthesis process, and guides the rhythm and emotion trend of the synthesized voice through the language-sensing characteristics at the sentence level, so that the long-term continuity of the chapter voice can be further enhanced on the basis of ensuring the local continuity based on chapter modeling, namely, the synthesized chapter voice can have rhythm and emotion fluctuation similar to those of the voice of a real person.

Based on any of the above embodiments, in step 130, the linguistic characteristics of each sentence in the discourse text are determined based on the following steps:

Specifically, for the semantic extraction of each clause in the discourse text, the semantic mapping of each clause in the discourse text can be performed through the mapping relationship between each clause in the sample discourse text and the sample semantic features thereof. The mapping relationship may be embodied as a speech extraction model obtained through model training, or may be embodied as a speech extraction rule obtained through association mining, which is not specifically limited in this embodiment of the present invention.

Further, considering that the language-sense characteristics reflect characteristics of rhythm, emotion and the like, compared with the method for extracting the language-sense characteristics from the text, the language-sense characteristics extracted from the voice can obtain characteristics which are more real and closely express when the real person reads the text, the embodiment of the invention collects the real voice of the sample chapter text before the sample language-sense characteristics of each sentence in the sample chapter text are obtained, wherein the real voice of the sample chapter text is the voice recorded when the real person reads the sample chapter text, the sample language-sense characteristics obtained by extracting the language-sense characteristics of the real voice are obtained, the characteristics of the real person reading are learned, and the characteristics of the characteristic rhythm, emotion and the like are more real, vivid and natural. On the basis, the sentences in the sample text of the chapters and the sample language sense characteristics of the sentences are applied to perform language sense extraction on the sentences in the text of the chapters, so that the authenticity and reliability of the language sense characteristics of the sentences in the text of the chapters on information such as rhythm and emotion expression can be further improved.

Based on any of the above embodiments, fig. 2 is a schematic flow chart of the method for extracting a semantic feature according to the present invention, as shown in fig. 2, in step 130, the semantic feature of each sentence in the discourse text is determined based on the following steps:

step 210, performing semantic extraction on each clause in the discourse text to obtain semantic features of each clause in the discourse text.

The semantic extraction is performed on each clause in the text of the discourse, specifically, the semantic extraction is performed independently on each clause in the text of the discourse, or the semantic extraction is performed on each clause in the text of the discourse based on the context on the whole text of the discourse, so as to obtain the semantic features of each clause in the text of the discourse.

Furthermore, the semantic extraction can be realized by a bert (bidirectional Encoder Representation from transforms) model in the field of natural language processing, and can also be realized by other language models with coding capability, such as an Encoder in a transform model. Taking the BERT model as an example, the BERT model has comprehension capability to the text and stronger modeling capability, and can output high-dimensional vectors containing semantic information. The input of the BERT model is a word-level text sequence of a chapter text, and the output is a high-dimensional coding vector containing semantic information with the same scale, the word-level text sequence referred herein is formed by splicing texts of clauses in the chapter text, the form of the word-level text sequence can be "< CLS > sentence 1< SEP > < CLS > sentence 2< SEP > … < CLS > sentence n < SEP >", and the output coding vector and the input are the same scale.

220, performing semantic conversion on semantic features of each clause in the discourse text based on the semantic-semantic conversion relation to obtain the semantic features of each clause in the discourse text;

Specifically, after the semantic features of each clause in the text of the chapters are obtained, the semantic features of each clause can be converted based on the semantic-sense conversion relationship, so that the semantic features of each clause are obtained. The semantic-sense conversion relationship may be understood as a part of the mapping relationship referred in the above embodiment, that is, the mapping relationship may be split into two parts, namely, a text semantic-sense conversion relationship and a semantic-sense conversion relationship, where the semantic-sense conversion relationship may be embodied as a semantic-sense extraction model obtained by supervised training of sample semantic features and sample semantic features of each clause in a sample chapter text, or may be embodied as a semantic-sense extraction rule obtained by performing association mining on sample semantic features and sample semantic features of each clause in the sample chapter text, which is not specifically limited in this embodiment of the present invention.

Here, the semantic sense conversion relationship may form a semantic sense extraction model together with the module for semantic extraction in step 210, or may be used as a semantic sense extraction model independently from the module for semantic extraction in step 210, and when a neural network is applied to represent the semantic sense conversion relationship, the network structure may be LSTM (Long Short-Term Memory, Long Short Term Memory network) plus a linear projection layer, or may be other structures capable of implementing mapping relationship representation, which is not specifically limited in this embodiment of the present invention. Accordingly, in the case that the semantic-sense conversion relationship and the module for semantic extraction in step 210 together form a semantic-sense extraction model, the training may be performed based on each clause and the sample semantic features thereof in the sample chapter text as samples, and in the case that the semantic-sense conversion relationship is independent of the module for semantic extraction in step 210 as a semantic-sense extraction model, the training may be performed based on the semantic features of each clause and the sample semantic features thereof in the sample chapter text as samples.

Based on any of the above embodiments, fig. 3 is a schematic flow chart of the sample speech feature extraction method provided by the present invention, and as shown in fig. 3, the sample speech feature is determined based on the following steps:

step 310, coding the acoustic characteristics of the real voice corresponding to the sample discourse text to obtain the voice characteristics of the real voice;

step 320, based on the voice-language-sense conversion relation, performing language-sense conversion on the voice characteristics to obtain sample language-sense characteristics of each clause in the sample discourse text;

Specifically, for the sample chapter text, the real speech corresponding to the sample chapter text may be obtained first, and then the sample language-sense features are extracted based on the real speech. Further, the acoustic features of the real speech may be obtained first, where the acoustic features may be obtained by performing frame windowing on the real speech and then performing fast fourier transform FFT extraction, and may be, for example, Mel Frequency Cepstral Coefficient (MFCC) features or Perceptual Linear Prediction (PLP) features. Subsequently, further feature extraction can be performed on the acoustic features of the real speech, thereby obtaining the speech features at the frame level

On the basis, the extracted voice characteristics of the real voice can be converted based on the voice emotion conversion relation, so that the language-sense characteristics of the real voice, namely the sample language-sense characteristics of each sentence in the sample discourse text corresponding to the real voice, can be obtained.

Here, the speech emotion conversion relationship may be embodied as a speech emotion extraction model obtained through training, or may be embodied as a speech emotion extraction rule obtained through association mining, and in consideration of that, in the stage of obtaining the speech emotion conversion relationship, the speech emotion characteristics of real speech are unknown, in the embodiment of the present invention, the determination of the speech emotion conversion relationship is realized through a comparison learning manner.

The key of the comparative learning is how to select a proper set of the positive and negative examples to assist the anchor point to learn useful information. In the embodiment of the present invention, the anchor point is the sample language sense characteristic c of each sentence in the sample text of the sample chapter_sentSample language sense characteristics c of each clause in sample discourse text_sentIs determined based on sentence-level characteristics of each sentence in the speech characteristics of the real speech corresponding to the sample discourse text. The sentence-level feature referred to here is a frame-level speech feature of a sentence obtained by dividing a speech feature of a real speech in units of the sentence. Considering that the sentence-level information and the local information in the sentence-level information, i.e., the sentence-level features of each sentence and the local features of each sentence should be highly correlated, the sample speech-like features obtained based on the sentence-level features of each sentence should also be highly correlated with the local features of the corresponding sentence.

Based on the above, the anchor point is the sample language sense characteristic c of any clause in the sample chapter text_sentUnder the condition, local features in the sentence-level features of the clause can be randomly selected as positive example points, and local features in the sentence-level features of other clauses can be selected as negative example points for comparison and learning, so that a speech-to-speech conversion relation capable of realizing speech conversion is obtained.

The method provided by the embodiment of the invention extracts the speech senses by comparing the speech sense conversion relation obtained by learning, is beneficial to improving the representation capability of the speech sense characteristics of the sample, and thus improves the fidelity of speech synthesis.

Based on any of the above embodiments, fig. 4 is a schematic structural diagram of the speech sense feature extraction model provided by the present invention, as shown in fig. 4, the real speech is represented in the form of speech waveform, and the acoustic features of the real speech, that is, … and x shown in the figure, can be obtained by performing acoustic feature extraction on the real speech_t-2、x_t-1、x_t…, wherein x_tIs the acoustic feature of the t-th frame. The encoder in the figure may perform the acoustic feature encoding of step 310 to obtain the speech features of the real speech, where the speech features are also at the frame levelI.e. …, z shown in the figure_t-2、z_t-1、z_t…. Assuming that the t-2 th frame to the t +3 th frame correspond to a sentence, z can be extracted by the feature extractor in the figure_t-2To z_t+3The sentence-level characteristics of the constructed clauses are subjected to language sense conversion, so that the sample language sense characteristics c of the clauses are obtained_sent。

Here, the semantic conversion is performed by the feature extractor, which may specifically be to perform further feature extraction on sentence-level features, and then average the vectors in one sentence obtained by feature extraction to obtain the sample semantic features c of the abstract sentence_sent。

Accordingly, for the training of the semantic feature extraction model, the anchor sample semantic feature c_sentIs just about

May be sentence-level features z from the clause_t-2To z_t+3In which the randomly selected local characteristic, for example, may be z_t. The counterexample points may be local features in sentence-level features of other clauses, for example, a plurality of local features in other clauses may be randomly selected to construct a set of counterexample points

For example, 300 z's in other clauses may be chosen randomly_tAs a set of counter-example points.

Specifically, in the process of the comparative learning, InfoNCE loss can be used as a loss function to drive the update of the speech feature extraction model, and the InfoNCE loss is shown as follows:

in the formula, L_NInfonce loss, f (c)_sent,z_t)＝exp(c_sent·z_t). After the model training is converged, inputting the acoustic characteristics of the real voice into the speech sense characteristic extraction model, and obtaining the sample chapter corresponding to the real voice output by the modelAnd sample language sense characteristics of each clause in the chapter text.

Based on any of the above embodiments, step 130 includes:

Specifically, the phonetic features are obtained by modeling the whole text of the chapter, and the linguistic features reflecting the prosody, emotional tendency and the like of the clauses in the text of the chapter are fused by taking the clauses as a unit, for example, the phonetic features of each clause can be positioned from the phonetic features of the text of the chapter by taking the clauses as a unit, the phonetic features of each clause and the linguistic features of the corresponding clause are spliced, the spliced features are used as the fusion features of each clause, the spliced features of each clause can be recoded by a bidirectional LSTM or RNN (Recurrent Neural Network) and the like, and the recoded features are used as the fusion features of each clause.

After the fusion characteristics of the clauses in the text of the discourse are obtained, the speech synthesis can be realized by performing speech decoding on the fusion characteristics of the clauses, so that the synthesized speech of the text of the discourse is obtained.

Based on any of the above embodiments, fig. 5 is a schematic flow chart of step 120 in the method for extracting a semantic feature according to the present invention, as shown in fig. 5, step 120 includes:

step 121, encoding the discourse phoneme sequence to obtain a phoneme level vector of the discourse text;

step 122, predicting the duration of each phoneme in the discourse phoneme sequence based on the phoneme level vector;

and 123, performing upsampling on the phoneme-level vector based on the duration of each phoneme in the discourse phoneme sequence to obtain the phonetic features.

Specifically, the encoding process for the discourse phoneme sequence can be realized by a non-autoregressive mode. First, a self-attention, multi-layer self-attention, or multi-layer multi-head attention (multi-head self-attention) network may be applied to perform a non-linear encoding on the chapter and phoneme sequence to extract the features of the chapter and phoneme sequence in the aspect of phonetics, so as to obtain a phoneme-level vector of the chapter text, where the phoneme-level vector may be referred to as memory.

Then, the duration of each phoneme in the chapter-phoneme sequence can be predicted according to the phoneme-level vector obtained by encoding, specifically, the duration of each phoneme in the chapter-phoneme sequence in the synthesized speech can be obtained by performing further feature encoding on the phoneme-level vector through a network in the form of LSTM, bi-directional LSTM, RNN, or the like, and then performing duration prediction by using features obtained by the further feature encoding, so as to obtain the duration of each phoneme in the chapter-phoneme sequence, i.e., the duration of each phoneme in the chapter-phoneme sequence.

On the basis, the vector of each phoneme in the phoneme-level vector can be up-sampled based on the duration of each phoneme in the chapter phoneme sequence, and the frame length reflected by the vector of each phoneme after up-sampling corresponds to the duration of each phoneme, so that the phonetic feature at the frame level is obtained. For example, there are 3 phonemes in the chapter phoneme sequence, and the phoneme level vector memory is represented as [ h1, h2, h3], where the duration corresponding to each phoneme is [2,3,2], and the output after the upsampling is copied, i.e., the phonetic features may be [ h1, h1, h2, h2, h2, h3, h3 ]. It should be noted that if the predicted duration is not an integer, the duration needs to be rounded, for example, the predicted duration of each phoneme is [2.5,4.3,2.7], and then upsampled after being normalized to [3,4,3 ].

The method provided by the embodiment of the invention realizes the speech synthesis with higher generation efficiency and stability in a non-autoregressive mode.

Based on any of the above embodiments, fig. 6 is a schematic structural diagram of the speech synthesis system provided by the present invention, and as shown in fig. 6, speech synthesis needs to be implemented by relying on three parts, namely, an autoregressive acoustic module, an abstract coding module, and a BERT prediction module. In fig. 6, the arrows in the form of dotted lines are effective only at the time of training, the arrows in the form of dotted lines are effective only at the time of application, the arrows in the form of solid lines are effective both at the time of training and at the time of application, and the solid lines of double arrows indicate error loss at the time of training.

The non-autoregressive acoustic module has the main function of constructing a mapping relation between an input chapter phoneme sequence and output acoustic features, and comprises an encoder, a duration prediction module, an upsampling module and a decoder. The input of the non-autoregressive acoustic module is a sequence formed by splicing phoneme-level text features of the whole chapter, namely a chapter phoneme sequence, and the phoneme-level vector of the chapter text is obtained by carrying out nonlinear coding through a coder and is recorded as memory. The phoneme-level vector memory predicts the duration of each phoneme through a duration prediction module, and calculates an error with the real duration of each phoneme in the real voice to drive the learning of the duration prediction module. Meanwhile, the phoneme-level vector memory and the phoneme duration (the real duration during training and the predicted duration output by the duration prediction module during application) are input into the up-sampling module for expansion, so that the frame-level phonetic features with the same scale as the acoustic features are obtained. And finally, after the phonetic features and the language sense features (the sample language sense features during training and the language sense features output by the BERT prediction module during application) at the frame level are spliced, predicting the acoustic features through a decoder, and driving the learning of the encoder and the decoder in the non-autoregressive acoustic module by using the error between the predicted acoustic features and the real acoustic features.

The abstract coding module has the main function of extracting the language sense characteristics of rhythm, emotion and the like of each sentence from the acoustic characteristics, so that the generated voice is closer to a real person. The main structure comprises a feature extraction layer and a pooling layer, wherein the input of an abstract coding module is the acoustic feature of real voice, and the acoustic feature is coded by the feature extraction layer and then down-sampled by the pooling layer to be the vector of each clause in the real voice, which is called as the sample language sense feature of each clause. The sample speech sense features generated by the module are spliced with the frame-level phonetic features in the non-autoregressive acoustic module to jointly guide the generation of the acoustic features. It should be noted that the abstract coding module is learned in advance by means of contrast learning, and is fixed when the non-autoregressive acoustic module is trained.

Since the real sample speech feature is extracted from the acoustic feature, a tool is also needed to predict the speech feature of each sentence when synthesizing speech, and the BERT prediction module is used to perform this function. The BERT prediction module comprises a BERT module and an autoregressive prediction module, wherein the BERT module is fixed after being pre-trained through mass corpus data. The input of the BERT prediction module is a chapter text, namely a word-level text of the whole chapter, a word-level coding vector can be obtained after the BERT is carried out, a vector corresponding to a < CLS > tag of each clause is taken as a semantic feature of each clause, the linguistic feature of each clause is modeled by the autoregressive prediction module, and the learning of the autoregressive prediction module is driven by an error between the predicted linguistic feature and a real sample linguistic feature.

According to the method provided by the embodiment of the invention, the voice synthesis is carried out through the non-autoregressive acoustic module, so that the stability and the high efficiency of the voice synthesis are ensured; in addition, in the process of voice synthesis, the whole text of the discourse is subjected to combined modeling, so that each clause in the text of the discourse can see the information of multiple sentences before and after modeling, and the continuity between the upper sentence and the lower sentence during the whole synthesis is promoted. In addition, the prosody and emotion trends of the discourse voice are controlled through the large-scale sentence-level speech sense characteristics, and the rationality of prosody and emotion of each sentence in the synthesized voice is guaranteed.

Based on any of the above embodiments, fig. 7 is a schematic structural diagram of a speech synthesis apparatus provided by the present invention, as shown in fig. 7, the apparatus includes:

a phoneme determining unit 710, configured to determine a chapter phoneme sequence of a chapter text to be synthesized;

a chapter coding unit 720, configured to code the chapter phoneme sequence to obtain a phonetic feature of the chapter text;

and a speech synthesis unit 730, configured to perform speech synthesis based on the phonetic features to obtain a synthesized speech of the discourse text.

The speech synthesis device provided by the embodiment of the invention can obtain the phonetic characteristics aiming at the integral modeling of the text of the chapter by coding the phoneme sequence of the chapter of the text of the chapter, so as to carry out speech synthesis, ensure the continuity of the synthesized speech at the level of the senses of prosody, emotion and the like, and improve the naturalness of the synthesized speech.

Based on any of the above embodiments, the speech synthesis unit 730 is configured to:

Based on any embodiment above, the apparatus further comprises:

the semanteme extraction unit is used for carrying out semanteme extraction on each clause in the discourse text based on the sample semanteme characteristics of each clause in the sample discourse text to obtain the semanteme characteristics of each clause in the discourse text;

Based on any of the above embodiments, the speech sense extracting unit is configured to:

Based on any embodiment above, the apparatus further comprises:

the sample language sense acquisition unit is used for coding the acoustic characteristics of the real voice corresponding to the sample discourse text to obtain the voice characteristics of the real voice;

The chapter encoding unit 720 includes:

Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. Processor 810 may invoke logic instructions in memory 830 to perform a speech synthesis method comprising:

determining a chapter phoneme sequence of a chapter text to be synthesized;

In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform a speech synthesis method provided by the above methods, the method comprising:

determining a chapter phoneme sequence of a chapter text to be synthesized;

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the speech synthesis methods provided above, the method comprising:

determining a chapter phoneme sequence of a chapter text to be synthesized;

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of speech synthesis, comprising:

determining a chapter phoneme sequence of a chapter text to be synthesized;

2. The method of claim 1, wherein the performing speech synthesis based on the phonetic features to obtain the synthesized speech of the discourse text comprises:

3. The speech synthesis method of claim 2, wherein the linguistic features of each sentence in the discourse text are determined based on the following steps:

4. The speech synthesis method of claim 3, wherein the obtaining of the semantic features of the clauses in the discourse text by performing the semantic extraction on the clauses in the discourse text based on the sample semantic features of the clauses in the sample discourse text comprises:

5. The speech synthesis method according to claim 3, wherein the sample speech characteristics are determined based on the steps of:

6. The method of claim 2, wherein the performing speech synthesis based on the phonetic features and the linguistic features of the sentences in the discourse text to obtain the synthesized speech of the discourse text comprises:

7. The method for synthesizing speech according to any one of claims 1 to 6, wherein the encoding the sequence of discourse phonemes to obtain the phonetic features of the discourse text comprises:

8. A speech synthesis apparatus, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the speech synthesis method according to any of claims 1 to 7 are implemented when the program is executed by the processor.

10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech synthesis method according to any one of claims 1 to 7.