CN118298797A

CN118298797A - Low-resource-based speech synthesis model training method, device, equipment and medium

Info

Publication number: CN118298797A
Application number: CN202410412217.7A
Authority: CN
Inventors: 张旭龙; 王健宗; 程宁; 夏晶
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2024-04-07
Filing date: 2024-04-07
Publication date: 2024-07-05

Abstract

The invention relates to the field of artificial intelligence, and discloses a low-resource-based speech synthesis model training method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a training sample set, and expanding the training sample set to obtain an expanded sample set, wherein the training sample set comprises voice data and text data; extracting fundamental frequency and unvoiced sound of voice data in the extended sample set respectively, and performing phoneme conversion on text data in the extended sample set to obtain a phoneme sequence; according to the phoneme sequence, calculating the duration time of each phoneme of the text data in the extended sample set by using a preset phoneme time prediction model; and training a preset speech synthesis model by using the extended sample set, the phoneme sequence, the fundamental frequency, the unvoiced sound and the duration to obtain a trained speech synthesis model. The invention can reduce the resource requirement of the speech synthesis model training and improve the accuracy of speech synthesis.

Description

Low-resource-based speech synthesis model training method, device, equipment and medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to a low-resource-based speech synthesis model training method, apparatus, electronic device, and readable storage medium.

Background

Speech synthesis, which refers to a technique of converting an arbitrarily input text into a corresponding speech, is an important research branch in the field of natural speech processing. For example, in the financial field, in order to facilitate marketing insurance products or credit card products, a method of intelligent outbound is often used to communicate with users, wherein the main implementation means of intelligent outbound is a speech synthesis technology.

The current common speech synthesis technology is mostly character input, and due to the existence of polyphones, the situation of pronunciation errors is easy to occur, besides, the current speech synthesis model needs a large amount of data to ensure the accuracy of synthesized speech during training, so that the resource consumption is huge, if the current speech synthesis model is not supported by huge resources, the training of the speech synthesis model is difficult to perform, and the training difficulty of the speech synthesis model is high.

Disclosure of Invention

The invention provides a low-resource-based speech synthesis model training method, a device, electronic equipment and a readable storage medium, and aims to reduce the resource requirement of speech synthesis model training and improve the accuracy of speech synthesis.

In order to achieve the above object, the present invention provides a low-resource-based speech synthesis model training method, which includes:

Acquiring a training sample set, and expanding the training sample set to obtain an expanded sample set, wherein the training sample set comprises voice data and text data;

extracting fundamental frequency and unvoiced sound of voice data in the extended sample set respectively, and performing phoneme conversion on text data in the extended sample set to obtain a phoneme sequence;

According to the phoneme sequence, calculating the duration time of each phoneme of the text data in the extended sample set by using a preset phoneme time prediction model;

According to the phoneme sequence, the fundamental frequency, the unvoiced sound and the duration, the phoneme sequence is sequentially processed by utilizing a coding layer, a decoding layer and a residual network layer in a preset voice synthesis model, so as to obtain a target mel frequency spectrum;

Performing parallel audio conversion on the target Mel frequency spectrum by using a vocoder in the speech synthesis model to obtain target audio;

And calculating the loss value of the voice data in the target audio and the expansion sample set by using a preset loss function, and adjusting the parameters of the voice synthesis model according to the loss value until the loss value is smaller than a preset threshold value to obtain the trained voice synthesis model.

Optionally, the processing the phoneme sequence sequentially by using an encoding layer, a decoding layer and a residual network layer in the speech synthesis model according to the phoneme sequence, the fundamental frequency, the unvoiced sound and the duration to obtain a target mel spectrum includes:

carrying out convolution processing on the phoneme sequence by utilizing a convolution module in the coding layer to obtain a feature matrix;

performing linear processing on the feature matrix to obtain a linear feature matrix;

carrying out batch normalization processing on the linear feature matrix to obtain an optimized feature matrix;

Calculating the optimized feature matrix by using a bidirectional long-short-time memory network preset in a coding layer of the voice synthesis model to obtain a hidden feature matrix;

According to the hidden characteristic matrix, the fundamental frequency, the unvoiced sound and the duration, predicting a mel frequency spectrum of text data in the expanded text data by utilizing a decoding layer in the voice synthesis model to obtain a predicted mel frequency spectrum;

and residual connection is carried out on the predicted Mel frequency spectrum by utilizing a residual network layer in the voice synthesis model, so as to obtain a target Mel frequency spectrum.

Optionally, the predicting, according to the hidden feature matrix, the fundamental frequency, the unvoiced sound and the duration, the mel spectrum of the text data in the extended text data by using a decoding layer in the speech synthesis model, to obtain a predicted mel spectrum includes:

constructing a time domain signal of voice data in the extended text data according to the fundamental frequency and the unvoiced sound;

converting the time domain signal into a frequency domain signal by utilizing short-time Fourier transform, and performing modular square calculation on the frequency domain signal to obtain a power spectrum;

multiplying the power spectrum by a preset Mel scale mapping formula to obtain a preset Mel spectrum;

Extracting a context vector in the hidden feature matrix by using an attention network in the decoding layer to obtain a context vector of a first current time step;

Performing series connection operation on the context vector of the first current time step and the preset Mel frequency spectrum, and inputting a series connection result into a preset double-layer long-short-time memory layer to obtain the context vector of the second current time step;

performing a first linear projection on the context vector of the second current time step by using a post-processing network in the decoding layer to obtain a context Wen Biaoliang of the current time step;

Performing second linear projection on the context vector of the second current time step by using the post-processing network, and performing Mel spectrum prediction on the context vector after the second linear projection to obtain Mel spectrum of the second current time step;

Calculating the probability of the completion of the Mel spectrum prediction by using a preset first activation function according to the context scalar of the current time step;

judging whether the probability of the completion of the Mel spectrum prediction is smaller than a preset threshold value;

And when the probability of the completion of the Mel spectrum prediction is smaller than the threshold value, carrying out series connection operation on the context vector of the second current time step and the Mel spectrum of the second current time step, and returning to the step of inputting the series connection result into a preset double-layer long-short-time memory layer until the probability of the completion of the Mel spectrum prediction is larger than the threshold value, ending the Mel spectrum prediction, and obtaining a predicted Mel spectrum.

Optionally, the expanding the training sample set to obtain an expanded sample set includes:

classifying the voice data in the training sample set according to the accent class, and sorting the classified language data according to the accent data quantity to obtain voice data sorting;

Selecting accents corresponding to the voice data with the accent data quantity less than the preset quantity threshold value in the voice data sorting as target accents;

selecting the voice data to be converted in the voice data at will, and performing accent conversion on the voice data to be converted by using a preset accent conversion model according to the target accent to obtain target accent data;

performing voice recognition on the target accent data to obtain target accent text data;

and filling the target accent text data and the target accent data into the training sample set to obtain an extended sample set.

Optionally, the performing accent conversion on the voice data to be converted by using a preset accent conversion model according to the target accent to obtain target accent data includes:

Coding the voice data to be converted by using a coding module in a preset accent conversion model to obtain a content coding vector;

Extracting accent features of the target accent by using a feature extraction module in the preset accent conversion model to obtain target accent features;

carrying out mapping feature calculation on the target accent features to obtain target accent mapping features;

Combining the target accent mapping feature with the content coding vector to obtain a target accent coding vector;

And decoding the target accent coding vector by using a decoding module in the preset accent conversion model to obtain target accent data.

Optionally, calculating the duration of each phoneme of the text data in the extended sample set using a preset phoneme time prediction model includes:

Coding the phoneme sequence by using a coding module in a preset phoneme time prediction model to obtain a phoneme coding sequence;

convolving the phoneme coding sequence by utilizing a convolution module in the preset phoneme time prediction model to obtain a phoneme characteristic sequence;

calculating the probability of the pronunciation time of each phoneme in the phoneme characteristic sequence by using a full-connection module in the preset phoneme time prediction model;

And determining the pronunciation time length of each phoneme in the phoneme characteristic sequence according to the probability to obtain the duration time of each phoneme of the text data in the extended sample set.

Optionally, the performing phoneme conversion on the text data in the extended sample set to obtain a phoneme sequence includes:

Performing sentence segmentation processing on the text data in the extended sample set to obtain sentence segmented text data;

According to a preset text format rule, converting non-characters in the sentence segmentation text data into characters;

word segmentation processing is carried out on sentence segmentation text data subjected to text conversion, so that word segmentation text data are obtained;

Mapping the word segmentation text data according to a preset word-phoneme mapping dictionary to obtain phoneme data;

vector conversion is carried out on the phoneme data to obtain a phoneme vector;

And carrying out coding sequencing on the phoneme vectors according to the text sequence to obtain a phoneme sequence.

In order to solve the above problems, the present invention further provides a low-resource-based speech synthesis model training apparatus, which includes:

the training sample expansion module is used for acquiring a training sample set and expanding the training sample set to obtain an expanded sample set, wherein the training sample set comprises voice data and text data;

the phoneme duration calculation module is used for respectively extracting fundamental frequency and unvoiced sound of the voice data in the extended sample set, performing phoneme conversion on the text data in the extended sample set to obtain a phoneme sequence, and calculating the duration of each phoneme of the text data in the extended sample set by using a preset phoneme time prediction model according to the phoneme sequence;

And the voice synthesis model training module is used for processing the phoneme sequence in sequence by utilizing a coding layer, a decoding layer and a residual error network layer in a preset voice synthesis model according to the phoneme sequence, the fundamental frequency, the unvoiced sound and the duration time to obtain a target Mel frequency spectrum, performing parallel audio conversion on the target Mel frequency spectrum by utilizing a vocoder in the voice synthesis model to obtain a target audio, calculating loss values of voice data in the target audio and the expansion sample set by utilizing a preset loss function, and adjusting parameters of the voice synthesis model according to the loss values until the loss values are smaller than a preset threshold value to obtain a trained voice synthesis model.

In order to solve the above-mentioned problems, the present invention also provides an electronic apparatus including:

A memory storing at least one computer program; and

And the processor executes the computer program stored in the memory to realize the low-resource-based speech synthesis model training method.

In order to solve the above-mentioned problems, the present invention also provides a computer readable storage medium having stored therein at least one computer program that is executed by a processor in an electronic device to implement the low-resource-based speech synthesis model training method described above.

According to the embodiment of the invention, the training sample set is obtained, and is expanded, so that the obtaining difficulty of sample data is reduced, the problems of unbalanced text data set and poor training effect of a voice synthesis model caused by less language data are also reduced, furthermore, the fundamental frequency and unvoiced sound of the voice data in the expansion sample set are respectively extracted, the text data in the expansion sample set are subjected to phoneme conversion to obtain a phoneme sequence, and the preset voice synthesis model is trained by utilizing the expansion sample set, the phoneme sequence, the fundamental frequency, the unvoiced sound mark and the duration time, so that the trained voice synthesis model is obtained, the input of replacing characters by using phoneme input is realized, and the accuracy of synthesized voice is improved. Therefore, the method, the device, the equipment and the storage medium for training the speech synthesis model based on low resources can reduce the resource requirement of training the speech synthesis model and improve the accuracy of speech synthesis.

Drawings

FIG. 1 is a flow chart of a low-resource-based speech synthesis model training method according to an embodiment of the present invention;

FIGS. 2 and 3 are flowcharts illustrating a detailed implementation of one of the steps in a low-resource-based speech synthesis model training method according to an embodiment of the present invention;

FIG. 4 is a schematic block diagram of a low-resource-based speech synthesis model training apparatus according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an internal structure of an electronic device implementing a low-resource-based speech synthesis model training method according to an embodiment of the present invention;

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The embodiment of the application provides a low-resource-based speech synthesis model training method. The execution subject of the low-resource-based speech synthesis model training method includes, but is not limited to, at least one of a server, a terminal, and the like, which can be configured to execute the method provided by the embodiment of the application. In other words, the low-resource-based speech synthesis model training method may be performed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server may include an independent server, and may also include a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

Referring to fig. 1, a flow chart of a low-resource-based speech synthesis model training method according to an embodiment of the present invention is shown, where in the embodiment of the present invention, the low-resource-based speech synthesis model training method includes:

S1, acquiring a training sample set, and expanding the training sample set to obtain an expanded sample set, wherein the training sample set comprises voice data and text data.

In the embodiment of the invention, the training sample set can be an unbalanced set of voice data and text data with few partial accent data resources. The voice data may be a voice with accents of the sample provider. The text data may be text converted after the voice data is voice-recognized, for example, the "ya child" after the Hunan utterance voice data "child" is voice-recognized may be text data.

In an optional implementation of the invention, voice data in a training sample set can be obtained by downloading voices through a network or inviting users with different accents to participate in voice recording and the like, and further, text data corresponding to the voice data in the training sample set can be obtained by carrying out voice recognition on the voice data in the training sample set, thereby ensuring the accuracy of samples required by training a voice synthesis model and improving the accuracy of the training of the voice synthesis model.

Further, the embodiment of the invention obtains the extended sample set by extending the training sample set, so that the voice synthesis model can learn data of different speakers and different accents more uniformly during training, and the problem of unbalanced data sets caused by less data of certain accents is solved.

In detail, as an optional embodiment of the present invention, referring to fig. 2, the expanding the training sample set to obtain an expanded sample set includes:

S11, classifying the voice data in the training sample set according to the accent class, and sorting the classified language data according to the accent data quantity to obtain voice data sorting;

S12, selecting accents corresponding to the voice data with the accent data quantity less than a preset quantity threshold value in the voice data sorting as target accents;

S13, randomly selecting voice data to be converted in the voice data, and performing accent conversion on the voice data to be converted by using a preset accent conversion model according to the target accent to obtain target accent data;

s14, carrying out voice recognition on the target accent data to obtain target accent text data;

And S15, filling the target accent text data and the target accent data into the training sample set to obtain an extended sample set.

In the embodiment of the present invention, the preset language conversion model may be a neural network based on training completion of a transducer.

In an alternative embodiment of the present invention, since the present solution requires a large amount of text data of different accents for training the speech synthesis model, in order to ensure that the text data of each accent is sufficient, the purpose of expanding the training sample set can be achieved by converting the accent of the speech data in the training sample set into the target accent, and thus an expanded sample set that can be used for training the speech synthesis model is obtained. For example, in the financial field, for promoting related insurance products or credit card services, an intelligent outbound method can be used, but because the intelligent outbound may face users in different areas, the intelligent outbound needs to enrich own language libraries, further, when enriching the language libraries, the voice synthesis training effect is poor due to insufficient voice synthesis training samples of part of languages, and the user understanding errors are caused easily, so that the original training sample set can be expanded through voice conversion, thereby reducing the problems of unbalanced text data and poor training effect of a voice synthesis model due to less voice data of certain accents.

Further, in an alternative embodiment of the present invention, in order to ensure that the speech data corresponding to the synthesized speech can be accurately found when the loss value is calculated in the training of the speech synthesis model, a special flag can be added to each sample data in the training sample set, so as to improve the accuracy of the speech synthesis model,

Thus, as an alternative embodiment of the present invention, the training sample set further comprises: speaker ID and accent ID.

The speaker ID may be a nickname explaining the speaker, for example, if the speaker of a sentence of guangdong is small, the speaker ID corresponding to the sentence of guangdong voice data is small. The accent ID may be an accent label of the speaker, for example, the speaker has small spoken a sentence of voice data of the cantonese, and the accent ID corresponding to the small spoken voice data is the cantonese.

Further, as an optional embodiment of the present invention, according to the target accent, the performing accent conversion on the voice data to be converted by using a preset accent conversion model to obtain target accent data includes:

According to the embodiment of the invention, the accent characteristics of the target accent are combined with the voice content of the voice data to be converted to obtain the target accent data, so that the voice synthesis model can learn data of different speakers and different accents more uniformly during training, and the problem of unbalanced data sets caused by less certain accent data is solved.

S2, extracting fundamental frequency and unvoiced sound of the voice data in the extended sample set respectively, and performing phoneme conversion on the text data in the extended sample set to obtain a phoneme sequence.

In the embodiment of the invention, the fundamental frequency is an important parameter of the speech signal, and is generally used for determining the tone value in the Chinese tone. The unvoiced sound may be a Chinese pronunciation criteria. The phoneme sequence can be the minimum phonetic unit divided according to the natural attribute of the voice, and is analyzed according to the pronunciation actions in syllables, and one action forms a phoneme, for example, the phonemes of Chinese characters can be Chinese pinyin and tones.

In an alternative embodiment of the present invention, the voice data is first converted into a mel spectrum form, and further, by analyzing the mel spectrum, the extraction of the fundamental frequency and unvoiced sound of the voice data in the extended sample set can be achieved.

According to the embodiment of the invention, the fundamental frequency and unvoiced sound of the voice data in the extended sample set are respectively extracted, so that the synthetic voice output by the voice synthesis model is ensured to be more closely matched with the voice data in the extended sample set, and the accuracy of the voice synthesis model is improved.

In an alternative implementation of the present invention, when the expanded sample set contains english text data, because the english speech has the same letters and different pronunciations, the complex pronunciation rules can be learned only by a large amount of training data, so when learning the pronunciation rules by using the neural network, if the training data is insufficient, it is difficult to learn all the pronunciation rules, especially when some rules occur too few times in the training data, the neural network will not learn the rules sufficiently, which inevitably results in the occurrence of pronunciation errors in some synthesized speech. Similarly, chinese characters also have the phenomenon of different pronunciations of the same Chinese character, such as polyphones. The phonemes are used as the most basic units of pronunciation, and the pronunciation attribute is reflected, so that the text data in the extended sample set is converted into a phoneme sequence for input, and the problem of pronunciation errors can be avoided.

In detail, as an alternative embodiment of the present invention, referring to fig. 3, the performing phoneme conversion on the text data in the extended sample set to obtain a phoneme sequence includes:

s21, performing sentence segmentation processing on the text data in the expansion sample set to obtain sentence segmented text data;

S22, converting non-characters in the sentence segmentation text data into characters according to a preset text format rule;

s23, performing word segmentation processing on sentence segmentation text data subjected to text conversion to obtain word segmentation text data;

S24, mapping the word segmentation text data according to a preset word-phoneme mapping dictionary to obtain phoneme data;

s25, carrying out vector conversion on the phoneme data to obtain a phoneme vector;

S26, carrying out coding sequencing on the phoneme vectors according to the text sequence to obtain a phoneme sequence.

In the embodiment of the present invention, the preset text format rule may be to convert arabic numerals into characters, for example, "there are 123 flowers" here, where "123" is an arabic numeral, and needs to be converted into kanji "two three". The preset word-phoneme mapping dictionary may be a comparison table of words and phonemes, and different words and phonemes have different word-phoneme mapping dictionaries.

In an alternative embodiment of the present invention, before performing sentence segmentation processing on the text data in the extended sample set, language analysis may be performed on each text data in the extended sample set, so that a pronunciation rule of the text data may be determined, further word segmentation processing may be performed on the text data, so as to accurately perform phoneme conversion on the text data, and a phoneme may be obtained, and finally, by performing coding ordering on the phonemes, a phoneme sequence may be obtained, and by arranging, accuracy of the phoneme sequence may be ensured, pronunciation confusion of the text data may be avoided, and precision of speech synthesis may be improved.

In another alternative embodiment of the present invention, in addition to the above-mentioned phoneme conversion method, the phoneme conversion may be implemented by using an open-source grapheme-to-phoneme conversion tool G2P to perform phoneme conversion on text data in the extended sample set, so as to obtain a phoneme sequence.

And S3, calculating the duration time of each phoneme of the text data in the extended sample set by using a preset phoneme time prediction model according to the phoneme sequence.

In the embodiment of the present invention, the preset phoneme time prediction model may be a training-completed convolutional neural network based on deep learning, and is used for predicting the duration of each phoneme in the text data in the extended sample set.

In an alternative embodiment of the present invention, since the Chinese characters with the same mouth tone and weight generally have different pronunciation durations, in order to ensure the similarity of the synthesized speech, before training the speech synthesis model according to the extended sample set, the duration of each phoneme of the text data in the extended sample set needs to be calculated by using a preset phoneme time prediction model.

According to the phoneme sequence, the embodiment of the invention calculates the duration time of each phoneme of the text data in the extended sample set by using the preset phoneme time prediction model, thereby improving the accuracy of the speech synthesized by the speech synthesis model and the practicability of the speech synthesis model.

Further, as an optional embodiment of the present invention, the calculating, according to the phoneme sequence, a duration of each phoneme of the text data in the extended sample set using a preset phoneme time prediction model includes:

In an alternative embodiment of the invention, the duration of each phoneme of the text data in the expansion sample set is determined by extracting the characteristics of the phoneme sequence and calculating the probability of the pronunciation time of each phoneme in the phoneme sequence, so that the accuracy of the voice synthesized by the voice synthesis model is improved.

And S4, according to the phoneme sequence, the fundamental frequency, the unvoiced sound and the duration, processing the phoneme sequence in sequence by utilizing a coding layer, a decoding layer and a residual network layer in a preset voice synthesis model to obtain a target Mel frequency spectrum.

In an embodiment of the invention, the encoder comprises a convolution layer and a bidirectional long-short-time memory network. The decoder may be an autoregressive recurrent neural network including an attention network and a post-processing network. The residual network includes a convolutional layer and a series of functions. The vocoder may be a codec that analyzes and synthesizes sound waves.

According to the embodiment of the invention, the preset speech synthesis model is trained by using the extended sample set, the phoneme sequence, the fundamental frequency, the unvoiced sound and the duration, so that the trained speech synthesis model is obtained, and the accuracy of the synthesized speech pronunciation standard and the similarity of the synthesized speech and the speech data in the extended sample set are improved.

Further, as an optional embodiment of the present invention, the processing, according to the phoneme sequence, the fundamental frequency, the unvoiced sound and the duration, the phoneme sequence sequentially by using an encoding layer, a decoding layer and a residual network layer in the speech synthesis model to obtain a target mel spectrum includes:

In the embodiment of the present invention, the hidden feature matrix includes information such as a context vector of the phoneme sequence.

In the embodiment of the invention, because the meaning of the word corresponding to each phoneme in the phoneme sequence is often closely related to the context, for example, in the sentence of "i love Chinese," the word "has two pronunciations, and the pronunciation of the word" good "cannot be determined by analyzing the word" good "alone, so that the problem of pronunciation error is easily caused, and therefore, the context characteristic information of each phoneme in the phoneme sequence is also required to be extracted.

Further, as an optional embodiment of the present invention, the predicting, according to the hidden feature matrix, the fundamental frequency, the unvoiced sound, and the duration, the mel spectrum of the text data in the extended text data by using a decoding layer in the speech synthesis model, to obtain a predicted mel spectrum includes:

In the embodiment of the present invention, the attention network includes the position-sensitive attention mechanism and the dual-layer long short-term memory layer, which are mainly used for determining which part of the encoder input needs to be focused. The first activation function may be a sigmoid function.

In the embodiment of the invention, the coding layer is utilized to perform feature extraction on the phoneme sequence to obtain the hidden feature matrix, and the decoder can extract the context feature of the phoneme sequence through the hidden feature matrix because the hidden feature matrix contains information such as the context vector of the phoneme sequence, thereby improving the influence of the context feature of the phoneme sequence on the phoneme sequence and improving the pronunciation accuracy of the speech synthesis model.

Further, the embodiment of the invention converts the phoneme sequence into the target Mel frequency spectrum, thereby realizing the conversion from text data to audio data, facilitating the analysis of the audio data by the vocoder, further outputting synthesized voice and realizing voice synthesis.

S5, performing parallel audio conversion on the target Mel frequency spectrum by using the vocoder in the voice synthesis model to obtain target audio.

In one embodiment of the present invention, the vocoder may be WaveGlow vocoder, which is a stream-based model that can generate high quality audio samples in parallel, thereby increasing the speed of speech synthesis.

In an alternative embodiment of the present invention, the vocoder is first utilized to perform parallel speech waveform conversion on the target mel spectrum to obtain a target speech waveform, and further, the target speech waveform is subjected to audio conversion, so as to obtain a target audio.

In the embodiment of the invention, performing audio conversion on the target voice waveform comprises the following steps: and sampling, quantizing and encoding the target voice waveform signal to obtain target audio. Wherein, sampling the target voice waveform signal is a process of discretizing the continuous target voice waveform signal on the time axis, and quantizing the sampled target voice waveform signal means converting each sample with continuous values on the amplitude into discrete value representation.

S6, calculating the loss value of the voice data in the target audio and the expansion sample set by using a preset loss function, and adjusting the parameters of the voice synthesis model according to the loss value until the loss value is smaller than a preset threshold value, so as to obtain the trained voice synthesis model.

In an alternative embodiment of the present invention, in order to improve the accuracy of speech synthesis performed by the speech synthesis model, the loss values of the target audio and the speech data in the extended sample set are also calculated, so that the trained speech synthesis model is ensured to be suitable for data outside the extended sample set, and the usability and accuracy of the speech synthesis model are improved.

According to the embodiment of the invention, the training sample set is obtained, and the training sample set is expanded to obtain the expanded sample set, so that the acquisition difficulty of sample data is reduced, the problems of unbalanced text data set and poor training effect of a voice synthesis model caused by less language data are also reduced, furthermore, the fundamental frequency and unvoiced sound of voice data in the expanded sample set are respectively extracted, and the text data in the expanded sample set are subjected to phoneme conversion to obtain a phoneme sequence, thereby realizing the purpose of inputting alternative characters by using phonemes, and improving the accuracy of synthesized voice. Therefore, the low-resource-based speech synthesis model training method provided by the invention can reduce the resource requirement of speech synthesis model training and improve the accuracy of speech synthesis.

As shown in fig. 4, a functional block diagram of the low-resource-based speech synthesis model training apparatus of the present invention is shown.

The low-resource-based speech synthesis model training apparatus 100 of the present invention may be installed in an electronic device. Depending on the functions implemented, the low-resource-based speech synthesis model training apparatus 100 may include a training sample expansion module 101, a phoneme duration calculation module 102, and a speech synthesis model training module 103, which may also be referred to as a unit, which refers to a series of computer program segments capable of being executed by a processor of an electronic device and performing a fixed function, and stored in a memory of the electronic device.

In the present embodiment, the functions concerning the respective modules/units are as follows:

the training sample expansion module 101 is configured to obtain a training sample set, and expand the training sample set to obtain an expanded sample set, where the training sample set includes voice data and text data.

The phoneme duration calculating module 102 is configured to extract a fundamental frequency and an unvoiced sound of the speech data in the extended sample set, and perform phoneme conversion on the text data in the extended sample set to obtain a phoneme sequence, and calculate a duration of each phoneme of the text data in the extended sample set by using a preset phoneme time prediction model according to the phoneme sequence.

The speech synthesis model training module 103 is configured to sequentially process the phoneme sequence by using a coding layer, a decoding layer and a residual network layer in a preset speech synthesis model according to the phoneme sequence, the fundamental frequency, the unvoiced sound and the duration to obtain a target mel spectrum, perform parallel audio conversion on the target mel spectrum by using a vocoder in the speech synthesis model to obtain a target audio, calculate loss values of speech data in the target audio and the extended sample set by using a preset loss function, and adjust parameters of the speech synthesis model according to the loss values until the loss values are smaller than a preset threshold value to obtain a trained speech synthesis model.

Fig. 5 is a schematic structural diagram of an electronic device implementing a low-resource-based speech synthesis model training method according to the present invention.

The electronic device may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program stored in the memory 11 and executable on the processor 10, such as a low-resource based speech synthesis model training program.

The memory 11 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, such as a mobile hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only for storing application software installed in an electronic device and various types of data, such as codes of a low-resource-based speech synthesis model training program, but also for temporarily storing data that has been output or is to be output.

The processor 10 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing Unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the entire electronic device using various interfaces and lines, executes or executes programs or modules (e.g., a low-resource-based speech synthesis model training program, etc.) stored in the memory 11, and invokes data stored in the memory 11 to perform various functions of the electronic device and process the data.

The communication bus 12 may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, PCI) bus, or an extended industry standard architecture (Extended Industry Standard Architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The communication bus 12 is arranged to enable a connection communication between the memory 11 and at least one processor 10 etc. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.

Fig. 5 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 5 is not limiting of the electronic device and may include fewer or more components than shown, or may combine certain components, or a different arrangement of components.

For example, although not shown, the electronic device may further include a power source (such as a battery) for supplying power to the respective components, and preferably, the power source may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management, and the like are implemented through the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device may further include various sensors, bluetooth modules, wi-Fi modules, etc., which are not described herein.

Optionally, the communication interface 13 may comprise a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), typically used to establish a communication connection between the electronic device and other electronic devices.

Optionally, the communication interface 13 may further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device and for displaying a visual user interface.

It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

The low-resource based speech synthesis model training program stored by the memory 11 in the electronic device is a combination of a plurality of computer programs, which when run in the processor 10, may implement:

In particular, the specific implementation method of the processor 10 on the computer program may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein.

Further, the electronic device integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. The computer readable medium may be non-volatile or volatile. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).

Embodiments of the present invention may also provide a computer readable storage medium storing a computer program which, when executed by a processor of an electronic device, may implement:

Further, the computer-usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.

In the embodiments provided in the present invention, it should be understood that the disclosed electronic device, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the system claims can also be implemented by means of software or hardware by means of one unit or means. The terms second, etc. are used to denote a name, but not any particular order.

Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A low-resource-based speech synthesis model training method, the method comprising:

2. The method of claim 1, wherein the sequentially processing the phoneme sequence with an encoding layer, a decoding layer and a residual network layer in the speech synthesis model according to the phoneme sequence, the fundamental frequency, the unvoiced sound and the duration to obtain a target mel spectrum comprises:

3. The method of claim 2, wherein predicting the mel spectrum of the text data in the extended text data by the decoding layer in the speech synthesis model according to the hidden feature matrix, the fundamental frequency, the unvoiced sound, and the duration to obtain a predicted mel spectrum comprises:

4. The method for training a low-resource-based speech synthesis model according to claim 1, wherein said expanding the training sample set to obtain an expanded sample set comprises:

5. The method for training a low-resource-based speech synthesis model according to claim 4, wherein said performing accent conversion on the speech data to be converted using a preset accent conversion model according to the target accent to obtain target accent data comprises:

6. The low-resource based speech synthesis model training method of claim 1, wherein calculating the duration of each phoneme of the text data in the extended sample set using a preset phoneme time prediction model comprises:

7. The low-resource based speech synthesis model training method of claim 1, wherein performing phoneme conversion on the text data in the extended sample set to obtain a phoneme sequence comprises:

8. A low-resource-based speech synthesis model training apparatus, the apparatus comprising:

9. An electronic device, the electronic device comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the low-resource based speech synthesis model training method of any one of claims 1 to 7.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the low-resource based speech synthesis model training method of any of claims 1 to 7.