CN112735389A

CN112735389A - Voice training method, device and equipment based on deep learning and storage medium

Info

Publication number: CN112735389A
Application number: CN202011593537.5A
Authority: CN
Inventors: 孙奥兰; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-30
Also published as: WO2022141842A1

Abstract

The invention discloses a voice training method, a device, computer equipment and a storage medium based on deep learning, which are applied to the technical field of artificial intelligence, and provide a method for training a voice synthesis model through a teacher-student neural network, so that the voice synthesis model can be trained efficiently, quickly, with low resource consumption and low resource consumption. The method provided by the invention comprises the following steps: coding the first phoneme sequence to obtain a first phoneme coding value; carrying out time length prediction processing on the first phoneme coding value to obtain a first pronunciation time length prediction value; performing extension processing on each phoneme in the first phoneme sequence to obtain an extension characteristic of each phoneme in the first phoneme sequence; transforming the extended features of each phoneme in the first phoneme sequence into first mel frequency spectrum values; and training the student neural network through the hidden variable and the first Mel frequency spectrum value provided by the trained teacher neural network until the first loss function of the student neural network is converged to obtain the trained student neural network.

Description

Voice training method, device and equipment based on deep learning and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a voice training method and device based on deep learning, computer equipment and a storage medium.

Background

Most of the existing speech synthesis technologies based on deep learning, such as tacontron 2, based on sequence-to-sequence (seq2seq) schemes, bring significant improvement to the speech synthesis effect compared with the traditional statistical parameter model algorithm. However, training of the sequence-to-sequence model system requires a large amount of training data sets and computational resources to learn the model, and it is difficult to perform efficient speech synthesis in the inference stage. Some systems attempt to reduce the computational resource stress with different model structure skills based on sequence-to-sequence models, such as by using convolutional neural networks in the encoding-decoding stage, which can be trained quickly, but the problem is that sequence reasoning is still required, which is relatively inefficient. In order to avoid the serialized reasoning phase, some models adopt a self-attention mechanism to parallelize the structure of spectrum generation, but the training phase of the attention layer is very difficult and time-consuming, and a speech synthesis model which can achieve efficient training, efficient reasoning and high quality at the same time is lacked so far.

Disclosure of Invention

The embodiment of the invention provides a voice training method and device based on deep learning, computer equipment and a storage medium, and aims to solve the technical problem that a voice synthesis model which can simultaneously achieve efficient training, efficient reasoning and high quality is not available at present.

In one aspect of the present invention, a speech training method based on deep learning is provided, which includes the following steps:

coding the first phoneme sequence to obtain a first phoneme coding value;

carrying out time length prediction processing on the first phoneme coding value to obtain a first pronunciation time length prediction value;

performing extension processing on each phoneme in the first phoneme sequence based on the first pronunciation duration prediction value to obtain an extension characteristic of each phoneme in the first phoneme sequence;

transforming the extended features of each phoneme in the first phoneme sequence into first mel frequency spectrum values;

and training the student neural network through the hidden variable and the first Mel frequency spectrum value provided by the trained teacher neural network until the first loss function of the student neural network is converged to obtain the trained student neural network.

In another aspect of the present invention, a deep learning based speech training device is provided, which includes the following modules:

the first phoneme coding module is used for coding the first phoneme sequence to obtain a first phoneme coding value;

the duration prediction processing module is used for carrying out duration prediction processing on the first phoneme coding value to obtain a first pronunciation duration prediction value;

the extension processing module is used for carrying out extension processing on each phoneme in the first phoneme sequence based on the first pronunciation duration prediction value to obtain the extension characteristics of each phoneme in the first phoneme sequence;

a first mel frequency spectrum value transformation module, which is used for transforming the extended characteristic of each phoneme in the first phoneme sequence into a first mel frequency spectrum value;

and the student neural network training module is used for training the student neural network through the hidden variables and the first Mel frequency spectrum values provided by the trained teacher neural network, and obtaining the trained student neural network when the first loss function of the student neural network is converged.

In another aspect of the present invention, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the above-mentioned deep learning based speech training method when executing the computer program.

In another aspect of the present invention, a computer-readable storage medium is provided, in which a computer program is stored, which, when being executed by a processor, implements the steps of the above-mentioned deep learning based speech training method.

The deep learning-based voice training method, the deep learning-based voice training device, the computer equipment and the storage medium can solve the technical problem that a voice synthesis model which can simultaneously achieve efficient training, efficient reasoning and high quality is not available at present. Specifically, sample data used for student neural network training is simultaneously input into a pre-trained teacher neural network, hidden variables and reference Mel frequency spectrum values are provided by the teacher neural network, and the machine learning process of the student neural network is supervised, so that the training and reasoning efficiency is improved, the requirement on hardware resources is reduced, and a good training effect is kept as far as possible. According to the deep learning model of the teacher-student neural network, the teacher neural network is trained in advance, so that the occupied system resources are small, the structure of the student neural network is simple, the occupied system resources are small during training, training can be performed on a single GPU resource, the trained student neural network is simple in structure, voice can be synthesized on a CPU in real time, and the system can be rapidly applied to multiple voice synthesis scenes in a landing mode to provide an end-to-end voice synthesis scheme.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment of a deep learning based speech training method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for deep learning based speech training in one embodiment of the present invention;

FIG. 3 is a flow chart of a teacher neural network training method in a deep learning based speech training method in one embodiment of the present invention;

FIG. 4 is a flow chart of a method for generating hidden variables in a deep learning based speech training method according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a deep learning based speech trainer apparatus according to an embodiment of the invention;

FIG. 6 is a schematic diagram of a computer device in one embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In a specific embodiment, as shown in the figure, a speech training method based on deep learning is provided, which is to perform supervised training on a student neural network through a trained teacher neural network, wherein the student neural network comprises a phoneme coder, a pronunciation duration predictor and a decoder. Specifically, the training of the student neural network comprises the following steps:

s101: and coding the first phoneme sequence to obtain a first phoneme coding value.

Phonemes are the smallest units of speech that are divided according to the natural properties of the speech, and are analyzed according to the pronunciation actions in the syllables, with one action constituting a phoneme. For example, a single word is decomposed into a plurality of syllables, which are each decomposed into a corresponding plurality of phones. For further example, for the word sequence "peace", there are two corresponding syllables, "ping" and "an", respectively; the syllable "ping" may be further decomposed into the phonemes "p" and "ing" and the phoneme "an" into the phonemes "a" and "n". In the embodiment of the application, in Chinese, one Chinese character corresponds to one syllable; in English, a word corresponds to a syllable, and other languages are similar.

The first phoneme sequence is used for training a neural network of a student, and needs to be encoded to obtain a first phoneme encoding value, specifically, the first phoneme sequence is transformed and compressed to a vector with a fixed length. Specifically, the first phoneme sequence is subjected to transform compression through a phoneme coder in the student neural network. The phoneme coder has a four-layer structure: the first Layer comprises an Embedding Layer (Embedding Layer), a Fully Connected Layer (full Connected Layer), and a Linear rectification function (ReLU, Rectified Linear Unit); the second layer comprises a one-dimensional Convolutional Neural network layer (CNN); the third layer comprises a linear rectifying unit; the fourth layer includes Batch Normalization.

S102: and carrying out time length prediction processing on the first phoneme coding value to obtain a first pronunciation time length prediction value.

And predicting the pronunciation duration of each phoneme in the first phoneme coding value through a pronunciation duration predictor of the student neural network according to the coded first phoneme coding value. Specifically, the pronunciation duration predictor has a three-layer structure: the first layer comprises a one-dimensional convolutional neural network layer; the second layer comprises a linear rectifying layer; the third layer includes a batch normalization layer.

S103: and performing extension processing on each phoneme in the first phoneme sequence based on the first pronunciation duration prediction value to obtain the extension characteristics of each phoneme in the first phoneme sequence.

The extended features of each phoneme are subjected to acoustic feature prediction processing to obtain the acoustic features of each phoneme, and the acoustic features of each phoneme are synthesized into text acoustic features, which can be realized by adopting the following method: determining a context feature corresponding to the extension feature of each phoneme; performing linear transformation on the context characteristics corresponding to the extended characteristics of each phoneme to obtain the acoustic characteristics of each phoneme; and splicing the acoustic features of each phoneme according to the sequence of each phoneme in the target text to obtain the acoustic features.

In a specific embodiment, each phoneme in the first phoneme sequence is subjected to extension processing according to the acoustic features extracted from the real value of the mel spectrum corresponding to the first phoneme sequence by the trained teacher neural network, so as to obtain the extension features of each phoneme in the first phoneme sequence.

And the teacher neural network extracts the extension feature of each phoneme in the real language from the input first phoneme coding value and the Mel spectrum real value corresponding to the first phoneme sequence as the extension feature of each phoneme in the first phoneme sequence.

S104: the extended features of each phoneme in the first phoneme sequence are transformed into first mel-frequency spectral values.

Since the human ear's perception of sound is nonlinear, in order to simulate the sensitivity of human hearing to actual frequencies, a mel-frequency filter function is often added to a linear spectrogram to convert the mel-frequency filter function into a nonlinear mel-frequency spectrum.

The extended features of each phoneme in the first phoneme sequence are input to a decoder through a student neural network, which transforms it into a first mel spectral value. Specifically, the decoder of the student neural network has a three-layer structure: the first layer comprises a one-dimensional convolutional neural network layer; the second layer comprises a linear rectifying layer; the third layer includes a batch normalization layer. The decoder also includes a linear layer.

S105: and training the student neural network through the hidden variable and the first Mel frequency spectrum value provided by the trained teacher neural network until the first loss function of the student neural network is converged to obtain the trained student neural network.

A teacher-student neural network training method belongs to a transfer learning method in machine learning. The transfer learning is to transfer the realizable performance of a trained model to another model, and the latter has a relatively simple structure compared with the former, and for a teacher-student neural network, the teacher neural network is often a more complex network, and has better performance and universality, but more system resources are needed to be used in the training of the teacher neural network, so that in order to save the system resources needed by the training, the trained teacher neural network can be used for providing a soft target (soft target) to guide another student neural network with a simpler structure and less system resource consumption to learn, so that the student neural model with a simple structure and less parameter computation amount can also obtain the performance similar to that of the teacher network through the training.

The first phoneme sequence used for training the student neural network and the real value of the Mel frequency spectrum corresponding to the first phoneme sequence are also input into the pre-trained teacher neural network, and the teacher neural network can output hidden variables and the Mel frequency spectrum value generated by the teacher neural network according to the first phoneme sequence to the student neural network to be used as a training effect for evaluating the student neural network. And training until the first loss function of the student neural network converges to obtain the trained student neural network.

In a specific embodiment, the first loss function for evaluating the training effect of the neural network of the student is the sum of the mean absolute errors between the predicted mel-frequency spectrum values and the mel-frequency spectrum values:

wherein f is_iIs the Mel frequency spectrum value, g, generated by the teacher neural network from the first phoneme sequence_iIs a mel frequency spectrum value generated by the student neural network according to the first phoneme sequence;

or, the first loss function is a Huber loss function:

wherein y is a Mel frequency spectrum value generated by the teacher neural network according to the first phoneme sequence, f (x) is a Mel frequency spectrum value generated by the student neural network according to the first phoneme sequence, and δ is a hyper-parameter preset according to an expected training effect before training begins.

In a specific embodiment, after step S105, the method for training speech based on deep learning further includes:

s111: connecting the trained student neural network to a pre-trained vocoder;

s112: converting the input phoneme sequence into a corresponding Mel frequency spectrum value through a trained student neural network;

s113: the mel spectrum values are converted into voice by a vocoder.

The vocoder is a neural network which can convert the Mel frequency spectrum value into voice which can be recognized by human ears, and the vocoder can select the existing neural network of the existing WaveNET, MelGAN, WaveGlow and the like, and after the existing neural network is trained in advance, the vocoder receives the Mel frequency spectrum value output by the trained student neural network and converts the Mel frequency spectrum value into voice. In a preferred embodiment, a more lightweight, faster speed MelGAN is preferred as a vocoder.

The voice training method, the device, the computer equipment and the storage medium based on deep learning provided by the invention can be used for simultaneously inputting the sample data for the student neural network training into a pre-trained teacher neural network, and the teacher neural network provides hidden variables and reference Mel frequency spectrum values and supervises the machine learning process of the student neural network, so that the training and reasoning efficiency is improved, the requirement on hardware resources is reduced, and meanwhile, the good training effect is kept as far as possible. According to the deep learning model of the teacher-student neural network, the teacher neural network is trained in advance, so that the occupied system resources are small, the structure of the student neural network is simple, the occupied system resources are small during training, training can be performed on a single GPU resource, the trained student neural network is simple in structure, voice can be synthesized on a CPU in real time, and the system can be rapidly applied to multiple voice synthesis scenes in a landing mode to provide an end-to-end voice synthesis scheme.

In another particular embodiment, a method of pre-training a teacher neural network for supervised training of student neural networks is provided, the teacher neural network including a phoneme encoder, a spectral encoder, an attention processing mechanism, and a decoder. Specifically, the step of pre-training the teacher neural network includes:

s201: and coding the second phoneme sequence to obtain a second phoneme key coding value.

The second phoneme sequence is a phoneme sequence used for training the teacher neural network, and needs to be encoded to obtain a second phoneme encoded value, specifically, the second phoneme sequence is subjected to transform compression by a phoneme encoder in the teacher neural network. The phoneme coder has a four-layer structure: the first layer comprises an embedded layer; the second layer comprises a fully connected layer; the third layer comprises a linear rectification function; the fourth layer comprises N layers of Gated Residual (Residual Gated) structures and a hollow Residual convolution network.

S202: and encoding the real Mel frequency spectrum value corresponding to the second phoneme sequence after leftwards shifting by a preset value to obtain a second Mel frequency spectrum encoding value.

The teacher neural network also has a spectral encoder that provides contextual encoding of the spectral frames, taking into account previous spectral frames. The spectral encoder comprises a fully connected layer, a linear rectification function and an N-layer gated residual structure. First, a full link layer and a linear rectification function are applied to each frame of the input spectrum. This encoded result is input to N-level gated residual structures and to a higher fine-grained gated residual network. The real mel spectral values input by the teacher neural network corresponding to the second phoneme sequence are shifted to the left by one position at the time of input, and the model can be used to predict the next spectral frame based on the current input phoneme and the previous spectral frame.

S203: and performing attention mechanism processing on the second phoneme coding value and the second Mel frequency spectrum coding value to obtain a second phoneme coding value and a second Mel frequency spectrum coding value which are added with attention.

The attention mechanism employs a dot product attention mechanism in which the second phoneme encoded value includes the output of the phoneme encoder and the sum of the output of the phoneme encoder and the phoneme encoding. The second mel spectral encoded value is the output of the spectral encoder. The attention score is a weighted average of the output of the phoneme coder and the sum of phoneme coding vectors, and the weight is a matching value of the output of the phoneme coder and the sum of phoneme coding and the output of the spectrum coder. In this manner, the model may tend to select the phoneme associated with the next spectral frame.

S204: the attention-summed second prime code value and the second mel-spectrum true code value are transformed into a second mel-spectrum value.

The input of a decoder of the teacher neural network is the sum of the output of an encoder and the attention score, then the correct number of channels are obtained through N layers of gated residual convolution networks and convolution layers with linear rectification functions in sequence, and finally an S-shaped growth curve (sigmoid) prediction layer is input to obtain a predicted spectrum value.

S205: and self-training the teacher neural network according to the real Mel frequency spectrum value and the second Mel frequency spectrum value corresponding to the second phoneme sequence until the second loss function of the teacher neural network is converged to obtain the trained teacher neural network.

The teacher network performs self-training using a second mel frequency spectrum value generated according to the second phoneme sequence and a real mel frequency spectrum value corresponding to the second phoneme sequence as training data until the second loss function converges.

In a specific embodiment, the second loss function for evaluating the self-training effect of the teacher neural network is a sum of mean absolute errors between the real mel-frequency spectrum values corresponding to the second phoneme sequence and the second mel-frequency spectrum values:

wherein f is_iIs the true mel spectral value, g, corresponding to the second phoneme sequence_iIs the mel frequency spectrum value generated by the teacher neural network from the second phoneme sequence.

In another specific embodiment, the step of generating hidden variables by the pre-trained teacher neural network comprises:

s301: the trained teacher neural network encodes the first phoneme sequence to obtain a third phoneme key encoding value;

s302: the trained teacher neural network carries out coding after leftwards shifting the real value of the Mel frequency spectrum corresponding to the first phoneme sequence by a preset value, and a third Mel frequency spectrum coding value is obtained;

s303: the trained teacher neural network carries out attention mechanism processing on the third phoneme coding value and the third Mel frequency spectrum real coding value to obtain a third phoneme coding value and a third Mel frequency spectrum real coding value which are added with attention;

s304: transforming the attention-summed third phoneme coding value and the third mel-frequency spectrum coding value into a third mel-frequency spectrum value;

s305: and outputting the attention-summed third phoneme coding value, the third Mel frequency spectrum true coding value and the third Mel frequency spectrum value as hidden variables to the student neural network through the trained teacher neural network.

The above steps S301 to 304 are similar to the steps S201 to S204 of pre-training the teacher neural network, except that the first phoneme sequence for training the student neural network and the real mel-frequency spectrum value corresponding to the first phoneme sequence are input, and other steps are the same as those of pre-training the teacher neural network, and are not repeated here.

In step S305, the input first phoneme sequence and the real mel-frequency spectrum value corresponding to the first phoneme sequence are converted into the attention-summed third phoneme code value, the attention-summed third real mel-frequency spectrum code value, and the attention-summed third mel-frequency spectrum value are output to the student neural network as hidden variables. The hidden variable is used for providing a soft target (soft target) for a student neural network with a simpler structure to guide the student neural network to learn so as to improve the learning speed and efficiency.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In another embodiment, a deep learning based speech training device 100 is provided, which is in one-to-one correspondence with the deep learning based speech training method in the above embodiments. The system comprises the following modules: the system comprises a first phoneme coding module 101, a duration prediction processing module 102, an extension processing module 103, a first mel frequency spectrum value transformation module 104 and a student neural network training module 105.

A first phoneme coding module 101, configured to code the first phoneme sequence to obtain a first phoneme coding value;

the duration prediction processing module 102 is configured to perform duration prediction processing on the first phoneme coding value to obtain a first pronunciation duration prediction value;

the extension processing module 103 is configured to perform extension processing on each phoneme in the first phoneme sequence based on the first pronunciation duration prediction value to obtain an extension feature of each phoneme in the first phoneme sequence;

a first mel frequency spectrum value transformation module 104 for transforming the extended features of each phoneme in the first phoneme sequence into first mel frequency spectrum values;

and the student neural network training module 105 is used for training the student neural network through the hidden variable and the first mel frequency spectrum value provided by the trained teacher neural network, and obtaining the trained student neural network when the first loss function of the student neural network is converged.

In another embodiment, a teacher neural network in a deep learning based speech training device includes the following modules:

the second phoneme coding module is used for coding the second phoneme sequence to obtain a second phoneme key coding value;

the second mel frequency spectrum coding module is used for coding the real mel frequency spectrum value corresponding to the second phoneme sequence after leftwards deviating from a preset value to obtain a second mel frequency spectrum coding value;

the attention mechanism processing module is used for carrying out attention mechanism processing on the second phoneme coding value and the second Mel frequency spectrum coding value to obtain a second phoneme coding value and a second Mel frequency spectrum coding value which are added with attention;

a second mel-frequency spectrum value transformation module, which is used for transforming the attention-summed second element code value and the second mel-frequency spectrum real code value into a second mel-frequency spectrum value;

and the teacher neural network self-training module is used for self-training the teacher neural network according to the real Mel frequency spectrum and the second Mel frequency spectrum corresponding to the second phoneme sequence until a second loss function of the teacher neural network converges, so as to obtain the trained teacher neural network.

In another specific embodiment, the student neural network training module 105 in the deep learning based speech training apparatus further includes:

the third phoneme coding unit is used for coding the first phoneme sequence by the trained teacher neural network to obtain a third phoneme key coding value;

the third Mel frequency spectrum coding unit is used for coding the real Mel frequency spectrum value corresponding to the first phoneme sequence after the trained teacher neural network deviates a preset value leftwards to obtain a third Mel frequency spectrum coding value;

the trained teacher neural network performs attention mechanism processing on the third phoneme coding value and the third mel-frequency-spectrum real coding value to obtain an attention-summed third phoneme coding value and a third mel-frequency-spectrum real coding value;

a third mel-frequency spectrum value transforming unit for transforming the attention-summed third phoneme coded value and third mel-frequency spectrum coded value into a third mel-frequency spectrum value;

and the hidden variable output unit is used for outputting the attention-summed third phoneme coding value, a third Mel frequency spectrum real coding value and a third Mel frequency spectrum value to the student neural network as the hidden variables through the trained teacher neural network.

In another specific embodiment, the first loss function is a sum of mean absolute errors between the first mel-frequency spectrum value and the third mel-frequency spectrum value; or, the first loss function is a Huber loss function.

In another specific embodiment, the extension processing module 103 is specifically configured to perform extension processing on each phoneme in the first phoneme sequence according to the feature extracted by the trained teacher neural network from the real mel spectrum value corresponding to the first phoneme sequence, so as to obtain an extension feature of each phoneme in the first phoneme sequence.

In another specific embodiment, the deep learning based speech training apparatus 100 further comprises:

a vocoder connection module for connecting the trained student neural network to a pre-trained vocoder;

the input phoneme sequence conversion module is used for converting the input phoneme sequence into a corresponding Mel frequency spectrum value through the trained student neural network;

and the language output module is used for converting the Mel frequency spectrum value into voice through the vocoder.

Wherein the meaning of "first" and "second" in the above modules/units is only to distinguish different modules/units, and is not used to define which module/unit has higher priority or other defining meaning. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules explicitly listed, but may include other steps or modules not explicitly listed or inherent to such process, method, article, or apparatus, and such that a division of modules presented in this application is merely a logical division and may be implemented in a practical application in a further manner.

For specific limitations of the deep learning based speech training apparatus, reference may be made to the above limitations of the deep learning based speech training method, which are not described herein again. The modules in the deep learning based speech training device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server or a workstation, the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data involved in the deep learning based speech training method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a deep learning based speech training method.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor executes the computer program to implement the steps of the deep learning based speech training method in the above embodiments, such as the steps 101 to 105 shown in fig. 2 and other extensions of the method and related steps. Alternatively, the processor, when executing the computer program, implements the functions of the modules/units of the deep learning based speech training apparatus in the above embodiments, such as the functions of the modules 101 to 105 shown in fig. 5. To avoid repetition, further description is omitted here.

The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, video data, etc.) created according to the use of the cellular phone, etc.

The memory may be integrated in the processor or may be provided separately from the processor.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, implements the steps of the deep learning based speech training method in the above-described embodiments, such as the steps 101 to 105 and other extensions of the method and related steps shown in fig. 2. Alternatively, the computer program, when executed by the processor, implements the functions of the modules/units of the deep learning based speech training apparatus in the above embodiments, such as the functions of the modules 101 to 105 shown in fig. 5. To avoid repetition, further description is omitted here.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A speech training method based on deep learning is characterized by comprising the following steps:

coding the first phoneme sequence to obtain a first phoneme coding value;

performing extension processing on each phoneme in the first phoneme sequence based on the first pronunciation duration prediction value to obtain an extension feature of each phoneme in the first phoneme sequence;

and training the student neural network through hidden variables provided by the trained teacher neural network and the first Mel frequency spectrum value until a first loss function of the student neural network is converged, so as to obtain the trained student neural network.

2. The deep learning based speech training method of claim 1, wherein the step of training the teacher neural network comprises:

coding the second phoneme sequence to obtain a second phoneme key coding value;

the real Mel frequency spectrum value corresponding to the second phoneme sequence is subjected to coding after being deviated to the left by a preset value, and a second Mel frequency spectrum coding value is obtained;

performing attention mechanism processing on the second phoneme coding value and the second mel frequency spectrum coding value to obtain the attention-summed second phoneme coding value and the second mel frequency spectrum coding value;

transforming the attention-summed second pel-encoded value and the second mel-spectrum true encoded value into a second mel-spectrum value;

and self-training the teacher neural network according to the real Mel frequency spectrum value corresponding to the second phoneme sequence and the second Mel frequency spectrum value until a second loss function of the teacher neural network is converged to obtain the trained teacher neural network.

3. The method for deep learning based speech training according to claim 2, wherein the step of obtaining a trained student neural network when the student neural network is trained through hidden variables provided by a pre-trained teacher neural network and the first mel-frequency spectrum value until a first loss function of the student neural network converges further comprises:

the trained teacher neural network encodes the first phoneme sequence to obtain a third phoneme key encoding value;

the trained teacher neural network codes the real value of the Mel frequency spectrum corresponding to the first phoneme sequence after leftwards deviating from a preset value to obtain a third Mel frequency spectrum coding value;

the trained teacher neural network performs attention mechanism processing on the third phoneme coding value and the third mel-frequency-spectrum real coding value to obtain a third phoneme coding value and a third mel-frequency-spectrum real coding value which are subjected to attention summation;

transforming the attention-summed third phoneme-encoding value and third mel-frequency spectrum-encoding value into a third mel-frequency spectrum value;

outputting the attention-summed third phoneme code value, third mel-frequency-spectrum true code value and third mel-frequency-spectrum value as the hidden variables to the student neural network through the trained teacher neural network.

4. The deep learning based speech training method of claim 3, wherein:

the first loss function selects the sum of mean absolute errors between the first mel frequency spectrum value and the third mel frequency spectrum value;

or, the first loss function is a Huber loss function.

5. The deep learning-based speech training method according to claim 3, wherein the step of performing an extension process on each phoneme in the first phoneme sequence based on the first pronunciation duration prediction value to obtain an extension feature of each phoneme in the first phoneme sequence specifically comprises:

and performing extension processing on each phoneme in the first phoneme sequence according to the feature extracted from the real value of the Mel spectrum corresponding to the first phoneme sequence by the trained teacher neural network to obtain the extension feature of each phoneme in the first phoneme sequence.

6. The method of claim 1, wherein the training of the student neural network through the hidden variables provided by the trained teacher neural network and the first mel-frequency spectrum values until the trained student neural network is obtained when the first loss function of the student neural network converges further comprises:

connecting the trained student neural network to a pre-trained vocoder;

converting the input phoneme sequence into a corresponding Mel frequency spectrum value through the trained student neural network;

converting the Mel spectral values to speech by the vocoder.

7. A speech training device based on deep learning is characterized by comprising the following modules:

the extension processing module is used for carrying out extension processing on each phoneme in the first phoneme sequence based on the first pronunciation duration prediction value to obtain an extension characteristic of each phoneme in the first phoneme sequence;

a first mel frequency spectrum value transformation module, configured to transform the extended features of each phoneme in the first phoneme sequence into first mel frequency spectrum values;

and the student neural network training module is used for training the student neural network through hidden variables provided by the trained teacher neural network and the first Mel frequency spectrum value until a first loss function of the student neural network converges, so as to obtain the trained student neural network.

8. The deep learning based speech training device of claim 7, wherein the teacher neural network comprises the following modules:

an attention mechanism processing module, configured to perform attention mechanism processing on the second phoneme encoded value and the second mel-frequency spectrum encoded value to obtain an attention-summed second phoneme encoded value and second mel-frequency spectrum encoded value;

a second mel-frequency spectrum value transformation module for transforming the attention-summed second prime code value and the second mel-frequency spectrum true code value into a second mel-frequency spectrum value;

and the teacher neural network self-training module is used for self-training the teacher neural network according to the real Mel frequency spectrum corresponding to the second phoneme sequence and the second Mel frequency spectrum until a second loss function of the teacher neural network converges, so as to obtain the trained teacher neural network.

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the deep learning based speech training method according to any of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the deep learning based speech training method according to any one of claims 1 to 6.