CN112820268A

CN112820268A - Personalized voice conversion training method and device, computer equipment and storage medium

Info

Publication number: CN112820268A
Application number: CN202011602932.5A
Authority: CN
Inventors: 黄东延; 王若童
Original assignee: Shenzhen Ubtech Technology Co ltd
Current assignee: Shenzhen Ubtech Technology Co ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-05-18

Abstract

The invention provides a personalized voice conversion training method, which comprises the steps of training an initial voice conversion model by using the acquired voice parallel corpora of N speakers to obtain a voice conversion average model; acquiring the voice parallel linguistic data of a specific speaker and combining the voice parallel linguistic data with the voice parallel linguistic data of the N speakers respectively to obtain N groups of training voice data; training the voice conversion average model based on N groups of training voice data to obtain a specific voice conversion average model; the method comprises the steps of obtaining first sample voice data of a target speaker, obtaining second sample voice data corresponding to a specific speaker, and training a specific voice conversion average model to obtain a target voice conversion model for converting specific voice into target voice. In addition, the application also relates to a personalized voice conversion training device, computer equipment and a storage medium.

Description

Personalized voice conversion training method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for personalized speech conversion training, a computer device, and a storage medium.

Background

With the continuous development of multimedia communication technology, a speech synthesis technology, which is one of important ways of man-machine communication, has received extensive attention of researchers due to its advantages of convenience and rapidness. The goal of speech synthesis is to make the synthesized speech intelligible, clear, natural and expressive. In order to make the synthesized speech more clear, natural and expressive, the existing speech synthesis system generally selects a target speaker, records a large amount of pronunciation data of the target speaker, and uses the pronunciation data as basic data for speech synthesis. The method has the advantages that the tone quality and tone color of the synthesized voice are more similar to the voice sent by the speaker, the definition and naturalness of the synthesized voice are greatly improved, but the method has the disadvantages that a large amount of sample voice data of the target speaker needs to be acquired, and the acquisition work of the sample voice data consumes a large amount of material resources and financial resources, so that the further preparation of a unique personalized voice synthesis model for each individual user becomes very difficult.

Disclosure of Invention

In view of the foregoing, there is a need to provide a method, an apparatus, a computer device and a storage medium for personalized speech conversion training, which only needs to collect a small amount of sample speech data of a target speaker.

In a first aspect, the present invention provides a method for personalized speech conversion training, including:

obtaining voice corpus data in a voice corpus, wherein the voice corpus data comprises: the voice parallel linguistic data of N speakers refer to the voice linguistic data of a plurality of persons corresponding to the same voice text content;

training an initial voice conversion model based on the voice parallel corpora of the N individual speakers to obtain a voice conversion average model;

acquiring the voice parallel corpus of a specific speaker, and combining the voice parallel corpus of the specific speaker with the voice parallel corpora of N speakers respectively to obtain N groups of training voice data;

training the voice conversion average model based on N groups of training voice data to obtain a specific voice conversion average model;

acquiring first sample voice data of a target speaker, and acquiring second sample voice data corresponding to a specific speaker, wherein the first sample voice data and the second sample voice data have the same text content, and the scale of the first sample voice data is far smaller than that of voice parallel linguistic data;

and training the specific voice conversion average model based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting the specific voice into the target voice.

In a second aspect, the present invention provides a personalized speech conversion training device, comprising:

the first obtaining module is used for obtaining voice corpus data in a voice corpus, and the voice corpus data comprises: the voice parallel linguistic data of N speakers refer to the voice linguistic data of a plurality of persons corresponding to the same voice text content;

the first training module is used for training the initial voice conversion model based on the voice parallel corpora of the N speakers to obtain a voice conversion average model;

the second acquisition module is used for acquiring the voice parallel linguistic data of the specific speaker, and combining the voice parallel linguistic data of the specific speaker with the voice parallel linguistic data of the N speakers respectively to obtain N groups of training voice data;

the second training module is used for training the voice conversion average model based on the N groups of training voice data to obtain a specific voice conversion average model;

the third acquisition module is used for acquiring first sample voice data of a target speaker and acquiring second sample voice data corresponding to a specific speaker, wherein the text contents corresponding to the first sample voice data and the second sample voice data are the same, and the scale of the first sample voice data is far smaller than that of the voice parallel corpus;

and the third training module is used for training the specific voice conversion average model based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting the specific voice into the target voice.

In a third aspect, the present invention provides a computer apparatus comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of:

acquiring the voice parallel linguistic data of a specific speaker, and combining the voice parallel linguistic data of the specific speaker with the voice parallel linguistic data of N speakers respectively to obtain N groups of training voice data;

acquiring first sample voice data of a target speaker, and acquiring second sample voice data corresponding to a specific speaker, wherein the first sample voice data and the second sample voice data have the same text content, and the scale of the first sample voice data is far smaller than that of the voice parallel corpus;

In a fourth aspect, the present invention provides a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of:

The invention provides a personalized voice conversion training method, a device and computer equipment.A voice conversion average model is obtained by training an initial voice conversion model through the acquired voice parallel corpora of N speakers; acquiring the voice parallel linguistic data of a specific speaker and combining the voice parallel linguistic data with the voice parallel linguistic data of N speakers respectively to obtain N groups of training voice data; training the voice conversion average model based on N groups of training voice data to obtain a specific voice conversion average model; the method comprises the steps of obtaining first sample voice data of a target speaker, obtaining second sample voice data corresponding to a specific speaker, and training a specific voice conversion average model to obtain a target voice conversion model for converting specific voice into target voice. Because the scale of the first sample voice data of the target speaker is far smaller than that of the voice parallel corpus, the invention can realize the synthesis of high-quality personalized voice only by a small amount of sample voice data of the target speaker, thereby greatly reducing the manufacturing cost of the personalized voice, and further manufacturing a unique personalized voice synthesis model for each individual user to realize the personalized voice synthesis of each individual user.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

FIG. 1 is a flow diagram of a personalized speech conversion training method in one embodiment;

FIG. 2 is a flow diagram of a personalized speech conversion training method in another embodiment;

FIG. 3 is a diagram of an algorithm for aligning source speech acoustic features with desired speech acoustic features, according to one embodiment;

FIG. 4 is a flow diagram illustrating alignment of a source speech acoustic feature with a desired speech acoustic feature, according to one embodiment;

FIG. 5 is a flow diagram of a method for personalized speech conversion in one embodiment;

FIG. 6 is a flow diagram of a personalized speech conversion training method in yet another embodiment;

FIG. 7 is a block diagram of a flow chart of a personalized speech conversion training device in one embodiment;

FIG. 8 is a block diagram of a flow chart of a personalized speech conversion training device in another embodiment;

FIG. 9 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, the present invention provides a personalized speech conversion training method, which includes:

step 102, obtaining voice corpus data in a voice corpus, wherein the voice corpus data comprises: the speech parallel linguistic data of the N speakers means that the speech linguistic data of a plurality of speakers correspond to the same speech text content.

The voice corpus is a place where voice corpus data is stored, the voice corpus includes a sufficient amount of voice corpora, and the voice corpora may include a voice sample and a text sample corresponding to the voice sample. The speech parallel corpus means that the content of the speaking text of each speaker is the same. For example, each speaker has 300 sentences of speech, and the text content corresponding to the 300 sentences of speech is the same.

And 104, training the initial voice conversion model based on the voice parallel corpora of the N speakers to obtain a voice conversion average model.

Wherein, the parallel linguistic data of the voices of N speakers are combined pairwise to obtain

Grouping training speech data; based on

And training the initial voice conversion model by using the group training voice data to obtain a voice conversion average model. Specifically, the initial speech conversion model is obtained based on a neural network deep learning model.

In one embodiment, the neural network model may select a BilSTM (Bi-directional Long Short-Term Memory) model, and the initial speech conversion model obtained based on the BilSTM model is trained by using the speech parallel corpora of N speakers to obtain the speech conversion average model.

And 106, acquiring the voice parallel linguistic data of the specific speaker, and combining the voice parallel linguistic data of the specific speaker with the voice parallel linguistic data of the N speakers respectively to obtain N groups of training voice data.

The voice parallel linguistic data of the specific speaker are combined with the voice parallel linguistic data of the N speakers respectively to obtain N groups of training voice data, and the N groups of training voice data converted from the specific speaker to the N speakers are obtained. N sets of training speech data can be stored in the cloud, and correspondingly, N sets of training speech data can also be stored in the local device.

In one embodiment, the particular speaker is A, there are 10 speakers, and each person has 300 sentences of speech in parallel corpus. Obtaining 300 sentences of speech parallel corpus of a specific speaker A, and combining the speech parallel corpus of the specific speaker with the speech parallel corpora of 10 speakers respectively to obtain 1x10 x 300-3000 groups of training speech data.

And 108, training the voice conversion average model based on the N groups of training voice data to obtain a specific voice conversion average model.

The average model of the voice conversion is trained by using N groups of training voice data to obtain a specific voice conversion average model, namely the specific voice conversion average model which can be converted from a specific speaker to the voices of N speakers.

And step 110, acquiring first sample voice data of a target speaker, and acquiring second sample voice data corresponding to a specific speaker, wherein the text contents corresponding to the first sample voice data and the second sample voice data are the same, and the scale of the first sample voice data is far smaller than that of the voice parallel corpus.

The second sample voice data corresponding to the specific speaker is obtained from the voice corpus, and the first sample data of the target speaker is obtained through the voice obtaining device, for example, the first sample data of the target speaker can be obtained through a recording studio and corresponding devices. Since the scale of the first sample voice data is much smaller than the scale of the speech parallel corpus, the first sample voice data of the target speaker is the small sample voice data of the target speaker.

And 112, training the specific voice conversion average model based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting the specific voice into the target voice.

The first sample voice data of the target speaker and the second sample voice data corresponding to the specific speaker are combined to obtain training voice data converted from the specific speaker to the target speaker. For example, a particular speech conversion mean model is trained using training speech data for a particular speaker to convert to a target speaker, resulting in a target speech conversion model for the particular speech to the target speech.

In one embodiment, as shown in FIG. 2, the N sets of training speech data have the speech parallel corpora of a specific speaker as the source speech, and the N sets of training speech data have the speech parallel corpora of N speakers as the expected speech; the method further comprises the following steps:

step 202, performing acoustic feature extraction on the source speech and the expected speech respectively by using a speech feature analyzer to obtain acoustic features of the source speech and acoustic features of the expected speech.

In order to reduce the computational complexity and the storage complexity, audio resampling is performed on the source speech and the expected speech before acoustic feature extraction is performed on the source speech and the expected speech respectively by using the speech feature analyzer, and an audio resampling algorithm can be used for realizing conversion between any sampling rates of audio signals. And respectively extracting acoustic features of the source voice and the expected voice after the audio resampling by using a voice feature analyzer, and converting voice signals of the source voice and the expected voice into acoustic features such as Frequency Spectrum (Spectrum), Fundamental Frequency (Fundamental Frequency), aperiodic Frequency (aperiodical Spectrum) and the like.

In step 204, control aligns source speech acoustic features with desired speech acoustic features on a time axis.

In one embodiment, as shown in fig. 3 and 4, a Dynamic Time Warping (Dynamic Time Warping) method is used to align the acoustic features of the source speech to the acoustic feature length of the desired speech. Since the acoustic features are extracted from frame to frame, the distance between frames at time t needs to be measured. The distance function between frames at the moment of measurement t is:

wherein, I and J are feature matrices, and the dimension is T (frame number) x N (feature dimension).

And step 206, training a preset neural network model by using the aligned source speech acoustic features and expected speech acoustic features to obtain an initial speech conversion model.

And sending the aligned source speech acoustic characteristics and expected speech acoustic characteristics into a bidirectional long and short memory recurrent neural network (BLSTM) model to obtain an initial speech conversion model, namely the initial speech conversion model capable of converting the source speech into the expected speech.

In one embodiment, the distance function between frames at time t is measured as:

wherein, I and J are characteristic matrixes, and the dimension is T (frame number) xN (characteristic dimension). And when N is 130, sending the aligned source speech acoustic characteristics x (T x N) into a bidirectional long-short memory recurrent neural network BLSTM model. At this time, relevant parameters of the bidirectional long-short memory recurrent neural network BLSTM model are shown in Table 1.

TABLE 1

Sending the aligned source speech acoustic features x (T x N, N is 130 here) into a bidirectional long-short memory recurrent neural network BLSTM model to output converted acoustic features

(Tx N, N is here 130).

According to the formula

Computing

And (3) performing gradient descent according to the calculated loss, and updating the parameter weight of the bidirectional long and short memory recurrent neural network BLSTM model to obtain an initial voice conversion model, wherein y is the acoustic feature which is correctly marked.

In one embodiment, training the initial speech conversion model based on the parallel corpora of the speeches of the N speakers to obtain a speech conversion average model comprises: combining the N persons' speech parallel corpora pairwise to obtain

Grouping training speech data; based on

And training the initial voice conversion model by using the group training voice data to obtain a voice conversion average model.

Wherein, the parallel linguistic data of the voices of N speakers are combined pairwise to obtain the phonetic data of the speakers which are converted pairwise

The set of training speech data. Since each person has the same numberQuantitative and multi-sentence speech parallel corpus, so

Each of the training speech data sets includes a plurality of sub-training speech data sets. For example, when each speaker in N speakers has 300 sentences of speech parallel corpus, the speech parallel corpora of N speakers are combined two by two to obtain

The groups train the speech data. In one embodiment, when there are 300 parallel utterances of 10 different speakers, the 300 parallel utterances of 10 different speakers are combined two by two to obtain

The groups train the speech data.

In one embodiment, as shown in fig. 5, the method further comprises: acquiring a voice text to be converted, and converting the voice text to be converted into voice data of a specific speaker through a voice synthesis model; and taking the voice data of the specific speaker as the input of the target voice conversion model, and acquiring the target voice data output by the target voice conversion model.

After the voice text to be converted is converted into the voice data of the specific speaker through the voice synthesis model, the voice data of the specific speaker is used as the input of the target voice conversion model, and the target voice data output by the target voice conversion model is obtained. The voice synthesis technology is combined with the voice conversion technology, namely, the voice conversion model is added behind the voice synthesis model, so that each target speaker can realize high-quality personalized voice synthesis only by providing a small amount of sample voice data.

In one embodiment, as shown in fig. 6, before converting the speech text to be converted into the speech data of the specific speaker by the speech synthesis model, the method further includes:

step 602, obtaining target speech corpus data corresponding to a specific speaker.

The target voice corpus data is contained in the target voice corpus, and the target voice corpus data comprises target voice data and target text data corresponding to the target voice data.

Step 604, performing text analysis and voice analysis on the target voice corpus data to obtain a voice corpus text feature and a voice corpus sound feature, respectively.

The voice corpus sound characteristics are obtained through a voice analyzer and comprise at least one of tone color parameters, tone parameters and loudness parameters. The text analysis can be lexical analysis or syntactic analysis, and the text characteristics comprise phononic sequences, parts of speech, word length, prosodic pauses and the like.

Step 606, training the preset neural network model by using the text features and the sound features of the speech corpus to obtain a speech synthesis model corresponding to the specific speaker.

The preset neural network model is trained by utilizing the text features and the sound features of the voice corpus, so that a voice synthesis model corresponding to a specific speaker is obtained.

In one embodiment, in the aspect of the neural network deep learning model, a bidirectional long and short memory neural network BilSTM model is selected. The bidirectional long and short memory neural network BilTM model is a deformation of the LSTM model and is formed by combining a forward LSTM model and a backward LSTM model. The LSTM has a 3-gate structure including a forgetting gate (forget gate), an input gate (input gate), and an output gate (output gate). The relation between the samples on the long-time sequence is obtained through the LSTM structure, and the history state is reserved and abandoned through the input gate, the forgetting gate and the output gate, so that the long-distance history information is effectively cached, and a BiLSTM model is selected for training to obtain a speech synthesis model corresponding to a specific speaker.

In one embodiment, the speech synthesis model includes a duration model and an acoustic model, obtains a speech text to be converted, and converts the speech text to be converted into speech data of a specific speaker through the speech synthesis model, and further includes: performing text analysis on the voice text to be converted to obtain the characteristics of the text to be converted; the text features to be converted are used as the input of a duration model, and duration features corresponding to the text features to be converted are obtained; and inputting the time length characteristic and the text characteristic to be converted into the acoustic model to obtain the voice data of the specific speaker.

The minimum phonetic unit is divided into phonemes (phones) according to natural attributes of the voices, and the duration model is used for predicting the length of pronunciation of each phoneme and controlling the speed of pronunciation. The acoustic model is used for acquiring the voice data of a specific speaker through the duration characteristic and the text characteristic to be converted. Specifically, the text features to be converted are obtained through an optimal front-end module, and the front-end module is obtained based on a neural network deep learning model and comprises a prosody prediction module, a part-of-speech module, a word length module, a phoneme sequence module and the like.

As shown in fig. 7, the present invention provides a personalized speech conversion training device, which comprises:

a first obtaining module 702, configured to obtain speech corpus data in a speech corpus, where the speech corpus data includes: the speech parallel linguistic data of the N speakers means that the speech linguistic data of a plurality of speakers correspond to the same speech text content.

The first training module 704 is configured to train the initial speech conversion model based on the parallel speech corpus of the N speakers to obtain an average speech conversion model.

The second obtaining module 706 is configured to obtain the speech parallel corpus of the specific speaker, and combine the speech parallel corpus of the specific speaker with the speech parallel corpora of the N speakers, respectively, to obtain N groups of training speech data.

The second training module 708 is configured to train the speech conversion average model based on the N groups of training speech data to obtain a specific speech conversion average model.

In one embodiment, the second training module 708 is further configured to combine the N speech parallel corpora two by two to obtain

Group trainingVoice data: based on

In one embodiment, the N sets of training speech data have the parallel corpora of speech of a particular speaker as the source speech, and the N sets of training speech data have the parallel corpora of speech of N speakers as the desired speech. The second training module 708 is further configured to perform acoustic feature extraction on the source speech and the expected speech by using the speech feature analyzer, so as to obtain acoustic features of the source speech and acoustic features of the expected speech; controlling aligning a source speech acoustic feature with a desired speech acoustic feature on a time axis; and training a preset neural network model by using the aligned source speech acoustic characteristics and expected speech acoustic characteristics to obtain an initial speech conversion model.

The third obtaining module 710 is configured to obtain first sample voice data of a target speaker, and obtain second sample voice data corresponding to a specific speaker, where text contents corresponding to the first sample voice data and the second sample voice data are the same, and a scale of the first sample voice data is far smaller than a scale of the parallel corpus.

And a third training module 712, configured to train the specific speech conversion average model based on the first sample speech data and the second sample speech data, to obtain a target speech conversion model from the specific speech to the target speech.

In one embodiment, as shown in fig. 8, a personalized speech conversion training device further comprises:

a fourth obtaining module 714, configured to obtain a speech text to be converted.

And a speech synthesis module 716, configured to convert the speech text to be converted into speech data of a specific speaker through a speech synthesis model.

In one embodiment, the speech synthesis module 716 is further configured to obtain target speech corpus data corresponding to a specific speaker; performing text analysis and voice analysis on the target voice corpus data to respectively obtain voice corpus text characteristics and voice corpus sound characteristics; and training a preset neural network model by using the text characteristics and the sound characteristics of the voice corpus to obtain a voice synthesis model corresponding to a specific speaker.

The voice conversion module 718 is configured to use voice data of a specific speaker as an input of the target voice conversion model, and obtain target voice data output by the target voice conversion model.

As shown in FIG. 9, in one embodiment an internal block diagram of a computer device is provided. The computer device may be a personalized speech conversion training apparatus or a terminal or server connected to a personalized speech conversion training apparatus. As shown in fig. 9, the computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system and may also store a computer program that, when executed by the processor, causes the processor to implement a personalized speech conversion training method. The internal memory may also have a computer program stored therein, which when executed by the processor, causes the processor to perform a personalized speech conversion training method. The network interface is used for communicating with an external device. Those skilled in the art will appreciate that the architecture shown in fig. 9 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the personalized speech conversion training method provided by the present application may be implemented in the form of a computer program that is executable on a computer device such as the one shown in fig. 9. The memory of the computer device may store various program templates that make up the personalized speech conversion training apparatus. For example, the first obtaining module 702, the first training module 704, the second obtaining module 706, the second training module 708, the third obtaining module 710, the third training module 712, the fourth obtaining module 714, the speech synthesizing module 716, and the speech converting module 718.

A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of: obtaining voice corpus data in a voice corpus, wherein the voice corpus data comprises: the voice parallel linguistic data of N speakers refer to the voice linguistic data of a plurality of persons corresponding to the same voice text content; training an initial voice conversion model based on the voice parallel corpora of the N individual speakers to obtain a voice conversion average model; acquiring the voice parallel linguistic data of a specific speaker, and combining the voice parallel linguistic data of the specific speaker with the voice parallel linguistic data of N speakers respectively to obtain N groups of training voice data; training the voice conversion average model based on N groups of training voice data to obtain a specific voice conversion average model; acquiring first sample voice data of a target speaker, and acquiring second sample voice data corresponding to a specific speaker, wherein the first sample voice data and the second sample voice data have the same text content, and the scale of the first sample voice data is far smaller than that of the voice parallel corpus; and training the specific voice conversion average model based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting the specific voice into the target voice.

In one embodiment, the N sets of training speech data have the parallel corpus of speeches of a particular speaker as source speech and the N sets of training speech data have the parallel corpus of speeches of N speakers as desired speech, the computer program when executed by the processor causes the processor to perform the steps of: respectively extracting acoustic features of source speech and expected speech by using a speech feature analyzer to obtain acoustic features of the source speech and the expected speech; controlling alignment of a source speech acoustic feature with a desired speech acoustic feature on a time axis; and training a preset neural network model by using the aligned source speech acoustic characteristics and expected speech acoustic characteristics to obtain an initial speech conversion model.

In one embodiment, based on N people sayingThe computer program, when executed by the processor, causes the processor to perform the steps of: combining the N persons' speech parallel corpora pairwise to obtain

Group training speech data: based on

In one embodiment, the computer program, when executed by the processor, causes the processor to perform the steps of: acquiring a voice text to be converted, and converting the voice text to be converted into voice data of a specific speaker through a voice synthesis model; and taking the voice data of the specific speaker as the input of the target voice conversion model, and acquiring the target voice data output by the target voice conversion model.

In one embodiment, prior to converting the phonetic text to be converted to speech data for a particular speaker by the speech synthesis model, the computer program, when executed by the processor, causes the processor to perform the steps of: acquiring target voice corpus data corresponding to a specific speaker; performing text analysis and voice analysis on the target voice corpus data to respectively obtain voice corpus text characteristics and voice corpus sound characteristics; and training a preset neural network model by using the text characteristics and the sound characteristics of the voice corpus to obtain a voice synthesis model corresponding to a specific speaker.

In one embodiment, the speech synthesis model includes a duration model and an acoustic model, the speech text to be converted is obtained, the speech text to be converted is converted into speech data of a specific speaker by the speech synthesis model, and the computer program, when executed by the processor, causes the processor to perform the steps of: performing text analysis on the voice text to be converted to obtain the characteristics of the text to be converted; the text features to be converted are used as the input of a duration model, and duration features corresponding to the text features to be converted are obtained; and inputting the time length characteristic and the text characteristic to be converted into the acoustic model to obtain the voice data of the specific speaker.

A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of: obtaining voice corpus data in a voice corpus, wherein the voice corpus data comprises: the voice parallel linguistic data of N speakers refer to the voice linguistic data of a plurality of persons corresponding to the same voice text content; training an initial voice conversion model based on the voice parallel corpora of the N individual speakers to obtain a voice conversion average model; acquiring the voice parallel linguistic data of a specific speaker, and combining the voice parallel linguistic data of the specific speaker with the voice parallel linguistic data of N speakers respectively to obtain N groups of training voice data; training the voice conversion average model based on N groups of training voice data to obtain a specific voice conversion average model; acquiring first sample voice data of a target speaker, and acquiring second sample voice data corresponding to a specific speaker, wherein the first sample voice data and the second sample voice data have the same text content, and the scale of the first sample voice data is far smaller than that of the voice parallel corpus; and training the specific voice conversion average model based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting the specific voice into the target voice.

In one embodiment, the N sets of training speech data have the parallel corpus of speeches of a particular speaker as source speech and the N sets of training speech data have the parallel corpus of speeches of N speakers as desired speech, the computer program when executed by the processor causes the processor to perform the steps of: respectively extracting acoustic features of source speech and expected speech by using a speech feature analyzer to obtain acoustic features of the source speech and the expected speech; controlling aligning a source speech acoustic feature with a desired speech acoustic feature on a time axis; and training a preset neural network model by using the aligned source speech acoustic characteristics and expected speech acoustic characteristics to obtain an initial speech conversion model.

In one embodiment, the initial speech conversion model is trained based on parallel corpora of speech of N speakers to obtain an average model of speech conversion, and the computer program, when executed by the processor, causes the processor to perform the steps of: combining the N persons' speech parallel corpora pairwise to obtain

Group training speech data: based on

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only show some embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for personalized speech conversion training, the method comprising:

obtaining voice corpus data in a voice corpus, wherein the voice corpus data comprises: the voice parallel linguistic data of N speakers refer to the voice linguistic data of a plurality of speakers corresponding to the same voice text content;

training an initial voice conversion model based on the voice parallel corpora of the N personal speakers to obtain a voice conversion average model;

acquiring the voice parallel corpus of a specific speaker, and combining the voice parallel corpus of the specific speaker with the voice parallel corpora of the N speakers respectively to obtain N groups of training voice data;

training the voice conversion average model based on the N groups of training voice data to obtain a specific voice conversion average model;

acquiring first sample voice data of a target speaker, and acquiring second sample voice data corresponding to the specific speaker, wherein the text contents corresponding to the first sample voice data and the second sample voice data are the same, and the scale of the first sample voice data is far smaller than that of the voice parallel corpus;

2. The method according to claim 1, wherein the N sets of training speech data have the parallel corpora of speech of a specific speaker as source speech, and the N sets of training speech data have the parallel corpora of speech of N speakers as expected speech; the method further comprises the following steps:

respectively extracting acoustic features of the source speech and the expected speech by using a speech feature analyzer to obtain acoustic features of the source speech and the expected speech;

control aligning the source speech acoustic features with the desired speech acoustic features on a time axis;

and training a preset neural network model by using the aligned source speech acoustic features and the expected speech acoustic features to obtain an initial speech conversion model.

3. The method according to claim 1, wherein training an initial speech conversion model based on the parallel corpus of speeches of the N speakers to obtain a speech conversion mean model comprises:

combining the N persons' voice parallel corpora pairwise to obtain

Grouping training speech data;

based on the

4. The method of claim 1, further comprising:

acquiring a voice text to be converted, and converting the voice text to be converted into voice data of the specific speaker through a voice synthesis model;

and taking the voice data of the specific speaker as the input of the target voice conversion model, and acquiring the target voice data output by the target voice conversion model.

5. The method of claim 4, wherein before converting the speech text to be converted into the speaker-specific speech data by the speech synthesis model, the method further comprises:

acquiring target voice corpus data corresponding to the specific speaker;

performing text analysis and voice analysis on the target voice corpus data to respectively obtain voice corpus text characteristics and voice corpus sound characteristics;

and training a preset neural network model by using the voice corpus text characteristics and the voice corpus sound characteristics to obtain a voice synthesis model corresponding to the specific speaker.

6. The method according to claim 4, wherein the speech synthesis model comprises a duration model and an acoustic model, the obtaining the speech text to be converted, and the converting the speech text to be converted into the speech data of the specific speaker by the speech synthesis model, further comprising:

performing text analysis on the voice text to be converted to obtain the characteristics of the text to be converted;

taking the text features to be converted as the input of the duration model to obtain duration features corresponding to the text features to be converted;

and inputting the duration characteristic and the text characteristic to be converted into the acoustic model to obtain the voice data of the specific speaker.

7. A personalized speech conversion training device, the device comprising:

the first obtaining module is configured to obtain speech corpus data in a speech corpus, where the speech corpus data includes: the voice parallel linguistic data of N speakers refer to the voice linguistic data of a plurality of speakers corresponding to the same voice text content;

the first training module is used for training an initial voice conversion model based on the voice parallel corpora of the N individual speakers to obtain a voice conversion average model;

the second acquisition module is used for acquiring the voice parallel linguistic data of a specific speaker, and combining the voice parallel linguistic data of the specific speaker with the voice parallel linguistic data of the N speakers respectively to obtain N groups of training voice data;

a third obtaining module, configured to obtain first sample voice data of a target speaker, and obtain second sample voice data corresponding to the specific speaker, where text contents corresponding to the first sample voice data and the second sample voice data are the same, and a scale of the first sample voice data is far smaller than a scale of the voice parallel corpus;

8. The apparatus of claim 8, further comprising:

the fourth acquisition module is used for acquiring the voice text to be converted;

the voice synthesis module is used for converting the voice text to be converted into the voice data of the specific speaker through a voice synthesis model;

and the voice conversion module is used for taking the voice data of the specific speaker as the input of the target voice conversion model and acquiring the target voice data output by the target voice conversion model.

9. A computer device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the method according to any one of claims 1 to 6.

10. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 6.