CN111261177A

CN111261177A - Voice conversion method, electronic device and computer readable storage medium

Info

Publication number: CN111261177A
Application number: CN202010063801.8A
Authority: CN
Inventors: 马坤; 赵之砚; 施奕明
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-01-19
Filing date: 2020-01-19
Publication date: 2020-06-09

Abstract

The invention relates to a voice processing technology, and discloses a voice conversion method, which comprises the following steps: receiving a conversion instruction which is sent by a user and carries real voice and target timbre, extracting first acoustic features from the real voice, inputting the first acoustic features into a first conversion model to perform timbre conversion to obtain second acoustic features, constructing a first spectrogram with low tone quality based on the second acoustic features, inputting the first spectrogram into a second conversion model to perform timbre conversion to obtain a second spectrogram with high tone quality, reducing a voice signal by using the second spectrogram to obtain target voice corresponding to the target timbre, and feeding the target voice back to the user. The invention also discloses an electronic device and a computer storage medium. The invention can realize real-time and high-quality voice conversion.

Description

Voice conversion method, electronic device and computer readable storage medium

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech conversion method, an electronic device, and a computer-readable storage medium.

Background

Natural style migration (natural style transfer) is an important emerging field in the field of artificial intelligence, and especially, many advances have been made in the image field, such as image-to-image transfer, drawing style migration, etc. in the aspect of image conversion.

However, in the speech field, the progress of related research is still relatively little, at present, the signal- > sound wave closest to the human voice effect in the speech conversion technology adopts wavenet, which is characterized by autoregression, and it is necessary to learn and train all sample data in the sample data, and the sound quality effect is particularly good, however, this method has the following problems: 1) a large amount of voice data of the user and the conversion target in pairs are needed, and in the practical application process, more paired voice data are difficult to obtain to support training, so that the model effect is poor, and high-quality conversion voice cannot be obtained; 2) the training process is particularly slow due to the need to learn all the sample data in the entire sample.

Therefore, it is desirable to provide a method for converting high quality converted speech quickly.

Disclosure of Invention

In view of the foregoing, the present invention provides a voice conversion method, an electronic device and a computer readable storage medium, which mainly aims to realize real-time and high-quality voice conversion.

To achieve the above object, the present invention provides a voice conversion method, including:

step S1, receiving a voice conversion instruction sent by a user through a client, wherein the voice conversion instruction comprises real voice to be converted and a target tone;

step S2, extracting a first acoustic feature from the real voice, inputting the first acoustic feature of the real voice into a pre-trained first conversion model corresponding to the target tone for tone conversion, and outputting a second acoustic feature of the real voice corresponding to the target tone;

step S3, constructing a first spectrogram of the real voice corresponding to the target timbre based on the second acoustic feature;

step S4, inputting the first spectrogram into a pre-trained second conversion model for sound quality conversion, and outputting a second spectrogram related to the real voice and corresponding to the target tone; and

and step S5, restoring the second spectrogram based on a voice reconstruction algorithm to obtain target voice related to the real voice and corresponding to the target tone, and feeding the target voice back to the user through the client.

In addition, to achieve the above object, the present invention also provides an electronic device, including: the system comprises a memory and a processor, wherein the memory stores a voice conversion program which can run on the processor, and the voice conversion program can realize any step of the voice conversion method when being executed by the processor.

In addition, to achieve the above object, the present invention also provides a computer-readable storage medium, which includes a voice conversion program, and when the voice conversion program is executed by a processor, the voice conversion program can implement any step of the voice conversion method as described above.

The voice conversion method, the electronic device and the computer readable storage medium provided by the invention receive a conversion instruction which is sent by a user and carries real voice and target tone, extract a first acoustic feature from the real voice, input the first acoustic feature into a first conversion model to obtain a second acoustic feature, construct a first spectrogram with low tone quality based on the second acoustic feature, input the first spectrogram into a second conversion model to obtain a second spectrogram with high tone quality, and restore a voice signal by using the second spectrogram to obtain the target voice corresponding to the target tone. 1. The method comprises the following steps of dividing the voice conversion process into: the first part of the invention only needs to achieve the low-tone-quality voice conversion, so only a small number of voice data pairs are needed, the problem of poor model training effect caused by sample data loss due to the fact that a large number of voice data pairs cannot be obtained is solved, and the voice conversion efficiency is improved; 2. converting the first spectrogram with low tone quality into a second spectrogram with high tone quality by using a second conversion model degree, and laying a foundation for realizing high-quality voice conversion subsequently; 3. a first conversion model and a second conversion model are constructed by using a pix2pix model applied to the field of image processing, and the model structure is correspondingly changed, so that the audio processing is adapted, the model convergence speed is increased, and the model training speed is increased.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of a voice conversion method according to the present invention;

FIG. 2 is a diagram of an electronic device according to a preferred embodiment of the present invention;

fig. 3 is a schematic diagram of program modules of the speech conversion procedure in fig. 2.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention provides a voice conversion method. The method may be performed by an apparatus, which may be implemented by software and/or hardware.

Referring to fig. 1, a flow chart of a voice conversion method according to a preferred embodiment of the invention is shown.

In this embodiment, the method includes: step S1-step S5.

Step S1, receiving a voice conversion instruction sent by a user through a client, where the voice conversion instruction includes a real voice to be converted and a target tone.

The following describes embodiments of the present invention with an electronic device as an execution body. The user sends out a voice conversion instruction carrying real voice and target tone through the client, the electronic device extracts the real voice to perform voice conversion processing after receiving the voice conversion instruction, converts the tone of the real voice into the target tone, and feeds back the converted voice fragment to the client.

The real voice is a voice fragment of the user's own tone, and the target tone is the tone of a target person, wherein the target person is a person selected by the user from a specified target person set, such as a cartoon person, a star, and the like.

The client side is provided with a voice acquisition unit, such as a microphone. The client side is provided with an APP, and the user sends a voice conversion instruction through the client side APP.

Step S2, extracting a first acoustic feature from the real voice, inputting the first acoustic feature of the real voice into a pre-trained first conversion model corresponding to the target timbre for timbre conversion, and outputting a second acoustic feature of the real voice corresponding to the target timbre.

The first acoustic feature is an acoustic feature corresponding to real voice, and the second acoustic feature is an acoustic feature which is converted according to the first acoustic feature and corresponds to the real voice and the target tone.

It should be noted that the first acoustic feature does not refer to a certain acoustic related feature, but is a combined feature vector obtained by combining a plurality of preset acoustic related features. The first acoustic feature is extracted by utilizing an open source voice toolkit PyWorld. In this embodiment, the extracting the first acoustic feature from the real speech includes:

calculating a first preset acoustic relevant feature and a second preset acoustic relevant feature in the real voice;

converting the second preset acoustic relevant features to obtain converted second preset acoustic relevant features; and

and combining and generating the first acoustic feature based on the first preset acoustic relevant feature and the converted second preset acoustic relevant feature.

The first preset acoustic correlation features include acoustic correlation features such as a fundamental frequency (F0) and an aperiodic signal (AP), the second preset acoustic correlation features include an acoustic correlation feature such as a spectral envelope (spectrum envelope), and the converted second preset acoustic correlation features are mel-cepstrum.

In this embodiment, the fundamental frequency of the real speech is calculated by using a DIO algorithm, and the spectral envelope of the real speech is calculated by using a cheaptpick algorithm. In addition, the spectral envelope is converted to the mel cepstrum using the pysptk phonetic toolkit, using sp2 mc. And based on the acoustic features, splicing into a combined feature vector, inputting the combined feature vector into a first conversion model as the first acoustic feature of the real voice, wherein the output of the model is the combined feature vector corresponding to the converted target timbre, namely, the second acoustic feature.

In this embodiment, the first conversion model trained in advance is a pix2pix model. It is understood that, since the first acoustic feature is a feature quantity around a time axis in one dimension, in the present embodiment, the two-dimensional convolution layer in the model is changed to the one-dimensional convolution layer, and the pix2pix model is applied to the one-dimensional acoustic feature conversion in the present embodiment.

It should be noted that the first conversion model is to solve one to one voice conversion, i.e. conversion from speaker a to speaker B, and does not support multi-person conversion, i.e. a model of a- > B can only solve conversion from a to B, and each different conversion model of a- > B needs to be trained separately.

Step S3, constructing a first spectrogram of the real voice corresponding to the target timbre based on the second acoustic feature.

The first spectrogram is a spectrogram with low tone quality, and the second spectrogram is characterized in that the first acoustic features are similar and are combined feature vectors formed by combining a plurality of independent acoustic related features.

In this embodiment, the constructing a first spectrogram of the real voice corresponding to the target timbre based on the second acoustic feature includes:

splitting the second acoustic feature to obtain a third preset acoustic related feature and a fourth preset acoustic related feature corresponding to the second acoustic feature;

converting the fourth preset acoustic relevant feature to obtain a converted fourth preset acoustic relevant feature; and

and taking the converted fourth preset acoustic related feature as the first spectrogram.

The third preset acoustic related feature includes acoustic related features such as a fundamental frequency and a non-periodic signal corresponding to the second acoustic feature, the fourth preset acoustic related feature includes a mel-frequency cepstrum corresponding to the second acoustic feature, and the converted fourth preset acoustic related feature includes a frequency spectrum envelope corresponding to a mel-frequency spectrogram corresponding to the second acoustic feature.

Firstly, acoustic relevant features such as fundamental frequency, aperiodic signals, Mel cepstrum and the like are extracted by splitting from a combined feature vector of second acoustic features, then, a pysptk voice kit is used, mc2sp is used for converting the extracted Mel cepstrum to obtain a spectrum envelope, and the spectrum envelope is used as a first spectrogram of the real voice.

It should be noted that the operation of mel spectrum- > spectrum envelope and the process of spectrum envelope- > mel spectrum are just the same idea, and the conversion between mel cepstrum and spectrum is realized by fourier transform in signaling, which is not described herein.

And step S4, inputting the first spectrogram into a pre-trained second conversion model for voice quality conversion, and outputting a second spectrogram related to the real voice and corresponding to the target tone.

The second spectrogram is a spectrogram with high tone quality, and the second conversion model is used for converting the spectrogram with low tone quality into the spectrogram with high tone quality, so that a foundation is laid for subsequently outputting conversion voice with high tone quality.

In this embodiment, the second conversion model trained in advance is a pix2pix model. It should be noted that, in view of the first spectrogram, the first spectrogram includes: information such as time, frequency, amplitude, etc. can be regarded as a single-channel two-dimensional image, so the two-dimensional pix2pix model is used as the second conversion model in this embodiment.

After obtaining the second spectrogram with high tone quality, the second spectrogram needs to be converted into a voice signal to be fed back to a user, and the target voice related to the real voice corresponding to the target tone color is obtained by restoring the second spectrogram based on a voice reconstruction algorithm, including:

acquiring a third preset acoustic relevant characteristic of the second acoustic characteristic;

synthesizing the third preset acoustic relevant feature and the second spectrogram by using a preset voice reconstruction algorithm to generate a voice signal corresponding to the second spectrogram; and

and taking the voice signal corresponding to the second spectrogram as target voice related to the real voice corresponding to the target tone.

In this embodiment, a synthesis method world.sp of a PyWorld toolkit is used to directly synthesize a speech signal based on acoustic relevant features such as fundamental frequency, spectral envelope, aperiodic signal, and the like.

Among them, WORLD is a vocoder-based speech synthesis tool, and the role of vocoder is mainly: extracting relevant parameters of the voice signal; and synthesizing final voice according to the related parameters. Some vocoders in the prior art are as follows:

STRAIGHT-can produce high-quality synthetic effect, but has slow speed;

real-time STRAIGHT-the algorithm is simplified on the basis of STRAIGHT, although the speed becomes fast, the cost is the loss of performance;

TANDEM-STRAIGHT-has the performance similar to that of STRAIGHT, but can not realize real-time synthesis;

compared with TANDEM-STRAIGHT, WORLD has the advantages that on the premise that the performance is not changed, the calculation complexity is reduced, and real-time synthesis is realized.

In other embodiments, the pre-trained first conversion model training step includes:

acquiring voice data pairs of a first preset number of original speakers (source speakers) and target speakers (targetspeakers);

respectively extracting acoustic features of each voice data in the first preset number of voice data pairs, and generating the first preset number of acoustic feature pairs as sample data;

dividing the sample data into a training set and a verification set according to a preset proportion (for example, 4:1), and training the one-dimensional pix2pix model by using the training set;

and calculating the loss value of the one-dimensional pix2pix model, finishing training when the loss value meets a preset condition, and determining the one-dimensional pix2pix model as a first conversion model corresponding to the target speaker.

For example, the above-mentioned speech data pair, i.e., the paired speech (paired data) having the same spoken contents of the original speaker a and the target speaker B, need not be very large, but small (e.g., 80-100 pairs).

In the process of generating sample data, corresponding speakers, such as an original speaker a and a target speaker B, need to be labeled for each acoustic feature in the acoustic feature pair of the sample data. The step of extracting the acoustic feature of the speech data is consistent with the step of extracting the first acoustic feature in the above embodiment, and details are not repeated here. In the training process, the model input is as follows: acoustic feature pairs (acoustic features of the original speaker a, acoustic features of the target speaker), output (acoustic features of the speech that simulate the timbre of the target speaker).

It should be noted that the convergence conditions corresponding to the sample data acquired under different environments are not completely the same. Taking the speech training set with the sampling rate of 16k recorded by using a mobile phone in a quiet environment as an example, when the generator loss is about 0.2, the speech training set is close to being stable and no longer converging, and the training can be terminated at this time.

The first conversion model is trained to obtain an acoustic feature conversion model, and the first conversion model aims to use a small amount of voice data to simulate a conversion model similar to the tone of a target speaker to complete the conversion from the speaker A to the speaker B, so that the problem of poor model effect caused by the fact that a large amount of voice data cannot be obtained is solved. In order to ensure that the model conversion result is closer to the target tone, the conversion effect can be manually detected.

In other embodiments, the training step of the pre-trained second conversion model includes:

acquiring acoustic features of a second preset number of target speakers by using the first conversion model, and constructing a large number of first frequency spectrograms based on the acoustic features of the second preset number of target speakers;

acquiring voice data of a third preset number of target speakers, respectively extracting acoustic features of each voice data in the voice data of the third preset number of target speakers, and constructing a second spectrogram of the third preset number based on the acoustic features;

generating sample data based on the first spectrogram and the second spectrogram;

dividing the sample data into a training set and a verification set according to a preset proportion (for example, 4:1), and training the two-dimensional pix2pix model by using the training set;

and calculating the loss value of the two-dimensional pix2pix model, finishing training when the loss value meets a preset condition, and determining the two-dimensional pix2pix model as a second conversion model.

In this embodiment, the steps of extracting the acoustic feature and constructing the spectrogram are the same as the steps of extracting the first acoustic feature and constructing the first spectrogram in the above embodiment, and are not described herein again. It should be noted that, in order to improve the model accuracy, a large number of low-tone-quality spectrograms and high-tone-quality spectrograms of the target speakers are required, so the second preset number and the third preset number are relatively required to be set to be larger values, and the second preset number and the third preset number are much larger than the first preset number, wherein the second preset number and the third preset number may be the same or different. The first, second and third preset thresholds are preset values and can be adjusted according to actual conditions.

The first spectrogram is a low-tone spectrogram serving as an independent variable X in the sample data, and the second spectrogram is a high-tone spectrogram, which may also be referred to as a full-tone spectrogram, serving as a dependent variable Y in the sample data. In the model training process, the loss value of the model is calculated, the loss value tends to be stable after reaching a certain condition and does not converge any more, and the training can be stopped at the moment.

In the task of voice conversion from low tone quality to high tone quality, the aim is to provide a conversion model from a low tone quality spectrogram to a high tone quality spectrogram. In the process of model training, a large amount of data (namely voice data of a conversion target) of a target speaker B, namely high-quality spectrogram data, are prepared, according to the output (namely low-quality spectrogram) of a first conversion model, a (low-quality- > high-quality) data relationship is constructed between the high-quality spectrogram data and the low-quality spectrogram data, and similarly, a data pair enters the model, and the constructed two-dimensional pix2pix model is trained to obtain a conversion model from the low-quality spectrogram to the high-quality spectrogram.

In other embodiments, to increase the training speed of the first transformation model, other structures of the one-dimensional pix2pix model may be adaptively adjusted.

In this embodiment, a U-Net structure in a pix2pix model is used. The convolution kernel size in the original structure was 4 x 4. In order to adapt to voice processing, when a model structure is constructed, besides changing a convolution layer and a deconvolution layer into a one-dimensional mechanism, the sizes of convolution kernels of a first layer of downsampling and a last layer of upsampling in a U-Net structure are also changed into 3 x 3.

Taking the down sampling in the U-Net structure as an example, the original structure is as follows:

input_shape,64,kernel_size＝4->

in_size＝64,out_size＝128,kernel_size＝4->

in_size＝128,out_size＝256,kernel_size＝4->

in_size＝256,out_size＝512,kernel_size＝4->

in_size＝512,out_size＝512,kernel_size＝4->

in_size＝512,out_size＝512,kernel_size＝4

the structure after adjustment is as follows:

in_size＝input_shape,out_size＝64,kernel_size＝3->

in_size＝64,out_size＝128,kernel_size＝4->

in_size＝128,out_size＝256,kernel_size＝4->

in_size＝256,out_size＝512,kernel_size＝4->

in_size＝512,out_size＝512,kernel_size＝4->

in_size＝512,out_size＝512,kernel_size＝4

the above network structure is changed to adapt to the audio processing on one hand, and to accelerate the convergence speed of the model on the other hand, thereby increasing the training speed of the model.

It should be noted that the above changes of the network structure are also used in the second conversion model, and are not described herein.

The voice conversion method provided in the above embodiment receives a conversion instruction carrying real voice and a target tone sent by a user, extracts a first acoustic feature from the real voice, inputs the first acoustic feature into a first conversion model to obtain a second acoustic feature, constructs a first spectrogram with low tone quality based on the second acoustic feature, inputs the first spectrogram into a second conversion model to obtain a second spectrogram with high tone quality, and restores a voice signal by using the second spectrogram to obtain a target voice corresponding to the target tone quality. 1. The method comprises the following steps of dividing the voice conversion process into: the first part of the invention only needs to achieve the low-tone-quality voice conversion, so only a small number of voice data pairs are needed, the problem of poor model training effect caused by sample data loss due to the fact that a large number of voice data pairs cannot be obtained is solved, and the voice conversion efficiency is improved; 2. converting the first spectrogram with low tone quality into a second spectrogram with high tone quality by using a second conversion model degree, and laying a foundation for realizing high-quality voice conversion subsequently; 3. a first conversion model and a second conversion model are constructed by using a pix2pix model applied to the field of image processing, and the model structure is correspondingly changed, so that the audio processing is adapted, the model convergence speed is increased, and the model training speed is increased.

The invention also provides an electronic device. Fig. 2 is a schematic view of an electronic device according to a preferred embodiment of the invention.

In this embodiment, the electronic device 1 may be a server, a smart phone, a tablet computer, a portable computer, a desktop computer, or other terminal equipment with a data processing function, and the server may be a rack server, a blade server, a tower server, or a cabinet server.

The electronic device 1 includes a memory 11, a processor 12, and a network interface 13.

The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1.

The memory 11 may also be an external storage device of the electronic apparatus 1 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic apparatus 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic apparatus 1.

The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as the voice conversion program 10, but also to temporarily store data that has been output or is to be output.

Processor 12, which in some embodiments may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip, executes program code or processes data stored in memory 11, such as voice conversion program 10.

The network interface 13 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), typically used for establishing a communication connection between the electronic apparatus 1 and other electronic devices, such as a client (not shown in the figure).

Fig. 2 only shows the electronic device 1 with the components 11-13, and it will be understood by a person skilled in the art that the structure shown in fig. 2 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.

Optionally, the electronic device 1 may further comprise a user interface, the user interface may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further comprise a standard wired interface, a wireless interface.

Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch screen, or the like. The display, which may also be referred to as a display screen or display unit, is used for displaying information processed in the electronic apparatus 1 and for displaying a visualized user interface.

In the embodiment of the electronic device 1 shown in fig. 2, the memory 11 as a kind of computer storage medium stores the program code of the speech conversion program 10, and when the processor 12 executes the program code of the speech conversion program 10, the following steps are implemented:

and a receiving step, receiving a voice conversion instruction sent by a user through a client, wherein the voice conversion instruction comprises real voice to be converted and a target tone.

The user sends out a voice conversion instruction carrying real voice and target tone through the client, the electronic device extracts the real voice to perform voice conversion processing after receiving the voice conversion instruction, converts the tone of the real voice into the target tone, and feeds back the converted voice fragment to the client.

A first conversion step of extracting a first acoustic feature from the real voice, inputting the first acoustic feature of the real voice into a pre-trained first conversion model corresponding to the target tone for tone conversion, and outputting a second acoustic feature of the real voice corresponding to the target tone.

And a construction step of constructing a first spectrogram related to the real voice corresponding to the target timbre based on the second acoustic feature.

And a second conversion step of inputting the first spectrogram into a pre-trained second conversion model for voice quality conversion, and outputting a second spectrogram related to the real voice and corresponding to the target tone.

And a restoring and feedback step, namely restoring the second spectrogram based on a voice reconstruction algorithm to obtain target voice related to the real voice and corresponding to the target tone, and feeding the target voice back to a user through the client.

STRAIGHT-can produce high-quality synthetic effect, but has slow speed;

input_shape,64,kernel_size＝4->

in_size＝64,out_size＝128,kernel_size＝4->

in_size＝128,out_size＝256,kernel_size＝4->

in_size＝256,out_size＝512,kernel_size＝4->

in_size＝512,out_size＝512,kernel_size＝4->

in_size＝512,out_size＝512,kernel_size＝4

the structure after adjustment is as follows:

in_size＝input_shape,out_size＝64,kernel_size＝3->

in_size＝64,out_size＝128,kernel_size＝4->

in_size＝128,out_size＝256,kernel_size＝4->

in_size＝256,out_size＝512,kernel_size＝4->

in_size＝512,out_size＝512,kernel_size＝4->

in_size＝512,out_size＝512,kernel_size＝4

The electronic device 1 provided in the above embodiment receives a conversion instruction carrying a real voice and a target timbre sent by a user, extracts a first acoustic feature from the real voice, inputs the first acoustic feature into a first conversion model to obtain a second acoustic feature, constructs a first spectrogram with low timbre based on the second acoustic feature, inputs the first spectrogram into a second conversion model to obtain a second spectrogram with high timbre, restores a voice signal by using the second spectrogram to obtain a target voice corresponding to the target timbre, and feeds the target voice back to a client. 1. The method comprises the following steps of dividing the voice conversion process into: the first part of the invention only needs to achieve the low-tone-quality voice conversion, so only a small number of voice data pairs are needed, the problem of poor model training effect caused by sample data loss due to the fact that a large number of voice data pairs cannot be obtained is solved, and the voice conversion efficiency is improved; 2. converting the first spectrogram with low tone quality into a second spectrogram with high tone quality by using a second conversion model degree, and laying a foundation for realizing high-quality voice conversion subsequently; 3. a first conversion model and a second conversion model are constructed by using a pix2pix model applied to the field of image processing, and the model structure is correspondingly changed, so that the audio processing is adapted, the model convergence speed is increased, and the model training speed is increased.

Alternatively, in other embodiments, the speech conversion program 10 may be divided into one or more modules, one or more modules being stored in the memory 11 and executed by the one or more processors 12 to implement the present invention, where a module refers to a series of computer program instruction segments capable of performing a specific function.

Referring to fig. 3, which is a schematic diagram of program modules of the speech conversion program 10 in fig. 2, in this embodiment, the speech conversion program 10 may be divided into modules 110 and 150, and functions or operation steps implemented by the modules 110 and 150 are similar to those described above, and are not described in detail here, for example, wherein:

a receiving module 110, configured to receive a voice conversion instruction sent by a user through a client, where the voice conversion instruction includes a real voice to be converted and a target tone;

a first conversion module 120, configured to extract a first acoustic feature from the real voice, input the first acoustic feature of the real voice into a pre-trained first conversion model corresponding to the target tone, perform tone conversion, and output a second acoustic feature of the real voice corresponding to the target tone;

a construction module 130, configured to construct a first spectrogram, corresponding to the target timbre, of the real speech based on the second acoustic feature;

a second conversion module 140, configured to input the first spectrogram into a second conversion model trained in advance for voice quality conversion, and output a second spectrogram, which corresponds to the target tone and is related to the real voice; and

and a restoring and feedback module 150, configured to restore the second spectrogram, obtain a target voice corresponding to the target tone and related to the real voice, and feed the target voice back to the user through the client.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a speech conversion program 10, and when the speech conversion program 10 is executed by a processor, any step in the speech conversion method is implemented, which is not described herein again.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A voice conversion method is suitable for an electronic device, and is characterized by comprising the following steps:

2. The method according to claim 1, wherein the extracting the first acoustic feature from the real speech includes:

3. The method according to claim 1, wherein the constructing a first spectrogram of the real speech corresponding to the target timbre based on the second acoustic feature comprises:

4. The speech conversion method according to claim 1, wherein the restoring the second spectrogram based on a speech reconstruction algorithm to obtain a target speech corresponding to the target timbre and related to the real speech comprises:

5. The speech conversion method according to any one of claims 1 to 4, wherein the first acoustic feature is a combined feature vector including a fundamental frequency, aperiodic information, and a spectral envelope of the real speech, and the second acoustic feature is an acoustic feature obtained by performing timbre conversion on the first acoustic feature.

6. The speech conversion method according to claim 5, wherein the first preset acoustic related features comprise fundamental frequency, non-periodic information of the real speech; the second preset acoustic related feature comprises a spectral envelope of the real speech; and the converted second preset acoustic related characteristic is a Mel cepstrum corresponding to the spectral envelope of the real voice.

7. The speech conversion method of claim 6, wherein the first conversion model is a one-dimensional pix2pix model, and the training step of the first conversion model comprises:

acquiring a first preset number of voice data pairs of an original speaker and a target speaker;

dividing the sample data into a training set and a verification set according to a preset proportion, and training the one-dimensional pix2pix model by using the training set;

8. The speech conversion method of claim 7, wherein the second conversion model is a two-dimensional pix2pix model, and the training step of the second conversion model comprises:

dividing the sample data into a training set and a verification set according to a preset proportion, and training the two-dimensional pix2pix model by using the training set;

9. An electronic device comprising a memory and a processor, wherein the memory stores a speech conversion program operable on the processor, and wherein the speech conversion program when executed by the processor performs the steps of:

receiving a voice conversion instruction sent by a user through a client, wherein the voice conversion instruction comprises real voice to be converted and a target tone;

extracting a first acoustic feature from the real voice, inputting the first acoustic feature of the real voice into a pre-trained first conversion model corresponding to the target tone for tone conversion, and outputting a second acoustic feature of the real voice corresponding to the target tone;

constructing a first spectrogram related to the real voice corresponding to the target tone color based on the second acoustic feature;

inputting the first spectrogram into a pre-trained second conversion model for sound quality conversion, and outputting a second spectrogram related to the real voice and corresponding to the target tone; and

and restoring the second spectrogram based on a voice reconstruction algorithm to obtain target voice related to the real voice and corresponding to the target tone, and feeding the target voice back to the user through the client.

10. A computer-readable storage medium, comprising a speech conversion program, which when executed by a processor, performs the steps of the speech conversion method according to any one of claims 1 to 8.