CN111261177A - Voice conversion method, electronic device and computer readable storage medium - Google Patents

Voice conversion method, electronic device and computer readable storage medium Download PDF

Info

Publication number
CN111261177A
CN111261177A CN202010063801.8A CN202010063801A CN111261177A CN 111261177 A CN111261177 A CN 111261177A CN 202010063801 A CN202010063801 A CN 202010063801A CN 111261177 A CN111261177 A CN 111261177A
Authority
CN
China
Prior art keywords
voice
acoustic
conversion
spectrogram
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010063801.8A
Other languages
Chinese (zh)
Inventor
马坤
赵之砚
施奕明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010063801.8A priority Critical patent/CN111261177A/en
Publication of CN111261177A publication Critical patent/CN111261177A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The invention relates to a voice processing technology, and discloses a voice conversion method, which comprises the following steps: receiving a conversion instruction which is sent by a user and carries real voice and target timbre, extracting first acoustic features from the real voice, inputting the first acoustic features into a first conversion model to perform timbre conversion to obtain second acoustic features, constructing a first spectrogram with low tone quality based on the second acoustic features, inputting the first spectrogram into a second conversion model to perform timbre conversion to obtain a second spectrogram with high tone quality, reducing a voice signal by using the second spectrogram to obtain target voice corresponding to the target timbre, and feeding the target voice back to the user. The invention also discloses an electronic device and a computer storage medium. The invention can realize real-time and high-quality voice conversion.

Description

Voice conversion method, electronic device and computer readable storage medium
Technical Field
The present invention relates to the field of speech processing technologies, and in particular, to a speech conversion method, an electronic device, and a computer-readable storage medium.
Background
Natural style migration (natural style transfer) is an important emerging field in the field of artificial intelligence, and especially, many advances have been made in the image field, such as image-to-image transfer, drawing style migration, etc. in the aspect of image conversion.
However, in the speech field, the progress of related research is still relatively little, at present, the signal- > sound wave closest to the human voice effect in the speech conversion technology adopts wavenet, which is characterized by autoregression, and it is necessary to learn and train all sample data in the sample data, and the sound quality effect is particularly good, however, this method has the following problems: 1) a large amount of voice data of the user and the conversion target in pairs are needed, and in the practical application process, more paired voice data are difficult to obtain to support training, so that the model effect is poor, and high-quality conversion voice cannot be obtained; 2) the training process is particularly slow due to the need to learn all the sample data in the entire sample.
Therefore, it is desirable to provide a method for converting high quality converted speech quickly.
Disclosure of Invention
In view of the foregoing, the present invention provides a voice conversion method, an electronic device and a computer readable storage medium, which mainly aims to realize real-time and high-quality voice conversion.
To achieve the above object, the present invention provides a voice conversion method, including:
step S1, receiving a voice conversion instruction sent by a user through a client, wherein the voice conversion instruction comprises real voice to be converted and a target tone;
step S2, extracting a first acoustic feature from the real voice, inputting the first acoustic feature of the real voice into a pre-trained first conversion model corresponding to the target tone for tone conversion, and outputting a second acoustic feature of the real voice corresponding to the target tone;
step S3, constructing a first spectrogram of the real voice corresponding to the target timbre based on the second acoustic feature;
step S4, inputting the first spectrogram into a pre-trained second conversion model for sound quality conversion, and outputting a second spectrogram related to the real voice and corresponding to the target tone; and
and step S5, restoring the second spectrogram based on a voice reconstruction algorithm to obtain target voice related to the real voice and corresponding to the target tone, and feeding the target voice back to the user through the client.
In addition, to achieve the above object, the present invention also provides an electronic device, including: the system comprises a memory and a processor, wherein the memory stores a voice conversion program which can run on the processor, and the voice conversion program can realize any step of the voice conversion method when being executed by the processor.
In addition, to achieve the above object, the present invention also provides a computer-readable storage medium, which includes a voice conversion program, and when the voice conversion program is executed by a processor, the voice conversion program can implement any step of the voice conversion method as described above.
The voice conversion method, the electronic device and the computer readable storage medium provided by the invention receive a conversion instruction which is sent by a user and carries real voice and target tone, extract a first acoustic feature from the real voice, input the first acoustic feature into a first conversion model to obtain a second acoustic feature, construct a first spectrogram with low tone quality based on the second acoustic feature, input the first spectrogram into a second conversion model to obtain a second spectrogram with high tone quality, and restore a voice signal by using the second spectrogram to obtain the target voice corresponding to the target tone. 1. The method comprises the following steps of dividing the voice conversion process into: the first part of the invention only needs to achieve the low-tone-quality voice conversion, so only a small number of voice data pairs are needed, the problem of poor model training effect caused by sample data loss due to the fact that a large number of voice data pairs cannot be obtained is solved, and the voice conversion efficiency is improved; 2. converting the first spectrogram with low tone quality into a second spectrogram with high tone quality by using a second conversion model degree, and laying a foundation for realizing high-quality voice conversion subsequently; 3. a first conversion model and a second conversion model are constructed by using a pix2pix model applied to the field of image processing, and the model structure is correspondingly changed, so that the audio processing is adapted, the model convergence speed is increased, and the model training speed is increased.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of a voice conversion method according to the present invention;
FIG. 2 is a diagram of an electronic device according to a preferred embodiment of the present invention;
fig. 3 is a schematic diagram of program modules of the speech conversion procedure in fig. 2.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a voice conversion method. The method may be performed by an apparatus, which may be implemented by software and/or hardware.
Referring to fig. 1, a flow chart of a voice conversion method according to a preferred embodiment of the invention is shown.
In this embodiment, the method includes: step S1-step S5.
Step S1, receiving a voice conversion instruction sent by a user through a client, where the voice conversion instruction includes a real voice to be converted and a target tone.
The following describes embodiments of the present invention with an electronic device as an execution body. The user sends out a voice conversion instruction carrying real voice and target tone through the client, the electronic device extracts the real voice to perform voice conversion processing after receiving the voice conversion instruction, converts the tone of the real voice into the target tone, and feeds back the converted voice fragment to the client.
The real voice is a voice fragment of the user's own tone, and the target tone is the tone of a target person, wherein the target person is a person selected by the user from a specified target person set, such as a cartoon person, a star, and the like.
The client side is provided with a voice acquisition unit, such as a microphone. The client side is provided with an APP, and the user sends a voice conversion instruction through the client side APP.
Step S2, extracting a first acoustic feature from the real voice, inputting the first acoustic feature of the real voice into a pre-trained first conversion model corresponding to the target timbre for timbre conversion, and outputting a second acoustic feature of the real voice corresponding to the target timbre.
The first acoustic feature is an acoustic feature corresponding to real voice, and the second acoustic feature is an acoustic feature which is converted according to the first acoustic feature and corresponds to the real voice and the target tone.
It should be noted that the first acoustic feature does not refer to a certain acoustic related feature, but is a combined feature vector obtained by combining a plurality of preset acoustic related features. The first acoustic feature is extracted by utilizing an open source voice toolkit PyWorld. In this embodiment, the extracting the first acoustic feature from the real speech includes:
calculating a first preset acoustic relevant feature and a second preset acoustic relevant feature in the real voice;
converting the second preset acoustic relevant features to obtain converted second preset acoustic relevant features; and
and combining and generating the first acoustic feature based on the first preset acoustic relevant feature and the converted second preset acoustic relevant feature.
The first preset acoustic correlation features include acoustic correlation features such as a fundamental frequency (F0) and an aperiodic signal (AP), the second preset acoustic correlation features include an acoustic correlation feature such as a spectral envelope (spectrum envelope), and the converted second preset acoustic correlation features are mel-cepstrum.
In this embodiment, the fundamental frequency of the real speech is calculated by using a DIO algorithm, and the spectral envelope of the real speech is calculated by using a cheaptpick algorithm. In addition, the spectral envelope is converted to the mel cepstrum using the pysptk phonetic toolkit, using sp2 mc. And based on the acoustic features, splicing into a combined feature vector, inputting the combined feature vector into a first conversion model as the first acoustic feature of the real voice, wherein the output of the model is the combined feature vector corresponding to the converted target timbre, namely, the second acoustic feature.
In this embodiment, the first conversion model trained in advance is a pix2pix model. It is understood that, since the first acoustic feature is a feature quantity around a time axis in one dimension, in the present embodiment, the two-dimensional convolution layer in the model is changed to the one-dimensional convolution layer, and the pix2pix model is applied to the one-dimensional acoustic feature conversion in the present embodiment.
It should be noted that the first conversion model is to solve one to one voice conversion, i.e. conversion from speaker a to speaker B, and does not support multi-person conversion, i.e. a model of a- > B can only solve conversion from a to B, and each different conversion model of a- > B needs to be trained separately.
Step S3, constructing a first spectrogram of the real voice corresponding to the target timbre based on the second acoustic feature.
The first spectrogram is a spectrogram with low tone quality, and the second spectrogram is characterized in that the first acoustic features are similar and are combined feature vectors formed by combining a plurality of independent acoustic related features.
In this embodiment, the constructing a first spectrogram of the real voice corresponding to the target timbre based on the second acoustic feature includes:
splitting the second acoustic feature to obtain a third preset acoustic related feature and a fourth preset acoustic related feature corresponding to the second acoustic feature;
converting the fourth preset acoustic relevant feature to obtain a converted fourth preset acoustic relevant feature; and
and taking the converted fourth preset acoustic related feature as the first spectrogram.
The third preset acoustic related feature includes acoustic related features such as a fundamental frequency and a non-periodic signal corresponding to the second acoustic feature, the fourth preset acoustic related feature includes a mel-frequency cepstrum corresponding to the second acoustic feature, and the converted fourth preset acoustic related feature includes a frequency spectrum envelope corresponding to a mel-frequency spectrogram corresponding to the second acoustic feature.
Firstly, acoustic relevant features such as fundamental frequency, aperiodic signals, Mel cepstrum and the like are extracted by splitting from a combined feature vector of second acoustic features, then, a pysptk voice kit is used, mc2sp is used for converting the extracted Mel cepstrum to obtain a spectrum envelope, and the spectrum envelope is used as a first spectrogram of the real voice.
It should be noted that the operation of mel spectrum- > spectrum envelope and the process of spectrum envelope- > mel spectrum are just the same idea, and the conversion between mel cepstrum and spectrum is realized by fourier transform in signaling, which is not described herein.
And step S4, inputting the first spectrogram into a pre-trained second conversion model for voice quality conversion, and outputting a second spectrogram related to the real voice and corresponding to the target tone.
The second spectrogram is a spectrogram with high tone quality, and the second conversion model is used for converting the spectrogram with low tone quality into the spectrogram with high tone quality, so that a foundation is laid for subsequently outputting conversion voice with high tone quality.
In this embodiment, the second conversion model trained in advance is a pix2pix model. It should be noted that, in view of the first spectrogram, the first spectrogram includes: information such as time, frequency, amplitude, etc. can be regarded as a single-channel two-dimensional image, so the two-dimensional pix2pix model is used as the second conversion model in this embodiment.
And step S5, restoring the second spectrogram based on a voice reconstruction algorithm to obtain target voice related to the real voice and corresponding to the target tone, and feeding the target voice back to the user through the client.
After obtaining the second spectrogram with high tone quality, the second spectrogram needs to be converted into a voice signal to be fed back to a user, and the target voice related to the real voice corresponding to the target tone color is obtained by restoring the second spectrogram based on a voice reconstruction algorithm, including:
acquiring a third preset acoustic relevant characteristic of the second acoustic characteristic;
synthesizing the third preset acoustic relevant feature and the second spectrogram by using a preset voice reconstruction algorithm to generate a voice signal corresponding to the second spectrogram; and
and taking the voice signal corresponding to the second spectrogram as target voice related to the real voice corresponding to the target tone.
In this embodiment, a synthesis method world.sp of a PyWorld toolkit is used to directly synthesize a speech signal based on acoustic relevant features such as fundamental frequency, spectral envelope, aperiodic signal, and the like.
Among them, WORLD is a vocoder-based speech synthesis tool, and the role of vocoder is mainly: extracting relevant parameters of the voice signal; and synthesizing final voice according to the related parameters. Some vocoders in the prior art are as follows:
STRAIGHT-can produce high-quality synthetic effect, but has slow speed;
real-time STRAIGHT-the algorithm is simplified on the basis of STRAIGHT, although the speed becomes fast, the cost is the loss of performance;
TANDEM-STRAIGHT-has the performance similar to that of STRAIGHT, but can not realize real-time synthesis;
compared with TANDEM-STRAIGHT, WORLD has the advantages that on the premise that the performance is not changed, the calculation complexity is reduced, and real-time synthesis is realized.
In other embodiments, the pre-trained first conversion model training step includes:
acquiring voice data pairs of a first preset number of original speakers (source speakers) and target speakers (targetspeakers);
respectively extracting acoustic features of each voice data in the first preset number of voice data pairs, and generating the first preset number of acoustic feature pairs as sample data;
dividing the sample data into a training set and a verification set according to a preset proportion (for example, 4:1), and training the one-dimensional pix2pix model by using the training set;
and calculating the loss value of the one-dimensional pix2pix model, finishing training when the loss value meets a preset condition, and determining the one-dimensional pix2pix model as a first conversion model corresponding to the target speaker.
For example, the above-mentioned speech data pair, i.e., the paired speech (paired data) having the same spoken contents of the original speaker a and the target speaker B, need not be very large, but small (e.g., 80-100 pairs).
In the process of generating sample data, corresponding speakers, such as an original speaker a and a target speaker B, need to be labeled for each acoustic feature in the acoustic feature pair of the sample data. The step of extracting the acoustic feature of the speech data is consistent with the step of extracting the first acoustic feature in the above embodiment, and details are not repeated here. In the training process, the model input is as follows: acoustic feature pairs (acoustic features of the original speaker a, acoustic features of the target speaker), output (acoustic features of the speech that simulate the timbre of the target speaker).
It should be noted that the convergence conditions corresponding to the sample data acquired under different environments are not completely the same. Taking the speech training set with the sampling rate of 16k recorded by using a mobile phone in a quiet environment as an example, when the generator loss is about 0.2, the speech training set is close to being stable and no longer converging, and the training can be terminated at this time.
The first conversion model is trained to obtain an acoustic feature conversion model, and the first conversion model aims to use a small amount of voice data to simulate a conversion model similar to the tone of a target speaker to complete the conversion from the speaker A to the speaker B, so that the problem of poor model effect caused by the fact that a large amount of voice data cannot be obtained is solved. In order to ensure that the model conversion result is closer to the target tone, the conversion effect can be manually detected.
In other embodiments, the training step of the pre-trained second conversion model includes:
acquiring acoustic features of a second preset number of target speakers by using the first conversion model, and constructing a large number of first frequency spectrograms based on the acoustic features of the second preset number of target speakers;
acquiring voice data of a third preset number of target speakers, respectively extracting acoustic features of each voice data in the voice data of the third preset number of target speakers, and constructing a second spectrogram of the third preset number based on the acoustic features;
generating sample data based on the first spectrogram and the second spectrogram;
dividing the sample data into a training set and a verification set according to a preset proportion (for example, 4:1), and training the two-dimensional pix2pix model by using the training set;
and calculating the loss value of the two-dimensional pix2pix model, finishing training when the loss value meets a preset condition, and determining the two-dimensional pix2pix model as a second conversion model.
In this embodiment, the steps of extracting the acoustic feature and constructing the spectrogram are the same as the steps of extracting the first acoustic feature and constructing the first spectrogram in the above embodiment, and are not described herein again. It should be noted that, in order to improve the model accuracy, a large number of low-tone-quality spectrograms and high-tone-quality spectrograms of the target speakers are required, so the second preset number and the third preset number are relatively required to be set to be larger values, and the second preset number and the third preset number are much larger than the first preset number, wherein the second preset number and the third preset number may be the same or different. The first, second and third preset thresholds are preset values and can be adjusted according to actual conditions.
The first spectrogram is a low-tone spectrogram serving as an independent variable X in the sample data, and the second spectrogram is a high-tone spectrogram, which may also be referred to as a full-tone spectrogram, serving as a dependent variable Y in the sample data. In the model training process, the loss value of the model is calculated, the loss value tends to be stable after reaching a certain condition and does not converge any more, and the training can be stopped at the moment.
In the task of voice conversion from low tone quality to high tone quality, the aim is to provide a conversion model from a low tone quality spectrogram to a high tone quality spectrogram. In the process of model training, a large amount of data (namely voice data of a conversion target) of a target speaker B, namely high-quality spectrogram data, are prepared, according to the output (namely low-quality spectrogram) of a first conversion model, a (low-quality- > high-quality) data relationship is constructed between the high-quality spectrogram data and the low-quality spectrogram data, and similarly, a data pair enters the model, and the constructed two-dimensional pix2pix model is trained to obtain a conversion model from the low-quality spectrogram to the high-quality spectrogram.
In other embodiments, to increase the training speed of the first transformation model, other structures of the one-dimensional pix2pix model may be adaptively adjusted.
In this embodiment, a U-Net structure in a pix2pix model is used. The convolution kernel size in the original structure was 4 x 4. In order to adapt to voice processing, when a model structure is constructed, besides changing a convolution layer and a deconvolution layer into a one-dimensional mechanism, the sizes of convolution kernels of a first layer of downsampling and a last layer of upsampling in a U-Net structure are also changed into 3 x 3.
Taking the down sampling in the U-Net structure as an example, the original structure is as follows:
input_shape,64,kernel_size=4->
in_size=64,out_size=128,kernel_size=4->
in_size=128,out_size=256,kernel_size=4->
in_size=256,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4
the structure after adjustment is as follows:
in_size=input_shape,out_size=64,kernel_size=3->
in_size=64,out_size=128,kernel_size=4->
in_size=128,out_size=256,kernel_size=4->
in_size=256,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4
the above network structure is changed to adapt to the audio processing on one hand, and to accelerate the convergence speed of the model on the other hand, thereby increasing the training speed of the model.
It should be noted that the above changes of the network structure are also used in the second conversion model, and are not described herein.
The voice conversion method provided in the above embodiment receives a conversion instruction carrying real voice and a target tone sent by a user, extracts a first acoustic feature from the real voice, inputs the first acoustic feature into a first conversion model to obtain a second acoustic feature, constructs a first spectrogram with low tone quality based on the second acoustic feature, inputs the first spectrogram into a second conversion model to obtain a second spectrogram with high tone quality, and restores a voice signal by using the second spectrogram to obtain a target voice corresponding to the target tone quality. 1. The method comprises the following steps of dividing the voice conversion process into: the first part of the invention only needs to achieve the low-tone-quality voice conversion, so only a small number of voice data pairs are needed, the problem of poor model training effect caused by sample data loss due to the fact that a large number of voice data pairs cannot be obtained is solved, and the voice conversion efficiency is improved; 2. converting the first spectrogram with low tone quality into a second spectrogram with high tone quality by using a second conversion model degree, and laying a foundation for realizing high-quality voice conversion subsequently; 3. a first conversion model and a second conversion model are constructed by using a pix2pix model applied to the field of image processing, and the model structure is correspondingly changed, so that the audio processing is adapted, the model convergence speed is increased, and the model training speed is increased.
The invention also provides an electronic device. Fig. 2 is a schematic view of an electronic device according to a preferred embodiment of the invention.
In this embodiment, the electronic device 1 may be a server, a smart phone, a tablet computer, a portable computer, a desktop computer, or other terminal equipment with a data processing function, and the server may be a rack server, a blade server, a tower server, or a cabinet server.
The electronic device 1 includes a memory 11, a processor 12, and a network interface 13.
The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a hard disk of the electronic device 1.
The memory 11 may also be an external storage device of the electronic apparatus 1 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic apparatus 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic apparatus 1.
The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as the voice conversion program 10, but also to temporarily store data that has been output or is to be output.
Processor 12, which in some embodiments may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip, executes program code or processes data stored in memory 11, such as voice conversion program 10.
The network interface 13 may optionally comprise a standard wired interface, a wireless interface (e.g. WI-FI interface), typically used for establishing a communication connection between the electronic apparatus 1 and other electronic devices, such as a client (not shown in the figure).
Fig. 2 only shows the electronic device 1 with the components 11-13, and it will be understood by a person skilled in the art that the structure shown in fig. 2 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
Optionally, the electronic device 1 may further comprise a user interface, the user interface may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface may further comprise a standard wired interface, a wireless interface.
Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch screen, or the like. The display, which may also be referred to as a display screen or display unit, is used for displaying information processed in the electronic apparatus 1 and for displaying a visualized user interface.
In the embodiment of the electronic device 1 shown in fig. 2, the memory 11 as a kind of computer storage medium stores the program code of the speech conversion program 10, and when the processor 12 executes the program code of the speech conversion program 10, the following steps are implemented:
and a receiving step, receiving a voice conversion instruction sent by a user through a client, wherein the voice conversion instruction comprises real voice to be converted and a target tone.
The user sends out a voice conversion instruction carrying real voice and target tone through the client, the electronic device extracts the real voice to perform voice conversion processing after receiving the voice conversion instruction, converts the tone of the real voice into the target tone, and feeds back the converted voice fragment to the client.
The real voice is a voice fragment of the user's own tone, and the target tone is the tone of a target person, wherein the target person is a person selected by the user from a specified target person set, such as a cartoon person, a star, and the like.
The client side is provided with a voice acquisition unit, such as a microphone. The client side is provided with an APP, and the user sends a voice conversion instruction through the client side APP.
A first conversion step of extracting a first acoustic feature from the real voice, inputting the first acoustic feature of the real voice into a pre-trained first conversion model corresponding to the target tone for tone conversion, and outputting a second acoustic feature of the real voice corresponding to the target tone.
The first acoustic feature is an acoustic feature corresponding to real voice, and the second acoustic feature is an acoustic feature which is converted according to the first acoustic feature and corresponds to the real voice and the target tone.
It should be noted that the first acoustic feature does not refer to a certain acoustic related feature, but is a combined feature vector obtained by combining a plurality of preset acoustic related features. The first acoustic feature is extracted by utilizing an open source voice toolkit PyWorld. In this embodiment, the extracting the first acoustic feature from the real speech includes:
calculating a first preset acoustic relevant feature and a second preset acoustic relevant feature in the real voice;
converting the second preset acoustic relevant features to obtain converted second preset acoustic relevant features; and
and combining and generating the first acoustic feature based on the first preset acoustic relevant feature and the converted second preset acoustic relevant feature.
The first preset acoustic correlation features include acoustic correlation features such as a fundamental frequency (F0) and an aperiodic signal (AP), the second preset acoustic correlation features include an acoustic correlation feature such as a spectral envelope (spectrum envelope), and the converted second preset acoustic correlation features are mel-cepstrum.
In this embodiment, the fundamental frequency of the real speech is calculated by using a DIO algorithm, and the spectral envelope of the real speech is calculated by using a cheaptpick algorithm. In addition, the spectral envelope is converted to the mel cepstrum using the pysptk phonetic toolkit, using sp2 mc. And based on the acoustic features, splicing into a combined feature vector, inputting the combined feature vector into a first conversion model as the first acoustic feature of the real voice, wherein the output of the model is the combined feature vector corresponding to the converted target timbre, namely, the second acoustic feature.
In this embodiment, the first conversion model trained in advance is a pix2pix model. It is understood that, since the first acoustic feature is a feature quantity around a time axis in one dimension, in the present embodiment, the two-dimensional convolution layer in the model is changed to the one-dimensional convolution layer, and the pix2pix model is applied to the one-dimensional acoustic feature conversion in the present embodiment.
It should be noted that the first conversion model is to solve one to one voice conversion, i.e. conversion from speaker a to speaker B, and does not support multi-person conversion, i.e. a model of a- > B can only solve conversion from a to B, and each different conversion model of a- > B needs to be trained separately.
And a construction step of constructing a first spectrogram related to the real voice corresponding to the target timbre based on the second acoustic feature.
The first spectrogram is a spectrogram with low tone quality, and the second spectrogram is characterized in that the first acoustic features are similar and are combined feature vectors formed by combining a plurality of independent acoustic related features.
In this embodiment, the constructing a first spectrogram of the real voice corresponding to the target timbre based on the second acoustic feature includes:
splitting the second acoustic feature to obtain a third preset acoustic related feature and a fourth preset acoustic related feature corresponding to the second acoustic feature;
converting the fourth preset acoustic relevant feature to obtain a converted fourth preset acoustic relevant feature; and
and taking the converted fourth preset acoustic related feature as the first spectrogram.
The third preset acoustic related feature includes acoustic related features such as a fundamental frequency and a non-periodic signal corresponding to the second acoustic feature, the fourth preset acoustic related feature includes a mel-frequency cepstrum corresponding to the second acoustic feature, and the converted fourth preset acoustic related feature includes a frequency spectrum envelope corresponding to a mel-frequency spectrogram corresponding to the second acoustic feature.
Firstly, acoustic relevant features such as fundamental frequency, aperiodic signals, Mel cepstrum and the like are extracted by splitting from a combined feature vector of second acoustic features, then, a pysptk voice kit is used, mc2sp is used for converting the extracted Mel cepstrum to obtain a spectrum envelope, and the spectrum envelope is used as a first spectrogram of the real voice.
It should be noted that the operation of mel spectrum- > spectrum envelope and the process of spectrum envelope- > mel spectrum are just the same idea, and the conversion between mel cepstrum and spectrum is realized by fourier transform in signaling, which is not described herein.
And a second conversion step of inputting the first spectrogram into a pre-trained second conversion model for voice quality conversion, and outputting a second spectrogram related to the real voice and corresponding to the target tone.
The second spectrogram is a spectrogram with high tone quality, and the second conversion model is used for converting the spectrogram with low tone quality into the spectrogram with high tone quality, so that a foundation is laid for subsequently outputting conversion voice with high tone quality.
In this embodiment, the second conversion model trained in advance is a pix2pix model. It should be noted that, in view of the first spectrogram, the first spectrogram includes: information such as time, frequency, amplitude, etc. can be regarded as a single-channel two-dimensional image, so the two-dimensional pix2pix model is used as the second conversion model in this embodiment.
And a restoring and feedback step, namely restoring the second spectrogram based on a voice reconstruction algorithm to obtain target voice related to the real voice and corresponding to the target tone, and feeding the target voice back to a user through the client.
After obtaining the second spectrogram with high tone quality, the second spectrogram needs to be converted into a voice signal to be fed back to a user, and the target voice related to the real voice corresponding to the target tone color is obtained by restoring the second spectrogram based on a voice reconstruction algorithm, including:
acquiring a third preset acoustic relevant characteristic of the second acoustic characteristic;
synthesizing the third preset acoustic relevant feature and the second spectrogram by using a preset voice reconstruction algorithm to generate a voice signal corresponding to the second spectrogram; and
and taking the voice signal corresponding to the second spectrogram as target voice related to the real voice corresponding to the target tone.
In this embodiment, a synthesis method world.sp of a PyWorld toolkit is used to directly synthesize a speech signal based on acoustic relevant features such as fundamental frequency, spectral envelope, aperiodic signal, and the like.
Among them, WORLD is a vocoder-based speech synthesis tool, and the role of vocoder is mainly: extracting relevant parameters of the voice signal; and synthesizing final voice according to the related parameters. Some vocoders in the prior art are as follows:
STRAIGHT-can produce high-quality synthetic effect, but has slow speed;
real-time STRAIGHT-the algorithm is simplified on the basis of STRAIGHT, although the speed becomes fast, the cost is the loss of performance;
TANDEM-STRAIGHT-has the performance similar to that of STRAIGHT, but can not realize real-time synthesis;
compared with TANDEM-STRAIGHT, WORLD has the advantages that on the premise that the performance is not changed, the calculation complexity is reduced, and real-time synthesis is realized.
In other embodiments, the pre-trained first conversion model training step includes:
acquiring voice data pairs of a first preset number of original speakers (source speakers) and target speakers (targetspeakers);
respectively extracting acoustic features of each voice data in the first preset number of voice data pairs, and generating the first preset number of acoustic feature pairs as sample data;
dividing the sample data into a training set and a verification set according to a preset proportion (for example, 4:1), and training the one-dimensional pix2pix model by using the training set;
and calculating the loss value of the one-dimensional pix2pix model, finishing training when the loss value meets a preset condition, and determining the one-dimensional pix2pix model as a first conversion model corresponding to the target speaker.
For example, the above-mentioned speech data pair, i.e., the paired speech (paired data) having the same spoken contents of the original speaker a and the target speaker B, need not be very large, but small (e.g., 80-100 pairs).
In the process of generating sample data, corresponding speakers, such as an original speaker a and a target speaker B, need to be labeled for each acoustic feature in the acoustic feature pair of the sample data. The step of extracting the acoustic feature of the speech data is consistent with the step of extracting the first acoustic feature in the above embodiment, and details are not repeated here. In the training process, the model input is as follows: acoustic feature pairs (acoustic features of the original speaker a, acoustic features of the target speaker), output (acoustic features of the speech that simulate the timbre of the target speaker).
It should be noted that the convergence conditions corresponding to the sample data acquired under different environments are not completely the same. Taking the speech training set with the sampling rate of 16k recorded by using a mobile phone in a quiet environment as an example, when the generator loss is about 0.2, the speech training set is close to being stable and no longer converging, and the training can be terminated at this time.
The first conversion model is trained to obtain an acoustic feature conversion model, and the first conversion model aims to use a small amount of voice data to simulate a conversion model similar to the tone of a target speaker to complete the conversion from the speaker A to the speaker B, so that the problem of poor model effect caused by the fact that a large amount of voice data cannot be obtained is solved. In order to ensure that the model conversion result is closer to the target tone, the conversion effect can be manually detected.
In other embodiments, the training step of the pre-trained second conversion model includes:
acquiring acoustic features of a second preset number of target speakers by using the first conversion model, and constructing a large number of first frequency spectrograms based on the acoustic features of the second preset number of target speakers;
acquiring voice data of a third preset number of target speakers, respectively extracting acoustic features of each voice data in the voice data of the third preset number of target speakers, and constructing a second spectrogram of the third preset number based on the acoustic features;
generating sample data based on the first spectrogram and the second spectrogram;
dividing the sample data into a training set and a verification set according to a preset proportion (for example, 4:1), and training the two-dimensional pix2pix model by using the training set;
and calculating the loss value of the two-dimensional pix2pix model, finishing training when the loss value meets a preset condition, and determining the two-dimensional pix2pix model as a second conversion model.
In this embodiment, the steps of extracting the acoustic feature and constructing the spectrogram are the same as the steps of extracting the first acoustic feature and constructing the first spectrogram in the above embodiment, and are not described herein again. It should be noted that, in order to improve the model accuracy, a large number of low-tone-quality spectrograms and high-tone-quality spectrograms of the target speakers are required, so the second preset number and the third preset number are relatively required to be set to be larger values, and the second preset number and the third preset number are much larger than the first preset number, wherein the second preset number and the third preset number may be the same or different. The first, second and third preset thresholds are preset values and can be adjusted according to actual conditions.
The first spectrogram is a low-tone spectrogram serving as an independent variable X in the sample data, and the second spectrogram is a high-tone spectrogram, which may also be referred to as a full-tone spectrogram, serving as a dependent variable Y in the sample data. In the model training process, the loss value of the model is calculated, the loss value tends to be stable after reaching a certain condition and does not converge any more, and the training can be stopped at the moment.
In the task of voice conversion from low tone quality to high tone quality, the aim is to provide a conversion model from a low tone quality spectrogram to a high tone quality spectrogram. In the process of model training, a large amount of data (namely voice data of a conversion target) of a target speaker B, namely high-quality spectrogram data, are prepared, according to the output (namely low-quality spectrogram) of a first conversion model, a (low-quality- > high-quality) data relationship is constructed between the high-quality spectrogram data and the low-quality spectrogram data, and similarly, a data pair enters the model, and the constructed two-dimensional pix2pix model is trained to obtain a conversion model from the low-quality spectrogram to the high-quality spectrogram.
In other embodiments, to increase the training speed of the first transformation model, other structures of the one-dimensional pix2pix model may be adaptively adjusted.
In this embodiment, a U-Net structure in a pix2pix model is used. The convolution kernel size in the original structure was 4 x 4. In order to adapt to voice processing, when a model structure is constructed, besides changing a convolution layer and a deconvolution layer into a one-dimensional mechanism, the sizes of convolution kernels of a first layer of downsampling and a last layer of upsampling in a U-Net structure are also changed into 3 x 3.
Taking the down sampling in the U-Net structure as an example, the original structure is as follows:
input_shape,64,kernel_size=4->
in_size=64,out_size=128,kernel_size=4->
in_size=128,out_size=256,kernel_size=4->
in_size=256,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4
the structure after adjustment is as follows:
in_size=input_shape,out_size=64,kernel_size=3->
in_size=64,out_size=128,kernel_size=4->
in_size=128,out_size=256,kernel_size=4->
in_size=256,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4->
in_size=512,out_size=512,kernel_size=4
the above network structure is changed to adapt to the audio processing on one hand, and to accelerate the convergence speed of the model on the other hand, thereby increasing the training speed of the model.
It should be noted that the above changes of the network structure are also used in the second conversion model, and are not described herein.
The electronic device 1 provided in the above embodiment receives a conversion instruction carrying a real voice and a target timbre sent by a user, extracts a first acoustic feature from the real voice, inputs the first acoustic feature into a first conversion model to obtain a second acoustic feature, constructs a first spectrogram with low timbre based on the second acoustic feature, inputs the first spectrogram into a second conversion model to obtain a second spectrogram with high timbre, restores a voice signal by using the second spectrogram to obtain a target voice corresponding to the target timbre, and feeds the target voice back to a client. 1. The method comprises the following steps of dividing the voice conversion process into: the first part of the invention only needs to achieve the low-tone-quality voice conversion, so only a small number of voice data pairs are needed, the problem of poor model training effect caused by sample data loss due to the fact that a large number of voice data pairs cannot be obtained is solved, and the voice conversion efficiency is improved; 2. converting the first spectrogram with low tone quality into a second spectrogram with high tone quality by using a second conversion model degree, and laying a foundation for realizing high-quality voice conversion subsequently; 3. a first conversion model and a second conversion model are constructed by using a pix2pix model applied to the field of image processing, and the model structure is correspondingly changed, so that the audio processing is adapted, the model convergence speed is increased, and the model training speed is increased.
Alternatively, in other embodiments, the speech conversion program 10 may be divided into one or more modules, one or more modules being stored in the memory 11 and executed by the one or more processors 12 to implement the present invention, where a module refers to a series of computer program instruction segments capable of performing a specific function.
Referring to fig. 3, which is a schematic diagram of program modules of the speech conversion program 10 in fig. 2, in this embodiment, the speech conversion program 10 may be divided into modules 110 and 150, and functions or operation steps implemented by the modules 110 and 150 are similar to those described above, and are not described in detail here, for example, wherein:
a receiving module 110, configured to receive a voice conversion instruction sent by a user through a client, where the voice conversion instruction includes a real voice to be converted and a target tone;
a first conversion module 120, configured to extract a first acoustic feature from the real voice, input the first acoustic feature of the real voice into a pre-trained first conversion model corresponding to the target tone, perform tone conversion, and output a second acoustic feature of the real voice corresponding to the target tone;
a construction module 130, configured to construct a first spectrogram, corresponding to the target timbre, of the real speech based on the second acoustic feature;
a second conversion module 140, configured to input the first spectrogram into a second conversion model trained in advance for voice quality conversion, and output a second spectrogram, which corresponds to the target tone and is related to the real voice; and
and a restoring and feedback module 150, configured to restore the second spectrogram, obtain a target voice corresponding to the target tone and related to the real voice, and feed the target voice back to the user through the client.
In addition, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a speech conversion program 10, and when the speech conversion program 10 is executed by a processor, any step in the speech conversion method is implemented, which is not described herein again.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A voice conversion method is suitable for an electronic device, and is characterized by comprising the following steps:
step S1, receiving a voice conversion instruction sent by a user through a client, wherein the voice conversion instruction comprises real voice to be converted and a target tone;
step S2, extracting a first acoustic feature from the real voice, inputting the first acoustic feature of the real voice into a pre-trained first conversion model corresponding to the target tone for tone conversion, and outputting a second acoustic feature of the real voice corresponding to the target tone;
step S3, constructing a first spectrogram of the real voice corresponding to the target timbre based on the second acoustic feature;
step S4, inputting the first spectrogram into a pre-trained second conversion model for sound quality conversion, and outputting a second spectrogram related to the real voice and corresponding to the target tone; and
and step S5, restoring the second spectrogram based on a voice reconstruction algorithm to obtain target voice related to the real voice and corresponding to the target tone, and feeding the target voice back to the user through the client.
2. The method according to claim 1, wherein the extracting the first acoustic feature from the real speech includes:
calculating a first preset acoustic relevant feature and a second preset acoustic relevant feature in the real voice;
converting the second preset acoustic relevant features to obtain converted second preset acoustic relevant features; and
and combining and generating the first acoustic feature based on the first preset acoustic relevant feature and the converted second preset acoustic relevant feature.
3. The method according to claim 1, wherein the constructing a first spectrogram of the real speech corresponding to the target timbre based on the second acoustic feature comprises:
splitting the second acoustic feature to obtain a third preset acoustic related feature and a fourth preset acoustic related feature corresponding to the second acoustic feature;
converting the fourth preset acoustic relevant feature to obtain a converted fourth preset acoustic relevant feature; and
and taking the converted fourth preset acoustic related feature as the first spectrogram.
4. The speech conversion method according to claim 1, wherein the restoring the second spectrogram based on a speech reconstruction algorithm to obtain a target speech corresponding to the target timbre and related to the real speech comprises:
acquiring a third preset acoustic relevant characteristic of the second acoustic characteristic;
synthesizing the third preset acoustic relevant feature and the second spectrogram by using a preset voice reconstruction algorithm to generate a voice signal corresponding to the second spectrogram; and
and taking the voice signal corresponding to the second spectrogram as target voice related to the real voice corresponding to the target tone.
5. The speech conversion method according to any one of claims 1 to 4, wherein the first acoustic feature is a combined feature vector including a fundamental frequency, aperiodic information, and a spectral envelope of the real speech, and the second acoustic feature is an acoustic feature obtained by performing timbre conversion on the first acoustic feature.
6. The speech conversion method according to claim 5, wherein the first preset acoustic related features comprise fundamental frequency, non-periodic information of the real speech; the second preset acoustic related feature comprises a spectral envelope of the real speech; and the converted second preset acoustic related characteristic is a Mel cepstrum corresponding to the spectral envelope of the real voice.
7. The speech conversion method of claim 6, wherein the first conversion model is a one-dimensional pix2pix model, and the training step of the first conversion model comprises:
acquiring a first preset number of voice data pairs of an original speaker and a target speaker;
respectively extracting acoustic features of each voice data in the first preset number of voice data pairs, and generating the first preset number of acoustic feature pairs as sample data;
dividing the sample data into a training set and a verification set according to a preset proportion, and training the one-dimensional pix2pix model by using the training set;
and calculating the loss value of the one-dimensional pix2pix model, finishing training when the loss value meets a preset condition, and determining the one-dimensional pix2pix model as a first conversion model corresponding to the target speaker.
8. The speech conversion method of claim 7, wherein the second conversion model is a two-dimensional pix2pix model, and the training step of the second conversion model comprises:
acquiring acoustic features of a second preset number of target speakers by using the first conversion model, and constructing a large number of first frequency spectrograms based on the acoustic features of the second preset number of target speakers;
acquiring voice data of a third preset number of target speakers, respectively extracting acoustic features of each voice data in the voice data of the third preset number of target speakers, and constructing a second spectrogram of the third preset number based on the acoustic features;
generating sample data based on the first spectrogram and the second spectrogram;
dividing the sample data into a training set and a verification set according to a preset proportion, and training the two-dimensional pix2pix model by using the training set;
and calculating the loss value of the two-dimensional pix2pix model, finishing training when the loss value meets a preset condition, and determining the two-dimensional pix2pix model as a second conversion model.
9. An electronic device comprising a memory and a processor, wherein the memory stores a speech conversion program operable on the processor, and wherein the speech conversion program when executed by the processor performs the steps of:
receiving a voice conversion instruction sent by a user through a client, wherein the voice conversion instruction comprises real voice to be converted and a target tone;
extracting a first acoustic feature from the real voice, inputting the first acoustic feature of the real voice into a pre-trained first conversion model corresponding to the target tone for tone conversion, and outputting a second acoustic feature of the real voice corresponding to the target tone;
constructing a first spectrogram related to the real voice corresponding to the target tone color based on the second acoustic feature;
inputting the first spectrogram into a pre-trained second conversion model for sound quality conversion, and outputting a second spectrogram related to the real voice and corresponding to the target tone; and
and restoring the second spectrogram based on a voice reconstruction algorithm to obtain target voice related to the real voice and corresponding to the target tone, and feeding the target voice back to the user through the client.
10. A computer-readable storage medium, comprising a speech conversion program, which when executed by a processor, performs the steps of the speech conversion method according to any one of claims 1 to 8.
CN202010063801.8A 2020-01-19 2020-01-19 Voice conversion method, electronic device and computer readable storage medium Pending CN111261177A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010063801.8A CN111261177A (en) 2020-01-19 2020-01-19 Voice conversion method, electronic device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010063801.8A CN111261177A (en) 2020-01-19 2020-01-19 Voice conversion method, electronic device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN111261177A true CN111261177A (en) 2020-06-09

Family

ID=70949020

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010063801.8A Pending CN111261177A (en) 2020-01-19 2020-01-19 Voice conversion method, electronic device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111261177A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112164407A (en) * 2020-09-22 2021-01-01 腾讯音乐娱乐科技(深圳)有限公司 Tone conversion method and device
CN112216293A (en) * 2020-08-28 2021-01-12 北京捷通华声科技股份有限公司 Tone conversion method and device
CN112382297A (en) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 Method, apparatus, device and medium for generating audio
CN112466275A (en) * 2020-11-30 2021-03-09 北京百度网讯科技有限公司 Voice conversion and corresponding model training method, device, equipment and storage medium
CN112652292A (en) * 2020-11-13 2021-04-13 北京有竹居网络技术有限公司 Method, apparatus, device and medium for generating audio
WO2023168813A1 (en) * 2022-03-09 2023-09-14 平安科技(深圳)有限公司 Timbre model construction method, timbre conversion method, apparatus, device, and medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107767879A (en) * 2017-10-25 2018-03-06 北京奇虎科技有限公司 Audio conversion method and device based on tone color
CN108461079A (en) * 2018-02-02 2018-08-28 福州大学 A kind of song synthetic method towards tone color conversion
US10186251B1 (en) * 2015-08-06 2019-01-22 Oben, Inc. Voice conversion using deep neural network with intermediate voice training
CN109671442A (en) * 2019-01-14 2019-04-23 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN Yu x vector
CN110136690A (en) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 Phoneme synthesizing method, device and computer readable storage medium
CN110223705A (en) * 2019-06-12 2019-09-10 腾讯科技(深圳)有限公司 Phonetics transfer method, device, equipment and readable storage medium storing program for executing
JP2019168608A (en) * 2018-03-23 2019-10-03 カシオ計算機株式会社 Learning device, acoustic generation device, method, and program
CN110619885A (en) * 2019-08-15 2019-12-27 西北工业大学 Method for generating confrontation network voice enhancement based on deep complete convolution neural network

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10186251B1 (en) * 2015-08-06 2019-01-22 Oben, Inc. Voice conversion using deep neural network with intermediate voice training
CN107767879A (en) * 2017-10-25 2018-03-06 北京奇虎科技有限公司 Audio conversion method and device based on tone color
CN108461079A (en) * 2018-02-02 2018-08-28 福州大学 A kind of song synthetic method towards tone color conversion
JP2019168608A (en) * 2018-03-23 2019-10-03 カシオ計算機株式会社 Learning device, acoustic generation device, method, and program
CN109671442A (en) * 2019-01-14 2019-04-23 南京邮电大学 Multi-to-multi voice conversion method based on STARGAN Yu x vector
CN110136690A (en) * 2019-05-22 2019-08-16 平安科技(深圳)有限公司 Phoneme synthesizing method, device and computer readable storage medium
CN110223705A (en) * 2019-06-12 2019-09-10 腾讯科技(深圳)有限公司 Phonetics transfer method, device, equipment and readable storage medium storing program for executing
CN110619885A (en) * 2019-08-15 2019-12-27 西北工业大学 Method for generating confrontation network voice enhancement based on deep complete convolution neural network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
HIROKAZU AKADOMARI等: "Comparison of the number of training data", 《2019IEEE 8TH GLOBAL CONFERENCE ON CONSUMER ELECTRONICS(GCCE)》 *
MASANORI MORISE等: "WORLD: A Vocoder-Based High-Quality Speech Synthesis System", 《 IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E99D (7)》 *
S. MIYAMOTO等: "Two-stage sequence-to-sequence neural voice conversion with low-to-high definition spectrogram mapping", 《RECENT ADVANCES IN INTELLIGENT INFORMATION HIDING AND MULTIMEDIA SIGNAL PROCESSING》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112216293A (en) * 2020-08-28 2021-01-12 北京捷通华声科技股份有限公司 Tone conversion method and device
CN112164407A (en) * 2020-09-22 2021-01-01 腾讯音乐娱乐科技(深圳)有限公司 Tone conversion method and device
CN112382297A (en) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 Method, apparatus, device and medium for generating audio
CN112652292A (en) * 2020-11-13 2021-04-13 北京有竹居网络技术有限公司 Method, apparatus, device and medium for generating audio
CN112466275A (en) * 2020-11-30 2021-03-09 北京百度网讯科技有限公司 Voice conversion and corresponding model training method, device, equipment and storage medium
CN112466275B (en) * 2020-11-30 2023-09-22 北京百度网讯科技有限公司 Voice conversion and corresponding model training method, device, equipment and storage medium
WO2023168813A1 (en) * 2022-03-09 2023-09-14 平安科技(深圳)有限公司 Timbre model construction method, timbre conversion method, apparatus, device, and medium

Similar Documents

Publication Publication Date Title
CN111261177A (en) Voice conversion method, electronic device and computer readable storage medium
US10553201B2 (en) Method and apparatus for speech synthesis
WO2022033327A1 (en) Video generation method and apparatus, generation model training method and apparatus, and medium and device
WO2020073944A1 (en) Speech synthesis method and device
JP2022137201A (en) Synthesis of speech from text in voice of target speaker using neural networks
CN110223705A (en) Phonetics transfer method, device, equipment and readable storage medium storing program for executing
US11355097B2 (en) Sample-efficient adaptive text-to-speech
CN105206257B (en) A kind of sound converting method and device
WO2018223727A1 (en) Voiceprint recognition method, apparatus and device, and medium
CN108492818B (en) Text-to-speech conversion method and device and computer equipment
CN107705782B (en) Method and device for determining phoneme pronunciation duration
CN111433847A (en) Speech conversion method and training method, intelligent device and storage medium
US10854182B1 (en) Singing assisting system, singing assisting method, and non-transitory computer-readable medium comprising instructions for executing the same
CN113724683B (en) Audio generation method, computer device and computer readable storage medium
WO2022126904A1 (en) Voice conversion method and apparatus, computer device, and storage medium
CA3195582A1 (en) Audio generator and methods for generating an audio signal and training an audio generator
JP7124373B2 (en) LEARNING DEVICE, SOUND GENERATOR, METHOD AND PROGRAM
CN112712789A (en) Cross-language audio conversion method and device, computer equipment and storage medium
CN113421584B (en) Audio noise reduction method, device, computer equipment and storage medium
WO2019218773A1 (en) Voice synthesis method and device, storage medium, and electronic device
CN113327594B (en) Speech recognition model training method, device, equipment and storage medium
CN116095357B (en) Live broadcasting method, device and system of virtual anchor
JP7360814B2 (en) Audio processing device and audio processing program
CN112951256B (en) Voice processing method and device
CN113066472B (en) Synthetic voice processing method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200609