CN113539239A

CN113539239A - Voice conversion method, device, storage medium and electronic equipment

Info

Publication number: CN113539239A
Application number: CN202110785424.3A
Authority: CN
Inventors: 詹皓粤; 林悦
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2021-07-12
Filing date: 2021-07-12
Publication date: 2021-10-22
Anticipated expiration: 2041-07-12
Also published as: CN113539239B

Abstract

The present disclosure relates to the field of speech processing, and in particular, to a speech conversion method, apparatus, storage medium, and electronic device. The voice conversion method comprises the steps of obtaining original voice data and preset tone information; extracting cross-language feature representation and emotion feature representation of the original voice data; and performing voice conversion based on the cross-language feature representation, the emotion feature representation and the tone color information to obtain target voice data. The voice conversion method provided by the disclosure can solve the problem of cross-language and multi-tone voice conversion with emotion reserved.

Description

Voice conversion method, device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of speech processing, and in particular, to a speech conversion method, apparatus, storage medium, and electronic device.

Background

In recent years, with the rapid progress of machine learning, especially the related technology research in the deep learning field, the great change of the man-machine interaction mode is promoted, and meanwhile, more and more commercialized products come to the ground. As a novel mode, voice interaction not only brings brand-new user experience, but also enlarges the design thought and application scene of each product, and in order to protect personal privacy in the Internet era, the voice can be generally subjected to emotion conversion processing.

Currently, the common emotion voice conversion modes include two modes: one is to record and produce a corpus containing a plurality of emotions, but the use is limited by the emotion types in the corpus and the corpus is not universal; the other method is to record and manufacture a small corpus with various fixed emotions, and then perform voice emotion conversion, but after the conversion, the voice emotion controllability is not strong, or the conversion failure is easy to occur.

It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.

Disclosure of Invention

The present disclosure is directed to a method and an apparatus for voice conversion, a storage medium, and an electronic device, and aims to solve the problem of voice conversion across languages and multiple timbres while preserving emotion.

Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.

According to an aspect of an embodiment of the present disclosure, there is provided a voice conversion method including: acquiring original voice data and preset tone information; extracting cross-language feature representation and emotion feature representation of the original voice data; and performing voice conversion based on the cross-language feature representation, the emotion feature representation and the tone color information to obtain target voice data.

According to some embodiments of the present disclosure, based on the foregoing scheme, extracting a cross-language feature representation of the original speech data includes: carrying out feature extraction on the original voice data to obtain audio features; and inputting the audio features into a pre-trained cross-language feature extraction model to obtain the cross-language feature representation.

According to some embodiments of the present disclosure, based on the foregoing solution, the method further includes pre-training the cross-language feature extraction model, including: acquiring a voice sample and a text sample corresponding to contents; performing feature extraction on the voice sample to obtain sample audio features, and performing text processing on the text sample to obtain sample cross-language features; and performing model training by using the sample audio features and the sample cross-language features to obtain the cross-language feature extraction model.

According to some embodiments of the present disclosure, based on the foregoing scheme, the performing text processing on the text sample to obtain a sample cross-language feature includes: converting the text sample into a text character set represented by the Unicharacters according to a mapping relation between preset text content and the Unicharacters; and obtaining the sample cross-language features based on the text character set.

According to some embodiments of the present disclosure, based on the foregoing solution, when the text sample includes a language type, the performing text processing on the text sample to obtain a sample cross-language feature includes: determining a language type of the text sample; converting the text sample into a text phoneme set according to the mapping relation between the text content and the phonemes of the language type; and obtaining the sample cross-language features based on the text phoneme set.

According to some embodiments of the present disclosure, based on the foregoing scheme, extracting an emotional feature representation of the raw speech data includes: extracting emotion information of the original voice data; converting the emotion information into a feature vector as the emotion feature representation.

According to some embodiments of the present disclosure, based on the foregoing scheme, the performing voice conversion based on the cross-language feature representation, the emotion feature representation, and the timbre information to obtain target voice data includes: and inputting the cross-language feature representation, the emotion feature representation and the tone information into a pre-trained voice conversion model to obtain the output target voice data.

According to some embodiments of the present disclosure, based on the foregoing scheme, the method further includes pre-training the speech conversion model, including: acquiring a voice sample, a converted voice sample corresponding to the voice sample and preset sample tone information; extracting sample cross-language feature representation of the voice sample by utilizing a pre-trained cross-language feature extraction model; and extracting a sample emotional feature representation of the speech sample; and performing model training by using the sample cross-language feature representation, the sample emotion feature representation, the converted voice sample and the sample tone information to obtain the voice conversion model.

According to a second aspect of the embodiments of the present disclosure, there is provided a speech conversion apparatus including: the device comprises a preparation module, a voice processing module and a voice processing module, wherein the preparation module is used for acquiring original voice data and preset tone information; the extraction module is used for extracting cross-language feature representation and emotion feature representation of the original voice data; and the conversion module is used for carrying out voice conversion on the basis of the cross-language feature representation, the emotion feature representation and the tone color information so as to obtain target voice data.

According to a third aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a speech conversion method as in the above embodiments.

According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus, including: one or more processors; a storage device for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the speech conversion method as in the above embodiments.

Exemplary embodiments of the present disclosure may have some or all of the following benefits:

in the technical solutions provided by some embodiments of the present disclosure, cross-language feature representation and emotion feature representation of original voice data are extracted, and then the original voice data is converted into target voice data according to the extracted cross-language feature representation, emotion feature representation and preset timbre information. On one hand, cross-language feature representation of the original voice data can be extracted, the language type in the original voice data is not limited, a voice data corpus of mixed languages is not required to be constructed in advance, and the early conversion preparation work is simplified; on the other hand, emotion characteristics in the voice can be reserved by extracting emotion characteristic representation for voice conversion, so that original voice is restored to the maximum extent by converted target voice data; on the other hand, tone information can be preset to obtain target voice data matched with the tone, and further, emotional voice conversion effects under different tones are achieved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty. In the drawings:

FIG. 1 schematically illustrates a flow chart of a method of speech conversion in an exemplary embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow diagram for training a cross-language feature extraction model in an exemplary embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow chart for training a speech conversion model in an exemplary embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow chart of a method of speech conversion in an exemplary embodiment of the present disclosure;

fig. 5 schematically illustrates a composition diagram of a voice conversion apparatus in an exemplary embodiment of the present disclosure;

FIG. 6 schematically illustrates a schematic diagram of a computer-readable storage medium in an exemplary embodiment of the disclosure;

fig. 7 schematically shows a structural diagram of a computer system of an electronic device in an exemplary embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the subject matter of the present disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Implementation details of the technical solution of the embodiments of the present disclosure are set forth in detail below.

In recent years, with the rapid progress of machine learning, especially the related technology research in the deep learning field, the great change of the man-machine interaction mode is promoted, and meanwhile, more and more commercialized products come to the ground. As a novel mode, voice interaction not only brings brand-new user experience, but also enlarges the design thought and application scene of each product, and meanwhile, data development and utilization and individual privacy protection in the Internet era can be propositions throughout the whole information era.

The existing emotion voice conversion mode comprises two modes:

the method has the problems that the emotion is limited by a plurality of emotions in the corpus, the conversion effect is strictly bound with the tone of a speaker, other tones cannot realize similar emotion voice conversion effect, and the method is generally not suitable for other languages.

Secondly, recording and manufacturing small corpora with various fixed emotions, and performing voice emotion conversion on corpora with other single styles, wherein the method has the problems that due to the influence of timbre and emotion expression difference of different corpora, the voice emotion after conversion is not always consistent with the target emotion, namely controllability is not strong, and due to insufficient data volume, the situation of conversion failure is easy to occur, and the method is not suitable for other languages generally.

Therefore, the voice conversion method provided by the disclosure can realize cross-language voice conversion without collecting mixed language voice data of a speaker in advance, retain emotion of original voice while converting, keep stable tone color after converting, obtain more emotional and interesting voice, and achieve the effect similar to 'voice skin'.

Fig. 1 schematically illustrates a flow chart of a voice conversion method in an exemplary embodiment of the present disclosure. As shown in fig. 1, the voice conversion method includes steps S1 to S3:

step S1, acquiring original voice data and preset tone information;

step S2, extracting cross-language feature representation and emotion feature representation of the original voice data;

and step S3, performing voice conversion based on the cross-language feature representation, the emotion feature representation and the tone color information to obtain target voice data.

Hereinafter, each step of the voice conversion method in the present exemplary embodiment will be described in more detail with reference to the drawings and the examples.

In step S1, the original voice data and the preset tone information are obtained.

In one embodiment of the present disclosure, the original voice data is also the audio that needs to be voice converted. The original voice data may include voice data corresponding to the same person, or voice data corresponding to a plurality of different persons, and the voice data may be one or more pieces of audio.

When raw speech data is acquired, the audio to be converted can be input into an ASR (speech recognition) model, which is a set of systems that automatically convert speech into computer-understandable character, text, forms.

The tone information is a tone mark corresponding to the target voice data, and has various tones for selection, and the tone information can be customized according to the requirements after voice conversion. It should be noted that, when the tone color information is preset, different audio segments of the target voice data can be specified to have different tone colors, so as to obtain a multi-tone voice conversion result.

And step S2, extracting cross-language feature representation and emotion feature representation of the original voice data.

In an embodiment of the present disclosure, the specific content of step S2 is: s21, extracting cross-language feature representation of the original voice data, and S22, extracting emotion feature representation of the original voice data.

For step S21, extracting the cross-language feature representation of the original speech data, the specific process includes: carrying out feature extraction on the original voice data to obtain audio features; and inputting the audio features into a pre-trained cross-language feature extraction model to obtain the cross-language feature representation.

When extracting the cross-language feature representation, a pre-trained cross-language feature extraction model needs to be used, so before step S21, the cross-language feature extraction model needs to be trained, and the specific process is as follows: acquiring a voice sample and a text sample corresponding to contents; performing feature extraction on the voice sample to obtain sample audio features, and performing text processing on the text sample to obtain sample cross-language features; and performing model training by using the sample audio features and the sample cross-language features to obtain the cross-language feature extraction model.

Specifically, firstly, a pair of a voice sample and a text sample is obtained, and the contents of the voice sample and the text sample correspond to each other; then, extracting the characteristics of the voice sample, and performing text processing on the text sample; and finally, training a cross-language feature extraction model by using the extraction and processing results through a machine learning method.

FIG. 2 schematically illustrates a flowchart of training a cross-language feature extraction model in an exemplary embodiment of the disclosure. The following will describe the process of training the cross-language feature extraction model in detail with reference to fig. 2:

s201, extracting a voice sample and a text sample corresponding to contents from a corpus;

in one embodiment of the present disclosure, the corpus may be pre-recorded and produced. The corpus comprises paired voices and texts, paired voices and texts of different users can be collected, and audio can be recorded and then converted into corresponding characters.

S202, performing feature extraction on the voice sample to obtain sample audio features;

in one embodiment of the present disclosure, the feature extraction module may be designed to perform feature extraction on speech samples. The module aims to extract input features suitable for different languages, whether the extracted audio features are completely unrelated to the regional language languages of speakers and the semantic content of voice is reserved, so that the final voice conversion effect is determined.

It should be noted that the feature extraction is mainly to extract key features in the audio for feature matching with the text processing result. The extracted features need to meet the following requirements:

1) differentiability: for the same utterance of audio, the features should be as close in space as possible, while for different utterances of audio, the features should be as far in space as possible;

2) strong robustness: speakers may be in a variety of complex environments, requiring the proposed features to be resistant to environmental interference for the pronunciation of the same content.

3) Separability: when the voice is the voice, the speaker verification is an option, so that separability between speaker information and voice content information in the characteristics is required, and if the speaker verification is not required, the characteristics related to the speaker can be shielded.

There are many implementations in the art, and besides the common speech audio feature extraction, such as MFCC (Mel Frequency Cepstral coeffients, i.e. Mel Frequency Cepstral coefficients), FBank (FilterBank, i.e. filter bank features), spectrum (i.e. speech Spectrogram), etc., there are also unsupervised neural networks, pre-trained network models, etc. to extract features.

In the case of MFCC, for example, it is a cepstral parameter extracted in the Mel-scale frequency domain, a feature that is widely used in automatic speech and speaker recognition. Pre-emphasis, framing and Hanning window adding are carried out on the basis of audio sample points, then time-division Fourier transform is carried out to obtain a linear spectrum, then a Mel filter is carried out to obtain a Mel spectrum, logarithm is taken, discrete cosine transform is carried out, and finally MFCC characteristics are obtained.

Fbank is characterized in that the Fbank is a front-end processing algorithm, and audio is processed in a manner similar to human ears, so that the performance of voice recognition can be improved. The general steps to obtain the fbank characteristics of a speech signal are: pre-emphasis, framing, windowing, short-time fourier transform (STFT), mel-filtering, de-averaging, etc.

S203, performing text processing on the text sample to obtain a sample cross-language feature;

the text processing of the text sample aims to unify the input representations of different languages, and the unified input representation is used for assisting the voice conversion of different languages.

In one embodiment of the present disclosure, the text processing may be expressed by using a commonly used international phonetic symbol. The specific process comprises the following steps: converting the text sample into a text character set represented by the Unicharacters according to a mapping relation between preset text content and the Unicharacters; and obtaining the sample cross-language features based on the text character set.

The method comprises the steps of processing special characters such as numbers and letters in texts of different languages, mapping the special characters into phonemes based on a dictionary and the like, mapping the phonemes into international phonetic symbol representation according to a self-defined dictionary, taking the obtained phonetic symbols as text characters, and sorting to obtain sample cross-language characteristics. The sample cross-language features are represented in the form of vectors, the length of which is related to the length of the speech.

It should be noted that, because the language text is expressed by the international phonetic symbols of the same standard, the language type of the language text is not limited, multiple languages can be applied, and the input of a single language is not affected.

In an embodiment of the present disclosure, when the language text in the corpus is in a single language, that is, the text sample includes a language type, the performing text processing on the text sample to obtain sample cross-language features includes: determining a language type of the text sample; converting the text sample into a text phoneme set according to the mapping relation between the text content and the phonemes of the language type; and obtaining the sample cross-language features based on the text phoneme set.

Specifically, for a text having a single language, in order to facilitate the extraction of the sample cross-language features, an international phonetic symbol may not be used, but a phoneme fixed under the language type may be used for representation. For example, if the text is Chinese, the phoneme is Pinyin, and if the text is Japanese, the phoneme is Hiragana.

Based on the method, because the speech conversion method provided by the disclosure can utilize the trained cross-language feature extraction model to extract the cross-language features, when the corpus is constructed, mixed language speech data of the same speaker does not need to be recorded, namely if the speech of different languages needs to be converted into the emotional speech of the target speaker, only the speech recognition data of a single language needs to be collected.

S204, performing model training by using the sample audio features and the sample cross-language features to obtain the cross-language feature extraction model.

In the training stage, sample audio features extracted from audio and sample cross-language features extracted from corresponding texts are input into a model for training, the model is learned to obtain cross-language feature representation through information decoupling, compression and other modes, and finally a cross-language feature extraction model is obtained.

For example, model training is performed by using a CNN convolutional neural network and an RNN cyclic neural network, characters after text processing are used during training, a classification loss function (such as cross entry) is set for gradient descent learning optimization, and network structure vector dimensions are compressed to achieve the effects of information decoupling and compression. Model parameters in the cross-language feature extraction model are obtained after the model is trained, and the parameters mainly refer to coefficients of multiplication of related matrixes in the CNN and the RNN.

The input of the model is audio features, such as the above-mentioned MFCC features, and the character id after text processing is finally output by using an intermediate output result obtained by continuously multiplying matrix coefficients of each network in the cross-language feature extraction model. However, since a loss function between the vector and the character id needs to be calculated during training, the character id also needs to be mapped into the vector during training.

It should be noted that the cross-language feature extraction model trained in the present disclosure is not limited to a specific model, and common machine learning models can be used, such as a neural network based on deep learning, a support vector machine, and the like.

After the cross-language feature extraction model is trained, the cross-language feature representation can be extracted by using the cross-language feature extraction model. It is noted that the input of the model is the audio features and the output of the model is the character id, but we only need to extract the cross-language feature representation, i.e. extract the intermediate output.

Therefore, when step S21 is executed, the audio features of the original speech data need to be extracted first, and the audio features can be obtained by performing feature extraction on the original speech data using the same feature extraction module as that used in the training process; and then inputting the audio features into a cross-language feature extraction model to obtain cross-language feature representation.

The cross-language features include information representation of semantics, prosody and the like of the voice, and the extraction of the cross-language features is the key point of the method which can be suitable for voice conversion of different languages.

For step S22, extracting an emotional feature representation of the raw speech data includes: extracting emotion information of the original voice data; converting the emotion information into a feature vector as the emotion feature representation.

In one embodiment of the present disclosure, the emotion feature representation is extracted by first extracting emotion information present in the speech and then converting it into a feature vector representation of fixed length. There are many implementations of technology, such as using common speech features: fundamental frequency, energy, etc., speech emotion classification features may also be used. Extracting emotional features is the key to making the converted speech more emotional.

Taking the fundamental frequency characteristic as an example, the Pitch period (Pitch) is the inverse of the vocal cord vibration frequency. It refers to the period during which the airflow through the vocal tract causes the vocal cords to vibrate when a person is voiced. The period of the vocal cord vibration is the pitch period. The estimation of the Pitch period is called Pitch Detection (Pitch Detection). The fundamental frequency contains a large number of features which characterize the speech emotion and is of great importance in speech emotion recognition.

The fundamental frequency has a large variation range of 50-500Hz and high detection difficulty. The commonly used fundamental frequency feature extraction method comprises the following steps: autocorrelation function (ACF), using time domain detection; average amplitude difference method (AMFD), using time domain detection; and wavelet methods, which utilize frequency domain detection.

It should be noted that the present disclosure does not specifically limit the execution sequence of steps S21 and S22, and step S21 may be executed first, step S22 may be executed first, or both steps may be executed simultaneously.

In one embodiment of the present disclosure, the process of step S3 is: and inputting the cross-language feature representation, the emotion feature representation and the tone information into a pre-trained voice conversion model to obtain the output target voice data.

Specifically, a pre-trained speech conversion model is used for speech conversion, and the model is input with cross-language feature representation, emotion feature representation and timbre information, and output with target speech data converted from original speech data.

The process of pre-training the speech conversion model is as follows: acquiring a voice sample, a converted voice sample corresponding to the voice sample and preset sample tone information; extracting sample cross-language feature representation of the voice sample by utilizing a pre-trained cross-language feature extraction model; and extracting a sample emotional feature representation of the speech sample; and performing model training by using the sample cross-language feature representation, the sample emotion feature representation, the converted voice sample and the sample tone information to obtain the voice conversion model.

Fig. 3 is a schematic flow chart illustrating a process of training a speech conversion model according to an exemplary embodiment of the disclosure, and the following describes the process of training the speech conversion model in detail with reference to fig. 3: step S301, obtaining a voice sample; step S302, obtaining a converted voice sample after the voice sample is expected to be converted; step S303, marking tone color information of the voice sample; step S304, carrying out feature extraction on the voice sample to obtain sample audio features; step S305, inputting the sample audio features into a cross-language feature extraction model; step S306, acquiring a sample cross-language feature representation output by the cross-language feature extraction model; step S307, emotion extraction is carried out on the voice sample; step S308, obtaining sample emotion characteristic representation of the voice sample; step S309, the cross-language feature representation, the emotion feature representation, the Conversion Voice sample and the tone mark of the sample are input into a VC (Voice Conversion) model for training, wherein the VC represents a Voice Conversion technology and converts the Voice of the original speaker into the Voice of another speaker.

The trained voice conversion model is embedded as a Vocoder (Vocoder) model of TTS (voice synthesis system), the TTS system is a system for automatically converting text representation such as characters and characters which can be understood by a computer into voice, and the Vocoder is a model for converting voice acoustic characteristics of a frequency domain into voice samples of a time domain in the voice synthesis system.

Fig. 4 is a flow chart schematically illustrating a voice conversion method in an exemplary embodiment of the present disclosure, and referring to fig. 4, when performing voice conversion, it is first necessary to perform step S401 to obtain original voice data, and perform step S402 to mark tone information; then, performing feature extraction on the original voice data through step S403 to obtain audio features, then executing step S404, inputting the audio features into a cross-language feature extraction model, and reaching step S405 to obtain cross-language feature representation output by the model; simultaneously executing step S406 to perform emotion extraction on the original voice data, and obtaining emotion characteristic representation of the original voice data when the step S406 is reached; and finally, executing step S408, inputting the cross-language feature representation, the emotion feature representation and the tone color information into the VC voice conversion model, and further reaching step S409 to obtain target voice data.

It should be noted that after the speech conversion model is trained, the input of the model is cross-language feature representation, emotion feature representation and tone information, so that different input information can be set on the basis of no original speech, and further, corresponding target speech conversion data can be obtained. After the result of the voice conversion is obtained, the input information of any item can be changed, so that a new voice conversion result is obtained.

Based on the method, mixed language voice data of the same speaker is not needed before voice conversion, namely, if voices of different languages are required to be converted into emotion voice of a target speaker, only voice recognition data of a single language is required to be collected. And the speech converted by the method can keep the emotional characteristics in the speech besides the semantic meaning of the original speech. In addition, the voice conversion method can be suitable for any language, and meanwhile, voice conversion of different emotions can be carried out by changing emotional characteristic representation. The method can realize cross-language emotion voice conversion, provide voice interaction experience for users, enlarge design ideas and application scenes of various products, and lay a good foundation for creating a voice interaction closed-loop system.

Fig. 5 schematically illustrates a composition diagram of a speech conversion apparatus in an exemplary embodiment of the disclosure, and as shown in fig. 5, the speech conversion apparatus 500 may include a preparation module 501, an extraction module 502, and a conversion module 503. Wherein:

a preparation module 501, configured to obtain original voice data and preset tone information;

an extraction module 502, configured to extract cross-language feature representation and emotion feature representation of the original voice data;

a conversion module 503, configured to perform voice conversion based on the cross-language feature representation, the emotion feature representation, and the timbre information to obtain target voice data.

According to an exemplary embodiment of the present disclosure, the extracting module 502 includes a first extracting module, configured to perform feature extraction on the original voice data to obtain an audio feature; and inputting the audio features into a pre-trained cross-language feature extraction model to obtain the cross-language feature representation.

According to an exemplary embodiment of the present disclosure, the speech conversion apparatus 500 further includes a first training module (not shown in the figure) for obtaining a speech sample and a text sample corresponding to contents; performing feature extraction on the voice sample to obtain sample audio features, and performing text processing on the text sample to obtain sample cross-language features; and performing model training by using the sample audio features and the sample cross-language features to obtain the cross-language feature extraction model.

According to an exemplary embodiment of the present disclosure, the first training module includes a text processing unit, configured to convert the text sample into a text character set represented by a unicode according to a mapping relationship between preset text content and the unicode; and obtaining the sample cross-language features based on the text character set.

According to an exemplary embodiment of the present disclosure, the text processing unit is further configured to determine a language type of the text sample when the text sample includes one language type; converting the text sample into a text phoneme set according to the mapping relation between the text content and the phonemes of the language type; and obtaining the sample cross-language features based on the text phoneme set.

According to an exemplary embodiment of the present disclosure, the extraction module 502 includes a second extraction module for extracting emotion information of the original voice data; converting the emotion information into a feature vector as the emotion feature representation.

According to an exemplary embodiment of the present disclosure, the conversion module 503 is configured to input the cross-language feature representation, the emotion feature representation, and the timbre information into a pre-trained speech conversion model to obtain the output target speech data.

According to an exemplary embodiment of the present disclosure, the voice conversion apparatus 500 further includes a second training module (not shown in the figure) for obtaining a voice sample and a converted voice sample corresponding to the voice sample, and preset sample tone information; extracting sample cross-language feature representation of the voice sample by utilizing a pre-trained cross-language feature extraction model; and extracting a sample emotional feature representation of the speech sample; and performing model training by using the sample cross-language feature representation, the sample emotion feature representation, the converted voice sample and the sample tone information to obtain the voice conversion model.

The details of each module in the voice converting apparatus 500 are already described in detail in the corresponding voice converting method, and therefore are not described herein again.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

In an exemplary embodiment of the present disclosure, there is also provided a storage medium capable of implementing the above-described method. Fig. 6 schematically illustrates a schematic diagram of a computer-readable storage medium in an exemplary embodiment of the disclosure, and as shown in fig. 6, a program product 600 for implementing the above method according to an embodiment of the disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a mobile phone. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided. Fig. 7 schematically shows a structural diagram of a computer system of an electronic device in an exemplary embodiment of the disclosure.

It should be noted that the computer system 700 of the electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of the application of the embodiments of the present disclosure.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for system operation are also stored. The CPU 701, the ROM702, and the RAM 703 are connected to each other via a bus 704. An Input/Output (I/O) interface 705 is also connected to the bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, the processes described below with reference to the flowcharts may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program, when executed by a Central Processing Unit (CPU)701, performs various functions defined in the system of the present disclosure.

It should be noted that the computer readable medium shown in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer-readable signal medium may include a propagated data signal with computer-readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

As another aspect, the present disclosure also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by an electronic device, cause the electronic device to implement the method described in the above embodiments.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of speech conversion, comprising:

acquiring original voice data and preset tone information;

extracting cross-language feature representation and emotion feature representation of the original voice data;

and performing voice conversion based on the cross-language feature representation, the emotion feature representation and the tone color information to obtain target voice data.

2. The method of speech conversion according to claim 1, wherein extracting a cross-linguistic feature representation of the raw speech data comprises:

carrying out feature extraction on the original voice data to obtain audio features;

and inputting the audio features into a pre-trained cross-language feature extraction model to obtain the cross-language feature representation.

3. The method of speech conversion according to claim 2, further comprising pre-training the cross-language feature extraction model, comprising:

acquiring a voice sample and a text sample corresponding to contents;

performing feature extraction on the voice sample to obtain sample audio features, and performing text processing on the text sample to obtain sample cross-language features;

and performing model training by using the sample audio features and the sample cross-language features to obtain the cross-language feature extraction model.

4. The method of claim 3, wherein the text processing the text sample to obtain a sample cross-language feature comprises:

converting the text sample into a text character set represented by the Unicharacters according to a mapping relation between preset text content and the Unicharacters;

and obtaining the sample cross-language features based on the text character set.

5. The method of claim 3, wherein when the text sample includes a language type, the text processing the text sample to obtain sample cross-language features comprises:

determining a language type of the text sample;

converting the text sample into a text phoneme set according to the mapping relation between the text content and the phonemes of the language type;

and obtaining the sample cross-language features based on the text phoneme set.

6. The speech conversion method of claim 1, wherein extracting an emotional feature representation of the raw speech data comprises:

extracting emotion information of the original voice data;

converting the emotion information into a feature vector as the emotion feature representation.

7. The speech conversion method according to claim 1, wherein performing speech conversion based on the cross-language feature representation, the emotion feature representation, and the timbre information to obtain target speech data comprises:

and inputting the cross-language feature representation, the emotion feature representation and the tone information into a pre-trained voice conversion model to obtain the output target voice data.

8. The method of speech conversion according to claim 7, further comprising pre-training the speech conversion model, comprising:

acquiring a voice sample, a converted voice sample corresponding to the voice sample and preset sample tone information;

extracting sample cross-language feature representation of the voice sample by utilizing a pre-trained cross-language feature extraction model; and

extracting a sample emotional feature representation of the speech sample;

and performing model training by using the sample cross-language feature representation, the sample emotion feature representation, the converted voice sample and the sample tone information to obtain the voice conversion model.

9. A speech conversion apparatus, comprising:

the device comprises a preparation module, a voice processing module and a voice processing module, wherein the preparation module is used for acquiring original voice data and preset tone information;

the extraction module is used for extracting cross-language feature representation and emotion feature representation of the original voice data;

and the conversion module is used for carrying out voice conversion on the basis of the cross-language feature representation, the emotion feature representation and the tone color information so as to obtain target voice data.

10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the speech conversion method according to any one of claims 1 to 8.

11. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the speech conversion method of any of claims 1 to 8.