CN114067806A

CN114067806A - Voice conversion method and related equipment

Info

Publication number: CN114067806A
Application number: CN202111362172.XA
Authority: CN
Inventors: 刘皓冬; 李栋梁; 刘恺
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-02-18

Abstract

The application relates to voice processing in artificial intelligence, under the scene that any source voice is required to be converted into target voice with the target tone of a specified target sound production object and the target voice of the source voice content, voice recognition is carried out on the source voice, after a voice recognition result is obtained, the voice recognition result and the target object identification of the target sound production object can be input into a pre-trained voice conversion model, and because the voice conversion model is obtained by synchronously and jointly training an acoustic model and a voice coder, the acoustic characteristic of the training voice coder is the predicted acoustic characteristic output by the acoustic model which is synchronously trained with the training voice coder, the voice synthesis effect of the voice coder obtained by training is ensured, and the accuracy of the target voice output by the voice coder is improved. If necessary, the application may also relate to a blockchain technique, and the pre-trained speech conversion model and the related data generated by the training process thereof may be stored in the nodes of the blockchain.

Description

Voice conversion method and related equipment

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech conversion method and related devices.

Background

With the development of multimedia communication technology and Artificial Intelligence (AI), speech synthesis and speech recognition technology have become key technologies for man-machine speech communication. In order to meet the application requirements of personalized Voice applications, personal Voice security and the like, Voice conversion technology (VC) determined by the technology can be used for converting the Voice of one person into the Voice of another person without changing the language content.

At present, a speech recognition model can be used for carrying out speech recognition on sample speech, after a sample speech recognition result is obtained, training of an acoustic model is realized according to the sample speech, training of a vocoder is realized by using real acoustic characteristics of the sample speech, and a speech conversion model is formed by the acoustic model obtained by training and the vocoder. Therefore, under the application scene of voice conversion, the pre-trained acoustic model performs feature extraction on the voice recognition result of the source voice to obtain the predicted acoustic features of the target sound-producing object, and then the predicted acoustic features are input into the vocoder to synthesize the target voice which accords with the tone of the target sound-producing object.

However, in the current voice conversion method, the error of synthesizing the output target voice by using the pre-trained vocoder is large, resulting in low similarity between the target voice and the voice actually output by the target sound object.

Disclosure of Invention

In view of the above, the present application provides a method for converting speech, the method comprising:

acquiring source speech of any sound-producing object and a target object identifier of a target sound-producing object;

performing voice recognition on the source voice to obtain a voice recognition result;

inputting the voice recognition result and the target object identification into a voice conversion model, and outputting target voice with target tone color characteristics corresponding to the target object identification and the content of the source voice;

the voice conversion model comprises an acoustic model and a vocoder which are obtained through synchronous training, and input information used for training the vocoder comprises output information of the acoustic model.

In some embodiments, the speech conversion model is obtained by pre-training, and the training method includes:

acquiring a training voice recognition result of training voice generated by a training object; wherein the training object comprises at least one sound-emitting object configured with a corresponding object identifier; the training speech is from speech generated by a corresponding sound object in the training data set;

inputting the training voice recognition result and the object identification into an acoustic model to obtain a predicted acoustic characteristic of the training object, and recording the tone characteristic of the sounding object corresponding to the object identification;

inputting the predicted acoustic features into a vocoder to obtain predicted voice of the training object;

acquiring a first error between the predicted acoustic feature and a reference acoustic feature of the training object, and a second error between the predicted speech and the training speech;

in the back propagation process, updating a first parameter of the acoustic model according to the first error, updating a second parameter of the vocoder according to the second error, and training the updated acoustic model and the vocoder to obtain a voice conversion model.

In some embodiments, the inputting the training speech recognition result and the object identifier into an acoustic model to obtain a predicted acoustic feature of the training object includes:

performing feature extraction on the training speech recognition result to obtain speech coding features and tone features corresponding to the object identification;

performing fusion processing on the voice coding features and the tone features to obtain predicted acoustic features of the training object;

the inputting the predicted acoustic features into a vocoder to obtain the predicted speech of the training object includes:

and inputting the voice coding characteristics and the predicted acoustic characteristics into a vocoder to obtain the predicted voice of the training object.

In some embodiments, the training method further comprises:

the training object is a specified sound-producing object, and the corresponding relation between the object identification of the specified sound-producing object and the tone color characteristics extracted by the acoustic model is recorded; the target sound-emitting object comprises any specified sound-emitting object;

and updating an embedded tone table represented by a coding embedded layer of the acoustic model by using the corresponding relation, and acquiring target tone characteristics corresponding to the target object identifier by inquiring the embedded tone table.

In some embodiments, the training method further comprises:

calling a reference fundamental frequency of the training voice;

inputting the reference fundamental frequency of the training voice into a fundamental frequency processing model to obtain the fundamental frequency characteristics of the corresponding sounding object;

updating a third parameter of the fundamental frequency processing model according to the second error in the back propagation process;

and inputting the fundamental frequency characteristic and the predicted acoustic characteristic into a vocoder to obtain the predicted voice of the training object.

In some embodiments, the training method further comprises:

inputting prosodic information contained in the training speech recognition result into a fundamental frequency prediction model to obtain respective prediction fundamental frequencies of the sounding objects contained in the training object;

calling a fundamental frequency prediction target value of a sounding object corresponding to the object identification;

acquiring a third error between the predicted fundamental frequency and the fundamental frequency prediction target value of the same sounding object;

and updating the fourth parameter of the fundamental frequency prediction model according to the third error in the reverse transmission process.

In some embodiments, training the updated acoustic model and the vocoder to obtain a speech conversion model includes:

training the updated acoustic model and the vocoder until a first training constraint condition is met, and stopping training the acoustic model;

inputting the predicted voice output by the vocoder for the next training and the corresponding training voice into a discriminator, updating a second parameter of the vocoder according to a discrimination result, and training the vocoder with the updated second parameter until a second training constraint condition is met to obtain a voice conversion model;

wherein the first training constraint is configured for a training process of the acoustic model; the second training constraint is configured for a training process of the vocoder.

In some embodiments, the method for obtaining the fundamental frequency prediction target value of the uttered object includes:

acquiring a plurality of voices generated by the same voice object in a training data set;

extracting respective reference fundamental frequencies of the plurality of voices;

and normalizing the respective reference fundamental frequencies of the multiple voices of the same sounding object to obtain the fundamental frequency prediction target values of the corresponding sounding objects.

In some embodiments, said inputting said speech recognition result and said target object identification into a speech conversion model, outputting a target speech having a target timbre characteristic corresponding to said target object identification and contents of said source speech, comprises:

inputting the voice recognition result and the target object identification into the acoustic model which is pre-trained to obtain target voice coding characteristics and target acoustic characteristics;

inputting prosodic information contained in the voice recognition result into the pre-trained fundamental frequency prediction model to obtain a target prediction fundamental frequency;

calling normalized fundamental frequency information corresponding to the target object identification, and performing reverse normalization processing on the target prediction fundamental frequency by using the normalized fundamental frequency information of the target sounding object to obtain a prediction reference fundamental frequency of the target sounding object;

inputting the prediction reference fundamental frequency into the pre-trained fundamental frequency processing model to obtain the target fundamental frequency characteristic of the target sounding object;

inputting the target voice coding features, the target acoustic features and the target fundamental frequency features into a pre-trained vocoder, and outputting target voice of the target sounding object; the target speech has target timbre features corresponding to the target object identification and the content of the source speech.

In some embodiments, the inputting the target speech coding features, the target acoustic features, and the target fundamental frequency features into a pre-trained vocoder and outputting the target speech of the target sound-producing object comprises:

inputting the target tone color feature, the target fundamental frequency feature and the target pronunciation feature and the target rhythm feature in the target voice coding feature into a pre-trained vocoder, and outputting the target voice of the target pronunciation object; or,

inputting the target tone color feature, the target fundamental frequency feature and a target pronunciation feature in the target voice coding feature into a pre-trained vocoder, and outputting a target voice of the target pronunciation object; or,

and inputting the target voice coding features, the target acoustic features, the target fundamental frequency features and preset voice energy features into a pre-trained vocoder, and outputting the target voice of the target sounding object.

In another aspect, the present application further provides a speech conversion apparatus, including:

the data acquisition module is used for acquiring source speech of a source speech object and a target object identifier of a target speech object;

the voice recognition module is used for extracting the characteristics of the source voice to obtain a voice recognition result;

a voice conversion module, configured to input the voice recognition result and the target object identifier into a voice conversion model, and output a target voice having a target timbre characteristic corresponding to the target object identifier and a content of the source voice;

In yet another aspect, the present application further proposes a computer device, comprising: at least one memory and at least one processor, wherein:

the memory for storing a program for implementing the voice conversion method as described above;

the processor is used for loading and executing the program stored in the memory so as to realize the voice conversion method.

In yet another aspect, the present application further proposes a computer-readable storage medium, on which a computer program is stored, the computer program being loaded and executed by a processor to implement the voice conversion method as described above.

In yet another aspect, the present application further proposes a computer program product, which includes computer instructions, which are read and executed by a processor, to implement the voice conversion method as described above.

Therefore, the application relates to voice processing in artificial intelligence, such as a voice conversion method, which is characterized in that under the scene that any source voice needs to be converted into target tone which accords with a specified target sound-producing object, the content of the source voice is reserved, the source voice is subjected to voice recognition, and after a voice recognition result is obtained, the voice recognition result and the target object identifier of the target sound-producing object can be input into a pre-trained voice conversion model.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 shows an architecture diagram of an alternative example of a speech conversion system suitable for the speech conversion method proposed in the present application;

FIG. 2 is a diagram illustrating a hardware architecture of an alternative example of a computer device suitable for use in the speech conversion method proposed in the present application;

FIG. 3 is a diagram showing a hardware configuration of yet another alternative example of a computer device suitable for the speech conversion method proposed in the present application;

FIG. 4 is a flow chart diagram illustrating an alternative example of the speech conversion method proposed by the present application;

FIG. 5 shows a schematic flow chart of yet another alternative example of the speech conversion method proposed by the present application;

FIG. 6 shows a schematic flow chart of yet another alternative example of the speech conversion method proposed by the present application;

FIG. 7 is a schematic diagram of an alternative training method for a speech conversion model in the speech conversion method proposed in the present application;

FIG. 8 is a flow chart diagram illustrating yet another alternative example of the speech conversion method proposed by the present application;

FIG. 9 is a flowchart illustrating an alternative example of invoking a pre-trained speech conversion model to implement a speech conversion method in the speech conversion method proposed in the present application;

fig. 10 is a schematic structural diagram showing an alternative example of the speech conversion apparatus proposed in the present application;

fig. 11 shows a schematic structural diagram of yet another alternative example of the speech conversion apparatus proposed by the present application.

Detailed Description

As can be seen from the above description, since the vocoder and the vocoder in the speech conversion system are trained independently, the real acoustic features (the acoustic features directly extracted from the speech of the target utterance object) input by the vocoder in the training stage are inconsistent with the predicted acoustic features (i.e., the acoustic models output by the pre-trained vocoder) input by the vocoder in the use stage, so that in the application of the vocoder, the prediction error of the vocoder is increased, and the prediction error of the acoustic features predicted by the vocoder is overlapped with the prediction error of the vocoder itself, which results in a decrease in the synthesis effect of the vocoder, i.e., the output result of the vocoder model trained according to the real acoustic features is less accurate, and affects the whole voice change effect.

In order to solve the above problems, the present application proposes to perform joint training on the acoustic varying model and the vocoder, that is, in the pre-training stage of the voice conversion model, the acoustic varying model and the vocoder are synchronously trained, so that the input information of the training vocoder comes from the output information of the acoustic varying model, rather than the acoustic features of the training voice, in this way, in the training process, the prediction error of the acoustic model which synchronously starts training and the prediction error of the vocoder itself can be considered, thereby realizing the model parameter adjustment of the vocoder, ensuring that the vocoder obtained by training performs fusion processing on the input information including the acoustic features output by the acoustic model in the use process, obtaining the target voice with high accuracy, that is, obtaining the target voice with high similarity to the voice output by the target sound production object, and improving the overall acoustic varying effect.

The speech conversion model can be trained and processed by using an Artificial Intelligence (AI) technology, and in the processing process, speech technologies such as a speech recognition technology and a speech synthesis technology, and machine learning and deep learning algorithms such as an Artificial neural network, a belief network and reinforcement learning can be used.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It should be understood that "system", "apparatus", "unit" and/or "module" as used herein is a method for distinguishing different components, elements, parts or assemblies at different levels. However, other words may be substituted by other expressions if they accomplish the same purpose.

As used in this application and the appended claims, the terms "a," "an," "the," and/or "the" are not intended to be inclusive in the singular, but rather are intended to be inclusive in the plural unless the context clearly dictates otherwise. In general, the terms "comprises" and "comprising" merely indicate that steps and elements are included which are explicitly identified, that the steps and elements do not form an exclusive list, and that a method or apparatus may include other steps or elements. An element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

In the description of the embodiments herein, "/" means "or" unless otherwise specified, for example, a/B may mean a or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of the present application, "a plurality" means two or more than two. The terms "first", "second" and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.

Additionally, flow charts are used herein to illustrate operations performed by systems according to embodiments of the present application. It should be understood that the preceding or following operations are not necessarily performed in the exact order in which they are performed. Rather, the various steps may be processed in reverse order or simultaneously. Meanwhile, other operations may be added to the processes, or a certain step or several steps of operations may be removed from the processes.

Referring to fig. 1, a schematic diagram of an architecture of an optional example of a speech conversion system suitable for the speech conversion method provided in the present application is shown, where an application scenario of the speech conversion system may refer to an application scenario, such as a short video production scenario, a live changing scenario, a voiced lecture broadcast scenario, etc., where a semantic content of source speech (i.e., speech generated by any one of the uttered objects) is maintained to be unchanged, and the source speech is converted into target speech conforming to a target timbre of a target uttered object (i.e., any one of the specified uttered objects). As shown in fig. 1, the speech conversion system proposed by the present application may include, but is not limited to: a client 11 and a server 12, wherein:

the client 11 may be an application program that runs on the terminal device and supports the voice conversion function, such as a voice assistant, a voice synthesis engine, or a dedicated application installed in the terminal device, or may be a web application program of a website that supports the voice conversion function, and the type of the client 11 is not limited in the present application. The terminal device may include, but is not limited to, a smart phone, a tablet computer, a desktop computer, a wearable device (such as a smart watch, an Augmented Reality (AR) device, a Virtual Reality (VR) device, and the like), a netbook, an e-book reader, an audio player, a vehicle-mounted terminal, a smart home device, a smart medical device, a smart transportation device, a robot, and other electronic devices.

In some embodiments, the client 11 may be an application program that supports multiple speech processing functions such as speech recognition and speech change, in which case, the terminal device may directly send any collected or received source speech to the client 11, and the client executes the speech conversion method provided by the present application, converts the source speech into a target speech of a specified target sound object, and sends the target speech to the audio/video player for playing. Optionally, the client 11 may also include APPs (applications) supporting different voice processing functions, such as a voice recognition APP, a voice conversion APP/variable-pitch processing APP, and the like.

In still other embodiments, especially for a terminal device with low data processing capability, the client 11 may also send the received source speech and the target object identifier of the target utterance object to the server 12, and the server 12 executes the speech conversion method provided in the present application, obtains the required target speech, and feeds the obtained target speech back to the client 11 for output.

The server 12 may be a service device that provides a corresponding service for the voice conversion function implemented by the client 11, and may be an independent physical server, a server cluster formed by multiple physical servers, a cloud server with cloud computing capability, or the like. In combination with the above analysis, the server 12 can implement data communication with the terminal device 11 through a wired network or a wireless network to meet application requirements, and the implementation process may be determined as appropriate, which is not described in detail herein.

Following the above analysis, the models such as the speech recognition model and the speech conversion model in the speech conversion method proposed in the present application are usually trained in advance based on corresponding training samples, and after obtaining models that satisfy corresponding training constraints, the trained models are stored, so that in a speech conversion application scenario, a desired model can be directly called to process speech or speech sequence features in the application scenario, and a desired processing result is obtained.

In the embodiment of the present application, for a speech conversion model for implementing speech conversion processing, a server may execute a corresponding training method, so that the speech conversion model may have the capability of filtering out the tone features of a source sound object and fusing the tone features of a target sound object to convert source speech into target speech. It should be noted that the speech conversion model may be integrated with the speech recognition model, and of course, the speech recognition model may also be an independent model different from the speech conversion model, which is not limited in this application.

It should be understood that the system architecture in the voice conversion scenario shown in fig. 1 does not constitute a limitation to the voice conversion system in the embodiment of the present application, and in practical applications, the system may include more devices than those shown in fig. 1, or a combination of devices, such as a database server, or other communication servers of a third party platform, and the present application is not specifically mentioned here.

In combination with the above analysis, referring to fig. 2, a schematic diagram of a hardware structure of an optional example of a computer device suitable for the speech conversion method provided in the present application is shown, where the computer device may be the server 12 or the terminal device, so that in an actual application, the server 12 or the client 11 operated by the terminal device may execute the speech conversion method provided in the present application, or the server 12 and the client 11 may cooperate to implement the speech conversion method, which may be determined according to actual application requirements. Taking the example of the computer device being a server, as shown in fig. 2, the computer device may include, but is not limited to: at least one memory 21 and at least one processor 22, wherein:

the memory 21 may be used to store a program implementing the voice conversion method described in the embodiments of the present application; the processor 22 may be configured to load and execute the program stored in the memory 21 to implement the steps of the voice conversion method described in the corresponding method embodiment, and the specific implementation process may refer to, but is not limited to, the description of the corresponding parts of the following embodiments.

In the present embodiment, the memory 21 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device or other volatile solid-state storage device. The processor 22 may be a Central Processing Unit (CPU), an application-specific integrated circuit (ASIC), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA), or other programmable logic device. The present application does not limit the number and types of the memory 21 and the processor 22 included in the computer device, and may be determined as appropriate.

In some embodiments, the computer device proposed by the present application may further include: the communication interface can comprise network interfaces suitable for wireless communication networks and/or wired communication networks, specifically data interfaces of communication modules such as a GSM module, a WIFI module, a 4G/5G/6G (fourth generation/fifth generation/sixth generation mobile communication network), Near Field Communication (NFC), and the like, and can also comprise interfaces such as a USB interface, a serial/parallel port, an I/O and the like so as to meet communication requirements between different components in the computing equipment and communication requirements of the computing equipment and other local equipment, and the like, and the number and types of the communication interfaces are not limited by the application, the function requirements of the computer device can be determined, and this embodiment is not illustrated.

Based on this, in this embodiment of the present application, the terminal device running the client 11 may implement data communication between the two through a communication interface matched with the server 12, that is, a data communication link between the two is constructed, so that the client 11 may send information such as source speech and a target object identifier of a target sound-emitting object to the server 12 through the data communication link, the server 12 executes the speech conversion processing method proposed in the present application, and feeds back the obtained target speech to the client 11 through the data communication link for playing.

In still other embodiments presented in the present application, as in the case where the computer device is a terminal device running the client 11, as shown in fig. 3, the computer device may further include: at least one input device such as a touch sensing unit, a camera, a microphone, etc., which senses a touch event on the touch display panel; at least one output device such as a display, speaker, etc.; the power supply component is used for supplying power to each component in the computer equipment and can comprise a power supply management module, at least one power supply, an associated component for realizing the functions of power supply distribution, management and the like, and the like; the sensor module comprises one or more sensors and is used for providing state information of different aspects for the computer equipment, such as input state, motion posture, distance of a sensing approaching object and the like; a communication module for supporting wired or wireless communication between the computer device and other devices, such as a communication module corresponding to the above-listed communication interface, an antenna, etc. The components listed in this application may be determined according to application function requirements of the terminal device, and may include, but are not limited to, components that may be different for different types of terminal devices, which is not exhaustive, and fig. 3 does not show components that are included in the computer device listed in this application.

Therefore, the structure of the computer device shown in fig. 2 and 3 does not constitute a limitation to the computer device in the embodiment of the present application, and in practical applications, the computer device may include more or less components than those shown in fig. 2 or 3, or may combine some components, which is not listed here.

It can be understood that, for the voice conversion method or apparatus provided by the present application, the server may be a node on the blockchain, and the synthesized target data, the pre-trained voice conversion model and other related data may be stored on the blockchain for the user to call at any time when using, and the implementation process is not described in detail in the present application.

Based on the above application scenarios and the related description of the system architecture thereof, the solution proposed in the embodiment of the present application relates to the technologies of artificial intelligence, such as speech recognition, speech synthesis, machine learning/deep learning, and the like, and can be illustrated by the following embodiments, but is not limited to the following embodiments:

referring to fig. 4, a flowchart of an alternative example of the speech conversion method proposed by the present application is shown, in any speech conversion scenario, in response to a change of voice requirement for converting a source speech of any sound object into a target speech of a specified target sound object, the speech conversion method proposed by the present application may be executed by a server or a client or by a combination of the server and the client. The embodiment of the present application is described by taking a server as an example, and as shown in fig. 4, the voice conversion method may include, but is not limited to, the following steps:

step S41, obtaining source speech of any sound object and target object identification of target sound object;

in the speech conversion application scenario described in the embodiment of the present application, the speech conversion application scenario may refer to a sound change application scenario in which the source speech of any sound object is converted into the target speech of the target sound object without changing sound object irrelevant information such as audio content and prosody. This application does not do the constraint to the source of source speech and its vocal object, can confirm according to the application demand, and target vocal object can user-designated or system allocation's appointed vocal object, in order to distinguish different vocal objects, can be for different vocal objects configuration unique object identification, like this, when the user tells the server through the customer end with the audio frequency of treating conversion that selects or upload promptly source speech, can upload target vocal object identification of target vocal object simultaneously, so that the server learns the audio frequency that needs to treat conversion audio frequency conversion to which vocal object from this.

Based on the above analysis, in some embodiments, the server may receive a voice conversion request sent by the client, where the voice conversion request may carry source voice and the target object identifier, and the server may obtain the source voice and the target object identifier by parsing the voice conversion request. Of course, the source speech and the target object identifier are not limited to such a simultaneous obtaining manner, and may also be obtained sequentially according to the actual application requirements, which is not described in detail in this application.

The object identification is the unique identification of the sound-producing object, can be a number, personal biological information, identity information and the like, and is used for distinguishing different tone characteristics of different sound-producing objects, and the content of the object identification is not limited in the application.

It should be understood that, in different voice conversion scenarios, the implementation manner of the server obtaining the source voice and/or the target object identifier may be different, for example, in the case that the source voice is sent from the server to the client, when the client requests to convert the source voice into the target voice of the target voice object, the client may directly send the voice identifier capable of representing the source voice or the information source thereof to the server, so that the server may determine the source voice to be converted according to the voice identifier, and execute the voice conversion method; if the source speech is collected by the terminal device running the client or sent by the external device, the client can directly upload the source speech to the server for conversion processing and the like according to the above-described mode. Of course, the voice conversion method may also be directly executed by the client to obtain the target voice, which often depends on the main body executing the voice conversion method, and the application is not limited thereto.

Step S42, recognizing the source speech to obtain a speech recognition result;

according to the working principle of the Speech conversion technology, for the source Speech output by the source Speech object (i.e. the Speech object of the source Speech, which may be the original Speech object, or the Speech object of the Speech obtained after one or more times of Speech conversion processing, etc.), the present application may use, for example, an Automatic Speech Recognition technology (ASR) to implement Speech Recognition on the source Speech, so as to obtain information related to pronunciation, prosody, semantic text, etc. in the source Speech, and the obtained Speech Recognition result may include but is not limited to: discrete variables, labels, which may be pronunciation information of one or a plurality of continuous speech frames included in the source speech, and continuous variables, PPGs (Phonetic posterior probabilities) which may include prosody information of the source speech. It can be seen that the speech recognition results such as the pronunciation information and the prosody information are feature information of time series extracted from different speech frames, and the speech recognition processing method is not limited in the present application.

Optionally, the speech recognition model for implementing the speech recognition processing may be trained in advance, and may be a neural network structure, which is obtained by training a sample audio based on an ASR technique, or may be a convolutional neural network of wav2vec, which is unsupervised and pre-trained, and learn a structure of speech from an original audio to obtain a speech recognition model, and the like.

Step S43, the speech recognition result and the target object identification are input to the speech conversion model, and the target speech having the target tone characteristic corresponding to the target object identification and the content of the source speech is output.

In combination with the above description related to the technical solution of the present application, in order to solve the technical problems that the acoustic model and the vocoder are independently trained, which results in the source of the input information in the vocoder training stage, which is inconsistent with the source of the input information in the using stage, the output accuracy of the vocoder is lowered, and the effect of changing voice is affected, the present application proposes to perform joint training on the acoustic model and the vocoder, that is, to perform synchronous training on the acoustic model and the vocoder included in the voice conversion model, and the input information used for training the vocoder includes the output information of the acoustic model.

Therefore, in the stage of model training, the vocoder and the acoustic model can synchronously start training, in each training process, the predicted acoustic characteristics output by the acoustic model of the training can be directly input into the vocoder, one training of the vocoder is completed, and a plurality of times of training is performed, so that the vocoder obtained by the training can effectively reduce the prediction error of the vocoder per se in the actual voice conversion application compared with the vocoder obtained by using the training based on the real acoustic model, and the integral sound changing effect is improved.

Based on this, in the implementation process of step S43, the voice recognition result and the target object identifier are sent to the acoustic model for pre-training, the acoustic model determines the target timbre characteristic corresponding to the target object identifier, so as to combine the target timbre characteristic to perform encoding and decoding processing on the pronunciation information, prosody information, and the like included in the voice recognition result, so as to obtain the target acoustic characteristic fused with the target timbre characteristic of the target utterance object, and input the target acoustic characteristic into the pre-trained vocoder for synthesis processing, so as to obtain the target voice which retains the content of the source voice and conforms to the timbre of the target utterance object, which is equivalent to the content of the source voice directly spoken by the target utterance object, and meets the requirement of the application of vocal change of the user.

Then, under the condition that the server executes the voice conversion method, the server can feed back the obtained target voice to a client end which provides source voice or is preset to run by a terminal for outputting, or the client end executes other application processing according to the obtained target voice, such as fusion with other multimedia information, and the required target video and the like are obtained.

Illustratively, in a live broadcasting sound-changing scene, a live broadcasting server can perform voice conversion processing on recorded anchor voice to obtain live broadcasting voice with the tone of a specified target generation object (such as the tone of a celebrity, the configured tone of a virtual object configured in an anchor individuation, and the like), can perform fusion processing with live broadcasting image content, and then sends the obtained live broadcasting video to a live broadcasting client of each audience entering a live broadcasting room to play, so that the audience can hear the sound of the target sounding object instead of the sound of the anchor himself when watching the live broadcasting, the live broadcasting interest can be improved, and the requirement of protecting the anchor sound can be met.

Under the situation of short video production, the characteristics of each sound production object contained in the short video and the personal production requirement of a short video producer can be used for extracting the voice of any sound production object of the short video as source voice, and specifying the voice of which sound production object to convert into, namely after specifying the target sound production object, according to the voice conversion method provided by the application, the source voice is converted into the target voice which accords with the tone of the target sound production object, then the target voice corresponding to the source voice of each sound production object in the short video is utilized to update the source short video, and the produced target short video can be obtained and issued. It is understood that the speech conversion implementation process in other speech conversion application scenarios is similar, and the present application is not described in detail herein by way of example.

In the scenes such as calling, voice call of social software and the like, a user can also configure the tone of which target sound object is to be converted in advance on a local client, so that the client can take the voice received in real time (such as the audio collected by an audio collector) as the source voice, and then convert the source voice into the target voice according to the voice conversion method described above and send the target voice to the call opposite side. It should be understood that the voice change processing procedure in the voice call scenario may also be implemented by a communication server providing a voice call service, performing the above-described voice conversion method, performing conversion processing on the received source voice sent by any call client, sending the obtained target voice to the call counterpart for output, and the like. In the implementation process, the target sound-emitting object representation may be specified in advance by the communication party or by the communication server, which is not limited in the present application.

It should be noted that, regarding the sound change scenarios applicable to the speech conversion method proposed in the present application, including but not limited to the above-listed application scenarios, in the face of the sound change requirement of the target speech intended to convert the source speech of any one of the uttered objects into the timbre of the specified target uttered object, the above-described speech conversion method can be implemented, and the present application is not described in detail by way of example.

In conclusion, the acoustic model and the vocoder are trained synchronously, so that the prediction acoustic model output by the acoustic model is input in the vocoder training process, and thus, when the voice recognition result of the source voice is processed by utilizing the voice conversion model formed by the voice conversion model, the vocoder can keep the content of the source voice and perform synthesis processing according to the acoustic characteristics output by the acoustic model, the target voice of the target sound production object can be predicted accurately and reliably, the whole sound change effect is improved, and the sound change application requirement is met better.

Referring to fig. 5, which is a schematic flow chart of yet another optional example of a speech conversion method proposed in the present application, an embodiment of the present application may describe a training process of a speech conversion model in the speech conversion method, but is not limited to the model training implementation method described in the present embodiment, and the model training method may be executed by the computer device, and this embodiment takes a scenario in which a server implements the model training method as an example, as shown in fig. 5, the method may include:

step S51, acquiring training voice generated by a training object;

in this embodiment of the present application, the training object may include at least one vocalizing object, and the vocalizing object is configured with a corresponding object identifier, and the training speech may be from a speech generated by a vocalizing object included in a training data set (for example, a set formed by a plurality of speeches generated by different vocalizing objects, which may be extracted from a corpus database, and the present application does not limit the content and the source of the content included in the training data set).

In this case, the obtained training speech may include one or more pieces of speech of each of the plurality of sound-emitting objects, and the obtained training speech is combined into the training speech according to a certain order.

For convenience of subsequent processing, the present application may record the ranking of the utterance objects corresponding to each utterance forming the training sentence, for example, the recording of the ranking result is implemented by using object identifiers of a plurality of utterance objects, but is not limited to this implementation manner. In addition, when the training speech is obtained, one or more pieces of speech can be randomly selected from the training data set to form the training speech, and the method for obtaining the training speech is not limited in the present application.

Step S52, inputting the training speech into the pre-training speech recognition model to obtain the training speech recognition result;

for the related description of the speech recognition model, reference may be made to the description of the corresponding parts of the above embodiments, and the embodiments of the present application are not described in detail herein.

In the embodiment of the application, because the training processes of the voice recognition model and the voice conversion model are mutually independent, after the voice recognition model is trained independently, the pre-trained voice recognition model is directly called to carry out feature extraction on training voice so as to meet the training requirement of the voice conversion model, thereby ensuring the consistency of the method for acquiring the voice recognition result input in the training stage and the use stage of the voice conversion model and improving the training efficiency and the accuracy of the voice conversion model.

Step S53, inputting the training speech recognition result and the object identification into an acoustic model to obtain the predicted acoustic characteristics of the training object, and recording the tone characteristics of the sounding object corresponding to the object identification;

in the embodiment of the application, the acoustic model adopts a neural network structure formed by an acoustic Encoder (Voice Conversion Encoder) and an acoustic Decoder (Voice Conversion Decoder), the acoustic Encoder is utilized to realize feature extraction of a training Voice recognition result, an input Voice recognition result with an indefinite length is converted into a feature vector with a corresponding length, and a hidden layer feature of input information is extracted in the Conversion process to obtain information related to Voice content irrelevant to a source sounding object.

For example, the acoustic encoder may be an encoding network obtained based on convolutional neural network training, and in the network structure, input information (e.g., recognition features such as pronunciation information and prosody information included in a training speech recognition result) may be mapped to a hidden layer feature space through a plurality of convolutional layers of the convolutional neural network, and then the distributed hidden layer features are mapped to a sample label space through a full connection layer, so as to implement classification and integration of hidden layer features, which is not described in detail in this application. It should be noted that the network structure of the acoustic encoder is not limited to the convolutional neural network, and other types of neural networks such as a cyclic neural network may also be used, and according to the actual application requirements, feature extraction may also be implemented in combination with a spatial/semantic attention mechanism, for example, so as to improve the reliability and accuracy of the obtained audio coding features.

In addition, in the process of extracting features by the acoustic encoder, the extracted tone color features of the sounding object may be associated with the object identifier, and an embedded tone color table (such as an embedding table) for recording the correspondence between the tone color features and the object identifier is characterized in an embedded layer of the acoustic encoder.

It should be noted that, in the process of continuously training the acoustic encoder, the tone features extracted from the speech of the same utterance object input in each training may be utilized to update the tone features corresponding to the object identifier of the utterance object in the embedded tone table, so as to implement the learning optimization of the tone features of different utterance objects in the embedded tone table, and improve the reliability and accuracy of the recorded tone features.

Then, the acoustic decoder may perform fusion processing on the encoding processing result output by the acoustic encoder, and predict an acoustic feature that has the tone of the training object and is consistent with the training speech content, and record the acoustic feature as a predicted acoustic feature. The feature fusion processing method implemented by the acoustic decoder may be determined according to a network operating principle of the decoder, and is not described in detail in this application. Optionally, the acoustic decoder may also adopt a neural network structure, and fusion processing of different dimensional features is implemented according to feature fusion processing capability of the neural network, so as to obtain the required predicted acoustic features.

Step S54, inputting the predicted acoustic feature into a vocoder to obtain the predicted voice of the training object;

the Vocoder (Vocoder) is a speech signal codec using model parameter prediction and speech synthesis technology, in the application of speech conversion technology, the Vocoder can synthesize the target speech according to the tone of the target sound object according to the predicted acoustic characteristics of the target sound object.

Therefore, in the process of training the voice conversion model, the acoustic model and the vocoder which are included in the voice conversion model are jointly trained, so that when the vocoder is trained each time, the predicted acoustic features output by the acoustic model trained at the same time are directly input into the vocoder for training, the vocoder inputs in the training stage and the use stage are consistent, namely, the vocoder outputs from the acoustic model, and the vocoder independently trains and inputs the real acoustic model for the vocoder to train.

Step S55, acquiring a first error between the predicted acoustic feature and a reference acoustic feature of the training object, and a second error between the predicted speech and the training speech;

the embodiment of the application can adopt a multi-task training mode to realize the training of the voice conversion model and improve the model training efficiency and the prediction accuracy, so that in the whole training process of the voice conversion model, the application can realize the parameter adjustment of different parts in the model according to the prediction target, and the parameter adjustment can be determined according to the training requirements of all parts.

In combination with the above-described training implementation process of the acoustic model and the vocoder, since the two parts adopt a joint training mode, and feature association is implemented in the whole model forward propagation calculation process, the present application may respectively obtain prediction errors of each part of the speech conversion model according to the forward propagation processing mode, for example, in step S55, respectively obtain a first error of a prediction result of the acoustic model and a second error of a prediction result of the vocoder, thereby characterizing the accuracy of the prediction result of the corresponding part.

Based on this, in the embodiment of the present application, a suitable preset loss function, such as a cross entropy loss function, may be called to obtain a first loss between the predicted acoustic feature and the reference acoustic feature, and predict a second loss between the speech and the training speech, so as to characterize an error of the prediction result of the corresponding portion according to the obtained loss, so as to subsequently implement adjustment of the model parameter of the corresponding portion accordingly. In addition, for the prediction results of different components in the speech conversion model, the same or different error acquisition modes can be adopted to acquire the errors of the corresponding prediction results.

In some embodiments, in the process of obtaining the second error, the loss calculation may not be directly performed on the two speech signals, that is, the predicted speech and the training speech, according to the present application, a unified signal processing method may be adopted, the predicted speech and the training speech are preprocessed first to obtain corresponding predicted speech features and training speech features, and then, the loss between the predicted speech features and the training speech features, that is, the loss between the predicted speech and the training speech, may be calculated through a loss function, so as to obtain the second error.

In a possible implementation manner, the preprocessing manner may include, but is not limited to, a Short-Time Fourier Transform (STFT) processing manner, which is used to determine the frequency and phase of a local area sinusoidal wave in a corresponding speech signal (such as predicted speech or training speech), so as to obtain a speech feature of the speech signal, which can be adapted to the loss calculation requirement.

In still other embodiments, the present application may implement error calculation on a prediction result of a voice conversion model in a GAN (generic adaptive Network, generated discriminant Network) training manner, in this case, both the acoustic model and the vocoder may be used as generators, configure corresponding discriminants, input a prediction result output by the corresponding generator and a prediction target corresponding to the prediction result into the discriminants for prediction scoring, and represent an accuracy or an error of the prediction result according to a magnitude of a prediction score obtained.

Therefore, in the further embodiments, the present application may input the predicted acoustic feature and the reference acoustic feature of the corresponding utterance object (i.e., the actual acoustic feature of the utterance object obtained by performing feature extraction on the speech of the utterance object in the training speech) into the first discriminator, perform scoring and the like on the accuracy of the predicted acoustic feature, obtain the first discrimination result for the predicted acoustic feature, and determine the first error based on the obtained first discrimination result. It can be seen that the first error may be a score obtained by the scoring operation, or an error value determined according to the score, and the like. Similarly, the predicted speech and the training speech may be input to a second discriminator for scoring, and the second error may be determined based on the second discrimination result. The application does not detail the implementation process of the counterstudy training method.

It should be noted that, in the case where the training target includes a plurality of sound-generating objects, and the training speech includes the speech of each of the plurality of sound-generating objects, since the plurality of speech may be combined into the training speech in a certain order, in the process of obtaining the predicted acoustic features, the speech recognition result of each sound-generating object in the training speech and the timbre features of the sound-generating object may be subjected to fusion processing to obtain the predicted acoustic features of the sound-generating object.

It can be seen that, in this mixed speech training scenario, the first error may include multiple prediction errors for different sound-producing objects, and the prediction error acquisition process for the acoustic features of different sound-producing objects is similar, which is not described in detail in this application. It is understood that if the training subject is a sound emitting subject, the first error is a prediction error of the acoustic feature of the training subject.

Step S56, updating a first parameter of the acoustic model according to the first error, updating a second parameter of the vocoder according to the second error, training the updated acoustic model and the vocoder until a first training constraint condition is met, and stopping training the acoustic model;

since the first error and the second error respectively represent the prediction accuracy of the corresponding part in the speech conversion model, in the implementation process of performing gradient pass back according to the error and updating the model parameters, in order to solve the problem that the same model parameter is updated by multiple prediction targets simultaneously, which causes instability of the model, the embodiments of the present application propose the errors of the respective prediction targets of the acoustic model and the acoustic decoder, which are only used to update the parameters of the corresponding part, that is, the acoustic feature prediction target of the acoustic model will be responsible for training the acoustic model, therefore, in the back propagation process, the first parameter of the acoustic model will be updated according to the first error, such as updating the network parameters of the neural network forming the acoustic model, and the first parameter may include but is not limited to the weight matrix, and may be determined according to the situation.

Similarly, the voice prediction target of the vocoder will be responsible for training the vocoder, so that during the above-mentioned reverse transmission process, the second parameter of the vocoder will be updated according to the second error, and the second parameter may include, but is not limited to, the weight matrix constituting the vocoder, etc., and may be determined according to the network structure of the vocoder and the network parameters thereof affecting the accuracy of the output of the vocoder, which will not be described in detail in this embodiment.

Therefore, in order to improve the training efficiency and reliability of the model, the acoustic model and the vocoder can start training at the same time, and in the training stage, the two models update the parameters of the models thereof according to the errors of the respective prediction results, so that the training time synchronization of the acoustic model and the vocoder at the current stage is realized, and the effect of saving the training time is achieved. However, it should be noted that, if the model training efficiency is not considered, the parameter updating processes of the two models may be performed sequentially, and are not limited to synchronous updating.

It can be understood that, since each training of the speech conversion model is implemented in a similar manner, according to the method described above, after the model parameters of the corresponding portion in the speech conversion model are updated once by using the obtained error, the method described above may be repeated, and the training of the acoustic model having the updated first parameter and the vocoder having the updated second parameter is continued until the preset training constraint condition is satisfied, so that the acoustic model and the vocoder obtained by the final training constitute the desired speech conversion model. The training constraint conditions can include that the number of times of model training reaches a preset number of times, each error of the model reaches a minimum error or convergence stability and the like, the training can be finished, and the content of the training constraint conditions is not limited by the method and the device, and can be determined according to the situation.

In addition, in practical application of the present application, training of the acoustic model and the vocoder can be realized according to the method described in the above steps, so as to obtain a speech conversion model, which is used for calling the computer device in a variable sound scene, converting any source speech into a target speech conforming to a target sound generating object, and retaining content information of the source speech. According to the sound change requirement, adaptive adjustment can be performed on the basis of the above-described training method, the implementation process is similar to the above-described training process, and detailed description is omitted in the present application.

Based on this, in order to further improve the synthesis stability and synthesis effect of the vocoder, in still other embodiments, the whole training process of the voice conversion model may be divided into two stages, and the above steps may belong to the first training stage of the model, in which, as described above, the acoustic feature prediction target and the voice prediction target may be used to implement the synchronous training of the acoustic model and the voice vocoder, so as to obtain a stable acoustic model and a relatively stable vocoder. As described above, the first training stage may determine whether to end the training of the stage according to the training effect of the acoustic model, so as to obtain the acoustic model meeting the first training constraint condition, and then not train the acoustic model, and use the trained acoustic model as the final acoustic model in the speech conversion model.

The first training constraint condition may be determined according to a training requirement of the acoustic model, and may include whether the training frequency of the acoustic model reaches a preset frequency, whether a prediction error of the acoustic model on an acoustic feature of a sound object in the training object is convergent and stable, and the like.

It should be noted that, when the training object includes a plurality of sound generating objects, the first parameter of the acoustic model may be updated according to the obtained first errors of the plurality of sound generating objects, so that the acoustic model obtained by training can be suitable for predicting acoustic features of different sound generating objects, and the implementation process may refer to the implementation process described above, which is not described in detail in this embodiment. Similarly, in the training process of the vocoder, the second parameters of the vocoder can be sequentially updated by using the respective second errors of the multiple sounding objects, so as to obtain the vocoder suitable for synthesizing the voices of different sounding objects, and the implementation process is not repeated.

And step S57, inputting the predicted voice output by the next vocoder training and the corresponding training voice into a discriminator, updating the second parameter of the vocoder according to the discrimination result, training the vocoder with the updated second parameter until the second training constraint condition is met, and obtaining a voice conversion model.

After the above analysis and the training to obtain the stable acoustic model, the method may further introduce a discriminator to train the vocoder, in which case, the vocoder is used as a generator, the discriminator calculates the probability that the input predicted speech is non-synthesized speech (such as corresponding training speech), which may represent the synthesis accuracy of the predicted speech output by the vocoder, the higher the probability value is, the higher the accuracy of the predicted speech is considered, and the probability or the score given based on the probability may be determined as the error between the predicted speech of the vocoder and the corresponding training speech.

The CAN training mode aims to hope that the predicted voice output by the vocoder is judged to be non-synthesized voice by the discriminator so as to improve the synthesis effect of the vocoder. It should be noted that, as for the discrimination result output by the discriminator (which may represent the error between the predicted speech output by the vocoder in the training and the training speech participating in the model training at this time), the discrimination result is used to continuously update the second parameter of the vocoder in the reverse transmission process, and at this time, the first parameter of the acoustic model may not be updated any more, so as to ensure that the vocoder and the acoustic model obtained by the final training are more stable and better meet the application requirement of speech conversion.

It should be noted that the second training constraint condition may also include that the number of training times of the vocoder model reaches a preset number of training times, the prediction error convergence of the vocoder reaches a minimum value, and the like.

To sum up, in the training process of the voice conversion model described in the embodiment of the present application, in the first training stage, in the course of performing each training, the acoustic model and the vocoder are synchronously trained, so that the acoustic feature input by the vocoder is the predicted acoustic feature output by the vocoder in the current training, and thus the training input and the use input of the vocoder are both from the acoustic model output, thereby improving the voice synthesis effect of the vocoder. After the synchronous training in the first stage is completed and a stable acoustic model and a relatively stable vocoder are obtained, the embodiment adopts a confrontation training mode, and introduces the discriminator to continue training the vocoder, so as to improve the stability and the output accuracy of the vocoder, thereby improving the sound changing effect of the voice conversion model on converting any source voice into the target voice of the specified target voice production object and better meeting the sound changing application requirement.

Referring to fig. 6, which is a schematic flow chart of yet another optional example of the speech conversion method proposed in the present application, the present implementation may be a further optional training implementation method of a speech conversion model in the speech conversion method, which may be an optional optimized implementation method of the model training implementation method described in the foregoing embodiment, but is not limited to the training implementation method described in this embodiment, and the method may still be executed by a computer device, and in combination with an optional network structure diagram of the speech conversion model shown in fig. 7, as shown in fig. 6, the training method of the speech conversion model may include:

step S61, acquiring training voice generated by a training object;

step S62, inputting the training speech into the pre-training speech recognition model to obtain the training speech recognition result;

regarding the implementation processes of step S61 and step S62, reference may be made to the description of the corresponding parts in the above embodiments, which is not repeated in this embodiment.

Step S63, inputting the training speech recognition result and the object identification into an acoustic encoder to obtain speech coding characteristics and tone characteristics corresponding to the object identification;

in the embodiment of the present application, in combination with the above description on the network structure of the acoustic model, the acoustic encoder included in the embodiment of the present application may be used to extract the features of the training speech recognition result to obtain speech coding features including pronunciation features, prosodic features, and the like that are not related to the training object, and at the same time, the extracted timbre features may be associated with the object identifier of the corresponding utterance object to record the timbre features of the object identifier, that is, the training speech is to be converted into the predicted speech of the timbre, which is not described in detail in the embodiment of the present application.

In the model training process, the embedding layer of the acoustic encoder may maintain an embedded tone table, and in the continuous training process of the acoustic encoder, the tone features corresponding to the corresponding object identifiers recorded in the embedded tone table may be continuously updated in combination with the description of the corresponding parts of the above embodiments. Therefore, in the final training stage of the model, the voice of a specified sound-emitting object is used as training voice, when the voice conversion model is trained in a targeted manner, the corresponding relation between the object identification of the specified sound-emitting object and the tone color characteristics extracted by the acoustic model can be recorded, and the embedded tone color table represented by the coding embedding layer of the acoustic coder is updated and maintained by using the corresponding relation.

Based on this, in the actual speech conversion application, the selected target sound-emitting object may be the specified sound-emitting object, and in the speech conversion process, in the processing process of the acoustic encoder, the target sound-color feature corresponding to the target object identifier may be obtained directly by querying the embedded sound-color table, so that the acoustic decoder may predict the acoustic feature of the target sound-emitting object accordingly, and the implementation process may be combined with the description of the corresponding part of the context.

Step S64, inputting the tone color characteristic and the voice coding characteristic into an acoustic decoder to obtain the predicted acoustic characteristic of the training object;

for an acoustic decoder in the acoustic model, the acoustic decoder is configured to perform fusion processing on the input speech coding features and the tone features, and predict an acoustic feature that has the tone of which sound object (this embodiment may refer to a sound object included in a training object) to be converted and is consistent with the content of the training speech, and details of the implementation process of the fusion processing are not described in this embodiment.

In still other embodiments, in combination with the model training implementation method described in the above embodiments, the acoustic model may output the speech coding feature and the predicted acoustic feature, so that the speech coding feature and the predicted acoustic feature may be input to the vocoder to obtain the predicted speech of the training object, and compared with inputting the predicted acoustic feature to the vocoder only, this embodiment considers the original feature of the training speech included in the speech coding feature during the speech synthesis process, thereby improving the speech synthesis effect.

Step S65, inputting prosodic information contained in the training speech recognition result into a fundamental frequency prediction model to obtain respective prediction fundamental frequencies of the vocal objects contained in the training object;

in the application of voice conversion, although the range of fundamental frequencies of different sound-producing objects (such as the lowest frequency of a voice signal wave, which generally affects the tone of voice) is known, such as 50Hz (unit: Hz) to 400Hz, the real fundamental frequencies of different sound-producing objects often have differences, and the fundamental frequencies of unknown sound-producing objects (such as a source sound-producing object) cannot be directly extracted. The structure of the fundamental frequency prediction model is not limited, and the fundamental frequency prediction model can be determined by analyzing the incidence relation between the prosodic information and the fundamental frequency of different voices.

After the fundamental frequency prediction model is trained according to the training mode described in the corresponding part below, in the application of voice conversion, the computer equipment directly calls the fundamental frequency prediction model to predict the fundamental frequency issued by the standard so as to perform inverse normalization processing by combining the real fundamental frequency information of the target sound-emitting object subsequently, and obtain the target fundamental frequency of the target sound-emitting object. However, in the model training process, it may participate in the training, but is not used to obtain the predicted fundamental frequency features of the training subject.

Step S66, a reference fundamental frequency of the training voice is called;

step S67, inputting the reference fundamental frequency of the training voice into a fundamental frequency processing model to obtain the fundamental frequency characteristics of the corresponding sounding object;

in the training process of the speech conversion model provided by the application, a training object is used as a source sound production object and a sound production object of speech to be synthesized, and the training speech is used as source speech and a speech prediction target. Therefore, in order to obtain the fundamental frequency features of the training object, the real fundamental frequency of the training speech can be directly called as the reference fundamental frequency and input into the fundamental frequency processing model to obtain the hidden layer feature representation, so as to obtain the fundamental frequency features of the corresponding vocal object. It is understood that, in the case that the training speech includes speech of multiple uttered objects, the reference fundamental frequency that is called may be a real fundamental frequency of the speech of each of the multiple uttered objects, and details regarding the method for obtaining the fundamental frequency in the speech signal are not described in this application.

Step S68, inputting the fundamental frequency characteristic, the voice coding characteristic and the prediction acoustic characteristic into a vocoder to obtain the prediction voice of the training object;

therefore, in the voice synthesis processing process of the vocoder, the fundamental frequency characteristic of the sounding object of the voice to be converted, the prediction acoustic characteristic and the voice coding characteristic of the training voice to be converted are comprehensively considered, the voice synthesis result with high reliability and accuracy is realized, and the reliability and the accuracy of the prediction voice output by the vocoder are improved.

Step S69, obtaining a first error between the predicted acoustic feature and the reference acoustic feature of the training object, a second error between the predicted speech and the corresponding training speech, and a third error between the predicted fundamental frequency and the fundamental frequency prediction target value of the same sounding object;

for example, by obtaining a plurality of voices generated by the same vocal subject in the training data set, extracting respective reference fundamental frequencies of the plurality of voices, and then normalizing the respective reference fundamental frequencies of the plurality of voices of the same vocal subject, the fundamental frequency prediction target value of the corresponding vocal subject is obtained. Optionally, the method may use a variance normalization processing method to implement calculation of the fundamental frequency prediction target value, for example, mean and standard deviation calculation are performed on fundamental frequencies of multiple voices of the same utterance object to obtain a corresponding fundamental frequency mean and a corresponding fundamental frequency standard deviation, and the fundamental frequency prediction target value of the utterance object is obtained according to a calculation manner of a fundamental frequency prediction target value (reference fundamental frequency-fundamental frequency mean)/fundamental frequency standard deviation by combining with a reference fundamental frequency of a corresponding utterance object in a training object, but is not limited to this calculation method.

For the above method for acquiring the first error and the second error, reference may be made to the description of the corresponding parts in the above embodiments, which is not described in detail in this embodiment; in the process of obtaining the third error, a preset loss function may also be called to calculate the loss between the predicted fundamental frequency and the fundamental frequency predicted target value of the same sounding object, so as to obtain the third error, which is the prediction error of the fundamental frequency of the sounding object by the fundamental frequency prediction model.

Step S610, updating a first parameter of the acoustic model according to the first error, updating a second parameter of the vocoder and a third parameter of the fundamental frequency processing model according to the second error, updating a fourth parameter of the fundamental frequency prediction model according to the third error, and training the updated model to obtain the speech conversion model.

Therefore, in the multi-task training method provided by the application, in order to avoid overfitting of the fundamental frequency prediction model, the fundamental frequency prediction model is trained independently according to the fundamental frequency prediction target, in the training process, the third error of the training is detected to meet the third training constraint condition aiming at the fundamental frequency prediction model, and the training of the fundamental frequency prediction model can be stopped. The embodiment of the present application may utilize a speech prediction target to implement training of a vocoder and a fundamental frequency processing model, and the training implementation process may refer to the description of the corresponding part of the above embodiment, which is not described in detail in this embodiment.

In summary, each part in the speech conversion model of the embodiment of the present application has its own prediction target, and in the model training process, the stable update of the corresponding part of the model parameters can be realized according to the prediction target. In the process of training the vocoder, the discriminator is introduced to continue to perform stable training on the vocoder according to the method described above after the training of the acoustic model is finished, and the implementation process is not repeated.

In some embodiments provided by the application, for the model training method described in each of the embodiments above, the acoustic model in the speech conversion model and the vocoder can be integrated into one network, and the coding result of the acoustic coder can be more conveniently transmitted to the vocoder, so that the vocoder has multi-dimensional reference information, the speech synthesis effect is improved, the acceptance and satisfaction of the user on the voice-changing function are increased, and the popularization of the product is more facilitated. Moreover, compared with data transmission between two independent models, the method and the device can realize data transmission in one network, reduce the first packet delay of data transmission, and better meet the requirement of real-time sound-varying application on low delay. In addition, through the structural link, the calling times of the model in the sound changing application are reduced, and the sound changing efficiency is improved.

It should be noted that, for the speech recognition model in the above embodiment, the system architecture of the whole speech conversion model may also be merged, a technique such as wav2vec may be adopted to implement training of the speech recognition model, complete a speech recognition task, and further reduce the delay of the first packet, but this processing method may reduce the pronunciation accuracy after changing the voice, resulting in unstable whole audio tone quality, so in practical application, the components of the system structure and the structural relationship between the components may be flexibly and reasonably determined according to the practical application requirements, whether the components need to be merged together, whether the network structure of the acoustic model and/or vocoder needs to be adjusted, etc., and the embodiment of the present application does not give details one by one.

In another embodiment, as described above, the present application may input the speech coding feature output by the acoustic model into the vocoder, the speech coding feature is obtained by feature extraction of the training speech recognition result of the training speech by the acoustic encoder, and the training speech recognition result often includes pronunciation information, prosody information, etc., so that if the speech coding feature input into the vocoder includes pronunciation feature and prosody feature, the output predicted speech can obviously sense the prosody fluctuation of the training speech, so that, in practical application, if there is another expression of the target sound object's timbre itself, inputting the prosody feature into the vocoder in the training stage will affect the judgment of the vocal variation effect of the target speech synthesized and output by the listener, in this case, the present application considers that in training the vocoder, the pronunciation characteristic of the input speech coding characteristic is not input any more, and the target speech which is more in line with the actual voice change requirement is obtained.

Based on this, in the model training process, the voice coding features of the input vocoder can include pronunciation features and prosody features, so as to obtain a voice conversion model suitable for the first type of application scenarios; in another implementation process, the speech coding features of the input vocoder may include pronunciation features and no prosodic features, so as to obtain a speech conversion model suitable for the second type of application scenario, so as to call the corresponding speech conversion model in the corresponding application scenario, meet the sound change requirement in the application scenario, and obtain the target speech meeting the timbre features of the target sound object and the requirements of the special application scenario. The application does not limit the respective contents of the first type application scene and the second type application scene, and can be configured in advance according to actual conditions.

In addition, in order to meet the changing voice requirements of the third type of application scenes, such as the live broadcast field and the like, in the process of training the vocoder, other dimensional characteristics such as voice energy characteristics and the like can be added, and the overall effect after changing voice is further improved.

Based on the description of the above embodiments, in the training process of the speech conversion model, when the number of the uttered objects included in the training data set is large and the speech generated by each uttered object is sufficient, the speech conversion model can be trained by directly using a plurality of groups of training speech composed of the speech of a plurality of uttered objects, so that the training speech conversion model learns the common features of the plurality of uttered objects, and the trained speech conversion model can be applied to the acoustic feature prediction of any uttered object, thereby synthesizing the speech of the uttered object. In this case, the trained speech conversion model can also convert the source speech of any one utterance object into the target speech of the specified target utterance object with high accuracy.

In the case where the training data set contains a small number of utterances, however, the present application may first train the speech conversion model in the manner described above, when the voice of a specified target voice-producing object (i.e. training object) is obtained as a training voice, at least one training voice of the target voice-producing object is used to continue training a voice conversion model obtained by the voice training of a plurality of voice-producing objects, and finally a voice conversion model suitable for converting the source voice of any voice-producing object into the target voice of the specified target voice-producing object with high accuracy is obtained, i.e., the speech conversion model for the target utterance object, and then, in the face of a voicing requirement that requires conversion of any source speech to the target speech for the target utterance object, the speech conversion model obtained by the final training can be called to convert any source speech into the target speech.

The method can also utilize a plurality of training sentences of different target sound production objects in sequence, training the voice conversion model is continued according to the training method described above, and object identification of the target sound production object used in each training is recorded, and the corresponding relation between the object identification and the tone color characteristics of the target sound production object obtained by training is recorded in the embedded tone color table. As to how to use a training sentence of a sound object to implement the training method of the speech conversion model, reference may be made to the description of the corresponding part in the above embodiment, which is not described in detail in this embodiment.

Referring to fig. 8, a flow chart of yet another alternative example of the speech conversion method proposed by the present application is illustrated, and this embodiment may be a detailed description of how the computer device invokes the speech conversion model to implement a process of converting any source speech into target speech of a specified target utterance object based on the pre-trained speech conversion model described in the above embodiment, but is not limited to such a detailed implementation manner proposed by the embodiment of the present application. In connection with the schematic diagram of the voice conversion method implemented by invoking the voice conversion model shown in fig. 9, as shown in fig. 8, the method may include, but is not limited to, the following steps:

step S81, obtaining source speech of the source speech object and target object identification of the target speech object;

step S82, inputting the source speech into the pre-trained speech recognition model to obtain a speech recognition result;

step S83, inputting the voice recognition result and the target object identification into a pre-trained acoustic encoder for encoding processing to obtain target voice encoding characteristics and target tone characteristics corresponding to the target object identification;

regarding the implementation process of step S81-step S83, reference may be made to the description of the corresponding parts in the above embodiments, which is not repeated in this embodiment.

In this embodiment, in combination with the description of the training process of the acoustic encoder, an embedded tone table may be maintained, and after the target object identifier is obtained, a table may be directly looked up to obtain the target tone characteristic. It can be understood that the acoustic encoder can also update the tone color feature of the corresponding sounding object in the embedded tone color table according to the feedback information of the voice conversion application, so as to improve the accuracy of the tone color feature, and the implementation process is similar to the implementation process of learning the tone color feature in the model training process, which is not described in detail in the application.

Step S84, inputting the target voice coding characteristics and the target tone characteristics into a pre-trained acoustic decoder for fusion processing to obtain target acoustic characteristics;

step S85, inputting prosodic information contained in the voice recognition result into a pre-trained fundamental frequency prediction model to obtain a target prediction fundamental frequency;

step S86, the normalization fundamental frequency information corresponding to the target object identification is called, and the normalization fundamental frequency information of the target sounding object is utilized to perform inverse normalization processing on the target prediction fundamental frequency to obtain the prediction reference fundamental frequency of the target sounding object;

in combination with the related description of the fundamental frequency prediction model in the above model training process, the fundamental frequency prediction model is the fundamental frequency under the prediction standard distribution, and actually is the predicted normalized fundamental frequency, so as shown in fig. 9, the application can call the pre-stored normalized fundamental frequency information such as the fundamental frequency mean value, the fundamental frequency standard deviation and the like of the target sounding object, thereby realizing the inverse normalization processing of the target predicted fundamental frequency and obtaining the predicted reference fundamental frequency of the target sounding object. The predicted reference fundamental frequency is the target predicted fundamental frequency plus the standard deviation of the fundamental frequency plus the mean of the fundamental frequency, but the method is not limited to this processing method.

Step S87, inputting the prediction reference fundamental frequency into a pre-trained fundamental frequency processing model to obtain the target fundamental frequency characteristic of the target sounding object;

step S88, inputting the target voice coding feature, the target tone feature and the target base frequency feature into the vocoder to obtain the target voice with the target tone feature and the content of the source voice.

In conjunction with the above analysis, in different application scenarios, the step S88 may include, but is not limited to, the following implementation methods:

inputting the target tone color characteristic, the target fundamental frequency characteristic and the target pronunciation characteristic and the target rhythm characteristic in the target voice coding characteristic into a pre-trained vocoder, and outputting the target voice of a target sounding object; or inputting the target tone color characteristic, the target fundamental frequency characteristic and the target pronunciation characteristic in the target voice coding characteristic into a pre-trained vocoder, and outputting the target voice of the target pronunciation object; or inputting the target voice coding characteristics, the target acoustic characteristics, the target fundamental frequency characteristics and the preset voice energy characteristics into a pre-trained vocoder, and outputting the target voice of the target sounding object.

Regarding the processing process of the input information by each model, reference may be made to the description of the corresponding part in the model training process, which is not described in detail in this embodiment.

In conclusion, in the process of training the voice conversion model, the acoustic features of the input vocoder are the predicted acoustic features output by the acoustic model, and the vocoder obtained based on the training is used in the sound changing scene, so that the acoustic features of the predicted target sound production object output by the acoustic model are input into the vocoder, the voice synthesis effect of the vocoder is improved, the similarity between the target voice and the real voice of the target sound production object is higher, the sound changing requirement of converting any source voice into the target voice of the specified target sound production object is better met, and the user experience is improved.

It can be understood that, in the above speech conversion process, the present application may also invoke the discriminator according to the above-described manner, perform similarity discrimination on the obtained target speech and the reference speech of the called target sounding object, further optimize the vocoder according to the discrimination result, re-synthesize the input information by using the optimized vocoder, update the above-described target speech, and refer to the description of the corresponding part in the above vocoder training process in the implementation process, which is not described in detail in this embodiment.

Referring to fig. 10, a schematic diagram of an alternative example of a speech conversion apparatus proposed in the present application may include, but is not limited to:

a data obtaining module 101, configured to obtain source speech of a source speech object and a target object identifier of a target speech object;

the voice recognition module 102 is configured to perform feature extraction on the source voice to obtain a voice recognition result;

a voice conversion module 103, configured to input the voice recognition result and the target object identifier into a voice conversion model, and output a target voice having a target timbre characteristic corresponding to the target object identifier and a content of a source voice;

In still other embodiments, to implement the training of the speech conversion model, the apparatus may include a speech conversion model training module, as shown in fig. 11, which may include:

a training speech recognition result acquisition module 104, configured to acquire a training speech recognition result of training speech generated by a training object;

the training object comprises at least one sound-emitting object, and the sound-emitting object is configured with a corresponding object identifier; the training speech is from speech produced by a corresponding sound producing object in the training data set.

A feature extraction module 105, configured to input the training speech recognition result and the object identifier into an acoustic model, obtain a predicted acoustic feature of the training object, and record a tone feature of the sounding object corresponding to the object identifier;

a speech synthesis module 106, configured to input the predicted acoustic feature into a vocoder to obtain a predicted speech of the training object;

an error obtaining module 107, configured to obtain a first error between the predicted acoustic feature and a reference acoustic feature of the training subject, and a second error between the predicted speech and the training speech;

a model parameter updating module 108, configured to update a first parameter of the acoustic model according to the first error, update a second parameter of the vocoder according to the second error, and train the updated acoustic model and the vocoder to obtain a speech conversion model.

Optionally, the feature extraction module 105 may include:

the feature extraction unit is used for extracting features of the training voice recognition result to obtain voice coding features and tone features corresponding to the object identification;

a feature fusion unit, configured to perform fusion processing on the speech coding features and the tone features to obtain predicted acoustic features of the training object

Based on this, the speech synthesis module 106 may include:

and the first voice synthesis unit is used for inputting the voice coding characteristics and the predicted acoustic characteristics into a vocoder to obtain the predicted voice of the training object.

In still other embodiments, the aforementioned speech conversion model training module may further include

The tone characteristic recording module is used for recording the corresponding relation between the object identification of the specified sound production object and the tone characteristic extracted by the acoustic model under the condition that the training object is a specified sound production object; the target sound-emitting object comprises any specified sound-emitting object;

and the embedded tone table updating module is used for updating the embedded tone table represented by the coding embedded layer of the acoustic model by utilizing the corresponding relation and obtaining the target tone characteristic corresponding to the target object identifier by inquiring the embedded tone table.

The reference fundamental frequency calling module is used for calling the reference fundamental frequency of the training voice;

a fundamental frequency characteristic obtaining module, configured to input the reference fundamental frequency of the training speech into a fundamental frequency processing model to obtain a fundamental frequency characteristic of a corresponding utterance object;

a fundamental frequency processing model parameter updating module, configured to update a third parameter of the fundamental frequency processing model according to the second error in the back propagation process;

based on this, the speech synthesis module 106 may include:

and the second voice synthesis unit is used for inputting the fundamental frequency characteristic and the predicted acoustic characteristic into a vocoder to obtain the predicted voice of the training object.

A prediction fundamental frequency obtaining module, configured to input prosodic information included in the training speech recognition result into a fundamental frequency prediction model, so as to obtain respective prediction fundamental frequencies of the uttered objects included in the training object;

the fundamental frequency prediction target value calling module is used for calling the fundamental frequency prediction target value of the sounding object corresponding to the object identification;

the fundamental frequency prediction error acquisition module is used for acquiring a third error between the prediction fundamental frequency and the fundamental frequency prediction target value of the same sounding object;

and the fundamental frequency prediction model parameter updating module is used for updating the fourth parameter of the fundamental frequency prediction model according to the third error in the reverse transmission process.

Optionally, in order to obtain the fundamental frequency prediction target value of the utterance object, the speech conversion model training module may further include:

the voice acquisition module is used for acquiring a plurality of voices generated by the same voice object in the training data set;

a reference fundamental frequency extracting module, configured to extract respective reference fundamental frequencies of the multiple voices;

and the normalization processing module is used for performing normalization processing on the respective reference fundamental frequencies of the multiple pieces of voice of the same voice-making object to obtain the fundamental frequency prediction target values of the corresponding voice-making objects.

Based on the above description of the embodiments, the model parameter update module 108 may include:

the model training monitoring unit is used for training the updated acoustic model and the vocoder until a first training constraint condition is met, and stopping training the acoustic model;

a vocoder training unit, which is used for inputting the predicted voice output by the vocoder in the next training and the corresponding training voice into a discriminator, updating the second parameter of the vocoder according to the discrimination result, training the vocoder with the updated second parameter until the second training constraint condition is satisfied, and obtaining a voice conversion model;

In some embodiments, in combination with the analysis, the voice conversion module 103 may include

The feature extraction unit is used for inputting the voice recognition result and the target object identification into the acoustic model which is pre-trained to obtain target voice coding features and target acoustic features;

the fundamental frequency prediction unit is used for inputting prosodic information contained in the voice recognition result into the pre-trained fundamental frequency prediction model to obtain a target prediction fundamental frequency;

the prediction reference fundamental frequency obtaining unit is used for calling the normalized fundamental frequency information corresponding to the target object identification, and performing reverse normalization processing on the target prediction fundamental frequency by using the normalized fundamental frequency information of the target sounding object to obtain the prediction reference fundamental frequency of the target sounding object;

the fundamental frequency processing unit is used for inputting the prediction reference fundamental frequency into the pre-trained fundamental frequency processing model to obtain the target fundamental frequency characteristic of the target sounding object;

a voice synthesis unit, configured to input the target voice coding feature, the target acoustic feature, and the target fundamental frequency feature into a pre-trained vocoder, and output a target voice of a target sound generation object; the target speech has target timbre features corresponding to the target object identification and the content of the source speech.

Optionally, the speech synthesis unit may include at least one of the following synthesis units:

a first synthesizing unit, configured to input the target timbre feature, the target fundamental frequency feature, and a target pronunciation feature and a target prosody feature in the target speech coding feature into a pre-trained vocoder, and output a target speech of the target utterance object;

a second synthesis unit, configured to input the target timbre feature, the target fundamental frequency feature, and a target pronunciation feature in the target speech coding feature into a pre-trained vocoder, and output a target speech of the target pronunciation object; or,

and a third synthesis unit, configured to input the target speech coding feature, the target acoustic feature, the target fundamental frequency feature, and a preset speech energy feature into a pre-trained vocoder, and output a target speech of the target sound generation object.

It should be noted that, various modules, units, and the like in the embodiments of the foregoing apparatuses may be stored in the memory as program modules, and the processor executes the program modules stored in the memory to implement corresponding functions, and for the functions implemented by the program modules and their combinations and the achieved technical effects, reference may be made to the description of corresponding parts in the embodiments of the foregoing methods, which is not described in detail in this embodiment.

The present application also proposes a computer-readable storage medium on which a computer program can be stored, which can be called and loaded by a processor to implement the steps of the speech conversion method described in the above embodiments.

The present application also proposes a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the computer device executes various optional embodiments of the voice conversion method and apparatus.

Finally, the embodiments in the present specification are described in a progressive or parallel manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device, the computer device and the system disclosed by the embodiment correspond to the method disclosed by the embodiment, so that the description is relatively simple, and the relevant points can be referred to the method part for description.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech conversion, comprising:

2. The method of claim 1, wherein the speech conversion model is obtained by pre-training, and the training method comprises:

in the back propagation process, updating a first parameter of the acoustic model according to the first error, updating a second parameter of the vocoder according to the second error, and training the acoustic model and the vocoder after updating to obtain a voice conversion model.

3. The method of claim 2, wherein inputting the training speech recognition result and the object identifier into an acoustic model to obtain a predicted acoustic feature of the training object comprises:

4. The method of claim 3, wherein the training method further comprises:

5. The method of claim 2, wherein the training method further comprises:

calling a reference fundamental frequency of the training voice;

6. The method of claim 5, wherein the training method further comprises:

7. The method of any of claims 2-6, wherein training the updated acoustic model and the vocoder to obtain a speech conversion model comprises:

training the acoustic model and the vocoder after updating until a first training constraint condition is met, and stopping training the acoustic model;

8. The method according to claim 6, wherein the method for obtaining the target value of the prediction of the fundamental frequency of the uttered object comprises:

9. The method of claim 6, wherein inputting the speech recognition result and the target object identification into a speech conversion model and outputting target speech having target timbre features corresponding to the target object identification and contents of the source speech comprises:

10. The method of claim 9, wherein inputting the target speech coding features, the target acoustic features, and the target fundamental frequency features into a pre-trained vocoder and outputting the target speech of the target sound-producing object comprises:

11. A speech conversion apparatus, characterized in that the apparatus comprises:

12. A computer device, characterized in that the computer device comprises: at least one memory and at least one processor, wherein:

the memory for storing a program for implementing the voice conversion method according to any one of claims 1 to 10;

the processor is used for loading and executing the program stored in the memory so as to realize the voice conversion method according to any one of claims 1 to 10.

13. A computer-readable storage medium, having stored thereon a computer program which, when loaded and executed by a processor, implements a speech conversion method according to any one of claims 1 to 10.

14. A computer program product comprising computer instructions, wherein the computer instructions, when read and executed by a processor, implement the method of speech conversion according to any of claims 1-10.