CN111161695A

CN111161695A - Song generation method and device

Info

Publication number: CN111161695A
Application number: CN201911362233.5A
Authority: CN
Inventors: 熊皓; 何中军; 李芝; 吴华; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2020-05-15
Anticipated expiration: 2039-12-26
Also published as: CN111161695B

Abstract

The present disclosure relates to the field of audio data processing. The embodiment of the disclosure discloses a song generation method and device. The method comprises the following steps: extracting a first accompaniment signal, lyrics of a first language and a first singing voice signal from the song audio of the first language; translating the lyrics of the first language into the lyrics of the second language; inputting the first accompaniment signal and the lyrics of the second language into a trained accompaniment generation model to obtain a second accompaniment signal; inputting the first singing voice signal and the lyrics of the second language into a trained singing voice generation model to generate a second singing voice signal; the second accompaniment signal and the second singing voice signal are synthesized into the song audio of the second language. The method realizes the automatic generation of the songs in different languages, and reduces the manufacturing cost of the multi-language songs.

Description

Song generation method and device

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to the technical field of audio data processing, and particularly relates to a song generation method and device.

Background

A song is a vocal product that combines human voice with music. The song is typically made by recording the singing voice of the singer and then combining the singing voice with the accompaniment.

For the existing song, the lyrics are translated into other language versions, so that the streaming degree of the song can be improved, and the form of the musical works is enriched. The current method for making songs in different language versions requires a singer to sing the original again by using languages of other languages. This approach is costly and not conducive to batch production of songs in different language versions.

Disclosure of Invention

Embodiments of the present disclosure propose a song generation method and apparatus, an electronic device, and a computer-readable medium.

In a first aspect, an embodiment of the present disclosure provides a song generating method, including: extracting a first accompaniment signal, lyrics of a first language and a first singing voice signal from the song audio of the first language; translating the lyrics of the first language into the lyrics of the second language; inputting the first accompaniment signal and the lyrics of the second language into a trained accompaniment generation model to obtain a second accompaniment signal; inputting the first singing voice signal and the lyrics of the second language into a trained singing voice generation model to generate a second singing voice signal; the second accompaniment signal and the second singing voice signal are synthesized into the song audio of the second language.

In some embodiments, the above method further comprises: training an accompaniment generation model based on a first sample song audio set, comprising: acquiring lyrics of a corresponding language of a first sample song audio in a first sample song audio set; extracting an accompaniment signal from the first sample song audio, inputting the accompaniment signal of the first sample song audio and the lyrics of the corresponding language of the first sample song into an accompaniment generation model to be trained, and obtaining a prediction result of the accompaniment signal of the first sample song audio; and iteratively adjusting parameters of the accompaniment generation model based on the difference between the prediction result of the accompaniment signals of the first sample song audio by the accompaniment generation model to be trained and the accompaniment signals extracted from the corresponding first sample song audio.

In some embodiments, the accompaniment generation model includes a first music encoder, a first text encoder, a first spectrum decoder, and a first vocoder; a first music encoder encoding an accompaniment signal inputted to the accompaniment generation model; a first text encoder performs text encoding on the lyrics of the input accompaniment generation model; the first spectrum decoder decodes based on the coding results of the first music coder and the first text coder to obtain corresponding spectrum signals; the first vocoder generates an accompaniment signal of a song based on the spectrum signal decoded by the first spectrum decoder.

In some embodiments, the above method further comprises: training a singing voice generation model based on the second sample song audio set, wherein the singing voice generation model comprises a speaker voice print coder and a singing voice generation sub-model; training a singing voice generation model based on the second sample song audio set, comprising: training a speaker voiceprint encoder based on the speaker voiceprint recognition task; acquiring lyrics of a corresponding language of a second sample song audio in a second sample song audio set; extracting a singing voice signal from the second sample song audio, and extracting the speaker voice print characteristic of the second sample song audio from the singing voice signal of the second sample song audio by using the trained speaker voice print encoder; inputting the singing voice signal of the second sample song audio, the lyrics of the corresponding language of the second sample song audio and the voice print characteristics of the speaker of the second sample song audio into a singing voice generation sub-model to be trained to obtain a prediction result of the singing voice signal of the second sample song audio; and iteratively adjusting parameters of the singing voice generation submodel based on the difference between the prediction result of the singing voice signal of the second sample song audio by the singing voice generation model to be trained and the singing voice signal extracted from the corresponding second sample song audio.

In some embodiments, the singing voice generation submodel includes: a second music encoder, a second text encoder, a second spectrum decoder, and a second vocoder; a second music encoder encodes the singing voice signal of the input singing voice generation submodel; the second text encoder performs text encoding on the lyrics of the input singing voice generation submodel; the second frequency spectrum decoder decodes based on the coding results of the speaker voiceprint coder, the second music coder and the second text coder to obtain corresponding frequency spectrum signals; the second decoder generates a singing voice signal of the song based on the spectrum signal decoded by the second spectrum decoder.

In a second aspect, an embodiment of the present disclosure provides a song generating apparatus, including: an extraction unit configured to extract a first accompaniment signal, lyrics of a first language, and a first singing voice signal from a song audio of the first language; a translation unit configured to translate the lyrics of the first language into the lyrics of the second language; a first generation unit configured to input the first accompaniment signal and the lyrics of the second language into a trained accompaniment generation model to obtain a second accompaniment signal; a second generation unit configured to input the first singing voice signal and the lyrics of the second language into the trained singing voice generation model, generating a second singing voice signal; a conversion unit configured to synthesize the second accompaniment signal and the second singing voice signal into a song audio of a second language.

In some embodiments, the above apparatus further comprises: a first training unit configured to train the accompaniment generation model based on the first sample song audio set as follows: acquiring lyrics of a corresponding language of a first sample song audio in a first sample song audio set; extracting an accompaniment signal from the first sample song audio, inputting the accompaniment signal of the first sample song audio and the lyrics of the corresponding language of the first sample song into an accompaniment generation model to be trained, and obtaining a prediction result of the accompaniment signal of the first sample song audio; and iteratively adjusting parameters of the accompaniment generation model based on the difference between the prediction result of the accompaniment signals of the first sample song audio by the accompaniment generation model to be trained and the accompaniment signals extracted from the corresponding first sample song audio.

In some embodiments, the above apparatus further comprises: a second training unit configured to train a singing voice generation model based on the second sample song audio set, wherein the singing voice generation model comprises a speaker voice print coder and a singing voice generation submodel; the second training unit is configured to train the singing voice generation model as follows: training a speaker voiceprint encoder based on the speaker voiceprint recognition task; acquiring lyrics of a corresponding language of a second sample song audio in a second sample song audio set; extracting a singing voice signal from a second sample song audio, and extracting the speaker voice print characteristic of the second sample song audio from the singing voice signal of the second sample song audio by using a trained speaker voice print encoder; inputting the singing voice signal of the second sample song audio, the lyrics of the corresponding language of the second sample song audio and the voice print characteristics of the speaker of the second sample song audio into a singing voice generation sub-model to be trained to obtain a prediction result of the singing voice signal of the second sample song audio; and iteratively adjusting parameters of the singing voice generation submodel based on the difference between the prediction result of the singing voice signal of the second sample song audio by the singing voice generation model to be trained and the singing voice signal extracted from the corresponding second sample song audio.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device for storing one or more programs which, when executed by one or more processors, cause the one or more processors to implement the song generation method as provided in the first aspect.

In a fourth aspect, an embodiment of the present disclosure provides a computer-readable medium on which a computer program is stored, wherein the program, when executed by a processor, implements the song generation method provided in the first aspect.

According to the song generation method and device disclosed by the embodiment of the invention, the first accompaniment signal, the lyrics of the first language and the first song sound signal are extracted from the song audio of the first language, the lyrics of the first language are translated into the lyrics of the second language, the first accompaniment signal and the lyrics of the second language are input into the trained accompaniment generation model to obtain the second accompaniment signal, the first song sound signal and the lyrics of the second language are input into the trained song sound generation model to generate the second song sound signal, and the second accompaniment signal and the second song sound signal are synthesized into the song audio of the second language, so that the automatic generation of different languages of songs is realized, the multi-language production cost of the songs is reduced, and the streaming degree of the song works can be effectively improved.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which embodiments of the present disclosure may be applied;

FIG. 2 is a flow diagram for one embodiment of a song generation method according to the present disclosure;

FIG. 3 is an exemplary block diagram of an accompaniment generation model;

FIG. 4 is a schematic diagram of a first spectral decoder in an accompaniment generation model;

FIG. 5 is a diagram of an exemplary structure of a singing voice generation model;

FIG. 6 is a flow diagram of another embodiment of a song generation method according to the present disclosure;

FIG. 7 is a schematic block diagram of one embodiment of a song generation apparatus of the present disclosure;

FIG. 8 is a schematic block diagram of a computer system suitable for use in implementing an electronic device of an embodiment of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which the song generation method or song generation apparatus of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102, 103 interact with a server 105 via a network 104 to receive or send messages or the like. The

end devices

101, 102, 103 may be customer premises devices on which various audio service class applications may be installed. Such as a singing-like application, an audio-video playing application, a voice service application, etc. The user 110 can record audio files using the

terminal devices

101, 102, 103, or play audio files through the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, e-book readers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server running various services, for example a server providing background support for applications running on the

terminal devices

101, 102, 103. The server 105 may receive the audio transmitted by the

terminal apparatuses

101, 102, 103, process the audio data, and feed back the processing result to the

terminal apparatuses

101, 102, 103.

In a particular application scenario, server 105 may be a server providing an automatic translation service for songs. The server 105 may receive the audio file of the song desired to be translated uploaded by the user from the

terminal device

101, 102, 103, then translate the song into another language using the trained model, generate a new language song audio, and transmit the generated new language song audio to the

terminal device

101, 102, 103. The

terminal apparatuses

101, 102, 103 can output song audio of a new language to the user through the audio output device.

It should be noted that the song generation method provided by the embodiment of the present disclosure is generally executed by the server 105, and accordingly, the song generation apparatus is generally disposed in the server 105.

In some scenarios, the server 105 may retrieve song audio to be translated from a database, memory, or other device, in which case the exemplary system architecture 100 may be absent the

terminal devices

101, 102, 103 and the network 104.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., multiple pieces of software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a song generation method according to the present disclosure is shown. The song generation method comprises the following steps:

step 201, a first accompaniment signal, lyrics of a first language and a first singing voice signal are extracted from the song audio of the first language.

In this embodiment, the main executing body of the song generating method may first obtain the song audio of the first language as the song audio to be translated. The song audio in the first language may be obtained from a music file in which the singer sings the song.

In practice, a user may specify a song and issue a request to translate the song into a version in another language. The execution main body may acquire the audio of the designated song as the audio of the song in the first language according to a request of a user. Alternatively, the execution subject may select at least one of the song audios in the first language from a song audio set stored in a database or downloaded from a network.

Then, the singing voice signal and the accompaniment signal may be separated for the song audio of the first language. Specifically, different tracks may be used to filter out high frequency vocal signals and low frequency accompaniment signals, respectively, to separate the vocal signals and accompaniment signals in a song. The lyrics of the first language of the song may be obtained from a database or network, or may be obtained by speech recognition of the audio of the song or of a song voice signal extracted from the audio of the song.

Step 202, the lyrics of the first language are translated into the lyrics of the second language.

The lyrics in the first language may be translated to lyrics in the second language using a trained text translation model. Here, the second language may be specified in advance. In a particular scenario, the user may specify that the song be translated to a language, such as English, Spanish, Japanese, etc., and the language specified by the user may be the second language.

The text translation model described above may be trained as follows: and collecting text training corpora, and pre-training the text translation model by using the text training corpora. The text training corpus may include text sentences in a first language and corresponding text sentences in a second language. After the pre-training is completed, lyric translation texts (e.g., lyrics of different language versions of the same song) can be collected to construct lyric sentence pairs, and the pre-trained text translation model can be fine-tuned by using the lyric sentences.

Step 203, inputting the first accompaniment signal and the lyrics of the second language into the trained accompaniment generation model to obtain a second accompaniment signal.

In this embodiment, a second accompaniment signal adapted to the second language may be generated based on the translated lyrics based on the first accompaniment signal of the original song audio, specifically, an accompaniment generation model may be trained, and the trained accompaniment generation model may be used to generate the second accompaniment signal.

The accompaniment generation model may be a model constructed based on a neural network. The accompaniment signal pair can be constructed based on the accompaniment signals of the song audio of different languages of the same song, and the accompaniment generation model is trained. Or, corresponding lyrics and accompaniment signals can be extracted from the sample song audio and used as the input of the accompaniment generation model to be trained, the quality score of the accompaniment signals output by the accompaniment generation model to be trained is obtained, and the quality score is fed back to the parameter adjustment of the accompaniment generation model to be trained in a back propagation mode or a reinforcement learning mode. Thus, the accompaniment generation model can learn to add lyric information to the musical accompaniment during the training process. When the trained accompaniment generation model is applied, the accompaniment generation model can fuse the information of the lyrics of the second language into the accompaniment signal of the first language to obtain a second accompaniment signal containing the lyrics information of the second language.

Optionally, the accompaniment generation model may include a first music encoder, a first text encoder, a first spectrum decoder, and a first vocoder. Wherein, the first music coder encodes the accompaniment signal input into the accompaniment generation model; a first text encoder performs text encoding on the lyrics of the input accompaniment generation model; the first spectrum decoder decodes based on the coding results of the first music coder and the first text coder to obtain corresponding spectrum signals; the first vocoder generates an accompaniment signal of a song based on the spectrum signal decoded by the first spectrum decoder.

Please refer to fig. 3, which shows a schematic structural diagram of an accompaniment generation model.

As shown in fig. 3, the accompaniment signal is sampled by MFCC (Mel Frequency Cepstral Coefficient ) to extract spectral features, and then input to a first music encoder, which may have a structure similar to that of an encoder in a natural language processing transform unit, including a plurality of self attention layers. The lyrics are embedded by text and then input to a first text encoder for encoding. The structure of the first text encoder may also be similar to the encoder in the natural language processing Transformer unit, including multiple Self Attention layers.

The first spectral encoder may be a decoder that decodes the results of the encoding of the accompaniment signal by the first music encoder and the results of the encoding of the lyrics by the first text encoder. Here, the first spectral decoder may employ a structure similar to that of a decoder in a transform unit, including a plurality of Multi-Head Attention units (Multi-Head Attention units).

Fig. 4 shows a schematic of the structure of the first spectral decoder. As shown in fig. 4, the first spectral decoder includes at least three Multi-Head Attention (Multi-Head Attention) units, wherein a first Multi-Head Attention (Multi-Head Attention) unit 1 receives a spectral signal from the accompaniment signal predicted by the accompaniment generation unit after MFCC feature extraction, and a second Multi-Head Attention (Multi-Head Attention) unit 2 and a third Multi-Head Attention (Multi-Head Attention) unit 3 receive outputs from the first music encoder and the first text encoder, respectively. In this way, the first spectral decoder may fuse the information of the accompaniment signal and the lyrics such that the output of the accompaniment generation model contains information of the lyrics.

The first spectrum decoder decodes the signal to obtain a spectrum signal of the predicted accompaniment signal, and a first vocoder in the accompaniment generation model can convert the spectrum signal into a corresponding accompaniment signal.

Step 204, inputting the first singing voice signal and the lyrics of the second language into the trained singing voice generation model to generate a second singing voice signal.

The singing voice generation model is used for generating a converted singing voice signal based on the input singing voice signal and the lyrics. The singing voice generating model may be trained in a similar manner to the accompaniment generating model described above. In one implementation, a singing voice signal of a first language and a singing voice signal of a second language of the same song may be obtained, and a sample singing voice signal pair may be constructed to train the singing voice generation model. Alternatively, in other implementations, the singing voice generation model may be trained based on sample song audio. The singing voice signal in the sample song audio can be used as the input of the singing voice generation model, and the lyrics of the second language of the sample song audio can also be input into the singing voice generation model to be trained. And comparing the singing voice signal output by the singing voice generation model to be trained with the singing voice signal of the sample song audio, and iteratively training the singing voice generation model according to the difference between the singing voice signal and the singing voice signal. Thus, the singing voice generating model can learn to fuse the singing voice signal of a certain language with the lyrics of another language during training, so that the singing voice generating model can generate the singing voice signal of a second language according to the singing voice signal of the first language and the lyrics of the second language after the training is finished.

In some embodiments, the singing voice generation model may include a speaker voice print coder and a singing voice generation submodel. The speaker voiceprint encoder can be a model constructed based on a convolutional neural network and is used for encoding the voiceprint of the speaker in the audio signal.

Fig. 5 shows a schematic structural view of the singing voice generation model. As shown in fig. 5, the singing voice generation model includes a speaker voice print coder and a singing voice generation submodel. The singing voice generation submodel includes: a second music encoder, a second text encoder, a second spectral decoder, and a second vocoder. The second music coder codes singing voice signals of the input singing voice generation submodel, the second text coder codes the lyrics of the input singing voice generation submodel in a text mode, and the second frequency spectrum decoder decodes the lyrics based on the coding results of the speaker voice pattern coder, the second music coder and the second text coder to obtain corresponding frequency spectrum signals; the second decoder generates a singing voice signal of the song based on the spectrum signal decoded by the second spectrum decoder.

The structure of the singing voice generation submodel is the same as that of the accompaniment generation model. The output of the speaker's voice-print encoder is concatenated with the encoding concat of the accompaniment signal by the first music encoder in the singing voice generation submodel. And the second decoder converts the spectrum signal decoded by the second spectrum decoder to obtain the predicted singing voice signal.

Step 205, the second accompaniment signal and the second singing voice signal are synthesized into the song audio of the second language.

The second accompaniment signal generated at step 203 and the second singing sound signal generated at step 204 may be synthesized into the song audio of the second language using a sound synthesizer. Alternatively, the song composition model may be trained in advance, for example, the accompaniment signal and the singing voice signal may be extracted from the song and then input to the song composition model, and the song composition model may be iteratively adjusted based on the spectral difference between the output of the song composition model and the original song. And then, synthesizing the second accompaniment signal and the second singing voice signal by using the trained song synthesis model to obtain the song audio frequency of the second language.

The song generation method of the above embodiment of the present disclosure extracts the first accompaniment signal, the lyrics of the first language and the first song sound signal from the song audio of the first language, translates the lyrics of the first language into the lyrics of the second language, inputs the first accompaniment signal and the lyrics of the second language into the trained accompaniment generation model to obtain the second accompaniment signal, inputs the lyrics of the first language and the lyrics of the second language into the trained song sound generation model to generate the second song sound signal, and synthesizes the second accompaniment signal and the second song sound signal into the song audio of the second language, thereby implementing automatic generation of songs of different languages, reducing the cost of making multi-language songs, and effectively helping to improve the streaming degree of song works.

With continued reference to fig. 6, a flow diagram of another embodiment of a song generation method of the present disclosure is shown. As shown in fig. 6, a flow 600 of the song generating method of the present embodiment includes the following steps:

step 601, training an accompaniment generation model based on the first sample song audio set.

In this embodiment, the execution subject of the song generating method may acquire the first sample song audio. In practice, the audio of the acoustic songs in different languages may be collected to construct a first sample song audio set.

The method comprises the steps of obtaining lyrics of a first sample song audio frequency in a first sample song audio frequency set, extracting accompaniment signals from the first sample song audio frequency, inputting the accompaniment signals of the first sample song audio frequency and the lyrics of the corresponding language of the first sample song into an accompaniment generation model to be trained, obtaining a prediction result of the accompaniment signals of the first sample song audio frequency, and iteratively adjusting parameters of the accompaniment generation model based on the difference between the prediction result of the accompaniment signals of the first sample song audio frequency by the accompaniment generation model to be trained and the accompaniment signals extracted from the corresponding first sample song audio frequency.

The accompaniment generation model to be trained may include a first music encoder, a first text encoder, a first spectrum decoder and a first vocoder as described in fig. 3. A first music encoder encoding an accompaniment signal inputted to the accompaniment generation model; a first text encoder performs text encoding on the lyrics of the input accompaniment generation model; the first spectrum decoder decodes based on the coding results of the first music coder and the first text coder to obtain corresponding spectrum signals; the first vocoder generates an accompaniment signal of a song based on the spectrum signal decoded by the first spectrum decoder.

In the process of training the accompaniment generation model, a loss function can be constructed according to the difference between the prediction result of the accompaniment signal of the audio of the first sample song output by the accompaniment generation model in the training and the accompaniment signal extracted from the audio of the first sample song, and the parameters of the first music encoder, the first text encoder, the first spectrum decoder and the first vocoder are iteratively adjusted based on the loss function until the value of the loss function is converged to a certain range.

The first sample song audio may include a first sample song audio in a first language and a first sample song audio in a second language. Therefore, the accompaniment generation model can learn and generate the accompaniment signals suitable for the songs in different languages in the training process.

At step 602, a singing voice generation model is trained based on the second sample song audio set.

In this embodiment, the executing subject of the song generating method may acquire the second sample song audio. In practice, the audio of the acoustic songs in different languages may be collected to construct the second sample set of song audio. The second sample song audio set may be the same as the first sample audio set. Optionally, the second set of sample song audio includes a second sample song audio in the first language and a second sample song audio in the second language. Thus, after the training is completed, the singing voice generation model can learn to fuse the lyrics of different languages with the singing voice.

The singing voice generating model comprises a voice-print coder of the speaker and a singing voice generating submodel. A speaker voiceprint encoder can first be trained based on a speaker voiceprint recognition task. The singing voice generation submodel is then trained.

Specifically, the singer corresponding to the second sample song audio in the second sample song audio set may be labeled. Singing voice signals in the second sample song audio can be extracted and input to a speaker voiceprint encoder for encoding, then a classifier is used for classifying encoding results, speaker voiceprints are identified, the speaker voiceprint identification results are compared with singer marking information, the error classification rate is calculated according to the comparison results, and parameters of the speaker voiceprint encoder are adjusted through multiple iterations according to the error classification rate until the error classification rate is smaller than a preset threshold value.

The singing voice generation submodel may be trained as follows: firstly, lyrics of a corresponding language of a second sample song audio frequency in a second sample song audio frequency set are obtained, then a song voice signal is extracted from the second sample song audio frequency, and a speaker voice print characteristic of the second sample song audio frequency is extracted from the song voice signal of the second sample song audio frequency by utilizing a trained speaker voice print encoder; and then inputting the singing voice signal of the second sample song audio, the lyrics of the corresponding language of the second sample song audio and the voice print characteristics of the speaker of the second sample song audio into a singing voice generation submodel to be trained to obtain a prediction result of the singing voice signal of the second sample song audio, and iteratively adjusting the parameters of the singing voice generation submodel for a plurality of times based on the difference between the prediction result of the singing voice signal of the second sample song audio and the singing voice signal extracted from the corresponding second sample song audio by the singing voice generation submodel to be trained.

Because the voiceprint characteristics of the singer are extracted, the singing voice signal and the lyrics are separated and input into the singing voice generation submodel when the singing voice generation model is trained, the singing voice generation submodel can learn to fuse the voice characteristics of the singer with the singing voice signal and the lyrics. After the training is completed, the singing voice generation model may synthesize the separated singing voice signals, lyrics, and vocal features of the singer into a complete song.

Optionally, the singing voice generation submodel to be trained may include a second music encoder, a second text encoder, a second spectrum decoder, and a second audio decoder, and the structure of the second audio decoder is as shown in fig. 5, where the second music encoder encodes the singing voice signal input to the singing voice generation submodel, the second text encoder text-encodes the lyrics input to the singing voice generation submodel, the second spectrum decoder decodes the lyrics based on the encoding results of the speaker voice pattern encoder, the second music encoder, and the second text encoder to obtain corresponding spectrum signals, and the second audio decoder generates the singing voice signal of the song based on the spectrum signals decoded by the second spectrum decoder. During the training of the singing voice generation submodel, each iteration adjusts the parameters of the second music encoder, the second text encoder, the second spectrum decoder, and the second vocoder according to the difference between the singing voice signal predicted by the singing voice generation submodel and the singing voice signal extracted from the second sample song audio.

It should be noted that, in other embodiments of the present disclosure, the singing voice generation submodel may also be constructed based on structural units of other types of neural networks, for example, the singing voice generation submodel may also be constructed based on a recurrent neural network.

Step 603, extracting a first accompaniment signal, lyrics of the first language and a first singing voice signal from the song audio of the first language.

Step 604, translate the lyrics of the first language into the lyrics of the second language.

Step 605, inputting the first accompaniment signal and the lyric of the second language into the trained accompaniment generation model to obtain a second accompaniment signal.

Step 606, inputting the first singing voice signal and the lyrics of the second language into the trained singing voice generation model to generate a second singing voice signal.

Step 607, the second accompaniment signal and the second singing voice signal are synthesized into the song audio of the second language.

Steps 603 to 607 of this embodiment correspond to steps 201 to 205 of the foregoing embodiment one to one, and specific implementation manners of steps 603 to 607 may refer to descriptions of steps 201 to 205 of the foregoing embodiment, which are not described herein again.

It should be noted that, in some embodiments of the present disclosure, the flow of the song generating method may not include step 601, and the executing entity of the song generating method may directly obtain the trained accompaniment generation model, and execute steps 602 to 607 to generate song audio of different languages. Alternatively, in some embodiments, the flow of the song generating method may not include step 602, and the executing entity of the song generating method may directly obtain the already trained singing voice generating model, and execute step 601, step 603 to step 607 to generate the audio of the songs in different languages.

According to the embodiment, the collected song audio data can be used for training the more reliable accompaniment generation model and/or singing voice generation model, so that the songs in the second language generated based on the accompaniment generation model and/or the singing voice generation model are more natural, and the automatic translation effect of the songs is further improved.

Referring to fig. 7, as an implementation of the above-described song generating method, the present disclosure provides an embodiment of a song generating apparatus, which corresponds to the method embodiments shown in fig. 2 and fig. 6, and which is particularly applicable to various electronic devices.

As shown in fig. 7, the song generating apparatus 700 of the present embodiment includes an extracting unit 701, a translating unit 702, a first generating unit 703, a second generating unit 704, and a converting unit 705. Wherein the extracting unit 701 is configured to extract a first accompaniment signal, lyrics of a first language, and a first singing voice signal from a song audio of the first language; the translation unit 702 is configured to translate the lyrics of the first language into the lyrics of the second language; the first generation unit 703 is configured to input the first accompaniment signal and the lyrics of the second language into the trained accompaniment generation model, resulting in a second accompaniment signal; the second generating unit 704 is configured to input the first singing voice signal and the lyrics of the second language into the trained singing voice generation model, generating a second singing voice signal; the conversion unit 705 is configured to synthesize the second accompaniment signal and the second singing voice signal into a song audio of the second language.

In some embodiments, the above apparatus further comprises: a second training unit configured to train a singing voice generation model based on the second sample song audio set, wherein the singing voice generation model comprises a speaker voice print coder and a singing voice generation submodel; the second training unit is configured to train the singing voice generation model as follows: training a speaker voiceprint encoder based on the speaker voiceprint recognition task; acquiring lyrics of a corresponding language of a second sample song audio in a second sample song audio set; extracting a singing voice signal from the second sample song audio, and extracting the speaker voice print characteristic of the second sample song audio from the singing voice signal of the second sample song audio by using the trained speaker voice print encoder; inputting the singing voice signal of the second sample song audio, the lyrics of the corresponding language of the second sample song audio and the voice print characteristics of the speaker of the second sample song audio into a singing voice generation sub-model to be trained to obtain a prediction result of the singing voice signal of the second sample song audio; and iteratively adjusting parameters of the singing voice generation submodel based on the difference between the prediction result of the singing voice signal of the second sample song audio by the singing voice generation model to be trained and the singing voice signal extracted from the corresponding second sample song audio.

The units in the apparatus 700 described above correspond to the steps in the method described with reference to fig. 2 and 6. Thus, the operations, features and technical effects described above for the song generating method are also applicable to the apparatus 700 and the units included therein, and are not described herein again.

Referring now to FIG. 8, a block diagram of an electronic device (e.g., the server shown in FIG. 1) 800 suitable for use in implementing embodiments of the present disclosure is shown. The electronic device shown in fig. 8 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 8, an electronic device 800 may include a processing means (e.g., central processing unit, graphics processor, etc.) 801 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage means 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the electronic apparatus 800 are also stored. The processing apparatus 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

Generally, the following devices may be connected to the I/O interface 805: input devices 806 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 807 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage 808 including, for example, a hard disk; and a communication device 809. The communication means 809 may allow the electronic device 800 to communicate wirelessly or by wire with other devices to exchange data. While fig. 8 illustrates an electronic device 800 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 8 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 809, or installed from the storage means 808, or installed from the ROM 802. The computer program, when executed by the processing apparatus 801, performs the above-described functions defined in the methods of the embodiments of the present disclosure. It should be noted that the computer readable medium described in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: extracting a first accompaniment signal, lyrics of a first language and a first singing voice signal from the song audio of the first language; translating the lyrics of the first language into the lyrics of the second language; inputting the first accompaniment signal and the lyrics of the second language into a trained accompaniment generation model to obtain a second accompaniment signal; inputting the first singing voice signal and the lyrics of the second language into a trained singing voice generation model to generate a second singing voice signal; the second accompaniment signal and the second singing voice signal are synthesized into the song audio of the second language.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an extraction unit, a translation unit, a first generation unit, a second generation unit, and a conversion unit. Where the names of the units do not in some cases constitute a limitation of the units themselves, for example, the extraction unit may also be described as a "unit that extracts the first accompaniment signal, the lyrics of the first language, and the first singing voice signal from the song audio of the first language".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the inventive concept as defined above. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A song generation method, comprising:

extracting a first accompaniment signal, lyrics of a first language and a first singing voice signal from the song audio of the first language;

translating the lyrics of the first language into lyrics of a second language;

inputting the first accompaniment signal and the lyrics of the second language into a trained accompaniment generation model to obtain a second accompaniment signal;

inputting the first singing voice signal and the lyrics of the second language into a trained singing voice generation model to generate a second singing voice signal;

and synthesizing the second accompaniment signal and the second singing voice signal into a song audio frequency of a second language.

2. The method of claim 1, wherein the method further comprises: training an accompaniment generation model based on a first sample song audio set, comprising:

acquiring lyrics of a corresponding language of a first sample song audio in the first sample song audio set;

extracting an accompaniment signal from the first sample song audio, inputting the accompaniment signal of the first sample song audio and the lyrics of the corresponding language of the first sample song into an accompaniment generation model to be trained, and obtaining a prediction result of the accompaniment signal of the first sample song audio;

iteratively adjusting parameters of the accompaniment generation model based on the difference between the prediction result of the accompaniment signals of the first sample song audio by the accompaniment generation model to be trained and the accompaniment signals extracted from the corresponding first sample song audio.

3. The method of claim 1 or 2, wherein the accompaniment generation model comprises a first music encoder, a first text encoder, a first spectral decoder and a first vocoder;

the first music encoder encodes an accompaniment signal input to the accompaniment generation model;

the first text encoder performs text encoding on the lyrics input into the accompaniment generation model;

the first spectrum decoder decodes based on the coding results of the first music coder and the first text coder to obtain corresponding spectrum signals;

the first vocoder generates an accompaniment signal of a song based on the spectrum signal decoded by the first spectrum decoder.

4. The method according to claim 1 or 2, wherein the method further comprises:

training a singing voice generation model based on a second sample song audio set, wherein the singing voice generation model comprises a speaker voice print coder and a singing voice generation submodel;

the training of the singing voice generation model based on the second sample song audio set comprises:

training the speaker voiceprint encoder based on a speaker voiceprint recognition task;

acquiring lyrics of a corresponding language of a second sample song audio in the second sample song audio set;

extracting a singing voice signal from the second sample song audio, and extracting the speaker voice print characteristic of the second sample song audio from the singing voice signal of the second sample song audio by using a trained speaker voice print encoder;

inputting the singing voice signal of the second sample song audio, the lyrics of the corresponding language of the second sample song audio and the voice print characteristics of the speaker of the second sample song audio into a singing voice generation submodel to be trained to obtain a prediction result of the singing voice signal of the second sample song audio;

and iteratively adjusting the parameters of the singing voice generation submodel based on the difference between the prediction result of the singing voice signal of the second sample song audio by the singing voice generation model to be trained and the singing voice signal extracted from the corresponding second sample song audio.

5. The method of claim 4, wherein the singing voice generation submodel comprises: a second music encoder, a second text encoder, a second spectrum decoder, and a second vocoder;

the second music coder codes the singing voice signal input into the singing voice generation submodel;

the second text encoder performs text encoding on the lyrics input into the singing voice generation submodel;

the second frequency spectrum decoder decodes based on the coding results of the speaker voiceprint coder, the second music coder and the second text coder to obtain corresponding frequency spectrum signals;

the second decoder generates a singing voice signal of a song based on the spectrum signal decoded by the second spectrum decoder.

6. A song generation apparatus comprising:

an extraction unit configured to extract a first accompaniment signal, lyrics of a first language, and a first singing voice signal from a song audio of the first language;

a translation unit configured to translate the lyrics of the first language into lyrics of a second language;

a first generation unit configured to input the first accompaniment signal and the lyrics of the second language into a trained accompaniment generation model to obtain a second accompaniment signal;

a second generation unit configured to input the first singing voice signal and the lyrics of the second language into a trained singing voice generation model, generating a second singing voice signal;

a conversion unit configured to synthesize the second accompaniment signal and the second singing voice signal into a song audio of a second language.

7. The apparatus of claim 6, wherein the apparatus further comprises: a first training unit configured to train the accompaniment generation model based on the first sample song audio set as follows:

8. The apparatus of claim 6 or 7, wherein the accompaniment generation model comprises a first music encoder, a first text encoder, a first spectral decoder and a first vocoder;

9. The apparatus of claim 6 or 7, wherein the apparatus further comprises:

a second training unit configured to train a singing voice generation model based on a second sample song audio set, wherein the singing voice generation model comprises a speaker voice print coder and a singing voice generation submodel;

the second training unit is configured to train the singing voice generation model as follows:

10. The apparatus of claim 9, wherein the singing voice generation submodel comprises: a second music encoder, a second text encoder, a second spectrum decoder, and a second vocoder;

11. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

12. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-5.