WO2021218324A1 - 歌曲合成方法、装置、可读介质及电子设备 - Google Patents

歌曲合成方法、装置、可读介质及电子设备 Download PDF

Info

Publication number
WO2021218324A1
WO2021218324A1 PCT/CN2021/077986 CN2021077986W WO2021218324A1 WO 2021218324 A1 WO2021218324 A1 WO 2021218324A1 CN 2021077986 W CN2021077986 W CN 2021077986W WO 2021218324 A1 WO2021218324 A1 WO 2021218324A1
Authority
WO
WIPO (PCT)
Prior art keywords
song
information
network
acoustic
sequence
Prior art date
Application number
PCT/CN2021/077986
Other languages
English (en)
French (fr)
Inventor
顾宇
殷翔
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Publication of WO2021218324A1 publication Critical patent/WO2021218324A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present disclosure relates to the technical field of speech synthesis, and in particular, to a song synthesis method, device, readable medium, and electronic equipment.
  • song synthesis technology has attracted the attention of all walks of life.
  • the greatest convenience of this technology is that it can synthesize lyrics and sheet music into audio for vocal singing, which makes it possible to synthesize songs in fields such as music production and entertainment, which are closely related to singing.
  • the present disclosure provides a song synthesis method, including:
  • the duration feature information and the song information are input into a preset song synthesis model to obtain the acoustic feature information corresponding to the target song, wherein the preset song synthesis model is a sequence based on the attention mechanism.
  • the acoustic feature information is synthesized by a vocoder to obtain the singing audio of the target song.
  • the present disclosure provides a song synthesizing device, including:
  • the duration feature acquisition module is configured to acquire duration feature information of the target song according to the song information of the target song, wherein the song information includes lyrics and music scores, and the duration feature information includes each phoneme contained in the lyrics The number of corresponding speech frames;
  • the acoustic feature acquisition module is configured to input the duration feature information and the song information acquired by the duration feature acquisition module into a preset song synthesis model to obtain acoustic feature information corresponding to the target song, where:
  • the preset song synthesis model is a sequence-to-sequence model based on an attention mechanism;
  • the audio synthesis module is used to synthesize the acoustic feature information acquired by the acoustic feature acquisition module through a vocoder to obtain the singing audio of the target song.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the song synthesis method provided in the first aspect of the present disclosure are realized.
  • the present disclosure provides an electronic device, including: a storage device on which one or more computer programs are stored; one or more processing devices for executing the one or more of the storage device A computer program to implement the steps of the song synthesis method provided in the first aspect of the present disclosure.
  • the present disclosure provides a computer program product, the program product comprising: a computer program that, when executed by a processing device, implements the steps of the method described in the first aspect of the present disclosure.
  • the present disclosure provides a computer program that, when executed by a processing device, implements the steps of the method described in the first aspect of the present disclosure.
  • the number of speech frames corresponding to each phoneme contained in the lyrics of the target song is first obtained;
  • the sequence-to-sequence model of the mechanism obtains the acoustic characteristic information corresponding to the target song;
  • the vocoder is used to synthesize the above-mentioned acoustic characteristic information to obtain the singing audio of the target song.
  • the sequence-to-sequence model based on the attention mechanism adopts an end-to-end architecture, it can extract richer acoustic feature information and has better timing modeling capabilities, making the pronunciation of the synthesized singing audio clearer and out of tune. Less, the synthesized sound range is also wider. As a result, the naturalness and fluency of the synthesized singing audio are improved, making it closer to the effect of real-life singing, and the user's hearing experience is better.
  • Fig. 1 is a flow chart showing a method for synthesizing songs according to an exemplary embodiment.
  • Fig. 2 is a schematic diagram showing a framework of a song synthesis method according to an exemplary embodiment.
  • Fig. 3 is a block diagram showing a song synthesizing device according to an exemplary embodiment.
  • Fig. 4 is a schematic diagram showing the structure of an electronic device according to an exemplary embodiment.
  • Fig. 1 is a flow chart showing a method for synthesizing songs according to an exemplary embodiment. As shown in Figure 1, the method may include the following steps 101 to 103.
  • step 101 the duration characteristic information of the target song is obtained according to the song information of the target song.
  • the song information may include lyrics and music scores
  • the duration characteristic information may include the number of speech frames corresponding to each phoneme contained in the lyrics.
  • the phoneme is the smallest phonetic unit divided according to the natural attributes of the speech. It is analyzed according to the pronunciation actions in the syllable.
  • One action constitutes a phoneme; phonemes are divided into two categories: vowels and consonants.
  • phonemes include consonants (consonants, which are consonants used in front of vowels and form a complete syllable together with vowels) and vowels (that is, vowels).
  • consonants consonants, which are consonants used in front of vowels and form a complete syllable together with vowels
  • vowels that is, vowels
  • phonemes include vowels and consonants.
  • each phoneme contained in the lyrics corresponds to multiple speech frames.
  • the number of speech frames corresponding to each phoneme where is the pronunciation time length of the phoneme corresponding to the music score, and is the time length of the speech frame, for example, 5 ms.
  • the number of speech frames corresponding to the phoneme is 40.
  • the number of speech frames corresponding to the phoneme is, that is, if the last piece is less than 5ms, it will be processed as one frame.
  • step 102 the duration characteristic information and the song information are input into a preset song synthesis model to obtain the acoustic characteristic information corresponding to the target song.
  • the aforementioned preset song synthesis model may be a sequence-to-sequence (Seq2seq) model based on an attention mechanism.
  • the above-mentioned acoustic feature information may include fundamental frequency features, spectral envelope features, and the like.
  • the aforementioned acoustic feature information may include Mel spectrum features. Since the Mel spectrum feature simulates the voice processing characteristics of the human ear to a certain extent, it can better reflect the human auditory characteristics, thereby enhancing the user's auditory experience.
  • step 103 the acoustic feature information is synthesized by the vocoder to obtain the singing audio of the target song.
  • the acoustic feature information after the acoustic feature information is obtained through the above step 102, it can be input to a vocoder (for example, Wavenet, Griffin-Lim, single-layer recurrent neural network model WaveRNN, etc.) for song synthesis, Get singing audio.
  • a vocoder for example, Wavenet, Griffin-Lim, single-layer recurrent neural network model WaveRNN, etc.
  • a WaveRNN vocoder can be used to obtain better sound quality and achieve a sound quality effect close to that of a real person singing.
  • the number of speech frames corresponding to each phoneme contained in the lyrics of the target song is first obtained;
  • the sequence-to-sequence model of the mechanism obtains the acoustic characteristic information corresponding to the target song;
  • the vocoder is used to synthesize the above-mentioned acoustic characteristic information to obtain the singing audio of the target song.
  • the sequence-to-sequence model based on the attention mechanism adopts an end-to-end architecture, it can extract richer acoustic feature information and has better timing modeling capabilities, making the pronunciation of the synthesized singing audio clearer and out of tune. Less, the synthesized sound range is also wider. As a result, the naturalness and fluency of the synthesized singing audio are improved, making it closer to the effect of real-life singing, and the user's hearing experience is better.
  • the song information (ie, lyrics and music score) of the target song may be input into a Hidden Markov Model (HMM) to obtain the duration characteristic information of the target song.
  • HMM Hidden Markov Model
  • the song information of the target song (that is, the lyrics and the music score) can be input into a preset Deep Neural Network (DNN) model to obtain the duration characteristic information of the target song.
  • DNN Deep Neural Network
  • the song information of the target song can be input to the preset Bidirectional Long Short Term Memory Network (Bidirectional Long Short Term Memory Network). , BLSTM) model, get the duration characteristic information of the target song.
  • BLSTM Bidirectional Long Short Term Memory Network
  • the BLSTM model has stronger modeling capabilities, taking into account long-term information (that is, using the current input and the input at the previous moment to jointly predict the current output), so the duration modeling accuracy is better and the prediction error is smaller , which makes the time-length ratio of each phoneme more reasonable, thereby improving the naturalness of the subsequent synthesized singing audio.
  • the sequence-to-sequence model based on the attention mechanism can include coding networks and attention networks (in Figure 2, the GMM attention network is used as an example, that is, the attention network based on the Gaussian Mixture Model (GMM)) And the decoding network.
  • GMM attention network is used as an example, that is, the attention network based on the Gaussian Mixture Model (GMM)
  • GMM Gaussian Mixture Model
  • the encoding network can be used to obtain the representation sequence corresponding to the duration feature information and the song information; the attention network can be used to generate fixed-length semantic representations according to the representation sequence; the decoding network can be used to obtain acoustic feature information based on the semantic representations .
  • the coding network can include a feature embedding layer (Feature Embedding layer), Convolutional Pre-net, Dense Pre-net, CBHG (Convolution) Bank+Highway network+bidirectionalGated Recurrent Unit, that is, convolutional layer+high-speed network+bidirectional recurrent neural network, that is, CBHG is composed of convolutional layer, high-speed network, and bidirectional recurrent neural network) sub-model, down-sampled convolution (Down-sampling Convolution) layer.
  • Feature Embedding layer feature embedding layer
  • Convolutional Pre-net Dense Pre-net
  • CBHG Convolution
  • BHG Convolution
  • the Feature Embedding layer uses the Feature Embedding layer to encode the song information and input it into Convolutional Pre-net to perform nonlinear transformation on the encoded song information, thereby improving the convergence and generalization capabilities of the sequence-to-sequence model based on the attention mechanism ;
  • the number of speech frames corresponding to each phoneme contained in the lyrics is input into Dense Pre-net to obtain the corresponding depth features;
  • the output of Convolutional Pre-net and the output of Dense Pre-net are combined Input it into the CBHG sub-model to extract the corresponding context features, and then input it into the Down-sampling Convolution to reduce the amount of calculation and the receptive field, and finally get the corresponding representation sequence.
  • the aforementioned attention network may be Location Sensitive Attention, or GMM attention (as shown in FIG. 2).
  • the attention network may be GMM attention, so that the stability of the song synthesis effect can be further improved, and the phenomenon of missing vowel consonants, repeated vowel consonants, or inability to stop can be avoided.
  • the above-mentioned decoding network may be an autoregressive neural network.
  • the autoregressive neural network may include: a pre-processing network (including a two-layer pre-net), a recurrent neural network (Decoder RNN), a linear projection (Linear Projection) module, and Post-processing network (including 5-layer convolutional post-processing network (5conv layer Posnet)).
  • the acoustic feature information can be obtained in the following ways:
  • the target acoustic sub-feature at each time step is determined as the acoustic feature information corresponding to the target song.
  • the aforementioned preset sequence-to-sequence model based on the attention mechanism can be constructed in the following ways:
  • the decoding network is an autoregressive neural network
  • the acoustic sub-features of the previous time step can be introduced into the time deduction of the model, This allows the model to produce highly reductive and natural audio even with a small amount of training data, and also speeds up song synthesis.
  • Fig. 3 is a block diagram showing a song synthesizing device according to an exemplary embodiment.
  • the device 300 may include: a duration characteristic acquisition module 301, configured to acquire duration characteristic information of the target song according to the song information of the target song, wherein the song information includes lyrics and music scores, and the duration characteristic The information includes the number of speech frames corresponding to each phoneme contained in the lyrics; the acoustic feature acquisition module 302 is configured to input the duration feature information and the song information acquired by the duration feature acquisition module 301 into the preview In the song synthesis model, the acoustic feature information corresponding to the target song is obtained, wherein the preset song synthesis model is a sequence-to-sequence model based on the attention mechanism; the audio synthesis module 303 is used to pass the vocoder The acoustic feature information acquired by the acoustic feature acquiring module 302 is synthesized to obtain the singing audio of the target song.
  • a duration characteristic acquisition module 301 configured to acquire duration characteristic information of the target song
  • the sequence-to-sequence model based on the attention mechanism includes an encoding network, an attention network, and a decoding network; wherein, the encoding network is used to obtain the representation sequence corresponding to the duration characteristic information and the song information
  • the attention network is used to generate a fixed-length semantic representation according to the representation sequence;
  • the decoding network is an autoregressive neural network, which is used to obtain the acoustic feature information according to the semantic representation.
  • the autoregressive neural network includes: a pre-processing network, a cyclic neural network, a linear projection module, and a post-processing network;
  • the step of linear transformation is performed on the acoustic sub-feature of t-1 until the stop symbol flag characterizes the stop cycle; the target acoustic sub-feature of each time step is determined as the acoustic feature information corresponding to the target song.
  • the attention network is an attention network based on a Gaussian mixture model.
  • the duration characteristic acquisition module 301 is configured to input the song information into a preset two-way long and short-term memory network model to obtain the duration characteristic information of the target song.
  • the vocoder is a single-layer recurrent neural network model WaveRNN.
  • the acoustic feature information includes Mel spectrum feature information.
  • the present disclosure also provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing device, the steps of the above-mentioned song synthesis method provided in the present disclosure are realized.
  • FIG. 4 shows a schematic structural diagram of an electronic device (such as a terminal device or a server) 400 suitable for implementing the embodiments of the present disclosure.
  • the terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile phones, notebook computers, digital broadcast receivers, PDAs (Personal Digital Assistant, personal digital assistants), PAD (Portable Android Device, tablet computers), and PMP (Portable Media). Players, portable multimedia players), mobile terminals such as vehicle-mounted terminals (for example, vehicle navigation terminals), and fixed terminals such as digital TVs, desktop computers, and the like.
  • the electronic device shown in FIG. 4 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.
  • the electronic device 400 may include a processing device (such as a central processing unit, a graphics processor, etc.) 401, which may be based on a program stored in a read-only memory (Read-Only Memory, ROM) 402 or from a storage device 408 is loaded into the program in random access memory (Random Access Memory, RAM) 403 to execute various appropriate actions and processing.
  • a processing device such as a central processing unit, a graphics processor, etc.
  • ROM read-only memory
  • RAM Random Access Memory
  • various programs and data required for the operation of the electronic device 400 are also stored.
  • the processing device 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404.
  • An input/output (Input/Output, I/O) interface 405 is also connected to the bus 404.
  • the following devices can be connected to the I/O interface 405: including input devices 406 such as touch screens, touch pads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; including, for example, liquid crystal displays (LCD) , Output devices 407 such as speakers, vibrators, etc.; storage devices 408 such as magnetic tapes, hard disks, etc.; and communication devices 409.
  • the communication device 409 may allow the electronic device 400 to perform wireless or wired communication with other devices to exchange data.
  • FIG. 4 shows an electronic device 400 having various devices, it should be understood that it is not required to implement or have all of the illustrated devices. It may be implemented alternatively or provided with more or fewer devices.
  • an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a non-transitory computer readable medium, and the computer program contains program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from the network through the communication device 409, or installed from the storage device 408, or installed from the ROM 402.
  • the processing device 401 When the computer program is executed by the processing device 401, the above-mentioned functions defined in the method of the embodiment of the present disclosure are executed.
  • the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the two.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination of the above.
  • Computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable removable Programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM or flash memory), optical fiber, portable compact disk read-only memory (Compact Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any of the above The right combination.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program, and the program may be used by or in combination with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in a baseband or as a part of a carrier wave, and a computer-readable program code is carried therein.
  • This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
  • the computer-readable signal medium may send, propagate or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including but not limited to: wire, optical cable, RF (Radio Frequency), etc., or any suitable combination of the foregoing.
  • the client and server can communicate with any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and can communicate with digital data in any form or medium.
  • Communication e.g., communication network
  • Examples of communication networks include local area networks ("Local Area Network, LAN”), wide area networks (“WAN”), the Internet (for example, the Internet), and end-to-end networks (for example, ad hoc end-to-end networks), And any networks currently known or developed in the future.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or it may exist alone without being assembled into the electronic device.
  • the aforementioned computer-readable medium carries one or more programs, and when the aforementioned one or more programs are executed by the electronic device, the electronic device: obtains the duration characteristic information of the target song according to the song information of the target song, wherein The song information includes lyrics and music scores, and the duration characteristic information includes the number of speech frames corresponding to each phoneme contained in the lyrics; and the duration characteristic information and the song information are input into a preset song synthesis
  • the acoustic feature information corresponding to the target song is obtained, wherein the preset song synthesis model is a sequence-to-sequence model based on the attention mechanism; the acoustic feature information is synthesized by a vocoder to obtain the Describe the singing audio of the target song. .
  • the computer program code used to perform the operations of the present disclosure can be written in one or more programming languages or a combination thereof.
  • the above-mentioned programming languages include but are not limited to object-oriented programming languages such as Java, Smalltalk, C++, and Including conventional procedural programming languages-such as "C" language or similar programming languages.
  • the program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer, or entirely executed on the remote computer or server.
  • the remote computer can be connected to the user’s computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to Connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagram can represent a module, program segment, or part of code, and the module, program segment, or part of code contains one or more for realizing the specified logic function.
  • Executable instructions can also occur in a different order from the order marked in the drawings. For example, two blocks shown one after another can actually be executed substantially in parallel, and they can sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart can be implemented by a dedicated hardware-based system that performs the specified functions or operations Or it can be realized by a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments described in the present disclosure can be implemented in software or hardware.
  • the name of the module does not constitute a limitation on the module itself under certain circumstances.
  • the duration feature acquisition module can also be described as "a module that acquires duration feature information of the target song according to the song information of the target song. ".
  • exemplary types of hardware logic components include: Field Programmable Gate Array (FPGA), Application Specific Integrated Circuit (ASIC), and application specific standard products (Application Specific Standard Parts, ASSP), System on Chip ((System on Chip, SOC), Complex Programmable Logic Device (CPLD), etc.
  • FPGA Field Programmable Gate Array
  • ASIC Application Specific Integrated Circuit
  • ASSP Application Specific Standard Parts
  • ASSP Application Specific Standard Parts
  • SOC System on Chip
  • CPLD Complex Programmable Logic Device
  • a machine-readable medium may be a tangible medium, which may contain or store a program for use by the instruction execution system, apparatus, or device or in combination with the instruction execution system, apparatus, or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • the machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any suitable combination of the foregoing.
  • machine-readable storage media would include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or flash memory erasable programmable read-only memory
  • CD-ROM compact disk read only memory
  • magnetic storage device or any suitable combination of the foregoing.
  • Example 1 provides a method for synthesizing a song, which includes: obtaining the duration characteristic information of the target song according to the song information of the target song, wherein the song information includes lyrics and music scores
  • the duration characteristic information includes the number of speech frames corresponding to each phoneme contained in the lyrics; the duration characteristic information and the song information are input into a preset song synthesis model to obtain the target song correspondence
  • the preset song synthesis model is a sequence-to-sequence model based on an attention mechanism; the acoustic feature information is synthesized by a vocoder to obtain the singing audio of the target song.
  • Example 2 provides the method of Example 1.
  • the sequence-to-sequence model based on the attention mechanism includes an encoding network, an attention network, and a decoding network; wherein the encoding network is used for Obtain a representation sequence corresponding to the duration feature information and the song information; the attention network is used to generate a fixed-length semantic representation according to the representation sequence; the decoding network is an autoregressive neural network and is used to According to the semantic representation, the acoustic feature information is obtained.
  • Example 3 provides the method of Example 2.
  • Example 4 provides the method of Example 2, and the attention network is an attention network based on a Gaussian mixture model.
  • Example 5 provides the method of Example 1.
  • the obtaining the duration characteristic information of the target song according to the song information of the target song includes: inputting the song information into a preset In the two-way long and short-term memory network model, the duration characteristic information of the target song is obtained.
  • Example 6 provides the method of any one of Examples 1-5, and the vocoder is a single-layer recurrent neural network model WaveRNN.
  • Example 7 provides the method of any one of Examples 1-5, and the acoustic feature information includes Mel spectrum feature information.
  • Example 8 provides a song synthesizing device, including: a duration characteristic acquisition module, configured to acquire duration characteristic information of the target song according to the song information of the target song.
  • the song information includes lyrics and music scores, and the duration feature information includes the number of speech frames corresponding to each phoneme contained in the lyrics; an acoustic feature acquisition module is used to capture the duration acquired by the duration feature acquisition module
  • the feature information and the song information are input into a preset song synthesis model to obtain acoustic feature information corresponding to the target song, wherein the preset song synthesis model is a sequence-to-sequence model based on an attention mechanism; audio
  • the synthesis module is used to synthesize the acoustic feature information acquired by the acoustic feature acquisition module through a vocoder to obtain the singing audio of the target song.
  • Example 9 provides a computer-readable medium having a computer program stored thereon, and when the program is executed by a processing device, the steps of the method described in any one of Examples 1-7 are implemented. .
  • Example 10 provides an electronic device, including: a storage device on which one or more computer programs are stored; one or more processing devices for executing the storage device The one or more computer programs in to implement the steps of the method described in any one of Examples 1-7.
  • a computer program product is further provided.
  • the program product includes: a computer program, a computer program, and when the computer program is executed by a processing device, the computer program implements any one of Examples 1-7 of the present disclosure. The steps of the method.
  • a computer program which, when executed by a processing device, implements the steps of the method described in any one of Examples 1-7 of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

一种歌曲合成方法、装置、电子设备、计算机可读介质及计算机程序,其中该方法包括:根据目标歌曲的歌曲信息,获取目标歌曲的时长特征信息(101);将时长特征信息和歌曲信息输入至预设的歌曲合成模型中,得到目标歌曲对应的声学特征信息,其中,预设的歌曲合成模型为基于注意力机制的序列到序列模型(102);通过声码器对声学特征信息进行合成,得到目标歌曲的歌唱音频(103)。由于基于注意力机制的序列到序列模型采用端到端的架构,因此,可提取更丰富的声学特征信息,具有较好的时序建模能力,使得合成后的歌唱音频的发音更加清楚,走调的现象更少,合成的音域也更广。由此,提升了合成的歌唱音频的自然度和流畅性,使其比较接近真人演唱效果,用户听觉体验佳。

Description

歌曲合成方法、装置、可读介质及电子设备
本申请要求于2020年04月27日提交中国专利局、申请号为202010346431.9、申请名称为“歌曲合成方法、装置、可读介质和电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及语音合成技术领域,具体地,涉及一种歌曲合成方法、装置、可读介质及电子设备。
背景技术
近年来,歌曲合成技术一直备受社会各界的关注,该技术最大的便利性在于其可以将歌词和乐谱合成为人声演唱的音频,这使得与歌唱紧密相关的音乐制作、娱乐等领域对歌曲合成技术的进步有着迫切的期待。其中,歌曲合成最大的难题之一就是合成自然度较低、歌声机械感强,严重影响用户的听感。因此,如何根据歌词和乐谱,合成自然度高,接近真人演唱效果的歌曲成为歌曲合成技术的研究热点。
发明内容
提供该部分内容以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该部分内容并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
第一方面,本公开提供一种歌曲合成方法,包括:
根据目标歌曲的歌曲信息,获取所述目标歌曲的时长特征信息,其中,所述歌曲信息包括歌词和乐谱,所述时长特征信息包括所述歌词所包含的每一音素对应的语音帧的数量;
将所述时长特征信息和所述歌曲信息输入至预设的歌曲合成模型中,得到所述目标歌曲对应的声学特征信息,其中,所述预设的歌曲合成模型为基于注意力机制的序列到序列模型;
通过声码器对所述声学特征信息进行合成,得到所述目标歌曲的歌唱音频。
第二方面,本公开提供一种歌曲合成装置,包括:
时长特征获取模块,用于根据目标歌曲的歌曲信息,获取所述目标歌曲的时长特征信息,其中,所述歌曲信息包括歌词和乐谱,所述时长特征信息包括所述歌词所包含的每一音素对应的语音帧的数量;
声学特征获取模块,用于将所述时长特征获取模块获取到的所述时长特征信息和所述歌曲信息输入至预设的歌曲合成模型中,得到所述目标歌曲对应的声学特征信息,其中,所述预设的歌曲合成模型为基于注意力机制的序列到序列模型;
音频合成模块,用于通过声码器对所述声学特征获取模块获取到的所述声学特征信息进行合成,得到所述目标歌曲的歌唱音频。
第三方面,本公开提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开第一方面提供的所述歌曲合成方法的步骤。
第四方面,本公开提供一种电子设备,包括:存储装置,其上存储有一个或多个计算机程序;一个或多个处理装置,用于执行所述存储装置中的所述一个或多个计算机程序,以实现本公开第一方面提供的所述歌曲合成方法的步骤。
第五方面,本公开提供一种计算机程序产品,所述程序产品包括:计算机程序,所述计算机程序被处理装置执行时实现本公开第一方面所述方法的步骤。
第六方面,本公开提供一种计算机程序,所述计算机程序被处理装置执行时实现本公开第一方面所述方法的步骤。
在上述技术方案中,首先获取目标歌曲的歌词所包含的每一音素对应的语音帧的数量;然后,根据歌词所包含的每一音素对应的语音帧的数量、歌词和乐谱,通过基于注意力机制的序列到序列模型获取目标歌曲对应的声学特征信息;最后,利用声码器对上述声学特征信息进行合成,得到目标歌曲的歌唱音频。由于基于注意力机制的序列到序列模型采用端到端的架构,因此,能够提取更丰富的声学特征信息,具有较好的时序建模能力,使得合成后的歌唱音频的发音更加清楚,走调的现象更少,合成的音域也更广。由此,提升了合成的歌唱音频的自然度和流畅性,使其比较接近真人演唱效果,用户听觉体验佳。
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。在附图中:
图1是根据一示例性实施例示出的一种歌曲合成方法的流程图。
图2是根据一示例性实施例示出的一种歌曲合成方法的框架示意图。
图3是根据一示例性实施例示出的一种歌曲合成装置的框图。
图4是根据一示例性实施例示出的一种电子设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
图1是根据一示例性实施例示出的一种歌曲合成方法的流程图。如图1所示,该方法可以包括以下步骤101~步骤103。
在步骤101中,根据目标歌曲的歌曲信息,获取目标歌曲的时长特征信息。
在本公开中,歌曲信息可以包括歌词和乐谱,时长特征信息可以包括歌词所包含的每一音素对应的语音帧的数量。
其中,音素是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素;音素分为元音与辅音两大类。示例地,对于中文来说, 音素包括声母(声母,是使用在韵母前面的辅音,跟韵母一齐构成的一个完整的音节)和韵母(即元音)。对于英文来说,音素包括元音和辅音。
另外,歌词中包含的每一音素都对应多个语音帧。其中,每一音素对应的语音帧的数量,其中,为该音素在乐谱中对应的发音时长,为语音帧的时间长度,例如,5ms。
示例地,一音素在乐谱中对应的发音时长为200ms,语音帧的时间长度为5ms,则该音素对应的语音帧的数量为40。
又示例地,在一音素在乐谱中对应的发音时长为203ms,一语音帧的时间长度为5ms,则该音素对应的语音帧的数量为,即最后一片不足5ms的,按照一帧处理。
在步骤102中,将时长特征信息和歌曲信息输入至预设的歌曲合成模型中,得到目标歌曲对应的声学特征信息。
在本公开中,上述预设的歌曲合成模型可以为基于注意力机制(attention)的序列到序列(Sequence-to-sequence,Seq2seq)模型。
在一种实施方式中,上述声学特征信息可以包括基频特征、谱包络特征等。
在另一种实施方式中,上述声学特征信息可以包括梅尔频谱特征。由于梅尔频谱特征在一定程度上模拟了人耳对语音的处理特点,能够更好体现人类的听觉特性,从而提升用户的听觉体验。
在步骤103中,通过声码器对声学特征信息进行合成,得到目标歌曲的歌唱音频。
在本公开中,在通过上述步骤102获取到声学特征信息后,可以将其输入到声码器(例如,Wavenet、Griffin-Lim、单层循环神经网络模型WaveRNN等)中,以进行歌曲合成,得到歌唱音频。优选地,可以采用WaveRNN声码器,以获取更好的音质,达到与真人演唱接近的音质效果。
在上述技术方案中,首先获取目标歌曲的歌词所包含的每一音素对应的语音帧的数量;然后,根据歌词所包含的每一音素对应的语音帧的数量、歌词和乐谱,通过基于注意力机制的序列到序列模型获取目标歌曲对应的声学特征信息;最后,利用声码器对上述声学特征信息进行合成,得到目标歌曲的歌唱音频。由于基于注意力机制的序列到序列模型采用端到端的架构,因此,能够提取更丰富的声学特征信息,具有较好的时序建模能力,使得合成后的歌唱音频的发音更加清楚,走调的现象更少,合成的音域也更广。由此,提升了合成的歌唱音频的自然度和流畅性,使其比较接近真人演唱效果,用户听觉体验佳。
下面针对上述步骤101中的根据目标歌曲的歌曲信息,获取目标歌曲的时长特征信息进行详细说明。
在一种实施方式中,可以将目标歌曲的歌曲信息(即歌词和乐谱)输入至隐马尔可夫模型(Hidden Markov Model,HMM)中,得到目标歌曲的时长特征信息。
在另一种实施方式中,可以将目标歌曲的歌曲信息(即歌词和乐谱)输入至预设的深度神经网络(Deep Neural Network,DNN)模型中,得到目标歌曲的时长特征信息。
由于利用HMM或DNN来获取目标歌曲的时长特征信息时,其仅仅是根据当前的输入预测当前的输出,并未考虑不同时刻的预测结果对下一时刻的预测结果的影响,因此,时长建模能力差,预测误差较大,即预测的各音素的时长比例不合理,进而导致后续合成的歌唱音频的自然度不高。基于此,为了提升歌唱音频的自然度,在又一种实施方式中,如图2中所示,可以将目标歌曲的歌曲信息输入至预设的双向长短时记忆网络(Bidirectional Long Short Term Memory Network,BLSTM)模型中,得到目标歌曲的时长特征信息。
其中,BLSTM模型的建模能力更强,考虑到长时的信息(即利用当前的输入和上一时刻的输入,共同预测当前的输出),因此,时长建模精度更好,预测误差更小,使得各音素的时长比例更加合理,进而提升了后续合成的歌唱音频的自然度。
另外,可以通过以下方式来构建上述预设的BLSTM模型:
(1)针对多个已有歌曲(具有歌词、乐谱和歌唱音频)中每一歌曲,获取该歌曲的歌词和乐谱,并标注歌词所包含的每一音素对应的语音帧的数量(即时长特征信息);
(2)将多个已有歌曲对应的歌词和乐谱作为训练样本输入至初始BLSTM模型中,得到各已有歌曲对应的预测时长特征信息;
(3)根据各已有歌曲对应的预测时长特征信息和标注的时长特征信息的比对结果,对初始BLSTM模型进行训练,得到上述预设的BLSTM模型。
下面针对上述步骤102中的基于注意力机制的序列到序列模型进行详细说明。如图2所示,基于注意力机制的序列到序列模型可以包括编码网络、注意力网络(图2中以GMM attention网络示例,即基于高斯混合模型(Gaussian Mixture Model,GMM)的注意力网络)以及解码网络。
其中,编码网络可以用于获取与时长特征信息和歌曲信息对应的表示序列;注意力网络可以用于根据表示序列,生成定长的语义表征;解码网络可以用于根据语义表征,获得声学特征信息。
具体来说,如图2所示,编码网络可以包括特征嵌入层(即Feature Embedding层)、卷积预处理网络(Convolutional Pre-net)、密集预处理网络(Dense Pre-net)、CBHG (Convolution Bank+Highway network+bidirectional Gated Recurrent Unit,即,卷积层+高速网络+双向递归神经网络,也就是说,CBHG由卷积层、高速网络以及双向递归神经网络组成)子模型、下采样卷积(Down-sampling Convolution)层。首先,利用Feature Embedding层对歌曲信息进行编码后,输入至Convolutional Pre-net中,以对编码后的歌曲信息进行非线性变换,从而提升基于注意力机制的序列到序列模型的收敛和泛化能力;同时,将歌词所包含的每一音素对应的语音帧的数量输入至Dense Pre-net中,以获取相应的深度特征;然后,将Convolutional Pre-net的输出和Dense Pre-net的输出一并输入至CBHG子模型中,以提取相应的上下文特征,之后,将其输入至Down-sampling Convolution中,以减小计算量和感受野,最终得到相应的表示序列。
此外,上述注意力网络可以为位置敏感注意力(Location Sensitive Attention),也可以为GMM attention(如图2中所示)。优选地,该注意力网络可以为GMM attention,这样,可以进一步提升歌曲合成效果的稳定性,避免出现漏元辅音、重复元辅音或者无法停止的现象。
另外,上述解码网络可以为自回归神经网络。其中,如图2所示,该自回归神经网络可以包括:预处理网络(包括两层预处理网络(2layer pre-net))、循环神经网络(Decoder RNN)、线性投影(Linear Projection)模块以及后处理网络(包括5层卷积后处理网络(5conv layer Posnet))。
具体来说,可以通过以下方式获取声学特征信息:
(1)利用预处理网络对时间步t-1的声学子特征做线性变换,其中,当前时间步t=1,时间步0的声学子特征为先前帧(Initial frame),其中,该先前帧为元素值均为0的向量帧(即全零帧);
(2)利用循环神经网络根据线性变换后的时间步t-1的声学子特征和语义表征进行解码,得到解码序列和停止标志位(Stop token);
(3)利用线性投影模块对解码序列做线性投影,得到当前时间步t的声学子特征;
(4)利用后处理网络根据当前时间步t的声学子特征,预测残差,并将该残差与当前时间步t的声学子特征相加,得到当前时间步t的目标声学子特征;
(5)更新当前时间步t=t+1,然后返回上述步骤(1)继续执行,直到停止符标志位表征停止循环时为止;
(6)最后,将各时间步的目标声学子特征确定为目标歌曲对应的声学特征信息。
此外,可以通过以下方式来构建上述预设的基于注意力机制的序列到序列模型:
(1)针对多个已有歌曲(具有歌词、乐谱和歌唱音频)中每一歌曲,获取该歌曲的歌唱和乐谱(即歌曲信息),并标注时长特征信息,即歌词所包含的每一音素对应的语音帧的数量;
(2)将多个已有歌曲对应的歌词、乐谱以及歌词所包含的每一音素对应的语音帧的数量作为训练样本输入至初始的基于注意力机制的序列到序列模型中,得到各已有歌曲对应的预测声学特征信息;
(3)根据各已有歌曲对应的预测声学特征信息和标记数据的比对结果,对初始的基于注意力机制的序列到序列模型进行训练,得到上述预设的基于注意力机制的序列到序列模型,其中,标记数据为各已有歌曲的歌唱音频对应的声学特征信息。
由于上述预设的基于注意力机制的序列到序列模型引入了自回归(即,解码网络为自回归神经网络),这样,可以将上一时间步的声学子特征引入到模型的时间推演中,使得该模型在训练数据量较少的情况下也可以产生高还原度、自然度的音频,同时还加快了歌曲合成速度。
图3是根据一示例性实施例示出的一种歌曲合成装置的框图。参照图3,该装置300可以包括:时长特征获取模块301,用于根据目标歌曲的歌曲信息,获取所述目标歌曲的时长特征信息,其中,所述歌曲信息包括歌词和乐谱,所述时长特征信息包括所述歌词所包含的每一音素对应的语音帧的数量;声学特征获取模块302,用于将所述时长特征获取模块301获取到的所述时长特征信息和所述歌曲信息输入至预设的歌曲合成模型中,得到所述目标歌曲对应的声学特征信息,其中,所述预设的歌曲合成模型为基于注意力机制的序列到序列模型;音频合成模块303,用于通过声码器对所述声学特征获取模块302获取到的所述声学特征信息进行合成,得到所述目标歌曲的歌唱音频。
可选地,所述基于注意力机制的序列到序列模型包括编码网络、注意力网络以及解码网络;其中,所述编码网络用于获取与所述时长特征信息和所述歌曲信息对应的表示序列;所述注意力网络,用于根据所述表示序列,生成定长的语义表征;所述解码网络为自回归神经网络,用于根据所述语义表征,获得所述声学特征信息。
可选地,所述自回归神经网络包括:预处理网络、循环神经网络、线性投影模块以及后处理网络;
所述自回归神经网络用于:利用所述预处理网络对时间步t-1的声学子特征做线性变换,其中,当前时间步t=1,时间步0的声学子特征为先前帧,其中,所述先前帧为元素值均为0的向量帧;利用所述循环神经网络根据线性变换后的时间步t-1的声学子特征和 所述语义表征进行解码,得到解码序列和停止标志位;利用所述线性投影模块对所述解码序列做线性投影,得到当前时间步t的声学子特征;利用所述后处理网络根据所述当前时间步t的声学子特征,预测残差,并将该残差与所述当前时间步t的声学子特征相加,得到当前时间步t的目标声学子特征;更新当前时间步t=t+1;返回所述利用所述预处理网络对时间步t-1的声学子特征做线性变换的步骤,直到所述停止符标志位表征停止循环时为止;将各时间步的目标声学子特征确定为所述目标歌曲对应的声学特征信息。
可选地,所述注意力网络为基于高斯混合模型的注意力网络。
可选地,所述时长特征获取模块301用于将所述歌曲信息输入至预设的双向长短时记忆网络模型中,得到所述目标歌曲的时长特征信息。
可选地,所述声码器为单层循环神经网络模型WaveRNN。
可选地,所述声学特征信息包括梅尔频谱特征信息。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
本公开还提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开提供的上述歌曲合成方法的步骤。
下面参考图4,其示出了适于用来实现本公开实施例的电子设备(例如终端设备或服务器)400的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(Personal Digital Assistant,个人数字助理)、PAD(Portable Android Device,平板电脑)、PMP(Portable Media Player,便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图4示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图4所示,电子设备400可以包括处理装置(例如中央处理器、图形处理器等)401,其可以根据存储在只读存储器(Read-Only Memory,ROM)402中的程序或者从存储装置408加载到随机访问存储器(Random Access Memory,RAM)403中的程序而执行各种适当的动作和处理。在RAM 403中,还存储有电子设备400操作所需的各种程序和数据。处理装置401、ROM 402以及RAM 403通过总线404彼此相连。输入/输出(Input/Output,I/O)接口405也连接至总线404。
通常,以下装置可以连接至I/O接口405:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置406;包括例如液晶显示器(Liquid Crystal  Display,LCD)、扬声器、振动器等的输出装置407;包括例如磁带、硬盘等的存储装置408;以及通信装置409。通信装置409可以允许电子设备400与其他设备进行无线或有线通信以交换数据。虽然图4示出了具有各种装置的电子设备400,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置409从网络上被下载和安装,或者从存储装置408被安装,或者从ROM 402被安装。在该计算机程序被处理装置401执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的***、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行***、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行***、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(Radio Frequency,射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“Local  Area Network,LAN”),广域网(Wide Area Network,“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:根据目标歌曲的歌曲信息,获取所述目标歌曲的时长特征信息,其中,所述歌曲信息包括歌词和乐谱,所述时长特征信息包括所述歌词所包含的每一音素对应的语音帧的数量;将所述时长特征信息和所述歌曲信息输入至预设的歌曲合成模型中,得到所述目标歌曲对应的声学特征信息,其中,所述预设的歌曲合成模型为基于注意力机制的序列到序列模型;通过声码器对所述声学特征信息进行合成,得到所述目标歌曲的歌唱音频。。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的***、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的***来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,时长 特征获取模块还可以被描述为“根据目标歌曲的歌曲信息,获取所述目标歌曲的时长特征信息的模块”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、片上***((System on Chip,SOC)、复杂可编程逻辑设备(Complex Programming Logic Device,CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行***、装置或设备使用或与指令执行***、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体***、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,示例1提供了一种歌曲合成方法,包括:根据目标歌曲的歌曲信息,获取所述目标歌曲的时长特征信息,其中,所述歌曲信息包括歌词和乐谱,所述时长特征信息包括所述歌词所包含的每一音素对应的语音帧的数量;将所述时长特征信息和所述歌曲信息输入至预设的歌曲合成模型中,得到所述目标歌曲对应的声学特征信息,其中,所述预设的歌曲合成模型为基于注意力机制的序列到序列模型;通过声码器对所述声学特征信息进行合成,得到所述目标歌曲的歌唱音频。
根据本公开的一个或多个实施例,示例2提供了示例1的方法,所述基于注意力机制的序列到序列模型包括编码网络、注意力网络以及解码网络;其中,所述编码网络用于获取与所述时长特征信息和所述歌曲信息对应的表示序列;所述注意力网络,用于根据所述表示序列,生成定长的语义表征;所述解码网络为自回归神经网络,用于根据所述语义表征,获得所述声学特征信息。
根据本公开的一个或多个实施例,示例3提供了示例2的方法,所述自回归神经网络包括:预处理网络、循环神经网络、线性投影模块以及后处理网络;所述根据所述语义表征,获取所述声学特征信息,包括:利用所述预处理网络对时间步t-1的声学子特征做线性变换,其中,当前时间步t=1,时间步0的声学子特征为先前帧,其中,所述先前帧为 元素值均为0的向量帧;利用所述循环神经网络根据线性变换后的时间步t-1的声学子特征和所述语义表征进行解码,得到解码序列和停止标志位;利用所述线性投影模块对所述解码序列做线性投影,得到当前时间步t的声学子特征;利用所述后处理网络根据所述当前时间步t的声学子特征,预测残差,并将该残差与所述当前时间步t的声学子特征相加,得到当前时间步t的目标声学子特征;更新当前时间步t=t+1;返回所述利用所述预处理网络对时间步t-1的声学子特征做线性变换的步骤,直到所述停止符标志位表征停止循环时为止;将各时间步的目标声学子特征确定为所述目标歌曲对应的声学特征信息。
根据本公开的一个或多个实施例,示例4提供了示例2的方法,所述注意力网络为基于高斯混合模型的注意力网络。
根据本公开的一个或多个实施例,示例5提供了示例1的方法,所述根据目标歌曲的歌曲信息,获取所述目标歌曲的时长特征信息,包括:将所述歌曲信息输入至预设的双向长短时记忆网络模型中,得到所述目标歌曲的时长特征信息。
根据本公开的一个或多个实施例,示例6提供了示例1-5中任一项所述的方法,所述声码器为单层循环神经网络模型WaveRNN。
根据本公开的一个或多个实施例,示例7提供了示例1-5中任一项所述的方法,所述声学特征信息包括梅尔频谱特征信息。
根据本公开的一个或多个实施例,示例8提供了一种歌曲合成装置,包括:时长特征获取模块,用于根据目标歌曲的歌曲信息,获取所述目标歌曲的时长特征信息,其中,所述歌曲信息包括歌词和乐谱,所述时长特征信息包括所述歌词所包含的每一音素对应的语音帧的数量;声学特征获取模块,用于将所述时长特征获取模块获取到的所述时长特征信息和所述歌曲信息输入至预设的歌曲合成模型中,得到所述目标歌曲对应的声学特征信息,其中,所述预设的歌曲合成模型为基于注意力机制的序列到序列模型;音频合成模块,用于通过声码器对所述声学特征获取模块获取到的所述声学特征信息进行合成,得到所述目标歌曲的歌唱音频。
根据本公开的一个或多个实施例,示例9提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现示例1-7中任一项所述方法的步骤。
根据本公开的一个或多个实施例,示例10提供了一种电子设备,包括:存储装置,其上存储有一个或多个计算机程序;一个或多个处理装置,用于执行所述存储装置中的所述一个或多个计算机程序,以实现示例1-7中任一项所述方法的步骤。
根据本公开的一个或多个实施例,还提供了一种计算机程序产品,程序产品包括:计算机程序,计算机程序,该计算机程序被处理装置执行时实现本公开示例1-7中任一项所述方法的步骤。
根据本公开的一个或多个实施例,还提供了一种计算机程序,该计算机程序被处理装置执行时实现本公开示例1-7中任一项所述方法的步骤。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。

Claims (12)

  1. 一种歌曲合成方法,其特征在于,包括:
    根据目标歌曲的歌曲信息,获取所述目标歌曲的时长特征信息,其中,所述歌曲信息包括歌词和乐谱,所述时长特征信息包括所述歌词所包含的每一音素对应的语音帧的数量;
    将所述时长特征信息和所述歌曲信息输入至预设的歌曲合成模型中,得到所述目标歌曲对应的声学特征信息,其中,所述预设的歌曲合成模型为基于注意力机制的序列到序列模型;
    通过声码器对所述声学特征信息进行合成,得到所述目标歌曲的歌唱音频。
  2. 根据权利要求1所述的方法,其特征在于,所述基于注意力机制的序列到序列模型包括编码网络、注意力网络以及解码网络;
    其中,所述编码网络用于获取与所述时长特征信息和所述歌曲信息对应的表示序列;
    所述注意力网络,用于根据所述表示序列,生成定长的语义表征;
    所述解码网络为自回归神经网络,用于根据所述语义表征,获得所述声学特征信息。
  3. 根据权利要求2所述的方法,其特征在于,所述自回归神经网络包括:预处理网络、循环神经网络、线性投影模块以及后处理网络;
    所述根据所述语义表征,获取所述声学特征信息,包括:
    利用所述预处理网络对时间步t-1的声学子特征做线性变换,其中,当前时间步t=1,时间步0的声学子特征为先前帧,其中,所述先前帧为元素值均为0的向量帧;
    利用所述循环神经网络根据线性变换后的时间步t-1的声学子特征和所述语义表征进行解码,得到解码序列和停止标志位;
    利用所述线性投影模块对所述解码序列做线性投影,得到当前时间步t的声学子特征;
    利用所述后处理网络根据所述当前时间步t的声学子特征,预测残差,并将该残差与所述当前时间步t的声学子特征相加,得到当前时间步t的目标声学子特征;
    更新当前时间步t=t+1;
    返回所述利用所述预处理网络对时间步t-1的声学子特征做线性变换的步骤,直到所述停止符标志位表征停止循环时为止;
    将各时间步的目标声学子特征确定为所述目标歌曲对应的声学特征信息。
  4. 根据权利要求2所述的方法,其特征在于,所述注意力网络为基于高斯混合模型的注意力网络。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述根据目标歌曲的歌曲信息,获取所述目标歌曲的时长特征信息,包括:
    将所述歌曲信息输入至预设的双向长短时记忆网络模型中,得到所述目标歌曲的时长 特征信息。
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,所述声码器为单层循环神经网络模型WaveRNN。
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,所述声学特征信息包括梅尔频谱特征信息。
  8. 一种歌曲合成装置,其特征在于,包括:
    时长特征获取模块,用于根据目标歌曲的歌曲信息,获取所述目标歌曲的时长特征信息,其中,所述歌曲信息包括歌词和乐谱,所述时长特征信息包括所述歌词所包含的每一音素对应的语音帧的数量;
    声学特征获取模块,用于将所述时长特征获取模块获取到的所述时长特征信息和所述歌曲信息输入至预设的歌曲合成模型中,得到所述目标歌曲对应的声学特征信息,其中,所述预设的歌曲合成模型为基于注意力机制的序列到序列模型;
    音频合成模块,用于通过声码器对所述声学特征获取模块获取到的所述声学特征信息进行合成,得到所述目标歌曲的歌唱音频。
  9. 一种计算机可读介质,其上存储有计算机程序,其特征在于,该程序被处理装置执行时实现权利要求1-7中任一项所述方法的步骤。
  10. 一种电子设备,其特征在于,包括:
    存储装置,其上存储有一个或多个计算机程序;
    一个或多个处理装置,用于执行所述存储装置中的所述一个或多个计算机程序,以实现权利要求1-7中任一项所述方法的步骤。
  11. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-7中任一项所述方法的步骤。
  12. 一种计算机程序,所述计算机程序在被处理器执行时实现根据权利要求1-7中任一项所述方法的步骤。
PCT/CN2021/077986 2020-04-27 2021-02-25 歌曲合成方法、装置、可读介质及电子设备 WO2021218324A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010346431.9A CN111583900B (zh) 2020-04-27 2020-04-27 歌曲合成方法、装置、可读介质及电子设备
CN202010346431.9 2020-04-27

Publications (1)

Publication Number Publication Date
WO2021218324A1 true WO2021218324A1 (zh) 2021-11-04

Family

ID=72124546

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/077986 WO2021218324A1 (zh) 2020-04-27 2021-02-25 歌曲合成方法、装置、可读介质及电子设备

Country Status (2)

Country Link
CN (1) CN111583900B (zh)
WO (1) WO2021218324A1 (zh)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583900B (zh) * 2020-04-27 2022-01-07 北京字节跳动网络技术有限公司 歌曲合成方法、装置、可读介质及电子设备
CN112037757B (zh) * 2020-09-04 2024-03-15 腾讯音乐娱乐科技(深圳)有限公司 一种歌声合成方法、设备及计算机可读存储介质
CN112231512B (zh) * 2020-10-20 2023-11-14 标贝(青岛)科技有限公司 歌曲标注检测方法、装置和***及存储介质
CN112542155B (zh) * 2020-11-27 2021-09-21 北京百度网讯科技有限公司 歌曲合成方法及模型训练方法、装置、设备与存储介质
CN112767914B (zh) * 2020-12-31 2024-04-30 科大讯飞股份有限公司 歌唱语音合成方法及合成设备、计算机存储介质
CN113781993A (zh) * 2021-01-20 2021-12-10 北京沃东天骏信息技术有限公司 定制音色歌声的合成方法、装置、电子设备和存储介质
CN112906369A (zh) * 2021-02-19 2021-06-04 脸萌有限公司 一种歌词文件生成方法及装置
CN112905835B (zh) * 2021-02-26 2022-11-11 成都潜在人工智能科技有限公司 一种多模态乐曲标题生成方法、装置及存储介质
CN113053355A (zh) * 2021-03-17 2021-06-29 平安科技(深圳)有限公司 佛乐的人声合成方法、装置、设备及存储介质
CN113035228A (zh) * 2021-03-23 2021-06-25 广州酷狗计算机科技有限公司 声学特征提取方法、装置、设备及存储介质
CN113066459B (zh) * 2021-03-24 2023-05-30 平安科技(深圳)有限公司 基于旋律的歌曲信息合成方法、装置、设备及存储介质
CN113409759B (zh) * 2021-07-07 2023-04-07 浙江工业大学 一种端到端实时语音合成方法
CN113593520B (zh) * 2021-09-08 2024-05-17 广州虎牙科技有限公司 歌声合成方法及装置、电子设备及存储介质
CN113808555A (zh) * 2021-09-17 2021-12-17 广州酷狗计算机科技有限公司 歌曲合成方法及其装置、设备、介质、产品
CN115457923B (zh) * 2022-10-26 2023-03-31 北京红棉小冰科技有限公司 一种歌声合成方法、装置、设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7977562B2 (en) * 2008-06-20 2011-07-12 Microsoft Corporation Synthesized singing voice waveform generator
CN106373580A (zh) * 2016-09-05 2017-02-01 北京百度网讯科技有限公司 基于人工智能的合成歌声的方法和装置
JP2017107228A (ja) * 2017-02-20 2017-06-15 株式会社テクノスピーチ 歌声合成装置および歌声合成方法
CN109801608A (zh) * 2018-12-18 2019-05-24 武汉西山艺创文化有限公司 一种基于神经网络的歌曲生成方法和***
CN110164460A (zh) * 2019-04-17 2019-08-23 平安科技(深圳)有限公司 歌唱合成方法和装置
CN110310621A (zh) * 2019-05-16 2019-10-08 平安科技(深圳)有限公司 歌唱合成方法、装置、设备以及计算机可读存储介质
CN110992926A (zh) * 2019-12-26 2020-04-10 标贝(北京)科技有限公司 语音合成方法、装置、***和存储介质
CN111583900A (zh) * 2020-04-27 2020-08-25 北京字节跳动网络技术有限公司 歌曲合成方法、装置、可读介质及电子设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2501067B (en) * 2012-03-30 2014-12-03 Toshiba Kk A text to speech system
CN109192199A (zh) * 2018-06-30 2019-01-11 中国人民解放军战略支援部队信息工程大学 一种结合瓶颈特征声学模型的数据处理方法
CN109036375B (zh) * 2018-07-25 2023-03-24 腾讯科技(深圳)有限公司 语音合成方法、模型训练方法、装置和计算机设备
CN109036377A (zh) * 2018-07-26 2018-12-18 ***股份有限公司 一种语音合成方法及装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7977562B2 (en) * 2008-06-20 2011-07-12 Microsoft Corporation Synthesized singing voice waveform generator
CN106373580A (zh) * 2016-09-05 2017-02-01 北京百度网讯科技有限公司 基于人工智能的合成歌声的方法和装置
JP2017107228A (ja) * 2017-02-20 2017-06-15 株式会社テクノスピーチ 歌声合成装置および歌声合成方法
CN109801608A (zh) * 2018-12-18 2019-05-24 武汉西山艺创文化有限公司 一种基于神经网络的歌曲生成方法和***
CN110164460A (zh) * 2019-04-17 2019-08-23 平安科技(深圳)有限公司 歌唱合成方法和装置
CN110310621A (zh) * 2019-05-16 2019-10-08 平安科技(深圳)有限公司 歌唱合成方法、装置、设备以及计算机可读存储介质
CN110992926A (zh) * 2019-12-26 2020-04-10 标贝(北京)科技有限公司 语音合成方法、装置、***和存储介质
CN111583900A (zh) * 2020-04-27 2020-08-25 北京字节跳动网络技术有限公司 歌曲合成方法、装置、可读介质及电子设备

Also Published As

Publication number Publication date
CN111583900B (zh) 2022-01-07
CN111583900A (zh) 2020-08-25

Similar Documents

Publication Publication Date Title
WO2021218324A1 (zh) 歌曲合成方法、装置、可读介质及电子设备
CN111402855B (zh) 语音合成方法、装置、存储介质和电子设备
CN111369967B (zh) 基于虚拟人物的语音合成方法、装置、介质及设备
CN111899719B (zh) 用于生成音频的方法、装置、设备和介质
WO2022151931A1 (zh) 语音合成方法、合成模型训练方法、装置、介质及设备
CN111292720A (zh) 语音合成方法、装置、计算机可读介质及电子设备
WO2022141678A1 (zh) 语音合成方法、装置、设备及存储介质
WO2022151930A1 (zh) 语音合成方法、合成模型训练方法、装置、介质及设备
CN110246488B (zh) 半优化CycleGAN模型的语音转换方法及装置
CN112687259B (zh) 一种语音合成方法、装置以及可读存储介质
CN111292719A (zh) 语音合成方法、装置、计算机可读介质及电子设备
CN111369971A (zh) 语音合成方法、装置、存储介质和电子设备
CN111899720A (zh) 用于生成音频的方法、装置、设备和介质
CN110210310A (zh) 一种视频处理方法、装置和用于视频处理的装置
CN112927674B (zh) 语音风格的迁移方法、装置、可读介质和电子设备
WO2023160553A1 (zh) 语音合成方法、装置、计算机可读介质及电子设备
CN112185363B (zh) 音频处理方法及装置
WO2022142850A1 (zh) 音频处理方法、装置、声码器、电子设备、计算机可读存储介质及计算机程序产品
CN111028824A (zh) 一种用于闽南语的合成方法及其装置
WO2022237665A1 (zh) 语音合成方法、装置、电子设备和存储介质
CN113205793B (zh) 音频生成方法、装置、存储介质及电子设备
WO2022042418A1 (zh) 音乐合成方法、装置、设备和计算机可读介质
CN111161695A (zh) 歌曲生成方法和装置
CN112786013A (zh) 基于唱本的语音合成方法、装置、可读介质和电子设备
CN114678032B (zh) 一种训练方法、语音转换方法及装置和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21795498

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16.02.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21795498

Country of ref document: EP

Kind code of ref document: A1