CN112802446B

CN112802446B - Audio synthesis method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN112802446B
Application number: CN201911114561.3A
Authority: CN
Inventors: 张黄斌; 李辉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-11-14
Filing date: 2019-11-14
Publication date: 2024-05-07
Anticipated expiration: 2039-11-14
Also published as: CN112802446A

Abstract

The embodiment of the disclosure provides an audio synthesis method and device, electronic equipment and a computer readable storage medium. The method comprises the following steps: obtaining a music score to be processed and lyrics to be processed; extracting music characteristics from the music score to be processed; extracting text features from the lyrics to be processed; processing the music characteristics and the text characteristics through an end-to-end neural network model to obtain frequency spectrum information; and synthesizing singing audio corresponding to the music score to be processed and the lyrics to be processed according to the frequency spectrum information. By the scheme provided by the embodiment of the disclosure, the synthesis of singing audio can be realized through an end-to-end neural network model.

Description

Audio synthesis method and device, electronic equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to an audio synthesis method and apparatus, an electronic device, and a computer readable storage medium.

Background

In the existing singing synthesis technical scheme, DNN (Deep Neural Network ) is adopted for modeling, and the synthesized singing dry voice is low in naturalness and low in tone quality and is far away from the singing level of a real person.

Accordingly, there is a need for a new audio synthesis method and apparatus, an electronic device, and a computer-readable storage medium.

It should be noted that the information disclosed in the foregoing background section is only for enhancing understanding of the background of the present disclosure.

Disclosure of Invention

The embodiment of the disclosure provides an audio synthesis method and device, electronic equipment and a computer readable storage medium, which can realize the synthesis of singing audio through an end-to-end neural network model, and the synthesized singing audio has higher naturalness and tone quality and is more similar to the singing level of a real person.

Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.

According to an aspect of the embodiments of the present disclosure, there is provided an audio synthesis method, the method including: obtaining a music score to be processed and lyrics to be processed; extracting music characteristics from the music score to be processed; extracting text features from the lyrics to be processed; processing the music characteristics and the text characteristics through an end-to-end neural network model to obtain frequency spectrum information; and synthesizing singing audio corresponding to the music score to be processed and the lyrics to be processed according to the frequency spectrum information.

According to an aspect of the disclosed embodiments, there is provided an audio synthesis apparatus, the apparatus comprising: the music score lyrics acquisition module is configured to acquire a music score to be processed and lyrics to be processed; a music feature extraction module configured to extract music features from the score to be processed; the text feature extraction module is configured to extract text features from the lyrics to be processed; the frequency spectrum information obtaining module is configured to process the music characteristics and the text characteristics through an end-to-end neural network model to obtain frequency spectrum information; and the singing audio synthesis module is configured to synthesize singing audio corresponding to the music score to be processed and the lyrics to be processed according to the frequency spectrum information.

In some exemplary embodiments of the present disclosure, the musical features include instrument digital interface features and a time value feature. Wherein, the music characteristic extraction module includes: a musical instrument digital interface feature obtaining sub-module configured to obtain the musical instrument digital interface feature according to the pitch in the score to be processed; and the time value characteristic obtaining sub-module is configured to normalize the note lengths in the score to be processed to obtain the time value characteristic.

In some exemplary embodiments of the present disclosure, the end-to-end neural network model includes a text encoder, an instrumental digital interface encoder, and a chronicle encoder. Wherein, the spectrum information obtaining module includes: a text embedding vector obtaining sub-module configured to obtain a text embedding vector by processing the text feature by the text encoder; the musical instrument digital interface embedded vector obtaining submodule is configured to process the musical instrument digital interface characteristics through the musical instrument digital interface encoder to obtain a musical instrument digital interface embedded vector; a time value embedding vector obtaining sub-module configured to obtain a time value embedding vector by processing the time value characteristics through the time value encoder; the fusion embedded vector obtaining submodule is configured to obtain a fusion embedded vector according to the text embedded vector, the musical instrument digital interface embedded vector and the time value embedded vector; and the spectrum information obtaining sub-module is configured to obtain the spectrum information according to the fusion embedded vector.

In some exemplary embodiments of the present disclosure, the instrumental digital interface encoder includes a first dense neural network. Wherein, musical instrument digital interface embedding vector obtains submodule includes: a first dense vector obtaining unit configured to obtain a first dense vector by processing the musical instrument digital interface feature through the first dense neural network; and the musical instrument digital interface embedded vector obtaining unit is configured to obtain the musical instrument digital interface embedded vector according to the first dense vector.

In some exemplary embodiments of the present disclosure, the instrumental digital interface encoder further includes a second dense neural network, a forward portal recurrent neural network, and a reverse portal recurrent neural network. Wherein the musical instrument digital interface embedding vector obtaining unit includes: a second dense vector obtaining subunit configured to process the first dense vector through the second dense neural network to obtain a second dense vector; a first feature map obtaining subunit configured to process the second dense vector through the forward gate recurrent neural network to obtain a first feature map; a second feature map obtaining subunit configured to process the first feature map through the inverse gate recurrent neural network to obtain a second feature map; and an instrumental digital interface embedded vector obtaining subunit configured to concatenate the second feature map and the second dense vector to obtain the instrumental digital interface embedded vector.

In some example embodiments of the present disclosure, the time value encoder includes a third dense neural network. Wherein the time value embedding vector obtaining submodule comprises: and a value embedding vector obtaining unit configured to obtain the value embedding vector by processing the value characteristics through the third dense neural network.

In some exemplary embodiments of the present disclosure, the end-to-end neural network model further includes an attention mechanism module and a spectrum decoder. Wherein the spectrum information obtaining submodule includes: an attention context vector obtaining unit configured to obtain an attention context vector by processing the fused embedded vector by the attention mechanism module; and a spectrum information obtaining unit configured to obtain the spectrum information by processing the attention context vector by the spectrum decoder.

In some exemplary embodiments of the present disclosure, the spectral information includes mel-spectral parameters and linear spectral parameters. Wherein, the singing audio synthesis module includes: and the singing audio synthesis submodule is configured to process the Mel spectrum parameters and the linear spectrum parameters through a neural network vocoder and synthesize the singing audio.

In some exemplary embodiments of the present disclosure, the apparatus further comprises: a sample information acquisition module configured to acquire a sample score, sample lyrics, and sample singing audio thereof; a sample music feature extraction module configured to extract sample music features from the sample score; a sample text feature extraction module configured to extract sample text features from the sample lyrics; a sample spectrum information obtaining module configured to obtain sample spectrum information from the sample singing audio; the frequency spectrum prediction module is configured to process the sample music characteristics and the sample text characteristics through the end-to-end neural network model to obtain predicted frequency spectrum information; and the model training module is configured to train the end-to-end neural network model according to the sample spectrum information and the prediction spectrum information.

According to an aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the audio synthesis method as described in the above embodiments.

According to an aspect of the embodiments of the present disclosure, there is provided an electronic device including: one or more processors; and a storage configured to store one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the audio synthesis method as described in the above embodiments.

In the technical schemes provided by some embodiments of the present disclosure, by extracting music features from a music score to be processed, extracting text features from lyrics to be processed, and further processing the music features and the text features through an end-to-end neural network model, spectrum information is obtained, so that singing audio corresponding to the music score to be processed and the lyrics to be processed thereof can be synthesized according to the spectrum information. Compared with DNN singing synthesis scheme, the singing audio synthesized by adopting the end-to-end neural network model has higher naturalness, better tone quality and more approximate to the singing level of a real person.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort. In the drawings:

FIG. 1 shows a schematic diagram of an exemplary system architecture to which an audio synthesis method or audio synthesis apparatus of embodiments of the present disclosure may be applied;

FIG. 2 illustrates a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure;

fig. 3 is a schematic diagram showing the implementation of singing dry voice synthesis using DNN in the related art;

FIG. 4 schematically illustrates a flow chart of an audio synthesis method according to an embodiment of the disclosure;

FIG. 5 is a schematic diagram showing the processing procedure of step S2 shown in FIG. 4 in one embodiment;

FIG. 6 is a schematic diagram showing the processing procedure of step S4 shown in FIG. 4 in one embodiment;

FIG. 7 is a schematic view showing the processing procedure of step S4 shown in FIG. 4 in another embodiment;

FIG. 8 schematically illustrates a structural schematic of an end-to-end neural network model, according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram showing the processing procedure of step S42 shown in FIG. 6 in one embodiment;

FIG. 10 schematically shows a schematic structure of a MIDI encoder according to an embodiment of the present disclosure;

FIG. 11 is a schematic diagram showing the processing procedure of step S422 shown in FIG. 9 in one embodiment;

FIG. 12 schematically shows a schematic structure of a MIDI encoder according to another embodiment of the present disclosure;

fig. 13 schematically illustrates a structural diagram of a time value encoder according to an embodiment of the present disclosure;

fig. 14 schematically illustrates a flow chart of an audio synthesis method according to another embodiment of the disclosure;

FIG. 15 schematically illustrates a training process diagram of an end-to-end neural network model, according to an embodiment of the present disclosure;

FIG. 16 schematically illustrates a predictive process schematic of an end-to-end neural network model, according to an embodiment of the disclosure;

FIG. 17 schematically shows an effect evaluation schematic of singing synthesis using different approaches;

Fig. 18-21 schematically illustrate application scenarios of the singing synthesis method according to the embodiments of the present disclosure;

fig. 22 schematically shows a block diagram of an audio synthesis device according to an embodiment of the disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosed aspects may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.

The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.

Fig. 1 shows a schematic diagram of an exemplary system architecture 100 to which an audio synthesis method or audio synthesis device of embodiments of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include one or more of terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 105 may be a server cluster formed by a plurality of servers.

The user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or send messages or the like. The terminal devices 101, 102, 103 may be a variety of electronic devices with display screens including, but not limited to, smartphones, tablet computers, portable computers, wearable smart devices, smart home devices and desktop computers, digital cinema projectors, and the like.

The server 105 may be a server providing various services. For example, the user sends various requests to the server 105 using the terminal device 103 (which may be the terminal device 101 or 102). The server 105 may obtain feedback information in response to the request to the terminal device 103 based on the relevant information carried in the request, and thus the user may view the displayed feedback information on the terminal device 103.

As another example, the terminal device 103 (may also be the terminal device 101 or 102) may be a smart tv, a VR (Virtual Reality)/AR (Augmented Reality) head-mounted display, or a mobile terminal such as a smart phone, a tablet computer, etc. with an instant messenger, a video Application (APP) installed thereon, through which a user may send various requests to the server 105. The server 105 may acquire feedback information in response to the request based on the request, and return the feedback information to the smart tv, VR/AR head mounted display or the instant messaging and video APP, so that the feedback information returned is displayed through the smart tv, VR/AR head mounted display or the instant messaging and video APP.

Fig. 2 shows a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.

It should be noted that the computer system 200 of the electronic device shown in fig. 2 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present disclosure.

As shown in fig. 2, the computer system 200 includes a central processing unit (CPU, central Processing Unit) 201 that can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 202 or a program loaded from a storage portion 208 into a random access Memory (RAM, random Access Memory) 203. In the RAM 203, various programs and data required for the system operation are also stored. The CPU 201, ROM 202, and RAM 203 are connected to each other through a bus 204. An input/output (I/O) interface 205 is also connected to bus 204.

The following components are connected to the I/O interface 205: an input section 206 including a keyboard, a mouse, and the like; an output section 207 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage section 208 including a hard disk or the like; and a communication section 209 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 209 performs communication processing via a network such as the internet. The drive 210 is also connected to the I/O interface 205 as needed. A removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 210 as needed, so that a computer program read therefrom is installed into the storage section 208 as needed.

In particular, according to embodiments of the present disclosure, the processes described below with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 209, and/or installed from the removable medium 211. The computer program, when executed by a Central Processing Unit (CPU) 201, performs the various functions defined in the method and/or apparatus of the present application.

It should be noted that the computer readable storage medium shown in the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM (Erasable Programmable Read Only Memory, erasable programmable read-only memory) or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF (Radio Frequency), and the like, or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods, apparatus, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules, sub-modules, units and sub-units described in the embodiments of the present disclosure may be implemented in software, or may be implemented in hardware, where the described modules, sub-modules, units and sub-units may also be provided in a processor. Wherein the names of the modules, sub-modules, units and sub-units do not constitute limitations of the modules, sub-modules, units and sub-units themselves in some cases.

As another aspect, the present application also provides a computer-readable storage medium that may be contained in the electronic device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable storage medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the methods described in the embodiments below. For example, the electronic device may implement the steps shown in fig. 4 or fig. 5 or fig. 6 or fig. 7 or fig. 9 or fig. 11 or fig. 14.

Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Key technologies to Speech technology (Speech Technology) are automatic Speech recognition technology (Automatic Speech Recognition, ASR) and Speech synthesis technology (TTS). The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

The scheme provided by the embodiment of the application relates to artificial intelligence voice technology, machine learning technology and other technologies, and is specifically described by the following embodiments:

In recent years, the singing synthesis technology has been attracting attention from various social circles, and the biggest convenience of the singing synthesis technology is that it can make a computer sing songs of any melody, which makes the fields of music making, entertainment and the like closely related to singing have urgent expectations for the progress of the singing synthesis technology.

In the related art, the singing synthesis technology includes a parameter synthesis system based on a hidden markov model (Hidden Markov Model, HMM) and a system based on waveform splicing synthesis.

The core of waveform splicing is to record the record method of each pronunciation in a certain language at different pitch in advance and then connect the records according to lyrics and music score. However, the system based on waveform splicing and synthesizing has two major difficulties, namely waveform distortion is very easy to generate in the process of waveform splicing, and thus synthesized sound is unnatural; and secondly, waveform splicing depends on huge recording data, so that a great deal of time and labor are consumed to collect the data, and the singing synthesis difficulty is high.

The singing synthesis system based on parameter synthesis firstly determines a duration parameter sequence, a fundamental frequency parameter sequence and a frequency spectrum parameter sequence of each basic synthesis unit (such as syllables, phonemes and the like) respectively, and then utilizes a parameter synthesizer to obtain continuous singing signals according to each parameter sequence.

Fig. 3 shows a schematic diagram of the related art implementation of singing dry voice synthesis using DNN.

As shown in fig. 3, the singing synthesis solution in the related art adopts DNN for modeling, and the solution is divided into two parts of model training and prediction. In the model training section, a duration model and an acoustic model are trained using the acquired singing dry voice, score information and lyric information. The time length model is used for predicting the pronunciation time length of each phoneme, and the acoustic model is used for predicting the frequency spectrum parameters of the phonemes. In the model prediction section, a singing dry voice is synthesized from a musical score and lyrics input by a user using an acoustic model and a time length model obtained by training. Here, a phone is a minimum phonetic unit divided from a tone color angle, such as an initial consonant and a final in chinese, a consonant and a vowel in english, and the like.

Specifically, in the model training phase, the method comprises the following steps:

1.1, preparing data: singing dry voice, score information and lyric information of a singer are collected from a singing database.

1.2, Marking data: and cutting off singing audios, namely the collected singing dry voice according to sentences. Manually marking word boundaries, phoneme boundaries, notes, pitches, values, etc. of each sentence. The labeling data is used as training data.

1.3, HTS (HMM-based SPEECH SYNTHESIS SYSTEM, hidden Markov model-based speech synthesis system) segmentation: the HTS system is trained using the training data obtained in step 1.2 above, and then each phoneme is further sliced into state levels.

Specifically, an HTS system is constructed in advance, each sentence of singing dry voice in training data is input to the HTS system, the HTS system is utilized to predict information such as word boundary, phoneme boundary, notes, pitch, time value and the like of each sentence, and the information such as word boundary, phoneme boundary, notes, pitch, time value and the like of each sentence is compared with the manually marked information such as word boundary, phoneme boundary, notes, pitch, time value and the like, so that a loss function is calculated, and the HTS system is trained. This HTS system is then used to further segment each phoneme.

For example, the "a" sound is artificially marked as a whole complete "a" sound, but in practice, the sound may be further divided into several parts (i.e. several states) according to the spectrum characteristics, and when the model is actually trained, the duration of the state level is predicted, and not the duration of the phoneme is directly predicted. HTS is a method for cutting each phoneme into 5 states and then outputting the length of each state of each phoneme.

1.4, Extracting text characteristics of a duration model: duration model text features are extracted from the score and lyrics.

1.5, Extracting phoneme duration parameters: and (3) extracting a duration parameter from the HTS segmentation result obtained in the step 1.3. Here training is performed using the duration of the phoneme state level.

1.6, Training a duration model: training a duration model by using the duration model text features extracted in the step 1.4 and the phoneme duration parameters extracted in the step 1.5.

1.7, Extracting singing audio acoustic parameters: and (2) further dividing each sentence of audio divided in the step (1.2) into N equal-length pieces, wherein N is a positive integer greater than or equal to 1, and each piece of small audio extracts acoustic parameters capable of expressing the frequency spectrum characteristics of the audio and a fundamental frequency and tremolo parameter sequence related to the pitch.

1.8, Extracting the text characteristics of the acoustic model: and (3) generating the text characteristics of the acoustic model which are the same as the audio length through a means of sequence expansion by combining the data extracted in the steps 1.4 and 1.5.

1.9, Training an acoustic model: using the data extracted in steps 1.7 and 1.8 above, an acoustic model is trained.

In the model prediction stage, the method comprises the following steps:

2.1, extracting music score and lyric information provided by a user.

2.2, Extracting a time length model text feature sequence from the music score and lyrics information.

And 2.3, inputting a time length model by using the time length model text feature sequence obtained in the step 2.2, and predicting the time length of each phoneme.

And 2.4, combining the time length model text feature sequence obtained in the step 2.2 and the phoneme time length predicted in the step 2.3, and obtaining an acoustic model text feature sequence through a sequence expansion means.

And 2.5, inputting an acoustic model by using the acoustic model text feature sequence obtained in the step 2.4, and predicting acoustic parameters, fundamental frequency and tremolo parameter sequences of each piece of audio.

And 2.6, inputting the acoustic parameters, the fundamental frequency and the tremolo parameter sequences obtained in the step 2.5 into a vocoder to obtain synthesized singing dry voice audio.

However, in the related art, the phoneme audio transition of the singing dry voice synthesized by DNN modeling is not well coordinated, the tone color is mechanically felt, unnatural, not "good" in the overall hearing sense, and far away from the real singing level.

In the related art, the end-to-end speech synthesis technology is a TTS technology, which is a text-to-speech synthesis technology, but the synthesized audio is speech normally speaking by a person, and at present, the pitch and the duration of the synthesis which are successfully synthesized by the end-to-end technology are not consistent with the requirements of music score and singing dry voice with high naturalness.

Fig. 4 schematically illustrates a flow chart of an audio synthesis method according to an embodiment of the present disclosure. The method provided by the embodiments of the present disclosure may be performed by any electronic device having computing processing capabilities, such as any one or more of the terminal devices 101, 102, 103 and/or the server 105 in fig. 1. In the following illustration, the server 105 is exemplified as an execution subject.

As shown in fig. 4, the audio synthesis method provided by the embodiment of the present disclosure may include the following steps.

In step S1, a score to be processed and lyrics to be processed thereof are acquired.

For example, a user sends a singing synthesis request to a server through a client, the singing synthesis request carries a to-be-processed music score and to-be-processed lyrics information corresponding to current to-be-synthesized singing audio selected by the user, and the server obtains the to-be-processed music score and the to-be-processed lyrics from the to-be-processed music score and the to-be-processed lyrics after receiving the singing synthesis request.

In the embodiment of the present disclosure, the score to be processed may be any one of MIDI (Musical Instrument DIGITAL INTERFACE ) format, ASCII (AMERICAN STANDARD Code for Information Interchange, american standard code for information interchange) format, musicXML (Music Extensible Markup Language, music extension markup language) format, and the like. The music score to be processed can be paper music score and/or electronic music score, and the lyrics to be processed correspond to the music score to be processed, and can also be paper lyrics and/or electronic lyrics. The paper music score and the paper lyrics can be printed music score and lyrics respectively, also can be handwritten music score and lyrics, can be converted into electronic music score and lyrics respectively in a scanning mode, and then are processed by software to obtain the music score and the lyrics in a required digital format; the electronic music score and lyrics can be downloaded from a website, or can be edited by a user, and the music score and lyrics in a required digital format are obtained after software processing.

In step S2, musical features are extracted from the score to be processed.

In the disclosed embodiment, the music characteristics may include MIDI characteristics related to pitch of notes and Interval (Interval) characteristics related to duration. Specific extraction processes may be described with reference to the embodiment of fig. 5 below.

In step S3, text features are extracted from the lyrics to be processed.

In an embodiment of the present disclosure, the text feature extracted from the lyrics to be processed may include: any one or a combination of more of phoneme features, prosodic boundary features, whether a phrase ends, how many phonemes the current pinyin has, whether the current phoneme is a voice, etc.

For example, taking the lyrics to be processed as "pinyin", firstly converting the text into the corresponding pinyin "ping yin", and splitting the pinyin into four sequences of "p", "ing", "y" and "in", and taking how many phones (feature 2, corresponding numbers represent corresponding numbers of phones) of the current pinyin, whether the current pinyin is a child tone (feature 3, 0 represents not a child tone, 1 represents a child tone) and whether the current pinyin is a zero-initial syllable (feature 4, 0 represents not a zero-initial syllable, and 1 represents a zero-initial syllable) as examples, the generation process of the text feature is described:

TABLE 1

The zero initial syllables are syllables which are composed of only vowels and have no consonant, and the syllables which start with vowels (a, o, e, i, u and u) are all zero initial syllables. Thus, the synthesized text is characterized by: "p-2-0-0", "ing-2-0-0", "y-2-0-0" and "in-2-0-0".

It should be understood that the above-exemplified features are for illustration only and may actually comprise hundreds of dimensional features.

In an embodiment of the present disclosure, one or more notes in a musical feature correspond to one lyrics in a text feature. The notes in the music feature also have timing, that is, the notes in the score need to be performed according to the predetermined timing to form the predetermined melody, and the lyrics in the text feature correspond to the notes, so the lyrics also have the predetermined timing.

In step S4, the music feature and the text feature are processed through an end-to-end neural network model, so as to obtain spectrum information.

In step S5, singing audio corresponding to the score to be processed and lyrics to be processed thereof is synthesized according to the spectrum information.

Specifically, the spectrum information may be input to a pre-trained neural network vocoder, and the sampling signal of singing audio corresponding to the score to be processed and the lyrics to be processed may be automatically output, for example, refer to the content of the neural network vocoder as shown in fig. 8 below.

In the embodiment of the disclosure, the singing audio may be a singing dry voice or a finally synthesized song, so as to return the song to the client of the user sending the singing synthesis request for playing. Wherein, the dry voice refers to the original singing voice without any post-processing (operations of reverberation, modulation, pressure limiting, speed changing, etc.). The wet voice is corresponding to the singing voice after post-processing. After synthesizing singing dry sounds corresponding to the music score to be processed and the lyrics to be processed according to the frequency spectrum information, the singing dry sounds and the corresponding background music can be synthesized to obtain the finally synthesized song.

According to the audio synthesis method provided by the embodiment of the disclosure, the music characteristics are extracted from the music score to be processed, the text characteristics are extracted from the lyrics to be processed, and then the music characteristics and the text characteristics are processed through the end-to-end neural network model, so that frequency spectrum information is obtained, and singing audio corresponding to the music score to be processed and the lyrics to be processed can be synthesized according to the frequency spectrum information. On one hand, because the end-to-end neural network model is directly modeled from text to voice, a long-time model in DNN modeling is not needed, an HTS system is not needed for data preprocessing, and the scheme provided by the embodiment of the disclosure can directly use the length of a phoneme and does not need to be further segmented to the state level of the phoneme; on the other hand, compared with DNN singing synthesis scheme, the singing audio synthesized by adopting the end-to-end neural network model has higher naturalness, better tone quality and more approximate to the level of real singing.

Fig. 5 shows a schematic diagram of the processing procedure of step S2 shown in fig. 4 in an embodiment. In embodiments of the present disclosure, the musical features may include musical instrument digital interface features and a time value feature.

As shown in fig. 5, in the embodiment of the present disclosure, the above step S2 may further include the following steps.

In step S21, the musical instrument digital interface feature is obtained according to the pitch in the score to be processed.

In the embodiment of the disclosure, the MIDI feature is word level, and one word in the lyrics corresponds to one MIDI. Pitch refers to the vibration frequency of the vocal cords during singing. The pitch f in the score to be processed may be converted according to the following formula, obtaining MIDI characteristic p:

for example, a numbered musical notation of a song is as follows:

The spectrum was tuned to the original pitch at ^b B, i.e., at 233.08Hz as 1 tone, and looking up Table 2 below gave frequencies corresponding to the above numbered musical notation of 196.00Hz, 233.08Hz, 293.66Hz, 195.00Hz and 456.16Hz, followed by conversion to MIDI (rounded) of 55, 58, 62, 55 and 70 in that order.

Table 2 Pitch versus frequency table

/>

In step S22, normalization processing is performed on the note lengths in the score to be processed, so as to obtain the time value features.

In the disclosed embodiment, interval is the duration of a note. And Duration are not the same concept, where the Duration is the length of the musical score indicated to be singed. But the pronunciation of the person cannot be exactly the same as that indicated by the score. Duration refers to the length of time that a person pronounces each phoneme (e.g., a, ao, etc.) at the time of actual pronunciation.

The values in the score may be converted into the value feature interval_feature according to the following formula:

Also taking the above numbered musical notation as an example, 234 beats per minute are indicated in the spectrum, that is, the length of one beat is 60 s/234=0.256 s=256 ms, and the above numbered musical notation sequence corresponds to one beat (256 ms), half beat (128 ms), so the value of interval_feature corresponds to: 256/1500= 0.1707, 128/1500= 0.0853, 128/1500= 0.0853, 128/1500= 0.0853, 128/1500= 0.0853.

In the embodiment of the disclosure, the extracted musical instrument digital interface feature and the extracted time value feature are taken as the music feature, and the time value feature in the music feature does not discretize the time value in the music score, for example, the time value is converted into a sequence represented by 0 and 1, but is normalized by the formula (2), so that the extracted music feature is better because the normalization processing results are better than the discretization processing.

Fig. 6 shows a schematic diagram of the processing procedure of step S4 shown in fig. 4 in an embodiment. In an embodiment of the present disclosure, the end-to-end neural network model may include a Text encoder (Text encoder), a musical instrument digital interface encoder (MIDI encoder), and a time value encoder (Interval encoder).

As shown in fig. 6, in the embodiment of the present disclosure, the above step S4 may further include the following steps.

In step S41, the text feature is processed by the text encoder to obtain a text embedding (Embedding) vector.

Wherein the Text encoder is used to convert the phoneme-level Text into a vector. For a specific implementation, reference may be made to the text encoder in the embodiment of fig. 8 below for converting a character sequence into a hidden feature representation, inputting text (input text) such as the text feature described above to the text encoder, where the text encoder may for example comprise a character embedding (CHARACTER EMBEDDING) layer, a3 convolution layer (3 Conv Layers) and a bi-directional LSTM (Bidirectional LSTM) (Long Short-Term Memory neural network) connected in sequence, the character embedding layer being used for converting the text feature into a 512-dimensional character embedding vector representation, and then being input to 3 convolution layers, each convolution layer comprising 512 filters (shape 5 times 1), for example each filter spanning 5 characters followed by a batch normalization (batch normalization, BN) and ReLU (RECTIFIED LINEAR Unit, linear rectification function) function, and inputting the output result of the last convolution layer to a bi-directional LSTM comprising 512 units (256 per direction) for outputting the text embedding vector.

In step S42, the musical instrument digital interface features are processed by the musical instrument digital interface encoder to obtain a musical instrument digital interface embedded (MIDI Embedding) vector.

For example, reference may be made to the MIDI encoder in the embodiment of fig. 8 below, where the extracted MIDI features described above are input to the MIDI encoder and the MIDI Embedding vectors are output. The specific structure of the MIDI encoder can be described with reference to the embodiments shown in fig. 9 to 12 below.

In step S43, the time value characteristics are processed by the time value encoder to obtain a time value embedding (Interval Embedding) vector.

For example, referring to the following value encoder in the embodiment of fig. 8, the above-mentioned extracted value features are input to the value encoder, and the Interval Embedding vectors are output. The specific structure of the time value encoder can be described with reference to the embodiment shown in fig. 13 below.

In step S44, a fusion embedded vector is obtained from the text embedded vector, the musical instrument digital interface embedded vector, and the value embedded vector.

For example, referring to the embodiment of fig. 8 below, the output results of the text encoder, i.e., the bidirectional LSTM, MIDI encoder, and time value encoder, are spliced to obtain a fusion embedded vector.

In step S45, the spectrum information is obtained according to the fusion embedding vector.

The spectral information described herein may include mel-spectra and/or linear spectra. Mel-spectra may be converted to linear spectra. For example, the fused embedded vector may be input to a location-based attention mechanism (Location Sensitive Attention) module, to generate an attention context vector, and then to a Mel-spectrum Decoder (Mel Decoder) to generate a Mel-spectrum (Mel Spectrogram), as described in the embodiment of fig. 8 below.

Fig. 7 shows a schematic diagram of the processing procedure of step S4 shown in fig. 4 in another embodiment. In embodiments of the present disclosure, the end-to-end neural network model may further include an Attention mechanism (Attention) module and a spectrum decoder.

As shown in fig. 7, in the embodiment of the present disclosure, the above step S4 may further include the following steps.

In step S46, the attention mechanism module processes the fused embedded vector to obtain an attention context vector.

For example, the embodiment of fig. 8 below may be referenced to using a position-sensitive attention mechanism, extending the increased attention mechanism to using cumulative attention weights, with the previous decoder time step as an assist feature. This encourages the model to continue advancing through the inputs, reducing potential failure modes for the decoder to repeat or ignore certain sub-sequences. After mapping the input and location features to the 128-dimensional implicit representation, the probability of attention is calculated. The position features were calculated using a 32-1-D convolution filter, length 31.

In step S47, the attention context vector is processed by the spectrum decoder to obtain the spectrum information.

For example, reference may be made to the embodiment of fig. 8 below, where the spectral decoder is a mel-spectrum decoder comprising 5 convolutional layers (5 Conv Layer) Post-Net (network structure mapping mel-spectrum into linear spectrum), 2 layers Pre-Net (2 Layer Pre-Net,2 layers Pre-network Layer), 2LSTM layers and two linear transform (Linear Projection) layers, the mel-spectrum decoder being an autoregressive recurrent neural network, predicting one mel-spectrum at a time from the encoded input sequence. The prediction from the previous time step is first passed through a small Pre-Net, which includes 2 fully connected layers of 256 hidden ReLU units. The output of Pre-Net and the attention context vector are stitched and input to 2 unidirectional LSTM layers (1024 units). The cascade of LSTM output and attention context vectors is mapped to the predicted target spectral frame by linear transformation. Finally, the predicted mel-spectrum is transmitted to a 5Conv Layer Post-Net, and the prediction residuals are added to improve overall reconstruction. Each Post-Net comprises 512 filters (shape 5 times 1) with BN followed by a tanh activation function as the whole last layer. The mel-spectrum is then input to a neural network vocoder (Neural Network Vocoder) and Audio Samples (Audio Samples), i.e., singing dry sounds, are output.

Fig. 8 schematically illustrates a structural schematic of an end-to-end neural network model according to an embodiment of the present disclosure.

As shown in fig. 8, the Mel Decoder in the embodiment of the present disclosure is an autoregressive recurrent neural network, and generates a Mel spectrum of audio by autoregressive manner in combination with the output of the Text encoder. At the time of outputting Mel Spectrogram, the result is recursively output. For example, initially, mel Spectrogram of three frames are predicted using 1 frame random number in combination with context vector, then Mel Spectrogram of the next three frames are predicted using the last frame of these three frames Mel Spectrogram as input in combination with context vector, and so on, and all Mel Spectrogram are predicted.

In parallel with spectral frame prediction, the cascade of mel-spectrum decoder LSTM output and attention context is projected to a scalar and transmitted to a sigmoid activation function to predict the probability of whether the output sequence is complete. The "Stop Token" prediction is used during inference to allow the model to dynamically decide when to terminate generation, rather than always generating a fixed duration.

In embodiments of the present disclosure, a neural network vocoder may employ WaveNet Vocoder to convert Mel Spectrogram feature representations into time domain waveform samples.

The method provided by the embodiment of the disclosure is based on the end-To-end synthesis technology of the TTS, modifies the model structure, and provides an end-To-end synthesis technology MusicTactron suitable for TTM (Text To Music, text To singing synthesis technology, and singing audio synthesis). Compared with DNN singing synthesis scheme, the audio synthesized by the embodiment of the disclosure has higher naturalness and better tone quality, and is very close to the singing level of a real person.

Note that, although the model structure of the end-to-end neural network model in fig. 8 is taken as an example Tacotron, MIDI encodings and Interval encodings are added. However, the disclosure is not limited thereto, and other end-to-end neural network model structures may be improved, for example, the structure of Tacotron may be improved, and the improved structure is to splice the output results of the MIDI encoder and the Interval encoder and the output result of the Tacotron text encoder together, and input the spliced output results into the attention module. For another example, the structure of Deep Voice 3 may be improved, and the improved structure is to splice the output results of the MIDI encoder and the Interval encoder with the output result of Tacotron text encoder and the output result of Deep Voice 3 text encoder together, and input the spliced output results into the AttentionBlock module.

Fig. 9 is a schematic diagram showing the processing procedure of step S42 shown in fig. 6 in one embodiment. In an embodiment of the present disclosure, the instrumental digital interface encoder may include a first Dense neural network (D1).

As shown in fig. 9, in the embodiment of the present disclosure, the above step S42 may further include the following steps.

In step S421, the instrumental digital interface feature is processed through the first dense neural network to obtain a first dense vector.

In step S422, the musical instrument digital interface embedded vector is obtained from the first dense vector.

Fig. 10 schematically shows a schematic structure of a MIDI encoder according to an embodiment of the present disclosure.

As shown in fig. 10, in the embodiment of the disclosure, the MIDI encoder includes a first Dense neural network (D1), inputs the extracted MIDI characteristics into a layer of Dense1, the node number of the layer of Dense1 is 1024, the activation function is ReLU, and outputs as the MIDI embedded vector.

Fig. 11 is a schematic diagram showing the processing procedure of step S422 shown in fig. 9 in an embodiment. In an embodiment of the disclosure, the instrumental digital interface encoder may further include a second dense neural network, a forward portal recurrent neural network (Recurrent Neural Network, RNN), and a reverse portal recurrent neural network.

As shown in fig. 11, in the embodiment of the present disclosure, the above step S422 may further include the following steps.

In step S4221, the first dense vector is processed through the second dense neural network to obtain a second dense vector.

In the embodiment of the disclosure, the calculation result of D1 in fig. 10 may be input into the next layer of Dense2, where the number of nodes of the layer of Dense2 may also be 1024, and the activation function is ReLU.

In step S4222, the second dense vector is processed through the forward gate recurrent neural network to obtain a first feature map.

The calculation result of the Dense2 in the step S4221 may be input into a layer of GRU1 structure, and the number of nodes of the layer of GRU1 may be 128.

In step S4223, the first feature map is processed through the inverse portal recurrent neural network to obtain a second feature map.

The calculation result of the GRU1 in step S4222 may be input into the next-layer GRU2 structure, and the number of nodes in this layer GRU2 may be 128.

In step S4224, concatenating the second feature map and the second dense vector, the musical instrument digital interface embedded vector is obtained.

The computation results of Dense2 and GRU2 can be spliced together and output to the attention mechanism module as MIDI Embedding vectors.

Fig. 12 schematically shows a schematic structure of a MIDI encoder according to another embodiment of the present disclosure.

As shown in fig. 12, in the embodiment of the present disclosure, the input of the MIDI encoder is MIDI characteristics, and the MIDI encoder may include a first Dense neural network (Dense 1, D1), a second Dense neural network (Dense 2, D2), a forward gate recurrent neural network (GRU 1) and a backward gate recurrent neural network (GRU 2), and a cascade layer. And D2 output results and GRU2 output results are cascaded, and then MIDI embedded vectors are output.

In the embodiment of FIG. 12, two layers of Dense in the MIDI encoder are MIDI embedding layers, and two GRUs are combined into Bidirectional GRU (bidirectional GRU). The stability of pitch is ensured by concatenating the computation results of Dense2 and GRU2, i.e. the same MIDI characteristics, the part of the Dense output being identical. Here, concatenation means that the second feature map and the second dense vector are spliced along the channel dimension, and recombined into an embedding vector of the digital interface of the musical instrument with a larger size and containing more feature information.

It will be appreciated that the network structure of the MIDI encoder is not limited to the above-mentioned structures illustrated in fig. 10 and 12, and may include three or more layers of elements, and/or three or more layers of GRUs, and the GRU layer and/or the element layer may be replaced by other recurrent neural networks such as LSTM, or any deep neural network, which is not limited in this disclosure.

In an exemplary embodiment, the time value encoder may include a third dense neural network. Wherein, processing the time value characteristic by the time value encoder to obtain a time value embedded vector may include: and processing the time value characteristic through the third dense neural network to obtain the time value embedded vector.

Fig. 13 schematically illustrates a structural diagram of a time value encoder according to an embodiment of the present disclosure.

As shown in fig. 13, the time value encoder provided in the embodiment of the present disclosure may include a third Dense neural network (Dense 3, D3), where the extracted Interval feature is input into a layer of Dense3, where the number of nodes of the layer of Dense3 may be 1024, and the activation function may be ReLU. The calculation result of Dense3 is input as Interval Embedding vector into the attention mechanism module.

It will be appreciated that the network structure of the time value encoder is not limited to the structure of fig. 13, for example, it may also be a structure in which at least one layer of Dense is superimposed with at least one layer of GRU, which is not limited in this disclosure.

Fig. 14 schematically illustrates a flow chart of an audio synthesis method according to another embodiment of the present disclosure.

As shown in fig. 14, the method provided by the embodiments of the present disclosure may further include the following steps, which are different from the above embodiments. In embodiments of the present disclosure, a model training portion and a model prediction portion may be included. Wherein the model training section includes the following steps.

In step S6, a sample score, sample lyrics and sample singing audio thereof are acquired.

And collecting singing dry voice of a singer as sample singing audio, and respectively taking a music score and lyrics corresponding to the singing dry voice as sample music score and sample lyrics.

In step S7, sample music features are extracted from the sample score.

Cutting a sample music score according to sentences, marking the information of notes, pitch, time values and the like of each sentence, obtaining MIDI characteristics according to the pitch conversion of the notes, and obtaining time value characteristics according to the time value information, wherein the MIDI characteristics and the time value characteristics are used as the sample music characteristics.

In step S8, sample text features are extracted from the sample lyrics.

And cutting the sample lyrics according to sentences, and marking the word boundary, the phoneme boundary and other information of each sentence. And obtaining sample text characteristics according to the word boundary, the phoneme boundary and other information.

In step S9, sample spectrum information is obtained from the sample singing audio.

Sample spectral information is extracted from the sample singing audio using a spectral extraction tool, e.g., the sample spectral information may include annotated real mel-spectral parameters and linear spectral parameters.

In step S10, the sample music feature and the sample text feature are processed through the end-to-end neural network model, so as to obtain prediction spectrum information.

And inputting MIDI features in the sample music features to a MIDI encoder, inputting the time value features in the sample music features to a time value encoder, and inputting the sample text features to a text encoder to obtain prediction spectrum information.

In step S11, the end-to-end neural network model is trained according to the sample spectrum information and the predicted spectrum information.

In the embodiment of the disclosure, when the end-to-end neural network model is trained, the following loss function may be adopted:

loss＝Mel_loss+Spec_loss (3)

in the above formula, loss represents the total loss, mel_loss represents Mel-spectrum loss, spec_loss represents linear spectrum loss.

Wherein, the mel-spectrum loss can be obtained by calculation by the following formula:

In the above formula, n1 is the mel spectrum frame number, n1 is a positive integer greater than or equal to 1, y _i1 is the marked real i1 frame mel spectrum, x _i1 is the predicted i1 frame mel spectrum, and i1 is a positive integer greater than or equal to and less than or equal to n 1.

The linear spectral loss can be calculated by the following formula:

In the above formula, n2 is the number of linear spectrum frames, n2 is a positive integer greater than or equal to 1, y _i2 is the true i2 frame linear spectrum, x _i2 is the predicted i2 frame linear spectrum, and i2 is a positive integer greater than or equal to and less than or equal to n 2.

Fig. 15 schematically illustrates a training process diagram of an end-to-end neural network model according to an embodiment of the present disclosure.

As shown in fig. 15, a sample music score, a sample lyric and a sample singing audio thereof can be obtained from a singing database, the obtained data are subjected to data segmentation and music score labeling, sample text features and sample music features are extracted, mel-spectrum parameters and linear spectrum parameters are predicted based on the extracted sample text features and sample music features, and an acoustic model, namely the end-to-end neural network model, is trained based on the truly labeled mel-spectrum parameters and linear spectrum parameters.

In an exemplary embodiment, the spectral information may include mel-spectrum parameters and linear-spectrum parameters. Wherein synthesizing singing audio corresponding to the score to be processed and lyrics to be processed according to the spectrum information may include: and processing the Mel spectrum parameters and the linear spectrum parameters by a neural network vocoder to synthesize the singing audio.

Fig. 16 schematically illustrates a predictive process schematic of an end-to-end neural network model according to an embodiment of the disclosure.

As shown in fig. 16, in the model prediction section, music features are extracted from score information provided by a user, text features are extracted from lyric information provided by a user, the extracted music features and text features are input into the MusicTactron model to obtain a predicted mel spectrum and a linear spectrum, the predicted mel spectrum and/or the linear spectrum are input into a neural network vocoder, and singing dry voice with lyrics and tunes both conforming to the user's requirements is generated.

According to the voice synthesis method provided by the embodiment of the disclosure, the structure of an end-to-end neural network model in the related technology is improved, and the MIDI embedded vector and the Interval embedded vector are added to the information input to the attention mechanism module, so that the model originally used for synthesizing normal speaking voice can be used for synthesizing singing dry voice, and on one hand, the model can synthesize singing dry voice with the pitch meeting the requirements of a music score by adding the MIDI encoder; on the other hand, by adding the encoder conforming to the duration value, the model can synthesize the singing dry voice with the note duration value consistent with the music score requirement, and a set of end-to-end singing synthesis technology capable of being used for singing synthesis is provided, so that the singing dry voice of a real person level can be finally synthesized, and the synthesis effect is far stronger than DNN in naturalness and timbre fidelity.

The novel end-to-end technical model structure provided by the embodiment of the disclosure can obtain a very good effect in the singing synthesis field, and can synthesize high-quality singing dry sounds according to music scores. Compared with the DNN technical scheme in the related art, the naturalness and tone color fidelity are greatly improved, and the evaluation result of MOS (Mean Opinion Score, voice quality evaluation index) of 50 people is shown in figure 17:

the MOS of the scheme is divided into 4.1 minutes which is far higher than 3.6 minutes of the DNN technical scheme. MOS of real person singing is divided into 4.3 points, which shows that the singing dry voice synthesized by the scheme is very close to the singing level of real person.

The technical scheme provided by the embodiment of the disclosure is a middle stage technology, and can be used for supporting products such as music, social contact and the like. Fig. 18-21 schematically show application scenarios of the singing synthesis method according to the embodiments of the present disclosure. As shown in fig. 18, the first page of the application is the first page of the application, through which the user can implement self-acting words or even self-acting music, and then let the application automatically sing out.

As shown in fig. 19, the "museum" in the application of fig. 18 is opened, where the user can select songs. As shown in FIG. 20, free authoring of this large category may include free authoring, topical word making, and topical recommendation of three small categories. The user inputs a keyword or a sentence in the theme column, clicks the confirm virtual button, and the application can complete the context according to the keyword or the sentence input by the user and automatically generate the complete lyrics of a song, so that the user does not need to manually input all the lyrics. As shown in FIG. 21, the user opens the small category of free creation, and can directly input lyrics to replace the original lyrics, so that the user can directly use the music score of the original song to rewrite the lyrics wanted by the user.

As shown in fig. 22, an audio synthesizing apparatus 2200 provided by an embodiment of the present disclosure may include: a score lyrics acquisition module 2210, a music feature extraction module 2220, a text feature extraction module 2230, a spectrum information acquisition module 2240, and a singing audio synthesis module 2250.

Wherein, the music lyrics acquisition module 2210 may be configured to acquire the to-be-processed music score and the to-be-processed lyrics thereof. The music feature extraction module 2220 may be configured to extract music features from the score to be processed. The text feature extraction module 2230 may be configured to extract text features from the lyrics to be processed. The spectral information obtaining module 2240 may be configured to obtain spectral information by processing the music feature and the text feature through an end-to-end neural network model. The singing audio synthesis module 2250 may be configured to synthesize singing audio corresponding to the score to be processed and lyrics thereof to be processed according to the spectral information.

In an exemplary embodiment, the musical features may include musical instrument digital interface features and a time value feature. The music feature extraction module 2220 may include: the musical instrument digital interface feature obtaining submodule can be configured to obtain the musical instrument digital interface feature according to the pitch in the music score to be processed; the time value characteristic obtaining sub-module may be configured to normalize the note lengths in the score to be processed to obtain the time value characteristic.

In an exemplary embodiment, the end-to-end neural network model may include a text encoder, an instrumental digital interface encoder, and a duration encoder. The spectrum information obtaining module 2240 may include: the text embedding vector obtaining sub-module may be configured to process the text feature by the text encoder to obtain a text embedding vector; the musical instrument digital interface embedded vector obtaining submodule can be configured to process the musical instrument digital interface characteristics through the musical instrument digital interface encoder to obtain a musical instrument digital interface embedded vector; the time value embedding vector obtaining submodule can be configured to obtain a time value embedding vector by processing the time value characteristics through the time value encoder; the fusion embedded vector obtaining sub-module may be configured to obtain a fusion embedded vector according to the text embedded vector, the musical instrument digital interface embedded vector and the time value embedded vector; the spectrum information obtaining sub-module may be configured to obtain the spectrum information according to the fusion embedding vector.

In an exemplary embodiment, the instrument digital interface encoder may include a first dense neural network. Wherein the musical instrument digital interface embedded vector obtaining sub-module may include: a first dense vector obtaining unit that may be configured to obtain a first dense vector by processing the musical instrument digital interface feature through the first dense neural network; the musical instrument digital interface embedded vector obtaining unit may be configured to obtain the musical instrument digital interface embedded vector from the first dense vector.

In an exemplary embodiment, the instrumental digital interface encoder may further include a second dense neural network, a forward portal recurrent neural network, and a reverse portal recurrent neural network. Wherein the musical instrument digital interface embedded vector obtaining unit may include: a second dense vector obtaining subunit, which may be configured to process the first dense vector through the second dense neural network to obtain a second dense vector; the first feature map obtaining subunit may be configured to process the second dense vector through the forward gate recurrent neural network to obtain a first feature map; the second feature map obtaining subunit may be configured to process the first feature map through the inverse gate recurrent neural network to obtain a second feature map; the musical instrument digital interface embedded vector obtaining subunit may be configured to concatenate the second feature map and the second dense vector to obtain the musical instrument digital interface embedded vector.

In an exemplary embodiment, the time value encoder may include a third dense neural network. Wherein the value embedding vector obtaining sub-module may include: the value embedding vector obtaining unit may be configured to obtain the value embedding vector by processing the value feature through the third dense neural network.

In an exemplary embodiment, the end-to-end neural network model may further include an attention mechanism module and a spectrum decoder. Wherein, the spectrum information obtaining sub-module may include: the attention context vector obtaining unit may be configured to obtain an attention context vector by processing the fused embedded vector by the attention mechanism module; the spectrum information obtaining unit may be configured to obtain the spectrum information by processing the attention context vector by the spectrum decoder.

In an exemplary embodiment, the spectral information may include mel-spectrum parameters and linear-spectrum parameters. Wherein the singing audio synthesis module 2250 may include: the singing audio synthesis submodule can be configured to process the mel spectrum parameters and the linear spectrum parameters through a neural network vocoder to synthesize the singing audio.

In an exemplary embodiment, the audio synthesizing apparatus 2200 may further include: the sample information acquisition module can be configured to acquire a sample music score, sample lyrics and sample singing audio thereof; a sample music feature extraction module, which may be configured to extract sample music features from the sample score; a sample text feature extraction module may be configured to extract sample text features from the sample lyrics; the sample spectrum information obtaining module may be configured to obtain sample spectrum information from the sample singing audio; the spectrum prediction module can be configured to process the sample music features and the sample text features through the end-to-end neural network model to obtain prediction spectrum information; the model training module may be configured to train the end-to-end neural network model based on the sample spectrum information and the predicted spectrum information.

The specific implementation of each module, sub-module, unit and sub-unit in the audio synthesis device provided in the embodiments of the present disclosure may refer to the content in the above audio synthesis method, and will not be described herein again.

It should be noted that although in the above detailed description several modules, sub-modules, units and sub-units of the apparatus for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules, sub-modules, units, and sub-units described above may be embodied in one module, sub-module, unit, and sub-unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module, sub-module, unit, and sub-unit described above may be further divided into ones that are embodied by a plurality of modules, sub-modules, units, and sub-units.

From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of audio synthesis, comprising:

obtaining a music score to be processed and lyrics to be processed;

Extracting music characteristics from the score to be processed, wherein the music characteristics comprise musical instrument digital interface characteristics and time value characteristics;

extracting text features from the lyrics to be processed;

Processing the music feature and the text feature through an end-to-end neural network model to obtain spectrum information, wherein the end-to-end neural network model comprises a text encoder, a musical instrument digital interface encoder and a time value encoder, and the obtaining of the spectrum information comprises: processing the text features through the text encoder to obtain text embedded vectors; processing the musical instrument digital interface characteristics through the musical instrument digital interface encoder to obtain a musical instrument digital interface embedded vector; processing the time value characteristic through the time value encoder to obtain a time value embedded vector; obtaining a fusion embedded vector according to the text embedded vector, the musical instrument digital interface embedded vector and the time value embedded vector; obtaining the spectrum information according to the fusion embedded vector;

and synthesizing singing audio corresponding to the music score to be processed and the lyrics to be processed according to the frequency spectrum information.

2. The method of claim 1, wherein extracting musical features from the score to be processed comprises:

Obtaining the musical instrument digital interface characteristics according to the pitch in the music score to be processed;

and carrying out normalization processing on the note length in the music score to be processed to obtain the time value characteristic.

3. The method of claim 1, wherein the instrumental digital interface encoder comprises a first dense neural network; wherein, through the musical instrument digital interface encoder to the musical instrument digital interface characteristic processing, obtain musical instrument digital interface embedding vector, include:

processing the musical instrument digital interface features through the first dense neural network to obtain a first dense vector;

and obtaining the musical instrument digital interface embedded vector according to the first dense vector.

4. The method of claim 3, wherein the instrumental digital interface encoder further comprises a second dense neural network, a forward portal recurrent neural network, and a reverse portal recurrent neural network; wherein obtaining the musical instrument digital interface embedded vector from the first dense vector comprises:

Processing the first dense vector through the second dense neural network to obtain a second dense vector;

processing the second dense vector through the forward gate recurrent neural network to obtain a first feature map;

Processing the first characteristic map through the reverse gate recurrent neural network to obtain a second characteristic map;

and cascading the second feature map and the second dense vector to obtain the musical instrument digital interface embedded vector.

5. The method according to claim 1, wherein the time value encoder comprises a third dense neural network; wherein the processing of the time value features by the time value encoder to obtain a time value embedded vector comprises:

and processing the time value characteristic through the third dense neural network to obtain the time value embedded vector.

6. The method of claim 1, wherein the end-to-end neural network model further comprises an attention mechanism module and a spectrum decoder; wherein, according to the fusion embedded vector, obtaining the spectrum information includes:

Processing the fusion embedded vector through the attention mechanism module to obtain an attention context vector;

And processing the attention context vector through the spectrum decoder to obtain the spectrum information.

7. The method of claim 1, wherein the spectral information comprises mel-spectral parameters and linear spectral parameters; wherein, synthesizing singing audio corresponding to the music score to be processed and lyrics to be processed according to the frequency spectrum information comprises:

And processing the Mel spectrum parameters and the linear spectrum parameters by a neural network vocoder to synthesize the singing audio.

8. The method as recited in claim 1, further comprising:

Acquiring a sample music score, sample lyrics and sample singing audio;

Extracting sample music characteristics from the sample music score;

Extracting sample text features from the sample lyrics;

obtaining sample spectrum information according to the sample singing audio;

Processing the sample music characteristics and the sample text characteristics through the end-to-end neural network model to obtain prediction spectrum information;

And training the end-to-end neural network model according to the sample spectrum information and the predicted spectrum information.

9. An audio synthesis device, comprising:

the music score lyrics acquisition module is configured to acquire a music score to be processed and lyrics to be processed;

a music feature extraction module configured to extract music features from the score to be processed, the music features including musical instrument digital interface features and time value features;

the text feature extraction module is configured to extract text features from the lyrics to be processed;

A spectrum information obtaining module configured to obtain spectrum information by processing the music feature and the text feature through an end-to-end neural network model including a text encoder, an instrument digital interface encoder, and a time value encoder, the spectrum information obtaining module comprising: a text embedding vector obtaining sub-module configured to obtain a text embedding vector by processing the text feature by the text encoder; the musical instrument digital interface embedded vector obtaining submodule is configured to process the musical instrument digital interface characteristics through the musical instrument digital interface encoder to obtain a musical instrument digital interface embedded vector; a time value embedding vector obtaining sub-module configured to obtain a time value embedding vector by processing the time value characteristics through the time value encoder; the fusion embedded vector obtaining submodule is configured to obtain a fusion embedded vector according to the text embedded vector, the musical instrument digital interface embedded vector and the time value embedded vector; the spectrum information obtaining sub-module is configured to obtain the spectrum information according to the fusion embedded vector;

And the singing audio synthesis module is configured to synthesize singing audio corresponding to the music score to be processed and the lyrics to be processed according to the frequency spectrum information.

10. The apparatus of claim 9, wherein the music feature extraction module comprises:

a musical instrument digital interface feature obtaining sub-module configured to obtain the musical instrument digital interface feature according to the pitch in the score to be processed;

And the time value characteristic obtaining sub-module is configured to normalize the note lengths in the score to be processed to obtain the time value characteristic.

11. The apparatus of claim 9, wherein the instrumental digital interface encoder comprises a first dense neural network; wherein, musical instrument digital interface embedding vector obtains submodule includes:

A first dense vector obtaining unit configured to obtain a first dense vector by processing the musical instrument digital interface feature through the first dense neural network;

And the musical instrument digital interface embedded vector obtaining unit is configured to obtain the musical instrument digital interface embedded vector according to the first dense vector.

12. The apparatus of claim 11, wherein the instrumental digital interface encoder further comprises a second dense neural network, a forward portal recurrent neural network, and a reverse portal recurrent neural network; wherein the musical instrument digital interface embedding vector obtaining unit includes:

A second dense vector obtaining subunit configured to process the first dense vector through the second dense neural network to obtain a second dense vector;

A first feature map obtaining subunit configured to process the second dense vector through the forward gate recurrent neural network to obtain a first feature map;

A second feature map obtaining subunit configured to process the first feature map through the inverse gate recurrent neural network to obtain a second feature map;

and an instrumental digital interface embedded vector obtaining subunit configured to concatenate the second feature map and the second dense vector to obtain the instrumental digital interface embedded vector.

13. The apparatus of claim 9, wherein the timing encoder comprises a third dense neural network; wherein the time value embedding vector obtaining submodule comprises:

and a value embedding vector obtaining unit configured to obtain the value embedding vector by processing the value characteristics through the third dense neural network.

14. The apparatus of claim 9, wherein the end-to-end neural network model further comprises an attention mechanism module and a spectrum decoder; wherein the spectrum information obtaining submodule includes:

An attention context vector obtaining unit configured to obtain an attention context vector by processing the fused embedded vector by the attention mechanism module;

And a spectrum information obtaining unit configured to obtain the spectrum information by processing the attention context vector by the spectrum decoder.

15. The apparatus of claim 9, wherein the spectral information comprises mel-spectral parameters and linear spectral parameters; wherein, the singing audio synthesis module includes:

and the singing audio synthesis submodule is configured to process the Mel spectrum parameters and the linear spectrum parameters through a neural network vocoder and synthesize the singing audio.

16. The apparatus as recited in claim 9, further comprising:

a sample information acquisition module configured to acquire a sample score, sample lyrics, and sample singing audio thereof;

A sample music feature extraction module configured to extract sample music features from the sample score;

A sample text feature extraction module configured to extract sample text features from the sample lyrics;

a sample spectrum information obtaining module configured to obtain sample spectrum information from the sample singing audio;

The frequency spectrum prediction module is configured to process the sample music characteristics and the sample text characteristics through the end-to-end neural network model to obtain predicted frequency spectrum information;

and the model training module is configured to train the end-to-end neural network model according to the sample spectrum information and the prediction spectrum information.

17. An electronic device, comprising:

One or more processors;

Storage means configured to store one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the audio synthesis method of any of claims 1 to 8.

18. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the audio synthesis method according to any one of claims 1 to 8.