WO2020248388A1 - Method and device for training singing voice synthesis model, computer apparatus, and storage medium - Google Patents

Method and device for training singing voice synthesis model, computer apparatus, and storage medium Download PDF

Info

Publication number
WO2020248388A1
WO2020248388A1 PCT/CN2019/103426 CN2019103426W WO2020248388A1 WO 2020248388 A1 WO2020248388 A1 WO 2020248388A1 CN 2019103426 W CN2019103426 W CN 2019103426W WO 2020248388 A1 WO2020248388 A1 WO 2020248388A1
Authority
WO
WIPO (PCT)
Prior art keywords
singing voice
model
score
synthesis model
data
Prior art date
Application number
PCT/CN2019/103426
Other languages
French (fr)
Chinese (zh)
Inventor
王健宗
曾振
罗剑
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020248388A1 publication Critical patent/WO2020248388A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Definitions

  • This application relates to the field of audio signal processing applied to singing, in particular to a training method, device, computer equipment and storage medium of a singing voice synthesis model.
  • Singing voice synthesis technology refers to the process of automatically synthesizing anthropomorphic singing voices by providing music scores and lyrics.
  • singing voice synthesis involves two aspects, speech synthesis and pitch.
  • speech synthesis also introduces the processing of pitch information.
  • pitch due to the introduction of pitch, the complexity of singing information data is higher than that of voice.
  • control of voice frequency is stricter, and it is more complicated and changeable than voice. Speech synthesis, singing voice synthesis is more difficult.
  • Singing voice synthesis needs to output a smooth audio curve in the singing process to ensure that the sound is sufficiently natural, and at the same time reflect the detailed characteristics of the human voice at the beginning and the turn, so that the synthesized singing voice can be personified.
  • Existing solutions are based on existing voice simulation models. In order to obtain a better singing voice synthesis effect, usually use more singing voice data to train the model, or use a more complex singing voice synthesis model by putting a large number of recorded singing voices into the training model Loop training extracts the features of singing voice to improve the accuracy of the model for singing voice synthesis.
  • recording singing voice data requires a lot of manpower and financial resources, so researchers need to continuously optimize the model.
  • the main technical problem to be solved by this application is to provide a training method, device, computer equipment and storage medium of a singing voice synthesis model, which can solve the problem that the training of the singing voice synthesis model in the prior art requires a lot of manpower and financial resources to record singing voice data.
  • the training method for the singing voice synthesis model includes: preprocessing music score data and recorded singing voice data to extract all State the score feature in the score data and the first acoustic feature parameter in the singing voice data; input the score feature into the singing voice synthesis model to generate a synthesized singing voice; determine whether the score value of the synthesized singing voice in the singing voice evaluation model is It is lower than the score value of the first acoustic characteristic parameter in the singing voice evaluation model; if it is determined to be yes, the first model parameter optimization of the singing voice synthesis model is performed on the singing voice synthesis model according to the score value of the synthesized singing voice, until optimization The score value of the synthesized singing voice generated by the latter singing voice synthesis model is greater than or equal to the score value of the first acoustic characteristic parameter in the optimized singing voice evaluation model.
  • another technical solution adopted by this application is to provide a computer device, the computer device includes a processor and a memory, the memory stores computer-readable instructions, and the processor is working.
  • the computer-readable instructions are executed to realize the training method of the singing voice synthesis model described in any one of the above.
  • another technical solution adopted in this application is to provide a computer-readable storage medium on which computer-readable instructions are stored, characterized in that the computer-readable instructions are executed by a processor to realize The training method of the singing voice synthesis model as described in any one of the above.
  • FIG. 1 is a schematic flowchart of an implementation manner of a training method for a singing voice synthesis model of the present application
  • FIG. 2 is a schematic flowchart of an embodiment of step S100 in FIG. 1;
  • FIG. 3 is a schematic diagram of the framework of an embodiment of the training method for a singing voice synthesis model according to this application;
  • FIG. 4 is a schematic flowchart of an embodiment of step S200 in FIG. 1;
  • Figure 5 is a schematic diagram of the causal convolutional network structure of the present application.
  • FIG. 6 is a schematic flowchart of an embodiment of step S300 in FIG. 1;
  • FIG. 7 is a schematic flowchart of an embodiment of step S400 in FIG. 1;
  • FIG. 8 is a schematic flowchart of an embodiment of step S420 in FIG. 7;
  • FIG. 9 is a schematic block diagram of a first embodiment of a training device for a singing voice synthesis model provided by the present application.
  • Figure 10 is a schematic block diagram of an embodiment of a computer device provided by the present application.
  • Fig. 11 is a schematic block diagram of an embodiment of a computer-readable storage medium provided by the present application.
  • first”, “second”, and “third” in this application are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Therefore, the features defined with “first”, “second”, and “third” may explicitly or implicitly include at least one of the features.
  • "a plurality of” means at least two, such as two, three, etc., unless otherwise specifically defined. All directional indications (such as up, down, left, right, front, back%) in the embodiments of this application are only used to explain the relative positional relationship between the components in a specific posture (as shown in the drawings) If the specific posture changes, the directional indication will change accordingly.
  • FIG. 1 is a schematic flowchart of an embodiment of a training method for a singing voice synthesis model according to this application. As shown in FIG. 1, the training method for a singing voice synthesis model according to this application includes the following steps:
  • S100 Perform preprocessing on the score data and the recorded singing voice data to extract score features in the score data and first acoustic feature parameters in the singing voice data.
  • the data required for singing voice synthesis mainly includes score data and singing voice data.
  • the singing voice data in this application refers to the audio data recorded by the singer, not the singing voice data synthesized by the machine, and the preprocessing process can be combined with FIG. 2, which is an implementation of step S100 of the application
  • step S100 further includes the following sub-steps:
  • S110 Perform pre-emphasis processing on the singing voice data to make the signal spectrum of the singing voice data flat.
  • step S110 the purpose of pre-emphasis on the singing voice data is to enhance the high frequency part of the singing voice data, flatten the frequency spectrum of the signal, and keep it in the entire frequency band from low frequency to high frequency.
  • the same signal-to-noise ratio can be used. Find the frequency spectrum.
  • it is also to eliminate the effects of the vocal cords and lips in the process of vocalization, to compensate for the high-frequency part of the singing voice data (voice signal) suppressed by the pronunciation system, and to highlight the high-frequency formants.
  • a high-pass filter is also to eliminate the effects of the vocal cords and lips in the process of vocalization, to compensate for the high-frequency part of the singing voice data (voice signal) suppressed by the pronunciation system, and to highlight the high-frequency formants.
  • N sampling points of the pre-emphasized singing voice data are assembled into one observation unit, which is called a frame.
  • the value of N can be 256 or 512, and the time covered is about 20-30ms.
  • this overlapping area contains M sampling points, usually the value of M is about 1/2 or 1/3 of N .
  • windowing is performed on each frame of the singing voice data signal.
  • the window function for the selection of the window function, the nature and processing requirements of the analyzed signal should be considered.
  • rectangular windows, triangular windows, also known as Feijer windows, Hanning windows, Hamming windows, or Gaussian windows can be used.
  • Different window functions have different effects on the signal spectrum. The main reason is that different window functions have different leakage levels and different frequency resolution capabilities.
  • the window function uses a Hamming window, and each frame of the singing voice data signal is multiplied by the Hamming window to increase the continuity between the left end and the right end of each frame.
  • S130 Acquire frequency spectrum information of each frame of singing voice data to obtain a first acoustic characteristic parameter of each frame of singing voice data.
  • the transformation of the signal in the time domain is usually difficult to see the characteristics of the signal, it is usually converted to the energy distribution in the frequency domain for observation. Different energy distributions can represent the characteristics of different singing data signals. Therefore, after multiplying the Hamming window, the singing data signal of each frame must undergo fast Fourier transform to obtain the energy distribution on the spectrum.
  • the fast Fourier data signal of each frame of singing data after being divided into frames and windowed
  • the spectrum information of each frame is obtained by transforming, so as to obtain the first acoustic characteristic parameter of each frame of singing data, wherein the first acoustic characteristic parameter includes at least spectral line energy, fundamental frequency characteristic and Mel frequency cepstrum coefficient (MFCC).
  • the spectral line energy is obtained by modulating the square of the frequency spectrum of each frame of the singing voice data signal to obtain the spectral line energy of the singing voice data signal.
  • the Mel frequency cepstrum coefficient is obtained by passing the above spectrum through the Mel filter bank to obtain the Mel spectrum. Through the Mel spectrum, the linear natural spectrum can be converted into the Mel spectrum that reflects the characteristics of human hearing. Cepstrum is performed on the Mel spectrum. The analysis is to take the logarithm and do the inverse transform. The actual inverse transform is generally realized by the DCT discrete cosine transform. Take the 2nd to 13th coefficients after DCT as the MFCC coefficients to obtain the Mel frequency cepstrum coefficients.
  • step S100 also includes preprocessing the score data to obtain score features. Specifically, the system extracts the lyrics, note pitch sequence, note duration sequence and other information in the score data, and then the lyrics text can generate text annotation sequences through text analysis and computer readable instructions, and then generate the lyrics corresponding to the lyrics through the trained speech synthesis model Hidden Markov Model (HMM) sequence, and further predict the spectral line parameters from the sequence, that is, tell the processor (computer) how the lyrics are pronounced in speech.
  • HMM Hidden Markov Model
  • each syllable is determined by the note duration and the note type of the music score, it is necessary to stretch or contract the spectral feature sequence of the voice by the note duration constraint given by the music score to generate the spectral feature sequence corresponding to the singing voice.
  • the fundamental frequency of the singing voice is determined by the note pitch of the musical score, it is necessary to generate the initial discrete step fundamental frequency trajectory from the note pitch and note duration given by the musical score, and increase the fundamental frequency through the fundamental frequency control model. After punching and preparing the dynamic features and vibrato, the fundamental frequency trajectory corresponding to the singing voice is generated, and finally the singing voice is synthesized from the characteristic sequence and the fundamental frequency trajectory using a voice vocoder.
  • S200 Input the music score feature into the singing voice synthesis model to generate a synthesized singing voice.
  • FIG. 3 is a schematic diagram of the framework of an embodiment of the singing voice synthesis training method of this application
  • FIG. 4 is a schematic flowchart of an embodiment of step S200 of this application. Step S200 further includes the following sub-steps:
  • the score features extracted from the score data in step S100 that is, the HMM sequence, the spectrum feature sequence, etc. generated according to the lyrics, the note pitch sequence, the note duration sequence and other information in the score data are input to the singing synthesis model.
  • the singing voice synthesis system in the present application may mainly include two models, namely a singing voice synthesis model and a singing voice discriminant model.
  • the singing voice synthesis model can be constructed by a causal convolutional network (WaveNet).
  • WaveNet is a schematic diagram of the causal convolutional network structure of the application.
  • Figure 5 shows the network structure of the causal convolutional neural network, which specifically includes an input layer (Input), multiple hidden layers (Hidden Layer), and an output layer ( Output).
  • WaveNet is an autoregression (autoregression) in-depth generation model, which directly models the voice waveform level and decomposes the joint probability of the waveform sequence into conditional probability multiplication.
  • the singing voice discriminant model in this application can also be constructed using a causal convolutional network.
  • the acoustic characteristic parameters of a piece of audio data are input into the singing voice discriminant model frame by frame, and the singing voice discriminant model can give Get the naturalness score of this piece of audio data. Among them, the higher the score, the better the synthesis effect of the singing voice synthesis model, or the audio data represented by the acoustic parameter is recorded by the singer.
  • S220 Acquire a second acoustic feature parameter of each frame according to the music score feature.
  • the second acoustic feature parameters corresponding to each frame will be obtained.
  • the second acoustic feature parameter is the same as the first acoustic feature parameter in the singing voice data, and the data processing process is similar.
  • the second acoustic characteristic parameter in this embodiment also includes spectral line energy, fundamental frequency characteristic, and Mel frequency cepstrum coefficient.
  • S230 Input the second acoustic characteristic parameter into the speech synthesis vocoder to generate a synthesized singing voice.
  • a speech synthesis vocoder is used to convert the second acoustic characteristic parameter into a synthesized singing voice.
  • S300 Determine whether the score value of the synthesized singing voice in the singing voice evaluation model is lower than the score value of the first acoustic feature parameter in the singing voice evaluation model.
  • FIG. 6 is a schematic flowchart of an embodiment of step S300 in this application, and step S300 in FIG. 6 further includes the following sub-steps:
  • the singing voice evaluation model generally scores the naturalness of a piece of singing voice or recorded singing voice.
  • the mathematical understanding is a classification model, and it tries to distinguish the synthesized singing voice from the recorded singing voice as much as possible.
  • S320 Distinguish the second acoustic feature parameter from the first acoustic feature parameter through a two-classification algorithm to obtain score values of the first acoustic feature parameter and the second acoustic feature parameter in the singing voice evaluation model, respectively.
  • the singing voice evaluation model distinguishes the second acoustic feature parameter from the first acoustic feature parameter through a two-classification algorithm.
  • the two-classification algorithms that can be used in this application include decision tree algorithm, random forest algorithm, naive Bayesian and logistic regression. Wait, there is no specific limitation here.
  • the first acoustic characteristic parameter in the present application is not an acoustic characteristic parameter of a singing voice through synthesis, but an acoustic characteristic parameter of a singing voice formed through recording, which is used as a criterion for judging the second acoustic characteristic parameter.
  • the singing voice evaluation model scores the first acoustic feature parameter and the second acoustic feature parameter respectively to obtain the score values of the two.
  • the score value of the acoustic feature parameter in the singing voice evaluation model represents the naturalness and personification of the singing voice, thereby distinguishing the singing voice synthesized by the machine from the singing voice recorded by the singer.
  • the personification degree indicates the similarity between the synthesized singing voice represented by the second acoustic parameter and the singing voice data represented by the first acoustic feature parameter.
  • the higher the score value the more natural the singing voice.
  • the score value of the first acoustic feature parameter is used as a reference standard, and the score value of the second acoustic feature parameter is compared with the score value of the first acoustic feature parameter. If the score value is less than the score value of the first acoustic feature parameter, the naturalness and personification effect of the synthesized singing voice is worse than the naturalness and personification of the recorded singing voice data. At this time, the first model parameter optimization of the singing voice synthesis model is required. That is, step S400 is entered.
  • step S500 is entered to end.
  • the first model parameter optimization of the singing voice synthesis model is performed according to the score value of the synthesized singing voice until the score value of the synthesized singing voice generated by the optimized singing voice synthesis model is greater than or equal to the first acoustic characteristic parameter in the optimized singing voice evaluation model Up to the score value.
  • Fig. 7 is a schematic flowchart of an embodiment of step S400 of this application, and step S400 further includes the following sub-steps:
  • S420 Perform parameter optimization on the singing voice synthesis model to reduce the deviation of the singing voice synthesis model.
  • a gradient descent algorithm can be used to optimize the model, so that the value of the above formula is continuously reduced and the deviation of the synthetic model is reduced.
  • Fig. 8 is a schematic flowchart of an implementation manner of step S420 of this application, and step S420 further includes the following sub-steps:
  • the known hypothesis function is:
  • the selection of the hypothesis function and the loss function is related to the regression method adopted by the gradient descent algorithm applied during optimization.
  • the assumption function and loss function used are as shown in the above formula. That is, if different regression methods are used, there will be different hypothesis functions and loss functions.
  • the curve corresponding to the hypothesis function is different from the actual distribution of the data, making the model unable to fit (the fitting process is called regression). Therefore, the loss function is needed to make up for the difference between the estimated value obtained by the model and the actual value. The difference between them is the smallest.
  • the step size ⁇ is multiplied by the gradient of the loss function, where, among them Is the gradient expression of the loss function.
  • S422 Determine whether the gradient descent distances of all the first model parameters are less than the termination distance.
  • step S422 it is determined whether the gradient descent distance li of all the first model parameters ⁇ i is less than the end distance ⁇ , and if the determination is yes, then step S423 is entered. Otherwise, go to step S424.
  • the first model parameter ⁇ i takes the peak value when i is equal to 1.
  • the value of the first model parameter ⁇ i will gradually decrease as the number of gradient descents increases, and the value of the gradient descent distance li will also gradually decrease, so the end distance ⁇ can be more and more approached.
  • Optimizing the first model parameter ⁇ i in the manner of gradient descent described above is beneficial to the fitting of the singing voice synthesis model, and prevents the first model parameter ⁇ i from missing the optimal value, causing the singing voice synthesis model to overfit Together.
  • the training method of the singing voice synthesis model in the present application may further include performing second model parameter optimization on the singing voice evaluation model according to the score value of the synthesized singing voice.
  • the singing voice evaluation model in this application also uses a causal convolutional neural network, and the optimization process of the second model parameters for the singing voice evaluation model can refer to the process of optimizing the first model parameters for the singing voice synthesis model.
  • the singing voice evaluation model performs a gradient ascent loss to maximize the difference between the second acoustic parameter and the first acoustic feature parameter, that is, to make the score of the recorded singing data high, and the score of the synthesized singing voice The value is low, and the singing voice synthesis model and the singing voice evaluation model compete with each other during the training process and fight against learning, thereby continuously improving the synthesis effect of the singing voice synthesis model.
  • the singing voice synthesis is performed by inputting the score features into the singing voice synthesis model, and then obtaining the score value of the synthesized singing voice in the singing voice evaluation model, and comparing the score value of the synthesized singing voice with the score value of the recorded singing voice.
  • the parameters of the singing voice synthesis model are optimized and adjusted according to the score value of the synthesized singing voice, so that the score value of the singing voice synthesized by the singing voice synthesis model is greater than or equal to the recorded singing voice in the singing voice evaluation model as much as possible
  • the singing voice synthesis model input the music score data including lyrics, note pitch sequence, note duration sequence for singing voice synthesis, and the singing voice synthesized by the synthesis model, It has a high degree of naturalness, can simulate a natural and smooth singing effect, and has a high degree of anthropomorphism, and the combination of singing voice is more in line with the characteristics of human voice.
  • FIG. 9 is a schematic block diagram of a first embodiment of a training device for a singing voice synthesis model provided by the present application.
  • the information generating device in this embodiment includes a processing module 31, a generating module 32, a judgment module 33 and an optimization module 34.
  • the processing module 31 is used to preprocess the score data and the singing voice data to extract the score features in the score data and the first acoustic feature parameter in the singing voice data.
  • the processing module 31 is configured to perform pre-emphasis processing on the singing voice data to make the signal spectrum of the singing voice data flat; divide the pre-emphasized singing voice data into integer frames; obtain the spectrum information of each frame of the singing voice data to obtain The first acoustic characteristic parameter of each frame of singing voice data.
  • the generating module 32 is used to input the music score feature into the singing voice synthesis model to generate the synthesized singing voice.
  • the generating module 32 is configured to sequentially input the music score features into the singing voice synthesis model; obtain the second acoustic feature parameters of each frame according to the music score features; input the second acoustic feature parameters into the speech synthesis vocoder to generate the synthesized singing voice.
  • the judgment module 33 is used to judge whether the score value of the synthesized singing voice in the singing voice evaluation model is lower than the score value of the first acoustic feature parameter in the singing voice evaluation model.
  • the judging module 33 is configured to input the first acoustic feature parameter and the second acoustic feature parameter into the singing voice evaluation model; the second acoustic feature parameter is distinguished from the first acoustic feature parameter through a two-classification algorithm to obtain respectively The score values of the first acoustic feature parameter and the second acoustic feature parameter in the singing voice evaluation model; compare the score values of the first acoustic feature parameter and the second acoustic feature parameter in the singing voice evaluation model.
  • the optimization module 34 is configured to perform a first model on the singing voice synthesis model according to the score value when judging that the score value of the synthesized singing voice in the singing voice evaluation model is lower than the score value of the first acoustic feature parameter in the singing voice evaluation model Parameter optimization.
  • the optimization module 34 is configured to obtain the deviation of the singing voice synthesis model according to the score value of the second acoustic characteristic parameter; optimize the parameters of the singing voice synthesis model to reduce the deviation of the singing voice synthesis model, including obtaining the gradient drop of the first type parameter of the model Distance: Determine whether the gradient descent distance of all the first model parameters is less than the termination distance; if the judgment is yes, stop optimizing the first model parameters of the singing voice synthesis model; if the judgment is no, update the first model parameters.
  • processing module 31, the generation module 32, the judgment module 33, and the optimization module 34 in this embodiment correspond to steps S100 to S400 in the above first embodiment.
  • steps S100 to S400 in the above first embodiment.
  • step S100 in the above first embodiment.
  • step S400 The relevant description of step S400 will not be repeated here.
  • FIG. 10 is a schematic block diagram of an embodiment of a computer device provided by the present application.
  • the computer device in this embodiment includes a processor 51 and a memory 52.
  • the memory 52 stores computer-readable instructions.
  • the processor 51 is working.
  • the computer-readable instructions are executed to implement the training method of the singing voice synthesis model in any of the above embodiments.
  • the processor 51 is used to preprocess the score data and the singing voice data to extract the score features in the score data and the first acoustic feature parameters in the singing voice data; input the score features into the singing voice synthesis model to generate the synthesized singing voice ; Determine whether the score value of the synthesized singing voice in the singing voice evaluation model is lower than the score value of the first acoustic characteristic parameter in the singing voice evaluation model; if the determination is yes, perform the first model of the singing voice synthesis model according to the score value of the synthesized singing voice The parameters are optimized until the score value of the synthesized singing voice generated by the optimized singing voice synthesis model is greater than or equal to the score value of the first acoustic characteristic parameter in the optimized singing voice evaluation model.
  • the processor 51 controls the operation of the mobile terminal, and the processor 51 may also be referred to as a CPU (Central Processing Unit, central processing unit).
  • the processor 51 may be an integrated circuit chip with signal processing capability.
  • the processor 51 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component .
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor, but is not limited to this.
  • FIG. 11 is a schematic structural diagram of an embodiment of a computer-readable non-volatile storage medium according to this application.
  • the computer-readable non-volatile storage medium of the present application stores computer-readable instructions 21 capable of implementing all the above methods, wherein the computer-readable instructions 21 may be stored in the above-mentioned storage device in the form of a software product, including several instructions It is used to enable a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage devices include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store computer-readable instruction codes, or computers, servers, mobile phones, Terminal devices such as tablets.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • this application provides a training method, device, computer equipment, and non-volatile storage medium for a singing voice synthesis model.
  • the singing voice synthesis is performed by inputting music score features into the singing voice synthesis model, and then obtains The score value of the synthesized singing voice in the singing voice evaluation model.
  • the score value of the synthesized singing voice is compared with the score value of the recorded singing voice.
  • the singing voice synthesis model is compared according to the score value of the synthesized singing voice
  • the parameters of the singing voice synthesis model are optimized and adjusted so that the score value of the singing voice synthesized by the singing voice synthesis model is greater than or equal to the score value of the recorded singing voice in the singing voice evaluation model, thereby continuously improving the synthesis effect of the singing voice synthesis model, making the synthesized singing voice more realistic .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Reverberation, Karaoke And Other Acoustics (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

The present application discloses an artificial intelligence method and device for training a singing voice synthesis model, a computer apparatus, and a storage medium. The method for training a singing voice synthesis model comprises: preprocessing music score data and singing voice data, and extracting a music score feature from the music score data and a first acoustic feature parameter from the singing voice data; inputting the music score feature into a singing voice synthesis model and generating a synthesized singing voice; determining whether or not the score value of the synthesized singing voice is lower than that of the first acoustic feature parameter in a singing voice evaluation model; and if so, performing first model parameter optimization on the singing voice synthesis model according to the score value of the synthesized singing voice, until the score value of a synthesized singing voice generated by the optimized singing voice synthesis model is greater than or equal to that of the first acoustic feature parameter in the singing voice evaluation model that has undergone optimization. By using the technique, the present application can improve the synthesis effects of the singing voice synthesis model.

Description

歌声合成模型的训练方法、装置、计算机设备以及存储介质Training method, device, computer equipment and storage medium of singing voice synthesis model
【交叉引用】【cross reference】
本申请以2019年6月11日提交的申请号为201910500699.0,名称为“歌声合成模型的训练方法、装置、计算机设备以及存储介质”的中国发明专利申请为基础,并要求其优先权。This application is based on the Chinese invention patent application filed on June 11, 2019 with the application number 201910500699.0, titled "Training method, device, computer equipment and storage medium of singing voice synthesis model", and claims priority.
【技术领域】【Technical Field】
本申请涉及应用于歌唱的音频信号处理领域,特别是涉及一种歌声合成模型的训练方法、装置、计算机设备以及存储介质。This application relates to the field of audio signal processing applied to singing, in particular to a training method, device, computer equipment and storage medium of a singing voice synthesis model.
【背景技术】【Background technique】
歌声合成技术指通过提供乐谱与歌词,机器自动合成出拟人的歌声的过程。其中,歌声合成涉及两个方面,语音合成和音高。相对于语音合成,歌声合成还引入了音高信息的处理。同时,由于音高的引入,歌声信息数据的复杂度相较于语音更高,尤其在模拟过程当中对声音频率的把控更为严格,相比于语音更加的复杂多变,因此相较于语音合成,歌声合成的难度更大。歌声合成需要输出符合歌唱过程中平滑的音频曲线,以保证声音足够自然,同时在起承转合处体现人声的细节特征,以实现合成的歌声能够拟人。现有方案基于现有语音的声音模拟模型,为了获得更好的歌声合成效果,通常使用更多的歌声数据来训练模型,或者使用更复杂的歌声合成模型通过将大量收录的歌声放入训练模型循环训练提取歌声的特征以提升模型对于歌声合成的准确性。然而,录制歌声数据是需要耗费大量的人力、财力,因此需要研究人员会在模型上不断进行优化。Singing voice synthesis technology refers to the process of automatically synthesizing anthropomorphic singing voices by providing music scores and lyrics. Among them, singing voice synthesis involves two aspects, speech synthesis and pitch. Compared with speech synthesis, singing voice synthesis also introduces the processing of pitch information. At the same time, due to the introduction of pitch, the complexity of singing information data is higher than that of voice. Especially in the simulation process, the control of voice frequency is stricter, and it is more complicated and changeable than voice. Speech synthesis, singing voice synthesis is more difficult. Singing voice synthesis needs to output a smooth audio curve in the singing process to ensure that the sound is sufficiently natural, and at the same time reflect the detailed characteristics of the human voice at the beginning and the turn, so that the synthesized singing voice can be personified. Existing solutions are based on existing voice simulation models. In order to obtain a better singing voice synthesis effect, usually use more singing voice data to train the model, or use a more complex singing voice synthesis model by putting a large number of recorded singing voices into the training model Loop training extracts the features of singing voice to improve the accuracy of the model for singing voice synthesis. However, recording singing voice data requires a lot of manpower and financial resources, so researchers need to continuously optimize the model.
【发明内容】[Content of the invention]
本申请主要解决的技术问题是提供一种歌声合成模型的训练方法、装置、计算机设备以及存储介质,能够解决现有技术中歌声合成模型的训练需要大量耗费人力财力来录制歌声数据的问题。The main technical problem to be solved by this application is to provide a training method, device, computer equipment and storage medium of a singing voice synthesis model, which can solve the problem that the training of the singing voice synthesis model in the prior art requires a lot of manpower and financial resources to record singing voice data.
为解决上述技术问题,本申请采用的一个技术方案是:提供一种歌声合成模型的训练方法,所述歌声合成模型的训练方法包括:对乐谱数据和录制的歌 声数据进行预处理,以提取所述乐谱数据中的乐谱特征和所述歌声数据中的第一声学特征参数;将所述乐谱特征输入歌声合成模型中以生成合成歌声;判断所述合成歌声在歌声评判模型中的评分值是否低于所述第一声学特征参数在所述歌声评判模型中的评分值;若判断为是,则根据所述合成歌声的评分值对所述歌声合成模型进行第一模型参数优化,直至优化后的歌声合成模型生成合成歌声的评分值大于等于所述第一声学特征参数在优化后的歌声评判模型中的评分值为止。In order to solve the above technical problems, a technical solution adopted in this application is to provide a method for training a singing voice synthesis model. The training method for the singing voice synthesis model includes: preprocessing music score data and recorded singing voice data to extract all State the score feature in the score data and the first acoustic feature parameter in the singing voice data; input the score feature into the singing voice synthesis model to generate a synthesized singing voice; determine whether the score value of the synthesized singing voice in the singing voice evaluation model is It is lower than the score value of the first acoustic characteristic parameter in the singing voice evaluation model; if it is determined to be yes, the first model parameter optimization of the singing voice synthesis model is performed on the singing voice synthesis model according to the score value of the synthesized singing voice, until optimization The score value of the synthesized singing voice generated by the latter singing voice synthesis model is greater than or equal to the score value of the first acoustic characteristic parameter in the optimized singing voice evaluation model.
为解决上述技术问题,本申请采用的又一个技术方案是:提供一种计算机设备,所述计算机设备包括处理器以及存储器,所述存储器中存储有计算机可读指令,所述处理器在工作时执行计算机可读指令以实现上述任一项所述的歌声合成模型的训练方法。In order to solve the above technical problems, another technical solution adopted by this application is to provide a computer device, the computer device includes a processor and a memory, the memory stores computer-readable instructions, and the processor is working. The computer-readable instructions are executed to realize the training method of the singing voice synthesis model described in any one of the above.
为解决上述技术问题,本申请采用的再一个技术方案是:提供一种计算机可读存储介质,其上存储有计算机可读指令,其特征在于,所述计算机可读指令被处理器执行以实现如上述任一项所述的歌声合成模型的训练方法。In order to solve the above technical problems, another technical solution adopted in this application is to provide a computer-readable storage medium on which computer-readable instructions are stored, characterized in that the computer-readable instructions are executed by a processor to realize The training method of the singing voice synthesis model as described in any one of the above.
本申请一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求中明显体现。The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will be apparent from the description, drawings and claims.
【附图说明】【Explanation of drawings】
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图,其中:In order to more clearly describe the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings needed in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, without creative work, other drawings can be obtained based on these drawings, among which:
图1是本申请歌声合成模型的训练方法一实施方式的流程示意图;FIG. 1 is a schematic flowchart of an implementation manner of a training method for a singing voice synthesis model of the present application;
图2是图1中步骤S100一实施方式的流程示意图;FIG. 2 is a schematic flowchart of an embodiment of step S100 in FIG. 1;
图3是本申请歌声合成模型训练方法一实施方式的框架示意图;FIG. 3 is a schematic diagram of the framework of an embodiment of the training method for a singing voice synthesis model according to this application;
图4是图1中步骤S200一实施方式的流程示意图;FIG. 4 is a schematic flowchart of an embodiment of step S200 in FIG. 1;
图5是本申请因果卷积网络结构的示意图;Figure 5 is a schematic diagram of the causal convolutional network structure of the present application;
图6是图1中步骤S300一实施方式的流程示意图;FIG. 6 is a schematic flowchart of an embodiment of step S300 in FIG. 1;
图7是图1中步骤S400一实施方式的流程示意图;FIG. 7 is a schematic flowchart of an embodiment of step S400 in FIG. 1;
图8是图7中步骤S420一实施方式的流程示意图;FIG. 8 is a schematic flowchart of an embodiment of step S420 in FIG. 7;
图9是本申请提供的歌声合成模型的训练装置第一实施例的示意框图;9 is a schematic block diagram of a first embodiment of a training device for a singing voice synthesis model provided by the present application;
图10是本申请提供的计算机设备实施例的示意框图;Figure 10 is a schematic block diagram of an embodiment of a computer device provided by the present application;
图11是本申请提供的计算机可读存储介质实施例的示意框图。Fig. 11 is a schematic block diagram of an embodiment of a computer-readable storage medium provided by the present application.
【具体实施方式】【Detailed ways】
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
本申请中的术语“第一”、“第二”、“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”、“第三”的特征可以明示或者隐含地包括至少一个该特征。本申请的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。本申请实施例中所有方向性指示(诸如上、下、左、右、前、后……)仅用于解释在某一特定姿态(如附图所示)下各部件之间的相对位置关系、运动情况等,如果该特定姿态发生改变时,则该方向性指示也相应地随之改变。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、***、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", and "third" in this application are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Therefore, the features defined with "first", "second", and "third" may explicitly or implicitly include at least one of the features. In the description of this application, "a plurality of" means at least two, such as two, three, etc., unless otherwise specifically defined. All directional indications (such as up, down, left, right, front, back...) in the embodiments of this application are only used to explain the relative positional relationship between the components in a specific posture (as shown in the drawings) If the specific posture changes, the directional indication will change accordingly. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not limited to the listed steps or units, but optionally includes unlisted steps or units, or optionally also includes Other steps or units inherent to these processes, methods, products or equipment.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference to "embodiments" herein means that a specific feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment mutually exclusive with other embodiments. Those skilled in the art clearly and implicitly understand that the embodiments described herein can be combined with other embodiments.
请参阅图1,图1为本申请歌声合成模型的训练方法一实施方式的流程示意图,如图1所示,本申请歌声合成模型的训练方法包括如下步骤:Please refer to FIG. 1. FIG. 1 is a schematic flowchart of an embodiment of a training method for a singing voice synthesis model according to this application. As shown in FIG. 1, the training method for a singing voice synthesis model according to this application includes the following steps:
S100,对乐谱数据和录制的歌声数据进行预处理,以提取乐谱数据中的乐谱特征和歌声数据中的第一声学特征参数。S100: Perform preprocessing on the score data and the recorded singing voice data to extract score features in the score data and first acoustic feature parameters in the singing voice data.
可选地,歌声合成所需的数据主要包括乐谱数据以及歌声数据。其中,本申请中的歌声数据指的是通过歌手录制而成的音频数据,而非通过机器合成的歌声数据,且其预处理流程可以结合图2,图2为本申请步骤S100一实施方式的流程示意图,如图2步骤S100进一步包括如下子步骤:Optionally, the data required for singing voice synthesis mainly includes score data and singing voice data. Among them, the singing voice data in this application refers to the audio data recorded by the singer, not the singing voice data synthesized by the machine, and the preprocessing process can be combined with FIG. 2, which is an implementation of step S100 of the application The flow chart, as shown in Figure 2, step S100 further includes the following sub-steps:
S110,对歌声数据进行预加重处理,以使得歌声数据的信号频谱平坦。S110: Perform pre-emphasis processing on the singing voice data to make the signal spectrum of the singing voice data flat.
步骤S110中,对歌声数据进行预加重处理的目的是为了提升歌声数据中的高频部分,使信号的频谱变得平坦,保持在低频到高频的整个频带中,能用同样的信噪比求频谱。同时,也是为了消除发声过程中声带和嘴唇的效应,来补偿歌声数据(语音信号)受到发音***所抑制的高频部分,也为了突出高频的共振峰,预加重处理其实是将歌声数据通过一个高通滤波器。In step S110, the purpose of pre-emphasis on the singing voice data is to enhance the high frequency part of the singing voice data, flatten the frequency spectrum of the signal, and keep it in the entire frequency band from low frequency to high frequency. The same signal-to-noise ratio can be used. Find the frequency spectrum. At the same time, it is also to eliminate the effects of the vocal cords and lips in the process of vocalization, to compensate for the high-frequency part of the singing voice data (voice signal) suppressed by the pronunciation system, and to highlight the high-frequency formants. A high-pass filter.
S120,将预加重处理后的歌声数据分割成整数帧。S120, dividing the pre-emphasized singing voice data into integer frames.
进一步,将预加重处理后的歌声数据的N个采样点集合成一个观测单位,称为帧。通常情况下N的值可以为256或512,涵盖的时间约为20~30ms左右。为了避免相邻两帧的变化过大,因此会让两相邻帧之间有一段重叠区域,此重叠区域包含了M个取样点,通常M的值约为N的1/2或1/3。Further, the N sampling points of the pre-emphasized singing voice data are assembled into one observation unit, which is called a frame. Normally, the value of N can be 256 or 512, and the time covered is about 20-30ms. In order to avoid excessive changes between two adjacent frames, there will be an overlapping area between two adjacent frames. This overlapping area contains M sampling points, usually the value of M is about 1/2 or 1/3 of N .
可选地,将上述每一帧歌声数据信号进行加窗处理。其中,对于窗函数的选择,应考虑被分析信号的性质与处理要求,本实施例中可以采用矩形窗、三角窗亦称费杰窗、汉宁窗、汉明窗或高斯窗的一种,此处不做具体限定。不同的窗函数对信号频谱的影响是不一样的,主要是因为不同的窗函数,产生泄漏的大小不一样,频率分辨能力也不一样。在本申请一具体实施方式中,窗函数选用汉明窗,将每一帧歌声数据信号分别乘以汉明窗,以增加每一帧左端和右端的连续性。Optionally, windowing is performed on each frame of the singing voice data signal. Among them, for the selection of the window function, the nature and processing requirements of the analyzed signal should be considered. In this embodiment, rectangular windows, triangular windows, also known as Feijer windows, Hanning windows, Hamming windows, or Gaussian windows can be used. There is no specific limitation here. Different window functions have different effects on the signal spectrum. The main reason is that different window functions have different leakage levels and different frequency resolution capabilities. In a specific embodiment of the present application, the window function uses a Hamming window, and each frame of the singing voice data signal is multiplied by the Hamming window to increase the continuity between the left end and the right end of each frame.
S130,获取每一帧歌声数据的频谱信息,以得到每一帧歌声数据的第一声学特征参数。S130: Acquire frequency spectrum information of each frame of singing voice data to obtain a first acoustic characteristic parameter of each frame of singing voice data.
由于信号在时域上的变换通常很难看出信号的特性,所以通常将它转换为频域上的能量分布来观察,不同的能量分布,就能代表不同歌声数据信号的特性。所以在乘上汉明窗后,每帧的歌声数据信号还必须再经过快速傅里叶变换以得到在频谱上的能量分布,对分帧加窗后的各帧歌声数据信号进行快速傅里叶变换得到每一帧的频谱信息,从而获取每一帧歌声数据的第一声学特征参数,其中,所述第一声学特征参数至少包括谱线能量、基频特征以及梅尔频率倒谱系数(MFCC)。Since the transformation of the signal in the time domain is usually difficult to see the characteristics of the signal, it is usually converted to the energy distribution in the frequency domain for observation. Different energy distributions can represent the characteristics of different singing data signals. Therefore, after multiplying the Hamming window, the singing data signal of each frame must undergo fast Fourier transform to obtain the energy distribution on the spectrum. The fast Fourier data signal of each frame of singing data after being divided into frames and windowed The spectrum information of each frame is obtained by transforming, so as to obtain the first acoustic characteristic parameter of each frame of singing data, wherein the first acoustic characteristic parameter includes at least spectral line energy, fundamental frequency characteristic and Mel frequency cepstrum coefficient (MFCC).
其中,谱线能量和梅尔频率倒谱系数的获取是通过如下方式:Among them, the spectral line energy and Mel frequency cepstrum coefficients are obtained in the following way:
谱线能量的获取通过对每一帧歌声数据信号的频谱取模平方得到歌声数据信号的谱线能量。梅尔频率倒谱系数的获取是通过将上面的频谱通过Mel滤波器组得到Mel频谱,通过Mel频谱,可以将线形的自然频谱转换为体现人类听觉特性的Mel频谱,在Mel频谱上面进行倒谱分析,即取对数,做逆变换,实际逆变换一般是通过DCT离散余弦变换来实现,取DCT后的第2个到第13个系数作为MFCC系数,获得梅尔频率倒谱系数。The spectral line energy is obtained by modulating the square of the frequency spectrum of each frame of the singing voice data signal to obtain the spectral line energy of the singing voice data signal. The Mel frequency cepstrum coefficient is obtained by passing the above spectrum through the Mel filter bank to obtain the Mel spectrum. Through the Mel spectrum, the linear natural spectrum can be converted into the Mel spectrum that reflects the characteristics of human hearing. Cepstrum is performed on the Mel spectrum. The analysis is to take the logarithm and do the inverse transform. The actual inverse transform is generally realized by the DCT discrete cosine transform. Take the 2nd to 13th coefficients after DCT as the MFCC coefficients to obtain the Mel frequency cepstrum coefficients.
进一步,步骤S100中还包括对乐谱数据的预处理以获得乐谱特征。具体地,***提取乐谱数据中的歌词、音符音高序列、音符时长序列等信息,接着歌词文本可以通过文本分析计算机可读指令生成文本标注序列,进而通过训练得到的语音合成模型生成歌词对应的隐马尔科夫模型(HMM)序列,并进一步从该系列中预测出谱线参数,即告诉处理器(计算机)这段歌词在说话中是如何发音的。Furthermore, step S100 also includes preprocessing the score data to obtain score features. Specifically, the system extracts the lyrics, note pitch sequence, note duration sequence and other information in the score data, and then the lyrics text can generate text annotation sequences through text analysis and computer readable instructions, and then generate the lyrics corresponding to the lyrics through the trained speech synthesis model Hidden Markov Model (HMM) sequence, and further predict the spectral line parameters from the sequence, that is, tell the processor (computer) how the lyrics are pronounced in speech.
进一步,由于每个音节的发音时长是有乐谱的音符时长和音符类型共同决定的,所以需要由乐谱给定的音符时长约束来拉伸或者收缩语音的谱特征序列,生成对应歌声的谱特征序列,再由于歌声的基频是有乐谱的音符音高确定的,所以需要由乐谱给定的音符音高和音符时长生成初始的离散阶跃的基频轨迹,通过基频控制模型增加基频过冲、预备动态特征和颤音后,生成对应歌声的基频轨迹,最后使用语音声码器由特征序列和基频轨迹合成出歌声。Furthermore, since the pronunciation duration of each syllable is determined by the note duration and the note type of the music score, it is necessary to stretch or contract the spectral feature sequence of the voice by the note duration constraint given by the music score to generate the spectral feature sequence corresponding to the singing voice. , And since the fundamental frequency of the singing voice is determined by the note pitch of the musical score, it is necessary to generate the initial discrete step fundamental frequency trajectory from the note pitch and note duration given by the musical score, and increase the fundamental frequency through the fundamental frequency control model. After punching and preparing the dynamic features and vibrato, the fundamental frequency trajectory corresponding to the singing voice is generated, and finally the singing voice is synthesized from the characteristic sequence and the fundamental frequency trajectory using a voice vocoder.
S200,将乐谱特征输入歌声合成模型中以生成合成歌声。S200: Input the music score feature into the singing voice synthesis model to generate a synthesized singing voice.
请一并结合3和图4,图3为本申请歌声合成训练方法一实施方式的框架示意图,图4为本申请步骤S200一实施方式的流程示意图,步骤S200进一步包括如下子步骤:Please combine 3 and FIG. 4 together. FIG. 3 is a schematic diagram of the framework of an embodiment of the singing voice synthesis training method of this application, and FIG. 4 is a schematic flowchart of an embodiment of step S200 of this application. Step S200 further includes the following sub-steps:
S210,将乐谱特征依序输入歌声合成模型。S210: Input the music score features into the singing voice synthesis model in sequence.
将步骤S100中从乐谱数据中提取到的乐谱特征,即根据乐谱数据中的歌词、音符音高序列、音符时长序列等信息生成的HMM序列、谱特征序列等输入至歌声合成模型。The score features extracted from the score data in step S100, that is, the HMM sequence, the spectrum feature sequence, etc. generated according to the lyrics, the note pitch sequence, the note duration sequence and other information in the score data are input to the singing synthesis model.
可选地,结合图3本申请中歌声合成***主要可以包括两个模型,分别是歌声合成模型以及歌声判别式模型。其中,歌声合成模型可以由因果卷积网络(WaveNet)构建。结合图5,图5为本申请因果卷积网络结构的示意图,图5示意了因果卷积神经网络的网络结构,具体包括输入层(Input)、多个隐藏层 (Hidden Layer)以及输出层(Output)。其中,WaveNet是一种自回归(autoregression)的深度生成模型,它直接在语音波形层面建模,将波形序列的联合概率分解为条件概率连乘。Optionally, in conjunction with FIG. 3, the singing voice synthesis system in the present application may mainly include two models, namely a singing voice synthesis model and a singing voice discriminant model. Among them, the singing voice synthesis model can be constructed by a causal convolutional network (WaveNet). In conjunction with Figure 5, Figure 5 is a schematic diagram of the causal convolutional network structure of the application. Figure 5 shows the network structure of the causal convolutional neural network, which specifically includes an input layer (Input), multiple hidden layers (Hidden Layer), and an output layer ( Output). Among them, WaveNet is an autoregression (autoregression) in-depth generation model, which directly models the voice waveform level and decomposes the joint probability of the waveform sequence into conditional probability multiplication.
可选地,本申请中歌声判别式模型也可以采用因果卷积网络构建,在具体实施方式中,将一段音频数据的声学特征参数逐帧输入该歌声判别式模型中,歌声判别式模型能够给出这段音频数据的自然度评分。其中,评分越高则说明歌声合成模型的合成效果越好,或者表示该声学参数代表的该段音频数据为歌手录制的。Optionally, the singing voice discriminant model in this application can also be constructed using a causal convolutional network. In a specific implementation, the acoustic characteristic parameters of a piece of audio data are input into the singing voice discriminant model frame by frame, and the singing voice discriminant model can give Get the naturalness score of this piece of audio data. Among them, the higher the score, the better the synthesis effect of the singing voice synthesis model, or the audio data represented by the acoustic parameter is recorded by the singer.
S220,根据乐谱特征获取每一帧的第二声学特征参数。S220: Acquire a second acoustic feature parameter of each frame according to the music score feature.
可选地,将乐谱特征参数依序输入至歌声合成模型中后,会得到每一帧对应的第二声学特征参数。其中,该第二声学特征参数和歌声数据在中的第一声学特征参数相同,且数据处理的过程类似,详见第一声学特征参数的获取过程,此处不再赘述。也即是,本实施例中的第二声学特征参数也包括谱线能量、基频特征以及梅尔频率倒谱系数。Optionally, after the musical score feature parameters are sequentially input into the singing voice synthesis model, the second acoustic feature parameters corresponding to each frame will be obtained. Wherein, the second acoustic feature parameter is the same as the first acoustic feature parameter in the singing voice data, and the data processing process is similar. For details, please refer to the acquisition process of the first acoustic feature parameter, which will not be repeated here. That is, the second acoustic characteristic parameter in this embodiment also includes spectral line energy, fundamental frequency characteristic, and Mel frequency cepstrum coefficient.
S230,将第二声学特征参数输入语音合成声码器中以生成合成歌声。S230: Input the second acoustic characteristic parameter into the speech synthesis vocoder to generate a synthesized singing voice.
可选地,利用语音合成声码器将第二声学特征参数转化成合成歌声。Optionally, a speech synthesis vocoder is used to convert the second acoustic characteristic parameter into a synthesized singing voice.
S300,判断合成歌声在歌声评判模型中的评分值是否低于第一声学特征参数在歌声评判模型中的评分值。S300: Determine whether the score value of the synthesized singing voice in the singing voice evaluation model is lower than the score value of the first acoustic feature parameter in the singing voice evaluation model.
可选地,结合图6,图6为本申请步骤S300一实施方式的流程示意图,如图6步骤S300进一步包括如下子步骤:Optionally, in conjunction with FIG. 6, FIG. 6 is a schematic flowchart of an embodiment of step S300 in this application, and step S300 in FIG. 6 further includes the following sub-steps:
S310,分别将第一声学特征参数和第二声学特征参数输入歌声评判模型。S310: Input the first acoustic characteristic parameter and the second acoustic characteristic parameter into the singing voice evaluation model respectively.
具体地,歌声评判模型一般是对一段歌声或者录音歌声进行自然度评分,数学角度上理解就是一个分类模型,尝试将合成歌声与录制歌声尽可能区分开。Specifically, the singing voice evaluation model generally scores the naturalness of a piece of singing voice or recorded singing voice. The mathematical understanding is a classification model, and it tries to distinguish the synthesized singing voice from the recorded singing voice as much as possible.
S320,通过二分类算法将第二声学特征参数与第一声学特征参数进行区分,以分别获取第一声学特征参数和第二声学特征参数在歌声评判模型中的评分值。S320: Distinguish the second acoustic feature parameter from the first acoustic feature parameter through a two-classification algorithm to obtain score values of the first acoustic feature parameter and the second acoustic feature parameter in the singing voice evaluation model, respectively.
进一步,歌声评判模型通过二分类算法将第二声学特征参数与第一声学特征参数进行区分,本申请中可以采用的二分类算法包括决策树算法、随机森林算法、朴素贝叶斯以及逻辑回归等等,此处不做具体限定。Further, the singing voice evaluation model distinguishes the second acoustic feature parameter from the first acoustic feature parameter through a two-classification algorithm. The two-classification algorithms that can be used in this application include decision tree algorithm, random forest algorithm, naive Bayesian and logistic regression. Wait, there is no specific limitation here.
可选地,本申请中第一声学特征参数为不是通过合成的歌声的声学特征参数,而是通过录制而成的歌声的声学特征参数,用作评判第二声学特征参数的标准。Optionally, the first acoustic characteristic parameter in the present application is not an acoustic characteristic parameter of a singing voice through synthesis, but an acoustic characteristic parameter of a singing voice formed through recording, which is used as a criterion for judging the second acoustic characteristic parameter.
可选地,歌声评判模型分别对第一声学特征参数和第二声学特征参数进行评分,以获取二者的评分值。其中,声学特征参数在歌声评判模型中的评分值代表歌声的自然度和拟人度,从而区分机器合成的歌声和歌手录制的歌声。其中,拟人度表示第二声学参数代表的合成歌声与第一声学特征参数代表的歌声数据的相似度。当然,在具体实施方式中,评分值越高则说明歌声越自然。Optionally, the singing voice evaluation model scores the first acoustic feature parameter and the second acoustic feature parameter respectively to obtain the score values of the two. Among them, the score value of the acoustic feature parameter in the singing voice evaluation model represents the naturalness and personification of the singing voice, thereby distinguishing the singing voice synthesized by the machine from the singing voice recorded by the singer. Wherein, the personification degree indicates the similarity between the synthesized singing voice represented by the second acoustic parameter and the singing voice data represented by the first acoustic feature parameter. Of course, in specific embodiments, the higher the score value, the more natural the singing voice.
S330,将第一声学特征参数和第二声学特征参数在歌声评判模型中的评分值进行比较。S330: Compare the score values of the first acoustic feature parameter and the second acoustic feature parameter in the singing voice evaluation model.
可选地,本申请中以第一声学特征参数的评分值作为参考基准,将第二声学特征参数的评分值和第一声学特征参数的评分值作比较,若第二声学特征参数的评分值小于第一声学特征参数的评分值,则合成歌声的自然度和拟人度效果差于录制歌声数据的自然度和拟人度,则此时需要对歌声合成模型进行第一模型参数优化,即进入步骤S400。反之,若判断第二声学特征参数的评分值大于或者等于第一声学特征参数的评分值,则说明该歌声合成模型合成的歌声自然度和拟人度高,表明所述第二声学特征参数与第一声学特征参数的区分度越小,此时可以无需调整歌声合成模型的参数,则进入步骤S500,结束。Optionally, in this application, the score value of the first acoustic feature parameter is used as a reference standard, and the score value of the second acoustic feature parameter is compared with the score value of the first acoustic feature parameter. If the score value is less than the score value of the first acoustic feature parameter, the naturalness and personification effect of the synthesized singing voice is worse than the naturalness and personification of the recorded singing voice data. At this time, the first model parameter optimization of the singing voice synthesis model is required. That is, step S400 is entered. Conversely, if it is determined that the score value of the second acoustic feature parameter is greater than or equal to the score value of the first acoustic feature parameter, it indicates that the naturalness and personification of the singing voice synthesized by the singing voice synthesis model is high, indicating that the second acoustic feature parameter is equal to The smaller the degree of discrimination of the first acoustic feature parameter is, the parameter of the singing voice synthesis model may not need to be adjusted at this time, and step S500 is entered to end.
S400,则根据合成歌声的评分值对歌声合成模型进行第一模型参数优化,直至优化后的歌声合成模型生成合成歌声的评分值大于等于第一声学特征参数在优化后的歌声评判模型中的评分值为止。In S400, the first model parameter optimization of the singing voice synthesis model is performed according to the score value of the synthesized singing voice until the score value of the synthesized singing voice generated by the optimized singing voice synthesis model is greater than or equal to the first acoustic characteristic parameter in the optimized singing voice evaluation model Up to the score value.
结合图7,图7为本申请步骤S400一实施方式的流程示意图,步骤S400进一步包括如下子步骤:With reference to Fig. 7, Fig. 7 is a schematic flowchart of an embodiment of step S400 of this application, and step S400 further includes the following sub-steps:
S410,根据第二声学特征参数的评分值获取歌声合成模型的偏差。S410: Obtain the deviation of the singing voice synthesis model according to the score value of the second acoustic feature parameter.
获取歌声合成模型的偏差:Obtain the deviation of the singing voice synthesis model:
Figure PCTCN2019103426-appb-000001
Figure PCTCN2019103426-appb-000001
其中,
Figure PCTCN2019103426-appb-000002
为录制歌声数据y与合成歌声
Figure PCTCN2019103426-appb-000003
的声学参数均方误差,
Figure PCTCN2019103426-appb-000004
反应了歌声评判模型对合成歌声的评分。
among them,
Figure PCTCN2019103426-appb-000002
To record singing voice data and synthesize singing voice
Figure PCTCN2019103426-appb-000003
Mean square error of acoustic parameters,
Figure PCTCN2019103426-appb-000004
It reflects the scoring of the synthetic singing voice by the singing voice evaluation model.
S420,对歌声合成模型进行参数优化以减小所述歌声合成模型的偏差。S420: Perform parameter optimization on the singing voice synthesis model to reduce the deviation of the singing voice synthesis model.
本申请中可以采用梯度下降算法优化模型,以让上式的值不断减少,减少合成模型的偏差。In this application, a gradient descent algorithm can be used to optimize the model, so that the value of the above formula is continuously reduced and the deviation of the synthetic model is reduced.
结合图8,图8为本申请步骤S420一实施方式的流程示意图,步骤S420进一步包括如下子步骤:With reference to Fig. 8, Fig. 8 is a schematic flowchart of an implementation manner of step S420 of this application, and step S420 further includes the following sub-steps:
在获取第一模型参数的梯度下降距离之前还需要预先对歌声合成模型进行拟合处理,具体地,已知假设函数为:Before acquiring the gradient descent distance of the first model parameter, it is also necessary to perform fitting processing on the singing voice synthesis model in advance. Specifically, the known hypothesis function is:
Figure PCTCN2019103426-appb-000005
Figure PCTCN2019103426-appb-000005
以及确定对歌声合成模型进行参数优化的损失函数为:And to determine the loss function for parameter optimization of the singing voice synthesis model:
Figure PCTCN2019103426-appb-000006
Figure PCTCN2019103426-appb-000006
其中,所述假设函数、所述损失函数的选择与优化时应用的梯度下降算法所采用的回归方式相关。当应用线性回归时,所采用的假设函数、损失函数如上式所示。也即采用了不同的回归方式就会有不同的假设函数和损失函数。所述假设函数对应的曲线与数据实际的分布存在差异,使得模型无法拟合(拟合的过程称为回归),因此需要所述损失函数来弥补,使得模型得出的估计值与实际值之间的差异最小。Wherein, the selection of the hypothesis function and the loss function is related to the regression method adopted by the gradient descent algorithm applied during optimization. When applying linear regression, the assumption function and loss function used are as shown in the above formula. That is, if different regression methods are used, there will be different hypothesis functions and loss functions. The curve corresponding to the hypothesis function is different from the actual distribution of the data, making the model unable to fit (the fitting process is called regression). Therefore, the loss function is needed to make up for the difference between the estimated value obtained by the model and the actual value. The difference between them is the smallest.
其中,θ i(i=0,1,2,…,n)为第一模型参数;x i(i=0,1,2,…,n)为样本的n个特征。 Among them, θ i (i=0,1,2,...,n) is the first model parameter; x i (i=0,1,2,...,n) is the n features of the sample.
S421,获取第一模型参数的梯度下降距离。S421: Obtain the gradient descent distance of the first model parameter.
具体地,具体通过步长α乘以所述损失函数的梯度,其中,
Figure PCTCN2019103426-appb-000007
其中
Figure PCTCN2019103426-appb-000008
为损失函数的梯度表达式。
Specifically, the step size α is multiplied by the gradient of the loss function, where,
Figure PCTCN2019103426-appb-000007
among them
Figure PCTCN2019103426-appb-000008
Is the gradient expression of the loss function.
S422,判断所有第一模型参数的梯度下降距离是否都小于终止距离。S422: Determine whether the gradient descent distances of all the first model parameters are less than the termination distance.
步骤S422中判断所有第一模型参数θ i的的梯度下降距离li是否都小于终止距离ε,若判断为是,则进入步骤S423。反之,则进入步骤S424。 In step S422, it is determined whether the gradient descent distance li of all the first model parameters θ i is less than the end distance ε, and if the determination is yes, then step S423 is entered. Otherwise, go to step S424.
S423,停止对歌声合成模型的第一模型参数进行优化。S423: Stop optimizing the first model parameter of the singing voice synthesis model.
当所有所述第一模型参数θ i的梯度下降距离li小于终止距离ε时,停止对歌声合成模型的所述第一模型参数进行优化。 When the gradient descent distance li of all the first model parameters θ i is less than the end distance ε, stop optimizing the first model parameters of the singing voice synthesis model.
S424,更新第一模型参数。S424: Update the first model parameter.
可选地,根据梯度下降算法的思想,所述第一模型参数θ i在i等于1时取的是峰值。所述第一模型参数θ i的值会随着梯度下降的次数的增加而逐渐减小,同时梯度下降距离li的取值也会逐渐减小,因此也就能越来越逼近终止距离ε。 Optionally, according to the idea of the gradient descent algorithm, the first model parameter θ i takes the peak value when i is equal to 1. The value of the first model parameter θ i will gradually decrease as the number of gradient descents increases, and the value of the gradient descent distance li will also gradually decrease, so the end distance ε can be more and more approached.
按照公式
Figure PCTCN2019103426-appb-000009
更新所述第一模型参数θ i
According to the formula
Figure PCTCN2019103426-appb-000009
Update the first model parameter θ i .
其中,公式
Figure PCTCN2019103426-appb-000010
中的等号应当理解成赋值。
Among them, the formula
Figure PCTCN2019103426-appb-000010
The equal sign in should be understood as an assignment.
通过上述以梯度下降的方式对第一模型参数θ i进行优化,有利于所述歌声合成模型的拟合,并且避免了所述第一模型参数θ i错过最佳值,造成歌声合成模型过拟合。 Optimizing the first model parameter θ i in the manner of gradient descent described above is beneficial to the fitting of the singing voice synthesis model, and prevents the first model parameter θ i from missing the optimal value, causing the singing voice synthesis model to overfit Together.
可选地,本申请中的歌声合成模型的训练方法还可以进一步包括根据合成歌声的评分值对歌声评判模型进行第二模型参数优化。Optionally, the training method of the singing voice synthesis model in the present application may further include performing second model parameter optimization on the singing voice evaluation model according to the score value of the synthesized singing voice.
具体地,在本申请中歌声评判模型也采用了因果卷积神经网络,对歌声评判模型进行第二模型参数的优化过程可以参考对歌声合成模型进行第一模型参数优化的过程。需要说明的是,歌声评判模型进行的是梯度上升损失,以尽可能拉大第二声学参数与第一声学特征参数之间的差异,即让录制歌声数据的评分值高,合成歌声的评分值低,且歌声合成模型与歌声评判模型在训练过程中相互竞争,对抗学习,从而不断提升歌声合成模型的合成效果。Specifically, the singing voice evaluation model in this application also uses a causal convolutional neural network, and the optimization process of the second model parameters for the singing voice evaluation model can refer to the process of optimizing the first model parameters for the singing voice synthesis model. It should be noted that the singing voice evaluation model performs a gradient ascent loss to maximize the difference between the second acoustic parameter and the first acoustic feature parameter, that is, to make the score of the recorded singing data high, and the score of the synthesized singing voice The value is low, and the singing voice synthesis model and the singing voice evaluation model compete with each other during the training process and fight against learning, thereby continuously improving the synthesis effect of the singing voice synthesis model.
上述实施方式中,通过将乐谱特征输入歌声合成模型进行歌声合成,再获取合成歌声在歌声评判模型中的评分值,将合成歌声的评分值和录制歌声的评分值进行比较,若合成歌声的评分值小于录制歌声的评分值,则根据合成歌声的评分值对歌声合成模型的参数进行优化和调整,以使得歌声合成模型合成出的歌声其评分值尽可能的大于或者等于录制歌声在歌声评判模型中的评分值,从而不断提升歌声合成模型的合成效果,通过使用该歌声合成模型,输入乐谱数据中的包括歌词、音符音高序列、音符时长序列进行歌声合成,通过该合成模型合成的歌声,具有高的自然度,能够模拟出自然平滑的歌声效果,并且具有高的拟人度,歌声起承转合更符合人声的特征。In the above embodiment, the singing voice synthesis is performed by inputting the score features into the singing voice synthesis model, and then obtaining the score value of the synthesized singing voice in the singing voice evaluation model, and comparing the score value of the synthesized singing voice with the score value of the recorded singing voice. If the value is less than the score value of the recorded singing voice, the parameters of the singing voice synthesis model are optimized and adjusted according to the score value of the synthesized singing voice, so that the score value of the singing voice synthesized by the singing voice synthesis model is greater than or equal to the recorded singing voice in the singing voice evaluation model as much as possible In order to continuously improve the synthesis effect of the singing voice synthesis model, by using the singing voice synthesis model, input the music score data including lyrics, note pitch sequence, note duration sequence for singing voice synthesis, and the singing voice synthesized by the synthesis model, It has a high degree of naturalness, can simulate a natural and smooth singing effect, and has a high degree of anthropomorphism, and the combination of singing voice is more in line with the characteristics of human voice.
参阅图9,图9是本申请提供的歌声合成模型的训练装置第一实施例的示意框图,本实施例中的信息生成装置包括处理模块31、生成模块32、判断模块33及优化模块34。Referring to FIG. 9, FIG. 9 is a schematic block diagram of a first embodiment of a training device for a singing voice synthesis model provided by the present application. The information generating device in this embodiment includes a processing module 31, a generating module 32, a judgment module 33 and an optimization module 34.
其中,处理模块31用于对乐谱数据和歌声数据进行预处理,以提取乐谱数据中的乐谱特征和歌声数据中的第一声学特征参数。The processing module 31 is used to preprocess the score data and the singing voice data to extract the score features in the score data and the first acoustic feature parameter in the singing voice data.
具体地,处理模块31用于对歌声数据进行预加重处理,以使得歌声数据的信号频谱平坦;将预加重处理后的歌声数据分割成整数帧;获取每一帧歌声数据的频谱信息,以得到每一帧歌声数据的第一声学特征参数。Specifically, the processing module 31 is configured to perform pre-emphasis processing on the singing voice data to make the signal spectrum of the singing voice data flat; divide the pre-emphasized singing voice data into integer frames; obtain the spectrum information of each frame of the singing voice data to obtain The first acoustic characteristic parameter of each frame of singing voice data.
生成模块32用于将乐谱特征输入歌声合成模型中以生成合成歌声。The generating module 32 is used to input the music score feature into the singing voice synthesis model to generate the synthesized singing voice.
具体地,生成模块32用于将乐谱特征依序输入歌声合成模型;根据乐谱特征获取每一帧的第二声学特征参数;将第二声学特征参数输入语音合成声码器 中以生成合成歌声。Specifically, the generating module 32 is configured to sequentially input the music score features into the singing voice synthesis model; obtain the second acoustic feature parameters of each frame according to the music score features; input the second acoustic feature parameters into the speech synthesis vocoder to generate the synthesized singing voice.
判断模块33用于判断合成歌声在歌声评判模型中的评分值是否低于第一声学特征参数在歌声评判模型中的评分值。The judgment module 33 is used to judge whether the score value of the synthesized singing voice in the singing voice evaluation model is lower than the score value of the first acoustic feature parameter in the singing voice evaluation model.
具体地,判断模块33用于分别将第一声学特征参数和第二声学特征参数输入歌声评判模型;通过二分类算法将第二声学特征参数与第一声学特征参数进行区分,以分别获取第一声学特征参数和第二声学特征参数在歌声评判模型中的评分值;将第一声学特征参数和第二声学特征参数在歌声评判模型中的评分值进行比较。Specifically, the judging module 33 is configured to input the first acoustic feature parameter and the second acoustic feature parameter into the singing voice evaluation model; the second acoustic feature parameter is distinguished from the first acoustic feature parameter through a two-classification algorithm to obtain respectively The score values of the first acoustic feature parameter and the second acoustic feature parameter in the singing voice evaluation model; compare the score values of the first acoustic feature parameter and the second acoustic feature parameter in the singing voice evaluation model.
优化模块34用于在判断合成歌声在歌声评判模型中的评分值低于所述第一声学特征参数在所述歌声评判模型中的评分值时,根据评分值对歌声合成模型进行第一模型参数优化。The optimization module 34 is configured to perform a first model on the singing voice synthesis model according to the score value when judging that the score value of the synthesized singing voice in the singing voice evaluation model is lower than the score value of the first acoustic feature parameter in the singing voice evaluation model Parameter optimization.
具体地,优化模块34用于根据第二声学特征参数的评分值获取歌声合成模型的偏差;对歌声合成模型进行参数优化以减小歌声合成模型的偏差,包括获取模第一型参数的梯度下降距离;判断所有第一模型参数的梯度下降距离是否都小于终止距离;若判断为是,则停止对歌声合成模型的第一模型参数进行优化;若判断为否,则更新第一模型参数。Specifically, the optimization module 34 is configured to obtain the deviation of the singing voice synthesis model according to the score value of the second acoustic characteristic parameter; optimize the parameters of the singing voice synthesis model to reduce the deviation of the singing voice synthesis model, including obtaining the gradient drop of the first type parameter of the model Distance: Determine whether the gradient descent distance of all the first model parameters is less than the termination distance; if the judgment is yes, stop optimizing the first model parameters of the singing voice synthesis model; if the judgment is no, update the first model parameters.
可以理解的是,本实施例中处理模块31、生成模块32、判断模块33及优化模块34与上述第一实施例中步骤S100~步骤S400相对应,具体可参阅上述第一实施例中步骤S100~步骤S400的相关描述,在此不再赘述。It can be understood that the processing module 31, the generation module 32, the judgment module 33, and the optimization module 34 in this embodiment correspond to steps S100 to S400 in the above first embodiment. For details, please refer to step S100 in the above first embodiment. ~ The relevant description of step S400 will not be repeated here.
参阅图10,图10是本申请提供的计算机设备实施例的示意框图,本实施例中的计算机设备包括处理器51及存储器52,存储器52中存储有计算机可读指令,处理器51在工作时执行计算机可读指令以实现上述任一实施例中的歌声合成模型的训练方法。Referring to FIG. 10, FIG. 10 is a schematic block diagram of an embodiment of a computer device provided by the present application. The computer device in this embodiment includes a processor 51 and a memory 52. The memory 52 stores computer-readable instructions. The processor 51 is working. The computer-readable instructions are executed to implement the training method of the singing voice synthesis model in any of the above embodiments.
具体的,处理器51用于对乐谱数据和歌声数据进行预处理,以提取乐谱数据中的乐谱特征和歌声数据中的第一声学特征参数;将乐谱特征输入歌声合成模型中以生成合成歌声;判断合成歌声在歌声评判模型中的评分值是否低于第一声学特征参数在歌声评判模型中的评分值;若判断为是,则根据合成歌声的评分值对歌声合成模型进行第一模型参数优化,直至优化后的歌声合成模型生成合成歌声的评分值大于等于所述第一声学特征参数在优化后的歌声评判模型中的评分值为止。Specifically, the processor 51 is used to preprocess the score data and the singing voice data to extract the score features in the score data and the first acoustic feature parameters in the singing voice data; input the score features into the singing voice synthesis model to generate the synthesized singing voice ; Determine whether the score value of the synthesized singing voice in the singing voice evaluation model is lower than the score value of the first acoustic characteristic parameter in the singing voice evaluation model; if the determination is yes, perform the first model of the singing voice synthesis model according to the score value of the synthesized singing voice The parameters are optimized until the score value of the synthesized singing voice generated by the optimized singing voice synthesis model is greater than or equal to the score value of the first acoustic characteristic parameter in the optimized singing voice evaluation model.
本实施例中关于处理器51的具体执行方式可参阅上述歌声合成模型的训练 方法实施例中的相关描述,在此不再赘述。For the specific execution mode of the processor 51 in this embodiment, please refer to the related description in the above-mentioned embodiment of the training method of the singing voice synthesis model, which will not be repeated here.
其中,处理器51控制移动终端的操作,处理器51还可以称为CPU(Central Processing Unit,中央处理单元)。处理器51可能是一种集成电路芯片,具有信号的处理能力。处理器51还可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器,但不仅限于此。The processor 51 controls the operation of the mobile terminal, and the processor 51 may also be referred to as a CPU (Central Processing Unit, central processing unit). The processor 51 may be an integrated circuit chip with signal processing capability. The processor 51 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component . The general-purpose processor may be a microprocessor or the processor may also be any conventional processor, but is not limited to this.
参阅图11,图11为本申请计算机可读非易失性存储介质一实施方式的结构示意图。本申请的计算机可读非易失性存储介质存储有能够实现上述所有方法的计算机可读指令21,其中,该计算机可读指令21可以以软件产品的形式存储在上述存储装置中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施方式所述方法的全部或部分步骤。而前述的存储装置包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储计算机可读指令代码的介质,或者是计算机、服务器、手机、平板等终端设备。Refer to FIG. 11, which is a schematic structural diagram of an embodiment of a computer-readable non-volatile storage medium according to this application. The computer-readable non-volatile storage medium of the present application stores computer-readable instructions 21 capable of implementing all the above methods, wherein the computer-readable instructions 21 may be stored in the above-mentioned storage device in the form of a software product, including several instructions It is used to enable a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage devices include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store computer-readable instruction codes, or computers, servers, mobile phones, Terminal devices such as tablets.
在本申请所提供的几个实施例中,应该理解到,所揭露的***,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
综上所述,本领域技术人员容易理解,本申请提供一种歌声合成模型的训练方法、装置、计算机设备以及非易失性存储介质,通过将乐谱特征输入歌声合成模型进行歌声合成,再获取合成歌声在歌声评判模型中的评分值,将合成歌声的评分值和录制歌声的评分值进行比较,若合成歌声的评分值小于录制歌声的评分值,则根据合成歌声的评分值对歌声合成模型的参数进行优化和调整, 以使得所述歌声合成模型合成出的歌声其评分值大于或者等于录制歌声在歌声评判模型中的评分值,从而不断提升歌声合成模型的合成效果,使得合成歌声更加逼真。In summary, those skilled in the art can easily understand that this application provides a training method, device, computer equipment, and non-volatile storage medium for a singing voice synthesis model. The singing voice synthesis is performed by inputting music score features into the singing voice synthesis model, and then obtains The score value of the synthesized singing voice in the singing voice evaluation model. The score value of the synthesized singing voice is compared with the score value of the recorded singing voice. If the score value of the synthesized singing voice is less than the score value of the recorded singing voice, the singing voice synthesis model is compared according to the score value of the synthesized singing voice The parameters of the singing voice synthesis model are optimized and adjusted so that the score value of the singing voice synthesized by the singing voice synthesis model is greater than or equal to the score value of the recorded singing voice in the singing voice evaluation model, thereby continuously improving the synthesis effect of the singing voice synthesis model, making the synthesized singing voice more realistic .
以上所述仅为本申请的实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only examples of this application, and do not limit the scope of this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of this application, or directly or indirectly applied to other related technologies In the same way, all fields are included in the scope of patent protection of this application.

Claims (20)

  1. 一种歌声合成模型的训练方法,其特征在于,所述歌声合成模型的训练方法包括:A training method of a singing voice synthesis model, characterized in that the training method of the singing voice synthesis model includes:
    对乐谱数据和录制的歌声数据进行预处理,以提取所述乐谱数据中的乐谱特征和所述歌声数据中的第一声学特征参数;Preprocessing the music score data and the recorded singing voice data to extract the music score feature in the music score data and the first acoustic feature parameter in the singing voice data;
    将所述乐谱特征输入歌声合成模型中以生成合成歌声;Inputting the music score feature into a singing voice synthesis model to generate a synthesized singing voice;
    判断所述合成歌声在歌声评判模型中的评分值是否低于所述第一声学特征参数在所述歌声评判模型中的评分值;Judging whether the score value of the synthetic singing voice in the singing voice evaluation model is lower than the score value of the first acoustic characteristic parameter in the singing voice evaluation model;
    若判断为是,则根据所述合成歌声的评分值对所述歌声合成模型进行第一模型参数优化,直至优化后的歌声合成模型生成合成歌声的评分值大于等于所述第一声学特征参数在优化后的歌声评判模型中的评分值为止。If the judgment is yes, the first model parameter optimization of the singing voice synthesis model is performed on the singing voice synthesis model according to the score value of the synthesized singing voice, until the score value of the synthesized singing voice generated by the optimized singing voice synthesis model is greater than or equal to the first acoustic characteristic parameter Until the score value in the optimized singing voice evaluation model.
  2. 根据权利要求1所述的歌声合成模型的训练方法,其特征在于,所述对乐谱数据和录制的歌声数据进行预处理,以提取所述乐谱数据中的乐谱特征和所述歌声数据中的第一声学特征参数包括:The training method of the singing voice synthesis model according to claim 1, wherein the preprocessing is performed on the score data and the recorded singing voice data to extract the score features in the score data and the first score in the singing voice data. An acoustic characteristic parameter includes:
    对所述歌声数据进行预加重处理,以使得所述歌声数据的信号频谱平坦;Performing pre-emphasis processing on the singing voice data to make the signal spectrum of the singing voice data flat;
    将预加重处理后的所述歌声数据分割成整数帧;Dividing the pre-emphasized singing voice data into integer frames;
    获取每一帧所述歌声数据的频谱信息,以得到每一帧所述歌声数据的第一声学特征参数。The frequency spectrum information of each frame of the singing voice data is acquired to obtain the first acoustic characteristic parameter of each frame of the singing voice data.
  3. 根据权利要求1所述的歌声合成模型的训练方法,其特征在于,所述将所述乐谱特征输入歌声合成模型中以生成合成歌声包括:The training method of a singing voice synthesis model according to claim 1, wherein the inputting the music score feature into the singing voice synthesis model to generate a synthesized singing voice comprises:
    将所述乐谱特征依序输入歌声合成模型;Sequentially inputting the music score features into the singing voice synthesis model;
    根据所述乐谱特征获取每一帧的第二声学特征参数;Acquiring the second acoustic feature parameter of each frame according to the music score feature;
    将所述第二声学特征参数输入语音合成声码器中以生成所述合成歌声。The second acoustic characteristic parameter is input into a speech synthesis vocoder to generate the synthesized singing voice.
  4. 根据权利要求3所述的歌声合成模型的训练方法,其特征在于,所述判断所述合成歌声在歌声评判模型中的评分值是否低于所述第一声学特征参数在所述歌声评判模型中的评分值包括:The method for training a singing voice synthesis model according to claim 3, wherein the determining whether the score value of the synthesized singing voice in the singing voice evaluation model is lower than that of the first acoustic characteristic parameter in the singing voice evaluation model The score values in include:
    分别将所述第一声学特征参数和所述第二声学特征参数输入所述歌声评判模型;Respectively inputting the first acoustic characteristic parameter and the second acoustic characteristic parameter into the singing voice evaluation model;
    通过二分类算法将所述第二声学特征参数与所述第一声学特征参数进行区 分,以分别获取所述第一声学特征参数和所述第二声学特征参数在所述歌声评判模型中的评分值;The second acoustic feature parameter is distinguished from the first acoustic feature parameter through a two-classification algorithm, so as to obtain the first acoustic feature parameter and the second acoustic feature parameter in the singing voice evaluation model, respectively Scoring value;
    将所述第一声学特征参数和所述第二声学特征参数在所述歌声评判模型中的评分值进行比较。The score values of the first acoustic feature parameter and the second acoustic feature parameter in the singing voice evaluation model are compared.
  5. 根据权利要求3所述的歌声合成模型的训练方法,其特征在于,所述根据所述合成歌声的评分值对所述歌声合成模型进行第一模型参数优化包括:The training method of a singing voice synthesis model according to claim 3, wherein the optimizing the first model parameters of the singing voice synthesis model according to the score value of the synthesized singing voice comprises:
    根据所述第二声学特征参数的评分值获取所述歌声合成模型的偏差;Acquiring the deviation of the singing voice synthesis model according to the score value of the second acoustic feature parameter;
    对所述歌声合成模型进行参数优化以减小所述歌声合成模型的偏差。Optimize the parameters of the singing voice synthesis model to reduce the deviation of the singing voice synthesis model.
  6. 根据权利要求5所述的歌声合成模型的训练方法,其特征在于,所述对所述歌声合成模型进行参数优化以减小所述歌声合成模型的偏差包括:The training method of the singing voice synthesis model according to claim 5, wherein the parameter optimization of the singing voice synthesis model to reduce the deviation of the singing voice synthesis model comprises:
    获取所述第一模型参数的梯度下降距离;Acquiring the gradient descent distance of the first model parameter;
    判断所有所述第一模型参数的梯度下降距离是否都小于终止距离;Judging whether the gradient descent distances of all the first model parameters are less than the termination distance;
    若判断为是,则停止对所述歌声合成模型的所述第一模型参数进行优化;If the judgment is yes, stop optimizing the first model parameters of the singing voice synthesis model;
    若判断为否,则更新所述第一模型参数。If the judgment is no, update the first model parameter.
  7. 根据权利要求1所述的歌声合成模型的训练方法,其特征在于,所述歌声合成模型的训练方法进一步包括:The training method of a singing voice synthesis model according to claim 1, wherein the training method of the singing voice synthesis model further comprises:
    根据所述合成歌声的评分值对所述歌声评判模型进行第二模型参数优化。The second model parameter optimization is performed on the singing voice evaluation model according to the score value of the synthetic singing voice.
  8. 一种歌声合成模型的训练装置,其特征在于,所述歌声合成模型的训练装置包括:A training device for a singing voice synthesis model, characterized in that the training device for a singing voice synthesis model includes:
    处理模块,用于对乐谱数据和录制的歌声数据进行预处理,以提取所述乐谱数据中的乐谱特征和所述歌声数据中的第一声学特征参数;A processing module for preprocessing the score data and the recorded singing voice data to extract the score feature in the score data and the first acoustic feature parameter in the singing voice data;
    生成模块,用于将所述乐谱特征输入歌声合成模型中以生成合成歌声;A generating module for inputting the music score feature into a singing voice synthesis model to generate a synthesized singing voice;
    判断模块,用于判断所述合成歌声在歌声评判模型中的评分值是否低于所述第一声学特征参数在所述歌声评判模型中的评分值;A judging module for judging whether the score value of the synthetic singing voice in the singing voice evaluation model is lower than the score value of the first acoustic feature parameter in the singing voice evaluation model;
    优化模块,用于在判断所述合成歌声在歌声评判模型中的评分值低于所述第一声学特征参数在所述歌声评判模型中的评分值时,根据所述评分值对所述歌声合成模型进行第一模型参数优化,直至优化后的歌声合成模型生成合成歌声的评分值大于等于所述第一声学特征参数在优化后的歌声评判模型中的评分值为止。The optimization module is used to compare the singing voice according to the score value when it is determined that the score value of the synthetic singing voice in the singing voice evaluation model is lower than the score value of the first acoustic feature parameter in the singing voice evaluation model. The synthesis model optimizes the first model parameters until the score value of the synthesized singing voice generated by the optimized singing voice synthesis model is greater than or equal to the score value of the first acoustic characteristic parameter in the optimized singing voice evaluation model.
  9. 根据权利要求8所述的一种歌声合成模型的训练装置,其特征在于,所述处理模块包括:The training device of a singing voice synthesis model according to claim 8, wherein the processing module comprises:
    预处理子模块,用于对所述歌声数据进行预加重处理,以使得所述歌声数据的信号频谱平坦;A pre-processing sub-module for pre-emphasizing the singing voice data to make the signal spectrum of the singing voice data flat;
    分割子模块,用于将预加重处理后的所述歌声数据分割成整数帧;A segmentation sub-module for segmenting the singing voice data after pre-emphasis processing into integer frames;
    第一特征参数获取子模块,用于获取每一帧所述歌声数据的频谱信息,以得到每一帧所述歌声数据的第一声学特征参数。The first characteristic parameter acquisition sub-module is used to acquire the frequency spectrum information of each frame of the singing voice data to obtain the first acoustic characteristic parameter of each frame of the singing voice data.
  10. 根据权利要求8所述的一种歌声合成模型的训练装置,其特征在于,所述生成模块具体包括:The training device of a singing voice synthesis model according to claim 8, wherein the generating module specifically comprises:
    输入子模块,用于将所述乐谱特征依序输入歌声合成模型;The input sub-module is used to input the music score features into the singing voice synthesis model in sequence;
    第二特征获取子模块,用于根据所述乐谱特征获取每一帧的第二声学特征参数;The second feature acquisition sub-module is configured to acquire the second acoustic feature parameter of each frame according to the music score feature;
    合成子模块,用于将所述第二声学特征参数输入语音合成声码器中以生成所述合成歌声。The synthesis sub-module is used to input the second acoustic characteristic parameter into a speech synthesis vocoder to generate the synthesized singing voice.
  11. 一种计算机设备,其特征在于,所述计算机设备包括处理器、存储器,以及存储在所述存储器中,可在所述处理器上运行的计算机可读指令,所述处理器在执行计算机可读指令以实现如下所述的歌声合成模型的训练方法的步骤:A computer device, characterized in that the computer device includes a processor, a memory, and computer-readable instructions stored in the memory that can run on the processor, and the processor is executing computer-readable instructions. Instructions to implement the steps of the training method of the singing voice synthesis model as described below:
    对乐谱数据和录制的歌声数据进行预处理,以提取所述乐谱数据中的乐谱特征和所述歌声数据中的第一声学特征参数;Preprocessing the music score data and the recorded singing voice data to extract the music score feature in the music score data and the first acoustic feature parameter in the singing voice data;
    将所述乐谱特征输入歌声合成模型中以生成合成歌声;Inputting the music score feature into a singing voice synthesis model to generate a synthesized singing voice;
    判断所述合成歌声在歌声评判模型中的评分值是否低于所述第一声学特征参数在所述歌声评判模型中的评分值;Judging whether the score value of the synthetic singing voice in the singing voice evaluation model is lower than the score value of the first acoustic characteristic parameter in the singing voice evaluation model;
    若判断为是,则根据所述合成歌声的评分值对所述歌声合成模型进行第一模型参数优化,直至优化后的歌声合成模型生成合成歌声的评分值大于等于所述第一声学特征参数在优化后的歌声评判模型中的评分值为止。If the judgment is yes, the first model parameter optimization of the singing voice synthesis model is performed on the singing voice synthesis model according to the score value of the synthesized singing voice, until the score value of the synthesized singing voice generated by the optimized singing voice synthesis model is greater than or equal to the first acoustic characteristic parameter Until the score value in the optimized singing voice evaluation model.
  12. 根据权利要求11所述的一种计算机设备,其特征在于,所述对乐谱数据和歌声数据进行预处理,以提取所述乐谱数据中的乐谱特征和所述歌声数据中的第一声学特征参数包括:The computer device according to claim 11, wherein the preprocessing is performed on the score data and the singing voice data to extract the score feature in the score data and the first acoustic feature in the singing voice data Parameters include:
    对所述歌声数据进行预加重处理,以使得所述歌声数据的信号频谱平坦;Performing pre-emphasis processing on the singing voice data to make the signal spectrum of the singing voice data flat;
    将预加重处理后的所述歌声数据分割成整数帧;Dividing the pre-emphasized singing voice data into integer frames;
    获取每一帧所述歌声数据的频谱信息,以得到每一帧所述歌声数据的第一声学特征参数。The frequency spectrum information of each frame of the singing voice data is acquired to obtain the first acoustic characteristic parameter of each frame of the singing voice data.
  13. 根据权利要求11所述的一种计算机设备,其特征在于,所述将所述乐 谱特征输入歌声合成模型中以生成合成歌声包括:The computer device according to claim 11, wherein said inputting said musical score feature into a singing voice synthesis model to generate a synthesized singing voice comprises:
    将所述乐谱特征依序输入歌声合成模型;Sequentially inputting the music score features into the singing voice synthesis model;
    根据所述乐谱特征获取每一帧的第二声学特征参数;Acquiring the second acoustic feature parameter of each frame according to the music score feature;
    将所述第二声学特征参数输入语音合成声码器中以生成所述合成歌声。The second acoustic characteristic parameter is input into a speech synthesis vocoder to generate the synthesized singing voice.
  14. 根据权利要求13所述的一种计算机设备,其特征在于,所述判断所述合成歌声在歌声评判模型中的评分值是否低于所述第一声学特征参数在所述歌声评判模型中的评分值包括:The computer device according to claim 13, wherein said determining whether the score value of the synthetic singing voice in the singing voice evaluation model is lower than that of the first acoustic characteristic parameter in the singing voice evaluation model. The score values include:
    分别将所述第一声学特征参数和所述第二声学特征参数输入所述歌声评判模型;Respectively inputting the first acoustic characteristic parameter and the second acoustic characteristic parameter into the singing voice evaluation model;
    通过二分类算法将所述第二声学特征参数与所述第一声学特征参数进行区分,以分别获取所述第一声学特征参数和所述第二声学特征参数在所述歌声评判模型中的评分值;The second acoustic feature parameter is distinguished from the first acoustic feature parameter through a two-classification algorithm, so as to obtain the first acoustic feature parameter and the second acoustic feature parameter in the singing voice evaluation model, respectively Scoring value;
    将所述第一声学特征参数和所述第二声学特征参数在所述歌声评判模型中的评分值进行比较。The score values of the first acoustic feature parameter and the second acoustic feature parameter in the singing voice evaluation model are compared.
  15. 根据权利要求13所述的一种计算机设备,其特征在于,所述根据所述合成歌声的评分值对所述歌声合成模型进行第一模型参数优化包括:The computer device according to claim 13, wherein the optimizing the first model parameters of the singing voice synthesis model according to the score value of the synthesized singing voice comprises:
    根据所述第二声学特征参数的评分值获取所述歌声合成模型的偏差;Acquiring the deviation of the singing voice synthesis model according to the score value of the second acoustic feature parameter;
    对所述歌声合成模型进行参数优化以减小所述歌声合成模型的偏差。Optimize the parameters of the singing voice synthesis model to reduce the deviation of the singing voice synthesis model.
  16. 一个或多个存储有计算机可读指令的非易失性可读存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行以实现下述的歌声合成模型的训练方法的步骤:One or more non-volatile readable storage media storing computer readable instructions, wherein the computer readable instructions are executed by one or more processors to implement the following training method of singing voice synthesis model step:
    对乐谱数据和录制的歌声数据进行预处理,以提取所述乐谱数据中的乐谱特征和所述歌声数据中的第一声学特征参数;Preprocessing the music score data and the recorded singing voice data to extract the music score feature in the music score data and the first acoustic feature parameter in the singing voice data;
    将所述乐谱特征输入歌声合成模型中以生成合成歌声;Inputting the music score feature into a singing voice synthesis model to generate a synthesized singing voice;
    判断所述合成歌声在歌声评判模型中的评分值是否低于所述第一声学特征参数在所述歌声评判模型中的评分值;Judging whether the score value of the synthetic singing voice in the singing voice evaluation model is lower than the score value of the first acoustic characteristic parameter in the singing voice evaluation model;
    若判断为是,则根据所述合成歌声的评分值对所述歌声合成模型进行第一模型参数优化,直至优化后的歌声合成模型生成合成歌声的评分值大于等于所述第一声学特征参数在优化后的歌声评判模型中的评分值为止。If the judgment is yes, the first model parameter optimization of the singing voice synthesis model is performed on the singing voice synthesis model according to the score value of the synthesized singing voice, until the score value of the synthesized singing voice generated by the optimized singing voice synthesis model is greater than or equal to the first acoustic characteristic parameter Until the score value in the optimized singing voice evaluation model.
  17. 根据权利要求16所述的一种非易失性可读存储介质,其特征在于:所述对乐谱数据和歌声数据进行预处理,以提取所述乐谱数据中的乐谱特征和所 述歌声数据中的第一声学特征参数包括:The non-volatile readable storage medium according to claim 16, characterized in that: said preprocessing the score data and singing voice data to extract the score features in the score data and the singing voice data. The first acoustic characteristic parameters include:
    对所述歌声数据进行预加重处理,以使得所述歌声数据的信号频谱平坦;Performing pre-emphasis processing on the singing voice data to make the signal spectrum of the singing voice data flat;
    将预加重处理后的所述歌声数据分割成整数帧;Dividing the pre-emphasized singing voice data into integer frames;
    获取每一帧所述歌声数据的频谱信息,以得到每一帧所述歌声数据的第一声学特征参数。The frequency spectrum information of each frame of the singing voice data is acquired to obtain the first acoustic characteristic parameter of each frame of the singing voice data.
  18. 根据权利要求16所述的一种非易失性可读存储介质,其特征在于:所述将所述乐谱特征输入歌声合成模型中以生成合成歌声包括:The non-volatile readable storage medium according to claim 16, wherein said inputting said musical score feature into a singing voice synthesis model to generate a synthesized singing voice comprises:
    将所述乐谱特征依序输入歌声合成模型;Sequentially inputting the music score features into the singing voice synthesis model;
    根据所述乐谱特征获取每一帧的第二声学特征参数;Acquiring the second acoustic feature parameter of each frame according to the music score feature;
    将所述第二声学特征参数输入语音合成声码器中以生成所述合成歌声。The second acoustic characteristic parameter is input into a speech synthesis vocoder to generate the synthesized singing voice.
  19. 根据权利要求18所述的一种非易失性可读存储介质,其特征在于:所述判断所述合成歌声在歌声评判模型中的评分值是否低于所述第一声学特征参数在所述歌声评判模型中的评分值包括:The non-volatile readable storage medium according to claim 18, characterized in that: said determining whether the score value of the synthetic singing voice in the singing voice evaluation model is lower than that of the first acoustic characteristic parameter. The score values in the singing voice evaluation model include:
    分别将所述第一声学特征参数和所述第二声学特征参数输入所述歌声评判模型;Respectively inputting the first acoustic characteristic parameter and the second acoustic characteristic parameter into the singing voice evaluation model;
    通过二分类算法将所述第二声学特征参数与所述第一声学特征参数进行区分,以分别获取所述第一声学特征参数和所述第二声学特征参数在所述歌声评判模型中的评分值;The second acoustic feature parameter is distinguished from the first acoustic feature parameter through a two-classification algorithm, so as to obtain the first acoustic feature parameter and the second acoustic feature parameter in the singing voice evaluation model, respectively Scoring value;
    将所述第一声学特征参数和所述第二声学特征参数在所述歌声评判模型中的评分值进行比较。The score values of the first acoustic feature parameter and the second acoustic feature parameter in the singing voice evaluation model are compared.
  20. 根据权利要求18所述的一种非易失性可读存储介质,其特征在于:所述根据所述合成歌声的评分值对所述歌声合成模型进行第一模型参数优化包括:The non-volatile readable storage medium according to claim 18, wherein the optimizing the first model parameters of the singing voice synthesis model according to the score value of the synthesized singing voice comprises:
    根据所述第二声学特征参数的评分值获取所述歌声合成模型的偏差;Acquiring the deviation of the singing voice synthesis model according to the score value of the second acoustic feature parameter;
    对所述歌声合成模型进行参数优化以减小所述歌声合成模型的偏差。Optimize the parameters of the singing voice synthesis model to reduce the deviation of the singing voice synthesis model.
PCT/CN2019/103426 2019-06-11 2019-08-29 Method and device for training singing voice synthesis model, computer apparatus, and storage medium WO2020248388A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910500699.0 2019-06-11
CN201910500699.0A CN110364140B (en) 2019-06-11 2019-06-11 Singing voice synthesis model training method, singing voice synthesis model training device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2020248388A1 true WO2020248388A1 (en) 2020-12-17

Family

ID=68217115

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103426 WO2020248388A1 (en) 2019-06-11 2019-08-29 Method and device for training singing voice synthesis model, computer apparatus, and storage medium

Country Status (2)

Country Link
CN (1) CN110364140B (en)
WO (1) WO2020248388A1 (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20210158706A (en) 2020-06-24 2021-12-31 현대자동차주식회사 Vehicle and control method for the same
KR20220000654A (en) 2020-06-26 2022-01-04 현대자동차주식회사 Vehicle and control method for the same
KR20220000655A (en) 2020-06-26 2022-01-04 현대자동차주식회사 Driving sound library, apparatus for generating driving sound library and vehicle comprising driving sound library
CN112185343B (en) * 2020-09-24 2022-07-22 长春迪声软件有限公司 Method and device for synthesizing singing voice and audio
CN112466313B (en) * 2020-11-27 2022-03-15 四川长虹电器股份有限公司 Method and device for synthesizing singing voices of multiple singers
CN112786013A (en) * 2021-01-11 2021-05-11 北京有竹居网络技术有限公司 Voice synthesis method and device based on album, readable medium and electronic equipment
CN112951200B (en) * 2021-01-28 2024-03-12 北京达佳互联信息技术有限公司 Training method and device for speech synthesis model, computer equipment and storage medium
CN113053355A (en) * 2021-03-17 2021-06-29 平安科技(深圳)有限公司 Fole human voice synthesis method, device, equipment and storage medium
CN113569196A (en) * 2021-07-15 2021-10-29 苏州仰思坪半导体有限公司 Data processing method, device, medium and equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108053814A (en) * 2017-11-06 2018-05-18 芋头科技(杭州)有限公司 A kind of speech synthesis system and method for analog subscriber song
CN109785823A (en) * 2019-01-22 2019-05-21 中财颐和科技发展(北京)有限公司 Phoneme synthesizing method and system
CN109817197A (en) * 2019-03-04 2019-05-28 天翼爱音乐文化科技有限公司 Song generation method, device, computer equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008015214A (en) * 2006-07-06 2008-01-24 Dds:Kk Singing skill evaluation method and karaoke machine
CN106971703A (en) * 2017-03-17 2017-07-21 西北师范大学 A kind of song synthetic method and device based on HMM
CN109817191B (en) * 2019-01-04 2023-06-06 平安科技(深圳)有限公司 Tremolo modeling method, device, computer equipment and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108053814A (en) * 2017-11-06 2018-05-18 芋头科技(杭州)有限公司 A kind of speech synthesis system and method for analog subscriber song
CN109785823A (en) * 2019-01-22 2019-05-21 中财颐和科技发展(北京)有限公司 Phoneme synthesizing method and system
CN109817197A (en) * 2019-03-04 2019-05-28 天翼爱音乐文化科技有限公司 Song generation method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN110364140A (en) 2019-10-22
CN110364140B (en) 2024-02-06

Similar Documents

Publication Publication Date Title
WO2020248388A1 (en) Method and device for training singing voice synthesis model, computer apparatus, and storage medium
Venkataramanan et al. Emotion recognition from speech
US11322155B2 (en) Method and apparatus for establishing voiceprint model, computer device, and storage medium
US10176811B2 (en) Neural network-based voiceprint information extraction method and apparatus
CN105023573B (en) It is detected using speech syllable/vowel/phone boundary of auditory attention clue
CN110246488B (en) Voice conversion method and device of semi-optimized cycleGAN model
CN111433847B (en) Voice conversion method, training method, intelligent device and storage medium
CN110021308A (en) Voice mood recognition methods, device, computer equipment and storage medium
CN108831437A (en) A kind of song generation method, device, terminal and storage medium
CN101178896A (en) Unit selection voice synthetic method based on acoustics statistical model
CN108962231B (en) Voice classification method, device, server and storage medium
CN113539240B (en) Animation generation method, device, electronic equipment and storage medium
CN109036437A (en) Accents recognition method, apparatus, computer installation and computer readable storage medium
US20210073611A1 (en) Dynamic data structures for data-driven modeling
US10452996B2 (en) Generating dynamically controllable composite data structures from a plurality of data segments
CN108766409A (en) A kind of opera synthetic method, device and computer readable storage medium
CN110970036A (en) Voiceprint recognition method and device, computer storage medium and electronic equipment
JP7124373B2 (en) LEARNING DEVICE, SOUND GENERATOR, METHOD AND PROGRAM
CN114999441A (en) Avatar generation method, apparatus, device, storage medium, and program product
Xue et al. Cross-modal information fusion for voice spoofing detection
CN114927126A (en) Scheme output method, device and equipment based on semantic analysis and storage medium
Hu et al. Generating synthetic dysarthric speech to overcome dysarthria acoustic data scarcity
Reimao Synthetic speech detection using deep neural networks
US20230368777A1 (en) Method And Apparatus For Processing Audio, Electronic Device And Storage Medium
Chit et al. Myanmar continuous speech recognition system using convolutional neural network

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19933144

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19933144

Country of ref document: EP

Kind code of ref document: A1