WO2019218773A1 - Voice synthesis method and device, storage medium, and electronic device - Google Patents

Voice synthesis method and device, storage medium, and electronic device Download PDF

Info

Publication number
WO2019218773A1
WO2019218773A1 PCT/CN2019/079582 CN2019079582W WO2019218773A1 WO 2019218773 A1 WO2019218773 A1 WO 2019218773A1 CN 2019079582 W CN2019079582 W CN 2019079582W WO 2019218773 A1 WO2019218773 A1 WO 2019218773A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
parameter
prosody
emotional
conversion rule
Prior art date
Application number
PCT/CN2019/079582
Other languages
French (fr)
Chinese (zh)
Inventor
解俊
朱杰
汤梦
李斌
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2019218773A1 publication Critical patent/WO2019218773A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • the present application relates to the field of communications, and in particular, to a method and device for synthesizing speech, a storage medium, and an electronic device.
  • speech synthesis technology is very extensive, but the usual speech synthesis technology can only produce neutral speech (ie, no emotional speech), and can not be applied to some occasions that need to contain emotions, such as: poetry, novels, etc. Some occasions that require voice reminders, such as: voice prompts of mobile terminals, voice prompts for car driving, etc. If emotional voice is used, it is closer to human interaction characteristics and more affinity.
  • the emotional speech synthesis method in the related art is usually synthesized by splicing emotional speech fields.
  • This method provides a speech database of desired emotions, and then splicing the emotional segments to form emotional speech.
  • a large emotional speech library is needed, and different emotions need to correspond to a speech library respectively, and then synthesized according to the existing prosody rules, and then based on this, a certain algorithm is used to adjust the emotional speech prosody parameters.
  • the speech unit is waveform-spliced to synthesize the corresponding emotional statement.
  • voice parameters are determined, they are all completed by manual debugging.
  • the embodiment of the present application provides a method and device for synthesizing voice, a storage medium, and an electronic device.
  • a method for synthesizing speech includes: acquiring an emotional feature parameter of a first speech; converting the emotional feature parameter into a prosody parameter according to a conversion rule, wherein the conversion rule is used to Depicting a mapping relationship between the sentiment feature parameter and the prosody parameter; synthesizing the second speech according to the prosody parameter and the first speech.
  • a voice synthesizing apparatus includes: an obtaining module configured to acquire an emotional feature parameter of a first voice; and a converting module configured to convert the sentiment feature parameter into a conversion rule according to a conversion rule a prosody parameter, wherein the conversion rule is used to describe a mapping relationship between the sentiment feature parameter and the prosody parameter; and a synthesizing module configured to synthesize the second speech according to the prosody parameter and the first speech.
  • a storage medium having stored therein a computer program, wherein the computer program is configured to execute the steps of any one of the method embodiments described above.
  • an electronic device comprising a memory and a processor, wherein the memory stores a computer program, the processor being configured to run the computer program to perform any of the above The steps in the method embodiments.
  • the emotion feature parameter is converted into the prosody parameter by using the conversion rule, and the second voice is synthesized, and the second voice has a rhythm sense during the playing, thereby realizing carrying the emotion in the carried voice and solving the synthesized emotional voice.
  • the overly complicated technical problem simplifies the synthesis system of emotional speech and improves the synthesis efficiency of emotional speech.
  • FIG. 1 is a block diagram showing the hardware structure of a mobile terminal for synthesizing a voice according to an embodiment of the present application
  • FIG. 2 is a flowchart of a method of synthesizing speech according to an embodiment of the present application
  • FIG. 3 is a structural block diagram of a voice synthesizing apparatus according to an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of an emotional voice conversion system based on the PSOLA method in the embodiment of the present application
  • FIG. 5 is a schematic diagram showing a probability density distribution of a smooth approximation of an arbitrary shape in the embodiment
  • Fig. 6 is a schematic diagram showing the fitting of a weighted sum of three Gaussian probability density functions in the present embodiment to a certain distribution.
  • FIG. 7 is a flowchart of determining a prosodic feature parameter based on a GMM according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram of distribution of fundamental frequency average data according to an embodiment of the present application.
  • FIG. 9 is a diagram showing a fundamental frequency mean probability density distribution of an embodiment of the present application.
  • FIG. 1 is a hardware structural block diagram of a mobile terminal for synthesizing a voice according to an embodiment of the present application.
  • the mobile terminal can include one or more (only one shown in FIG. 1) processor 102 (the processor 102 can include, but is not limited to, a MicroController Unit (MCU) or a programmable logic device (The processing device of the Field-Programmable Gate Array (FPGA) or the like and the memory 104 for storing data may alternatively include the transmission device 106 for the communication function and the input and output device 108.
  • MCU MicroController Unit
  • FPGA Field-Programmable Gate Array
  • FIG. 1 is merely illustrative, and does not limit the structure of the above mobile terminal.
  • the mobile terminal may also include more or fewer components than those shown in FIG. 1, or have a different configuration than that shown in FIG.
  • the memory 104 can be used to store a computer program, for example, a software program of a application software and a module, such as a computer program corresponding to a method of synthesizing speech in the embodiment of the present application, and the processor 102 executes by executing a computer program stored in the memory 104.
  • a computer program for example, a software program of a application software and a module, such as a computer program corresponding to a method of synthesizing speech in the embodiment of the present application, and the processor 102 executes by executing a computer program stored in the memory 104.
  • Memory 104 may include high speed random access memory, and may also include non-volatile memory such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory.
  • memory 104 may further include memory remotely located relative to processor 102, which may be connected to mobile terminal 10 over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof
  • Transmission device 106 is for receiving or transmitting data via a network.
  • the above-described network specific example may include a wireless network provided by a communication provider of the mobile terminal 10.
  • the transmission device 106 includes a Network Interface Controller (NIC) that can be connected to other network devices through a base station to communicate with the Internet.
  • the transmission device 106 can be a Radio Frequency (RF) module for communicating with the Internet wirelessly.
  • the input/output device 108 may be at least one of a touch screen, a button, a microphone, and an audio output device.
  • FIG. 2 is a flowchart of a method for synthesizing voice according to an embodiment of the present application. As shown in FIG. 2, the process includes the following steps:
  • Step S202 acquiring an emotional feature parameter of the first voice
  • Step S204 Convert the emotion feature parameter into a prosody parameter according to the conversion rule, where the conversion rule is used to describe a mapping relationship between the emotion feature parameter and the prosody parameter;
  • Step S206 synthesizing the second speech according to the prosody parameter and the first speech.
  • the requirements for the emotional speech library are high, and the final quality of the synthesized emotional speech is affected by both the quality of the speech library and the quality of the mosaic.
  • Another drawback is that an emotional speech library can only synthesize an emotion, but in real life there will be a variety of emotions, such as emotions, and if each emotion requires a corresponding emotional speech library, the system will become very complicated, not Conducive to the use of some terminal products, such as mobile phones.
  • the emotion feature parameter is converted into the prosody parameter by using the conversion rule, and the second voice is synthesized, and the second voice has a rhythm sense during the play, thereby carrying the emotion in the carried voice, and solving the synthetic emotion in the related technology.
  • the technical problem of excessively complex speech simplifies the synthesis system of emotional speech and improves the synthesis efficiency of emotional speech.
  • the execution body of the foregoing steps in this embodiment may be a terminal, a voice synthesis platform, a processor, or the like, but is not limited thereto.
  • the method before converting the emotion feature parameter into the prosody parameter according to the conversion rule, the method further includes one of the following: training the conversion rule; and preset the conversion rule.
  • the default way is to set the conversion rules in advance, such as can be purchased from the supplier or obtained from other devices, avoiding the trouble of requiring temporary training during use.
  • the training conversion rule includes: setting a Gaussian Mixture Model (GMM), inputting multiple types of emotional feature parameters, and multiple types of prosody parameters as tag data into the Gaussian mixture model, and training to obtain a conversion rule.
  • GMM Gaussian Mixture Model
  • training the conversion rule includes: selecting an initial value of the Gaussian mixture model, and determining a calculation parameter according to a data distribution of the initial value, wherein the calculation parameter comprises: a weight value, an expected value, a variance value And the number of models; estimating the initial value and the calculation parameter using an Expectation Maximum (EM) algorithm to obtain a maximum likelihood value.
  • EM Expectation Maximum
  • the expression p(x) of the Gaussian mixture model is expressed by the following formula:
  • c m is the weight value
  • ⁇ m is the expected value
  • ⁇ m is the variance value
  • M is the number of single Gaussian models
  • x is the emotional feature parameter value
  • p(x) is the prosody parameter value
  • d is a constant
  • is a constant
  • T is a constant.
  • synthesizing the second speech according to the prosody parameter and the first speech includes one of: according to a prosody parameter and a first speech synthesis on a Text To Speech (TTS) platform.
  • TTS Text To Speech
  • PSOLA Pitch Synchronized Overlap-Add
  • the method further includes: performing smoothing on the second voice to obtain a third voice. It may be that the first voice may also be natural while it sounds emotional.
  • the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware, but in many cases, the former is A better implementation.
  • the technical solution of the present application which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in various embodiments of the present application.
  • a synthesizing device for a voice is also provided, which is used to implement the foregoing embodiments and preferred embodiments, and details are not described herein.
  • the term "module” may implement a combination of software and/or hardware of a predetermined function.
  • the apparatus described in the following embodiments is preferably implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.
  • FIG. 3 is a structural block diagram of a voice synthesizing apparatus according to an embodiment of the present application. As shown in FIG. 3, the apparatus includes:
  • the obtaining module 30 is configured to acquire an emotional feature parameter of the first voice
  • the conversion module 32 is configured to convert the emotion feature parameter into a prosody parameter according to the conversion rule, wherein the conversion rule is used to describe a mapping relationship between the emotion feature parameter and the prosody parameter;
  • the synthesis module 34 is configured to synthesize the second speech according to the prosody parameter and the first speech.
  • the apparatus further includes: a training module configured to: before the synthesizing module converts the emotional feature parameter into the prosody parameter according to the conversion rule, training the conversion rule; setting the module, configured to The composition module presets the conversion rule before converting the emotion feature parameter into the prosody parameter according to the conversion rule.
  • the training module includes: a training unit configured to set a Gaussian mixture model, and input multiple types of emotional feature parameters and multiple types of prosody parameters as tag data into the Gaussian mixture model. Training gets the conversion rules.
  • the training module is configured to select an initial value of a Gaussian mixture model, and determine a calculation parameter according to the data distribution of the initial value, wherein the calculation parameter comprises: a weight value, an expected value, a variance value, and a model
  • the initial value and the calculation parameter are estimated by using an EM algorithm to obtain a maximum likelihood value.
  • the expression p(x) of the Gaussian mixture model may be expressed by the above formula.
  • the synthesizing module 34 is configured to synthesize the second speech according to the prosody parameter and the first speech on the TTS platform; or use the PSOLA algorithm according to the prosody parameter and the The first speech synthesizes the second speech.
  • the apparatus further includes a smoothing module configured to perform smoothing on the second voice to obtain a third voice.
  • each of the above modules may be implemented by software or hardware.
  • the foregoing may be implemented by, but not limited to, the foregoing modules are all located in the same processor; or, the above modules are in any combination.
  • the forms are located in different processors.
  • This embodiment provides an automatic determination method for emotional speech feature parameters based on the GMM training algorithm.
  • the solution of the present application is described in detail in conjunction with a specific implementation manner:
  • the emotional speech synthesis technology proposed in this embodiment is implemented by using a prosody parameter adjustment technology, and the technology can be used to "change" any neutral speech to achieve the set emotional speech requirement. It is simple and practical, easy to implement, and does not need to occupy a lot of computing resources. After the practical processing, the various key rhythm parameters required by the technology of the embodiment can be automatically determined, as long as these automatically determined parameters are placed in the TTS synthesis platform. It is possible to make the neutral TTS produce emotional speech.
  • the technical solution of the embodiment can be directly grafted to the existing TTS platform, so that the neutral TTS software platform produces the effect of emotional speech, thereby avoiding the collection of emotional corpus and the development of the emotional TTS platform.
  • the embodiment of the present application proposes a method for training and automatically determining prosody parameters based on the GMM method.
  • the mapping relationship between different emotional speech and neutral speech key prosody parameters can be obtained by training the emotional corpus in advance, thereby automatically determining
  • the rhythm parameter eliminates the original manual adjustment link, which can greatly improve the adjustment efficiency and ensure that the generated emotional speech has better quality.
  • This application is to solve the automatic determination of emotional speech prosody parameters and improve the adjustment efficiency.
  • the embodiment of the present application can be applied to an emotional speech generation environment based on the PSOLA synthesis method, and is used for automatically determining key prosody parameter values and optimizing the quality of the emotional speech. Since the present embodiment relates to how to automatically generate key parameters, once these parameters are obtained, it can be applied to any speech synthesis platform capable of generating a neutral TTS.
  • the emotional speech parameters can be automatically obtained by using the embodiment of the present application, and then these parameters are obtained. Putting into the TTS neutral speech platform can produce more satisfactory emotional speech.
  • the training phase mainly includes a training phase and a conversion phase.
  • the training phase is mainly used for training to obtain an emotion conversion rule, and the conversion phase is mainly based on
  • the emotional transformation rules obtained during the training phase are emotionally transformed.
  • the emotional speech conversion system based on the PSOLA method is mainly composed of the following four parts, including:
  • the speech signal is framed, windowed, pre-emphasized, and endpoint detected, and emotional feature parameters and prosody parameters are extracted.
  • the feature parameters are extracted from the neutral speech and the emotional speech respectively, and the emotional feature parameters and the prosody parameters are obtained; the emotional feature parameters and the prosody parameters are time-aligned as sample data for training the emotional conversion rules;
  • the original speech is analyzed and the feature parameters are extracted, and the corresponding emotional feature parameters are obtained.
  • the emotion conversion is performed according to the conversion rules obtained by the training, and the prosody parameters are obtained.
  • the speech signal is reconstructed based on the prosody parameters, the synthesized emotional speech is obtained, the synthetic emotional speech is smoothed, and the like, so that it sounds as natural as possible.
  • Gaussian density function estimation is a parametric model.
  • the parameter estimation of the GMM model is to obtain the parameters of the model according to a certain criterion through a given set of speech training data, so that the determined model can best describe the probability distribution of the given speech training data.
  • the prosody feature parameters of the emotional speech can be obtained more accurately by using the model, so that the synthesized speech emotional expression is more natural.
  • the Gaussian Mixture Model is an extension of the single Gaussian probability density function. Because it can smoothly approximate the probability density distribution of arbitrary shapes, it is widely used and has excellent effects.
  • FIG. 5 is a schematic diagram of a probability density distribution of a smooth approximation of an arbitrary shape in the present embodiment, which is an example of fitting a weighted sum of three Gaussian probability density functions to a certain distribution, wherein the mean parameter indicates the position of each Gaussian distribution, The variance parameter indicates the range of each Gaussian distribution, and the weight parameter indicates the magnitude of each Gaussian distribution, that is, the amount of data distributed in the Gaussian.
  • FIG. 6 is the weighted sum of the three Gaussian probability density functions in this embodiment.
  • a schematic diagram of the distribution of the distribution is a distribution of the general density function.
  • c m is the weight value
  • ⁇ m is the expected value
  • ⁇ m is the variance value
  • M is the number of single Gaussian models
  • x is the emotional feature parameter value
  • p(x) is the prosody parameter value
  • d is a constant
  • is a constant
  • T is a constant.
  • the parameters are usually estimated using the EM algorithm. Essentially, it is assumed that the probability distribution of the sample is a mixed Gaussian model (GMM), and then multiple samples are used to learn the parameters of these GMMs. Through continuous iterative calculation, the obtained model parameters gradually approach the target model. We use this probability distribution model to find the optimal emotional speech parameters corresponding to the neutral speech feature parameters.
  • GMM mixed Gaussian model
  • FIG. 7 is a flowchart of determining a prosodic feature parameter based on a GMM according to an embodiment of the present application, as shown in FIG. 7 :
  • GMM model Generate a Gaussian mixture model
  • GMM Gaussian Mixture Model
  • the basic idea of the EM algorithm is to randomly initialize a set of parameters ⁇ (0), update the expected E(Y) of Y according to the posterior probability Pr(Y
  • the model parameter ⁇ (1) This is iterated until ⁇ tends to be stable.
  • the EM algorithm has two steps:
  • ⁇ (t) is the estimated value of the t-th iteration parameter ⁇ ;
  • expression (1) represents a log-likelihood function for computing L( ⁇ ; X, Z); where X is the observed variable, Z is the hidden variable, E is the expectation, L is the log-likelihood function, and Q is a functional formula.
  • the maximum likelihood value is gradually approached by a two-step iteration.
  • the present application approximates the values of the parameter samples by a multi-Gaussian hybrid model.
  • There are a plurality of key parameters for determining the emotional voice such as: the fundamental frequency, the duration, and the like.
  • the technique of the embodiment may be used to determine the parameter value. The following is an example of determining the fundamental frequency parameter.
  • FIG. 8 is a schematic diagram of the distribution of the fundamental frequency mean data in the embodiment of the present application, wherein the X direction is a “neutral” voice, and the Y direction is a “happy” emotion voice.
  • x represents the value of the "neutral” speech fundamental frequency parameter
  • y represents the value of the speech fundamental frequency parameter of the "happy" emotion, which constitutes "neutral-happy” (neutral- Happy)
  • x represents the value of the "neutral” speech fundamental frequency parameter
  • y represents the value of the speech fundamental frequency parameter of the "happy” emotion, which constitutes "neutral-happy” (neutral- Happy)
  • FIG. 9 is a schematic diagram of a fundamental frequency mean probability density distribution of an embodiment of the present application, which is a schematic diagram of a surface fitted by a plurality of Gaussian functions.
  • the determination of the speech prosody feature parameters includes:
  • Neutral parameter (x coordinate) ⁇ Through the EM algorithm ((i) and (ii) below), find the y value with the highest probability of the x value in the model as the emotional parameter of the output ⁇ the obtained emotional speech prosodic feature parameter.
  • ⁇ (t) is the estimated value of the t-th iteration parameter ⁇ ;
  • expression (1) represents a log-likelihood function for computing L( ⁇ ; X, Z); where X is the observed variable, Z is the hidden variable, E is the expectation, L is the log-likelihood function, and Q is a functional formula.
  • the maximum likelihood value is gradually approached by a two-step iteration.
  • a plurality of sets of sentences are randomly tested by the method of the embodiment of the present application, and after the syllable synchronization processing, the difference of the speech parameters of the synthesized emotional speech and the real human speech is compared, and the "baseband distance value" is used to represent the gap.
  • the base frequency distance value is used to represent the gap.
  • new1 is based on 100 sets of data
  • new2 is based on 200 sets of data
  • 4 single Gaussian training combined emotional speech
  • new3 is based on 200 sets of data for each male and female 4 single Gauss training Post-synthesized emotional speech.
  • the distance from the fundamental frequency can be compared: increasing the amount of training data, increasing the number of single Gaussian blends, and separately training the male and female voice data will make the modeling effect better.
  • the algorithm can be placed in the TTS system architecture of the mobile phone, so that the TTS synthesized voice of the mobile phone has a specified emotion.
  • the existing emotional speech application system usually selects segments directly from the emotional corpus into emotional speech, and needs to load various emotional corpora. This method occupies a large amount of storage space, which inevitably leads to slow application and affects user experience.
  • the solution of the embodiment only needs to provide a neutral corpus, and then adjust the parameters to obtain the desired emotional voice.
  • Embodiments of the present application also provide a storage medium having stored therein a computer program, wherein the computer program is configured to execute the steps of any one of the method embodiments described above.
  • the foregoing storage medium may be configured to store a computer program for performing the following steps: acquiring an emotional feature parameter of the first voice; converting the sentiment feature parameter into a prosody parameter according to a conversion rule, The conversion rule is used to describe a mapping relationship between the sentiment feature parameter and the prosody parameter; and synthesizing the second speech according to the prosody parameter and the first speech.
  • the foregoing storage medium may include, but not limited to, a USB flash drive, a read-only memory (ROM), a random access memory (RAM), a mobile hard disk, and a magnetic
  • ROM read-only memory
  • RAM random access memory
  • mobile hard disk a magnetic
  • magnetic A variety of media that can store computer programs, such as a disc or an optical disc.
  • Embodiments of the present application also provide an electronic device including a memory and a processor having a computer program stored therein, the processor being configured to execute a computer program to perform the steps of any of the above method embodiments.
  • the electronic device may further include a transmission device and an input and output device, wherein the transmission device is connected to the processor, and the input and output device is connected to the processor.
  • the foregoing processor may be configured to: perform, by using a computer program, the following steps: acquiring an emotional feature parameter of the first voice; and converting the emotional feature parameter into a prosody parameter according to a conversion rule, where The conversion rule is used to describe a mapping relationship between the emotional feature parameter and the prosody parameter; and synthesize the second speech according to the prosody parameter and the first speech.
  • the disclosed apparatus and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner such as: multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored or not executed.
  • the coupling, or direct coupling, or communication connection of the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical, mechanical or other forms. of.
  • the units described above as separate components may or may not be physically separated, and the components displayed as the unit may or may not be physical units, that is, may be located in one place or distributed to multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated into one unit; the above integration
  • the unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed.
  • the foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a removable storage device, a ROM, a RAM, a magnetic disk, or an optical disk, and the like, which can store program codes.
  • the above-described integrated unit of the present application may be stored in a computer readable storage medium if it is implemented in the form of a software function module and sold or used as a stand-alone product.
  • the technical solution of the embodiments of the present application may be embodied in the form of a software product in essence or in the form of a software product stored in a storage medium, including a plurality of instructions.
  • a computer device (which may be a personal computer, server, or network device, etc.) is caused to perform all or part of the methods described in various embodiments of the present invention.
  • the foregoing storage medium includes various media that can store program codes, such as a mobile storage device, a ROM, a RAM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephone Function (AREA)
  • Machine Translation (AREA)

Abstract

A voice synthesis method and device, the method comprising: acquiring emotion characteristic parameters of a first voice (S202); converting, according to a conversion rule, the emotion characteristic parameters into prosody parameters, the conversion rule being used for describing a mapping relationship between emotion characteristic parameters and prosody parameters (S204); and generating a synthesized second voice according to the prosody parameters and the first voice (S206).

Description

语音的合成方法及装置、存储介质、电子装置Speech synthesis method and device, storage medium, electronic device
相关申请的交叉引用Cross-reference to related applications
本申请基于申请号为201810462450.0、申请日为2018年5月15日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。The present application is based on the Chinese Patent Application No. 20181046245, the entire disclosure of which is hereby incorporated by reference.
技术领域Technical field
本申请涉及通信领域,具体涉及一种语音的合成方法及装置、存储介质、电子装置。The present application relates to the field of communications, and in particular, to a method and device for synthesizing speech, a storage medium, and an electronic device.
背景技术Background technique
语音合成技术的使用场合非常广泛,但通常的语音合成技术一般只能产生中性语音(即无情感语音),不能适用于一些需要包含情感的场合,如:诗歌、小说等的数字朗读,在一些需要语音提醒的场合,如:手机终端的语音提示、汽车驾驶的语音提示等,如果采用了情感语音,则更接近人类的交互特点,更具亲和力。The use of speech synthesis technology is very extensive, but the usual speech synthesis technology can only produce neutral speech (ie, no emotional speech), and can not be applied to some occasions that need to contain emotions, such as: poetry, novels, etc. Some occasions that require voice reminders, such as: voice prompts of mobile terminals, voice prompts for car driving, etc. If emotional voice is used, it is closer to human interaction characteristics and more affinity.
相关技术中的情感语音合成方法通常是由情感语音字段拼接合成的,这种方法要提供所需情感的语音数据库,然后将情感语段进行拼接,形成情感语音。具体就是需要一个庞大的情感语音库,并且不同的情感需要分别对应一个语音库,再依据已有的韵律规则进行合成,然后再在此基础上通过一定的算法进行情感语音韵律参数的调节,由此来对语音单元进行波形拼接,合成出对应的情感语句。相关技术中在确定语音参数时,都是通过人工调试来完成。The emotional speech synthesis method in the related art is usually synthesized by splicing emotional speech fields. This method provides a speech database of desired emotions, and then splicing the emotional segments to form emotional speech. Specifically, a large emotional speech library is needed, and different emotions need to correspond to a speech library respectively, and then synthesized according to the existing prosody rules, and then based on this, a certain algorithm is used to adjust the emotional speech prosody parameters. In this way, the speech unit is waveform-spliced to synthesize the corresponding emotional statement. In the related art, when the voice parameters are determined, they are all completed by manual debugging.
针对相关技术中存在的上述问题,目前尚未发现有效的解决方案。In view of the above problems in the related art, no effective solution has been found yet.
发明内容Summary of the invention
本申请实施例提供了一种语音的合成方法及装置、存储介质、电子装置。The embodiment of the present application provides a method and device for synthesizing voice, a storage medium, and an electronic device.
根据本申请的一个实施例,提供了一种语音的合成方法,包括:获取第一语音的情感特征参数;根据转换规则将所述情感特征参数转换为韵律参数,其中,所述转换规则用于描述所述情感特征参数与所述韵律参数的映射关系;根据所述韵律参数和所述第一语音合成第二语音。According to an embodiment of the present application, a method for synthesizing speech includes: acquiring an emotional feature parameter of a first speech; converting the emotional feature parameter into a prosody parameter according to a conversion rule, wherein the conversion rule is used to Depicting a mapping relationship between the sentiment feature parameter and the prosody parameter; synthesizing the second speech according to the prosody parameter and the first speech.
根据本申请的另一个实施例,提供了一种语音的合成装置,包括:获取模块,配置为获取第一语音的情感特征参数;转换模块,配置为根据转换规则将所述情感特征参数转换为韵律参数,其中,所述转换规则用于描述所述情感特征参数与所述韵律参数的映射关系;合成模块,配置为根据所述韵律参数和所述第一语音合成第二语音。According to another embodiment of the present application, a voice synthesizing apparatus includes: an obtaining module configured to acquire an emotional feature parameter of a first voice; and a converting module configured to convert the sentiment feature parameter into a conversion rule according to a conversion rule a prosody parameter, wherein the conversion rule is used to describe a mapping relationship between the sentiment feature parameter and the prosody parameter; and a synthesizing module configured to synthesize the second speech according to the prosody parameter and the first speech.
根据本申请的又一个实施例,还提供了一种存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。According to still another embodiment of the present application, there is also provided a storage medium having stored therein a computer program, wherein the computer program is configured to execute the steps of any one of the method embodiments described above.
根据本申请的又一个实施例,还提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行上述任一项方法实施例中的步骤。According to still another embodiment of the present application, there is also provided an electronic device comprising a memory and a processor, wherein the memory stores a computer program, the processor being configured to run the computer program to perform any of the above The steps in the method embodiments.
通过本申请实施例,通过使用转换规则将情感特征参数转换为韵律参数,并合成得到第二语音,第二语音在播放时具备韵律感,从而实现在携带语音中携带情感,解决了合成情感语音过于复杂的技术问题,简化了情感语音的合成***,提高了情感语音的合成效率。Through the embodiment of the present application, the emotion feature parameter is converted into the prosody parameter by using the conversion rule, and the second voice is synthesized, and the second voice has a rhythm sense during the playing, thereby realizing carrying the emotion in the carried voice and solving the synthesized emotional voice. The overly complicated technical problem simplifies the synthesis system of emotional speech and improves the synthesis efficiency of emotional speech.
附图说明DRAWINGS
此处所说明的附图用来提供对本申请实施例的进一步理解,构成本申请实施例的一部分,本申请的示意性实施例及其说明用于解释本申请,并 不构成对本申请的不当限定。在附图中:The drawings are intended to provide a further understanding of the embodiments of the present application, and are intended to be a part of the embodiments of the present application. In the drawing:
图1是本申请实施例的一种语音的合成方法的移动终端的硬件结构框图;1 is a block diagram showing the hardware structure of a mobile terminal for synthesizing a voice according to an embodiment of the present application;
图2是根据本申请实施例的语音的合成方法的流程图;2 is a flowchart of a method of synthesizing speech according to an embodiment of the present application;
图3是根据本申请实施例的语音的合成装置的结构框图;3 is a structural block diagram of a voice synthesizing apparatus according to an embodiment of the present application;
图4是本申请实施例基于PSOLA方法的情感语音转换***的结构示意图;4 is a schematic structural diagram of an emotional voice conversion system based on the PSOLA method in the embodiment of the present application;
图5为本实施例平滑地近似任意形状的概率密度分布示意图;FIG. 5 is a schematic diagram showing a probability density distribution of a smooth approximation of an arbitrary shape in the embodiment; FIG.
图6是本实施例三个高斯概率密度函数的加权和对某种分布进行拟合示意图。Fig. 6 is a schematic diagram showing the fitting of a weighted sum of three Gaussian probability density functions in the present embodiment to a certain distribution.
图7是本申请实施例基于GMM的韵律特征参数确定流程图;7 is a flowchart of determining a prosodic feature parameter based on a GMM according to an embodiment of the present application;
图8是本申请实施例的基频均值数据分布示意图;FIG. 8 is a schematic diagram of distribution of fundamental frequency average data according to an embodiment of the present application; FIG.
图9是本申请实施例的基频均值概率密度分布图。FIG. 9 is a diagram showing a fundamental frequency mean probability density distribution of an embodiment of the present application.
具体实施方式Detailed ways
下文中将参考附图并结合实施例来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。The present application will be described in detail below with reference to the drawings in conjunction with the embodiments. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict.
需要说明的是,本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that the terms "first", "second" and the like in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or order.
本申请实施例所提供的方法实施例可以在移动终端、计算机终端或者类似的运算装置中执行。以运行在移动终端上为例,图1是本申请实施例的一种语音的合成方法的移动终端的硬件结构框图。如图1所示,移动终端可以包括一个或多个(图1中仅示出一个)处理器102(处理器102可以包括但不限于微处理器(MicroController Unit,MCU)或可编程逻辑器件(Field-Programmable Gate Array,FPGA)等的处理装置)和用于存储数 据的存储器104,可选地,上述移动终端还可以包括用于通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述移动终端的结构造成限定。例如,移动终端还可包括比图1中所示更多或者更少的组件,或者具有与图1所示不同的配置。The method embodiments provided by the embodiments of the present application may be implemented in a mobile terminal, a computer terminal, or the like. Taking a mobile terminal as an example, FIG. 1 is a hardware structural block diagram of a mobile terminal for synthesizing a voice according to an embodiment of the present application. As shown in FIG. 1, the mobile terminal can include one or more (only one shown in FIG. 1) processor 102 (the processor 102 can include, but is not limited to, a MicroController Unit (MCU) or a programmable logic device ( The processing device of the Field-Programmable Gate Array (FPGA) or the like and the memory 104 for storing data may alternatively include the transmission device 106 for the communication function and the input and output device 108. It will be understood by those skilled in the art that the structure shown in FIG. 1 is merely illustrative, and does not limit the structure of the above mobile terminal. For example, the mobile terminal may also include more or fewer components than those shown in FIG. 1, or have a different configuration than that shown in FIG.
存储器104可用于存储计算机程序,例如,应用软件的软件程序以及模块,如本申请实施例中的语音的合成方法对应的计算机程序,处理器102通过运行存储在存储器104内的计算机程序,从而执行各种功能应用以及数据处理,即实现上述的方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至移动终端10。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 104 can be used to store a computer program, for example, a software program of a application software and a module, such as a computer program corresponding to a method of synthesizing speech in the embodiment of the present application, and the processor 102 executes by executing a computer program stored in the memory 104. Various functional applications and data processing, that is, the above methods are implemented. Memory 104 may include high speed random access memory, and may also include non-volatile memory such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, memory 104 may further include memory remotely located relative to processor 102, which may be connected to mobile terminal 10 over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
传输设备106用于经由一个网络接收或者发送数据。上述的网络具体实例可包括移动终端10的通信供应商提供的无线网络。在一个实例中,传输设备106包括一个网络适配器(Network Interface Controller,NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输设备106可以为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。输入输出设备108中,可以是触控屏、按键、麦克风、音频输出设备中的至少一种设备等。 Transmission device 106 is for receiving or transmitting data via a network. The above-described network specific example may include a wireless network provided by a communication provider of the mobile terminal 10. In one example, the transmission device 106 includes a Network Interface Controller (NIC) that can be connected to other network devices through a base station to communicate with the Internet. In one example, the transmission device 106 can be a Radio Frequency (RF) module for communicating with the Internet wirelessly. The input/output device 108 may be at least one of a touch screen, a button, a microphone, and an audio output device.
在本实施例中提供了一种运行于上述移动终端的语音的合成方法,图2是根据本申请实施例的语音的合成方法的流程图,如图2所示,该流程包括如下步骤:In this embodiment, a method for synthesizing voices of the mobile terminal is provided. FIG. 2 is a flowchart of a method for synthesizing voice according to an embodiment of the present application. As shown in FIG. 2, the process includes the following steps:
步骤S202,获取第一语音的情感特征参数;Step S202, acquiring an emotional feature parameter of the first voice;
步骤S204,根据转换规则将情感特征参数转换为韵律参数,其中,转换规则用于描述情感特征参数与韵律参数的映射关系;Step S204: Convert the emotion feature parameter into a prosody parameter according to the conversion rule, where the conversion rule is used to describe a mapping relationship between the emotion feature parameter and the prosody parameter;
步骤S206,根据韵律参数和第一语音合成第二语音。Step S206, synthesizing the second speech according to the prosody parameter and the first speech.
相关技术中对情感语音库的要求较高,合成出的情感语音的最终质量会受到语音库质量和拼接质量的双重影响。另外一个缺陷就是一种情感语音库只能合成出一种情感,但实际生活中会有喜怒哀乐多种情感,如果每种情感都需要一个对应的情感语音库,***会变得很复杂,不利于在一些终端产品上的使用,比如手机。In the related art, the requirements for the emotional speech library are high, and the final quality of the synthesized emotional speech is affected by both the quality of the speech library and the quality of the mosaic. Another drawback is that an emotional speech library can only synthesize an emotion, but in real life there will be a variety of emotions, such as emotions, and if each emotion requires a corresponding emotional speech library, the system will become very complicated, not Conducive to the use of some terminal products, such as mobile phones.
通过上述步骤,通过使用转换规则将情感特征参数转换为韵律参数,并合成得到第二语音,第二语音在播放时具备韵律感,从而实现在携带语音中携带情感,解决了相关技术中合成情感语音过于复杂的技术问题,简化了情感语音的合成***,提高了情感语音的合成效率。Through the above steps, the emotion feature parameter is converted into the prosody parameter by using the conversion rule, and the second voice is synthesized, and the second voice has a rhythm sense during the play, thereby carrying the emotion in the carried voice, and solving the synthetic emotion in the related technology. The technical problem of excessively complex speech simplifies the synthesis system of emotional speech and improves the synthesis efficiency of emotional speech.
本实施例上述步骤的执行主体可以为终端、语音合成平台、处理器等,但不限于此。The execution body of the foregoing steps in this embodiment may be a terminal, a voice synthesis platform, a processor, or the like, but is not limited thereto.
在本申请的一种可选实施例中,在根据转换规则将情感特征参数转换为韵律参数之前,方法还包括以下之一:训练转换规则;预设转换规则。预设的方式是提前设置将转换规则,如可以从供应商购买或者从其他设备获取得到,避免了在使用时还需要临时训练的麻烦。In an optional embodiment of the present application, before converting the emotion feature parameter into the prosody parameter according to the conversion rule, the method further includes one of the following: training the conversion rule; and preset the conversion rule. The default way is to set the conversion rules in advance, such as can be purchased from the supplier or obtained from other devices, avoiding the trouble of requiring temporary training during use.
可选地,训练转换规则包括:设置高斯混合模型(Gaussian Mixture Model,GMM),将多个类型的情感特征参数,多个类型的韵律参数作为标签数据输入至高斯混合模型,训练得到转换规则。Optionally, the training conversion rule includes: setting a Gaussian Mixture Model (GMM), inputting multiple types of emotional feature parameters, and multiple types of prosody parameters as tag data into the Gaussian mixture model, and training to obtain a conversion rule.
在一个示例中,训练所述转换规则包括:选择所述高斯混合模型的初始值,根据所述初始值的数据分布确定计算参数,其中,所述计算参数包括:权重值、期望值、方差值和模型个数;采用最大期望(Expectation Maximum,EM)算法估计所述初始值和所述计算参数,得到最大似然值。In one example, training the conversion rule includes: selecting an initial value of the Gaussian mixture model, and determining a calculation parameter according to a data distribution of the initial value, wherein the calculation parameter comprises: a weight value, an expected value, a variance value And the number of models; estimating the initial value and the calculation parameter using an Expectation Maximum (EM) algorithm to obtain a maximum likelihood value.
在一个示例中,高斯混合模型其中一级的表达式p(x)通过以下公式表示:In one example, the expression p(x) of the Gaussian mixture model is expressed by the following formula:
Figure PCTCN2019079582-appb-000001
Figure PCTCN2019079582-appb-000001
其中,among them,
Figure PCTCN2019079582-appb-000002
Figure PCTCN2019079582-appb-000002
其中,c m为权重值;μ m为期望值,Σm为方差值,M为单高斯模型个数,x为情感特征参数值,p(x)为韵律参数值,d为常量,μ为常量,T为常量。 Where c m is the weight value; μ m is the expected value, Σm is the variance value, M is the number of single Gaussian models, x is the emotional feature parameter value, p(x) is the prosody parameter value, d is a constant, and μ is a constant , T is a constant.
在本申请的一种可选实施例中,根据韵律参数和第一语音合成第二语音包括以下之一:在文本到语音(Text To Speech,TTS)平台上根据韵律参数和第一语音合成第二语音;使用基音同步叠加(Pitch Synchronized Overlap-Add,PSOLA)算法根据韵律参数和第一语音合成第二语音。In an optional embodiment of the present application, synthesizing the second speech according to the prosody parameter and the first speech includes one of: according to a prosody parameter and a first speech synthesis on a Text To Speech (TTS) platform. Two speech; a Pitch Synchronized Overlap-Add (PSOLA) algorithm is used to synthesize the second speech according to the prosody parameter and the first speech.
在本申请的一种可选实施例中,在根据韵律参数和第一语音合成第二语音之后,方法还包括:对第二语音进行平滑处理,得到第三语音。可以是第一语音在听起来包含情感的同时,还可能自然。In an optional embodiment of the present application, after the second voice is synthesized according to the prosody parameter and the first voice, the method further includes: performing smoothing on the second voice to obtain a third voice. It may be that the first voice may also be natural while it sounds emotional.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course, by hardware, but in many cases, the former is A better implementation. Based on such understanding, the technical solution of the present application, which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk, The optical disc includes a number of instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the methods described in various embodiments of the present application.
在本实施例中还提供了一种语音的合成装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。In this embodiment, a synthesizing device for a voice is also provided, which is used to implement the foregoing embodiments and preferred embodiments, and details are not described herein. As used below, the term "module" may implement a combination of software and/or hardware of a predetermined function. Although the apparatus described in the following embodiments is preferably implemented in software, hardware, or a combination of software and hardware, is also possible and contemplated.
图3是根据本申请实施例的语音的合成装置的结构框图,如图3所示,该装置包括:FIG. 3 is a structural block diagram of a voice synthesizing apparatus according to an embodiment of the present application. As shown in FIG. 3, the apparatus includes:
获取模块30,配置为获取第一语音的情感特征参数;The obtaining module 30 is configured to acquire an emotional feature parameter of the first voice;
转换模块32,配置为根据转换规则将情感特征参数转换为韵律参数,其中,转换规则用于描述情感特征参数与韵律参数的映射关系;The conversion module 32 is configured to convert the emotion feature parameter into a prosody parameter according to the conversion rule, wherein the conversion rule is used to describe a mapping relationship between the emotion feature parameter and the prosody parameter;
合成模块34,配置为根据韵律参数和第一语音合成第二语音。The synthesis module 34 is configured to synthesize the second speech according to the prosody parameter and the first speech.
在本申请的一种可选实施例中,装置还包括以下之一:训练模块,配置为在合成模块根据转换规则将情感特征参数转换为韵律参数之前,训练转换规则;设置模块,配置为在合成模块根据转换规则将情感特征参数转换为韵律参数之前,预设转换规则。In an optional embodiment of the present application, the apparatus further includes: a training module configured to: before the synthesizing module converts the emotional feature parameter into the prosody parameter according to the conversion rule, training the conversion rule; setting the module, configured to The composition module presets the conversion rule before converting the emotion feature parameter into the prosody parameter according to the conversion rule.
在本申请的一种可选实施例中,训练模块包括:训练单元,配置为设置高斯混合模型,将多个类型的情感特征参数,多个类型的韵律参数作为标签数据输入至高斯混合模型,训练得到转换规则。In an optional embodiment of the present application, the training module includes: a training unit configured to set a Gaussian mixture model, and input multiple types of emotional feature parameters and multiple types of prosody parameters as tag data into the Gaussian mixture model. Training gets the conversion rules.
作为一种示例,所述训练模块,配置为选择高斯混合模型的初始值,根据所述初始值的数据分布确定计算参数,其中,所述计算参数包括:权重值、期望值、方差值和模型个数;采用EM算法估计所述初始值和所述计算参数,得到最大似然值。As an example, the training module is configured to select an initial value of a Gaussian mixture model, and determine a calculation parameter according to the data distribution of the initial value, wherein the calculation parameter comprises: a weight value, an expected value, a variance value, and a model The initial value and the calculation parameter are estimated by using an EM algorithm to obtain a maximum likelihood value.
可选地,所述高斯混合模型其中一级的表达式p(x)可通过上述公式表示。Alternatively, the expression p(x) of the Gaussian mixture model may be expressed by the above formula.
在本申请的一种可选实施例中,合成模块34,配置为在TTS平台上根 据所述韵律参数和所述第一语音合成第二语音;或者,使用PSOLA算法根据所述韵律参数和所述第一语音合成第二语音。In an optional embodiment of the present application, the synthesizing module 34 is configured to synthesize the second speech according to the prosody parameter and the first speech on the TTS platform; or use the PSOLA algorithm according to the prosody parameter and the The first speech synthesizes the second speech.
在本申请的一种可选实施例中,所述装置还包括平滑模块,配置为对所述第二语音进行平滑处理,得到第三语音。In an optional embodiment of the present application, the apparatus further includes a smoothing module configured to perform smoothing on the second voice to obtain a third voice.
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述各个模块以任意组合的形式分别位于不同的处理器中。It should be noted that each of the above modules may be implemented by software or hardware. For the latter, the foregoing may be implemented by, but not limited to, the foregoing modules are all located in the same processor; or, the above modules are in any combination. The forms are located in different processors.
本实施例提供了一种基于GMM训练算法的情感语音特征参数自动确定方法。结合具体的实施方式进行对本申请的方案进行了详细说明:This embodiment provides an automatic determination method for emotional speech feature parameters based on the GMM training algorithm. The solution of the present application is described in detail in conjunction with a specific implementation manner:
本实施例提出的情感语音合成技术,其技术核心是采用了一种韵律参数调整技术来进行实施的,采用该技术可以“改变”任何的中性语音来达到所设定的情感语音要求,方法简单实用,易于实施,无需占用很多的计算资源,在经过实用化处理后,通过本实施例技术可以自动确定所需要的各种关键韵律参数,只要将这些自动确定的参数置入TTS合成平台,就可以使得中性TTS产生情感语音。本实施例技术方案可以直接嫁接于现有的TTS平台,使得中性TTS软件平台产生出情感语音的效果,从而避免了情感语料的采集和情感TTS平台的开发。而按照现有的方法,只能是通过统计参数的预置,在产生过程中需要多次进行人工调整,才能获得比较满意的效果,效率比较低,不适合参数的自动确定和调整。本申请实施例提出了一种基于GMM方法的韵律参数训练和自动确定的方法,可以通过事先对情感语料的训练,得到不同的情感语音与中性语音关键韵律参数的映射关系,从而可以自动确定韵律参数,免去了原来的手工调整环节,可以大大提高调整效率,并能保证产生的情感语音具有较好的质量。The emotional speech synthesis technology proposed in this embodiment is implemented by using a prosody parameter adjustment technology, and the technology can be used to "change" any neutral speech to achieve the set emotional speech requirement. It is simple and practical, easy to implement, and does not need to occupy a lot of computing resources. After the practical processing, the various key rhythm parameters required by the technology of the embodiment can be automatically determined, as long as these automatically determined parameters are placed in the TTS synthesis platform. It is possible to make the neutral TTS produce emotional speech. The technical solution of the embodiment can be directly grafted to the existing TTS platform, so that the neutral TTS software platform produces the effect of emotional speech, thereby avoiding the collection of emotional corpus and the development of the emotional TTS platform. According to the existing method, only through the preset of the statistical parameters, the manual adjustment needs to be performed multiple times in the production process, in order to obtain a satisfactory effect, the efficiency is relatively low, and it is not suitable for the automatic determination and adjustment of the parameters. The embodiment of the present application proposes a method for training and automatically determining prosody parameters based on the GMM method. The mapping relationship between different emotional speech and neutral speech key prosody parameters can be obtained by training the emotional corpus in advance, thereby automatically determining The rhythm parameter eliminates the original manual adjustment link, which can greatly improve the adjustment efficiency and ensure that the generated emotional speech has better quality.
本申请是为了解决情感语音韵律参数的自动确定,提高调整效率。本申请实施例可以应用于基于PSOLA合成方法的情感语音产生环境中,用于 自动确定关键韵律参数值,优化情感语音的质量。由于本实施例涉及如何自动产生关键参数,一旦获得这些参数之后,理论上可以适用于任何能产生中性TTS的语音合成平台,采用本申请实施例就可以自动得到情感语音参数,再将这些参数置入TTS中性语音平台,就能产生较为满意的情感语音。This application is to solve the automatic determination of emotional speech prosody parameters and improve the adjustment efficiency. The embodiment of the present application can be applied to an emotional speech generation environment based on the PSOLA synthesis method, and is used for automatically determining key prosody parameter values and optimizing the quality of the emotional speech. Since the present embodiment relates to how to automatically generate key parameters, once these parameters are obtained, it can be applied to any speech synthesis platform capable of generating a neutral TTS. The emotional speech parameters can be automatically obtained by using the embodiment of the present application, and then these parameters are obtained. Putting into the TTS neutral speech platform can produce more satisfactory emotional speech.
图4是本申请实施例基于PSOLA方法的情感语音转换***的结构示意图,如图4所示,主要包括训练阶段和转换阶段;其中,训练阶段主要用于训练获得情感转换规则,转换阶段主要基于训练阶段获得的情感转换规则进行情感转换。作为一种示例,采用基于PSOLA方法的情感语音转换***主要由如下四个部分组成,包括:4 is a schematic structural diagram of an emotional speech conversion system based on the PSOLA method in the embodiment of the present application. As shown in FIG. 4, the training phase mainly includes a training phase and a conversion phase. The training phase is mainly used for training to obtain an emotion conversion rule, and the conversion phase is mainly based on The emotional transformation rules obtained during the training phase are emotionally transformed. As an example, the emotional speech conversion system based on the PSOLA method is mainly composed of the following four parts, including:
(1)语音信号预处理和提取特征参数;(1) voice signal preprocessing and extraction of feature parameters;
对语音信号进行分帧、加窗、预加重、端点检测,提取出情感特征参数和韵律参数。The speech signal is framed, windowed, pre-emphasized, and endpoint detected, and emotional feature parameters and prosody parameters are extracted.
作为一种实施方式,分别对中性语音和情感语音进行特征参数提取,获得情感特征参数和韵律参数;对情感特征参数和韵律参数进行时间对齐,作为训练情感转换规则的样本数据;As an implementation manner, the feature parameters are extracted from the neutral speech and the emotional speech respectively, and the emotional feature parameters and the prosody parameters are obtained; the emotional feature parameters and the prosody parameters are time-aligned as sample data for training the emotional conversion rules;
(2)模型训练;(2) Model training;
主要是对提取的情感特征参数和韵律参数进行建模,训练得到转换规则。It mainly models the extracted emotional feature parameters and prosodic parameters, and trains to obtain conversion rules.
(3)特征转换;(3) Feature conversion;
首先对原语音进行分析并进行特征参数提取,获得对应的情感特征参数,根据训练得到的转换规则进行情感转换,得到韵律参数。Firstly, the original speech is analyzed and the feature parameters are extracted, and the corresponding emotional feature parameters are obtained. The emotion conversion is performed according to the conversion rules obtained by the training, and the prosody parameters are obtained.
(4)语音合成及后期处理;(4) Speech synthesis and post processing;
基于韵律参数对语音信号进行语音重建,获得合成情感语音,对合成情感语音进行平滑处理等等,使之尽可能听起来自然。The speech signal is reconstructed based on the prosody parameters, the synthesized emotional speech is obtained, the synthetic emotional speech is smoothed, and the like, so that it sounds as natural as possible.
高斯密度函数估计是一种参数化模型。GMM模型的参数估计,就是通过给定的一组语音训练数据,依据某种准则来求出模型的参数,从而使确定的模型能最佳的描述给定的语音训练数据的概率分布。本实施例利用这种模型能较为精确的得到情感语音的韵律特征参数,使得合成出的语音情感表现力更加自然。Gaussian density function estimation is a parametric model. The parameter estimation of the GMM model is to obtain the parameters of the model according to a certain criterion through a given set of speech training data, so that the determined model can best describe the probability distribution of the given speech training data. In this embodiment, the prosody feature parameters of the emotional speech can be obtained more accurately by using the model, so that the synthesized speech emotional expression is more natural.
高斯混合模型(Gaussian Mixture Model)是单高斯概率密度函数的延伸,由于能够平滑地近似任意形状的概率密度分布,它的应用十分广泛,效果出色。The Gaussian Mixture Model is an extension of the single Gaussian probability density function. Because it can smoothly approximate the probability density distribution of arbitrary shapes, it is widely used and has excellent effects.
根据统计理论,用若干个高斯概率密度的线性加权组合可以逼近任意分布,因此,理论上可以用来描述各种形式的语音特征统计分布。图5为本实施例平滑地近似任意形状的概率密度分布示意图,为三个高斯概率密度函数的加权和对某种分布进行拟合的例子,其中,均值参数表示了每个高斯分布的位置,方差参数表示了每个高斯分布的范围,而权重参数则表示了每个高斯分布的幅度大小即分布在该高斯的数据多少,图6是本实施例三个高斯概率密度函数的加权和对某种分布进行拟合示意图,是分布概略密度函数。这样,一个任意的分布就可以通过模型的均值、方差和权重参数来拟合。According to statistical theory, a linear weighted combination of several Gaussian probability densities can approximate an arbitrary distribution. Therefore, it can theoretically be used to describe various forms of statistical distribution of speech features. 5 is a schematic diagram of a probability density distribution of a smooth approximation of an arbitrary shape in the present embodiment, which is an example of fitting a weighted sum of three Gaussian probability density functions to a certain distribution, wherein the mean parameter indicates the position of each Gaussian distribution, The variance parameter indicates the range of each Gaussian distribution, and the weight parameter indicates the magnitude of each Gaussian distribution, that is, the amount of data distributed in the Gaussian. FIG. 6 is the weighted sum of the three Gaussian probability density functions in this embodiment. A schematic diagram of the distribution of the distribution is a distribution of the general density function. Thus, an arbitrary distribution can be fitted by the model's mean, variance, and weight parameters.
GMM的概率密度表达式如下:The probability density expression of GMM is as follows:
Figure PCTCN2019079582-appb-000003
Figure PCTCN2019079582-appb-000003
其中,among them,
Figure PCTCN2019079582-appb-000004
Figure PCTCN2019079582-appb-000004
其中,c m为权重值;μ m为期望值,Σm为方差值,M为单高斯模型个数,x为情感特征参数值,p(x)为韵律参数值,d为常量,μ为常量,T为常量。 Where c m is the weight value; μ m is the expected value, Σm is the variance value, M is the number of single Gaussian models, x is the emotional feature parameter value, p(x) is the prosody parameter value, d is a constant, and μ is a constant , T is a constant.
通常采用EM算法对参数进行估计。本质上就是先假设样本的概率分布是一个混合高斯模型(GMM),然后用多个样本去学习这些GMM的参数。通过不断迭代计算,得到的模型参数逐渐逼近目标模型。我们就借助这种概率分布模型,找到中性语音特征参数对应的最优的情感语音参数。The parameters are usually estimated using the EM algorithm. Essentially, it is assumed that the probability distribution of the sample is a mixed Gaussian model (GMM), and then multiple samples are used to learn the parameters of these GMMs. Through continuous iterative calculation, the obtained model parameters gradually approach the target model. We use this probability distribution model to find the optimal emotional speech parameters corresponding to the neutral speech feature parameters.
图7是本申请实施例基于GMM的韵律特征参数确定流程图,如图7所示:FIG. 7 is a flowchart of determining a prosodic feature parameter based on a GMM according to an embodiment of the present application, as shown in FIG. 7 :
以韵律特征中最重要的基频参数的确定为例,如要产生高兴(HAPPY)情感,训练和确定参数的步骤如下:Taking the determination of the most important fundamental frequency parameters in the prosodic features as an example, if you want to generate HAPPY emotions, the steps to train and determine the parameters are as follows:
1、生成高斯混合模型(GMM模型)1. Generate a Gaussian mixture model (GMM model)
上文已经对高斯混合模型(GMM)的概念和表达式进行了简要介绍,具体的模型生成步骤如下。The concepts and expressions of the Gaussian Mixture Model (GMM) have been briefly introduced above. The specific model generation steps are as follows.
在进行建模训练时,首先需要选择GMM各参数的初始值,根据数据分布可以大致确定权重值c m、期望值μ m、方差Σm和单高斯模型个数M。 When performing modeling training, it is first necessary to select an initial value of each parameter of the GMM, and the weight value c m , the expected value μ m , the variance Σ m , and the number M of the single Gaussian model can be roughly determined according to the data distribution.
2、采用EM算法对参数进行估计。2. Estimate the parameters using the EM algorithm.
EM算法的基本思路是:随机初始化一组参数θ(0),根据后验概率Pr(Y|X;θ)来更新Y的期望E(Y),然后用E(Y)代替Y求出新的模型参数θ(1)。如此迭代直到θ趋于稳定。The basic idea of the EM algorithm is to randomly initialize a set of parameters θ(0), update the expected E(Y) of Y according to the posterior probability Pr(Y|X; θ), and then use E(Y) instead of Y to find a new one. The model parameter θ(1). This is iterated until θ tends to be stable.
EM算法有两个步骤:The EM algorithm has two steps:
通过设置初始化θ值,求出使似然方程最大的Qt值,此步骤称为E-步(E-step)By setting the initial value of θ, the Qt value that maximizes the likelihood equation is obtained. This step is called E-step.
Figure PCTCN2019079582-appb-000005
Figure PCTCN2019079582-appb-000005
Figure PCTCN2019079582-appb-000006
Figure PCTCN2019079582-appb-000006
其中,θ=(c,μ,∑)是GMM模型参数,包含了权重c、期望μ、方差∑三个模型参数;θ (t)为第t次迭代参数θ的估计值;表达式(1)表示计算L(θ;X,Z)的对数似然函数;其中,X是观测到的变量,Z是隐变量,E是期望,L是 对数似然函数,Q表示一个函数式。 Where θ=(c,μ,∑) is a GMM model parameter, including three model parameters of weight c, expectation μ, and variance ;; θ (t) is the estimated value of the t-th iteration parameter θ; expression (1) ) represents a log-likelihood function for computing L(θ; X, Z); where X is the observed variable, Z is the hidden variable, E is the expectation, L is the log-likelihood function, and Q is a functional formula.
(2)利用求出的Qt值,更新θ。此步骤称为M-步(M-step)。(2) Update θ using the obtained Qt value. This step is called M-step.
通过两步迭代,逐渐逼近最大似然值。The maximum likelihood value is gradually approached by a two-step iteration.
本申请通过多高斯混合模型来逼近参数样本的数值。决定情感语音的关键参数有多个,如:基频、时长等,针对每个参数都可以采用本实施例技术来确定参数数值,下面是以基频参数的确定为例。The present application approximates the values of the parameter samples by a multi-Gaussian hybrid model. There are a plurality of key parameters for determining the emotional voice, such as: the fundamental frequency, the duration, and the like. For each parameter, the technique of the embodiment may be used to determine the parameter value. The following is an example of determining the fundamental frequency parameter.
图8是本申请实施例的基频均值数据分布示意图,中X方向为“中性”语音,Y方向为“高兴”情感的语音。FIG. 8 is a schematic diagram of the distribution of the fundamental frequency mean data in the embodiment of the present application, wherein the X direction is a “neutral” voice, and the Y direction is a “happy” emotion voice.
每一个二维数据点(x,y)中,x代表“中性”语音基频参数值,y代表“高兴”情感的语音基频参数值,它们组成了“中性-高兴”(neutral-happy)参数的映射关系。In each two-dimensional data point (x, y), x represents the value of the "neutral" speech fundamental frequency parameter, and y represents the value of the speech fundamental frequency parameter of the "happy" emotion, which constitutes "neutral-happy" (neutral- Happy) The mapping relationship of parameters.
图9是本申请实施例的基频均值概率密度分布图,是多个高斯函数拟合的曲面示意图。语音韵律特征参数的确定包括:FIG. 9 is a schematic diagram of a fundamental frequency mean probability density distribution of an embodiment of the present application, which is a schematic diagram of a surface fitted by a plurality of Gaussian functions. The determination of the speech prosody feature parameters includes:
中性参数(x坐标)→通过EM算法(如下的(i)和(ii)),找到模型中x值对应的概率最大的y值作为输出的情感参数→得到的情感语音韵律特征参数。Neutral parameter (x coordinate) → Through the EM algorithm ((i) and (ii) below), find the y value with the highest probability of the x value in the model as the emotional parameter of the output → the obtained emotional speech prosodic feature parameter.
(i)通过设置初始化θ值,按照以下表达式求出使似然方程最大的Qt值,此步骤称为E-步(E-step):(i) By setting the initial value of θ, the Qt value that maximizes the likelihood equation is found according to the following expression. This step is called E-step:
Figure PCTCN2019079582-appb-000007
Figure PCTCN2019079582-appb-000007
Figure PCTCN2019079582-appb-000008
Figure PCTCN2019079582-appb-000008
其中,θ=(c,μ,∑)是GMM模型参数,包含了权重c、期望μ、方差∑三个模型参数;θ (t)为第t次迭代参数θ的估计值;表达式(1)表示计算L(θ;X,Z)的对数似然函数;其中,X是观测到的变量,Z是隐变量,E是期望,L是对数似然函数,Q表示一个函数式。 Where θ=(c,μ,∑) is a GMM model parameter, including three model parameters of weight c, expectation μ, and variance ;; θ (t) is the estimated value of the t-th iteration parameter θ; expression (1) ) represents a log-likelihood function for computing L(θ; X, Z); where X is the observed variable, Z is the hidden variable, E is the expectation, L is the log-likelihood function, and Q is a functional formula.
(ii)利用求出的Qt值,更新θ,此步骤称为M-步(M-step)。(ii) Update θ using the determined Qt value. This step is called M-step.
通过两步迭代,逐渐逼近最大似然值。The maximum likelihood value is gradually approached by a two-step iteration.
采取本申请实施例方法随机测试了多组句子,通过音节同步处理之后,比较合成的情感语音与真人语音的语音参数差距,用“基频距离值”来表示差距。以真人语音为参考模板,基频距离值越小表示两段语音越相似,即越接近真人语音;基频距离值越大表示语音越不接近真人语音。部分结果展示如表1所示,结果表明,采用本申请实施例合成的情感语音效果要优于其它可以获得的情感语音合成效果。A plurality of sets of sentences are randomly tested by the method of the embodiment of the present application, and after the syllable synchronization processing, the difference of the speech parameters of the synthesized emotional speech and the real human speech is compared, and the "baseband distance value" is used to represent the gap. Taking the real person voice as the reference template, the smaller the base frequency distance value is, the more similar the two-segment speech is, that is, the closer to the real human voice; the larger the base frequency distance value is, the less the voice is closer to the real human voice. Some results are shown in Table 1. The results show that the emotional speech effect synthesized by the embodiment of the present application is superior to other available emotional speech synthesis effects.
表1Table 1
Figure PCTCN2019079582-appb-000009
Figure PCTCN2019079582-appb-000009
其中,new1是根据100组数据、4个单高斯训练后合成的情感语音,new2是根据200组数据4个单高斯训练后合成的情感语音,new3是根据200组数据男女各4个单高斯训练后合成的情感语音。从基频距离值可以比较得出:增加训练数据量、增加单高斯混合个数、男女声数据分别训练,都会使得建模效果更好。Among them, new1 is based on 100 sets of data, 4 single Gaussian training synthesized emotional speech, new2 is based on 200 sets of data, 4 single Gaussian training, combined emotional speech, new3 is based on 200 sets of data for each male and female 4 single Gauss training Post-synthesized emotional speech. The distance from the fundamental frequency can be compared: increasing the amount of training data, increasing the number of single Gaussian blends, and separately training the male and female voice data will make the modeling effect better.
作为本实施例的一个应用示例,可以将该算法置入手机的TTS***构架中,使得手机的TTS合成语音具有指定的情感。As an application example of the embodiment, the algorithm can be placed in the TTS system architecture of the mobile phone, so that the TTS synthesized voice of the mobile phone has a specified emotion.
现有的情感语音应用***通常是从情感语料库中选取片段直接拼接成情感语音的,需要载入各种情感语料库,这种方法占用大量存储空间,必然导致应用运行缓慢,影响用户体验。使用本实施例的方案只需要提供中性语料库,再加以参数调整,就能得到所需要的情感语音。The existing emotional speech application system usually selects segments directly from the emotional corpus into emotional speech, and needs to load various emotional corpora. This method occupies a large amount of storage space, which inevitably leads to slow application and affects user experience. The solution of the embodiment only needs to provide a neutral corpus, and then adjust the parameters to obtain the desired emotional voice.
本申请的实施例还提供了一种存储介质,该存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。Embodiments of the present application also provide a storage medium having stored therein a computer program, wherein the computer program is configured to execute the steps of any one of the method embodiments described above.
可选地,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的计算机程序:获取第一语音的情感特征参数;根据转换规则将所述情感特征参数转换为韵律参数,其中,所述转换规则用于描述所述情感特征参数与所述韵律参数的映射关系;根据所述韵律参数和所述第一语音合成第二语音。Optionally, in this embodiment, the foregoing storage medium may be configured to store a computer program for performing the following steps: acquiring an emotional feature parameter of the first voice; converting the sentiment feature parameter into a prosody parameter according to a conversion rule, The conversion rule is used to describe a mapping relationship between the sentiment feature parameter and the prosody parameter; and synthesizing the second speech according to the prosody parameter and the first speech.
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。Optionally, in this embodiment, the foregoing storage medium may include, but not limited to, a USB flash drive, a read-only memory (ROM), a random access memory (RAM), a mobile hard disk, and a magnetic A variety of media that can store computer programs, such as a disc or an optical disc.
本申请的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。Embodiments of the present application also provide an electronic device including a memory and a processor having a computer program stored therein, the processor being configured to execute a computer program to perform the steps of any of the above method embodiments.
可选地,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。Optionally, the electronic device may further include a transmission device and an input and output device, wherein the transmission device is connected to the processor, and the input and output device is connected to the processor.
可选地,在本实施例中,上述处理器可以被设置为通过计算机程序执行以下步骤:获取第一语音的情感特征参数;根据转换规则将所述情感特 征参数转换为韵律参数,其中,所述转换规则用于描述所述情感特征参数与所述韵律参数的映射关系;根据所述韵律参数和所述第一语音合成第二语音。Optionally, in this embodiment, the foregoing processor may be configured to: perform, by using a computer program, the following steps: acquiring an emotional feature parameter of the first voice; and converting the emotional feature parameter into a prosody parameter according to a conversion rule, where The conversion rule is used to describe a mapping relationship between the emotional feature parameter and the prosody parameter; and synthesize the second speech according to the prosody parameter and the first speech.
可选地,本实施例中的具体示例可以参考上述实施例及可选实施方式中所描述的示例,本实施例在此不再赘述。For example, the specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the optional embodiments, and details are not described herein again.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个***,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, such as: multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored or not executed. In addition, the coupling, or direct coupling, or communication connection of the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical, mechanical or other forms. of.
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元,即可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and the components displayed as the unit may or may not be physical units, that is, may be located in one place or distributed to multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本申请各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated into one unit; the above integration The unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。A person skilled in the art can understand that all or part of the steps of implementing the above method embodiments may be completed by using hardware related to the program instructions. The foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The foregoing steps include the steps of the foregoing method embodiments; and the foregoing storage medium includes: a removable storage device, a ROM, a RAM, a magnetic disk, or an optical disk, and the like, which can store program codes.
或者,本申请上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本发明各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Alternatively, the above-described integrated unit of the present application may be stored in a computer readable storage medium if it is implemented in the form of a software function module and sold or used as a stand-alone product. Based on such understanding, the technical solution of the embodiments of the present application may be embodied in the form of a software product in essence or in the form of a software product stored in a storage medium, including a plurality of instructions. A computer device (which may be a personal computer, server, or network device, etc.) is caused to perform all or part of the methods described in various embodiments of the present invention. The foregoing storage medium includes various media that can store program codes, such as a mobile storage device, a ROM, a RAM, a magnetic disk, or an optical disk.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The foregoing is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present application. It should be covered by the scope of protection of this application. Therefore, the scope of protection of the present application should be determined by the scope of the claims.

Claims (12)

  1. 一种语音的合成方法,包括:A method of synthesizing speech, comprising:
    获取第一语音的情感特征参数;Obtaining an emotional feature parameter of the first voice;
    根据转换规则将所述情感特征参数转换为韵律参数,其中,所述转换规则用于描述所述情感特征参数与所述韵律参数的映射关系;Converting the sentiment feature parameter into a prosody parameter according to a conversion rule, wherein the conversion rule is used to describe a mapping relationship between the sentiment feature parameter and the prosody parameter;
    根据所述韵律参数和所述第一语音合成第二语音。And synthesizing the second speech according to the prosody parameter and the first speech.
  2. 根据权利要求1所述的方法,其中,在根据转换规则将所述情感特征参数转换为韵律参数之前,所述方法还包括以下之一:The method of claim 1, wherein the method further comprises one of: before converting the sentiment feature parameter to a prosody parameter according to a conversion rule:
    训练所述转换规则;Training the conversion rules;
    预设所述转换规则。Preset the conversion rules.
  3. 根据权利要求2所述的方法,其中,所述训练所述转换规则包括:The method of claim 2 wherein said training said conversion rules comprises:
    设置高斯混合模型,将多个类型的情感特征参数,多个类型的韵律参数作为标签数据输入至所述高斯混合模型,训练得到所述转换规则。A Gaussian mixture model is set, and a plurality of types of emotional feature parameters, a plurality of types of prosody parameters are input as tag data to the Gaussian mixture model, and the conversion rule is trained.
  4. 根据权利要求2所述的方法,其中,所述训练所述转换规则包括:The method of claim 2 wherein said training said conversion rules comprises:
    选择高斯混合模型的初始值,根据所述初始值的数据分布确定计算参数,其中,所述计算参数包括:权重值、期望值、方差值和模型个数;Selecting an initial value of the Gaussian mixture model, and determining a calculation parameter according to the data distribution of the initial value, where the calculation parameter includes: a weight value, an expected value, a variance value, and a model number;
    采用最大期望EM算法估计所述初始值和所述计算参数,得到最大似然值。The initial value and the calculation parameter are estimated using a maximum expected EM algorithm to obtain a maximum likelihood value.
  5. 根据权利要求3所述的方法,其中,所述高斯混合模型其中一级的表达式p(x)通过以下公式表示:The method according to claim 3, wherein the expression p(x) of the Gaussian mixture model is expressed by the following formula:
    Figure PCTCN2019079582-appb-100001
    Figure PCTCN2019079582-appb-100001
    其中,among them,
    Figure PCTCN2019079582-appb-100002
    Figure PCTCN2019079582-appb-100002
    其中,c m为权重值;μ m为期望值,Σ m为方差值,M为单高斯模型个数,x为情感特征参数值,p(x)为韵律参数值,d为常量,μ为常量,T为常量。 Where c m is the weight value; μ m is the expected value, Σ m is the variance value, M is the number of single Gaussian models, x is the emotional feature parameter value, p(x) is the prosody parameter value, d is a constant, μ is Constant, T is a constant.
  6. 根据权利要求1所述的方法,其中,根据所述韵律参数和所述第一语音合成第二语音包括以下之一:The method of claim 1, wherein synthesizing the second speech according to the prosody parameter and the first speech comprises one of the following:
    在文本到语音TTS平台上根据所述韵律参数和所述第一语音合成第二语音;Synthesizing the second speech according to the prosody parameter and the first speech on a text-to-speech TTS platform;
    使用基音同步叠加PSOLA算法根据所述韵律参数和所述第一语音合成第二语音。A second speech is synthesized based on the prosody parameter and the first speech using a pitch synchronization overlay PSOLA algorithm.
  7. 根据权利要求1所述的方法,其中,在根据所述韵律参数和所述第一语音合成第二语音之后,所述方法还包括:The method of claim 1, wherein after synthesizing the second speech according to the prosody parameter and the first speech, the method further comprises:
    对所述第二语音进行平滑处理,得到第三语音。Smoothing the second speech to obtain a third speech.
  8. 一种参数的确定装置,包括:A parameter determining device includes:
    获取模块,配置为获取第一语音的情感特征参数;An acquiring module configured to acquire an emotional feature parameter of the first voice;
    转换模块,配置为根据转换规则将所述情感特征参数转换为韵律参数,其中,所述转换规则用于描述所述情感特征参数与所述韵律参数的映射关系;a conversion module, configured to convert the sentiment feature parameter into a prosody parameter according to a conversion rule, wherein the conversion rule is used to describe a mapping relationship between the sentiment feature parameter and the prosody parameter;
    合成模块,配置为根据所述韵律参数和所述第一语音合成第二语音。a synthesis module configured to synthesize the second speech based on the prosody parameter and the first speech.
  9. 根据权利要求8所述的装置,其中,所述装置还包括以下之一:The apparatus of claim 8 wherein said apparatus further comprises one of:
    训练模块,配置为在所述合成模块根据转换规则将所述情感特征参数转换为韵律参数之前,训练所述转换规则;a training module, configured to train the conversion rule before the synthesizing module converts the sentiment feature parameter into a prosody parameter according to a conversion rule;
    设置模块,配置为在所述合成模块根据转换规则将所述情感特征参数转换为韵律参数之前,预设所述转换规则。And a setting module configured to preset the conversion rule before the synthesizing module converts the emotional feature parameter into a prosody parameter according to a conversion rule.
  10. 根据权利要求9所述的装置,其中,所述训练模块包括:The apparatus of claim 9 wherein said training module comprises:
    训练单元,配置为设置高斯混合模型,将多个类型的情感特征参数, 多个类型的韵律参数作为标签数据输入至所述高斯混合模型,训练得到所述转换规则。The training unit is configured to set a Gaussian mixture model, input multiple types of emotional feature parameters, and multiple types of prosody parameters as tag data into the Gaussian mixture model, and train the conversion rules.
  11. 一种存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行所述权利要求1至7任一项中所述的方法。A storage medium having stored therein a computer program, wherein the computer program is arranged to execute the method of any one of claims 1 to 7 at runtime.
  12. 一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行所述权利要求1至7任一项中所述的方法。An electronic device comprising a memory and a processor, wherein the memory stores a computer program, the processor being arranged to execute the computer program to perform the method of any one of claims 1 to 7.
PCT/CN2019/079582 2018-05-15 2019-03-25 Voice synthesis method and device, storage medium, and electronic device WO2019218773A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810462450.0A CN110556092A (en) 2018-05-15 2018-05-15 Speech synthesis method and device, storage medium and electronic device
CN201810462450.0 2018-05-15

Publications (1)

Publication Number Publication Date
WO2019218773A1 true WO2019218773A1 (en) 2019-11-21

Family

ID=68539473

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/079582 WO2019218773A1 (en) 2018-05-15 2019-03-25 Voice synthesis method and device, storage medium, and electronic device

Country Status (2)

Country Link
CN (1) CN110556092A (en)
WO (1) WO2019218773A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021127979A1 (en) * 2019-12-24 2021-07-01 深圳市优必选科技股份有限公司 Speech synthesis method and apparatus, computer device, and computer readable storage medium
CN112349272A (en) * 2020-10-15 2021-02-09 北京捷通华声科技股份有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic device
CN112837700A (en) * 2021-01-11 2021-05-25 网易(杭州)网络有限公司 Emotional audio generation method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101064104A (en) * 2006-04-24 2007-10-31 中国科学院自动化研究所 Emotion voice creating method based on voice conversion
CN102184731A (en) * 2011-05-12 2011-09-14 北京航空航天大学 Method for converting emotional speech by combining rhythm parameters with tone parameters
CN102385858A (en) * 2010-08-31 2012-03-21 国际商业机器公司 Emotional voice synthesis method and system
CN105280179A (en) * 2015-11-02 2016-01-27 小天才科技有限公司 Text-to-speech processing method and system
CN106531150A (en) * 2016-12-23 2017-03-22 上海语知义信息技术有限公司 Emotion synthesis method based on deep neural network model
CN106688034A (en) * 2014-09-11 2017-05-17 微软技术许可有限责任公司 Text-to-speech with emotional content
CN107221344A (en) * 2017-04-07 2017-09-29 南京邮电大学 A kind of speech emotional moving method

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7401020B2 (en) * 2002-11-29 2008-07-15 International Business Machines Corporation Application of emotion-based intonation and prosody to speech in text-to-speech systems
CN101452699A (en) * 2007-12-04 2009-06-10 株式会社东芝 Rhythm self-adapting and speech synthesizing method and apparatus
US8374873B2 (en) * 2008-08-12 2013-02-12 Morphism, Llc Training and applying prosody models
CN101685633A (en) * 2008-09-28 2010-03-31 富士通株式会社 Voice synthesizing apparatus and method based on rhythm reference
KR101203188B1 (en) * 2011-04-14 2012-11-22 한국과학기술원 Method and system of synthesizing emotional speech based on personal prosody model and recording medium
CN103198827B (en) * 2013-03-26 2015-06-17 合肥工业大学 Voice emotion correction method based on relevance of prosodic feature parameter and emotion parameter
JP6433650B2 (en) * 2013-11-15 2018-12-05 国立大学法人佐賀大学 Mood guidance device, mood guidance program, and computer operating method
CN104217721B (en) * 2014-08-14 2017-03-08 东南大学 Based on the phonetics transfer method under the conditions of the asymmetric sound bank that speaker model aligns
CN106652995A (en) * 2016-12-31 2017-05-10 深圳市优必选科技有限公司 Voice broadcasting method and system for text
CN107146217B (en) * 2017-04-07 2020-03-06 北京工业大学 Image detection method and device
CN107301859B (en) * 2017-06-21 2020-02-21 南京邮电大学 Voice conversion method under non-parallel text condition based on self-adaptive Gaussian clustering

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101064104A (en) * 2006-04-24 2007-10-31 中国科学院自动化研究所 Emotion voice creating method based on voice conversion
CN102385858A (en) * 2010-08-31 2012-03-21 国际商业机器公司 Emotional voice synthesis method and system
CN102184731A (en) * 2011-05-12 2011-09-14 北京航空航天大学 Method for converting emotional speech by combining rhythm parameters with tone parameters
CN106688034A (en) * 2014-09-11 2017-05-17 微软技术许可有限责任公司 Text-to-speech with emotional content
CN105280179A (en) * 2015-11-02 2016-01-27 小天才科技有限公司 Text-to-speech processing method and system
CN106531150A (en) * 2016-12-23 2017-03-22 上海语知义信息技术有限公司 Emotion synthesis method based on deep neural network model
CN107221344A (en) * 2017-04-07 2017-09-29 南京邮电大学 A kind of speech emotional moving method

Also Published As

Publication number Publication date
CN110556092A (en) 2019-12-10

Similar Documents

Publication Publication Date Title
KR102361389B1 (en) Method and apparatus to synthesize voice based on facial structures
CN109767778B (en) Bi-L STM and WaveNet fused voice conversion method
US12027165B2 (en) Computer program, server, terminal, and speech signal processing method
WO2019218773A1 (en) Voice synthesis method and device, storage medium, and electronic device
GB2516965A (en) Synthetic audiovisual storyteller
CN110364140A (en) Training method, device, computer equipment and the storage medium of song synthetic model
CN112102811B (en) Optimization method and device for synthesized voice and electronic equipment
CN111627420B (en) Method and device for synthesizing emotion voice of specific speaker under extremely low resource
CN111261177A (en) Voice conversion method, electronic device and computer readable storage medium
JP7124373B2 (en) LEARNING DEVICE, SOUND GENERATOR, METHOD AND PROGRAM
CN113724683B (en) Audio generation method, computer device and computer readable storage medium
CN113609255A (en) Method, system and storage medium for generating facial animation
CN115953521B (en) Remote digital person rendering method, device and system
CN114170648A (en) Video generation method and device, electronic equipment and storage medium
CN112383721B (en) Method, apparatus, device and medium for generating video
Filntisis et al. Video-realistic expressive audio-visual speech synthesis for the Greek language
CN112652309A (en) Dialect voice conversion method, device, equipment and storage medium
CN113241054B (en) Speech smoothing model generation method, speech smoothing method and device
CN113299270B (en) Method, device, equipment and storage medium for generating voice synthesis system
JP6681264B2 (en) Audio processing device and program
CN116469369A (en) Virtual sound synthesis method and device and related equipment
JP2006243215A (en) Data generating device for articulatory parameter interpolation, speech synthesizing device, and computer program
CN114333758A (en) Speech synthesis method, apparatus, computer device, storage medium and product
CN113611283A (en) Voice synthesis method and device, electronic equipment and storage medium
Kolivand et al. Realistic lip syncing for virtual character using common viseme set

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19802635

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16.04.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19802635

Country of ref document: EP

Kind code of ref document: A1