CN109285536A - Voice special effect synthesis method and device, electronic equipment and storage medium - Google Patents

Voice special effect synthesis method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN109285536A
CN109285536A CN201811413566.1A CN201811413566A CN109285536A CN 109285536 A CN109285536 A CN 109285536A CN 201811413566 A CN201811413566 A CN 201811413566A CN 109285536 A CN109285536 A CN 109285536A
Authority
CN
China
Prior art keywords
target
basic
prosodic features
voice
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811413566.1A
Other languages
Chinese (zh)
Other versions
CN109285536B (en
Inventor
张冉
张征
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mobvoi Innovation Technology Co Ltd
Original Assignee
Beijing Yufanzhi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yufanzhi Information Technology Co ltd filed Critical Beijing Yufanzhi Information Technology Co ltd
Priority to CN201811413566.1A priority Critical patent/CN109285536B/en
Publication of CN109285536A publication Critical patent/CN109285536A/en
Application granted granted Critical
Publication of CN109285536B publication Critical patent/CN109285536B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a method, a device, electronic equipment and a storage medium for synthesizing a special voice effect, wherein the method comprises the following steps: acquiring text data corresponding to original voice data, and acquiring basic prosodic features and basic acoustic features matched with the text data; acquiring a feature adjustment parameter corresponding to a required special effect according to a mapping relation between at least two pre-established special effects and the corresponding feature adjustment parameter, wherein the feature adjustment parameter comprises: adjusting basic prosodic features and/or basic acoustic features by using the feature adjustment parameters to obtain target prosodic features and/or target acoustic features; and generating special effect synthetic voice corresponding to the original voice data according to the target prosody characteristic and/or the target acoustic characteristic. The technical scheme of the embodiment of the invention can meet the diversity requirement of special-effect voice.

Description

A kind of voice special efficacy synthetic method, device, electronic equipment and storage medium
Technical field
The present embodiments relate to voice processing technology field, more particularly to a kind of voice special efficacy synthetic method, device, Electronic equipment and storage medium.
Background technique
Speech synthesis, also known as literary periodicals (Text to Speech) technology, can convert in real time any text information to Smooth voice.
Existing speech synthesis technique generallys use preparatory trained rhythm model and acoustic model to raw tone number According to text data handled, obtain the corresponding synthesis voice of primary voice data.Inventor in the specific implementation process, sends out Following defect now exists in the prior art: using preparatory trained rhythm model and acoustic model to the text of primary voice data Notebook data is handled, and be can only obtain a kind of synthesis voice of fixed type, is unable to satisfy the diversity requirement of special efficacy voice.
Summary of the invention
In view of this, the embodiment of the invention provides a kind of voice special efficacy synthetic method, device, electronic equipment and storages to be situated between Matter, main purpose are that the existing synthesis voice of solution speech synthesis system is more single, are unable to satisfy the multiplicity of special efficacy voice The problems such as property demand.
To solve the above-mentioned problems, the embodiment of the present invention mainly provides the following technical solutions:
In a first aspect, the embodiment of the invention provides a kind of voice special efficacy synthetic methods, this method comprises:
The corresponding text data of primary voice data is obtained, and obtains the basic prosodic features with the matches text data With basic acoustic feature;
According to the mapping relations between at least two special effects pre-established and corresponding Character adjustment parameter, obtain Character adjustment parameter corresponding with the special effect of demand, the Character adjustment parameter include: target prosodic features adjusting parameter And/or target acoustical Character adjustment parameter;
The basic prosodic features and/or basic acoustic feature are adjusted using the Character adjustment parameter, obtained Target prosodic features and/or target acoustical feature;
According to the target prosodic features and/or target acoustical feature, spy corresponding with the primary voice data is generated Effect synthesis voice.
Second aspect, the embodiment of the present invention also provide a kind of voice special efficacy synthesizer, which includes:
Data acquisition module for obtaining the corresponding text data of primary voice data, and obtains and the text data Matched basis prosodic features and basic acoustic feature;
Parameter acquisition module, for according at least two special effects that pre-establish and corresponding Character adjustment parameter it Between mapping relations, obtain corresponding with the special effect of demand Character adjustment parameter, the Character adjustment parameter includes: target Prosodic features adjusting parameter and/or target acoustical Character adjustment parameter;
Parameter adjustment module, for using the Character adjustment parameter to the basic prosodic features and/or basic acoustics Feature is adjusted, and obtains target prosodic features and/or target acoustical feature;
Voice synthetic module, for according to the target prosodic features and/or target acoustical feature, generate with it is described original The corresponding special efficacy of voice data synthesizes voice.
The third aspect, the embodiment of the present invention also provide a kind of electronic equipment, comprising:
At least one processor;
And at least one processor, the bus being connected to the processor;Wherein,
The processor, memory complete mutual communication by the bus;
The processor is used to call the program instruction in the memory, is provided with executing any embodiment of that present invention Voice special efficacy synthetic method.
Fourth aspect, the embodiment of the present invention also provide a kind of non-transient computer readable storage medium, the non-transient meter Calculation machine readable storage medium storing program for executing stores computer instruction, and the computer instruction makes the computer execute any embodiment of that present invention Provided voice special efficacy synthetic method.
By above-mentioned technical proposal, technical solution provided in an embodiment of the present invention is at least had the advantage that
The embodiment of the present invention is by obtaining the corresponding text data of primary voice data and the basis with matches text data Prosodic features and basic acoustic feature, and according at least two special effects pre-established and corresponding Character adjustment parameter it Between mapping relations, corresponding with the special effect of demand Character adjustment parameter is obtained, to use Character adjustment parameter to basis Prosodic features and/or basic acoustic feature are adjusted, to obtain target prosodic features and/or target acoustical feature, finally Special efficacy corresponding with primary voice data, which is generated, according to the target prosodic features of acquisition and/or target acoustical feature synthesizes voice, The a variety of special effects realized according to demand are adjusted and synthesize, and solve the existing synthesis voice of existing voice synthesis system Single problem, to meet the diversity requirement of special efficacy voice.
Above description is only the general introduction of technical solution of the embodiment of the present invention, in order to better understand the embodiment of the present invention Technological means, and can be implemented in accordance with the contents of the specification, and in order to allow above and other mesh of the embodiment of the present invention , feature and advantage can be more clearly understood, the special specific embodiment for lifting the embodiment of the present invention below.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention The limitation of embodiment.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 is a kind of flow chart for voice special efficacy synthetic method that the embodiment of the present invention one provides;
Fig. 2 is a kind of flow chart of voice special efficacy synthetic method provided by Embodiment 2 of the present invention;
Fig. 3 is a kind of schematic diagram for answer output device that the embodiment of the present invention three provides;
Fig. 4 is the structural schematic diagram for a kind of electronic equipment that the embodiment of the present invention four provides.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
Embodiment one
Fig. 1 is a kind of flow chart for voice special efficacy synthetic method that the embodiment of the present invention one provides, and the present embodiment is applicable According to different special effect demand synthesis special efficacy synthesis voice the case where, this method can by voice special efficacy synthesizer Lai It executes, which can be realized by the mode of software and/or hardware.Correspondingly, as shown in Figure 1, this method includes following behaviour Make:
S110, the corresponding text data of primary voice data is obtained, and obtains the basic rhythm with the matches text data Restrain feature and basic acoustic feature.
Wherein, primary voice data can be manually entered, and need to be converted to the voice data of special efficacy synthesis voice.Base Plinth prosodic features can be the prosodic features of primary voice data, for example, the tone of voice phonetic, stress, participle and pause Feature etc..Basic acoustic feature can be the acoustic feature of primary voice data, for example, the base frequency parameters of voice data and spectrum ginseng Number etc..
In embodiments of the present invention, any voice data that primary voice data can be manually entered.Optionally, it can be used The mode of mandarin inputs, and can also input using with emotion, the embodiment of the present invention is not to primary voice data Typing mode and content etc. are defined.Obtaining the corresponding text data of primary voice data can be known using any voice Other technology realizes that the embodiment of the present invention does not limit the mode for obtaining the corresponding text data of primary voice data equally It is fixed.Correspondingly, can further be obtained and matches text data after getting the corresponding text data of primary voice data Basic prosodic features and basic acoustic feature.
S120, it is closed according to the mapping between at least two special effects and corresponding Character adjustment parameter pre-established System obtains Character adjustment parameter corresponding with the special effect of demand, and the Character adjustment parameter includes: target prosodic features tune Whole parameter and/or target acoustical Character adjustment parameter.
Wherein, Character adjustment parameter can be for the basic prosodic features and/or basic acoustics to primary voice data The parameter that feature is adjusted, such as tone, stress, participle, pause feature, base frequency parameters and spectrum parameter.Target prosodic features Adjusting parameter can be the available parameter being adjusted to the basic prosodic features of primary voice data, target acoustical Character adjustment Parameter can be the available parameter being adjusted to the basic acoustic feature of primary voice data.
In embodiments of the present invention, before synthesizing special efficacy voice to primary voice data, at least two can be initially set up Mapping relations between a special effect and corresponding Character adjustment parameter, to realize that the voice to different special effect demands closes At.For example, being established respectively and feature tune for multiple special effects such as machine sound effect, male senile patient sound effect and children's sound effects Mapping relations between whole parameter.For different special effects, the target rhythm that corresponding Character adjustment parameter includes is special Sign adjusting parameter and/or target acoustical Character adjustment parameter are also not quite similar each other.The corresponding feature of part special effect Adjusting parameter may be related to target prosodic features adjusting parameter and target acoustical Character adjustment parameter simultaneously, but also have part special Effect effect may relate only to target prosodic features adjusting parameter or target acoustical Character adjustment parameter.
S130, the basic prosodic features and/or basic acoustic feature are adjusted using the Character adjustment parameter, Obtain target prosodic features and/or target acoustical feature.
Wherein, target prosodic features can be the prosodic features of the special efficacy synthesis voice of demand, and target acoustical feature can be with The acoustic feature of the special efficacy synthesis voice of demand.
Correspondingly, the special effect pair of voice can be synthesized according to special efficacy when carrying out special efficacy synthesis to primary voice data The basic prosodic features and/or basic acoustic feature that the Character adjustment parameter answered includes to primary voice data are adjusted, from And target prosodic features and/or target acoustical feature after being adjusted.That is, the embodiment of the present invention can be respectively to basic rhythm Rule feature and basic acoustic feature are adjusted to obtain target prosodic features and target acoustical feature, can also be only to the basic rhythm Feature or basic acoustic feature are adjusted to obtain target prosodic features or target acoustical feature.
S140, according to the target prosodic features and/or target acoustical feature, generate corresponding with the primary voice data Special efficacy synthesize voice.
It in embodiments of the present invention, can be according to the target rhythm after obtaining target prosodic features and/or target acoustical feature Feature and/or target acoustical feature generate special efficacy corresponding with primary voice data and synthesize voice.
The embodiment of the present invention is by obtaining the corresponding text data of primary voice data and the basis with matches text data Prosodic features and basic acoustic feature, and according at least two special effects pre-established and corresponding Character adjustment parameter it Between mapping relations, corresponding with the special effect of demand Character adjustment parameter is obtained, to use Character adjustment parameter to basis Prosodic features and/or basic acoustic feature are adjusted, to obtain target prosodic features and/or target acoustical feature, finally Special efficacy corresponding with primary voice data, which is generated, according to the target prosodic features of acquisition and/or target acoustical feature synthesizes voice, The a variety of special effects realized according to demand are adjusted and synthesize, and solve the existing synthesis voice of existing voice synthesis system Single problem, to meet the diversity requirement of special efficacy voice.
Embodiment two
Fig. 2 is a kind of flow chart of voice special efficacy synthetic method provided by Embodiment 2 of the present invention, and the present embodiment is with above-mentioned It is embodied based on embodiment, in the present embodiment, is given using the Character adjustment parameter to the basic rhythm Feature and/or basic acoustic feature are adjusted, and obtain the specific implementation side of target prosodic features and/or target acoustical feature Formula.Correspondingly, as shown in Fig. 2, this method includes following operation:
S210, the corresponding text data of primary voice data is obtained, and obtains the basic rhythm with the matches text data Restrain feature and basic acoustic feature.
S220, it is closed according to the mapping between at least two special effects and corresponding Character adjustment parameter pre-established System obtains Character adjustment parameter corresponding with the special effect of demand, and the Character adjustment parameter includes: target prosodic features tune Whole parameter and/or target acoustical Character adjustment parameter.
S230, the basic prosodic features and/or basic acoustic feature are adjusted using the Character adjustment parameter, Obtain target prosodic features and/or target acoustical feature.
Wherein, the basic prosodic features includes: that basis pronunciation corresponding with the text character in the text data is special Basic pause value between text character two-by-two in sign and the text data, the basis pronunciation character include and text The corresponding pinyin elements of character and tone value corresponding with pinyin elements;The target prosodic features adjusting parameter, comprising: mesh Mark tone adjusted value and target pause value.The basis acoustic feature includes: basis corresponding with the primary voice data Base frequency parameters and basic spectral parameter;The target acoustical Character adjustment parameter, comprising: target base frequency parameters and target spectrum ginseng Number.
In embodiments of the present invention, basic pronunciation character can be each text of the corresponding text data of primary voice data Pronunciation character, specifically can be the corresponding pinyin elements of text character and tone value corresponding with pinyin elements.Stop on basis Value of pausing can be the pause value between adjacent two text of the corresponding text data of primary voice data.Basic base frequency parameters and base Plinth spectrum parameter is the base frequency parameters and spectrum parameter of primary voice data.Target tone adjusted value is that required special efficacy synthesizes voice The pinyin elements of corresponding text data and tone value corresponding with pinyin elements, target pause value are that required special efficacy synthesizes Pause value between adjacent two text of the corresponding text data of voice.Needed for target base frequency parameters and target spectrum parameter are Special efficacy synthesizes the corresponding base frequency parameters of voice and spectrum parameter.
Correspondingly, being obtained when using the target prosodic features adjusting parameter to be adjusted the basic prosodic features When target prosodic features, S230 can specifically include operations described below:
S231a, updated using the target tone adjusted value it is corresponding with each pinyin elements in the basic prosodic features Tone value.
Specifically, the basic rhythm of the corresponding text data of primary voice data can be updated using target tone adjusted value Tone value corresponding with each pinyin elements in feature.
S232a, text character two-by-two is updated in the basic prosodic features text data using the target pause value Between basic pause value.
Similarly, it can be updated using target pause value in the basic prosodic features of the corresponding text data of primary voice data Basic pause value between text character two-by-two in text data.
S233a, result will be updated as the target prosodic features.
Correspondingly, finally can be using the update result of tone value and pause value as target prosodic features.
Correspondingly, being obtained when using the target acoustical Character adjustment parameter to be adjusted the basic acoustic feature When target acoustical feature, S230 can specifically include operations described below:
S231b, the basic base frequency parameters in the basic acoustic feature are updated using the target base frequency parameters.
Specifically, the basis in the corresponding basic acoustic feature of primary voice data can be updated using target base frequency parameters Base frequency parameters.
S232b, the basic spectral parameter in the basic acoustic feature is updated using target spectrum parameter.
Similarly, the basic spectral ginseng in the corresponding basic acoustic feature of primary voice data can be updated using target spectrum parameter Number.
S233b, result will be updated as the target acoustical feature.
Correspondingly, finally can be using the update result of base frequency parameters and spectrum parameter as target acoustical feature.
S240, using vocoder according to the target prosodic features and/or target acoustical feature, generate and the original language The corresponding special efficacy of sound data synthesizes voice.
In embodiments of the present invention, after obtaining target prosodic features and/or target acoustical feature, vocoder pair can be used Target prosodic features and/or target acoustical feature are synthesized, and synthesize voice to generate special efficacy corresponding with primary voice data.
In an alternate embodiment of the present invention where, it according to the target prosodic features and/or target acoustical feature, generates Special efficacy corresponding with the primary voice data synthesizes voice, may include following one:
According to the target prosodic features and the basic acoustic feature, generate corresponding with the primary voice data Special efficacy synthesizes voice;
According to the basic prosodic features and the target acoustical feature, generate corresponding with the primary voice data Special efficacy synthesizes voice;
According to the target prosodic features and target acoustical feature, special efficacy corresponding with the primary voice data is generated Synthesize voice.
It should be noted that the corresponding Character adjustment parameter of part special effect can due to for different special effects It can be related to target prosodic features adjusting parameter and target acoustical Character adjustment parameter simultaneously, but also have part special effect may Relate only to target prosodic features adjusting parameter or target acoustical Character adjustment parameter.Therefore, it generates and primary voice data pair When the special efficacy synthesis voice answered, it is also possible to while being related to target prosodic features and target acoustical feature, or relate only to Target prosodic features or target acoustical feature.Specifically, when obtaining target prosodic features and target acoustical feature, it can be to mesh Mark prosodic features and target acoustical feature are combined to obtain corresponding special efficacy synthesis voice;When obtaining target prosodic features, Target prosodic features can be combined to obtain corresponding special efficacy synthesis voice with basic acoustic feature;When obtaining target acoustical When feature, target acoustical feature can be combined to obtain corresponding special efficacy synthesis voice with basic prosodic features.
In addition it should be noted that, special effect corresponding different prosodic features model and acoustic feature can also be used Model is adjusted basic prosodic features and/or basic acoustic feature, obtains target prosodic features and/or target acoustical is special Sign.
In a specific example, illustrate for synthesizing voice using machine sound as special efficacy.Assuming that primary voice data Corresponding text data are as follows: " hello, and Nice to see you ".The basic prosodic features of text Data Matching are as follows: 1) ni3 hao3 hen3 gao1 xing4 ren4 shi1 ni3;2) your good #2 Nice to see you #3.Wherein, first feature is text The corresponding basic pronunciation character of this character, second feature are the basic pause value between text character two-by-two in text data. Basic spectral parameter in the basic acoustic feature of text Data Matching is ... 258 263 275 ....Correspondingly, if machine Mapping relations between the corresponding target prosodic features adjusting parameter of device sound special effect are that target tone adjusted value is unified for one Sound and target pause value are unified for second level, the mapping between the corresponding target acoustical Character adjustment parameter of machine sound special effect Relationship is that target base frequency parameters are fixed value 260, and spectrum parameter remains unchanged, then the target rhythm of final special efficacy synthesis voice is special Sign are as follows: 1) ni1 hao1 hen1 gao1 xing1 ren1 shi1 ni1;2) your good #2 Nice to see you #2.Target acoustical The target base frequency parameters that feature includes are ... 260 260 260 ..., the target spectrum parameter and base that target acoustical feature includes Plinth spectrum parameter is consistent, without adjustment.Finally, can be by vocoder by the corresponding target prosodic features of machine sound and target Acoustic feature is combined synthesis, and then generates the corresponding special efficacy of final machine sound and synthesize voice.
It should be noted that Fig. 2 is only a kind of schematic diagram of implementation, between S231a-S233a and S231b-S233b It there is no sequencing relationship, can first implement S231a-S233a, then implement S231b-S233b, can also first implement S231b- S233b, then implement S231a-S233a, with the two parallel practice or an implementation can be selected.
By adopting the above technical scheme, by according to reflecting between the corresponding Character adjustment parameter of special effect pre-established Penetrate relationship, the corresponding Character adjustment parameter of special effect needed for determining, and according to Character adjustment parameter to basic prosodic features and/ Or basic acoustic feature is adjusted, and then obtains target prosodic features and/or target acoustical feature, finally according to the mesh of acquisition It marks prosodic features and/or target acoustical feature generates special efficacy corresponding with primary voice data and synthesizes voice, realize according to need The a variety of special effects asked are adjusted and synthesize, and solve the problems, such as that the existing synthesis voice of existing voice synthesis system is single, To meet the diversity requirement of special efficacy voice.
It should be noted that in the above various embodiments between each technical characteristic arbitrary arrangement combination also belong to it is of the invention Protection scope.
Embodiment three
Fig. 3 is a kind of schematic diagram for answer output device that the embodiment of the present invention three provides, as shown in figure 3, described device It include: data acquisition module 310, parameter acquisition module 320, parameter adjustment module 330 and voice synthetic module 340, in which:
Data acquisition module 310 for obtaining the corresponding text data of primary voice data, and obtains and the textual data According to matched basic prosodic features and basic acoustic feature;
Parameter acquisition module 320, for being joined according at least two special effects pre-established with corresponding Character adjustment Mapping relations between number, obtain Character adjustment parameter corresponding with the special effect of demand, and the Character adjustment parameter includes: Target prosodic features adjusting parameter and/or target acoustical Character adjustment parameter;
Parameter adjustment module 330, for using the Character adjustment parameter to the basic prosodic features and/or basic sound It learns feature to be adjusted, obtains target prosodic features and/or target acoustical feature;
Voice synthetic module 340, for according to the target prosodic features and/or target acoustical feature, generate with it is described The corresponding special efficacy of primary voice data synthesizes voice.
The embodiment of the present invention is by obtaining the corresponding text data of primary voice data and the basis with matches text data Prosodic features and basic acoustic feature, and according at least two special effects pre-established and corresponding Character adjustment parameter it Between mapping relations, corresponding with the special effect of demand Character adjustment parameter is obtained, to use Character adjustment parameter to basis Prosodic features and/or basic acoustic feature are adjusted, to obtain target prosodic features and/or target acoustical feature, finally Special efficacy corresponding with primary voice data, which is generated, according to the target prosodic features of acquisition and/or target acoustical feature synthesizes voice, The a variety of special effects realized according to demand are adjusted and synthesize, and solve the existing synthesis voice of existing voice synthesis system Single problem, to meet the diversity requirement of special efficacy voice.
Optionally, the basic prosodic features includes: basis pronunciation corresponding with the text character in the text data The basic pause value between text character, the basis pronunciation character include and text two-by-two in feature and the text data The corresponding pinyin elements of this character and tone value corresponding with pinyin elements;The target prosodic features adjusting parameter, comprising: Target tone adjusted value and target pause value.
Optionally, parameter adjustment module 330 are specifically used for updating the basic rhythm using the target tone adjusted value Tone value corresponding with each pinyin elements in feature;Textual data in the basic prosodic features is updated using the target pause value The basic pause value between text character two-by-two in;Result will be updated as the target prosodic features.
Optionally, the basic acoustic feature includes: basic base frequency parameters corresponding with the primary voice data and base Plinth composes parameter;The target acoustical Character adjustment parameter, comprising: target base frequency parameters and target compose parameter.
Optionally, parameter adjustment module 330 are specifically used for updating the basic acoustics spy using the target base frequency parameters Basic base frequency parameters in sign;The basic spectral parameter in the basic acoustic feature is updated using target spectrum parameter;It will more New result is as the target acoustical feature.
Optionally, voice synthetic module 340 are specifically used for special according to the target prosodic features and the basic acoustics Sign generates special efficacy corresponding with the primary voice data and synthesizes voice;According to the basic prosodic features and the target Acoustic feature generates special efficacy corresponding with the primary voice data and synthesizes voice;According to the target prosodic features and mesh Acoustic feature is marked, special efficacy corresponding with the primary voice data is generated and synthesizes voice.
Optionally, voice synthetic module 340 are specifically used for using vocoder according to the target prosodic features and/or mesh Acoustic feature is marked, special efficacy corresponding with the primary voice data is generated and synthesizes voice.
Voice special efficacy synthetic method provided by any embodiment of the invention can be performed in above-mentioned voice special efficacy synthesizer, tool The standby corresponding functional module of execution method and beneficial effect.The not technical detail of detailed description in the present embodiment, reference can be made to this The voice special efficacy synthetic method that invention any embodiment provides.
Since the voice special efficacy synthesizer that the present embodiment is introduced is the voice spy that can be executed in the embodiment of the present invention The device of synthetic method is imitated, so based on voice special efficacy synthetic method described in the embodiment of the present invention, the affiliated skill in this field Art personnel can understand the specific embodiment and its various change form of the voice special efficacy synthesizer of the present embodiment, so How voice special efficacy synthetic method in embodiment of the present invention in detail is realized if being no longer situated between for the voice special efficacy synthesizer at this It continues.As long as those skilled in the art implement device used by voice special efficacy synthetic method in the embodiment of the present invention, all belong to In the range that the application to be protected.
Example IV
Fig. 4 is the structural schematic diagram for a kind of electronic equipment that the embodiment of the present invention four provides.As shown in figure 4, such as Fig. 4 institute Show, comprising: at least one processor (processor) 41;And at least one processor being connect with the processor 41 (memory) 42, bus 43;Wherein,
The processor 41, memory 42 complete mutual communication by the bus 43;
The processor 41 is used to call the program instruction in the memory 42, to execute above-mentioned voice special efficacy synthesis side Step in method embodiment.For example, the processor 41 executes: obtaining the corresponding text data of primary voice data, and obtain With the basic prosodic features and basic acoustic feature of the matches text data;According at least two special effects pre-established With the mapping relations between corresponding Character adjustment parameter, Character adjustment parameter corresponding with the special effect of demand, institute are obtained Stating Character adjustment parameter includes: target prosodic features adjusting parameter and/or target acoustical Character adjustment parameter;Use the feature Adjusting parameter is adjusted the basic prosodic features and/or basic acoustic feature, obtains target prosodic features and/or target Acoustic feature;According to the target prosodic features and/or target acoustical feature, spy corresponding with the primary voice data is generated Effect synthesis voice.
Embodiment five
The present embodiment provides a kind of non-transient computer readable storage medium, the non-transient computer readable storage medium Computer instruction is stored, the computer instruction proposes the above-mentioned each voice special efficacy synthetic method embodiment of the computer execution The voice special efficacy synthetic method of confession: the corresponding text data of primary voice data is obtained, and is obtained and the matches text data Basic prosodic features and basic acoustic feature;Joined according at least two special effects pre-established with corresponding Character adjustment Mapping relations between number, obtain Character adjustment parameter corresponding with the special effect of demand, and the Character adjustment parameter includes: Target prosodic features adjusting parameter and/or target acoustical Character adjustment parameter;Using the Character adjustment parameter to the basis Prosodic features and/or basic acoustic feature are adjusted, and obtain target prosodic features and/or target acoustical feature;According to described Target prosodic features and/or target acoustical feature generate special efficacy corresponding with the primary voice data and synthesize voice.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, Usable storage medium (including but not limited to magnetic disk storage, CD-ROM (Compact Disc-Read Only Memory, CD-ROM), optical memory etc.) on the form of computer program product implemented.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
In a typical configuration, calculating equipment includes one or more processors (Central Processing Unit/Processor, CPU), input/output interface, network interface and memory.
Memory may include the non-volatile memory in computer-readable medium, random access memory (Random Access Memory, RAM) and/or the forms such as Nonvolatile memory, such as read-only memory (Read Only Memory, ROM) Or flash memory (flash RAM).Memory is the example of computer-readable medium.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer include, but are not limited to phase change memory (Parallel Random Access Machine, PRAM), static random access memory (Static Random-Access Memory, SRAM), dynamic random access memory (Dynamic RandomAccess Memory, DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), digital multi CD (Digital Video Disc, DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic Property storage equipment or any other non-transmission medium, can be used for storing and can be accessed by a computing device information.According to herein Define, computer-readable medium does not include temporary computer readable media (transitory media), the data letter of such as modulation Number and carrier wave.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence " including one ... ", it is not excluded that including element Process, method, there is also other identical elements in commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.
The above is only embodiments herein, are not intended to limit this application.To those skilled in the art, Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement, Improve etc., it should be included within the scope of the claims of this application.

Claims (10)

1. a kind of voice special efficacy synthetic method characterized by comprising
The corresponding text data of primary voice data is obtained, and obtains the basic prosodic features and base with the matches text data Plinth acoustic feature;
According to the mapping relations between at least two special effects pre-established and corresponding Character adjustment parameter, obtains and need The corresponding Character adjustment parameter of the special effect asked, the Character adjustment parameter include: target prosodic features adjusting parameter and/or Target acoustical Character adjustment parameter;
The basic prosodic features and/or basic acoustic feature are adjusted using the Character adjustment parameter, obtain target Prosodic features and/or target acoustical feature;
According to the target prosodic features and/or target acoustical feature, generates special efficacy corresponding with the primary voice data and close At voice.
2. according to the method described in claim 1, it is characterized by:
The basis prosodic features includes: basic pronunciation character corresponding with the text character in the text data, Yi Jisuo The basic pause value between text character two-by-two is stated in text data, the basis pronunciation character includes corresponding with text character Pinyin elements and tone value corresponding with pinyin elements;
The target prosodic features adjusting parameter, comprising: target tone adjusted value and target pause value.
3. according to the method described in claim 2, it is characterized in that, using the target prosodic features adjusting parameter to the base Plinth prosodic features is adjusted, and obtains target prosodic features, comprising:
Tone value corresponding with each pinyin elements in the basic prosodic features is updated using the target tone adjusted value;
The basis between text character two-by-two is updated in the basic prosodic features text data using the target pause value Pause value;
Result will be updated as the target prosodic features.
4. according to the method described in claim 1, it is characterized by:
The basis acoustic feature includes: basic base frequency parameters corresponding with the primary voice data and basic spectral parameter;
The target acoustical Character adjustment parameter, comprising: target base frequency parameters and target compose parameter.
5. according to the method described in claim 4, it is characterized in that, using the target acoustical Character adjustment parameter to the base Plinth acoustic feature is adjusted, and obtains target acoustical feature, comprising:
The basic base frequency parameters in the basic acoustic feature are updated using the target base frequency parameters;
The basic spectral parameter in the basic acoustic feature is updated using target spectrum parameter;
Result will be updated as the target acoustical feature.
6. method according to claim 1-5, which is characterized in that according to the target prosodic features and/or mesh Acoustic feature is marked, special efficacy corresponding with the primary voice data is generated and synthesizes voice, including following one:
According to the target prosodic features and the basic acoustic feature, special efficacy corresponding with the primary voice data is generated Synthesize voice;
According to the basic prosodic features and the target acoustical feature, special efficacy corresponding with the primary voice data is generated Synthesize voice;
According to the target prosodic features and target acoustical feature, special efficacy synthesis corresponding with the primary voice data is generated Voice.
7. method according to claim 1-5, which is characterized in that according to the target prosodic features and/or mesh Acoustic feature is marked, special efficacy corresponding with the primary voice data is generated and synthesizes voice, comprising:
Using vocoder according to the target prosodic features and/or target acoustical feature, generate and the primary voice data pair The special efficacy synthesis voice answered.
8. a kind of voice special efficacy synthesizer characterized by comprising
Data acquisition module for obtaining the corresponding text data of primary voice data, and obtains and the matches text data Basic prosodic features and basic acoustic feature;
Parameter acquisition module, for according between at least two special effects and corresponding Character adjustment parameter pre-established Mapping relations obtain Character adjustment parameter corresponding with the special effect of demand, and the Character adjustment parameter includes: the target rhythm Character adjustment parameter and/or target acoustical Character adjustment parameter;
Parameter adjustment module, for using the Character adjustment parameter to the basic prosodic features and/or basic acoustic feature It is adjusted, obtains target prosodic features and/or target acoustical feature;
Voice synthetic module, for generating and the raw tone according to the target prosodic features and/or target acoustical feature The corresponding special efficacy of data synthesizes voice.
9. a kind of electronic equipment characterized by comprising
At least one processor;
And at least one processor, the bus being connected to the processor;Wherein,
The processor, memory complete mutual communication by the bus;
The processor is used to call the program instruction in the memory, any into claim 7 with perform claim requirement 1 Voice special efficacy synthetic method described in.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Store up computer instruction, the computer instruction requires the computer perform claim 1 to described in any one of claim 7 Voice special efficacy synthetic method.
CN201811413566.1A 2018-11-23 2018-11-23 Voice special effect synthesis method and device, electronic equipment and storage medium Active CN109285536B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811413566.1A CN109285536B (en) 2018-11-23 2018-11-23 Voice special effect synthesis method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811413566.1A CN109285536B (en) 2018-11-23 2018-11-23 Voice special effect synthesis method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN109285536A true CN109285536A (en) 2019-01-29
CN109285536B CN109285536B (en) 2022-05-13

Family

ID=65172650

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811413566.1A Active CN109285536B (en) 2018-11-23 2018-11-23 Voice special effect synthesis method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN109285536B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108140393A (en) * 2016-09-28 2018-06-08 华为技术有限公司 A kind of methods, devices and systems for handling multi-channel audio signal
CN112672259A (en) * 2019-10-16 2021-04-16 北京地平线机器人技术研发有限公司 Loudspeaker control method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050267758A1 (en) * 2004-05-31 2005-12-01 International Business Machines Corporation Converting text-to-speech and adjusting corpus
CN103035251A (en) * 2011-09-30 2013-04-10 西门子公司 Method for building voice transformation model and method and system for voice transformation
CN104916284A (en) * 2015-06-10 2015-09-16 百度在线网络技术(北京)有限公司 Prosody and acoustics joint modeling method and device for voice synthesis system
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN108364631A (en) * 2017-01-26 2018-08-03 北京搜狗科技发展有限公司 A kind of phoneme synthesizing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050267758A1 (en) * 2004-05-31 2005-12-01 International Business Machines Corporation Converting text-to-speech and adjusting corpus
CN103035251A (en) * 2011-09-30 2013-04-10 西门子公司 Method for building voice transformation model and method and system for voice transformation
CN104916284A (en) * 2015-06-10 2015-09-16 百度在线网络技术(北京)有限公司 Prosody and acoustics joint modeling method and device for voice synthesis system
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN108364631A (en) * 2017-01-26 2018-08-03 北京搜狗科技发展有限公司 A kind of phoneme synthesizing method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108140393A (en) * 2016-09-28 2018-06-08 华为技术有限公司 A kind of methods, devices and systems for handling multi-channel audio signal
CN112672259A (en) * 2019-10-16 2021-04-16 北京地平线机器人技术研发有限公司 Loudspeaker control method and device
CN112672259B (en) * 2019-10-16 2023-03-10 北京地平线机器人技术研发有限公司 Loudspeaker control method and device

Also Published As

Publication number Publication date
CN109285536B (en) 2022-05-13

Similar Documents

Publication Publication Date Title
US10410621B2 (en) Training method for multiple personalized acoustic models, and voice synthesis method and device
CN105845125B (en) Phoneme synthesizing method and speech synthetic device
CN105244020B (en) Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
CN106688034B (en) Text-to-speech conversion with emotional content
JP5293460B2 (en) Database generating apparatus for singing synthesis and pitch curve generating apparatus
JP5471858B2 (en) Database generating apparatus for singing synthesis and pitch curve generating apparatus
CN109285537A (en) Acoustic model foundation, phoneme synthesizing method, device, equipment and storage medium
CN104916284A (en) Prosody and acoustics joint modeling method and device for voice synthesis system
WO2022121176A1 (en) Speech synthesis method and apparatus, electronic device, and readable storage medium
US8380508B2 (en) Local and remote feedback loop for speech synthesis
CN108492818B (en) Text-to-speech conversion method and device and computer equipment
CN110176237A (en) A kind of audio recognition method and device
US20180232363A1 (en) System and method for audio dubbing and translation of a video
CN108766413A (en) Phoneme synthesizing method and system
JP6680933B2 (en) Acoustic model learning device, speech synthesis device, acoustic model learning method, speech synthesis method, program
CN110599998A (en) Voice data generation method and device
US20210366461A1 (en) Generating speech signals using both neural network-based vocoding and generative adversarial training
CN113724683B (en) Audio generation method, computer device and computer readable storage medium
CN109285536A (en) Voice special effect synthesis method and device, electronic equipment and storage medium
CN111785248A (en) Text information processing method and device
US9159329B1 (en) Statistical post-filtering for hidden Markov modeling (HMM)-based speech synthesis
US8781835B2 (en) Methods and apparatuses for facilitating speech synthesis
JP6082657B2 (en) Pose assignment model selection device, pose assignment device, method and program thereof
CN111048065B (en) Text error correction data generation method and related device
JP2013164609A (en) Singing synthesizing database generation device, and pitch curve generation device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220414

Address after: 210034 floor 8, building D11, Hongfeng Science Park, Nanjing Economic and Technological Development Zone, Jiangsu Province

Applicant after: New Technology Co.,Ltd.

Address before: 100080 Room 501, 5th floor, NO.67, North Fourth Ring Road West, Haidian District, Beijing

Applicant before: Beijing Yufanzhi Information Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant