CN107507619A - Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing - Google Patents
Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing Download PDFInfo
- Publication number
- CN107507619A CN107507619A CN201710812770.XA CN201710812770A CN107507619A CN 107507619 A CN107507619 A CN 107507619A CN 201710812770 A CN201710812770 A CN 201710812770A CN 107507619 A CN107507619 A CN 107507619A
- Authority
- CN
- China
- Prior art keywords
- target
- converted
- frame unit
- voice
- speaker
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000012546 transfer Methods 0.000 title claims abstract description 25
- 230000011218 segmentation Effects 0.000 claims abstract description 28
- 238000000605 extraction Methods 0.000 claims abstract description 12
- 238000006243 chemical reaction Methods 0.000 claims description 27
- 238000013507 mapping Methods 0.000 claims description 16
- 238000004422 calculation algorithm Methods 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 14
- 238000001228 spectrum Methods 0.000 claims description 14
- 238000009499 grossing Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 4
- 230000008878 coupling Effects 0.000 claims 1
- 238000010168 coupling process Methods 0.000 claims 1
- 238000005859 coupling reaction Methods 0.000 claims 1
- 230000015572 biosynthetic process Effects 0.000 abstract description 7
- 238000003786 synthesis reaction Methods 0.000 abstract description 7
- 230000008859 change Effects 0.000 description 9
- 238000005516 engineering process Methods 0.000 description 7
- 239000000284 extract Substances 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000033764 rhythmic process Effects 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000007812 deficiency Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000005611 electricity Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000007596 consolidation process Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000004549 pulsed laser deposition Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G10L2021/0135—Voice conversion or morphing
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing.It is multiple frame units to be converted by phonetic segmentation to be converted that methods described, which is included based on default segmentation rule,;The mel cepstrum feature of each frame unit to be converted of extraction;According to phoneme dictionary and the mel cepstrum feature of each frame unit to be converted, multiple candidate's frame units are calculated;According to the corresponding relation between the frame unit of speaker to be converted and the frame unit of target tone color speaker, matching obtains target frame unit;Switching cost is calculated, obtains optimal path;Target frame unit on optimal path is handled, obtains target voice.Multiple candidate's frame units are calculated in this method in phoneme dictionary, computing resource raising calculating speed can be saved by being searched relative to prior art from whole technical characteristic dictionary, calculating by the computed improved of traditional single frames for multiframe simultaneously, it is discontinuous to significantly improve synthesis voice, the poor technical problem of tonequality.
Description
Technical field
The present invention relates to speech signal analysis technical field, in particular to a kind of phonetics transfer method, device, electricity
Sub- equipment and readable storage medium storing program for executing.
Background technology
Speech synthesis technique passes through nearly semicentennial development, great successes is had been achieved for, in artificial intelligence etc.
Field plays extremely important effect.Wherein, TTS (Text-to-Speech, also known as literary periodicals) is by computer oneself
Caused or outside input text information is changed into the technology that can listen spoken output must understand, fluent, but TTS is closed
Into voice generally there are following both sides problem:First, tone color is confined to a small amount of announcer's sample, individual character can not be met
The demand of change;But the rhythm is unnatural, synthesis vestige is obvious.
Tone color conversion (also known as voice conversion) be on the premise of voice content is not changed, current speaker's tone color is direct
The technology of output speaker's tone color is converted to, advantage is the rhythm naturally, personalized tone color keeps preferable.At present, based on voice
The phonetics transfer method that characteristics dictionary is searched is the method for main flow in nonparametric Voice Conversion Techniques, and the thinking of this method is as follows:
1. extracting raw tone storehouse and target voice planting modes on sink characteristic, characteristics dictionary is established, parallel training is carried out and obtains mapping ruler;2. extract
Speech feature vector to be converted, according to mapping ruler, k nearest neighbor target is found for each characteristic vector in target signature dictionary
Characteristic vector;3. calculating target cost and connection cost, optimal road is searched in k nearest neighbor eigenmatrix using viterbi algorithm
Footpath;4. the speech feature vector of target is chosen in connection, and is converted to voice.The deficiency of this method is, each k nearest neighbor feature to
Amount lookup needs to travel through whole target signature dictionary, and calculating speed is slow, very high to system performance requirements.Meanwhile connected calculating
Using single frames as unit during cost, the smoothness properties of voice interframe is not accounted for, causes the missing of voice prompting message, causes to synthesize
Voice is discontinuous, greatly affected speech quality.
The content of the invention
In order to overcome above-mentioned deficiency of the prior art, the technical problems to be solved by the invention are to provide a kind of voice and turned
Method, apparatus, electronic equipment and readable storage medium storing program for executing are changed, it can be under the premise of ensureing that synthesis voice is continuous, it is ensured that frequency spectrum
Details is not lost.
The purpose of first aspect present invention is to provide a kind of phonetics transfer method, and methods described includes:
By the phonetic segmentation to be converted of speaker to be converted it is multiple frame units to be converted based on default segmentation rule, its
In, each frame unit to be converted includes multiple continuous speech frames;
The mel cepstrum feature of each frame unit to be converted of extraction;
According to the phoneme dictionary for the speaker to be converted being previously obtained and the mel cepstrum of each frame unit to be converted
Feature, multiple candidate's frame units are calculated;
According to corresponding between the frame unit for the speaker to be converted being previously obtained and the frame unit of target tone color speaker
Relation, matching obtain target frame unit corresponding to candidate's frame unit;
Switching cost is calculated, obtains the optimal path that voice to be converted is converted to target tone color speaker's voice;
Target frame unit on the optimal path is handled, obtains target tone color corresponding to the voice to be converted
The target voice of speaker.
Alternatively, methods described also includes pre-processing speech data;
Described the step of being pre-processed to speech data, includes:
Using the default segmentation rule to the raw tone and target in raw tone storehouse corresponding to speaker to be converted
Target voice in target voice storehouse corresponding to tone color speaker carries out cutting, obtain multiple frame units corresponding to raw tone and
Multiple frame units corresponding to target voice;
Raw tone and the mel cepstrum feature of target voice are extracted, raw tone characteristics dictionary is built and target voice is special
Levy dictionary;
The corresponding relation established between the frame unit of the raw tone and the frame unit of target voice;
Raw tone characteristics dictionary is sorted out to obtain phoneme dictionary according to the phoneme information marked;
Raw tone and the fundamental frequency feature of target voice are extracted, calculates fundamental frequency average and fundamental frequency variance;
The mapping of fundamental frequency between speaker to be converted and target tone color speaker is established according to fundamental frequency average and fundamental frequency variance
Relation.
The purpose of second aspect of the present invention is to provide a kind of voice conversion device, and described device includes:
Cutting module, for waiting to turn to be multiple by the phonetic segmentation to be converted of speaker to be converted based on default segmentation rule
Frame unit is changed, wherein, each frame unit to be converted includes multiple continuous speech frames;
Extraction module, for extracting the mel cepstrum feature of each frame unit to be converted;
Computing module, for the phoneme dictionary according to the speaker to be converted being previously obtained and each frame list to be converted
The mel cepstrum feature of member, is calculated multiple candidate's frame units;
Matching module, for the frame unit according to the speaker to be converted being previously obtained and the frame list of target tone color speaker
Corresponding relation between member, matching obtain target frame unit corresponding to candidate's frame unit;
The computing module, it is additionally operable to calculate switching cost, obtains voice to be converted and be converted to target tone color speaker's language
The optimal path of sound;
Processing module, for handling the target frame unit on the optimal path, obtain the voice to be converted
The target voice of corresponding target tone color speaker.
Alternatively, described device also includes:Pretreatment module;
The mode that the pretreatment module is pre-processed to speech data includes:
Using the default segmentation rule to the raw tone and target in raw tone storehouse corresponding to speaker to be converted
Target voice in target voice storehouse corresponding to tone color speaker carries out cutting, obtain multiple frame units corresponding to raw tone and
Multiple frame units corresponding to target voice;
Raw tone and the mel cepstrum feature of target voice are extracted, raw tone characteristics dictionary is built and target voice is special
Levy dictionary;
The corresponding relation established between the frame unit of the raw tone and the frame unit of target voice;
Raw tone characteristics dictionary is sorted out to obtain phoneme dictionary according to the phoneme information marked;
Raw tone and the fundamental frequency feature of target voice are extracted, calculates fundamental frequency average and fundamental frequency variance;
The mapping of fundamental frequency between speaker to be converted and target tone color speaker is established according to fundamental frequency average and fundamental frequency variance
Relation.
The purpose of third aspect present invention is to provide a kind of electronic equipment, and the electronic equipment includes:Processor and
Memory, the memory are couple to the processor, the memory store instruction, when the instruction is held by the processor
The electronic equipment is set to perform the phonetics transfer method described in first aspect present invention during row.
The purpose of fourth aspect present invention is to provide a kind of readable storage medium storing program for executing, and the readable storage medium storing program for executing includes calculating
Machine program, electronic equipment performs first aspect present invention the computer program controls the readable storage medium storing program for executing when running where
Described phonetics transfer method.
In terms of existing technologies, the invention has the advantages that:
The present invention provides a kind of phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing.Methods described includes base
By the phonetic segmentation to be converted of speaker to be converted it is multiple frame units to be converted in default segmentation rule;Treated described in extraction is each
Change the mel cepstrum feature of frame unit;According to the phoneme dictionary for the speaker to be converted being previously obtained and each described to be converted
The mel cepstrum feature of frame unit, multiple candidate's frame units are calculated;According to the frame list for the speaker to be converted being previously obtained
Corresponding relation between member and the frame unit of target tone color speaker, matching obtain target frame unit corresponding to candidate's frame unit;
Switching cost is calculated, obtains the optimal path that voice to be converted is converted to target tone color speaker's voice;To the optimal path
On target frame unit handled, obtain the target voice of target tone color speaker corresponding to the voice to be converted.It is described
Multiple candidate's frame units are calculated in method in the phoneme dictionary of speaker to be converted, relative to prior art from whole technology
Computing resource can be saved by, which being searched in characteristics dictionary, improves calculating speed, while the meter by the computed improved of traditional single frames for multiframe
Calculate, significantly improve synthesis voice it is discontinuous, the poor technical problem of tonequality.
Brief description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by embodiment it is required use it is attached
Figure is briefly described, it will be appreciated that the following drawings illustrate only certain embodiments of the present invention, therefore be not construed as pair
The restriction of scope, for those of ordinary skill in the art, on the premise of not paying creative work, can also be according to this
A little accompanying drawings obtain other related accompanying drawings.
Fig. 1 is the block diagram of electronic equipment provided in an embodiment of the present invention.
Fig. 2 is a kind of flow chart of steps for the phonetics transfer method that first embodiment of the invention provides.
Fig. 3 is another flow chart of steps for the phonetics transfer method that first embodiment of the invention provides.
Fig. 4 is the sub-step flow chart of step S170 in Fig. 3.
Fig. 5 is frame unit structural representation.
Fig. 6 is the schematic diagram by frame unit while multiple phoneme of speech sound set corresponding to being added to.
Fig. 7 is the schematic diagram of Viterbi path search provided in an embodiment of the present invention.
Fig. 8 is the sub-step flow chart of step S160 in Fig. 1 or Fig. 3.
Fig. 9 is the structured flowchart for the voice conversion device that second embodiment of the invention provides.
Embodiment
To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention
In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is
Part of the embodiment of the present invention, rather than whole embodiments.The present invention implementation being generally described and illustrated herein in the accompanying drawings
The component of example can be configured to arrange and design with a variety of.
Therefore, below the detailed description of the embodiments of the invention to providing in the accompanying drawings be not intended to limit it is claimed
The scope of the present invention, but be merely representative of the present invention selected embodiment.It is common based on the embodiment in the present invention, this area
The every other embodiment that technical staff is obtained under the premise of creative work is not made, belong to the model that the present invention protects
Enclose.
It should be noted that:Similar label and letter represents similar terms in following accompanying drawing, therefore, once a certain Xiang Yi
It is defined, then it further need not be defined and explained in subsequent accompanying drawing in individual accompanying drawing.
Fig. 1 is refer to, is the block diagram for a kind of electronic equipment 100 that present pre-ferred embodiments provide.The electricity
Sub- equipment 100 can include voice conversion device 300, memory 111, storage control 112 and processor 113.
The memory 111, storage control 112 and 113 each element of processor are directly or indirectly electrical between each other
Connection, to realize the transmission of data or interaction.For example, these elements can pass through one or more communication bus or letter between each other
Number line, which is realized, to be electrically connected with.The voice conversion device 300 can include it is at least one can be with software or firmware (firmware)
Form be stored in the memory 111 or be solidificated in the operating system (operating of the electronic equipment 100
System, OS) in software function module.The processor 113 is used to perform the executable mould stored in the memory 111
Block, such as software function module included by the voice conversion device 300 and computer program etc..
Wherein, the memory 111 may be, but not limited to, random access memory (Random Access
Memory, RAM), read-only storage (Read Only Memory, ROM), programmable read only memory (Programmable
Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only
Memory, EPROM), electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only
Memory, EEPROM) etc..Wherein, memory 111 is used for storage program, the processor 113 after execute instruction is received,
Perform described program.The processor 113 and other access of possible component to memory 111 can be in the storage controls
Carried out under the control of device 112.
The processor 113 is probably a kind of IC chip, has the disposal ability of signal.Above-mentioned processor
113 can be general processor, including central processing unit (Central Processing Unit, CPU), network processing unit
(Network Processor, NP) etc.;It can also be digital signal processor (DSP), application specific integrated circuit (ASIC), ready-made
Programmable gate array (FPGA) either other PLDs, discrete gate or transistor logic, discrete hardware group
Part.It can realize or perform disclosed each method, step and the logic diagram in the embodiment of the present invention.General processor can be with
It is microprocessor or the processor can also be any conventional processor etc..
First embodiment
Fig. 2 is refer to, Fig. 2 is the step flow chart for the phonetics transfer method that present pre-ferred embodiments provide.The side
Method is applied to electrical equipment described above 100, does specific description to the step of phonetics transfer method below.
Step S110, by the phonetic segmentation to be converted of speaker to be converted it is multiple frames to be converted based on default segmentation rule
Unit.
In the present embodiment, can select to need the speech range for carrying out voice conversion by way of mark, alternatively,
The mode that automatic speech annotation tool is labeled can be called to select voice to be converted from the voice of speaker to be converted.
After the voice to be converted of mark is obtained, converting speech is treated using default segmentation rule and carries out cutting so that is cut
Each frame unit after point includes multiple continuous speech frames.
Step S120, extract the mel cepstrum feature of each frame unit to be converted.
In the present embodiment, step S120 includes:
Time-frequency domain is carried out to the frame unit to be converted to change to obtain the spectrum information of each frame unit to be converted.
Extract to obtain the mel cepstrum feature of the frame unit using Mel wave filter group.
Step S130, according to the phoneme dictionary for the speaker to be converted being previously obtained and each frame unit to be converted
Mel cepstrum feature, multiple candidate's frame units are calculated.
The step S130 can include following sub-step.
From the mel cepstrum feature of each frame unit to be converted form the feature of each frame unit to be converted to
Amount.
Calculate the Europe between the characteristic vector of each frame unit in the characteristic vector and phoneme dictionary of each frame unit to be converted
Formula distance is simultaneously ranked up.
Multiple candidate frame lists corresponding to each frame unit to be converted are filtered out from the phoneme dictionary using k nearest neighbor algorithm
Member.
Wherein, the k nearest neighbor algorithm is a kind of sorting algorithm, k nearest neighbor algorithm be by a sample in feature space
K most like (i.e. closest in feature space) sample in most of when belonging to some classification, also the sample is returned
Algorithm of the class to this classification.
Step S140, according to the frame unit of the frame unit for the speaker to be converted being previously obtained and target tone color speaker it
Between corresponding relation, matching obtain target frame unit corresponding to candidate's frame unit.
Fig. 3 is refer to, in the present embodiment, methods described also includes step S170.
Step S170, is pre-processed to speech data.
To target corresponding to the raw tone in raw tone storehouse corresponding to speaker to be converted and target tone color speaker
Target voice in sound bank carries out parallel training, to establish the frame of the frame unit of speaker to be converted and target tone color speaker
The mapping relations of fundamental frequency between corresponding relation and speaker to be converted and target tone color speaker between unit.In this process
Due to being to raw tone and the parallel training of target voice progress, it is desirable to which raw tone and target voice content correspond to one by one, interior
Hold consistent.
Fig. 4 is refer to, in the present embodiment, step S170 includes following sub-step.
Sub-step S171, using the default segmentation rule to original in raw tone storehouse corresponding to speaker to be converted
Target voice in target voice storehouse corresponding to voice and target tone color speaker carries out cutting, obtains more corresponding to raw tone
Multiple frame units corresponding to individual frame unit and target voice.
In the present embodiment, in order to establish the mapping relations between raw tone and target voice, it is necessary to carry out parallel instruction
Practice, i.e., raw tone storehouse is consistent with target voice storehouse content, and duration is enough, in order to ensure the effect of tone color conversion, it is desirable to wrap
Containing enough all morpheme information.
Fig. 5 is refer to, in the present embodiment, it is contemplated that the prompting message of smooth connection and voice between frame unit, this
The continuous odd-numbered frame of scheme selection (q=2p+1) is a frame unit, and its center frame is the frame of pth+1, front and rear each p frames, adjacent two
Overlapping 2p frames between individual frame unit.It is understood that employed in sub-step S171 in default segmentation rule and step S110
Default segmentation rule is identical.
For raw tone, frame sequence can be expressed as X=[x(1),x(2),x(3),...,x(n),...,x(N)], n-th is single
Member can be expressed as x(n)=[xn-p,xn-p+1,...,xn,...,xn+p+1,xn+p], wherein xnRepresent the n-th frame in frame sequence.Together
Reason, the operation of identical dividing elements can also be done to target voice.
Sub-step S172, raw tone and the mel cepstrum feature of target voice are extracted, build raw tone characteristics dictionary
And target voice characteristics dictionary.
In the present embodiment, through obtaining each frame frequency spectrum information after Fast Fourier Transform (FFT), Mel wave filter group is passed through
Extract mel cepstrum feature.Pass through the mel cepstrum feature construction raw tone characteristics dictionary and target voice tagged word of extraction
Allusion quotation.
Sub-step S173, the corresponding relation established between the frame unit of the raw tone and the frame unit of target voice.
In the present embodiment, using DTW (Dynamic Time Warping, dynamic time consolidation) algorithm, establish original
Corresponding relation between speech frame and target voice frame.Corresponding relation between the raw tone and target voice can represent
For:Z=[z1,z2,...,zl,...zL], whereinThe frame unit of frame unit and target voice for raw tone
Pairing.The frame unit by raw tone for being established as the tone color conversion stage of above-mentioned corresponding relation searches the frame of target voice
Unit provides basis.
Sub-step S174, raw tone characteristics dictionary is sorted out according to the phoneme information marked to obtain phoneme word
Allusion quotation.
In the present embodiment, each bar phoneme of speech sound information in raw tone is labeled in advance, according to each original language
Position of the frame unit of sound in raw tone, the frame unit of each raw tone is referred in each phoneme dictionary.It please join
According to Fig. 6, because frame unit includes multiple successive frames, it is thus possible to a frame unit occur and cross over two (or two or more) voices
The situation of set of phonemes, in order to ensure the quality of conversion, the frame unit is added to corresponding at least one phoneme dictionary simultaneously
In.
The mode of classification obtains phoneme dictionary, the mode of multiple candidate's frame units is calculated based on phoneme dictionary, relatively
Computing resource can be saved by being searched in prior art from whole technical characteristic dictionary, improve calculating speed.
Sub-step S175, raw tone and the fundamental frequency feature of target voice are extracted, calculate fundamental frequency average and fundamental frequency variance.
Sub-step S176, established according to fundamental frequency average and fundamental frequency variance between speaker to be converted and target tone color speaker
The mapping relations of fundamental frequency.
In the present embodiment, the excitation of voiced sound is periodically pulsing string, and the frequency of train of pulse is exactly fundamental frequency, therefore
Fundamental frequency is also the key character of voice, and the accuracy of fundamental frequency extraction directly affects the holding of the personalized tone color of synthesis voice, with
And rhythm cadence.Statistically, can be by different two identicals distribution (such as normal state point of statistical nature (average, variance)
Cloth etc.) mutually changed.Therefore, raw tone and target voice fundamental frequency feature are considered as Normal Distribution, calculate fundamental frequency
Average and fundamental frequency variance, it is possible to establish the mapping relations of fundamental frequency between raw tone and target voice.Establish raw tone and
Between target voice the mapping relations of fundamental frequency so as to subsequent voice conversion the stage pass through phonetic acquisition target voice to be converted
Fundamental frequency feature.
Step S150, switching cost is calculated, obtains the optimal road that voice to be converted is converted to target tone color speaker's voice
Footpath.
In the present embodiment, step S150 obtains voice to be converted and is converted to target tone color speaker's language in the following manner
The optimal path of sound.
Calculate between the target cost between frame unit to be converted and target frame unit, and the target frame unit of adjacent moment
Transfer value.
Target cost and transfer value according to being calculated search for obtain optimal path using viterbi algorithm.
Alternatively, the target cost between frame unit to be converted and target frame unit is calculated using Euclidean distance, and
Transfer value between the target frame unit of adjacent moment.Viterbi algorithm searches for the directed acyclic graph equivalent to a Weight
Minimal cost path search procedure.
The calculation formula of the target cost can be as follows:
Wherein,Itself weight of each node in the directed acyclic graph of Weight can be expressed as, can be with
The target cost being interpreted as in the present embodiment.Describe frame unit X to be converted(t)With target frame unit
Between similarity degree, weight is smaller to represent that both are more similar.Wherein X(t)(i, d) and Yk'(t)(i, d) is represented in t unit
The d dimension datas of i-th frame.
Transfer weight in the directed acyclic graph of Weight between node is to connect cost,
Describe t target frame unitWith (t+1) and t+1 target frame units
Similarity degree, weight is smaller to represent that both are more similar, cross get over it is smooth.According to above principle, it is possible in target frame unit
That is searched in matrix arrives optimal path.Fig. 7 is refer to, each node on path (being made up of line with the arrow in Fig. 4) is
Each when the optimal selection that engraves.
Step S160, the target frame unit on the optimal path is handled, it is corresponding to obtain the voice to be converted
Target tone color speaker target voice.
In the present embodiment, Fig. 8 is refer to, the step S160 can include following sub-step.
Sub-step S161, according to the corresponding relation between the frame unit of the raw tone and the frame unit of target voice,
Obtain the mel cepstrum feature of target frame unit corresponding to frame unit to be converted.
Sub-step S162, it is suitable according to the time to the mel cepstrum feature of each target frame unit on the optimal path
Sequence and default segmentation rules carry out processing in smoothing junction.
In the present embodiment, because the frame for having 2p frames between adjacent target frame unit is folded, when connecting into eigenmatrix, it is necessary to
Instantaneous window is added smoothly to ensure the continuity on phonetic hearing.For each target frame unit, proceed as follows.
A weight coefficient is multiplied by each frame in target frame unit, represents instantaneous with exponential function in the present embodiment
Window w, formula represent as follows,
W=exp (- λ | a |), a=[p, p-1 ..., 0 ..., p-1, p]
Wherein, λ is scalar value, for adjusting instantaneous window w shapes.λ is bigger, more highlights center frame information, weakens consecutive frame
Prompting message;Conversely, λ is smaller, consecutive frame prompting message is more taken into account, weakens center frame information, therefore selects suitable λ can be with
Take into account the two simultaneously., it is necessary to normalize instantaneous window each element before adding window, it is 1 to make itself and value.
Sub-step S163, according to the mapping relations of fundamental frequency between speaker to be converted and target tone color speaker, treated
Change the fundamental frequency feature of target frame unit corresponding to frame unit.
Fundamental frequency average by the speech pitch sequence of voice to be converted with the target voice of corresponding target tone color speaker
Subtract each other, the difference of gained is multiplied with the business of the fundamental frequency variance of target voice and the fundamental frequency variance of voice to be converted, is multiplied what is obtained
Product is added the fundamental frequency sequence for obtaining target voice with the fundamental frequency average of target voice.The calculating of the fundamental frequency sequence of target voice
Formula can be:
Wherein, f0 (i) is target voice fundamental frequency sequence,For speech pitch sequence to be converted, sf0m and tf0m difference
The fundamental frequency average of fundamental frequency average and target voice for voice to be converted, sf0v and tf0v are respectively the fundamental frequency side of voice to be converted
Difference and the fundamental frequency variance of target voice.
Sub-step S164, by the frequency spectrum that the mel cepstrum feature of target frame unit and fundamental frequency Feature Conversion are target voice.
In the present embodiment, STRAIGHT kits are alternatively called by the mel cepstrum feature and fundamental frequency of target frame unit
Feature Conversion is the frequency spectrum of target voice.
Sub-step S165, the frequency spectrum of target voice is entered into the target language that line frequency time domain is converted to target tone color speaker
Sound.
In the present embodiment, the frequency spectrum of target voice is converted to the mesh of target tone color speaker using inverse Fourier transform
Poster sound.
Second embodiment
Fig. 9 is refer to, Fig. 9 is the structured flowchart for the voice conversion device 300 that present pre-ferred embodiments provide.Institute's predicate
Sound conversion equipment 300 includes:Cutting module 310, extraction module 320, computing module 330, matching module 340 and processing module
350。
It is more by the phonetic segmentation to be converted of speaker to be converted that the cutting module 310, which is used for based on default segmentation rule,
Individual frame unit to be converted, wherein, each frame unit to be converted includes multiple continuous speech frames.
The extraction module 320 is used for the mel cepstrum feature for extracting each frame unit to be converted.
In the present embodiment, the extraction module 320 extracts the mode of the mel cepstrum feature of the frame unit to be converted
Including:
Time-frequency domain is carried out to the frame unit to be converted to change to obtain the spectrum information of each frame unit;
Extract to obtain the mel cepstrum feature of the frame unit using Mel wave filter group.
The computing module 330 is described for the phoneme dictionary according to the speaker to be converted being previously obtained and each to be waited to turn
The mel cepstrum feature of frame unit is changed, multiple candidate's frame units are calculated.
In the present embodiment, the computing module 330 is according to the phoneme dictionary of the speaker to be converted being previously obtained and every
The mel cepstrum feature of the individual frame unit to be converted, the mode of multiple candidate's frame units, which is calculated, to be included:
From the mel cepstrum feature of each frame unit to be converted form the feature of each frame unit to be converted to
Amount;
Calculate the Europe between the characteristic vector of each frame unit in the characteristic vector and phoneme dictionary of each frame unit to be converted
Formula distance is simultaneously ranked up;
Multiple candidate frame lists corresponding to each frame unit to be converted are filtered out from the phoneme dictionary using k nearest neighbor algorithm
Member.
The matching module 340 is used for frame unit and target tone color speaker according to the speaker to be converted being previously obtained
Frame unit between corresponding relation, matching obtain target frame unit corresponding to candidate's frame unit.
The computing module 330 is additionally operable to calculate switching cost, obtains voice to be converted and is converted to target tone color speaker
The optimal path of voice.
In the present embodiment, the computing module 330 calculates switching cost, obtains voice to be converted and is converted to target tone color
The mode of the optimal path of speaker's voice includes:
Calculate between the target cost between frame unit to be converted and target frame unit, and the target frame unit of adjacent moment
Transfer value;
Target cost and transfer value according to being calculated search for obtain optimal path using viterbi algorithm.
The processing module 350 is used to handle the target frame unit on the optimal path, obtains described waiting to turn
Change the target voice of target tone color speaker corresponding to voice.
In the present embodiment, the processing module 350 is handled the target frame unit on the optimal path, is obtained
The mode of the target voice of target tone color speaker includes corresponding to the voice to be converted:
According to the corresponding relation between the frame unit of the raw tone and the frame unit of target voice, frame to be converted is obtained
The mel cepstrum feature of target frame unit corresponding to unit;
To the mel cepstrum feature of each target frame unit on the optimal path, cut sequentially in time with default
Divider then carries out processing in smoothing junction;
According to the mapping relations of fundamental frequency between speaker to be converted and target tone color speaker, frame unit pair to be converted is obtained
The fundamental frequency feature for the target frame unit answered;
By the frequency spectrum that the mel cepstrum feature of target frame unit and fundamental frequency Feature Conversion are target voice;
The frequency spectrum of target voice is entered into the target voice that line frequency time domain is converted to target tone color speaker.
Referring once again to Fig. 9, in the present embodiment, the voice conversion device 300 also includes:Pretreatment module 360.
The mode that the pretreatment module 360 is pre-processed to speech data includes:
Using the default segmentation rule to the raw tone and target in raw tone storehouse corresponding to speaker to be converted
Target voice in target voice storehouse corresponding to tone color speaker carries out cutting, obtain multiple frame units corresponding to raw tone and
Multiple frame units corresponding to target voice;
Raw tone and the mel cepstrum feature of target voice are extracted, raw tone characteristics dictionary is built and target voice is special
Levy dictionary;
The corresponding relation established between the frame unit of the raw tone and the frame unit of target voice;
Raw tone characteristics dictionary is sorted out to obtain phoneme dictionary according to the phoneme information marked;
Raw tone and the fundamental frequency feature of target voice are extracted, calculates fundamental frequency average and fundamental frequency variance;
The mapping of fundamental frequency between speaker to be converted and target tone color speaker is established according to fundamental frequency average and fundamental frequency variance
Relation.
The present invention provides a kind of phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing.Methods described includes base
By the phonetic segmentation to be converted of speaker to be converted it is multiple frame units to be converted in default segmentation rule;Treated described in extraction is each
Change the mel cepstrum feature of frame unit;According to the phoneme dictionary for the speaker to be converted being previously obtained and each described to be converted
The mel cepstrum feature of frame unit, multiple candidate's frame units are calculated;According to the frame list for the speaker to be converted being previously obtained
Corresponding relation between member and the frame unit of target tone color speaker, matching obtain target frame unit corresponding to candidate's frame unit;
Switching cost is calculated, obtains the optimal path that voice to be converted is converted to target tone color speaker's voice;To the optimal path
On target frame unit handled, obtain the target voice of target tone color speaker corresponding to the voice to be converted.It is described
Multiple candidate's frame units are calculated in method in the phoneme dictionary of speaker to be converted, relative to prior art from whole technology
Computing resource can be saved by, which being searched in characteristics dictionary, improves calculating speed, while in view of interframe is smooth and the prompting message of voice,
The computed improved of traditional single frames is the calculating of the unit comprising multiframe, and adding window smoothing processing has been done in connection unit,
By the computed improved of traditional single frames be multiframe calculating, significantly improve synthesis voice it is discontinuous, the poor technology of tonequality
Problem.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area
For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies
Change, equivalent substitution, improvement etc., should be included in the scope of the protection.It should be noted that:Similar label and letter exists
Similar terms is represented in following accompanying drawing, therefore, once being defined in a certain Xiang Yi accompanying drawing, is then not required in subsequent accompanying drawing
It is further defined and explained.
The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area
For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies
Change, equivalent substitution, improvement etc., should be included in the scope of the protection.
Claims (12)
1. a kind of phonetics transfer method, it is characterised in that methods described includes:
By the phonetic segmentation to be converted of speaker to be converted it is multiple frame units to be converted based on default segmentation rule, wherein, often
Individual frame unit to be converted includes multiple continuous speech frames;
The mel cepstrum feature of each frame unit to be converted of extraction;
According to the phoneme dictionary for the speaker to be converted being previously obtained and the mel cepstrum feature of each frame unit to be converted,
Multiple candidate's frame units are calculated;
According to the corresponding relation between the frame unit for the speaker to be converted being previously obtained and the frame unit of target tone color speaker,
Matching obtains target frame unit corresponding to candidate's frame unit;
Switching cost is calculated, obtains the optimal path that voice to be converted is converted to target tone color speaker's voice;
Target frame unit on the optimal path is handled, obtain the voice to be converted corresponding to target tone color speak
The target voice of people.
2. the method as described in claim 1, it is characterised in that methods described also includes:Speech data is pre-processed
Step, the step include:
Using the default segmentation rule to the raw tone in raw tone storehouse corresponding to speaker to be converted and target tone color
Target voice in target voice storehouse corresponding to speaker carries out cutting, obtains multiple frame units and target corresponding to raw tone
Multiple frame units corresponding to voice;
Raw tone and the mel cepstrum feature of target voice are extracted, builds raw tone characteristics dictionary and target voice tagged word
Allusion quotation;
The corresponding relation established between the frame unit of the raw tone and the frame unit of target voice;
Raw tone characteristics dictionary is sorted out to obtain phoneme dictionary according to the phoneme information marked;
Raw tone and the fundamental frequency feature of target voice are extracted, calculates fundamental frequency average and fundamental frequency variance;
The mapping relations of fundamental frequency between speaker to be converted and target tone color speaker are established according to fundamental frequency average and fundamental frequency variance.
3. according to the method for claim 2, it is characterised in that the phoneme for the speaker to be converted that the basis is previously obtained
The mel cepstrum feature of dictionary and each frame unit to be converted, the step of multiple candidate's frame units are calculated, include:
The characteristic vector of each frame unit to be converted is made up of the mel cepstrum feature of each frame unit to be converted;
Calculate in the characteristic vector and phoneme dictionary of each frame unit to be converted between the characteristic vector of each frame unit it is European away from
From and be ranked up;
Multiple candidate's frame units corresponding to each frame unit to be converted are filtered out from the phoneme dictionary using k nearest neighbor algorithm.
4. according to the method for claim 2, it is characterised in that the calculating switching cost, obtain voice conversion to be converted
For target tone color speaker's voice optimal path the step of include:
Calculate and turn between the target cost between frame unit to be converted and target frame unit, and the target frame unit of adjacent moment
Move cost;
Target cost and transfer value according to being calculated search for obtain optimal path using viterbi algorithm.
5. according to the method for claim 2, it is characterised in that the target frame unit on the optimal path is carried out
The step of handling, obtaining the target voice of target tone color speaker corresponding to the voice to be converted includes:
According to the corresponding relation between the frame unit of the raw tone and the frame unit of target voice, frame unit to be converted is obtained
The mel cepstrum feature of corresponding target frame unit;
To the mel cepstrum feature of each target frame unit on the optimal path, advised sequentially in time with default cutting
Then carry out processing in smoothing junction;
According to the mapping relations of fundamental frequency between speaker to be converted and target tone color speaker, obtain corresponding to frame unit to be converted
The fundamental frequency feature of target frame unit;
By the frequency spectrum that the mel cepstrum feature of target frame unit and fundamental frequency Feature Conversion are target voice;
The frequency spectrum of target voice is entered into the target voice that line frequency time domain is converted to target tone color speaker.
6. a kind of voice conversion device, it is characterised in that described device includes:
Cutting module, for by the phonetic segmentation to be converted of speaker to be converted being multiple frames to be converted based on default segmentation rule
Unit, wherein, each frame unit to be converted includes multiple continuous speech frames;
Extraction module, for extracting the mel cepstrum feature of each frame unit to be converted;
Computing module, the phoneme dictionary for the speaker to be converted being previously obtained for basis and each frame unit to be converted
Mel cepstrum feature, multiple candidate's frame units are calculated;
Matching module, for the frame unit according to the frame unit of speaker to be converted being previously obtained and target tone color speaker it
Between corresponding relation, matching obtain target frame unit corresponding to candidate's frame unit;
The computing module, it is additionally operable to calculate switching cost, obtains voice to be converted and be converted to target tone color speaker's voice
Optimal path;
Processing module, for handling the target frame unit on the optimal path, it is corresponding to obtain the voice to be converted
Target tone color speaker target voice.
7. voice conversion device as claimed in claim 6, it is characterised in that described device also includes:Pretreatment module;
The mode that the pretreatment module is pre-processed to speech data includes:
Using the default segmentation rule to the raw tone in raw tone storehouse corresponding to speaker to be converted and target tone color
Target voice in target voice storehouse corresponding to speaker carries out cutting, obtains multiple frame units and target corresponding to raw tone
Multiple frame units corresponding to voice;
Raw tone and the mel cepstrum feature of target voice are extracted, builds raw tone characteristics dictionary and target voice tagged word
Allusion quotation;
The corresponding relation established between the frame unit of the raw tone and the frame unit of target voice;
Raw tone characteristics dictionary is sorted out to obtain phoneme dictionary according to the phoneme information marked;
Raw tone and the fundamental frequency feature of target voice are extracted, calculates fundamental frequency average and fundamental frequency variance;
The mapping relations of fundamental frequency between speaker to be converted and target tone color speaker are established according to fundamental frequency average and fundamental frequency variance.
8. voice conversion device as claimed in claim 7, it is characterised in that the computing module is waited to turn according to what is be previously obtained
The phoneme dictionary of speaker and the mel cepstrum feature of each frame unit to be converted are changed, multiple candidate's frame units are calculated
Mode include:
The characteristic vector of each frame unit to be converted is made up of the mel cepstrum feature of each frame unit to be converted;
Calculate in the characteristic vector and phoneme dictionary of each frame unit to be converted between the characteristic vector of each frame unit it is European away from
From and be ranked up;
Multiple candidate's frame units corresponding to each frame unit to be converted are filtered out from the phoneme dictionary using k nearest neighbor algorithm.
9. voice conversion device as claimed in claim 7, it is characterised in that the computing module calculates switching cost, obtains
The mode that voice to be converted is converted to the optimal path of target tone color speaker's voice includes:
Calculate and turn between the target cost between frame unit to be converted and target frame unit, and the target frame unit of adjacent moment
Move cost;
Target cost and transfer value according to being calculated search for obtain optimal path using viterbi algorithm.
10. voice conversion device as claimed in claim 7, it is characterised in that the processing module is on the optimal path
Target frame unit handled, obtain the mode bag of the target voice of target tone color speaker corresponding to the voice to be converted
Include:
According to the corresponding relation between the frame unit of the raw tone and the frame unit of target voice, frame unit to be converted is obtained
The mel cepstrum feature of corresponding target frame unit;
To the mel cepstrum feature of each target frame unit on the optimal path, advised sequentially in time with default cutting
Then carry out processing in smoothing junction;
According to the mapping relations of fundamental frequency between speaker to be converted and target tone color speaker, obtain corresponding to frame unit to be converted
The fundamental frequency feature of target frame unit;
By the frequency spectrum that the mel cepstrum feature of target frame unit and fundamental frequency Feature Conversion are target voice;
The frequency spectrum of target voice is entered into the target voice that line frequency time domain is converted to target tone color speaker.
11. a kind of electronic equipment, it is characterised in that the electronic equipment includes:Processor and memory, the memory coupling
The processor is connected to, the memory store instruction, makes the electronic equipment when executed by the processor
Perform claim requires the phonetics transfer method described in any one in 1-5.
12. a kind of readable storage medium storing program for executing, the readable storage medium storing program for executing includes computer program, it is characterised in that:
Electronic equipment perform claim where controlling the readable storage medium storing program for executing during computer program operation requires any in 1-5
Phonetics transfer method described in one.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710812770.XA CN107507619B (en) | 2017-09-11 | 2017-09-11 | Voice conversion method and device, electronic equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710812770.XA CN107507619B (en) | 2017-09-11 | 2017-09-11 | Voice conversion method and device, electronic equipment and readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107507619A true CN107507619A (en) | 2017-12-22 |
CN107507619B CN107507619B (en) | 2021-08-20 |
Family
ID=60695368
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710812770.XA Active CN107507619B (en) | 2017-09-11 | 2017-09-11 | Voice conversion method and device, electronic equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107507619B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109817197A (en) * | 2019-03-04 | 2019-05-28 | 天翼爱音乐文化科技有限公司 | Song generation method, device, computer equipment and storage medium |
CN111048109A (en) * | 2019-12-25 | 2020-04-21 | 广州酷狗计算机科技有限公司 | Acoustic feature determination method and apparatus, computer device, and storage medium |
CN111213205A (en) * | 2019-12-30 | 2020-05-29 | 深圳市优必选科技股份有限公司 | Streaming voice conversion method and device, computer equipment and storage medium |
CN112562728A (en) * | 2020-11-13 | 2021-03-26 | 百果园技术(新加坡)有限公司 | Training method for generating confrontation network, and audio style migration method and device |
CN112614481A (en) * | 2020-12-08 | 2021-04-06 | 浙江合众新能源汽车有限公司 | Voice tone customization method and system for automobile prompt tone |
CN112634920A (en) * | 2020-12-18 | 2021-04-09 | 平安科技(深圳)有限公司 | Method and device for training voice conversion model based on domain separation |
CN113345453A (en) * | 2021-06-01 | 2021-09-03 | 平安科技(深圳)有限公司 | Singing voice conversion method, device, equipment and storage medium |
CN113782050A (en) * | 2021-09-08 | 2021-12-10 | 浙江大华技术股份有限公司 | Sound tone changing method, electronic device and storage medium |
CN114582365A (en) * | 2022-05-05 | 2022-06-03 | 阿里巴巴(中国)有限公司 | Audio processing method and device, storage medium and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090089063A1 (en) * | 2007-09-29 | 2009-04-02 | Fan Ping Meng | Voice conversion method and system |
CN102063899A (en) * | 2010-10-27 | 2011-05-18 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
CN103531196A (en) * | 2013-10-15 | 2014-01-22 | 中国科学院自动化研究所 | Sound selection method for waveform concatenation speech synthesis |
CN104123933A (en) * | 2014-08-01 | 2014-10-29 | 中国科学院自动化研究所 | Self-adaptive non-parallel training based voice conversion method |
CN104575488A (en) * | 2014-12-25 | 2015-04-29 | 北京时代瑞朗科技有限公司 | Text information-based waveform concatenation voice synthesizing method |
-
2017
- 2017-09-11 CN CN201710812770.XA patent/CN107507619B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090089063A1 (en) * | 2007-09-29 | 2009-04-02 | Fan Ping Meng | Voice conversion method and system |
CN102063899A (en) * | 2010-10-27 | 2011-05-18 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
CN103531196A (en) * | 2013-10-15 | 2014-01-22 | 中国科学院自动化研究所 | Sound selection method for waveform concatenation speech synthesis |
CN104123933A (en) * | 2014-08-01 | 2014-10-29 | 中国科学院自动化研究所 | Self-adaptive non-parallel training based voice conversion method |
CN104575488A (en) * | 2014-12-25 | 2015-04-29 | 北京时代瑞朗科技有限公司 | Text information-based waveform concatenation voice synthesizing method |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109817197B (en) * | 2019-03-04 | 2021-05-11 | 天翼爱音乐文化科技有限公司 | Singing voice generation method and device, computer equipment and storage medium |
CN109817197A (en) * | 2019-03-04 | 2019-05-28 | 天翼爱音乐文化科技有限公司 | Song generation method, device, computer equipment and storage medium |
CN111048109A (en) * | 2019-12-25 | 2020-04-21 | 广州酷狗计算机科技有限公司 | Acoustic feature determination method and apparatus, computer device, and storage medium |
CN111213205A (en) * | 2019-12-30 | 2020-05-29 | 深圳市优必选科技股份有限公司 | Streaming voice conversion method and device, computer equipment and storage medium |
CN111213205B (en) * | 2019-12-30 | 2023-09-08 | 深圳市优必选科技股份有限公司 | Stream-type voice conversion method, device, computer equipment and storage medium |
WO2021134232A1 (en) * | 2019-12-30 | 2021-07-08 | 深圳市优必选科技股份有限公司 | Streaming voice conversion method and apparatus, and computer device and storage medium |
CN112562728A (en) * | 2020-11-13 | 2021-03-26 | 百果园技术(新加坡)有限公司 | Training method for generating confrontation network, and audio style migration method and device |
CN112614481A (en) * | 2020-12-08 | 2021-04-06 | 浙江合众新能源汽车有限公司 | Voice tone customization method and system for automobile prompt tone |
CN112634920A (en) * | 2020-12-18 | 2021-04-09 | 平安科技(深圳)有限公司 | Method and device for training voice conversion model based on domain separation |
CN112634920B (en) * | 2020-12-18 | 2024-01-02 | 平安科技(深圳)有限公司 | Training method and device of voice conversion model based on domain separation |
CN113345453A (en) * | 2021-06-01 | 2021-09-03 | 平安科技(深圳)有限公司 | Singing voice conversion method, device, equipment and storage medium |
CN113345453B (en) * | 2021-06-01 | 2023-06-16 | 平安科技(深圳)有限公司 | Singing voice conversion method, device, equipment and storage medium |
CN113782050A (en) * | 2021-09-08 | 2021-12-10 | 浙江大华技术股份有限公司 | Sound tone changing method, electronic device and storage medium |
CN114582365A (en) * | 2022-05-05 | 2022-06-03 | 阿里巴巴(中国)有限公司 | Audio processing method and device, storage medium and electronic equipment |
Also Published As
Publication number | Publication date |
---|---|
CN107507619B (en) | 2021-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107507619A (en) | Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing | |
CN109272988B (en) | Voice recognition method based on multi-path convolution neural network | |
CN107154260B (en) | Domain-adaptive speech recognition method and device | |
CN101178896B (en) | Unit selection voice synthetic method based on acoustics statistical model | |
Wang et al. | Word embedding for recurrent neural network based TTS synthesis | |
CN107705802A (en) | Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing | |
CN106486121B (en) | Voice optimization method and device applied to intelligent robot | |
CN110288980A (en) | Audio recognition method, the training method of model, device, equipment and storage medium | |
CN103280216B (en) | Improve the speech recognition device the relying on context robustness to environmental change | |
CN108984529A (en) | Real-time court's trial speech recognition automatic error correction method, storage medium and computing device | |
CN105810191B (en) | Merge the Chinese dialects identification method of prosodic information | |
CN109036467A (en) | CFFD extracting method, speech-emotion recognition method and system based on TF-LSTM | |
CN108986798B (en) | Processing method, device and the equipment of voice data | |
CN107526826A (en) | Phonetic search processing method, device and server | |
CN107122492A (en) | Lyric generation method and device based on picture content | |
CN107291775A (en) | The reparation language material generation method and device of error sample | |
CN111599339B (en) | Speech splicing synthesis method, system, equipment and medium with high naturalness | |
CN109147771A (en) | Audio frequency splitting method and system | |
An et al. | Speech Emotion Recognition algorithm based on deep learning algorithm fusion of temporal and spatial features | |
Zhang et al. | Automatic synthesis technology of music teaching melodies based on recurrent neural network | |
CN114927126A (en) | Scheme output method, device and equipment based on semantic analysis and storage medium | |
CN104751856B (en) | A kind of speech sentences recognition methods and device | |
US20240161727A1 (en) | Training method for speech synthesis model and speech synthesis method and related apparatuses | |
CN107680584A (en) | Method and apparatus for cutting audio | |
CN112686041A (en) | Pinyin marking method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |