CN107507619A

CN107507619A - Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing

Info

Publication number: CN107507619A
Application number: CN201710812770.XA
Authority: CN
Inventors: 方博伟; 张康; 卓鹏鹏; 张伟; 尤嘉华
Original assignee: Xiamen Meitu Technology Co Ltd
Current assignee: Xiamen Meitu Technology Co Ltd
Priority date: 2017-09-11
Filing date: 2017-09-11
Publication date: 2017-12-22
Anticipated expiration: 2037-09-11
Also published as: CN107507619B

Abstract

The present invention provides a kind of phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing.It is multiple frame units to be converted by phonetic segmentation to be converted that methods described, which is included based on default segmentation rule,；The mel cepstrum feature of each frame unit to be converted of extraction；According to phoneme dictionary and the mel cepstrum feature of each frame unit to be converted, multiple candidate's frame units are calculated；According to the corresponding relation between the frame unit of speaker to be converted and the frame unit of target tone color speaker, matching obtains target frame unit；Switching cost is calculated, obtains optimal path；Target frame unit on optimal path is handled, obtains target voice.Multiple candidate's frame units are calculated in this method in phoneme dictionary, computing resource raising calculating speed can be saved by being searched relative to prior art from whole technical characteristic dictionary, calculating by the computed improved of traditional single frames for multiframe simultaneously, it is discontinuous to significantly improve synthesis voice, the poor technical problem of tonequality.

Description

Phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing

Technical field

The present invention relates to speech signal analysis technical field, in particular to a kind of phonetics transfer method, device, electricity Sub- equipment and readable storage medium storing program for executing.

Background technology

Speech synthesis technique passes through nearly semicentennial development, great successes is had been achieved for, in artificial intelligence etc. Field plays extremely important effect.Wherein, TTS (Text-to-Speech, also known as literary periodicals) is by computer oneself Caused or outside input text information is changed into the technology that can listen spoken output must understand, fluent, but TTS is closed Into voice generally there are following both sides problem：First, tone color is confined to a small amount of announcer's sample, individual character can not be met The demand of change；But the rhythm is unnatural, synthesis vestige is obvious.

Tone color conversion (also known as voice conversion) be on the premise of voice content is not changed, current speaker's tone color is direct The technology of output speaker's tone color is converted to, advantage is the rhythm naturally, personalized tone color keeps preferable.At present, based on voice The phonetics transfer method that characteristics dictionary is searched is the method for main flow in nonparametric Voice Conversion Techniques, and the thinking of this method is as follows： 1. extracting raw tone storehouse and target voice planting modes on sink characteristic, characteristics dictionary is established, parallel training is carried out and obtains mapping ruler；2. extract Speech feature vector to be converted, according to mapping ruler, k nearest neighbor target is found for each characteristic vector in target signature dictionary Characteristic vector；3. calculating target cost and connection cost, optimal road is searched in k nearest neighbor eigenmatrix using viterbi algorithm Footpath；4. the speech feature vector of target is chosen in connection, and is converted to voice.The deficiency of this method is, each k nearest neighbor feature to Amount lookup needs to travel through whole target signature dictionary, and calculating speed is slow, very high to system performance requirements.Meanwhile connected calculating Using single frames as unit during cost, the smoothness properties of voice interframe is not accounted for, causes the missing of voice prompting message, causes to synthesize Voice is discontinuous, greatly affected speech quality.

The content of the invention

In order to overcome above-mentioned deficiency of the prior art, the technical problems to be solved by the invention are to provide a kind of voice and turned Method, apparatus, electronic equipment and readable storage medium storing program for executing are changed, it can be under the premise of ensureing that synthesis voice is continuous, it is ensured that frequency spectrum Details is not lost.

The purpose of first aspect present invention is to provide a kind of phonetics transfer method, and methods described includes：

By the phonetic segmentation to be converted of speaker to be converted it is multiple frame units to be converted based on default segmentation rule, its In, each frame unit to be converted includes multiple continuous speech frames；

The mel cepstrum feature of each frame unit to be converted of extraction；

According to the phoneme dictionary for the speaker to be converted being previously obtained and the mel cepstrum of each frame unit to be converted Feature, multiple candidate's frame units are calculated；

According to corresponding between the frame unit for the speaker to be converted being previously obtained and the frame unit of target tone color speaker Relation, matching obtain target frame unit corresponding to candidate's frame unit；

Switching cost is calculated, obtains the optimal path that voice to be converted is converted to target tone color speaker's voice；

Target frame unit on the optimal path is handled, obtains target tone color corresponding to the voice to be converted The target voice of speaker.

Alternatively, methods described also includes pre-processing speech data；

Described the step of being pre-processed to speech data, includes：

Using the default segmentation rule to the raw tone and target in raw tone storehouse corresponding to speaker to be converted Target voice in target voice storehouse corresponding to tone color speaker carries out cutting, obtain multiple frame units corresponding to raw tone and Multiple frame units corresponding to target voice；

Raw tone and the mel cepstrum feature of target voice are extracted, raw tone characteristics dictionary is built and target voice is special Levy dictionary；

The corresponding relation established between the frame unit of the raw tone and the frame unit of target voice；

Raw tone characteristics dictionary is sorted out to obtain phoneme dictionary according to the phoneme information marked；

Raw tone and the fundamental frequency feature of target voice are extracted, calculates fundamental frequency average and fundamental frequency variance；

The mapping of fundamental frequency between speaker to be converted and target tone color speaker is established according to fundamental frequency average and fundamental frequency variance Relation.

The purpose of second aspect of the present invention is to provide a kind of voice conversion device, and described device includes：

Cutting module, for waiting to turn to be multiple by the phonetic segmentation to be converted of speaker to be converted based on default segmentation rule Frame unit is changed, wherein, each frame unit to be converted includes multiple continuous speech frames；

Extraction module, for extracting the mel cepstrum feature of each frame unit to be converted；

Computing module, for the phoneme dictionary according to the speaker to be converted being previously obtained and each frame list to be converted The mel cepstrum feature of member, is calculated multiple candidate's frame units；

Matching module, for the frame unit according to the speaker to be converted being previously obtained and the frame list of target tone color speaker Corresponding relation between member, matching obtain target frame unit corresponding to candidate's frame unit；

The computing module, it is additionally operable to calculate switching cost, obtains voice to be converted and be converted to target tone color speaker's language The optimal path of sound；

Processing module, for handling the target frame unit on the optimal path, obtain the voice to be converted The target voice of corresponding target tone color speaker.

Alternatively, described device also includes：Pretreatment module；

The mode that the pretreatment module is pre-processed to speech data includes：

The purpose of third aspect present invention is to provide a kind of electronic equipment, and the electronic equipment includes：Processor and Memory, the memory are couple to the processor, the memory store instruction, when the instruction is held by the processor The electronic equipment is set to perform the phonetics transfer method described in first aspect present invention during row.

The purpose of fourth aspect present invention is to provide a kind of readable storage medium storing program for executing, and the readable storage medium storing program for executing includes calculating Machine program, electronic equipment performs first aspect present invention the computer program controls the readable storage medium storing program for executing when running where Described phonetics transfer method.

In terms of existing technologies, the invention has the advantages that：

The present invention provides a kind of phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing.Methods described includes base By the phonetic segmentation to be converted of speaker to be converted it is multiple frame units to be converted in default segmentation rule；Treated described in extraction is each Change the mel cepstrum feature of frame unit；According to the phoneme dictionary for the speaker to be converted being previously obtained and each described to be converted The mel cepstrum feature of frame unit, multiple candidate's frame units are calculated；According to the frame list for the speaker to be converted being previously obtained Corresponding relation between member and the frame unit of target tone color speaker, matching obtain target frame unit corresponding to candidate's frame unit； Switching cost is calculated, obtains the optimal path that voice to be converted is converted to target tone color speaker's voice；To the optimal path On target frame unit handled, obtain the target voice of target tone color speaker corresponding to the voice to be converted.It is described Multiple candidate's frame units are calculated in method in the phoneme dictionary of speaker to be converted, relative to prior art from whole technology Computing resource can be saved by, which being searched in characteristics dictionary, improves calculating speed, while the meter by the computed improved of traditional single frames for multiframe Calculate, significantly improve synthesis voice it is discontinuous, the poor technical problem of tonequality.

Brief description of the drawings

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below by embodiment it is required use it is attached Figure is briefly described, it will be appreciated that the following drawings illustrate only certain embodiments of the present invention, therefore be not construed as pair The restriction of scope, for those of ordinary skill in the art, on the premise of not paying creative work, can also be according to this A little accompanying drawings obtain other related accompanying drawings.

Fig. 1 is the block diagram of electronic equipment provided in an embodiment of the present invention.

Fig. 2 is a kind of flow chart of steps for the phonetics transfer method that first embodiment of the invention provides.

Fig. 3 is another flow chart of steps for the phonetics transfer method that first embodiment of the invention provides.

Fig. 4 is the sub-step flow chart of step S170 in Fig. 3.

Fig. 5 is frame unit structural representation.

Fig. 6 is the schematic diagram by frame unit while multiple phoneme of speech sound set corresponding to being added to.

Fig. 7 is the schematic diagram of Viterbi path search provided in an embodiment of the present invention.

Fig. 8 is the sub-step flow chart of step S160 in Fig. 1 or Fig. 3.

Fig. 9 is the structured flowchart for the voice conversion device that second embodiment of the invention provides.

Embodiment

To make the purpose, technical scheme and advantage of the embodiment of the present invention clearer, below in conjunction with the embodiment of the present invention In accompanying drawing, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is Part of the embodiment of the present invention, rather than whole embodiments.The present invention implementation being generally described and illustrated herein in the accompanying drawings The component of example can be configured to arrange and design with a variety of.

Therefore, below the detailed description of the embodiments of the invention to providing in the accompanying drawings be not intended to limit it is claimed The scope of the present invention, but be merely representative of the present invention selected embodiment.It is common based on the embodiment in the present invention, this area The every other embodiment that technical staff is obtained under the premise of creative work is not made, belong to the model that the present invention protects Enclose.

It should be noted that：Similar label and letter represents similar terms in following accompanying drawing, therefore, once a certain Xiang Yi It is defined, then it further need not be defined and explained in subsequent accompanying drawing in individual accompanying drawing.

Fig. 1 is refer to, is the block diagram for a kind of electronic equipment 100 that present pre-ferred embodiments provide.The electricity Sub- equipment 100 can include voice conversion device 300, memory 111, storage control 112 and processor 113.

The memory 111, storage control 112 and 113 each element of processor are directly or indirectly electrical between each other Connection, to realize the transmission of data or interaction.For example, these elements can pass through one or more communication bus or letter between each other Number line, which is realized, to be electrically connected with.The voice conversion device 300 can include it is at least one can be with software or firmware (firmware) Form be stored in the memory 111 or be solidificated in the operating system (operating of the electronic equipment 100 System, OS) in software function module.The processor 113 is used to perform the executable mould stored in the memory 111 Block, such as software function module included by the voice conversion device 300 and computer program etc..

Wherein, the memory 111 may be, but not limited to, random access memory (Random Access Memory, RAM), read-only storage (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, EEPROM) etc..Wherein, memory 111 is used for storage program, the processor 113 after execute instruction is received, Perform described program.The processor 113 and other access of possible component to memory 111 can be in the storage controls Carried out under the control of device 112.

The processor 113 is probably a kind of IC chip, has the disposal ability of signal.Above-mentioned processor 113 can be general processor, including central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP) etc.；It can also be digital signal processor (DSP), application specific integrated circuit (ASIC), ready-made Programmable gate array (FPGA) either other PLDs, discrete gate or transistor logic, discrete hardware group Part.It can realize or perform disclosed each method, step and the logic diagram in the embodiment of the present invention.General processor can be with It is microprocessor or the processor can also be any conventional processor etc..

First embodiment

Fig. 2 is refer to, Fig. 2 is the step flow chart for the phonetics transfer method that present pre-ferred embodiments provide.The side Method is applied to electrical equipment described above 100, does specific description to the step of phonetics transfer method below.

Step S110, by the phonetic segmentation to be converted of speaker to be converted it is multiple frames to be converted based on default segmentation rule Unit.

In the present embodiment, can select to need the speech range for carrying out voice conversion by way of mark, alternatively, The mode that automatic speech annotation tool is labeled can be called to select voice to be converted from the voice of speaker to be converted.

After the voice to be converted of mark is obtained, converting speech is treated using default segmentation rule and carries out cutting so that is cut Each frame unit after point includes multiple continuous speech frames.

Step S120, extract the mel cepstrum feature of each frame unit to be converted.

In the present embodiment, step S120 includes：

Time-frequency domain is carried out to the frame unit to be converted to change to obtain the spectrum information of each frame unit to be converted.

Extract to obtain the mel cepstrum feature of the frame unit using Mel wave filter group.

Step S130, according to the phoneme dictionary for the speaker to be converted being previously obtained and each frame unit to be converted Mel cepstrum feature, multiple candidate's frame units are calculated.

The step S130 can include following sub-step.

From the mel cepstrum feature of each frame unit to be converted form the feature of each frame unit to be converted to Amount.

Calculate the Europe between the characteristic vector of each frame unit in the characteristic vector and phoneme dictionary of each frame unit to be converted Formula distance is simultaneously ranked up.

Multiple candidate frame lists corresponding to each frame unit to be converted are filtered out from the phoneme dictionary using k nearest neighbor algorithm Member.

Wherein, the k nearest neighbor algorithm is a kind of sorting algorithm, k nearest neighbor algorithm be by a sample in feature space K most like (i.e. closest in feature space) sample in most of when belonging to some classification, also the sample is returned Algorithm of the class to this classification.

Step S140, according to the frame unit of the frame unit for the speaker to be converted being previously obtained and target tone color speaker it Between corresponding relation, matching obtain target frame unit corresponding to candidate's frame unit.

Fig. 3 is refer to, in the present embodiment, methods described also includes step S170.

Step S170, is pre-processed to speech data.

To target corresponding to the raw tone in raw tone storehouse corresponding to speaker to be converted and target tone color speaker Target voice in sound bank carries out parallel training, to establish the frame of the frame unit of speaker to be converted and target tone color speaker The mapping relations of fundamental frequency between corresponding relation and speaker to be converted and target tone color speaker between unit.In this process Due to being to raw tone and the parallel training of target voice progress, it is desirable to which raw tone and target voice content correspond to one by one, interior Hold consistent.

Fig. 4 is refer to, in the present embodiment, step S170 includes following sub-step.

Sub-step S171, using the default segmentation rule to original in raw tone storehouse corresponding to speaker to be converted Target voice in target voice storehouse corresponding to voice and target tone color speaker carries out cutting, obtains more corresponding to raw tone Multiple frame units corresponding to individual frame unit and target voice.

In the present embodiment, in order to establish the mapping relations between raw tone and target voice, it is necessary to carry out parallel instruction Practice, i.e., raw tone storehouse is consistent with target voice storehouse content, and duration is enough, in order to ensure the effect of tone color conversion, it is desirable to wrap Containing enough all morpheme information.

Fig. 5 is refer to, in the present embodiment, it is contemplated that the prompting message of smooth connection and voice between frame unit, this The continuous odd-numbered frame of scheme selection (q=2p+1) is a frame unit, and its center frame is the frame of pth+1, front and rear each p frames, adjacent two Overlapping 2p frames between individual frame unit.It is understood that employed in sub-step S171 in default segmentation rule and step S110 Default segmentation rule is identical.

For raw tone, frame sequence can be expressed as X=[x⁽¹⁾,x⁽²⁾,x⁽³⁾,...,x⁽ⁿ⁾,...,x^(N)], n-th is single Member can be expressed as x⁽ⁿ⁾=[x_n-p,x_n-p+1,...,x_n,...,x_n+p+1,x_n+p], wherein x_nRepresent the n-th frame in frame sequence.Together Reason, the operation of identical dividing elements can also be done to target voice.

Sub-step S172, raw tone and the mel cepstrum feature of target voice are extracted, build raw tone characteristics dictionary And target voice characteristics dictionary.

In the present embodiment, through obtaining each frame frequency spectrum information after Fast Fourier Transform (FFT), Mel wave filter group is passed through Extract mel cepstrum feature.Pass through the mel cepstrum feature construction raw tone characteristics dictionary and target voice tagged word of extraction Allusion quotation.

Sub-step S173, the corresponding relation established between the frame unit of the raw tone and the frame unit of target voice.

In the present embodiment, using DTW (Dynamic Time Warping, dynamic time consolidation) algorithm, establish original Corresponding relation between speech frame and target voice frame.Corresponding relation between the raw tone and target voice can represent For：Z=[z₁,z₂,...,z_l,...z_L], whereinThe frame unit of frame unit and target voice for raw tone Pairing.The frame unit by raw tone for being established as the tone color conversion stage of above-mentioned corresponding relation searches the frame of target voice Unit provides basis.

Sub-step S174, raw tone characteristics dictionary is sorted out according to the phoneme information marked to obtain phoneme word Allusion quotation.

In the present embodiment, each bar phoneme of speech sound information in raw tone is labeled in advance, according to each original language Position of the frame unit of sound in raw tone, the frame unit of each raw tone is referred in each phoneme dictionary.It please join According to Fig. 6, because frame unit includes multiple successive frames, it is thus possible to a frame unit occur and cross over two (or two or more) voices The situation of set of phonemes, in order to ensure the quality of conversion, the frame unit is added to corresponding at least one phoneme dictionary simultaneously In.

The mode of classification obtains phoneme dictionary, the mode of multiple candidate's frame units is calculated based on phoneme dictionary, relatively Computing resource can be saved by being searched in prior art from whole technical characteristic dictionary, improve calculating speed.

Sub-step S175, raw tone and the fundamental frequency feature of target voice are extracted, calculate fundamental frequency average and fundamental frequency variance.

Sub-step S176, established according to fundamental frequency average and fundamental frequency variance between speaker to be converted and target tone color speaker The mapping relations of fundamental frequency.

In the present embodiment, the excitation of voiced sound is periodically pulsing string, and the frequency of train of pulse is exactly fundamental frequency, therefore Fundamental frequency is also the key character of voice, and the accuracy of fundamental frequency extraction directly affects the holding of the personalized tone color of synthesis voice, with And rhythm cadence.Statistically, can be by different two identicals distribution (such as normal state point of statistical nature (average, variance) Cloth etc.) mutually changed.Therefore, raw tone and target voice fundamental frequency feature are considered as Normal Distribution, calculate fundamental frequency Average and fundamental frequency variance, it is possible to establish the mapping relations of fundamental frequency between raw tone and target voice.Establish raw tone and Between target voice the mapping relations of fundamental frequency so as to subsequent voice conversion the stage pass through phonetic acquisition target voice to be converted Fundamental frequency feature.

Step S150, switching cost is calculated, obtains the optimal road that voice to be converted is converted to target tone color speaker's voice Footpath.

In the present embodiment, step S150 obtains voice to be converted and is converted to target tone color speaker's language in the following manner The optimal path of sound.

Calculate between the target cost between frame unit to be converted and target frame unit, and the target frame unit of adjacent moment Transfer value.

Target cost and transfer value according to being calculated search for obtain optimal path using viterbi algorithm.

Alternatively, the target cost between frame unit to be converted and target frame unit is calculated using Euclidean distance, and Transfer value between the target frame unit of adjacent moment.Viterbi algorithm searches for the directed acyclic graph equivalent to a Weight Minimal cost path search procedure.

The calculation formula of the target cost can be as follows：

Wherein,Itself weight of each node in the directed acyclic graph of Weight can be expressed as, can be with The target cost being interpreted as in the present embodiment.Describe frame unit X to be converted^(t)With target frame unit Between similarity degree, weight is smaller to represent that both are more similar.Wherein X^(t)(i, d) and Y_k'^(t)(i, d) is represented in t unit The d dimension datas of i-th frame.

Transfer weight in the directed acyclic graph of Weight between node is to connect cost,

Describe t target frame unitWith (t+1) and t+1 target frame units Similarity degree, weight is smaller to represent that both are more similar, cross get over it is smooth.According to above principle, it is possible in target frame unit That is searched in matrix arrives optimal path.Fig. 7 is refer to, each node on path (being made up of line with the arrow in Fig. 4) is Each when the optimal selection that engraves.

Step S160, the target frame unit on the optimal path is handled, it is corresponding to obtain the voice to be converted Target tone color speaker target voice.

In the present embodiment, Fig. 8 is refer to, the step S160 can include following sub-step.

Sub-step S161, according to the corresponding relation between the frame unit of the raw tone and the frame unit of target voice, Obtain the mel cepstrum feature of target frame unit corresponding to frame unit to be converted.

Sub-step S162, it is suitable according to the time to the mel cepstrum feature of each target frame unit on the optimal path Sequence and default segmentation rules carry out processing in smoothing junction.

In the present embodiment, because the frame for having 2p frames between adjacent target frame unit is folded, when connecting into eigenmatrix, it is necessary to Instantaneous window is added smoothly to ensure the continuity on phonetic hearing.For each target frame unit, proceed as follows.

A weight coefficient is multiplied by each frame in target frame unit, represents instantaneous with exponential function in the present embodiment Window w, formula represent as follows,

W=exp (- λ | a |), a=[p, p-1 ..., 0 ..., p-1, p]

Wherein, λ is scalar value, for adjusting instantaneous window w shapes.λ is bigger, more highlights center frame information, weakens consecutive frame Prompting message；Conversely, λ is smaller, consecutive frame prompting message is more taken into account, weakens center frame information, therefore selects suitable λ can be with Take into account the two simultaneously., it is necessary to normalize instantaneous window each element before adding window, it is 1 to make itself and value.

Sub-step S163, according to the mapping relations of fundamental frequency between speaker to be converted and target tone color speaker, treated Change the fundamental frequency feature of target frame unit corresponding to frame unit.

Fundamental frequency average by the speech pitch sequence of voice to be converted with the target voice of corresponding target tone color speaker Subtract each other, the difference of gained is multiplied with the business of the fundamental frequency variance of target voice and the fundamental frequency variance of voice to be converted, is multiplied what is obtained Product is added the fundamental frequency sequence for obtaining target voice with the fundamental frequency average of target voice.The calculating of the fundamental frequency sequence of target voice Formula can be：

Wherein, f0 (i) is target voice fundamental frequency sequence,For speech pitch sequence to be converted, sf0m and tf0m difference The fundamental frequency average of fundamental frequency average and target voice for voice to be converted, sf0v and tf0v are respectively the fundamental frequency side of voice to be converted Difference and the fundamental frequency variance of target voice.

Sub-step S164, by the frequency spectrum that the mel cepstrum feature of target frame unit and fundamental frequency Feature Conversion are target voice.

In the present embodiment, STRAIGHT kits are alternatively called by the mel cepstrum feature and fundamental frequency of target frame unit Feature Conversion is the frequency spectrum of target voice.

Sub-step S165, the frequency spectrum of target voice is entered into the target language that line frequency time domain is converted to target tone color speaker Sound.

In the present embodiment, the frequency spectrum of target voice is converted to the mesh of target tone color speaker using inverse Fourier transform Poster sound.

Second embodiment

Fig. 9 is refer to, Fig. 9 is the structured flowchart for the voice conversion device 300 that present pre-ferred embodiments provide.Institute's predicate Sound conversion equipment 300 includes：Cutting module 310, extraction module 320, computing module 330, matching module 340 and processing module 350。

It is more by the phonetic segmentation to be converted of speaker to be converted that the cutting module 310, which is used for based on default segmentation rule, Individual frame unit to be converted, wherein, each frame unit to be converted includes multiple continuous speech frames.

The extraction module 320 is used for the mel cepstrum feature for extracting each frame unit to be converted.

In the present embodiment, the extraction module 320 extracts the mode of the mel cepstrum feature of the frame unit to be converted Including：

Time-frequency domain is carried out to the frame unit to be converted to change to obtain the spectrum information of each frame unit；

The computing module 330 is described for the phoneme dictionary according to the speaker to be converted being previously obtained and each to be waited to turn The mel cepstrum feature of frame unit is changed, multiple candidate's frame units are calculated.

In the present embodiment, the computing module 330 is according to the phoneme dictionary of the speaker to be converted being previously obtained and every The mel cepstrum feature of the individual frame unit to be converted, the mode of multiple candidate's frame units, which is calculated, to be included：

From the mel cepstrum feature of each frame unit to be converted form the feature of each frame unit to be converted to Amount；

Calculate the Europe between the characteristic vector of each frame unit in the characteristic vector and phoneme dictionary of each frame unit to be converted Formula distance is simultaneously ranked up；

The matching module 340 is used for frame unit and target tone color speaker according to the speaker to be converted being previously obtained Frame unit between corresponding relation, matching obtain target frame unit corresponding to candidate's frame unit.

The computing module 330 is additionally operable to calculate switching cost, obtains voice to be converted and is converted to target tone color speaker The optimal path of voice.

In the present embodiment, the computing module 330 calculates switching cost, obtains voice to be converted and is converted to target tone color The mode of the optimal path of speaker's voice includes：

Calculate between the target cost between frame unit to be converted and target frame unit, and the target frame unit of adjacent moment Transfer value；

The processing module 350 is used to handle the target frame unit on the optimal path, obtains described waiting to turn Change the target voice of target tone color speaker corresponding to voice.

In the present embodiment, the processing module 350 is handled the target frame unit on the optimal path, is obtained The mode of the target voice of target tone color speaker includes corresponding to the voice to be converted：

According to the corresponding relation between the frame unit of the raw tone and the frame unit of target voice, frame to be converted is obtained The mel cepstrum feature of target frame unit corresponding to unit；

To the mel cepstrum feature of each target frame unit on the optimal path, cut sequentially in time with default Divider then carries out processing in smoothing junction；

According to the mapping relations of fundamental frequency between speaker to be converted and target tone color speaker, frame unit pair to be converted is obtained The fundamental frequency feature for the target frame unit answered；

By the frequency spectrum that the mel cepstrum feature of target frame unit and fundamental frequency Feature Conversion are target voice；

The frequency spectrum of target voice is entered into the target voice that line frequency time domain is converted to target tone color speaker.

Referring once again to Fig. 9, in the present embodiment, the voice conversion device 300 also includes：Pretreatment module 360.

The mode that the pretreatment module 360 is pre-processed to speech data includes：

The present invention provides a kind of phonetics transfer method, device, electronic equipment and readable storage medium storing program for executing.Methods described includes base By the phonetic segmentation to be converted of speaker to be converted it is multiple frame units to be converted in default segmentation rule；Treated described in extraction is each Change the mel cepstrum feature of frame unit；According to the phoneme dictionary for the speaker to be converted being previously obtained and each described to be converted The mel cepstrum feature of frame unit, multiple candidate's frame units are calculated；According to the frame list for the speaker to be converted being previously obtained Corresponding relation between member and the frame unit of target tone color speaker, matching obtain target frame unit corresponding to candidate's frame unit； Switching cost is calculated, obtains the optimal path that voice to be converted is converted to target tone color speaker's voice；To the optimal path On target frame unit handled, obtain the target voice of target tone color speaker corresponding to the voice to be converted.It is described Multiple candidate's frame units are calculated in method in the phoneme dictionary of speaker to be converted, relative to prior art from whole technology Computing resource can be saved by, which being searched in characteristics dictionary, improves calculating speed, while in view of interframe is smooth and the prompting message of voice, The computed improved of traditional single frames is the calculating of the unit comprising multiframe, and adding window smoothing processing has been done in connection unit, By the computed improved of traditional single frames be multiframe calculating, significantly improve synthesis voice it is discontinuous, the poor technology of tonequality Problem.

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.It should be noted that：Similar label and letter exists Similar terms is represented in following accompanying drawing, therefore, once being defined in a certain Xiang Yi accompanying drawing, is then not required in subsequent accompanying drawing It is further defined and explained.

The preferred embodiments of the present invention are the foregoing is only, are not intended to limit the invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.Within the spirit and principles of the invention, that is made any repaiies Change, equivalent substitution, improvement etc., should be included in the scope of the protection.

Claims

1. a kind of phonetics transfer method, it is characterised in that methods described includes：

By the phonetic segmentation to be converted of speaker to be converted it is multiple frame units to be converted based on default segmentation rule, wherein, often Individual frame unit to be converted includes multiple continuous speech frames；

The mel cepstrum feature of each frame unit to be converted of extraction；

According to the phoneme dictionary for the speaker to be converted being previously obtained and the mel cepstrum feature of each frame unit to be converted, Multiple candidate's frame units are calculated；

According to the corresponding relation between the frame unit for the speaker to be converted being previously obtained and the frame unit of target tone color speaker, Matching obtains target frame unit corresponding to candidate's frame unit；

Target frame unit on the optimal path is handled, obtain the voice to be converted corresponding to target tone color speak The target voice of people.

2. the method as described in claim 1, it is characterised in that methods described also includes：Speech data is pre-processed Step, the step include：

Using the default segmentation rule to the raw tone in raw tone storehouse corresponding to speaker to be converted and target tone color Target voice in target voice storehouse corresponding to speaker carries out cutting, obtains multiple frame units and target corresponding to raw tone Multiple frame units corresponding to voice；

Raw tone and the mel cepstrum feature of target voice are extracted, builds raw tone characteristics dictionary and target voice tagged word Allusion quotation；

The mapping relations of fundamental frequency between speaker to be converted and target tone color speaker are established according to fundamental frequency average and fundamental frequency variance.

3. according to the method for claim 2, it is characterised in that the phoneme for the speaker to be converted that the basis is previously obtained The mel cepstrum feature of dictionary and each frame unit to be converted, the step of multiple candidate's frame units are calculated, include：

The characteristic vector of each frame unit to be converted is made up of the mel cepstrum feature of each frame unit to be converted；

Calculate in the characteristic vector and phoneme dictionary of each frame unit to be converted between the characteristic vector of each frame unit it is European away from From and be ranked up；

Multiple candidate's frame units corresponding to each frame unit to be converted are filtered out from the phoneme dictionary using k nearest neighbor algorithm.

4. according to the method for claim 2, it is characterised in that the calculating switching cost, obtain voice conversion to be converted For target tone color speaker's voice optimal path the step of include：

Calculate and turn between the target cost between frame unit to be converted and target frame unit, and the target frame unit of adjacent moment Move cost；

5. according to the method for claim 2, it is characterised in that the target frame unit on the optimal path is carried out The step of handling, obtaining the target voice of target tone color speaker corresponding to the voice to be converted includes：

According to the corresponding relation between the frame unit of the raw tone and the frame unit of target voice, frame unit to be converted is obtained The mel cepstrum feature of corresponding target frame unit；

To the mel cepstrum feature of each target frame unit on the optimal path, advised sequentially in time with default cutting Then carry out processing in smoothing junction；

According to the mapping relations of fundamental frequency between speaker to be converted and target tone color speaker, obtain corresponding to frame unit to be converted The fundamental frequency feature of target frame unit；

6. a kind of voice conversion device, it is characterised in that described device includes：

Cutting module, for by the phonetic segmentation to be converted of speaker to be converted being multiple frames to be converted based on default segmentation rule Unit, wherein, each frame unit to be converted includes multiple continuous speech frames；

Computing module, the phoneme dictionary for the speaker to be converted being previously obtained for basis and each frame unit to be converted Mel cepstrum feature, multiple candidate's frame units are calculated；

Matching module, for the frame unit according to the frame unit of speaker to be converted being previously obtained and target tone color speaker it Between corresponding relation, matching obtain target frame unit corresponding to candidate's frame unit；

The computing module, it is additionally operable to calculate switching cost, obtains voice to be converted and be converted to target tone color speaker's voice Optimal path；

Processing module, for handling the target frame unit on the optimal path, it is corresponding to obtain the voice to be converted Target tone color speaker target voice.

7. voice conversion device as claimed in claim 6, it is characterised in that described device also includes：Pretreatment module；

8. voice conversion device as claimed in claim 7, it is characterised in that the computing module is waited to turn according to what is be previously obtained The phoneme dictionary of speaker and the mel cepstrum feature of each frame unit to be converted are changed, multiple candidate's frame units are calculated Mode include：

9. voice conversion device as claimed in claim 7, it is characterised in that the computing module calculates switching cost, obtains The mode that voice to be converted is converted to the optimal path of target tone color speaker's voice includes：

10. voice conversion device as claimed in claim 7, it is characterised in that the processing module is on the optimal path Target frame unit handled, obtain the mode bag of the target voice of target tone color speaker corresponding to the voice to be converted Include：

11. a kind of electronic equipment, it is characterised in that the electronic equipment includes：Processor and memory, the memory coupling The processor is connected to, the memory store instruction, makes the electronic equipment when executed by the processor Perform claim requires the phonetics transfer method described in any one in 1-5.

12. a kind of readable storage medium storing program for executing, the readable storage medium storing program for executing includes computer program, it is characterised in that：

Electronic equipment perform claim where controlling the readable storage medium storing program for executing during computer program operation requires any in 1-5 Phonetics transfer method described in one.