CN109637551A

CN109637551A - Phonetics transfer method, device, equipment and storage medium

Info

Publication number: CN109637551A
Application number: CN201811604615.XA
Authority: CN
Inventors: 陈云琳; 刘冰
Original assignee: Chumen Wenwen Information Technology Co Ltd
Current assignee: Chumen Wenwen Information Technology Co Ltd
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2019-04-16

Abstract

Present disclose provides a kind of phonetics transfer methods, comprising: obtains the voice of the voice of the predetermined quantity of source speaker and the predetermined quantity of target speaker；It is trained based on the characteristic parameter of the characteristic parameter of the voice of acquired source speaker and the voice of target speaker, to obtain the transfer function of training pattern；Characteristic parameter is extracted from the real-time voice of source speaker, extracted source speaker speech characteristic parameter is converted by target speaker's speech characteristic parameter by the transfer function of training pattern；And according to target speaker's speech characteristic parameter after conversion, obtain the voice of target speaker.The disclosure additionally provides a kind of voice conversion device, electronic equipment and readable storage medium storing program for executing.

Description

Phonetics transfer method, device, equipment and storage medium

Technical field

This disclosure relates to a kind of phonetics transfer method, voice conversion device, electronic equipment and readable storage medium storing program for executing.

Background technique

In phone customer service scene, client proposes various requirement to phone contact staff sometimes, such as requires to change a female Property contact staff；It is required that changing the pleasing to the ear contact staff of sound；It is required that changing the male contact staff with magnetic sound Etc..

Current phone customer service system, substantially can only be this to solve the problems, such as by increasing human resources.For example, spending more The women contact staff and sound that money goes recruitment sound sweet meet the requirement of client full of magnetic male contact staff To improve user experience etc..Alternatively possible solution is that some pairs of extremely strong sound of sound control power of recruitment are excellent, both can be with Sweet female voice is issued, and magnetic male voice can be issued.

In existing solution, it will increase many human costs, and be only able to satisfy the requirement etc. of portions of client.

Summary of the invention

At least one of in order to solve the above-mentioned technical problem, present disclose provides a kind of phonetics transfer method, voices to turn Changing device, electronic equipment and readable storage medium storing program for executing.

According to one aspect of the disclosure, a kind of phonetics transfer method, comprising: obtain the language of the predetermined quantity of source speaker The voice of sound and the predetermined quantity of target speaker；The characteristic parameter and target of voice based on acquired source speaker are spoken The characteristic parameter of the voice of people is trained, to obtain the transfer function of training pattern；From the real-time voice of source speaker Characteristic parameter is extracted, extracted source speaker speech characteristic parameter is converted by mesh by the transfer function of the training pattern Mark speaker's speech characteristic parameter；And according to target speaker's speech characteristic parameter after conversion, obtain target speaker's Voice.

According at least one embodiment of the disclosure, the voice based on acquired source speaker and target speaker come When being trained to obtain the transfer function of training pattern, comprising: according to the language of acquired source speaker and target speaker Sound, the characteristic parameter of the voice of the characteristic parameter and target speaker of the voice of extraction source speaker；By the voice of source speaker Characteristic parameter and target speaker voice characteristic parameter carry out time unifying；By to the source speaker after time unifying Characteristic parameter and the characteristic parameter of target speaker be trained, to obtain the transfer function of the training pattern.

According at least one embodiment of the disclosure, the training pattern uses LSTM structure.

According at least one embodiment of the disclosure, the structure of the training pattern is three layer feedforward neural networks, two The structure of the two-way LSTM and one layer of feedforward neural network of layer.

According at least one embodiment of the disclosure, the voice of the predetermined quantity of the source speaker and target speaker Predetermined quantity voice be parallel corpora form predetermined quantity voice.

According at least one embodiment of the disclosure, the voice of the predetermined quantity of the source speaker and target speaker The voice of predetermined quantity be that quantity is identical and voice less than 200.

According at least one embodiment of the disclosure, characteristic parameter includes: spectrum parameter, linear spectral pair and fundamental frequency.

According at least one embodiment of the disclosure, speak artificial phone contact staff in the source, and the target is spoken Artificial target voice speaker, and the voice of the phone contact staff is converted into target voice speaker in real time online Voice.

According to another aspect of the present disclosure, a kind of voice conversion device, comprising: voice obtains module, obtains source speaker Predetermined quantity voice and target speaker predetermined quantity voice；Voice training module is spoken based on acquired source The characteristic parameter of the voice of the characteristic parameter and target speaker of the voice of people is trained, to obtain training pattern；It extracts Conversion module extracts characteristic parameter from the real-time voice of source speaker, and extracted source is spoken by the training pattern People's speech characteristic parameter is converted into target speaker's speech characteristic parameter；And speech production module, according to the target after conversion Speaker's speech characteristic parameter obtains the voice of target speaker.

According to the another aspect of the disclosure, a kind of electronic equipment, comprising: memory, the memory storage computer are held Row instruction；And processor, the processor execute the computer executed instructions of memory storage, so that processor execution is above-mentioned Method.

According to the another further aspect of the disclosure, a kind of readable storage medium storing program for executing is stored with computer in the readable storage medium storing program for executing It executes instruction, for realizing above-mentioned method when computer executed instructions are executed by processor.

Detailed description of the invention

Attached drawing shows the illustrative embodiments of the disclosure, and it is bright together for explaining the principles of this disclosure, Which includes these attached drawings to provide further understanding of the disclosure, and attached drawing is included in the description and constitutes this Part of specification.

Fig. 1 is the schematic flow chart according to the phonetics transfer method of one embodiment of the disclosure.

Fig. 2 is the schematic flow chart according to the training stage of the phonetics transfer method of one embodiment of the disclosure.

Fig. 3 is the schematic flow chart according to the conversion stage of the phonetics transfer method of one embodiment of the disclosure.

Fig. 4 is the schematic block diagram according to the training pattern structure of one embodiment of the disclosure.

Fig. 5 is the schematic block diagram according to the voice conversion device of one embodiment of the disclosure.

Fig. 6 is the schematic block diagram according to the voice training device of one embodiment of the disclosure.

Fig. 7 is the schematic block diagram according to the voice real-time conversion device of one embodiment of the disclosure.

Fig. 8 is the explanatory view according to the electronic equipment of one embodiment of the disclosure.

Specific embodiment

The disclosure is described in further detail with embodiment with reference to the accompanying drawing.It is understood that this place The specific embodiment of description is only used for explaining related content, rather than the restriction to the disclosure.It also should be noted that being Convenient for description, part relevant to the disclosure is illustrated only in attached drawing.

It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the disclosure can To be combined with each other.The disclosure is described in detail below with reference to the accompanying drawings and in conjunction with embodiment.

Voice conversion (Voice Conversion, VC) refers under the premise of not changing the semantic content of primitive sound, changes The voice personal characteristics for becoming a speaker (source speaker, source speaker), with another speaker's (mesh Mark speaker, target speaker) voice personal characteristics.Information included in voice is extremely complex, most importantly Semantic information, another critically important information are the individualized feature information of voice.The premise of voice conversion seeks to turning Retain that original semantic content information is constant during changing, by changing the customized information of voice, so that the voice after conversion Sound speciality with more multiple target speaker, and clarity with higher, intelligibility and naturalness.

In accordance with one embodiment of the present disclosure, a kind of phonetics transfer method is provided.As shown in Figure 1, the voice is converted Method 10 includes: the step S11 of the voice of the predetermined quantity of acquisition source speaker and target speaker, is trained to be instructed The real-time voice characteristic parameter of the step S12, extraction source speaker that practice model are to be converted into target speaker's speech characteristic parameter Step S13 and obtain target speaker voice step S14.

In a step 11, the voice of the voice of the predetermined quantity of source speaker and the predetermined quantity of target speaker is obtained. In an optional embodiment, the voice of the predetermined quantity of the voice and target speaker of the predetermined quantity of source speaker is flat The voice of the predetermined quantity of row corpus form.Optionally, the voice of the predetermined quantity of source speaker and target speaker's is predetermined The voice of quantity is that quantity is identical and voice less than 200.Such as 50~200 voices can be respectively necessary for.Wherein put down Row corpus refers to that the recorded content of two speakers is identical, contains identical semantic and affective characteristics.

In step s 12, the language of characteristic parameter and target speaker is extracted based on the voice of acquired source speaker Sound is trained to extract characteristic parameter based on extracted speech characteristic parameter, to obtain the conversion of training pattern Function.According to the disclosure, acquired speech characteristic parameter includes: spectrum parameter, linear spectral to (Linear Spectrum Pairs, LSP) and fundamental frequency.It is alternatively possible to extract above-mentioned speech characteristic parameter according to STRAIGHT feature extraction algorithm.Its In, training pattern use LSTM structure, optionally the structure of training pattern be three layer feedforward neural networks, two layers of two-way LSTM and The structure of one layer of feedforward neural network.

In step s 13, characteristic parameter is extracted from the real-time voice of source speaker, it will be extracted by training pattern Source speaker's speech characteristic parameter is converted into target speaker's speech characteristic parameter.Here, for example as phone contact staff and visitor When family is talked with, the speech characteristic parameter of phone contact staff (source speaker) can be converted into the phonetic feature of target speaker Parameter.Here extracted speech characteristic parameter may include: spectrum parameter, linear spectral pair and fundamental frequency.

In step S14, according to target speaker's speech characteristic parameter after conversion, the voice of target speaker is obtained (such as sweet female voice sound, magnetic male voice sound etc.).

The above method 10 is described in detail below with reference to Fig. 2 and Fig. 3.

For phonetics transfer method 10, if being classified according to implementation phase, the step of training stage can be divided into And the step of conversion stage.Training stage may include step S11 and S12, and the stage of converting may include step S13 and S14. Wherein, the training stage can carry out offline model training, and the stage of converting can carry out online conversion in real time.In training rank Section, can the characteristic parameter based on speech analysis model extraction source speaker's voice Yu target speaker voice, and to extraction Speech characteristic parameter carries out time unifying, then is trained to obtain voice transfer function.Conversion the stage, treat converting speech into Row speech signal analysis simultaneously extracts speech characteristic parameter, and the mapping of characteristic parameter is carried out according to the transfer function that the training stage obtains Speech characteristic parameter after being converted finally synthesizes converting speech by the speech characteristic parameter after converting.

Referring to fig. 2, the specific example of training stage is illustrated in detail.As shown in Fig. 2, the method for training stage can wrap It includes and obtains the step S21 of voice, the step S22 for extracting speech characteristic parameter, the step S23 of time unifying and model training Step S24.Wherein, source speaker can be phone contact staff, and target is spoken artificial target voice speaker.

In the step s 21, mode corresponds to the step S11 in method 10.Wherein it is possible to obtain the 100 of source speaker ~200 words obtain identical 100~200 word of target speaker.These words can be the form of parallel corpora.Later It is trained by these words.

In step S22, according to the voice of acquired source speaker and target speaker, the voice of extraction source speaker Characteristic parameter and target speaker voice characteristic parameter.In the disclosure, it can be spoken according to source speaker and target The voice of people, extracts spectrum parameter, and extraction is characterized in linear spectral pair and fundamental frequency, can use STRAIGHT feature in the disclosure Extraction algorithm is realized.

In step S23, the speech characteristic parameter of extraction is subjected to temporal alignment.Since source speaker and target are said The difference of the human speech speed rhythm is talked about, even if causing identical a word, the duration that different speakers obtain is also different, even and if Everyone says that same a word duration is unequal in different time points, and same word, syllable in sentence Pronunciation duration be also to change at random, so needing to carry out time unifying, just can be carried out the training of neural network in this way.At this In open, the method for time unifying can use dynamic time warping (Dynamic Time Warping, DTW).

In step s 24, pass through the characteristic parameter of the characteristic parameter of the source speaker after time unifying and target speaker Carry out model training.Wherein, which can be two-way using LSTM structure, such as three layer feedforward neural networks+two layers Mono- layer of feedforward neural network of LSTM+ (this structure will be described in detail below).Model training can be trained in single GPU, Exercise wheel number can be 30 wheels, and learning rate can use exponential decrease strategy, and model smoothing can use the method for moving average.

Finally, can come in the conversion stage later using the training stage obtained best model.

Fig. 3 shows the method in conversion stage, wherein speak artificial phone contact staff in source, and target is spoken artificial target Voice speaker, and the voice of phone contact staff is converted to the voice of target voice speaker in real time online.The party Method include: the step S31 of acquisition source speaker's real-time voice, extraction source speaker's real-time voice characteristic parameter step S32, The characteristic parameter of extraction is by training stage trained model come target speaker's speech characteristic parameter after being converted Step S33, and according to target speaker's speech characteristic parameter after conversion, the step S34 of the voice of target speaker is obtained.

In step S31, when source speaker (phone contact staff) is when with Communication with Customer, first by phone contact staff Voice be acquired, to obtain the voice of phone contact staff in real time.

In step s 32, speech characteristic parameter is extracted to the voice of the obtained phone contact staff of step S31, this In extracted speech characteristic parameter can be with are as follows: spectrum parameter, linear spectral pair and fundamental frequency.

In step S33, the training pattern that the speech characteristic parameter input training stage of extraction is obtained passes through training mould These speech characteristic parameters are converted into target speaker (such as with sweet female voice sound, magnetic male voice sound by the transfer function of type Deng people) speech characteristic parameter.

In step S34, according to target speaker's speech characteristic parameter after conversion, the voice of target speaker is obtained. In the disclosure, vocoder can be used to obtain the voice, which can use source-filter model.

Finally, the voice of obtained target speaker can be led to when phone contact staff with client when linking up It crosses telephone channel to export in real time, to be supplied to client.

An example of training pattern structure used in the disclosure is described in detail below.

Referring to Fig. 4, training pattern structure may include three layer feedforward neural networks (1-1,1-2,1-3)+two layers two-way + one layer of feedforward neural network (3-1) of LSTM (2-1,2-2).Wherein, three layer feedforward neural networks 1-1,1-2,1-3 is for extracting The former speech characteristic parameter of source speaker, and two layers two-way LSTM2-1,2-2 are then the phases for learning linguistic time context Information is closed, one layer of feedforward neural network 3-1 is then the forward direction layer for converting into target speaker, and it is special finally to export target voice Sign.

According to the other embodiments of the disclosure, a kind of voice conversion device is provided.As shown in figure 5, the voice is converted Device 500 may include: voice obtain module 501, obtain source speaker predetermined quantity voice and target speaker it is pre- The voice of fixed number amount；Voice training module 502, the characteristic parameter of the voice based on acquired source speaker and target speaker The characteristic parameter of voice be trained, to obtain training pattern；Conversion module 503 is extracted, from the real-time language of source speaker Characteristic parameter is extracted in sound, and extracted source speaker speech characteristic parameter is converted by target by training pattern and is spoken human speech Sound characteristic parameter；And speech production module 504 obtains target and says according to target speaker's speech characteristic parameter after conversion Talk about the voice of people.

In voice training module 502, according to the voice of acquired source speaker and target speaker, extraction source is spoken The characteristic parameter of the voice of the characteristic parameter and target speaker of the voice of people；By the characteristic parameter and mesh of the voice of source speaker The characteristic parameter for marking the voice of speaker carries out time unifying；Pass through the characteristic parameter and mesh to the source speaker after time unifying The characteristic parameter of mark speaker is trained, to obtain the transfer function of training pattern.

According to the other embodiments of the disclosure, a kind of voice training device is provided.As shown in fig. 6, voice training fills Setting 600 may include: to obtain voice module 601, extract characteristic module 602, time unifying module 603 and model training module 604。

Obtain voice module 601 obtain source speaker 100~200 words, acquisition target speaker identical 100~ 200 words.These words can be the form of parallel corpora.It is trained later by these words.

Extract characteristic module 602, according to the voice of acquired source speaker and target speaker, extraction source speaker's The characteristic parameter of the voice of the characteristic parameter and target speaker of voice.It in the disclosure, can be according to source speaker and target The voice of speaker, extracts spectrum parameter, and extraction is characterized in linear spectral pair and fundamental frequency.

The speech characteristic parameter of extraction is carried out temporal alignment by time unifying module 603.In the disclosure, the time The method of alignment can use dynamic time warping (Dynamic Time Warping, DTW).

Model training module 604 passes through the characteristic parameter of the source speaker after time unifying and the feature of target speaker Parameter carries out model training.Wherein, which can use LSTM structure, such as the three layers of BP Neural Network illustrated before The mono- layer of feedforward neural network of two-way LSTM+ of network+two layers.

Source speaker can be phone contact staff, and target is spoken artificial target voice speaker.Again according to the disclosure One embodiment there is provided a kind of voice real-time conversion devices.As shown in fig. 7, voice real-time conversion device 700 is spoken including source People's real-time voice obtains module 701, source speaker's real-time voice characteristic parameter extraction module 702, conversion module 703, target and says Talk about human speech sound generation module 704, wherein source speaker can be phone contact staff, and target artificial target voice of speaking is spoken People, and the voice of phone contact staff is converted to the voice of target voice speaker in real time online.This method comprises:.

It is obtained in module 701 in source speaker's real-time voice, when source speaker (phone contact staff) is with Communication with Customer When, the voice of phone contact staff is acquired first, to obtain the voice of phone contact staff in real time.In source speaker In real-time voice characteristic parameter extraction module 702, speech characteristic parameter is extracted to the voice of obtained phone contact staff, Here extracted speech characteristic parameter can be with are as follows: spectrum parameter, linear spectral pair and fundamental frequency.In conversion module 703, by extraction The training pattern that the speech characteristic parameter input training stage obtains, is joined these phonetic features by the transfer function of training pattern Number is converted into the speech characteristic parameter of target speaker.In target speaker's speech production module 704, according to the mesh after conversion Speaker's speech characteristic parameter is marked, the voice of target speaker is obtained.In the disclosure, vocoder can be used to obtain the language Sound, the vocoder can use source-filter model.Finally, can be incited somebody to action when phone contact staff with client when linking up The voice of obtained target speaker is exported in real time by telephone channel, to be supplied to client.

According to the technical solution of the disclosure, need cost very low, it is different from traditional speech synthesis (to need a large amount of voice Data, 10 hours or more), it is only necessary to 50-200 word, i.e. 5-20 minutes data by each speaker of the disclosure, so that it may turn The speaker's tone color for swapping out different.Voice Conversion needs data volume few, not only reduces human cost, but also can expire The demand of sufficient client can greatly reduce unnecessary expenditures etc..

The disclosure also provides a kind of electronic equipment, as shown in figure 8, the equipment includes: communication interface 1000, memory 2000 With processor 3000.Communication interface 1000 carries out data interaction for being communicated with external device.In memory 2000 It is stored with the computer program that can be run on processor 3000.Processor 3000 is realized above-mentioned when executing the computer program Method in embodiment.The quantity of the memory 2000 and processor 3000 can be one or more.

Memory 2000 may include high speed RAM memory, can also further include nonvolatile memory (non- Volatile memory), a for example, at least magnetic disk storage.

If communication interface 1000, memory 2000 and the independent realization of processor 3000, communication interface 1000, memory 2000 and processor 3000 can be connected with each other by bus and complete mutual communication.The bus can be industrial standard Architecture (ISA, Industry Standard Architecture) bus, external equipment interconnection (PCI, Peripheral Component) bus or extended industry-standard architecture (EISA, Extended Industry Standard Component) bus etc..The bus can be divided into address bus, data/address bus, control bus etc..For convenient for expression, the figure In only indicated with a thick line, it is not intended that an only bus or a type of bus.

Optionally, in specific implementation, if communication interface 1000, memory 2000 and processor 3000 are integrated in one On block chip, then communication interface 1000, memory 2000 and processor 3000 can complete mutual lead to by internal interface Letter.

Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing specific logical function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the disclosure includes other realization, wherein can not press shown or discussed suitable Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be by the disclosure Embodiment person of ordinary skill in the field understood.Processor executes each method as described above and processing. For example, the method implementation in the disclosure may be implemented as software program, it is tangibly embodied in machine readable media, Such as memory.In some embodiments, some or all of of software program can be via memory and/or communication interface And it is loaded into and/or installs.When software program is loaded into memory and is executed by processor, above-described side can be executed One or more steps in method.Alternatively, in other embodiments, processor can pass through other any modes appropriate (for example, by means of firmware) and be configured as executing one of above method.

Expression or logic and/or step described otherwise above herein in flow charts, may be embodied in any In readable storage medium storing program for executing, so that (such as computer based system is including processor for instruction execution system, device or equipment Unite or other can be from instruction execution system, device or equipment instruction fetch and the system executed instruction) it uses, or refer in conjunction with these It enables and executes system, device or equipment and use.

For the purpose of this specification, " readable storage medium storing program for executing " can be it is any may include, store, communicate, propagate, or transport Program is for instruction execution system, device or equipment or the device used in conjunction with these instruction execution systems, device or equipment. The more specific example (non-exhaustive list) of readable storage medium storing program for executing include the following: there is the electrical connection section of one or more wirings (electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM) are erasable Except editable read-only memory (EPROM or flash memory), fiber device and portable read-only memory (CDROM).Separately Outside, readable storage medium storing program for executing can even is that the paper that can print described program on it or other suitable media, because can example Such as by carrying out optical scanner to paper or other media, is then edited, interpreted or when necessary with the progress of other suitable methods Processing is then stored in memory electronically to obtain described program.

It should be appreciated that each section of the disclosure can be realized with hardware, software or their combination.In above-mentioned embodiment party In formula, multiple steps or method can carry out reality in memory and by the software that suitable instruction execution system executes with storage It is existing.It, and in another embodiment, can be in following technology well known in the art for example, if realized with hardware Any one or their combination are realized: having a discrete logic for realizing the logic gates of logic function to data-signal Circuit, the specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), field-programmable gate array Arrange (FPGA) etc..

Those skilled in the art are understood that realize all or part of the steps of above embodiment method It is that relevant hardware can be instructed to complete by program, the program can store in a kind of readable storage medium storing program for executing, should Program when being executed, includes the steps that one or a combination set of method implementation.

In addition, can integrate in a processing module in each functional unit in each embodiment of the disclosure, it can also To be that each unit physically exists alone, can also be integrated in two or more units in a module.It is above-mentioned integrated Module both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module If in the form of software function module realize and when sold or used as an independent product, also can store readable at one In storage medium.The storage medium can be read-only memory, disk or CD etc..

In the description of this specification, reference term " an embodiment/mode ", " some embodiment/modes ", The description of " example ", " specific example " or " some examples " etc. means the embodiment/mode or example is combined to describe specific Feature, structure, material or feature are contained at least one embodiment/mode or example of the application.In this specification In, schematic expression of the above terms are necessarily directed to identical embodiment/mode or example.Moreover, description Particular features, structures, materials, or characteristics can be in any one or more embodiment/modes or example in an appropriate manner In conjunction with.In addition, without conflicting with each other, those skilled in the art can be by different implementations described in this specification Mode/mode or example and different embodiments/mode or exemplary feature are combined.

In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present application, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.

It will be understood by those of skill in the art that above embodiment is used for the purpose of clearly demonstrating the disclosure, and simultaneously Non- be defined to the scope of the present disclosure.For those skilled in the art, may be used also on the basis of disclosed above To make other variations or modification, and these variations or modification are still in the scope of the present disclosure.

Claims

1. a kind of phonetics transfer method characterized by comprising

The voice of the predetermined quantity of the voice and target speaker of the predetermined quantity of acquisition source speaker；

It is instructed based on the characteristic parameter of the characteristic parameter of the voice of acquired source speaker and the voice of target speaker Practice, to obtain the transfer function of training pattern；

Characteristic parameter is extracted from the real-time voice of source speaker, by the transfer function of the training pattern by extracted source Speaker's speech characteristic parameter is converted into target speaker's speech characteristic parameter；And

According to target speaker's speech characteristic parameter after conversion, the voice of target speaker is obtained.

2. the method as described in claim 1, which is characterized in that the voice based on acquired source speaker and target speaker Come when being trained to obtain the transfer function of training pattern, comprising:

According to the voice of acquired source speaker and target speaker, the characteristic parameter and target of the voice of extraction source speaker The characteristic parameter of the voice of speaker；

The characteristic parameter of the characteristic parameter of the voice of source speaker and the voice of target speaker is subjected to time unifying；And

It is trained by the characteristic parameter of characteristic parameter and target speaker to the source speaker after time unifying, to obtain The transfer function of the training pattern.

3. method according to claim 1 or 2, which is characterized in that the training pattern uses LSTM structure.

4. method as claimed in claim 3, which is characterized in that the structure of the training pattern be three layer feedforward neural networks, The structure of two layers of two-way LSTM and one layer of feedforward neural network.

5. method according to any one of claims 1 to 4, which is characterized in that the language of the predetermined quantity of the source speaker The voice of sound and the predetermined quantity of target speaker is the voice of the predetermined quantity of parallel corpora form；And the source speaker The voice of predetermined quantity and the voice of predetermined quantity of target speaker be that quantity is identical and voice less than 200.

6. the method as described in any one of claims 1 to 5, which is characterized in that the characteristic parameter includes: spectrum parameter, line Property spectrum pair and fundamental frequency.

7. such as method described in any one of claims 1 to 6, which is characterized in that speak artificial phone contact staff in the source, The target is spoken artificial target voice speaker, and the voice of the phone contact staff is converted to institute in real time online State the voice of target voice speaker.

8. a kind of voice conversion device characterized by comprising

Voice obtains module, obtains the voice of the voice of the predetermined quantity of source speaker and the predetermined quantity of target speaker；

Voice training module, the feature of the voice of the characteristic parameter and target speaker of the voice based on acquired source speaker Parameter is trained, to obtain the transfer function of training pattern；

Conversion module is extracted, characteristic parameter is extracted from the real-time voice of source speaker, passes through the conversion letter of the training pattern Extracted source speaker speech characteristic parameter is converted into target speaker's speech characteristic parameter by number；And

Speech production module obtains the voice of target speaker according to target speaker's speech characteristic parameter after conversion.

9. a kind of electronic equipment characterized by comprising

Memory, the memory storage execute instruction；And

Processor, the processor execute executing instruction for the memory storage, so that the processor is executed as right is wanted Method described in asking any one of 1 to 7.

10. a kind of readable storage medium storing program for executing, which is characterized in that it is stored with and executes instruction in the readable storage medium storing program for executing, the execution For realizing the method as described in any one of claims 1 to 7 when instruction is executed by processor.