CN109637551A - Phonetics transfer method, device, equipment and storage medium - Google Patents
Phonetics transfer method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN109637551A CN109637551A CN201811604615.XA CN201811604615A CN109637551A CN 109637551 A CN109637551 A CN 109637551A CN 201811604615 A CN201811604615 A CN 201811604615A CN 109637551 A CN109637551 A CN 109637551A
- Authority
- CN
- China
- Prior art keywords
- voice
- speaker
- characteristic parameter
- target
- source
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 238000012546 transfer Methods 0.000 title claims abstract description 28
- 238000012549 training Methods 0.000 claims abstract description 59
- 238000006243 chemical reaction Methods 0.000 claims abstract description 37
- 230000006870 function Effects 0.000 claims description 21
- 238000000605 extraction Methods 0.000 claims description 20
- 238000013528 artificial neural network Methods 0.000 claims description 15
- 238000001228 spectrum Methods 0.000 claims description 10
- 238000004519 manufacturing process Methods 0.000 claims description 4
- 230000005055 memory storage Effects 0.000 claims description 4
- 239000010410 layer Substances 0.000 description 17
- 238000004891 communication Methods 0.000 description 10
- 239000000284 extract Substances 0.000 description 7
- 230000003595 spectral effect Effects 0.000 description 7
- 235000009508 confectionery Nutrition 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000007115 recruitment Effects 0.000 description 2
- 239000002356 single layer Substances 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/01—Correction of time axis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Theoretical Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Machine Translation (AREA)
Abstract
Present disclose provides a kind of phonetics transfer methods, comprising: obtains the voice of the voice of the predetermined quantity of source speaker and the predetermined quantity of target speaker;It is trained based on the characteristic parameter of the characteristic parameter of the voice of acquired source speaker and the voice of target speaker, to obtain the transfer function of training pattern;Characteristic parameter is extracted from the real-time voice of source speaker, extracted source speaker speech characteristic parameter is converted by target speaker's speech characteristic parameter by the transfer function of training pattern;And according to target speaker's speech characteristic parameter after conversion, obtain the voice of target speaker.The disclosure additionally provides a kind of voice conversion device, electronic equipment and readable storage medium storing program for executing.
Description
Technical field
This disclosure relates to a kind of phonetics transfer method, voice conversion device, electronic equipment and readable storage medium storing program for executing.
Background technique
In phone customer service scene, client proposes various requirement to phone contact staff sometimes, such as requires to change a female
Property contact staff;It is required that changing the pleasing to the ear contact staff of sound;It is required that changing the male contact staff with magnetic sound
Etc..
Current phone customer service system, substantially can only be this to solve the problems, such as by increasing human resources.For example, spending more
The women contact staff and sound that money goes recruitment sound sweet meet the requirement of client full of magnetic male contact staff
To improve user experience etc..Alternatively possible solution is that some pairs of extremely strong sound of sound control power of recruitment are excellent, both can be with
Sweet female voice is issued, and magnetic male voice can be issued.
In existing solution, it will increase many human costs, and be only able to satisfy the requirement etc. of portions of client.
Summary of the invention
At least one of in order to solve the above-mentioned technical problem, present disclose provides a kind of phonetics transfer method, voices to turn
Changing device, electronic equipment and readable storage medium storing program for executing.
According to one aspect of the disclosure, a kind of phonetics transfer method, comprising: obtain the language of the predetermined quantity of source speaker
The voice of sound and the predetermined quantity of target speaker;The characteristic parameter and target of voice based on acquired source speaker are spoken
The characteristic parameter of the voice of people is trained, to obtain the transfer function of training pattern;From the real-time voice of source speaker
Characteristic parameter is extracted, extracted source speaker speech characteristic parameter is converted by mesh by the transfer function of the training pattern
Mark speaker's speech characteristic parameter;And according to target speaker's speech characteristic parameter after conversion, obtain target speaker's
Voice.
According at least one embodiment of the disclosure, the voice based on acquired source speaker and target speaker come
When being trained to obtain the transfer function of training pattern, comprising: according to the language of acquired source speaker and target speaker
Sound, the characteristic parameter of the voice of the characteristic parameter and target speaker of the voice of extraction source speaker;By the voice of source speaker
Characteristic parameter and target speaker voice characteristic parameter carry out time unifying;By to the source speaker after time unifying
Characteristic parameter and the characteristic parameter of target speaker be trained, to obtain the transfer function of the training pattern.
According at least one embodiment of the disclosure, the training pattern uses LSTM structure.
According at least one embodiment of the disclosure, the structure of the training pattern is three layer feedforward neural networks, two
The structure of the two-way LSTM and one layer of feedforward neural network of layer.
According at least one embodiment of the disclosure, the voice of the predetermined quantity of the source speaker and target speaker
Predetermined quantity voice be parallel corpora form predetermined quantity voice.
According at least one embodiment of the disclosure, the voice of the predetermined quantity of the source speaker and target speaker
The voice of predetermined quantity be that quantity is identical and voice less than 200.
According at least one embodiment of the disclosure, characteristic parameter includes: spectrum parameter, linear spectral pair and fundamental frequency.
According at least one embodiment of the disclosure, speak artificial phone contact staff in the source, and the target is spoken
Artificial target voice speaker, and the voice of the phone contact staff is converted into target voice speaker in real time online
Voice.
According to another aspect of the present disclosure, a kind of voice conversion device, comprising: voice obtains module, obtains source speaker
Predetermined quantity voice and target speaker predetermined quantity voice;Voice training module is spoken based on acquired source
The characteristic parameter of the voice of the characteristic parameter and target speaker of the voice of people is trained, to obtain training pattern;It extracts
Conversion module extracts characteristic parameter from the real-time voice of source speaker, and extracted source is spoken by the training pattern
People's speech characteristic parameter is converted into target speaker's speech characteristic parameter;And speech production module, according to the target after conversion
Speaker's speech characteristic parameter obtains the voice of target speaker.
According to the another aspect of the disclosure, a kind of electronic equipment, comprising: memory, the memory storage computer are held
Row instruction;And processor, the processor execute the computer executed instructions of memory storage, so that processor execution is above-mentioned
Method.
According to the another further aspect of the disclosure, a kind of readable storage medium storing program for executing is stored with computer in the readable storage medium storing program for executing
It executes instruction, for realizing above-mentioned method when computer executed instructions are executed by processor.
Detailed description of the invention
Attached drawing shows the illustrative embodiments of the disclosure, and it is bright together for explaining the principles of this disclosure,
Which includes these attached drawings to provide further understanding of the disclosure, and attached drawing is included in the description and constitutes this
Part of specification.
Fig. 1 is the schematic flow chart according to the phonetics transfer method of one embodiment of the disclosure.
Fig. 2 is the schematic flow chart according to the training stage of the phonetics transfer method of one embodiment of the disclosure.
Fig. 3 is the schematic flow chart according to the conversion stage of the phonetics transfer method of one embodiment of the disclosure.
Fig. 4 is the schematic block diagram according to the training pattern structure of one embodiment of the disclosure.
Fig. 5 is the schematic block diagram according to the voice conversion device of one embodiment of the disclosure.
Fig. 6 is the schematic block diagram according to the voice training device of one embodiment of the disclosure.
Fig. 7 is the schematic block diagram according to the voice real-time conversion device of one embodiment of the disclosure.
Fig. 8 is the explanatory view according to the electronic equipment of one embodiment of the disclosure.
Specific embodiment
The disclosure is described in further detail with embodiment with reference to the accompanying drawing.It is understood that this place
The specific embodiment of description is only used for explaining related content, rather than the restriction to the disclosure.It also should be noted that being
Convenient for description, part relevant to the disclosure is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the feature in embodiment and embodiment in the disclosure can
To be combined with each other.The disclosure is described in detail below with reference to the accompanying drawings and in conjunction with embodiment.
Voice conversion (Voice Conversion, VC) refers under the premise of not changing the semantic content of primitive sound, changes
The voice personal characteristics for becoming a speaker (source speaker, source speaker), with another speaker's (mesh
Mark speaker, target speaker) voice personal characteristics.Information included in voice is extremely complex, most importantly
Semantic information, another critically important information are the individualized feature information of voice.The premise of voice conversion seeks to turning
Retain that original semantic content information is constant during changing, by changing the customized information of voice, so that the voice after conversion
Sound speciality with more multiple target speaker, and clarity with higher, intelligibility and naturalness.
In accordance with one embodiment of the present disclosure, a kind of phonetics transfer method is provided.As shown in Figure 1, the voice is converted
Method 10 includes: the step S11 of the voice of the predetermined quantity of acquisition source speaker and target speaker, is trained to be instructed
The real-time voice characteristic parameter of the step S12, extraction source speaker that practice model are to be converted into target speaker's speech characteristic parameter
Step S13 and obtain target speaker voice step S14.
In a step 11, the voice of the voice of the predetermined quantity of source speaker and the predetermined quantity of target speaker is obtained.
In an optional embodiment, the voice of the predetermined quantity of the voice and target speaker of the predetermined quantity of source speaker is flat
The voice of the predetermined quantity of row corpus form.Optionally, the voice of the predetermined quantity of source speaker and target speaker's is predetermined
The voice of quantity is that quantity is identical and voice less than 200.Such as 50~200 voices can be respectively necessary for.Wherein put down
Row corpus refers to that the recorded content of two speakers is identical, contains identical semantic and affective characteristics.
In step s 12, the language of characteristic parameter and target speaker is extracted based on the voice of acquired source speaker
Sound is trained to extract characteristic parameter based on extracted speech characteristic parameter, to obtain the conversion of training pattern
Function.According to the disclosure, acquired speech characteristic parameter includes: spectrum parameter, linear spectral to (Linear Spectrum
Pairs, LSP) and fundamental frequency.It is alternatively possible to extract above-mentioned speech characteristic parameter according to STRAIGHT feature extraction algorithm.Its
In, training pattern use LSTM structure, optionally the structure of training pattern be three layer feedforward neural networks, two layers of two-way LSTM and
The structure of one layer of feedforward neural network.
In step s 13, characteristic parameter is extracted from the real-time voice of source speaker, it will be extracted by training pattern
Source speaker's speech characteristic parameter is converted into target speaker's speech characteristic parameter.Here, for example as phone contact staff and visitor
When family is talked with, the speech characteristic parameter of phone contact staff (source speaker) can be converted into the phonetic feature of target speaker
Parameter.Here extracted speech characteristic parameter may include: spectrum parameter, linear spectral pair and fundamental frequency.
In step S14, according to target speaker's speech characteristic parameter after conversion, the voice of target speaker is obtained
(such as sweet female voice sound, magnetic male voice sound etc.).
The above method 10 is described in detail below with reference to Fig. 2 and Fig. 3.
For phonetics transfer method 10, if being classified according to implementation phase, the step of training stage can be divided into
And the step of conversion stage.Training stage may include step S11 and S12, and the stage of converting may include step S13 and S14.
Wherein, the training stage can carry out offline model training, and the stage of converting can carry out online conversion in real time.In training rank
Section, can the characteristic parameter based on speech analysis model extraction source speaker's voice Yu target speaker voice, and to extraction
Speech characteristic parameter carries out time unifying, then is trained to obtain voice transfer function.Conversion the stage, treat converting speech into
Row speech signal analysis simultaneously extracts speech characteristic parameter, and the mapping of characteristic parameter is carried out according to the transfer function that the training stage obtains
Speech characteristic parameter after being converted finally synthesizes converting speech by the speech characteristic parameter after converting.
Referring to fig. 2, the specific example of training stage is illustrated in detail.As shown in Fig. 2, the method for training stage can wrap
It includes and obtains the step S21 of voice, the step S22 for extracting speech characteristic parameter, the step S23 of time unifying and model training
Step S24.Wherein, source speaker can be phone contact staff, and target is spoken artificial target voice speaker.
In the step s 21, mode corresponds to the step S11 in method 10.Wherein it is possible to obtain the 100 of source speaker
~200 words obtain identical 100~200 word of target speaker.These words can be the form of parallel corpora.Later
It is trained by these words.
In step S22, according to the voice of acquired source speaker and target speaker, the voice of extraction source speaker
Characteristic parameter and target speaker voice characteristic parameter.In the disclosure, it can be spoken according to source speaker and target
The voice of people, extracts spectrum parameter, and extraction is characterized in linear spectral pair and fundamental frequency, can use STRAIGHT feature in the disclosure
Extraction algorithm is realized.
In step S23, the speech characteristic parameter of extraction is subjected to temporal alignment.Since source speaker and target are said
The difference of the human speech speed rhythm is talked about, even if causing identical a word, the duration that different speakers obtain is also different, even and if
Everyone says that same a word duration is unequal in different time points, and same word, syllable in sentence
Pronunciation duration be also to change at random, so needing to carry out time unifying, just can be carried out the training of neural network in this way.At this
In open, the method for time unifying can use dynamic time warping (Dynamic Time Warping, DTW).
In step s 24, pass through the characteristic parameter of the characteristic parameter of the source speaker after time unifying and target speaker
Carry out model training.Wherein, which can be two-way using LSTM structure, such as three layer feedforward neural networks+two layers
Mono- layer of feedforward neural network of LSTM+ (this structure will be described in detail below).Model training can be trained in single GPU,
Exercise wheel number can be 30 wheels, and learning rate can use exponential decrease strategy, and model smoothing can use the method for moving average.
Finally, can come in the conversion stage later using the training stage obtained best model.
Fig. 3 shows the method in conversion stage, wherein speak artificial phone contact staff in source, and target is spoken artificial target
Voice speaker, and the voice of phone contact staff is converted to the voice of target voice speaker in real time online.The party
Method include: the step S31 of acquisition source speaker's real-time voice, extraction source speaker's real-time voice characteristic parameter step S32,
The characteristic parameter of extraction is by training stage trained model come target speaker's speech characteristic parameter after being converted
Step S33, and according to target speaker's speech characteristic parameter after conversion, the step S34 of the voice of target speaker is obtained.
In step S31, when source speaker (phone contact staff) is when with Communication with Customer, first by phone contact staff
Voice be acquired, to obtain the voice of phone contact staff in real time.
In step s 32, speech characteristic parameter is extracted to the voice of the obtained phone contact staff of step S31, this
In extracted speech characteristic parameter can be with are as follows: spectrum parameter, linear spectral pair and fundamental frequency.
In step S33, the training pattern that the speech characteristic parameter input training stage of extraction is obtained passes through training mould
These speech characteristic parameters are converted into target speaker (such as with sweet female voice sound, magnetic male voice sound by the transfer function of type
Deng people) speech characteristic parameter.
In step S34, according to target speaker's speech characteristic parameter after conversion, the voice of target speaker is obtained.
In the disclosure, vocoder can be used to obtain the voice, which can use source-filter model.
Finally, the voice of obtained target speaker can be led to when phone contact staff with client when linking up
It crosses telephone channel to export in real time, to be supplied to client.
An example of training pattern structure used in the disclosure is described in detail below.
Referring to Fig. 4, training pattern structure may include three layer feedforward neural networks (1-1,1-2,1-3)+two layers two-way
+ one layer of feedforward neural network (3-1) of LSTM (2-1,2-2).Wherein, three layer feedforward neural networks 1-1,1-2,1-3 is for extracting
The former speech characteristic parameter of source speaker, and two layers two-way LSTM2-1,2-2 are then the phases for learning linguistic time context
Information is closed, one layer of feedforward neural network 3-1 is then the forward direction layer for converting into target speaker, and it is special finally to export target voice
Sign.
According to the other embodiments of the disclosure, a kind of voice conversion device is provided.As shown in figure 5, the voice is converted
Device 500 may include: voice obtain module 501, obtain source speaker predetermined quantity voice and target speaker it is pre-
The voice of fixed number amount;Voice training module 502, the characteristic parameter of the voice based on acquired source speaker and target speaker
The characteristic parameter of voice be trained, to obtain training pattern;Conversion module 503 is extracted, from the real-time language of source speaker
Characteristic parameter is extracted in sound, and extracted source speaker speech characteristic parameter is converted by target by training pattern and is spoken human speech
Sound characteristic parameter;And speech production module 504 obtains target and says according to target speaker's speech characteristic parameter after conversion
Talk about the voice of people.
In voice training module 502, according to the voice of acquired source speaker and target speaker, extraction source is spoken
The characteristic parameter of the voice of the characteristic parameter and target speaker of the voice of people;By the characteristic parameter and mesh of the voice of source speaker
The characteristic parameter for marking the voice of speaker carries out time unifying;Pass through the characteristic parameter and mesh to the source speaker after time unifying
The characteristic parameter of mark speaker is trained, to obtain the transfer function of training pattern.
According to the other embodiments of the disclosure, a kind of voice training device is provided.As shown in fig. 6, voice training fills
Setting 600 may include: to obtain voice module 601, extract characteristic module 602, time unifying module 603 and model training module
604。
Obtain voice module 601 obtain source speaker 100~200 words, acquisition target speaker identical 100~
200 words.These words can be the form of parallel corpora.It is trained later by these words.
Extract characteristic module 602, according to the voice of acquired source speaker and target speaker, extraction source speaker's
The characteristic parameter of the voice of the characteristic parameter and target speaker of voice.It in the disclosure, can be according to source speaker and target
The voice of speaker, extracts spectrum parameter, and extraction is characterized in linear spectral pair and fundamental frequency.
The speech characteristic parameter of extraction is carried out temporal alignment by time unifying module 603.In the disclosure, the time
The method of alignment can use dynamic time warping (Dynamic Time Warping, DTW).
Model training module 604 passes through the characteristic parameter of the source speaker after time unifying and the feature of target speaker
Parameter carries out model training.Wherein, which can use LSTM structure, such as the three layers of BP Neural Network illustrated before
The mono- layer of feedforward neural network of two-way LSTM+ of network+two layers.
Source speaker can be phone contact staff, and target is spoken artificial target voice speaker.Again according to the disclosure
One embodiment there is provided a kind of voice real-time conversion devices.As shown in fig. 7, voice real-time conversion device 700 is spoken including source
People's real-time voice obtains module 701, source speaker's real-time voice characteristic parameter extraction module 702, conversion module 703, target and says
Talk about human speech sound generation module 704, wherein source speaker can be phone contact staff, and target artificial target voice of speaking is spoken
People, and the voice of phone contact staff is converted to the voice of target voice speaker in real time online.This method comprises:.
It is obtained in module 701 in source speaker's real-time voice, when source speaker (phone contact staff) is with Communication with Customer
When, the voice of phone contact staff is acquired first, to obtain the voice of phone contact staff in real time.In source speaker
In real-time voice characteristic parameter extraction module 702, speech characteristic parameter is extracted to the voice of obtained phone contact staff,
Here extracted speech characteristic parameter can be with are as follows: spectrum parameter, linear spectral pair and fundamental frequency.In conversion module 703, by extraction
The training pattern that the speech characteristic parameter input training stage obtains, is joined these phonetic features by the transfer function of training pattern
Number is converted into the speech characteristic parameter of target speaker.In target speaker's speech production module 704, according to the mesh after conversion
Speaker's speech characteristic parameter is marked, the voice of target speaker is obtained.In the disclosure, vocoder can be used to obtain the language
Sound, the vocoder can use source-filter model.Finally, can be incited somebody to action when phone contact staff with client when linking up
The voice of obtained target speaker is exported in real time by telephone channel, to be supplied to client.
According to the technical solution of the disclosure, need cost very low, it is different from traditional speech synthesis (to need a large amount of voice
Data, 10 hours or more), it is only necessary to 50-200 word, i.e. 5-20 minutes data by each speaker of the disclosure, so that it may turn
The speaker's tone color for swapping out different.Voice Conversion needs data volume few, not only reduces human cost, but also can expire
The demand of sufficient client can greatly reduce unnecessary expenditures etc..
The disclosure also provides a kind of electronic equipment, as shown in figure 8, the equipment includes: communication interface 1000, memory 2000
With processor 3000.Communication interface 1000 carries out data interaction for being communicated with external device.In memory 2000
It is stored with the computer program that can be run on processor 3000.Processor 3000 is realized above-mentioned when executing the computer program
Method in embodiment.The quantity of the memory 2000 and processor 3000 can be one or more.
Memory 2000 may include high speed RAM memory, can also further include nonvolatile memory (non-
Volatile memory), a for example, at least magnetic disk storage.
If communication interface 1000, memory 2000 and the independent realization of processor 3000, communication interface 1000, memory
2000 and processor 3000 can be connected with each other by bus and complete mutual communication.The bus can be industrial standard
Architecture (ISA, Industry Standard Architecture) bus, external equipment interconnection (PCI, Peripheral
Component) bus or extended industry-standard architecture (EISA, Extended Industry Standard
Component) bus etc..The bus can be divided into address bus, data/address bus, control bus etc..For convenient for expression, the figure
In only indicated with a thick line, it is not intended that an only bus or a type of bus.
Optionally, in specific implementation, if communication interface 1000, memory 2000 and processor 3000 are integrated in one
On block chip, then communication interface 1000, memory 2000 and processor 3000 can complete mutual lead to by internal interface
Letter.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes
It is one or more for realizing specific logical function or process the step of executable instruction code module, segment or portion
Point, and the range of the preferred embodiment of the disclosure includes other realization, wherein can not press shown or discussed suitable
Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be by the disclosure
Embodiment person of ordinary skill in the field understood.Processor executes each method as described above and processing.
For example, the method implementation in the disclosure may be implemented as software program, it is tangibly embodied in machine readable media,
Such as memory.In some embodiments, some or all of of software program can be via memory and/or communication interface
And it is loaded into and/or installs.When software program is loaded into memory and is executed by processor, above-described side can be executed
One or more steps in method.Alternatively, in other embodiments, processor can pass through other any modes appropriate
(for example, by means of firmware) and be configured as executing one of above method.
Expression or logic and/or step described otherwise above herein in flow charts, may be embodied in any
In readable storage medium storing program for executing, so that (such as computer based system is including processor for instruction execution system, device or equipment
Unite or other can be from instruction execution system, device or equipment instruction fetch and the system executed instruction) it uses, or refer in conjunction with these
It enables and executes system, device or equipment and use.
For the purpose of this specification, " readable storage medium storing program for executing " can be it is any may include, store, communicate, propagate, or transport
Program is for instruction execution system, device or equipment or the device used in conjunction with these instruction execution systems, device or equipment.
The more specific example (non-exhaustive list) of readable storage medium storing program for executing include the following: there is the electrical connection section of one or more wirings
(electronic device), portable computer diskette box (magnetic device), random access memory (RAM), read-only memory (ROM) are erasable
Except editable read-only memory (EPROM or flash memory), fiber device and portable read-only memory (CDROM).Separately
Outside, readable storage medium storing program for executing can even is that the paper that can print described program on it or other suitable media, because can example
Such as by carrying out optical scanner to paper or other media, is then edited, interpreted or when necessary with the progress of other suitable methods
Processing is then stored in memory electronically to obtain described program.
It should be appreciated that each section of the disclosure can be realized with hardware, software or their combination.In above-mentioned embodiment party
In formula, multiple steps or method can carry out reality in memory and by the software that suitable instruction execution system executes with storage
It is existing.It, and in another embodiment, can be in following technology well known in the art for example, if realized with hardware
Any one or their combination are realized: having a discrete logic for realizing the logic gates of logic function to data-signal
Circuit, the specific integrated circuit with suitable combinational logic gate circuit, programmable gate array (PGA), field-programmable gate array
Arrange (FPGA) etc..
Those skilled in the art are understood that realize all or part of the steps of above embodiment method
It is that relevant hardware can be instructed to complete by program, the program can store in a kind of readable storage medium storing program for executing, should
Program when being executed, includes the steps that one or a combination set of method implementation.
In addition, can integrate in a processing module in each functional unit in each embodiment of the disclosure, it can also
To be that each unit physically exists alone, can also be integrated in two or more units in a module.It is above-mentioned integrated
Module both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module
If in the form of software function module realize and when sold or used as an independent product, also can store readable at one
In storage medium.The storage medium can be read-only memory, disk or CD etc..
In the description of this specification, reference term " an embodiment/mode ", " some embodiment/modes ",
The description of " example ", " specific example " or " some examples " etc. means the embodiment/mode or example is combined to describe specific
Feature, structure, material or feature are contained at least one embodiment/mode or example of the application.In this specification
In, schematic expression of the above terms are necessarily directed to identical embodiment/mode or example.Moreover, description
Particular features, structures, materials, or characteristics can be in any one or more embodiment/modes or example in an appropriate manner
In conjunction with.In addition, without conflicting with each other, those skilled in the art can be by different implementations described in this specification
Mode/mode or example and different embodiments/mode or exemplary feature are combined.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance
Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or
Implicitly include at least one this feature.In the description of the present application, the meaning of " plurality " is at least two, such as two, three
It is a etc., unless otherwise specifically defined.
It will be understood by those of skill in the art that above embodiment is used for the purpose of clearly demonstrating the disclosure, and simultaneously
Non- be defined to the scope of the present disclosure.For those skilled in the art, may be used also on the basis of disclosed above
To make other variations or modification, and these variations or modification are still in the scope of the present disclosure.
Claims (10)
1. a kind of phonetics transfer method characterized by comprising
The voice of the predetermined quantity of the voice and target speaker of the predetermined quantity of acquisition source speaker;
It is instructed based on the characteristic parameter of the characteristic parameter of the voice of acquired source speaker and the voice of target speaker
Practice, to obtain the transfer function of training pattern;
Characteristic parameter is extracted from the real-time voice of source speaker, by the transfer function of the training pattern by extracted source
Speaker's speech characteristic parameter is converted into target speaker's speech characteristic parameter;And
According to target speaker's speech characteristic parameter after conversion, the voice of target speaker is obtained.
2. the method as described in claim 1, which is characterized in that the voice based on acquired source speaker and target speaker
Come when being trained to obtain the transfer function of training pattern, comprising:
According to the voice of acquired source speaker and target speaker, the characteristic parameter and target of the voice of extraction source speaker
The characteristic parameter of the voice of speaker;
The characteristic parameter of the characteristic parameter of the voice of source speaker and the voice of target speaker is subjected to time unifying;And
It is trained by the characteristic parameter of characteristic parameter and target speaker to the source speaker after time unifying, to obtain
The transfer function of the training pattern.
3. method according to claim 1 or 2, which is characterized in that the training pattern uses LSTM structure.
4. method as claimed in claim 3, which is characterized in that the structure of the training pattern be three layer feedforward neural networks,
The structure of two layers of two-way LSTM and one layer of feedforward neural network.
5. method according to any one of claims 1 to 4, which is characterized in that the language of the predetermined quantity of the source speaker
The voice of sound and the predetermined quantity of target speaker is the voice of the predetermined quantity of parallel corpora form;And the source speaker
The voice of predetermined quantity and the voice of predetermined quantity of target speaker be that quantity is identical and voice less than 200.
6. the method as described in any one of claims 1 to 5, which is characterized in that the characteristic parameter includes: spectrum parameter, line
Property spectrum pair and fundamental frequency.
7. such as method described in any one of claims 1 to 6, which is characterized in that speak artificial phone contact staff in the source,
The target is spoken artificial target voice speaker, and the voice of the phone contact staff is converted to institute in real time online
State the voice of target voice speaker.
8. a kind of voice conversion device characterized by comprising
Voice obtains module, obtains the voice of the voice of the predetermined quantity of source speaker and the predetermined quantity of target speaker;
Voice training module, the feature of the voice of the characteristic parameter and target speaker of the voice based on acquired source speaker
Parameter is trained, to obtain the transfer function of training pattern;
Conversion module is extracted, characteristic parameter is extracted from the real-time voice of source speaker, passes through the conversion letter of the training pattern
Extracted source speaker speech characteristic parameter is converted into target speaker's speech characteristic parameter by number;And
Speech production module obtains the voice of target speaker according to target speaker's speech characteristic parameter after conversion.
9. a kind of electronic equipment characterized by comprising
Memory, the memory storage execute instruction;And
Processor, the processor execute executing instruction for the memory storage, so that the processor is executed as right is wanted
Method described in asking any one of 1 to 7.
10. a kind of readable storage medium storing program for executing, which is characterized in that it is stored with and executes instruction in the readable storage medium storing program for executing, the execution
For realizing the method as described in any one of claims 1 to 7 when instruction is executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811604615.XA CN109637551A (en) | 2018-12-26 | 2018-12-26 | Phonetics transfer method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811604615.XA CN109637551A (en) | 2018-12-26 | 2018-12-26 | Phonetics transfer method, device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109637551A true CN109637551A (en) | 2019-04-16 |
Family
ID=66077876
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811604615.XA Pending CN109637551A (en) | 2018-12-26 | 2018-12-26 | Phonetics transfer method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109637551A (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110223705A (en) * | 2019-06-12 | 2019-09-10 | 腾讯科技(深圳)有限公司 | Phonetics transfer method, device, equipment and readable storage medium storing program for executing |
CN110246489A (en) * | 2019-06-14 | 2019-09-17 | 苏州思必驰信息科技有限公司 | Audio recognition method and system for children |
CN111247584A (en) * | 2019-12-24 | 2020-06-05 | 深圳市优必选科技股份有限公司 | Voice conversion method, system, device and storage medium |
CN111433847A (en) * | 2019-12-31 | 2020-07-17 | 深圳市优必选科技股份有限公司 | Speech conversion method and training method, intelligent device and storage medium |
WO2021028236A1 (en) * | 2019-08-12 | 2021-02-18 | Interdigital Ce Patent Holdings, Sas | Systems and methods for sound conversion |
CN112712813A (en) * | 2021-03-26 | 2021-04-27 | 北京达佳互联信息技术有限公司 | Voice processing method, device, equipment and storage medium |
CN112992162A (en) * | 2021-04-16 | 2021-06-18 | 杭州一知智能科技有限公司 | Tone cloning method, system, device and computer readable storage medium |
WO2021120145A1 (en) * | 2019-12-20 | 2021-06-24 | 深圳市优必选科技股份有限公司 | Voice conversion method and apparatus, computer device and computer-readable storage medium |
CN113345451A (en) * | 2021-04-26 | 2021-09-03 | 北京搜狗科技发展有限公司 | Sound changing method and device and electronic equipment |
CN115064177A (en) * | 2022-06-14 | 2022-09-16 | 中国第一汽车股份有限公司 | Voice conversion method, apparatus, device and medium based on voiceprint encoder |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
CN103531205A (en) * | 2013-10-09 | 2014-01-22 | 常州工学院 | Asymmetrical voice conversion method based on deep neural network feature mapping |
CN103886859A (en) * | 2014-02-14 | 2014-06-25 | 河海大学常州校区 | Voice conversion method based on one-to-many codebook mapping |
CN105390141A (en) * | 2015-10-14 | 2016-03-09 | 科大讯飞股份有限公司 | Sound conversion method and sound conversion device |
-
2018
- 2018-12-26 CN CN201811604615.XA patent/CN109637551A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
CN103531205A (en) * | 2013-10-09 | 2014-01-22 | 常州工学院 | Asymmetrical voice conversion method based on deep neural network feature mapping |
CN103886859A (en) * | 2014-02-14 | 2014-06-25 | 河海大学常州校区 | Voice conversion method based on one-to-many codebook mapping |
CN105390141A (en) * | 2015-10-14 | 2016-03-09 | 科大讯飞股份有限公司 | Sound conversion method and sound conversion device |
Non-Patent Citations (3)
Title |
---|
L SUN, S KANG, K LI ET.. AL: "Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks", <ICASSP> * |
刘正晨: "《博士学位论文》", 30 October 2018, 中国科学技术大学 * |
苗晓孔等: "基于BLSTM实现基频(F0)融合变换的语音转换方法研究", <SCIENCE PC> * |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110223705B (en) * | 2019-06-12 | 2023-09-15 | 腾讯科技(深圳)有限公司 | Voice conversion method, device, equipment and readable storage medium |
CN110223705A (en) * | 2019-06-12 | 2019-09-10 | 腾讯科技(深圳)有限公司 | Phonetics transfer method, device, equipment and readable storage medium storing program for executing |
CN110246489A (en) * | 2019-06-14 | 2019-09-17 | 苏州思必驰信息科技有限公司 | Audio recognition method and system for children |
WO2021028236A1 (en) * | 2019-08-12 | 2021-02-18 | Interdigital Ce Patent Holdings, Sas | Systems and methods for sound conversion |
WO2021120145A1 (en) * | 2019-12-20 | 2021-06-24 | 深圳市优必选科技股份有限公司 | Voice conversion method and apparatus, computer device and computer-readable storage medium |
WO2021127985A1 (en) * | 2019-12-24 | 2021-07-01 | 深圳市优必选科技股份有限公司 | Voice conversion method, system and device, and storage medium |
CN111247584B (en) * | 2019-12-24 | 2023-05-23 | 深圳市优必选科技股份有限公司 | Voice conversion method, system, device and storage medium |
CN111247584A (en) * | 2019-12-24 | 2020-06-05 | 深圳市优必选科技股份有限公司 | Voice conversion method, system, device and storage medium |
CN111433847A (en) * | 2019-12-31 | 2020-07-17 | 深圳市优必选科技股份有限公司 | Speech conversion method and training method, intelligent device and storage medium |
CN111433847B (en) * | 2019-12-31 | 2023-06-09 | 深圳市优必选科技股份有限公司 | Voice conversion method, training method, intelligent device and storage medium |
CN112712813A (en) * | 2021-03-26 | 2021-04-27 | 北京达佳互联信息技术有限公司 | Voice processing method, device, equipment and storage medium |
CN112712813B (en) * | 2021-03-26 | 2021-07-20 | 北京达佳互联信息技术有限公司 | Voice processing method, device, equipment and storage medium |
CN112992162A (en) * | 2021-04-16 | 2021-06-18 | 杭州一知智能科技有限公司 | Tone cloning method, system, device and computer readable storage medium |
CN113345451A (en) * | 2021-04-26 | 2021-09-03 | 北京搜狗科技发展有限公司 | Sound changing method and device and electronic equipment |
CN113345451B (en) * | 2021-04-26 | 2023-08-22 | 北京搜狗科技发展有限公司 | Sound changing method and device and electronic equipment |
CN115064177A (en) * | 2022-06-14 | 2022-09-16 | 中国第一汽车股份有限公司 | Voice conversion method, apparatus, device and medium based on voiceprint encoder |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109637551A (en) | Phonetics transfer method, device, equipment and storage medium | |
Delić et al. | Speech technology progress based on new machine learning paradigm | |
US11514888B2 (en) | Two-level speech prosody transfer | |
CN111276120B (en) | Speech synthesis method, apparatus and computer-readable storage medium | |
CN108847249A (en) | Sound converts optimization method and system | |
CN110223705A (en) | Phonetics transfer method, device, equipment and readable storage medium storing program for executing | |
CN111433847B (en) | Voice conversion method, training method, intelligent device and storage medium | |
Kelly et al. | Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors | |
CN108305641A (en) | The determination method and apparatus of emotion information | |
Song et al. | ExcitNet vocoder: A neural excitation model for parametric speech synthesis systems | |
CN101578659A (en) | Voice tone converting device and voice tone converting method | |
CN110246488A (en) | Half optimizes the phonetics transfer method and device of CycleGAN model | |
CN112184859B (en) | End-to-end virtual object animation generation method and device, storage medium and terminal | |
CN109599094A (en) | The method of sound beauty and emotion modification | |
CN108010516A (en) | Semantic independent speech emotion feature recognition method and device | |
WO2020175810A1 (en) | Electronic apparatus and method for controlling thereof | |
WO2021212954A1 (en) | Method and apparatus for synthesizing emotional speech of specific speaker with extremely few resources | |
CN109065073A (en) | Speech-emotion recognition method based on depth S VM network model | |
Fahmy et al. | A transfer learning end-to-end arabic text-to-speech (tts) deep architecture | |
Luong et al. | Laughnet: synthesizing laughter utterances from waveform silhouettes and a single laughter example | |
Van Rijn et al. | Exploring emotional prototypes in a high dimensional TTS latent space | |
Zhang et al. | AccentSpeech: Learning accent from crowd-sourced data for target speaker TTS with accents | |
Johar | Paralinguistic profiling using speech recognition | |
CN112185341A (en) | Dubbing method, apparatus, device and storage medium based on speech synthesis | |
CN116798405A (en) | Speech synthesis method, device, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190416 |