CN107507619B - Voice conversion method and device, electronic equipment and readable storage medium - Google Patents
Voice conversion method and device, electronic equipment and readable storage medium Download PDFInfo
- Publication number
- CN107507619B CN107507619B CN201710812770.XA CN201710812770A CN107507619B CN 107507619 B CN107507619 B CN 107507619B CN 201710812770 A CN201710812770 A CN 201710812770A CN 107507619 B CN107507619 B CN 107507619B
- Authority
- CN
- China
- Prior art keywords
- target
- voice
- frame unit
- converted
- speaker
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000006243 chemical reaction Methods 0.000 title claims abstract description 57
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000004364 calculation method Methods 0.000 claims abstract description 30
- 230000011218 segmentation Effects 0.000 claims abstract description 26
- 238000012545 processing Methods 0.000 claims abstract description 22
- 239000013598 vector Substances 0.000 claims description 18
- 238000013507 mapping Methods 0.000 claims description 16
- 238000004422 calculation algorithm Methods 0.000 claims description 14
- 238000001228 spectrum Methods 0.000 claims description 14
- 238000007781 pre-processing Methods 0.000 claims description 13
- 238000012546 transfer Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 4
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000009826 distribution Methods 0.000 description 3
- 238000009499 grossing Methods 0.000 description 3
- 238000002372 labelling Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000033764 rhythmic process Effects 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 230000001052 transient effect Effects 0.000 description 3
- 230000015572 biosynthetic process Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000003786 synthesis reaction Methods 0.000 description 2
- 206010002953 Aphonia Diseases 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G10L2021/0135—Voice conversion or morphing
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a voice conversion method, a voice conversion device, electronic equipment and a readable storage medium. The method comprises the steps of segmenting a voice to be converted into a plurality of frame units to be converted based on a preset segmentation rule; extracting the Mel cepstrum characteristic of each frame unit to be converted; calculating to obtain a plurality of candidate frame units according to the phoneme dictionary and the Mel cepstrum characteristics of each frame unit to be converted; matching to obtain a target frame unit according to the corresponding relation between the frame unit of the speaker to be converted and the frame unit of the target timbre speaker; calculating conversion cost to obtain an optimal path; and processing the target frame unit on the optimal path to obtain the target voice. The method obtains a plurality of candidate frame units through calculation in the phoneme dictionary, saves calculation resources and improves calculation speed compared with the prior art that the calculation is carried out from the whole technical feature dictionary, and meanwhile, the traditional single-frame calculation is improved into the calculation of multiple frames, so that the technical problems of discontinuous synthesized voice and poor tone quality are greatly improved.
Description
Technical Field
The invention relates to the technical field of voice information processing, in particular to a voice conversion method, a voice conversion device, electronic equipment and a readable storage medium.
Background
The speech synthesis technology has achieved fruitful results through the development of nearly half century, and plays an extremely important role in the fields of artificial intelligence and the like. Among them, TTS (Text-to-Speech, also called Text-to-Speech) is a technology for converting Text information generated by a computer or inputted from the outside into intelligible and fluent spoken language, but the Speech synthesized by TTS generally has the following two problems: firstly, the timbre is limited to a small number of announcer samples, and cannot meet the personalized requirements; but the rhythm is unnatural and the synthesis trace is obvious.
Tone conversion (also called voice conversion) is a technology for directly converting the current speaker tone into the output speaker tone without changing the voice content, and has the advantages of natural rhythm and better personalized tone retention. At present, a speech conversion method based on speech feature dictionary lookup is a mainstream method in a non-parameter speech conversion technology, and the idea of the method is as follows: firstly, extracting the characteristics of an original voice library and a target voice library, establishing a characteristic dictionary, and performing parallel training to obtain a mapping rule; extracting the feature vectors of the voice to be converted, and searching K adjacent target feature vectors for each feature vector in a target feature dictionary according to a mapping rule; calculating target cost and connection cost, and searching an optimal path in the K neighbor characteristic matrix by adopting a Viterbi algorithm; and fourthly, connecting the voice feature vectors of the selected target and converting the voice feature vectors into voice. The method has the defects that the whole target feature dictionary needs to be traversed for searching the K neighbor feature vectors every time, the calculation speed is low, and the requirement on the system performance is high. Meanwhile, when the connection cost is calculated, a single frame is used as a unit, and the smooth characteristic between voice frames is not considered, so that the loss of voice instantaneous information is caused, the synthesized voice is discontinuous, and the voice quality is greatly influenced.
Disclosure of Invention
In order to overcome the above-mentioned deficiencies in the prior art, the present invention provides a method, an apparatus, an electronic device and a readable storage medium for voice conversion, which can ensure that the spectrum details are not lost on the premise of ensuring the continuity of the synthesized voice.
It is an object of a first aspect of the present invention to provide a method of speech conversion, the method comprising:
segmenting the voice to be converted of the speaker to be converted into a plurality of frame units to be converted based on a preset segmentation rule, wherein each frame unit to be converted comprises a plurality of continuous voice frames;
extracting the Mel cepstrum characteristic of each frame unit to be converted;
calculating to obtain a plurality of candidate frame units according to a pre-obtained phoneme dictionary of the speaker to be converted and the Mel cepstrum characteristics of each frame unit to be converted;
matching to obtain a target frame unit corresponding to the candidate frame unit according to a corresponding relation between a frame unit of a speaker to be converted and a frame unit of a target timbre speaker, which is obtained in advance;
calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker;
and processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted.
Optionally, the method further comprises pre-processing the speech data;
the step of preprocessing the speech data comprises:
segmenting an original voice in an original voice library corresponding to a speaker to be converted and a target voice in a target voice library corresponding to a target timbre speaker by adopting the preset segmentation rule to obtain a plurality of frame units corresponding to the original voice and a plurality of frame units corresponding to the target voice;
extracting Mel cepstrum characteristics of the original voice and the target voice, and constructing an original voice characteristic dictionary and a target voice characteristic dictionary;
establishing a corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
classifying the original speech feature dictionary according to the labeled phoneme information to obtain a phoneme dictionary;
extracting fundamental frequency characteristics of the original voice and the target voice, and calculating a fundamental frequency mean value and a fundamental frequency variance;
and establishing a mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker according to the mean value and the variance of the fundamental frequency.
It is an object of a second aspect of the present invention to provide a speech conversion apparatus, comprising:
the segmentation module is used for segmenting the to-be-converted voice of the speaker to be converted into a plurality of to-be-converted frame units based on a preset segmentation rule, wherein each to-be-converted frame unit comprises a plurality of continuous voice frames;
the extraction module is used for extracting the Mel cepstrum characteristics of each frame unit to be converted;
the computing module is used for computing a plurality of candidate frame units according to a pre-obtained phoneme dictionary of the speaker to be converted and the Mel cepstrum characteristics of each frame unit to be converted;
the matching module is used for matching to obtain a target frame unit corresponding to the candidate frame unit according to the corresponding relation between the frame unit of the speaker to be converted and the frame unit of the target timbre speaker obtained in advance;
the computing module is also used for computing the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker;
and the processing module is used for processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted.
Optionally, the apparatus further comprises: a preprocessing module;
the mode of the preprocessing module for preprocessing the voice data comprises the following steps:
segmenting an original voice in an original voice library corresponding to a speaker to be converted and a target voice in a target voice library corresponding to a target timbre speaker by adopting the preset segmentation rule to obtain a plurality of frame units corresponding to the original voice and a plurality of frame units corresponding to the target voice;
extracting Mel cepstrum characteristics of the original voice and the target voice, and constructing an original voice characteristic dictionary and a target voice characteristic dictionary;
establishing a corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
classifying the original speech feature dictionary according to the labeled phoneme information to obtain a phoneme dictionary;
extracting fundamental frequency characteristics of the original voice and the target voice, and calculating a fundamental frequency mean value and a fundamental frequency variance;
and establishing a mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker according to the mean value and the variance of the fundamental frequency.
It is an object of a third aspect of the present invention to provide an electronic apparatus, comprising: a processor and a memory coupled to the processor, the memory storing instructions that, when executed by the processor, cause the electronic device to perform the method of speech conversion according to the first aspect of the invention.
A fourth aspect of the present invention is directed to a readable storage medium, which includes a computer program, where the computer program controls an electronic device where the readable storage medium is located to execute the voice conversion method according to the first aspect of the present invention when the computer program runs.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a voice conversion method, a voice conversion device, electronic equipment and a readable storage medium. The method comprises the steps of segmenting the voice to be converted of a speaker to be converted into a plurality of frame units to be converted based on a preset segmentation rule; extracting the Mel cepstrum characteristic of each frame unit to be converted; calculating to obtain a plurality of candidate frame units according to a pre-obtained phoneme dictionary of the speaker to be converted and the Mel cepstrum characteristics of each frame unit to be converted; matching to obtain a target frame unit corresponding to the candidate frame unit according to a corresponding relation between a frame unit of a speaker to be converted and a frame unit of a target timbre speaker, which is obtained in advance; calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker; and processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted. The method obtains a plurality of candidate frame units by calculation in the phoneme dictionary of the speaker to be converted, saves calculation resources and improves calculation speed compared with the prior art in which the candidate frame units are searched from the whole technical feature dictionary, and simultaneously improves the technical problems of discontinuous synthesized voice and poor tone quality by improving the traditional single-frame calculation into the multi-frame calculation.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a block diagram of an electronic device according to an embodiment of the present invention.
Fig. 2 is a flowchart of a voice conversion method according to a first embodiment of the present invention.
Fig. 3 is a flowchart illustrating another step of the voice conversion method according to the first embodiment of the present invention.
Fig. 4 is a flowchart of sub-steps of step S170 in fig. 3.
Fig. 5 is a schematic diagram of a frame unit structure.
Fig. 6 is a schematic diagram of adding frame units to a corresponding plurality of speech phoneme sets at the same time.
FIG. 7 is a schematic diagram of a Viterbi path search provided by an embodiment of the invention.
Fig. 8 is a flowchart of sub-steps of step S160 in fig. 1 or 3.
Fig. 9 is a block diagram of a speech conversion apparatus according to a second embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
Fig. 1 is a block diagram of an electronic device 100 according to a preferred embodiment of the invention. The electronic device 100 may include a voice conversion apparatus 300, a memory 111, a storage controller 112, and a processor 113.
The memory 111, the memory controller 112 and the processor 113 are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The voice conversion apparatus 300 may include at least one software functional module which may be stored in the memory 111 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the electronic device 100. The processor 113 is used for executing executable modules stored in the memory 111, such as software functional modules and computer programs included in the speech conversion apparatus 300.
The Memory 111 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 111 is used for storing a program, and the processor 113 executes the program after receiving an execution instruction. Access to the memory 111 by the processor 113 and possibly other components may be under the control of the memory controller 112.
The processor 113 may be an integrated circuit chip having signal processing capabilities. The Processor 113 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
First embodiment
Referring to fig. 2, fig. 2 is a flowchart illustrating a voice conversion method according to a preferred embodiment of the invention. The method is applied to the electronic device 100 described above, and the steps of the voice conversion method are described in detail below.
Step S110, segmenting the to-be-converted voice of the speaker to be converted into a plurality of to-be-converted frame units based on a preset segmentation rule.
In this embodiment, the voice range to be subjected to voice conversion may be selected in a labeling manner, and optionally, the voice to be converted may be selected from the voices of the speaker to be converted in a manner of calling an automatic voice labeling tool to perform labeling.
After the marked voice to be converted is obtained, the voice to be converted is segmented by adopting a preset segmentation rule, so that each segmented frame unit comprises a plurality of continuous voice frames.
And step S120, extracting the Mel cepstrum characteristic of each frame unit to be converted.
In this embodiment, step S120 includes:
and carrying out time-frequency domain change on the frame units to be converted to obtain the frequency spectrum information of each frame unit to be converted.
And extracting the Mel cepstrum characteristics of the frame unit by adopting a Mel filter bank.
Step S130, a plurality of candidate frame units are obtained through calculation according to a pre-obtained phoneme dictionary of the speaker to be converted and the Mel cepstrum characteristics of each frame unit to be converted.
The step S130 may include the following sub-steps.
And forming a feature vector of each frame unit to be converted by the Mel cepstrum feature of each frame unit to be converted.
And calculating and sequencing Euclidean distances between the feature vectors of each frame unit to be converted and the feature vectors of each frame unit in the phoneme dictionary.
And screening a plurality of candidate frame units corresponding to each frame unit to be converted from the phoneme dictionary by adopting a K nearest neighbor algorithm.
The K-nearest neighbor algorithm is a classification algorithm, and the K-nearest neighbor algorithm is an algorithm for classifying a sample into a certain class when most of K most similar samples (i.e., nearest neighbors in the feature space) in the feature space belong to the class.
Step S140, according to the pre-obtained corresponding relationship between the frame unit of the speaker to be converted and the frame unit of the target timbre speaker, the target frame unit corresponding to the candidate frame unit is obtained through matching.
Referring to fig. 3, in the present embodiment, the method further includes step S170.
Step S170, preprocessing the voice data.
And performing parallel training on the original voice in the original voice library corresponding to the speaker to be converted and the target voice in the target voice library corresponding to the target timbre speaker to establish a corresponding relation between a frame unit of the speaker to be converted and a frame unit of the target timbre speaker and a mapping relation of fundamental frequencies between the speaker to be converted and the target timbre speaker. In the process, the original voice and the target voice are trained in parallel, so that the contents of the original voice and the target voice are required to correspond one by one and are consistent.
Referring to fig. 4, in the present embodiment, the step S170 includes the following sub-steps.
And a substep S171, segmenting the original speech in the original speech library corresponding to the speaker to be converted and the target speech in the target speech library corresponding to the target timbre speaker by using the preset segmentation rule, so as to obtain a plurality of frame units corresponding to the original speech and a plurality of frame units corresponding to the target speech.
In this embodiment, in order to establish the mapping relationship between the original speech and the target speech, parallel training needs to be performed, that is, the contents of the original speech library and the target speech library are consistent, and the duration is long enough.
Referring to fig. 5, in the present embodiment, in consideration of smooth connection between frame units and transient information of speech, the present solution selects consecutive odd frames (q ═ 2p +1) as a frame unit, where the center frame is the p +1 th frame, and p frames before and after the frame unit, and two adjacent frame units overlap by 2p frames. It is understood that the preset slicing rule employed in the sub-step S171 is the same as the preset slicing rule employed in the step S110.
For original speech, the frame sequence may be denoted X ═ X(1),x(2),x(3),...,x(n),...,x(N)]The nth cell can be represented as x(n)=[xn-p,xn-p+1,...,xn,...,xn+p+1,xn+p]Wherein x isnRepresenting the nth frame in the sequence of frames. Similarly, the same unit division operation can be performed on the target speech.
And a substep S172 of extracting Mel cepstrum characteristics of the original voice and the target voice and constructing an original voice characteristic dictionary and a target voice characteristic dictionary.
In the embodiment, each frame of spectrum information is obtained after fast fourier transform, and mel cepstrum features are extracted through a mel filter bank. And constructing an original speech feature dictionary and a target speech feature dictionary through the extracted Mel cepstrum features.
And a sub-step S173 of establishing a corresponding relationship between the frame unit of the original speech and the frame unit of the target speech.
In this embodiment, a DTW (Dynamic Time Warping) algorithm is adopted to establish a corresponding relationship between an original speech frame and a target speech frame. The correspondence between the original speech and the target speech may be expressed as: z ═ Z1,z2,...,zl,...zL]WhereinIs the pairing of the frame unit of the original speech and the frame unit of the target speech. The establishment of the corresponding relation provides a basis for searching the frame unit of the target voice through the frame unit of the original voice in the tone conversion stage.
And a substep S174, classifying the original speech feature dictionary according to the labeled phoneme information to obtain a phoneme dictionary.
In this embodiment, each piece of speech phoneme information in the original speech is labeled in advance, and the frame unit of each original speech is classified into each phoneme dictionary according to the position of the frame unit of each original speech in the original speech. Referring to fig. 6, since a frame unit includes a plurality of continuous frames, it may happen that one frame unit spans two (or more) speech phoneme sets, and in order to ensure the conversion quality, the frame unit is added to at least one phoneme dictionary at the same time.
The phoneme dictionary is obtained through the classification mode, and the mode of obtaining a plurality of candidate frame units based on the phoneme dictionary calculation can save calculation resources and improve calculation speed compared with the mode of searching from the whole technical feature dictionary in the prior art.
The substep S175 is to extract the fundamental frequency features of the original speech and the target speech, and calculate the fundamental frequency mean and the fundamental frequency variance.
And a substep S176, establishing a mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker according to the mean value and the variance of the fundamental frequency.
In this embodiment, excitation of voiced sound is a periodic pulse train, the frequency of the pulse train is the fundamental frequency, so the fundamental frequency is also an important feature of speech, and the accuracy of fundamental frequency extraction directly affects the personalized tone preservation and rhythm of synthesized speech. Statistically, two identical distributions (e.g., normal distributions, etc.) with different statistical characteristics (mean, variance) can be transformed into each other. Therefore, the fundamental frequency features of the original voice and the target voice are regarded as obeying normal distribution, and the fundamental frequency mean value and the fundamental frequency variance are calculated, so that the mapping relation of the fundamental frequency between the original voice and the target voice can be established. And establishing a mapping relation of fundamental frequency between the original voice and the target voice so as to obtain the fundamental frequency characteristic of the target voice through the voice to be converted in the subsequent voice conversion stage.
And step S150, calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker.
In this embodiment, the step S150 obtains the optimal path for converting the voice to be converted into the voice of the speaker with the target timbre in the following manner.
And calculating the target cost between the frame unit to be converted and the target frame unit and the transfer cost between the target frame units at adjacent moments.
And searching by adopting a Viterbi algorithm according to the target cost and the transfer cost obtained by calculation to obtain an optimal path.
Optionally, the euclidean distance is used to calculate a target cost between the frame unit to be converted and the target frame unit, and a transfer cost between target frame units at adjacent times. The viterbi algorithm is equivalent to a search process for a minimum cost path of a weighted directed acyclic graph.
The calculation formula of the target cost may be as follows:
wherein the content of the first and second substances,the self weight of each node in the weighted directed acyclic graph can be expressed, and the target cost in the embodiment can be understood.Describes a frame unit X to be converted(t)And target frame unitThe more similar the weight isSmall means that the two are more similar. Wherein X(t)(i, d) and Yk'(t)(i, d) represents d-th dimensional data of the i-th frame in a unit of time t.
The transfer weight between nodes in the weighted directed acyclic graph is the connection cost,
describes a target frame unit at the time of tAnd (t +1) and t +1 target frame unitsThe smaller the weight, the more similar the two are, and the smoother the transition. According to the above principle, the optimal path to be searched in the target frame unit matrix can be obtained. Referring to fig. 7, each node on the path (composed of the line with an arrow in fig. 4) is the optimal choice at each time.
Step S160, processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted.
In this embodiment, referring to fig. 8, the step S160 may include the following sub-steps.
And a substep S161, obtaining the mel cepstrum feature of the target frame unit corresponding to the frame unit to be converted according to the corresponding relationship between the frame unit of the original voice and the frame unit of the target voice.
And a substep S162, performing smooth connection processing on the mel-frequency cepstrum features of each target frame unit on the optimal path according to a time sequence and a preset segmentation rule.
In this embodiment, because there is a frame stack of 2p frames between adjacent target frame units, transient window smoothing is needed to ensure continuity in speech hearing when connecting into a feature matrix. For each target frame unit, the following operations are performed.
That is, each frame in the target frame unit is multiplied by a weighting factor, and the instantaneous window w is expressed by an exponential function in the present embodiment, and the formula is expressed as follows,
w=exp(-λ|a|),a=[p,p-1,...,0,...,p-1,p]
where λ is a scalar value used to adjust the transient window w shape. The larger the lambda is, the more prominent the central frame information is, and the instantaneous information of the adjacent frames is weakened; conversely, the smaller the λ, the more the instantaneous information of the adjacent frame is considered, and the information of the central frame is weakened, so that the proper λ can be selected and both the instantaneous information and the central frame can be considered simultaneously. Before windowing, the elements of the temporal window need to be normalized to a sum of 1.
And a substep S163 of obtaining the fundamental frequency characteristic of the target frame unit corresponding to the frame unit to be converted according to the mapping relationship of the fundamental frequency between the speaker to be converted and the target timbre speaker.
And subtracting the voice base frequency sequence of the voice to be converted from the corresponding base frequency mean value of the target voice of the target timbre speaker, multiplying the obtained difference value by the quotient of the base frequency variance of the target voice and the base frequency variance of the voice to be converted, and adding the product obtained by multiplication and the base frequency mean value of the target voice to obtain the base frequency sequence of the target voice. The calculation formula of the fundamental frequency sequence of the target speech can be as follows:
wherein f0(i) is the target speech fundamental frequency sequence,for the fundamental frequency sequence of the voice to be converted, sf0m and tf0m are the mean of the fundamental frequency of the voice to be converted and the mean of the fundamental frequency of the target voice respectively, and sf0v and tf0v are the fundamental frequency of the voice to be converted respectivelyThe variance is the variance of the fundamental frequency of the target speech.
And a substep S164 of converting the mel cepstral feature and the fundamental frequency feature of the target frame unit into a frequency spectrum of the target speech.
In this embodiment, the STRAIGHT kit is optionally invoked to convert the mel-frequency cepstral feature and the fundamental frequency feature of the target frame unit into the frequency spectrum of the target speech.
And a substep S165 of performing frequency-time domain conversion on the frequency spectrum of the target voice to obtain the target voice of the target timbre speaker.
In this embodiment, inverse fourier transform is used to convert the frequency spectrum of the target voice to the target voice of the target timbre speaker.
Second embodiment
Referring to fig. 9, fig. 9 is a block diagram of a voice conversion apparatus 300 according to a preferred embodiment of the invention. The voice conversion apparatus 300 includes: a segmentation module 310, an extraction module 320, a calculation module 330, a matching module 340, and a processing module 350.
The segmentation module 310 is configured to segment the to-be-converted speech of the speaker to be converted into a plurality of to-be-converted frame units based on a preset segmentation rule, where each to-be-converted frame unit includes a plurality of continuous speech frames.
The extracting module 320 is configured to extract mel cepstrum features of each frame unit to be transformed.
In this embodiment, the manner of extracting the mel cepstrum feature of the frame unit to be converted by the extracting module 320 includes:
carrying out time-frequency domain change on the frame unit to be converted to obtain frequency spectrum information of each frame unit;
and extracting the Mel cepstrum characteristics of the frame unit by adopting a Mel filter bank.
The calculating module 330 is configured to calculate to obtain a plurality of candidate frame units according to a pre-obtained phoneme dictionary of the speaker to be converted and the mel cepstrum feature of each frame unit to be converted.
In this embodiment, the calculating module 330 calculates a plurality of candidate frame units according to the pre-obtained phoneme dictionary of the speaker to be converted and the mel cepstrum feature of each frame unit to be converted, including:
forming a feature vector of each frame unit to be converted by the Mel cepstrum feature of each frame unit to be converted;
calculating and sequencing Euclidean distances between the feature vectors of each frame unit to be converted and the feature vectors of each frame unit in the phoneme dictionary;
and screening a plurality of candidate frame units corresponding to each frame unit to be converted from the phoneme dictionary by adopting a K nearest neighbor algorithm.
The matching module 340 is configured to match the frame unit of the speaker to be converted with the frame unit of the target timbre speaker according to a correspondence relationship between the frame units of the speaker to be converted and the frame units of the target timbre speaker, so as to obtain a target frame unit corresponding to the candidate frame unit.
The calculating module 330 is further configured to calculate a conversion cost, and obtain an optimal path for converting the voice to be converted into the voice of the target timbre speaker.
In this embodiment, the calculating module 330 calculates the conversion cost, and the method for obtaining the optimal path for converting the voice to be converted into the target timbre speaker voice includes:
calculating target cost between a frame unit to be converted and a target frame unit and transfer cost between target frame units at adjacent moments;
and searching by adopting a Viterbi algorithm according to the target cost and the transfer cost obtained by calculation to obtain an optimal path.
The processing module 350 is configured to process the target frame unit on the optimal path to obtain a target voice of the target timbre speaker corresponding to the voice to be converted.
In this embodiment, the processing module 350 processes the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted includes:
obtaining the Mel cepstrum characteristic of a target frame unit corresponding to a frame unit to be converted according to the corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
performing smooth connection processing on the Mel cepstrum characteristics of each target frame unit on the optimal path according to a time sequence and a preset segmentation rule;
obtaining the fundamental frequency characteristics of a target frame unit corresponding to a frame unit to be converted according to the mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker;
converting the Mel cepstrum characteristic and the fundamental frequency characteristic of the target frame unit into a frequency spectrum of the target voice;
and performing frequency-time domain conversion on the frequency spectrum of the target voice to obtain the target voice of the target timbre speaker.
Referring to fig. 9 again, in the present embodiment, the voice conversion apparatus 300 further includes: a pre-processing module 360.
The way of the preprocessing module 360 preprocessing the voice data includes:
segmenting an original voice in an original voice library corresponding to a speaker to be converted and a target voice in a target voice library corresponding to a target timbre speaker by adopting the preset segmentation rule to obtain a plurality of frame units corresponding to the original voice and a plurality of frame units corresponding to the target voice;
extracting Mel cepstrum characteristics of the original voice and the target voice, and constructing an original voice characteristic dictionary and a target voice characteristic dictionary;
establishing a corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
classifying the original speech feature dictionary according to the labeled phoneme information to obtain a phoneme dictionary;
extracting fundamental frequency characteristics of the original voice and the target voice, and calculating a fundamental frequency mean value and a fundamental frequency variance;
and establishing a mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker according to the mean value and the variance of the fundamental frequency.
The invention provides a voice conversion method, a voice conversion device, electronic equipment and a readable storage medium. The method comprises the steps of segmenting the voice to be converted of a speaker to be converted into a plurality of frame units to be converted based on a preset segmentation rule; extracting the Mel cepstrum characteristic of each frame unit to be converted; calculating to obtain a plurality of candidate frame units according to a pre-obtained phoneme dictionary of the speaker to be converted and the Mel cepstrum characteristics of each frame unit to be converted; matching to obtain a target frame unit corresponding to the candidate frame unit according to a corresponding relation between a frame unit of a speaker to be converted and a frame unit of a target timbre speaker, which is obtained in advance; calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker; and processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted. The method obtains a plurality of candidate frame units by calculation in the phoneme dictionary of the speaker to be converted, saves calculation resources and improves calculation speed compared with the prior art by searching in the whole technical feature dictionary, simultaneously considers interframe smoothing and speech instantaneous information, improves the traditional single-frame calculation into the calculation of a unit containing a plurality of frames, performs windowing smoothing processing when connecting the units, improves the traditional single-frame calculation into the calculation of the plurality of frames, and greatly improves the technical problems of discontinuous synthesized speech and poor tone quality.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A method of speech conversion, the method comprising:
segmenting the voice to be converted of the speaker to be converted into a plurality of frame units to be converted based on a preset segmentation rule, wherein each frame unit to be converted comprises a plurality of continuous voice frames;
extracting the Mel cepstrum characteristic of each frame unit to be converted;
forming a feature vector of each frame unit to be converted by the Mel cepstrum feature of each frame unit to be converted;
calculating and sequencing Euclidean distances between the feature vectors of each frame unit to be converted and the feature vectors of each frame unit in the phoneme dictionary;
screening a plurality of candidate frame units corresponding to each frame unit to be converted from the phoneme dictionary by adopting a K nearest neighbor algorithm;
matching to obtain a target frame unit corresponding to the candidate frame unit according to a corresponding relation between a frame unit of a speaker to be converted and a frame unit of a target timbre speaker, wherein the frame unit of the speaker to be converted is obtained by segmenting original voice in an original voice library corresponding to the speaker to be converted according to the preset segmentation rule;
calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker;
and processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted.
2. The method of claim 1, wherein the method further comprises: a step of preprocessing speech data, the step comprising:
segmenting an original voice in an original voice library corresponding to a speaker to be converted and a target voice in a target voice library corresponding to a target timbre speaker by adopting the preset segmentation rule to obtain a plurality of frame units corresponding to the original voice and a plurality of frame units corresponding to the target voice;
extracting Mel cepstrum characteristics of the original voice and the target voice, and constructing an original voice characteristic dictionary and a target voice characteristic dictionary;
establishing a corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
classifying the original speech feature dictionary according to the labeled phoneme information to obtain a phoneme dictionary;
extracting fundamental frequency characteristics of the original voice and the target voice, and calculating a fundamental frequency mean value and a fundamental frequency variance;
and establishing a mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker according to the mean value and the variance of the fundamental frequency.
3. The method of claim 2, wherein the step of calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target speaker with timbre comprises:
calculating target cost between a frame unit to be converted and a target frame unit and transfer cost between target frame units at adjacent moments;
and searching by adopting a Viterbi algorithm according to the target cost and the transfer cost obtained by calculation to obtain an optimal path.
4. The method according to claim 2, wherein the step of processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted comprises:
obtaining the Mel cepstrum characteristic of a target frame unit corresponding to a frame unit to be converted according to the corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
performing smooth connection processing on the Mel cepstrum characteristics of each target frame unit on the optimal path according to a time sequence and a preset segmentation rule;
obtaining the fundamental frequency characteristics of a target frame unit corresponding to a frame unit to be converted according to the mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker;
converting the Mel cepstrum characteristic and the fundamental frequency characteristic of the target frame unit into a frequency spectrum of the target voice;
and performing frequency-time domain conversion on the frequency spectrum of the target voice to obtain the target voice of the target timbre speaker.
5. A speech conversion apparatus, characterized in that the apparatus comprises:
the segmentation module is used for segmenting the to-be-converted voice of the speaker to be converted into a plurality of to-be-converted frame units based on a preset segmentation rule, wherein each to-be-converted frame unit comprises a plurality of continuous voice frames;
the extraction module is used for extracting the Mel cepstrum characteristics of each frame unit to be converted;
the computing module is used for forming a feature vector of each frame unit to be converted by the Mel cepstrum feature of each frame unit to be converted;
calculating and sequencing Euclidean distances between the feature vectors of each frame unit to be converted and the feature vectors of each frame unit in the phoneme dictionary;
screening a plurality of candidate frame units corresponding to each frame unit to be converted from the phoneme dictionary by adopting a K nearest neighbor algorithm;
the matching module is used for matching and obtaining a target frame unit corresponding to the candidate frame unit according to the corresponding relation between the frame unit of the speaker to be converted and the frame unit of the target timbre speaker, wherein the frame unit of the speaker to be converted is obtained by segmenting the original voice in the original voice library corresponding to the speaker to be converted according to the preset segmentation rule;
the computing module is also used for computing the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker;
and the processing module is used for processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted.
6. The speech conversion device of claim 5, wherein the device further comprises: a preprocessing module;
the mode of the preprocessing module for preprocessing the voice data comprises the following steps:
segmenting an original voice in an original voice library corresponding to a speaker to be converted and a target voice in a target voice library corresponding to a target timbre speaker by adopting the preset segmentation rule to obtain a plurality of frame units corresponding to the original voice and a plurality of frame units corresponding to the target voice;
extracting Mel cepstrum characteristics of the original voice and the target voice, and constructing an original voice characteristic dictionary and a target voice characteristic dictionary;
establishing a corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
classifying the original speech feature dictionary according to the labeled phoneme information to obtain a phoneme dictionary;
extracting fundamental frequency characteristics of the original voice and the target voice, and calculating a fundamental frequency mean value and a fundamental frequency variance;
and establishing a mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker according to the mean value and the variance of the fundamental frequency.
7. The speech conversion device of claim 6, wherein the computing module computes the conversion cost, and the means for obtaining the optimal path for converting the speech to be converted into the target timbre speaker speech comprises:
calculating target cost between a frame unit to be converted and a target frame unit and transfer cost between target frame units at adjacent moments;
and searching by adopting a Viterbi algorithm according to the target cost and the transfer cost obtained by calculation to obtain an optimal path.
8. The speech conversion device according to claim 6, wherein the processing module processes the target frame unit on the optimal path to obtain the target speech of the target timbre speaker corresponding to the speech to be converted comprises:
obtaining the Mel cepstrum characteristic of a target frame unit corresponding to a frame unit to be converted according to the corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
performing smooth connection processing on the Mel cepstrum characteristics of each target frame unit on the optimal path according to a time sequence and a preset segmentation rule;
obtaining the fundamental frequency characteristics of a target frame unit corresponding to a frame unit to be converted according to the mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker;
converting the Mel cepstrum characteristic and the fundamental frequency characteristic of the target frame unit into a frequency spectrum of the target voice;
and performing frequency-time domain conversion on the frequency spectrum of the target voice to obtain the target voice of the target timbre speaker.
9. An electronic device, characterized in that the electronic device comprises: a processor and a memory coupled to the processor, the memory storing instructions that, when executed by the processor, cause the electronic device to perform the speech conversion method of any of claims 1-4.
10. A readable storage medium, the readable storage medium comprising a computer program, characterized in that: the computer program controls the electronic device in which the readable storage medium is located to execute the speech conversion method according to any one of claims 1 to 4 when running.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710812770.XA CN107507619B (en) | 2017-09-11 | 2017-09-11 | Voice conversion method and device, electronic equipment and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710812770.XA CN107507619B (en) | 2017-09-11 | 2017-09-11 | Voice conversion method and device, electronic equipment and readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107507619A CN107507619A (en) | 2017-12-22 |
CN107507619B true CN107507619B (en) | 2021-08-20 |
Family
ID=60695368
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710812770.XA Active CN107507619B (en) | 2017-09-11 | 2017-09-11 | Voice conversion method and device, electronic equipment and readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107507619B (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109817197B (en) * | 2019-03-04 | 2021-05-11 | 天翼爱音乐文化科技有限公司 | Singing voice generation method and device, computer equipment and storage medium |
CN111048109A (en) * | 2019-12-25 | 2020-04-21 | 广州酷狗计算机科技有限公司 | Acoustic feature determination method and apparatus, computer device, and storage medium |
WO2021134232A1 (en) * | 2019-12-30 | 2021-07-08 | 深圳市优必选科技股份有限公司 | Streaming voice conversion method and apparatus, and computer device and storage medium |
CN112562728B (en) * | 2020-11-13 | 2024-06-18 | 百果园技术(新加坡)有限公司 | Method for generating countermeasure network training, method and device for audio style migration |
CN112614481A (en) * | 2020-12-08 | 2021-04-06 | 浙江合众新能源汽车有限公司 | Voice tone customization method and system for automobile prompt tone |
CN112634920B (en) * | 2020-12-18 | 2024-01-02 | 平安科技(深圳)有限公司 | Training method and device of voice conversion model based on domain separation |
CN113345453B (en) * | 2021-06-01 | 2023-06-16 | 平安科技(深圳)有限公司 | Singing voice conversion method, device, equipment and storage medium |
CN113782050A (en) * | 2021-09-08 | 2021-12-10 | 浙江大华技术股份有限公司 | Sound tone changing method, electronic device and storage medium |
CN114582365B (en) * | 2022-05-05 | 2022-09-06 | 阿里巴巴(中国)有限公司 | Audio processing method and device, storage medium and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063899A (en) * | 2010-10-27 | 2011-05-18 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
CN103531196A (en) * | 2013-10-15 | 2014-01-22 | 中国科学院自动化研究所 | Sound selection method for waveform concatenation speech synthesis |
CN104123933A (en) * | 2014-08-01 | 2014-10-29 | 中国科学院自动化研究所 | Self-adaptive non-parallel training based voice conversion method |
CN104575488A (en) * | 2014-12-25 | 2015-04-29 | 北京时代瑞朗科技有限公司 | Text information-based waveform concatenation voice synthesizing method |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101399044B (en) * | 2007-09-29 | 2013-09-04 | 纽奥斯通讯有限公司 | Voice conversion method and system |
-
2017
- 2017-09-11 CN CN201710812770.XA patent/CN107507619B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063899A (en) * | 2010-10-27 | 2011-05-18 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
CN102982809A (en) * | 2012-12-11 | 2013-03-20 | 中国科学技术大学 | Conversion method for sound of speaker |
CN103531196A (en) * | 2013-10-15 | 2014-01-22 | 中国科学院自动化研究所 | Sound selection method for waveform concatenation speech synthesis |
CN104123933A (en) * | 2014-08-01 | 2014-10-29 | 中国科学院自动化研究所 | Self-adaptive non-parallel training based voice conversion method |
CN104575488A (en) * | 2014-12-25 | 2015-04-29 | 北京时代瑞朗科技有限公司 | Text information-based waveform concatenation voice synthesizing method |
Also Published As
Publication number | Publication date |
---|---|
CN107507619A (en) | 2017-12-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107507619B (en) | Voice conversion method and device, electronic equipment and readable storage medium | |
CN107705802B (en) | Voice conversion method and device, electronic equipment and readable storage medium | |
US10891944B2 (en) | Adaptive and compensatory speech recognition methods and devices | |
Kameoka et al. | ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion | |
CN106683677B (en) | Voice recognition method and device | |
CN111048064B (en) | Voice cloning method and device based on single speaker voice synthesis data set | |
US11810546B2 (en) | Sample generation method and apparatus | |
US11049491B2 (en) | System and method for prosodically modified unit selection databases | |
CN111599339B (en) | Speech splicing synthesis method, system, equipment and medium with high naturalness | |
CN111508466A (en) | Text processing method, device and equipment and computer readable storage medium | |
US10079011B2 (en) | System and method for unit selection text-to-speech using a modified Viterbi approach | |
Marxer et al. | Low-latency instrument separation in polyphonic audio using timbre models | |
CN112686041A (en) | Pinyin marking method and device | |
CN114566156A (en) | Keyword speech recognition method and device | |
CN113314101B (en) | Voice processing method and device, electronic equipment and storage medium | |
Patil et al. | Hidden-Markov-model based statistical parametric speech synthesis for Marathi with optimal number of hidden states | |
US20080147385A1 (en) | Memory-efficient method for high-quality codebook based voice conversion | |
Xiao et al. | Speech intelligibility enhancement by non-parallel speech style conversion using CWT and iMetricGAN based CycleGAN | |
US20240161727A1 (en) | Training method for speech synthesis model and speech synthesis method and related apparatuses | |
CN112786017B (en) | Training method and device of speech speed detection model, and speech speed detection method and device | |
CN112885380B (en) | Method, device, equipment and medium for detecting clear and voiced sounds | |
Yarra et al. | A frame selective dynamic programming approach for noise robust pitch estimation | |
Park et al. | Discriminative weight training for unit-selection based speech synthesis. | |
CN117975931A (en) | Speech synthesis method, electronic device and computer program product | |
Kim et al. | Discriminative training for concatenative speech synthesis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |