CN107507619B - Voice conversion method and device, electronic equipment and readable storage medium - Google Patents

Voice conversion method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN107507619B
CN107507619B CN201710812770.XA CN201710812770A CN107507619B CN 107507619 B CN107507619 B CN 107507619B CN 201710812770 A CN201710812770 A CN 201710812770A CN 107507619 B CN107507619 B CN 107507619B
Authority
CN
China
Prior art keywords
target
voice
frame unit
converted
speaker
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710812770.XA
Other languages
Chinese (zh)
Other versions
CN107507619A (en
Inventor
方博伟
张康
卓鹏鹏
张伟
尤嘉华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Meitu Technology Co Ltd
Original Assignee
Xiamen Meitu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Meitu Technology Co Ltd filed Critical Xiamen Meitu Technology Co Ltd
Priority to CN201710812770.XA priority Critical patent/CN107507619B/en
Publication of CN107507619A publication Critical patent/CN107507619A/en
Application granted granted Critical
Publication of CN107507619B publication Critical patent/CN107507619B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • G10L21/007Changing voice quality, e.g. pitch or formants characterised by the process used
    • G10L21/013Adapting to target pitch
    • G10L2021/0135Voice conversion or morphing

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Quality & Reliability (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a voice conversion method, a voice conversion device, electronic equipment and a readable storage medium. The method comprises the steps of segmenting a voice to be converted into a plurality of frame units to be converted based on a preset segmentation rule; extracting the Mel cepstrum characteristic of each frame unit to be converted; calculating to obtain a plurality of candidate frame units according to the phoneme dictionary and the Mel cepstrum characteristics of each frame unit to be converted; matching to obtain a target frame unit according to the corresponding relation between the frame unit of the speaker to be converted and the frame unit of the target timbre speaker; calculating conversion cost to obtain an optimal path; and processing the target frame unit on the optimal path to obtain the target voice. The method obtains a plurality of candidate frame units through calculation in the phoneme dictionary, saves calculation resources and improves calculation speed compared with the prior art that the calculation is carried out from the whole technical feature dictionary, and meanwhile, the traditional single-frame calculation is improved into the calculation of multiple frames, so that the technical problems of discontinuous synthesized voice and poor tone quality are greatly improved.

Description

Voice conversion method and device, electronic equipment and readable storage medium
Technical Field
The invention relates to the technical field of voice information processing, in particular to a voice conversion method, a voice conversion device, electronic equipment and a readable storage medium.
Background
The speech synthesis technology has achieved fruitful results through the development of nearly half century, and plays an extremely important role in the fields of artificial intelligence and the like. Among them, TTS (Text-to-Speech, also called Text-to-Speech) is a technology for converting Text information generated by a computer or inputted from the outside into intelligible and fluent spoken language, but the Speech synthesized by TTS generally has the following two problems: firstly, the timbre is limited to a small number of announcer samples, and cannot meet the personalized requirements; but the rhythm is unnatural and the synthesis trace is obvious.
Tone conversion (also called voice conversion) is a technology for directly converting the current speaker tone into the output speaker tone without changing the voice content, and has the advantages of natural rhythm and better personalized tone retention. At present, a speech conversion method based on speech feature dictionary lookup is a mainstream method in a non-parameter speech conversion technology, and the idea of the method is as follows: firstly, extracting the characteristics of an original voice library and a target voice library, establishing a characteristic dictionary, and performing parallel training to obtain a mapping rule; extracting the feature vectors of the voice to be converted, and searching K adjacent target feature vectors for each feature vector in a target feature dictionary according to a mapping rule; calculating target cost and connection cost, and searching an optimal path in the K neighbor characteristic matrix by adopting a Viterbi algorithm; and fourthly, connecting the voice feature vectors of the selected target and converting the voice feature vectors into voice. The method has the defects that the whole target feature dictionary needs to be traversed for searching the K neighbor feature vectors every time, the calculation speed is low, and the requirement on the system performance is high. Meanwhile, when the connection cost is calculated, a single frame is used as a unit, and the smooth characteristic between voice frames is not considered, so that the loss of voice instantaneous information is caused, the synthesized voice is discontinuous, and the voice quality is greatly influenced.
Disclosure of Invention
In order to overcome the above-mentioned deficiencies in the prior art, the present invention provides a method, an apparatus, an electronic device and a readable storage medium for voice conversion, which can ensure that the spectrum details are not lost on the premise of ensuring the continuity of the synthesized voice.
It is an object of a first aspect of the present invention to provide a method of speech conversion, the method comprising:
segmenting the voice to be converted of the speaker to be converted into a plurality of frame units to be converted based on a preset segmentation rule, wherein each frame unit to be converted comprises a plurality of continuous voice frames;
extracting the Mel cepstrum characteristic of each frame unit to be converted;
calculating to obtain a plurality of candidate frame units according to a pre-obtained phoneme dictionary of the speaker to be converted and the Mel cepstrum characteristics of each frame unit to be converted;
matching to obtain a target frame unit corresponding to the candidate frame unit according to a corresponding relation between a frame unit of a speaker to be converted and a frame unit of a target timbre speaker, which is obtained in advance;
calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker;
and processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted.
Optionally, the method further comprises pre-processing the speech data;
the step of preprocessing the speech data comprises:
segmenting an original voice in an original voice library corresponding to a speaker to be converted and a target voice in a target voice library corresponding to a target timbre speaker by adopting the preset segmentation rule to obtain a plurality of frame units corresponding to the original voice and a plurality of frame units corresponding to the target voice;
extracting Mel cepstrum characteristics of the original voice and the target voice, and constructing an original voice characteristic dictionary and a target voice characteristic dictionary;
establishing a corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
classifying the original speech feature dictionary according to the labeled phoneme information to obtain a phoneme dictionary;
extracting fundamental frequency characteristics of the original voice and the target voice, and calculating a fundamental frequency mean value and a fundamental frequency variance;
and establishing a mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker according to the mean value and the variance of the fundamental frequency.
It is an object of a second aspect of the present invention to provide a speech conversion apparatus, comprising:
the segmentation module is used for segmenting the to-be-converted voice of the speaker to be converted into a plurality of to-be-converted frame units based on a preset segmentation rule, wherein each to-be-converted frame unit comprises a plurality of continuous voice frames;
the extraction module is used for extracting the Mel cepstrum characteristics of each frame unit to be converted;
the computing module is used for computing a plurality of candidate frame units according to a pre-obtained phoneme dictionary of the speaker to be converted and the Mel cepstrum characteristics of each frame unit to be converted;
the matching module is used for matching to obtain a target frame unit corresponding to the candidate frame unit according to the corresponding relation between the frame unit of the speaker to be converted and the frame unit of the target timbre speaker obtained in advance;
the computing module is also used for computing the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker;
and the processing module is used for processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted.
Optionally, the apparatus further comprises: a preprocessing module;
the mode of the preprocessing module for preprocessing the voice data comprises the following steps:
segmenting an original voice in an original voice library corresponding to a speaker to be converted and a target voice in a target voice library corresponding to a target timbre speaker by adopting the preset segmentation rule to obtain a plurality of frame units corresponding to the original voice and a plurality of frame units corresponding to the target voice;
extracting Mel cepstrum characteristics of the original voice and the target voice, and constructing an original voice characteristic dictionary and a target voice characteristic dictionary;
establishing a corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
classifying the original speech feature dictionary according to the labeled phoneme information to obtain a phoneme dictionary;
extracting fundamental frequency characteristics of the original voice and the target voice, and calculating a fundamental frequency mean value and a fundamental frequency variance;
and establishing a mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker according to the mean value and the variance of the fundamental frequency.
It is an object of a third aspect of the present invention to provide an electronic apparatus, comprising: a processor and a memory coupled to the processor, the memory storing instructions that, when executed by the processor, cause the electronic device to perform the method of speech conversion according to the first aspect of the invention.
A fourth aspect of the present invention is directed to a readable storage medium, which includes a computer program, where the computer program controls an electronic device where the readable storage medium is located to execute the voice conversion method according to the first aspect of the present invention when the computer program runs.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a voice conversion method, a voice conversion device, electronic equipment and a readable storage medium. The method comprises the steps of segmenting the voice to be converted of a speaker to be converted into a plurality of frame units to be converted based on a preset segmentation rule; extracting the Mel cepstrum characteristic of each frame unit to be converted; calculating to obtain a plurality of candidate frame units according to a pre-obtained phoneme dictionary of the speaker to be converted and the Mel cepstrum characteristics of each frame unit to be converted; matching to obtain a target frame unit corresponding to the candidate frame unit according to a corresponding relation between a frame unit of a speaker to be converted and a frame unit of a target timbre speaker, which is obtained in advance; calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker; and processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted. The method obtains a plurality of candidate frame units by calculation in the phoneme dictionary of the speaker to be converted, saves calculation resources and improves calculation speed compared with the prior art in which the candidate frame units are searched from the whole technical feature dictionary, and simultaneously improves the technical problems of discontinuous synthesized voice and poor tone quality by improving the traditional single-frame calculation into the multi-frame calculation.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a block diagram of an electronic device according to an embodiment of the present invention.
Fig. 2 is a flowchart of a voice conversion method according to a first embodiment of the present invention.
Fig. 3 is a flowchart illustrating another step of the voice conversion method according to the first embodiment of the present invention.
Fig. 4 is a flowchart of sub-steps of step S170 in fig. 3.
Fig. 5 is a schematic diagram of a frame unit structure.
Fig. 6 is a schematic diagram of adding frame units to a corresponding plurality of speech phoneme sets at the same time.
FIG. 7 is a schematic diagram of a Viterbi path search provided by an embodiment of the invention.
Fig. 8 is a flowchart of sub-steps of step S160 in fig. 1 or 3.
Fig. 9 is a block diagram of a speech conversion apparatus according to a second embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
Fig. 1 is a block diagram of an electronic device 100 according to a preferred embodiment of the invention. The electronic device 100 may include a voice conversion apparatus 300, a memory 111, a storage controller 112, and a processor 113.
The memory 111, the memory controller 112 and the processor 113 are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The voice conversion apparatus 300 may include at least one software functional module which may be stored in the memory 111 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the electronic device 100. The processor 113 is used for executing executable modules stored in the memory 111, such as software functional modules and computer programs included in the speech conversion apparatus 300.
The Memory 111 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 111 is used for storing a program, and the processor 113 executes the program after receiving an execution instruction. Access to the memory 111 by the processor 113 and possibly other components may be under the control of the memory controller 112.
The processor 113 may be an integrated circuit chip having signal processing capabilities. The Processor 113 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
First embodiment
Referring to fig. 2, fig. 2 is a flowchart illustrating a voice conversion method according to a preferred embodiment of the invention. The method is applied to the electronic device 100 described above, and the steps of the voice conversion method are described in detail below.
Step S110, segmenting the to-be-converted voice of the speaker to be converted into a plurality of to-be-converted frame units based on a preset segmentation rule.
In this embodiment, the voice range to be subjected to voice conversion may be selected in a labeling manner, and optionally, the voice to be converted may be selected from the voices of the speaker to be converted in a manner of calling an automatic voice labeling tool to perform labeling.
After the marked voice to be converted is obtained, the voice to be converted is segmented by adopting a preset segmentation rule, so that each segmented frame unit comprises a plurality of continuous voice frames.
And step S120, extracting the Mel cepstrum characteristic of each frame unit to be converted.
In this embodiment, step S120 includes:
and carrying out time-frequency domain change on the frame units to be converted to obtain the frequency spectrum information of each frame unit to be converted.
And extracting the Mel cepstrum characteristics of the frame unit by adopting a Mel filter bank.
Step S130, a plurality of candidate frame units are obtained through calculation according to a pre-obtained phoneme dictionary of the speaker to be converted and the Mel cepstrum characteristics of each frame unit to be converted.
The step S130 may include the following sub-steps.
And forming a feature vector of each frame unit to be converted by the Mel cepstrum feature of each frame unit to be converted.
And calculating and sequencing Euclidean distances between the feature vectors of each frame unit to be converted and the feature vectors of each frame unit in the phoneme dictionary.
And screening a plurality of candidate frame units corresponding to each frame unit to be converted from the phoneme dictionary by adopting a K nearest neighbor algorithm.
The K-nearest neighbor algorithm is a classification algorithm, and the K-nearest neighbor algorithm is an algorithm for classifying a sample into a certain class when most of K most similar samples (i.e., nearest neighbors in the feature space) in the feature space belong to the class.
Step S140, according to the pre-obtained corresponding relationship between the frame unit of the speaker to be converted and the frame unit of the target timbre speaker, the target frame unit corresponding to the candidate frame unit is obtained through matching.
Referring to fig. 3, in the present embodiment, the method further includes step S170.
Step S170, preprocessing the voice data.
And performing parallel training on the original voice in the original voice library corresponding to the speaker to be converted and the target voice in the target voice library corresponding to the target timbre speaker to establish a corresponding relation between a frame unit of the speaker to be converted and a frame unit of the target timbre speaker and a mapping relation of fundamental frequencies between the speaker to be converted and the target timbre speaker. In the process, the original voice and the target voice are trained in parallel, so that the contents of the original voice and the target voice are required to correspond one by one and are consistent.
Referring to fig. 4, in the present embodiment, the step S170 includes the following sub-steps.
And a substep S171, segmenting the original speech in the original speech library corresponding to the speaker to be converted and the target speech in the target speech library corresponding to the target timbre speaker by using the preset segmentation rule, so as to obtain a plurality of frame units corresponding to the original speech and a plurality of frame units corresponding to the target speech.
In this embodiment, in order to establish the mapping relationship between the original speech and the target speech, parallel training needs to be performed, that is, the contents of the original speech library and the target speech library are consistent, and the duration is long enough.
Referring to fig. 5, in the present embodiment, in consideration of smooth connection between frame units and transient information of speech, the present solution selects consecutive odd frames (q ═ 2p +1) as a frame unit, where the center frame is the p +1 th frame, and p frames before and after the frame unit, and two adjacent frame units overlap by 2p frames. It is understood that the preset slicing rule employed in the sub-step S171 is the same as the preset slicing rule employed in the step S110.
For original speech, the frame sequence may be denoted X ═ X(1),x(2),x(3),...,x(n),...,x(N)]The nth cell can be represented as x(n)=[xn-p,xn-p+1,...,xn,...,xn+p+1,xn+p]Wherein x isnRepresenting the nth frame in the sequence of frames. Similarly, the same unit division operation can be performed on the target speech.
And a substep S172 of extracting Mel cepstrum characteristics of the original voice and the target voice and constructing an original voice characteristic dictionary and a target voice characteristic dictionary.
In the embodiment, each frame of spectrum information is obtained after fast fourier transform, and mel cepstrum features are extracted through a mel filter bank. And constructing an original speech feature dictionary and a target speech feature dictionary through the extracted Mel cepstrum features.
And a sub-step S173 of establishing a corresponding relationship between the frame unit of the original speech and the frame unit of the target speech.
In this embodiment, a DTW (Dynamic Time Warping) algorithm is adopted to establish a corresponding relationship between an original speech frame and a target speech frame. The correspondence between the original speech and the target speech may be expressed as: z ═ Z1,z2,...,zl,...zL]Wherein
Figure BDA0001404329400000101
Is the pairing of the frame unit of the original speech and the frame unit of the target speech. The establishment of the corresponding relation provides a basis for searching the frame unit of the target voice through the frame unit of the original voice in the tone conversion stage.
And a substep S174, classifying the original speech feature dictionary according to the labeled phoneme information to obtain a phoneme dictionary.
In this embodiment, each piece of speech phoneme information in the original speech is labeled in advance, and the frame unit of each original speech is classified into each phoneme dictionary according to the position of the frame unit of each original speech in the original speech. Referring to fig. 6, since a frame unit includes a plurality of continuous frames, it may happen that one frame unit spans two (or more) speech phoneme sets, and in order to ensure the conversion quality, the frame unit is added to at least one phoneme dictionary at the same time.
The phoneme dictionary is obtained through the classification mode, and the mode of obtaining a plurality of candidate frame units based on the phoneme dictionary calculation can save calculation resources and improve calculation speed compared with the mode of searching from the whole technical feature dictionary in the prior art.
The substep S175 is to extract the fundamental frequency features of the original speech and the target speech, and calculate the fundamental frequency mean and the fundamental frequency variance.
And a substep S176, establishing a mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker according to the mean value and the variance of the fundamental frequency.
In this embodiment, excitation of voiced sound is a periodic pulse train, the frequency of the pulse train is the fundamental frequency, so the fundamental frequency is also an important feature of speech, and the accuracy of fundamental frequency extraction directly affects the personalized tone preservation and rhythm of synthesized speech. Statistically, two identical distributions (e.g., normal distributions, etc.) with different statistical characteristics (mean, variance) can be transformed into each other. Therefore, the fundamental frequency features of the original voice and the target voice are regarded as obeying normal distribution, and the fundamental frequency mean value and the fundamental frequency variance are calculated, so that the mapping relation of the fundamental frequency between the original voice and the target voice can be established. And establishing a mapping relation of fundamental frequency between the original voice and the target voice so as to obtain the fundamental frequency characteristic of the target voice through the voice to be converted in the subsequent voice conversion stage.
And step S150, calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker.
In this embodiment, the step S150 obtains the optimal path for converting the voice to be converted into the voice of the speaker with the target timbre in the following manner.
And calculating the target cost between the frame unit to be converted and the target frame unit and the transfer cost between the target frame units at adjacent moments.
And searching by adopting a Viterbi algorithm according to the target cost and the transfer cost obtained by calculation to obtain an optimal path.
Optionally, the euclidean distance is used to calculate a target cost between the frame unit to be converted and the target frame unit, and a transfer cost between target frame units at adjacent times. The viterbi algorithm is equivalent to a search process for a minimum cost path of a weighted directed acyclic graph.
The calculation formula of the target cost may be as follows:
Figure BDA0001404329400000111
wherein the content of the first and second substances,
Figure BDA0001404329400000112
the self weight of each node in the weighted directed acyclic graph can be expressed, and the target cost in the embodiment can be understood.
Figure BDA0001404329400000113
Describes a frame unit X to be converted(t)And target frame unit
Figure BDA0001404329400000114
The more similar the weight isSmall means that the two are more similar. Wherein X(t)(i, d) and Yk'(t)(i, d) represents d-th dimensional data of the i-th frame in a unit of time t.
The transfer weight between nodes in the weighted directed acyclic graph is the connection cost,
Figure BDA0001404329400000121
Figure BDA0001404329400000122
describes a target frame unit at the time of t
Figure BDA0001404329400000123
And (t +1) and t +1 target frame units
Figure BDA0001404329400000124
The smaller the weight, the more similar the two are, and the smoother the transition. According to the above principle, the optimal path to be searched in the target frame unit matrix can be obtained. Referring to fig. 7, each node on the path (composed of the line with an arrow in fig. 4) is the optimal choice at each time.
Step S160, processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted.
In this embodiment, referring to fig. 8, the step S160 may include the following sub-steps.
And a substep S161, obtaining the mel cepstrum feature of the target frame unit corresponding to the frame unit to be converted according to the corresponding relationship between the frame unit of the original voice and the frame unit of the target voice.
And a substep S162, performing smooth connection processing on the mel-frequency cepstrum features of each target frame unit on the optimal path according to a time sequence and a preset segmentation rule.
In this embodiment, because there is a frame stack of 2p frames between adjacent target frame units, transient window smoothing is needed to ensure continuity in speech hearing when connecting into a feature matrix. For each target frame unit, the following operations are performed.
Figure BDA0001404329400000125
That is, each frame in the target frame unit is multiplied by a weighting factor, and the instantaneous window w is expressed by an exponential function in the present embodiment, and the formula is expressed as follows,
w=exp(-λ|a|),a=[p,p-1,...,0,...,p-1,p]
where λ is a scalar value used to adjust the transient window w shape. The larger the lambda is, the more prominent the central frame information is, and the instantaneous information of the adjacent frames is weakened; conversely, the smaller the λ, the more the instantaneous information of the adjacent frame is considered, and the information of the central frame is weakened, so that the proper λ can be selected and both the instantaneous information and the central frame can be considered simultaneously. Before windowing, the elements of the temporal window need to be normalized to a sum of 1.
And a substep S163 of obtaining the fundamental frequency characteristic of the target frame unit corresponding to the frame unit to be converted according to the mapping relationship of the fundamental frequency between the speaker to be converted and the target timbre speaker.
And subtracting the voice base frequency sequence of the voice to be converted from the corresponding base frequency mean value of the target voice of the target timbre speaker, multiplying the obtained difference value by the quotient of the base frequency variance of the target voice and the base frequency variance of the voice to be converted, and adding the product obtained by multiplication and the base frequency mean value of the target voice to obtain the base frequency sequence of the target voice. The calculation formula of the fundamental frequency sequence of the target speech can be as follows:
Figure BDA0001404329400000131
wherein f0(i) is the target speech fundamental frequency sequence,
Figure BDA0001404329400000132
for the fundamental frequency sequence of the voice to be converted, sf0m and tf0m are the mean of the fundamental frequency of the voice to be converted and the mean of the fundamental frequency of the target voice respectively, and sf0v and tf0v are the fundamental frequency of the voice to be converted respectivelyThe variance is the variance of the fundamental frequency of the target speech.
And a substep S164 of converting the mel cepstral feature and the fundamental frequency feature of the target frame unit into a frequency spectrum of the target speech.
In this embodiment, the STRAIGHT kit is optionally invoked to convert the mel-frequency cepstral feature and the fundamental frequency feature of the target frame unit into the frequency spectrum of the target speech.
And a substep S165 of performing frequency-time domain conversion on the frequency spectrum of the target voice to obtain the target voice of the target timbre speaker.
In this embodiment, inverse fourier transform is used to convert the frequency spectrum of the target voice to the target voice of the target timbre speaker.
Second embodiment
Referring to fig. 9, fig. 9 is a block diagram of a voice conversion apparatus 300 according to a preferred embodiment of the invention. The voice conversion apparatus 300 includes: a segmentation module 310, an extraction module 320, a calculation module 330, a matching module 340, and a processing module 350.
The segmentation module 310 is configured to segment the to-be-converted speech of the speaker to be converted into a plurality of to-be-converted frame units based on a preset segmentation rule, where each to-be-converted frame unit includes a plurality of continuous speech frames.
The extracting module 320 is configured to extract mel cepstrum features of each frame unit to be transformed.
In this embodiment, the manner of extracting the mel cepstrum feature of the frame unit to be converted by the extracting module 320 includes:
carrying out time-frequency domain change on the frame unit to be converted to obtain frequency spectrum information of each frame unit;
and extracting the Mel cepstrum characteristics of the frame unit by adopting a Mel filter bank.
The calculating module 330 is configured to calculate to obtain a plurality of candidate frame units according to a pre-obtained phoneme dictionary of the speaker to be converted and the mel cepstrum feature of each frame unit to be converted.
In this embodiment, the calculating module 330 calculates a plurality of candidate frame units according to the pre-obtained phoneme dictionary of the speaker to be converted and the mel cepstrum feature of each frame unit to be converted, including:
forming a feature vector of each frame unit to be converted by the Mel cepstrum feature of each frame unit to be converted;
calculating and sequencing Euclidean distances between the feature vectors of each frame unit to be converted and the feature vectors of each frame unit in the phoneme dictionary;
and screening a plurality of candidate frame units corresponding to each frame unit to be converted from the phoneme dictionary by adopting a K nearest neighbor algorithm.
The matching module 340 is configured to match the frame unit of the speaker to be converted with the frame unit of the target timbre speaker according to a correspondence relationship between the frame units of the speaker to be converted and the frame units of the target timbre speaker, so as to obtain a target frame unit corresponding to the candidate frame unit.
The calculating module 330 is further configured to calculate a conversion cost, and obtain an optimal path for converting the voice to be converted into the voice of the target timbre speaker.
In this embodiment, the calculating module 330 calculates the conversion cost, and the method for obtaining the optimal path for converting the voice to be converted into the target timbre speaker voice includes:
calculating target cost between a frame unit to be converted and a target frame unit and transfer cost between target frame units at adjacent moments;
and searching by adopting a Viterbi algorithm according to the target cost and the transfer cost obtained by calculation to obtain an optimal path.
The processing module 350 is configured to process the target frame unit on the optimal path to obtain a target voice of the target timbre speaker corresponding to the voice to be converted.
In this embodiment, the processing module 350 processes the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted includes:
obtaining the Mel cepstrum characteristic of a target frame unit corresponding to a frame unit to be converted according to the corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
performing smooth connection processing on the Mel cepstrum characteristics of each target frame unit on the optimal path according to a time sequence and a preset segmentation rule;
obtaining the fundamental frequency characteristics of a target frame unit corresponding to a frame unit to be converted according to the mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker;
converting the Mel cepstrum characteristic and the fundamental frequency characteristic of the target frame unit into a frequency spectrum of the target voice;
and performing frequency-time domain conversion on the frequency spectrum of the target voice to obtain the target voice of the target timbre speaker.
Referring to fig. 9 again, in the present embodiment, the voice conversion apparatus 300 further includes: a pre-processing module 360.
The way of the preprocessing module 360 preprocessing the voice data includes:
segmenting an original voice in an original voice library corresponding to a speaker to be converted and a target voice in a target voice library corresponding to a target timbre speaker by adopting the preset segmentation rule to obtain a plurality of frame units corresponding to the original voice and a plurality of frame units corresponding to the target voice;
extracting Mel cepstrum characteristics of the original voice and the target voice, and constructing an original voice characteristic dictionary and a target voice characteristic dictionary;
establishing a corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
classifying the original speech feature dictionary according to the labeled phoneme information to obtain a phoneme dictionary;
extracting fundamental frequency characteristics of the original voice and the target voice, and calculating a fundamental frequency mean value and a fundamental frequency variance;
and establishing a mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker according to the mean value and the variance of the fundamental frequency.
The invention provides a voice conversion method, a voice conversion device, electronic equipment and a readable storage medium. The method comprises the steps of segmenting the voice to be converted of a speaker to be converted into a plurality of frame units to be converted based on a preset segmentation rule; extracting the Mel cepstrum characteristic of each frame unit to be converted; calculating to obtain a plurality of candidate frame units according to a pre-obtained phoneme dictionary of the speaker to be converted and the Mel cepstrum characteristics of each frame unit to be converted; matching to obtain a target frame unit corresponding to the candidate frame unit according to a corresponding relation between a frame unit of a speaker to be converted and a frame unit of a target timbre speaker, which is obtained in advance; calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker; and processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted. The method obtains a plurality of candidate frame units by calculation in the phoneme dictionary of the speaker to be converted, saves calculation resources and improves calculation speed compared with the prior art by searching in the whole technical feature dictionary, simultaneously considers interframe smoothing and speech instantaneous information, improves the traditional single-frame calculation into the calculation of a unit containing a plurality of frames, performs windowing smoothing processing when connecting the units, improves the traditional single-frame calculation into the calculation of the plurality of frames, and greatly improves the technical problems of discontinuous synthesized speech and poor tone quality.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method of speech conversion, the method comprising:
segmenting the voice to be converted of the speaker to be converted into a plurality of frame units to be converted based on a preset segmentation rule, wherein each frame unit to be converted comprises a plurality of continuous voice frames;
extracting the Mel cepstrum characteristic of each frame unit to be converted;
forming a feature vector of each frame unit to be converted by the Mel cepstrum feature of each frame unit to be converted;
calculating and sequencing Euclidean distances between the feature vectors of each frame unit to be converted and the feature vectors of each frame unit in the phoneme dictionary;
screening a plurality of candidate frame units corresponding to each frame unit to be converted from the phoneme dictionary by adopting a K nearest neighbor algorithm;
matching to obtain a target frame unit corresponding to the candidate frame unit according to a corresponding relation between a frame unit of a speaker to be converted and a frame unit of a target timbre speaker, wherein the frame unit of the speaker to be converted is obtained by segmenting original voice in an original voice library corresponding to the speaker to be converted according to the preset segmentation rule;
calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker;
and processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted.
2. The method of claim 1, wherein the method further comprises: a step of preprocessing speech data, the step comprising:
segmenting an original voice in an original voice library corresponding to a speaker to be converted and a target voice in a target voice library corresponding to a target timbre speaker by adopting the preset segmentation rule to obtain a plurality of frame units corresponding to the original voice and a plurality of frame units corresponding to the target voice;
extracting Mel cepstrum characteristics of the original voice and the target voice, and constructing an original voice characteristic dictionary and a target voice characteristic dictionary;
establishing a corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
classifying the original speech feature dictionary according to the labeled phoneme information to obtain a phoneme dictionary;
extracting fundamental frequency characteristics of the original voice and the target voice, and calculating a fundamental frequency mean value and a fundamental frequency variance;
and establishing a mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker according to the mean value and the variance of the fundamental frequency.
3. The method of claim 2, wherein the step of calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target speaker with timbre comprises:
calculating target cost between a frame unit to be converted and a target frame unit and transfer cost between target frame units at adjacent moments;
and searching by adopting a Viterbi algorithm according to the target cost and the transfer cost obtained by calculation to obtain an optimal path.
4. The method according to claim 2, wherein the step of processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted comprises:
obtaining the Mel cepstrum characteristic of a target frame unit corresponding to a frame unit to be converted according to the corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
performing smooth connection processing on the Mel cepstrum characteristics of each target frame unit on the optimal path according to a time sequence and a preset segmentation rule;
obtaining the fundamental frequency characteristics of a target frame unit corresponding to a frame unit to be converted according to the mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker;
converting the Mel cepstrum characteristic and the fundamental frequency characteristic of the target frame unit into a frequency spectrum of the target voice;
and performing frequency-time domain conversion on the frequency spectrum of the target voice to obtain the target voice of the target timbre speaker.
5. A speech conversion apparatus, characterized in that the apparatus comprises:
the segmentation module is used for segmenting the to-be-converted voice of the speaker to be converted into a plurality of to-be-converted frame units based on a preset segmentation rule, wherein each to-be-converted frame unit comprises a plurality of continuous voice frames;
the extraction module is used for extracting the Mel cepstrum characteristics of each frame unit to be converted;
the computing module is used for forming a feature vector of each frame unit to be converted by the Mel cepstrum feature of each frame unit to be converted;
calculating and sequencing Euclidean distances between the feature vectors of each frame unit to be converted and the feature vectors of each frame unit in the phoneme dictionary;
screening a plurality of candidate frame units corresponding to each frame unit to be converted from the phoneme dictionary by adopting a K nearest neighbor algorithm;
the matching module is used for matching and obtaining a target frame unit corresponding to the candidate frame unit according to the corresponding relation between the frame unit of the speaker to be converted and the frame unit of the target timbre speaker, wherein the frame unit of the speaker to be converted is obtained by segmenting the original voice in the original voice library corresponding to the speaker to be converted according to the preset segmentation rule;
the computing module is also used for computing the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker;
and the processing module is used for processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted.
6. The speech conversion device of claim 5, wherein the device further comprises: a preprocessing module;
the mode of the preprocessing module for preprocessing the voice data comprises the following steps:
segmenting an original voice in an original voice library corresponding to a speaker to be converted and a target voice in a target voice library corresponding to a target timbre speaker by adopting the preset segmentation rule to obtain a plurality of frame units corresponding to the original voice and a plurality of frame units corresponding to the target voice;
extracting Mel cepstrum characteristics of the original voice and the target voice, and constructing an original voice characteristic dictionary and a target voice characteristic dictionary;
establishing a corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
classifying the original speech feature dictionary according to the labeled phoneme information to obtain a phoneme dictionary;
extracting fundamental frequency characteristics of the original voice and the target voice, and calculating a fundamental frequency mean value and a fundamental frequency variance;
and establishing a mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker according to the mean value and the variance of the fundamental frequency.
7. The speech conversion device of claim 6, wherein the computing module computes the conversion cost, and the means for obtaining the optimal path for converting the speech to be converted into the target timbre speaker speech comprises:
calculating target cost between a frame unit to be converted and a target frame unit and transfer cost between target frame units at adjacent moments;
and searching by adopting a Viterbi algorithm according to the target cost and the transfer cost obtained by calculation to obtain an optimal path.
8. The speech conversion device according to claim 6, wherein the processing module processes the target frame unit on the optimal path to obtain the target speech of the target timbre speaker corresponding to the speech to be converted comprises:
obtaining the Mel cepstrum characteristic of a target frame unit corresponding to a frame unit to be converted according to the corresponding relation between the frame unit of the original voice and the frame unit of the target voice;
performing smooth connection processing on the Mel cepstrum characteristics of each target frame unit on the optimal path according to a time sequence and a preset segmentation rule;
obtaining the fundamental frequency characteristics of a target frame unit corresponding to a frame unit to be converted according to the mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker;
converting the Mel cepstrum characteristic and the fundamental frequency characteristic of the target frame unit into a frequency spectrum of the target voice;
and performing frequency-time domain conversion on the frequency spectrum of the target voice to obtain the target voice of the target timbre speaker.
9. An electronic device, characterized in that the electronic device comprises: a processor and a memory coupled to the processor, the memory storing instructions that, when executed by the processor, cause the electronic device to perform the speech conversion method of any of claims 1-4.
10. A readable storage medium, the readable storage medium comprising a computer program, characterized in that: the computer program controls the electronic device in which the readable storage medium is located to execute the speech conversion method according to any one of claims 1 to 4 when running.
CN201710812770.XA 2017-09-11 2017-09-11 Voice conversion method and device, electronic equipment and readable storage medium Active CN107507619B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710812770.XA CN107507619B (en) 2017-09-11 2017-09-11 Voice conversion method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710812770.XA CN107507619B (en) 2017-09-11 2017-09-11 Voice conversion method and device, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN107507619A CN107507619A (en) 2017-12-22
CN107507619B true CN107507619B (en) 2021-08-20

Family

ID=60695368

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710812770.XA Active CN107507619B (en) 2017-09-11 2017-09-11 Voice conversion method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN107507619B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109817197B (en) * 2019-03-04 2021-05-11 天翼爱音乐文化科技有限公司 Singing voice generation method and device, computer equipment and storage medium
CN111048109A (en) * 2019-12-25 2020-04-21 广州酷狗计算机科技有限公司 Acoustic feature determination method and apparatus, computer device, and storage medium
WO2021134232A1 (en) * 2019-12-30 2021-07-08 深圳市优必选科技股份有限公司 Streaming voice conversion method and apparatus, and computer device and storage medium
CN112562728B (en) * 2020-11-13 2024-06-18 百果园技术(新加坡)有限公司 Method for generating countermeasure network training, method and device for audio style migration
CN112614481A (en) * 2020-12-08 2021-04-06 浙江合众新能源汽车有限公司 Voice tone customization method and system for automobile prompt tone
CN112634920B (en) * 2020-12-18 2024-01-02 平安科技(深圳)有限公司 Training method and device of voice conversion model based on domain separation
CN113345453B (en) * 2021-06-01 2023-06-16 平安科技(深圳)有限公司 Singing voice conversion method, device, equipment and storage medium
CN113782050A (en) * 2021-09-08 2021-12-10 浙江大华技术股份有限公司 Sound tone changing method, electronic device and storage medium
CN114582365B (en) * 2022-05-05 2022-09-06 阿里巴巴(中国)有限公司 Audio processing method and device, storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN103531196A (en) * 2013-10-15 2014-01-22 中国科学院自动化研究所 Sound selection method for waveform concatenation speech synthesis
CN104123933A (en) * 2014-08-01 2014-10-29 中国科学院自动化研究所 Self-adaptive non-parallel training based voice conversion method
CN104575488A (en) * 2014-12-25 2015-04-29 北京时代瑞朗科技有限公司 Text information-based waveform concatenation voice synthesizing method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101399044B (en) * 2007-09-29 2013-09-04 纽奥斯通讯有限公司 Voice conversion method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition
CN102982809A (en) * 2012-12-11 2013-03-20 中国科学技术大学 Conversion method for sound of speaker
CN103531196A (en) * 2013-10-15 2014-01-22 中国科学院自动化研究所 Sound selection method for waveform concatenation speech synthesis
CN104123933A (en) * 2014-08-01 2014-10-29 中国科学院自动化研究所 Self-adaptive non-parallel training based voice conversion method
CN104575488A (en) * 2014-12-25 2015-04-29 北京时代瑞朗科技有限公司 Text information-based waveform concatenation voice synthesizing method

Also Published As

Publication number Publication date
CN107507619A (en) 2017-12-22

Similar Documents

Publication Publication Date Title
CN107507619B (en) Voice conversion method and device, electronic equipment and readable storage medium
CN107705802B (en) Voice conversion method and device, electronic equipment and readable storage medium
US10891944B2 (en) Adaptive and compensatory speech recognition methods and devices
Kameoka et al. ConvS2S-VC: Fully convolutional sequence-to-sequence voice conversion
CN106683677B (en) Voice recognition method and device
CN111048064B (en) Voice cloning method and device based on single speaker voice synthesis data set
US11810546B2 (en) Sample generation method and apparatus
US11049491B2 (en) System and method for prosodically modified unit selection databases
CN111599339B (en) Speech splicing synthesis method, system, equipment and medium with high naturalness
CN111508466A (en) Text processing method, device and equipment and computer readable storage medium
US10079011B2 (en) System and method for unit selection text-to-speech using a modified Viterbi approach
Marxer et al. Low-latency instrument separation in polyphonic audio using timbre models
CN112686041A (en) Pinyin marking method and device
CN114566156A (en) Keyword speech recognition method and device
CN113314101B (en) Voice processing method and device, electronic equipment and storage medium
Patil et al. Hidden-Markov-model based statistical parametric speech synthesis for Marathi with optimal number of hidden states
US20080147385A1 (en) Memory-efficient method for high-quality codebook based voice conversion
Xiao et al. Speech intelligibility enhancement by non-parallel speech style conversion using CWT and iMetricGAN based CycleGAN
US20240161727A1 (en) Training method for speech synthesis model and speech synthesis method and related apparatuses
CN112786017B (en) Training method and device of speech speed detection model, and speech speed detection method and device
CN112885380B (en) Method, device, equipment and medium for detecting clear and voiced sounds
Yarra et al. A frame selective dynamic programming approach for noise robust pitch estimation
Park et al. Discriminative weight training for unit-selection based speech synthesis.
CN117975931A (en) Speech synthesis method, electronic device and computer program product
Kim et al. Discriminative training for concatenative speech synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant