CN107507619B

CN107507619B - Voice conversion method and device, electronic equipment and readable storage medium

Info

Publication number: CN107507619B
Application number: CN201710812770.XA
Authority: CN
Inventors: 方博伟; 张康; 卓鹏鹏; 张伟; 尤嘉华
Original assignee: Xiamen Meitu Technology Co Ltd
Current assignee: Xiamen Meitu Technology Co Ltd
Priority date: 2017-09-11
Filing date: 2017-09-11
Publication date: 2021-08-20
Anticipated expiration: 2037-09-11
Also published as: CN107507619A

Abstract

The invention provides a voice conversion method, a voice conversion device, electronic equipment and a readable storage medium. The method comprises the steps of segmenting a voice to be converted into a plurality of frame units to be converted based on a preset segmentation rule; extracting the Mel cepstrum characteristic of each frame unit to be converted; calculating to obtain a plurality of candidate frame units according to the phoneme dictionary and the Mel cepstrum characteristics of each frame unit to be converted; matching to obtain a target frame unit according to the corresponding relation between the frame unit of the speaker to be converted and the frame unit of the target timbre speaker; calculating conversion cost to obtain an optimal path; and processing the target frame unit on the optimal path to obtain the target voice. The method obtains a plurality of candidate frame units through calculation in the phoneme dictionary, saves calculation resources and improves calculation speed compared with the prior art that the calculation is carried out from the whole technical feature dictionary, and meanwhile, the traditional single-frame calculation is improved into the calculation of multiple frames, so that the technical problems of discontinuous synthesized voice and poor tone quality are greatly improved.

Description

Voice conversion method and device, electronic equipment and readable storage medium

Technical Field

The invention relates to the technical field of voice information processing, in particular to a voice conversion method, a voice conversion device, electronic equipment and a readable storage medium.

Background

The speech synthesis technology has achieved fruitful results through the development of nearly half century, and plays an extremely important role in the fields of artificial intelligence and the like. Among them, TTS (Text-to-Speech, also called Text-to-Speech) is a technology for converting Text information generated by a computer or inputted from the outside into intelligible and fluent spoken language, but the Speech synthesized by TTS generally has the following two problems: firstly, the timbre is limited to a small number of announcer samples, and cannot meet the personalized requirements; but the rhythm is unnatural and the synthesis trace is obvious.

Tone conversion (also called voice conversion) is a technology for directly converting the current speaker tone into the output speaker tone without changing the voice content, and has the advantages of natural rhythm and better personalized tone retention. At present, a speech conversion method based on speech feature dictionary lookup is a mainstream method in a non-parameter speech conversion technology, and the idea of the method is as follows: firstly, extracting the characteristics of an original voice library and a target voice library, establishing a characteristic dictionary, and performing parallel training to obtain a mapping rule; extracting the feature vectors of the voice to be converted, and searching K adjacent target feature vectors for each feature vector in a target feature dictionary according to a mapping rule; calculating target cost and connection cost, and searching an optimal path in the K neighbor characteristic matrix by adopting a Viterbi algorithm; and fourthly, connecting the voice feature vectors of the selected target and converting the voice feature vectors into voice. The method has the defects that the whole target feature dictionary needs to be traversed for searching the K neighbor feature vectors every time, the calculation speed is low, and the requirement on the system performance is high. Meanwhile, when the connection cost is calculated, a single frame is used as a unit, and the smooth characteristic between voice frames is not considered, so that the loss of voice instantaneous information is caused, the synthesized voice is discontinuous, and the voice quality is greatly influenced.

Disclosure of Invention

In order to overcome the above-mentioned deficiencies in the prior art, the present invention provides a method, an apparatus, an electronic device and a readable storage medium for voice conversion, which can ensure that the spectrum details are not lost on the premise of ensuring the continuity of the synthesized voice.

It is an object of a first aspect of the present invention to provide a method of speech conversion, the method comprising:

segmenting the voice to be converted of the speaker to be converted into a plurality of frame units to be converted based on a preset segmentation rule, wherein each frame unit to be converted comprises a plurality of continuous voice frames;

extracting the Mel cepstrum characteristic of each frame unit to be converted;

calculating to obtain a plurality of candidate frame units according to a pre-obtained phoneme dictionary of the speaker to be converted and the Mel cepstrum characteristics of each frame unit to be converted;

matching to obtain a target frame unit corresponding to the candidate frame unit according to a corresponding relation between a frame unit of a speaker to be converted and a frame unit of a target timbre speaker, which is obtained in advance;

calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker;

and processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted.

Optionally, the method further comprises pre-processing the speech data;

the step of preprocessing the speech data comprises:

segmenting an original voice in an original voice library corresponding to a speaker to be converted and a target voice in a target voice library corresponding to a target timbre speaker by adopting the preset segmentation rule to obtain a plurality of frame units corresponding to the original voice and a plurality of frame units corresponding to the target voice;

extracting Mel cepstrum characteristics of the original voice and the target voice, and constructing an original voice characteristic dictionary and a target voice characteristic dictionary;

establishing a corresponding relation between the frame unit of the original voice and the frame unit of the target voice;

classifying the original speech feature dictionary according to the labeled phoneme information to obtain a phoneme dictionary;

extracting fundamental frequency characteristics of the original voice and the target voice, and calculating a fundamental frequency mean value and a fundamental frequency variance;

and establishing a mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker according to the mean value and the variance of the fundamental frequency.

It is an object of a second aspect of the present invention to provide a speech conversion apparatus, comprising:

the segmentation module is used for segmenting the to-be-converted voice of the speaker to be converted into a plurality of to-be-converted frame units based on a preset segmentation rule, wherein each to-be-converted frame unit comprises a plurality of continuous voice frames;

the extraction module is used for extracting the Mel cepstrum characteristics of each frame unit to be converted;

the computing module is used for computing a plurality of candidate frame units according to a pre-obtained phoneme dictionary of the speaker to be converted and the Mel cepstrum characteristics of each frame unit to be converted;

the matching module is used for matching to obtain a target frame unit corresponding to the candidate frame unit according to the corresponding relation between the frame unit of the speaker to be converted and the frame unit of the target timbre speaker obtained in advance;

the computing module is also used for computing the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker;

and the processing module is used for processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted.

Optionally, the apparatus further comprises: a preprocessing module;

the mode of the preprocessing module for preprocessing the voice data comprises the following steps:

It is an object of a third aspect of the present invention to provide an electronic apparatus, comprising: a processor and a memory coupled to the processor, the memory storing instructions that, when executed by the processor, cause the electronic device to perform the method of speech conversion according to the first aspect of the invention.

A fourth aspect of the present invention is directed to a readable storage medium, which includes a computer program, where the computer program controls an electronic device where the readable storage medium is located to execute the voice conversion method according to the first aspect of the present invention when the computer program runs.

Compared with the prior art, the invention has the following beneficial effects:

the invention provides a voice conversion method, a voice conversion device, electronic equipment and a readable storage medium. The method comprises the steps of segmenting the voice to be converted of a speaker to be converted into a plurality of frame units to be converted based on a preset segmentation rule; extracting the Mel cepstrum characteristic of each frame unit to be converted; calculating to obtain a plurality of candidate frame units according to a pre-obtained phoneme dictionary of the speaker to be converted and the Mel cepstrum characteristics of each frame unit to be converted; matching to obtain a target frame unit corresponding to the candidate frame unit according to a corresponding relation between a frame unit of a speaker to be converted and a frame unit of a target timbre speaker, which is obtained in advance; calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker; and processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted. The method obtains a plurality of candidate frame units by calculation in the phoneme dictionary of the speaker to be converted, saves calculation resources and improves calculation speed compared with the prior art in which the candidate frame units are searched from the whole technical feature dictionary, and simultaneously improves the technical problems of discontinuous synthesized voice and poor tone quality by improving the traditional single-frame calculation into the multi-frame calculation.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a block diagram of an electronic device according to an embodiment of the present invention.

Fig. 2 is a flowchart of a voice conversion method according to a first embodiment of the present invention.

Fig. 3 is a flowchart illustrating another step of the voice conversion method according to the first embodiment of the present invention.

Fig. 4 is a flowchart of sub-steps of step S170 in fig. 3.

Fig. 5 is a schematic diagram of a frame unit structure.

Fig. 6 is a schematic diagram of adding frame units to a corresponding plurality of speech phoneme sets at the same time.

FIG. 7 is a schematic diagram of a Viterbi path search provided by an embodiment of the invention.

Fig. 8 is a flowchart of sub-steps of step S160 in fig. 1 or 3.

Fig. 9 is a block diagram of a speech conversion apparatus according to a second embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

Fig. 1 is a block diagram of an electronic device 100 according to a preferred embodiment of the invention. The electronic device 100 may include a voice conversion apparatus 300, a memory 111, a storage controller 112, and a processor 113.

The memory 111, the memory controller 112 and the processor 113 are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines. The voice conversion apparatus 300 may include at least one software functional module which may be stored in the memory 111 in the form of software or firmware (firmware) or solidified in an Operating System (OS) of the electronic device 100. The processor 113 is used for executing executable modules stored in the memory 111, such as software functional modules and computer programs included in the speech conversion apparatus 300.

The Memory 111 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Read-Only Memory (EPROM), an electrically Erasable Read-Only Memory (EEPROM), and the like. The memory 111 is used for storing a program, and the processor 113 executes the program after receiving an execution instruction. Access to the memory 111 by the processor 113 and possibly other components may be under the control of the memory controller 112.

The processor 113 may be an integrated circuit chip having signal processing capabilities. The Processor 113 may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

First embodiment

Referring to fig. 2, fig. 2 is a flowchart illustrating a voice conversion method according to a preferred embodiment of the invention. The method is applied to the electronic device 100 described above, and the steps of the voice conversion method are described in detail below.

Step S110, segmenting the to-be-converted voice of the speaker to be converted into a plurality of to-be-converted frame units based on a preset segmentation rule.

In this embodiment, the voice range to be subjected to voice conversion may be selected in a labeling manner, and optionally, the voice to be converted may be selected from the voices of the speaker to be converted in a manner of calling an automatic voice labeling tool to perform labeling.

After the marked voice to be converted is obtained, the voice to be converted is segmented by adopting a preset segmentation rule, so that each segmented frame unit comprises a plurality of continuous voice frames.

And step S120, extracting the Mel cepstrum characteristic of each frame unit to be converted.

In this embodiment, step S120 includes:

and carrying out time-frequency domain change on the frame units to be converted to obtain the frequency spectrum information of each frame unit to be converted.

And extracting the Mel cepstrum characteristics of the frame unit by adopting a Mel filter bank.

Step S130, a plurality of candidate frame units are obtained through calculation according to a pre-obtained phoneme dictionary of the speaker to be converted and the Mel cepstrum characteristics of each frame unit to be converted.

The step S130 may include the following sub-steps.

And forming a feature vector of each frame unit to be converted by the Mel cepstrum feature of each frame unit to be converted.

And calculating and sequencing Euclidean distances between the feature vectors of each frame unit to be converted and the feature vectors of each frame unit in the phoneme dictionary.

And screening a plurality of candidate frame units corresponding to each frame unit to be converted from the phoneme dictionary by adopting a K nearest neighbor algorithm.

The K-nearest neighbor algorithm is a classification algorithm, and the K-nearest neighbor algorithm is an algorithm for classifying a sample into a certain class when most of K most similar samples (i.e., nearest neighbors in the feature space) in the feature space belong to the class.

Step S140, according to the pre-obtained corresponding relationship between the frame unit of the speaker to be converted and the frame unit of the target timbre speaker, the target frame unit corresponding to the candidate frame unit is obtained through matching.

Referring to fig. 3, in the present embodiment, the method further includes step S170.

Step S170, preprocessing the voice data.

And performing parallel training on the original voice in the original voice library corresponding to the speaker to be converted and the target voice in the target voice library corresponding to the target timbre speaker to establish a corresponding relation between a frame unit of the speaker to be converted and a frame unit of the target timbre speaker and a mapping relation of fundamental frequencies between the speaker to be converted and the target timbre speaker. In the process, the original voice and the target voice are trained in parallel, so that the contents of the original voice and the target voice are required to correspond one by one and are consistent.

Referring to fig. 4, in the present embodiment, the step S170 includes the following sub-steps.

And a substep S171, segmenting the original speech in the original speech library corresponding to the speaker to be converted and the target speech in the target speech library corresponding to the target timbre speaker by using the preset segmentation rule, so as to obtain a plurality of frame units corresponding to the original speech and a plurality of frame units corresponding to the target speech.

In this embodiment, in order to establish the mapping relationship between the original speech and the target speech, parallel training needs to be performed, that is, the contents of the original speech library and the target speech library are consistent, and the duration is long enough.

Referring to fig. 5, in the present embodiment, in consideration of smooth connection between frame units and transient information of speech, the present solution selects consecutive odd frames (q ═ 2p +1) as a frame unit, where the center frame is the p +1 th frame, and p frames before and after the frame unit, and two adjacent frame units overlap by 2p frames. It is understood that the preset slicing rule employed in the sub-step S171 is the same as the preset slicing rule employed in the step S110.

For original speech, the frame sequence may be denoted X ═ X⁽¹⁾,x⁽²⁾,x⁽³⁾,...,x⁽ⁿ⁾,...,x^(N)]The nth cell can be represented as x⁽ⁿ⁾＝[x_n-p,x_n-p+1,...,x_n,...,x_n+p+1,x_n+p]Wherein x is_nRepresenting the nth frame in the sequence of frames. Similarly, the same unit division operation can be performed on the target speech.

And a substep S172 of extracting Mel cepstrum characteristics of the original voice and the target voice and constructing an original voice characteristic dictionary and a target voice characteristic dictionary.

In the embodiment, each frame of spectrum information is obtained after fast fourier transform, and mel cepstrum features are extracted through a mel filter bank. And constructing an original speech feature dictionary and a target speech feature dictionary through the extracted Mel cepstrum features.

And a sub-step S173 of establishing a corresponding relationship between the frame unit of the original speech and the frame unit of the target speech.

In this embodiment, a DTW (Dynamic Time Warping) algorithm is adopted to establish a corresponding relationship between an original speech frame and a target speech frame. The correspondence between the original speech and the target speech may be expressed as: z ═ Z₁,z₂,...,z_l,...z_L]Wherein

Is the pairing of the frame unit of the original speech and the frame unit of the target speech. The establishment of the corresponding relation provides a basis for searching the frame unit of the target voice through the frame unit of the original voice in the tone conversion stage.

And a substep S174, classifying the original speech feature dictionary according to the labeled phoneme information to obtain a phoneme dictionary.

In this embodiment, each piece of speech phoneme information in the original speech is labeled in advance, and the frame unit of each original speech is classified into each phoneme dictionary according to the position of the frame unit of each original speech in the original speech. Referring to fig. 6, since a frame unit includes a plurality of continuous frames, it may happen that one frame unit spans two (or more) speech phoneme sets, and in order to ensure the conversion quality, the frame unit is added to at least one phoneme dictionary at the same time.

The phoneme dictionary is obtained through the classification mode, and the mode of obtaining a plurality of candidate frame units based on the phoneme dictionary calculation can save calculation resources and improve calculation speed compared with the mode of searching from the whole technical feature dictionary in the prior art.

The substep S175 is to extract the fundamental frequency features of the original speech and the target speech, and calculate the fundamental frequency mean and the fundamental frequency variance.

And a substep S176, establishing a mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker according to the mean value and the variance of the fundamental frequency.

In this embodiment, excitation of voiced sound is a periodic pulse train, the frequency of the pulse train is the fundamental frequency, so the fundamental frequency is also an important feature of speech, and the accuracy of fundamental frequency extraction directly affects the personalized tone preservation and rhythm of synthesized speech. Statistically, two identical distributions (e.g., normal distributions, etc.) with different statistical characteristics (mean, variance) can be transformed into each other. Therefore, the fundamental frequency features of the original voice and the target voice are regarded as obeying normal distribution, and the fundamental frequency mean value and the fundamental frequency variance are calculated, so that the mapping relation of the fundamental frequency between the original voice and the target voice can be established. And establishing a mapping relation of fundamental frequency between the original voice and the target voice so as to obtain the fundamental frequency characteristic of the target voice through the voice to be converted in the subsequent voice conversion stage.

And step S150, calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker.

In this embodiment, the step S150 obtains the optimal path for converting the voice to be converted into the voice of the speaker with the target timbre in the following manner.

And calculating the target cost between the frame unit to be converted and the target frame unit and the transfer cost between the target frame units at adjacent moments.

And searching by adopting a Viterbi algorithm according to the target cost and the transfer cost obtained by calculation to obtain an optimal path.

Optionally, the euclidean distance is used to calculate a target cost between the frame unit to be converted and the target frame unit, and a transfer cost between target frame units at adjacent times. The viterbi algorithm is equivalent to a search process for a minimum cost path of a weighted directed acyclic graph.

The calculation formula of the target cost may be as follows:

wherein the content of the first and second substances,

the self weight of each node in the weighted directed acyclic graph can be expressed, and the target cost in the embodiment can be understood.

Describes a frame unit X to be converted^(t)And target frame unit

The more similar the weight isSmall means that the two are more similar. Wherein X^(t)(i, d) and Y_k'^(t)(i, d) represents d-th dimensional data of the i-th frame in a unit of time t.

The transfer weight between nodes in the weighted directed acyclic graph is the connection cost,

describes a target frame unit at the time of t

And (t +1) and t +1 target frame units

The smaller the weight, the more similar the two are, and the smoother the transition. According to the above principle, the optimal path to be searched in the target frame unit matrix can be obtained. Referring to fig. 7, each node on the path (composed of the line with an arrow in fig. 4) is the optimal choice at each time.

Step S160, processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted.

In this embodiment, referring to fig. 8, the step S160 may include the following sub-steps.

And a substep S161, obtaining the mel cepstrum feature of the target frame unit corresponding to the frame unit to be converted according to the corresponding relationship between the frame unit of the original voice and the frame unit of the target voice.

And a substep S162, performing smooth connection processing on the mel-frequency cepstrum features of each target frame unit on the optimal path according to a time sequence and a preset segmentation rule.

In this embodiment, because there is a frame stack of 2p frames between adjacent target frame units, transient window smoothing is needed to ensure continuity in speech hearing when connecting into a feature matrix. For each target frame unit, the following operations are performed.

That is, each frame in the target frame unit is multiplied by a weighting factor, and the instantaneous window w is expressed by an exponential function in the present embodiment, and the formula is expressed as follows,

w＝exp(-λ|a|)，a＝[p,p-1,...,0,...,p-1,p]

where λ is a scalar value used to adjust the transient window w shape. The larger the lambda is, the more prominent the central frame information is, and the instantaneous information of the adjacent frames is weakened; conversely, the smaller the λ, the more the instantaneous information of the adjacent frame is considered, and the information of the central frame is weakened, so that the proper λ can be selected and both the instantaneous information and the central frame can be considered simultaneously. Before windowing, the elements of the temporal window need to be normalized to a sum of 1.

And a substep S163 of obtaining the fundamental frequency characteristic of the target frame unit corresponding to the frame unit to be converted according to the mapping relationship of the fundamental frequency between the speaker to be converted and the target timbre speaker.

And subtracting the voice base frequency sequence of the voice to be converted from the corresponding base frequency mean value of the target voice of the target timbre speaker, multiplying the obtained difference value by the quotient of the base frequency variance of the target voice and the base frequency variance of the voice to be converted, and adding the product obtained by multiplication and the base frequency mean value of the target voice to obtain the base frequency sequence of the target voice. The calculation formula of the fundamental frequency sequence of the target speech can be as follows:

wherein f0(i) is the target speech fundamental frequency sequence,

for the fundamental frequency sequence of the voice to be converted, sf0m and tf0m are the mean of the fundamental frequency of the voice to be converted and the mean of the fundamental frequency of the target voice respectively, and sf0v and tf0v are the fundamental frequency of the voice to be converted respectivelyThe variance is the variance of the fundamental frequency of the target speech.

And a substep S164 of converting the mel cepstral feature and the fundamental frequency feature of the target frame unit into a frequency spectrum of the target speech.

In this embodiment, the STRAIGHT kit is optionally invoked to convert the mel-frequency cepstral feature and the fundamental frequency feature of the target frame unit into the frequency spectrum of the target speech.

And a substep S165 of performing frequency-time domain conversion on the frequency spectrum of the target voice to obtain the target voice of the target timbre speaker.

In this embodiment, inverse fourier transform is used to convert the frequency spectrum of the target voice to the target voice of the target timbre speaker.

Second embodiment

Referring to fig. 9, fig. 9 is a block diagram of a voice conversion apparatus 300 according to a preferred embodiment of the invention. The voice conversion apparatus 300 includes: a segmentation module 310, an extraction module 320, a calculation module 330, a matching module 340, and a processing module 350.

The segmentation module 310 is configured to segment the to-be-converted speech of the speaker to be converted into a plurality of to-be-converted frame units based on a preset segmentation rule, where each to-be-converted frame unit includes a plurality of continuous speech frames.

The extracting module 320 is configured to extract mel cepstrum features of each frame unit to be transformed.

In this embodiment, the manner of extracting the mel cepstrum feature of the frame unit to be converted by the extracting module 320 includes:

carrying out time-frequency domain change on the frame unit to be converted to obtain frequency spectrum information of each frame unit;

The calculating module 330 is configured to calculate to obtain a plurality of candidate frame units according to a pre-obtained phoneme dictionary of the speaker to be converted and the mel cepstrum feature of each frame unit to be converted.

In this embodiment, the calculating module 330 calculates a plurality of candidate frame units according to the pre-obtained phoneme dictionary of the speaker to be converted and the mel cepstrum feature of each frame unit to be converted, including:

forming a feature vector of each frame unit to be converted by the Mel cepstrum feature of each frame unit to be converted;

calculating and sequencing Euclidean distances between the feature vectors of each frame unit to be converted and the feature vectors of each frame unit in the phoneme dictionary;

The matching module 340 is configured to match the frame unit of the speaker to be converted with the frame unit of the target timbre speaker according to a correspondence relationship between the frame units of the speaker to be converted and the frame units of the target timbre speaker, so as to obtain a target frame unit corresponding to the candidate frame unit.

The calculating module 330 is further configured to calculate a conversion cost, and obtain an optimal path for converting the voice to be converted into the voice of the target timbre speaker.

In this embodiment, the calculating module 330 calculates the conversion cost, and the method for obtaining the optimal path for converting the voice to be converted into the target timbre speaker voice includes:

calculating target cost between a frame unit to be converted and a target frame unit and transfer cost between target frame units at adjacent moments;

The processing module 350 is configured to process the target frame unit on the optimal path to obtain a target voice of the target timbre speaker corresponding to the voice to be converted.

In this embodiment, the processing module 350 processes the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted includes:

obtaining the Mel cepstrum characteristic of a target frame unit corresponding to a frame unit to be converted according to the corresponding relation between the frame unit of the original voice and the frame unit of the target voice;

performing smooth connection processing on the Mel cepstrum characteristics of each target frame unit on the optimal path according to a time sequence and a preset segmentation rule;

obtaining the fundamental frequency characteristics of a target frame unit corresponding to a frame unit to be converted according to the mapping relation of the fundamental frequency between the speaker to be converted and the target timbre speaker;

converting the Mel cepstrum characteristic and the fundamental frequency characteristic of the target frame unit into a frequency spectrum of the target voice;

and performing frequency-time domain conversion on the frequency spectrum of the target voice to obtain the target voice of the target timbre speaker.

Referring to fig. 9 again, in the present embodiment, the voice conversion apparatus 300 further includes: a pre-processing module 360.

The way of the preprocessing module 360 preprocessing the voice data includes:

The invention provides a voice conversion method, a voice conversion device, electronic equipment and a readable storage medium. The method comprises the steps of segmenting the voice to be converted of a speaker to be converted into a plurality of frame units to be converted based on a preset segmentation rule; extracting the Mel cepstrum characteristic of each frame unit to be converted; calculating to obtain a plurality of candidate frame units according to a pre-obtained phoneme dictionary of the speaker to be converted and the Mel cepstrum characteristics of each frame unit to be converted; matching to obtain a target frame unit corresponding to the candidate frame unit according to a corresponding relation between a frame unit of a speaker to be converted and a frame unit of a target timbre speaker, which is obtained in advance; calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target timbre speaker; and processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted. The method obtains a plurality of candidate frame units by calculation in the phoneme dictionary of the speaker to be converted, saves calculation resources and improves calculation speed compared with the prior art by searching in the whole technical feature dictionary, simultaneously considers interframe smoothing and speech instantaneous information, improves the traditional single-frame calculation into the calculation of a unit containing a plurality of frames, performs windowing smoothing processing when connecting the units, improves the traditional single-frame calculation into the calculation of the plurality of frames, and greatly improves the technical problems of discontinuous synthesized speech and poor tone quality.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention. It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of speech conversion, the method comprising:

extracting the Mel cepstrum characteristic of each frame unit to be converted;

screening a plurality of candidate frame units corresponding to each frame unit to be converted from the phoneme dictionary by adopting a K nearest neighbor algorithm;

matching to obtain a target frame unit corresponding to the candidate frame unit according to a corresponding relation between a frame unit of a speaker to be converted and a frame unit of a target timbre speaker, wherein the frame unit of the speaker to be converted is obtained by segmenting original voice in an original voice library corresponding to the speaker to be converted according to the preset segmentation rule;

2. The method of claim 1, wherein the method further comprises: a step of preprocessing speech data, the step comprising:

3. The method of claim 2, wherein the step of calculating the conversion cost to obtain the optimal path for converting the voice to be converted into the voice of the target speaker with timbre comprises:

4. The method according to claim 2, wherein the step of processing the target frame unit on the optimal path to obtain the target voice of the target timbre speaker corresponding to the voice to be converted comprises:

5. A speech conversion apparatus, characterized in that the apparatus comprises:

the computing module is used for forming a feature vector of each frame unit to be converted by the Mel cepstrum feature of each frame unit to be converted;

the matching module is used for matching and obtaining a target frame unit corresponding to the candidate frame unit according to the corresponding relation between the frame unit of the speaker to be converted and the frame unit of the target timbre speaker, wherein the frame unit of the speaker to be converted is obtained by segmenting the original voice in the original voice library corresponding to the speaker to be converted according to the preset segmentation rule;

6. The speech conversion device of claim 5, wherein the device further comprises: a preprocessing module;

7. The speech conversion device of claim 6, wherein the computing module computes the conversion cost, and the means for obtaining the optimal path for converting the speech to be converted into the target timbre speaker speech comprises:

8. The speech conversion device according to claim 6, wherein the processing module processes the target frame unit on the optimal path to obtain the target speech of the target timbre speaker corresponding to the speech to be converted comprises:

9. An electronic device, characterized in that the electronic device comprises: a processor and a memory coupled to the processor, the memory storing instructions that, when executed by the processor, cause the electronic device to perform the speech conversion method of any of claims 1-4.

10. A readable storage medium, the readable storage medium comprising a computer program, characterized in that: the computer program controls the electronic device in which the readable storage medium is located to execute the speech conversion method according to any one of claims 1 to 4 when running.