CN111508511A - Real-time sound changing method and device - Google Patents
Real-time sound changing method and device Download PDFInfo
- Publication number
- CN111508511A CN111508511A CN201910091188.8A CN201910091188A CN111508511A CN 111508511 A CN111508511 A CN 111508511A CN 201910091188 A CN201910091188 A CN 201910091188A CN 111508511 A CN111508511 A CN 111508511A
- Authority
- CN
- China
- Prior art keywords
- specific target
- target speaker
- audio data
- voice recognition
- speaker
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000006243 chemical reaction Methods 0.000 claims abstract description 69
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 54
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 54
- 230000005236 sound signal Effects 0.000 claims abstract description 20
- 238000012549 training Methods 0.000 claims description 57
- 230000015654 memory Effects 0.000 claims description 18
- 238000003062 neural network model Methods 0.000 claims description 9
- 230000003595 spectral effect Effects 0.000 claims description 6
- 230000008859 change Effects 0.000 abstract description 12
- 230000000694 effects Effects 0.000 abstract description 9
- 230000004044 response Effects 0.000 abstract description 5
- 238000012545 processing Methods 0.000 description 14
- 238000004891 communication Methods 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 11
- 238000010586 diagram Methods 0.000 description 10
- 238000000605 extraction Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 6
- 230000003044 adaptive effect Effects 0.000 description 5
- 238000010276 construction Methods 0.000 description 5
- 238000013480 data collection Methods 0.000 description 5
- 230000005291 magnetic effect Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- OSXPVFSMSBQPBU-UHFFFAOYSA-N 2-(2-carboxyethoxycarbonyl)benzoic acid Chemical compound OC(=O)CCOC(=O)C1=CC=CC=C1C(O)=O OSXPVFSMSBQPBU-UHFFFAOYSA-N 0.000 description 3
- 230000001133 acceleration Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000002715 modification method Methods 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/12—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being prediction coefficients
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/18—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/003—Changing voice quality, e.g. pitch or formants
- G10L21/007—Changing voice quality, e.g. pitch or formants characterised by the process used
- G10L21/013—Adapting to target pitch
- G10L2021/0135—Voice conversion or morphing
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Quality & Reliability (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
The invention discloses a real-time sound changing method and a device, wherein the method comprises the following steps: receiving source speaker audio data; extracting voice recognition acoustic features from the source speaker audio data, and obtaining hidden layer features of voice recognition by utilizing the voice recognition acoustic features; inputting the hidden layer characteristics into a pre-constructed tone conversion model corresponding to a specific target speaker to obtain the voice synthesis acoustic characteristics of the specific target speaker; and generating the audio signal of the specific target speaker by utilizing the voice synthesis acoustic characteristics of the specific target speaker. The invention can realize real-time sound change with low response delay and obtain better sound change effect.
Description
Technical Field
The invention relates to the field of voice signal processing, in particular to a real-time voice changing method and device.
Background
At present, with the development of speech synthesis technology, how to make synthesized speech natural, diversified and personalized becomes a hot spot of current speech technology research, and the sound change technology is one of ways to make synthesized speech diversified and personalized. The voice modification technology mainly refers to a technology of preserving semantic content of a voice signal but changing voice characteristics of a speaker so that a voice of a person sounds like a voice of another person. The sound variation technique is generally divided into two ways from the perspective of speaker conversion: a speech conversion between non-specific persons, such as conversion between male voice and female voice, conversion between different age levels, etc.; another is speech conversion between specific persons, such as converting the voice of speaker a to the voice of speaker B.
A conventional processing method for realizing timbre conversion from any speaker to a target speaker by changing voice usually is based on a speech recognition technology, and aligns parallel corpora by using DTW (Dynamic Time Warping) or attention (attention) mechanisms, and then performs timbre conversion. In the processing mode, when the conversion model is trained, parallel corpora of a source speaker and a target speaker, namely audio corpora with the same content, need to be collected, and the conversion model is trained by using the aligned frequency spectrum characteristics; when the audio conversion is carried out, the spectrum characteristics extracted from the audio data of the source speaker are converted through a conversion model, the fundamental frequency characteristics are subjected to linear stretching treatment, and the non-periodic components are not changed. The sound changing effect obtained by the sound changing processing mode is poor, and some application scenes with real-time requirements cannot be met.
Disclosure of Invention
The embodiment of the invention provides a real-time sound changing method and device, which are used for realizing real-time sound changing with low response delay and obtaining a better sound changing effect.
Therefore, the invention provides the following technical scheme:
a real-time voicing method, the method comprising:
receiving source speaker audio data;
extracting voice recognition acoustic features from the source speaker audio data, and obtaining hidden layer features of voice recognition by utilizing the voice recognition acoustic features;
inputting the hidden layer characteristics into a pre-constructed tone conversion model corresponding to a specific target speaker to obtain the voice synthesis acoustic characteristics of the specific target speaker;
and generating the audio signal of the specific target speaker by utilizing the voice synthesis acoustic characteristics of the specific target speaker.
Optionally, the method further includes constructing the tone conversion model corresponding to the specific target speaker in the following manner:
collecting audio data of a specific target speaker;
and carrying out self-adaptive training on a universal sound variation model which is constructed in advance based on the audio data of a plurality of speakers by utilizing the audio data of the specific target speaker to obtain a tone conversion model corresponding to the specific target speaker.
Optionally, the method further comprises: the method for constructing the universal sound variation model based on the audio data of a plurality of speakers specifically comprises the following steps:
collecting audio data of a plurality of speakers as training data;
extracting voice recognition acoustic features and voice synthesis acoustic features from the training data, and obtaining hidden layer features of voice recognition by using the voice recognition acoustic features;
and training to obtain a universal sound variation model by utilizing the hidden layer characteristics and the voice synthesis acoustic characteristics.
Optionally, the obtaining of the hidden layer feature of the speech recognition by using the speech recognition acoustic feature includes:
and inputting the voice recognition acoustic features into a voice recognition model to obtain hidden layer features.
Optionally, the speech recognition model is a neural network model.
Optionally, the speech recognition acoustic features comprise any one or more of: mel frequency cepstrum coefficients, perceptual linear prediction parameters.
Optionally, the speech synthesis acoustic features comprise any one or more of: clear-turbid characteristics, fundamental frequency characteristics, spectral characteristics, and aperiodic components.
A real-time sound-altering device, the device comprising:
the receiving module is used for receiving audio data of a source speaker;
the characteristic acquisition module is used for extracting voice recognition acoustic characteristics from the source speaker audio data and obtaining hidden layer characteristics of voice recognition by utilizing the voice recognition acoustic characteristics;
the characteristic conversion module is used for inputting the hidden layer characteristics into a pre-constructed tone conversion model corresponding to a specific target speaker to obtain the voice synthesis acoustic characteristics of the specific target speaker;
and the voice synthesis module is used for generating the audio signal of the specific target speaker by utilizing the voice synthesis acoustic characteristics of the specific target speaker.
Optionally, the apparatus further comprises: the tone conversion model building module is used for building a tone conversion model corresponding to a specific target speaker;
the tone conversion model construction module comprises:
a target data collection unit for collecting audio data of a specific target speaker;
and the model training unit is used for carrying out self-adaptive training on a universal sound variation model which is constructed in advance based on the audio data of a plurality of speakers by utilizing the audio data of the specific target speaker to obtain a tone conversion model corresponding to the specific target speaker.
Optionally, the apparatus further comprises: the universal model building module is used for building a universal sound variation model based on the audio data of a plurality of speakers;
the general model building module comprises:
the universal data collection unit is used for collecting audio data of a plurality of speakers as training data;
the feature acquisition unit is used for extracting voice recognition acoustic features and voice synthesis acoustic features from the training data and obtaining hidden layer features of voice recognition by utilizing the voice recognition acoustic features;
and the universal parameter training unit is used for training to obtain the multi-person sound-changing model by utilizing the hidden layer characteristics and the voice synthesis acoustic characteristics.
Optionally, the feature obtaining module includes:
the acoustic feature extraction unit is used for extracting voice recognition acoustic features from the source speaker audio data;
and the hidden layer feature extraction unit is used for inputting the voice recognition acoustic features into a voice recognition model to obtain hidden layer features.
Optionally, the speech recognition model is a neural network model.
Optionally, the speech recognition acoustic features comprise any one or more of: mel frequency cepstrum coefficients, perceptual linear prediction parameters.
Optionally, the speech synthesis acoustic features comprise any one or more of: clear-turbid characteristics, fundamental frequency characteristics, spectral characteristics, and aperiodic components.
An electronic device, comprising: one or more processors, memory;
the memory is configured to store computer-executable instructions and the processor is configured to execute the computer-executable instructions to implement the method described above.
A readable storage medium having stored thereon instructions which are executed to implement the foregoing method.
The real-time voice changing method and the device provided by the embodiment of the invention pre-construct a tone color conversion model corresponding to a specific target speaker, extract voice recognition acoustic characteristics from received source speaker audio data, obtain hidden layer characteristics of voice recognition by using the voice recognition acoustic characteristics, use the hidden layer characteristics as an intermediary, convert the voice recognition acoustic characteristics corresponding to the source speaker into voice synthesis acoustic characteristics corresponding to the specific target speaker by using the tone color conversion model, and then generate an audio signal of the specific target speaker by using the voice synthesis acoustic characteristics. Because a plurality of acoustic features are adopted for combined modeling, a better sound variation effect can be obtained; and the stream type feature extraction can be carried out, the real-time sound change with low response delay is realized, and the application requirement of the real-time sound change is met.
Furthermore, in the scheme of the invention, during modeling, the audio data of a plurality of speakers are firstly utilized to carry out the training of the universal sound changing model, and then the small amount of audio data of a specific target speaker is utilized to carry out the self-adaptive training on the basis of the universal sound changing model, so as to obtain the tone color conversion model corresponding to the specific target speaker. Because the adaptive training is carried out on the audio data of the specific target speaker on the basis of the universal sound variation model, the parameters of the tone conversion model obtained by training can be more accurate, and the voice synthesis acoustic characteristics obtained by utilizing the tone conversion model are more in line with the voice characteristics of the specific target speaker, so that the finally synthesized audio signal has better effect. Moreover, when different specific target speakers are aimed at, only a small amount of audio data of the specific target speakers need to be recorded, and parallel corpora corresponding to the source speakers do not need to be recorded, so that the collection work of training corpora is greatly simplified.
Drawings
In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a flow chart of a general acoustic variation model constructed in a real-time acoustic variation method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a topology structure of a generic acoustic modification model in a real-time acoustic modification method according to an embodiment of the present invention;
FIG. 3 is a flow chart of a real-time voicing method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a model training and testing process in a real-time acoustic method according to an embodiment of the present invention;
FIG. 5 is a block diagram of a real-time sound-changing device according to an embodiment of the present invention;
FIG. 6 is a block diagram illustrating an apparatus for a real-time voicing method in accordance with an exemplary embodiment;
fig. 7 is a schematic structural diagram of a server in an embodiment of the present invention.
Detailed Description
In order to make the technical field of the invention better understand the scheme of the embodiment of the invention, the embodiment of the invention is further described in detail with reference to the drawings and the implementation mode.
The embodiment of the invention provides a real-time voice changing method and a device, which are characterized in that a tone color conversion model corresponding to a specific target speaker is constructed in advance, voice recognition acoustic characteristics are extracted from received source speaker audio data, hidden layer characteristics of voice recognition are obtained by utilizing the voice recognition acoustic characteristics, the hidden layer characteristics are used as an intermediary, the voice recognition acoustic characteristics corresponding to the source speaker are converted into voice synthesis acoustic characteristics corresponding to the specific target speaker by utilizing the tone color conversion model, and then voice synthesis acoustic characteristics are utilized to generate an audio signal of the specific target speaker.
In practical application, the tone conversion model can be obtained by collecting audio data of a large number of specific target speakers and training; or firstly, the audio data of a plurality of speakers are utilized to carry out the training of a universal sound changing model, and then the small amount of audio data of a specific target speaker is utilized to carry out the self-adaptive training on the basis of the universal sound changing model, so as to obtain the tone color conversion model corresponding to the specific target speaker.
As shown in fig. 1, it is a flowchart of constructing a universal acoustic varying model in a real-time acoustic varying method according to an embodiment of the present invention, and the method includes the following steps:
The generic voicing model is not specific to a particular target speaker and thus may be trained based on audio data for multiple speakers.
And step 102, extracting voice recognition acoustic features and voice synthesis acoustic features from the training data, and obtaining hidden layer features of voice recognition by using the voice recognition acoustic features.
The speech recognition acoustic features may include, but are not limited to, any one or more of MFCC (Mel-scale frequency Cepstral Coefficients), P L P (Perceptual Linear prediction) parameters, the MFCC is a Cepstral parameter extracted in the Mel-scale frequency domain, the Mel-scale describes the non-linear characteristics of human ear frequencies, the P L P parameter is an auditory model-based feature parameter which is a set of Coefficients of an all-pole model prediction polynomial equivalent to a L PC (L initial predictive coefficient) feature.
The speech synthesis acoustic features may include, but are not limited to, any one or more of voiced (UV), fundamental (L F0), spectral (MCEP), Aperiodic (AP) components.
The hidden layer feature refers to an output of a hidden layer of a speech recognition model, and in the embodiment of the present invention, the speech recognition model may adopt a neural network model, and the neural network model may include one or more hidden layers. Accordingly, the speech recognition acoustic features are input into the speech recognition model, and the output of the hidden layer can be obtained. In practical application, one or more hidden layers can be output as the hidden layer features of the speech recognition.
And 103, training to obtain a universal sound variation model by utilizing the hidden layer characteristics and the voice synthesis acoustic characteristics.
The generic vocalization model and the tone color conversion model corresponding to the specific target speaker may employ neural network models such as CNN-L STM (convolutional neural network and long-short term memory network) and the like.
Fig. 2 is a schematic diagram of a topological structure of a universal acoustic varying model in a real-time acoustic varying method according to an embodiment of the present invention.
The input of the universal sound changing model comprises a speech recognition hidden layer feature A and a speech recognition hidden layer feature B, and the output is target audio speech synthesis acoustic features, wherein the speech recognition hidden layer feature A passes through a plurality of neural network models such as a convolutional layer, a pooling layer and a residual layer to obtain a hidden layer 1 and a hidden layer 2, the speech recognition hidden layer feature B passes through a plurality of layers of DNN to obtain a hidden layer 3, and the hidden layer 1, the hidden layer 2 and the hidden layer 3 are combined to be used as the input of an L STM model.
On the basis of the general sound changing model, aiming at a specific target speaker, the sound color conversion model corresponding to the specific target speaker can be obtained by collecting a small amount of audio data of the specific target speaker and carrying out self-adaptive training on the general sound changing model by utilizing the audio data of the specific target speaker.
The adaptive training process is similar to that of the generic acoustic variation model, except for the training data.
Because the hidden layer features output by the acoustic model are recognized to contain less tone features of the source speaker and simultaneously retain semantic information and partial prosodic information, the mapping relation from the hidden layer features to the synthesized acoustic features of the target speaker is learned through the acoustic change model, and the tone conversion from the source speaker to the target speaker can be realized.
The real-time voice changing method provided by the embodiment of the invention utilizes the tone conversion model to convert the voice recognition acoustic characteristics of the source speaker into the voice synthesis acoustic characteristics of the specific target speaker, and then generates the audio signal of the specific target speaker according to the voice synthesis acoustic characteristics, thereby realizing the real-time conversion from the audio data of the source speaker to the audio signal of the specific target speaker.
As shown in fig. 3, it is a flowchart of a real-time sound-changing method according to an embodiment of the present invention, and the method includes the following steps:
The source speaker audio data may be real-time online streaming audio data or offline audio data, which is not limited in the embodiments of the present invention.
Similar to the model training phase, the speech recognition acoustic features may include, but are not limited to, any one or more of MFCC, P L P, or the like.
The hidden layer features may be obtained by inputting the speech recognition acoustic features into a speech recognition model, and specifically, one or more hidden layers in the speech recognition model may be output as the hidden layer features of the speech recognition.
And 303, inputting the hidden layer characteristics into a pre-constructed tone conversion model corresponding to the specific target speaker to obtain the voice synthesis acoustic characteristics of the specific target speaker.
The input of the timbre conversion model is the hidden layer feature obtained in the step 302, and the output is the speech synthesis acoustic feature, which may include, but is not limited to, any one or more of the following features, namely, unvoiced and voiced features (UV), fundamental frequency features (L F0), spectral features (MCEP), aperiodic components (AP), and the like.
By utilizing the tone conversion model, the voice recognition acoustic characteristics of the source speaker can be converted into voice synthesis acoustic characteristics with the voice characteristics of the specific target speaker.
Specifically, vocoders such as world/direct based on signal processing can be utilized to synthesize the speech synthesis acoustic features into a speech signal, i.e., to realize the conversion from the speech of the speaker of any source to the speech of the speaker of the target speaker.
For better understanding of the solution of the present invention, fig. 4 shows a schematic diagram of a model training and testing process in the real-time acoustic varying method according to the embodiment of the present invention.
The real-time voice changing method provided by the embodiment of the invention is characterized in that a tone conversion model corresponding to a specific target speaker is constructed in advance, voice recognition acoustic characteristics are extracted from received source speaker audio data, hidden layer characteristics of voice recognition are obtained by utilizing the voice recognition acoustic characteristics, the hidden layer characteristics are used as an intermediate to be input into the tone conversion model, the voice recognition acoustic characteristics corresponding to the source speaker are converted into voice synthesis acoustic characteristics corresponding to the specific target speaker by utilizing the tone conversion model, and then the voice synthesis acoustic characteristics are utilized to generate an audio signal of the specific target speaker. Because a plurality of acoustic features are adopted for combined modeling, a better sound variation effect can be obtained; and the stream type feature extraction can be carried out, the real-time sound change with low response delay is realized, and the application requirement of the real-time sound change is met. The real-time voice changing method provided by the invention realizes that the relevant information of the speaker is removed while the information of the content, rhythm and the like of the source speaker is kept, namely, the voice of the source speaker is changed into the voice of the target speaker in real time.
In addition, in the scheme of the invention, during modeling, the audio data of a plurality of speakers are firstly utilized to carry out the training of a universal sound changing model, and then the small amount of audio data of a specific target speaker is utilized to carry out self-adaptive training on the basis of the universal sound changing model, so as to obtain the tone color conversion model corresponding to the specific target speaker. Because the adaptive training is carried out on the audio data of the specific target speaker on the basis of the universal sound variation model, the parameters of the tone conversion model obtained by training can be more accurate, and the voice synthesis acoustic characteristics obtained by utilizing the tone conversion model are more in line with the voice characteristics of the specific target speaker, so that the finally synthesized audio signal has better effect. Moreover, when different specific target speakers are aimed at, only a small amount of audio data of the specific target speakers need to be recorded, and parallel corpora corresponding to the source speakers do not need to be recorded, so that the collection work of training corpora is greatly simplified.
Correspondingly, an embodiment of the present invention further provides a real-time sound-changing device, as shown in fig. 5, which is a structural block diagram of the device.
In this embodiment, the apparatus includes the following modules:
a receiving module 501, configured to receive audio data of a source speaker;
a feature obtaining module 502, configured to extract a speech recognition acoustic feature from the source speaker audio data, and obtain a hidden layer feature of speech recognition by using the speech recognition acoustic feature;
the feature conversion module 503 is configured to input the hidden layer features into a pre-constructed tone conversion model corresponding to the specific target speaker, so as to obtain a speech synthesis acoustic feature of the specific target speaker;
a speech synthesis module 504 for generating a target-specific speaker audio signal using the speech synthesis acoustic features of the target-specific speaker.
It should be noted that the real-time sound changing apparatus provided in the embodiment of the present invention may be applied to an application environment of real-time online sound changing, and may also be applied to an application environment of offline sound changing, that is, the audio data received by the receiving module 501 may be streaming audio data input by a source speaker in real time, or may be non-real-time audio data of the source speaker, for example, the audio data is obtained from an audio file of the source speaker.
The feature obtaining module 502 may specifically include: the device comprises an acoustic feature extraction unit and a hidden layer feature extraction unit. Wherein:
the acoustic feature extraction unit is configured to extract speech recognition acoustic features from the source speaker audio data, where the speech recognition acoustic features may include, but are not limited to, any one or more of MFCC, P L P, and the like.
The hidden layer feature extraction unit is used for inputting the voice recognition acoustic features into a voice recognition model to obtain hidden layer features.
The speech recognition model may adopt a neural network model, such as L STM (L ong Short-term memory network), L C-C L DNN (L accident-controlled C L DNN), and the like, wherein the C L DNN is a neural network model constructed by simultaneously using a convolution structure, a loop structure, and a full-connection structure.
In practical application, one or more hidden layers in the speech recognition model can be output as hidden layer features of the speech recognition.
In the embodiment of the present invention, the speech synthesis acoustic features may include, but are not limited to, any one or more of a voiced-unvoiced feature (UV), a fundamental frequency feature (L F0), a spectral feature (MCEP), an aperiodic component (AP), and the like.
The real-time voice changing device provided by the embodiment of the invention is characterized in that a tone conversion model corresponding to a specific target speaker is constructed in advance, voice recognition acoustic characteristics are extracted from received source speaker audio data, hidden layer characteristics of voice recognition are obtained by utilizing the voice recognition acoustic characteristics, the hidden layer characteristics are used as an intermediary, the voice recognition acoustic characteristics corresponding to the source speaker are converted into voice synthesis acoustic characteristics corresponding to the specific target speaker by utilizing the tone conversion model, and then the voice synthesis acoustic characteristics are utilized to generate an audio signal of the specific target speaker. Because a plurality of acoustic features are adopted for combined modeling, a better sound variation effect can be obtained; and the stream type feature extraction can be carried out, the real-time sound change with low response delay is realized, and the application requirement of the real-time sound change is met.
In practical applications, the tone conversion model may be constructed by a corresponding tone conversion model construction module, and the tone conversion model construction module may be a part of the apparatus of the present invention, or may be independent of the apparatus of the present invention, which is not limited thereto.
The tone conversion model building module can specifically acquire the tone conversion model by collecting audio data of a large number of specific target speakers for training, or perform general sound variation model training by using the audio data of a plurality of speakers, and then perform self-adaptive training by using a small amount of audio data of the specific target speakers on the basis of the general sound variation model to acquire the tone conversion model corresponding to the specific target speakers.
The universal acoustic varying model can be constructed by a corresponding universal model construction module, and similarly, the universal model construction module can be a part of the device of the invention or can be independent of the device of the invention, which is not limited herein.
It should be noted that, no matter training of the generic acoustic variation model or adaptive training based on the generic acoustic variation model, an iterative computation process is performed, and therefore, in practical applications, the generic model building module and the timbre conversion model building module may be combined into one functional module or may be used as two independent functional modules, which is not limited herein. The iterative calculation process of the two is the same, but the training data is different.
In a specific embodiment, the tone conversion model building module may include the following units:
a target data collection unit for collecting audio data of a large number of specific target speakers as training data;
the feature acquisition unit is used for extracting voice recognition acoustic features and voice synthesis acoustic features from the training data and obtaining hidden layer features of voice recognition by utilizing the voice recognition acoustic features;
and the parameter training unit is used for training to obtain a tone conversion model corresponding to the specific target speaker by utilizing the hidden layer characteristics and the voice synthesis acoustic characteristics.
In another embodiment, the generic model building module may include the following elements:
the universal data collection unit is used for collecting audio data of a plurality of speakers as training data;
the feature acquisition unit is used for extracting voice recognition acoustic features and voice synthesis acoustic features from the training data and obtaining hidden layer features of voice recognition by utilizing the voice recognition acoustic features;
and the universal parameter training unit is used for training to obtain the multi-person sound-changing model by utilizing the hidden layer characteristics and the voice synthesis acoustic characteristics.
Accordingly, the tone conversion model building module may include the following units:
a target data collection unit for collecting audio data of a specific target speaker;
and the model training unit is used for carrying out self-adaptive training on a universal sound variation model which is constructed in advance based on the audio data of a plurality of speakers by utilizing the audio data of the specific target speaker to obtain a tone conversion model corresponding to the specific target speaker. The self-adaptive training process mainly comprises the steps of extracting voice recognition acoustic features and voice synthesis acoustic features from the audio data of the specific target speaker, obtaining hidden layer features of voice recognition by utilizing the voice recognition acoustic features, and training to obtain a tone conversion model corresponding to the specific target speaker by utilizing the hidden layer features and the voice synthesis acoustic features through iterative computation.
By using the scheme of the embodiment, the tone conversion model corresponding to the specific target speaker can be obtained by collecting a small amount of audio data of the specific target speaker and carrying out adaptive training based on the universal sound variation model, so that the parameters of the tone conversion model obtained by training can be more accurate, and the voice synthesis acoustic characteristics obtained by using the tone conversion model can better accord with the voice characteristics of the specific target speaker, so that the finally synthesized audio signal has better effect. Moreover, when different specific target speakers are aimed at, only a small amount of audio data of the specific target speakers need to be recorded, and parallel corpora corresponding to the source speakers do not need to be recorded, so that the collection work of training corpora is greatly simplified.
Fig. 6 is a block diagram illustrating an apparatus 800 for a real-time voicing method according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 6, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.
The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing elements 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various classes of data to support operations at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user, in some embodiments, the screen may include a liquid crystal display (L CD) and a Touch Panel (TP). if the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), programmable logic devices (P L D), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components for performing the methods described above.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the key press false touch correction method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
The present invention also provides a non-transitory computer readable storage medium having instructions which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform all or part of the steps of the above-described method embodiments of the present invention.
Fig. 7 is a schematic structural diagram of a server in an embodiment of the present invention. The server 1900, which may vary widely in configuration or performance, may include one or more Central Processing Units (CPUs) 1922 (e.g., one or more processors) and memory 1932, one or more storage media 1930 (e.g., one or more mass storage devices) that store applications 1942 or data 1944. Memory 1932 and storage medium 1930 can be, among other things, transient or persistent storage. The program stored in the storage medium 1930 may include one or more modules (not shown), each of which may include a series of instructions operating on a server. Still further, a central processor 1922 may be provided in communication with the storage medium 1930 to execute a series of instruction operations in the storage medium 1930 on the server 1900.
The server 1900 may also include one or more power supplies 1926, one or more wired or wireless network interfaces 1950, one or more input/output interfaces 1958, one or more keyboards 1956, and/or one or more operating systems 1941, such as Windows server, Mac OS XTM, UnixTM, <tttranslation = L "&tttl &/t >tinuxtm, FreeBSDTM, and so forth.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (10)
1. A real-time voicing method, the method comprising:
receiving source speaker audio data;
extracting voice recognition acoustic features from the source speaker audio data, and obtaining hidden layer features of voice recognition by utilizing the voice recognition acoustic features;
inputting the hidden layer characteristics into a pre-constructed tone conversion model corresponding to a specific target speaker to obtain the voice synthesis acoustic characteristics of the specific target speaker;
and generating the audio signal of the specific target speaker by utilizing the voice synthesis acoustic characteristics of the specific target speaker.
2. The method of claim 1, further comprising constructing the timbre conversion model for the particular target speaker by:
collecting audio data of a specific target speaker;
and carrying out self-adaptive training on a universal sound variation model which is constructed in advance based on the audio data of a plurality of speakers by utilizing the audio data of the specific target speaker to obtain a tone conversion model corresponding to the specific target speaker.
3. The method of claim 2, further comprising: the method for constructing the universal sound variation model based on the audio data of a plurality of speakers specifically comprises the following steps:
collecting audio data of a plurality of speakers as training data;
extracting voice recognition acoustic features and voice synthesis acoustic features from the training data, and obtaining hidden layer features of voice recognition by using the voice recognition acoustic features;
and training to obtain a universal sound variation model by utilizing the hidden layer characteristics and the voice synthesis acoustic characteristics.
4. The method of claim 1, wherein the deriving hidden layer features for speech recognition using the speech recognition acoustic features comprises:
and inputting the voice recognition acoustic features into a voice recognition model to obtain hidden layer features.
5. The method of claim 4, wherein the speech recognition model is a neural network model.
6. The method of claim 1, wherein the speech recognition acoustic features comprise any one or more of: mel frequency cepstrum coefficients, perceptual linear prediction parameters.
7. The method of claim 1, wherein the speech synthesis acoustic features comprise any one or more of: clear-turbid characteristics, fundamental frequency characteristics, spectral characteristics, and aperiodic components.
8. A real-time sound-altering apparatus, comprising:
the receiving module is used for receiving audio data of a source speaker;
the characteristic acquisition module is used for extracting voice recognition acoustic characteristics from the source speaker audio data and obtaining hidden layer characteristics of voice recognition by utilizing the voice recognition acoustic characteristics;
the characteristic conversion module is used for inputting the hidden layer characteristics into a pre-constructed tone conversion model corresponding to a specific target speaker to obtain the voice synthesis acoustic characteristics of the specific target speaker;
and the voice synthesis module is used for generating the audio signal of the specific target speaker by utilizing the voice synthesis acoustic characteristics of the specific target speaker.
9. An electronic device, comprising: one or more processors, memory;
the memory is for storing computer-executable instructions, and the processor is for executing the computer-executable instructions to implement the method of any one of claims 1 to 7.
10. A readable storage medium having stored thereon instructions that are executed to implement the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910091188.8A CN111508511A (en) | 2019-01-30 | 2019-01-30 | Real-time sound changing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910091188.8A CN111508511A (en) | 2019-01-30 | 2019-01-30 | Real-time sound changing method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111508511A true CN111508511A (en) | 2020-08-07 |
Family
ID=71873967
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910091188.8A Pending CN111508511A (en) | 2019-01-30 | 2019-01-30 | Real-time sound changing method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111508511A (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111883149A (en) * | 2020-07-30 | 2020-11-03 | 四川长虹电器股份有限公司 | Voice conversion method and device with emotion and rhythm |
CN112133278A (en) * | 2020-11-20 | 2020-12-25 | 成都启英泰伦科技有限公司 | Network training and personalized speech synthesis method for personalized speech synthesis model |
CN112164387A (en) * | 2020-09-22 | 2021-01-01 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio synthesis method and device, electronic equipment and computer-readable storage medium |
CN112309365A (en) * | 2020-10-21 | 2021-02-02 | 北京大米科技有限公司 | Training method and device of speech synthesis model, storage medium and electronic equipment |
CN112382297A (en) * | 2020-11-13 | 2021-02-19 | 北京有竹居网络技术有限公司 | Method, apparatus, device and medium for generating audio |
CN112382273A (en) * | 2020-11-13 | 2021-02-19 | 北京有竹居网络技术有限公司 | Method, apparatus, device and medium for generating audio |
CN112382268A (en) * | 2020-11-13 | 2021-02-19 | 北京有竹居网络技术有限公司 | Method, apparatus, device and medium for generating audio |
CN112383721A (en) * | 2020-11-13 | 2021-02-19 | 北京有竹居网络技术有限公司 | Method and apparatus for generating video |
CN112802448A (en) * | 2021-01-05 | 2021-05-14 | 杭州一知智能科技有限公司 | Speech synthesis method and system for generating new tone |
CN113362807A (en) * | 2021-04-26 | 2021-09-07 | 北京搜狗智能科技有限公司 | Real-time sound changing method and device and electronic equipment |
CN113724690A (en) * | 2021-09-01 | 2021-11-30 | 宿迁硅基智能科技有限公司 | PPG feature output method, target audio output method and device |
EP3859735A3 (en) * | 2020-09-25 | 2022-01-05 | Beijing Baidu Netcom Science And Technology Co. Ltd. | Voice conversion method, voice conversion apparatus, electronic device, and storage medium |
-
2019
- 2019-01-30 CN CN201910091188.8A patent/CN111508511A/en active Pending
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111883149A (en) * | 2020-07-30 | 2020-11-03 | 四川长虹电器股份有限公司 | Voice conversion method and device with emotion and rhythm |
CN111883149B (en) * | 2020-07-30 | 2022-02-01 | 四川长虹电器股份有限公司 | Voice conversion method and device with emotion and rhythm |
CN112164387A (en) * | 2020-09-22 | 2021-01-01 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio synthesis method and device, electronic equipment and computer-readable storage medium |
EP3859735A3 (en) * | 2020-09-25 | 2022-01-05 | Beijing Baidu Netcom Science And Technology Co. Ltd. | Voice conversion method, voice conversion apparatus, electronic device, and storage medium |
CN112309365B (en) * | 2020-10-21 | 2024-05-10 | 北京大米科技有限公司 | Training method and device of speech synthesis model, storage medium and electronic equipment |
CN112309365A (en) * | 2020-10-21 | 2021-02-02 | 北京大米科技有限公司 | Training method and device of speech synthesis model, storage medium and electronic equipment |
CN112382297A (en) * | 2020-11-13 | 2021-02-19 | 北京有竹居网络技术有限公司 | Method, apparatus, device and medium for generating audio |
CN112383721A (en) * | 2020-11-13 | 2021-02-19 | 北京有竹居网络技术有限公司 | Method and apparatus for generating video |
CN112382268A (en) * | 2020-11-13 | 2021-02-19 | 北京有竹居网络技术有限公司 | Method, apparatus, device and medium for generating audio |
CN112382273A (en) * | 2020-11-13 | 2021-02-19 | 北京有竹居网络技术有限公司 | Method, apparatus, device and medium for generating audio |
CN112383721B (en) * | 2020-11-13 | 2023-04-07 | 北京有竹居网络技术有限公司 | Method, apparatus, device and medium for generating video |
CN112133278A (en) * | 2020-11-20 | 2020-12-25 | 成都启英泰伦科技有限公司 | Network training and personalized speech synthesis method for personalized speech synthesis model |
CN112802448A (en) * | 2021-01-05 | 2021-05-14 | 杭州一知智能科技有限公司 | Speech synthesis method and system for generating new tone |
CN112802448B (en) * | 2021-01-05 | 2022-10-11 | 杭州一知智能科技有限公司 | Speech synthesis method and system for generating new tone |
CN113362807A (en) * | 2021-04-26 | 2021-09-07 | 北京搜狗智能科技有限公司 | Real-time sound changing method and device and electronic equipment |
CN113724690A (en) * | 2021-09-01 | 2021-11-30 | 宿迁硅基智能科技有限公司 | PPG feature output method, target audio output method and device |
CN113724690B (en) * | 2021-09-01 | 2023-01-03 | 宿迁硅基智能科技有限公司 | PPG feature output method, target audio output method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111508511A (en) | Real-time sound changing method and device | |
CN111583944A (en) | Sound changing method and device | |
CN109801644B (en) | Separation method, separation device, electronic equipment and readable medium for mixed sound signal | |
CN110136692B (en) | Speech synthesis method, apparatus, device and storage medium | |
CN107705783B (en) | Voice synthesis method and device | |
CN110808063A (en) | Voice processing method and device for processing voice | |
JP2022137201A (en) | Synthesis of speech from text in voice of target speaker using neural networks | |
JP6336676B2 (en) | Method and apparatus for synthesizing voice based on facial structure | |
CN110097890B (en) | Voice processing method and device for voice processing | |
CN108922525B (en) | Voice processing method, device, storage medium and electronic equipment | |
CN107291690A (en) | Punctuate adding method and device, the device added for punctuate | |
CN107992485A (en) | A kind of simultaneous interpretation method and device | |
CN113409764B (en) | Speech synthesis method and device for speech synthesis | |
CN110992927B (en) | Audio generation method, device, computer readable storage medium and computing equipment | |
CN112185389A (en) | Voice generation method and device, storage medium and electronic equipment | |
CN111326138A (en) | Voice generation method and device | |
CN109801618A (en) | A kind of generation method and device of audio-frequency information | |
CN113223542B (en) | Audio conversion method and device, storage medium and electronic equipment | |
CN115273831A (en) | Voice conversion model training method, voice conversion method and device | |
CN113362813A (en) | Voice recognition method and device and electronic equipment | |
CN115148185A (en) | Speech synthesis method and device, electronic device and storage medium | |
CN107437412B (en) | Acoustic model processing method, voice synthesis method, device and related equipment | |
CN115039169A (en) | Voice instruction recognition method, electronic device and non-transitory computer readable storage medium | |
CN108346424A (en) | Phoneme synthesizing method and device, the device for phonetic synthesis | |
CN116597858A (en) | Voice mouth shape matching method and device, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20220803 Address after: 100084. Room 9, floor 01, cyber building, building 9, building 1, Zhongguancun East Road, Haidian District, Beijing Applicant after: BEIJING SOGOU TECHNOLOGY DEVELOPMENT Co.,Ltd. Address before: 100084. Room 9, floor 01, cyber building, building 9, building 1, Zhongguancun East Road, Haidian District, Beijing Applicant before: BEIJING SOGOU TECHNOLOGY DEVELOPMENT Co.,Ltd. Applicant before: SOGOU (HANGZHOU) INTELLIGENT TECHNOLOGY Co.,Ltd. |