WO2022057759A1 - 一种语音转换的方法及相关设备 - Google Patents

一种语音转换的方法及相关设备 Download PDF

Info

Publication number
WO2022057759A1
WO2022057759A1 PCT/CN2021/117945 CN2021117945W WO2022057759A1 WO 2022057759 A1 WO2022057759 A1 WO 2022057759A1 CN 2021117945 W CN2021117945 W CN 2021117945W WO 2022057759 A1 WO2022057759 A1 WO 2022057759A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
style
conversion
input
feature
Prior art date
Application number
PCT/CN2021/117945
Other languages
English (en)
French (fr)
Inventor
范泛
罗敬昊
李硕
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022057759A1 publication Critical patent/WO2022057759A1/zh
Priority to US18/186,285 priority Critical patent/US20230223006A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • H04M1/7243User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages
    • H04M1/72433User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality with interactive means for internal management of messages for voice messaging, e.g. dictaphones
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units

Definitions

  • the present application relates to the technical field of speech processing, and in particular, to a speech conversion method and related equipment.
  • Vocal beautification refers to the modification and beautification of the sound to produce pleasant auditory effects.
  • the voice recording function of many recording software or social application (application, APP) on the market will have a certain ability to beautify the human voice. For example, denoising the speech, increasing the brightness or volume of the speech, etc., only modifies the speaker's own speech characteristics, and the mode of beautifying the human voice is too simple.
  • Embodiments of the present application provide a voice conversion method and related equipment, which are used to provide a variety of human voice beautification modes and realize the diversity of human voice beautification.
  • an embodiment of the present application provides a voice conversion method, which is applied to a voice conversion device.
  • the device may be a terminal, and the method may include: the terminal receives a mode selection operation input by a user, and the mode selection operation uses For selecting the mode of voice conversion, the terminal then selects a target conversion mode from multiple modes according to the mode selection operation, and the multiple modes include a style conversion mode, a dialect conversion mode and a voice enhancement mode; the terminal can select the target conversion mode corresponding to the target conversion mode.
  • the voice conversion network realizes voice conversion through the target voice conversion network; the terminal device obtains the first voice to be converted, and further extracts feature information of the first voice, the feature information is used to retain the content information of the first voice, and then the terminal converts the first voice
  • the feature information of a voice is input into the target voice conversion network corresponding to the target conversion mode, and the converted second voice is output through the target voice conversion network, and finally the second voice is output.
  • a variety of selectable modes are provided, such as a style conversion mode, which is used to convert the speaking style of the first voice; a dialect conversion mode, which is used to implement accent or de-accent; and a speech enhancement mode, which is used to implement Voice enhancement; the three modes have corresponding voice conversion networks.
  • the voice conversion network corresponding to the modes can convert the first voice, so as to realize the diversification of human voice beautification and meet the needs of users in different application scenarios.
  • extracting the feature information of the first voice may specifically include: the terminal inputs the first information into the voice feature extraction model, and extracts the phoneme posterior probability PPG feature of the first voice through the voice feature extraction model;
  • the PPG feature is used to preserve the content information of the first speech.
  • the PPG feature describes the probability of each phoneme in the phoneme set corresponding to the speech frame, which is equivalent to identifying the phoneme.
  • the PPG feature is used to retain the content information of the first speech. In this example, it is not necessary to convert the speech into text to Retaining the content information of the voice, but directly inputting through the voice, and retaining the content information of the first voice through the PPG feature, can further increase the robustness.
  • the style conversion network includes a style separation model and a speech fusion model
  • the method may further include: the terminal obtains a To extract the third voice of style features, input the third voice into the style separation model, and separate the style features of the third voice through the style separation model; then further input the style features and the feature information of the first voice into the voice fusion model Fusion is performed to obtain the second voice.
  • the terminal receives the first voice to be converted and the third voice for extracting style features, then, the first voice is input into the voice feature extraction model, and the PPG feature is extracted through the voice feature extraction model; the PPG feature is used to retain The content information of the first voice is directly input through the PPG feature.
  • the terminal inputs the third voice into the style separation model, separates the style features of the third voice through the style separation model, and finally, inputs the style features and the PPG features into the voice fusion model for fusion, and obtains a fusion of the first voice content and the third voice. style second voice.
  • the third voice can be the voice of any person, so that the first voice can be converted into the voice style of any person, and the diversity of voice style conversion can be realized.
  • the style feature includes a first feature, and the first feature includes multiple sub-features;
  • the third voice is input into the style separation model, and the style feature of the third voice separated by the style separation model may specifically include: The terminal inputs the third voice into the style separation model, and extracts the vector of the first feature in the third voice through the style separation model, for example, the first feature may be timbre, and then inputs the third voice into the sub-feature extraction model, Extract the vector of sub-features through the sub-feature extraction model; receive the weight of each sub-feature in the multiple sub-features input by the user; determine the style feature of the third voice according to the vector of the first feature, the vector of each sub-feature and the weight of each sub-feature .
  • the similarity between the style of the voice to be converted and the third voice is adjusted by the weight corresponding to each sub-feature input by the user, and the similarity is determined by the weight input by the user.
  • the similarity between the final output voice style and the third voice is determined by this weight.
  • the user can flexibly adjust the style of the voice to be converted by adjusting the input weight.
  • the style of the converted voice can be the same as that of the third voice, or it can be Changes are made on the basis of the style of the third voice, thereby realizing the style diversification of the voice to be converted.
  • determining the style feature of the third speech according to the vector of the first feature, the vector of each sub-feature, and the weight of each sub-feature may include: the terminal inputting the vector of the first feature into the multi-head attention structure , and the product of the vector of each sub-feature and its corresponding weight is input to the multi-head attention structure, and the style feature of the third speech is output through the multi-head attention structure.
  • the multi-head attention structure enables the model to extract feature expressions from different subspaces.
  • Each head corresponds to a subfeature space in the high-dimensional space, which is equivalent to decomposing the high-dimensional space, and each head is responsible for one subfeature space.
  • the multi-head attention mechanism is equivalent to multiple attention mechanisms with the same structure, so that the output result of the multi-head attention mechanism contains part of the timbre of the third speech.
  • acquiring the third voice for extracting style features may include:
  • the terminal receives the template selection operation input by the user, and the template selection operation is used to select a target template.
  • the target template may be the style of a male announcer, or it may be the voice style of a certain "female announcer", etc., and the terminal obtains
  • the voice corresponding to the target template is used as the third voice, and the style features of the third voice are fused into the first voice, so that the diversification of voice styles can be realized.
  • acquiring the third voice for extracting style features may specifically include: the terminal may also receive a third voice input by the second speaker, where the first voice is the voice of the first speaker, and the first voice is the voice of the first speaker.
  • the second speaker is an arbitrary person different from the first speaker, so that the diversification of speech styles can be realized.
  • the target conversion mode is a dialect conversion mode and the target voice conversion network is a dialect conversion network
  • the feature information of the first voice is input into the target voice conversion network corresponding to the target conversion mode
  • the target voice conversion network is passed through the target voice conversion network.
  • the voice conversion network outputs the converted second voice, and may further specifically include: the terminal inputs the characteristic information of the first voice into the dialect conversion network, and outputs the second voice through the dialect conversion network, where the first voice is the voice of the first dialect, and the first voice is the voice of the first dialect.
  • the second voice is the voice of the second dialect, so as to realize dialect conversion, enhance the communication convenience of users in different regions, and realize the diversification of voice conversion.
  • the dialect conversion network includes a plurality of dialect conversion models, each dialect conversion model is respectively for a different dialect to be converted, and the method further includes: the terminal can receive a selection operation input by the user; The feature information of the speech is input into the dialect conversion model corresponding to the selection operation, and the second speech is output through the dialect conversion model corresponding to the selection operation.
  • the terminal may select a corresponding dialect conversion model according to a specific operation input by the user.
  • the method further includes: the terminal inputs the first voice into a style separation model, and uses the style separation model to separate the style features of the first voice; then, the terminal combines the style features of the first voice with the first voice.
  • the feature information of the voice is input to the dialect conversion network, and the second voice is output through the dialect conversion network, and the second voice and the first voice have the same style.
  • the content of the second voice is the same as the input voice (first voice), and the speaking style of the input voice (first voice) is retained.
  • the first voice is far-field voice
  • the target conversion mode is the voice enhancement mode and the target voice conversion network is the voice enhancement model
  • the feature information of the first voice is input into the target conversion mode corresponding to The target voice conversion network
  • outputting the converted second voice through the target voice conversion network may include: the terminal inputting the feature information of the first voice into the voice enhancement model corresponding to the mode, outputting the second voice through the voice enhancement model, the second voice for near-field speech.
  • the far-field speech is converted into the near-field speech to realize speech enhancement, so as to increase the clarity of the speech, increase the application scenarios, and realize the diversification of speech conversion.
  • the method further includes: the terminal inputs the first voice into a style separation model, and separates the style features of the first voice through the style separation model; then, separates the style features of the first voice and the first voice
  • the feature information of 1 is input into the speech enhancement model, and the second speech is output through the speech enhancement model, and the style of the second speech and the first speech is the same.
  • the converted voice is the same as the input voice (first voice), and the speaking style of the input voice (first voice) is preserved.
  • acquiring the first voice to be converted may include: receiving the first voice input by the first speaker; or, selecting the first voice from a locally stored file.
  • an embodiment of the present application provides a voice conversion device, the device has the function of implementing the terminal in the first aspect, and the function can be implemented by hardware or by executing corresponding software in hardware; the hardware or The software includes one or more modules corresponding to the above functions.
  • an embodiment of the present application provides a terminal, including a processor, the processor is coupled to at least one memory, and the processor is configured to read a computer program stored in the at least one memory, so that the terminal executes any one of the first aspects above item method.
  • an embodiment of the present application provides a computer-readable medium, where the computer-readable storage medium is used to store a computer program, and when the computer program runs on a computer, the computer is made to execute the above-mentioned first aspect Methods.
  • the present application provides a chip system, where the chip system includes a processor for supporting a terminal device to implement the functions involved in the above aspects.
  • the chip system further includes a memory for storing necessary program instructions and data of the terminal device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • 1 is a flow chart of steps of an embodiment of a voice conversion method in an embodiment of the application
  • FIG. 2 is a schematic diagram of a scene selection mode in an embodiment of the present application
  • FIG. 4 is a flowchart of steps of another embodiment of a voice conversion method in the embodiment of the present application.
  • FIG. 5 is a schematic diagram of an example of a process of performing style conversion of speech in an embodiment of the present application
  • 6A is a schematic diagram of a scene of an interface of a style conversion mode in an embodiment of the present application.
  • 6B is a schematic diagram of another scene of the interface of the style conversion mode in the embodiment of the present application.
  • FIG. 7 is a schematic diagram of style transfer network training and updating in an embodiment of the present application.
  • FIG. 8 is a flowchart of steps of another embodiment of a voice conversion method in the embodiment of the present application.
  • FIG. 9 is a schematic diagram of an example of a dialect conversion process for speech in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of another example of a dialect conversion process for speech in an embodiment of the application.
  • 11A is a schematic diagram of a scene of an interface of a dialect conversion mode and a speech enhancement mode in an embodiment of the present application;
  • 11B is a schematic diagram of another scene of the interface of the dialect conversion mode and the speech enhancement mode in the embodiment of the application;
  • FIG. 12 is a schematic diagram of training and updating a dialect conversion model in an embodiment of the application.
  • FIG. 13 is a schematic structural diagram of an embodiment of a voice conversion apparatus in an embodiment of the application.
  • FIG. 14 is a schematic structural diagram of an example of a chip in an embodiment of the application.
  • FIG. 15 is a schematic structural diagram of another embodiment of a voice conversion apparatus according to an embodiment of the present application.
  • the embodiment of the present application provides a voice conversion method, and the method is applied to a terminal.
  • the terminal may be a smart terminal with voice function, such as a mobile phone, a tablet computer, a notebook computer, and a smart watch, and the terminal device may also be called a terminal device.
  • User equipment user equipment, UE
  • the terminal device can be described by taking a mobile phone as an example.
  • the terminal device is equipped with a speech processing neural network, and the speech processing neural network mainly includes a style conversion network, a dialect conversion network and a speech enhancement model.
  • the speech processing neural network realizes the conversion (beautification) of speech.
  • a variety of selectable modes are provided in this application, such as the style conversion mode, which is used to convert the speaking style of the first voice; the dialect conversion mode, which is used to implement accent or de-accent; and the speech enhancement mode, which is used to implement voice enhancement.
  • the three modes have corresponding voice conversion networks, and the terminal selects the target voice conversion network corresponding to the target conversion mode according to the target conversion mode selected by the user to convert the acquired first voice to be converted, and output the second voice after the conversion. , so as to realize the diversification of vocal beautification and meet the needs of users in different application scenarios.
  • Sequence to sequence is a type of encoder-decoder structure, the input sequence (sequence) is compressed into a vector of a specified length through the encoding layer, and then the vector is input Go to the decoding layer to get the output sequence (sequence).
  • seq2seq neural network means that both the encoding layer and the decoding layer are composed of neural networks. Among them, the encoding layer (encoder) is used to encode the input sequence through certain rules to generate a vector. The decoder layer is used to convert the generated vector into the output sequence.
  • Attention structure multiple vectors of specified length are generated in the encoding layer, each vector is weighted by the input feature, and the weight is related to the output of the decoding layer. The purpose is to make the output of the decoding layer pay more attention to the difference of the input. critical sections, rather than focusing on the entire input.
  • Neural network vocoder Essentially a neural network, it is used to convert the output features of a speech processing neural network into speech with a high degree of naturalness.
  • timbre refers to the quality of sound, also known as timbre
  • timbre can reflect the unique quality of the speaker's voice.
  • Timbre includes, but is not limited to, rhythm, accent, and speed of speech.
  • Phoneme It is the smallest phonetic unit of a pronunciation action, such as hao (good), there are 3 phonemes in total, wo (me), one has 2 phonemes.
  • Phoneme Posterior Gram (PPG) feature The posterior probability that each speech frame belongs to a set of pre-defined phonetic units (phonemes or triphones/auxiliaries) that preserve the linguistic and phonetic information of speech.
  • the voice signal is stored as a waveform feature, which is represented as a feature transformation in the time domain, but the frequency distribution of the voice signal cannot be seen only based on the waveform feature.
  • the features of speech can be extracted.
  • the Mel feature can clearly represent the speech formant characteristics.
  • the present application provides an embodiment of a voice conversion method
  • the execution body of the method is a terminal device.
  • the execution body of the method may also be a processor in the terminal device, or the execution body may also be a chip in the terminal device.
  • the execution body of the method takes a terminal as an example, and the terminal takes a mobile phone as an example for description.
  • a speech processing neural network is configured in the terminal, and the speech processing neural network mainly includes a style conversion network, a dialect conversion network and a speech enhancement model.
  • Step 101 The terminal receives a mode selection operation input by a user, and the mode selection operation is used to select a mode of voice conversion.
  • the mode selection operation can be a click operation, and the multiple modes include a style conversion mode, a dialect conversion mode and a speech enhancement mode.
  • each mode has a corresponding voice conversion network, and the voice is converted through the corresponding voice conversion network.
  • the style transfer mode is used to convert speech styles.
  • the dialect conversion mode is used for dialect conversion, such as "accent” or "de-accent", etc.
  • the speech enhancement mode is used to convert far-field speech to near-field speech.
  • Step 102 The terminal selects a target conversion mode from a plurality of modes according to the mode selection operation, and the plurality of modes include a style conversion mode, a dialect conversion mode and a speech enhancement mode.
  • a list of multiple modes can be displayed on the display interface of the terminal, and the user can select a target conversion mode among the multiple modes by clicking operations according to requirements, and the target conversion mode can be any one of the above three modes.
  • the terminal selects the target voice conversion network corresponding to the target conversion mode.
  • the style conversion mode corresponds to the style conversion network
  • the dialect conversion mode corresponds to the dialect conversion network
  • the speech enhancement mode corresponds to the speech enhancement model.
  • the terminal determines the target voice conversion network corresponding to the target conversion mode according to the corresponding relationship between the mode and the voice conversion network. For example, when the user selects the style transfer mode, the target speech transfer network is the style transfer network. When the user selects the dialect conversion mode, the target speech conversion network is the dialect conversion network. When the user selects the speech enhancement mode, the target speech conversion network is the speech enhancement model.
  • Step 103 The terminal acquires the first voice to be converted.
  • the terminal receives the first voice input by the first speaker.
  • the terminal device receives the first voice input by the first speaker through the microphone.
  • the terminal device receives a first operation input by a user (the user may be the same person as the first speaker, or may be a different person), where the first operation is an operation of recording voice.
  • the first operation may be a click operation.
  • the mobile phone starts to record the voice input by the user, which is the voice (ie, the first voice) that the user wants to be beautified (or processed).
  • the terminal selects the first voice from a locally stored file.
  • the terminal device may acquire the first voice from the locally stored file.
  • the terminal device receives a second operation input by the user, where the second operation is an operation of selecting a voice file. For example, the user clicks the "select file" button on the screen of the mobile phone, the mobile phone can display the list of voices to be selected according to the second operation, and then the terminal device receives the third operation input by the user, the third operation is used to select the target voice, The terminal device uses the target voice as the first voice.
  • Step 104 The terminal extracts feature information of the first voice.
  • the feature information may be a mel frequency cepstrum coefficient (mel frequency cepstrum coefficient, MFCC) feature.
  • the feature information is a phoneme posterior probability PPG feature.
  • the feature information of the first voice is described by taking the PPG feature as an example.
  • the PPG feature describes the probability of each phoneme in the phoneme set corresponding to the speech frame, which is equivalent to identifying the phoneme, and the PPG feature is used to retain the content information of the first speech.
  • Step 105 The terminal inputs the feature information of the first voice into the target voice conversion network corresponding to the target conversion mode, and outputs the converted second voice through the target voice conversion network.
  • the target speech transfer network is a style transfer network.
  • the terminal obtains the third voice for extracting style features, then inputs the third voice into the style separation model, and separates the style features of the third voice through the style separation model; finally, the style features and the first voice are separated.
  • the PPG features are input into the speech fusion model for fusion to obtain the second speech.
  • the characteristic information of the first voice is input into the dialect conversion network, and the second voice is output through the dialect conversion network.
  • the first voice is the voice of the first dialect
  • the second voice is the voice of the second dialect.
  • the PPG feature of the first voice is input into the voice enhancement model corresponding to the mode, and the second voice is output through the voice enhancement model, and the second voice is a near-field voice.
  • the human voice beautification includes multiple modes, and the multiple modes include a style conversion mode, a dialect conversion mode and a voice enhancement mode, and the first voice can be beautified according to the mode selected by the user. For example, beautify the style of the first voice, perform dialect conversion on the first voice, or perform voice enhancement on the first voice, etc., so as to realize the diversity of human voice beautification.
  • Step 106 The terminal outputs the second voice.
  • the terminal outputs the converted second voice through the speaker.
  • a variety of selectable modes are provided, such as a style conversion mode, which is used to convert the speaking style of the first voice; a dialect conversion mode, which is used to implement accent or de-accent; and a speech enhancement mode, which is used for To achieve voice enhancement; the three modes have corresponding voice conversion networks.
  • the voice conversion network corresponding to the mode can convert the first voice, so as to realize the diversification of human voice beautification and meet the needs of users in different application scenarios.
  • this embodiment of the present application provides another embodiment of voice conversion.
  • the style conversion mode is described, that is, the style conversion of the first voice is exemplarily described.
  • the mode is a style transfer mode
  • the target speech transfer network is a style transfer network.
  • Step 401 The terminal acquires the first voice to be converted and the third voice for extracting style features.
  • step 103 For the method for the terminal to acquire the first voice in this step, please refer to the description of step 103 in the embodiment corresponding to FIG. 1 , which is not repeated here.
  • the terminal receives the template selection operation input by the user, the terminal selects the voice corresponding to the target template according to the template selection operation, and then uses the voice corresponding to the target template as the third voice.
  • target templates include, but are not limited to, "male announcer”, “female announcer”, “actor voice”, and so on.
  • the target template can also be a category.
  • the terminal device receives the sub-template selection operation, and the terminal device selects the target voice corresponding to the sub-template according to the sub-template selection operation.
  • the list of sub-templates under the "Male Announcer” category includes “Announcer A,” “Announcer B,” and “Announcer C,” and so on.
  • the list of sub-templates under the “Female Announcer” category includes “Announcer D”, “Announcer C", and so on.
  • the list of sub-templates under the "Actor Voices” category includes “Actor D", "Actor F", and so on.
  • the terminal device selects the voice corresponding to "Announcer A" as the third voice according to the sub-template selection operation input by the user.
  • the examples of the target template and the sub-template in this embodiment are all examples, and do not set a limitation.
  • the terminal may receive the third voice input by the second speaker.
  • the terminal device receives a template selection operation input by the user, the target template selected by the template selection operation is "anyone's voice", and the terminal may use the voice corresponding to the "anyone's voice” as the third voice.
  • the terminal starts recording a third voice, and the second speaker is an arbitrary person different from the first speaker.
  • the third voice may be a pre-recorded voice stored locally.
  • the third voice may be the voice of a favorite character in a movie (a cartoon character voice, or the voice of an actor).
  • the third voice may also be a voice that is pre-downloaded from the Internet and saved to the terminal device.
  • Step 402 The terminal inputs the first voice into the voice feature extraction model, and extracts the PPG feature of the first voice through the voice feature extraction model, and the PPG feature is used to retain content information of the first voice.
  • the terminal inputs the first voice into the voice feature extraction model, and extracts the PPG feature of the first voice through the voice feature extraction model.
  • the speech feature extraction model may be a deep neural network
  • the speech feature extraction model includes multiple convolutional layers, two LSTM layers and one fully connected layer
  • the speech feature extraction model outputs the PPG feature of the first speech
  • the PPG The feature describes the probability of each phoneme in the phoneme set corresponding to the speech frame, which is equivalent to identifying the phoneme, and the PPG feature is used to retain the content information of the first speech.
  • the speech feature extraction model is pre-trained according to a large amount of corpus.
  • the PPG feature is used as the content information input of the first speech, and these speech units (phonemes) retain the language and speech information of the speech.
  • ASR automatic speech recognition
  • the use of PPG features can increase the robustness.
  • ASR technology needs to convert speech into text first, which increases the probability of speech content recognition errors.
  • PPG The feature is input as the content information of the first voice, that is, the voice input can be directly realized without converting it into text content, which increases the robustness of the system.
  • Step 403 The terminal inputs the third speech into the style separation model, and uses the style separation model to separate the style features of the third speech.
  • the style feature includes a first feature, and the first feature includes a plurality of sub-features.
  • the first feature is a timbre feature, and the multiple sub-features include prosody, accent, and speech rate.
  • the style separation model is used to separate the style features of the third speech.
  • the style separation model includes a tone separation model.
  • the timbre separation model is used to separate the timbre features of the third speech, so as to obtain the vector of the first feature (ie, the timbre feature vector).
  • the style transfer network further includes multiple sub-feature extraction models and a multi-head attention structure.
  • the terminal inputs the third speech into the sub-feature extraction model, and extracts the sub-feature vector through the sub-feature extraction model.
  • the multiple sub-feature extraction models include a prosody extraction model, an accent extraction model, and a speech rate extraction model.
  • the prosody extraction model is used to extract prosody features in the third speech to obtain a prosody vector.
  • the accent extraction model is used to extract the accent features in the third speech to obtain the accent vector.
  • the speech rate extraction model is used to extract speech rate features in the third speech to obtain speech rate vectors.
  • the terminal receives the weight of each of the multiple sub-features input by the user.
  • adjustment bars for prosody, speech rate, and accent are displayed on the screen of the mobile phone, and the user can input the weight corresponding to each sub-feature by adjusting the adjustment bar for each sub-feature.
  • the weight corresponding to each sub-feature can be flexibly adjusted according to the user's own needs. For example, setting "prosody" to 10% means that the final output speech is 10% similar to the prosody of the target template, that is, a parameter with a value of 0.1 is passed to the built-in sub-feature extraction model.
  • gears are preconfigured.
  • it can be divided into three gears, and the weight of each sub-feature in each gear is preconfigured according to the experience value.
  • the weight of prosody is 0.1
  • the speed of speech is 0.2
  • the accent is 0.1
  • the terminal determines the weight corresponding to each sub-feature by receiving the gear input by the user.
  • the user does not need to individually adjust the weight of each sub-feature, and the user only needs to select the gear, which is convenient for the user to operate.
  • the terminal determines the style feature of the third speech according to the vector of the first feature, the vector of each sub-feature and the weight of each sub-feature.
  • the terminal multiplies the weight corresponding to each sub-feature input by the user with the vector of each sub-feature.
  • the multiplied result and the vector of the first feature are simultaneously input into the multi-head attention structure for attention alignment, so that the multi-head attention structure outputs a style vector, which is the style feature of the third speech.
  • the multi-head attention structure enables the model to extract feature representations from different subspaces.
  • Each head corresponds to a subfeature space in the high-dimensional space, which is equivalent to decomposing the high-dimensional space, and each head is responsible for a subfeature space.
  • the multi-head attention mechanism is equivalent to multiple attention mechanisms with the same structure, so that the output result of the multi-head attention mechanism contains part of the timbre of the third speech.
  • the output structure of the multi-head attention mechanism can be a 256-dimensional style embedding vector (style features).
  • the similarity between the style of the voice to be converted and the third voice is adjusted by the weight corresponding to each sub-feature input by the user, and the similarity is determined by the weight input by the user.
  • the user can choose whether to input the weight. If the user chooses to input the weight, the similarity between the final output voice style and the third voice is determined by the input weight. If the user chooses not to input the weight, the final output voice style will be exactly the same as the style (or timbre, the style is taken as an example) of the third voice.
  • the user can flexibly adjust the style of the voice to be converted by adjusting the weight of the input.
  • the style of the voice to be converted can be the same as the style of the third voice, or can be changed based on the style of the third voice, so as to realize the voice to be converted.
  • variety of styles for example, the style of the converted voice can be the voice style of a "male announcer”; or, it can be the voice style of a "female announcer”; or, it can be the voice style of any person, and it can also be Change on the basis of the voice style of "male announcer", "female announcer” or anyone, so that the diversification of voice style can be realized.
  • Step 404 The terminal inputs the style feature and the PPG feature into the speech fusion model for fusion to obtain a second speech.
  • the speech fusion model is a seq2seq neural network, and an attention mechanism is introduced into the seq2seq neural network.
  • the seq2seq neural network includes an encoding layer, a decoding layer and an attention structure.
  • the models of the encoding layer and decoding layer can adopt any combination of neural network models.
  • it includes any one of convolutional neural networks (CNN), recurrent neural networks (RNN), long short-term memory (LSTM), or a combination of any two, etc. Wait.
  • the encoding layer may include three convolutional layers and one bidirectional LSTM layer.
  • the PPG feature is first input to the encoding layer, which encodes the input PPG sequence into a fixed-dimensional vector. Since the length of the input sequence may be long, it is difficult for a vector to express rich information during decoding, so an attention mechanism is introduced. Then, the vector of style features and the PPG vector output by the encoding layer are concatenated in width to obtain the attention input matrix. Then, this attention input matrix is fed into the attention structure frame by frame, and the attention structure cooperates with the coding layer to output the Mel (or also called Mel spectrum) features of the second speech.
  • the mel spectral feature is the feature representation of the second speech.
  • the output vectors of the encoding module are weighted differently to obtain several vectors, each of which corresponds to an output, thus ensuring that the output sequence is no longer based on the entire input sequence, but focuses on the input sequence.
  • Step 405 the terminal outputs the second voice.
  • the mel spectrum feature is an intermediate representation. Although the mel spectrum contains the information of the output speech, it cannot be played directly. It needs to go through the inverse process of converting the speech signal into a mel spectrogram.
  • a playable audio file can be generated by a vocoder through the acoustic features of the Mel spectrum.
  • the vocoder may be a neural network vocoder, which is responsible for converting the Mel features into a speech signal of high naturalness.
  • the network consists of multiple convolutional and deconvolutional layers, and the final output is a playable speech.
  • the content in the speech is the same as that of the first speech, and the speaking style is the same as that of the third speech.
  • the terminal can directly play the second voice.
  • the user can choose whether to retain the second voice. If the user chooses to keep the second voice, the terminal stores the second voice.
  • the terminal receives the first voice to be converted and the third voice for extracting style features, and then, the first voice is input into the voice feature extraction model, and the PPG feature is extracted through the voice feature extraction model, and the PPG feature is used to retain The content information of the first voice is directly input through the PPG feature.
  • the terminal inputs the third voice into the style separation model, separates the style features of the third voice through the style separation model, and finally, inputs the style features and the PPG features into the voice fusion model for fusion, and obtains a fusion of the first voice content and the third voice. style second voice.
  • the third voice can be the voice of any person, so that the first voice can be converted into the voice style of any person, and the diversity of voice style conversion can be realized.
  • the speech processing neural network in this application is built into the application.
  • the APP can be an instant messaging APP, or a vocal beautification APP, which can be applied to a recording or video recording scene.
  • the user can beautify the voice to be sent through the APP.
  • the voice can be beautified by the APP composed of the present invention.
  • the method for the user to beautify the human voice through the APP may include:
  • the user can record the voice (the first voice) that he wants to be processed.
  • the user can also click "Select File", and then select a voice from a local file as the first voice.
  • the first voice that has been read can be displayed through the audio waveform.
  • the terminal device starts to record the voice of any person, that is, receives the third voice input by the second speaker. If the user clicks "select file”, the terminal device selects the voice of any person that has been saved from the local file, and uses the voice as the third voice.
  • the user selects input style feature weights (also called “sub-feature weights”) under "style similarity adjustment". For example, setting “prosody” to 20% means that the final output speech is 20% similar to the prosody of the target template, and the app implementation passes a parameter value of 0.2 to the built-in neural network model. Similarly, setting “Speech Rate” to 30% means that the final output voice is 30% similar to the target template's speech rate, and setting “Accent” to 40% means that the final output voice is 40% similar to the target template's accent. % similarity.
  • the adjustable style features include, but are not limited to, pitch, sound intensity, sound length, timbre, and the like. Among them, timbre includes but is not limited to prosody, speech rate and accent.
  • step 4 After the user clicks the "Start Beautification" button, the input of steps 1, 2 and 3 will be input into the trained voice processing neural network built in the APP at the same time, and a piece of processed voice will be output after being processed by the voice processing neural network ( The second voice), the content of the voice is the same as the content of the first voice, the style of the voice is similar to the target template, and the degree of similarity is determined by the style feature weight used for input.
  • style transfer network and application scenarios are described above.
  • the training and updating process of the style transfer network is described below.
  • the style transfer network consists of three parts of the neural network, the speech feature extraction part (that is, the speech feature extraction model), the style separation part (that is, including the style separation model, the multi-head attention structure and multiple sub-feature extraction models), and the speech fusion model.
  • Fig. 7 represents the flow of input data
  • the dashed black line represents the direction of parameter update.
  • the style separation model and the speech feature extraction model do not participate in the network update:
  • the style separation model is pre-trained using a large number of speech data from different speakers. That is, the training data set of the style separation model includes a large number of corpora of different speakers (different people's voices have different timbre features), and the style separation model is obtained by training the style separation model through the training set. This style separation model does not participate in the parameter update of the whole network.
  • the speech feature extraction model is pre-trained through a large amount of corpus.
  • the input of the sample data included in the training data set is speech
  • the label is the PPG feature of the input speech.
  • the speech feature extraction model is used to extract PPG features, and the speech feature extraction model does not participate in the parameter update of the entire network.
  • each sub-feature extraction model, and the Seq2seq neural network can be a neural network, and each neural network can include multiple convolution layers , fully connected layer, LSTM layer and other structural layers, and the weight parameters of each layer need to be obtained through a large amount of data training.
  • the difference between the update of the style transfer network in this application and the traditional method is that since the style separation model and the speech feature extraction model in this application are already pre-trained, they do not participate in the network update, that is, the training of the style separation model does not need to be considered. Therefore, the input and the label do not need to be two voices with the same content (two voices spoken by two different people), which greatly reduces the amount of sample data.
  • the label obtains the first mel feature through the short-time Fourier transform (STFT) algorithm. After inputting the speech into the style transfer network, the output of the network is the second mel feature. By comparing the first mel feature The first mel feature and the second mel feature get the loss value and gradient.
  • STFT short-time Fourier transform
  • the output of this network is compared with the input mel feature obtained by the STFT algorithm to obtain the loss value and its gradient.
  • the black dotted line in Figure 7 is the gradient flow direction.
  • the Loss value is used as an indicator for judging when the network stops training. When the loss value drops to a certain value and there is no obvious downward trend, it means that the network has converged and training can be stopped.
  • This network learning method belongs to the unsupervised learning method.
  • the style transfer network trained by the unsupervised learning method can Support users to manually adjust the degree of beautification of the recorded speech and combine the speaking styles of different speakers.
  • an embodiment of the present application provides another embodiment of a voice conversion, and the dialect conversion mode and the voice enhancement mode are described in this embodiment.
  • the dialect conversion network includes a plurality of dialect conversion models, wherein the dialect conversion models are used for dialect conversion of the input first speech.
  • the speech enhancement model is used to convert far-field speech into near-field speech for speech enhancement.
  • the multiple dialect conversion models include at least two categories.
  • the first category is to de-accent, that is, to convert dialects into Mandarin.
  • Sichuanese is converted into Mandarin.
  • the second category is adding accents, that is, converting Putonghua into dialects, for example, converting Putonghua into Sichuan dialect, etc.
  • Each dialect conversion model is for a different dialect to be converted. Dialect conversion can enhance the communication convenience of users in different regions and realize the diversification of voice conversion. It should be noted that only two types are listed here for exemplary illustration. Of course, in an optional solution, the two dialect conversion models can also be used in combination to perform conversion between the two dialects. To convert Sichuanese to Cantonese, you can first convert Sichuanese to Mandarin, and then convert Mandarin to Cantonese.
  • Step 801 The terminal receives a mode selection operation input by a user.
  • processing modes for speech are displayed on the user interface of the terminal.
  • the processing modes may include "dialect switching mode" and "speech enhancement mode".
  • the model selection operation may be a click operation, and when the user clicks the "dialect switching mode", the user selects the dialect switching processing mode.
  • Step 802 The terminal receives a selection operation input by the user.
  • the mode selection operation when used to select the dialect switching mode, the mode selection operation (operation at the first level) may further include multiple selection operations at the next level (operation at the second level). For example, a first selection operation and a second selection operation. This first selection operation is used to select "de-accent", while the second selection operation is used to select "accent”.
  • Each second-level operation in turn includes multiple third-level operations.
  • the user interface of the terminal will display three-level classification labels for accent options in different places. For example, “Sichuan accent”, “Cantonese accent”, etc.
  • the terminal will receive the third-level selection operation input by the user, such as the first operation, the second operation, and so on.
  • the first operation is used for selecting to convert Sichuan dialect into Mandarin
  • the second operation is used for selecting to convert Mandarin into Sichuan dialect, and so on.
  • the terminal can select the corresponding dialect conversion model according to the specific operation input by the user.
  • Step 803 The terminal inputs the feature information of the first voice into the dialect conversion model corresponding to the selection operation, and outputs the second voice through the dialect conversion model corresponding to the selection operation.
  • the terminal inputs the PPG feature into the dialect switching model (seq2seq neural network) according to the mode selection operation, and outputs the Mel of the second voice through the dialect switching model.
  • the feature form is then converted into a playable speech form by a neural network vocoder.
  • the first voice is the voice of the first dialect
  • the second voice is the voice of the second dialect.
  • the terminal inputs the third voice into the style separation model, and uses the style separation model to separate the style features of the first voice to obtain a timbre feature vector of the first voice. Then, the PPG feature of the first voice and the timbre feature vector of the first voice are input into the dialect conversion model corresponding to the selection operation, and the second voice is output through the dialect conversion model corresponding to the selection operation.
  • the second voice content is the same as the input voice (first voice) and retains the speaking style of the input voice (first voice).
  • the dialects are "local languages", for example, Sichuan dialect, Cantonese dialect and so on.
  • Putonghua is also a kind of dialect.
  • Putonghua is a dialect based on Beijing dialect, which can be understood as a dialect based on northern dialect, that is, in this application, Putonghua is a kind of dialect.
  • the first dialect is Sichuanese
  • the second dialect is Mandarin
  • the first dialect is Mandarin
  • the second dialect is Sichuan dialect
  • the second dialect is Northeastern dialect, etc.
  • the first dialect and the second dialect are not limited.
  • the dialect conversion network includes N dialect conversion models, each dialect conversion model is for a different dialect to be converted, and each dialect conversion has a corresponding model.
  • the first dialect conversion model among the multiple dialect conversion models is used to convert Sichuanese to Mandarin
  • the second dialect conversion model is used to convert Mandarin to Cantonese
  • the first dialect conversion model and the second dialect conversion model are The integrated model can convert Sichuanese to Cantonese and so on.
  • the first dialect and the second dialect are illustrated, which does not limit the present application.
  • the terminal inputs the PPG feature of the first voice into the voice enhancement model corresponding to the mode, and outputs the second voice through the voice enhancement model, and the first voice is a far-field voice.
  • the second voice is near-field voice.
  • the far-field voice is converted into near-field voice, thereby realizing voice enhancement, increasing application scenarios, and realizing the diversification of voice conversion.
  • the speech enhancement model is learned from the sample data in the training dataset.
  • the sample data includes input and labels.
  • the input is far-field speech and the label is near-field speech.
  • the first speech is input into a style separation model, and the style features of the first speech are separated by the style separation model. Then, the style feature of the first voice and the feature information of the first voice are input into the voice enhancement model, and the second voice is output through the voice enhancement model, and the style of the second voice and the first voice is the same.
  • the converted voice is the same as the input voice (first voice), and the speaking style of the input voice (first voice) is preserved.
  • FIGS. 11A and 11B Exemplarily, in an application scenario, please refer to FIGS. 11A and 11B .
  • the terminal After the user clicks the "Start Recording" button, the terminal records the voice (the first voice) that it wants to be processed. The user can also click the "Select File” button, and the terminal selects a voice from a local file as the first voice. Optionally, the terminal can display the waveform of the audio that has been read.
  • the user selects the speech processing mode in the interface.
  • the interface displays the first-level classification labels.
  • the first-level classification labels include “dialect switching" and "speech enhancement".
  • the interface displays the secondary classification labels, such as the secondary classification labels include: "De-accent", "Add accent”.
  • the interface will display three-level classification labels for different local accent options, such as "Cantonese accent”, “Sichuan accent”, “Fujian accent", etc.
  • the user can also select the "Voice Enhancement" mode.
  • the terminal will select the corresponding model according to the mode selected by the user.
  • the first voice is input into the selected model (for example, the dialect conversion model of Mandarin to Cantonese accent), and after the dialect conversion model is processed for a period of time, a segment of the processed first is output.
  • the selected model for example, the dialect conversion model of Mandarin to Cantonese accent
  • the dialect conversion model is processed for a period of time
  • a segment of the processed first is output.
  • Two voices the content of the second voice is the same as that of the input voice (the first voice), and the speaking style of the input voice (the first voice) is preserved.
  • the first voice is far-field voice
  • the second voice is near-field voice.
  • the display interface also shows three buttons for "Play”, "Save File” and "Rerecord”.
  • the processed voice will be played through the phone's speaker.
  • the terminal saves the processed voice locally.
  • the terminal processes the first voice again, selects the corresponding model according to the user's selection operation on the interface at this time, and returns to step d at the same time.
  • the following describes the training and updating process of the dialect conversion model.
  • each dialect conversion model is a seq2seq neural network.
  • the seq2seq neural network includes an encoding layer, a decoding layer, and an attention structure.
  • Different dialect conversion models have different corresponding training data sets, and the parameters of each layer are different.
  • the first dialect conversion model is Sichuan dialect to Mandarin
  • each sample data in the training data set corresponding to the first dialect conversion model includes an input and a label.
  • the input is Sichuan dialect
  • the label is Mandarin with the same content as the Sichuan dialect.
  • the second dialect conversion model is Putonghua to Sichuan dialect.
  • Each sample data in the training data set corresponding to the first dialect conversion model includes an input and a label.
  • the input is Mandarin
  • the label is Sichuan dialect with the same content as the Mandarin.
  • Different dialect conversion models are learned on different training datasets.
  • Figure 12 takes the parameter update of each module in a dialect conversion model as an example.
  • This training scheme is basically the same as the training method of the style transfer network.
  • the style separation model and the speech feature extraction model are already pre-trained and do not participate in the network update, that is, the training of the style separation model and the speech feature extraction model does not need to be considered.
  • the label obtains the third mel feature through the STFT algorithm
  • the output of this network is the fourth mel feature
  • the loss value and gradient are obtained by comparing the third mel feature and the fourth mel feature.
  • the loss value and its gradient are obtained by comparing the four-mel feature and the input third-mel feature obtained by the STFT algorithm.
  • the black dotted line in Figure 12 is the gradient flow direction.
  • the parameters in the module will be updated, while the style separation model and the speech feature extraction model are pre-trained and do not need to participate in the parameter update of the network.
  • the loss value is used as an indicator for judging when the network stops training. When the loss value drops to a certain value and there is no obvious continuous downward trend, it means that the network has converged and training can be stopped.
  • unsupervised learning is used to extract style features such as timbre, rhythm, and speed of speech in speech, so as to realize controllable beautification of the style of speech.
  • style features such as timbre, rhythm, and speed of speech in speech
  • the present application can keep the style of the processed speech unchanged, and at the same time realize dialect conversion or Voice enhancement.
  • the present application uses artificial intelligence technology to provide a more convenient, richer and more scene-covering method for realizing human voice beautification. Realize the effect of de-accent, accent or voice enhancement for any person, while keeping the input and output voice style unchanged.
  • the execution body for training and updating each model may be a server, and after the server has trained the speech processing neural network, the terminal downloads it to the local end.
  • the speech processing neural network is built in an APP, for example, the APP is an instant messaging APP and the like.
  • the voice conversion device includes: an input module 1320, a processing module 1310, an acquisition module 1330, Output module 1340.
  • an input module 1320 configured to receive a mode selection operation input by a user, where the mode selection operation is used to select a mode of voice conversion;
  • a processing module 1310 configured to select a target conversion mode from a plurality of modes according to the mode selection operation received by the input module 1320, the plurality of modes include a style conversion mode, a dialect conversion mode and a speech enhancement mode;
  • an acquisition module 1330 configured to acquire the first voice to be converted
  • the processing module 1310 is further configured to extract the feature information of the first voice obtained by the obtaining module 1330; input the feature information of the first voice into the target voice conversion network corresponding to the target conversion mode, and pass the The target speech conversion network outputs the converted second speech;
  • the output module 1340 is used for outputting the second voice.
  • the processing module 1310 is further configured to input the first information into a speech feature extraction model, and extract the phoneme posterior probability PPG of the first speech through the speech feature extraction model feature; the PPG feature is used to retain the content information of the first voice.
  • the style conversion network when the target conversion mode is a style conversion mode and the target voice conversion network is a style conversion network, the style conversion network includes a style separation model and a voice fusion model;
  • the obtaining module 1330 is used to obtain the third voice for extracting style features
  • the processing module 1310 is further configured to input the third voice into a style separation model, and separate the style features of the third voice through the style separation model;
  • the processing module 1310 is further configured to input the style feature and the feature information of the first voice into a voice fusion model for fusion to obtain the second voice.
  • the style feature includes a first feature, and the first feature includes a plurality of sub-features;
  • the processing module 1310 is further configured to input the third voice into a style separation model, and extract the vector of the first feature in the third voice through the style separation model;
  • a feature extraction model which extracts the vector of the sub-features through the sub-feature extraction model;
  • the input module 1320 is further configured to receive the weight of each of the sub-features in the plurality of sub-features input by the user;
  • the processing module 1310 is further configured to determine the style of the third voice according to the vector of the first feature, the vector of each of the sub-features and the weight of each of the sub-features received by the input module 1320 feature.
  • the processing module 1310 is further configured to input the vector of the first feature into the multi-head attention structure, and input the product of the vector of each sub-feature and its corresponding weight to the multi-headed attention structure An attention structure, outputting the style feature of the third speech through the multi-head attention structure.
  • the obtaining module 1330 is further configured to receive a template selection operation input by the user, where the template selection operation is used to select a target template; The voice corresponding to the template is used as the third voice.
  • the obtaining module 1330 is further configured to receive a third voice input by the second speaker, where the first voice is the voice of the first speaker, and the second speaker is the same as the voice of the second speaker. describe any person who is different from the first speaker.
  • the target conversion mode is a dialect conversion mode
  • the target voice conversion network is a dialect conversion network
  • the processing module 1310 is further configured to input the feature information of the first voice into a and a dialect conversion network for outputting the second voice through the dialect conversion network, where the first voice is the voice of the first dialect, and the second voice is the voice of the second dialect.
  • the dialect conversion network includes a plurality of dialect conversion models, and each dialect conversion model is for a different dialect to be converted;
  • the input module 1320 is further configured to receive a selection operation input by the user;
  • the processing module 1310 is further configured to input the feature information of the first voice into the dialect conversion model corresponding to the selection operation, and output the second voice through the dialect conversion model corresponding to the selection operation.
  • the processing module 1310 is further configured to input the first voice into a style separation model, and use the style separation model to separate the style features of the first voice;
  • the processing module 1310 is further configured to input the style feature of the first voice and the feature information of the first voice into a dialect conversion network, and output the second voice through the dialect conversion network, and the second voice
  • the voice is of the same style as the first voice.
  • the first voice is far-field voice
  • the target conversion mode is a voice enhancement mode
  • the target voice conversion network is a voice enhancement model
  • the processing module 1310 is further configured to input the feature information of the first voice into a voice enhancement model corresponding to the mode, and output a second voice through the voice enhancement model, where the second voice is a near-field voice.
  • the processing module 1310 is further configured to input the first voice into a style separation model, and use the style separation model to separate the style features of the first voice;
  • the processing module 1310 is further configured to input the style feature of the first voice and the feature information of the first voice into the voice enhancement model, and output the second voice through the voice enhancement model.
  • the second voice is of the same style as the first voice.
  • the obtaining module 1330 is further configured to receive the first voice input by the first speaker; or, select the first voice from a locally stored file.
  • the processing module 1310 may be a processing device, and the functions of the processing device may be partially or completely implemented by software.
  • the functions of the processing device may be implemented in part or in whole by software.
  • the processing device may include a memory and a processor, wherein the memory is used to store a computer program, and the processor reads and executes the computer program stored in the memory to perform corresponding processing and/or steps in any one of the method embodiments.
  • the processing means may comprise only a processor.
  • the memory for storing the computer program is located outside the processing device, and the processor is connected to the memory through a circuit/wire to read and execute the computer program stored in the memory.
  • the processing means may be one or more chips, or one or more integrated circuits.
  • an embodiment of the present application provides a chip structure, as shown in FIG. 14 , the chip includes:
  • the chip can be expressed as a neural-network processing unit (NPU) 140, and the NPU is mounted on the main CPU (host CPU) as a co-processor, and the host CPU assigns tasks.
  • the core part of the NPU is the arithmetic circuit 1403, which is controlled by the controller 1404 to extract the matrix data in the memory and perform multiplication operations.
  • the arithmetic circuit 1403 includes multiple processing units (process engines, PEs). In some implementations, arithmetic circuit 1403 is a two-dimensional systolic array. The arithmetic circuit 1403 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1403 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1402 and buffers it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches the data of matrix A and matrix B from the input memory 1401 to perform matrix operation, and the partial result or final result of the obtained matrix is stored in the accumulator 1408 accumulator.
  • Unified memory 1406 is used to store input data and output data.
  • the weight data is directly transferred to the weight memory 1402 through the storage unit access controller 1405 (direct memory access controller, DMAC).
  • Input data is also moved to unified memory 1406 via the DMAC.
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 1410, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch memory 1409 Instruction Fetch Buffer.
  • the bus interface unit 1410 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1409 to obtain instructions from the external memory, and also for the storage unit access controller 1405 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1406 , the weight data to the weight memory 1402 , or the input data to the input memory 1401 .
  • the vector calculation unit 1407 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc., if necessary.
  • Mainly used for non-convolutional/FC layer network computation in neural networks such as pooling, batch normalization, local response normalization, etc.
  • vector computation unit 1407 can store the processed output vectors to unified buffer 1406 .
  • the vector calculation unit 1407 may apply a nonlinear function to the output of the arithmetic circuit 1403, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 1407 generates normalized values, merged values, or both.
  • the vector of processed outputs can be used as an activation input to the arithmetic circuit 1403, eg, for use in subsequent layers in a neural network.
  • the instruction fetch buffer (instruction fetch buffer) 1409 connected to the controller 1404 is used to store the instructions used by the controller 1404;
  • the unified memory 1406, the input memory 1401, the weight memory 1402 and the instruction fetch memory 1409 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • each of the speech feature extraction model, style separation model, multi-head attention structure, each sub-feature extraction model, encoding layer, decoding layer and attention structure in the Seq2seq neural network, dialect conversion model, and speech enhancement model can be performed by the operation circuit 1403 or the vector calculation unit 1407 .
  • the arithmetic circuit 1403 or the vector calculation unit 1407 calculates the parameter value (such as the first parameter value), and the main CPU is used to read the computer program stored in the at least one memory, so that the terminal executes the method executed by the terminal in the above method embodiments .
  • an embodiment of the present invention also provides another voice conversion apparatus.
  • the voice conversion device can be a terminal, and the terminal can be a mobile phone, a tablet computer, a notebook computer, a smart watch, etc.
  • the terminal is a mobile phone as an example:
  • FIG. 15 is a block diagram showing a partial structure of a mobile phone related to a terminal provided by an embodiment of the present invention.
  • the mobile phone includes: a radio frequency (RF) circuit 1510, a memory 1520, an input unit 1530, a display unit 1540, an audio circuit 1560, a processor 1580, and a power supply 1590.
  • RF radio frequency
  • FIG. 15 does not constitute a limitation on the mobile phone, and may include more or less components than shown, or combine some components, or arrange different components.
  • the RF circuit 1510 can be used for receiving and sending signals during the transmission and reception of information or during a call. In particular, after receiving the downlink information of the base station, it is processed by the processor 1580 .
  • the memory 1520 can be used to store software programs and modules, and the processor 1580 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 1520 .
  • the memory 1520 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function (such as a sound playback function, etc.), and the like. Additionally, memory 1520 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
  • the input unit 1530 may be used for receiving inputted numerical or character information, and generating key signal input related to user setting and function control of the mobile phone.
  • the input unit 1530 may include a touch panel 1531 and other input devices 1532 .
  • the touch panel 1531 also known as a touch screen, can collect the user's touch operations on or near it (such as the user's finger, stylus, etc., any suitable object or accessory on or near the touch panel 1531). operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 1531 may include two parts, a touch detection device and a touch controller.
  • the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it to the touch controller.
  • the input unit 1530 may also include other input devices 1532 .
  • other input devices 1532 may include, but are not limited to, one or more of physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, joysticks, and the like.
  • the input unit 1530 is used to receive various operations input by the user, for example, mode selection operations and the like.
  • the function of the input module 1320 in FIG. 13 may be performed by the input unit 1530 , or the function of the acquisition module 1330 in FIG. 13 may be performed by the input unit 1530 .
  • the display unit 1540 may be used to display information input by the user or information provided to the user and various menus of the mobile phone.
  • the display unit 1540 may include a display panel 1541, and optionally, the display panel 1541 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an organic light-emitting diode (Organic Light-Emitting Diode, OLED), and the like.
  • the touch panel 1531 may cover the display panel 1541. When the touch panel 1531 detects a touch operation on or near it, it transmits it to the processor 1580 to determine the type of the touch event, and then the processor 1580 determines the type of the touch event according to the touch event. Type provides corresponding visual output on display panel 1541.
  • the touch panel 1531 and the display panel 1541 are used as two independent components to realize the input and input functions of the mobile phone, in some embodiments, the touch panel 1531 and the display panel 1541 can be integrated to form Realize the input and output functions of the mobile phone.
  • the display unit 1540 is used to display the APP interface shown in FIG. 6A , FIG. 6B , FIG. 11A , and FIG. 11B corresponding to the method embodiment.
  • the audio circuit 1560, the speaker 1561, and the microphone 1562 can provide an audio interface between the user and the mobile phone.
  • the audio circuit 1560 can convert the received audio data into an electrical signal, and transmit it to the speaker 1561, and the speaker 1561 converts it into a sound signal for output; on the other hand, the microphone 1562 converts the collected sound signal into an electrical signal, which is converted by the audio circuit 1560 After receiving, it is converted into audio data, and then the audio data is output to the processor 1580 for processing.
  • the audio circuit 1560 receives the first voice of the first speaker through the microphone 1562, or receives the third voice of the second speaker.
  • the speaker 1561 is configured to output the processed second voice, for example, the second voice is a style-converted voice, or the second voice is a dialect-converted voice, or the second voice is a voice-enhanced voice.
  • the speaker 1561 is used to output the second voice.
  • the function of output module 1340 in FIG. 13 may be performed by speaker 1561.
  • the processor 1580 is the control center of the mobile phone, using various interfaces and lines to connect various parts of the entire mobile phone, by running or executing the software programs and/or modules stored in the memory 1520, and calling the data stored in the memory 1520. Various functions of the mobile phone and processing data, so as to monitor the mobile phone as a whole.
  • processor 1580 may include one or more processing units.
  • the mobile phone also includes a power supply 1590 (such as a battery) that supplies power to various components.
  • a power supply 1590 (such as a battery) that supplies power to various components.
  • the power supply can be logically connected to the processor 1580 through a power management system, so as to manage charging, discharging, and power consumption management functions through the power management system.
  • the mobile phone may also include a camera, a Bluetooth module, and the like, which will not be repeated here.
  • Embodiments of the present application further provide a computer-readable storage medium, where a program is stored in the computer-readable storage medium, and when the computer-readable storage medium runs on the computer, the computer executes the method described in the foregoing method embodiments by the terminal device. A step of.
  • the embodiments of the present application also provide a computer program product, which enables the computer to execute the steps performed by the terminal in the methods described in the foregoing method embodiments when the computer is running on the computer.
  • An embodiment of the present application further provides a circuit system, where the circuit system includes a processing circuit, and the processing circuit is configured to perform the steps performed by the terminal device in the method described in the foregoing method embodiments.
  • the chip when the device is a chip in the terminal, the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, pins or circuits, etc.
  • the processing unit can execute the computer-executable instructions stored in the storage unit, so that the chip in the terminal executes the wireless communication method according to any one of the above-mentioned first aspect.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit in the terminal located outside the chip, such as a read-only memory (read only memory). -only memory, ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.
  • the processor mentioned in any one of the above may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more of the above
  • the first aspect is an integrated circuit for executing the program of the wireless communication method.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Quality & Reliability (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音转换方法及相关设备,用于实现人声美化的多样化。本申请实施例方法包括: 接收用户输入的模式选择操作,所述模式选择操作用于选择语音转换的模式; 提供的多种可以选择的模式包括: 风格转换模式,用于对待转换的第一语音进行说话风格转换; 方言转换模式,用于对第一语音实现加口音或去口音; 语音增强模式,用于对第一语音实现语音增强;三种模式具有对应的语音转换网络,根据用户选择的目标转换模式,选择目标转换模式对应的目标语音转换网络对第一语音进行转换,输出转换之后的第二语音,从而实现人声美化的多样化,满足用户在不同应用场景下的需求。

Description

一种语音转换的方法及相关设备
本申请要求于2020年9月21日提交中国专利局、申请号为“202010996501.5”、申请名称为“一种语音转换的方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音处理技术领域,尤其涉及一种语音转换的方法及相关设备。
背景技术
人声美化是指对声音进行修饰和美化,产生悦耳的听觉效果。目前市面上很多的录音软件或者社交应用(application,APP)的语音录制功能都会带有一定的人声美化的能力。例如,对语音进行去噪、提高语音的亮度或音量等,仅是对说话人自身的语音特点进行修饰,对于人声美化的模式过于单一。
发明内容
本申请实施例提供了一种语音转换的方法及相关设备,用于提供多种人声美化模式,实现人声美化的多样性。
第一方面,本申请实施例提供了一种语音转换方法,该方法应用于语音转换装置,例如,该装置可以为终端,该方法可以包括:终端接收用户输入的模式选择操作,模式选择操作用于选择语音转换的模式,然后,终端,根据模式选择操作从多个模式中选择目标转换模式,多个模式包括风格转换模式,方言转换模式和语音增强模式;终端可以选择目标转换模式对应的目标语音转换网络,通过目标语音转换网络实现语音转换;终端设备获取待转换的第一语音,进一步提取第一语音的特征信息,该特征信息用于保留第一语音的内容信息,然后,终端将第一语音的特征信息输入到目标转换模式对应的目标语音转换网络,通过目标语音转换网络输出转换后的第二语音,最后输出第二语音。
本实施例中,提供了多种可以选择的模式,如风格转换模式,用于对第一语音进行说话风格转换;方言转换模式,用于实现加口音或去口音;语音增强模式,用于实现语音增强;三种模式具有对应的语音转换网络,根据用户的需求,模式对应的语音转换网络可以对第一语音进行转换,从而实现人声美化的多样化,满足用户不同应用场景的需求。
在一种可选的实现方式中,提取第一语音的特征信息可以具体包括:终端将第一信息输入到语音特征提取模型,通过语音特征提取模型提取第一语音的音素后验概率PPG特征;PPG特征用于保留第一语音的内容信息。PPG特征描述的是语音帧所对应音素集中每个音素概率的大小,相当于对音素进行了识别,该PPG特征用于保留第一语音的内容信息,本示例中,不需要将语音转换成文本来保留语音的内容信息,而是直接通过语音输入,且通过PPG特征保留第一语音的内容信息,更能增加鲁棒性。
在一种可选的实现方式中,当目标转换模式为风格转换模式,目标语音转换网络为风格转换网络时,风格转换网络包括风格分离模型和语音融合模型,该方法还可以包括:终端获取用于提取风格特征的第三语音,并将第三语音输入到风格分离模型,通过风格分离 模型分离第三语音的风格特征;然后进一步的将风格特征和第一语音的特征信息输入到语音融合模型进行融合,得到第二语音。
本实施例中,终端接收待转换的第一语音和用于提取风格特征的第三语音,然后,第一语音输入到语音特征提取模型,通过语音特征提取模型提取PPG特征;PPG特征用于保留第一语音的内容信息,通过PPG特征实现语音的直接输入。终端将第三语音输入到风格分离模型,通过风格分离模型分离第三语音的风格特征,最后,将风格特征和PPG特征输入到语音融合模型进行融合,得到融合了第一语音内容和第三语音风格的第二语音。第三语音可以是任意人的语音,从而实现将第一语音转换成任意人的语音风格,实现语音风格转换的多样性。在一种可选的实现方式中,风格特征包括第一特征,第一特征包括多个子特征;将第三语音输入到风格分离模型,通过风格分离模型分离第三语音的风格特征可以具体包括:终端将第三语音输入到风格分离模型,通过风格分离模型提取第三语音中的第一特征的向量,例如,该第一特征可以为音色,然后,将第三语音输入到子特征提取模型,通过子特征提取模型提取子特征的向量;接收用户输入的多个子特征中每个子特征的权重;根据第一特征的向量,每个子特征的向量及每个子特征的权重确定第三语音的风格特征。
本实施例中,通过用户输入的各子特征对应的权重,调整待转换语音的风格与第三语音(即目标模板对应的语音)的相似度,该相似程度由用户输入的权重决定。最终输出的语音风格与第三语音的相似程度由该权重决定,用户可以通过调整输入的权重,灵活调整待转换语音的风格,转换语音的风格既可以与第三语音的风格完全相同,也可以在第三语音的风格的基础上进行改变,从而实现待转换语音的风格多样化。
在一种可选的实现方式中,根据第一特征的向量,每个子特征的向量及每个子特征的权重确定第三语音的风格特征可以包括:终端将第一特征的向量输入多头注意力结构,且将每个子特征的向量及与其对应权重的乘积输入到多头注意力结构,通过多头注意力结构输出第三语音的风格特征。
本实施例中,多头注意力结构使得模型能够从不同的子空间提取特征表达,每个头对应高维空间中的一个子特征空间,相当于将高维空间分解,每个头负责一个子特征空间。多头注意力机制相当于多个结构相同的注意力机制,从而使得多头注意力机制输出的结果包含第三语音的部分音色。
在一种可选的实现方式中,获取用于提取风格特征的第三语音可以包括:
终端接收用户输入的模板选择操作,模板选择操作用于选择目标模板,例如,该目标模板可以是男播音员的风格,或者,可以为某一位“女播音员”的语音风格等,终端获取目标模板对应的语音,将目标模板对应的语音作为第三语音,将第三语音的风格特征融合到第一语音中,从而可以实现语音风格的多样化。
在一种可选的实现方式中,获取用于提取风格特征的第三语音可以具体包括:终端还可以接收第二说话人输入的第三语音,第一语音为第一说话人的语音,第二说话人为与第一说话人不同的任意人,从而可以实现语音风格的多样化。
在一种可选的实现方式中,当目标转换模式为方言转换模式,目标语音转换网络为方言转换网络时,将第一语音的特征信息输入到目标转换模式对应的目标语音转换网络,通 过目标语音转换网络输出转换后的第二语音,还可以具体包括:终端将第一语音的特征信息输入到方言转换网络,通过方言转换网络输出第二语音,第一语音为第一方言的语音,第二语音为第二方言的语音,从而实现方言转换,增强不同地域用户的交流便捷性,实现语音转换的多样化。
在一种可选的实现方式中,方言转换网络包括多个方言转换模型,每个方言转换模型分别针对不同的待转换方言,方法还包括:终端可以接收用户输入的选择操作;并且将第一语音的特征信息输入到选择操作对应的方言转换模型,通过选择操作对应的方言转换模型输出第二语音。本实施例中,终端可以根据用户输入的具体操作选择对应的方言转换模型。
在一种可选的实现方式中,方法还包括:终端将第一语音输入到风格分离模型,通过风格分离模型分离第一语音的风格特征;然后,终端将第一语音的风格特征和第一语音的特征信息输入到方言转换网络,通过方言转换网络输出第二语音,第二语音和第一语音的风格相同。
本实施例中,该第二语音内容与输入语音(第一语音)相同,且保留了输入语音(第一语音)的说话风格。
在一种可选的实现方式中,第一语音为远场语音,当目标转换模式为语音增强模式,目标语音转换网络为语音增强模型时,将第一语音的特征信息输入到目标转换模式对应的目标语音转换网络,通过目标语音转换网络输出转换后的第二语音可以包括:终端将第一语音的特征信息输入到模式对应的语音增强模型,通过语音增强模型输出第二语音,第二语音为近场语音。本示例中,将远场语音转换成近场语音,实现语音增强,以增加语音的清晰度,增加了应用场景,实现语音转换的多样化。
在一种可选的实现方式中,方法还包括:终端将第一语音输入到风格分离模型,通过风格分离模型分离第一语音的风格特征;然后,将第一语音的风格特征和第一语音的特征信息输入到语音增强模型,通过语音增强模型输出第二语音,第二语音和第一语音的风格相同。本示例中,转换之后的语音与输入语音(第一语音)相同,且保留了输入语音(第一语音)的说话风格。
在一种可选的实现方式中,获取待转换的第一语音可以包括:接收第一说话人输入的第一语音;或者,从本地存储文件中选择第一语音。
第二方面,本申请实施例提供了一种语音转换装置,该装置具有实现上述第一方面终端所执行的功能,该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现;该硬件或软件包括一个或多个与上述功能相对应的模块。
第三方面,本申请实施例提供了一种终端,包括处理器,处理器和至少一个存储器耦合,处理器用于读取至少一个存储器所存储的计算机程序,使得终端执行上述第一方面中任一项的方法。
第四方面,本申请实施例提供了一种计算机可读介质,所述计算机可读存储介质用于存储计算机程序,当所述计算机程序在计算机上运行时,使得所述计算机执行上述第一方面的方法。
第五方面,本申请提供了一种芯片***,该芯片***包括处理器,用于支持终端设备实现上述方面中所涉及的功能。在一种可能的设计中,所述芯片***还包括存储器,所述存储器,用于保存终端设备必要的程序指令和数据。该芯片***,可以由芯片构成,也可以包括芯片和其他分立器件。
附图说明
图1为本申请实施例中一种语音转换方法的一个实施例的步骤流程图;
图2为本申请实施例中选择模式的场景示意图;
图3为本申请实施例中各模式与各语音转换网络之间的对应关系的示意图;
图4为本申请实施例中一种语音转换方法的另一个实施例的步骤流程图;
图5为本申请实施例中语音进行风格转换过程的一个示例的示意图;
图6A为本申请实施例中风格转换模式的界面的一个场景示意图;
图6B为本申请实施例中风格转换模式的界面的另一个场景示意图;
图7为本申请实施例中风格转换网络训练并更新的示意图;
图8为本申请实施例中一种语音转换方法的另一个实施例的步骤流程图;
图9为本申请实施例中语音进行方言转换过程的一个示例的示意图;
图10为本申请实施例中语音进行方言转换过程的另一个示例的示意图;
图11A为本申请实施例中方言转换模式及语音增强模式的界面的一个场景示意图;
图11B为本申请实施例中方言转换模式及语音增强模式的界面的另一个场景示意图;
图12为本申请实施例中方言转换模型训练并更新的示意图;
图13为本申请实施例中一种语音转换装置的一个实施例的结构示意图;
图14为本申请实施例中一种芯片的一个示例的结构示意图;
图15为本申请实施例中一种语音转换装置的另一个实施例的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。本申请中出现的术语“和/或”,可以是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本申请中字符“/”,一般表示前后关联对象是一种“或”的关系。本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或模块的过程、方法、***、产品或设备不必限于清楚地列出的那些步骤或模块,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或模块。
本申请实施例提供了一种声音转换方法,该方法应用于终端,该终端可以为手机、平板电脑,笔记本电脑,智能手表等具有语音功能的智能终端,该终端设备也可以称为终端设备,用户设备(user equipment,UE)等。该终端设备可以以手机为例进行说明。该终端设备内配置有语音处理神经网络,语音处理神经网络主要包括风格转换网络,方言转换网 络和语音增强模型。语音处理神经网络实现对语音的转换(美化)。本申请中提供了多种可以选择的模式,如风格转换模式,用于对第一语音进行说话风格转换;方言转换模式,用于实现加口音或去口音;语音增强模式,用于实现语音增强;三种模式具有对应的语音转换网络,终端根据用户选择的目标转换模式,选择目标转换模式对应的目标语音转换网络对获取到的待转换的第一语音进行转换,输出转换之后的第二语音,从而实现人声美化的多样化,满足用户不同应用场景的需求。
为了更好的理解本申请,首先对本申请中涉及的词语进行说明。
序列到序列(sequence to sequence,seq2seq)神经网络:seq2seq属于编码(encoder)-解码(decoder)结构的一种,输入的序列(sequence)通过编码层压缩成指定长度的向量,然后将该向量输入到解码层中得到输出的序列(sequence)。seq2seq神经网络指编码层和解码层都由神经网络构成。其中,编码层(encoder)用于通过一定规则对输入的序列进行编码生成向量。解码层(decoder)用于将生成的向量转化成输出的序列。
注意力(attention)结构:编码层中生成多个指定长度的向量,每个向量都由输入特征加权得到,权值与解码层的输出相关,其目的是让解码层的输出更关注输入的不同关键部分,而不是关注整段输入。
神经网络声码器(vocoder):本质上是一种神经网络,用于将语音处理神经网络的输出特征转换为自然度高的语音。
风格特征:包括但不限定于音色,音色是指声音的品质,又叫音品,音色可以反映说话人发出的声音特有的品质。音色包括但不限定于韵律,重音和语速。
音素:是一个发音动作的最小语音单位,如hao(好),一共有3个音素,wo(我),一种有2个音素。
音素后验概率(phoneme posteriorgram,PPG)特征:每个语音帧属于一组预先定义的语音单元(音素或三音素/助音素)的后验概率,这些语音单元保留了语音的语言和语音信息。
梅尔(mel)特征:语音信号存储为波形图特征是表示为其时域上的特征变换,但是仅根据波形图特征是不能看出语音信号的频率分布的。通过对语音信号频率域的分析,可以提取语音的特征。梅尔特征作为语音信号的特征表示,可以清晰表现语音共振峰特性。
请参阅图1所示,本申请提供了一种语音转换方法的一个实施例,该方法的执行主体为终端设备。或者,该方法的执行主体也可以为终端设备中的处理器,或者,该执行主体也可以是终端设备中的芯片。本申请实施例中,该方法的执行主体以终端为例,该终端以手机为例进行说明。终端内配置有语音处理神经网络,语音处理神经网络主要包括风格转换网络、方言转换网络和语音增强模型。
步骤101、终端接收用户输入的模式选择操作,模式选择操作用于选择语音转换的模式。
请参阅图2所示,模式选择操作可以为点击操作,多个模式包括风格转换模式,方言 转换模式和语音增强模式。其中,每种模式具有对应的语音转换网络,通过对应的语音转换网络对语音进行转换。例如,风格转换模式用于转换语音风格。方言转换模式用于方言转换,例如“加口音”或者“去口音”等。语音增强模式用于将远场语音转换成近场语音。
步骤102、终端根据模式选择操作从多个模式中选择目标转换模式,多个模式包括风格转换模式,方言转换模式和语音增强模式。
终端的显示界面上可以显示多个模式的列表,用户根据需求通过点击操作选择多个模式中的目标转换模式,该目标转换模式可以是上述三种模式中的任一种模式。
终端选择目标转换模式对应的目标语音转换网络。
请参阅图3所示,风格转换模式对应风格转换网络,方言转换模式对应方言转换网络,语音增强模式对应语音增强模型。终端根据模式与语音转换网络之间的对应关系,确定目标转换模式对应的目标语音转换网络。例如,当用户选择风格转换模式时,目标语音转换网络为风格转换网络。当用户选择言转换模式时,目标语音转换网络为方言转换网络。当用户选择语音增强模式时,目标语音转换网络为语音增强模型。
步骤103、终端获取待转换的第一语音。
第一种实现方式,终端接收第一说话人输入的第一语音。
终端设备通过麦克风接收第一说话人输入的第一语音。终端设备接收用户(该用户可以和第一说话人为同一人,也可以为不同人)输入的第一操作,该第一操作为录制语音的操作。例如,该第一操作可以为点击操作。用户在手机屏幕上点击“开始录制”按钮,手机接收到该第一操作后,开始录制用户输入的语音,该语音作为用户希望被美化(或处理)的语音(即第一语音)。
第二种实现方式,终端从本地存储文件中选择第一语音。
终端设备可以从本地存储文件中获取第一语音。终端设备接收用户输入的第二操作,该第二操作为选择语音文件的操作。例如,用户在手机屏幕上点击“选择文件”按钮,手机可以根据该第二操作显示待选择的语音列表,然后,终端设备接收用户输入的第三操作,该第三操作用于选择目标语音,终端设备将该目标语音作为第一语音。
步骤104、终端提取第一语音的特征信息。
该特征信息可以为梅尔频率倒谱系数(mel frequency cepstrum coefficient,MFCC)特征。或者,该特征信息为音素后验概率PPG特征。本申请实施例中,该第一语音的特征信息以PPG特征为例进行说明。PPG特征描述的是语音帧所对应音素集中每个音素概率的大小,相当于对音素进行了识别,该PPG特征用于保留第一语音的内容信息。
步骤105、终端将第一语音的特征信息输入到目标转换模式对应的目标语音转换网络,通过目标语音转换网络输出转换后的第二语音。
第一种模式,目标语音转换网络为风格转换网络。示例性的,终端获取用于提取风格特征的第三语音,然后,再将第三语音输入到风格分离模型,通过风格分离模型分离第三语音的风格特征;最后,将风格特征和第一语音的PPG特征输入到语音融合模型进行融合,得到第二语音。
第二种模式,将第一语音的特征信息输入到方言转换网络,通过方言转换网络输出第 二语音,第一语音为第一方言的语音,第二语音为第二方言的语音。
第三种模式,将第一语音的PPG特征输入到模式对应的语音增强模型,通过语音增强模型输出第二语音,第二语音为近场语音。
对于人声美化包括多个模式,多个模式包括风格转换模式,方言转换模式和语音增强模式,可以根据用户选择的模式对第一语音进行美化。例如,对第一语音的风格进行美化,对第一语音进行方言转换,或对第一语音进行语音增强等,实现人声美化的多样性。
步骤106、终端输出第二语音。
终端通过扬声器输出转换后的第二语音。
本申请实施例中,提供了多种可以选择的模式,如风格转换模式,用于对第一语音进行说话风格转换;方言转换模式,用于实现加口音或去口音;语音增强模式,用于实现语音增强;三种模式具有对应的语音转换网络,根据用户的需求,模式对应的语音转换网络可以对第一语音进行转换,从而实现人声美化的多样化,满足用户不同应用场景的需求。
请参阅图4所示,本申请实施例提供了一种语音转换的另一个实施例,本实施例中针对风格转换模式进行说明,即对第一语音进行风格转换进行示例性说明。模式为风格转换模式,目标语音转换网络为风格转换网络。
步骤401、终端获取待转换的第一语音及用于提取风格特征的第三语音。
本步骤中终端获取第一语音的方法请参阅图1对应的实施例中的步骤103的说明,此处不赘述。
终端获取第三语音的方式:
第一种实现方式中,终端接收用户输入的模板选择操作,终端根据模板选择操作选择目标模板对应的语音,然后将该目标模板对应的语音作为第三语音。例如,目标模板包括但不限定“男播音员”,“女播音员”,“演员声音”等等。
可选地,该目标模板也可以为一个类别。进一步的,终端设备接收子模板选择操作,终端设备根据子模板选择操作选择子模板对应的目标语音。例如,“男播音员”类别下的子模板列表包括“播音员A”、“播音员B”和“播音员C”等。“女播音员”类别下的子模板列表包括“播音员D”、“播音员C”等。“演员声音”类别下的子模板列表包括“演员D”、“演员F”等。例如,用户选择“播音员A”,终端设备根据用户输入的子模板选择操作选择“播音员A”对应的语音作为第三语音。需要说明的是,本实施例中目标模板及子模板的示例,均是举例说明,并不造成限定。
第二种实现方式中,终端可以接收第二说话人输入第三语音。示例性的,终端设备接收用户输入的模板选择操作,该模板选择操作选择的目标模板是“任意人语音”,终端可以将该“任意人语音”对应的语音作为第三语音。例如,用户点击“任意人语音”选项,终端开始录制第三语音,该第二说话人是与第一说话人不同的任意人。
可选地,该第三语音可以是预先存储在本地的录制好的语音。例如,第三语音可以是电影中喜欢的角色的声音(卡通人物声音,或者某个演员的声音)。可选地,该第三语音也可以是预先从网上下载后,保存到终端设备的语音。
步骤402、终端将第一语音输入到语音特征提取模型,通过语音特征提取模型提取第一语音的PPG特征,PPG特征用于保留第一语音的内容信息。
请参阅图5所示,终端将第一语音输入到语音特征提取模型,通过语音特征提取模型提取第一语音的PPG特征。示例性的,语音特征提取模型可以为深度神经网络,语音特征提取模型包含多层的卷积层、两层LSTM层和一层全连接层,语音特征提取模型输出第一语音的PPG特征,PPG特征描述的是语音帧所对应音素集中每个音素概率的大小,相当于对音素进行了识别,该PPG特征用于保留第一语音的内容信息。该语音特征提取模型预先根据大量的语料进行训练得到的。
本实施例中,使用PPG特征作为第一语音的内容信息输入,这些语音单元(音素)保留了语音的语言和语音信息。使用PPG特征相比自动语音识别技术(automatic speech recognition,ASR)更能增加鲁棒性,ASR技术需要先将语音转换成文本,增加了语音内容识别错误的概率,本实施例中,通过使用PPG特征作为第一语音的内容信息输入,即直接可以实现语音输入,不需要转换成文本内容,增加***的鲁棒性。
步骤403、终端将第三语音输入到风格分离模型,通过风格分离模型分离第三语音的风格特征。
风格特征包括第一特征,第一特征包括多个子特征。例如,第一特征为音色特征,多个子特征包括韵律,重音和语速等。
风格分离模型用于分离第三语音的风格特征。该风格分离模型包括音色分离模型。该音色分离模型用于分离第三语音的音色特征,从而得到第一特征的向量(即音色特征向量)。
可选地,风格转换网络还包括多个子特征提取模型及多头注意力结构。终端将第三语音输入到子特征提取模型,通过子特征提取模型提取子特征的向量。例如,多个子特征提取模型包括韵律提取模型,重音提取模型和语速提取模型。例如,韵律提取模型用于提取第三语音中的韵律特征,得到韵律向量。重音提取模型用于提取第三语音中的重音特征,得到重音向量。语速提取模型用于提取第三语音中的语速特征,得到语速向量。
首先,终端接收用户输入的多个子特征中每个子特征的权重。一个示例中,手机的屏幕上显示韵律、语速和重音的调节条,用户可以通过对每个子特征的调节条的调节来输入每个子特征对应的权重。每个子特征对应的权重可以根据用户自己的需求灵活调整。例如,将“韵律”设为10%表示最终输出的语音跟目标模板的韵律有10%的相似度,即将一个值为0.1的参数传给了内置的子特征提取模型。
另一个示例中,预先配置几个档位。例如,可以分为三个档位,每个档位中各子特征的权重是根据经验值预先配置好的。例如,一档中,韵律的权重为0.1,语速为0.2,重音为0.1。二档中,韵律的权重为0.2,语速为0.2,重音为0.1等等。终端通过接收用户输入的档位从而确定各子特征对应的权重。本示例中,不需要用户单独调整每个子特征的权重,用户只需要选择档位即可,方便用户操作。
然后,终端根据第一特征的向量,每个子特征的向量及每个子特征的权重确定第三语音的风格特征。
终端将用户输入的各子特征对应的权重与各子特征的向量相乘。将相乘之后的结果和 第一特征的向量(如音色向量)同时输入多头注意结构中进行注意对齐,使得多头注意结构输出风格向量,该风格向量为第三语音的风格特征。多头注意力结构使得模型能够从不同的子空间提取特征表达,每个头对应高维空间中的一个子特征空间,相当于将高维空间分解,每个头负责一个子特征空间。多头注意力机制相当于多个结构相同的注意力机制,从而使得多头注意力机制输出的结果包含第三语音的部分音色。例如,多头注意力机制输出的结构可以为一个256维的风格嵌入向量(风格特征)。
本实施例中,通过用户输入的各子特征对应的权重,调整待转换语音的风格与第三语音的相似度,该相似程度由用户输入的权重决定。用户可以选择是否输入该权重,若用户选择输入该权重,则最终输出的语音风格与第三语音的相似程度由输入的权重决定。若用户选择选择不输入该权重,则最终输出的语音风格将和第三语音的风格(或音色,风格以音色为例)完全相同。用户可以通过调整输入的权重,灵活调整待转换语音的风格,转换语音的风格既可以与第三语音的风格完全相同,也可以在第三语音的风格的基础上进行改变,从而实现待转换语音的风格多样化。例如,该转换语音的风格可以为某一位“男播音员”的语音风格;或者,可以为某一位“女播音员”的语音风格;或者,可以为任意人的语音风格,而且还可以在“男播音员”、“女播音员”或任意人的语音风格的基础上进行改变,从而可以实现语音风格的多样化。
步骤404、终端将风格特征和PPG特征输入到语音融合模型进行融合,得到第二语音。
示例性的,该语音融合模型为seq2seq神经网络,在seq2seq神经网络中引入注意力机制。其中,该seq2seq神经网络包括编码层、解码层和注意力结构。编码层和解码层的模型可以采用任意神经网络模型的组合。例如,包括卷积神经网络(convolutional neural networks,CNN)、循环神经网络(recurrent neural network,RNN)、长短期记忆网络(long short-term memory,LSTM)中的任意种或任意两种的组合等等。例如,本申请中,该编码层可以包含三个卷积层和一个双向LSTM层。PPG特征首先会输入编码层,编码层将输入的PPG序列编码成一个固定维度的向量,由于输入序列长度可能较长,在解码时一个向量难以表达丰富的信息,因而引入了注意力机制。然后,风格特征的向量与编码层输出的PPG向量在宽度上拼接,得到注意力输入矩阵。然后,将这个注意力输入矩阵一帧一帧地送入注意力结构中,并注意力结构与编码层配合输出第二语音的梅尔(或也可以称为梅尔频谱)特征。梅尔频谱特征为第二语音的特征表示形式。在编码结束后,对编码模块的输出向量进行不同的加权得到若干向量,每个向量对应一个输出,从而保证了输出序列不再依据全部的输入序列,而是对输入序列有所聚焦。
步骤405、终端输出第二语音。
该梅尔频谱特征为中间表示,梅尔频谱虽然蕴含了输出语音的信息,却不能直接播放,需要经过语音信号转化为梅尔谱图的逆过程。本申请中,可以通过声码器通过梅尔频谱的声学特征生成可播放的音频文件。该声码器可以为神经网络声码器,该网络负责将梅尔特征转换为高自然度的语音信号。该网络由多个卷积层和反卷积层构成,最终输出为可以播放的语音。语音中的内容与第一语音相同,说话风格与第三语音相同。该终端可以直接播放第二语音。可选的,用户可以选择是否保留第二语音。若用户选择保留该第二语音,终 端将该第二语音进行存储。
本实施例中,终端接收待转换的第一语音和用于提取风格特征的第三语音,然后,第一语音输入到语音特征提取模型,通过语音特征提取模型提取PPG特征,PPG特征用于保留第一语音的内容信息,通过PPG特征实现语音的直接输入。终端将第三语音输入到风格分离模型,通过风格分离模型分离第三语音的风格特征,最后,将风格特征和PPG特征输入到语音融合模型进行融合,得到融合了第一语音内容和第三语音风格的第二语音。第三语音可以是任意人的语音,从而实现将第一语音转换成任意人的语音风格,实现语音风格转换的多样性。
在一个应用场景中,本申请中的语音处理神经网络内置于应用内。例如,该APP可以为即时通讯类的APP,或者,人声美化APP,人声美化APP可以应用于录音或录像的场景中。例如,用户可以通过该APP将待发送的语音进行美化。在录音录像场景中,用户利用终端录制完语音或带有语音的视频后,可以通过本发明组成的APP将语音进行美化。
请参阅图6A和6B所示,用户通过该APP进行人声美化的方法可以包括:
1)、用户点击屏幕上的“开始录制”按钮后,即可录制自己希望被处理的语音(第一语音)。或者,用户也可以点击“选择文件”,然后从本地文件中选取语音作为第一语音。并且可以将已读取到第一语音通过音频波形显示出来。
2)、用户在界面中选择目标模板,即选择第三语音。当用户选择“仿演员F”、“男播音员”或“女播音员”时,则默认使用终端内已经保存的语音。当用户选择“任意人语音”时,则需要通过用户点击“开始录制”或“选择文件”按钮,将任意人的语音输入到APP中。
若用户点击“开始录制”按钮,则终端设备开始录制任意人的语音,即接收第二说话人输入的第三语音。若用户点击“选择文件”,则终端设备从本地文件中选择已经保存的任意人的语音,将该语音作为第三语音。
3)、用户在“风格相似度调节”下选择输入风格特征权重(也称为“子特征权重”)。例如,将“韵律”设为20%表示最终输出的语音跟目标模板的韵律有20%的相似度,app实现将一个值为0.2的参数传给了内置的神经网络模型。同理,将“语速”设为30%表示最终输出的语音跟目标模板的语速有30%的相似度,将“重音”设为40%表示最终输出的语音跟目标模板的重音有40%的相似度。需要说明的是,可调节的风格特征包括但不限于音高、音强、音长、音色等。其中,音色包括但不限于韵律、语速和重音。
4)、用户点击“开始美化”按钮后,步骤1、2和3的输入将同时输进APP内置的已训练好的语音处理神经网络,经过语音处理神经网络处理之后输出一段处理完的语音(第二语音),该语音内容与第一语音的内容相同,该语音的风格与目标模板相似,相似程度由用于输入的风格特征权重决定。
5)、用户在APP界面可以看到处理完的语音波形,同时会有“播放”、“保存文件”和“重新录制”三个按钮。当用户点击“播放”时,该处理完的语音(第二语音)将通过手机的扬声器播放出来。当用户点击“保存文件”时,终端将该处理后的语音(第二语音)保存到本地。当用户点击“重新录制”时,终端开始进行重新处理,即回到步骤4。
以上对风格转换网络及应用场景进行了说明。下面对该风格转换网络的训练和更新过程进行说明。
风格转换网络包括三个部分的神经网络,语音特征提取部分(即语音特征提取模型),风格分离部分(即包括风格分离模型、多头注意力结构和多个子特征提取模型),语音融合模型。
请参阅图7所示,图7中黑色实线表示输入数据的流向,黑色虚线表示参数更新的方向。风格转换网络中,风格分离模型和语音特征提取模型不参与网络更新:
风格分离模型是预先使用大量不同说话人的语音数据训练好的。即风格分离模型的训练数据集包括大量不同说话人的语料(不同的人的语音具有不同的音色特征),通过该训练集对该风格分离模型进行训练,得到风格分离模型。该风格分离模型不参与整个网络的参数更新。
语音特征提取模型是预先通过大量的语料训练好的,例如,训练数据集中包括的样本数据的输入为语音,标签为该输入语音的PPG特征。语音特征提取模型用于提取PPG特征,该语音特征提取模型不参与整个网络的参数更新。
整个网络中除了风格分离模型和语音特征提取模型之外,其他的模型参与整个网络的训练和更新。
请参阅图7中黑色虚线所经过的模型。其中,多头注意结构、各子特征提取模型、Seq2seq神经网络(包括编码层、解码层及注意力结构)中,每个结构都可以是一个神经网络,每个神经网络可以包括多个卷积层、全连接层、LSTM层等结构层,而各层的权重参数需要通过大量的数据训练得到。
本申请中对于风格转换网络的更新与传统方法不同之处在于:由于本申请中风格分离模型和语音特征提取模型已经是预先训练好的,不参与网络更新,即不需要考虑风格分离模型的训练,由此,输入与标签不需要是同样内容的两条语音(两个不同的人说的两条语音),极大的减少了样本数据的数量。标签通过短时傅里叶变换(short-time fourier transform,STFT)算法得到第一梅尔特征,将语音输入到风格转换网络后,通过本网络的输出为第二梅尔特征,通过比对第一梅尔特征和第二梅尔特征得到loss值和梯度,本网络的输出与输入通过STFT算法得到的梅尔特征进行比对得到loss值及其梯度,图7中黑色虚线为梯度流动方向,当梯度流过相应结构时,结构中的参数才会进行更新,而风格分离模型和语音特征提取模型是预先训练得到,不需要参与本网络的参数更新。Loss值作为判断网络何时停止训练的指标,当loss值下降到一定值,且无明显继续下降趋势时,说明网络已收敛,即可停止训练。
整个风格转换网络获得语音美化的能力并没有通过使用风格标签(即输入与标签为同一条语音序列),这种网络学习方式属于无监督学习方式,通过无监督学习方式训练的风格转换网络,可以支持用户手动调节对所录语音美化的程度并可组合不同说话人的说话风格。
请参阅图8所示,本申请实施例提供了一种语音转换的另一个实施例,本实施例中针对方言转换模式和语音增强模式进行说明。
方言转换网络包括多个方言转换模型,其中,方言转换模型用于对输入的第一语音进行方言转换。语音增强模型用于将远场语音转换成近场语音,实现语音增强。
多个方言转换模型至少包括两类,第一类是去口音,即将方言转换成普通话。例如,四川话转换成普通话。第二类是加口音,即将普通话转换成方言,例如,普通话转换成四川话等。每个方言转换模型分别针对不同的待转换方言。方言转换可以增强不同地域用户的交流便捷性,实现语音转换的多样化。需要说明的是,这里仅是举了两类进行示例性说明,当然在可选的方案中,也可以将两个方言转换模型进行联合使用,从而进行两种方言之间的转换,例如,将四川话转换成广东话,可以先将四川话转换成普通话,再将普通话转换成广东话。
步骤801、终端接收用户输入的模式选择操作。
例如,终端的用户界面上显示对语音的不同的处理模式。例如,该处理模式可以包括“方言切换模式”和“语音增强模式”。该模型选择操作可以为点击操作,当用户点击“方言切换模式”时,则用户选择方言切换处理模式。
步骤802、终端接收用户输入的选择操作。
可选地,当模式选择操作用于选择方言切换模式时,模式选择操作(第一个层级的操作)还可以包括下一层级(第二个层级的操作)的多个选择操作。例如,第一选择操作和第二选择操作。该第一选择操作用于选择“去口音”,而第二选择操作用于选择“加口音”。每一个第二层级的操作又会包括多个第三层级的操作。在一个应用场景中,当用户选择“口音方言切换”时,会出现二级分类“去口音”,“加口音”选项。当用户选择“加口音”时,终端的用户界面又会显示不同地方口音选项的三级分类标签。例如,“四川口音”、“广东口音”等。终端会接收用户输入的第三个层级的选择操作,如第一操作,第二操作等。其中,第一操作用于选择将四川话转换成普通话,第二操作用于选择将普通话转换四川话等。终端可以根据用户输入的具体操作选择对应的方言转换模型。
步骤803、终端将第一语音的特征信息输入到选择操作对应的方言转换模型,通过选择操作对应的方言转换模型输出第二语音。
请参阅图9所示,当模式选择操作用于选择方言切换模式时,终端根据模式选择操作,将PPG特征输入到方言切换模型(seq2seq神经网络),通过方言切换模型输出第二语音的梅尔特征形式,然后,通过神经网络声码器将梅尔特征转换成可播放的语音形式。第一语音为第一方言的语音,第二语音为第二方言的语音。
可选地,终端将第三语音输入到风格分离模型,通过风格分离模型分离第一语音的风格特征,得到第一语音的音色特征向量。然后,将第一语音的PPG特征和第一语音的音色特征向量输入到选择操作对应的方言转换模型,通过选择操作对应的方言转换模型输出第二语音。该第二语音内容与输入语音(第一语音)相同,且保留了输入语音(第一语音)的说话风格。
需要说明的是,本申请实施例中,方言为“地方语言”,例如,四川话,广东话等等。 而普通话也为方言的一种,汉语中,普通话是以北京语音为标准音,可以理解为以北方话为基础的方言,即在本申请中,普通话为也方言的一种。
示例性的,第一方言为四川话,第二方言为普通话。或者,第一方言为普通话,第二方言为四川话,或者,第二方言为东北话等等,具体的,并不限定第一方言和第二方言。
请参阅图10所示,方言转换网络包括N个方言转换模型,每个方言转换模型分别针对不同的待转换方言,每一种方言转换都具有对应的模型。例如,多个方言转换模型中的第一方言转换模型用于四川话转换为普通话,而第二方言转换模型用于将普通话转换为广东话,而第一方言转换模型和第二方言转换模型的整合模型可以将四川话转换为广东话等等。本示例中,对于第一方言和第二方言进行举例说明,并不造成对本申请的限定。
可选地,当模式选择操作对应的模式为语音增强模式时,终端将第一语音的PPG特征输入到模式对应的语音增强模型,通过语音增强模型输出第二语音,第一语音为远场语音,第二语音为近场语音。将远场语音转换成近场语音,从而实现语音增强,增加了应用场景,实现语音转换的多样化。语音增强模型是通过训练数据集中的样本数据进行学习得到的,样本数据包括输入和标签,输入为远场语音,标签为近场语音。
可选地,将所述第一语音输入到风格分离模型,通过所述风格分离模型分离所述第一语音的风格特征。然后,将第一语音的风格特征和第一语音的特征信息输入到所述语音增强模型,通过语音增强模型输出所述第二语音,所述第二语音和所述第一语音的风格相同。本示例中,转换之后的语音与输入语音(第一语音)相同,且保留了输入语音(第一语音)的说话风格。
示例性的,在一个应用场景中,请参阅图11A和11B所示。
a、用户点击“开始录制”按钮后,终端录制自己希望被处理的语音(第一语音),用户也可以点击“选择文件”按钮,终端从本地文件中选取一段语音作为第一语音。可选的,终端可以显示已读取到音频的波形。
b、用户在界面中选择语音处理模式。界面显示一级分类标签。例如,该一级分类标签包括“方言切换”和“语音增强”。当用户选择“方言切换”时,界面显示二级分类标签,如二级分类标签包括:“去口音”,“加口音”。当用户选择“加口音”时,界面会显示不同地方口音选项的三级分类标签,例如:“广东口音”、“四川口音”、“福建口音”等。或者,用户也可以选择“语音增强”模式。
c、终端会根据用户选择的模式选取对应的模型。
d、用户点击“开始美化”按钮后,将第一语音输入已选择好的模型中(例如,普通话转广东口音的方言转换模型),经过方言转换模型处理一段时间后,输出一段处理完的第二语音,该第二语音内容与输入语音(第一语音)相同,且保留了输入语音(第一语音)的说话风格。或者,将第一语音输入已选择好的语音增强模型中,经过语音增强模型处理一段时间后,输出一段处理完的第二语音,该第二语音内容与输入语音(第一语音)相同,且保留了输入语音(第一语音)的说话风格。第一语音为远场语音,第二语音为近场语音。
e、最后用户在显示界面可以看到处理完的语音波形。显示界面还会显示“播放”、“保 存文件”和“重新录制”三个按钮。当用户点击“播放”按钮时,处理完的语音将通过手机的扬声器播放出来。当用户点击“保存文件”按钮时,终端将处理后的语音保存到本地。当用户点击“重新录制”按钮时,终端重新对该第一语音进行处理,根据此时界面上用户的选择操作选取对应的模型,同时回到步骤d。
下面对方言转换模型训练和更新过程进行说明。
请参阅图12所示,每个方言转换模型的结构相同,每个方言转换模型都为seq2seq神经网络,该seq2seq神经网络包括编码层、解码层及注意力结构。不同的方言转换模型的对应的训练数据集不同,各层的参数不同。例如,第一方言转换模型为四川话转普通话,该第一方言转换模型对应的训练数据集中的每个样本数据包括输入和标签。其中,输入为四川话,标签为与该四川话相同内容的普通话。第二方言转换模型为普通话转四川话,该第一方言转换模型对应的训练数据集中的每个样本数据包括输入和标签,输入为普通话,标签为与该普通话相同内容的四川话等等。不同的方言转换模型对不同的训练数据集进行学习。图12中以一个方言转换模型中各模块的参数更新进行示例。
本训练方案与风格转换网络的训练方式基本相同,风格分离模型和语音特征提取模型已经是预先训练好的,不参与网络更新,即不需要考虑风格分离模型和语音特征提取模型的训练,由此,标签通过STFT算法得到第三梅尔特征,通过本网络的输出为第四梅尔特征,通过比对第三梅尔特征和第四梅尔特征得到loss值和梯度,本网络的输出的第四梅尔特征与输入通过STFT算法得到的第三梅尔特征进行比对得到loss值及其梯度,图12中黑色虚线为梯度流动方向,当梯度流过相应结构(如编码层、注意结构和解码层)时,模块中的参数才会进行更新,而风格分离模型和语音特征提取模型是预先训练得到,不需要参与本网络的参数更新。loss值作为判断网络何时停止训练的指标,当loss值下降到一定值,且无明显继续下降趋势时,说明网络已收敛,即可停止训练。
本申请实施例中,利用无监督学习的方式提取语音中的音色、韵律和语速等风格特征,实现对语音的风格的可控美化。对于用户希望处理后的语音的风格跟处理前是相同的人声美化场景,比如,方言转换和语音增强的应用场景,本申请能保持处理后的语音的风格不变,同时实现方言的转换或语音增强。本申请利用人工智能技术为实现人声美化提供了一种更加便捷、丰富和覆盖场景更多的方法。实现任意人的去口音、加口音或语音增强的效果,同时可以保持输入和输出的语音风格不发生改变。
本申请实施例中,对于各模型的训练及更新的执行主体可以为服务器,服务器将该语音处理神经网络训练好后,由终端下载到本地端。实际应用中,该语音处理神经网络内置于APP中,如该APP为即时通讯类的APP等。
上面对一种语音转换方法进行了说明,请参阅图13所示,下面对该方法应用的语音转换装置1300进行说明,语音转换装置包括:输入模块1320,处理模块1310,获取模块1330,输出模块1340。
输入模块1320,用于接收用户输入的模式选择操作,所述模式选择操作用于选择语音转换的模式;
处理模块1310,用于根据所述输入模块1320接收的所述模式选择操作从多个模式中选择目标转换模式,所述多个模式包括风格转换模式,方言转换模式和语音增强模式;
获取模块1330,用于获取待转换的第一语音;
处理模块1310,还用于提取所述获取模块1330获取到的所述第一语音的特征信息;将所述第一语音的特征信息输入所述目标转换模式对应的目标语音转换网络,通过所述目标语音转换网络输出转换后的第二语音;
输出模块1340,用于输出第二语音。
在一个可选的实现方式中,所述处理模块1310,还用于将所述第一信息输入到语音特征提取模型,通过所述语音特征提取模型提取所述第一语音的音素后验概率PPG特征;所述PPG特征用于保留所述第一语音的内容信息。
在一个可选的实现方式中,当所述目标转换模式为风格转换模式,所述目标语音转换网络为风格转换网络时,风格转换网络包括风格分离模型和语音融合模型;
所述获取模块1330,用于获取用于提取风格特征的第三语音;
所述处理模块1310,还用于将所述第三语音输入到风格分离模型,通过所述风格分离模型分离所述第三语音的风格特征;
所述处理模块1310,还用于将所述风格特征和第一语音的特征信息输入到语音融合模型进行融合,得到所述第二语音。
在一个可选的实现方式中,所述风格特征包括第一特征,所述第一特征包括多个子特征;
所述处理模块1310,还用于将所述第三语音输入到风格分离模型,通过所述风格分离模型提取所述第三语音中的第一特征的向量;将所述第三语音输入到子特征提取模型,通过所述子特征提取模型提取所述子特征的向量;
所述输入模块1320,还用于接收用户输入的所述多个子特征中每个所述子特征的权重;
所述处理模块1310,还用于根据所述第一特征的向量,所述输入模块1320接收的每个所述子特征的向量及每个所述子特征的权重确定所述第三语音的风格特征。
在一个可选的实现方式中,所述处理模块1310,还用于将所述第一特征的向量输入多头注意力结构,且将每个子特征的向量及与其对应权重的乘积输入到所述多头注意力结构,通过所述多头注意力结构输出所述第三语音的风格特征。
在一个可选的实现方式中,所述获取模块1330,还用于接收用户输入的模板选择操作,所述模板选择操作用于选择目标模板;获取所述目标模板对应的语音,将所述目标模板对应的语音作为所述第三语音。
在一个可选的实现方式中,所述获取模块1330,还用于接收第二说话人输入的第三语音,所述第一语音为第一说话人的语音,所述第二说话人为与所述第一说话人不同的任意人。
在一个可选的实现方式中,所述目标转换模式为方言转换模式,所述目标语音转换网络为方言转换网络;所述处理模块1310,还用于将所述第一语音的特征信息输入到方言转换网络,通过所述方言转换网络输出所述第二语音,所述第一语音为第一方言的语音,所 述第二语音为第二方言的语音。
在一个可选的实现方式中,所述方言转换网络包括多个方言转换模型,每个方言转换模型分别针对不同的待转换方言;
所述输入模块1320,还用于接收用户输入的选择操作;
所述处理模块1310,还用于将第一语音的特征信息输入到所述选择操作对应的方言转换模型,通过所述选择操作对应的方言转换模型输出所述第二语音。
在一个可选的实现方式中,所述处理模块1310,还用于将所述第一语音输入到风格分离模型,通过所述风格分离模型分离所述第一语音的风格特征;
所述处理模块1310,还用于将所述第一语音的风格特征和所述第一语音的特征信息输入到方言转换网络,通过所述方言转换网络输出所述第二语音,所述第二语音和所述第一语音的风格相同。
在一个可选的实现方式中,所述第一语音为远场语音,当所述目标转换模式为语音增强模式,所述目标语音转换网络为语音增强模型时,
所述处理模块1310,还用于将所述第一语音的特征信息输入到所述模式对应的语音增强模型,通过所述语音增强模型输出第二语音,所述第二语音为近场语音。
在一个可选的实现方式中,所述处理模块1310,还用于将所述第一语音输入到风格分离模型,通过所述风格分离模型分离所述第一语音的风格特征;
所述处理模块1310,还用于将所述第一语音的风格特征和所述第一语音的特征信息输入到所述语音增强模型,通过所述语音增强模型输出所述第二语音,所述第二语音和所述第一语音的风格相同。
在一个可选的实现方式中,所述获取模块1330,还用于接收第一说话人输入的所述第一语音;或者,从本地存储文件中选择所述第一语音。
在一种可能的设计中,处理模块1310可以是一个处理装置,处理装置的功能可以部分或全部通过软件实现。
可选地,处理装置的功能可以部分或全部通过软件实现。此时,处理装置可以包括存储器和处理器,其中,存储器用于存储计算机程序,处理器读取并执行存储器中存储的计算机程序,以执行任意一个方法实施例中的相应处理和/或步骤。
可选地,处理装置可以仅包括处理器。用于存储计算机程序的存储器位于处理装置之外,处理器通过电路/电线与存储器连接,以读取并执行存储器中存储的计算机程序。
可选地,所述处理装置可以是一个或多个芯片,或一个或多个集成电路。
示例性的,本申请实施例提供了一种芯片结构,请参阅图14所示,该芯片包括:
所述芯片可以表现为神经网络处理器140(neural-network processing unit,NPU),NPU作为协处理器挂载到主CPU(host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1403,通过控制器1404控制运算电路1403提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路1403内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路1403是二维脉动阵列。运算电路1403还可以是一维脉动阵列或者能够 执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1403是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1402中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1401中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器1408accumulator中。
统一存储器1406用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器1405(direct memory access controller,DMAC)被搬运到权重存储器1402中。输入数据也通过DMAC被搬运到统一存储器1406中。
BIU为Bus Interface Unit即,总线接口单元1410,用于AXI总线与DMAC和取指存储器1409Instruction Fetch Buffer的交互。
总线接口单元1410(Bus Interface Unit,简称BIU),用于取指存储器1409从外部存储器获取指令,还用于存储单元访问控制器1405从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1406或将权重数据搬运到权重存储器1402中或将输入数据数据搬运到输入存储器1401中。
向量计算单元1407包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/FC层网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现种,向量计算单元1407能将经处理的输出的向量存储到统一缓存器1406。例如,向量计算单元1407可以将非线性函数应用到运算电路1403的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1407生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1403的激活输入,例如用于在神经网络中的后续层中的使用。
控制器1404连接的取指存储器(instruction fetch buffer)1409,用于存储控制器1404使用的指令;
统一存储器1406,输入存储器1401,权重存储器1402以及取指存储器1409均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
其中,本申请中,语音特征提取模型、风格分离模型、多头注意结构、各子特征提取模型、Seq2seq神经网络中的编码层、解码层和注意力结构、方言转换模型、语音增强模型中每个模块中各层的运算可以由运算电路1403或向量计算单元1407执行。
运算电路1403或向量计算单元1407计算得到参数值(如第一参数值),主CPU用于读取所述至少一个存储器所存储的计算机程序,使得终端执行上述方法实施例中终端所执行的方法。
请参阅图15所示,本发明实施例还提供了另一种语音转换装置,如图15所示,仅示 出了与本申请实施例相关的部分,具体技术细节未揭示的,请参照本申请实施例方法部分。语音转换装置可以为终端,该终端可以为手机、平板电脑,笔记本电脑,智能手表等,终端以为手机为例:
图15示出的是与本发明实施例提供的终端相关的手机的部分结构的框图。参考图15,手机包括:射频(radio frequency,RF)电路1510、存储器1520、输入单元1530、显示单元1540、音频电路1560、处理器1580、以及电源1590等部件。本领域技术人员可以理解,图15中示出的手机结构并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图15对手机的各个构成部件进行具体的介绍:
RF电路1510可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器1580处理。
存储器1520可用于存储软件程序以及模块,处理器1580通过运行存储在存储器1520的软件程序以及模块,从而执行手机的各种功能应用以及数据处理。存储器1520可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作***、至少一个功能所需的应用程序(比如声音播放功能等)等。此外,存储器1520可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元1530可用于接收输入的数字或字符信息,以及产生与手机的用户设置以及功能控制有关的键信号输入。具体地,输入单元1530可包括触控面板1531以及其他输入设备1532。触控面板1531,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板1531上或在触控面板1531附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触控面板1531可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器1580,并能接收处理器1580发来的命令并加以执行。除了触控面板1531,输入单元1530还可以包括其他输入设备1532。具体地,其他输入设备1532可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
本申请中,输入单元1530用于接收用户输入的各种操作,例如,模式选择操作等。图13中输入模块1320的功能可以由该输入单元1530来执行,或者,图13中获取模块1330的功能可以由该输入单元1530来执行。
显示单元1540可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单。显示单元1540可包括显示面板1541,可选的,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板1541。进一步的,触控面板1531可覆盖显示面板1541,当触控面板1531检测到在其上或附近的触摸操作后,传送给处理器1580以确定触摸事件的类型,随后处理器1580根据触摸事件的类型在显示面板1541上提供相应的视觉输出。虽然在图15中,触控面板 1531与显示面板1541是作为两个独立的部件来实现手机的输入和输入功能,但是在某些实施例中,可以将触控面板1531与显示面板1541集成而实现手机的输入和输出功能。
本申请中,显示单元1540用于显示方法实施例中对应的图6A、图6B、图11A和图11B中示出的APP界面。
音频电路1560、扬声器1561,传声器1562可提供用户与手机之间的音频接口。音频电路1560可将接收到的音频数据转换后的电信号,传输到扬声器1561,由扬声器1561转换为声音信号输出;另一方面,传声器1562将收集的声音信号转换为电信号,由音频电路1560接收后转换为音频数据,再将音频数据输出处理器1580处理后。
本申请中,音频电路1560通过传声器1562接收第一说话人的第一语音,或者,接收第二说话人的第三语音。扬声器1561用于输出处理之后的第二语音,例如,第二语音为风格转换后的语音,或者,第二语音为方言转换的语音,或者,第二语音为语音增强后的语音。扬声器1561用于输出第二语音。
在一个可能的设计中,图13中输出模块1340的功能可以由扬声器1561来执行。
处理器1580是手机的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器1520内的软件程序和/或模块,以及调用存储在存储器1520内的数据,执行手机的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器1580可包括一个或多个处理单元。
手机还包括给各个部件供电的电源1590(比如电池),优选的,电源可以通过电源管理***与处理器1580逻辑相连,从而通过电源管理***实现管理充电、放电、以及功耗管理等功能。
尽管未示出,手机还可以包括摄像头、蓝牙模块等,在此不再赘述。
本申请中,当所述存储器存储的程序指令被所述处理器执行时实现上述各方法实施例中终端执行的方法,具体的参阅上述各方法实施例中的描述,此处不赘述。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有程序,当其在计算机上运行时,使得计算机执行如前述方法实施例描述的方法中终端设备所执行的步骤。
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上行驶时,使得计算机执行如前述各方法实施例描述的方法中终端所执行的步骤。
本申请实施例中还提供一种电路***,所述电路***包括处理电路,所述处理电路配置为执行如前述方法实施例描述的方法中终端设备所执行的步骤。
在另一种可能的设计中,当该装置为终端内的芯片时,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使该终端内的芯片执行上述第一方面任意一项的无线通信方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述终端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
其中,上述任一处提到的处理器,可以是一个通用中央处理器(CPU),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制上述第一方面无线通信方法的程序执行的集成电路。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的***,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的***,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (23)

  1. 一种语音转换方法,其特征在于,包括:
    接收用户输入的模式选择操作,所述模式选择操作用于选择语音转换的模式;
    根据模式选择操作从多个模式中选择目标转换模式,所述多个模式包括风格转换模式,方言转换模式和语音增强模式;
    获取待转换的第一语音;
    提取所述第一语音的特征信息;
    将所述第一语音的特征信息输入到所述目标转换模式对应的目标语音转换网络,通过所述目标语音转换网络输出转换后的第二语音;
    输出所述第二语音。
  2. 根据权利要求1所述的方法,其特征在于,所述提取所述第一语音的特征信息,包括:
    将所述第一信息输入到语音特征提取模型,通过所述语音特征提取模型提取所述第一语音的音素后验概率PPG特征;所述PPG特征用于保留所述第一语音的内容信息。
  3. 根据权利要求1或2所述的方法,其特征在于,当所述目标转换模式为风格转换模式,所述目标语音转换网络为风格转换网络时,所述风格转换网络包括风格分离模型和语音融合模型,所述方法还包括:
    获取用于提取风格特征的第三语音;
    将所述第三语音输入到风格分离模型,通过所述风格分离模型分离所述第三语音的风格特征;
    所述将所述第一语音的特征信息输入到所述目标转换模式对应的目标语音转换网络,通过所述目标语音转换网络输出转换后的第二语音,包括:
    将所述风格特征和第一语音的特征信息输入到语音融合模型进行融合,得到所述第二语音。
  4. 根据权利要求3所述的方法,其特征在于,所述风格特征包括第一特征,所述第一特征包括多个子特征;
    所述将所述第三语音输入到风格分离模型,通过所述风格分离模型分离所述第三语音的风格特征,包括:
    将所述第三语音输入到风格分离模型,通过所述风格分离模型提取所述第三语音中的第一特征的向量;
    将所述第三语音输入到子特征提取模型,通过所述子特征提取模型提取所述子特征的向量;
    接收用户输入的所述多个子特征中每个所述子特征的权重;
    根据所述第一特征的向量,每个所述子特征的向量及每个所述子特征的权重确定所述第三语音的风格特征。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述第一特征的向量,每个所述子特征的向量及每个所述子特征的权重确定所述第三语音的风格特征,包括:
    将所述第一特征的向量输入多头注意力结构,且将每个子特征的向量及与其对应权重的乘积输入到所述多头注意力结构,通过所述多头注意力结构输出所述第三语音的风格特征。
  6. 根据权利要求3所述的方法,其特征在于,所述获取用于提取风格特征的第三语音,包括:
    接收用户输入的模板选择操作,所述模板选择操作用于选择目标模板;
    获取所述目标模板对应的语音,将所述目标模板对应的语音作为所述第三语音。
  7. 根据权利要求3所述的方法,其特征在于,所述获取用于提取风格特征的第三语音,包括:
    接收第二说话人输入的第三语音,所述第一语音为第一说话人的语音,所述第二说话人为与所述第一说话人不同的任意人。
  8. 根据权利要求1或2所述的方法,其特征在于,当所述目标转换模式为方言转换模式,所述目标语音转换网络为方言转换网络时,所述将所述第一语音的特征信息输入到所述目标转换模式对应的目标语音转换网络,通过所述目标语音转换网络输出转换后的第二语音,包括:
    将所述第一语音的特征信息输入到方言转换网络,通过所述方言转换网络输出所述第二语音,所述第一语音为第一方言的语音,所述第二语音为第二方言的语音。
  9. 根据权利要求8所述的方法,其特征在于,所述方言转换网络包括多个方言转换模型,每个方言转换模型分别针对不同的待转换方言,所述方法还包括:
    接收用户输入的选择操作;
    将第一语音的特征信息输入到所述选择操作对应的方言转换模型,通过所述选择操作对应的方言转换模型输出所述第二语音。
  10. 根据权利要求8所述的方法,其特征在于,所述方法还包括:
    将所述第一语音输入到风格分离模型,通过所述风格分离模型分离所述第一语音的风格特征;
    所述将所述第一语音的特征信息输入到方言转换网络,通过所述方言转换网络输出所述第二语音,包括:
    将所述第一语音的风格特征和所述第一语音的特征信息输入到方言转换网络,通过所述方言转换网络输出所述第二语音,所述第二语音和所述第一语音的风格相同。
  11. 根据权利要求1或2所述的方法,其特征在于,所述第一语音为远场语音,当所述目标转换模式为语音增强模式,所述目标语音转换网络为语音增强模型时,所述将所述第一语音的特征信息输入到所述目标转换模式对应的目标语音转换网络,通过所述目标语音转换网络输出转换后的第二语音,包括:
    将所述第一语音的特征信息输入到所述模式对应的语音增强模型,通过所述语音增强模型输出第二语音,所述第二语音为近场语音。
  12. 根据权利要求11所述的方法,其特征在于,所述方法还包括:
    将所述第一语音输入到风格分离模型,通过所述风格分离模型分离所述第一语音的风 格特征;
    所述将所述第一语音的特征信息输入到所述模式对应的语音增强模型,通过所述语音增强模型输出第二语音,包括:
    将所述第一语音的风格特征和所述第一语音的特征信息输入到所述语音增强模型,通过所述语音增强模型输出所述第二语音,所述第二语音和所述第一语音的风格相同。
  13. 根据权利要求1-12中任一项所述的方法,其特征在于,所述获取待转换的第一语音,包括:
    接收第一说话人输入的所述第一语音;
    或者,
    从本地存储文件中选择所述第一语音。
  14. 一种语音转换装置,其特征在于,包括:
    输入模块,用于接收用户输入的模式选择操作,所述模式选择操作用于选择语音转换的模式;
    处理模块,用于根据所述输入模块接收的所述模式选择操作从多个模式中选择目标转换模式,所述多个模式包括风格转换模式,方言转换模式和语音增强模式;
    获取模块,用于获取待转换的第一语音;
    所述处理模块,还用于提取所述获取模块获取到的所述第一语音的特征信息;将所述第一语音的特征信息输入所述目标转换模式对应的目标语音转换网络,通过所述目标语音转换网络输出转换后的第二语音;
    输出模块,用于输出所述第二语音。
  15. 根据权利要求14所述的装置,其特征在于,
    所述处理模块,还用于将所述第一信息输入到语音特征提取模型,通过所述语音特征提取模型提取所述第一语音的音素后验概率PPG特征;所述PPG特征用于保留所述第一语音的内容信息。
  16. 根据权利要求14或15所述的装置,其特征在于,当所述目标转换模式为风格转换模式,所述目标语音转换网络为风格转换网络时,风格转换网络包括风格分离模型和语音融合模型;
    所述获取模块,用于获取用于提取风格特征的第三语音;
    所述处理模块,还用于将所述第三语音输入到风格分离模型,通过所述风格分离模型分离所述第三语音的风格特征;
    所述处理模块,还用于将所述风格特征和第一语音的特征信息输入到语音融合模型进行融合,得到所述第二语音。
  17. 根据权利要求16所述的装置,其特征在于,所述风格特征包括第一特征,所述第一特征包括多个子特征;
    所述处理模块,还用于将所述第三语音输入到风格分离模型,通过所述风格分离模型提取所述第三语音中的第一特征的向量;将所述第三语音输入到子特征提取模型,通过所述子特征提取模型提取所述子特征的向量;
    所述输入模块,还用于接收用户输入的所述多个子特征中每个所述子特征的权重;
    所述处理模块,还用于根据所述第一特征的向量,所述输入模块接收的每个所述子特征的向量及每个所述子特征的权重确定所述第三语音的风格特征。
  18. 根据权利要求16所述的装置,其特征在于,
    所述获取模块,还用于接收用户输入的模板选择操作,所述模板选择操作用于选择目标模板;获取所述目标模板对应的语音,将所述目标模板对应的语音作为所述第三语音。
  19. 根据权利要求14或15所述的装置,其特征在于,所述目标转换模式为方言转换模式,所述目标语音转换网络为方言转换网络;
    所述处理模块,还用于将所述第一语音的特征信息输入到方言转换网络,通过所述方言转换网络输出所述第二语音,所述第一语音为第一方言的语音,所述第二语音为第二方言的语音。
  20. 根据权利要求19所述的装置,其特征在于,所述处理模块,还用于将所述第一语音输入到风格分离模型,通过所述风格分离模型分离所述第一语音的风格特征;
    所述处理模块,还用于将所述第一语音的风格特征和所述第一语音的特征信息输入到方言转换网络,通过所述方言转换网络输出所述第二语音,所述第二语音和所述第一语音的风格相同。
  21. 根据权利要求14或15所述的装置,其特征在于,所述第一语音为远场语音,当所述目标转换模式为语音增强模式,所述目标语音转换网络为语音增强模型时,
    所述处理模块,还用于将所述第一语音的特征信息输入到所述模式对应的语音增强模型,通过所述语音增强模型输出第二语音,所述第二语音为近场语音。
  22. 一种终端设备,其特征在于,包括处理器,所述处理器与存储器耦合,所述存储器存储有程序指令,当所述存储器存储的程序指令被所述处理器执行时实现权利要求1至13中任一项所述的方法。
  23. 一种计算机可读存储介质,其特征在于,包括程序,当其在计算机上运行时,使得计算机执行如权利要求1至13中任一项所述的方法。
PCT/CN2021/117945 2020-09-21 2021-09-13 一种语音转换的方法及相关设备 WO2022057759A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/186,285 US20230223006A1 (en) 2020-09-21 2023-03-20 Voice conversion method and related device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010996501.5 2020-09-21
CN202010996501.5A CN114299908A (zh) 2020-09-21 2020-09-21 一种语音转换的方法及相关设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/186,285 Continuation US20230223006A1 (en) 2020-09-21 2023-03-20 Voice conversion method and related device

Publications (1)

Publication Number Publication Date
WO2022057759A1 true WO2022057759A1 (zh) 2022-03-24

Family

ID=80776453

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/117945 WO2022057759A1 (zh) 2020-09-21 2021-09-13 一种语音转换的方法及相关设备

Country Status (3)

Country Link
US (1) US20230223006A1 (zh)
CN (1) CN114299908A (zh)
WO (1) WO2022057759A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114464162A (zh) * 2022-04-12 2022-05-10 阿里巴巴达摩院(杭州)科技有限公司 语音合成方法、神经网络模型训练方法、和语音合成模型

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220293098A1 (en) * 2021-03-15 2022-09-15 Lenovo (Singapore) Pte. Ltd. Dialect correction and training
CN118173110A (zh) * 2022-12-08 2024-06-11 抖音视界有限公司 语音处理方法、装置及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013171196A (ja) * 2012-02-21 2013-09-02 Toshiba Corp 音声合成装置、方法およびプログラム
CN105551480A (zh) * 2015-12-18 2016-05-04 百度在线网络技术(北京)有限公司 方言转换方法及装置
CN106356065A (zh) * 2016-10-31 2017-01-25 努比亚技术有限公司 一种移动终端及语音转换方法
CN110895932A (zh) * 2018-08-24 2020-03-20 中国科学院声学研究所 基于语言种类和语音内容协同分类的多语言语音识别方法
CN111312267A (zh) * 2020-02-20 2020-06-19 广州市百果园信息技术有限公司 一种语音风格的转换方法、装置、设备和存储介质
CN111627457A (zh) * 2020-05-13 2020-09-04 广州国音智能科技有限公司 语音分离方法、***及计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013171196A (ja) * 2012-02-21 2013-09-02 Toshiba Corp 音声合成装置、方法およびプログラム
CN105551480A (zh) * 2015-12-18 2016-05-04 百度在线网络技术(北京)有限公司 方言转换方法及装置
CN106356065A (zh) * 2016-10-31 2017-01-25 努比亚技术有限公司 一种移动终端及语音转换方法
CN110895932A (zh) * 2018-08-24 2020-03-20 中国科学院声学研究所 基于语言种类和语音内容协同分类的多语言语音识别方法
CN111312267A (zh) * 2020-02-20 2020-06-19 广州市百果园信息技术有限公司 一种语音风格的转换方法、装置、设备和存储介质
CN111627457A (zh) * 2020-05-13 2020-09-04 广州国音智能科技有限公司 语音分离方法、***及计算机可读存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114464162A (zh) * 2022-04-12 2022-05-10 阿里巴巴达摩院(杭州)科技有限公司 语音合成方法、神经网络模型训练方法、和语音合成模型

Also Published As

Publication number Publication date
CN114299908A (zh) 2022-04-08
US20230223006A1 (en) 2023-07-13

Similar Documents

Publication Publication Date Title
WO2022057759A1 (zh) 一种语音转换的方法及相关设备
WO2020182153A1 (zh) 基于自适应语种进行语音识别的方法及相关装置
CN111048062B (zh) 语音合成方法及设备
JP2022107032A (ja) 機械学習を利用したテキスト音声合成方法、装置およびコンピュータ読み取り可能な記憶媒体
CN111009237B (zh) 语音识别方法、装置、电子设备及存储介质
CN107481718B (zh) 语音识别方法、装置、存储介质及电子设备
JP2021103328A (ja) 音声変換方法、装置及び電子機器
CN111312245B (zh) 一种语音应答方法、装置和存储介质
WO2022078146A1 (zh) 语音识别方法、装置、设备以及存储介质
EP3824462B1 (en) Electronic apparatus for processing user utterance and controlling method thereof
CN110853618A (zh) 一种语种识别的方法、模型训练的方法、装置及设备
WO2004036939A1 (fr) Appareil de communication mobile numerique portable, procede de commande vocale et systeme
CN111508511A (zh) 实时变声方法及装置
WO2019242414A1 (zh) 语音处理方法、装置、存储介质及电子设备
CN113421547B (zh) 一种语音处理方法及相关设备
CN109036395A (zh) 个性化的音箱控制方法、***、智能音箱及存储介质
CN110379411B (zh) 针对目标说话人的语音合成方法和装置
CN112837669B (zh) 语音合成方法、装置及服务器
CN112309365A (zh) 语音合成模型的训练方法、装置、存储介质以及电子设备
WO2023207541A1 (zh) 一种语音处理方法及相关设备
CN109285548A (zh) 信息处理方法、***、电子设备、和计算机存储介质
CN113223542B (zh) 音频的转换方法、装置、存储介质及电子设备
WO2020145353A1 (ja) コンピュータプログラム、サーバ装置、端末装置及び音声信号処理方法
EP4198967A1 (en) Electronic device and control method thereof
CN115148185A (zh) 语音合成方法及装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21868583

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21868583

Country of ref document: EP

Kind code of ref document: A1