CN112652294A - Speech synthesis method, apparatus, computer device and storage medium - Google Patents

Speech synthesis method, apparatus, computer device and storage medium Download PDF

Info

Publication number
CN112652294A
CN112652294A CN202011562944.XA CN202011562944A CN112652294A CN 112652294 A CN112652294 A CN 112652294A CN 202011562944 A CN202011562944 A CN 202011562944A CN 112652294 A CN112652294 A CN 112652294A
Authority
CN
China
Prior art keywords
text
language
features
feature
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011562944.XA
Other languages
Chinese (zh)
Other versions
CN112652294B (en
Inventor
刘夏冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhuiyi Technology Co Ltd
Original Assignee
Shenzhen Zhuiyi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhuiyi Technology Co Ltd filed Critical Shenzhen Zhuiyi Technology Co Ltd
Priority to CN202011562944.XA priority Critical patent/CN112652294B/en
Publication of CN112652294A publication Critical patent/CN112652294A/en
Application granted granted Critical
Publication of CN112652294B publication Critical patent/CN112652294B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a speech synthesis method, a speech synthesis device, computer equipment and a storage medium, which are applicable to the technical field of computers. The method comprises the following steps: acquiring a target text to be synthesized, wherein the target text consists of at least two languages; inputting a target text into a text synthesis model, wherein the text synthesis model comprises at least two feature extraction modules, a feature fusion module and a voice conversion module which are in one-to-one correspondence with at least two languages; respectively carrying out feature extraction processing on the target text through at least two feature extraction modules to obtain at least two text features which are in one-to-one correspondence with the at least two feature extraction modules; fusing at least two text features through a feature fusion module to obtain fusion features; and performing voice conversion processing on the fusion characteristics through a voice conversion module to obtain the synthetic voice corresponding to the target text. By adopting the method, the target text composed of at least two languages can be synthesized into the corresponding synthesized voice.

Description

Speech synthesis method, apparatus, computer device and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a speech synthesis method, apparatus, computer device, and storage medium.
Background
With the continuous development of artificial intelligence, a Speech synthesis technology (Text To Speech, abbreviated as TTS) is becoming more and more mature, wherein the Speech synthesis technology is a technology for converting Text into synthesized Speech, and synthesized Speech which is understandable, clear, natural and rich in expressive force can be provided for users through the Speech synthesis technology.
Currently, in speech synthesis technology, research is being conducted on single-language text, which refers to text containing only one language, and at present, technology for converting single-language text into synthesized speech is well developed.
However, as communications between countries have increased with global development, mixed language text, which refers to text that includes at least two languages, such as "i do not care", "i are preparing a presentation", etc., has become increasingly common in people's daily lives. However, currently, there is little research on speech synthesis techniques for mixed-language text, and the techniques are relatively immature.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a speech synthesis method, apparatus, computer device and storage medium capable of synthesizing speech from text including at least two languages.
In a first aspect, a speech synthesis method is provided, which includes:
acquiring a target text to be synthesized, wherein the target text consists of at least two languages; inputting a target text into a text synthesis model, wherein the text synthesis model comprises at least two feature extraction modules, a feature fusion module and a voice conversion module which are in one-to-one correspondence with at least two languages; respectively carrying out feature extraction processing on the target text through at least two feature extraction modules to obtain at least two text features which are in one-to-one correspondence with the at least two feature extraction modules; fusing at least two text features through a feature fusion module to obtain fusion features; and performing voice conversion processing on the fusion characteristics through a voice conversion module to obtain the synthetic voice corresponding to the target text.
In one embodiment, inputting the target text into the text synthesis model comprises: converting the target text into target phoneme notation; inputting the target phoneme notation symbols into an acoustic feature recognition model to obtain acoustic features corresponding to the target phoneme notation symbols; the acoustic features are input into a text synthesis model.
In one embodiment, the fusing at least two text features by the feature fusion module includes: for each text feature, determining at least two language features in the text feature, and determining the weight corresponding to each language feature according to the target language corresponding to the feature extraction module for extracting the text feature, wherein the at least two language features correspond to the at least two languages one by one; and performing fusion processing on at least two text features according to the language features in each text feature and the weight corresponding to the language features in each text feature.
In one embodiment, determining at least two language features among the text features includes: determining at least two language texts in the target text, wherein the at least two language texts correspond to at least two languages one by one; at least two language features are determined among the text features according to the positions of the at least two language texts in the target text.
In one embodiment, determining the weight corresponding to each language feature according to the target language corresponding to the feature extraction module for extracting the text feature includes: determining a first language characteristic and a second language characteristic in at least two language characteristics according to a target language corresponding to a characteristic extraction module for extracting the text characteristic, wherein the first language characteristic corresponds to the target language, and the second language characteristic does not correspond to the target language; and taking the first weight as the weight corresponding to the first language characteristic, and taking the second weight as the weight corresponding to the second language characteristic, wherein the first weight is greater than the second weight.
In one embodiment, the language features are matrices, and the fusion processing is performed on at least two text features according to the language features in each text feature and the weight corresponding to the language features in each text feature, including: multiplying each language feature by the weight corresponding to each language feature to obtain a plurality of corrected language features; for each language of at least two languages, adding the corrected language features corresponding to the languages to obtain candidate language features corresponding to the languages; and carrying out splicing processing on each candidate language feature.
In one embodiment, the training process of the text synthesis model comprises: acquiring at least two groups of training sets corresponding to at least two languages one by one, wherein each group of training sets comprises a plurality of training samples, and each training sample comprises a training text and real synthesized voice corresponding to the training text; respectively training at least two monolingual text synthesis models which are in one-to-one correspondence with at least two languages by utilizing at least two groups of training sets; acquiring at least two feature extraction modules based on at least two single-language text synthesis models; and forming a text synthesis model by the acquired at least two feature extraction modules, the preset fusion module and the preset voice conversion module.
In a second aspect, there is provided a speech synthesis apparatus, the apparatus comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a target text to be synthesized, and the target text consists of at least two languages;
the input module is used for inputting the target text into a text synthesis model, and the text synthesis model comprises at least two feature extraction modules, a feature fusion module and a voice conversion module which are in one-to-one correspondence with at least two languages;
the second acquisition module is used for respectively carrying out feature extraction processing on the target text through the at least two feature extraction modules to obtain at least two text features which are in one-to-one correspondence with the at least two feature extraction modules;
the third acquisition module is used for carrying out fusion processing on at least two text features through the feature fusion module to obtain fusion features;
and the fourth acquisition module is used for carrying out voice conversion processing on the fusion characteristics through the voice conversion module to obtain the synthetic voice corresponding to the target text.
In one embodiment, the input module is specifically configured to: converting the target text into target phoneme notation; inputting the target phoneme notation symbols into an acoustic feature recognition model to obtain acoustic features corresponding to the target phoneme notation symbols; the acoustic features are input into a text synthesis model.
In one embodiment, the third obtaining module includes:
the determining unit is used for determining at least two language features in the text features for each text feature, and determining the weight corresponding to each language feature according to the target language corresponding to the feature extraction module for extracting the text features, wherein the at least two language features correspond to the at least two languages one by one;
and the fusion unit is used for fusing at least two text characteristics according to the language characteristics in each text characteristic and the weight corresponding to the language characteristics in each text characteristic.
In one embodiment, the determining unit is specifically configured to: determining at least two language texts in the target text, wherein the at least two language texts correspond to at least two languages one by one; at least two language features are determined among the text features according to the positions of the at least two language texts in the target text.
In one embodiment, the fusion unit is specifically configured to: determining a first language characteristic and a second language characteristic in at least two language characteristics according to a target language corresponding to a characteristic extraction module for extracting the text characteristic, wherein the first language characteristic corresponds to the target language, and the second language characteristic does not correspond to the target language; and taking the first weight as the weight corresponding to the first language characteristic, and taking the second weight as the weight corresponding to the second language characteristic, wherein the first weight is greater than the second weight.
In one embodiment, the language features are matrices, and the fusion unit is specifically configured to: multiplying each language feature by the weight corresponding to each language feature to obtain a plurality of corrected language features; for each language of at least two languages, adding the corrected language features corresponding to the languages to obtain candidate language features corresponding to the languages; and carrying out splicing processing on each candidate language feature.
In one embodiment, the speech synthesis apparatus further includes:
the fifth acquisition module is used for acquiring at least two groups of training sets which correspond to at least two languages one by one, wherein each group of training sets comprises a plurality of training samples, and each training sample comprises a training text and real synthetic voice corresponding to the training text;
the training module is used for respectively training at least two monolingual text synthesis models which correspond to at least two languages one by utilizing at least two groups of training sets;
a sixth obtaining module, configured to obtain at least two feature extraction modules based on the at least two single-language text synthesis models;
and the composition module is used for composing the acquired at least two feature extraction modules, the preset fusion module and the preset voice conversion module into a text synthesis model.
In a third aspect, there is provided a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the speech synthesis method according to any one of the first aspect when executing the computer program.
In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the speech synthesis method according to any one of the first aspect.
According to the voice synthesis method, the voice synthesis device, the computer equipment and the storage medium, the target text to be synthesized is obtained, and the target text is input into the text synthesis model. And respectively carrying out feature extraction processing on the target text through at least two feature extraction modules in the text synthesis model to obtain at least two text features which are in one-to-one correspondence with the at least two feature extraction modules. And then, carrying out fusion processing on at least two text features through a feature fusion module in the text synthesis model to obtain fusion features, and carrying out voice conversion processing on the fusion features through a voice conversion module in the text synthesis model to obtain synthetic voice corresponding to the target text. The embodiment of the application provides a method for performing speech synthesis on a mixed language text, in the method, a target text composed of at least two languages can be input into a text synthesis model, and at least two text features corresponding to a feature extraction module are obtained by performing feature extraction on the target text through at least two feature extraction modules corresponding to the at least two languages one by one in the text synthesis model. The accuracy of different language features in the text features extracted by each feature extraction module is different, and the target text comprises a Chinese text and an English text as an example, so that the accuracy of the Chinese language features is higher and the accuracy of the English language features is lower in the text features extracted by the Chinese feature extraction module; in the text features extracted by the English feature extraction module, the accuracy of the English language features is high, and the accuracy of the Chinese language features is low, so that the text features corresponding to Chinese in the Chinese feature extraction model and the text features corresponding to English in the English feature extraction model are fused, the accuracy of the obtained fusion features is high for various languages, and then the fusion features are subjected to voice conversion processing to obtain synthetic voice corresponding to a target text composed of at least two languages. And finally, the research on the conversion of the mixed language text into the synthetic language is realized.
Drawings
FIG. 1 is a flow diagram illustrating a method for speech synthesis in one embodiment;
FIG. 2 is a diagram illustrating a text synthesis model in a speech synthesis method according to an embodiment;
FIG. 3 is a flow diagram illustrating the speech synthesis steps in one embodiment;
FIG. 4 is a flow chart illustrating a speech synthesis method according to another embodiment;
FIG. 5 is a flow chart illustrating a speech synthesis method according to another embodiment;
FIG. 6 is a flow chart illustrating a speech synthesis method according to another embodiment;
FIG. 7 is a flowchart illustrating a speech synthesis method according to another embodiment;
FIG. 8 is a flowchart illustrating a speech synthesis method according to another embodiment;
FIG. 9 is a block diagram showing the structure of a speech synthesis apparatus according to an embodiment;
FIG. 10 is a block diagram showing the structure of a speech synthesis apparatus according to an embodiment;
FIG. 11 is a block diagram showing the structure of a speech synthesis apparatus according to an embodiment;
FIG. 12 is a diagram illustrating an internal structure of a computer device in one embodiment when the computer device is a terminal;
fig. 13 is an internal configuration diagram when the computer device is a server in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
With the continuous development of artificial intelligence, the speech synthesis technology is more and more mature. Currently, speech synthesis techniques can be applied in a variety of scenarios.
For example, in one possible reading and listening scenario, the speech synthesis technology is applied to various reading APPs, which can provide reading functions of multiple sound libraries for the user, release both hands and eyes of the user, and provide a more extreme reading experience.
In another possible information broadcasting scene, the voice synthesis technology provides a special sound library specially created for the news information broadcasting scene, so that the avatars of equipment such as mobile phones and sound boxes can take charge professionally, and fresh information can be broadcasted for users at any time and any place.
In another possible order broadcasting scene, the voice synthesis technology can be applied to scenes such as taxi taking software, catering number calling software, queuing software and the like, order broadcasting is carried out through the voice synthesis technology, and a user can conveniently obtain notification information.
In another possible scenario, the speech synthesis technology may be integrated into intelligent hardware devices such as a child story machine, an intelligent robot, a tablet device, and the like, so that the user interacts with the device more naturally and more personally.
Currently, in speech synthesis technology, research is being conducted on single-language text, which refers to text containing only one language, and at present, technology for converting single-language text into synthesized speech is well developed.
However, as communications between countries have increased with global development, mixed language text, which refers to text that includes at least two languages, such as "i do not care", "i are preparing a presentation", etc., has become increasingly common in people's daily lives. However, currently, there is little research on speech synthesis techniques for mixed-language text, and the techniques are relatively immature.
The present application provides a method for synthesizing speech aiming at the above technical problem, and the method mainly includes: the method comprises the steps of obtaining a target text to be synthesized, which is composed of at least two languages, and inputting the target text into a text synthesis model, which is composed of at least two feature extraction modules, feature fusion modules and voice conversion modules, wherein the feature extraction modules, the feature fusion modules and the voice conversion modules are in one-to-one correspondence with the at least two languages. The method comprises the steps of respectively carrying out feature extraction processing on a target text through at least two feature extraction modules in a text synthesis model to obtain at least two text features which are in one-to-one correspondence with the at least two feature extraction modules, carrying out fusion processing on the at least two text features through a feature fusion module in the text synthesis model to obtain fusion features, and finally carrying out voice conversion processing on the fusion features through a voice conversion module in the text synthesis model to obtain synthetic voice corresponding to the target text. The embodiment of the application provides a method for performing speech synthesis on a mixed language text, in the method, a target text composed of at least two languages can be input into a text synthesis model, and at least two text features corresponding to a feature extraction module are obtained by performing feature extraction on the target text through at least two feature extraction modules corresponding to the at least two languages one by one in the text synthesis model. The accuracy of different language features in the text features extracted by each feature extraction module is different, and the target text comprises a Chinese text and an English text as an example, so that the accuracy of the Chinese language features is higher and the accuracy of the English language features is lower in the text features extracted by the Chinese feature extraction module; in the text features extracted by the English feature extraction module, the accuracy of the English language features is high, and the accuracy of the Chinese language features is low, so that the text features corresponding to Chinese in the Chinese feature extraction model and the text features corresponding to English in the English feature extraction model are fused, the accuracy of the obtained fusion features is high for various languages, and then the fusion features are subjected to voice conversion processing to obtain synthetic voice corresponding to a target text composed of at least two languages. And finally, the research on the conversion of the mixed language text into the synthetic language is realized.
It should be noted that, in the method for speech synthesis provided in this embodiment of the present application, an execution main body of the method may be a speech synthesis device, and the speech synthesis device may be implemented as part or all of a computer device in a software, hardware, or a combination of software and hardware, where the computer device may be a server or a terminal, where the server in this embodiment of the present application may be one server or a server cluster composed of multiple servers, and the terminal in this embodiment of the present application may be another intelligent hardware device such as a smart phone, a personal computer, a tablet computer, a wearable device, a child story machine, an intelligent robot, and the like. In the following method embodiments, the execution subject is a computer device as an example.
In an embodiment of the present application, as shown in fig. 1, a speech synthesis method is provided, which is described by taking the method as an example applied to a computer device, and includes the following steps:
step 101, a computer device obtains a target text to be synthesized.
The target text may be composed of a chinese text and an english text, for example.
In this embodiment of the application, in a case that the computer device is a server, optionally, the server may receive a target text sent by a terminal; the server may also extract the target text in a server database.
In the case that the computer device is a terminal, optionally, the terminal may receive a target text input by a user; the terminal can also obtain a target text displayed by the interface; the terminal may also extract the target text in the terminal data. The embodiment of the application does not specifically limit the way in which the computer device obtains the target text.
Step 102, the computer device inputs the target text into a text synthesis model.
In the embodiment of the present application, the text synthesis model is used to synthesize the input target text into the synthetic speech corresponding to the target text. Optionally, the training process of the text synthesis model may include: the method comprises the steps of obtaining a plurality of training samples, wherein each training sample comprises a training text used for training and real synthetic voice corresponding to the training text, and training a text synthesis model by using the training text and the real synthetic voice. The text synthesis model may include at least two feature extraction modules, a feature fusion module, and a speech conversion module, which are in one-to-one correspondence with at least two languages, and a structure diagram of which is shown in fig. 2.
The text synthesis model may include at least two feature extraction modules corresponding to at least two languages one to one, and the feature extraction modules are configured to perform feature extraction on an input target text to obtain at least two text features corresponding to the at least two feature extraction modules one to one. For example, at least two languages included in the target text are chinese and english, respectively, and the at least two feature extraction modules corresponding to the at least two languages one by one are a chinese feature extraction module and an english feature extraction module, respectively. The Chinese feature extraction module extracts features of the target text to obtain text features corresponding to the Chinese feature extraction module, and the English feature extraction module extracts features of the target text to obtain text features corresponding to the English feature extraction module.
The Chinese feature extraction module is used for extracting Chinese texts to obtain text features with high accuracy, and extracting English texts to obtain text features with high accuracy and low accuracy; the English feature extraction module has higher accuracy for the text features extracted from the English text and has higher and lower accuracy for the text features extracted from the Chinese text. Therefore, in the text features extracted by the Chinese feature extraction module, the accuracy of the Chinese language features is high, and the accuracy of the English language features is low; in the text features extracted by the English feature extraction module, the accuracy of English language features is high, and the accuracy of Chinese language features is low.
The feature fusion module is used for performing fusion processing on the text features extracted by the at least two feature extraction modules to obtain fused features. Based on the above example, the feature fusion module performs fusion processing on the text features corresponding to the chinese feature extraction module and the text features corresponding to the english feature extraction module to obtain fused features.
According to the above contents, the accuracy of the text features extracted by each feature extraction module for the texts in different languages is different, and in the text features extracted by the Chinese feature extraction module, the accuracy of the Chinese language features is higher, and the accuracy of the English language features is lower; in the text features extracted by the English feature extraction module, the accuracy of English language features is high, and the accuracy of Chinese language features is low. Therefore, optionally, the computer device may fuse the chinese language features in the text features extracted by the chinese feature extraction module with the english language features in the text features extracted by the english feature extraction module, so that the accuracy of the fused fusion features for each language is high, and the finally synthesized speech is high in accuracy, clear and natural.
The voice conversion module is used for converting the fusion characteristics fused by the characteristic fusion module to obtain the synthesized voice.
And 103, respectively carrying out feature extraction processing on the target text by the computer equipment through at least two feature extraction modules to obtain at least two text features which are in one-to-one correspondence with the at least two feature extraction modules.
In the embodiment of the present application, based on the above, the computer device inputs the target text into the at least two feature extraction modules, and the at least two feature extraction modules respectively perform feature extraction on the target text, so as to obtain at least two text features corresponding to the at least two feature extraction modules one to one.
And 104, the computer equipment performs fusion processing on the at least two text features through the feature fusion module to obtain fusion features.
In this embodiment, it can be known from the above that, the target text is subjected to feature extraction by the at least two feature extraction modules to obtain at least two text features corresponding to the at least two feature extraction modules one to one, and in order to make a synthesis language output by the final text synthesis model clear and natural, the computer device needs to perform feature fusion on the obtained at least two text features by the feature fusion module to obtain a fusion feature.
And 105, the computer equipment performs voice conversion processing on the fusion characteristics through a voice conversion module to obtain synthetic voice corresponding to the target text.
In the embodiment of the application, the computer device inputs the fused feature into the voice conversion module, and the fused feature is converted into the synthetic voice corresponding to the target text through the voice conversion module.
According to the voice synthesis method, the computer equipment acquires the target text to be synthesized and inputs the target text into the text synthesis model. And the computer equipment respectively performs feature extraction processing on the target text through at least two feature extraction modules in the text synthesis model to obtain at least two text features which are in one-to-one correspondence with the at least two feature extraction modules. And then, the computer equipment performs fusion processing on at least two text features through a feature fusion module in the text synthesis model to obtain fusion features, and performs voice conversion processing on the fusion features through a voice conversion module in the text synthesis model to obtain synthetic voice corresponding to the target text. In the method, a target text composed of at least two languages can be input into a text synthesis model, and feature extraction is performed on the target text through at least two feature extraction modules corresponding to the at least two languages one by one in the text synthesis model to obtain at least two text features corresponding to the feature extraction modules. The accuracy of different language features in the text features extracted by each feature extraction module is different, and the target text comprises a Chinese text and an English text as an example, so that the accuracy of the Chinese language features is higher and the accuracy of the English language features is lower in the text features extracted by the Chinese feature extraction module; in the text features extracted by the English feature extraction module, the accuracy of the English language features is high, and the accuracy of the Chinese language features is low, so that the text features corresponding to Chinese in the Chinese feature extraction model and the text features corresponding to English in the English feature extraction model are fused, the accuracy of the obtained fusion features is high for various languages, and then the fusion features are subjected to voice conversion processing to obtain synthetic voice corresponding to a target text composed of at least two languages. And finally, the research on the conversion of the mixed language text into the synthetic language is realized.
In an alternative embodiment of the present application, as shown in fig. 3, the inputting of the target text into the text synthesis model may include the following steps:
step 301, the computer device converts the target text into target phonetic transcription symbols.
In this embodiment of the application, optionally, the computer device may display the target text to a user through a display interface, and the user reads the target text on the display interface, labels the target text as the target phoneme notation, and inputs the target text into the computer device. Optionally, in this embodiment of the present application, the target phonetic transcription symbol may be an international phonetic transcription.
Alternatively, the computer device may convert the target text into target phoneme transcription symbols according to a pre-trained text conversion model. The text conversion model is used for converting an input target text into a target phoneme notation corresponding to the target text. The training process of the text conversion model may include: the method comprises the steps of obtaining a plurality of text conversion training samples, wherein each text conversion training sample comprises a training text used for training a text conversion model and a phoneme notation corresponding to the training text, and training the text conversion model based on the training text in each text conversion training sample and the phoneme notation corresponding to the training text.
Step 302, the computer device inputs the target phoneme notation symbol into the acoustic feature recognition model to obtain the acoustic feature corresponding to the target phoneme notation symbol.
In an embodiment of the present application, a computer device inputs target phonetic transcription symbols into an acoustic feature recognition model. The acoustic feature recognition model is used for recognizing the input target phoneme notation as the acoustic feature corresponding to the target phoneme notation. Optionally, in the embodiment of the present application, the acoustic feature may be a mel-frequency feature.
Optionally, in this embodiment of the present application, the training process of the acoustic feature recognition model may include: obtaining a plurality of acoustic feature training samples, wherein each acoustic feature training sample comprises a phoneme notation used for training an acoustic feature model and an acoustic feature corresponding to the phoneme notation, and training an acoustic feature recognition model based on the phoneme notation in each acoustic feature training sample and the acoustic feature corresponding to the phoneme notation.
Step 303, the computer device inputs the acoustic features into the text synthesis model.
In the embodiment of the application, the computer device inputs the acoustic features output by the acoustic feature recognition model into the text synthesis model, so that the acoustic features corresponding to the target text are synthesized into the synthesized voice.
In the embodiment of the application, the computer device converts the target text into the target phoneme notation symbols, inputs the target phoneme notation symbols into the acoustic feature recognition model to obtain the acoustic features corresponding to the target phoneme notation symbols, and then inputs the acoustic features into the text synthesis model. According to the method, the target text is converted into the target phoneme notation, so that at least two language texts in the target text are input to the acoustic feature recognition model through the target phoneme notation with the same specification, and the acoustic feature corresponding to the target phoneme notation is obtained. Therefore, the training and using process of the acoustic feature recognition model is simplified. In addition, the acoustic features corresponding to the target phoneme notation are input into the text synthesis model, rather than the target text being input directly into the text synthesis model. If the target text is directly input into the text synthesis model, the training of the text synthesis model is complicated, and the synthesized speech output according to the target text is hard and unclear. However, the above method inputs acoustic features into the text synthesis model, so that the synthesized language input by the text synthesis model is more natural and clear.
In an optional embodiment of the present application, as shown in fig. 4, the fusing at least two text features by the feature fusion module may include the following steps:
step 401, for each text feature, the computer device determines at least two language features in the text feature, and determines a weight corresponding to each language feature according to a target language corresponding to a feature extraction module extracting the text feature.
Wherein the at least two language features correspond to the at least two languages one to one.
In this embodiment of the application, for each text feature output by the at least two feature extraction modules corresponding to the at least two languages one to one, the computing device determines at least two language features in each text feature based on a position of each language text in the target text.
For example, the target text includes two languages of chinese and english, the two feature extraction modules corresponding to the two languages one by one are a chinese feature extraction module and an english feature extraction module, respectively, and the computer device determines chinese features and english features in the text features output by the chinese feature extraction module and the english feature extraction module, respectively, according to the positions of the chinese text and the english text in the target text.
Optionally, the target text is "i do not care", the two feature extraction modules are a chinese feature extraction module and an english feature extraction module, respectively, and the computer device determines a chinese language feature corresponding to "i do not care" and an english language feature corresponding to "i do not care" in the text features output by the chinese feature extraction module and the english feature extraction module according to the positions of "i do not care" and "care".
In the embodiment of the application, after at least two language features in each text feature are determined for each text feature, the computer device respectively extracts the target language corresponding to the feature extraction module of the text feature, and determines the weight corresponding to each language feature according to the accuracy of each language feature extracted by each feature extraction module.
For example, based on the above example, after determining the chinese language feature and the english language feature in each text feature for each text feature, the computer device determines that the target language corresponding to the chinese feature extraction module is chinese and the target language corresponding to the english feature extraction module is english. And the computer equipment determines the weight corresponding to the Chinese language feature and the English language feature in each text feature according to the accuracy of extracting the Chinese language feature and the English language feature by the Chinese feature extraction module and the English feature extraction module.
And 402, fusing at least two text characteristics by the computer equipment according to the language characteristics in each text characteristic and the weight corresponding to the language characteristics in each text characteristic.
In this embodiment of the application, after determining the language features in each text feature and the weights corresponding to the language features in each text feature, the computer device may optionally multiply or divide each language feature and the weight corresponding to each language feature, thereby implementing the fusion processing on at least two text features.
In the embodiment of the application, the computer device determines at least two language features in the text features for each obtained text feature, so as to perform different processing on different language features in each text feature. The computer equipment determines the weight corresponding to each language feature according to the target language corresponding to the feature extraction module for extracting the text feature, and because the accuracy of different language features in the text features extracted by different feature extraction modules is different, the weight corresponding to each language feature is determined according to the target language corresponding to the feature extraction module for extracting the text feature, so that the accuracy of each language feature can be ensured. And then, the computer equipment performs fusion processing on at least two text characteristics according to the language characteristics in each text characteristic and the weight corresponding to the language characteristics in each text characteristic, so that the accuracy of the fusion characteristics after the fusion processing is ensured, the accuracy of the synthetic language output by the text synthesis model is further ensured, and the synthetic language is clear and natural.
In an alternative embodiment of the present application, as shown in fig. 5, the determining at least two language features from the text features may include the following steps:
in step 501, the computer device determines at least two language texts in the target text.
Wherein, the at least two language texts correspond to the at least two languages one by one. For example, if the at least two languages are chinese and english, respectively, then the at least two language texts are chinese and english, respectively.
In this embodiment of the application, optionally, the identification information of at least two languages is contained in the target text acquired by the computer device, and the computer device determines at least two language texts in the target text based on the identification information of at least two languages contained in the target text. For example, the target text is "i'm does not care", wherein the identification information corresponding to "care" is in english, and the identification information corresponding to "i'm does not do so" is in chinese, and the computer device determines "care" as an english text and "i'm does not do so" as a chinese text by reading the identification information corresponding to "care" and the identification information corresponding to "i do not do so".
The computer device may determine at least two language texts in the target text by a text determination model. The text determination model is used for determining language texts included in target texts for the target texts of the input text determination model. For example, after "me" is input to the text determination model, the text determination model may determine "me" as a chinese text; after "care" is input to the text determination model, the text determination model may determine that "care" is an english text; after "i do not care" is entered into the text determination model, the text determination model may determine that "i do not care" is chinese text and "care" is english text.
Step 502, the computer device determines at least two language features among the text features according to the positions of the at least two language texts in the target text.
In the embodiment of the application, after determining the at least two language texts in the target text, the computer device determines at least two language features from the at least two text features output by the at least two feature extraction modules according to the positions of the at least two language texts in the target text.
In an embodiment of the application, the computer device determines at least two language texts in the target text, and determines at least two language features in the text features according to the positions of the at least two language texts in the target text. At least two language features in each text feature are determined, so that weight assignment of the at least two language features in each text feature is facilitated, and weight assignment errors of the at least two language features in each text feature are avoided.
In an optional embodiment of the present application, the language feature is a matrix, and as shown in fig. 6, the performing the fusion processing on at least two text features according to the language feature in each text feature and the weight corresponding to the language feature in each text feature may include the following steps:
step 601, the computer device determines a first language feature and a second language feature in at least two language features according to a target language corresponding to the feature extraction module for extracting the text feature.
In the embodiment of the present application, the first language characteristic corresponds to a target language, and the second language characteristic does not correspond to the target language.
In the embodiment of the application, the computer device determines the target language corresponding to each feature extraction module according to the feature extraction module of the extracted text feature, and determines the first language feature and the second language feature in each text feature based on the target language of each feature extraction module.
For example, the target text includes two languages of chinese and english, and the two feature extraction modules corresponding to the two languages one by one are a chinese feature extraction module and an english feature extraction module, respectively. If the target language of the Chinese character extraction module is Chinese, the computer equipment determines that the Chinese character is a first language character and the English language character is a second language character in the text characters output by the Chinese character extraction module; and determining that the English language feature is a first language feature and the Chinese language feature is a second language feature from the text features output by the English feature extraction module.
Step 602, the computer device uses the first weight as a weight corresponding to the first language feature and uses the second weight as a weight corresponding to the second language feature, wherein the first weight is greater than the second weight.
In the embodiment of the present application, the first weight is greater than the second weight. Because each feature extraction module carries out feature extraction on different language texts in the target text, the accuracy of the extracted different language features is different. Therefore, the computer device needs to determine at least two language features in the text features output by each feature extraction module, and according to the accuracy of different language features extracted by each feature extraction module, use the first weight as the weight corresponding to the first language feature and use the second weight as the weight corresponding to the second language feature.
Optionally, based on the above example, it can be known that if the chinese language feature is determined to be the first language feature and the english language is the second language feature in the text features output by the chinese language feature extraction module, the first weight is used as a weight corresponding to the chinese language feature, the second weight is used as a weight corresponding to the english language feature, and the first weight is greater than the second weight, where the first weight may be 80% and the second weight may be 20%, the chinese language feature may be the first chinese matrix, and the english language feature may be the second english matrix; determining that the english language feature is a first language feature and the chinese language feature is a second language feature in the text features output by the english feature extraction module, taking the first weight as a weight corresponding to the english language feature, taking the second weight as a weight corresponding to the chinese language feature, where the first weight is greater than the second weight, the first weight may be 80%, the second weight may be 20%, the english language feature may be a first english matrix, and the chinese language feature may be a second chinese matrix.
Step 603, the computer device multiplies each language feature by the weight corresponding to each language feature to obtain a plurality of corrected language features.
In the application embodiment, after the language features and the weights corresponding to the language features are obtained, the computer device multiplies the language features and the weights corresponding to the language features to obtain a plurality of corrected language features.
Continuing with the above example. Optionally, the computer device determines a chinese language feature in the text features corresponding to the chinese feature extraction module as a first language feature, determines an english language feature as a second language feature, and determines a first weight corresponding to the first language feature as 80% and a second weight corresponding to the second language feature as 20%, optionally, the first language feature may be a first language; the computer equipment determines the English language features in the text features corresponding to the English feature extraction module as first language features, determines the Chinese language features as second language features, and determines that the first weight corresponding to the first language features is 80% and the second weight corresponding to the second language features is 20%. The computer equipment multiplies the Chinese language features in the text features corresponding to the Chinese feature extraction module by 80%, multiplies the English language features by 20%, multiplies the English language features in the text features corresponding to the English feature extraction module by 80%, multiplies the Chinese language features by 20%, namely, the first Chinese matrix multiplies by 80%, the second English matrix multiplies by 20%, the first English matrix multiplies by 80%, and the second Chinese matrix multiplies by 20%, thereby obtaining a plurality of corrected language features.
Step 604, the computer device adds the corrected language features corresponding to the languages for each of the at least two languages to obtain candidate language features corresponding to the languages.
In the embodiment of the application, after obtaining the plurality of corrected language features, the computer device determines the language corresponding to each corrected language feature, and determines at least two languages from each corrected language feature. Based on each determined language, the computer device adds the corrected language features corresponding to each language to obtain candidate language features corresponding to each language.
For example, based on the above example, it can be known that the computer device obtains a corrected chinese language feature and a corrected english language feature corresponding to the chinese feature extraction module and a corrected english language feature and a corrected chinese language feature corresponding to the english feature extraction module, and the computer device adds the corrected chinese language feature corresponding to the chinese feature extraction module and the corrected chinese language feature corresponding to the english feature extraction module to obtain a candidate chinese feature, and adds the corrected english language feature corresponding to the chinese feature extraction module and the corrected english language feature corresponding to the english feature extraction module to obtain a candidate english feature. Multiplying the first Chinese matrix by 80%, and then adding the second Chinese matrix by 20% to obtain candidate Chinese characteristics; and multiplying the second English matrix by 20%, and then adding the first English matrix by 80% to obtain candidate English features.
Step 605, the computer device performs a splicing process on each candidate language feature.
In the embodiment of the application, after the computer device obtains each candidate language feature, the computer device performs splicing processing on each candidate language feature.
Based on the content, the computer equipment splices the candidate Chinese character features and the candidate English characters to obtain Chinese and English splicing language features.
In the embodiment of the application, the computer device determines the first language feature and the second language feature in at least two language features according to the target language corresponding to the feature extraction module which extracts the text feature, so that the first language feature is the language feature corresponding to the target language. Then, the computer device takes the first weight as the weight corresponding to the first language feature, takes the second weight as the weight corresponding to the second language feature, and multiplies each language feature by the weight corresponding to each language feature to obtain a plurality of corrected language features, so that the accuracy of each two language features in each text feature is ensured, and the accuracy of the plurality of corrected language features is also ensured. For each language of the at least two languages, the computer equipment adds the corrected language features corresponding to the languages to obtain candidate language features corresponding to the languages, and performs splicing processing on the candidate language features. According to the method, after the corrected language features corresponding to the languages are added, the candidate language features corresponding to the languages become the optimal features corresponding to the languages, and the candidate language features are spliced, so that the overall language features corresponding to the target text are optimal, and the accuracy, clearness and naturalness of the synthesized voice output by the text synthesis model are ensured.
In an alternative embodiment of the present application, as shown in fig. 7, in the above-mentioned speech synthesis method, the training process of the text synthesis model may include the following steps:
step 701, a computer device obtains at least two sets of training sets corresponding to at least two languages one to one.
Each training set comprises a plurality of training samples, and each training sample comprises a training text and real synthesized voice corresponding to the training text.
In this embodiment of the present application, optionally, in a case that the computer device is a server, the server may receive at least two sets of training sets corresponding to at least two languages one to one, which are sent by a terminal; the server may also extract at least two sets of training sets in the server database that correspond one-to-one with the at least two languages.
Optionally, in a case that the computer device is a terminal, the terminal may receive at least two sets of training sets corresponding to at least two languages one to one, which are input by a user; the terminal can also obtain at least two groups of training sets which are displayed on the interface and correspond to at least two languages one by one; the terminal can also extract at least two groups of training sets corresponding to at least two languages one by one from the terminal data. The embodiment of the present application does not specifically limit the manner in which the computer device obtains at least two sets of training sets corresponding to at least two languages one to one.
Step 702, the computer device trains at least two monolingual text synthesis models corresponding to at least two languages one by one respectively by using at least two sets of training sets.
In this embodiment of the present application, after obtaining at least two sets of training sets corresponding to at least two languages one to one, the computer device may train, by using each training set, each unilingual text synthesis model corresponding to each training set language respectively. The training mode of each monolingual text synthesis model may be to obtain a training set corresponding to each monolingual text synthesis model, and each training set includes a plurality of training samples. Each training sample comprises a training text and real synthetic voice corresponding to the training text, and each single language text synthesis model is trained based on the training text in each training sample and the real synthetic voice corresponding to the training text.
For example, optionally, the computer device obtains a chinese training set and an english training set, and trains the chinese text synthesis model and the english text synthesis model using the chinese training set and the english training set, respectively. In the following, taking the training process of the chinese text synthesis model as an example, the training process of the chinese text synthesis model is introduced: the computer equipment obtains a Chinese training set, wherein the Chinese training set comprises a plurality of Chinese training samples, each Chinese training sample comprises a Chinese text and a real synthetic voice corresponding to the Chinese text, and a Chinese text synthetic model is trained on the basis of the Chinese text in each Chinese training sample and the real synthetic voice corresponding to the Chinese text.
Step 703, the computer device obtains at least two feature extraction modules based on the at least two single-language text synthesis models.
In this embodiment of the application, after the at least two single-language text synthesis models are obtained through training, the computer device may train the at least two feature extraction modules by using the output feature of the penultimate layer of each single-language text synthesis model as the tag feature of the feature extraction module corresponding to each single-language text synthesis model in a knowledge distillation manner.
Optionally, based on the above example, the computer device trains to obtain a chinese text synthesis model and an english text synthesis model, and uses the output features of the penultimate layers of the chinese text synthesis model and the english text synthesis model as the tag features of the chinese feature extraction module and the english feature extraction module, respectively, in a knowledge distillation manner, so as to train the chinese feature extraction module and the english feature extraction module.
Step 704, the computer device composes the at least two acquired feature extraction modules, the preset fusion module and the preset voice conversion module into a text synthesis model.
In the embodiment of the application, the computer device can pre-train a classification model according to the requirement of feature fusion, the classification model is used for classifying at least two languages in the target text, and then the trained classification model is linked to the feature fusion module to guide the fusion of the features of each text.
Optionally, the computer device may train the speech conversion model based on the content of the last layer of the above-mentioned monolingual text synthesis model, and convert the fusion feature into the synthesized speech.
In the embodiment of the application, the computer device combines the obtained at least two feature extraction modules, the preset fusion module and the preset voice conversion module into a text synthesis model.
In the embodiment of the application, the computer device obtains at least two groups of training sets corresponding to at least two languages one by one, and trains at least two monolingual text synthesis models corresponding to the at least two languages one by using the at least two groups of training sets respectively. Due to the fact that the corresponding training set of the single-language text synthesis model is large and the method is mature, the accuracy of each single-language text synthesis model obtained through training can be guaranteed. After the at least two single-language text synthesis models are obtained through training, the computer equipment obtains the at least two feature extraction modules based on the at least two single-language text synthesis models, and the accuracy of the at least two feature extraction modules can be guaranteed under the condition that the accuracy of each single-language text synthesis model is guaranteed. And then, the computer equipment forms a text synthesis model by the acquired at least two feature extraction modules, the preset fusion module and the preset voice conversion module. The preset fusion module and the preset voice conversion module can be trained for multiple times, and the accuracy is high. Based on the accuracy of the at least two feature extraction modules and the accuracy of the preset fusion module and the preset voice conversion module, the accuracy of the text synthesis model is ensured, and the accuracy of the synthesized language obtained by performing voice synthesis on the target file based on the text synthesis model is improved.
To better explain the speech synthesis method provided by the present application, the present application provides an embodiment that explains the overall flow aspect of the speech synthesis method, as shown in fig. 8, the method includes:
step 801, a computer device obtains at least two sets of training sets corresponding to at least two languages one to one, wherein each set of training set comprises a plurality of training samples, and each training sample comprises a training text and a real synthetic voice corresponding to the training text.
Step 802, the computer device trains at least two monolingual text synthesis models corresponding to at least two languages one to one respectively by using at least two sets of training sets.
Step 803, the computer device obtains at least two feature extraction modules based on the at least two single-language text synthesis models.
Step 804, the computer device composes the obtained at least two feature extraction modules, the preset fusion module and the preset voice conversion module into a text synthesis model.
Step 805, the computer device obtains a target text to be synthesized, the target text being composed of at least two languages.
At step 806, the computer device converts the target text into target phoneme notations.
Step 807, the computer device inputs the target phoneme notation symbol into the acoustic feature recognition model to obtain the acoustic feature corresponding to the target phoneme notation symbol.
Step 808, the computer device inputs the acoustic features into the text synthesis model.
Step 809, the computer device determines at least two language texts in the target text, wherein the at least two language texts are in one-to-one correspondence with the at least two languages.
In step 810, the computer device determines at least two language features among the text features according to the positions of the at least two language texts in the target text.
Step 811, the computer device determines a first language feature and a second language feature in the at least two language features according to the target language corresponding to the feature extraction module extracting the text feature, wherein the first language feature corresponds to the target language, and the second language feature does not correspond to the target language.
In step 812, the computer device uses the first weight as a weight corresponding to the first language feature and uses the second weight as a weight corresponding to the second language feature, wherein the first weight is greater than the second weight.
In step 813, the computer device multiplies each linguistic feature by the weight corresponding to each linguistic feature to obtain a plurality of corrected linguistic features.
Step 814, the computer device adds the corrected language features corresponding to the languages for each of the at least two languages to obtain candidate language features corresponding to the languages.
Step 815, the computer device splices each candidate language feature.
Step 816, the computer device performs voice conversion processing on the fusion features through the voice conversion module to obtain the synthesized voice corresponding to the target text.
It should be understood that although the various steps in the flowcharts of fig. 1 and 3-8 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 1 and fig. 3-8 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least some of the other steps.
In one embodiment, as shown in fig. 9, there is provided a speech synthesis apparatus 900 comprising: a first obtaining module 901, an input module 902, a second obtaining module 903, a third obtaining module 904, and a fourth obtaining module 905, wherein:
the first obtaining module 901 is configured to obtain a target text to be synthesized, where the target text is composed of at least two languages.
The input module 902 is configured to input the target text into a text synthesis model, where the text synthesis model includes at least two feature extraction modules, a feature fusion module, and a speech conversion module that are in one-to-one correspondence with at least two languages.
And a second obtaining 903, configured to perform feature extraction processing on the target text through at least two feature extraction modules, respectively, to obtain at least two text features that correspond to the at least two feature extraction modules one to one.
A third obtaining module 904, configured to perform fusion processing on the at least two text features through the feature fusion module to obtain a fusion feature.
A fourth obtaining module 905, configured to perform voice conversion processing on the fusion features through the voice conversion module, so as to obtain a synthesized voice corresponding to the target text.
In one embodiment, the input module 902 is specifically configured to: converting the target text into target phoneme notation; inputting the target phoneme notation symbols into an acoustic feature recognition model to obtain acoustic features corresponding to the target phoneme notation symbols; the acoustic features are input into a text synthesis model.
In one embodiment, as shown in fig. 10, the third obtaining module 904 includes: a determination unit 9041 and a fusion unit 9042, wherein:
a determining unit 9041, configured to determine, for each text feature, at least two language features in the text feature, and determine, according to a target language corresponding to a feature extraction module that extracts the text feature, a weight corresponding to each language feature, where the at least two language features correspond to the at least two languages one to one;
and a fusion unit 9042, configured to perform fusion processing on at least two text features according to the language feature in each text feature and the weight corresponding to the language feature in each text feature.
In one embodiment, the determining unit 9041 is specifically configured to: determining at least two language texts in the target text, wherein the at least two language texts correspond to at least two languages one by one; at least two language features are determined among the text features according to the positions of the at least two language texts in the target text.
In one embodiment, the fusion unit 9042 is specifically configured to: determining a first language characteristic and a second language characteristic in at least two language characteristics according to a target language corresponding to a characteristic extraction module for extracting the text characteristic, wherein the first language characteristic corresponds to the target language, and the second language characteristic does not correspond to the target language; and taking the first weight as the weight corresponding to the first language characteristic, and taking the second weight as the weight corresponding to the second language characteristic, wherein the first weight is greater than the second weight.
In one embodiment, the language features are matrices, and the fusion unit 9042 is specifically configured to: multiplying each language feature by the weight corresponding to each language feature to obtain a plurality of corrected language features; for each language of at least two languages, adding the corrected language features corresponding to the languages to obtain candidate language features corresponding to the languages; and carrying out splicing processing on each candidate language feature.
In one embodiment, as shown in fig. 11, the speech synthesis apparatus 900 further includes: a fifth acquisition module 906, a training module 907, a sixth acquisition module 908, and a composition module 909, wherein:
a fifth obtaining module 906, configured to obtain at least two sets of training sets corresponding to at least two languages one to one, where each set of training set includes multiple training samples, and each training sample includes a training text and a real synthesized voice corresponding to the training text;
a training module 907 for training at least two monolingual text synthesis models corresponding to at least two languages one to one respectively by using at least two sets of training sets;
a sixth obtaining module 908, configured to obtain at least two feature extraction modules based on the at least two single-language text synthesis models;
a composing module 909, configured to compose the obtained at least two feature extraction modules, the preset fusion module, and the preset voice conversion module into a text synthesis model.
For the specific limitations of the speech synthesis apparatus, reference may be made to the above limitations of the speech synthesis method, which are not described herein again. The respective modules in the above-described speech synthesis apparatus may be implemented in whole or in part by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 12. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a speech synthesis method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.
Those skilled in the art will appreciate that the architecture shown in fig. 12 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 13. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing speech synthesis data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a speech synthesis method.
Those skilled in the art will appreciate that the architecture shown in fig. 13 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment of the present application, there is provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the following steps when executing the computer program:
acquiring a target text to be synthesized, wherein the target text consists of at least two languages; inputting a target text into a text synthesis model, wherein the text synthesis model comprises at least two feature extraction modules, a feature fusion module and a voice conversion module which are in one-to-one correspondence with at least two languages; respectively carrying out feature extraction processing on the target text through at least two feature extraction modules to obtain at least two text features which are in one-to-one correspondence with the at least two feature extraction modules; fusing at least two text features through a feature fusion module to obtain fusion features; and performing voice conversion processing on the fusion characteristics through a voice conversion module to obtain the synthetic voice corresponding to the target text.
In one embodiment of the application, the processor when executing the computer program further performs the following steps: converting the target text into target phoneme notation; inputting the target phoneme notation symbols into an acoustic feature recognition model to obtain acoustic features corresponding to the target phoneme notation symbols; the acoustic features are input into a text synthesis model.
In one embodiment of the application, the processor, when executing the computer program, further performs the steps of: for each text feature, determining at least two language features in the text feature, and determining the weight corresponding to each language feature according to the target language corresponding to the feature extraction module for extracting the text feature, wherein the at least two language features correspond to the at least two languages one by one; and performing fusion processing on at least two text features according to the language features in each text feature and the weight corresponding to the language features in each text feature.
In one embodiment of the application, the processor when executing the computer program further performs the following steps: determining at least two language texts in the target text, wherein the at least two language texts correspond to at least two languages one by one; at least two language features are determined among the text features according to the positions of the at least two language texts in the target text.
In one embodiment of the application, the processor when executing the computer program further performs the following steps: determining a first language characteristic and a second language characteristic in at least two language characteristics according to a target language corresponding to a characteristic extraction module for extracting the text characteristic, wherein the first language characteristic corresponds to the target language, and the second language characteristic does not correspond to the target language; and taking the first weight as the weight corresponding to the first language characteristic, and taking the second weight as the weight corresponding to the second language characteristic, wherein the first weight is greater than the second weight.
In one embodiment of the application, the language features are matrices, and in one embodiment, the processor when executing the computer program further performs the steps of: multiplying each language feature by the weight corresponding to each language feature to obtain a plurality of corrected language features; for each language of at least two languages, adding the corrected language features corresponding to the languages to obtain candidate language features corresponding to the languages; and carrying out splicing processing on each candidate language feature.
In one embodiment of the application, the processor when executing the computer program further performs the following steps: acquiring at least two groups of training sets corresponding to at least two languages one by one, wherein each group of training sets comprises a plurality of training samples, and each training sample comprises a training text and real synthesized voice corresponding to the training text; respectively training at least two monolingual text synthesis models which are in one-to-one correspondence with at least two languages by utilizing at least two groups of training sets; acquiring at least two feature extraction modules based on at least two single-language text synthesis models; and forming a text synthesis model by the acquired at least two feature extraction modules, the preset fusion module and the preset voice conversion module.
In one embodiment of the present application, there is provided a computer readable storage medium having a computer program stored thereon, the computer program when executed by a processor implementing the steps of:
acquiring a target text to be synthesized, wherein the target text consists of at least two languages; inputting a target text into a text synthesis model, wherein the text synthesis model comprises at least two feature extraction modules, a feature fusion module and a voice conversion module which are in one-to-one correspondence with at least two languages; respectively carrying out feature extraction processing on the target text through at least two feature extraction modules to obtain at least two text features which are in one-to-one correspondence with the at least two feature extraction modules; fusing at least two text features through a feature fusion module to obtain fusion features; and performing voice conversion processing on the fusion characteristics through a voice conversion module to obtain the synthetic voice corresponding to the target text.
In one embodiment of the application, the computer program when executed by the processor further performs the steps of: converting the target text into target phoneme notation; inputting the target phoneme notation symbols into an acoustic feature recognition model to obtain acoustic features corresponding to the target phoneme notation symbols; the acoustic features are input into a text synthesis model.
In one embodiment of the application, the computer program when executed by the processor further performs the steps of: for each text feature, determining at least two language features in the text feature, and determining the weight corresponding to each language feature according to the target language corresponding to the feature extraction module for extracting the text feature, wherein the at least two language features correspond to the at least two languages one by one; and performing fusion processing on at least two text features according to the language features in each text feature and the weight corresponding to the language features in each text feature.
In one embodiment of the application, the computer program when executed by the processor further performs the steps of: determining at least two language texts in the target text, wherein the at least two language texts correspond to at least two languages one by one; at least two language features are determined among the text features according to the positions of the at least two language texts in the target text.
In one embodiment of the application, the computer program when executed by the processor further performs the steps of: determining a first language characteristic and a second language characteristic in at least two language characteristics according to a target language corresponding to a characteristic extraction module for extracting the text characteristic, wherein the first language characteristic corresponds to the target language, and the second language characteristic does not correspond to the target language; and taking the first weight as the weight corresponding to the first language characteristic, and taking the second weight as the weight corresponding to the second language characteristic, wherein the first weight is greater than the second weight.
In an embodiment of the application, the language features are matrices, and in an embodiment the computer program when executed by the processor further performs the steps of: multiplying each language feature by the weight corresponding to each language feature to obtain a plurality of corrected language features; for each language of at least two languages, adding the corrected language features corresponding to the languages to obtain candidate language features corresponding to the languages; and carrying out splicing processing on each candidate language feature.
In one embodiment of the application, the computer program when executed by the processor further performs the steps of: acquiring at least two groups of training sets corresponding to at least two languages one by one, wherein each group of training sets comprises a plurality of training samples, and each training sample comprises a training text and real synthesized voice corresponding to the training text; respectively training at least two monolingual text synthesis models which are in one-to-one correspondence with at least two languages by utilizing at least two groups of training sets; acquiring at least two feature extraction modules based on at least two single-language text synthesis models; and forming a text synthesis model by the acquired at least two feature extraction modules, the preset fusion module and the preset voice conversion module.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of speech synthesis, the method comprising:
acquiring a target text to be synthesized, wherein the target text consists of at least two languages;
inputting the target text into a text synthesis model, wherein the text synthesis model comprises at least two feature extraction modules, a feature fusion module and a voice conversion module which are in one-to-one correspondence with the at least two languages;
respectively performing feature extraction processing on the target text through the at least two feature extraction modules to obtain at least two text features which are in one-to-one correspondence with the at least two feature extraction modules;
fusing the at least two text features through the feature fusion module to obtain fusion features;
and performing voice conversion processing on the fusion characteristics through the voice conversion module to obtain synthetic voice corresponding to the target text.
2. The method of claim 1, wherein the entering the target text into a text synthesis model comprises:
converting the target text into target phoneme notation;
inputting the target phoneme notation symbols into an acoustic feature recognition model to obtain acoustic features corresponding to the target phoneme notation symbols;
inputting the acoustic features into the text synthesis model.
3. The method according to claim 1, wherein the fusing the at least two text features by the feature fusion module comprises:
for each text feature, determining at least two language features in the text feature, and determining the weight corresponding to each language feature according to a target language corresponding to a feature extraction module for extracting the text feature, wherein the at least two language features are in one-to-one correspondence with the at least two languages;
and performing fusion processing on the at least two text features according to the language features in the text features and the weights corresponding to the language features in the text features.
4. The method of claim 3, wherein determining at least two language features among the text features comprises:
determining at least two language texts in the target text, wherein the at least two language texts correspond to the at least two languages one to one;
determining the at least two language features among the text features according to the positions of the at least two language texts in the target text.
5. The method according to claim 3, wherein the determining the weight corresponding to each language feature according to the target language corresponding to the feature extraction module extracting the text feature comprises:
determining a first language feature and a second language feature in the at least two language features according to a target language corresponding to a feature extraction module for extracting the text feature, wherein the first language feature corresponds to the target language, and the second language feature does not correspond to the target language;
and taking a first weight as a weight corresponding to the first language characteristic, and taking a second weight as a weight corresponding to the second language characteristic, wherein the first weight is greater than the second weight.
6. The method according to claim 3, wherein the language features are matrices, and the fusing the at least two text features according to the language features in the text features and the weights corresponding to the language features in the text features comprises:
multiplying each language feature by the weight corresponding to each language feature to obtain a plurality of corrected language features;
for each language of the at least two languages, adding the corrected language features corresponding to the language to obtain candidate language features corresponding to the language;
and performing splicing processing on each candidate language feature.
7. The method of claim 1, wherein the training process of the text synthesis model comprises:
acquiring at least two groups of training sets which are in one-to-one correspondence with the at least two languages, wherein each group of training sets comprises a plurality of training samples, and each training sample comprises a training text and real synthesized voice corresponding to the training text;
respectively training at least two monolingual text synthesis models which are in one-to-one correspondence with the at least two languages by utilizing the at least two groups of training sets;
acquiring the at least two feature extraction modules based on the at least two single-language text synthesis models;
and combining the acquired at least two feature extraction modules, the preset fusion module and the preset voice conversion module into the text synthesis model.
8. A speech synthesis apparatus, characterized in that the apparatus comprises:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a target text to be synthesized, and the target text consists of at least two languages;
the input module is used for inputting the target text into a text synthesis model, and the text synthesis model comprises at least two feature extraction modules, a feature fusion module and a voice conversion module which are in one-to-one correspondence with the at least two languages;
the second obtaining module is used for respectively carrying out feature extraction processing on the target text through the at least two feature extraction modules to obtain at least two text features which are in one-to-one correspondence with the at least two feature extraction modules;
the third acquisition module is used for carrying out fusion processing on the at least two text features through the feature fusion module to obtain fusion features;
and the fourth acquisition module is used for performing voice conversion processing on the fusion features through the voice conversion module to obtain the synthetic voice corresponding to the target text.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202011562944.XA 2020-12-25 2020-12-25 Speech synthesis method, device, computer equipment and storage medium Active CN112652294B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011562944.XA CN112652294B (en) 2020-12-25 2020-12-25 Speech synthesis method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011562944.XA CN112652294B (en) 2020-12-25 2020-12-25 Speech synthesis method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112652294A true CN112652294A (en) 2021-04-13
CN112652294B CN112652294B (en) 2023-10-24

Family

ID=75363002

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011562944.XA Active CN112652294B (en) 2020-12-25 2020-12-25 Speech synthesis method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112652294B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1731510A (en) * 2004-08-05 2006-02-08 摩托罗拉公司 Text-speech conversion for amalgamated language
KR20070071675A (en) * 2005-12-30 2007-07-04 주식회사 팬택 Method for performing multiple language tts process in mibile terminal
US20120173241A1 (en) * 2010-12-30 2012-07-05 Industrial Technology Research Institute Multi-lingual text-to-speech system and method
US20150186359A1 (en) * 2013-12-30 2015-07-02 Google Inc. Multilingual prosody generation
CN105845125A (en) * 2016-05-18 2016-08-10 百度在线网络技术(北京)有限公司 Speech synthesis method and speech synthesis device
CN109767755A (en) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 A kind of phoneme synthesizing method and system
KR20190085879A (en) * 2018-01-11 2019-07-19 네오사피엔스 주식회사 Method of multilingual text-to-speech synthesis
CN111247581A (en) * 2019-12-23 2020-06-05 深圳市优必选科技股份有限公司 Method, device, equipment and storage medium for synthesizing voice by multi-language text
CN111292720A (en) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1731510A (en) * 2004-08-05 2006-02-08 摩托罗拉公司 Text-speech conversion for amalgamated language
KR20070071675A (en) * 2005-12-30 2007-07-04 주식회사 팬택 Method for performing multiple language tts process in mibile terminal
US20120173241A1 (en) * 2010-12-30 2012-07-05 Industrial Technology Research Institute Multi-lingual text-to-speech system and method
US20150186359A1 (en) * 2013-12-30 2015-07-02 Google Inc. Multilingual prosody generation
CN105845125A (en) * 2016-05-18 2016-08-10 百度在线网络技术(北京)有限公司 Speech synthesis method and speech synthesis device
KR20190085879A (en) * 2018-01-11 2019-07-19 네오사피엔스 주식회사 Method of multilingual text-to-speech synthesis
CN109767755A (en) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 A kind of phoneme synthesizing method and system
CN111247581A (en) * 2019-12-23 2020-06-05 深圳市优必选科技股份有限公司 Method, device, equipment and storage medium for synthesizing voice by multi-language text
CN111292720A (en) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment

Also Published As

Publication number Publication date
CN112652294B (en) 2023-10-24

Similar Documents

Publication Publication Date Title
CN108447486B (en) Voice translation method and device
CN111667814B (en) Multilingual speech synthesis method and device
CN107632980A (en) Voice translation method and device, the device for voiced translation
CN110910903B (en) Speech emotion recognition method, device, equipment and computer readable storage medium
US20240070397A1 (en) Human-computer interaction method, apparatus and system, electronic device and computer medium
CN107291704A (en) Treating method and apparatus, the device for processing
CN113380222B (en) Speech synthesis method, device, electronic equipment and storage medium
CN114401431A (en) Virtual human explanation video generation method and related device
CN112463942A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN112735371A (en) Method and device for generating speaker video based on text information
CN115497448A (en) Method and device for synthesizing voice animation, electronic equipment and storage medium
CN114882862A (en) Voice processing method and related equipment
CN114550239A (en) Video generation method and device, storage medium and terminal
CN114255737B (en) Voice generation method and device and electronic equipment
CN116682411A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
CN114038484A (en) Voice data processing method and device, computer equipment and storage medium
CN117352132A (en) Psychological coaching method, device, equipment and storage medium
CN115640611B (en) Method for updating natural language processing model and related equipment
CN113409791A (en) Voice recognition processing method and device, electronic equipment and storage medium
JP4200874B2 (en) KANSEI information estimation method and character animation creation method, program using these methods, storage medium, sensitivity information estimation device, and character animation creation device
CN110781329A (en) Image searching method and device, terminal equipment and storage medium
CN112652294B (en) Speech synthesis method, device, computer equipment and storage medium
CN114694633A (en) Speech synthesis method, apparatus, device and storage medium
CN114464163A (en) Method, device, equipment, storage medium and product for training speech synthesis model
CN114267324A (en) Voice generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant