CN113160793A - Speech synthesis method, device, equipment and storage medium based on low resource language - Google Patents

Speech synthesis method, device, equipment and storage medium based on low resource language Download PDF

Info

Publication number
CN113160793A
CN113160793A CN202110441988.5A CN202110441988A CN113160793A CN 113160793 A CN113160793 A CN 113160793A CN 202110441988 A CN202110441988 A CN 202110441988A CN 113160793 A CN113160793 A CN 113160793A
Authority
CN
China
Prior art keywords
low
resource language
language
text
resource
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110441988.5A
Other languages
Chinese (zh)
Inventor
孙奥兰
王健宗
程宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202110441988.5A priority Critical patent/CN113160793A/en
Publication of CN113160793A publication Critical patent/CN113160793A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Acoustics & Sound (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an artificial intelligence technology, and discloses a speech synthesis method based on a low-resource language, which comprises the following steps: determining a low resource language and a high resource language, and acquiring a text corresponding to the low resource language to obtain a low resource language text; converting the low resource language text into a low resource language phoneme text; translating the low resource language phoneme text into a high resource language phoneme text by using a translation model based on dual learning training; and carrying out voice synthesis on the high-resource language phoneme text by utilizing a pre-trained voice synthesis model to obtain language voice. In addition, the invention also relates to a block chain technology, and the low-resource language text can be stored in the nodes of the block chain. The invention also provides a low-resource language-based speech synthesis device, electronic equipment and a computer-readable storage medium. The invention can provide a voice synthesis method aiming at low-resource languages, which can improve the voice synthesis effect.

Description

Speech synthesis method, device, equipment and storage medium based on low resource language
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a voice synthesis method and device based on a low-resource language, electronic equipment and a computer readable storage medium.
Background
In recent years, text-to-speech (TTS) has been rapidly developed, and has received wide attention from both academic and industrial fields. The demand of voice synthesis for various industries is increasing, for example, the customer service industry can use voice synthesis to complete self-service voice service.
There are 6000 languages in the whole world, only dozens of languages covered by the traditional speech synthesis method, the back reason is not departing from the difficulty of data set development, developing a new language data set usually needs to hire professional dubbing actors to customize a large amount of high-quality speech data for the language, and then the high-quality speech data are used as training data sets to train models, but for many low-resource languages, such as dialects and other little languages, there is usually no international phonetic symbol or even no phonetic symbol of the language, at this time, hiring a linguistics specialist to customize and analyze the technical data of the pronunciation situation and phoneme tone of the language needs to consume a large amount of capital and time, and therefore, the training data set is few. The effect of directly using the traditional speech synthesis model to perform speech synthesis on low-resource languages is poor, and no speech synthesis method specially aiming at the low-resource languages exists at present.
Disclosure of Invention
The invention provides a voice synthesis method and device based on low-resource language and a computer readable storage medium, and mainly aims to provide a voice synthesis method aiming at low-resource language and improving voice synthesis effect.
In order to achieve the above object, the present invention provides a speech synthesis method based on low resource language, comprising:
determining a low resource language and a high resource language, and acquiring a text corresponding to the low resource language to obtain a low resource language text;
converting the low-resource language text into a low-resource language phoneme text;
translating the low-resource language phoneme text into phonemes corresponding to the high-resource language by using a translation model based on dual learning training to obtain a high-resource language phoneme text;
and carrying out voice synthesis on the high-resource language phoneme text by utilizing a pre-trained voice synthesis model to obtain language voice.
Optionally, the converting the low-resource language text into a low-resource language phoneme text includes:
determining a low-resource language pronunciation corresponding to each character in the low-resource language text;
and splitting the low-resource language pronunciation into low-resource language phonemes to obtain a low-resource language phoneme text.
Optionally, before the translating the low-resource language phoneme text into the phoneme corresponding to the high-resource language by using the translation model trained based on dual learning, the method further includes:
collecting audio files, and converting the audio files into low-resource language phonemes by using a pre-trained speech recognition model to obtain a phoneme text set;
and training a translation model by utilizing the phoneme text set and a pre-purchased reverse translation model based on dual learning to obtain a trained translation model.
Optionally, the training a translation model by using the phoneme text set and a pre-purchased reverse translation model based on dual learning to obtain a trained translation model includes:
training the translation model by using the phoneme text set to obtain a high-resource phoneme text set output by the translation model and a corresponding likelihood probability Pf
Training the reverse translation model by using the high-resource phoneme text set to obtain a low-resource phoneme text set output by the reverse translation model and a corresponding likelihood probability Pb
Adjusting parameters of the trained translation model and the reverse translation model, and repeatedly executing the training steps of the translation model and the reverse translation model until the likelihood probability PfAnd the likelihood probability PbAnd meeting the preset stop condition to obtain the trained translation model.
Optionally, the translating the low-resource language phoneme text into a phoneme corresponding to the high-resource language by using a translation model trained based on dual learning to obtain a high-resource language phoneme text includes:
performing feature extraction on the low-resource language phoneme text by using an encoder of the translation model to obtain an encoding vector;
and decoding the coding vector by using a decoder of the translation model to obtain a high-resource language phoneme text.
Optionally, the performing speech synthesis on the high-resource language phoneme text by using a pre-trained speech synthesis model to obtain a language speech includes:
extracting sequence representation of the high-resource language phoneme text through an encoder of the speech synthesis model to obtain a characteristic sequence vector;
decoding and waveform synthesizing the characteristic sequence vector through a decoder of the voice synthesis model to obtain initial acoustic characteristics;
correcting the initial acoustic features through a post-processing network of the voice synthesis model to obtain standard acoustic features;
and performing inverse decoding on the standard acoustic characteristics by using a preset vocoder to obtain language voice.
Optionally, before performing speech synthesis on the high-resource language phoneme text by using a pre-trained speech synthesis model to obtain a language speech, the method further includes:
collecting a plurality of high resource language phoneme texts to obtain a training data set;
carrying out voice synthesis on the training data set through a pre-purchased voice synthesis model to obtain a training result;
calculating a loss value of the training result by using a preset loss function;
and performing back propagation parameter adjustment on the voice synthesis model according to the loss value, and returning to the step of performing voice synthesis on the training data set through the voice synthesis model until the loss value is not reduced any more, so as to obtain the trained voice synthesis model.
In order to solve the above problem, the present invention further provides a speech synthesis apparatus based on a low-resource language, the apparatus comprising:
the text acquisition module is used for determining a low resource language and a high resource language, and acquiring a text corresponding to the low resource language to obtain a low resource language text;
a phoneme obtaining module, configured to convert the low-resource language text into a low-resource language phoneme text;
the phoneme mapping module is used for translating the low-resource language phoneme text into phonemes corresponding to the high-resource language by using a translation model based on dual learning training to obtain a high-resource language phoneme text;
and the speech synthesis module is used for carrying out speech synthesis on the high-resource language phoneme text by utilizing a pre-trained speech synthesis model to obtain language speech.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one instruction; and
and the processor executes the instructions stored in the memory to realize the low-resource language-based speech synthesis method.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, which stores at least one instruction, which is executed by a processor in an electronic device to implement the low-resource language based speech synthesis method described above.
The embodiment of the invention greatly reduces the requirement on the data set of the low resource language by converting the low resource language phoneme text into the high resource language phoneme text, and can realize the speech synthesis corresponding to the low resource language by utilizing the speech synthesis model corresponding to the high resource language; meanwhile, the translation model is trained by using dual learning, so that the defect that only phoneme mapping aiming at international phonetic symbols exists at present can be overcome, the training time of the translation model is shortened, and the stability and the accuracy of the translation model are improved. Therefore, the low-resource language-based speech synthesis method, the low-resource language-based speech synthesis device, the electronic equipment and the computer-readable storage medium provided by the invention can provide a speech synthesis method aiming at low-resource languages, and the speech synthesis effect is improved.
Drawings
FIG. 1 is a flowchart illustrating a low-resource language-based speech synthesis method according to an embodiment of the present invention;
FIG. 2 is a functional block diagram of a low-resource language-based speech synthesis apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device implementing the low-resource language-based speech synthesis method according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a speech synthesis method based on a low-resource language. The execution subject of the low-resource language-based speech synthesis method includes, but is not limited to, at least one of the electronic devices, such as a server and a terminal, which can be configured to execute the method provided by the embodiments of the present application. In other words, the low-resource language based speech synthesis method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
Fig. 1 is a schematic flow chart of a speech synthesis method based on low-resource language according to an embodiment of the present invention. In this embodiment, the method for synthesizing speech based on low-resource language includes:
s1, determining a low resource language and a high resource language, and acquiring a text corresponding to the low resource language to obtain a low resource language text.
In the embodiment of the present invention, the low-resource language text is a text composed of a low-resource language, and the low-resource language is a language with a small language data set, including dialect and minor languages, such as shanghai dialect.
The high resource language is a language opposite to the low resource language, and there is a high similarity between phonemes of the high resource language and phonemes of the low resource language, such as mandarin chinese and shanghai chinese.
In detail, the determining of the low resource language and the high resource language is to select a corresponding language as the low resource language according to the actual service scene requirement, and then select one of a plurality of languages with a large amount of data resources as the high resource language according to the similarity with the low resource language.
Further, in the embodiment of the present invention, the low-resource language text may be acquired from a preset database. To further emphasize the privacy and security of the low resource language text, the low resource language text may also be obtained from a node of a blockchain.
And S2, converting the low resource language text into a low resource language phoneme text.
Phones (phones) are the smallest phonetic unit divided according to natural attributes of speech, and are analyzed according to pronunciation actions in syllables, and one action constitutes one phone. For example, the Chinese speech of "Mandarin" is composed of three syllables and can be split into eight phonemes of "p, u, t, o, ng, h, u, a". The computer may synthesize speech from the phoneme text.
In detail, the converting the low-resource language text into a low-resource language phoneme text includes:
determining a low-resource language pronunciation corresponding to each character in the low-resource language text;
and splitting the low-resource language pronunciation into low-resource language phonemes to obtain a low-resource language phoneme text.
And S3, translating the low-resource language phoneme text into phonemes corresponding to the high-resource language by using a translation model based on dual learning training to obtain a high-resource language phoneme text.
In the embodiment of the invention, the translation model takes the phoneme text of the low resource language as input and outputs the phoneme text of the high resource language, and the phoneme text effect of the high resource language is similar to that of reading Shanghai dialect by using the pronunciation of Mandarin.
Optionally, before the translating model trained based on dual learning is used to translate the low-resource language phoneme text into a phoneme corresponding to the high-resource language, so as to obtain a high-resource language phoneme text, the method further includes:
collecting audio files, and converting the audio files into low-resource language phonemes by using a pre-trained speech recognition model to obtain a phoneme text set;
and training a translation model by using the phoneme text set and a pre-constructed reverse translation model based on dual learning to obtain a trained translation model.
Wherein the audio file is audio in a low-resource language; the voice recognition model is composed of CNN, CTC is used as a loss function, and the voice recognition model can accurately convert an audio file into a phoneme text of a low-resource language; the reverse translation model is a translation model corresponding to the translation model, takes the phoneme text of the high resource language as input, and outputs the phoneme text of the corresponding low resource language.
Further, the converting the audio file into low resource language phonemes by using a pre-trained speech recognition model to obtain a phoneme text set includes:
coding and feature extraction are carried out on the audio file by utilizing a coding layer of the voice recognition model, so as to obtain voice features;
decoding and matching the voice characteristics by utilizing a decoding layer of the voice recognition model to obtain a language text set;
and performing phoneme conversion on the language text set by utilizing a post-processing network of the speech recognition model to obtain a phoneme text set.
Further, the training a translation model by using the phoneme text set and a pre-constructed reverse translation model based on dual learning to obtain a trained translation model includes:
training the translation model by using the phoneme text set to obtain a high-resource phoneme text set output by the translation model and a corresponding likelihood probability Pf
Training the reverse translation model by using the high-resource phoneme text set to obtain a low-resource phoneme text set output by the reverse translation model and a corresponding likelihood probability Pb
Adjusting parameters of the trained translation model and the reverse translation model, and repeatedly executing the training steps of the translation model and the reverse translation model until the likelihood probability PfAnd the likelihood probability PbAnd meeting the preset stop condition to obtain the trained translation model.
Wherein the stop condition is a likelihood probability P of the forward translation model outputfLikelihood probability P of output from the backward translation modelbAre equal. Wherein the likelihood probability is the maximum one of the probability values obtained according to the activation function in the model.
In fact, there may be some cases where a particular low-resource language phoneme is not matched with a high-resource language phoneme, for example, there are ten to twelve different phonemes of a main vowel (vowel core) according to accents of different people due to the unit phonation of the shanghai, whereas the vowels in the conventional latin alphabet only have a, e, i, o, u, which cannot be completely corresponding, and thus some of the unit phonemes must be represented in other ways. The embodiment of the invention trains the translation model based on dual learning, which means that starting from any sentence of monolingual data, the sentence is firstly translated into another language and then translated back to the original language. For example, a sentence in language A is translated into a sentence in language B through the translation model X and then sent to the translation model Y, then the translation model Y translates the received sentence in language B into a sentence in language A and sends the sentence in language A to the translation model X and then the sentence in language B, and the translation quality is improved through multiple iterations.
In detail, the translating the low-resource language phoneme text into a phoneme corresponding to the high-resource language by using a translation model trained based on dual learning to obtain a high-resource language phoneme text includes:
performing feature extraction on the low-resource language phoneme text by using an encoder of the translation model to obtain an encoding vector;
and decoding the coding vector by using a decoder of the translation model to obtain a high-resource language phoneme text.
And S4, carrying out voice synthesis on the high-resource language phoneme text by using the pre-trained voice synthesis model to obtain language voice.
The speech synthesis model in the embodiment of the invention is an integrated end-to-end TTS model, takes a character sequence as input and outputs corresponding frequency spectrum, namely acoustic characteristics, and comprises an encoder, an attention mechanism-based decoder and a post-processing network.
Optionally, before performing speech synthesis on the high-resource language phoneme text by using a pre-trained speech synthesis model to obtain a low-resource language speech, the method further includes:
collecting a plurality of high resource language phoneme texts to obtain a training data set;
performing voice synthesis on the training data set through a pre-constructed voice synthesis model to obtain a training result;
calculating a loss value of the training result by using a preset loss function;
and performing back propagation parameter adjustment on the voice synthesis model according to the loss value, and returning to the step of performing voice synthesis on the training data set through the voice synthesis model until the loss value is not reduced any more, so as to obtain the trained voice synthesis model.
In detail, the performing speech synthesis on the high-resource language phoneme text by using the pre-trained speech synthesis model to obtain the low-resource language speech includes:
extracting sequence representation of the high-resource language phoneme text through an encoder of the speech synthesis model to obtain a characteristic sequence vector;
decoding and waveform synthesizing the characteristic sequence vector through a decoder of the voice synthesis model to obtain initial acoustic characteristics;
correcting the initial acoustic features through a post-processing network of the voice synthesis model to obtain standard acoustic features;
and performing inverse decoding on the standard acoustic characteristics by using a preset vocoder to obtain language voice.
The embodiment of the invention can convert the text of the low resource language into the corresponding audio file, such as: the method comprises the steps of providing a section of Shanghainese text, converting the Shanghainese text into Shanghainese phoneme information, translating the Shanghainese phoneme information into pinyin information through a translation model, and converting the pinyin information into an audio file through a speech synthesis model according to the obtained pinyin information.
The embodiment of the invention greatly reduces the requirement on the data set of the low resource language by converting the low resource language phoneme text into the high resource language phoneme text, and can realize the speech synthesis corresponding to the low resource language by utilizing the speech synthesis model corresponding to the high resource language; meanwhile, the translation model is trained by using dual learning, so that the defect that only phoneme mapping aiming at international phonetic symbols exists at present can be overcome, the training time of the translation model is shortened, and the stability and the accuracy of the translation model are improved. Therefore, the low-resource language-based speech synthesis method, the low-resource language-based speech synthesis device, the electronic equipment and the computer-readable storage medium provided by the invention can provide a speech synthesis method aiming at low-resource languages, and the speech synthesis effect is improved.
Fig. 2 is a functional block diagram of a speech synthesis apparatus based on low-resource language according to an embodiment of the present invention.
The low-resource language based speech synthesis apparatus 100 of the present invention can be installed in an electronic device. Depending on the implemented functions, the low-resource language-based speech synthesis apparatus 100 may include a text acquisition module 101, a phoneme acquisition module 102, a phoneme mapping module 103, and a speech synthesis module 104. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the text obtaining module 101 is configured to determine a low resource language and a high resource language, and obtain a text corresponding to the low resource language to obtain a low resource language text.
In the embodiment of the present invention, the low-resource language text is a text composed of a low-resource language, and the low-resource language is a language with a small language data set, including dialect and minor languages, such as shanghai dialect.
The high resource language is a language opposite to the low resource language, and there is a high similarity between phonemes of the high resource language and phonemes of the low resource language, such as mandarin chinese and shanghai chinese.
In detail, the determining of the low resource language and the high resource language is to select a corresponding language as the low resource language according to the actual service scene requirement, and then select one of a plurality of languages with a large amount of data resources as the high resource language according to the similarity with the low resource language.
Further, in the embodiment of the present invention, the low-resource language text may be acquired from a preset database. To further emphasize the privacy and security of the low resource language text, the low resource language text may also be obtained from a node of a blockchain.
The phoneme obtaining module 102 is configured to convert the low-resource language text into a low-resource language phoneme text.
Phones (phones) are the smallest phonetic unit divided according to natural attributes of speech, and are analyzed according to pronunciation actions in syllables, and one action constitutes one phone. For example, the Chinese speech of "Mandarin" is composed of three syllables and can be split into eight phonemes of "p, u, t, o, ng, h, u, a". The computer may synthesize speech from the phoneme text.
In detail, the phoneme obtaining module 102 is specifically configured to:
determining a low-resource language pronunciation corresponding to each character in the low-resource language text;
and splitting the low-resource language pronunciation into low-resource language phonemes to obtain a low-resource language phoneme text.
The phoneme mapping module 103 is configured to use a translation model after training based on dual learning to translate the low-resource language phoneme text into phonemes corresponding to the high-resource language, so as to obtain a high-resource language phoneme text.
In the embodiment of the invention, the translation model takes the phoneme text of the low resource language as input and outputs the phoneme text of the high resource language, and the phoneme text effect of the high resource language is similar to that of reading Shanghai dialect by using the pronunciation of Mandarin.
Optionally, before the translating model trained based on dual learning is used to translate the low-resource language phoneme text into a phoneme corresponding to the high-resource language, so as to obtain a high-resource language phoneme text, the method further includes:
collecting audio files, and converting the audio files into low-resource language phonemes by using a pre-trained speech recognition model to obtain a phoneme text set;
and training a translation model by using the phoneme text set and a pre-constructed reverse translation model based on dual learning to obtain a trained translation model.
Wherein the audio file is audio in a low-resource language; the voice recognition model is composed of CNN, CTC is used as a loss function, and the voice recognition model can accurately convert an audio file into a phoneme text of a low-resource language; the reverse translation model is a translation model corresponding to the translation model, takes the phoneme text of the high resource language as input, and outputs the phoneme text of the corresponding low resource language.
Further, the converting the audio file into low resource language phonemes by using a pre-trained speech recognition model to obtain a phoneme text set includes:
coding and feature extraction are carried out on the audio file by utilizing a coding layer of the voice recognition model, so as to obtain voice features;
decoding and matching the voice characteristics by utilizing a decoding layer of the voice recognition model to obtain a language text set;
and performing phoneme conversion on the language text set by utilizing a post-processing network of the speech recognition model to obtain a phoneme text set.
Further, the training a translation model by using the phoneme text set and a pre-constructed reverse translation model based on dual learning to obtain a trained translation model includes:
training the translation model by using the phoneme text set to obtain a high-resource phoneme text set output by the translation model and a corresponding likelihood probability Pf
Training the reverse translation model by using the high-resource phoneme text set to obtain a low-resource phoneme text set output by the reverse translation model and a corresponding likelihood probability Pb
Adjusting parameters of the trained translation model and the reverse translation model, and repeatedly executing the training steps of the translation model and the reverse translation model until the likelihood probability PfAnd the likelihood probability PbAnd meeting the preset stop condition to obtain the trained translation model.
Wherein the stop condition is a likelihood probability P of the forward translation model outputfLikelihood probability P of output from the backward translation modelbIs equal to. Wherein the likelihood probability is the maximum one of the probability values obtained according to the activation function in the model.
In fact, there may be some cases where a particular low-resource language phoneme is not matched with a high-resource language phoneme, for example, there are ten to twelve different phonemes of a main vowel (vowel core) according to accents of different people due to the unit phonation of the shanghai, whereas the vowels in the conventional latin alphabet only have a, e, i, o, u, which cannot be completely corresponding, and thus some of the unit phonemes must be represented in other ways. The embodiment of the invention trains the translation model based on dual learning, which means that starting from any sentence of monolingual data, the sentence is firstly translated into another language and then translated back to the original language. For example, a sentence in language A is translated into a sentence in language B through the translation model X and then sent to the translation model Y, then the translation model Y translates the received sentence in language B into a sentence in language A and sends the sentence in language A to the translation model X and then the sentence in language B, and the translation quality is improved through multiple iterations.
In detail, the phoneme mapping module 103 is specifically configured to:
performing feature extraction on the low-resource language phoneme text by using an encoder of the translation model to obtain an encoding vector;
and decoding the coding vector by using a decoder of the translation model to obtain a high-resource language phoneme text.
The speech synthesis module 104 is configured to perform speech synthesis on the high-resource language phoneme text by using a pre-trained speech synthesis model to obtain a language speech.
The speech synthesis model in the embodiment of the invention is an integrated end-to-end TTS model, takes a character sequence as input and outputs corresponding frequency spectrum, namely acoustic characteristics, and comprises an encoder, an attention mechanism-based decoder and a post-processing network.
Optionally, before performing speech synthesis on the high-resource language phoneme text by using a pre-trained speech synthesis model to obtain a low-resource language speech, the method further includes:
collecting a plurality of high resource language phoneme texts to obtain a training data set;
performing voice synthesis on the training data set through a pre-constructed voice synthesis model to obtain a training result;
calculating a loss value of the training result by using a preset loss function;
and performing back propagation parameter adjustment on the voice synthesis model according to the loss value, and returning to the step of performing voice synthesis on the training data set through the voice synthesis model until the loss value is not reduced any more, so as to obtain the trained voice synthesis model.
In detail, the speech synthesis module 104 is specifically configured to:
extracting sequence representation of the high-resource language phoneme text through an encoder of the speech synthesis model to obtain a characteristic sequence vector;
decoding and waveform synthesizing the characteristic sequence vector through a decoder of the voice synthesis model to obtain initial acoustic characteristics;
correcting the initial acoustic features through a post-processing network of the voice synthesis model to obtain standard acoustic features;
and performing inverse decoding on the standard acoustic characteristics by using a preset vocoder to obtain language voice.
The embodiment of the invention can convert the text of the low resource language into the corresponding audio file, such as: the method comprises the steps of providing a section of Shanghainese text, converting the Shanghainese text into Shanghainese phoneme information, translating the Shanghainese phoneme information into pinyin information through a translation model, and converting the pinyin information into an audio file through a speech synthesis model according to the obtained pinyin information.
Fig. 3 is a schematic structural diagram of an electronic device for implementing a speech synthesis method based on a low-resource language according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a low resource language based speech synthesis program 12, stored in the memory 11 and executable on the processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of the low-resource language-based speech synthesis program 12, but also to temporarily store data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., a low-resource language-based speech synthesis program, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The memory 11 in the electronic device 1 stores a low-resource language based speech synthesis program 12 that is a combination of instructions that, when executed in the processor 10, enable:
determining a low resource language and a high resource language, and acquiring a text corresponding to the low resource language to obtain a low resource language text;
converting the low-resource language text into a low-resource language phoneme text;
translating the low-resource language phoneme text into phonemes corresponding to the high-resource language by using a translation model based on dual learning training to obtain a high-resource language phoneme text;
and carrying out voice synthesis on the high-resource language phoneme text by utilizing a pre-trained voice synthesis model to obtain language voice.
Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiments corresponding to fig. 1 to fig. 3, which is not repeated herein.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:
determining a low resource language and a high resource language, and acquiring a text corresponding to the low resource language to obtain a low resource language text;
converting the low-resource language text into a low-resource language phoneme text;
translating the low-resource language phoneme text into phonemes corresponding to the high-resource language by using a translation model based on dual learning training to obtain a high-resource language phoneme text;
and carrying out voice synthesis on the high-resource language phoneme text by utilizing a pre-trained voice synthesis model to obtain language voice.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method for low-resource language-based speech synthesis, the method comprising:
determining a low resource language and a high resource language, and acquiring a text corresponding to the low resource language to obtain a low resource language text;
converting the low-resource language text into a low-resource language phoneme text; translating the low-resource language phoneme text into phonemes corresponding to the high-resource language by using a translation model based on dual learning training to obtain a high-resource language phoneme text;
and carrying out voice synthesis on the high-resource language phoneme text by utilizing a pre-trained voice synthesis model to obtain language voice.
2. The method of low-resource language based speech synthesis according to claim 1, wherein said converting the low-resource language text into low-resource language phoneme text comprises:
determining a low-resource language pronunciation corresponding to each character in the low-resource language text;
and splitting the low-resource language pronunciation into low-resource language phonemes to obtain a low-resource language phoneme text.
3. The method for low-resource language based speech synthesis according to claim 1, wherein before the using the translation model trained based on dual learning to translate the low-resource language phoneme text into the phoneme corresponding to the high-resource language, the method further comprises:
collecting audio files, and converting the audio files into low-resource language phonemes by using a pre-trained speech recognition model to obtain a phoneme text set;
and training a translation model by utilizing the phoneme text set and a pre-purchased reverse translation model based on dual learning to obtain a trained translation model.
4. The method for low-resource language-based speech synthesis according to claim 3, wherein the training of the translation model by using the phoneme text set and the pre-purchased reverse translation model based on dual learning to obtain a trained translation model comprises:
training the translation model by using the phoneme text set to obtain a high-resource phoneme text set output by the translation model and a corresponding likelihood probability Pf
Training the reverse translation model by using the high-resource phoneme text set to obtain a low-resource phoneme text output by the reverse translation modelThe current set and the corresponding likelihood probability Pb
Adjusting parameters of the trained translation model and the reverse translation model, and repeatedly executing the training steps of the translation model and the reverse translation model until the likelihood probability PfAnd the likelihood probability PbAnd meeting the preset stop condition to obtain the trained translation model.
5. The method for synthesizing speech based on low resource language according to claim 1, wherein said translating the low resource language phoneme text into the phoneme corresponding to the high resource language by using the translation model trained based on dual learning to obtain the high resource language phoneme text comprises:
performing feature extraction on the low-resource language phoneme text by using an encoder of the translation model to obtain an encoding vector;
and decoding the coding vector by using a decoder of the translation model to obtain a high-resource language phoneme text.
6. The method according to any one of claims 1 to 5, wherein the performing speech synthesis on the high-resource language phoneme text by using a pre-trained speech synthesis model to obtain the language speech comprises:
extracting sequence representation of the high-resource language phoneme text through an encoder of the speech synthesis model to obtain a characteristic sequence vector;
decoding and waveform synthesizing the characteristic sequence vector through a decoder of the voice synthesis model to obtain initial acoustic characteristics;
correcting the initial acoustic features through a post-processing network of the voice synthesis model to obtain standard acoustic features;
and performing inverse decoding on the standard acoustic characteristics by using a preset vocoder to obtain language voice.
7. The method of claim 6, wherein before the pre-trained speech synthesis model is used to perform speech synthesis on the high-resource language phoneme text to obtain the language speech, the method further comprises:
collecting a plurality of high resource language phoneme texts to obtain a training data set;
carrying out voice synthesis on the training data set through a pre-purchased voice synthesis model to obtain a training result;
calculating a loss value of the training result by using a preset loss function;
and performing back propagation parameter adjustment on the voice synthesis model according to the loss value, and returning to the step of performing voice synthesis on the training data set through the voice synthesis model until the loss value is not reduced any more, so as to obtain the trained voice synthesis model.
8. An apparatus for low-resource language based speech synthesis, the apparatus comprising:
the text acquisition module is used for determining a low resource language and a high resource language, and acquiring a text corresponding to the low resource language to obtain a low resource language text;
a phoneme obtaining module, configured to convert the low-resource language text into a low-resource language phoneme text;
the phoneme mapping module is used for translating the low-resource language phoneme text into phonemes corresponding to the high-resource language by using a translation model based on dual learning training to obtain a high-resource language phoneme text;
and the speech synthesis module is used for carrying out speech synthesis on the high-resource language phoneme text by utilizing a pre-trained speech synthesis model to obtain language speech.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a low-resource language based speech synthesis method according to any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, implements a low-resource language based speech synthesis method according to any one of claims 1 to 7.
CN202110441988.5A 2021-04-23 2021-04-23 Speech synthesis method, device, equipment and storage medium based on low resource language Pending CN113160793A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110441988.5A CN113160793A (en) 2021-04-23 2021-04-23 Speech synthesis method, device, equipment and storage medium based on low resource language

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110441988.5A CN113160793A (en) 2021-04-23 2021-04-23 Speech synthesis method, device, equipment and storage medium based on low resource language

Publications (1)

Publication Number Publication Date
CN113160793A true CN113160793A (en) 2021-07-23

Family

ID=76869827

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110441988.5A Pending CN113160793A (en) 2021-04-23 2021-04-23 Speech synthesis method, device, equipment and storage medium based on low resource language

Country Status (1)

Country Link
CN (1) CN113160793A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673261A (en) * 2021-09-07 2021-11-19 北京小米移动软件有限公司 Data generation method and device and readable storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108133705A (en) * 2017-12-21 2018-06-08 儒安科技有限公司 Speech recognition and phonetic synthesis model training method based on paired-associate learning
CN108447486A (en) * 2018-02-28 2018-08-24 科大讯飞股份有限公司 A kind of voice translation method and device
US20180307679A1 (en) * 2017-04-23 2018-10-25 Voicebox Technologies Corporation Multi-lingual semantic parser based on transferred learning
CN108766414A (en) * 2018-06-29 2018-11-06 北京百度网讯科技有限公司 Method, apparatus, equipment and computer readable storage medium for voiced translation
CN109887484A (en) * 2019-02-22 2019-06-14 平安科技(深圳)有限公司 A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN110765784A (en) * 2019-09-12 2020-02-07 内蒙古工业大学 Mongolian Chinese machine translation method based on dual learning
CN111144140A (en) * 2019-12-23 2020-05-12 语联网(武汉)信息技术有限公司 Zero-learning-based Chinese and Tai bilingual corpus generation method and device
CN111178097A (en) * 2019-12-24 2020-05-19 语联网(武汉)信息技术有限公司 Method and device for generating Chinese and Tai bilingual corpus based on multi-level translation model
CN111369974A (en) * 2020-03-11 2020-07-03 北京声智科技有限公司 Dialect pronunciation labeling method, language identification method and related device
CN112131368A (en) * 2020-09-27 2020-12-25 平安国际智慧城市科技股份有限公司 Dialog generation method and device, electronic equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180307679A1 (en) * 2017-04-23 2018-10-25 Voicebox Technologies Corporation Multi-lingual semantic parser based on transferred learning
CN108133705A (en) * 2017-12-21 2018-06-08 儒安科技有限公司 Speech recognition and phonetic synthesis model training method based on paired-associate learning
CN108447486A (en) * 2018-02-28 2018-08-24 科大讯飞股份有限公司 A kind of voice translation method and device
CN108766414A (en) * 2018-06-29 2018-11-06 北京百度网讯科技有限公司 Method, apparatus, equipment and computer readable storage medium for voiced translation
CN109887484A (en) * 2019-02-22 2019-06-14 平安科技(深圳)有限公司 A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
CN110765784A (en) * 2019-09-12 2020-02-07 内蒙古工业大学 Mongolian Chinese machine translation method based on dual learning
CN111144140A (en) * 2019-12-23 2020-05-12 语联网(武汉)信息技术有限公司 Zero-learning-based Chinese and Tai bilingual corpus generation method and device
CN111178097A (en) * 2019-12-24 2020-05-19 语联网(武汉)信息技术有限公司 Method and device for generating Chinese and Tai bilingual corpus based on multi-level translation model
CN111369974A (en) * 2020-03-11 2020-07-03 北京声智科技有限公司 Dialect pronunciation labeling method, language identification method and related device
CN112131368A (en) * 2020-09-27 2020-12-25 平安国际智慧城市科技股份有限公司 Dialog generation method and device, electronic equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
苏依拉等: ""基于对偶学习的西里古尔蒙古语-汉语机器翻译研究"", 《计算机应用与软件》, vol. 37, no. 1, 12 January 2020 (2020-01-12), pages 172 - 178 *
苏依拉等: ""基于对偶学习的西里古尔蒙古语-汉语机器翻译研究"", 《计算机应用与软件》, vol. 37, no. 1, pages 172 - 178 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113673261A (en) * 2021-09-07 2021-11-19 北京小米移动软件有限公司 Data generation method and device and readable storage medium

Similar Documents

Publication Publication Date Title
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
CN109271631B (en) Word segmentation method, device, equipment and storage medium
WO2020186778A1 (en) Error word correction method and device, computer device, and storage medium
JP2022531414A (en) End-to-end automatic speech recognition of digit strings
US11488577B2 (en) Training method and apparatus for a speech synthesis model, and storage medium
CN110288972B (en) Speech synthesis model training method, speech synthesis method and device
WO2021127817A1 (en) Speech synthesis method, device, and apparatus for multilingual text, and storage medium
CN112397047A (en) Speech synthesis method, device, electronic equipment and readable storage medium
WO2022227190A1 (en) Speech synthesis method and apparatus, and electronic device and storage medium
CN113096242A (en) Virtual anchor generation method and device, electronic equipment and storage medium
CN112466273A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN111241853B (en) Session translation method, device, storage medium and terminal equipment
CN112820269A (en) Text-to-speech method, device, electronic equipment and storage medium
WO2022121158A1 (en) Speech synthesis method and apparatus, and electronic device and storage medium
CN112463942A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN115002491A (en) Network live broadcast method, device, equipment and storage medium based on intelligent machine
CN113918031A (en) System and method for Chinese punctuation recovery using sub-character information
CN112951233A (en) Voice question and answer method and device, electronic equipment and readable storage medium
CN113205814A (en) Voice data labeling method and device, electronic equipment and storage medium
CN112489634A (en) Language acoustic model training method and device, electronic equipment and computer medium
CN113434642B (en) Text abstract generation method and device and electronic equipment
CN113870835A (en) Speech synthesis method, apparatus, device and storage medium based on artificial intelligence
CN115101042A (en) Text processing method, device and equipment
CN113450758B (en) Speech synthesis method, apparatus, device and medium
CN114155832A (en) Speech recognition method, device, equipment and medium based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination