CN113160793A

CN113160793A - Speech synthesis method, device, equipment and storage medium based on low resource language

Info

Publication number: CN113160793A
Application number: CN202110441988.5A
Authority: CN
Inventors: 孙奥兰; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2021-07-23

Abstract

The invention relates to an artificial intelligence technology, and discloses a speech synthesis method based on a low-resource language, which comprises the following steps: determining a low resource language and a high resource language, and acquiring a text corresponding to the low resource language to obtain a low resource language text; converting the low resource language text into a low resource language phoneme text; translating the low resource language phoneme text into a high resource language phoneme text by using a translation model based on dual learning training; and carrying out voice synthesis on the high-resource language phoneme text by utilizing a pre-trained voice synthesis model to obtain language voice. In addition, the invention also relates to a block chain technology, and the low-resource language text can be stored in the nodes of the block chain. The invention also provides a low-resource language-based speech synthesis device, electronic equipment and a computer-readable storage medium. The invention can provide a voice synthesis method aiming at low-resource languages, which can improve the voice synthesis effect.

Description

Speech synthesis method, device, equipment and storage medium based on low resource language

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a voice synthesis method and device based on a low-resource language, electronic equipment and a computer readable storage medium.

Background

In recent years, text-to-speech (TTS) has been rapidly developed, and has received wide attention from both academic and industrial fields. The demand of voice synthesis for various industries is increasing, for example, the customer service industry can use voice synthesis to complete self-service voice service.

There are 6000 languages in the whole world, only dozens of languages covered by the traditional speech synthesis method, the back reason is not departing from the difficulty of data set development, developing a new language data set usually needs to hire professional dubbing actors to customize a large amount of high-quality speech data for the language, and then the high-quality speech data are used as training data sets to train models, but for many low-resource languages, such as dialects and other little languages, there is usually no international phonetic symbol or even no phonetic symbol of the language, at this time, hiring a linguistics specialist to customize and analyze the technical data of the pronunciation situation and phoneme tone of the language needs to consume a large amount of capital and time, and therefore, the training data set is few. The effect of directly using the traditional speech synthesis model to perform speech synthesis on low-resource languages is poor, and no speech synthesis method specially aiming at the low-resource languages exists at present.

Disclosure of Invention

The invention provides a voice synthesis method and device based on low-resource language and a computer readable storage medium, and mainly aims to provide a voice synthesis method aiming at low-resource language and improving voice synthesis effect.

In order to achieve the above object, the present invention provides a speech synthesis method based on low resource language, comprising:

determining a low resource language and a high resource language, and acquiring a text corresponding to the low resource language to obtain a low resource language text;

converting the low-resource language text into a low-resource language phoneme text;

translating the low-resource language phoneme text into phonemes corresponding to the high-resource language by using a translation model based on dual learning training to obtain a high-resource language phoneme text;

and carrying out voice synthesis on the high-resource language phoneme text by utilizing a pre-trained voice synthesis model to obtain language voice.

Optionally, the converting the low-resource language text into a low-resource language phoneme text includes:

determining a low-resource language pronunciation corresponding to each character in the low-resource language text;

and splitting the low-resource language pronunciation into low-resource language phonemes to obtain a low-resource language phoneme text.

Optionally, before the translating the low-resource language phoneme text into the phoneme corresponding to the high-resource language by using the translation model trained based on dual learning, the method further includes:

collecting audio files, and converting the audio files into low-resource language phonemes by using a pre-trained speech recognition model to obtain a phoneme text set;

and training a translation model by utilizing the phoneme text set and a pre-purchased reverse translation model based on dual learning to obtain a trained translation model.

Optionally, the training a translation model by using the phoneme text set and a pre-purchased reverse translation model based on dual learning to obtain a trained translation model includes:

training the translation model by using the phoneme text set to obtain a high-resource phoneme text set output by the translation model and a corresponding likelihood probability P_f；

Training the reverse translation model by using the high-resource phoneme text set to obtain a low-resource phoneme text set output by the reverse translation model and a corresponding likelihood probability P_b；

Adjusting parameters of the trained translation model and the reverse translation model, and repeatedly executing the training steps of the translation model and the reverse translation model until the likelihood probability P_fAnd the likelihood probability P_bAnd meeting the preset stop condition to obtain the trained translation model.

Optionally, the translating the low-resource language phoneme text into a phoneme corresponding to the high-resource language by using a translation model trained based on dual learning to obtain a high-resource language phoneme text includes:

performing feature extraction on the low-resource language phoneme text by using an encoder of the translation model to obtain an encoding vector;

and decoding the coding vector by using a decoder of the translation model to obtain a high-resource language phoneme text.

Optionally, the performing speech synthesis on the high-resource language phoneme text by using a pre-trained speech synthesis model to obtain a language speech includes:

extracting sequence representation of the high-resource language phoneme text through an encoder of the speech synthesis model to obtain a characteristic sequence vector;

decoding and waveform synthesizing the characteristic sequence vector through a decoder of the voice synthesis model to obtain initial acoustic characteristics;

correcting the initial acoustic features through a post-processing network of the voice synthesis model to obtain standard acoustic features;

and performing inverse decoding on the standard acoustic characteristics by using a preset vocoder to obtain language voice.

Optionally, before performing speech synthesis on the high-resource language phoneme text by using a pre-trained speech synthesis model to obtain a language speech, the method further includes:

collecting a plurality of high resource language phoneme texts to obtain a training data set;

carrying out voice synthesis on the training data set through a pre-purchased voice synthesis model to obtain a training result;

calculating a loss value of the training result by using a preset loss function;

and performing back propagation parameter adjustment on the voice synthesis model according to the loss value, and returning to the step of performing voice synthesis on the training data set through the voice synthesis model until the loss value is not reduced any more, so as to obtain the trained voice synthesis model.

In order to solve the above problem, the present invention further provides a speech synthesis apparatus based on a low-resource language, the apparatus comprising:

the text acquisition module is used for determining a low resource language and a high resource language, and acquiring a text corresponding to the low resource language to obtain a low resource language text;

a phoneme obtaining module, configured to convert the low-resource language text into a low-resource language phoneme text;

the phoneme mapping module is used for translating the low-resource language phoneme text into phonemes corresponding to the high-resource language by using a translation model based on dual learning training to obtain a high-resource language phoneme text;

and the speech synthesis module is used for carrying out speech synthesis on the high-resource language phoneme text by utilizing a pre-trained speech synthesis model to obtain language speech.

In order to solve the above problem, the present invention also provides an electronic device, including:

a memory storing at least one instruction; and

and the processor executes the instructions stored in the memory to realize the low-resource language-based speech synthesis method.

In order to solve the above problem, the present invention further provides a computer-readable storage medium, which stores at least one instruction, which is executed by a processor in an electronic device to implement the low-resource language based speech synthesis method described above.

The embodiment of the invention greatly reduces the requirement on the data set of the low resource language by converting the low resource language phoneme text into the high resource language phoneme text, and can realize the speech synthesis corresponding to the low resource language by utilizing the speech synthesis model corresponding to the high resource language; meanwhile, the translation model is trained by using dual learning, so that the defect that only phoneme mapping aiming at international phonetic symbols exists at present can be overcome, the training time of the translation model is shortened, and the stability and the accuracy of the translation model are improved. Therefore, the low-resource language-based speech synthesis method, the low-resource language-based speech synthesis device, the electronic equipment and the computer-readable storage medium provided by the invention can provide a speech synthesis method aiming at low-resource languages, and the speech synthesis effect is improved.

Drawings

FIG. 1 is a flowchart illustrating a low-resource language-based speech synthesis method according to an embodiment of the present invention;

FIG. 2 is a functional block diagram of a low-resource language-based speech synthesis apparatus according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device implementing the low-resource language-based speech synthesis method according to an embodiment of the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment of the application provides a speech synthesis method based on a low-resource language. The execution subject of the low-resource language-based speech synthesis method includes, but is not limited to, at least one of the electronic devices, such as a server and a terminal, which can be configured to execute the method provided by the embodiments of the present application. In other words, the low-resource language based speech synthesis method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.

Fig. 1 is a schematic flow chart of a speech synthesis method based on low-resource language according to an embodiment of the present invention. In this embodiment, the method for synthesizing speech based on low-resource language includes:

s1, determining a low resource language and a high resource language, and acquiring a text corresponding to the low resource language to obtain a low resource language text.

In the embodiment of the present invention, the low-resource language text is a text composed of a low-resource language, and the low-resource language is a language with a small language data set, including dialect and minor languages, such as shanghai dialect.

The high resource language is a language opposite to the low resource language, and there is a high similarity between phonemes of the high resource language and phonemes of the low resource language, such as mandarin chinese and shanghai chinese.

In detail, the determining of the low resource language and the high resource language is to select a corresponding language as the low resource language according to the actual service scene requirement, and then select one of a plurality of languages with a large amount of data resources as the high resource language according to the similarity with the low resource language.

Further, in the embodiment of the present invention, the low-resource language text may be acquired from a preset database. To further emphasize the privacy and security of the low resource language text, the low resource language text may also be obtained from a node of a blockchain.

And S2, converting the low resource language text into a low resource language phoneme text.

Phones (phones) are the smallest phonetic unit divided according to natural attributes of speech, and are analyzed according to pronunciation actions in syllables, and one action constitutes one phone. For example, the Chinese speech of "Mandarin" is composed of three syllables and can be split into eight phonemes of "p, u, t, o, ng, h, u, a". The computer may synthesize speech from the phoneme text.

In detail, the converting the low-resource language text into a low-resource language phoneme text includes:

And S3, translating the low-resource language phoneme text into phonemes corresponding to the high-resource language by using a translation model based on dual learning training to obtain a high-resource language phoneme text.

In the embodiment of the invention, the translation model takes the phoneme text of the low resource language as input and outputs the phoneme text of the high resource language, and the phoneme text effect of the high resource language is similar to that of reading Shanghai dialect by using the pronunciation of Mandarin.

Optionally, before the translating model trained based on dual learning is used to translate the low-resource language phoneme text into a phoneme corresponding to the high-resource language, so as to obtain a high-resource language phoneme text, the method further includes:

and training a translation model by using the phoneme text set and a pre-constructed reverse translation model based on dual learning to obtain a trained translation model.

Wherein the audio file is audio in a low-resource language; the voice recognition model is composed of CNN, CTC is used as a loss function, and the voice recognition model can accurately convert an audio file into a phoneme text of a low-resource language; the reverse translation model is a translation model corresponding to the translation model, takes the phoneme text of the high resource language as input, and outputs the phoneme text of the corresponding low resource language.

Further, the converting the audio file into low resource language phonemes by using a pre-trained speech recognition model to obtain a phoneme text set includes:

coding and feature extraction are carried out on the audio file by utilizing a coding layer of the voice recognition model, so as to obtain voice features;

decoding and matching the voice characteristics by utilizing a decoding layer of the voice recognition model to obtain a language text set;

and performing phoneme conversion on the language text set by utilizing a post-processing network of the speech recognition model to obtain a phoneme text set.

Further, the training a translation model by using the phoneme text set and a pre-constructed reverse translation model based on dual learning to obtain a trained translation model includes:

Wherein the stop condition is a likelihood probability P of the forward translation model output_fLikelihood probability P of output from the backward translation model_bAre equal. Wherein the likelihood probability is the maximum one of the probability values obtained according to the activation function in the model.

In fact, there may be some cases where a particular low-resource language phoneme is not matched with a high-resource language phoneme, for example, there are ten to twelve different phonemes of a main vowel (vowel core) according to accents of different people due to the unit phonation of the shanghai, whereas the vowels in the conventional latin alphabet only have a, e, i, o, u, which cannot be completely corresponding, and thus some of the unit phonemes must be represented in other ways. The embodiment of the invention trains the translation model based on dual learning, which means that starting from any sentence of monolingual data, the sentence is firstly translated into another language and then translated back to the original language. For example, a sentence in language A is translated into a sentence in language B through the translation model X and then sent to the translation model Y, then the translation model Y translates the received sentence in language B into a sentence in language A and sends the sentence in language A to the translation model X and then the sentence in language B, and the translation quality is improved through multiple iterations.

In detail, the translating the low-resource language phoneme text into a phoneme corresponding to the high-resource language by using a translation model trained based on dual learning to obtain a high-resource language phoneme text includes:

And S4, carrying out voice synthesis on the high-resource language phoneme text by using the pre-trained voice synthesis model to obtain language voice.

The speech synthesis model in the embodiment of the invention is an integrated end-to-end TTS model, takes a character sequence as input and outputs corresponding frequency spectrum, namely acoustic characteristics, and comprises an encoder, an attention mechanism-based decoder and a post-processing network.

Optionally, before performing speech synthesis on the high-resource language phoneme text by using a pre-trained speech synthesis model to obtain a low-resource language speech, the method further includes:

performing voice synthesis on the training data set through a pre-constructed voice synthesis model to obtain a training result;

In detail, the performing speech synthesis on the high-resource language phoneme text by using the pre-trained speech synthesis model to obtain the low-resource language speech includes:

The embodiment of the invention can convert the text of the low resource language into the corresponding audio file, such as: the method comprises the steps of providing a section of Shanghainese text, converting the Shanghainese text into Shanghainese phoneme information, translating the Shanghainese phoneme information into pinyin information through a translation model, and converting the pinyin information into an audio file through a speech synthesis model according to the obtained pinyin information.

Fig. 2 is a functional block diagram of a speech synthesis apparatus based on low-resource language according to an embodiment of the present invention.

The low-resource language based speech synthesis apparatus 100 of the present invention can be installed in an electronic device. Depending on the implemented functions, the low-resource language-based speech synthesis apparatus 100 may include a text acquisition module 101, a phoneme acquisition module 102, a phoneme mapping module 103, and a speech synthesis module 104. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

the text obtaining module 101 is configured to determine a low resource language and a high resource language, and obtain a text corresponding to the low resource language to obtain a low resource language text.

The phoneme obtaining module 102 is configured to convert the low-resource language text into a low-resource language phoneme text.

In detail, the phoneme obtaining module 102 is specifically configured to:

The phoneme mapping module 103 is configured to use a translation model after training based on dual learning to translate the low-resource language phoneme text into phonemes corresponding to the high-resource language, so as to obtain a high-resource language phoneme text.

Wherein the stop condition is a likelihood probability P of the forward translation model output_fLikelihood probability P of output from the backward translation model_bIs equal to. Wherein the likelihood probability is the maximum one of the probability values obtained according to the activation function in the model.

In detail, the phoneme mapping module 103 is specifically configured to:

The speech synthesis module 104 is configured to perform speech synthesis on the high-resource language phoneme text by using a pre-trained speech synthesis model to obtain a language speech.

In detail, the speech synthesis module 104 is specifically configured to:

Fig. 3 is a schematic structural diagram of an electronic device for implementing a speech synthesis method based on a low-resource language according to an embodiment of the present invention.

The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a low resource language based speech synthesis program 12, stored in the memory 11 and executable on the processor 10.

The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of the low-resource language-based speech synthesis program 12, but also to temporarily store data that has been output or is to be output.

The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., a low-resource language-based speech synthesis program, etc.) stored in the memory 11 and calling data stored in the memory 11.

The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.

Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.

Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The memory 11 in the electronic device 1 stores a low-resource language based speech synthesis program 12 that is a combination of instructions that, when executed in the processor 10, enable:

Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiments corresponding to fig. 1 to fig. 3, which is not repeated herein.

Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A method for low-resource language-based speech synthesis, the method comprising:

converting the low-resource language text into a low-resource language phoneme text; translating the low-resource language phoneme text into phonemes corresponding to the high-resource language by using a translation model based on dual learning training to obtain a high-resource language phoneme text;

2. The method of low-resource language based speech synthesis according to claim 1, wherein said converting the low-resource language text into low-resource language phoneme text comprises:

3. The method for low-resource language based speech synthesis according to claim 1, wherein before the using the translation model trained based on dual learning to translate the low-resource language phoneme text into the phoneme corresponding to the high-resource language, the method further comprises:

4. The method for low-resource language-based speech synthesis according to claim 3, wherein the training of the translation model by using the phoneme text set and the pre-purchased reverse translation model based on dual learning to obtain a trained translation model comprises:

Training the reverse translation model by using the high-resource phoneme text set to obtain a low-resource phoneme text output by the reverse translation modelThe current set and the corresponding likelihood probability P_b；

5. The method for synthesizing speech based on low resource language according to claim 1, wherein said translating the low resource language phoneme text into the phoneme corresponding to the high resource language by using the translation model trained based on dual learning to obtain the high resource language phoneme text comprises:

6. The method according to any one of claims 1 to 5, wherein the performing speech synthesis on the high-resource language phoneme text by using a pre-trained speech synthesis model to obtain the language speech comprises:

7. The method of claim 6, wherein before the pre-trained speech synthesis model is used to perform speech synthesis on the high-resource language phoneme text to obtain the language speech, the method further comprises:

8. An apparatus for low-resource language based speech synthesis, the apparatus comprising:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a low-resource language based speech synthesis method according to any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, implements a low-resource language based speech synthesis method according to any one of claims 1 to 7.