CN113191164A

CN113191164A - Dialect voice synthesis method and device, electronic equipment and storage medium

Info

Publication number: CN113191164A
Application number: CN202110616158.1A
Authority: CN
Inventors: 孙见青; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2021-07-30
Anticipated expiration: 2041-06-02
Also published as: CN113191164B

Abstract

The invention relates to a dialect synthesis method, a dialect synthesis device, electronic equipment and a storage medium, wherein the method comprises the following steps: obtaining a mandarin text, and determining a first dialect text according to the mandarin text; determining a second dialect text according to the first dialect text and the dialect text model; determining a representation corresponding to the second dialect text according to the second dialect text; determining an acoustic parameter corresponding to the second dialect text according to the representation corresponding to the second dialect text; and synthesizing dialect speech corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text. In the embodiment of the application, the synthesis of the mandarin text into the dialect speech is realized, for example, the mandarin text in a newspaper is synthesized into the tetragon speech, namely, the tetragon speech is read out through the tetragon speech.

Description

Dialect voice synthesis method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of voice processing, in particular to a dialect synthesis method, a dialect synthesis device, electronic equipment and a storage medium.

Background

At present, when dialect speech is synthesized, an input mandarin text is directly synthesized to obtain the dialect speech, and the problems that the synthesized dialect is not authentic and does not accord with the expression characteristic of the dialect exist.

Disclosure of Invention

The invention provides a method, a device, an electronic device and a storage medium for determining dialect synthesis, which can solve the technical problem that the synthesized dialect is not authentic.

The technical scheme for solving the technical problems is as follows:

in a first aspect, an embodiment of the present invention provides a dialect speech synthesis method, including:

obtaining a mandarin text, and determining a first dialect text according to the mandarin text;

determining a second dialect text according to the first dialect text and the dialect text model;

determining a representation corresponding to the second dialect text according to the second dialect text;

determining an acoustic parameter corresponding to the second dialect text according to the representation corresponding to the second dialect text;

and synthesizing dialect speech corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.

In some embodiments, determining the first dialect text from the mandarin chinese text includes:

obtaining a plurality of mandarin text and dialect text pairs;

training according to a plurality of Mandarin texts and dialect texts to obtain an end-to-end translation model;

the Mandarin text is input into the end-to-end translation model to obtain a first dialect text.

In some embodiments, determining the second dialect text from the first dialect text and the dialect text model comprises:

acquiring a plurality of dialect texts;

training a plurality of dialect texts to obtain a dialect text model;

and obtaining a second dialect text according to the first dialect text and the dialect text model.

In some embodiments, determining the corresponding token of the second dialect text from the second dialect text includes:

and analyzing the second dialect text to obtain the representation corresponding to the second dialect text.

In some embodiments, determining the acoustic parameter corresponding to the second dialect text from the representation corresponding to the second dialect text includes:

obtaining a plurality of dialect text and voice pairs;

training according to a plurality of dialect text and voice pairs to obtain an end-to-end synthesis model;

and inputting the representation corresponding to the second dialect text into the end-to-end synthesis model to obtain the acoustic parameter corresponding to the second dialect text.

In some embodiments, synthesizing dialect speech corresponding to mandarin chinese text from acoustic parameters corresponding to second dialect text includes:

acquiring voice acoustic parameters and voice pairs;

training a neural network vocoder according to the voice acoustic parameters and the voice to obtain a neural network vocoder model;

and inputting the acoustic parameters corresponding to the second dialect text into the neural network vocoder model to synthesize dialect voice corresponding to the Mandarin text.

In a second aspect, an embodiment of the present invention provides a dialect speech synthesis apparatus, including:

an acquisition module: the method comprises the steps of obtaining a Mandarin text, and determining a first dialect text according to the Mandarin text;

a first determination module: the second dialect text is determined according to the Mandarin text and the first dialect text;

a second determination module: the representation corresponding to the second dialect text is determined according to the second dialect text;

a third determination module: the acoustic parameter corresponding to the second dialect text is determined according to the representation corresponding to the second dialect text;

a synthesis module: and the acoustic parameter generator is used for synthesizing dialect speech corresponding to the Mandarin text according to the acoustic parameter corresponding to the second dialect text.

In some embodiments, the obtaining module is further configured to:

obtaining a plurality of mandarin text and dialect text pairs;

In a third aspect, an embodiment of the present invention further provides an electronic device, including: a processor and a memory;

the processor is configured to perform any of the dialect speech synthesis methods described above by invoking programs or instructions stored in the memory.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, which stores a program or instructions for causing a computer to execute the dialect speech synthesis method according to any one of the above descriptions.

The invention has the beneficial effects that: obtaining a mandarin text, and determining a first dialect text according to the mandarin text; determining a second dialect text according to the first dialect text and the dialect text model; determining a representation corresponding to the second dialect text according to the second dialect text; determining an acoustic parameter corresponding to the second dialect text according to the representation corresponding to the second dialect text; and synthesizing dialect speech corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text. In the embodiment of the application, the synthesis of the mandarin text into the dialect speech is realized, for example, the mandarin text in a newspaper is synthesized into the tetragon speech, namely, the tetragon speech is read out through the tetragon speech.

Drawings

Fig. 1 is a diagram of a dialect speech synthesis method according to an embodiment of the present invention;

fig. 2 is a second diagram of a dialect speech synthesis method according to an embodiment of the present invention;

fig. 3 is a third diagram of a dialect speech synthesis method according to an embodiment of the present invention;

FIG. 4 is a fourth diagram of a dialect speech synthesis method according to an embodiment of the present invention;

FIG. 5 is a fifth illustration of a dialect speech synthesis method according to an embodiment of the present invention;

FIG. 6 is a diagram of a dialect speech synthesis apparatus according to an embodiment of the present invention;

fig. 7 is a schematic block diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

In order that the above objects, features and advantages of the present application can be more clearly understood, the present disclosure will be further described in detail with reference to the accompanying drawings and examples. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. The specific embodiments described herein are merely illustrative of the disclosure and are not limiting of the application. All other embodiments that can be derived by one of ordinary skill in the art from the description of the embodiments are intended to be within the scope of the present disclosure.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

Fig. 1 is a dialect speech synthesis method according to an embodiment of the present invention.

In a first aspect, as shown in fig. 1, an embodiment of the present invention provides a dialect speech synthesis method, including the following steps S101, S102, S103, S104, and S105:

s101: and acquiring a Mandarin text, and determining a first dialect text according to the Mandarin text.

Specifically, the mandarin text in the embodiment of the application can be the mandarin text recorded in newspapers, magazines, periodicals and the like; the first dialect text can be a Sichuan dialect text, a Guangdong dialect text and a Fujian dialect text; for example, based on mandarin text in a newspaper, translated into corresponding dialect text, such as tetragon text.

It should be understood that the Mandarin Chinese text and the first dialect text above are only examples and are not intended to limit the scope of the present invention; in the embodiments of the present application, the text of Sichuan is described as an example.

S102: and determining a second dialect text according to the first dialect text and the dialect text model.

Specifically, the mandarin text is translated into the tetranchang text through the step S101, and the tetranchang text is further translated into the tetranchang text according to the tetranchang text and the dialect text model, so that it is understood that the accuracy of text translation is improved by introducing the dialect text model.

S103: and determining the representation corresponding to the second dialect text according to the second dialect text.

Specifically, after the dialect text with higher accuracy corresponding to the mandarin text is determined through the steps S102 and S103, the dialect text is analyzed to obtain the representation corresponding to the second dialect text.

S104: and determining the acoustic parameters corresponding to the second dialect text according to the representation corresponding to the second dialect text.

Specifically, in the embodiment of the present application, the acoustic parameter corresponding to the dialect text with higher accuracy is determined through the representation corresponding to the dialect text with higher accuracy.

S105: and synthesizing dialect speech corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.

Specifically, in the embodiment of the present application, dialect speech corresponding to mandarin text is synthesized through the acoustic parameters in step S104.

It should be understood that the synthesis of mandarin text into dialect speech through the five steps of S101, S102, S103, S104 and S105, such as the synthesis of mandarin text on a newspaper into tetragon speech, i.e. through the reading of the tetragon speech sound, introduces dialect text, dialect text with higher accuracy, and acoustic parameters, and the synthesized dialect is more authentic compared with the dialect directly synthesized in the prior art in the embodiment of the present application.

Fig. 2 is a second dialect speech synthesis method according to an embodiment of the present invention.

In some embodiments, as shown in fig. 2, determining the first dialect text from the mandarin text includes the following three steps S201, S202 and S203:

s201: a plurality of mandarin text and dialect text pairs are obtained.

Specifically, in the embodiment of the present application, a plurality of mandarin text and dialect text pairs, such as mandarin and tetrakawa pair, mandarin and beijing pairs, mandarin and shanghai pairs, and the like, are obtained.

S202: an end-to-end translation model is obtained from the training of the plurality of mandarin text and dialect text pairs.

Specifically, in the embodiment of the present application, an end-to-end translation model is obtained through training of mandarin and tetrakawa, mandarin and beijing, mandarin and shanghai, and the like.

S203: the Mandarin text is input into the end-to-end translation model to obtain a first dialect text.

Specifically, in the embodiment of the application, an end-to-end translation model is trained, and a corresponding Sichuan language text, a Beijing language text, a Shanghai language text and the like of a Mandarin text can be directly determined through the end-to-end translation model at a later stage.

Fig. 3 is a third diagram of a dialect speech synthesis method according to an embodiment of the present invention.

In some embodiments, as shown in fig. 3, determining the second dialect text from the first dialect text and the dialect text model includes:

s301: a plurality of dialect texts is obtained.

Specifically, in the embodiment of the present application, a plurality of dialect texts are obtained, such as the four-channel language, the beijing language, the fujian language, the shanghai language, and the like.

S302: and training a plurality of dialect texts to obtain a dialect text model.

Specifically, after obtaining a plurality of dialect texts, such as the Sichuan, Beijing, Fujian, Shanghai, and so on, the embodiment of the present application trains a dialect text model, such as the Sichuan text model, through the Sichuan, Beijing, Fujian, Shanghai, and so on.

S303: and obtaining a second dialect text according to the first dialect text and the dialect text model.

Specifically, the accuracy of the tetranchspeech text determined by the end-to-end translation model is not high, and in order to further improve the accuracy, the tetranchspeech text with higher accuracy is obtained by further translating the tetranchspeech text model.

Illustratively, in the embodiment of the present application, the representation corresponding to the tetralogy text is obtained by analyzing the tetralogy text.

in some embodiments, as shown in fig. 4, determining the acoustic parameter corresponding to the second dialect text according to the representation corresponding to the second dialect text includes:

s401: a plurality of dialect text and speech pairs are obtained.

Specifically, in the embodiment of the present application, a plurality of dialect text and voice pairs are obtained, such as the voice corresponding to the four and the four, the voice corresponding to the beijing and the beijing, and the voice corresponding to the shanghai and the shanghai, etc.

S402: an end-to-end synthesis model is derived from the training of the plurality of dialect text and speech pairs.

Specifically, in the embodiment of the present application, an end-to-end synthesis model is obtained through training, such as using a representation corresponding to a sichuan text as an input, using an acoustic parameter corresponding to a sichuan voice as an output, using a representation corresponding to a beijing text as an input, using an acoustic parameter corresponding to a beijing voice as an output, using a representation corresponding to a shanghai text as an input, and using an acoustic parameter corresponding to a shanghai voice as an output.

S403: and inputting the representation corresponding to the second dialect text into the end-to-end synthesis model to obtain the acoustic parameter corresponding to the second dialect text.

Specifically, for example, the representation corresponding to the tetrakagian text is input into the end-to-end synthesis model to obtain the acoustic parameter corresponding to the tetrakagian text.

Fig. 5 is a fifth diagram of a dialect speech synthesis method according to an embodiment of the present invention.

In some embodiments, as shown in fig. 5, synthesizing dialect speech corresponding to mandarin chinese text according to the acoustic parameters corresponding to the second dialect text includes:

s501: and acquiring voice acoustic parameters and voice pairs.

Specifically, in the embodiment of the present application, the speech acoustic parameters and the speech pairs are obtained, that is, the speech acoustic parameters and the speech are in one-to-one correspondence.

S502: training a neural network vocoder according to the voice acoustic parameters and the voice to obtain a neural network vocoder model;

specifically, in the embodiment of the present application, acoustic parameters corresponding to speech are used as input, and speech is used as output, so as to train a neural network vocoder to obtain a neural network vocoder model.

S503: and inputting the acoustic parameters corresponding to the second dialect text into the neural network vocoder model to synthesize dialect voice corresponding to the Mandarin text.

Specifically, as an example, the acoustic parameters corresponding to the tetragon text are input into the neural network vocoder model to synthesize the tetragon voice, that is, the mandarin text is broadcasted through the tetragon voice.

the acquisition module 601: the method comprises the steps of obtaining a Mandarin text, and determining a first dialect text according to the Mandarin text;

The first determination module 602: the second dialect text is determined according to the Mandarin text and the first dialect text;

specifically, after the mandarin text is acquired through the acquisition module 601, the mandarin text is translated into the tetrahuana text, and the tetrahuana text is further translated into the tetrahuana text according to the tetrahuana text and the dialect text model.

The second determination module 603: the representation corresponding to the second dialect text is determined according to the second dialect text;

specifically, after the dialect text with higher accuracy corresponding to the mandarin text is determined by the first determining module 602 and the second determining module 603, the dialect text is analyzed to obtain the representation corresponding to the second dialect text.

The third determination module 604: the acoustic parameter corresponding to the second dialect text is determined according to the representation corresponding to the second dialect text;

specifically, in the embodiment of the present application, the acoustic parameter corresponding to the dialect text is determined by the representation corresponding to the dialect text with higher accuracy.

A synthesis module 605: and the acoustic parameter generator is used for synthesizing dialect speech corresponding to the Mandarin text according to the acoustic parameter corresponding to the second dialect text.

Specifically, in the embodiment of the present application, the acoustic parameters determined by the third determining module 604 are synthesized by the synthesizing module 605 to generate dialect speech corresponding to mandarin chinese text.

It should be understood that the synthesis of mandarin text into dialect speech through the five

modules

601, 602, 603, 604 and 605, such as mandarin text in a newspaper into tetragon speech, i.e. through the reading of the tetragon speech sound, the embodiment of the present application introduces dialect text, dialect text with higher accuracy, and acoustic parameters, which are more authentic compared with the dialect directly synthesized in the prior art.

In some embodiments, the obtaining module 601 is further configured to:

obtaining a plurality of mandarin text and dialect text pairs;

An end-to-end translation model is obtained from the training of the plurality of mandarin text and dialect text pairs.

Fig. 7 is a schematic block diagram of an electronic device provided by an embodiment of the disclosure.

As shown in fig. 7, the electronic apparatus includes: at least one processor 701, at least one memory 702, and at least one communication interface 703. The various components in the electronic device are coupled together by a bus system 704. A communication interface 703 for information transmission with an external device. It is understood that the bus system 704 is used to enable communications among the components. The bus system 704 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, the various buses are labeled in fig. 7 as the bus system 704.

It will be appreciated that the memory 702 in this embodiment can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.

In some embodiments, memory 702 stores the following elements, executable units or data structures, or a subset thereof, or an expanded set thereof: an operating system and an application program.

The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs, including various application programs such as a Media Player (Media Player), a Browser (Browser), etc., are used to implement various application services. A program for implementing any one of the dialect speech synthesis methods provided by the embodiments of the present application may be included in an application program.

In this embodiment of the application, the processor 701 is configured to execute the steps of the dialect speech synthesis method provided by the embodiment of the application by calling a program or an instruction stored in the memory 702, specifically, a program or an instruction stored in an application program.

Any one of the dialect speech synthesis methods provided in the embodiments of the present application may be applied to the processor 701, or implemented by the processor 701. The processor 701 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 701. The Processor 701 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The steps of any one of the dialect speech synthesis methods provided by the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software units in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 702, and the processor 701 reads the information in the memory 702, and completes the steps of the method in combination with the hardware thereof.

Those skilled in the art will appreciate that although some embodiments described herein include some features included in other embodiments instead of others, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments.

Those skilled in the art will appreciate that the description of each embodiment has a respective emphasis, and reference may be made to the related description of other embodiments for those parts of an embodiment that are not described in detail.

Although the embodiments of the present application have been described in conjunction with the accompanying drawings, those skilled in the art will be able to make various modifications and variations without departing from the spirit and scope of the application, and such modifications and variations are included in the specific embodiments of the present invention as defined in the appended claims, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of various equivalent modifications and substitutions within the technical scope of the present disclosure, and these modifications and substitutions are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A dialect speech synthesis method, comprising:

and synthesizing the dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.

2. The dialect speech synthesis method of claim 1, wherein the determining first dialect text from the mandarin text comprises:

obtaining a plurality of mandarin text and dialect text pairs;

training according to the multiple Mandarin texts and dialect texts to obtain an end-to-end translation model;

and inputting the Mandarin text into the end-to-end translation model to obtain the first dialect text.

3. The dialect speech synthesis method of claim 1, wherein determining the second dialect text from the first dialect text and the dialect text model comprises:

acquiring a plurality of dialect texts;

training the dialect texts to obtain the dialect text model;

and determining a second dialect text according to the first dialect text and the dialect text model.

4. The dialect speech synthesis method of claim 1, wherein the determining the corresponding representation of the second dialect text from the second dialect text comprises:

and analyzing the second dialect text to obtain a representation corresponding to the second dialect text.

5. The dialect speech synthesis method of claim 1, wherein the determining the acoustic parameters corresponding to the second dialect text according to the representation corresponding to the second dialect text comprises:

obtaining a plurality of dialect text and voice pairs;

training to obtain an end-to-end synthesis model according to the dialect text and the voice pairs;

6. The dialect speech synthesis method of claim 1, wherein the synthesizing of dialect speech corresponding to mandarin text according to the acoustic parameters corresponding to the second dialect text comprises:

acquiring voice acoustic parameters and voice pairs;

and inputting the acoustic parameters corresponding to the second dialect text into the neural network vocoder model to synthesize dialect voice corresponding to Mandarin Chinese text.

7. A dialect speech synthesis apparatus, comprising:

8. The dialect speech synthesis apparatus of claim 7, wherein the obtaining module is further configured to:

obtaining a plurality of mandarin text and dialect text pairs;

9. An electronic device, comprising: a processor and a memory;

the processor is configured to perform a dialect speech synthesis method according to any one of claims 1 to 6 by calling a program or instructions stored in the memory.

10. A computer-readable storage medium storing a program or instructions for causing a computer to perform a dialect speech synthesis method according to any one of claims 1 to 6.