CN113191164B

CN113191164B - Dialect voice synthesis method, device, electronic equipment and storage medium

Info

Publication number: CN113191164B
Application number: CN202110616158.1A
Authority: CN
Inventors: 孙见青; 梁家恩
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2023-11-10
Anticipated expiration: 2041-06-02
Also published as: CN113191164A

Abstract

The application relates to a dialect synthesis method, a dialect synthesis device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a Mandarin text, and determining a first dialect text according to the Mandarin text; determining a second dialect text according to the first dialect text and the dialect text model; determining a representation corresponding to the second dialect text according to the second dialect text; determining acoustic parameters corresponding to the second dialect text according to the characterization corresponding to the second dialect text; synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text. According to the embodiment of the application, the Mandarin text is synthesized into dialect voice, such as Mandarin text on newspapers is synthesized into Sichuan voice, namely, the Mandarin text is read out through the Sichuan voice.

Description

Dialect voice synthesis method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method and apparatus for dialect synthesis, an electronic device, and a storage medium.

Background

At present, when synthesizing dialect voice, the input Mandarin text is directly synthesized to obtain the dialect voice, and the problems that the synthesized dialect is not good, does not accord with the expression characteristics of the dialect and the like exist.

Disclosure of Invention

The application provides a method, a device, electronic equipment and a storage medium for determining dialect synthesis, which can solve the technical problem that synthesized dialects are not genuine.

The technical scheme for solving the technical problems is as follows:

in a first aspect, an embodiment of the present application provides a dialect speech synthesis method, including:

acquiring a Mandarin text, and determining a first dialect text according to the Mandarin text;

determining a second dialect text according to the first dialect text and the dialect text model;

determining a representation corresponding to the second dialect text according to the second dialect text;

determining acoustic parameters corresponding to the second dialect text according to the characterization corresponding to the second dialect text;

synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.

In some embodiments, determining the first dialect text from the mandarin chinese text includes:

acquiring a plurality of Mandarin text and dialect text pairs;

training according to a plurality of Mandarin texts and dialect text pairs to obtain an end-to-end translation model;

the Mandarin text is input into an end-to-end translation model to obtain first dialect text.

In some embodiments, determining the second dialect text from the first dialect text and the dialect text model includes:

acquiring a plurality of dialect texts;

training a plurality of dialect texts to obtain a dialect text model;

and obtaining the second dialect text according to the first dialect text and the dialect text model.

In some embodiments, determining a representation of the second dialect text correspondence from the second dialect text includes:

and analyzing the second dialect text to obtain the representation corresponding to the second dialect text.

In some embodiments, determining the acoustic parameters corresponding to the second dialect text from the characterization corresponding to the second dialect text includes:

acquiring a plurality of dialect text and voice pairs;

training according to a plurality of dialect texts and voice pairs to obtain an end-to-end synthesis model;

and inputting the representation corresponding to the second dialect text into the end-to-end synthesis model to obtain the acoustic parameters corresponding to the second dialect text.

In some embodiments, synthesizing dialect speech corresponding to mandarin text based on acoustic parameters corresponding to the second dialect text includes:

acquiring voice acoustic parameters and voice pairs;

training a neural network vocoder according to the voice acoustic parameters and the voice pair to obtain a neural network vocoder model;

and inputting the acoustic parameters corresponding to the second dialect text into the neural network vocoder model to synthesize dialect voice corresponding to the Mandarin text.

In a second aspect, an embodiment of the present application provides a dialect speech synthesis apparatus, including:

the acquisition module is used for: the method comprises the steps of obtaining a Mandarin text, and determining a first dialect text according to the Mandarin text;

a first determination module: for determining a second dialect text from the Mandarin text, the first dialect text;

a second determination module: the method comprises the steps of determining a representation corresponding to a second dialect text according to the second dialect text;

and a third determination module: the acoustic parameters corresponding to the second dialect text are determined according to the characterization corresponding to the second dialect text;

and a synthesis module: and synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.

In some embodiments, the acquisition module is further to:

acquiring a plurality of Mandarin text and dialect text pairs;

In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor and a memory;

the processor is configured to perform the dialect speech synthesis method as described in any of the above by invoking a program or instructions stored in the memory.

In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium storing a program or instructions that cause a computer to perform the dialect speech synthesis method as described in any of the above.

The beneficial effects of the application are as follows: acquiring a Mandarin text, and determining a first dialect text according to the Mandarin text; determining a second dialect text according to the first dialect text and the dialect text model; determining a representation corresponding to the second dialect text according to the second dialect text; determining acoustic parameters corresponding to the second dialect text according to the characterization corresponding to the second dialect text; synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text. According to the embodiment of the application, the Mandarin text is synthesized into dialect voice, such as Mandarin text on newspapers is synthesized into Sichuan voice, namely, the Mandarin text is read out through the Sichuan voice.

Drawings

FIG. 1 is a diagram of a method for synthesizing dialect speech according to an embodiment of the present application;

FIG. 2 is a diagram of a second method for synthesizing dialect speech according to an embodiment of the present application;

FIG. 3 is a third diagram of a method for synthesizing dialect speech according to an embodiment of the present application;

FIG. 4 is a diagram of a method for synthesizing dialect speech according to an embodiment of the present application;

FIG. 5 is a diagram of a method for synthesizing dialect speech according to an embodiment of the present application;

FIG. 6 is a diagram of a dialect speech synthesis apparatus according to an embodiment of the present application;

fig. 7 is a schematic block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The principles and features of the present application are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the application and are not to be construed as limiting the scope of the application.

In order that the above-recited objects, features and advantages of the present application can be more clearly understood, a more particular description of the application will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It is to be understood that the described embodiments are some, but not all, of the embodiments of the present disclosure. The specific embodiments described herein are to be considered in an illustrative rather than a restrictive sense. All other embodiments, which are obtained by a person skilled in the art based on the described embodiments of the application, fall within the scope of protection of the application.

It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

Fig. 1 is a dialect speech synthesis method according to an embodiment of the present application.

In a first aspect, as shown in fig. 1, an embodiment of the present application provides a dialect speech synthesis method, including the following five steps S101, S102, S103, S104, and S105:

s101: and acquiring the Mandarin text, and determining a first dialect text according to the Mandarin text.

Specifically, the mandarin text in the embodiment of the present application may be mandarin text described above, such as newspaper, magazine, journal, and the like; the first dialect text may be a Sichuan text, a Cantonese text, and a Fujian dialect text; for example, the text is translated into corresponding dialect text, such as Sichuan text, from Mandarin text in a newspaper.

It should be understood that the above Mandarin text and the first language text are exemplary only and are not intended to limit the scope of the present application; the Sichuan text is taken as an example to introduce the embodiment of the application.

S102: the second dialect text is determined from the first dialect text model.

Specifically, the mandarin text is translated into the Sichuan text through the step S101, and is further translated into the Sichuan text according to the Sichuan text and the dialect text model, and it is understood that the accuracy of text translation is improved by introducing the dialect text model.

S103: and determining the representation corresponding to the second dialect text according to the second dialect text.

Specifically, after the dialect text with higher accuracy corresponding to the mandarin text is determined through the steps S102 and S103, the dialect text is analyzed to obtain the representation corresponding to the second dialect text.

S104: and determining acoustic parameters corresponding to the second dialect text according to the characterization corresponding to the second dialect text.

Specifically, in the embodiment of the application, the acoustic parameters corresponding to the dialect texts with higher accuracy are determined through the characterization corresponding to the dialect texts with higher accuracy.

S105: synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.

Specifically, in the embodiment of the present application, dialect speech corresponding to mandarin text is synthesized through the acoustic parameters in step S104.

It should be understood that, through five steps of S101, S102, S103, S104 and S105, synthesis of dialect voice from mandarin texts is realized, for example, synthesis of mandarin texts on newspapers into Sichuan voices, that is, the dialect texts are introduced, the dialect texts with higher accuracy and acoustic parameters are read out through the Sichuan voices, and the synthesized dialect is more preferable than the dialect synthesized directly in the prior art.

FIG. 2 is a diagram of a second method for synthesizing dialect speech according to an embodiment of the present application.

In some embodiments, as shown in fig. 2, determining the first dialect text from the mandarin chinese text includes the following three steps S201, S202, and S203:

s201: a plurality of Mandarin text and dialect text pairs are obtained.

Specifically, in the embodiment of the present application, a plurality of Mandarin text and dialect text pairs, such as Mandarin and Sichuan pairs, mandarin and Beijing pairs, mandarin and Shanghai pairs, and the like, are obtained.

S202: and training to obtain an end-to-end translation model according to the plurality of Mandarin texts and dialect text pairs.

Specifically, in the embodiment of the application, the end-to-end translation model is obtained through training of Mandarin and Sichuan pair, mandarin and Beijing pair, mandarin and Shanghai pair and the like.

S203: the Mandarin text is input into an end-to-end translation model to obtain first dialect text.

Specifically, in the embodiment of the application, an end-to-end translation model is trained, and the text of Sichuan, beijing and Shanghai corresponding to the Mandarin Chinese can be directly determined through the end-to-end translation model in the later stage.

FIG. 3 is a third diagram of a method for synthesizing dialect speech according to an embodiment of the present application.

In some embodiments, as shown in fig. 3, determining the second dialect text from the first dialect text and the dialect text model includes:

s301: a plurality of dialect texts is obtained.

Specifically, in the embodiment of the application, a plurality of dialect texts, such as Sichuan, beijing, fujian, shanghai and the like, are obtained.

S302: training a plurality of dialect texts to obtain a dialect text model.

Specifically, after the embodiment of the application acquires a plurality of dialect texts, such as Sichuan, beijing, fujian and Shanghai, the dialect text model, such as Sichuan text model, is trained through Sichuan, beijing, fujian and Shanghai.

S303: and obtaining the second dialect text according to the first dialect text and the dialect text model.

Specifically, in the embodiment of the application, the accuracy of the Sichuan text determined by the end-to-end translation model is not high, and in order to further improve the accuracy, the Sichuan text with higher accuracy is further translated by the Sichuan text model.

By way of example, in the embodiment of the application, the representation corresponding to the Sichuan text is obtained by analyzing the Sichuan text.

in some embodiments, as shown in fig. 4, determining the acoustic parameters corresponding to the second dialect text from the characterization corresponding to the second dialect text includes:

s401: a plurality of dialect text and speech pairs are acquired.

Specifically, in the embodiment of the application, a plurality of dialect texts and voice pairs, such as Sichuan voices corresponding to Sichuan voices, beijing voices corresponding to Beijing voices, shanghai voices, voices corresponding to Shanghai voices and the like, are obtained.

S402: an end-to-end synthesis model is trained from a plurality of dialect text and speech pairs.

Specifically, in the embodiment of the application, the end-to-end synthesis model is obtained by training by using the representation corresponding to the Sichuan text as input, the acoustic parameter corresponding to the Sichuan voice as output, the representation corresponding to the Beijing voice as input, the acoustic parameter corresponding to the Beijing voice as output, the representation corresponding to the Shanghai voice as input, the acoustic parameter corresponding to the Shanghai voice as output and the like.

S403: and inputting the representation corresponding to the second dialect text into the end-to-end synthesis model to obtain the acoustic parameters corresponding to the second dialect text.

Specifically, by way of example, the representation corresponding to the Sichuan text is input into an end-to-end synthesis model to obtain acoustic parameters corresponding to the Sichuan text.

Fig. 5 is a diagram of a dialect speech synthesis method according to an embodiment of the present application.

In some embodiments, as shown in fig. 5, synthesizing dialect speech corresponding to mandarin chinese text according to acoustic parameters corresponding to the second dialect text includes:

s501: and acquiring the acoustic parameters of the voice and the voice pair.

Specifically, in the embodiment of the application, the acoustic parameters of the voice and the voice pairs are obtained, namely the acoustic parameters of the voice and the voice are in one-to-one correspondence.

S502: training a neural network vocoder according to the voice acoustic parameters and the voice pair to obtain a neural network vocoder model;

specifically, in the embodiment of the application, acoustic parameters corresponding to voice are used as input, voice is used as output, and a neural network vocoder is trained to obtain a neural network vocoder model.

S503: and inputting the acoustic parameters corresponding to the second dialect text into the neural network vocoder model to synthesize dialect voice corresponding to the Mandarin text.

Specifically, for example, acoustic parameters corresponding to the Sichuan text are input into the neural network vocoder model to synthesize Sichuan speech, so that the Mandarin text is broadcasted through the Sichuan speech.

acquisition module 601: the method comprises the steps of obtaining a Mandarin text, and determining a first dialect text according to the Mandarin text;

The first determination module 602: for determining a second dialect text from the Mandarin text, the first dialect text;

specifically, after the mandarin text is obtained by the obtaining module 601, the mandarin text is translated into the Sichuan text, and further translated into the Sichuan text according to the Sichuan text and the dialect text model, it is understood that the accuracy of text translation is improved by introducing the dialect text model.

The second determination module 603: the method comprises the steps of determining a representation corresponding to a second dialect text according to the second dialect text;

specifically, after the dialect text with higher accuracy corresponding to the mandarin text is determined by the first determining module 602 and the second determining module 603, the dialect text is parsed to obtain the representation corresponding to the second dialect text.

A third determination module 604: the acoustic parameters corresponding to the second dialect text are determined according to the characterization corresponding to the second dialect text;

specifically, in the embodiment of the application, the acoustic parameters corresponding to the dialect texts are determined through the characterization corresponding to the dialect texts with higher accuracy.

Synthesis module 605: and synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.

Specifically, in the embodiment of the present application, the acoustic parameters determined by the third determining module 604 are synthesized by the synthesizing module 605 into dialect voice corresponding to the mandarin text.

It should be understood that, through five modules 601, 602, 603, 604 and 605, the synthesis of dialect speech from mandarin text is realized, for example, the mandarin text on a newspaper is synthesized into Sichuan speech, that is, the dialect text with higher accuracy and acoustic parameters are introduced in the embodiment of the present application, and the synthesized dialect is more preferable than the dialect synthesized directly in the prior art.

In some embodiments, the acquisition module 601 is further configured to:

acquiring a plurality of Mandarin text and dialect text pairs;

And training to obtain an end-to-end translation model according to the plurality of Mandarin texts and dialect text pairs.

Fig. 7 is a schematic block diagram of an electronic device provided by an embodiment of the present disclosure.

As shown in fig. 7, the electronic device includes: at least one processor 701, at least one memory 702, and at least one communication interface 703. The various components in the electronic device are coupled together by a bus system 704. A communication interface 703 for information transmission with an external device. It is appreciated that bus system 704 is used to enable connected communications between these components. The bus system 704 includes a power bus, a control bus, and a status signal bus in addition to the data bus. The various buses are labeled as bus system 704 in fig. 7 for clarity of illustration.

It is to be appreciated that the memory 702 in the present embodiment may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory.

In some implementations, the memory 702 stores the following elements, executable units or data structures, or a subset thereof, or an extended set thereof: an operating system and application programs.

The operating system includes various system programs, such as a framework layer, a core library layer, a driving layer, and the like, and is used for realizing various basic services and processing hardware-based tasks. Applications, including various applications such as Media Player (Media Player), browser (Browser), etc., are used to implement various application services. The program for implementing any one of the dialect speech synthesis methods provided in the embodiments of the present application may be included in the application program.

In the embodiment of the present application, the processor 701 is configured to execute the steps of each embodiment of the dialect speech synthesis method provided in the embodiment of the present application by calling a program or an instruction stored in the memory 702, specifically, a program or an instruction stored in an application program.

Any one of the dialect speech synthesis methods provided in the embodiments of the present application may be applied to the processor 701, or implemented by the processor 701. The processor 701 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 701 or by instructions in the form of software. The processor 701 described above may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The steps of any method in the dialect speech synthesis method provided by the embodiment of the application can be directly embodied as the execution completion of the hardware decoding processor, or the execution completion of the combination execution of the hardware and software units in the decoding processor. The software elements may be located in a random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 702, and the processor 701 reads information in the memory 702 and performs the steps of the method in combination with its hardware.

Those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments.

Those skilled in the art will appreciate that the descriptions of the various embodiments are each focused on, and that portions of one embodiment that are not described in detail may be referred to as related descriptions of other embodiments.

Although the embodiments of the present application have been described with reference to the accompanying drawings, those skilled in the art may make various modifications and alterations without departing from the spirit and scope of the present application, and such modifications and alterations fall within the scope of the appended claims, which are to be construed as merely illustrative of the present application, but the scope of the application is not limited thereto, and various equivalent modifications and substitutions will be readily apparent to those skilled in the art within the scope of the present application, and are intended to be included within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

The present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the present application, and these modifications and substitutions are intended to be included in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method of dialect speech synthesis, comprising:

acquiring a plurality of Mandarin text and dialect text pairs;

training according to the plurality of Mandarin texts and dialect text pairs to obtain an end-to-end translation model;

inputting the Mandarin text into the end-to-end translation model to obtain a first dialect text;

acquiring a plurality of dialect texts;

training the plurality of dialect texts to obtain the dialect text model;

analyzing the second dialect text to obtain a representation corresponding to the second dialect text;

acquiring a plurality of dialect text and voice pairs;

training according to the plurality of dialect texts and the voice pairs to obtain an end-to-end synthesis model;

inputting the representation corresponding to the second dialect text into the end-to-end synthesis model to obtain acoustic parameters corresponding to the second dialect text;

2. The dialect speech synthesis method according to claim 1, wherein the synthesizing the dialect speech corresponding to the mandarin text according to the acoustic parameters corresponding to the second dialect text includes:

acquiring voice acoustic parameters and voice pairs;

3. A dialect speech synthesis apparatus, comprising:

the acquisition module is used for:

acquiring a plurality of Mandarin text and dialect text pairs;

a first determination module:

acquiring a plurality of dialect texts;

training the plurality of dialect texts to obtain the dialect text model;

a second determination module:

and a third determination module:

acquiring a plurality of dialect text and voice pairs;

and a synthesis module:

and synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.

4. An electronic device, comprising: a processor and a memory;

the processor is configured to perform the dialect speech synthesis method according to any of claims 1 to 2 by invoking a program or instructions stored in the memory.

5. A computer-readable storage medium storing a program or instructions that cause a computer to perform the dialect speech synthesis method according to any one of claims 1 to 2.