CN113191164B - Dialect voice synthesis method, device, electronic equipment and storage medium - Google Patents

Dialect voice synthesis method, device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113191164B
CN113191164B CN202110616158.1A CN202110616158A CN113191164B CN 113191164 B CN113191164 B CN 113191164B CN 202110616158 A CN202110616158 A CN 202110616158A CN 113191164 B CN113191164 B CN 113191164B
Authority
CN
China
Prior art keywords
dialect
text
mandarin
voice
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110616158.1A
Other languages
Chinese (zh)
Other versions
CN113191164A (en
Inventor
孙见青
梁家恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202110616158.1A priority Critical patent/CN113191164B/en
Publication of CN113191164A publication Critical patent/CN113191164A/en
Application granted granted Critical
Publication of CN113191164B publication Critical patent/CN113191164B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to a dialect synthesis method, a dialect synthesis device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a Mandarin text, and determining a first dialect text according to the Mandarin text; determining a second dialect text according to the first dialect text and the dialect text model; determining a representation corresponding to the second dialect text according to the second dialect text; determining acoustic parameters corresponding to the second dialect text according to the characterization corresponding to the second dialect text; synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text. According to the embodiment of the application, the Mandarin text is synthesized into dialect voice, such as Mandarin text on newspapers is synthesized into Sichuan voice, namely, the Mandarin text is read out through the Sichuan voice.

Description

Dialect voice synthesis method, device, electronic equipment and storage medium
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a method and apparatus for dialect synthesis, an electronic device, and a storage medium.
Background
At present, when synthesizing dialect voice, the input Mandarin text is directly synthesized to obtain the dialect voice, and the problems that the synthesized dialect is not good, does not accord with the expression characteristics of the dialect and the like exist.
Disclosure of Invention
The application provides a method, a device, electronic equipment and a storage medium for determining dialect synthesis, which can solve the technical problem that synthesized dialects are not genuine.
The technical scheme for solving the technical problems is as follows:
in a first aspect, an embodiment of the present application provides a dialect speech synthesis method, including:
acquiring a Mandarin text, and determining a first dialect text according to the Mandarin text;
determining a second dialect text according to the first dialect text and the dialect text model;
determining a representation corresponding to the second dialect text according to the second dialect text;
determining acoustic parameters corresponding to the second dialect text according to the characterization corresponding to the second dialect text;
synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.
In some embodiments, determining the first dialect text from the mandarin chinese text includes:
acquiring a plurality of Mandarin text and dialect text pairs;
training according to a plurality of Mandarin texts and dialect text pairs to obtain an end-to-end translation model;
the Mandarin text is input into an end-to-end translation model to obtain first dialect text.
In some embodiments, determining the second dialect text from the first dialect text and the dialect text model includes:
acquiring a plurality of dialect texts;
training a plurality of dialect texts to obtain a dialect text model;
and obtaining the second dialect text according to the first dialect text and the dialect text model.
In some embodiments, determining a representation of the second dialect text correspondence from the second dialect text includes:
and analyzing the second dialect text to obtain the representation corresponding to the second dialect text.
In some embodiments, determining the acoustic parameters corresponding to the second dialect text from the characterization corresponding to the second dialect text includes:
acquiring a plurality of dialect text and voice pairs;
training according to a plurality of dialect texts and voice pairs to obtain an end-to-end synthesis model;
and inputting the representation corresponding to the second dialect text into the end-to-end synthesis model to obtain the acoustic parameters corresponding to the second dialect text.
In some embodiments, synthesizing dialect speech corresponding to mandarin text based on acoustic parameters corresponding to the second dialect text includes:
acquiring voice acoustic parameters and voice pairs;
training a neural network vocoder according to the voice acoustic parameters and the voice pair to obtain a neural network vocoder model;
and inputting the acoustic parameters corresponding to the second dialect text into the neural network vocoder model to synthesize dialect voice corresponding to the Mandarin text.
In a second aspect, an embodiment of the present application provides a dialect speech synthesis apparatus, including:
the acquisition module is used for: the method comprises the steps of obtaining a Mandarin text, and determining a first dialect text according to the Mandarin text;
a first determination module: for determining a second dialect text from the Mandarin text, the first dialect text;
a second determination module: the method comprises the steps of determining a representation corresponding to a second dialect text according to the second dialect text;
and a third determination module: the acoustic parameters corresponding to the second dialect text are determined according to the characterization corresponding to the second dialect text;
and a synthesis module: and synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.
In some embodiments, the acquisition module is further to:
acquiring a plurality of Mandarin text and dialect text pairs;
training according to a plurality of Mandarin texts and dialect text pairs to obtain an end-to-end translation model;
the Mandarin text is input into an end-to-end translation model to obtain first dialect text.
In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor and a memory;
the processor is configured to perform the dialect speech synthesis method as described in any of the above by invoking a program or instructions stored in the memory.
In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium storing a program or instructions that cause a computer to perform the dialect speech synthesis method as described in any of the above.
The beneficial effects of the application are as follows: acquiring a Mandarin text, and determining a first dialect text according to the Mandarin text; determining a second dialect text according to the first dialect text and the dialect text model; determining a representation corresponding to the second dialect text according to the second dialect text; determining acoustic parameters corresponding to the second dialect text according to the characterization corresponding to the second dialect text; synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text. According to the embodiment of the application, the Mandarin text is synthesized into dialect voice, such as Mandarin text on newspapers is synthesized into Sichuan voice, namely, the Mandarin text is read out through the Sichuan voice.
Drawings
FIG. 1 is a diagram of a method for synthesizing dialect speech according to an embodiment of the present application;
FIG. 2 is a diagram of a second method for synthesizing dialect speech according to an embodiment of the present application;
FIG. 3 is a third diagram of a method for synthesizing dialect speech according to an embodiment of the present application;
FIG. 4 is a diagram of a method for synthesizing dialect speech according to an embodiment of the present application;
FIG. 5 is a diagram of a method for synthesizing dialect speech according to an embodiment of the present application;
FIG. 6 is a diagram of a dialect speech synthesis apparatus according to an embodiment of the present application;
fig. 7 is a schematic block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The principles and features of the present application are described below with reference to the drawings, the examples are illustrated for the purpose of illustrating the application and are not to be construed as limiting the scope of the application.
In order that the above-recited objects, features and advantages of the present application can be more clearly understood, a more particular description of the application will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It is to be understood that the described embodiments are some, but not all, of the embodiments of the present disclosure. The specific embodiments described herein are to be considered in an illustrative rather than a restrictive sense. All other embodiments, which are obtained by a person skilled in the art based on the described embodiments of the application, fall within the scope of protection of the application.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
Fig. 1 is a dialect speech synthesis method according to an embodiment of the present application.
In a first aspect, as shown in fig. 1, an embodiment of the present application provides a dialect speech synthesis method, including the following five steps S101, S102, S103, S104, and S105:
s101: and acquiring the Mandarin text, and determining a first dialect text according to the Mandarin text.
Specifically, the mandarin text in the embodiment of the present application may be mandarin text described above, such as newspaper, magazine, journal, and the like; the first dialect text may be a Sichuan text, a Cantonese text, and a Fujian dialect text; for example, the text is translated into corresponding dialect text, such as Sichuan text, from Mandarin text in a newspaper.
It should be understood that the above Mandarin text and the first language text are exemplary only and are not intended to limit the scope of the present application; the Sichuan text is taken as an example to introduce the embodiment of the application.
S102: the second dialect text is determined from the first dialect text model.
Specifically, the mandarin text is translated into the Sichuan text through the step S101, and is further translated into the Sichuan text according to the Sichuan text and the dialect text model, and it is understood that the accuracy of text translation is improved by introducing the dialect text model.
S103: and determining the representation corresponding to the second dialect text according to the second dialect text.
Specifically, after the dialect text with higher accuracy corresponding to the mandarin text is determined through the steps S102 and S103, the dialect text is analyzed to obtain the representation corresponding to the second dialect text.
S104: and determining acoustic parameters corresponding to the second dialect text according to the characterization corresponding to the second dialect text.
Specifically, in the embodiment of the application, the acoustic parameters corresponding to the dialect texts with higher accuracy are determined through the characterization corresponding to the dialect texts with higher accuracy.
S105: synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.
Specifically, in the embodiment of the present application, dialect speech corresponding to mandarin text is synthesized through the acoustic parameters in step S104.
It should be understood that, through five steps of S101, S102, S103, S104 and S105, synthesis of dialect voice from mandarin texts is realized, for example, synthesis of mandarin texts on newspapers into Sichuan voices, that is, the dialect texts are introduced, the dialect texts with higher accuracy and acoustic parameters are read out through the Sichuan voices, and the synthesized dialect is more preferable than the dialect synthesized directly in the prior art.
FIG. 2 is a diagram of a second method for synthesizing dialect speech according to an embodiment of the present application.
In some embodiments, as shown in fig. 2, determining the first dialect text from the mandarin chinese text includes the following three steps S201, S202, and S203:
s201: a plurality of Mandarin text and dialect text pairs are obtained.
Specifically, in the embodiment of the present application, a plurality of Mandarin text and dialect text pairs, such as Mandarin and Sichuan pairs, mandarin and Beijing pairs, mandarin and Shanghai pairs, and the like, are obtained.
S202: and training to obtain an end-to-end translation model according to the plurality of Mandarin texts and dialect text pairs.
Specifically, in the embodiment of the application, the end-to-end translation model is obtained through training of Mandarin and Sichuan pair, mandarin and Beijing pair, mandarin and Shanghai pair and the like.
S203: the Mandarin text is input into an end-to-end translation model to obtain first dialect text.
Specifically, in the embodiment of the application, an end-to-end translation model is trained, and the text of Sichuan, beijing and Shanghai corresponding to the Mandarin Chinese can be directly determined through the end-to-end translation model in the later stage.
FIG. 3 is a third diagram of a method for synthesizing dialect speech according to an embodiment of the present application.
In some embodiments, as shown in fig. 3, determining the second dialect text from the first dialect text and the dialect text model includes:
s301: a plurality of dialect texts is obtained.
Specifically, in the embodiment of the application, a plurality of dialect texts, such as Sichuan, beijing, fujian, shanghai and the like, are obtained.
S302: training a plurality of dialect texts to obtain a dialect text model.
Specifically, after the embodiment of the application acquires a plurality of dialect texts, such as Sichuan, beijing, fujian and Shanghai, the dialect text model, such as Sichuan text model, is trained through Sichuan, beijing, fujian and Shanghai.
S303: and obtaining the second dialect text according to the first dialect text and the dialect text model.
Specifically, in the embodiment of the application, the accuracy of the Sichuan text determined by the end-to-end translation model is not high, and in order to further improve the accuracy, the Sichuan text with higher accuracy is further translated by the Sichuan text model.
In some embodiments, determining a representation of the second dialect text correspondence from the second dialect text includes:
and analyzing the second dialect text to obtain the representation corresponding to the second dialect text.
By way of example, in the embodiment of the application, the representation corresponding to the Sichuan text is obtained by analyzing the Sichuan text.
FIG. 4 is a diagram of a method for synthesizing dialect speech according to an embodiment of the present application;
in some embodiments, as shown in fig. 4, determining the acoustic parameters corresponding to the second dialect text from the characterization corresponding to the second dialect text includes:
s401: a plurality of dialect text and speech pairs are acquired.
Specifically, in the embodiment of the application, a plurality of dialect texts and voice pairs, such as Sichuan voices corresponding to Sichuan voices, beijing voices corresponding to Beijing voices, shanghai voices, voices corresponding to Shanghai voices and the like, are obtained.
S402: an end-to-end synthesis model is trained from a plurality of dialect text and speech pairs.
Specifically, in the embodiment of the application, the end-to-end synthesis model is obtained by training by using the representation corresponding to the Sichuan text as input, the acoustic parameter corresponding to the Sichuan voice as output, the representation corresponding to the Beijing voice as input, the acoustic parameter corresponding to the Beijing voice as output, the representation corresponding to the Shanghai voice as input, the acoustic parameter corresponding to the Shanghai voice as output and the like.
S403: and inputting the representation corresponding to the second dialect text into the end-to-end synthesis model to obtain the acoustic parameters corresponding to the second dialect text.
Specifically, by way of example, the representation corresponding to the Sichuan text is input into an end-to-end synthesis model to obtain acoustic parameters corresponding to the Sichuan text.
Fig. 5 is a diagram of a dialect speech synthesis method according to an embodiment of the present application.
In some embodiments, as shown in fig. 5, synthesizing dialect speech corresponding to mandarin chinese text according to acoustic parameters corresponding to the second dialect text includes:
s501: and acquiring the acoustic parameters of the voice and the voice pair.
Specifically, in the embodiment of the application, the acoustic parameters of the voice and the voice pairs are obtained, namely the acoustic parameters of the voice and the voice are in one-to-one correspondence.
S502: training a neural network vocoder according to the voice acoustic parameters and the voice pair to obtain a neural network vocoder model;
specifically, in the embodiment of the application, acoustic parameters corresponding to voice are used as input, voice is used as output, and a neural network vocoder is trained to obtain a neural network vocoder model.
S503: and inputting the acoustic parameters corresponding to the second dialect text into the neural network vocoder model to synthesize dialect voice corresponding to the Mandarin text.
Specifically, for example, acoustic parameters corresponding to the Sichuan text are input into the neural network vocoder model to synthesize Sichuan speech, so that the Mandarin text is broadcasted through the Sichuan speech.
FIG. 6 is a diagram of a dialect speech synthesis apparatus according to an embodiment of the present application;
in a second aspect, an embodiment of the present application provides a dialect speech synthesis apparatus, including:
acquisition module 601: the method comprises the steps of obtaining a Mandarin text, and determining a first dialect text according to the Mandarin text;
specifically, the mandarin text in the embodiment of the present application may be mandarin text described above, such as newspaper, magazine, journal, and the like; the first dialect text may be a Sichuan text, a Cantonese text, and a Fujian dialect text; for example, the text is translated into corresponding dialect text, such as Sichuan text, from Mandarin text in a newspaper.
It should be understood that the above Mandarin text and the first language text are exemplary only and are not intended to limit the scope of the present application; the Sichuan text is taken as an example to introduce the embodiment of the application.
The first determination module 602: for determining a second dialect text from the Mandarin text, the first dialect text;
specifically, after the mandarin text is obtained by the obtaining module 601, the mandarin text is translated into the Sichuan text, and further translated into the Sichuan text according to the Sichuan text and the dialect text model, it is understood that the accuracy of text translation is improved by introducing the dialect text model.
The second determination module 603: the method comprises the steps of determining a representation corresponding to a second dialect text according to the second dialect text;
specifically, after the dialect text with higher accuracy corresponding to the mandarin text is determined by the first determining module 602 and the second determining module 603, the dialect text is parsed to obtain the representation corresponding to the second dialect text.
A third determination module 604: the acoustic parameters corresponding to the second dialect text are determined according to the characterization corresponding to the second dialect text;
specifically, in the embodiment of the application, the acoustic parameters corresponding to the dialect texts are determined through the characterization corresponding to the dialect texts with higher accuracy.
Synthesis module 605: and synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.
Specifically, in the embodiment of the present application, the acoustic parameters determined by the third determining module 604 are synthesized by the synthesizing module 605 into dialect voice corresponding to the mandarin text.
It should be understood that, through five modules 601, 602, 603, 604 and 605, the synthesis of dialect speech from mandarin text is realized, for example, the mandarin text on a newspaper is synthesized into Sichuan speech, that is, the dialect text with higher accuracy and acoustic parameters are introduced in the embodiment of the present application, and the synthesized dialect is more preferable than the dialect synthesized directly in the prior art.
In some embodiments, the acquisition module 601 is further configured to:
acquiring a plurality of Mandarin text and dialect text pairs;
specifically, in the embodiment of the present application, a plurality of Mandarin text and dialect text pairs, such as Mandarin and Sichuan pairs, mandarin and Beijing pairs, mandarin and Shanghai pairs, and the like, are obtained.
And training to obtain an end-to-end translation model according to the plurality of Mandarin texts and dialect text pairs.
Specifically, in the embodiment of the application, the end-to-end translation model is obtained through training of Mandarin and Sichuan pair, mandarin and Beijing pair, mandarin and Shanghai pair and the like.
The Mandarin text is input into an end-to-end translation model to obtain first dialect text.
Specifically, in the embodiment of the application, an end-to-end translation model is trained, and the text of Sichuan, beijing and Shanghai corresponding to the Mandarin Chinese can be directly determined through the end-to-end translation model in the later stage.
In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor and a memory;
the processor is configured to perform the dialect speech synthesis method as described in any of the above by invoking a program or instructions stored in the memory.
In a fourth aspect, embodiments of the present application also provide a computer-readable storage medium storing a program or instructions that cause a computer to perform the dialect speech synthesis method as described in any of the above.
Fig. 7 is a schematic block diagram of an electronic device provided by an embodiment of the present disclosure.
As shown in fig. 7, the electronic device includes: at least one processor 701, at least one memory 702, and at least one communication interface 703. The various components in the electronic device are coupled together by a bus system 704. A communication interface 703 for information transmission with an external device. It is appreciated that bus system 704 is used to enable connected communications between these components. The bus system 704 includes a power bus, a control bus, and a status signal bus in addition to the data bus. The various buses are labeled as bus system 704 in fig. 7 for clarity of illustration.
It is to be appreciated that the memory 702 in the present embodiment may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory.
In some implementations, the memory 702 stores the following elements, executable units or data structures, or a subset thereof, or an extended set thereof: an operating system and application programs.
The operating system includes various system programs, such as a framework layer, a core library layer, a driving layer, and the like, and is used for realizing various basic services and processing hardware-based tasks. Applications, including various applications such as Media Player (Media Player), browser (Browser), etc., are used to implement various application services. The program for implementing any one of the dialect speech synthesis methods provided in the embodiments of the present application may be included in the application program.
In the embodiment of the present application, the processor 701 is configured to execute the steps of each embodiment of the dialect speech synthesis method provided in the embodiment of the present application by calling a program or an instruction stored in the memory 702, specifically, a program or an instruction stored in an application program.
Acquiring a Mandarin text, and determining a first dialect text according to the Mandarin text;
determining a second dialect text according to the first dialect text and the dialect text model;
determining a representation corresponding to the second dialect text according to the second dialect text;
determining acoustic parameters corresponding to the second dialect text according to the characterization corresponding to the second dialect text;
synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.
Any one of the dialect speech synthesis methods provided in the embodiments of the present application may be applied to the processor 701, or implemented by the processor 701. The processor 701 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in the processor 701 or by instructions in the form of software. The processor 701 described above may be a general purpose processor, a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), an off-the-shelf programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The steps of any method in the dialect speech synthesis method provided by the embodiment of the application can be directly embodied as the execution completion of the hardware decoding processor, or the execution completion of the combination execution of the hardware and software units in the decoding processor. The software elements may be located in a random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 702, and the processor 701 reads information in the memory 702 and performs the steps of the method in combination with its hardware.
Those skilled in the art will appreciate that while some embodiments described herein include some features but not others included in other embodiments, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments.
Those skilled in the art will appreciate that the descriptions of the various embodiments are each focused on, and that portions of one embodiment that are not described in detail may be referred to as related descriptions of other embodiments.
Although the embodiments of the present application have been described with reference to the accompanying drawings, those skilled in the art may make various modifications and alterations without departing from the spirit and scope of the present application, and such modifications and alterations fall within the scope of the appended claims, which are to be construed as merely illustrative of the present application, but the scope of the application is not limited thereto, and various equivalent modifications and substitutions will be readily apparent to those skilled in the art within the scope of the present application, and are intended to be included within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.
The present application is not limited to the above embodiments, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the present application, and these modifications and substitutions are intended to be included in the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (5)

1. A method of dialect speech synthesis, comprising:
acquiring a plurality of Mandarin text and dialect text pairs;
training according to the plurality of Mandarin texts and dialect text pairs to obtain an end-to-end translation model;
inputting the Mandarin text into the end-to-end translation model to obtain a first dialect text;
acquiring a plurality of dialect texts;
training the plurality of dialect texts to obtain the dialect text model;
determining a second dialect text according to the first dialect text and the dialect text model;
analyzing the second dialect text to obtain a representation corresponding to the second dialect text;
acquiring a plurality of dialect text and voice pairs;
training according to the plurality of dialect texts and the voice pairs to obtain an end-to-end synthesis model;
inputting the representation corresponding to the second dialect text into the end-to-end synthesis model to obtain acoustic parameters corresponding to the second dialect text;
synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.
2. The dialect speech synthesis method according to claim 1, wherein the synthesizing the dialect speech corresponding to the mandarin text according to the acoustic parameters corresponding to the second dialect text includes:
acquiring voice acoustic parameters and voice pairs;
training a neural network vocoder according to the voice acoustic parameters and the voice pair to obtain a neural network vocoder model;
and inputting the acoustic parameters corresponding to the second dialect text into the neural network vocoder model to synthesize dialect voice corresponding to the Mandarin text.
3. A dialect speech synthesis apparatus, comprising:
the acquisition module is used for:
acquiring a plurality of Mandarin text and dialect text pairs;
training according to the plurality of Mandarin texts and dialect text pairs to obtain an end-to-end translation model;
inputting the Mandarin text into the end-to-end translation model to obtain a first dialect text;
a first determination module:
acquiring a plurality of dialect texts;
training the plurality of dialect texts to obtain the dialect text model;
determining a second dialect text according to the first dialect text and the dialect text model;
a second determination module:
analyzing the second dialect text to obtain a representation corresponding to the second dialect text;
and a third determination module:
acquiring a plurality of dialect text and voice pairs;
training according to the plurality of dialect texts and the voice pairs to obtain an end-to-end synthesis model;
inputting the representation corresponding to the second dialect text into the end-to-end synthesis model to obtain acoustic parameters corresponding to the second dialect text;
and a synthesis module:
and synthesizing dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.
4. An electronic device, comprising: a processor and a memory;
the processor is configured to perform the dialect speech synthesis method according to any of claims 1 to 2 by invoking a program or instructions stored in the memory.
5. A computer-readable storage medium storing a program or instructions that cause a computer to perform the dialect speech synthesis method according to any one of claims 1 to 2.
CN202110616158.1A 2021-06-02 2021-06-02 Dialect voice synthesis method, device, electronic equipment and storage medium Active CN113191164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110616158.1A CN113191164B (en) 2021-06-02 2021-06-02 Dialect voice synthesis method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110616158.1A CN113191164B (en) 2021-06-02 2021-06-02 Dialect voice synthesis method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113191164A CN113191164A (en) 2021-07-30
CN113191164B true CN113191164B (en) 2023-11-10

Family

ID=76975818

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110616158.1A Active CN113191164B (en) 2021-06-02 2021-06-02 Dialect voice synthesis method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113191164B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110197655A (en) * 2019-06-28 2019-09-03 百度在线网络技术(北京)有限公司 Method and apparatus for synthesizing voice
CN111986646A (en) * 2020-08-17 2020-11-24 云知声智能科技股份有限公司 Dialect synthesis method and system based on small corpus
CN112634866A (en) * 2020-12-24 2021-04-09 北京猎户星空科技有限公司 Speech synthesis model training and speech synthesis method, apparatus, device and medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130110511A1 (en) * 2011-10-31 2013-05-02 Telcordia Technologies, Inc. System, Method and Program for Customized Voice Communication

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110197655A (en) * 2019-06-28 2019-09-03 百度在线网络技术(北京)有限公司 Method and apparatus for synthesizing voice
CN111986646A (en) * 2020-08-17 2020-11-24 云知声智能科技股份有限公司 Dialect synthesis method and system based on small corpus
CN112634866A (en) * 2020-12-24 2021-04-09 北京猎户星空科技有限公司 Speech synthesis model training and speech synthesis method, apparatus, device and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
利用五度字调模型实现普通话到兰州方言的转换;梁青青;杨鸿武;郭威彤;裴东;甘振业;;声学技术(第06期);64-69 *

Also Published As

Publication number Publication date
CN113191164A (en) 2021-07-30

Similar Documents

Publication Publication Date Title
US10789938B2 (en) Speech synthesis method terminal and storage medium
CN109740053B (en) Sensitive word shielding method and device based on NLP technology
WO2021127817A1 (en) Speech synthesis method, device, and apparatus for multilingual text, and storage medium
US11417316B2 (en) Speech synthesis method and apparatus and computer readable storage medium using the same
WO2017044415A1 (en) System and method for eliciting open-ended natural language responses to questions to train natural language processors
US8423365B2 (en) Contextual conversion platform
CN111226275A (en) Voice synthesis method, device, terminal and medium based on rhythm characteristic prediction
CN113066511A (en) Voice conversion method and device, electronic equipment and storage medium
WO2014183411A1 (en) Method, apparatus and speech synthesis system for classifying unvoiced and voiced sound
US11996084B2 (en) Speech synthesis method and apparatus, device and computer storage medium
CN110534115B (en) Multi-party mixed voice recognition method, device, system and storage medium
CN113191164B (en) Dialect voice synthesis method, device, electronic equipment and storage medium
CN113362804A (en) Method, device, terminal and storage medium for synthesizing voice
CN112668704B (en) Training method and device of audio recognition model and audio recognition method and device
CN113781996B (en) Voice synthesis model training method and device and electronic equipment
CN113555003B (en) Speech synthesis method, device, electronic equipment and storage medium
CN113345408B (en) Chinese and English voice mixed synthesis method and device, electronic equipment and storage medium
CN112509559A (en) Audio recognition method, model training method, device, equipment and storage medium
CN113948064A (en) Speech synthesis and speech recognition
CN113392645B (en) Prosodic phrase boundary prediction method and device, electronic equipment and storage medium
CN112951218B (en) Voice processing method and device based on neural network model and electronic equipment
CN110808957B (en) Vulnerability information matching processing method and device
US20230056128A1 (en) Speech processing method and apparatus, device and computer storage medium
CN111494950B (en) Game voice processing method and device, storage medium and electronic equipment
WO2024012040A1 (en) Method for speech generation and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant