CN113191164A - Dialect voice synthesis method and device, electronic equipment and storage medium - Google Patents

Dialect voice synthesis method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113191164A
CN113191164A CN202110616158.1A CN202110616158A CN113191164A CN 113191164 A CN113191164 A CN 113191164A CN 202110616158 A CN202110616158 A CN 202110616158A CN 113191164 A CN113191164 A CN 113191164A
Authority
CN
China
Prior art keywords
dialect
text
mandarin
determining
dialect text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110616158.1A
Other languages
Chinese (zh)
Other versions
CN113191164B (en
Inventor
孙见青
梁家恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202110616158.1A priority Critical patent/CN113191164B/en
Publication of CN113191164A publication Critical patent/CN113191164A/en
Application granted granted Critical
Publication of CN113191164B publication Critical patent/CN113191164B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a dialect synthesis method, a dialect synthesis device, electronic equipment and a storage medium, wherein the method comprises the following steps: obtaining a mandarin text, and determining a first dialect text according to the mandarin text; determining a second dialect text according to the first dialect text and the dialect text model; determining a representation corresponding to the second dialect text according to the second dialect text; determining an acoustic parameter corresponding to the second dialect text according to the representation corresponding to the second dialect text; and synthesizing dialect speech corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text. In the embodiment of the application, the synthesis of the mandarin text into the dialect speech is realized, for example, the mandarin text in a newspaper is synthesized into the tetragon speech, namely, the tetragon speech is read out through the tetragon speech.

Description

Dialect voice synthesis method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of voice processing, in particular to a dialect synthesis method, a dialect synthesis device, electronic equipment and a storage medium.
Background
At present, when dialect speech is synthesized, an input mandarin text is directly synthesized to obtain the dialect speech, and the problems that the synthesized dialect is not authentic and does not accord with the expression characteristic of the dialect exist.
Disclosure of Invention
The invention provides a method, a device, an electronic device and a storage medium for determining dialect synthesis, which can solve the technical problem that the synthesized dialect is not authentic.
The technical scheme for solving the technical problems is as follows:
in a first aspect, an embodiment of the present invention provides a dialect speech synthesis method, including:
obtaining a mandarin text, and determining a first dialect text according to the mandarin text;
determining a second dialect text according to the first dialect text and the dialect text model;
determining a representation corresponding to the second dialect text according to the second dialect text;
determining an acoustic parameter corresponding to the second dialect text according to the representation corresponding to the second dialect text;
and synthesizing dialect speech corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.
In some embodiments, determining the first dialect text from the mandarin chinese text includes:
obtaining a plurality of mandarin text and dialect text pairs;
training according to a plurality of Mandarin texts and dialect texts to obtain an end-to-end translation model;
the Mandarin text is input into the end-to-end translation model to obtain a first dialect text.
In some embodiments, determining the second dialect text from the first dialect text and the dialect text model comprises:
acquiring a plurality of dialect texts;
training a plurality of dialect texts to obtain a dialect text model;
and obtaining a second dialect text according to the first dialect text and the dialect text model.
In some embodiments, determining the corresponding token of the second dialect text from the second dialect text includes:
and analyzing the second dialect text to obtain the representation corresponding to the second dialect text.
In some embodiments, determining the acoustic parameter corresponding to the second dialect text from the representation corresponding to the second dialect text includes:
obtaining a plurality of dialect text and voice pairs;
training according to a plurality of dialect text and voice pairs to obtain an end-to-end synthesis model;
and inputting the representation corresponding to the second dialect text into the end-to-end synthesis model to obtain the acoustic parameter corresponding to the second dialect text.
In some embodiments, synthesizing dialect speech corresponding to mandarin chinese text from acoustic parameters corresponding to second dialect text includes:
acquiring voice acoustic parameters and voice pairs;
training a neural network vocoder according to the voice acoustic parameters and the voice to obtain a neural network vocoder model;
and inputting the acoustic parameters corresponding to the second dialect text into the neural network vocoder model to synthesize dialect voice corresponding to the Mandarin text.
In a second aspect, an embodiment of the present invention provides a dialect speech synthesis apparatus, including:
an acquisition module: the method comprises the steps of obtaining a Mandarin text, and determining a first dialect text according to the Mandarin text;
a first determination module: the second dialect text is determined according to the Mandarin text and the first dialect text;
a second determination module: the representation corresponding to the second dialect text is determined according to the second dialect text;
a third determination module: the acoustic parameter corresponding to the second dialect text is determined according to the representation corresponding to the second dialect text;
a synthesis module: and the acoustic parameter generator is used for synthesizing dialect speech corresponding to the Mandarin text according to the acoustic parameter corresponding to the second dialect text.
In some embodiments, the obtaining module is further configured to:
obtaining a plurality of mandarin text and dialect text pairs;
training according to a plurality of Mandarin texts and dialect texts to obtain an end-to-end translation model;
the Mandarin text is input into the end-to-end translation model to obtain a first dialect text.
In a third aspect, an embodiment of the present invention further provides an electronic device, including: a processor and a memory;
the processor is configured to perform any of the dialect speech synthesis methods described above by invoking programs or instructions stored in the memory.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, which stores a program or instructions for causing a computer to execute the dialect speech synthesis method according to any one of the above descriptions.
The invention has the beneficial effects that: obtaining a mandarin text, and determining a first dialect text according to the mandarin text; determining a second dialect text according to the first dialect text and the dialect text model; determining a representation corresponding to the second dialect text according to the second dialect text; determining an acoustic parameter corresponding to the second dialect text according to the representation corresponding to the second dialect text; and synthesizing dialect speech corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text. In the embodiment of the application, the synthesis of the mandarin text into the dialect speech is realized, for example, the mandarin text in a newspaper is synthesized into the tetragon speech, namely, the tetragon speech is read out through the tetragon speech.
Drawings
Fig. 1 is a diagram of a dialect speech synthesis method according to an embodiment of the present invention;
fig. 2 is a second diagram of a dialect speech synthesis method according to an embodiment of the present invention;
fig. 3 is a third diagram of a dialect speech synthesis method according to an embodiment of the present invention;
FIG. 4 is a fourth diagram of a dialect speech synthesis method according to an embodiment of the present invention;
FIG. 5 is a fifth illustration of a dialect speech synthesis method according to an embodiment of the present invention;
FIG. 6 is a diagram of a dialect speech synthesis apparatus according to an embodiment of the present invention;
fig. 7 is a schematic block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
In order that the above objects, features and advantages of the present application can be more clearly understood, the present disclosure will be further described in detail with reference to the accompanying drawings and examples. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. The specific embodiments described herein are merely illustrative of the disclosure and are not limiting of the application. All other embodiments that can be derived by one of ordinary skill in the art from the description of the embodiments are intended to be within the scope of the present disclosure.
It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
Fig. 1 is a dialect speech synthesis method according to an embodiment of the present invention.
In a first aspect, as shown in fig. 1, an embodiment of the present invention provides a dialect speech synthesis method, including the following steps S101, S102, S103, S104, and S105:
s101: and acquiring a Mandarin text, and determining a first dialect text according to the Mandarin text.
Specifically, the mandarin text in the embodiment of the application can be the mandarin text recorded in newspapers, magazines, periodicals and the like; the first dialect text can be a Sichuan dialect text, a Guangdong dialect text and a Fujian dialect text; for example, based on mandarin text in a newspaper, translated into corresponding dialect text, such as tetragon text.
It should be understood that the Mandarin Chinese text and the first dialect text above are only examples and are not intended to limit the scope of the present invention; in the embodiments of the present application, the text of Sichuan is described as an example.
S102: and determining a second dialect text according to the first dialect text and the dialect text model.
Specifically, the mandarin text is translated into the tetranchang text through the step S101, and the tetranchang text is further translated into the tetranchang text according to the tetranchang text and the dialect text model, so that it is understood that the accuracy of text translation is improved by introducing the dialect text model.
S103: and determining the representation corresponding to the second dialect text according to the second dialect text.
Specifically, after the dialect text with higher accuracy corresponding to the mandarin text is determined through the steps S102 and S103, the dialect text is analyzed to obtain the representation corresponding to the second dialect text.
S104: and determining the acoustic parameters corresponding to the second dialect text according to the representation corresponding to the second dialect text.
Specifically, in the embodiment of the present application, the acoustic parameter corresponding to the dialect text with higher accuracy is determined through the representation corresponding to the dialect text with higher accuracy.
S105: and synthesizing dialect speech corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.
Specifically, in the embodiment of the present application, dialect speech corresponding to mandarin text is synthesized through the acoustic parameters in step S104.
It should be understood that the synthesis of mandarin text into dialect speech through the five steps of S101, S102, S103, S104 and S105, such as the synthesis of mandarin text on a newspaper into tetragon speech, i.e. through the reading of the tetragon speech sound, introduces dialect text, dialect text with higher accuracy, and acoustic parameters, and the synthesized dialect is more authentic compared with the dialect directly synthesized in the prior art in the embodiment of the present application.
Fig. 2 is a second dialect speech synthesis method according to an embodiment of the present invention.
In some embodiments, as shown in fig. 2, determining the first dialect text from the mandarin text includes the following three steps S201, S202 and S203:
s201: a plurality of mandarin text and dialect text pairs are obtained.
Specifically, in the embodiment of the present application, a plurality of mandarin text and dialect text pairs, such as mandarin and tetrakawa pair, mandarin and beijing pairs, mandarin and shanghai pairs, and the like, are obtained.
S202: an end-to-end translation model is obtained from the training of the plurality of mandarin text and dialect text pairs.
Specifically, in the embodiment of the present application, an end-to-end translation model is obtained through training of mandarin and tetrakawa, mandarin and beijing, mandarin and shanghai, and the like.
S203: the Mandarin text is input into the end-to-end translation model to obtain a first dialect text.
Specifically, in the embodiment of the application, an end-to-end translation model is trained, and a corresponding Sichuan language text, a Beijing language text, a Shanghai language text and the like of a Mandarin text can be directly determined through the end-to-end translation model at a later stage.
Fig. 3 is a third diagram of a dialect speech synthesis method according to an embodiment of the present invention.
In some embodiments, as shown in fig. 3, determining the second dialect text from the first dialect text and the dialect text model includes:
s301: a plurality of dialect texts is obtained.
Specifically, in the embodiment of the present application, a plurality of dialect texts are obtained, such as the four-channel language, the beijing language, the fujian language, the shanghai language, and the like.
S302: and training a plurality of dialect texts to obtain a dialect text model.
Specifically, after obtaining a plurality of dialect texts, such as the Sichuan, Beijing, Fujian, Shanghai, and so on, the embodiment of the present application trains a dialect text model, such as the Sichuan text model, through the Sichuan, Beijing, Fujian, Shanghai, and so on.
S303: and obtaining a second dialect text according to the first dialect text and the dialect text model.
Specifically, the accuracy of the tetranchspeech text determined by the end-to-end translation model is not high, and in order to further improve the accuracy, the tetranchspeech text with higher accuracy is obtained by further translating the tetranchspeech text model.
In some embodiments, determining the corresponding token of the second dialect text from the second dialect text includes:
and analyzing the second dialect text to obtain the representation corresponding to the second dialect text.
Illustratively, in the embodiment of the present application, the representation corresponding to the tetralogy text is obtained by analyzing the tetralogy text.
FIG. 4 is a fourth diagram of a dialect speech synthesis method according to an embodiment of the present invention;
in some embodiments, as shown in fig. 4, determining the acoustic parameter corresponding to the second dialect text according to the representation corresponding to the second dialect text includes:
s401: a plurality of dialect text and speech pairs are obtained.
Specifically, in the embodiment of the present application, a plurality of dialect text and voice pairs are obtained, such as the voice corresponding to the four and the four, the voice corresponding to the beijing and the beijing, and the voice corresponding to the shanghai and the shanghai, etc.
S402: an end-to-end synthesis model is derived from the training of the plurality of dialect text and speech pairs.
Specifically, in the embodiment of the present application, an end-to-end synthesis model is obtained through training, such as using a representation corresponding to a sichuan text as an input, using an acoustic parameter corresponding to a sichuan voice as an output, using a representation corresponding to a beijing text as an input, using an acoustic parameter corresponding to a beijing voice as an output, using a representation corresponding to a shanghai text as an input, and using an acoustic parameter corresponding to a shanghai voice as an output.
S403: and inputting the representation corresponding to the second dialect text into the end-to-end synthesis model to obtain the acoustic parameter corresponding to the second dialect text.
Specifically, for example, the representation corresponding to the tetrakagian text is input into the end-to-end synthesis model to obtain the acoustic parameter corresponding to the tetrakagian text.
Fig. 5 is a fifth diagram of a dialect speech synthesis method according to an embodiment of the present invention.
In some embodiments, as shown in fig. 5, synthesizing dialect speech corresponding to mandarin chinese text according to the acoustic parameters corresponding to the second dialect text includes:
s501: and acquiring voice acoustic parameters and voice pairs.
Specifically, in the embodiment of the present application, the speech acoustic parameters and the speech pairs are obtained, that is, the speech acoustic parameters and the speech are in one-to-one correspondence.
S502: training a neural network vocoder according to the voice acoustic parameters and the voice to obtain a neural network vocoder model;
specifically, in the embodiment of the present application, acoustic parameters corresponding to speech are used as input, and speech is used as output, so as to train a neural network vocoder to obtain a neural network vocoder model.
S503: and inputting the acoustic parameters corresponding to the second dialect text into the neural network vocoder model to synthesize dialect voice corresponding to the Mandarin text.
Specifically, as an example, the acoustic parameters corresponding to the tetragon text are input into the neural network vocoder model to synthesize the tetragon voice, that is, the mandarin text is broadcasted through the tetragon voice.
FIG. 6 is a diagram of a dialect speech synthesis apparatus according to an embodiment of the present invention;
in a second aspect, an embodiment of the present invention provides a dialect speech synthesis apparatus, including:
the acquisition module 601: the method comprises the steps of obtaining a Mandarin text, and determining a first dialect text according to the Mandarin text;
specifically, the mandarin text in the embodiment of the application can be the mandarin text recorded in newspapers, magazines, periodicals and the like; the first dialect text can be a Sichuan dialect text, a Guangdong dialect text and a Fujian dialect text; for example, based on mandarin text in a newspaper, translated into corresponding dialect text, such as tetragon text.
It should be understood that the Mandarin Chinese text and the first dialect text above are only examples and are not intended to limit the scope of the present invention; in the embodiments of the present application, the text of Sichuan is described as an example.
The first determination module 602: the second dialect text is determined according to the Mandarin text and the first dialect text;
specifically, after the mandarin text is acquired through the acquisition module 601, the mandarin text is translated into the tetrahuana text, and the tetrahuana text is further translated into the tetrahuana text according to the tetrahuana text and the dialect text model.
The second determination module 603: the representation corresponding to the second dialect text is determined according to the second dialect text;
specifically, after the dialect text with higher accuracy corresponding to the mandarin text is determined by the first determining module 602 and the second determining module 603, the dialect text is analyzed to obtain the representation corresponding to the second dialect text.
The third determination module 604: the acoustic parameter corresponding to the second dialect text is determined according to the representation corresponding to the second dialect text;
specifically, in the embodiment of the present application, the acoustic parameter corresponding to the dialect text is determined by the representation corresponding to the dialect text with higher accuracy.
A synthesis module 605: and the acoustic parameter generator is used for synthesizing dialect speech corresponding to the Mandarin text according to the acoustic parameter corresponding to the second dialect text.
Specifically, in the embodiment of the present application, the acoustic parameters determined by the third determining module 604 are synthesized by the synthesizing module 605 to generate dialect speech corresponding to mandarin chinese text.
It should be understood that the synthesis of mandarin text into dialect speech through the five modules 601, 602, 603, 604 and 605, such as mandarin text in a newspaper into tetragon speech, i.e. through the reading of the tetragon speech sound, the embodiment of the present application introduces dialect text, dialect text with higher accuracy, and acoustic parameters, which are more authentic compared with the dialect directly synthesized in the prior art.
In some embodiments, the obtaining module 601 is further configured to:
obtaining a plurality of mandarin text and dialect text pairs;
specifically, in the embodiment of the present application, a plurality of mandarin text and dialect text pairs, such as mandarin and tetrakawa pair, mandarin and beijing pairs, mandarin and shanghai pairs, and the like, are obtained.
An end-to-end translation model is obtained from the training of the plurality of mandarin text and dialect text pairs.
Specifically, in the embodiment of the present application, an end-to-end translation model is obtained through training of mandarin and tetrakawa, mandarin and beijing, mandarin and shanghai, and the like.
The Mandarin text is input into the end-to-end translation model to obtain a first dialect text.
Specifically, in the embodiment of the application, an end-to-end translation model is trained, and a corresponding Sichuan language text, a Beijing language text, a Shanghai language text and the like of a Mandarin text can be directly determined through the end-to-end translation model at a later stage.
In a third aspect, an embodiment of the present invention further provides an electronic device, including: a processor and a memory;
the processor is configured to perform any of the dialect speech synthesis methods described above by invoking programs or instructions stored in the memory.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, which stores a program or instructions for causing a computer to execute the dialect speech synthesis method according to any one of the above descriptions.
Fig. 7 is a schematic block diagram of an electronic device provided by an embodiment of the disclosure.
As shown in fig. 7, the electronic apparatus includes: at least one processor 701, at least one memory 702, and at least one communication interface 703. The various components in the electronic device are coupled together by a bus system 704. A communication interface 703 for information transmission with an external device. It is understood that the bus system 704 is used to enable communications among the components. The bus system 704 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, the various buses are labeled in fig. 7 as the bus system 704.
It will be appreciated that the memory 702 in this embodiment can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory.
In some embodiments, memory 702 stores the following elements, executable units or data structures, or a subset thereof, or an expanded set thereof: an operating system and an application program.
The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs, including various application programs such as a Media Player (Media Player), a Browser (Browser), etc., are used to implement various application services. A program for implementing any one of the dialect speech synthesis methods provided by the embodiments of the present application may be included in an application program.
In this embodiment of the application, the processor 701 is configured to execute the steps of the dialect speech synthesis method provided by the embodiment of the application by calling a program or an instruction stored in the memory 702, specifically, a program or an instruction stored in an application program.
Obtaining a mandarin text, and determining a first dialect text according to the mandarin text;
determining a second dialect text according to the first dialect text and the dialect text model;
determining a representation corresponding to the second dialect text according to the second dialect text;
determining an acoustic parameter corresponding to the second dialect text according to the representation corresponding to the second dialect text;
and synthesizing dialect speech corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.
Any one of the dialect speech synthesis methods provided in the embodiments of the present application may be applied to the processor 701, or implemented by the processor 701. The processor 701 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 701. The Processor 701 may be a general-purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The steps of any one of the dialect speech synthesis methods provided by the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software units in the decoding processor. The software elements may be located in ram, flash, rom, prom, or eprom, registers, among other storage media that are well known in the art. The storage medium is located in the memory 702, and the processor 701 reads the information in the memory 702, and completes the steps of the method in combination with the hardware thereof.
Those skilled in the art will appreciate that although some embodiments described herein include some features included in other embodiments instead of others, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments.
Those skilled in the art will appreciate that the description of each embodiment has a respective emphasis, and reference may be made to the related description of other embodiments for those parts of an embodiment that are not described in detail.
Although the embodiments of the present application have been described in conjunction with the accompanying drawings, those skilled in the art will be able to make various modifications and variations without departing from the spirit and scope of the application, and such modifications and variations are included in the specific embodiments of the present invention as defined in the appended claims, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of various equivalent modifications and substitutions within the technical scope of the present disclosure, and these modifications and substitutions are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A dialect speech synthesis method, comprising:
obtaining a Mandarin text, and determining a first dialect text according to the Mandarin text;
determining a second dialect text according to the first dialect text and the dialect text model;
determining a representation corresponding to the second dialect text according to the second dialect text;
determining an acoustic parameter corresponding to the second dialect text according to the representation corresponding to the second dialect text;
and synthesizing the dialect voice corresponding to the Mandarin text according to the acoustic parameters corresponding to the second dialect text.
2. The dialect speech synthesis method of claim 1, wherein the determining first dialect text from the mandarin text comprises:
obtaining a plurality of mandarin text and dialect text pairs;
training according to the multiple Mandarin texts and dialect texts to obtain an end-to-end translation model;
and inputting the Mandarin text into the end-to-end translation model to obtain the first dialect text.
3. The dialect speech synthesis method of claim 1, wherein determining the second dialect text from the first dialect text and the dialect text model comprises:
acquiring a plurality of dialect texts;
training the dialect texts to obtain the dialect text model;
and determining a second dialect text according to the first dialect text and the dialect text model.
4. The dialect speech synthesis method of claim 1, wherein the determining the corresponding representation of the second dialect text from the second dialect text comprises:
and analyzing the second dialect text to obtain a representation corresponding to the second dialect text.
5. The dialect speech synthesis method of claim 1, wherein the determining the acoustic parameters corresponding to the second dialect text according to the representation corresponding to the second dialect text comprises:
obtaining a plurality of dialect text and voice pairs;
training to obtain an end-to-end synthesis model according to the dialect text and the voice pairs;
and inputting the representation corresponding to the second dialect text into the end-to-end synthesis model to obtain the acoustic parameter corresponding to the second dialect text.
6. The dialect speech synthesis method of claim 1, wherein the synthesizing of dialect speech corresponding to mandarin text according to the acoustic parameters corresponding to the second dialect text comprises:
acquiring voice acoustic parameters and voice pairs;
training a neural network vocoder according to the voice acoustic parameters and the voice to obtain a neural network vocoder model;
and inputting the acoustic parameters corresponding to the second dialect text into the neural network vocoder model to synthesize dialect voice corresponding to Mandarin Chinese text.
7. A dialect speech synthesis apparatus, comprising:
an acquisition module: the method comprises the steps of obtaining a Mandarin text, and determining a first dialect text according to the Mandarin text;
a first determination module: the second dialect text is determined according to the Mandarin text and the first dialect text;
a second determination module: the representation corresponding to the second dialect text is determined according to the second dialect text;
a third determination module: the acoustic parameter corresponding to the second dialect text is determined according to the representation corresponding to the second dialect text;
a synthesis module: and the acoustic parameter generator is used for synthesizing dialect speech corresponding to the Mandarin text according to the acoustic parameter corresponding to the second dialect text.
8. The dialect speech synthesis apparatus of claim 7, wherein the obtaining module is further configured to:
obtaining a plurality of mandarin text and dialect text pairs;
training according to the multiple Mandarin texts and dialect texts to obtain an end-to-end translation model;
and inputting the Mandarin text into the end-to-end translation model to obtain the first dialect text.
9. An electronic device, comprising: a processor and a memory;
the processor is configured to perform a dialect speech synthesis method according to any one of claims 1 to 6 by calling a program or instructions stored in the memory.
10. A computer-readable storage medium storing a program or instructions for causing a computer to perform a dialect speech synthesis method according to any one of claims 1 to 6.
CN202110616158.1A 2021-06-02 2021-06-02 Dialect voice synthesis method, device, electronic equipment and storage medium Active CN113191164B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110616158.1A CN113191164B (en) 2021-06-02 2021-06-02 Dialect voice synthesis method, device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110616158.1A CN113191164B (en) 2021-06-02 2021-06-02 Dialect voice synthesis method, device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113191164A true CN113191164A (en) 2021-07-30
CN113191164B CN113191164B (en) 2023-11-10

Family

ID=76975818

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110616158.1A Active CN113191164B (en) 2021-06-02 2021-06-02 Dialect voice synthesis method, device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113191164B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130110511A1 (en) * 2011-10-31 2013-05-02 Telcordia Technologies, Inc. System, Method and Program for Customized Voice Communication
CN110197655A (en) * 2019-06-28 2019-09-03 百度在线网络技术(北京)有限公司 Method and apparatus for synthesizing voice
CN111986646A (en) * 2020-08-17 2020-11-24 云知声智能科技股份有限公司 Dialect synthesis method and system based on small corpus
CN112634866A (en) * 2020-12-24 2021-04-09 北京猎户星空科技有限公司 Speech synthesis model training and speech synthesis method, apparatus, device and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130110511A1 (en) * 2011-10-31 2013-05-02 Telcordia Technologies, Inc. System, Method and Program for Customized Voice Communication
CN110197655A (en) * 2019-06-28 2019-09-03 百度在线网络技术(北京)有限公司 Method and apparatus for synthesizing voice
CN111986646A (en) * 2020-08-17 2020-11-24 云知声智能科技股份有限公司 Dialect synthesis method and system based on small corpus
CN112634866A (en) * 2020-12-24 2021-04-09 北京猎户星空科技有限公司 Speech synthesis model training and speech synthesis method, apparatus, device and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
梁青青;杨鸿武;郭威彤;裴东;甘振业;: "利用五度字调模型实现普通话到兰州方言的转换", 声学技术, no. 06, pages 64 - 69 *

Also Published As

Publication number Publication date
CN113191164B (en) 2023-11-10

Similar Documents

Publication Publication Date Title
CN110136691B (en) Speech synthesis model training method and device, electronic equipment and storage medium
CN109686361B (en) Speech synthesis method, device, computing equipment and computer storage medium
WO2020248393A1 (en) Speech synthesis method and system, terminal device, and readable storage medium
DE112010005168B4 (en) Recognition dictionary generating device, speech recognition device and voice synthesizer
US11417316B2 (en) Speech synthesis method and apparatus and computer readable storage medium using the same
CN105261355A (en) Voice synthesis method and apparatus
CN113066511B (en) Voice conversion method and device, electronic equipment and storage medium
CN112397047A (en) Speech synthesis method, device, electronic equipment and readable storage medium
CN110211562B (en) Voice synthesis method, electronic equipment and readable storage medium
CN113470684A (en) Audio noise reduction method, device, equipment and storage medium
CN113053357A (en) Speech synthesis method, apparatus, device and computer readable storage medium
CN113345431A (en) Cross-language voice conversion method, device, equipment and medium
CN113256262A (en) Automatic generation method and system of conference summary, storage medium and electronic equipment
CN110312161B (en) Video dubbing method and device and terminal equipment
CN113362804A (en) Method, device, terminal and storage medium for synthesizing voice
CN113191164A (en) Dialect voice synthesis method and device, electronic equipment and storage medium
CN112668704B (en) Training method and device of audio recognition model and audio recognition method and device
CN113555003B (en) Speech synthesis method, device, electronic equipment and storage medium
CN113256133B (en) Conference summary management method, device, computer equipment and storage medium
CN112487771B (en) Report generation method, report generation device and terminal
CN113345408B (en) Chinese and English voice mixed synthesis method and device, electronic equipment and storage medium
CN112133279B (en) Vehicle-mounted information broadcasting method and device and terminal equipment
CN113160793A (en) Speech synthesis method, device, equipment and storage medium based on low resource language
CN112951218B (en) Voice processing method and device based on neural network model and electronic equipment
CN115171651B (en) Method and device for synthesizing infant voice, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant