CN107705782B

CN107705782B - Method and device for determining phoneme pronunciation duration

Info

Publication number: CN107705782B
Application number: CN201710911991.2A
Authority: CN
Inventors: 张黄斌
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2017-09-29
Filing date: 2017-09-29
Publication date: 2021-01-05
Anticipated expiration: 2037-09-29
Also published as: CN107705782A

Abstract

The embodiment of the application discloses a method and a device for determining pronunciation duration of a phoneme. One embodiment of the method comprises: acquiring a text corresponding to a voice to be synthesized; extracting text features of the text; importing the text features into a pre-established first time length determining model to generate a pronunciation time length type sequence, wherein pronunciation time length types in the pronunciation time length type sequence correspond to phonemes in a phoneme sequence corresponding to the text, and the first time length determining model is used for representing the corresponding relation between the text features and the pronunciation time length type sequence; and determining the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text according to the generated pronunciation duration type sequence and preset corresponding relation information, wherein the corresponding relation information is used for indicating the corresponding relation between the pronunciation duration type and the pronunciation duration. This embodiment improves the pronunciation duration accuracy of the determined phonemes.

Description

Method and device for determining phoneme pronunciation duration

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of speech synthesis, and particularly relates to a method and a device for determining pronunciation duration of phonemes.

Background

Speech synthesis, also known as text-to-speech technology, is a technology for generating artificial speech by mechanical and electronic means. It is a technology for converting the text information generated locally by computer or inputted externally into fluent speech sound which can be understood by human beings. In the process of speech synthesis, it is necessary to concatenate audio corresponding to a plurality of speech units, and the speech units may be pinyin or phoneme.

However, the existing speech synthesis method usually does not distinguish the pronunciation duration of the speech unit.

Disclosure of Invention

The embodiment of the application aims to provide a method and a device for determining pronunciation duration of phonemes.

In a first aspect, an embodiment of the present application provides a method for determining a pronunciation duration of a phoneme, where the method includes: acquiring a text corresponding to a voice to be synthesized; extracting text features of the text; importing the text features into a pre-established first time length determining model to generate a pronunciation time length type sequence, wherein pronunciation time length types in the pronunciation time length type sequence correspond to phonemes in a phoneme sequence corresponding to the text, and the first time length determining model is used for representing the corresponding relation between the text features and the pronunciation time length type sequence; and determining the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text according to the generated pronunciation duration type sequence and preset corresponding relation information, wherein the corresponding relation information is used for indicating the corresponding relation between the pronunciation duration type and the pronunciation duration.

In a second aspect, an embodiment of the present application provides an apparatus for determining a pronunciation duration of a phoneme, where the apparatus includes: the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a text corresponding to a voice to be synthesized; the extraction unit is used for extracting text features of the text; a first generating unit, configured to introduce the text feature into a pre-established first time length determining model, and generate a pronunciation time length type sequence, where a pronunciation time length type in the pronunciation time length type sequence corresponds to a phoneme in a phoneme sequence corresponding to the text, and the first time length determining model is used to represent a correspondence between the text feature and the pronunciation time length type sequence; and the determining unit is used for determining the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text according to the generated pronunciation duration type sequence and preset corresponding relation information, wherein the corresponding relation information is used for indicating the corresponding relation between the pronunciation duration type and the pronunciation duration.

In a third aspect, an embodiment of the present application provides an electronic device, where the electronic device includes: one or more processors; storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, which when executed by a processor, implements the method according to the first aspect.

According to the method and the device for determining the pronunciation duration of the phonemes, provided by the embodiment of the application, the text characteristics of the text corresponding to the speech to be synthesized are firstly extracted, then the first time duration determination model is utilized to determine the pronunciation duration type sequence, and then the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text is obtained according to the corresponding relationship information for indicating the corresponding relationship between the pronunciation duration type and the pronunciation duration, so that the more accurate pronunciation duration of the phonemes can be determined.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for determining phoneme pronunciation durations in accordance with the present application;

FIG. 3 is a schematic diagram of an application scenario of a method for determining phoneme pronunciation durations according to the present application;

FIG. 4 is a flow diagram of one exemplary implementation of a method for determining phoneme pronunciation durations in accordance with the present application;

FIG. 5A is a flow diagram of yet another embodiment of a method for determining phoneme pronunciation durations in accordance with the present application;

FIG. 5B is a flowchart of an exemplary implementation of step 506 in the flowchart shown in FIG. 5A;

FIG. 6 is a block diagram illustrating an embodiment of an apparatus for determining phoneme pronunciation durations according to the present application;

FIG. 7 is a block diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the present method for determining a phoneme pronunciation time or apparatus for determining a phoneme pronunciation time may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as a voice assistant-like application, a shopping-like application, a search-like application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a voice playing function, including but not limited to smart phones, tablet computers, electronic book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for voice assistant like applications on the

terminal devices

101, 102, 103. The background server may analyze and perform other processing on the received data such as the voice synthesis request, and feed back a processing result (e.g., synthesized voice data) to the terminal device.

It should be noted that the method for determining the pronunciation duration of the phoneme provided by the embodiment of the present application is generally executed by the server 105, and accordingly, the apparatus for determining the pronunciation duration of the phoneme is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. It should be noted that in some application scenarios, the system architecture 100 may not include a terminal device and a network.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for determining phoneme pronunciation durations in accordance with the present application is shown. The method for determining the pronunciation duration of the phoneme comprises the following steps:

step 201, obtaining a text corresponding to the speech to be synthesized.

In this embodiment, an electronic device (e.g., a server shown in fig. 1) on which the method for determining the pronunciation duration of a phoneme operates may acquire a text corresponding to a speech to be synthesized.

In this embodiment, the electronic device may obtain the text corresponding to the speech to be synthesized through various manners, such as local generation or reception from other electronic devices.

As an example, in the scenario of automatic question answering, the user's question is "1 plus 1 equals to several", and the electronic device may generate the text "two" after determining that the answer is "2", that is, the speech to be synthesized is the speech corresponding to the text "two".

As an example, a user inputs a text "i eat" using a terminal, the terminal transmits the text to a server, and the server may synthesize a voice corresponding to the text "i eat".

Step 202, extracting text features of the text.

In this embodiment, the electronic device may extract text features of the text.

In this embodiment, what kind of text features of the extracted text can be flexibly adjusted in practical application. By way of example, the extracted textual features may include, but are not limited to: the corresponding phonemes, tone information, word vectors, part of speech information, punctuation information, etc. of the text.

As shown, for the text "i eat", the respective phonemes corresponding to the text may be "wochinfan", the tone information is "three sounds, one sound, four sounds", and the like. It should be noted that the representation form of the text feature may be set according to an actual situation, for example, the feature may be represented by using an unique hot code form, which is not described herein again.

In this embodiment, speech synthesis is performed using phonemes as a basic unit. Phonemes, which are the smallest units in speech, are analyzed according to the pronunciation actions in syllables, with one action constituting a phoneme. Determining the pronunciation duration of each phoneme is the basis of speech synthesis. The determined pronunciation duration of the phoneme is more accurate, and then more natural voice can be synthesized.

Step 203, importing the text features into a pre-established first time length determination model to generate a pronunciation time length type sequence.

In this embodiment, the electronic device may import the text feature into a first time length determination model established in advance, and generate a pronunciation time length type sequence. Here, the utterance duration type in the utterance duration type sequence corresponds to a phoneme in a phoneme sequence corresponding to the text. The first time length determination model is used for representing the corresponding relation between the text characteristics and the pronunciation time length type sequence.

As an example, the phoneme sequence corresponding to the text "i eat" is "wochinfan", and the generated pronunciation duration type sequence may be "1, 2, 3, 4, 5,6, 7, 8", where "1" corresponds to the phoneme "w" and indicates that the first pronunciation duration type, i.e., the pronunciation duration type of the phoneme "w" is the first type. "2" corresponds to the phoneme of "o" and indicates that the second pronunciation duration type, i.e., the pronunciation duration type of the phoneme of "o" is the second pronunciation duration type. The phonemes corresponding to "3, 4, 5,6, 7, 8" in the sequence of pronunciation duration types are analogized in turn.

In some optional implementations of the embodiment, the first duration determination model may be a relational lookup table that describes a correspondence between text features and pronunciation duration types. As an example, the text features corresponding to the text may include tonal features, for example, for triphones may correspond to a third pronunciation duration type, i.e., "wo" for triphones, the pronunciation duration types of the phonemes "w" and "o" are both the third pronunciation duration type, and if the pronunciation duration corresponding to the third pronunciation duration type is 10 milliseconds, then the pronunciation duration of the phonemes "w" and "o" is 10 milliseconds.

And 204, determining the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text according to the generated pronunciation duration type sequence and preset corresponding relation information.

In this embodiment, the electronic device may determine the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text according to the generated pronunciation duration type sequence and the preset correspondence information. Here, the above correspondence information is used to indicate the correspondence between the utterance length type and the utterance length.

As an example, the phoneme sequence corresponding to the text "i eat" is "wochinfan", and the generated pronunciation duration type sequence may be "1, 2, 3, 4, 5,6, 7, 8", where "1" corresponds to the phoneme "w" and indicates that the first pronunciation duration type, i.e., the pronunciation duration type of the phoneme "w" is the first type. "2" corresponds to the phoneme of "o" and indicates that the second pronunciation duration type, i.e., the pronunciation duration type of the phoneme of "o" is the second pronunciation duration type. The phonemes corresponding to "3, 4, 5,6, 7, 8" in the sequence of pronunciation duration types, and so on. In the phoneme sequence 'woochifan' corresponding to the text 'i eat', the pronunciation time of 'w' is 5 milliseconds, the pronunciation time of 'o' is 6 milliseconds, and so on.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of the method for determining a pronunciation duration of a phoneme according to the present embodiment. In the application scenario of fig. 3, a user first sends a speech synthesis request 303 to a server 302 through a terminal 301, where the speech synthesis request includes a text "i eat" corresponding to speech to be synthesized; then, the server may obtain the text and extract the text features, as shown at 304; the server may then generate a sequence of utterance duration types, as shown at 305, such as "1, 2, 3, 4, 5,6, 7, 8" using the first duration determination model; then, the server may determine pronunciation durations of the phonemes in the phoneme sequence ("wochinfan") corresponding to the text according to the correspondence information, as shown in 306. The server may then synthesize speech based on the determined individual utterance durations, as shown at 307, and send the synthesized speech 308 to the terminal.

In the method provided by the embodiment of the application, the text features of the text corresponding to the speech to be synthesized are extracted, then the pronunciation duration type sequence is determined by using the first duration determination model, and the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text is obtained according to the corresponding relationship information indicating the corresponding relationship between the pronunciation duration type and the pronunciation duration, so that more accurate pronunciation duration of the phoneme can be determined.

In some optional implementations of this embodiment, the first time length determination model may be obtained by: a first set of training samples is obtained, where the first training samples are audio information associated with a pronunciation duration for which a phoneme is set. The first neural network and the conditional random field model are trained using the first training sample set to obtain a first time length determination model, where the output of the first neural network is the input of the conditional random field model.

As an example, the first training sample may be audio information corresponding to "i eat" and this audio information is associated with a pronunciation duration of a phoneme, i.e., "wochinfan", which is the pronunciation duration of each phoneme in this phoneme sequence.

In this implementation manner, the first neural network and the conditional random field model are trained by using the first training sample set to obtain a first time length determination model, which may be implemented by: and for each first training sample, extracting text features of the text corresponding to the training sample, importing the extracted text features into a first neural network, and taking the input of the first neural network as the input of the conditional random field model. The output of the conditional random field model is a sequence of pronunciation duration types. And adjusting the first neural network and the conditional random field model according to the pronunciation durations corresponding to the pronunciation duration type sequence and the pronunciation duration set in association with the first training sample to obtain a first time duration determination model comprising the adjusted first neural network and the adjusted conditional random field model.

In this implementation, the output of the first neural network includes a probability sequence corresponding to each phoneme in the phoneme sequence. Here, the probability in the probability sequence corresponding to the phoneme is the probability that the pronunciation duration of the phoneme belongs to a preset pronunciation duration type.

As an example, the preset utterance duration type includes an a type and a B type. The corresponding phoneme sequence of the text "i starve" is "woele".

For the phoneme "w" in this phoneme sequence, the first neural network outputs a probability sequence (P1, P2) of the phoneme "w" pair, wherein P1 is a probability that the pronunciation duration of the phoneme "w" belongs to the a type, and P2 is a probability that the pronunciation duration of the phoneme "w" belongs to the B type.

For the phoneme "o" in this phoneme sequence, the first neural network outputs a probability sequence (P3, P4) of the phoneme "o" pair, where P3 is a probability that the pronunciation duration of the phoneme "o" belongs to the a type, and P4 is a probability that the pronunciation duration of the phoneme "o" belongs to the B type.

For the first "e" phoneme in this phoneme sequence, the first neural network outputs a probability sequence (P5, P6) of the first "e" phoneme pair, where P5 is the probability that the pronunciation duration of the first "e" phoneme belongs to type a and P6 is the probability that the pronunciation duration of the first "e" phoneme belongs to type B.

For the phoneme "l" in this phoneme sequence, the first neural network outputs a probability sequence (P7, P8) of the phoneme "l" pair, where P7 is a probability that the pronunciation duration of the phoneme "l" belongs to the a type, and P8 is a probability that the pronunciation duration of the phoneme "l" belongs to the B type.

For the second "e" phoneme in this phoneme sequence, the first neural network outputs a probability sequence (P9, P10) for the second "e" phoneme, where P9 is the probability that the pronunciation duration of the second "e" phoneme belongs to type a and P10 is the probability that the pronunciation duration of the second "e" phoneme belongs to type B.

In some optional implementations of the embodiment, referring to fig. 4, the utterance duration type is obtained through a flow 400 shown in fig. 4:

step 401, determining the shortest pronunciation duration and the longest pronunciation duration of the phoneme.

The execution subject of the flow 400 may be the same as or different from the execution subject of the flow 200.

In this implementation, the phoneme may be any phoneme. The shortest pronunciation time and the longest pronunciation time can be obtained by statistics from a large number of voice samples or can be set manually.

As an example, the above-described shortest utterance length may be 7 milliseconds, and the longest utterance length may be 250 milliseconds.

Step 402, taking the logarithm of the shortest pronunciation duration to obtain a first logarithm value, and taking the logarithm of the longest pronunciation duration to obtain a second logarithm value.

As an example, a base 10 logarithm may be taken for 7, resulting in a first logarithm lg 7. The base-10 logarithm may be taken for 250 to obtain a second logarithm value lg 250.

In step 403, a difference between the second logarithmic value and the first logarithmic value is determined, and a ratio of the difference to a preset number is determined.

In this implementation, the preset number is a number of a preset utterance duration type.

In step 404, a pronunciation duration type sequence including a preset number of pronunciation duration types is generated.

In this implementation, for each pronunciation duration type in the sequence of pronunciation duration types, the pronunciation duration corresponding to the pronunciation duration type is determined according to the bit information of the pronunciation duration type in the sequence of pronunciation duration types and the ratio.

As an example, if the ratio is t, the pronunciation duration type one located first in the pronunciation duration type sequence corresponds to a duration of (lg7+1 × t); for the pronunciation duration type two located second in the pronunciation duration type sequence, the duration corresponding to the pronunciation duration type is (lg7+2 × t).

In some optional implementations of this embodiment, the first training sample set is obtained by: and determining whether the initial sample set comprises a sample which is associated with the pronunciation time length or not for the pronunciation time length corresponding to each pronunciation time length type in the pronunciation time length type sequence. Here, the initial sample is audio information in which the pronunciation duration of a phoneme is set in association. And in response to the fact that the initial sample set does not comprise the sample with the pronunciation time length in the associated mode, adjusting the sample with the pronunciation time length with the minimum difference value with the pronunciation time length in the associated mode to obtain the sample with the pronunciation time length. And adding the adjusted sample into the initial sample set to obtain the first training sample set.

With further reference to FIG. 5A, a flow 500 of yet another embodiment of a method for determining phoneme pronunciation durations is illustrated. The process 500 of the method for determining phoneme pronunciation time includes the following steps:

step 501, obtaining a text corresponding to a speech to be synthesized.

In this embodiment, the electronic device may obtain the text corresponding to the speech to be synthesized in a manner of local generation or reception from other electronic devices.

Step 502, extracting text features of the text.

Step 503, importing the text features into a pre-established first time length determination model, and generating a pronunciation time length type sequence.

For details and technical effects of step 501, step 502, and step 503, reference may be made to the descriptions in step 201, step 202, and step 203, which are not described herein again.

Step 504, importing the text features into a pre-established second duration determination model, and generating a second pronunciation duration sequence corresponding to the text.

In this embodiment, the electronic device may import the text feature into a second duration determination model established in advance, and generate a second pronunciation duration sequence corresponding to the text. Here, the second pronunciation duration sequence includes pronunciation durations of phonemes in a phoneme sequence corresponding to the text. The second duration determination model is used for representing the corresponding relation between the text feature and the second pronunciation duration sequence.

In some optional implementations of the present embodiment, the second duration determination model may be obtained by: a second set of training samples is obtained, the second training samples being audio information associated with a pronunciation duration for which a phoneme is set. And training a second neural network by using the second training sample set to obtain a second duration determination model.

In this implementation manner, the second neural network is trained by using the second training sample set to obtain the second duration determination model, which may be implemented in the following manner: and for each second training sample, extracting the text features of the text corresponding to the training sample, and introducing the extracted text features into a second neural network, wherein the second neural network outputs a pronunciation duration sequence. And adjusting the second neural network according to the pronunciation time length in the pronunciation time length sequence and the pronunciation time length set in association with the second training sample to obtain a second time length determination model comprising the adjusted second neural network.

In this implementation, the second training sample set may be the same as or different from the first training sample set.

In this implementation, the second neural network may be any kind of neural network with any structure, and may be, for example, a two-layer time-recursive neural network cascade.

In some optional implementations of this embodiment, the second duration determination model may also be a relational lookup table. The relation comparison table records the corresponding relation between the text characteristics and the pronunciation duration type. As an example, the text features corresponding to the text may include tonal features, for example, for triphones may correspond to a third pronunciation duration type, i.e., "wo" for triphones, the pronunciation duration types of the phonemes "w" and "o" are both the third pronunciation duration type, and if the pronunciation duration corresponding to the third pronunciation duration type is 10 milliseconds, then the pronunciation duration of the phonemes "w" and "o" is 10 milliseconds.

And 505, generating a first pronunciation time length sequence according to the generated pronunciation time length type sequence and the corresponding relation information.

In this embodiment, the electronic device may generate the first pronunciation time sequence according to the generated pronunciation time type sequence and the correspondence information.

Step 506, determining the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text according to the first pronunciation duration sequence and the second pronunciation duration sequence.

In this embodiment, the electronic device may determine the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text according to the first pronunciation duration sequence and the second pronunciation duration sequence.

In some optional implementations of this embodiment, step 506 may be implemented by: it may be preset to set weights for the first utterance duration sequence and the second utterance duration sequence, respectively, that is, to set the first weight and the second weight. And weighting and summing the first pronunciation time sequence and the first pronunciation time sequence by using the first weight and the second weight to obtain the pronunciation time of each factor.

In some optional implementations of this embodiment, please refer to fig. 5B, which illustrates a flow 506:

step 5061, importing the text features into a pre-established adjustment parameter determination model. And obtaining an adjusting parameter sequence.

In this implementation, the number of adjustment parameters in the adjustment parameter sequence matches the number of phonemes in the phoneme sequence corresponding to the text.

In this implementation manner, the adjustment parameter determination model is used for characterizing a corresponding relationship between the text feature and the adjustment parameter sequence.

Step 5062, determining a pronunciation duration of each phoneme in the phoneme sequence corresponding to the text according to the first pronunciation duration sequence, the second pronunciation duration sequence and the adjustment parameter sequence.

In this implementation, for the phoneme sequence "wo" corresponding to the text "i", the corresponding first pronunciation duration sequence is (5,6), the second pronunciation duration sequence is (7,8), the units are all milliseconds, the adjustment parameter sequence is (0.3,0.4), and then the pronunciation duration of the phoneme "w" may be 7 × 0.3+ (1-0.3)5 ═ 5.6; the pronunciation duration of the phoneme "o" may be 8 × 0.4+ (1-0.4) × 6 ═ 6.8.

In some optional implementations of this embodiment, the adjustment parameter determination model may be obtained by: a third set of training samples is obtained, where the third training samples are audio information associated with a pronunciation duration for which a phoneme is set. And training a third neural network based on the first duration determination model, the second duration determination model and the third training sample set to obtain an adjustment parameter determination model.

In this implementation manner, training a third neural network based on the first duration determining model, the second duration determining model, and the third training sample set to obtain an adjustment parameter determining model may include: and determining a training text corresponding to the third training sample, and extracting training text features of the training text. Importing the text features for training into the first time length determination model to obtain a first pronunciation time length sequence for training; importing the text features for training into the second duration determination model to obtain a second pronunciation duration sequence for training; and leading the text features for training into a third neural network in training to obtain an adjusting parameter sequence for training. And determining the pronunciation time sequence for training according to the first pronunciation time sequence for training, the second pronunciation time sequence for training and the adjustment parameter sequence for training. And adjusting the third neural network in the training according to the training pronunciation duration sequence and the pronunciation duration of the phoneme set in association with the third training sample.

In this implementation, the third neural network may be a neural network of an arbitrary structure, and may be a two-layer time-recursive neural network, for example.

As can be seen from fig. 5A, compared with the embodiment shown in fig. 2, the flow 500 of the method for determining the pronunciation time length of the phoneme in the present embodiment highlights the step of determining the pronunciation time length of the phoneme according to the second time length determination model and the first time length determination model. Therefore, the scheme described by the embodiment can improve the accuracy of the determined pronunciation time length.

With further reference to fig. 6, as an implementation of the method shown in the above-mentioned figures, the present application provides an embodiment of an apparatus for determining a pronunciation duration of a phoneme, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 6, the apparatus 500 for determining pronunciation duration of phoneme according to the embodiment includes: an acquisition unit 601, an extraction unit 602, a first generation unit 603, and a determination unit 604. The device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a text corresponding to a voice to be synthesized; the extraction unit is used for extracting text features of the text; a first generating unit, configured to introduce the text feature into a pre-established first time length determining model, and generate a pronunciation time length type sequence, where a pronunciation time length type in the pronunciation time length type sequence corresponds to a phoneme in a phoneme sequence corresponding to the text, and the first time length determining model is used to represent a correspondence between the text feature and the pronunciation time length type sequence; and the determining unit is used for determining the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text according to the generated pronunciation duration type sequence and preset corresponding relation information, wherein the corresponding relation information is used for indicating the corresponding relation between the pronunciation duration type and the pronunciation duration.

In this embodiment, specific processes of the obtaining unit 601, the extracting unit 602, the first generating unit 603, and the determining unit 604 and technical effects thereof may refer to related descriptions of step 201, step 202, step 203, and step 204 in the corresponding embodiment of fig. 2, which are not described herein again.

In some optional implementations of this embodiment, the first duration determination model is obtained by: acquiring a first training sample set, wherein the first training sample is audio information associated with a pronunciation duration provided with a phoneme; and training a first neural network and a conditional random field model by using the first training sample set to obtain a first time length determination model, wherein the output of the first neural network is the input of the conditional random field model.

In some optional implementations of the embodiment, the output of the first neural network includes a probability sequence corresponding to each phoneme in the phoneme sequence, where a probability in the probability sequence corresponding to the phoneme is a probability that the pronunciation duration of the phoneme belongs to a preset pronunciation duration type.

In some optional implementations of the present embodiment, the utterance duration type is obtained by: determining the shortest pronunciation duration and the longest pronunciation duration of the phoneme; taking the logarithm of the shortest pronunciation time length to obtain a first logarithm value, and taking the logarithm of the longest pronunciation time length to obtain a second logarithm value; determining a difference between the second pair of values and the first pair of values, and determining a ratio of the difference to a predetermined number; generating a pronunciation duration type sequence comprising a preset number of pronunciation duration types, wherein for each pronunciation duration type in the pronunciation duration sequence, the pronunciation duration corresponding to the pronunciation duration type is determined according to the bit information of the pronunciation duration type in the pronunciation duration type sequence and the ratio.

In some optional implementations of this embodiment, the first training sample set is obtained by: determining whether the initial sample set comprises a sample which is associated with the pronunciation duration or not for the pronunciation duration corresponding to each pronunciation duration type in the pronunciation duration type sequence, wherein the initial sample is audio information associated with the pronunciation duration provided with phonemes; in response to the fact that the initial sample set does not comprise a sample which is provided with the pronunciation time length in a correlated mode, adjusting the sample which is provided with the pronunciation time length with the minimum difference value with the pronunciation time length in a correlated mode to obtain a sample of the pronunciation time length; and adding the adjusted sample into the initial sample set to obtain the first training sample set.

In some optional implementations of this embodiment, the apparatus further includes: a second generating unit (not shown) configured to import the text feature into a second pre-established duration determining model, and generate a second pronunciation duration sequence corresponding to the text, where the second pronunciation duration sequence includes pronunciation durations of phonemes in a phoneme sequence corresponding to the text, and the second duration determining model is configured to characterize a correspondence between the text feature and the second pronunciation duration sequence.

In some optional implementations of the embodiment, the second duration determination model is obtained by: acquiring a second training sample set, wherein the second training sample is audio information with the pronunciation duration of the phoneme related to the second training sample set; and training a second neural network by using the second training sample set to obtain a second duration determination model.

In some optional implementation manners of this embodiment, the determining unit is further configured to: generating a first pronunciation time sequence according to the generated pronunciation time type sequence and the corresponding relation information; and determining the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text based on the first pronunciation duration sequence and the second pronunciation duration sequence.

In some optional implementation manners of this embodiment, the determining unit is further configured to: importing the text features into a pre-established adjustment parameter determining model to obtain an adjustment parameter sequence, wherein the number of adjustment parameters in the adjustment parameter sequence is matched with the number of phonemes in a phoneme sequence corresponding to the text, and the adjustment parameter determining model is used for representing the corresponding relation between the text features and the adjustment parameter sequence; and determining the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text according to the first pronunciation duration sequence, the second pronunciation duration sequence and the adjustment parameter sequence.

In some optional implementations of this embodiment, the adjustment parameter determination model is obtained by: acquiring a third training sample set, wherein the third training sample is audio information associated with the pronunciation duration of the phoneme; and training a third neural network based on the first duration determination model, the second duration determination model and the third training sample set to obtain an adjustment parameter determination model.

It should be noted that, for details of implementation and technical effects of each unit in the apparatus for determining a pronunciation duration of a phoneme provided in this embodiment, reference may be made to descriptions of other embodiments in this application, and details are not described herein again.

Referring now to FIG. 7, shown is a block diagram of a computer system 700 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 7, the computer system 700 includes a Central Processing Unit (CPU)701, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data necessary for the operation of the system 700 are also stored. The CPU 701, the ROM 702, and the RAM 703 are connected to each other via a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

The following components are connected to the I/O interface 705: an input portion 706 including a keyboard, a mouse, and the like; an output section 707 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 708 including a hard disk and the like; and a communication section 709 including a network interface card such as a LAN card, a modem, or the like. The communication section 709 performs communication processing via a network such as the internet. A drive 710 is also connected to the I/O interface 705 as needed. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 710 as necessary, so that a computer program read out therefrom is mounted into the storage section 708 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 709, and/or installed from the removable medium 711. The computer program, when executed by a Central Processing Unit (CPU)701, performs the above-described functions defined in the method of the present application.

It should be noted that the computer readable medium mentioned above in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit, an extraction unit, a first generation unit, and a determination unit. The names of these units do not in some cases constitute a limitation on the unit itself, and for example, the acquiring unit may also be described as a "unit that acquires text corresponding to speech to be synthesized".

As another aspect, the present application also provides a computer-readable medium, which may be contained in the apparatus described in the above embodiments; or may be present separately and not assembled into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to: acquiring a text corresponding to a voice to be synthesized; extracting text features of the text; importing the text features into a pre-established first time length determining model to generate a pronunciation time length type sequence, wherein pronunciation time length types in the pronunciation time length type sequence correspond to phonemes in a phoneme sequence corresponding to the text, and the first time length determining model is used for representing the corresponding relation between the text features and the pronunciation time length type sequence; and determining the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text according to the generated pronunciation duration type sequence and preset corresponding relation information, wherein the corresponding relation information is used for indicating the corresponding relation between the pronunciation duration type and the pronunciation duration.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for determining phoneme pronunciation durations, the method comprising:

acquiring a text corresponding to a voice to be synthesized;

extracting text features of the text;

importing the text features into a pre-established first time length determining model to generate a pronunciation time length type sequence, wherein pronunciation time length types in the pronunciation time length type sequence correspond to phonemes in a phoneme sequence corresponding to the text, and the first time length determining model is used for representing the corresponding relation between the text features and the pronunciation time length type sequence;

and determining the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text according to the generated pronunciation duration type sequence and preset corresponding relation information, wherein the corresponding relation information is used for indicating the corresponding relation between the pronunciation duration type and the pronunciation duration.

2. The method of claim 1, wherein the first time duration determination model is obtained by:

acquiring a first training sample set, wherein the first training sample is audio information associated with a pronunciation duration provided with a phoneme;

and training a first neural network and a conditional random field model by using the first training sample set to obtain a first time length determination model, wherein the output of the first neural network is the input of the conditional random field model.

3. The method of claim 2, wherein the output of the first neural network comprises a probability sequence corresponding to each phoneme in the phoneme sequence, wherein the probability in the probability sequence corresponding to the phoneme is a probability that the pronunciation duration of the phoneme belongs to a preset pronunciation duration type.

4. The method according to claim 1, wherein the utterance duration type is obtained by:

determining the shortest pronunciation duration and the longest pronunciation duration of the phoneme;

taking a logarithm of the shortest pronunciation time to obtain a first logarithm value, and taking a logarithm of the longest pronunciation time to obtain a second logarithm value;

determining a difference between the second logarithmic value and the first logarithmic value, and determining a ratio of the difference to a preset number;

generating a pronunciation duration type sequence comprising a preset number of pronunciation duration types, wherein for each pronunciation duration type in the pronunciation duration sequence, the pronunciation duration corresponding to the pronunciation duration type is determined according to the bit information of the pronunciation duration type in the pronunciation duration type sequence and the ratio.

5. The method of claim 4, wherein the first training sample set is obtained by:

determining whether an initial sample set comprises a sample which is associated with the pronunciation duration or not for the pronunciation duration corresponding to each pronunciation duration type in the pronunciation duration type sequence, wherein the initial sample is audio information associated with the pronunciation duration provided with phonemes; in response to the fact that the initial sample set does not comprise a sample which is set with the pronunciation time length in an associated mode, adjusting the sample which is set with the pronunciation time length with the minimum difference value with the pronunciation time length in the associated mode to obtain a sample of the pronunciation time length; and adding the adjusted sample into the initial sample set to obtain the first training sample set.

6. The method according to any one of claims 1-5, further comprising:

and importing the text features into a pre-established second duration determining model to generate a second pronunciation duration sequence corresponding to the text, wherein the second pronunciation duration sequence comprises pronunciation durations of all phonemes in the phoneme sequence corresponding to the text, and the second duration determining model is used for representing a corresponding relation between the text features and the second pronunciation duration sequence.

7. The method of claim 6, wherein the second duration determination model is obtained by:

acquiring a second training sample set, wherein the second training sample is audio information with the pronunciation duration of the phoneme related to the second training sample set;

and training a second neural network by using the second training sample set to obtain a second duration determination model.

8. The method according to claim 7, wherein determining the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text according to the generated pronunciation duration type sequence and the preset correspondence information comprises:

generating a first pronunciation time sequence according to the generated pronunciation time type sequence and the corresponding relation information;

and determining the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text based on the first pronunciation duration sequence and the second pronunciation duration sequence.

9. The method of claim 8, wherein determining the pronunciation duration for each phoneme in the sequence of phonemes to which the text corresponds based on the first pronunciation duration sequence and the second pronunciation duration sequence comprises:

importing the text features into a pre-established adjustment parameter determining model to obtain an adjustment parameter sequence, wherein the number of adjustment parameters in the adjustment parameter sequence is matched with the number of phonemes in a phoneme sequence corresponding to the text, and the adjustment parameter determining model is used for representing the corresponding relation between the text features and the adjustment parameter sequence;

and determining the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text according to the first pronunciation duration sequence, the second pronunciation duration sequence and the adjustment parameter sequence.

10. The method of claim 9, wherein the tuning parameter determination model is obtained by:

acquiring a third training sample set, wherein the third training sample is audio information associated with the pronunciation duration of the phoneme;

and training a third neural network based on the first duration determination model, the second duration determination model and the third training sample set to obtain an adjustment parameter determination model.

11. An apparatus for determining phoneme pronunciation durations, the apparatus comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a text corresponding to a voice to be synthesized;

the extraction unit is used for extracting text features of the text;

the first generation unit is used for importing the text features into a pre-established first time length determination model to generate a pronunciation time length type sequence, wherein pronunciation time length types in the pronunciation time length type sequence correspond to phonemes in a phoneme sequence corresponding to the text, and the first time length determination model is used for representing the corresponding relation between the text features and the pronunciation time length type sequence;

and the determining unit is used for determining the pronunciation duration of each phoneme in the phoneme sequence corresponding to the text according to the generated pronunciation duration type sequence and preset corresponding relation information, wherein the corresponding relation information is used for indicating the corresponding relation between the pronunciation duration type and the pronunciation duration.

12. An electronic device, characterized in that the electronic device comprises:

one or more processors;

a storage device for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-10.

13. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-10.