CN107464559B

CN107464559B - Combined prediction model construction method and system based on Chinese prosody structure and accents

Info

Publication number: CN107464559B
Application number: CN201710561567.XA
Authority: CN
Inventors: 陶建华; 郑艺斌; 李雅
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2017-07-11
Filing date: 2017-07-11
Publication date: 2020-12-15
Anticipated expiration: 2037-07-11
Also published as: CN107464559A

Abstract

The invention relates to a prediction model construction method and a system based on the combination of Chinese prosody structure and accent, wherein the construction method comprises the following steps: preprocessing a plurality of historical corpus text training corpora to obtain preprocessed texts; performing word segmentation processing on the preprocessed text to obtain word segmentation text information; determining a word vector characteristic sequence of a corresponding text according to the word segmentation text information; and (3) coding and decoding the word vector characteristic sequence based on the RNN coding-decoding of the attention mechanism, and establishing a combined prediction model based on the Chinese prosody structure and accents for predicting the prosody structure and accents of the text to be processed. The invention obtains word segmentation text information by preprocessing and word segmentation processing on a plurality of historical corpus text training corpora to obtain word vector characteristic sequences of corresponding texts, and then establishes a combined prediction model based on the coding-decoding of RNN (neural network) of an attention mechanism, fully considers the relation between Chinese rhythm structures and accents and realizes accurate prediction of texts to be tested.

Description

Combined prediction model construction method and system based on Chinese prosody structure and accents

Technical Field

The invention relates to the technical field of man-machine interaction total voice synthesis, in particular to a combined prediction model construction method and system based on Chinese prosody structure and accent.

Background

Accurate prosodic structure and stress description and prediction of prosodic structure and stress from text information are always the most important steps in speech synthesis, and are important components for improving the naturalness and expressiveness of synthesized speech and constructing a harmonious human-computer interaction technology. The rhythm structure and the accent model can carve out the suppression of the rising and the pause and the light and heavy slow and fast in the voice, and further improve the expressive force and the naturalness of the synthetic voice. Prosodic structure and stress modeling and prediction are of great significance to the development of speech synthesis, human-computer interaction and the like.

Although much research has been done in this area, many problems with prosodic structure and stress prediction have not been solved well to date. In the description of the text features, the word vector features are word vector models trained in advance, and the values of the word vectors are not further adjusted along with different tasks in prosody structure and accent model training. Furthermore, the considerations for contextual text features are not comprehensive enough in the selection of prediction models for prosodic structure and accents. Moreover, in the existing research on the prosodic structure and accents of Chinese language, it has been shown that there is a relatively close association between the prosodic structure and the accents. In the existing prediction on Chinese prosody structure and accent, the prosody structure prediction and the accent prediction are modeled as two relatively independent tasks, and the relationship between the two tasks is not taken into consideration.

Disclosure of Invention

In order to solve the problems in the prior art, namely to combine the Chinese prosody structure and accent to accurately predict the prosody structure and accent in the text information, the invention provides a method and a system for constructing a combined prediction model based on the Chinese prosody structure and accent.

In order to achieve the purpose, the invention provides the following scheme:

a combined prediction model construction method based on Chinese prosody structure and accent comprises the following steps:

preprocessing a plurality of training corpora to obtain preprocessed texts;

performing word segmentation processing on the preprocessed text to obtain word segmentation text information;

determining a word vector characteristic sequence of a corresponding text according to the word segmentation text information;

and based on the coding-decoding of the recurrent neural network RNN of the attention mechanism, coding and decoding the word vector characteristic sequence, and establishing a combined prediction model based on the Chinese prosody structure and accents for predicting the prosody structure and accents of the text to be processed.

Optionally, the preprocessing the plurality of corpus specifically includes:

carrying out regularization processing on the training corpus and correcting polyphone pronunciation errors; and/or regularize the numbers.

Optionally, the determining a word vector feature sequence of a corresponding text according to the word segmentation text information specifically includes:

according to the word segmentation text information, searching word vectors of corresponding words by a word table searching method, and determining word vector characteristic sequences of corresponding texts;

wherein the initial value of the vocabulary is obtained based on the training of the continuous bag-of-words model CBOW.

Optionally, the establishing of the joint prediction model based on the chinese prosody structure and the accent specifically includes:

a bidirectional RNN-based encoder reads the word vector feature sequence from the forward direction and the reverse direction, and determines the hidden state of the encoder at each time step;

and decoding by a decoder based on the undirected RNN to obtain a decoding state function representing a combined prediction model based on the Chinese prosody structure and accents, wherein the decoding state function is used for predicting the prosody structure and accents of the text to be processed.

Optionally, the establishing a joint prediction model based on the chinese prosody structure and the accent further includes:

extracting prosodic structures and accent labeling results in the training corpora as target values;

calculating the predicted value of each training corpus according to the decoding state function;

and adjusting the state parameters of the prediction model according to the target value and the predicted value.

Optionally, the bidirectional RNN-based encoder reads the word vector feature sequence from the forward direction and the reverse direction, and determines the hidden state of the encoder at each time step, specifically including:

forward RNN reads word vector characteristic sequence x ═ x (x) according to forward direction₁,x₂,...,x_T) And generating a forward hidden state fh at each time step i_iWherein, fh_i＝(fh₁,...,fh_T) 1,2, ·, T; f represents a forward hidden state parameter of the prediction model;

backward RNN reads the word vector characteristic sequence in the reverse direction and generates a reverse hidden state bh_iWherein, bh is_i＝(bh_T,...,bh₁) (ii) a b represents a reverse hidden state parameter of the prediction model;

according to the forward hidden state fh_iAnd reverse hidden state bh_iDetermining the hidden state h of the encoder at each time step_iWherein h is_i＝[fh_i,bh_i]。

Optionally, the decoding performed by the undirected RNN-based decoder specifically includes:

obtaining decoder decoding state s of undirected RNN at time step (i-1)_i-1And a corresponding label y_i-1；

Obtaining the hidden state h of the encoder of the bidirectional RNN of the current time step i_iAnd a semantic vector c_i；

According to decoding state s_i-1Label y_i-1Hidden state h_iAnd semantic vector c_iDetermining the decoding state S of the decoder of the undirected RNN corresponding to the current time step i_i(ii) a Wherein S is_i＝P(S_i-1,y_i-1,h_i,c_i) And P () represents a relational function.

Optionally, the semantic vector c is determined according to the following formula_i：

e_i,k＝g(s_i-1,h_k)；

Wherein, g () represents a neural network, i, j, k represent time step sequence numbers, respectively, and i is 1, 2.

Optionally, the prediction model is divided into three levels, a first level is prosodic words, a second level is prosodic phrases, and a third level is intonation phrases;

when predicting the prosodic structure and accent of a text to be processed, predicting the accent while predicting prosodic words at a first level;

and in the second level and the third level, accent prediction is removed, the prediction structure of the previous level is used as the input of the current level, and the corresponding prediction result is obtained by splicing with the word vector sequence of the text to be processed.

In order to achieve the above purpose, the invention also provides the following scheme:

a prediction model construction system based on Chinese prosody structure and accent combination, the construction system comprising:

the text preprocessing module is used for preprocessing the training corpora to obtain preprocessed texts;

the word segmentation module is used for carrying out word segmentation on the preprocessed text to obtain word segmentation text information;

the word vector determining module is used for determining a corresponding word vector characteristic sequence according to the word segmentation text information;

and the modeling module is used for coding and decoding the word vector characteristic sequence based on the coding-decoding of the RNN of the attention mechanism, and establishing a combined prediction model based on the Chinese prosody structure and accents for predicting the prosody structure and accents of the text to be processed.

According to the embodiment of the invention, the invention discloses the following technical effects:

the invention obtains word segmentation text information by preprocessing and word segmentation processing a plurality of training linguistic data to obtain a word vector characteristic sequence of a corresponding text, further establishes a combined prediction model based on a Chinese prosody structure and accents based on a recurrent neural network, fully considers the relation between the Chinese prosody structure and the accents, and realizes accurate prediction of the text to be detected.

Drawings

FIG. 1 is a flow chart of a prediction model construction method based on the combination of Chinese prosody structure and accent;

FIG. 2 is a schematic diagram of a module structure of a prediction model construction system based on the combination of Chinese prosody structure and accent.

Description of the symbols:

the system comprises a text preprocessing module-1, a word segmentation module-2, a word vector determination module-3 and a modeling module-4.

Detailed Description

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

The invention provides a prediction model construction method based on Chinese prosody structure and accent combination, which is characterized in that word segmentation text information is obtained by preprocessing and word segmentation processing on a plurality of training corpora, word vector characteristic sequences of corresponding texts are obtained, a combined prediction model based on the Chinese prosody structure and the accent is further established based on a recurrent neural network, the relation between the Chinese prosody structure and the accent is fully considered, and accurate prediction of a text to be detected is realized.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1, the method for constructing a prediction model based on the combination of chinese prosody structure and accent of the present invention includes:

step 100: preprocessing a plurality of training corpora to obtain preprocessed texts;

step 200: performing word segmentation processing on the preprocessed text to obtain word segmentation text information;

step 300: determining a word vector characteristic sequence of a corresponding text according to the word segmentation text information;

step 400: and based on the coding-decoding of the RNN (Recurrent Neural Network) of the attention mechanism, coding and decoding the word vector characteristic sequence, and establishing a combined prediction model based on the Chinese prosody structure and accent for predicting the prosody structure and accent of the text to be processed.

In step 100, a large number of training corpora similar to the style of the text to be processed in terms of prosodic structure and accent are collected, a database is created, the database is about 15G, and the more the text corpora, the higher the prediction accuracy. And the database stores training corpora and corresponding prosodic structures and accent labeling results.

Further, the preprocessing the plurality of corpus specifically includes:

In step 300, the determining a word vector feature sequence of a corresponding text according to the word segmentation text information specifically includes:

wherein the initial value of the vocabulary is obtained based on CBOW (Continuous bag-of-words) training. And in the subsequent training process, the word vector model is continuously updated for training, so that the word list is continuously updated to enrich the word list, and the accuracy of the prediction of the text to be processed is improved.

In step 400, the establishing of the joint prediction model based on the chinese prosody structure and the accents specifically includes:

step 401: a bi-directional RNN based encoder reads the word vector feature sequence from forward and reverse directions, determining the hidden state of the encoder at each time step.

Step 402: and decoding by a decoder based on the undirected RNN to obtain a decoding state function representing a combined prediction model based on the Chinese prosody structure and accents, wherein the decoding state function is used for predicting the prosody structure and accents of the text to be processed.

Further, the establishing of the joint prediction model based on the Chinese prosody structure and the accents further includes:

step 403: extracting prosodic structures and accent labeling results in the training corpora as target values;

step 404: calculating the predicted value of each training corpus according to the decoding state function;

step 405: and adjusting the state parameters of the prediction model according to the target value and the predicted value. The state parameters comprise forward hidden state parameters of the prediction model and reverse hidden state parameters of the prediction model.

In step 401, the bidirectional RNN-based encoder reads the word vector feature sequence from the forward direction and the reverse direction, and determines the hidden state of the encoder at each time step, which specifically includes:

step 4011: forward RNN reads word vector characteristic sequence x ═ x (x) according to forward direction₁,x₂,...,x_T) And generating a forward hidden state fh at each time step i_iWherein, fh_i＝(fh₁,...,fh_T) 1,2, ·, T; f represents a forward hidden state parameter of the prediction model;

step 4012: backward RNN reads the word vector characteristic sequence in the reverse direction and generates a reverse hidden state bh_iWherein, bh is_i＝(bh_T,...,bh₁) (ii) a b represents a reverse hidden state parameter of the prediction model;

step 4013: according to the forward hidden state fh_iAnd reverse hidden state bh_iDetermining the hidden state h of the encoder at each time step_iWherein h is_i＝[fh_i,bh_i]。

Due to the time sequence modeling characteristic of the RNN, the hidden state of the encoder at the last moment also carries all the information of the source input sequence.

In step 402, the decoding performed by the undirected RNN-based decoder specifically includes:

step 4021: obtaining decoding state s of an undirected RNN decoder at time step (i-1)_i-1And a corresponding label y_i-1。

Step 4022: obtaining the hidden state h of the encoder of the bidirectional RNN of the current time step i_iAnd a semantic vector c_i。

Further, the semantic vector c may be determined according to equations (1) - (3)_i：

e_i,k＝g(s_i-1,h_k) (3)；

Step 4023: according to decoding state s_i-1Label y_i-1Hidden state h_iAnd semantic vector c_iDetermining a decoding state S corresponding to the current time step i_i(ii) a Wherein S is_i＝P(S_i-1,y_i-1,h_i,c_i) And P () represents a relational function.

Wherein the decoder of the undirected RNN has no directionality in the decoding process relative to the encoder of the bi-directional RNN. The undirected RNN-based decoder decodes except that h is represented by only the hidden state at each time step of the bi-directional RNN-based encoder_iIn addition, a mechanism of attention is further introduced (i.e. a semantic vector c is introduced)_i). The state s of the decoder at time step i of decoding due to the introduction of the attention mechanism_iIs the state s at time step (i-1) by the decoder_i-1And a corresponding label y_i-1Hidden state h of encoder aligned with current time_iAnd a semantic vector c_iAre jointly decided.

Wherein the semantic vector c_iHidden state sequence [ h ] of encoder being undirected RNN₁,...,h_T]The weighted average of (a), equation (1), can provide more context information for the undirected RNN decoder.

The invention introduces a multi-task learning mechanism to carry out the joint prediction modeling of the prosodic structure and the stress. Specifically, the prediction model is divided into three levels, wherein the first level is Prosodic Words (PW), the second level is Prosodic Phrases (PPH), and the third level is Intonation Phrases (IPH); when predicting the prosodic structure and accent of a text to be processed, predicting the accent while predicting prosodic words at a first level; and in the second level and the third level, accent prediction is removed, the prediction structure of the previous level is used as the input of the current level, and the corresponding prediction result is obtained by splicing with the word vector sequence of the text to be processed.

At the same time of prosodic word prediction at the first level of the prosodic structure, another task is added (i.e., another decoder is added) to predict stress at the same time. Where the encoder and word vector layers are shared between the two tasks. In model training, the loss function of the whole model is the sum of the errors of the two tasks, i.e. the errors of the two decoders. The error is directionally propagated to make adjustments to the model parameters. The task of accent prediction is removed when performing the other two levels of prosodic structure (i.e., prosodic and intonation phrase) prediction. And the prediction result of the previous level of the prosodic structure is used as the input of the current level, spliced with the sequence converted by the word vector layer and then sent into the encoder.

Further, the attention-based RNN encoding-decoding prosody structure and accent joint prediction module outputs results of prosody structure prediction and accent prediction of the corresponding text by using the attention-based RNN encoding-decoding prosody structure and accent joint prediction model.

The method comprises the steps of initializing word vectors by utilizing a pre-trained word vector model, and modeling the Chinese prosody structure and accent jointly by utilizing an attention model and a multi-task learning-based method to predict the prosody structure and accent information of a text to be processed.

The invention improves the phoneme duration modeling and prediction mainly through a characteristic level and a model level. On the aspect of characteristics, a word vector model belonging to a prosodic structure and accents is established, so that the description of text characteristics is more accurate. On the model level, a method based on an attention model and multi-task learning is adopted to jointly model the Chinese prosody structure and accents. Thereby greatly improving the performance of Chinese prosody structure and stress prediction. The prediction result is used for guiding the rear end of the voice synthesis to carry out the voice synthesis, so that the naturalness and the expressive force of the synthesized voice are improved.

In addition, the invention also provides a combined prediction model construction system based on the Chinese prosody structure and accent. As shown in fig. 2, the system for constructing a joint prediction model based on chinese prosody structure and accent of the present invention includes a text preprocessing module 1, a word segmentation module 2, a word vector determination module 3, and a modeling module 4.

The text preprocessing module 1 is configured to preprocess a plurality of training corpora to obtain a preprocessed text; the word segmentation module 2 is used for performing word segmentation processing on the preprocessed text to obtain word segmentation text information; the word vector determining module 3 is configured to determine a corresponding word vector feature sequence according to the word segmentation text information; the modeling module 4 is used for coding and decoding the word vector feature sequence based on the coding-decoding of the RNN of the attention mechanism, and establishing a combined prediction model based on a Chinese prosody structure and accents for predicting the prosody structure and accents of the text to be processed.

Compared with the prior art, the prediction model construction system based on the combination of the Chinese prosody structure and the accent has the same beneficial effects as the prediction model construction method based on the combination of the Chinese prosody structure and the accent, and is not repeated herein.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A combined prediction model construction method based on Chinese prosody structure and accent is characterized by comprising the following steps:

preprocessing a plurality of training corpora to obtain preprocessed texts;

and coding and decoding the word vector characteristic sequence based on the coding-decoding of the recurrent neural network RNN of the attention mechanism, and establishing a prediction model based on the combination of the Chinese prosody structure and the accent for predicting the prosody structure and the accent of the text to be processed.

2. The method for constructing a combined prediction model based on chinese prosody structure and accent according to claim 1, wherein the preprocessing the plurality of corpus specifically includes:

3. The method for constructing a combined prediction model based on chinese prosodic structure and accent according to claim 1, wherein the determining a word vector feature sequence of a corresponding text according to the segmented text information specifically includes:

4. The method for constructing a combined prediction model based on chinese prosodic structure and accent according to claim 1, wherein the establishing a combined prediction model based on chinese prosodic structure and accent specifically includes:

5. The method for constructing a combined prediction model based on Chinese prosodic structure and accent according to claim 4, wherein the establishing of the combined prediction model based on Chinese prosodic structure and accent further comprises:

6. The method as claimed in claim 4, wherein the bi-directional RNN based encoder reads the word vector feature sequence from forward and backward directions to determine the hidden state of the encoder at each time step, and specifically comprises:

7. The method for constructing a joint prediction model based on chinese prosodic structure and accent according to claim 6, wherein the decoding performed by the undirected RNN-based decoder specifically comprises:

obtaining decoding state s of an undirected RNN decoder at time step (i-1)_i-1And a corresponding label y_i-1；

According to decoding state s_i-1Label y_i-1Hidden state h_iAnd semantic vector c_iDetermining the decoding state S of the decoder of the undirected RNN corresponding to the current time step i_i(ii) a Wherein S is_i＝P(S_i-1,y_i-1,h_i,c_i) P () represents a relation function, S_iIndicating the decoding state corresponding to the current time step i.

8. The method of claim 7, wherein the semantic vector c is determined according to the following formula_i：

e_i,k＝g(s_i-1,h_k)；

9. The method for constructing a combined prediction model based on Chinese prosodic structures and accents according to any one of claims 1-8, wherein the prediction model is divided into three levels, a first level is prosodic words, a second level is prosodic phrases, and a third level is intonation phrases;

10. A combined prediction model construction system based on Chinese prosody structure and accent is characterized by comprising: