CN107464559B - Combined prediction model construction method and system based on Chinese prosody structure and accents - Google Patents
Combined prediction model construction method and system based on Chinese prosody structure and accents Download PDFInfo
- Publication number
- CN107464559B CN107464559B CN201710561567.XA CN201710561567A CN107464559B CN 107464559 B CN107464559 B CN 107464559B CN 201710561567 A CN201710561567 A CN 201710561567A CN 107464559 B CN107464559 B CN 107464559B
- Authority
- CN
- China
- Prior art keywords
- prediction model
- accent
- text
- decoding
- chinese
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000010276 construction Methods 0.000 title claims abstract description 16
- 239000013598 vector Substances 0.000 claims abstract description 67
- 230000011218 segmentation Effects 0.000 claims abstract description 37
- 238000012549 training Methods 0.000 claims abstract description 28
- 238000007781 pre-processing Methods 0.000 claims abstract description 17
- 238000013528 artificial neural network Methods 0.000 claims abstract description 11
- 238000012545 processing Methods 0.000 claims abstract description 11
- 230000007246 mechanism Effects 0.000 claims abstract description 10
- 238000000034 method Methods 0.000 claims description 18
- 230000002457 bidirectional effect Effects 0.000 claims description 7
- 230000000306 recurrent effect Effects 0.000 claims description 5
- 238000002372 labelling Methods 0.000 claims description 4
- 230000033764 rhythmic process Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 9
- 230000015572 biosynthetic process Effects 0.000 description 5
- 238000003786 synthesis reaction Methods 0.000 description 5
- 230000003993 interaction Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1807—Speech classification or search using natural language modelling using prosody or stress
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
Abstract
The invention relates to a prediction model construction method and a system based on the combination of Chinese prosody structure and accent, wherein the construction method comprises the following steps: preprocessing a plurality of historical corpus text training corpora to obtain preprocessed texts; performing word segmentation processing on the preprocessed text to obtain word segmentation text information; determining a word vector characteristic sequence of a corresponding text according to the word segmentation text information; and (3) coding and decoding the word vector characteristic sequence based on the RNN coding-decoding of the attention mechanism, and establishing a combined prediction model based on the Chinese prosody structure and accents for predicting the prosody structure and accents of the text to be processed. The invention obtains word segmentation text information by preprocessing and word segmentation processing on a plurality of historical corpus text training corpora to obtain word vector characteristic sequences of corresponding texts, and then establishes a combined prediction model based on the coding-decoding of RNN (neural network) of an attention mechanism, fully considers the relation between Chinese rhythm structures and accents and realizes accurate prediction of texts to be tested.
Description
Technical Field
The invention relates to the technical field of man-machine interaction total voice synthesis, in particular to a combined prediction model construction method and system based on Chinese prosody structure and accent.
Background
Accurate prosodic structure and stress description and prediction of prosodic structure and stress from text information are always the most important steps in speech synthesis, and are important components for improving the naturalness and expressiveness of synthesized speech and constructing a harmonious human-computer interaction technology. The rhythm structure and the accent model can carve out the suppression of the rising and the pause and the light and heavy slow and fast in the voice, and further improve the expressive force and the naturalness of the synthetic voice. Prosodic structure and stress modeling and prediction are of great significance to the development of speech synthesis, human-computer interaction and the like.
Although much research has been done in this area, many problems with prosodic structure and stress prediction have not been solved well to date. In the description of the text features, the word vector features are word vector models trained in advance, and the values of the word vectors are not further adjusted along with different tasks in prosody structure and accent model training. Furthermore, the considerations for contextual text features are not comprehensive enough in the selection of prediction models for prosodic structure and accents. Moreover, in the existing research on the prosodic structure and accents of Chinese language, it has been shown that there is a relatively close association between the prosodic structure and the accents. In the existing prediction on Chinese prosody structure and accent, the prosody structure prediction and the accent prediction are modeled as two relatively independent tasks, and the relationship between the two tasks is not taken into consideration.
Disclosure of Invention
In order to solve the problems in the prior art, namely to combine the Chinese prosody structure and accent to accurately predict the prosody structure and accent in the text information, the invention provides a method and a system for constructing a combined prediction model based on the Chinese prosody structure and accent.
In order to achieve the purpose, the invention provides the following scheme:
a combined prediction model construction method based on Chinese prosody structure and accent comprises the following steps:
preprocessing a plurality of training corpora to obtain preprocessed texts;
performing word segmentation processing on the preprocessed text to obtain word segmentation text information;
determining a word vector characteristic sequence of a corresponding text according to the word segmentation text information;
and based on the coding-decoding of the recurrent neural network RNN of the attention mechanism, coding and decoding the word vector characteristic sequence, and establishing a combined prediction model based on the Chinese prosody structure and accents for predicting the prosody structure and accents of the text to be processed.
Optionally, the preprocessing the plurality of corpus specifically includes:
carrying out regularization processing on the training corpus and correcting polyphone pronunciation errors; and/or regularize the numbers.
Optionally, the determining a word vector feature sequence of a corresponding text according to the word segmentation text information specifically includes:
according to the word segmentation text information, searching word vectors of corresponding words by a word table searching method, and determining word vector characteristic sequences of corresponding texts;
wherein the initial value of the vocabulary is obtained based on the training of the continuous bag-of-words model CBOW.
Optionally, the establishing of the joint prediction model based on the chinese prosody structure and the accent specifically includes:
a bidirectional RNN-based encoder reads the word vector feature sequence from the forward direction and the reverse direction, and determines the hidden state of the encoder at each time step;
and decoding by a decoder based on the undirected RNN to obtain a decoding state function representing a combined prediction model based on the Chinese prosody structure and accents, wherein the decoding state function is used for predicting the prosody structure and accents of the text to be processed.
Optionally, the establishing a joint prediction model based on the chinese prosody structure and the accent further includes:
extracting prosodic structures and accent labeling results in the training corpora as target values;
calculating the predicted value of each training corpus according to the decoding state function;
and adjusting the state parameters of the prediction model according to the target value and the predicted value.
Optionally, the bidirectional RNN-based encoder reads the word vector feature sequence from the forward direction and the reverse direction, and determines the hidden state of the encoder at each time step, specifically including:
forward RNN reads word vector characteristic sequence x ═ x (x) according to forward direction1,x2,...,xT) And generating a forward hidden state fh at each time step iiWherein, fhi=(fh1,...,fhT) 1,2, ·, T; f represents a forward hidden state parameter of the prediction model;
backward RNN reads the word vector characteristic sequence in the reverse direction and generates a reverse hidden state bhiWherein, bh isi=(bhT,...,bh1) (ii) a b represents a reverse hidden state parameter of the prediction model;
according to the forward hidden state fhiAnd reverse hidden state bhiDetermining the hidden state h of the encoder at each time stepiWherein h isi=[fhi,bhi]。
Optionally, the decoding performed by the undirected RNN-based decoder specifically includes:
obtaining decoder decoding state s of undirected RNN at time step (i-1)i-1And a corresponding label yi-1;
Obtaining the hidden state h of the encoder of the bidirectional RNN of the current time step iiAnd a semantic vector ci;
According to decoding state si-1Label yi-1Hidden state hiAnd semantic vector ciDetermining the decoding state S of the decoder of the undirected RNN corresponding to the current time step ii(ii) a Wherein S isi=P(Si-1,yi-1,hi,ci) And P () represents a relational function.
Optionally, the semantic vector c is determined according to the following formulai:
ei,k=g(si-1,hk);
Wherein, g () represents a neural network, i, j, k represent time step sequence numbers, respectively, and i is 1, 2.
Optionally, the prediction model is divided into three levels, a first level is prosodic words, a second level is prosodic phrases, and a third level is intonation phrases;
when predicting the prosodic structure and accent of a text to be processed, predicting the accent while predicting prosodic words at a first level;
and in the second level and the third level, accent prediction is removed, the prediction structure of the previous level is used as the input of the current level, and the corresponding prediction result is obtained by splicing with the word vector sequence of the text to be processed.
In order to achieve the above purpose, the invention also provides the following scheme:
a prediction model construction system based on Chinese prosody structure and accent combination, the construction system comprising:
the text preprocessing module is used for preprocessing the training corpora to obtain preprocessed texts;
the word segmentation module is used for carrying out word segmentation on the preprocessed text to obtain word segmentation text information;
the word vector determining module is used for determining a corresponding word vector characteristic sequence according to the word segmentation text information;
and the modeling module is used for coding and decoding the word vector characteristic sequence based on the coding-decoding of the RNN of the attention mechanism, and establishing a combined prediction model based on the Chinese prosody structure and accents for predicting the prosody structure and accents of the text to be processed.
According to the embodiment of the invention, the invention discloses the following technical effects:
the invention obtains word segmentation text information by preprocessing and word segmentation processing a plurality of training linguistic data to obtain a word vector characteristic sequence of a corresponding text, further establishes a combined prediction model based on a Chinese prosody structure and accents based on a recurrent neural network, fully considers the relation between the Chinese prosody structure and the accents, and realizes accurate prediction of the text to be detected.
Drawings
FIG. 1 is a flow chart of a prediction model construction method based on the combination of Chinese prosody structure and accent;
FIG. 2 is a schematic diagram of a module structure of a prediction model construction system based on the combination of Chinese prosody structure and accent.
Description of the symbols:
the system comprises a text preprocessing module-1, a word segmentation module-2, a word vector determination module-3 and a modeling module-4.
Detailed Description
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
The invention provides a prediction model construction method based on Chinese prosody structure and accent combination, which is characterized in that word segmentation text information is obtained by preprocessing and word segmentation processing on a plurality of training corpora, word vector characteristic sequences of corresponding texts are obtained, a combined prediction model based on the Chinese prosody structure and the accent is further established based on a recurrent neural network, the relation between the Chinese prosody structure and the accent is fully considered, and accurate prediction of a text to be detected is realized.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, the method for constructing a prediction model based on the combination of chinese prosody structure and accent of the present invention includes:
step 100: preprocessing a plurality of training corpora to obtain preprocessed texts;
step 200: performing word segmentation processing on the preprocessed text to obtain word segmentation text information;
step 300: determining a word vector characteristic sequence of a corresponding text according to the word segmentation text information;
step 400: and based on the coding-decoding of the RNN (Recurrent Neural Network) of the attention mechanism, coding and decoding the word vector characteristic sequence, and establishing a combined prediction model based on the Chinese prosody structure and accent for predicting the prosody structure and accent of the text to be processed.
In step 100, a large number of training corpora similar to the style of the text to be processed in terms of prosodic structure and accent are collected, a database is created, the database is about 15G, and the more the text corpora, the higher the prediction accuracy. And the database stores training corpora and corresponding prosodic structures and accent labeling results.
Further, the preprocessing the plurality of corpus specifically includes:
carrying out regularization processing on the training corpus and correcting polyphone pronunciation errors; and/or regularize the numbers.
In step 300, the determining a word vector feature sequence of a corresponding text according to the word segmentation text information specifically includes:
according to the word segmentation text information, searching word vectors of corresponding words by a word table searching method, and determining word vector characteristic sequences of corresponding texts;
wherein the initial value of the vocabulary is obtained based on CBOW (Continuous bag-of-words) training. And in the subsequent training process, the word vector model is continuously updated for training, so that the word list is continuously updated to enrich the word list, and the accuracy of the prediction of the text to be processed is improved.
In step 400, the establishing of the joint prediction model based on the chinese prosody structure and the accents specifically includes:
step 401: a bi-directional RNN based encoder reads the word vector feature sequence from forward and reverse directions, determining the hidden state of the encoder at each time step.
Step 402: and decoding by a decoder based on the undirected RNN to obtain a decoding state function representing a combined prediction model based on the Chinese prosody structure and accents, wherein the decoding state function is used for predicting the prosody structure and accents of the text to be processed.
Further, the establishing of the joint prediction model based on the Chinese prosody structure and the accents further includes:
step 403: extracting prosodic structures and accent labeling results in the training corpora as target values;
step 404: calculating the predicted value of each training corpus according to the decoding state function;
step 405: and adjusting the state parameters of the prediction model according to the target value and the predicted value. The state parameters comprise forward hidden state parameters of the prediction model and reverse hidden state parameters of the prediction model.
In step 401, the bidirectional RNN-based encoder reads the word vector feature sequence from the forward direction and the reverse direction, and determines the hidden state of the encoder at each time step, which specifically includes:
step 4011: forward RNN reads word vector characteristic sequence x ═ x (x) according to forward direction1,x2,...,xT) And generating a forward hidden state fh at each time step iiWherein, fhi=(fh1,...,fhT) 1,2, ·, T; f represents a forward hidden state parameter of the prediction model;
step 4012: backward RNN reads the word vector characteristic sequence in the reverse direction and generates a reverse hidden state bhiWherein, bh isi=(bhT,...,bh1) (ii) a b represents a reverse hidden state parameter of the prediction model;
step 4013: according to the forward hidden state fhiAnd reverse hidden state bhiDetermining the hidden state h of the encoder at each time stepiWherein h isi=[fhi,bhi]。
Due to the time sequence modeling characteristic of the RNN, the hidden state of the encoder at the last moment also carries all the information of the source input sequence.
In step 402, the decoding performed by the undirected RNN-based decoder specifically includes:
step 4021: obtaining decoding state s of an undirected RNN decoder at time step (i-1)i-1And a corresponding label yi-1。
Step 4022: obtaining the hidden state h of the encoder of the bidirectional RNN of the current time step iiAnd a semantic vector ci。
Further, the semantic vector c may be determined according to equations (1) - (3)i:
ei,k=g(si-1,hk) (3);
Wherein, g () represents a neural network, i, j, k represent time step sequence numbers, respectively, and i is 1, 2.
Step 4023: according to decoding state si-1Label yi-1Hidden state hiAnd semantic vector ciDetermining a decoding state S corresponding to the current time step ii(ii) a Wherein S isi=P(Si-1,yi-1,hi,ci) And P () represents a relational function.
Wherein the decoder of the undirected RNN has no directionality in the decoding process relative to the encoder of the bi-directional RNN. The undirected RNN-based decoder decodes except that h is represented by only the hidden state at each time step of the bi-directional RNN-based encoderiIn addition, a mechanism of attention is further introduced (i.e. a semantic vector c is introduced)i). The state s of the decoder at time step i of decoding due to the introduction of the attention mechanismiIs the state s at time step (i-1) by the decoderi-1And a corresponding label yi-1Hidden state h of encoder aligned with current timeiAnd a semantic vector ciAre jointly decided.
Wherein the semantic vector ciHidden state sequence [ h ] of encoder being undirected RNN1,...,hT]The weighted average of (a), equation (1), can provide more context information for the undirected RNN decoder.
The invention introduces a multi-task learning mechanism to carry out the joint prediction modeling of the prosodic structure and the stress. Specifically, the prediction model is divided into three levels, wherein the first level is Prosodic Words (PW), the second level is Prosodic Phrases (PPH), and the third level is Intonation Phrases (IPH); when predicting the prosodic structure and accent of a text to be processed, predicting the accent while predicting prosodic words at a first level; and in the second level and the third level, accent prediction is removed, the prediction structure of the previous level is used as the input of the current level, and the corresponding prediction result is obtained by splicing with the word vector sequence of the text to be processed.
At the same time of prosodic word prediction at the first level of the prosodic structure, another task is added (i.e., another decoder is added) to predict stress at the same time. Where the encoder and word vector layers are shared between the two tasks. In model training, the loss function of the whole model is the sum of the errors of the two tasks, i.e. the errors of the two decoders. The error is directionally propagated to make adjustments to the model parameters. The task of accent prediction is removed when performing the other two levels of prosodic structure (i.e., prosodic and intonation phrase) prediction. And the prediction result of the previous level of the prosodic structure is used as the input of the current level, spliced with the sequence converted by the word vector layer and then sent into the encoder.
Further, the attention-based RNN encoding-decoding prosody structure and accent joint prediction module outputs results of prosody structure prediction and accent prediction of the corresponding text by using the attention-based RNN encoding-decoding prosody structure and accent joint prediction model.
The method comprises the steps of initializing word vectors by utilizing a pre-trained word vector model, and modeling the Chinese prosody structure and accent jointly by utilizing an attention model and a multi-task learning-based method to predict the prosody structure and accent information of a text to be processed.
The invention improves the phoneme duration modeling and prediction mainly through a characteristic level and a model level. On the aspect of characteristics, a word vector model belonging to a prosodic structure and accents is established, so that the description of text characteristics is more accurate. On the model level, a method based on an attention model and multi-task learning is adopted to jointly model the Chinese prosody structure and accents. Thereby greatly improving the performance of Chinese prosody structure and stress prediction. The prediction result is used for guiding the rear end of the voice synthesis to carry out the voice synthesis, so that the naturalness and the expressive force of the synthesized voice are improved.
In addition, the invention also provides a combined prediction model construction system based on the Chinese prosody structure and accent. As shown in fig. 2, the system for constructing a joint prediction model based on chinese prosody structure and accent of the present invention includes a text preprocessing module 1, a word segmentation module 2, a word vector determination module 3, and a modeling module 4.
The text preprocessing module 1 is configured to preprocess a plurality of training corpora to obtain a preprocessed text; the word segmentation module 2 is used for performing word segmentation processing on the preprocessed text to obtain word segmentation text information; the word vector determining module 3 is configured to determine a corresponding word vector feature sequence according to the word segmentation text information; the modeling module 4 is used for coding and decoding the word vector feature sequence based on the coding-decoding of the RNN of the attention mechanism, and establishing a combined prediction model based on a Chinese prosody structure and accents for predicting the prosody structure and accents of the text to be processed.
Compared with the prior art, the prediction model construction system based on the combination of the Chinese prosody structure and the accent has the same beneficial effects as the prediction model construction method based on the combination of the Chinese prosody structure and the accent, and is not repeated herein.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
Claims (10)
1. A combined prediction model construction method based on Chinese prosody structure and accent is characterized by comprising the following steps:
preprocessing a plurality of training corpora to obtain preprocessed texts;
performing word segmentation processing on the preprocessed text to obtain word segmentation text information;
determining a word vector characteristic sequence of a corresponding text according to the word segmentation text information;
and coding and decoding the word vector characteristic sequence based on the coding-decoding of the recurrent neural network RNN of the attention mechanism, and establishing a prediction model based on the combination of the Chinese prosody structure and the accent for predicting the prosody structure and the accent of the text to be processed.
2. The method for constructing a combined prediction model based on chinese prosody structure and accent according to claim 1, wherein the preprocessing the plurality of corpus specifically includes:
carrying out regularization processing on the training corpus and correcting polyphone pronunciation errors; and/or regularize the numbers.
3. The method for constructing a combined prediction model based on chinese prosodic structure and accent according to claim 1, wherein the determining a word vector feature sequence of a corresponding text according to the segmented text information specifically includes:
according to the word segmentation text information, searching word vectors of corresponding words by a word table searching method, and determining word vector characteristic sequences of corresponding texts;
wherein the initial value of the vocabulary is obtained based on the training of the continuous bag-of-words model CBOW.
4. The method for constructing a combined prediction model based on chinese prosodic structure and accent according to claim 1, wherein the establishing a combined prediction model based on chinese prosodic structure and accent specifically includes:
a bidirectional RNN-based encoder reads the word vector feature sequence from the forward direction and the reverse direction, and determines the hidden state of the encoder at each time step;
and decoding by a decoder based on the undirected RNN to obtain a decoding state function representing a combined prediction model based on the Chinese prosody structure and accents, wherein the decoding state function is used for predicting the prosody structure and accents of the text to be processed.
5. The method for constructing a combined prediction model based on Chinese prosodic structure and accent according to claim 4, wherein the establishing of the combined prediction model based on Chinese prosodic structure and accent further comprises:
extracting prosodic structures and accent labeling results in the training corpora as target values;
calculating the predicted value of each training corpus according to the decoding state function;
and adjusting the state parameters of the prediction model according to the target value and the predicted value.
6. The method as claimed in claim 4, wherein the bi-directional RNN based encoder reads the word vector feature sequence from forward and backward directions to determine the hidden state of the encoder at each time step, and specifically comprises:
forward RNN reads word vector characteristic sequence x ═ x (x) according to forward direction1,x2,...,xT) And generating a forward hidden state fh at each time step iiWherein, fhi=(fh1,...,fhT) 1,2, ·, T; f represents a forward hidden state parameter of the prediction model;
backward RNN reads the word vector characteristic sequence in the reverse direction and generates a reverse hidden state bhiWherein, bh isi=(bhT,...,bh1) (ii) a b represents a reverse hidden state parameter of the prediction model;
according to the forward hidden state fhiAnd reverse hidden state bhiDetermining the hidden state h of the encoder at each time stepiWherein h isi=[fhi,bhi]。
7. The method for constructing a joint prediction model based on chinese prosodic structure and accent according to claim 6, wherein the decoding performed by the undirected RNN-based decoder specifically comprises:
obtaining decoding state s of an undirected RNN decoder at time step (i-1)i-1And a corresponding label yi-1;
Obtaining the hidden state h of the encoder of the bidirectional RNN of the current time step iiAnd a semantic vector ci;
According to decoding state si-1Label yi-1Hidden state hiAnd semantic vector ciDetermining the decoding state S of the decoder of the undirected RNN corresponding to the current time step ii(ii) a Wherein S isi=P(Si-1,yi-1,hi,ci) P () represents a relation function, SiIndicating the decoding state corresponding to the current time step i.
9. The method for constructing a combined prediction model based on Chinese prosodic structures and accents according to any one of claims 1-8, wherein the prediction model is divided into three levels, a first level is prosodic words, a second level is prosodic phrases, and a third level is intonation phrases;
when predicting the prosodic structure and accent of a text to be processed, predicting the accent while predicting prosodic words at a first level;
and in the second level and the third level, accent prediction is removed, the prediction structure of the previous level is used as the input of the current level, and the corresponding prediction result is obtained by splicing with the word vector sequence of the text to be processed.
10. A combined prediction model construction system based on Chinese prosody structure and accent is characterized by comprising:
the text preprocessing module is used for preprocessing the training corpora to obtain preprocessed texts;
the word segmentation module is used for carrying out word segmentation on the preprocessed text to obtain word segmentation text information;
the word vector determining module is used for determining a corresponding word vector characteristic sequence according to the word segmentation text information;
and the modeling module is used for coding and decoding the word vector characteristic sequence based on the coding-decoding of the RNN of the attention mechanism, and establishing a combined prediction model based on the Chinese prosody structure and accents for predicting the prosody structure and accents of the text to be processed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710561567.XA CN107464559B (en) | 2017-07-11 | 2017-07-11 | Combined prediction model construction method and system based on Chinese prosody structure and accents |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710561567.XA CN107464559B (en) | 2017-07-11 | 2017-07-11 | Combined prediction model construction method and system based on Chinese prosody structure and accents |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107464559A CN107464559A (en) | 2017-12-12 |
CN107464559B true CN107464559B (en) | 2020-12-15 |
Family
ID=60543891
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710561567.XA Active CN107464559B (en) | 2017-07-11 | 2017-07-11 | Combined prediction model construction method and system based on Chinese prosody structure and accents |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107464559B (en) |
Families Citing this family (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108417210B (en) * | 2018-01-10 | 2020-06-26 | 苏州思必驰信息科技有限公司 | Word embedding language model training method, word recognition method and system |
CN108231062B (en) * | 2018-01-12 | 2020-12-22 | 科大讯飞股份有限公司 | Voice translation method and device |
CN108417202B (en) * | 2018-01-19 | 2020-09-01 | 苏州思必驰信息科技有限公司 | Voice recognition method and system |
CN110321913B (en) * | 2018-03-30 | 2023-07-25 | 杭州海康威视数字技术股份有限公司 | Text recognition method and device |
US10923107B2 (en) * | 2018-05-11 | 2021-02-16 | Google Llc | Clockwork hierarchical variational encoder |
CN108920622B (en) * | 2018-06-29 | 2021-07-20 | 北京奇艺世纪科技有限公司 | Training method, training device and recognition device for intention recognition |
CN108897894A (en) * | 2018-07-12 | 2018-11-27 | 电子科技大学 | A kind of problem generation method |
CN109271643A (en) * | 2018-08-08 | 2019-01-25 | 北京捷通华声科技股份有限公司 | A kind of training method of translation model, interpretation method and device |
KR20200020545A (en) * | 2018-08-17 | 2020-02-26 | 삼성전자주식회사 | Electronic apparatus and controlling method thereof |
CN109299273B (en) * | 2018-11-02 | 2020-06-23 | 广州语义科技有限公司 | Multi-source multi-label text classification method and system based on improved seq2seq model |
CN109615538A (en) * | 2018-12-13 | 2019-04-12 | 平安医疗健康管理股份有限公司 | Social security violation detection method, device, equipment and computer storage medium |
CN109545186B (en) * | 2018-12-16 | 2022-05-27 | 魔门塔(苏州)科技有限公司 | Speech recognition training system and method |
CN111354333B (en) * | 2018-12-21 | 2023-11-10 | 中国科学院声学研究所 | Self-attention-based Chinese prosody level prediction method and system |
CN109670185B (en) * | 2018-12-27 | 2023-06-23 | 北京百度网讯科技有限公司 | Text generation method and device based on artificial intelligence |
CN110310619A (en) * | 2019-05-16 | 2019-10-08 | 平安科技(深圳)有限公司 | Polyphone prediction technique, device, equipment and computer readable storage medium |
CN110211568A (en) * | 2019-06-03 | 2019-09-06 | 北京大牛儿科技发展有限公司 | A kind of audio recognition method and device |
CN110427608B (en) * | 2019-06-24 | 2021-06-08 | 浙江大学 | Chinese word vector representation learning method introducing layered shape-sound characteristics |
CN110277085B (en) * | 2019-06-25 | 2021-08-24 | 腾讯科技(深圳)有限公司 | Method and device for determining polyphone pronunciation |
CN110457661B (en) * | 2019-08-16 | 2023-06-20 | 腾讯科技(深圳)有限公司 | Natural language generation method, device, equipment and storage medium |
CN111639152B (en) * | 2019-08-29 | 2021-04-13 | 上海卓繁信息技术股份有限公司 | Intention recognition method |
CN110534087B (en) * | 2019-09-04 | 2022-02-15 | 清华大学深圳研究生院 | Text prosody hierarchical structure prediction method, device, equipment and storage medium |
CN110782870B (en) * | 2019-09-06 | 2023-06-16 | 腾讯科技(深圳)有限公司 | Speech synthesis method, device, electronic equipment and storage medium |
CN111061868B (en) * | 2019-11-05 | 2023-05-23 | 百度在线网络技术(北京)有限公司 | Reading method prediction model acquisition and reading method prediction method, device and storage medium |
CN112783334A (en) * | 2019-11-08 | 2021-05-11 | 阿里巴巴集团控股有限公司 | Text generation method and device, electronic equipment and computer-readable storage medium |
CN110970031B (en) * | 2019-12-16 | 2022-06-24 | 思必驰科技股份有限公司 | Speech recognition system and method |
CN113302683B (en) * | 2019-12-24 | 2023-08-04 | 深圳市优必选科技股份有限公司 | Multi-tone word prediction method, disambiguation method, device, apparatus, and computer-readable storage medium |
CN113129864A (en) * | 2019-12-31 | 2021-07-16 | 科大讯飞股份有限公司 | Voice feature prediction method, device, equipment and readable storage medium |
CN111724765B (en) * | 2020-06-30 | 2023-07-25 | 度小满科技(北京)有限公司 | Text-to-speech method and device and computer equipment |
CN112309367B (en) * | 2020-11-03 | 2022-12-06 | 北京有竹居网络技术有限公司 | Speech synthesis method, speech synthesis device, storage medium and electronic equipment |
CN112364653A (en) * | 2020-11-09 | 2021-02-12 | 北京有竹居网络技术有限公司 | Text analysis method, apparatus, server and medium for speech synthesis |
CN113808579B (en) * | 2021-11-22 | 2022-03-08 | 中国科学院自动化研究所 | Detection method and device for generated voice, electronic equipment and storage medium |
CN114333760B (en) * | 2021-12-31 | 2023-06-02 | 科大讯飞股份有限公司 | Construction method of information prediction module, information prediction method and related equipment |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7136816B1 (en) * | 2002-04-05 | 2006-11-14 | At&T Corp. | System and method for predicting prosodic parameters |
CN101650942B (en) * | 2009-08-26 | 2012-06-27 | 北京邮电大学 | Prosodic structure forming method based on prosodic phrase |
CN102254554B (en) * | 2011-07-18 | 2012-08-08 | 中国科学院自动化研究所 | Method for carrying out hierarchical modeling and predicating on mandarin accent |
JP6230606B2 (en) * | 2012-08-30 | 2017-11-15 | インタラクティブ・インテリジェンス・インコーポレイテッド | Method and system for predicting speech recognition performance using accuracy scores |
-
2017
- 2017-07-11 CN CN201710561567.XA patent/CN107464559B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN107464559A (en) | 2017-12-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107464559B (en) | Combined prediction model construction method and system based on Chinese prosody structure and accents | |
US11676573B2 (en) | Controlling expressivity in end-to-end speech synthesis systems | |
CN111798832A (en) | Speech synthesis method, apparatus and computer-readable storage medium | |
Zen et al. | Statistical parametric speech synthesis | |
CN115516552A (en) | Speech recognition using synthesis of unexplained text and speech | |
Liu et al. | Reinforcement learning for emotional text-to-speech synthesis with improved emotion discriminability | |
Zheng et al. | BLSTM-CRF Based End-to-End Prosodic Boundary Prediction with Context Sensitive Embeddings in a Text-to-Speech Front-End. | |
Liu et al. | Mongolian text-to-speech system based on deep neural network | |
CN113808571B (en) | Speech synthesis method, speech synthesis device, electronic device and storage medium | |
CN111339771B (en) | Text prosody prediction method based on multitasking multi-level model | |
Zhang et al. | Extracting and predicting word-level style variations for speech synthesis | |
CN112669809A (en) | Parallel neural text to speech conversion | |
Lazaridis et al. | Improving phone duration modelling using support vector regression fusion | |
Lin et al. | Hierarchical prosody modeling for Mandarin spontaneous speech | |
Sawada et al. | The nitech text-to-speech system for the blizzard challenge 2016 | |
Rebai et al. | Arabic speech synthesis and diacritic recognition | |
Wutiwiwatchai et al. | Thai text-to-speech synthesis: a review | |
Park et al. | Korean grapheme unit-based speech recognition using attention-ctc ensemble network | |
Rebai et al. | Arabic text to speech synthesis based on neural networks for MFCC estimation | |
JP4684770B2 (en) | Prosody generation device and speech synthesis device | |
Xu et al. | End-to-end speech synthesis for *** multidialect | |
CN113571037A (en) | Method and system for synthesizing Chinese braille voice | |
Ilyes et al. | Statistical parametric speech synthesis for Arabic language using ANN | |
CN114373445B (en) | Voice generation method and device, electronic equipment and storage medium | |
Choi et al. | Label Embedding for Chinese Grapheme-to-Phoneme Conversion. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |