CN107464559B - Combined prediction model construction method and system based on Chinese prosody structure and accents - Google Patents

Combined prediction model construction method and system based on Chinese prosody structure and accents Download PDF

Info

Publication number
CN107464559B
CN107464559B CN201710561567.XA CN201710561567A CN107464559B CN 107464559 B CN107464559 B CN 107464559B CN 201710561567 A CN201710561567 A CN 201710561567A CN 107464559 B CN107464559 B CN 107464559B
Authority
CN
China
Prior art keywords
prediction model
accent
text
decoding
chinese
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710561567.XA
Other languages
Chinese (zh)
Other versions
CN107464559A (en
Inventor
陶建华
郑艺斌
李雅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201710561567.XA priority Critical patent/CN107464559B/en
Publication of CN107464559A publication Critical patent/CN107464559A/en
Application granted granted Critical
Publication of CN107464559B publication Critical patent/CN107464559B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1807Speech classification or search using natural language modelling using prosody or stress
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Abstract

The invention relates to a prediction model construction method and a system based on the combination of Chinese prosody structure and accent, wherein the construction method comprises the following steps: preprocessing a plurality of historical corpus text training corpora to obtain preprocessed texts; performing word segmentation processing on the preprocessed text to obtain word segmentation text information; determining a word vector characteristic sequence of a corresponding text according to the word segmentation text information; and (3) coding and decoding the word vector characteristic sequence based on the RNN coding-decoding of the attention mechanism, and establishing a combined prediction model based on the Chinese prosody structure and accents for predicting the prosody structure and accents of the text to be processed. The invention obtains word segmentation text information by preprocessing and word segmentation processing on a plurality of historical corpus text training corpora to obtain word vector characteristic sequences of corresponding texts, and then establishes a combined prediction model based on the coding-decoding of RNN (neural network) of an attention mechanism, fully considers the relation between Chinese rhythm structures and accents and realizes accurate prediction of texts to be tested.

Description

Combined prediction model construction method and system based on Chinese prosody structure and accents
Technical Field
The invention relates to the technical field of man-machine interaction total voice synthesis, in particular to a combined prediction model construction method and system based on Chinese prosody structure and accent.
Background
Accurate prosodic structure and stress description and prediction of prosodic structure and stress from text information are always the most important steps in speech synthesis, and are important components for improving the naturalness and expressiveness of synthesized speech and constructing a harmonious human-computer interaction technology. The rhythm structure and the accent model can carve out the suppression of the rising and the pause and the light and heavy slow and fast in the voice, and further improve the expressive force and the naturalness of the synthetic voice. Prosodic structure and stress modeling and prediction are of great significance to the development of speech synthesis, human-computer interaction and the like.
Although much research has been done in this area, many problems with prosodic structure and stress prediction have not been solved well to date. In the description of the text features, the word vector features are word vector models trained in advance, and the values of the word vectors are not further adjusted along with different tasks in prosody structure and accent model training. Furthermore, the considerations for contextual text features are not comprehensive enough in the selection of prediction models for prosodic structure and accents. Moreover, in the existing research on the prosodic structure and accents of Chinese language, it has been shown that there is a relatively close association between the prosodic structure and the accents. In the existing prediction on Chinese prosody structure and accent, the prosody structure prediction and the accent prediction are modeled as two relatively independent tasks, and the relationship between the two tasks is not taken into consideration.
Disclosure of Invention
In order to solve the problems in the prior art, namely to combine the Chinese prosody structure and accent to accurately predict the prosody structure and accent in the text information, the invention provides a method and a system for constructing a combined prediction model based on the Chinese prosody structure and accent.
In order to achieve the purpose, the invention provides the following scheme:
a combined prediction model construction method based on Chinese prosody structure and accent comprises the following steps:
preprocessing a plurality of training corpora to obtain preprocessed texts;
performing word segmentation processing on the preprocessed text to obtain word segmentation text information;
determining a word vector characteristic sequence of a corresponding text according to the word segmentation text information;
and based on the coding-decoding of the recurrent neural network RNN of the attention mechanism, coding and decoding the word vector characteristic sequence, and establishing a combined prediction model based on the Chinese prosody structure and accents for predicting the prosody structure and accents of the text to be processed.
Optionally, the preprocessing the plurality of corpus specifically includes:
carrying out regularization processing on the training corpus and correcting polyphone pronunciation errors; and/or regularize the numbers.
Optionally, the determining a word vector feature sequence of a corresponding text according to the word segmentation text information specifically includes:
according to the word segmentation text information, searching word vectors of corresponding words by a word table searching method, and determining word vector characteristic sequences of corresponding texts;
wherein the initial value of the vocabulary is obtained based on the training of the continuous bag-of-words model CBOW.
Optionally, the establishing of the joint prediction model based on the chinese prosody structure and the accent specifically includes:
a bidirectional RNN-based encoder reads the word vector feature sequence from the forward direction and the reverse direction, and determines the hidden state of the encoder at each time step;
and decoding by a decoder based on the undirected RNN to obtain a decoding state function representing a combined prediction model based on the Chinese prosody structure and accents, wherein the decoding state function is used for predicting the prosody structure and accents of the text to be processed.
Optionally, the establishing a joint prediction model based on the chinese prosody structure and the accent further includes:
extracting prosodic structures and accent labeling results in the training corpora as target values;
calculating the predicted value of each training corpus according to the decoding state function;
and adjusting the state parameters of the prediction model according to the target value and the predicted value.
Optionally, the bidirectional RNN-based encoder reads the word vector feature sequence from the forward direction and the reverse direction, and determines the hidden state of the encoder at each time step, specifically including:
forward RNN reads word vector characteristic sequence x ═ x (x) according to forward direction1,x2,...,xT) And generating a forward hidden state fh at each time step iiWherein, fhi=(fh1,...,fhT) 1,2, ·, T; f represents a forward hidden state parameter of the prediction model;
backward RNN reads the word vector characteristic sequence in the reverse direction and generates a reverse hidden state bhiWherein, bh isi=(bhT,...,bh1) (ii) a b represents a reverse hidden state parameter of the prediction model;
according to the forward hidden state fhiAnd reverse hidden state bhiDetermining the hidden state h of the encoder at each time stepiWherein h isi=[fhi,bhi]。
Optionally, the decoding performed by the undirected RNN-based decoder specifically includes:
obtaining decoder decoding state s of undirected RNN at time step (i-1)i-1And a corresponding label yi-1
Obtaining the hidden state h of the encoder of the bidirectional RNN of the current time step iiAnd a semantic vector ci
According to decoding state si-1Label yi-1Hidden state hiAnd semantic vector ciDetermining the decoding state S of the decoder of the undirected RNN corresponding to the current time step ii(ii) a Wherein S isi=P(Si-1,yi-1,hi,ci) And P () represents a relational function.
Optionally, the semantic vector c is determined according to the following formulai
Figure BDA0001347239600000031
Figure BDA0001347239600000032
ei,k=g(si-1,hk);
Wherein, g () represents a neural network, i, j, k represent time step sequence numbers, respectively, and i is 1, 2.
Optionally, the prediction model is divided into three levels, a first level is prosodic words, a second level is prosodic phrases, and a third level is intonation phrases;
when predicting the prosodic structure and accent of a text to be processed, predicting the accent while predicting prosodic words at a first level;
and in the second level and the third level, accent prediction is removed, the prediction structure of the previous level is used as the input of the current level, and the corresponding prediction result is obtained by splicing with the word vector sequence of the text to be processed.
In order to achieve the above purpose, the invention also provides the following scheme:
a prediction model construction system based on Chinese prosody structure and accent combination, the construction system comprising:
the text preprocessing module is used for preprocessing the training corpora to obtain preprocessed texts;
the word segmentation module is used for carrying out word segmentation on the preprocessed text to obtain word segmentation text information;
the word vector determining module is used for determining a corresponding word vector characteristic sequence according to the word segmentation text information;
and the modeling module is used for coding and decoding the word vector characteristic sequence based on the coding-decoding of the RNN of the attention mechanism, and establishing a combined prediction model based on the Chinese prosody structure and accents for predicting the prosody structure and accents of the text to be processed.
According to the embodiment of the invention, the invention discloses the following technical effects:
the invention obtains word segmentation text information by preprocessing and word segmentation processing a plurality of training linguistic data to obtain a word vector characteristic sequence of a corresponding text, further establishes a combined prediction model based on a Chinese prosody structure and accents based on a recurrent neural network, fully considers the relation between the Chinese prosody structure and the accents, and realizes accurate prediction of the text to be detected.
Drawings
FIG. 1 is a flow chart of a prediction model construction method based on the combination of Chinese prosody structure and accent;
FIG. 2 is a schematic diagram of a module structure of a prediction model construction system based on the combination of Chinese prosody structure and accent.
Description of the symbols:
the system comprises a text preprocessing module-1, a word segmentation module-2, a word vector determination module-3 and a modeling module-4.
Detailed Description
Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.
The invention provides a prediction model construction method based on Chinese prosody structure and accent combination, which is characterized in that word segmentation text information is obtained by preprocessing and word segmentation processing on a plurality of training corpora, word vector characteristic sequences of corresponding texts are obtained, a combined prediction model based on the Chinese prosody structure and the accent is further established based on a recurrent neural network, the relation between the Chinese prosody structure and the accent is fully considered, and accurate prediction of a text to be detected is realized.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
As shown in fig. 1, the method for constructing a prediction model based on the combination of chinese prosody structure and accent of the present invention includes:
step 100: preprocessing a plurality of training corpora to obtain preprocessed texts;
step 200: performing word segmentation processing on the preprocessed text to obtain word segmentation text information;
step 300: determining a word vector characteristic sequence of a corresponding text according to the word segmentation text information;
step 400: and based on the coding-decoding of the RNN (Recurrent Neural Network) of the attention mechanism, coding and decoding the word vector characteristic sequence, and establishing a combined prediction model based on the Chinese prosody structure and accent for predicting the prosody structure and accent of the text to be processed.
In step 100, a large number of training corpora similar to the style of the text to be processed in terms of prosodic structure and accent are collected, a database is created, the database is about 15G, and the more the text corpora, the higher the prediction accuracy. And the database stores training corpora and corresponding prosodic structures and accent labeling results.
Further, the preprocessing the plurality of corpus specifically includes:
carrying out regularization processing on the training corpus and correcting polyphone pronunciation errors; and/or regularize the numbers.
In step 300, the determining a word vector feature sequence of a corresponding text according to the word segmentation text information specifically includes:
according to the word segmentation text information, searching word vectors of corresponding words by a word table searching method, and determining word vector characteristic sequences of corresponding texts;
wherein the initial value of the vocabulary is obtained based on CBOW (Continuous bag-of-words) training. And in the subsequent training process, the word vector model is continuously updated for training, so that the word list is continuously updated to enrich the word list, and the accuracy of the prediction of the text to be processed is improved.
In step 400, the establishing of the joint prediction model based on the chinese prosody structure and the accents specifically includes:
step 401: a bi-directional RNN based encoder reads the word vector feature sequence from forward and reverse directions, determining the hidden state of the encoder at each time step.
Step 402: and decoding by a decoder based on the undirected RNN to obtain a decoding state function representing a combined prediction model based on the Chinese prosody structure and accents, wherein the decoding state function is used for predicting the prosody structure and accents of the text to be processed.
Further, the establishing of the joint prediction model based on the Chinese prosody structure and the accents further includes:
step 403: extracting prosodic structures and accent labeling results in the training corpora as target values;
step 404: calculating the predicted value of each training corpus according to the decoding state function;
step 405: and adjusting the state parameters of the prediction model according to the target value and the predicted value. The state parameters comprise forward hidden state parameters of the prediction model and reverse hidden state parameters of the prediction model.
In step 401, the bidirectional RNN-based encoder reads the word vector feature sequence from the forward direction and the reverse direction, and determines the hidden state of the encoder at each time step, which specifically includes:
step 4011: forward RNN reads word vector characteristic sequence x ═ x (x) according to forward direction1,x2,...,xT) And generating a forward hidden state fh at each time step iiWherein, fhi=(fh1,...,fhT) 1,2, ·, T; f represents a forward hidden state parameter of the prediction model;
step 4012: backward RNN reads the word vector characteristic sequence in the reverse direction and generates a reverse hidden state bhiWherein, bh isi=(bhT,...,bh1) (ii) a b represents a reverse hidden state parameter of the prediction model;
step 4013: according to the forward hidden state fhiAnd reverse hidden state bhiDetermining the hidden state h of the encoder at each time stepiWherein h isi=[fhi,bhi]。
Due to the time sequence modeling characteristic of the RNN, the hidden state of the encoder at the last moment also carries all the information of the source input sequence.
In step 402, the decoding performed by the undirected RNN-based decoder specifically includes:
step 4021: obtaining decoding state s of an undirected RNN decoder at time step (i-1)i-1And a corresponding label yi-1
Step 4022: obtaining the hidden state h of the encoder of the bidirectional RNN of the current time step iiAnd a semantic vector ci
Further, the semantic vector c may be determined according to equations (1) - (3)i
Figure BDA0001347239600000071
Figure BDA0001347239600000072
ei,k=g(si-1,hk) (3);
Wherein, g () represents a neural network, i, j, k represent time step sequence numbers, respectively, and i is 1, 2.
Step 4023: according to decoding state si-1Label yi-1Hidden state hiAnd semantic vector ciDetermining a decoding state S corresponding to the current time step ii(ii) a Wherein S isi=P(Si-1,yi-1,hi,ci) And P () represents a relational function.
Wherein the decoder of the undirected RNN has no directionality in the decoding process relative to the encoder of the bi-directional RNN. The undirected RNN-based decoder decodes except that h is represented by only the hidden state at each time step of the bi-directional RNN-based encoderiIn addition, a mechanism of attention is further introduced (i.e. a semantic vector c is introduced)i). The state s of the decoder at time step i of decoding due to the introduction of the attention mechanismiIs the state s at time step (i-1) by the decoderi-1And a corresponding label yi-1Hidden state h of encoder aligned with current timeiAnd a semantic vector ciAre jointly decided.
Wherein the semantic vector ciHidden state sequence [ h ] of encoder being undirected RNN1,...,hT]The weighted average of (a), equation (1), can provide more context information for the undirected RNN decoder.
The invention introduces a multi-task learning mechanism to carry out the joint prediction modeling of the prosodic structure and the stress. Specifically, the prediction model is divided into three levels, wherein the first level is Prosodic Words (PW), the second level is Prosodic Phrases (PPH), and the third level is Intonation Phrases (IPH); when predicting the prosodic structure and accent of a text to be processed, predicting the accent while predicting prosodic words at a first level; and in the second level and the third level, accent prediction is removed, the prediction structure of the previous level is used as the input of the current level, and the corresponding prediction result is obtained by splicing with the word vector sequence of the text to be processed.
At the same time of prosodic word prediction at the first level of the prosodic structure, another task is added (i.e., another decoder is added) to predict stress at the same time. Where the encoder and word vector layers are shared between the two tasks. In model training, the loss function of the whole model is the sum of the errors of the two tasks, i.e. the errors of the two decoders. The error is directionally propagated to make adjustments to the model parameters. The task of accent prediction is removed when performing the other two levels of prosodic structure (i.e., prosodic and intonation phrase) prediction. And the prediction result of the previous level of the prosodic structure is used as the input of the current level, spliced with the sequence converted by the word vector layer and then sent into the encoder.
Further, the attention-based RNN encoding-decoding prosody structure and accent joint prediction module outputs results of prosody structure prediction and accent prediction of the corresponding text by using the attention-based RNN encoding-decoding prosody structure and accent joint prediction model.
The method comprises the steps of initializing word vectors by utilizing a pre-trained word vector model, and modeling the Chinese prosody structure and accent jointly by utilizing an attention model and a multi-task learning-based method to predict the prosody structure and accent information of a text to be processed.
The invention improves the phoneme duration modeling and prediction mainly through a characteristic level and a model level. On the aspect of characteristics, a word vector model belonging to a prosodic structure and accents is established, so that the description of text characteristics is more accurate. On the model level, a method based on an attention model and multi-task learning is adopted to jointly model the Chinese prosody structure and accents. Thereby greatly improving the performance of Chinese prosody structure and stress prediction. The prediction result is used for guiding the rear end of the voice synthesis to carry out the voice synthesis, so that the naturalness and the expressive force of the synthesized voice are improved.
In addition, the invention also provides a combined prediction model construction system based on the Chinese prosody structure and accent. As shown in fig. 2, the system for constructing a joint prediction model based on chinese prosody structure and accent of the present invention includes a text preprocessing module 1, a word segmentation module 2, a word vector determination module 3, and a modeling module 4.
The text preprocessing module 1 is configured to preprocess a plurality of training corpora to obtain a preprocessed text; the word segmentation module 2 is used for performing word segmentation processing on the preprocessed text to obtain word segmentation text information; the word vector determining module 3 is configured to determine a corresponding word vector feature sequence according to the word segmentation text information; the modeling module 4 is used for coding and decoding the word vector feature sequence based on the coding-decoding of the RNN of the attention mechanism, and establishing a combined prediction model based on a Chinese prosody structure and accents for predicting the prosody structure and accents of the text to be processed.
Compared with the prior art, the prediction model construction system based on the combination of the Chinese prosody structure and the accent has the same beneficial effects as the prediction model construction method based on the combination of the Chinese prosody structure and the accent, and is not repeated herein.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (10)

1. A combined prediction model construction method based on Chinese prosody structure and accent is characterized by comprising the following steps:
preprocessing a plurality of training corpora to obtain preprocessed texts;
performing word segmentation processing on the preprocessed text to obtain word segmentation text information;
determining a word vector characteristic sequence of a corresponding text according to the word segmentation text information;
and coding and decoding the word vector characteristic sequence based on the coding-decoding of the recurrent neural network RNN of the attention mechanism, and establishing a prediction model based on the combination of the Chinese prosody structure and the accent for predicting the prosody structure and the accent of the text to be processed.
2. The method for constructing a combined prediction model based on chinese prosody structure and accent according to claim 1, wherein the preprocessing the plurality of corpus specifically includes:
carrying out regularization processing on the training corpus and correcting polyphone pronunciation errors; and/or regularize the numbers.
3. The method for constructing a combined prediction model based on chinese prosodic structure and accent according to claim 1, wherein the determining a word vector feature sequence of a corresponding text according to the segmented text information specifically includes:
according to the word segmentation text information, searching word vectors of corresponding words by a word table searching method, and determining word vector characteristic sequences of corresponding texts;
wherein the initial value of the vocabulary is obtained based on the training of the continuous bag-of-words model CBOW.
4. The method for constructing a combined prediction model based on chinese prosodic structure and accent according to claim 1, wherein the establishing a combined prediction model based on chinese prosodic structure and accent specifically includes:
a bidirectional RNN-based encoder reads the word vector feature sequence from the forward direction and the reverse direction, and determines the hidden state of the encoder at each time step;
and decoding by a decoder based on the undirected RNN to obtain a decoding state function representing a combined prediction model based on the Chinese prosody structure and accents, wherein the decoding state function is used for predicting the prosody structure and accents of the text to be processed.
5. The method for constructing a combined prediction model based on Chinese prosodic structure and accent according to claim 4, wherein the establishing of the combined prediction model based on Chinese prosodic structure and accent further comprises:
extracting prosodic structures and accent labeling results in the training corpora as target values;
calculating the predicted value of each training corpus according to the decoding state function;
and adjusting the state parameters of the prediction model according to the target value and the predicted value.
6. The method as claimed in claim 4, wherein the bi-directional RNN based encoder reads the word vector feature sequence from forward and backward directions to determine the hidden state of the encoder at each time step, and specifically comprises:
forward RNN reads word vector characteristic sequence x ═ x (x) according to forward direction1,x2,...,xT) And generating a forward hidden state fh at each time step iiWherein, fhi=(fh1,...,fhT) 1,2, ·, T; f represents a forward hidden state parameter of the prediction model;
backward RNN reads the word vector characteristic sequence in the reverse direction and generates a reverse hidden state bhiWherein, bh isi=(bhT,...,bh1) (ii) a b represents a reverse hidden state parameter of the prediction model;
according to the forward hidden state fhiAnd reverse hidden state bhiDetermining the hidden state h of the encoder at each time stepiWherein h isi=[fhi,bhi]。
7. The method for constructing a joint prediction model based on chinese prosodic structure and accent according to claim 6, wherein the decoding performed by the undirected RNN-based decoder specifically comprises:
obtaining decoding state s of an undirected RNN decoder at time step (i-1)i-1And a corresponding label yi-1
Obtaining the hidden state h of the encoder of the bidirectional RNN of the current time step iiAnd a semantic vector ci
According to decoding state si-1Label yi-1Hidden state hiAnd semantic vector ciDetermining the decoding state S of the decoder of the undirected RNN corresponding to the current time step ii(ii) a Wherein S isi=P(Si-1,yi-1,hi,ci) P () represents a relation function, SiIndicating the decoding state corresponding to the current time step i.
8. The method of claim 7, wherein the semantic vector c is determined according to the following formulai
Figure FDA0002591799450000021
Figure FDA0002591799450000022
ei,k=g(si-1,hk);
Wherein, g () represents a neural network, i, j, k represent time step sequence numbers, respectively, and i is 1, 2.
9. The method for constructing a combined prediction model based on Chinese prosodic structures and accents according to any one of claims 1-8, wherein the prediction model is divided into three levels, a first level is prosodic words, a second level is prosodic phrases, and a third level is intonation phrases;
when predicting the prosodic structure and accent of a text to be processed, predicting the accent while predicting prosodic words at a first level;
and in the second level and the third level, accent prediction is removed, the prediction structure of the previous level is used as the input of the current level, and the corresponding prediction result is obtained by splicing with the word vector sequence of the text to be processed.
10. A combined prediction model construction system based on Chinese prosody structure and accent is characterized by comprising:
the text preprocessing module is used for preprocessing the training corpora to obtain preprocessed texts;
the word segmentation module is used for carrying out word segmentation on the preprocessed text to obtain word segmentation text information;
the word vector determining module is used for determining a corresponding word vector characteristic sequence according to the word segmentation text information;
and the modeling module is used for coding and decoding the word vector characteristic sequence based on the coding-decoding of the RNN of the attention mechanism, and establishing a combined prediction model based on the Chinese prosody structure and accents for predicting the prosody structure and accents of the text to be processed.
CN201710561567.XA 2017-07-11 2017-07-11 Combined prediction model construction method and system based on Chinese prosody structure and accents Active CN107464559B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710561567.XA CN107464559B (en) 2017-07-11 2017-07-11 Combined prediction model construction method and system based on Chinese prosody structure and accents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710561567.XA CN107464559B (en) 2017-07-11 2017-07-11 Combined prediction model construction method and system based on Chinese prosody structure and accents

Publications (2)

Publication Number Publication Date
CN107464559A CN107464559A (en) 2017-12-12
CN107464559B true CN107464559B (en) 2020-12-15

Family

ID=60543891

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710561567.XA Active CN107464559B (en) 2017-07-11 2017-07-11 Combined prediction model construction method and system based on Chinese prosody structure and accents

Country Status (1)

Country Link
CN (1) CN107464559B (en)

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108417210B (en) * 2018-01-10 2020-06-26 苏州思必驰信息科技有限公司 Word embedding language model training method, word recognition method and system
CN108231062B (en) * 2018-01-12 2020-12-22 科大讯飞股份有限公司 Voice translation method and device
CN108417202B (en) * 2018-01-19 2020-09-01 苏州思必驰信息科技有限公司 Voice recognition method and system
CN110321913B (en) * 2018-03-30 2023-07-25 杭州海康威视数字技术股份有限公司 Text recognition method and device
US10923107B2 (en) * 2018-05-11 2021-02-16 Google Llc Clockwork hierarchical variational encoder
CN108920622B (en) * 2018-06-29 2021-07-20 北京奇艺世纪科技有限公司 Training method, training device and recognition device for intention recognition
CN108897894A (en) * 2018-07-12 2018-11-27 电子科技大学 A kind of problem generation method
CN109271643A (en) * 2018-08-08 2019-01-25 北京捷通华声科技股份有限公司 A kind of training method of translation model, interpretation method and device
KR20200020545A (en) * 2018-08-17 2020-02-26 삼성전자주식회사 Electronic apparatus and controlling method thereof
CN109299273B (en) * 2018-11-02 2020-06-23 广州语义科技有限公司 Multi-source multi-label text classification method and system based on improved seq2seq model
CN109615538A (en) * 2018-12-13 2019-04-12 平安医疗健康管理股份有限公司 Social security violation detection method, device, equipment and computer storage medium
CN109545186B (en) * 2018-12-16 2022-05-27 魔门塔(苏州)科技有限公司 Speech recognition training system and method
CN111354333B (en) * 2018-12-21 2023-11-10 中国科学院声学研究所 Self-attention-based Chinese prosody level prediction method and system
CN109670185B (en) * 2018-12-27 2023-06-23 北京百度网讯科技有限公司 Text generation method and device based on artificial intelligence
CN110310619A (en) * 2019-05-16 2019-10-08 平安科技(深圳)有限公司 Polyphone prediction technique, device, equipment and computer readable storage medium
CN110211568A (en) * 2019-06-03 2019-09-06 北京大牛儿科技发展有限公司 A kind of audio recognition method and device
CN110427608B (en) * 2019-06-24 2021-06-08 浙江大学 Chinese word vector representation learning method introducing layered shape-sound characteristics
CN110277085B (en) * 2019-06-25 2021-08-24 腾讯科技(深圳)有限公司 Method and device for determining polyphone pronunciation
CN110457661B (en) * 2019-08-16 2023-06-20 腾讯科技(深圳)有限公司 Natural language generation method, device, equipment and storage medium
CN111639152B (en) * 2019-08-29 2021-04-13 上海卓繁信息技术股份有限公司 Intention recognition method
CN110534087B (en) * 2019-09-04 2022-02-15 清华大学深圳研究生院 Text prosody hierarchical structure prediction method, device, equipment and storage medium
CN110782870B (en) * 2019-09-06 2023-06-16 腾讯科技(深圳)有限公司 Speech synthesis method, device, electronic equipment and storage medium
CN111061868B (en) * 2019-11-05 2023-05-23 百度在线网络技术(北京)有限公司 Reading method prediction model acquisition and reading method prediction method, device and storage medium
CN112783334A (en) * 2019-11-08 2021-05-11 阿里巴巴集团控股有限公司 Text generation method and device, electronic equipment and computer-readable storage medium
CN110970031B (en) * 2019-12-16 2022-06-24 思必驰科技股份有限公司 Speech recognition system and method
CN113302683B (en) * 2019-12-24 2023-08-04 深圳市优必选科技股份有限公司 Multi-tone word prediction method, disambiguation method, device, apparatus, and computer-readable storage medium
CN113129864A (en) * 2019-12-31 2021-07-16 科大讯飞股份有限公司 Voice feature prediction method, device, equipment and readable storage medium
CN111724765B (en) * 2020-06-30 2023-07-25 度小满科技(北京)有限公司 Text-to-speech method and device and computer equipment
CN112309367B (en) * 2020-11-03 2022-12-06 北京有竹居网络技术有限公司 Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN112364653A (en) * 2020-11-09 2021-02-12 北京有竹居网络技术有限公司 Text analysis method, apparatus, server and medium for speech synthesis
CN113808579B (en) * 2021-11-22 2022-03-08 中国科学院自动化研究所 Detection method and device for generated voice, electronic equipment and storage medium
CN114333760B (en) * 2021-12-31 2023-06-02 科大讯飞股份有限公司 Construction method of information prediction module, information prediction method and related equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7136816B1 (en) * 2002-04-05 2006-11-14 At&T Corp. System and method for predicting prosodic parameters
CN101650942B (en) * 2009-08-26 2012-06-27 北京邮电大学 Prosodic structure forming method based on prosodic phrase
CN102254554B (en) * 2011-07-18 2012-08-08 中国科学院自动化研究所 Method for carrying out hierarchical modeling and predicating on mandarin accent
JP6230606B2 (en) * 2012-08-30 2017-11-15 インタラクティブ・インテリジェンス・インコーポレイテッド Method and system for predicting speech recognition performance using accuracy scores

Also Published As

Publication number Publication date
CN107464559A (en) 2017-12-12

Similar Documents

Publication Publication Date Title
CN107464559B (en) Combined prediction model construction method and system based on Chinese prosody structure and accents
US11676573B2 (en) Controlling expressivity in end-to-end speech synthesis systems
CN111798832A (en) Speech synthesis method, apparatus and computer-readable storage medium
Zen et al. Statistical parametric speech synthesis
CN115516552A (en) Speech recognition using synthesis of unexplained text and speech
Liu et al. Reinforcement learning for emotional text-to-speech synthesis with improved emotion discriminability
Zheng et al. BLSTM-CRF Based End-to-End Prosodic Boundary Prediction with Context Sensitive Embeddings in a Text-to-Speech Front-End.
Liu et al. Mongolian text-to-speech system based on deep neural network
CN113808571B (en) Speech synthesis method, speech synthesis device, electronic device and storage medium
CN111339771B (en) Text prosody prediction method based on multitasking multi-level model
Zhang et al. Extracting and predicting word-level style variations for speech synthesis
CN112669809A (en) Parallel neural text to speech conversion
Lazaridis et al. Improving phone duration modelling using support vector regression fusion
Lin et al. Hierarchical prosody modeling for Mandarin spontaneous speech
Sawada et al. The nitech text-to-speech system for the blizzard challenge 2016
Rebai et al. Arabic speech synthesis and diacritic recognition
Wutiwiwatchai et al. Thai text-to-speech synthesis: a review
Park et al. Korean grapheme unit-based speech recognition using attention-ctc ensemble network
Rebai et al. Arabic text to speech synthesis based on neural networks for MFCC estimation
JP4684770B2 (en) Prosody generation device and speech synthesis device
Xu et al. End-to-end speech synthesis for *** multidialect
CN113571037A (en) Method and system for synthesizing Chinese braille voice
Ilyes et al. Statistical parametric speech synthesis for Arabic language using ANN
CN114373445B (en) Voice generation method and device, electronic equipment and storage medium
Choi et al. Label Embedding for Chinese Grapheme-to-Phoneme Conversion.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant