CN1670820A

CN1670820A - Method for converting words to phonetic symbols by regrading mistakable grapheme to improve accuracy rate

Info

Publication number: CN1670820A
Application number: CNA2004100287756A
Authority: CN
Inventors: 林一中; 洪鹏翔; 王稔志
Original assignee: Industrial Technology Research Institute ITRI
Current assignee: Industrial Technology Research Institute ITRI
Priority date: 2004-03-17
Filing date: 2004-03-17
Publication date: 2005-09-21
Anticipated expiration: 2024-03-17
Also published as: CN1315108C

Abstract

This invention relates to a character to phonetic symbol method on shapes with wrong phonetic symbols, which cuts the shapes and phonetic symbol to one input character to generate at least one pair of shape to phonetic sequence and aligning the shapes and phonetic sequence; choosing the front and back characters according to each shape to judge whether the character relationship with the relative phonetic ones.

Description

The mistakable morph is marked again with the method for the converting character into phonetic symbol that improves accuracy rate

Technical field

The invention relates to a kind of method of converting character into phonetic symbol, refer to that especially a kind of morph at the mistakable phoneme marks again with the method for the converting character into phonetic symbol that improves accuracy rate.

Background technology

Converting character into phonetic symbol is that the text conversion that will import is phonetic symbol output, and it is applied in phonetic synthesis (speech synthesis) system relevant with speech recognition (speech recognition) usually.Basically, obtaining the best method of pronunciation is queries dictionary.Yet dictionary and can't comprise all words and pronunciation, therefore when voice system runs into the new word that dictionary can not find out, the pronunciation that just needs the technology of converting character into phonetic symbol to produce this word.On the application of phonetic synthesis, this technology provides the pronunciation of new word, avoids system in default of pronunciation, causes the difficulty of voice output.And on the application of speech recognition, for improving the correctness of identification, often increase new word and expand training corpus, utilize converting character into phonetic symbol just can handle the new word that those lack pronunciation, reduce the degree of difficulty that expands corpus.Hence one can see that, voice are very important media of man-machine interface, and converting character into phonetic symbol is all being played the part of important role aspect the synthetic and identification of voice, the system relevant for voice, best performance is arranged, and outstanding converting character into phonetic symbol technology definitely is an indispensable ring.

Traditionally, converting character into phonetic symbol is manually stipulating rule (rule-based), but this kind method need be write a large amount of rules by language specialist.Yet rule is many again, still has situation about can't handle and takes place, in addition add more new regulation, and can't guarantee can be with both regular not inconsistent yet.Rule quantity is many more, it is also high more to revise the cost that is spent with maintenance regulation, these rules also can be because of the difference of language difference to some extent in addition, if application will be expanded to other language, just need a large amount of time and the human costs of cost to reformulate rule, therefore the letter-to-phone system based on rule lacks usability (reusability) and portability (portability) again, the also difficult usefulness that promotes.

Because above-mentioned shortcoming, the method of more and more letter-to-phone systems employing data guiding (data-driven), comprise analogize pronunciation (pronunciation by analogy, PbA), the N-gram model of neural network (neural-network), decision tree (decision tree), convolution (Joint) and learning rules (automatic rule learning) etc. automatically.These methods all need the language material of training, and normally one one contains literal and diaphone target dictionary.The advantage of data guiding is need not the intervention of too many manpower and professional knowledge, and the language of using is not limited to.Therefore at system's foundation, the following aspects such as reaching utilization again of safeguarding, all more excellent based on the method for rule.In the middle of these methods, PbA and Joint N-gram model are two kinds of comparatively general methods.

PbA is the morph (grapheme) that the literal of input is decomposed into different length, after the comparison of literal in the dictionary, find out the most representative phoneme of each morph, morph and phoneme are established as figure net (graph), the pronunciation of this literal is promptly represented in best path (path) in the figure net.The JointN-gram model then needs earlier literal and phonetic symbol to be decomposed into morph-phoneme and matches after (grapheme-phonemepair), utilize these pairings to set up the probability model, Shu Ru literal also is broken down into morph-phoneme pairing afterwards, according to the probability model of previous foundation, find out best aligned phoneme sequence.The JointN-gram model has higher accuracy at present, yet its calculating process is but quite consuming time.Though and the PbA arithmetic speed is very fast, in the performance of accuracy but not as Joint N-gram model.Therefore, the method for aforementioned existing converting character into phonetic symbol still has disappearance, and gives improved necessity.

Summary of the invention

Fundamental purpose of the present invention is providing a kind of morph at the mistakable phoneme to mark again with the method for the converting character into phonetic symbol that improves accuracy rate, and it can obtain being better than the converting character into phonetic symbol result of prior art in short operation time.

For reaching aforesaid purpose, morph at the mistakable of the present invention method with the converting character into phonetic symbol that improves accuracy rate of marking again comprises that mainly one morph-phoneme matched sequence produces step and a step of marking again, produce in the step in this morph-phoneme matched sequence, be that an input characters is carried out morph cutting and phoneme sign, to produce at least one morph-phoneme matched sequence, each morph-phoneme matched sequence comprises at least one morph and corresponding phoneme thereof, and calculates the mark of each morph-phoneme matched sequence; In this marks step again, be by having before the higher fractional at least one morph-phoneme matched sequence, to morph-phoneme matched sequence with default mistakable morph, morph according to each mistakable, choose the feature of its front and back literary composition, calculate the relevance of these features with the phoneme of this mistakable morph correspondence, so that this morph-phoneme matched sequence is marked again, and with morph-phoneme matched sequence with highest score result as conversion.

Description of drawings

Fig. 1 is that the morph at mistakable of the present invention is marked again with the flow process of the method for the converting character into phonetic symbol that improves accuracy rate;

Fig. 2 is a figure net of setting up according to the step of method of the present invention;

Fig. 3 is the accuracy according to the phonetic symbol of the morph that is obtained of method of the present invention.

Embodiment

For more understanding technology contents of the present invention, be described as follows especially exemplified by a preferred embodiment.

Relevant morph at mistakable of the present invention is marked again with the method for the converting character into phonetic symbol that improves accuracy rate, please be earlier with reference to flow process shown in Figure 1, it utilizes corresponding phoneme (phoneme) relation table 12 of a morph (grapheme) set 11 and one morph to carry out converting character into phonetic symbol.It at first uses input characters is carried out morph cutting (grapheme segmentation) (step S1), to obtain at least one group of morph sequence (grapheme sequence), input characters wherein is Roman phonetic or similar literal, for example English, German, French etc.Secondly, by carrying out the sign (Phoneme Tagging) (step S2) of phoneme by the morph sequence of high-accuracy, obtaining aligned phoneme sequence (phoneme sequence), and then produce morph-phoneme matched sequence (grapheme-phonemesequence).At last, the morph that holds the mistakable phoneme is added more feature mark again (step S3).

In abovementioned steps S1, be the morph that is had according in this morph set 11, come input characters is carried out the morph cutting to obtain at least a morph sequence G=(g with the N-gram model ₁g ₂... g _i... g _n), g wherein _iBe a morph, for example, input characters is feasible, morph set 11 be a, b, e, ea, f, i, s, le ... }, then possible morph sequence is f-e-a-s-i-b-1e or f-ea-s-i-b-1e, and for each morph sequence, asks for its mark S again _GAs follows:

S_{G} = Σ_{i = 1}^{n} \log (P (g_{i} | g_{i - N + 1}^{i - 1})),

Wherein, n is the number of this morph morph that sequence comprises, and N is the N of N-gram model, and representative just utilizes g _iThe N of a front morph decides g _iMark.

In aforesaid step S2, be relation table 12, with the sign that at least one morph sequence before the higher fractional is carried out phoneme that has that step S1 is produced according to the corresponding phoneme of this morph.Wherein, in the corresponding phoneme relation table 12 of morph, the corresponding phoneme of each morph on average surpasses two kinds, even some is up to more than ten kinds, therefore, goes out at least one aligned phoneme sequence P=(f by each morph sequence is signable ₁f ₂... f _i... f _n), f wherein _iBe a phoneme, in order to find best aligned phoneme sequence, so ask for the mark S of each aligned phoneme sequence earlier _pAs follows:

s_{p} = Σ_{i = 1}^{n} \log (p ({f_{i} | g}_{i - R}^{i - L})),

Wherein, L, R represent morph g _iThe scope of context information, n is the number of phoneme that this aligned phoneme sequence comprises, and g _iRepresent f _iCorresponding morph.Again to the corresponding aligned phoneme sequence of each morph sequence, get and have at least one aligned phoneme sequence before the higher fractional, and produce morph-phoneme matched sequence.Can set up a figure net with abovementioned steps S1 and S2, be shown as at step S1 as Fig. 2 and input characters W carried out morph cutting and obtain many group morph sequence G1～G5, many thus group morph sequence G1～G5 get the morph sequence G1～G3 with higher fractional, again each the morph sequence G1～G3 that selects is indicated a plurality of aligned phoneme sequence P1～P3 in step S2, P1～P5, P1～P4, and get preceding n (present embodiment is example with n=3) aligned phoneme sequence P1～P3 with higher fractional, P1～P3, P1～P3, and produce morph-phoneme matched sequence G1P1, G1P2, G1P3, G2P1, G2P2, G2P3, G3P1, G3P2, G3P3.Therefore form the figure net of one morph-aligned phoneme sequence pairing institute construction, and in step S2, because the morph sequence is fixing, so only set up the figure net at phoneme, obviously reduce by figure network planning mould compared to the JointN-gram model, therefore save computing time by morph-phoneme pairing institute construction.

Each morph-phoneme matched sequence of earlier figures net is a kind of possible converting character into phonetic symbol result, and its mark is to carry out the adjustment of weight according to morph sequence mark and aligned phoneme sequence mark, and obtains the mark S of converting character into phonetic symbol _G2p:

S _G2P=w _GS _G+ w _PS _P, wherein, W _GAnd W _pBe respectively morph sequence mark S _GWeighted value with aligned phoneme sequence mark Sp.

With aforementioned result with morph-phoneme matched sequence of highest score as conversion, in the time of L=1, R=2, its whole speech accuracy (ward accuracy) can reach 59.71%, has surmounted result's (58.54%) of PbA.Right further the analysis found, in morph-phoneme matched sequence that abovementioned steps S1 and S2 are produced, because the corresponding phoneme of part morph is more, so a morph is used as feature and is there is no enough information and judge its orthoepy before and after only utilizing, and this class mistakable morph with vowel the most seriously (as a, e, i, o, u), the average pairing phoneme of each vowel is 10.6, may cause error and then the whole speech accuracy of influence in the judgement.

For the phoneme of confirming that vowel is correct, in the scoring again of step S3, preceding several that produce according to step S1, S2 have the morph-phoneme matched sequence of higher fractional, add more feature at the morph of easy misjudgment phoneme and confirm, and morph-phoneme matched sequence that the adjustment that sees through weight obtains best result is used as answer.

In aforesaid step S3, be in step S2 produced preceding n have in the morph-phoneme matched sequence of higher fractional (n is a positive integer), morph-phoneme matched sequence to morph with mistakable, according to the morph of each mistakable, the feature (also comprising phoneme and morph-phoneme pairing except morph) of choosing its front and back literary composition obtains the required mark of S3.At this embodiment, we utilize mutual message (mutual information, MI) calculate these features with the relevance between the phoneme of this mistakable morph correspondence, this mutual message is promptly represented the common possibility that occurs of these features phoneme corresponding with the mistakable morph, so that to mark as follows again to this morph-phoneme matched sequence:

s_{R} = \underset{g_{i} &Element; E}{\underset{i}{Σ}} Σ_{j = 1}^{| X (i) |} w_{j} \log (\frac{P (x_{j}, f_{i})}{P (x_{j}) P (f_{i})}) \times \frac{1}{\underset{g_{1} &Element; E}{\underset{i}{Σ 1}}}

Wherein, W _jBe weighted value, E represents the set of mistakable morph in morph-aligned phoneme sequence that step S2 produces, and we only partly mark again at vowel at this embodiment.X (i) is the set of the feature chosen, is expressed as with mathematical expression:

Wherein, τ _i≡ g _if _i, L, R represent morph g _iThe scope of context information, N is selected number with higher fractional morph-phoneme matched sequence, y is g, f or τ, l, r then represent position that y occurs must i-L and; Between the i+R.

Via aforementioned n morph-phoneme matched sequence is marked again, and S is counted in the heavily scoring that obtains each morph-phoneme matched sequence _R, adjustment and the mark S by weight at last _G2PIntegrate and obtain final mark S _FinalAs follows:

S _Final＝w _G2PS _G2P+w _RS _R，

Wherein, the morph-phoneme matched sequence with best result is last answer.

For verifying excellent effect of the present invention, (http://www.speech.cs.cmu.edu/cgi-bin/cmudict) experimentizes with the CMU Pronounceable dictionary, the CMU Pronounceable dictionary is the dictionary of a computer-readable (machine-readable), comprised and surpassed 125,000 English vocabulary and corresponding pronunciation thereof, and these pronunciations are synthesized by one group of set of phonemes that comprises 39 phonemes.Remaining altogether 110,327 vocabulary, next all morph G (w)=g of each vocabulary w behind the vocabulary of removal punctuation mark and the multiple pronunciation of tool ₁g ₂... g _n, and phoneme P (W)=f ₁f ₂... f _mAll obtain pairing GP (the w)=g of morph and phoneme through corresponding automatically module ₁p ₁: g ₂p ₂: ... g _nP _m, the result of all pairings is divided into ten set at random again with cross validation method (cross-validation) amount of commenting that experimentizes.

Experiment is at first carried out the morph cutting to input characters, finds according to experimental result, gets to have higher fractional S _GThe correct option of preceding two morph sequence comprise rate (includingrate) up to 98.25%, have highest score S than getting _GResult's (90.61%) high, therefore carry out phoneme according to preceding two morph sequence and indicate, the foundation that phoneme indicates is the morph of front and back, and scope is L=1, R=2, each morph sequence is respectively got have higher fractional S _PThe first two aligned phoneme sequence of ten.Then according to the mark S of morph sequence _GAnd the mark Sp of aligned phoneme sequence and choose the first two ten and have higher fractional S _G2pMorph-phoneme matched sequence, find that the literal accuracy is 59.71%, have highest score S than getting _GMorph sequence and the first two ten have result 59.63% the height of the aligned phoneme sequence of higher fractional Sp, and get the first two ten rates that comprise with correct option as a result of higher fractional Sp aligned phoneme sequence and also obviously improve (88.92%-＞90.95%).

At last to vowel (a, e, i, o u) marks again, expands as 1=5, R=5 by the scope that adds more feature (front and back morph, phoneme and morph-phoneme pairing) and judge by L=1, R=2, and can have higher fractional S to the first two ten of input _G2PPhoneme-morph matched sequence carry out vowel and confirm again to obtain the mark S of scoring again _R

Experimental result finds that whole speech accuracy two stage 59.71% rises to 69.13% in the past through after marking again, and mistake slip (Error reducton rate) is 23.38%, has surmounted 67.89% (N=4) of Joint N-gram model.Further analyze and find, as shown in Figure 3, the average accuracy of vowel phoneme is 81.16% from 69.72% lifting originally also, and the mistake slip is 37.78%, and therefore, method of the present invention can effectively promote the accuracy of converting character into phonetic symbol really.

The foregoing description only is to give an example for convenience of description, and the interest field that the present invention advocated should be as the criterion so that claim is described certainly, but not only limits to the foregoing description.

Claims

1. the morph at mistakable is marked again with the method for the converting character into phonetic symbol that improves accuracy rate, it is characterized in that, comprising:

One morph-phoneme matched sequence produces step, be that an input characters is carried out morph cutting and phoneme sign, to produce at least one morph-phoneme matched sequence, each morph-phoneme matched sequence comprises at least one morph and corresponding phoneme thereof, and calculates the mark of each morph-phoneme matched sequence; And

One step of marking again, be by having before the higher fractional at least one morph-phoneme matched sequence, to morph-phoneme matched sequence with default mistakable morph, morph according to each mistakable, choose the feature of its front and back literary composition, calculate the relevance of these features with the phoneme of this mistakable morph correspondence, so that so that this morph-phoneme matched sequence is marked again, and with morph-phoneme matched sequence with highest score result as conversion.

2. the morph at mistakable as claimed in claim 1 is marked again with the method for the converting character into phonetic symbol that improves accuracy rate, it is characterized in that, the relevance of calculating between mistakable morph and the front and back Wen Tezheng is mutual information.

3. the morph at mistakable as claimed in claim 1 is marked again with the method for the converting character into phonetic symbol that improves accuracy rate, it is characterized in that, this morph-phoneme matched sequence produces step and comprises:

One morph cutting step is the morph that is had in gathering according to a default morph, and this input characters is carried out the morph cutting to obtain at least one morph sequence that each morph sequence comprises a plurality of morphs, and asks for the mark of each morph sequence;

One phoneme indicates step, it is relation according to the corresponding phoneme of a morph of presetting, and carry out the sign of phoneme to having at least one morph sequence before the higher fractional, so that each morph sequence is obtained at least one aligned phoneme sequence, and ask for the mark of each aligned phoneme sequence, and to the corresponding aligned phoneme sequence of each morph sequence, get and have at least one aligned phoneme sequence before the higher fractional, and produce this at least one morph-phoneme matched sequence.

4. the morph at mistakable as claimed in claim 2 is marked again with the method for the converting character into phonetic symbol that improves accuracy rate, it is characterized in that, in this marks step again, be to mark as follows again to each morph-phoneme matched sequence:

S_{R} = \underset{g_{i} &Element; E}{\underset{i}{Σ}} Σ_{j = 1}^{| X (i) |} w_{j} \log (\frac{P (x_{j}, f_{i})}{P (x_{j}) P (f_{i})}) \times \frac{1}{\underset{g_{i} &Element; E}{\underset{i}{Σ}} 1}

In the middle of, g _iBe the morph of morph sequence, f _iBe the phoneme of aligned phoneme sequence, W _jBe weighted value, E represents the set of this mistakable morph, and X (i) is the set of the feature chosen, X _jRepresent the arbitrary feature among the characteristic set X (i).

5. the morph at mistakable as claimed in claim 4 is marked again with the method for the converting character into phonetic symbol that improves accuracy rate, it is characterized in that wherein, X (i) is:

In the middle of, τ _i≡ g _if _i, L, R only represent morph g _iThe scope of context information, N is selected number with higher fractional morph-phoneme matched sequence, y is g, f or τ, 1, the r position of then representing y to occur must be between i-L and i=R.

6. the morph at mistakable as claimed in claim 3 is marked again with the method for the converting character into phonetic symbol that improves accuracy rate, wherein, and the mark S of each morph-phoneme matched sequence _G2pBe for:

S _G2P=w _GS _G+ w _PS _P, in the middle of, S _GBe the mark of its morph sequence, S _PBe the mark of its aligned phoneme sequence, W _GAnd W _PBe weighted value.

7. the morph at mistakable as claimed in claim 6 is marked again with the method for the converting character into phonetic symbol that improves accuracy rate, it is characterized in that, and in this morph cutting step, the mark S of each morph sequence of being asked for _GFor:

S_{G} = Σ_{i = 1}^{n} \log (P (g_{i} | g_{i - N + 1}^{i - 1})),

In the middle of, g _iBe the morph of this morph sequence, n is the number of this morph morph that sequence comprises, and the N representative utilizes g _iThe N of a front morph decides g _iMark.

8. the morph at mistakable as claimed in claim 6 is marked again with the method for the converting character into phonetic symbol that improves accuracy rate, it is characterized in that, indicates in the step mark S of each aligned phoneme sequence of being asked in this phoneme _pFor:

S_{P} = Σ_{i = 1}^{n} \log (P (f_{i} | g_{i - R}^{i - L})),

In the middle of, f _iBe the phoneme of this aligned phoneme sequence, L, R represent morph g _iThe scope of context information, n is the number of phoneme that this aligned phoneme sequence comprises.

9. the morph at mistakable as claimed in claim 4 is marked again with the method for the converting character into phonetic symbol that improves accuracy rate, it is characterized in that, and in this marks step again, the mark S of each morph-phoneme matched sequence after marking again _FinalFor:

S _Final=w _G2PS _G2P+ w _RS _R, in the middle of, W _G2pAnd W _RBe weighted value only.

10. the morph at mistakable as claimed in claim 1 is marked again with the method for the converting character into phonetic symbol that improves accuracy rate, it is characterized in that this input characters is the literal of Roman phonetic.

Change the method for phonetic symbol with a Chinese character written in the cursive hand that improves accuracy rate 11. the morph at mistakable as claimed in claim 1 is marked again, it is characterized in that, in this marked step again, the morph of this mistakable was English vowel.

12. the morph at mistakable as claimed in claim 1 is marked again with the method for the converting character into phonetic symbol that improves accuracy rate, it is characterized in that, in this marked step again, feature of literary composition comprised phoneme, morph and morph-phoneme pairing before and after these.

13. the morph at mistakable as claimed in claim 3 is marked again with the method for the converting character into phonetic symbol that improves accuracy rate, it is characterized in that, indicate in the step in this phoneme, in the relation of the corresponding phoneme of default morph, each morph is to there being at least one phoneme.

14. the morph at mistakable as claimed in claim 3 is marked again with the method for the converting character into phonetic symbol that improves accuracy rate, it is characterized in that, and in this morph cutting step, be to come input characters is carried out the morph cutting with the N-gram model.