CA2160184A1

CA2160184A1 - Language identification with phonological and lexical models

Info

Publication number: CA2160184A1
Application number: CA 2160184
Authority: CA
Inventors: James Lee Hieronymus; Shubha Kadambe
Original assignee: AT&T Corp
Current assignee: AT&T Corp
Priority date: 1994-12-29
Filing date: 1995-10-10
Publication date: 1996-06-30

Abstract

A method and apparatus for identifying a speech signal as representing speech in a given candidate language. First, the described illustrative embodiment performs a language-specific phoneme recognition on the speech signal for the given candidate language. Next, a corresponding phonemotactic (i.e., phoneme transition probability) model for the given language is applied to produce one or more corresponding phoneme sequences and associated likelihood scores (e.g, probabilities). Then, a corresponding lexical model for the given language is applied to the phoneme sequences and their associated likelihood scores. In this manner, the lexical characteristics of the given language are taken into account in order to identify the most likely phoneme sequence (assuming that the given candidate language is, in fact, the language which was spoken) and its associated likelihood. This associated likelihood is used to provide a resultant likelihood score for the given candidate language. Finally, the speech signal isidentified as representing speech in the given language based on the resultant likelihood score so obtained. In particular, the speech signal is analyzed in accordance with the above with respect to each of a plurality of candidate languages, and is identified as representing speech in the candidate language which produces the highest likelihood score.

Description

~ 2~6~184 LANGUAGE IDENTIFICATION WITH PHONOLOGICAL
AND LEXICAL MODELS

Field of the ll~vertioll The present invention relates generally to the field of speech recognition and 5 more particularly to the problem of spoken language identification.

Back~round of the ~nventioll Spoken language identification (LID) has been the subject of research for several years. Initially, systems were developed to screen radio tr~n~mi~ions and telephone c~ ve-~lions for the intelligence comml1nity. In the future, LID systems will become 10 an integral part of telephone and speech input computer networks which provide services in multiple languages. For example, a LID system can be used to pre-sort telephone callers (or conl~ulel users) into categories based on the language they speak, so that a required service may be provided in an ~l~liate language. Examples of such services include travel information, emer~ell.;y ~ t~nce, language i--tel~ lion, telephone 15 information and stock quotations.
Systems in the field of speech recognition generally pelrollll their analysis of a given input speech signal based on certain linguistic models of language. These models include acoustic models which are commonly based on the fact that spoken language is comprised of a sequence of phonemes, which are the distinct, filn~l~ment~l speech 20 sounds of a given language. Phonemes may be combined into syllables, words and, ltim~tely senle..ces.
Prior art LID systems have, in particular, been based on the acoustic p-op~lliesof languages. Specifically, they have been based on the fact that the languages of the world differ from one another in their particular phoneme il-v~nto-y and in the

2 5 likelihood of occurrence of various sequences of these phonemes. Such systems may, for example, perform phoneme recognition based on a corresponding phoneme inventory for each of a set of candidate languages, followed by the application of a corresponding phonemotactic (phoneme sequence probabilities) model for each of the given languages 2 21601~

to determine the likelihood that the recognized sequence of phonemes would occur in that language. Then, the language for which the recognized phoneme sequence is most probable may be identified as the spoken language.

~ r~m~ry of tlle Inv~ntion Prior art approaches fail to take into account lexical distinctions between languages. By using language-specific lexical models (as well as phonologic ones), the present invention provides a method and a~dL~lus for LID which results in superior language fli~crimin~tion capability relative to prior art systems. In particular, languages differ from each other along many tlimen~ions including syllable structure, prosodics, lexical words and grammar (in addition to phoneme inventory and phoneme sequences).
Thus, the present invention provides a superior technique for LID which uses lexical models in addition to the phonological models used by prior art systems.
Specifically, the method of the present invention i~ntifies a speech signal as r~l~senting speech in a given c~n(li~te language. First, the method ~lrOlllls acoustic speech recognition on the speech signal based on the given candidate language. This speech recogmtion results in the generation of one or more sequences of subwords and associated acoustic likelihood scores. The acoustic speech recognition may, for example, be based on a language-specific phoneme invento~y (i.e., the subwords may be phonemes), and may apply a corresponding phonemotactic (i.e., phoneme transition 2 o probability) model for the given language to produce the associated acoustic likelihood ~- scores (e.g, probabilities) for each of the corresponding phoneme sequences.
After the acoustic-based speech recognition has been pt;lrolnled, a colle~ondinglexical model for the given language is applied to the phoneme sequences and their associated acoustic likelihood scores. In this m~nner, the lexical ~h~r~cteri~tics of the given language are taken into account in order to identify the most likely phoneme sequence (~nmin~ that t_e given candidate language is, in fact, the language which was spoken) and to produce a resultant likelihood score. This resultant likelihood score (of the most probable phoneme sequence) may be used as an overall language likelihood score for the given c~n-li(l~te language. In other words, the likelihood that the speech

3 o signal, in fact, l~L~scnts speech in the given c~n~1icl~t~ language may be equated with 3 2 1 ~

the likelihood that the speech signal comprises the most likely phoneme sequence (when both acoustic and lexical language characteristics have been taken into account).
Finally, the speech signal is identified as ~lese~lhlg speech in the given language based on the resultant likelihood score obtained. In accordance with one illustrative embodiment, the speech signal is analyzed in accordance with the above method with respect to a plurality of candidate languages, and is identified as representing speech in the candidate language which produces the highest likelihood score.

Brief n~ ion of tl-^ r)~
Fig. 1 shows a prior art language i-l~ntific~tion system using phoneme 0 recognition and phonemotactic models of phoneme sequences.
Fig. 2 shows a language identification system using both phonological and lexical models in accordance with an illu~ liv~ embodiment of the present invention.

n~t~iled n~-~ ;yliQ~
Fig. 1 shows an example prior art language identification system using phoneme recognition and phonemotactic models of phoneme sequences. The system shown classifies an input speech signal (generated from a speech utterance) into one of four ç~n~ te languages -- Fngli~h, Spanish, ~n~l~rin or German. Thus, the system comprises language subsytems 11-1 t_rough 11-4, each for p~ millg speech recognition in one of the four c~n.li-1~te languages. Specifie~lly, language subsystem 2 o 11-1 performs Fngli.~h language speech recognition, language sub~y~lelll 11-2 pt;lrOlllls Spanish language speech recognition, language subsy~L~lll 11-3 p~.rOlllls ~n(1~rin language speech recognition and language sub~y~ l 11-4 performs German language speech reco nition-Each language subsy~hm 1 l-i comprises corresponding phoneme recognizer 12-i and corresponding phonemotactics module 13-i. Thus, Fn~ h language subsystem 11-1 comprises Fngli~h phoneme recognizer 12-1 and Fn~ h phonemotactics module 13-1, Spanish language subsystem 11-2 comprises Spanish phoneme recognizer 12-2 and Spanish phonemotactics module 13-2, ~n~l~rin language sub~y~l~lll 11-3 comprisesM~n-l~rin phoneme recognizer 12-3 and ~n(l~rin phonemotactics module 13-3 and

4 ~ 8~
. --German language subsystem 11-4 comprises German phoneme recognizer 12-4 and German ph~nemotactics module 13-4. Each language sub~y~ ll-i produces a corresponding log likelihood value which reflects the likelihood that the analyzed input speech signal is, in fact, speech in the given language. Finally, the system of Fig. 1 also

5 comprises classifier 14 for classifying the input speech signal based on the log likelihood values produced by the phonemotactics modules of the language subsystems.
Phoneme recognizers 12-1 through 12-4 may, for example, each be based on conventional second order ergodic Continuous Variable Duration Hidden Markov Models (CVDHMMs). Each ergodic Hidden Markov Model (HMM) has one state per lo phoneme -- however, each phoneme is modeled by a time sequence of three probability distribution functions (pdfs) with each pdf lepleselllillg the be~inning, the middle and the end of a phoneme, respectively. Note that this structure is equivalent to a three state left-to-right hidden Markov phoneme model. The duration of eachphoneme may be modeled by a four parameter gamma distribution function, where the 5 p~r~meter~ are: (1) the shortest allowed phoneme duration (the gamma distribution shift); (2) the mean duration; (3) the variance of the duration; and (4) the m~xi~
allowed duration for the phoneme.
Different training procedures may be advantageously adopted to train the phoneme recognition systems depending on the type of transcription and the ~ nment 2 o of speech waveform with the transcription which is available. For example, when the word labels and the alignment of these labels with the speech waveform is available, the phonemically segmented data may be generated automatically by obtaining the phonemic transcription and the estim~ted duration for each phoneme using a Text-To-Speech(TTS) system and ~llelcl~i"g these durations linearly to cover the word duration. The 25 ph~nemically segm~nted data thus obtained may be used to initially train the ergodic HMM models. These models may be re-trained using a convention~l segment~l k-means algorithm ile.dlively until the models converge.
.Altern~tively, when the time aligned phonemic l~ scli~lion of the speech data is available, the initial models may be trained using this data and the models may be re-30 trained using the segmental k-means algo.il~ lively until the models converge.
And when the sentence level transcription and segment~tion is available, the phonemic 5 21~ 4 level transcription and segmentation may be obtained automatically as above, except that the phoneme durations are stretched linearly to fit the whole sentence. The models may then be trained iteratively as described above by using the segmente~l data so obtained.
Note that this method is similar to a conventional flat start k-means training procedure.
For the transition probabilities of a second order ergodic HMM, a trigrarn phonemotactic model may advantageously be used for phonemotactics modules 13-1 through 13-4. Such a model provides more ~ çrimin~tive power than the phoneme inv~illol ~ and bigram probabilities since the trigram phonemotactic c~ Les the allowable phoneme sequences in any given language very efficiently. For example,given a set of c~n~litl~te languages, it will often be the case that there are certain three phoneme sequences allowed in one of the c~nt~ te languages but not in the others.
The transition probabilities (i. e., the phonemotactics) may be trained using large amounts of labelled speech. Alternatively, in the absence of enough transcribed speech to train the transition probabilities, they may be appioxin~tecl using large amounts of text (e.g., 10 million words per language advantageously obtained from varying sources such as news wire services, n~w~a~cl~ and transcribed speech) and a conventionalgrapheme to rhoneme convertor. Specifically, the trigram phonemotactic models may, for example, be trained by col~ lg text to phoneme strings and then by estim~ting the trigram probability values by applying the following equation:

2o Pr( S3ISl,S2) = i~3 f( S3¦Sl~S2) + ~2 f( S3 I S2) + ~I f ( S3) (l) where the weights ~3, ~2 and ~l are set to 1, 0 and 0, respectively, si is the phoneme symbol "i" and f ( ) is the frequency of occurrence.
Cl~ifier 14 is used to classify the input speech signal (i.e., the speech uLl~ ce) as comprising speech in one of the given languages. Specifically, each language 2 5 sub~y~Lem 1 l-i may advantageously be applied to a given speech utterance in parallel.
Then, the language sub~y~lelll which produces the highest log likelihood value is chosen by classifier 14 as the language of the input speech signal. The log likelihood may, for example, be computed on a per frame basis to advantageously avoid the bias toward short utterances. In addition, since the phoneme set of each language may contain 216~84 different numbers of phonemes (Fngli~h, for example, has 42 phonemes whereas Spanish has 27 and M~n~larin has 41), the computation of the log likelihood on a frame basis helps to achieve norm~ tion with respect to the number of phonemes.
The log likelihood values generated by phonemotactic modules 13-1 through 13-4 and used by cla~ifier 14 may, for example, be computed using the well-known Baye's rule:
P ( x I Li ) = P ( x I ~i ) P ( ~; I Li ) (2) where the Ps are conditional probabilities, x is the input speech signal, ,~; is the phoneme sequence and Li is the phonemotactic model of the language i.
o Fig. 2 shows a language identification system using both phonological and lexical models in accordance with an illustrative embodiment of the present invention. The illustrative system shown c!a~ifies an input speech signal (generated from a speech utt~rance) into one of four cancliclate languages -- Fngli~h, Spanish, ~an-l~rin or Gerrnan -- as does the prior art system of Fig. 1. However, the system of Fig. 2advantageously uses lexical models as well as phonological models to improve system accuracy. The illustrative system of Fig. 2 comprises language subsytems 15-1 through 15-4, each for perforrning speech recognition in one of the four candidate languages.
Speci~lcally, language sub~y~Lt;l.l 15-1 p~lrOll.ls Fn~ h language speech recognition, language subsystem 15-2 performs Spanish language speech recognition, language 2 o subsystem 15-3 performs ~anllann language speech recognition and language subsystem 15-4 performs German language speech recognition.
Each language sub~y~lelll 15-i comprises corresponding phoneme recognizer 12-i, corresponding phonemotactics module 13-i and corresponding lexical access module 16-i. Thus, Fn~ h language subsystem 15-1 comprises Fn~ h phoneme recognizer 12-1, Fn~ h phonemotactics module 13-1 and Fngli~h lexical access module 16-1; Spanishlanguage sub~y~l~lll 11-2 comprises Spanish phoneme recognizer 12-2, Spanish phonemotactics module 13-2 and Spanish lexical access module 16-2; ~fan(larin language subsystem 11-3 compri~es Man~1~rin phoneme recognizer 12-3, Man-larin phonemotactics module 13-3 and ~an(larin lexical access module 16-3; and German 3 o language subsy~lenl 11-4 comprises German phoneme recognizer 12-4, German 7 216018~

phonemotactics module 13-4 and German lexical access module 16-4. As in the prior art system of Fig. 1, each language subsystem 15-i produces a co~responding log likelihood value which reflects the likelihood that the analyzed input speech signal is, in fact, speech in the given language. Finally, the system of Fig. 2 also comprises 5 classifier 14 for classifving the input speech signal based on the log likelihood values produced by the lexical access modules of the language subsystems.
Lexical access modules 16-1 through 16-4 generate corresponding log likelihood values analogous to those generated by phonemotactic modules 13-1 through 13-4 in the prior art system of Fig. 1. However, in the case of the illu~llalive system of Fig. 2, 10 these values have been based on a corresponding lexical model for the given language, as well as on the colre~ollding phonological model. In particular, each of the phonemotactic modules in the illustrative system of Fig. 2 yields one or more phoneme sequences along with their associated (phonological) log likelihood scores. These sequences and their associated scores are then provided to the collc;~onding lexical 15 access modules for further analysis in order to ~let~rmine the likelihood of each sequence in fi-rther view of a l~r~uage-spec;fic lexi~l Jnodel. (Note that in the case of the prior art system of Fig. 1 only the log likelihood score of the most likely phoneme sequence need be provided by the phonemotactic modules, since no further linguistic analysis is to be pe~rolllled -- thus, the log likelihood score of the most likely phoneme sequence 2 o reflects the prior art system's best estimate of the likelihood that the spoken ~ltter~nce was in the given candidate language.) Specifically, for each phoneme sequence, the lexical access module of the illustrative system of Fig. 1 produces a lexical log likelihood score (as opposed to the phonological log likelihood scores produced by the phonemotactic modules) based on 25 the likelihood that the given phoneme sequence compri~es lexir~lly "me~ningful"
speech. Then, the lexical log likelihood score is added to the phonological log likelihood score to produce an overall likelihood score for the given phoneme sequence (since addition of log values is equivalent to multiplication of the original values).
Then, the highest of these overall likelihood scores is produced as the language3 o likelihood score for the given language (i.e., the log likelihood score produced by the co~le~ollding lexical access module and provided to cl~ifier 14).

a ~ 18~

Lexical access modules 16-1 through 16-4 may, for example, be based on the lexical model described in F. Pereira, M~ Riley and R. Sproat, Weighted RationalTr~nsduction and theirApplication to Human Language Processing, DARPA Workshop on Human Language Tech., Princeton, NJ, 1994. This method uses the concepts of 5 weighted language, tr~nedl1ction and finite state automata from algebraic automata theory to decode c~ecades in speech and language processing. Lexical access can be considered as a tr~nerlllction c~eca~1e since the lexical access problem can be decomposed into a tr~nC~Ilcti~n, "D," from phoneme sequences to word sequences (a lexicon), and a weighted language, "M", which specifies the language model. Each of 10 these can be represented as a finite state automaton.
The automaton for the phoneme sequence to word sequence tr~n~dllct;on "D"
may be defined in terms of word models. A word model (or lexicon) is a tr~ne(~ er from a subsequence of phoneme labels to a specific word. To each subsequence of phonemes, a likelihood may be ~e,eign~d in~lic~tin~ the probability that it produced the 5 specified word. Hence, different paths through a word model col.e~lld to di~e~phonetic realizations of the word which advantageously incorporates ~lt~rn~tive pron~-nci~tions.
The langge model "M," which may be an N-gram model, may be implemented as a weighted finite state acceptor. Combining the automata implementin~ "D" and "M"
2 o thus results in an ~utom~ton which assigns a probability to each word sequence, and the highest probability path that the automaton estim~tes gives the most likely word- sequence for the given speech utterance. Thus, a best sequence of words which c<~ spolld to a given speech utterance may be obtained, along with a corresponding probability therefor. The log likelihood score produced as output by lexical access 25 modules 16-1 through 16-4 may be the logarithm of the probability so obtained.
The tr~n~ f r "D" (lexicon or word model) and the acceptor "M" (language model) may advantageously be built using a large sample (e.g, 10,000 words per language) obtained from a commercially available (or otherwise generally available) multi-language transcribed speech data base, such as the data base compiled by Oregon 3 o Graduate rn~titl1te and described in Y. K. Mu~ y, R. A. Cole and B. T. Oshika, The OGI Multi-Language Telephone Speech Corpus, Proc. of ICSLP 92, Banff, Canada, 9 2~ ~0 1~4 1992. The lexicon for each language advantageously compri~es a large number of words (e.g, 2000 unique words), which includes the most frequently used words in the language.

For clarity of explanation, the illu~l~dtiv~ embodiment of the present inventions is presented as comprising individual functional blocks. The functions these blocks r~resent may be provided ~rough the use of either shared or dedicated har.lwa~
including, but not limited to, hal.lw~e capable of executing software. For example, the functions of the system components presented in FIG. 2 may be provided by a single shared processor or a plurality of processors. (Use of the term "processor" should not 0 be construed to refer exclusively to ha~-lw~Le capable of executing software.)Illustrative embo-liment~ may comprise digital signal processor (DSP) ha,dw~e, such as the AT&T DSP16 or DSP32C, read-only memory (ROM) for storing software performing the operations discussed below, and random access memory (RAM) for storing DSP results. Very large scale integration (VLSI) hardw~. embotliment~, as well 5 as custom VLSI ch~;uiLl,y in combination with a general purpose DSP circuit, may also be provided. General purpose computer system h~dw~e may also be used to implement LID systems in accordance with the present invention.
Although a number of specific embo~liment~ of this invention have been shown and described herein, it is to be understood that these embo-liment~ are merely 20 illustrative of the many possible specific arrang~ which can be devised in application of the principles of the invention. N~nt;lous and varied other arrangements can be devised in accordance with these principles by those of oldil1aL ~ skill in the art without departing from the spirit and scope of the invention.

Claims

Claims:

1. A method of identifying a speech signal as representing speech in a given candidate language, the method comprising the steps of:

performing acoustic speech recognition on the speech signal to generate one or more sequences of subwords and corresponding acoustic likelihood scores, the acoustic speech recognition based on the given candidate language;

applying a lexical model of the given candidate language to one or more of said subword sequences to generate a language likelihood score for the given candidate language; and identifying the speech signal as representing speech in the given candidate language based on the language likelihood score for the given candidate language.

2. The method of claim 1 wherein the subwords comprise phonemes and wherein the step of performing acoustic speech recognition comprises the steps of:

performing phoneme recognition on the speech signal to generate the one or more subword sequences; and applying a phonemotactic model of the given candidate language to one or more of said subword sequences to generate the one or more corresponding acoustic likelihood scores.

3. The method of claim 2 wherein the step of performing phoneme recognition is based on a phoneme inventory, the phoneme inventory comprising phonemes associated with the given candidate language.

4. The method of claim 2 wherein the step of performing phoneme recognition is based on a second order ergodic Continuous Variable Duration Hidden Markov Model.

5. The method of claim 2 wherein the phonemotactic model comprises a trigram phonemotactic model of the given candidate language.

6. The method of claim 1 wherein the step of identifying the speech signal comprises comparing the language likelihood score for the given candidate language with one or more language likelihood scores for other candidate languages.

7. The method of claim 1 wherein the language likelihood score is based on one or more of the acoustic likelihood scores corresponding to the one or more subword sequences.

8. The method of claim 7 wherein the language likelihood score is computed by combining one of the acoustic likelihood scores with a lexical likelihood score, the lexical likelihood score based on said application of the lexical model to the subword sequence which corresponds to said one of the acoustic likelihood scores.

9. The method of claim 1 wherein the step of applying a lexical model of the given candidate language comprises the steps of:

transducing one or more of said subword sequences into one or more word sequences;
determining a corresponding probability for one or more of the word sequences; and generating the language likelihood score of the given candidate language based on the determined probabilities.

10. The method of claim 9 wherein the step of transducing the subword sequences into word sequences is performed with use of a finite-state automaton and the step of determining the corresponding probabilities for the word sequences is performed with use of a weighted finite-state acceptor.

11. An apparatus for identifying a speech signal as representing speech in a given candidate language, the apparatus comprising:

means for performing acoustic speech recognition on the speech signal to generate one or more sequences of subwords is and corresponding acoustic likelihood scores, the acoustic speech recognition based on the given candidate language;

means for applying a lexical model of the given candidate language to one or more of said subword sequences to generate a language likelihood score for the given candidate language; and means for identifying the speech signal as representing speech in the given candidate language based on the language likelihood score for the given candidate language.

12. The apparatus of claim 11 wherein the subwords comprise phonemes and wherein the means for performing acoustic speech recognition comprises the steps of:
means for performing phoneme recognition on the speech signal to generate the one or more subword sequences; and means for applying a phonemotactic model of the given candidate language to one or more of said subword sequences to generate the one or more corresponding acoustic likelihood scores.

13. The apparatus of claim 12 wherein the means for performing phoneme recognition is based on a phoneme inventory, the phoneme inventory comprising phonemes associated with the given candidate language.

14. The apparatus of claim 12 wherein the means for performing phoneme recognition is based on a second order ergodic Continuous Variable Duration Hidden Markov Model.

15. The apparatus of claim 12 wherein the phonemotactic model comprises a trigram phonemotactic model of the given candidate language.

16. The apparatus of claim 11 wherein the means for identifying the speech signal comprises means for comparing the language likelihood score for the given candidate language with one or more language likelihood scores for other candidate languages.

17. The apparatus of claim 11 wherein the language likelihood score is based on one or more of the acoustic likelihood scores corresponding to the one or more subword sequences.

18. The apparatus of claim 17 wherein the means for applying the lexical model comprises means for computing the language likelihood score by combining one of the acoustic likelihood scores with a lexical likelihood score,the lexical likelihood score based on said application of the lexical model to the subword sequence which corresponds to said one of the acoustic likelihood scores.

19. The apparatus of claim 11 wherein the means for applying a lexical model of the given candidate language comprises:

means for transducing one or more of said subword sequences into one or more word sequences;

means for determining a corresponding probability for one or more of the word sequences; and means for generating the language likelihood score of the given candidate language based on the determined probabilities.

20. The apparatus of claim 19 wherein the means for transducing the subword sequences into word sequences comprises a finite-state automaton and the means for determined the corresponding probabilities for the word sequences comprises a weighted finite-state acceptor.