JP3240691B2 - Voice recognition method - Google Patents

Voice recognition method

Info

Publication number
JP3240691B2
JP3240691B2 JP17970492A JP17970492A JP3240691B2 JP 3240691 B2 JP3240691 B2 JP 3240691B2 JP 17970492 A JP17970492 A JP 17970492A JP 17970492 A JP17970492 A JP 17970492A JP 3240691 B2 JP3240691 B2 JP 3240691B2
Authority
JP
Japan
Prior art keywords
statistical language
language model
speech
recognition
occurrence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
JP17970492A
Other languages
Japanese (ja)
Other versions
JPH0627985A (en
Inventor
昭一 松永
清宏 鹿野
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP17970492A priority Critical patent/JP3240691B2/en
Publication of JPH0627985A publication Critical patent/JPH0627985A/en
Application granted granted Critical
Publication of JP3240691B2 publication Critical patent/JP3240691B2/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【0001】[0001]

【産業上の利用分野】この発明は、統計的言語モデル
(例えば、Bahl,L.他“A Statistic
al Approach to Continuous
SpeechRecognition”IEEE T
rans.on PAMI(1983))を用いた音声
認識方法に関する。
BACKGROUND OF THE INVENTION The present invention relates to a statistical language model (for example, Bahl, L. et al., "A Static"
al Approach to Continuous
SpeechRecognition "IEEE T
rans. on PAMI (1983)).

【0002】[0002]

【従来の技術】従来の統計的言語モデルを用いた音声認
識方法として、学習用テキストデータベースより、音節
や単語の生起順序に関する統計的言語モデルと、音節や
単語の標準パターン(例えば、隠れマルコフモデル等)
とを予め作成しておき、入力音声に対し、統計的言語モ
デルを用いて、既に認識した直前の複数の音節や単語か
ら、次に生起する確率の高い複数の音節や単語候補を選
出し、これら選出した音節や単語候補のそれぞれについ
てその音節や単語の標準パターンと入力音声とを照合し
て、生起尤度と標準パターンとの類似尤度との総合的尤
度の最も高い音節や単語を認識結果として出力すること
が提案されている。
2. Description of the Related Art As a conventional speech recognition method using a statistical language model, a statistical language model relating to the occurrence order of syllables and words and a standard pattern of syllables and words (for example, a Hidden Markov Model) are obtained from a learning text database. etc)
Are prepared in advance, and a plurality of syllables or word candidates having a high probability of occurring next are selected from a plurality of syllables or words immediately before recognition using a statistical language model for the input speech, For each of these selected syllables and word candidates, the standard pattern of the syllable or word is compared with the input speech, and the syllable or word with the highest overall likelihood between the occurrence likelihood and the similarity of the standard pattern is determined. It has been proposed to output as a recognition result.

【0003】[0003]

【発明が解決しようとする課題】しかし、この認識方法
は統計的言語モデルが、認識タスク(発声内容)に類似
した大規模なものでなければならなかった。例えば、統
計的言語モデルが新聞の社説の大規模なデータベースか
ら作成されたものであれば、社説の内容の発声に対する
認識に有効であるが、例えば電話会議登録などの異なる
タスクに対する認識には有効性が低かった。つまり、異
なるタスクを認識させる場合にはそれに応じた大規模な
テキストデータを作成しなければならなかった。
However, in this recognition method, the statistical language model had to be a large-scale one similar to the recognition task (speech content). For example, if the statistical language model is created from a large database of newspaper editorials, it is effective for recognizing utterances of editorial content, but is useful for recognizing different tasks such as registering for a conference call. Sex was low. That is, in order to recognize different tasks, large-scale text data corresponding to the different tasks had to be created.

【0004】[0004]

【課題を解決するための手段】この発明によれば、異な
るタスクのテキストデータベースからそれぞれ生起順序
に関する統計的言語モデル群を予め用意しておき、その
統計的言語モデル群の中から、予め得た少量の発声用テ
キストと類似した統計的言語モデルを自動的に選出し、
この選出したモデルを音声認識用統計的言語モデルに用
いる。このように適応型統計的言語モデルを用いて認識
することで、任意の統計的言語モデルを用いる場合より
認識性能が上昇する。
According to the present invention, statistical language models relating to the order of occurrence are prepared in advance from text databases of different tasks, and the statistical language models obtained in advance from the statistical language models are prepared. A statistical language model similar to a small amount of vocal text is automatically selected,
The selected model is used as a statistical language model for speech recognition. Recognition using an adaptive statistical language model in this way improves recognition performance over using an arbitrary statistical language model.

【0005】統計的言語モデルの選択の手法としては、
例えばKullbackのdivergence(坂井
利之他「パターン認識の理論」共立出版(1967))
を用いればよい。具体的には、2つの統計的言語モデル
をA,Bで表し、モデル間の距離をD、モデルの各要素
(例えば、文字の三つ組(例えば、「あいう」等)の出
現する確率、トライグラム)をモデルAに関してPn
(A)、モデルBに関してPn(B)、とすると、 D(A,B)=(Σ(Pn(A)−Pn(B)(logPn(A) −logPn(B))/N で表す。ここで、Σはn=1からNまで、Nは要素数
(例えば、三つ組の種類数)である。Dの値が小さい
程、A,Bは類似している。
As a method of selecting a statistical language model, there are:
For example, Kullback's diversity (Toshiyuki Sakai et al. "Theory of Pattern Recognition", Kyoritsu Shuppan (1967))
May be used. Specifically, the two statistical language models are represented by A and B, the distance between the models is D, the probability of the appearance of each element of the model (for example, a triple of characters (for example, "Ai"), the trigram ) With respect to model A
(A), Pn (B) for model B, D (A, B) = (Σ (Pn (A) −Pn (B) (logPn (A) −logPn (B)) / N). Here, Σ is from n = 1 to N, and N is the number of elements (for example, the number of types of triples) .The smaller the value of D, the more similar A and B are.

【0006】[0006]

【実施例】図1に、この発明の実施例を示す。入力端子
1から入力された音声は、特徴抽出部2においてディジ
タル信号に変換され、更にLPCケプストラム分析され
た後、1フレーム(例えば10ミリ秒)ごとに特徴パラ
メータに変換される。この特徴パラメータは例えばLP
Cケプストラム係数である。
FIG. 1 shows an embodiment of the present invention. The voice input from the input terminal 1 is converted into a digital signal in the feature extraction unit 2, further subjected to LPC cepstrum analysis, and then converted into feature parameters every frame (for example, every 10 milliseconds). This characteristic parameter is, for example, LP
C is the cepstrum coefficient.

【0007】予め、学習用音声データベースより、上記
特徴パラメータと同一形式で、音節あるいは単語の標準
パターンを作り、標準パターンメモリ4に記憶してお
く。また、統計モデル選択部は5、認識に用いる統計的
言語モデル8を、統計的言語モデル群6の統計的言語モ
デル(M1,M2,…,MI、Iは言語モデルの数)の
中から、認識タスクに関する言語モデル7に最も類似し
ている統計的言語モデル8(Mj)を選択する。新聞の
社説、会議登録、旅行案内など異なるタスク(発声内
容)についての各学習用テキストデータベースからそれ
ぞれの音節や単語の生起順序に関する統計的言語モデル
を予め作成しておき、これらを統計的言語モデル群6と
する。
A standard pattern of syllables or words is created in advance in the standard pattern memory 4 in the same format as the above-mentioned feature parameters from the learning speech database. Further, the statistical model selecting unit 5 converts the statistical language model 8 used for recognition from among the statistical language models (M1, M2,..., MI, and I are the number of language models) of the statistical language model group 6. The statistical language model 8 (Mj) that is most similar to the language model 7 for the recognition task is selected. Create a statistical language model for each syllable or word occurrence order from each learning text database for different tasks (voice contents) such as newspaper editorial, conference registration, travel guidance, etc. Group 6.

【0008】音声認識部3では、選択した統計的言語モ
デル8(Mj)を用いて選出した複数の音節や単語の候
補について、その候補の標準パターンを標準パターンメ
モリ4から読みだし、入力音声のパラメータとの類似度
(尤度)をそれぞれ求める。つまり例えば入力音声のi
番目のユニット(音節や単語)を認識するには、選択し
た統計的言語モデル8からユニットの出現順序に関する
トライグラムを用いて、(i−2)番目と(i−1)番
目との各ユニットの認識結果を基に、i番目に出現され
ると予測される尤度が高い複数のユニットを候補ユニッ
トk1〜knとして選出する(図2)。これら選出され
た各候補ユニットk1〜knに対する標準パターンと入
力音声との尤度(類似度)をそれぞれ求め、その候補ユ
ニットのi番目に出現する尤度と、その標準パターンと
の類似性を示す尤度との和を総合尤度とし、この尤度が
最も高い候補ユニット、例えばk2をi番目の認識結果
として認識結果出力部9へ出力する。
The speech recognition unit 3 reads a standard pattern of a plurality of syllables and word candidates selected using the selected statistical language model 8 (Mj) from the standard pattern memory 4, and reads the input speech. The similarity (likelihood) with the parameter is obtained. That is, for example, i of the input voice
In order to recognize the unit (syllable or word), a trigram on the order of appearance of units from the selected statistical language model 8 is used to identify each unit of the (i-2) th and (i-1) th units. Based on the recognition result, a plurality of units having a high likelihood predicted to appear i-th are selected as candidate units k1 to kn (FIG. 2). The likelihood (similarity) between the standard pattern and the input speech for each of the selected candidate units k1 to kn is obtained, and the likelihood of the i-th candidate unit and the similarity to the standard pattern are shown. The sum with the likelihood is defined as the total likelihood, and the candidate unit having the highest likelihood, for example, k2, is output to the recognition result output unit 9 as the i-th recognition result.

【0009】このユニット候補の選出と、それらについ
ての標準パターンとの照合と、その総合尤度から認識結
果ユニットを得る操作とを音声区間が終わるまで繰り返
し、最後に、それまで得られた認識結果ユニットを、そ
の順に入力音声の認識結果の系列として出力する。な
お、特徴抽出部2、認識部3、認識結果出力部9、統計
的言語モデルの選択部5はそれぞれ専用、または兼用の
マイクロプロセッサにより処理することができる。
The selection of the unit candidates, the comparison with the standard pattern for the unit candidates, and the operation of obtaining the recognition result unit from the total likelihood are repeated until the end of the voice section, and finally, the recognition result obtained so far is obtained. The units are output in that order as a sequence of recognition results of the input voice. Note that the feature extraction unit 2, the recognition unit 3, the recognition result output unit 9, and the statistical language model selection unit 5 can be processed by dedicated or shared microprocessors.

【0010】更に選択部5はKullbackのdiv
ergenceに限るわけではない。たとえば、ユーク
リッド距離等の距離が算出できるいずれの尺度でもよ
い。また統計的言語モデルや認識標準パターンのユニッ
トは音節や単語だけでなく、音素やかな、漢字などの文
字単位であってもよい。認識手法は隠れマルコフモデル
に限らず、DPマッチングを用いても良い。統計的言語
モデルもトライグラムに限らず、バイグラムやユニグラ
ムの統計量でも良い。
[0010] Further, the selection unit 5 provides a div of Kullback.
It is not limited to ergence. For example, any scale that can calculate the distance such as the Euclidean distance may be used. The unit of the statistical language model or the recognition standard pattern may be not only a syllable or a word but also a character unit such as a phoneme, a kana, or a kanji. The recognition method is not limited to the hidden Markov model, and DP matching may be used. The statistical language model is not limited to trigrams, but may be bigram or unigram statistics.

【0011】[0011]

【発明の効果】以上述べたように、この発明によれば、
発声タスクと類似した統計的言語モデルを用いるため、
任意の統計的言語モデルを用いる場合よりも高い認識性
能が予期される。会議登録に関する発声タスク279文
節に対して文節認識率による評価を行った。例えば、旅
行案内に関するテキストを統計的言語モデルとして用い
た場合、認識性能は42%であるが、会議登録に関する
テキストを用いた場合には認識性能は64%に上昇し
た。
As described above, according to the present invention,
To use a statistical language model similar to the vocal task,
Higher recognition performance is expected than with any statistical language model. The 279 utterance tasks related to conference registration were evaluated based on the phrase recognition rate. For example, the recognition performance was 42% when the text related to travel guidance was used as a statistical language model, whereas the recognition performance increased to 64% when the text related to conference registration was used.

【0012】これに対して、雑誌記事、論説文、新聞、
会議登録に関するキーボード会話の4種の統計的言語モ
デルの中から会議登録に関する電話会議と最も類似した
統計的言語モデルを、この発明方法の中の統計モデル選
択部を用いて選んだ場合、会議登録に関するキーボード
会話が選ばれ、本選択が適切であることが示された。な
お、複数の統計的言語モデルの群を用意しておくことな
く、雑誌記事、論説文、新聞、旅行案内、会議登録など
の各学習用テキストデータベースを用意しておき、発声
タスクのサンプルを用いて、これと類似した学習用テキ
ストデータベースを選択し、その選択した学習用テキス
トデータベースから生起順序に関する統計的言語モデル
を作り、この統計的言語モデルを音声認識に使用しても
よい。しかしこの場合は各種の学習用テキストデータベ
ースを用意しておくために、より多くの記憶容量が必要
であり、かつ発声タスクと学習用テキストデータベース
との類似性を調べるには統計量を用いることになり、統
計的言語モデルとの類似性を調べる方が処理が簡単とな
る。
In contrast, magazine articles, editorials, newspapers,
When the statistical language model most similar to the conference call related to conference registration is selected from the four statistical language models of the keyboard conversation related to conference registration using the statistical model selection unit in the method of the present invention, the conference registration is performed. A keyboard conversation on was selected, indicating that this choice was appropriate. In addition, without preparing multiple statistical language model groups, prepare text databases for learning such as magazine articles, editorial sentences, newspapers, travel guides, conference registrations, etc. Then, a similar learning text database may be selected, a statistical language model relating to the order of occurrence may be created from the selected learning text database, and this statistical language model may be used for speech recognition. However, in this case, more storage space is required to prepare various text databases for learning, and statistics must be used to check the similarity between the utterance task and the text database for learning. In other words, it is easier to examine the similarity with the statistical language model.

【図面の簡単な説明】[Brief description of the drawings]

【図1】この発明の実施例を示すブロック図。FIG. 1 is a block diagram showing an embodiment of the present invention.

【図2】i番目の認識のための候補ユニットを選出し、
これより認識結果を出力する説明図。
FIG. 2 selects candidate units for i-th recognition,
FIG. 7 is an explanatory diagram for outputting a recognition result from this.

フロントページの続き (56)参考文献 特開 昭59−61897(JP,A) 特開 平4−291399(JP,A) 1992 IEEE Internati onal Conference on Acoustics,Speech and Signal Process ing(ICASSP−92),p.I− 165〜I−168(1992年3月23−26日) 人工知能学会第2回言語・音声理解と 対話処理研究会資料,p.117〜121 (1992年7月13日) 情報処理学会第42回全国大会(平成3 年前期)6D−5,p.2−114〜2− 115(1991年3月12〜14日) 日本音響学会講演論文集(平成3年10 月)2−P−18,p.173−174 (58)調査した分野(Int.Cl.7,DB名) G10L 15/18 JICSTファイル(JOIS)Continuation of front page (56) References JP-A-59-61897 (JP, A) JP-A-4-291399 (JP, A) 1992 IEEE International Conferencing on Acoustics, Speech and Signal Processing (ICASSP-92), p. I-165 to I-168 (March 23-26, 1992) Materials of the 2nd meeting of the Japanese Society for Artificial Intelligence, Language and Speech Understanding and Dialogue Processing, p. 117-121 (July 13, 1992) IPSJ 42nd National Convention (Early 1991) 6D-5, p. 2-114-2-115 (March 12-14, 1991) Proceedings of the Acoustical Society of Japan (October 1991) 2-P-18, p. 173-174 (58) Field surveyed (Int. Cl. 7 , DB name) G10L 15/18 JICST file (JOIS)

Claims (1)

(57)【特許請求の範囲】(57) [Claims] 【請求項1】 入力音声を特徴パラメータの時系列と
し、生起順序に関する統計的言語モデルを用いて、上記
入力音声の特徴パラメータ時系列について、複数の音声
認識候補を選出し、これらの各音声認識候補について、
音声標準パターンと上記入力音声の特徴パラメータ時系
列とをそれぞれ照合して、生起の尤度と類似の尤度との
総合尤度の高い候補を認識結果とする音声認識方法にお
いて、 異なるタスクの学習用テキストデータベースからそれぞ
れ作成された生起順序に関する統計的言語モデル群を予
め用意しておき、 発声するタスクのサンプルを用いて、発声タスク(発声
する音声の内容)と類似した統計的言語モデルを上記統
計的言語モデル群から選択し、 その選択した統計的言語モデルを、上記複数の音声認識
候補の選出に使用することを特徴とする音声認識方法。
An input speech is used as a time series of feature parameters, and a plurality of speech recognition candidates are selected for the feature parameter time series of the input speech using a statistical language model relating to the order of occurrence. About the candidate,
In a speech recognition method in which a speech standard pattern is compared with the feature parameter time series of the input speech and a candidate having a high likelihood of occurrence and similar likelihood is obtained as a recognition result, learning of different tasks A statistical language model group related to the order of occurrence created from the text database is prepared in advance, and a statistical language model similar to the utterance task (contents of the uttered voice) is described above using a sample of the uttered task. A speech recognition method characterized by selecting from a group of statistical language models, and using the selected statistical language model for selecting the plurality of speech recognition candidates.
JP17970492A 1992-07-07 1992-07-07 Voice recognition method Expired - Lifetime JP3240691B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP17970492A JP3240691B2 (en) 1992-07-07 1992-07-07 Voice recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP17970492A JP3240691B2 (en) 1992-07-07 1992-07-07 Voice recognition method

Publications (2)

Publication Number Publication Date
JPH0627985A JPH0627985A (en) 1994-02-04
JP3240691B2 true JP3240691B2 (en) 2001-12-17

Family

ID=16070422

Family Applications (1)

Application Number Title Priority Date Filing Date
JP17970492A Expired - Lifetime JP3240691B2 (en) 1992-07-07 1992-07-07 Voice recognition method

Country Status (1)

Country Link
JP (1) JP3240691B2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8606583B2 (en) 2008-08-13 2013-12-10 Nec Corporation Speech synthesis system for generating speech information obtained by converting text into speech
US8620663B2 (en) 2008-08-13 2013-12-31 Nec Corporation Speech synthesis system for generating speech information obtained by converting text into speech

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3426176B2 (en) 1999-12-27 2003-07-14 インターナショナル・ビジネス・マシーンズ・コーポレーション Speech recognition device, method, computer system and storage medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
1992 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP−92),p.I−165〜I−168(1992年3月23−26日)
人工知能学会第2回言語・音声理解と対話処理研究会資料,p.117〜121(1992年7月13日)
情報処理学会第42回全国大会(平成3年前期)6D−5,p.2−114〜2−115(1991年3月12〜14日)
日本音響学会講演論文集(平成3年10月)2−P−18,p.173−174

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8606583B2 (en) 2008-08-13 2013-12-10 Nec Corporation Speech synthesis system for generating speech information obtained by converting text into speech
US8620663B2 (en) 2008-08-13 2013-12-31 Nec Corporation Speech synthesis system for generating speech information obtained by converting text into speech

Also Published As

Publication number Publication date
JPH0627985A (en) 1994-02-04

Similar Documents

Publication Publication Date Title
US6694296B1 (en) Method and apparatus for the recognition of spelled spoken words
US6236964B1 (en) Speech recognition apparatus and method for matching inputted speech and a word generated from stored referenced phoneme data
US6243680B1 (en) Method and apparatus for obtaining a transcription of phrases through text and spoken utterances
US5878390A (en) Speech recognition apparatus equipped with means for removing erroneous candidate of speech recognition
US6856956B2 (en) Method and apparatus for generating and displaying N-best alternatives in a speech recognition system
US6910012B2 (en) Method and system for speech recognition using phonetically similar word alternatives
US6973427B2 (en) Method for adding phonetic descriptions to a speech recognition lexicon
US6073091A (en) Apparatus and method for forming a filtered inflected language model for automatic speech recognition
US8126714B2 (en) Voice search device
US6502072B2 (en) Two-tier noise rejection in speech recognition
JP3444108B2 (en) Voice recognition device
JP3240691B2 (en) Voice recognition method
El Méliani et al. Accurate keyword spotting using strictly lexical fillers
JPH09134192A (en) Statistical language model forming device and speech recognition device
JP2938865B1 (en) Voice recognition device
JP3009709B2 (en) Japanese speech recognition method
JPH04291399A (en) Voice recognizing method
JP3430265B2 (en) Japanese speech recognition method
Chiang et al. CCLMDS'96: Towards a speaker-independent large-vocabulary Mandarin dictation system
US20060206301A1 (en) Determining the reading of a kanji word
JPH06289894A (en) Japanese speech recognizing method
JP3001334B2 (en) Language processor for recognition
JP2979912B2 (en) Voice recognition device
JPH0612091A (en) Japanese speech recognizing method
JP3033132B2 (en) Language processor

Legal Events

Date Code Title Description
FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20071019

Year of fee payment: 6

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20081019

Year of fee payment: 7

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20091019

Year of fee payment: 8

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20101019

Year of fee payment: 9

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20101019

Year of fee payment: 9

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20111019

Year of fee payment: 10

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20111019

Year of fee payment: 10

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20121019

Year of fee payment: 11

EXPY Cancellation because of completion of term
FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20121019

Year of fee payment: 11