JPS63292200A

JPS63292200A - Word voice recognition equipment

Info

Publication number: JPS63292200A
Application number: JP62127882A
Authority: JP
Inventors: 北井　幹雄; 川野辺　正
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1987-05-25
Filing date: 1987-05-25
Publication date: 1988-11-29

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（発明の属する技術分野）本発明は、単語音声認識装置に関するものである。[Detailed description of the invention] (Technical field to which the invention pertains) The present invention relates to a word speech recognition device.

（従来の技術）従来の単語音声認識装置では、入力音声を単語辞書内に
置かれた各認識対象語の音声の特徴を表すデータと同じ
形式の特徴データに変換し、その入力音声の特徴データ
と辞書に置かれた各認識対象語の特徴データの類似度を
計算して、類似度の大きい認識対象語の特徴データに対
応する認識対象語のコードを認識結果として出力する単
語音声認識装置が、現在市販されている音声認識装置の
中で最も高い認識性能を表現しているが、認識対象とす
る語食数が多くなると、認識処理に要する時間が増加す
る、認識性能が劣化する、といった問題がある。(Prior art) In a conventional word speech recognition device, input speech is converted into feature data in the same format as data representing the speech characteristics of each recognition target word placed in a word dictionary, and the feature data of the input speech is A word speech recognition device that calculates the similarity between the feature data of each recognition target word placed in a dictionary and outputs the code of the recognition target word corresponding to the feature data of the recognition target word with a high degree of similarity as a recognition result. , represents the highest recognition performance among the speech recognition devices currently on the market, but as the number of words to be recognized increases, the time required for recognition processing increases and recognition performance deteriorates. There's a problem.

（発明の目的）本発明は、従来の単語音声認識装置の上記の欠点を解決
する機能を持った音声認識装置を提供することにある。(Object of the Invention) An object of the present invention is to provide a speech recognition device having a function of solving the above-mentioned drawbacks of conventional word speech recognition devices.

（発明の構成）（発明の特徴と従来技術との差異）上記目的を達成するために、従来の単語単位の特徴デー
タを使って単語音声を認識する装置に、以下の５つの手
段を付加した。(Structure of the Invention) (Characteristics of the Invention and Differences from the Prior Art) In order to achieve the above object, the following five means were added to the conventional device for recognizing word sounds using feature data for each word. .

■　入力音声を、単語辞書内に置かれた各認識対象語の
音声の特徴を表すデータと同じ形式に変換されたデータ
を、一時記憶する入力データ一時記憶手段。■ Input data temporary storage means for temporarily storing data obtained by converting input speech into the same format as data representing the speech characteristics of each recognition target word placed in a word dictionary.

■　入力音声を単語より小さい単位である音素単位で認
識し、認識結果として入力音声に対応する音素記号系列
を出力する音素認識手段。■ A phoneme recognition means that recognizes input speech in units of phonemes, which are units smaller than words, and outputs a phoneme symbol sequence corresponding to the input speech as a recognition result.

■　音素認識手段から出力された音素記号系列から、認
識確度の高い部分の記号系列を取り出し、その部分記号
系列とそれが入力単語上に出現した位置情報を出力する
音素部分系列出力手段。■ Phoneme partial sequence output means that extracts a symbol sequence with high recognition accuracy from the phoneme symbol sequence output from the phoneme recognition means and outputs the partial symbol sequence and position information where it appears on the input word.

■　各認識対象単語毎に、その単語の音素表記と、その
単語に対応する認識用の特徴データの番号が記述された
辞書から、音素部分系列出力手段が出力した部分記号系
列を含み、且つその位置情報を満足する単語のすべてを
検索し、その単語に対応する認識用の特徴データの番号
を出力する単語検索手段。■ Contains, for each word to be recognized, the partial symbol sequence output by the phoneme partial sequence output means from a dictionary in which the phoneme notation of that word and the number of feature data for recognition corresponding to the word are described; A word search means that searches for all words that satisfy positional information and outputs the number of feature data for recognition corresponding to the word.

■　単語検索手段の出力した番号の認識用の特徴データ
を単語辞書から取り出し、そのデータを単語音声の認識
処理部内のメモリに出力する特徴データロード手段。■ Feature data loading means that retrieves feature data for recognizing the number output by the word search means from the word dictionary and outputs the data to the memory in the word speech recognition processing section.

追加したこれらの手段を使って、単語単位の特徴データ
による認識の前に、音素単位の認識を行い、その認識結
果に応じて認識対象とする単語を減らせば、前述の従来
の技術で述べた単語音声認識装置の欠点を改善すること
ができる。By using these additional methods, we can perform phoneme-by-phoneme recognition before recognition using word-by-word feature data, and reduce the number of words to be recognized according to the recognition results. The shortcomings of word speech recognition devices can be improved.

（実施例）第１図は本発明の一実施例の構成を示すブロック図であ
る。(Embodiment) FIG. 1 is a block diagram showing the configuration of an embodiment of the present invention.

１は入力データ変換部で、入力音声を、単語辞書内に置
かれた各認識対象語の音声の特徴を表すデータと同じ形
式のデータに変換し、出力する。Reference numeral 1 denotes an input data conversion unit which converts input speech into data in the same format as data representing the speech characteristics of each recognition target word placed in a word dictionary and outputs the data.

２は入力データ一時記憶部で、入力データ変換部１から
出力されたデータを一時記憶する。Reference numeral 2 denotes an input data temporary storage section that temporarily stores the data output from the input data conversion section 1.

３は音素認識部で、入力データ変換部１から出力された
データを音素単位で認識し、認識結果として入力音声に
対応する音素記号系列を出力する。3 is a phoneme recognition unit that recognizes the data output from the input data conversion unit 1 in units of phonemes, and outputs a phoneme symbol sequence corresponding to the input speech as a recognition result.

４は音素部分系列出力部で、音素認識部３から出力され
た候補音素の記号系列から、認識確度の高い部分の音素
記号系列を取り出し、その部分記号系列と、それが入力
単語上に出現した位置情報（単語の最初、中間、最後な
ど）を出力する。4 is a phoneme partial sequence output unit, which extracts a phoneme symbol sequence with high recognition accuracy from the candidate phoneme symbol sequence output from the phoneme recognition unit 3, and extracts the phoneme symbol sequence of the part with high recognition accuracy and the partial symbol sequence that appears on the input word. Output position information (beginning, middle, end of word, etc.).

５は音素表記辞書で、各認識対象単語毎に、その単語の
音素表記と、その単語に対応する認識用の特徴データの
番号が記述されている。Reference numeral 5 denotes a phoneme notation dictionary in which, for each word to be recognized, the phoneme notation of the word and the number of feature data for recognition corresponding to the word are described.

６は単語検索部で、音素部分系列出力部４が出力した音
素系列を含み、且つその系列の位置情報を満足する単語
のすべてを音素表記辞書５から検索し、その単語に対応
する認識用の特徴データの番号を出力する。Reference numeral 6 denotes a word search unit that searches the phoneme notation dictionary 5 for all words that include the phoneme sequence output by the phoneme partial sequence output unit 4 and that satisfies the position information of the sequence, and searches the phoneme transcription dictionary 5 for all words that include the phoneme sequence output by the phoneme partial sequence output unit 4 and that satisfies the position information of the sequence. Output the feature data number.

７は単語辞書で、認識対象語の認識用の特徴データを記
憶する。A word dictionary 7 stores feature data for recognition of recognition target words.

８は特徴データロード部で、単語検索部６の出力した番
号の特徴データを単語辞書７から取り出し、そのデータ
を次の単語音声認識部内のメモリにロードする。Reference numeral 8 denotes a feature data loading section which takes out the feature data of the number output by the word search section 6 from the word dictionary 7 and loads the data into the memory in the next word speech recognition section.

９は単語音声認識部で、特徴データロード部８によりロ
ードされた特徴データと、入力データ一時記憶部２に記
憶して置いた特徴データの類似度を計算して候補単語を
出力する。9 is a word speech recognition unit which calculates the degree of similarity between the feature data loaded by the feature data load unit 8 and the feature data stored in the input data temporary storage unit 2 and outputs candidate words.

入力された音声は、入力データ変換部１で特徴データに
変換され、音素認識部３で音素の系列に展開される。Input speech is converted into feature data by an input data converter 1, and expanded into a series of phonemes by a phoneme recognizer 3.

同時に、入力データ変換部１から出力された特徴データ
は、入力データ一時記憶部２により記憶される。At the same time, the feature data output from the input data conversion section 1 is stored in the input data temporary storage section 2.

この後の各部の処理を以下に簡単に説明する。The subsequent processing of each section will be briefly explained below.

表表は、「あ」「い」「う」「え」「お」の５つの語を並
べ換えて出来る１２０単語を認識対象語としている。The table uses 120 words to be recognized by rearranging the five words "a,""i,""u,""e," and "o."

なお、以下では、認識対象語は前記の表に示す１２０単
語で、音素はアルファベットａ、ｉ、ｕ、ｅ。Note that in the following, the words to be recognized are the 120 words shown in the table above, and the phonemes are the alphabets a, i, u, and e.

０の５記号（母音）で表されており、候補音素の特徴デ
ータとの類似度は、Ｏから１００までの整数で表現され
ていると仮定する。It is assumed that the candidate phoneme is represented by five symbols (vowels) of 0, and the degree of similarity with the feature data of the candidate phoneme is represented by an integer from 0 to 100.

第２図は候補音素系列の抽出例を示す図である。FIG. 2 is a diagram showing an example of extracting candidate phoneme sequences.

いま、音素認識部３が認識結果として出力した音素記号
系列が第２図のようであったとする。Assume now that the phoneme symbol sequence output by the phoneme recognition unit 3 as a recognition result is as shown in FIG.

図中、音素間の時間的な隔たり或いは重なりが許容値以
下の時、二つの音素は続いているとする。In the figure, it is assumed that two phonemes are continuous when the temporal distance or overlap between them is less than a tolerance value.

このとき、音素部分系列出力部４は、音素認識部３で展
開された音素記号系列の中に、その音素の特徴データと
の距離がある閾値以下である音素の系列とその位置情報
を出力する。この閾値は装置使用前に設定しておく。At this time, the phoneme partial sequence output unit 4 outputs a sequence of phonemes whose distance from the feature data of the phoneme is equal to or less than a certain threshold, and its position information, in the phoneme symbol sequence developed by the phoneme recognition unit 3. . This threshold value is set before using the device.

いま、閾値が８０であったとすると、第２図の候補音素
系列の中で各候補音素の類似度が８０を越えるものの系
列（第２図の２個）の中で、例えば類似度の和が最大で
あるものｒａ　ｉＪを選出し、その位置情報、今の場合
は「中間」という位置情報とともに単語検索部６に送る
。Now, if the threshold value is 80, among the candidate phoneme sequences in Figure 2 in which the similarity of each candidate phoneme exceeds 80 (two in Figure 2), for example, if the sum of the similarities is The largest one ra iJ is selected and sent to the word search unit 6 along with its position information, in this case the position information of "intermediate".

単語検索部６は、音素部分系列ｒａｉＪを「中間」に含
む認識対象語の単語を音素表記辞書５から検索し、その
特徴データの番号を特徴データロード部８に出力する。The word search unit 6 searches the phoneme notation dictionary 5 for words that are recognition target words that include the phoneme subsequence raiJ in the “intermediate” range, and outputs the number of the feature data to the feature data load unit 8 .

特徴データロード部８は、単語検索部６から入力された
番号の特徴データを単語辞書７から取り出し、そのデー
タを次の単語音声認識部９内のメモリにロードする。The feature data loading unit 8 retrieves the feature data of the number input from the word search unit 6 from the word dictionary 7 and loads the data into the memory in the next word speech recognition unit 9.

単語音声認識部９は、ロードされた特徴データと予め入
力データ一時記憶部２に記録して置いた入力音声の特徴
データの類似度を計算し、候補単語を出力する。The word speech recognition section 9 calculates the degree of similarity between the loaded feature data and the feature data of the input speech previously recorded in the input data temporary storage section 2, and outputs candidate words.

認識対象語は最初１２０単語であったが、この時点では
音素系列ｒａ　ｉＪを「中間」に含む単語６個のみに認
識対象語が絞り込まれた。Initially, there were 120 words to be recognized, but at this point, the words to be recognized were narrowed down to only 6 words that included the phoneme sequence ra iJ in the "middle" range.

（発明の効果）単語単位の認識を行う前に音素単語の認識を行って、認
識対象とする単語の数を減らすことができるので、特に
大給量を認識対象とする場合に、認識率の向上、認識処
理時間の削減に効果がある。(Effect of the invention) Since the number of words to be recognized can be reduced by recognizing phoneme words before performing word-by-word recognition, the recognition rate can be improved, especially when recognizing large amounts of money. It is effective in improving recognition processing time and reducing recognition processing time.

【図面の簡単な説明】[Brief explanation of the drawing]

第１図は本発明の一実施例を説明するための図、第２図
は候補音素系列の抽出例を示す図である。１　・・・入力データ変換部、２・・・入力データ一時記憶部。３・・・音素認識部、４　・・・音素部分系列出力部、５・・・音素表記辞書、６・・・単語検索部、７・・・単語辞書、８・・・特徴データロード部、９　・・・単語音声認識部。特許出願人　日本電信電話株式会社第２図（ａ）　　會衆裂識壓占泉　（＆ざ１寥　マー、酔４ン
□０□　４Ｆ−ｕ→←１→←ｅ→ ６０　　　　　　　５３　　　　８５　　　　　　４Ｂ
（ｂ）閾値８０をと回る音零列FIG. 1 is a diagram for explaining one embodiment of the present invention, and FIG. 2 is a diagram showing an example of extraction of candidate phoneme sequences. 1... Input data conversion section, 2... Input data temporary storage section. 3... Phoneme recognition unit, 4... Phoneme partial sequence output unit, 5... Phoneme notation dictionary, 6... Word search unit, 7... Word dictionary, 8... Feature data loading unit, 9...Word speech recognition unit. Patent applicant Nippon Telegraph and Telephone Corporation Figure 2 (a)
(b) Tonic zero sequence around threshold 80

Claims

【特許請求の範囲】入力音声を単語辞書内に置かれた各認識対象語の音声の
特徴を表すデータと同じ形式の特徴データに変換し、そ
の入力音声の特徴データと辞書に置かれた各認識対象語
の特徴データの類似度を計算して、類似度の大きい認識
対象語の特徴データに対応する認識対象語のコードを認
識結果として出力する単語音声認識装置において、入力音声を単語辞書内に置かれた各認識対象語の音声の
特徴を表すデータと同じ形式のデータに変換し、一時記
憶する入力データ記憶手段と、入力音声を単語より小さ
い単位である音素単位で認識し、認識結果として入力音
声に対応する音素記号系列を出力する音素認識手段と、音素認識手段から出力された音素記号系列から、認識確
度の高い部分の記号系列を取り出し、その部分記号系列
とそれが入力単語上に出現した位置情報を出力する音素
部分系列出力手段と、各認識対象単語毎に、その単語の音素表記と、その単語
に対応する認識用の単語テンプレートの番号が記述され
た辞書から、音素部分系列出力手段が出力した部分記号
系列を含み、且つその位置情報を満足する単語のすべて
を検索し、その単語に対応する認識用の特徴データの番
号を出力する単語検索手段と、単語検索手段の出力した、単語番号に対応する認識用の
特徴データを、単語辞書から取り出し、入力データ記憶
手段に記憶して置いたデータとの類似度を計算して候補
単語を出力する単語音声認識手段と、を持つことを特徴とする単語音声認識装置。[Claims] Input speech is converted into feature data in the same format as data representing the speech characteristics of each recognition target word placed in a word dictionary, and the feature data of the input speech and each word placed in the dictionary are In a word speech recognition device that calculates the similarity of the feature data of the recognition target word and outputs the code of the recognition target word corresponding to the recognition target word feature data with a high degree of similarity as a recognition result, the input speech is stored in the word dictionary. an input data storage means for converting into data in the same format as the data representing the voice characteristics of each recognition target word placed in the computer, and temporarily storing the data; A phoneme recognition means that outputs a phoneme symbol sequence corresponding to the input voice as A phoneme subsequence output means outputs positional information that appears in the word, and a phoneme part sequence output means for each word to be recognized is selected from a dictionary in which the phoneme notation of the word and the number of the recognition word template corresponding to the word are described. a word search means for searching for all words that include the partial symbol sequence output by the series output means and satisfying the position information thereof, and outputting a number of feature data for recognition corresponding to the word; word speech recognition means for extracting the output feature data for recognition corresponding to the word number from a word dictionary, calculating the degree of similarity with the data stored in the input data storage means, and outputting a candidate word; A word speech recognition device comprising: