JPS6151799B2

JPS6151799B2 -

Info

Publication number: JPS6151799B2
Application number: JP55143118A
Authority: JP
Inventors: Hidekazu Shiratori; Yasuo Sato; Junichi Ichikawa; Takayuki Ooyama
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1980-10-14
Filing date: 1980-10-14
Publication date: 1986-11-10
Also published as: JPS5766499A

Description

【発明の詳細な説明】本発明は標準音声登録パターンの学習方法に関
し、特に、入力された音声を分析してその特徴を
抽出し、辞書部からの認識するための情報と比較
し、該比較結果にもとづいて確率の高い優先順
に、上記入力された音声である可能性の高い単語
候補を複数表示し、その中から正答候補を選択す
ることによつて音声認識処理を行なう方法におい
て、認識率の高い標準登録パターンを自動的に作
成できるように、一定の条件が満たされたとき標
準登録パターンを入力音声によつて学習せしめる
ようにした方法に関する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a method for learning standard speech registration patterns, and in particular, to analyzing input speech, extracting its features, comparing them with information for recognition from a dictionary section, and comparing the Based on the results, a plurality of word candidates that are likely to be the input speech are displayed in order of priority based on the probability, and the correct answer candidate is selected from among them to perform speech recognition processing. The present invention relates to a method in which standard registration patterns are learned from input speech when certain conditions are met, so that standard registration patterns with high performance can be automatically created.

現在、データ入力装置としてはタイプライタ、
キーボード、穿孔装置、タツチ入力装置等がある
が、それらの装置を十分に使いこなすためには、
操作者に対してかなりの訓練が必要であり、また
訓練後の操作者が実際の装置の操作を行なう場合
にも長時間連続して操作すると疲労度も増大す
る。 Currently, typewriters are used as data input devices.
There are keyboards, punching devices, touch input devices, etc., but in order to fully utilize these devices, it is necessary to
A considerable amount of training is required for the operator, and even when the operator actually operates the apparatus after training, the degree of fatigue increases if the operator continues to operate the apparatus for a long period of time.

このように入力すべきデータを上記の如き入力
装置を用いて、データ処理装置が理解できるよう
な書かれたデータ形式、あるいは穿孔形式等に変
換するのにはかなりの負担がかかるのが普通であ
る。したがつて、望ましくは上記のような人手に
よる変換を行なわず入力データを音声の形式で直
接入力できれば誠に都合がよい。 It usually takes a considerable amount of effort to convert the data to be input into a written data format or a punched format that can be understood by a data processing device using an input device such as the one described above. be. Therefore, it would be very convenient if the input data could be directly input in the form of voice, preferably without performing the above-mentioned manual conversion.

このような音声入力装置についての研究は、か
なりの成果を収めており、入力対象を区切つて発
声された単語とし、使用者の音声単語の登録を前
提とする装置は、すでにいくつか実現されてい
る。 Research on such voice input devices has achieved considerable results, and several devices have already been realized that use separated words as the input target and register the user's spoken words. There is.

しかし、従来のこの種の装置においては、入力
すべき単語の音声の特徴が互に類似している場合
には、それらを認識、識別することは困難であ
り、その認識率が低下してしまつて実用的である
とは言えなかつた。 However, with conventional devices of this type, it is difficult to recognize and distinguish between words that have similar sound characteristics, and the recognition rate decreases. It could not be said that it was practical.

特に、入力対象がカナ文字の場合は殆んどすべ
ての音素の識別が必要とされ、現在の技術段階で
は、実用的レベルの認識を実現するのは極めて困
難な状況にある。このような音声認識の困難さの
理由としては、既に述べた (1) 互いにその特徴の類似した音素が存在するこ
と。 In particular, when the input target is kana characters, it is necessary to identify almost all phonemes, and at the current state of technology, it is extremely difficult to achieve a practical level of recognition. The reasons for this difficulty in speech recognition are, as already mentioned, (1) the existence of phonemes with similar characteristics;

の他、さらに (2) 音素を特徴づけるパラメータが種々の要因で
変動すること。In addition, (2) parameters characterizing phonemes vary due to various factors.

(3) 一部の音素は特徴そのものが充分解明されて
いないこと。(3) The characteristics of some phonemes themselves have not been fully elucidated.

などがあげられる。etc.

したがつて、これらの問題を含んだ音声入力を
実用的な意味で実現するには、これらの問題を直
接解決することを避け、入力された音声に対応す
る可能性のある文字候補をいくつか表示して、そ
のうちから正しい文字のみを選択することにすれ
ばよい。 Therefore, in order to realize voice input that includes these problems in a practical sense, we should avoid directly solving these problems and instead create several character candidates that may correspond to the input voice. All you have to do is display it and select only the correct characters from among them.

このような、複数の文字候補を表示し、その中
から正しい候補を選択する１つの方式について
は、本願の出願人は、先に出願した特開昭53−
77402号公報において開示している。 Regarding one method of displaying a plurality of character candidates and selecting the correct candidate from among them, the applicant of the present application has previously applied for Japanese Patent Application Laid-Open No.
It is disclosed in Publication No. 77402.

上記出願の発明においては、入力された音声に
相当する確率の高い文字候補順に複数の文字を表
示するが、正しい候補を如何にして、確実に、も
れなく表示するかは、残された問題である。通
常、音声パターンの分布は複雑であり、入力パタ
ーンとのマツチング距離を使用する場合、１つだ
けの登録パターンは誤認識もしくは認識もれが生
ずる可能性が大である。そのため、一般には、各
単語に対して必要に応じて複数の登録パターンを
設定するようにしており、例えば、認識の難しい
「ガ行」、「ザ行」、「ダ行」、「バ行」等の単語には
複数の登録パターンが与えられている。一方、比
較的認識の容易な「ア」、「カ」、「ユ」、「ヨ」等の
単語には単一の登録パターンを設定し、登録パタ
ーンの増加による認識処理速度の低下を防ぐよう
にしている。 In the invention of the above application, a plurality of characters are displayed in the order of character candidates that have a high probability of corresponding to the input voice, but the problem remains as to how to display the correct candidates without fail. . Normally, the distribution of speech patterns is complex, and when using the matching distance with the input pattern, there is a high possibility that misrecognition or omission of recognition will occur if only one registered pattern is used. Therefore, in general, multiple registered patterns are set for each word as necessary. For example, "ga line", "za line", "da line", "ba line", which are difficult to recognize Multiple registered patterns are given to words such as . On the other hand, a single registered pattern is set for words that are relatively easy to recognize, such as "a", "ka", "yu", and "yo", in order to prevent a decrease in recognition processing speed due to an increase in the number of registered patterns. I have to.

第１図は音声パターンの分布と登録パターンの
関係を示すものであり、Ａ、Ｂ、Ｃの実線で囲ま
れた部分は実際の音声パターンの分布を示し、
A₁とA₂は単語Ａに対する登録パターン、B₁〜B₃
は単語Ｂに対する登録パターン、C₁は単語Ｃに
対する登録パターンである。ここで、図示した登
録パターンのうちA₁、B₁、B₃のように本来の音
声パターンからはずれているようなパターンの場
合は、例えば、B₁→B₁′の如く本来の音声パター
ンの内部に位置するよう修正することが望まし
い。 Figure 1 shows the relationship between the distribution of voice patterns and registered patterns, and the areas surrounded by solid lines A, B, and C indicate the distribution of actual voice patterns.
A ₁ and A ₂ are registered patterns for word A, B ₁ to _{B 3}
is a registered pattern for word B, and _C1 is a registered pattern for word C. Here, among the registered patterns shown in the figure, in the case of patterns that deviate from the original voice pattern, such as A ₁ , B ₁ , and B ₃ , for example, B ₁ →B ₁ ′, which deviates from the original voice pattern, It is desirable to modify it so that it is located inside.

本発明は、上記した登録パターンの修正を自動
的に効率よく行なえるようにすることを目的と
し、そしてそのため本発明は、入力された音声を
分析してその特徴を抽出し、辞書部からの認識す
るための情報と比較し、該比較結果にもとづいて
確率の高い優先順に、上記入力された音声である
可能性の高い単語候補を複数表示し、その中から
正答候補を選択することによつて音声認識処理を
行なう方法において、選択された単語に対応する
複数の標準登録パターンのうち入力音声とのマツ
チング距離が最小でかつ一定値以下の標準登録パ
ターンについてのみ入力音声によつて学習を行な
うようにしたことを特徴とする。 An object of the present invention is to automatically and efficiently correct the registered patterns described above, and for this purpose, the present invention analyzes input speech, extracts its features, and extracts the characteristics from the dictionary section. By comparing with the information for recognition, displaying multiple word candidates that are likely to be the input voice in order of priority based on the comparison result, and selecting the correct answer candidate from among them. In a method of performing speech recognition processing using a selected word, learning is performed using input speech only for the standard registered pattern whose matching distance with the input speech is the minimum and is less than a certain value among the plurality of standard registered patterns corresponding to the selected word. It is characterized by the following.

以下、本発明を図面により説明する。第２図は
本発明による実施例の音声認識処理装置のブロツ
ク図であり、図中、１はスペクトル分析部、２は
パターン抽出部、３は入力パターンバツフア、４
は照合判定部、５は結果表示部、６は選択キー、
７は距離計算部、８は最小値判定部、９はパター
ン学習部、１０は辞書部である。 Hereinafter, the present invention will be explained with reference to the drawings. FIG. 2 is a block diagram of a speech recognition processing device according to an embodiment of the present invention, in which 1 is a spectrum analysis section, 2 is a pattern extraction section, 3 is an input pattern buffer, and 4
5 is a comparison judgment section, 5 is a result display section, 6 is a selection key,
7 is a distance calculation section, 8 is a minimum value determination section, 9 is a pattern learning section, and 10 is a dictionary section.

マイクロホン（MIC）より入力された音声信号
はスペクトル分析部１においてスペクトル分析さ
れる。スペクトル分析部１は、帯域フイルタ群、
パラメータ抽出回路などを含み、入力音声即ち単
音節音声の特徴量（パラメータ）例えば第１ホル
マント周波数に相当するモーメントM₁や第２ホ
ルマント周波数に相当するモーメントM₂や更に
は低域電力や高域電力などを抽出し、さらにこれ
らの特徴量に関するサンプル点を決定し特徴量の
時系列情報を得る機能を有している。スペクトル
分析部１においてスペクトル分析され、得られた
パラメータ時系列情報は、パターン抽出部２に入
力される。次に、パターン抽出部２は、入力音声
の特徴を表わす入力パターンを抽出し、入力パタ
ーンバツフアに保持せしめておくとともに、照合
判定部４を起動する。照合判定部４は、辞書部１
０の内容を順次読出し、辞書部１０に登録されて
いるパターンと入力パターンバツフア３に得られ
ている入力パターンとの照合を行なう。そして、
照合判定部４では両パターン間の照合距離が所定
値以下のものを判定し、該所定値以下の照合距離
を有する登録パターンに対応する単語（１つまた
は複数）を、その照合距離順位情報を付加して正
答単語候補として結果表示部５へ送出する。結果
表示部５では照合距離の小さい順、即ち、確度の
大きい順に複数の正答単語候補を表示する。 An audio signal input from a microphone (MIC) is subjected to spectrum analysis in a spectrum analysis section 1. The spectrum analysis unit 1 includes a group of band filters,
It includes a parameter extraction circuit, etc., and extracts features (parameters) of the input speech, that is, monosyllabic speech, such as the moment _M1 corresponding to the first formant frequency, the moment _M2 corresponding to the second formant frequency, and further low frequency power and high frequency. It has the function of extracting power, etc., determining sample points related to these feature quantities, and obtaining time series information of the feature quantities. The spectrum analysis section 1 performs spectrum analysis, and the obtained parameter time series information is input to the pattern extraction section 2 . Next, the pattern extraction unit 2 extracts an input pattern representing the characteristics of the input voice, stores it in the input pattern buffer, and activates the matching determination unit 4. The collation determination unit 4 includes the dictionary unit 1
The contents of 0 are sequentially read out and the patterns registered in the dictionary section 10 are compared with the input patterns obtained in the input pattern buffer 3. and,
The matching determination unit 4 determines whether the matching distance between the two patterns is equal to or less than a predetermined value, and selects the word (one or more) corresponding to the registered pattern having the matching distance less than or equal to the predetermined value based on its matching distance ranking information. It is added and sent to the result display section 5 as a correct word candidate. The result display section 5 displays a plurality of correct word candidates in descending order of matching distance, that is, in descending order of accuracy.

使用者は、表示結果を見て、自分の入力したい
語があつた場合は、選択キー６を操作して当該単
語を選択し、図示しない記憶部へ記憶させる。 When the user looks at the displayed results and finds the word he or she wants to input, he or she selects the word by operating the selection key 6 and stores it in a storage section (not shown).

一方、選択キー６からの信号によつて入力音声
に対応する単語が同定されたとき、距離計算部７
は、辞書部１０から当該同定された単語に対応す
る登録パターンを読出し、入力パターンバツフア
３に保持されている入力パターンとの距離を計算
する。そして、計算結果は最小値判定部８へ送出
され、最小値判定部８では当該計算結果のうち最
小のものを選択するとともに、その選択された最
小距離が一定値以下の場合にはパターン学習部９
を起動する。パターン学習部９は上記最小距離を
有する登録パターンと、入力パターンバツフア内
に保持されている入力パターンとの間でパターン
学習を行ない、学習されたパターンを再び辞書部
１０に格納する。 On the other hand, when a word corresponding to the input speech is identified by the signal from the selection key 6, the distance calculation unit 7
reads the registered pattern corresponding to the identified word from the dictionary section 10 and calculates the distance from the input pattern held in the input pattern buffer 3. Then, the calculation results are sent to the minimum value determination section 8, which selects the minimum among the calculation results, and if the selected minimum distance is less than a certain value, the pattern learning section 9
Start. The pattern learning unit 9 performs pattern learning between the registered pattern having the minimum distance and the input pattern held in the input pattern buffer, and stores the learned pattern in the dictionary unit 10 again.

第３図はパターン学習の一例を示す図であり、
パターンＡ（登録パターン）とパターンＢ（入力
パターン）の間で学習を行ない、新たにパターン
Ｃを作り出す過程を示している。 FIG. 3 is a diagram showing an example of pattern learning,
It shows the process of learning between pattern A (registered pattern) and pattern B (input pattern) to create a new pattern C.

第３図においてはパターンＡとパターンＢの標
本点の数が異なるため、適当な手法を使用して、
図示するように、標本点数の少ないパターンＢの
１点を標本点数の多いパターンＡの複数点に対応
させている。そして、パターンＡとパターンＢの
対応する標本点同志の間で算術平均をとり、パタ
ーンＣを得て、これを新たな登録パターンとして
いる。 In Figure 3, since the number of sample points for pattern A and pattern B is different, use an appropriate method to
As shown in the figure, one point in pattern B, which has a small number of sample points, is made to correspond to multiple points in pattern A, which has a large number of sample points. Then, the arithmetic mean is taken between the corresponding sample points of pattern A and pattern B to obtain pattern C, which is used as a new registered pattern.

なお、上記パターン学習において、登録パター
ンのうち音声とのマツチング距離が最小でかつ一
定値以下の登録パターンについてのみ学習を行な
う理由は、全く異なるパターンとの間で学習を行
なうとその結果は本来の音声パターンとは離れた
パターンになつてしまう可能性が大きいからであ
る。例えば、第１図の音声パターンＢに関して、
B₁が男子の発声に特有のパラメータを示し、B₂
が女子の発声に特有のパラメータを示しているも
のとすると、このB₁とB₂の平均化は明らかに逆
効果を招き、認識率を低下させる原因となつてし
まう。このように本発明の方式は、類以パターン
間の最適中心値を求めて学習を進めてゆくもので
ある。 In addition, in the above pattern learning, the reason why learning is performed only for the registered patterns whose matching distance with the voice is the minimum and below a certain value is that if learning is performed between completely different patterns, the result will be different from the original one. This is because there is a high possibility that the pattern will be different from the voice pattern. For example, regarding voice pattern B in Figure 1,
B ₁ indicates parameters specific to boys' vocalizations, B ₂
If it is assumed that B 1 and B 2 represent parameters specific to female vocalizations, the averaging of B ₁ and B ₂ will clearly have the opposite effect and cause a decrease in the recognition rate. In this way, the method of the present invention advances learning by finding the optimal central value between similar patterns.

以上説明したように本発明によれば、登録パタ
ーンの学習を自動的に効率よく行なえるようにし
たので、認識率の高い登録パターンを有する辞書
部を容易に作成することができる。 As explained above, according to the present invention, learning of registered patterns can be performed automatically and efficiently, so that it is possible to easily create a dictionary section having registered patterns with a high recognition rate.

【図面の簡単な説明】[Brief explanation of the drawing]

第１図は音声パターンの分布と登録パターンの
関係を示す図、第２図は本発明による実施例の音
声認識処理装置のブロツク図、第３図はパターン
学習の一例を示す図である。第２図において、１はスペクトル分析部、２は
パターン抽出部、３は入力パターンバツフア、４
は照合判定部、５は結果表示部、７は距離計算
部、８は最小値判定部、９はパターン学習部、１
０は辞書部である。 FIG. 1 is a diagram showing the relationship between the distribution of speech patterns and registered patterns, FIG. 2 is a block diagram of a speech recognition processing apparatus according to an embodiment of the present invention, and FIG. 3 is a diagram showing an example of pattern learning. In FIG. 2, 1 is a spectrum analysis section, 2 is a pattern extraction section, 3 is an input pattern buffer, and 4
1 is a matching determination unit, 5 is a result display unit, 7 is a distance calculation unit, 8 is a minimum value determination unit, 9 is a pattern learning unit, 1
0 is the dictionary section.

Claims

【特許請求の範囲】[Claims]

１入力された音声を分析してその特徴を抽出
し、辞書部からの認識するための情報と比較し、
該比較結果にもとづいて確率の高い優先順に、上
記入力された音声である可能性の高い単語候補を
複数表示し、その中から正答候補を選択すること
によつて音声認識処理を行なう方法において、選
択された単語に対応する複数の標準登録パターン
のうち入力音声とのマツチング距離が最小でかつ
一定値以下の標準登録パターンについてのみ入力
音声から抽出された入力パターンと、該標準登録
パターンの算術平均により新たな登録パターンを
得ることを特徴とする標準音声登録パターンの学
習方法。1 Analyze the input voice, extract its features, compare them with information for recognition from the dictionary section,
A method of performing speech recognition processing by displaying a plurality of word candidates that are likely to be the input speech in order of priority based on the comparison result and selecting a correct answer candidate from among them, Among the multiple standard registered patterns corresponding to the selected word, only the standard registered patterns whose matching distance with the input voice is the minimum and below a certain value are extracted from the input voice, and the arithmetic average of the standard registered patterns. A method for learning standard voice registration patterns, characterized in that a new registration pattern is obtained by.