JPH0419700A

JPH0419700A - Method for matching voice pattern

Info

Publication number: JPH0419700A
Application number: JP2123745A
Authority: JP
Inventors: Junichiro Fujimoto; 潤一郎藤本
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1990-05-14
Filing date: 1990-05-14
Publication date: 1992-01-23
Anticipated expiration: 2015-01-11
Also published as: JP2997007B2

Abstract

PURPOSE:To execute correct collation by converting a specific part into a determined length, comparing the pattern of a part removed from the vicinity of an end part up to the end and its residual pattern with reference patterns and defining a result having higher similarity as a similar pattern. CONSTITUTION:The pattern of an obtained feature variable is expanded/contracted to the determined length, the expanded/contracted pattern is stored and registered in a reference pattern storing memory 6, and at the time of ending the registration, a switch 5 is turned to the recognition side to recognize the pattern. When there is a part whose power is lower than a fixed value, whether a frequency components are concentrated into a high range or not and whether the range is the head or the end of the pattern are checked respectively by comparing parts 13, 15, the whole length of the pattern is expanded/contracted by an expanding/contracting part 16 and the processed data are stored in a memory 17. A collating part 7 finds out the similarity or error of the obtained pattern with/from respective reference patterns and a maximum similarity detecting part 8 detects the reference pattern representing the maximum similarity and outputs a code or the like expressing the detected reference pattern as a recognized result. Thus, patterns can be correctly collated without expanding/contracting them at the time of collation.

Description

【発明の詳細な説明】技監光夏本発明は、音声パターンマツチング方法、より詳細には
、音声認識におけるパターン照合方法に関する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a speech pattern matching method, and more particularly to a pattern matching method in speech recognition.

炙未援４現在の音声認識装置は、パターンマツチング方式を利用
するものが主流であり、あらかじめ登録された標準パタ
ーンと、入力された未知の音声パターンを比較して、最
も類似した標準パターンのカテゴリーを認識結果として
出力するものである。Bumien 4 Most current speech recognition devices use a pattern matching method, which compares a pre-registered standard pattern with an unknown input speech pattern to find the most similar standard pattern. This outputs categories as recognition results.

第３図は、従来の音声パターン照合方法の一例を説明す
るための図で、図中、１はマイクロフォン、２はマイク
アンプ、３は特徴変換部、４はＡ／Ｄ変換部、５は切換
えスイッチ、６は標準パターン格納部、７は照合部、８
は最大類似度検出部、９は認識結果出力部で、まず、ス
イッチ５を標準パターン登録側（ａ側）にしておき、マ
イク１から音声を人力する。マイク１で電気信号に変換
された音声は、マイクアンプ２で増幅され、特徴変換部
３により特徴変換されるが、利用される特徴量としては
スペクトル化いくつか知られている。それを離散量に直
し標準パターンとして標準パターン格納部６に格納する
。認識時は、スイッチ５を照合側（ｂ側）へ倒して行な
う。登録時と同様に音声のパターンを作り、あらかじめ
登録しておいたすべての標準パターンと照合し、類似性
の一番高いパターンを見て認識結果とするものである。FIG. 3 is a diagram for explaining an example of a conventional voice pattern matching method. In the figure, 1 is a microphone, 2 is a microphone amplifier, 3 is a feature converter, 4 is an A/D converter, and 5 is a switch. Switch, 6 is standard pattern storage section, 7 is collation section, 8
9 is a maximum similarity detection unit, and 9 is a recognition result output unit. First, the switch 5 is set to the standard pattern registration side (a side), and a voice is input manually from the microphone 1. The voice converted into an electrical signal by the microphone 1 is amplified by the microphone amplifier 2, and then subjected to feature conversion by the feature converter 3. There are several known spectralizations as the feature amounts used. It is converted into discrete quantities and stored in the standard pattern storage section 6 as a standard pattern. At the time of recognition, the switch 5 is moved to the verification side (side b). In the same way as when registering, a voice pattern is created, compared with all standard patterns registered in advance, and the pattern with the highest similarity is determined as the recognition result.

このような認識方式の詳細や、特徴量については、例え
ば新美著「音声認識ｊ等に書かれており、周知であるの
で、ここでの詳細な説明は省略する。Details of such a recognition method and feature amounts are described in, for example, "Speech Recognition J" by Niimi, and are well known, so a detailed explanation will be omitted here.

このなかで、パターンの照合に際して、パターンの変動
をどの様に対策するかと言う問題がある。Among these, there is the issue of how to deal with pattern variations when matching patterns.

特に、この変動は時間的なものが大きく、発声の速度等
の影響がでる。この対策は２つあり、１つはＤＰマツチ
ングに代表される非線形照合で、照合する２つのパター
ンの類似性を見ながら、その類似性が最大になるように
ダイナミックに２つのパターンを対応づけるもの、もう
１つは、類似性のチエツクなどせずに時間長を均等にデ
ータ挿入。In particular, this variation is largely temporal, and is affected by the speed of speech, etc. There are two ways to counter this. One is non-linear matching, represented by DP matching, which dynamically matches two patterns to maximize the similarity while looking at the similarities between the two patterns. , the other is to insert data evenly over time without checking for similarity.

間引きによって一致させてから両者を比較して線形照合
するものである。これらは、前者が計算量が多い代りに
、精度が良く、後者は計算量が非常に少ないというメリ
ットがある。特に、後者の場合、全てのパターンを一定
長にして置く事で、入力された音声のパターンを一度長
さ合せしてしまうと、照合に際して、パターン伸縮する
必要がないと言う特徴がある。この方法では、音声パタ
ーンが完全で、欠落や付加が無い時にはかなり有効であ
るが、しかし、音声は非線形な伸縮をしているものであ
り、それを線形伸縮で間に合わせている為、音声パター
ンに欠落や付加があると、照合精度は非常に悪いものに
なってしまう。After matching by thinning, the two are compared for linear matching. Although the former requires a large amount of calculation, it has good accuracy, and the latter has the advantage of having a very small amount of calculation. In particular, in the latter case, by setting all patterns to a constant length, once the length of the input voice pattern is adjusted, there is no need to expand or contract the pattern during verification. This method is quite effective when the audio pattern is complete and there are no omissions or additions. If there are omissions or additions to the information, the matching accuracy will be extremely poor.

第４図は、音声のエネルギーの時間変化を示す図で、こ
の図に従って説明すると、図に示すごとく、同じｒ　５
ｔａｆｆ　Ｊという音声パターンがあるとき、正常なも
の同士を線形に伸縮して比較する場合には、（ａ）に示
すように、両者の誤差を小さくすることができるが、（
ｂ）に示すように、音声区間検出に失敗して、一方のパ
ターンの／ｆ／が欠落したｒ　ｓｔａ　Ｊだったりする
と、同じパターンでありながら、音声の末尾付近で違う
音同士が対応づいてしまい、両パターンの差は著しく大
きくなる。FIG. 4 is a diagram showing the temporal change in the energy of the voice.If you explain according to this diagram, as shown in the figure, the same r 5
When there is a sound pattern called taff J, when comparing the normal ones by linearly expanding and contracting them, the error between the two can be reduced as shown in (a), but (
As shown in b), if voice section detection fails and the /f/ in one pattern is missing, r sta J, the same pattern may have different sounds associated with each other near the end of the voice. As a result, the difference between the two patterns becomes significantly large.

ここに例として挙げたｒｓｔａｆｆＪの／ｆ／のように
、発声されるエネルギーの小さな子音は音声区間の検出
がうまく行かないことが多く、上記の問題が非常によく
起こる。非線形伸縮を用いたパターン照合法では端点フ
リーにするものがあり、／ｆ／が欠けていながら、精度
の良いマツチングができる。ただし、この非線形伸縮を
用いた方法では、先に述べたように計算量が多い事に変
りはない。For consonants produced with low energy, such as /f/ in rstaffJ given here as an example, detection of the vocal interval often fails, and the above problem occurs very often. Some pattern matching methods using nonlinear expansion/contraction make end points free, and can perform accurate matching even though /f/ is missing. However, as mentioned above, this method using nonlinear expansion and contraction still requires a large amount of calculation.

また、この対策のひとつとして、欠落等が生じる等、不
安定な音声の標準パターンにマークをつけておいて、入
力された音声に不安定な部分がある場合には、標準パタ
ーンの不安定な部分をつけたままで、入力された音声に
不安定な部分が無い時には、全ての標準パターンから不
安定部を取除いて照合するものがある。しかしながら、
この方法では、入力のパターンによって標準パターンを
変化させるものであるから、照合時に毎回標準パターン
を修正しなければならないという欠点がある。In addition, as one of the countermeasures, mark the standard patterns of unstable audio such as omissions, etc., and if there are unstable parts of the input audio, check the unstable standard patterns. If there are no unstable parts in the input audio with the parts left attached, there is a method that removes the unstable parts from all standard patterns and compares them. however,
Since this method changes the standard pattern depending on the input pattern, it has the disadvantage that the standard pattern must be corrected each time it is compared.

目　　　　　的本発明は、上述のごとき実情に鑑みてなされたもので、
特に、音声区間の検出がうまく行かなかった場合にも、
計算量の少ない線形伸縮法によって、正しい照合ができ
るようにすることを目的としてなされたものである。Purpose The present invention was made in view of the above-mentioned circumstances.
In particular, even if the detection of the voice section is not successful,
This was done with the aim of enabling accurate matching using a linear expansion/contraction method that requires less calculation.

濃−一一戒。Dark - 11 precepts.

本発明は、上記目的を達成するために、音声信号から特
徴量を取り出し特徴パターンとなして時間長を一定にし
て照合する音声パターンマツチング方法において、入力
された未知の音声の冒頭、または末尾に母音に比べて音
声のエネルギーが低く、かつ、周波数スペクトル成分が
高域に集中しているような特定部分が見出された時、全
体のパターンを定められた長さに変換すると共に、該特
定部分の終了部近傍から先端までの部分、あるいは前記
特定部分の終了部近傍から末尾までの部分を取除いた残
りのパターンを、定められた長さに変換して両方を保持
しておき、両方を標準パターンと照合し、類似性の高い
方向の結果をパターン間の類似性と定義するようにした
事を特徴としたものである。以下、本発明の実施例に基
いて説明する。In order to achieve the above object, the present invention provides a voice pattern matching method in which feature quantities are extracted from a voice signal and matched as a feature pattern with a constant length of time. When a specific part is found in which the energy of the voice is lower than that of vowels and the frequency spectrum components are concentrated in the high range, the entire pattern is converted to a predetermined length and the corresponding Converting the remaining pattern after removing the portion from near the end of the specific portion to the tip, or the portion from near the end of the specific portion to the end, to a predetermined length and retaining both; The feature is that both patterns are compared with a standard pattern, and the result in the direction of high similarity is defined as the similarity between the patterns. Hereinafter, the present invention will be explained based on examples.

第１図は、本発明の一実施例を説明するためのフローチ
ャート、第２図は、第１図に示した本発明を実現するた
めののブロック図で、図中、１１は伸縮部、１２はパワ
ー計算部、１３は比較部。FIG. 1 is a flowchart for explaining one embodiment of the present invention, and FIG. 2 is a block diagram for realizing the present invention shown in FIG. is a power calculation section, and 13 is a comparison section.

１４は高域スペクトル計算部、１５は比較部、１６は伸
縮部、１７はメモリー　１８．１９は閾値で、本発明は
、音声区間検出がしにくい子音はエネルギーが小さく、
周波数成分が高い方に集中していることに注目してなさ
れたものであり、特に、音声信号から特徴量を取り出し
特徴パターンとなして時間長を一定にして照合する音声
パターンマツチング方法において、入力された未知の音
声の冒頭、または末尾に母音に比べて音声のエネルギー
が低く、かつ周波数スペクトル成分が高域に集中してい
るような部分部分が見出された時、全体のパターンを定
められた長さに変換すると共に、該説明部分の終了部近
傍から先端までの部分、あるいは該説明部分の終了部近
傍から末尾までの部分を取除いた残りのパターンを、定
められた長さに変換して両方を保持しておき、両方を標
準パターンと照合し、類似性の高い方の結果をパターン
間の類似性と定義するようにしたものである。14 is a high frequency spectrum calculation section, 15 is a comparison section, 16 is an expansion/contraction section, 17 is a memory, and 18. 19 is a threshold value.
This method was developed by focusing on the fact that frequency components are concentrated on the higher side, and in particular, in the audio pattern matching method, which extracts feature quantities from audio signals and uses them as feature patterns to match them while keeping the length of time constant. When a partial portion is found at the beginning or end of the input unknown voice where the energy of the voice is lower than that of vowels and the frequency spectrum components are concentrated in the high range, the overall pattern is determined. At the same time, the remaining pattern after removing the part from near the end of the explanatory part to the tip, or the part from near the end of the explanatory part to the end, is converted to the specified length. Both are converted and retained, both are compared with a standard pattern, and the result with higher similarity is defined as the similarity between patterns.

最初に、第１図に示したフローチャートに基づいて説明
すると、まず、■の音声登録のフローにおいて、音声入
力の音声全体を一定の長さにしておいて、標準パターン
として登録する。次に、■の音声認識のフローにおいて
、入力された音声を標準パターンと同じ手順で特徴パタ
ーンに変換すると共に、その音声の冒頭や末尾に特定部
（つまり、音声のエネルギーが比較的小さく、周波数成
分が高域に集中している部分）があるかどうかをみる。First, an explanation will be given based on the flowchart shown in FIG. 1. First, in the voice registration flow (2), the entire voice input is set to a constant length and registered as a standard pattern. Next, in the speech recognition flow (■), the input speech is converted into a characteristic pattern using the same procedure as the standard pattern, and specific parts (in other words, the voice has relatively low energy and frequency Check whether there is a part where the components are concentrated in the high range.

この音声エネルギーが小さいか否かは冒頭や末尾で音声
のエネルギーがある一定値より下がるかどうかで調べる
ようにし、この一定値は、母音が入力された時のエネル
ギー値から１１５程度に決めればよい。また、周波数が
高域に集中しているかどうかは色々な調べかたが考えら
れるが、例えば、分析周波数帯域を２つに分け、高域に
低域の何倍かの成分が存在している時とか、スペクル分
布の周波数軸方向へのフィツト直線を引いて、この傾き
が負の場合とかで判断する事ができる。Whether this voice energy is small or not can be checked by checking whether the voice energy drops below a certain value at the beginning or end.This constant value should be set at about 115 based on the energy value when the vowel is input. . Also, there are various ways to check whether frequencies are concentrated in the high range, but for example, divide the analysis frequency band into two and find that the high range contains components that are several times as large as the low range. This can be determined by drawing a fitted straight line in the direction of the frequency axis of the speckle distribution and determining if the slope is negative.

このような音声冒頭や末尾にエネルギーが小さく、周波
数成分が高域に集中している部分がなければ、つまり、
特定部がなければこの音声の登録が終り、ある場合は、
それが冒頭か、末尾かによって、つまり、前記の／ｆ／
のような欠落しやすい音が、音声のどこに付いているか
を調べておく０次に、あらかじめ、これを欠落させたパ
ターンを併せて作る。つまり、音声冒頭に欠落しやすい
音が付いていると判断し、エネルギーが小さく、周波数
成分が高域に集中している部分から末尾までを取除いた
残りを一定長にしておいて、これも入カバターンと同様
にバッファメモリ内に保持しておく。If there is no such part at the beginning or end of the voice where the energy is small and the frequency components are concentrated in the high range, that is,
If there is no specific part, the registration of this voice is finished, and if there is,
Depending on whether it is the beginning or the end, that is, /f/
First, find out where in the audio the sounds that are likely to be dropped, such as , are attached, and then create a pattern in which these sounds are dropped. In other words, it is determined that there is a sound that is likely to be lost at the beginning of the audio, and the part with low energy and frequency components concentrated in the high range is removed from the part to the end, and the remaining part is set to a certain length. It is retained in the buffer memory in the same way as the input cover pattern.

このようにして、登録されたすべての標準パターンと照
合する。もし、標準パターンの冒頭、末尾の子音等が落
ちやすいものには、入カバターンが２つできるので、た
とえ標準パターンの一部が欠落していても照合できるか
ら、認識の精度を向上させる事ができる。In this way, all registered standard patterns are matched. If the standard pattern is likely to have consonants at the beginning or end, two introductory turns can be created, so even if a part of the standard pattern is missing, it can be matched, improving recognition accuracy. can.

第２図は、上述のごとき本発明を実現するためのブロッ
ク図であるが、この場合、マイク１からの音声を、特徴
変換して離散量になおすところまでは、第３図に示した
従来技術と同じである。はじめに、登録について説明す
る。スイッチ５を登録側（ａ側）に倒しておき、得られ
た特徴量のパターン（特徴パターン）を定められた長さ
に伸縮して標準パターン格納メモリー６に格納して登録
しておく、こうして登録すべき音声を標準パターン格納
部に登録し終わると、スイッチ５を認識側（ｂ側）に倒
して認識する。認識は、登録と同様に特徴パターンにな
おした後、あるいは、なおす前に音声信号をパワー計算
するためのパワー計算部１２へ入れる。ここでパワーが
一定値より低い部分が有るか、有るならそれは周波数成
分が高域に集中しているかどうかを、さらにその位置は
冒頭か、末尾かをそれぞれ比較部１３．１５で調べてお
く。そして、伸縮部１６でパターン全体の長さを一定の
長さに伸縮して、メモリー１７へ保持しておき、もし、
音声冒頭や末尾にエネルギーが小さく、周波数成分が高
域に集中している部分が存在したなら、第１図のフロー
チャートで示したように、その部分を取除き、再度伸縮
部で整形されたパターンを一定長にした後に、おなじく
メモリー１７のなかに保存しておく。照合は先に格納し
て置いたメモリー中のパターンと標準パターンとの類似
性を計算する。メモリーの中に２つのパターンが格納さ
れている場合は、標準パターン１つに対して２回の類似
性を計算し高い類似度の方を入力と標準パターンの間の
類似度として採用する０図では伸縮部が２つあるが両者
は同じ機能をもてば良くて、これらは同じものでよい。FIG. 2 is a block diagram for realizing the present invention as described above. In this case, the conventional method shown in FIG. It's the same with technology. First, registration will be explained. Turn the switch 5 to the registration side (a side), expand or contract the obtained feature pattern (feature pattern) to a predetermined length, store it in the standard pattern storage memory 6, and register it. When the voice to be registered has been registered in the standard pattern storage section, the switch 5 is turned to the recognition side (side b) to be recognized. For recognition, the audio signal is input into the power calculation unit 12 for power calculation after or before being converted into a characteristic pattern in the same manner as registration. Here, the comparison units 13 and 15 check whether there is a part where the power is lower than a certain value, and if so, whether the frequency components are concentrated in the high range, and whether the position is at the beginning or the end. Then, the length of the entire pattern is expanded or contracted to a certain length by the expansion/contraction unit 16 and stored in the memory 17.
If there is a part at the beginning or end of the audio where energy is low and frequency components are concentrated in the high range, remove that part and reshape the pattern using the expansion/contraction section as shown in the flowchart in Figure 1. After making it a certain length, it is also stored in the memory 17. Matching calculates the similarity between the previously stored pattern in memory and the standard pattern. If two patterns are stored in memory, the similarity is calculated twice for one standard pattern, and the one with the higher similarity is adopted as the similarity between the input and standard pattern. In this case, there are two telescopic parts, but they only need to have the same function, and they can be the same.

照合部７は特に照合方法を限定するものではなく、市街
地距離によってパターン相互の差を求める方法でも良い
し、ベクトル間の内積による類似性を計算するのも良い
。未知入力のパターンと各標準パターンとの類似性、ま
たは、誤差をそれぞれ求めておく。最大類似度検出部８
では、最も大きな類似性を示した標準パターンを見つけ
だし、その名前または、それを表わす記号等を認識結果
として出力する。The matching method of the matching unit 7 is not particularly limited, and it may be a method of finding differences between patterns based on the distance between urban areas, or a method of calculating similarity based on an inner product between vectors. The similarity or error between the unknown input pattern and each standard pattern is determined in advance. Maximum similarity detection unit 8
Then, the standard pattern that shows the greatest similarity is found, and its name or symbol representing it is output as the recognition result.

この方法によると、あらかじめ音声の一部が欠落した音
声パターンも一定長にして登録しである為、入力の音声
の冒頭、末尾の子音等が落ちている時にはこのパターン
と照合できるから伸縮するものに比べて演算量は少なく
、認識の精度を向上させる事ができる。According to this method, a voice pattern in which a part of the voice is missing is registered in advance as a fixed length, so if a consonant at the beginning or end of the input voice is missing, it can be compared with this pattern, so it can be expanded or contracted. The amount of calculation is small compared to , and recognition accuracy can be improved.

効　　　果以上の説明から明らかなように、本発明によると、音声
区間の検出がうまく行かなかった場合にも、照合時に伸
縮する事なく、正しい照合ができる。Effects As is clear from the above explanation, according to the present invention, even if voice section detection is unsuccessful, correct verification can be performed without expansion or contraction during verification.

【図面の簡単な説明】[Brief explanation of the drawing]

第１図は、本発明の一実施例を説明するためのフローチ
ャート、第２図は、本発明の実現に使用するブロック図
の一例を示す図、第３図は、一般のパターンマツチング
のブロック図、第４図は、弱い子音が検出された場合の
対応づけと検出できなかった場合の対応づけを説明する
ための図である。１・・・マイクロフォン、２・・・マイクアンプ、３・
・・特徴変換部、４・・・Ａ／Ｄ変換部、５・・・切換
えスイッチ、６・・・標準パターン格納部、７・・・照
合部、８・・・最大類似度検出部、９・・・認識結果出
力部、１１・・・伸縮部、１２・・・パワー計算部、１
３・・・比較部、１４・・・高域スペクトル計算部、１
５・・・比較部、１６・・・伸縮部、１７・・・メモリ
ー、１８，１９・・・閾値部。第図第図FIG. 1 is a flowchart for explaining one embodiment of the present invention, FIG. 2 is a diagram showing an example of a block diagram used to implement the present invention, and FIG. 3 is a general pattern matching block diagram. 4 are diagrams for explaining the correspondence when a weak consonant is detected and the correspondence when a weak consonant cannot be detected. 1...Microphone, 2...Mic amplifier, 3.
... Feature conversion section, 4... A/D conversion section, 5... Changeover switch, 6... Standard pattern storage section, 7... Matching section, 8... Maximum similarity detection section, 9 . . . Recognition result output section, 11 . . . Expansion/contraction section, 12 . . Power calculation section, 1
3... Comparison section, 14... High frequency spectrum calculation section, 1
5... Comparison section, 16... Expansion/contraction section, 17... Memory, 18, 19... Threshold section. Figure Figure

Claims

【特許請求の範囲】[Claims]

１、音声信号から特徴量を取り出して特徴パターンとな
して時間長を一定にして照合する音声パターンマッチン
グ方法において、入力された未知の音声の冒頭、または
、末尾に母音に比べて音声のエネルギーが低く、かつ、
周波数スペクトル成分が高域に集中しているような特定
部分が見出された時、全体のパターンを定められた長さ
に変換すると共に、該特定部分の終了部近傍から先端ま
での部分、あるいは、前記特定部分の終了部近傍から末
尾までの部分を取除いた残りのパターンを、定められた
長さに変換して両方を保持しておき、両方を標準パター
ンと照合し、類似性の高い方向の結果をパターン間の類
似性と定義するようにしたことを特徴とする音声パター
ンマッチング方法。1. In a voice pattern matching method that extracts feature quantities from a voice signal and uses them as a feature pattern to match a constant length of time, there is a possibility that the voice has more energy than a vowel at the beginning or end of the input unknown voice. low and
When a specific part where frequency spectrum components are concentrated in the high range is found, the entire pattern is converted to a predetermined length, and the part from near the end of the specific part to the tip, or , the remaining pattern after removing the part from near the end of the specific part to the end is converted to a predetermined length and both are retained, and both are compared with the standard pattern to find the one with high similarity. A voice pattern matching method characterized in that a direction result is defined as a similarity between patterns.