JPH05313697A

JPH05313697A - Speaker recognition system

Info

Publication number: JPH05313697A
Application number: JP4117380A
Authority: JP
Inventors: Shingo Nishimura; 新吾西村; Masayuki Unno; 雅幸海野
Original assignee: Sekisui Chemical Co Ltd
Current assignee: Sekisui Chemical Co Ltd
Priority date: 1992-05-11
Filing date: 1992-05-11
Publication date: 1993-11-26

Abstract

PURPOSE:To learn various phonemes by short vocalization by using a voice which has good phoneme balance. CONSTITUTION:A neural network is learnt by using the outline of a short-time spectrum obtained from a voice for learning. A voice with good phoneme balance is used as the voice for learning. For recognition, the outline of the same short-time spectrum is found from an optional speaking and its series is inputted to the network to obtain a network output. The output vectors of the obtained network output indicate speakers for the short-time inputs and are totally judged by sum, product, and majority decision making, etc., over the entire series to obtain one recognition result.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明はニューラルネットワーク
を用いた話者認識方式（特に話者照合）に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speaker recognition system (particularly speaker verification) using a neural network.

【０００２】[0002]

【従来の技術】通常、話者認識を行う場合、予め学習し
た発声内容についてのみ話者の認識を行うものが多く、
発声内容を限定しない話者認識では、種々の音韻に共通
の話者情報を利用するため、ある程度長い発声が必要で
あり、また、高い認識率も得にくい。更に学習時のデー
タ数が非常に多いため、学習に長時間を要していた。2. Description of the Related Art Normally, in the case of speaker recognition, many speakers recognize only the voicing contents learned in advance.
Speaker recognition that does not limit the utterance content requires speaker utterance that is common to various phonemes, and thus requires a certain amount of utterance, and it is difficult to obtain a high recognition rate. Furthermore, since the amount of data during learning is very large, it took a long time for learning.

【０００３】そこで本出願人は、既に発声内容を限定し
ない話者認識方式を提案している（特願平2-243413「話
者認識方式」、特願平3-282843「話者認識方式」）。Therefore, the applicant has already proposed a speaker recognition method that does not limit the utterance content (Japanese Patent Application No. 2-243413 "Speaker Recognition Method", Japanese Patent Application No. 3-282843 "Speaker Recognition Method". ).

【０００４】[0004]

【発明が解決しようとする課題】然しながら、本出願人
が既に提案している話者認識方式は、主として、話者が
ある限られた人の中の誰であるかを特定する（話者同定
と呼ぶ）技術に関するものであった。However, the speaker recognition method proposed by the applicant of the present invention mainly identifies who the speaker is among a limited number of people (speaker identification). It was related to technology.

【０００５】本発明は、発声内容を限定しない話者認識
（特に話者照合）において、比較的短い発声で高い認識
率を得ることを目的とする。It is an object of the present invention to obtain a high recognition rate with a relatively short utterance in speaker recognition (especially speaker verification) without limiting the utterance content.

【０００６】また、本発明は、発声内容を限定しない話
者認識（特に話者照合）において、比較的短い発声で高
い認識率を得るとともに、学習を軽減することを目的と
する。It is another object of the present invention to obtain a high recognition rate with a relatively short utterance and reduce learning in speaker recognition (especially speaker verification) without limiting the utterance content.

【０００７】[0007]

【課題を解決するための手段】請求項１に記載の話者認
識方式を説明する。先ず、学習用の音声から得た短時間
スペクトルの概形を用いて、ニューラルネットワークを
学習する。学習用の音声には、音韻バランスのとれたも
のを用いる。認識時は、任意の発声から上記と同じ短時
間スペクトルの概形を求め、その系列をネットワークに
入力し、ネットワーク出力の系列を得る。得られたネッ
トワークの出力ベクトルは、それぞれが短時間の入力に
対する話者を示唆しており、これを系列全体で、和、
積、多数決等の総合的な判断を下すことによって、１つ
の認識結果を得る。A speaker recognition system according to claim 1 will be described. First, the neural network is learned using the outline of the short-time spectrum obtained from the learning voice. As the voice for learning, a phonologically balanced voice is used. At the time of recognition, an outline of the same short-time spectrum as described above is obtained from an arbitrary utterance, the sequence is input to the network, and a sequence of network output is obtained. The output vector of the obtained network suggests the speaker for each input for a short time.
One recognition result is obtained by making a comprehensive judgment such as a product or a majority vote.

【０００８】請求項２に記載の話者認識方式を説明す
る。先ず、学習用の音声から得た短時間スペクトルの概
形を用いて、ニューラルネットワークを学習する。学習
用の音声には、音韻バランスのとれたものを用いる。認
識時は、任意の発声から上記と同じ短時間スペクトルの
概形を求め、その系列をネットワークに入力し、ネット
ワーク出力の系列を得る。得られたネットワークの出力
ベクトルは、それぞれが短時間の入力に対する話者を示
唆しているが、出力ベクトル選択用しきい値を設けて、
この中で信頼性の高い出力ベクトルのみを選択し、これ
らすべてについて、和、積、多数決等の総合的な判断を
下すことによって、１つの認識結果を得る。A speaker recognition method according to claim 2 will be described. First, the neural network is learned using the outline of the short-time spectrum obtained from the learning voice. As the learning voice, a phonologically balanced voice is used. At the time of recognition, an outline of the same short-time spectrum as described above is obtained from an arbitrary utterance, the sequence is input to the network, and a sequence of network output is obtained. Although the output vector of the obtained network suggests the speaker for each input for a short time, a threshold for output vector selection is provided,
One recognition result is obtained by selecting only a highly reliable output vector among these, and making a comprehensive judgment such as sum, product, and majority decision for all of these.

【０００９】請求項３に記載の話者認識方式を説明す
る。先ず、学習用の音声から得た短時間スペクトルの概
形を用いて、ニューラルネットワークを学習する。この
際に話者毎のクラスタリングを行うことによって学習用
データ数を減らしておく。認識時は、任意の発声から上
記と同じ短時間スペクトルの概形を求め、その系列をネ
ットワークに入力し、ネットワーク出力の系列を得る。
得られたネットワークの出力ベクトルは、それぞれが短
時間の入力に対する話者を示唆しており、これを系列全
体で、和、積、多数決等の総合的な判断を下すことによ
って、１つの認識結果を得る。A speaker recognition method according to claim 3 will be described. First, the neural network is learned using the outline of the short-time spectrum obtained from the learning voice. At this time, the number of learning data is reduced by performing clustering for each speaker. At the time of recognition, an outline of the same short-time spectrum as described above is obtained from an arbitrary utterance, the sequence is input to the network, and a sequence of network output is obtained.
The output vector of the obtained network suggests the speaker for each input for a short time, and by making a comprehensive judgment such as sum, product, majority vote, etc. in the whole sequence, one recognition result is obtained. To get

【００１０】請求項４に記載の話者認識方式を説明す
る。先ず、学習用の音声から得た短時間スペクトルの概
形を用いて、ニューラルネットワークを学習する。この
際に話者毎のクラスタリングを行うことによって学習用
データ数を減らしておく。認識時は、任意の発声から上
記と同じ短時間スペクトルの概形を求め、その系列をネ
ットワークに入力し、ネットワーク出力の系列を得る。
得られたネットワークの出力ベクトルは、それぞれが短
時間の入力に対する話者を示唆しているが、出力ベクト
ル選択用しきい値を設けて、この中で信頼性の高い出力
ベクトルのみを選択し、これらすべてについて、和、
積、多数決等の総合的な判断を下すことによって、１つ
の認識結果を得る。The speaker recognition method according to claim 4 will be described. First, the neural network is learned using the outline of the short-time spectrum obtained from the learning voice. At this time, the number of learning data is reduced by performing clustering for each speaker. At the time of recognition, an outline of the same short-time spectrum as described above is obtained from an arbitrary utterance, the sequence is input to the network, and a sequence of network output is obtained.
The output vector of the obtained network suggests a speaker for each input for a short time, but a threshold for output vector selection is provided, and only a reliable output vector is selected among them. For all these, sum,
One recognition result is obtained by making a comprehensive judgment such as a product or a majority vote.

【００１１】請求項５に記載の話者認識方式を説明す
る。これは話者に関する判定を行う際に用いる話者判定
用しきい値を大小１つずつ設定するもので、２つのしき
い値を用いることにより、ある話者に対応した出力のみ
が活性化した（出力値が大きい）ときに、話者は登録者
であると判定するものである（図１参照）。The speaker recognition method according to claim 5 will be described. This is to set thresholds for speaker determination, which are used when making a determination regarding a speaker, one by one, and by using two thresholds, only the output corresponding to a certain speaker is activated. When the output value is large, the speaker is determined to be a registrant (see FIG. 1).

【００１２】然るに、本発明における「ニューラルネッ
トワーク」について説明すれば、下記(1) 〜(4) の如く
である。However, the description of the "neural network" in the present invention is as follows (1) to (4).

【００１３】(1)ニューラルネットワークは、その構造
から、図２（Ａ）に示す階層的ネットワークと図２
（Ｂ）に示す相互結合ネットワークの２種に大別でき
る。本発明は、両ネットワークのいずれを用いて構成す
るものであっても良いが、階層的ネットワークは後述す
る如くの簡単な学習アルゴリズムが確立されているため
より有用である。(1) The neural network has a structure similar to that of the hierarchical network shown in FIG.
It can be roughly classified into two types of mutual connection networks shown in (B). The present invention may be configured by using either of both networks, but the hierarchical network is more useful because a simple learning algorithm as described later has been established.

【００１４】(2)ネットワークの構造階層的ネットワークは、図３に示す如く、入力層、中間
層、出力層からなる階層構造をとる。各層は１以上のユ
ニットから構成される。結合は、入力層→中間層→出力
層という前向きの結合だけで、各層内での結合はない。(2) Network Structure As shown in FIG. 3, the hierarchical network has a hierarchical structure including an input layer, an intermediate layer, and an output layer. Each layer is composed of one or more units. The coupling is only forward coupling such as input layer → middle layer → output layer, and there is no coupling in each layer.

【００１５】(3)ユニットの構造ユニットは図４に示す如く脳のニューロンのモデル化で
あり構造は簡単である。他のユニットから入力を受け、
その総和をとり一定の規則（変換関数）で変換し、結果
を出力する。他のユニットとの結合には、それぞれ結合
の強さを表わす可変の重みを付ける。(3) Structure of the unit The unit is a model of a brain neuron as shown in FIG. 4, and the structure is simple. Receive input from other units,
The sum is taken and converted according to a certain rule (conversion function), and the result is output. A variable weight, which represents the strength of the bond, is attached to each of the bonds with other units.

【００１６】(4)学習（バックプロパゲーション）ネットワークの学習とは、実際の出力を目標値（望まし
い出力）に近づけることであり、一般的には図４に示し
た各ユニットの変換関数及び重みを変化させて学習を行
なう。(4) Learning (Back Propagation) Learning a network is to bring an actual output close to a target value (desired output). Generally, the conversion function and weight of each unit shown in FIG. Is learned by changing.

【００１７】また、学習のアルゴリズムとしては、例え
ば、Rumelhart, D.E.,McClelland,J.L. and the PDP Re
search Group, PARALLEL DISTRIBUTED PROCESSING, the
MIT Press, 1986.に記載されているバックプロパゲー
ションを用いることができる。As a learning algorithm, for example, Rumelhart, DE, McClelland, JL and the PDP Re
search Group, PARALLEL DISTRIBUTED PROCESSING, the
Backpropagation described in MIT Press, 1986. can be used.

【００１８】[0018]

【作用】請求項１に記載の認識方式においては学習に用いた短時間スペクトルの概形は、それぞれ、
種々の音素や音素間の渡りの部分に対応しており、これ
らより話者情報を得るようにニューラルネットワークを
学習することにより、任意の発声に対応することができ
る。In the recognition method according to claim 1, the outline of the short-time spectrum used for learning is
It corresponds to various phonemes and transitions between phonemes, and by learning the neural network so as to obtain speaker information from these, any utterance can be dealt with.

【００１９】音韻バランスのとれた音声を用いること
により、短い発声で種々の音素を学習することができ
る。By using a phonologically balanced voice, various phonemes can be learned with a short utterance.

【００２０】出力ベクトルの系列を総合的に判断する
ことにより、出力ベクトルの１つからの判断では誤りが
ある場合でも、全体としては正しい判断を下すことがで
き、認識率が向上する。By comprehensively judging the series of output vectors, even if there is an error in the judgment from one of the output vectors, the correct judgment can be made as a whole, and the recognition rate is improved.

【００２１】更に、請求項２に記載の話者認識方式にお
いては出力ベクトルの中で信頼性の高いものを選択すること
により、総合的な判断がより確実になり、認識率が向上
する。Further, in the speaker recognition method according to the second aspect, by selecting a highly reliable output vector from among the output vectors, the comprehensive judgment becomes more reliable and the recognition rate is improved.

【００２２】請求項３に記載の話者認識方式においては学習に用いた短時間スペクトルの概形は、それぞれ、
種々の音素や音素間の渡りの部分に対応しており、これ
らより話者情報を得るようにニューラルネットワークを
学習することにより、任意の発声に対応することができ
る。In the speaker recognition method of the third aspect, the outline of the short-time spectrum used for learning is
It corresponds to various phonemes and transitions between phonemes, and by learning the neural network so as to obtain speaker information from these, any utterance can be dealt with.

【００２３】出力ベクトルの系列を総合的に判断する
ことにより、出力ベクトルの１つからの判断では誤りで
ある場合でも、全体としては正しい判断を下すことがで
き、認識率が向上する。By comprehensively judging the series of output vectors, even if the judgment from one of the output vectors is incorrect, the correct judgment can be made as a whole, and the recognition rate is improved.

【００２４】クラスタリングにより複数のデータの代
表ベクトルを学習データとしているので、学習効果を保
ちつつ学習データ数を削減できる。その結果、ニューラ
ルネットワークの学習時間が大幅に短縮できる。更
に、請求項４に記載の話者認識方式においてはSince the representative vector of a plurality of data is used as the learning data by clustering, the number of learning data can be reduced while maintaining the learning effect. As a result, the learning time of the neural network can be greatly reduced. Furthermore, in the speaker recognition method according to claim 4,

【００２５】出力ベクトルの中で信頼性の高いものを
選択することにより、総合的な判断がより確実になり、
認識率が向上する。By selecting a highly reliable output vector, the total judgment becomes more reliable,
The recognition rate is improved.

【００２６】更に、請求項５に記載の話者認識方式にお
いては話者判定用しきい値を大小１つずつ設定することによ
り、より高精度に登録者・非登録者の判定が可能とな
り、認識率が向上する。Further, in the speaker recognition method according to the fifth aspect, by setting the speaker determination thresholds one by one, the registered person and the non-registered person can be more accurately determined. The recognition rate is improved.

【００２７】[0027]

【実施例】【Example】

（第１実施例）登録者 5名・非登録者25名について、学習用の音韻バ
ランスのとれた短文を、サンプリング周波数10kHz 、フ
レーム長25.6msec、フレーム周期12.8msecでフーリエ分
析し、100 〜5000Hzの帯域で68ch（1/12 Oct. ）のパワ
ーベクトルの系列を得る。(First Example) For 5 registrants and 25 non-registrants, a short phonologically balanced short sentence for learning was Fourier-analyzed at a sampling frequency of 10 kHz, a frame length of 25.6 msec and a frame period of 12.8 msec. A series of 68ch (1/12 Oct.) power vectors is obtained in the band.

【００２８】これらのベクトルをニューラルネットワ
ークの入力とし（入力層68ユニット、入力パターンは１
回の発声につきフレームの数だけ得られる）、登録者の
場合のみ対応する出力ユニットが活性化するように十分
学習する。These vectors are used as inputs to the neural network (input layer 68 units, input pattern 1
Learn enough to activate the corresponding output unit only in the case of a registrant.

【００２９】任意の発声に対して、と同様にパワー
ベクトルの系列を得る。これを、で学習したネットワークに入力し、出力ベ
クトルの系列｛ｘ¹ ，ｘ² ，…，ｘⁿ ｝ｘ^t ＝（ｘ^t ₁，…、ｘ^t ₅）ｎ：フレーム数を得る。A sequence of power vectors is obtained in the same manner as for any utterance. This is input to the network learned by, and the sequence of output vectors {x ¹ , x ² , ..., X ⁿ } x ^t = (x ^t ₁ , ..., x ^t ₅ ) n: the number of frames is obtained.

【００３０】上記のベクトル系列に対し以下の３手
法を用いて、入力が登録者・非登録者いずれのものであ
るかを判断する。The following three methods are used for the above vector series to determine whether the input is a registered person or a non-registered person.

【００３１】(1) Σ_t ｘ^t _s（s=1 〜5 ）の最大値が、予
め設定した話者判定用しきい値を越えていれば登録者、
そうでなければ非登録者(1) If the maximum value of Σ _t x ^t _s (s = 1 to 5) exceeds a preset threshold value for speaker determination, the registered person,
Otherwise non-registered person

【００３２】(2) Π_t ｘ^t _s（s=1 〜5 ）の最大値が、予
め設定した話者判定用しきい値を越えていれば登録者、
そうでなければ非登録者(2) If the maximum value of Π _t x ^t _s (s = 1 to 5) exceeds the preset threshold for speaker determination, the registrant,
Otherwise non-registered person

【００３３】(3) max ｛ｘ^t ₁，…，ｘ^t ₅｝＝ｘ^t _s（s=1
〜5 ）の最大値が、予め設定した話者判定用しきい値を
越えていれば登録者、そうでなければ非登録者(3) max {x ^t ₁ , ..., X ^t ₅ } = x ^t _s (s = 1
~ 5) is a registered person if the maximum value exceeds the preset threshold for speaker determination, otherwise a non-registered person

【００３４】また、上記の３手法のかわりに以下の手法
を用いても良い（請求項５に相当）。The following methods may be used instead of the above three methods (corresponding to claim 5).

【００３５】(1) Σ_t ｘ^t _s（s=1 〜5 ）の最大値のみ
が、予め設定した第１の話者判定用しきい値を越え、か
つ、その他の値が予め設定した第２の話者判定用しきい
値を下回っていれば登録者、そうでなければ非登録者(1) Only the maximum value of Σ _t x ^t _s (s = 1 to 5) exceeds the preset first speaker determination threshold value, and other values are preset. Registered person if it is less than the threshold for speaker determination of No. 2, otherwise non-registered person

【００３６】(2) Π_t ｘ^t _s（s=1 〜5 ）の最大値のみ
が、予め設定した第１の話者判定用しきい値を越え、か
つ、その他の値が予め設定した第２の話者判定用しきい
値を下回っていれば登録者、そうでなければ非登録者(2) Only the maximum value of Π _t x ^t _s (s = 1 to 5) exceeds the preset first speaker determination threshold value, and other values are preset. Registered person if it is less than the threshold for speaker determination of No. 2, otherwise non-registered person

【００３７】(3) max ｛ｘ^t ₁，…，ｘ^t ₅｝＝ｘ^t _s（s=1
〜5 ）の数の最大値が、予め設定した第１の話者判定用
しきい値を越え、かつ、その他の値が予め設定した第２
の話者判定用しきい値を下回っていれば登録者、そうで
なければ非登録者(3) max {x ^t ₁ , ..., X ^t ₅ } = x ^t _s (s = 1
The maximum value of the number of 5 to 5) exceeds the preset first speaker determination threshold value, and the other values have the preset second threshold value.
If it is below the speaker judgment threshold of

【００３８】任意発声の一例として、学習用短文「彼は
以前から、科学技術の進歩と人間の勇気が、はるかな宇
宙への旅を可能にしたのだと考えていました。」に対し
て、「ただいま」「こんにちは」「おはようございま
す」の３単語を用いて話者認識実験を行ったところ、学
習に用いた登録者 5名及び学習に用いていない非登録者
26名を完全に認識できた。As an example of voluntary vocalization, a short sentence for learning, "He thought that the advance of science and technology and the courage of human beings have enabled a journey to a far universe." , "I'm home", "Hello" I was subjected to a speaker recognition experiments using the three-word of "Good morning", not used in the registrant five and learning using the learning non-subscribers
I was able to fully recognize 26 people.

【００３９】（第２実施例）登録者 5名・非登録者25名について、学習用の文章
を、サンプリング周波数10kHz 、フレーム長25.6msec、
フレーム周期12.8msecでフーリエ分析し、100 〜5000Hz
の帯域で68ch（1/12 Oct. ）のパワーベクトルの系列を
得る。(Second Embodiment) For 5 registrants and 25 non-registrants, sentences for learning were sampled at a sampling frequency of 10 kHz and a frame length of 25.6 msec.
Fourier analysis with a frame period of 12.8 msec, 100 to 5000 Hz
A series of 68ch (1/12 Oct.) power vectors is obtained in the band.

【００４０】これらのベクトルから、階層的クラスタ
リングを行うことによって、話者毎に200 程度の代表ベ
クトルを得る。Hierarchical clustering is performed from these vectors to obtain about 200 representative vectors for each speaker.

【００４１】これらの代表ベクトルをニューラルネッ
トワークの入力とし（入力層68ユニット、入力パターン
は話者数×クラスタリング後の代表ベクトル数だけ得ら
れる）、登録者の場合のみ対応する出力ユニットが活性
化するように十分学習する。These representative vectors are used as inputs to the neural network (input layer 68 units, input pattern is obtained by the number of speakers × the number of representative vectors after clustering), and the corresponding output unit is activated only in the case of a registrant. To learn enough.

【００４２】任意の発声に対して、と同様にパワー
ベクトルの系列を得る。これを、で学習したネットワークに入力し、出力ベ
クトルの系列｛ｘ¹ ，ｘ² ，…，ｘⁿ ｝Ｘ^t ＝（Ｘ^t ₁，…，Ｘ^t ₅）ｎ：フレーム数を得る。For any utterance, a sequence of power vectors is obtained in the same manner as. This is input to the network learned by, and the sequence of output vectors {x ¹ , x ² , ..., X ⁿ } X ^t = (X ^t ₁ , ..., X ^t ₅ ) n: the number of frames is obtained.

【００４３】上記のベクトル系列に対し以下の３手
法を用いて、入力が登録者・非登録者いずれのものであ
るかを判断する。The following three methods are used for the above vector series to determine whether the input is a registered person or a non-registered person.

【００４４】(1) Σ_t ｘ^t _s（s=1 〜5 ）の最大値が、予
め設定した話者判定用しきい値を越えていれば登録者、
そうでなければ非登録者(1) If the maximum value of Σ _t x ^t _s (s = 1 to 5) exceeds a preset threshold value for speaker determination, the registered person,
Otherwise non-registered person

【００４５】(2) Π_t ｘ^t _s（s=1 〜5 ）の最大値が、予
め設定した話者判定用しきい値を越えていれば登録者、
そうでなければ非登録者(2) If the maximum value of Π _t x ^t _s (s = 1 to 5) exceeds the preset threshold value for speaker determination, the registrant,
Otherwise non-registered person

【００４６】(3) max ｛ｘ^t ₁，…，ｘ^t ₅｝＝ｘ^t _s（s=1
〜5 ）の数の最大値が、予め設定した話者判定用しきい
値を越えていれば登録者、そうでなければ非登録者(3) max {x ^t ₁ , ..., X ^t ₅ } = x ^t _s (s = 1
~ 5) is the registered person if the maximum value exceeds the preset speaker judgment threshold value, otherwise the non-registered person

【００４７】また、上記の３手法のかわりに以下の手法
を用いても良い（請求項５に相当）。Further, the following method may be used instead of the above three methods (corresponding to claim 5).

【００４８】(1) Σ_t ｘ^t _s（s=1 〜5 ）の最大値のみ
が、予め設定した第１の話者判定用しきい値を越え、か
つ、その他の値が予め設定した第２の話者判定用しきい
値を下回っていれば登録者、そうでなければ非登録者(1) Only the maximum value of Σ _t x ^t _s (s = 1 to 5) exceeds the preset first speaker determination threshold value, and other values are preset. Registered person if it is less than the threshold for speaker determination of No. 2, otherwise non-registered person

【００４９】(2) Π_t ｘ^t _s（s=1 〜5 ）の最大値のみ
が、予め設定した第１の話者判定用しきい値を越え、か
つ、その他の値が予め設定した第２の話者判定用しきい
値を下回っていれば登録者、そうでなければ非登録者(2) Only the maximum value of Π _t x ^t _s (s = 1 to 5) exceeds the preset first speaker determination threshold value, and other values are preset. Registered person if it is less than the threshold for speaker determination of No. 2, otherwise non-registered person

【００５０】(3) max ｛ｘ^t ₁，…，ｘ^t ₅｝＝ｘ^t _s（s=1
〜5 ）の数の最大値が、予め設定した第１の話者判定用
しきい値を越え、かつ、その他の値が予め設定した第２
の話者判定用しきい値を下回っていれば登録者、そうで
なければ非登録者(3) max {x ^t ₁ , ..., X ^t ₅ } = x ^t _s (s = 1
The maximum value of the number of 5 to 5) exceeds the preset first speaker determination threshold value, and the other values have the preset second threshold value.
If it is below the speaker judgment threshold of

【００５１】[0051]

【発明の効果】以上のように本発明によれば、発声内容
を限定しない話者認識（特に話者照合）において、比較
的短い発声で高い認識率を得ることができる。As described above, according to the present invention, a high recognition rate can be obtained with a relatively short utterance in speaker recognition (especially speaker verification) without limiting the utterance content.

【００５２】また、本発明によれば、発声内容を限定し
ない話者認識（特に話者照合）において、比較的短い発
声で高い認識率を得るとともに、学習を軽減することが
できる。Further, according to the present invention, in speaker recognition (particularly speaker verification) in which the utterance content is not limited, a high recognition rate can be obtained with a relatively short utterance, and learning can be reduced.

【図面の簡単な説明】[Brief description of drawings]

【図１】図１は話者判定用しきい値とネットワークの出
力値とを示す模式図である。FIG. 1 is a schematic diagram showing a threshold value for speaker determination and an output value of a network.

【図２】図２はニューラルネットワークを示す模式図で
ある。FIG. 2 is a schematic diagram showing a neural network.

【図３】図３は階層的なニューラルネットワークを示す
模式図である。FIG. 3 is a schematic diagram showing a hierarchical neural network.

【図４】図４はユニットの構造を示す模式図である。FIG. 4 is a schematic diagram showing a structure of a unit.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁵ 識別記号庁内整理番号ＦＩ技術表示箇所Ｇ１０Ｌ 3/00 ５３１Ｋ 8842−5Ｈ ─────────────────────────────────────────────────── ─── Continuation of the front page (51) Int.Cl. ⁵ Identification code Internal reference number FI technical display location G10L 3/00 531 K 8842-5H

Claims

【特許請求の範囲】[Claims]

【請求項１】ニューラルネットワークを用い、短時間
スペクトルの概形を表すベクトルの系列を入力し、ネッ
トワーク出力の系列を、個々の出力による認識結果の
和、積、多数決等により総合し、その結果を話者判定用
しきい値と比較し、１つの認識結果を得ることを特徴と
した話者認識方式であって、ネットワークの学習用デー
タとして、音韻バランスのとれた文章を用いることを特
徴とする話者認識方式。1. A neural network is used to input a series of vectors representing the outline of a short-time spectrum, and the series of network outputs are integrated by summing, recognizing, and majority voting of recognition results by individual outputs. Is compared with a threshold for speaker determination to obtain a single recognition result, and a phoneme-balanced sentence is used as the network learning data. Speaker recognition method.

【請求項２】ニューラルネットワークを用い、短時間
スペクトルの概形を表すベクトルの系列を入力し、ネッ
トワーク出力の系列から出力ベクトル選択用しきい値を
用いて選択した出力ベクトルについて、個々の出力によ
る認識結果の和、積、多数決等により総合し、その結果
を話者判定用しきい値と比較し、１つの認識結果を得る
ことを特徴とした話者認識方式であって、ネットワーク
の学習用データとして、音韻バランスのとれた文章を用
いることを特徴とする話者認識方式。2. A neural network is used to input a series of vectors representing the outline of a short-time spectrum, and output vectors selected from the series of network outputs using output vector selection thresholds are output by individual outputs. A speaker recognition method characterized by obtaining one recognition result by summing the recognition results by sum, product, majority vote, etc., and comparing the result with a speaker judgment threshold value. A speaker recognition method characterized by using phonologically balanced sentences as data.

【請求項３】ニューラルネットワークを用い、短時間
スペクトルの概形を表すベクトルの系列を入力し、ネッ
トワーク出力の系列を、個々の出力による認識結果の
和、積、多数決等により総合し、その結果を話者判定用
しきい値と比較し、１つの認識結果を得ることを特徴と
した話者認識方式であって、ネットワークの学習用デー
タ数をクラスタリングにより削減することを特徴とする
話者認識方式。3. A neural network is used to input a series of vectors representing the outline of a short-time spectrum, and the series of network outputs are combined by summing, multiplying, majority voting, etc. of recognition results by individual outputs, and the result is obtained. Is compared with a threshold for speaker determination to obtain a single recognition result, and the speaker recognition is characterized by reducing the number of learning data of the network by clustering. method.

【請求項４】ニューラルネットワークを用い、短時間
スペクトルの概形を表すベクトルの系列を入力し、ネッ
トワーク出力の系列から出力ベクトル選択用しきい値を
用いて選択した出力ベクトルについて、個々の出力によ
る認識結果の和、積、多数決等により総合し、その結果
を話者判定用しきい値と比較し、１つの認識結果を得る
ことを特徴とした話者認識方式であって、ネットワーク
の学習用データ数をクラスタリングにより削減すること
を特徴とする話者認識方式。4. A neural network is used to input a series of vectors representing the outline of a short-time spectrum, and an output vector selected from the series of network outputs using an output vector selection threshold is output by individual outputs. A speaker recognition method characterized by obtaining one recognition result by summing the recognition results by sum, product, majority vote, etc., and comparing the result with a speaker judgment threshold value. A speaker recognition method characterized by reducing the number of data by clustering.

【請求項５】請求項１〜４のいずれかに記載の話者認
識方式であって、話者判定用しきい値を大小１つずつ設
定することを特徴とする話者認識方式。5. The speaker recognition method according to any one of claims 1 to 4, wherein the speaker determination thresholds are set one by one in magnitude.