JP2510301B2

JP2510301B2 - Speaker recognition system

Info

Publication number: JP2510301B2
Application number: JP1298503A
Authority: JP
Inventors: 和彦岡下; 新吾西村; 正志宮川
Original assignee: Sekisui Chemical Co Ltd
Current assignee: Sekisui Chemical Co Ltd
Priority date: 1989-11-16
Filing date: 1989-11-16
Publication date: 1996-06-26
Anticipated expiration: 2011-06-26
Also published as: JPH03157698A

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、電子錠等において入力音声からその話者を
照合するに好適な話者認識システムに関する。TECHNICAL FIELD The present invention relates to a speaker recognition system suitable for matching the speaker from an input voice in an electronic lock or the like.

［従来の技術］従来の話者認識システムは、例えば特公昭56−13956
に記載される如く、以下の手順による。[Prior Art] A conventional speaker recognition system is disclosed in, for example, Japanese Patent Publication No. 56-13956.
The procedure is as follows, as described in 1.

入力音声に含まれる話者に関する特徴量を抽出する。The feature amount related to the speaker included in the input voice is extracted.

予め上記と同様にして抽出しておいた標準パターン
と上記で抽出した特徴量との距離を計算する。The distance between the standard pattern previously extracted in the same manner as above and the feature amount extracted above is calculated.

上記で計算した距離が、予め設定してあるしきい値
よりも小なることを条件に、今回の入力話者をその標準
パターンの登録話者であるものと判定する。On the condition that the distance calculated above is smaller than a preset threshold value, the input speaker of this time is determined to be a registered speaker of the standard pattern.

［発明が解決しようとする課題］然しながら、上記従来の話者認識システムでは、下記
、の問題点がある。[Problems to be Solved by the Invention] However, the above-mentioned conventional speaker recognition system has the following problems.

標準パターン作成時から時間が経過するにつれ、認識
率が劣化する。例えば、３ヶ月経過により、認識率は10
0.0％から85.0％に劣化する。The recognition rate deteriorates as time passes after the standard pattern is created. For example, after 3 months, the recognition rate is 10
It deteriorates from 0.0% to 85.0%.

実時間処理が困難である。即ち、一定以上の認識率を
確保するためには複雑な特徴量を用いる必要があるが、
複雑な特徴量を抽出するためには複雑な処理装置が必要
であり、処理時間も多大となる。Real-time processing is difficult. That is, in order to secure a recognition rate above a certain level, it is necessary to use complicated feature quantities,
A complicated processing device is required to extract the complicated feature amount, and the processing time becomes long.

本発明は、経時的な認識率の劣化が極めて少なく、容
易に実時間処理できる話者認識システムを得ることを目
的とする。It is an object of the present invention to obtain a speaker recognition system which has a very small deterioration in recognition rate over time and can be easily processed in real time.

［課題を解決するための手段］請求項１に記載の本発明は、ニューラルネットワーク
を用いた話者認識システムであって、登録話者に対応す
る出力ユニットの出力値に対し、登録話者認識用しきい
値と追加学習用しきい値とを設定し、上記出力値が登録
話者認識用しきい値より大なることを条件に、今回の入
力話者を登録話者と判定し、上記出力値が登録話者認識
用しきい値より大、かつ追加学習用しきい値より小なる
ことを条件に、今回の入力音声データを用いてニューラ
ルネットワークの追加学習を行なうようにしたものであ
る。[Means for Solving the Problem] The present invention according to claim 1 is a speaker recognition system using a neural network, wherein the registered speaker recognition is performed with respect to the output value of the output unit corresponding to the registered speaker. Threshold for additional learning and threshold for additional learning are set, and the input speaker of this time is determined to be a registered speaker on the condition that the output value is larger than the threshold for recognition of the registered speaker, and Additional learning of the neural network is performed by using the input voice data of this time, provided that the output value is larger than the threshold for speaker recognition and smaller than the threshold for additional learning. .

請求項２に記載の本発明は、前記ニューラルネットワ
ークへの入力として、音声の周波数特性の時間的変化、音声の平均的な線形予測係数、音声の平均的なPARCOR係数、音声の平均的な周波数特性、及びピッチ周波数、高域強調を施された音声波形の平均的な周波数特性、
並びに音声の平均的な周波数特性のうちの１つ以上を使用するようにしたものである。The present invention according to claim 2 is characterized in that, as an input to the neural network, temporal changes in frequency characteristics of voice, average linear prediction coefficient of voice, average PARCOR coefficient of voice, average frequency of voice Characteristics, pitch frequency, average frequency characteristics of voice waveform with high frequency emphasis,
In addition, one or more of the average frequency characteristics of voice is used.

請求項３に記載の本発明は、前記ニューラルネットワ
ークが階層的なニューラルネットワークであるようにし
たものである。According to a third aspect of the present invention, the neural network is a hierarchical neural network.

［作用］（１）経時的な認識率の劣化が極めて少ない。このこと
は、後述する実験結果により確認されていることである
が、ニューラルネットワークが音声の時期差による変動
の影響を受けにくい構造をとることが可能なためと推定
される。[Operation] (1) The deterioration of the recognition rate over time is extremely small. This has been confirmed by the experimental result described later, but it is presumed that the neural network can have a structure that is not easily affected by the fluctuation due to the timing difference of the voice.

（２）ニューラルネットワークを構成する、登録話者に
対応する出力ユニットの出力値に対し、登録話者認識用
しきい値の他に、追加学習用しきい値を設けた。即ち、
上記出力値が登録話者認識用しきい値を超えて大なるも
のであり、入力話者を登録話者と判定できるものであっ
ても、該出力値が該登録話者認識用しきい値より大なる
追加学習用しきい値を超えるものでない場合には、今回
の入力音声データを用いてニューラルネットワークの追
加学習を行なう。これにより、話者の特徴が経時変化し
ても認識率が劣化する前にニューラルネットワークを更
新でき、結果として、音声の経時変化に強い話者認識シ
ステムを構成できる。(2) For the output value of the output unit corresponding to the registered speaker, which constitutes the neural network, the threshold for additional learning is provided in addition to the threshold for recognition of the registered speaker. That is,
Even if the output value is large and exceeds the registered speaker recognition threshold value and the input speaker can be determined as the registered speaker, the output value is the registered speaker recognition threshold value. If it does not exceed the larger threshold for additional learning, the additional learning of the neural network is performed using the input voice data of this time. As a result, the neural network can be updated before the recognition rate deteriorates even if the characteristics of the speaker change over time, and as a result, a speaker recognition system that is resistant to changes over time of the voice can be configured.

（３）ニューラルネットワークは、原理的に、ネットワ
ーク全体の演算処理が単純且つ迅速である。(3) In principle, the neural network has simple and quick arithmetic processing for the entire network.

（４）ニューラルネットワークは、原理的に、それを構
成している各ユニットが独立に動作しており、並列的な
演算処理が可能である。従って、演算処理が迅速であ
る。(4) In principle, each unit constituting the neural network operates independently and parallel arithmetic processing is possible. Therefore, the arithmetic processing is quick.

（５）上記（３）〜（４）により、話者認識システムを
複雑な処理装置によることなく容易に実時間処理でき
る。(5) By the above (3) to (4), the speaker recognition system can be easily processed in real time without using a complicated processing device.

又、請求項２に記載の本発明によれば上記（１）〜
（５）の作用効果に加えて、下記（６）の作用効果があ
る。Further, according to the present invention described in claim 2, the above (1) to
In addition to the action and effect of (5), there is the action and effect of the following (6).

（６）ニューラルネットワークへの入力として、請求項
２に記載の〜の各要素のうちの１つ以上を用いるか
ら、入力を得るための前処理が、従来の複雑な特徴量抽
出に対して、単純となり、この前処理に要する時間が短
くて足りる。(6) Since one or more of each of the elements in claim 2 are used as inputs to the neural network, the preprocessing for obtaining the inputs is It is simple and the time required for this pretreatment is short.

又、請求項３に記載の本発明によれば上記（１）〜
（６）の作用効果に加えて、下記（７）の作用効果があ
る。According to the present invention as set forth in claim 3, the above (1) to
In addition to the action and effect of (6), there is the action and effect of the following (7).

（７）階層的なニューラルネットワークにあっては、現
在、後述する如くの簡単な学習アルゴリズム（バックプ
ロパゲーション）が確立されており、高い認識率を実現
できるニューラルネットワークを容易に形成できる。(7) With respect to the hierarchical neural network, a simple learning algorithm (back propagation) as described later is currently established, and a neural network capable of realizing a high recognition rate can be easily formed.

［実施例］第１図は本発明が適用された話者認識システムの一例
を示す模式図、第２図は音声処理部とニューラルネット
ワークの一例を示す模式図、第３図は入力音声を示す模
式図、第４図はバンドパスフィルタの出力を示す模式
図、第５図はニューラルネットワークを示す模式図、第
６図は階層的なニューラルネットワークを示す模式図、
第７図はユニットの構造を示す模式図である。Embodiments FIG. 1 is a schematic diagram showing an example of a speaker recognition system to which the present invention is applied, FIG. 2 is a schematic diagram showing an example of a voice processing unit and a neural network, and FIG. Schematic diagram, FIG. 4 is a schematic diagram showing the output of a bandpass filter, FIG. 5 is a schematic diagram showing a neural network, FIG. 6 is a schematic diagram showing a hierarchical neural network,
FIG. 7 is a schematic diagram showing the structure of the unit.

本発明の具体的実施例の説明に先立ち、ニューラルネ
ットワークの構成、学習アルゴリズムについて説明す
る。Prior to the description of specific embodiments of the present invention, the configuration of the neural network and the learning algorithm will be described.

（１）ニューラルネットワークは、その構造から、第５
図（Ａ）に示す階層的ネットワークと第５図（Ｂ）に示
す相互結合ネットワークの２種に大別できる。本発明
は、両ネットワークのいずれを用いて構成するものであ
っても良いが、階層的ネットワークは後述する如くの簡
単な学習アルゴリズムが確立されているためより有用で
ある。(1) The neural network has a fifth structure because of its structure.
The hierarchical network shown in FIG. 5A and the interconnection network shown in FIG. 5B can be roughly classified into two types. The present invention may be configured by using either of both networks, but the hierarchical network is more useful because a simple learning algorithm as described later has been established.

（２）ネットワークの構造階層的ネットワークは、第６図に示す如く、入力層、
中間層、出力層からなる階層構造をとる。各層は１以上
のユニットから構成される。結合は、入力層→中間層→
出力層という前向きの結合だけで、各層内での結合はな
い。(2) Network structure As shown in FIG. 6, the hierarchical network has an input layer,
It has a hierarchical structure consisting of an intermediate layer and an output layer. Each layer is composed of one or more units. The connection is input layer → middle layer →
There is no coupling within each layer, only the forward coupling of the output layers.

（３）ユニットの構造ユニットは第７図に示す如く脳のニューロンのモデル
化であり構造は簡単である。他のユニットから入力を受
け、その総和をとり一定の規則（変換関数）で変換し、
結果を出力する。他のユニットとの結合には、それぞれ
結合の強さを表わす可変の重みを付ける。(3) Unit structure The unit is a model of brain neurons as shown in FIG. 7, and the structure is simple. It receives inputs from other units, takes the sum of them, and converts them according to a certain rule (conversion function),
Output the result. A variable weight that represents the strength of the connection is attached to each of the connections with other units.

（４）学習（バックプロパゲーション）ネットワークの学習とは、実際の出力を目標値（望ま
しい出力）に近づけることであり、一般的には第７図に
示した各ユニットの変数関数及び重みを変化させて学習
を行なう。(4) Learning (Back Propagation) Learning a network is to bring an actual output closer to a target value (desired output). Generally, the variable function and weight of each unit shown in FIG. 7 are changed. Let them learn.

又、学習のアルゴリズムとしては、例えば、Rumelhar
t,D.E.,McClelland,J.L.and the PDP Research Group,P
ARALLEL DISTRIBUTED PROCESSING,the MIT Press,1986.
に記載されているバックプロパゲーションを用いること
ができる。As a learning algorithm, for example, Rumelhar
t, DE, McClelland, JLand the PDP Research Group, P
ARALLEL DISTRIBUTED PROCESSING, the MIT Press, 1986.
The back propagation described in can be used.

以下、本発明の具体的な実施例について説明する。 Hereinafter, specific examples of the present invention will be described.

話者認識システム10は、第１図に示す如く、音声入力
部11、音声処理部12、ニューラルネットワーク13、判定
部14、メモリ部15、ネットワーク制御部16、機器制御部
17を有して構成される。As shown in FIG. 1, the speaker recognition system 10 includes a voice input unit 11, a voice processing unit 12, a neural network 13, a determination unit 14, a memory unit 15, a network control unit 16, and a device control unit.
Comprised of 17.

（１）音声入力部11に登録音声を入力する。この時、学
習単語を「タダイマ」、入力単語を「タダイマ」とす
る。(1) Input a registered voice into the voice input unit 11. At this time, the learning word is "Tadaima" and the input word is "Tadaima".

又、登録話者を９名、詐称者を27名とする。 There are 9 registered speakers and 27 impostors.

（２）音声処理部12で、上記（１）の入力音声に簡単な
前処理を施す。(2) The voice processing unit 12 performs simple preprocessing on the input voice of (1).

前処理結果は、今回の話者認識のためにニューラルネ
ットワーク13に転送されるとともに、追加学習の可能性
に備えて、メモリ部15に転送される。The preprocessing result is transferred to the neural network 13 for speaker recognition this time, and is also transferred to the memory unit 15 in preparation for the possibility of additional learning.

（３）ニューラルネットワーク13は、下記の学習動作
と下記の評価動作を行なう。(3) The neural network 13 performs the following learning operation and the following evaluation operation.

学習目標値（出力層を構成する各出力ユニットの目標出力
値）を、登録話者については（1,0）、詐称者について
は（0,1）とする。The learning target value (target output value of each output unit that constitutes the output layer) is (1,0) for registered speakers and (0,1) for impostors.

登録話者の入力音声「タダイマ」に、音声処理部12に
よる前処理を施し、この前処理結果をニューラルネット
ワーク13に入力する。そして、ニューラルネットワーク
13の出力値（出力層を構成する各出力ユニットの出力
値）が上記目標値に近づくように、ニューラルネットワ
ーク13の各ユニットの変換関数及び重みを修正する。The input voice “Tadima” of the registered speaker is preprocessed by the voice processing unit 12, and the preprocessing result is input to the neural network 13. And the neural network
The conversion function and weight of each unit of the neural network 13 are modified so that the output value of 13 (the output value of each output unit forming the output layer) approaches the target value.

この学習動作を例えば３万回くり返す。 This learning operation is repeated 30,000 times, for example.

評価今回話者の入力音声に前処理を施し、この前処理を施
した音声をニューラルネットワーク13に入力し、ニュー
ラルネットワークの出力値（Ｘ、Ｙ）を得る。Evaluation This time, the input voice of the speaker is preprocessed, the preprocessed voice is input to the neural network 13, and the output value (X, Y) of the neural network is obtained.

そして、ニューラルネットワーク13の上記出力値
（Ｘ、Ｙ）は判定部14に転送される。Then, the output value (X, Y) of the neural network 13 is transferred to the determination unit 14.

（４）判定部14は、ニューラルネットワーク13の出力値
（Ｘ、Ｙ）に対し、しきい値θ１、θ２、θ３（θ１＞
θ２）を設ける。(4) The determination unit 14 sets threshold values θ1, θ2, θ3 (θ1> θ) for the output values (X, Y) of the neural network 13.
θ2) is provided.

θ１は追加学習用しきい値、θ２は登録話者認識用し
きい値、θ３は詐称者認識用しきい値である。θ1 is a threshold for additional learning, θ2 is a threshold for recognizing a registered speaker, and θ3 is a threshold for recognizing an impostor.

判定部14は、上記しきい値を用いて、下記〜の判
定動作を行なう。The determination unit 14 uses the threshold value to perform the following determination operations.

［Ｘ＞θ２かつＹ＜θ３］であることを条件に、判定部14は、今回の入力話者を登
録話者と判定し、この登録話者判定信号を機器制御部17
に出力する。Under the condition that [X> θ2 and Y <θ3], the determination unit 14 determines that the input speaker of this time is the registered speaker, and outputs the registered speaker determination signal to the device control unit 17
Output to.

［Ｘ＞θ２かつＹ＞θ３］又は［Ｘ＜θ２かつＹ＞θ
３］又は［Ｘ＜θ２かつＹ＜θ３］であることを条件に、判定部14は、今回の入力話者を詐
称者と判定し、この詐称者判定信号を機器制御部17に出
力する。[X> θ2 and Y> θ3] or [X <θ2 and Y> θ
3] or [X <θ2 and Y <θ3], the determination unit 14 determines the input speaker of this time as an impostor, and outputs this impostor determination signal to the device control unit 17.

上記の登録話者判定時に限り、判定部14は、更に次
の（ａ）、（ｂ）の処理を行なう。Only when the above-mentioned registered speaker is determined, the determination unit 14 further performs the following processes (a) and (b).

（ａ）［Ｘ＜θ１］であることを条件に、判定部14は、今回の入力音声デー
タを用いてニューラルネットワーク13の追加学習を行な
うべく、ネットワーク制御部16に追加学習実行信号を出
力する。(A) On the condition that [X <θ1], the determination unit 14 outputs an additional learning execution signal to the network control unit 16 in order to perform additional learning of the neural network 13 using the input voice data of this time. .

（ｂ）［Ｘ＞θ１］である時、判定部14は何もしない。(B) When [X> θ1], the determination unit 14 does nothing.

（５）機器制御部17は、判定部14による上記の判定結
果に基づく登録話者判定信号により、機器を制御する。(5) The device control unit 17 controls the device by the registered speaker determination signal based on the above determination result by the determination unit 14.

この機器は、例えば電子錠であり、上記登録話者判定
信号に基づいて開錠制御を行なう。This device is, for example, an electronic lock, and performs unlocking control based on the registered speaker determination signal.

（６）ネットワーク制御部16は、判定部14による上記
の判定結果に基づく追加学習実行信号により、ニューラ
ルネットワーク13の追加学習を行なうことを判断する。
この時、ネットワーク制御部16は、メモリ部15より、今
回の入力音声データを取出し、この入力音声データをニ
ューラルネットワーク13に再入力し、この入力に対する
ニューラルネットワーク13の出力値（Ｘ、Ｙ）が前述
（３）の登録話者についての目標値（1,0）に近づく
ように、ニューラルネットワーク13の各ユニットの変換
関数及び重みを修正する。ネットワーク制御部16は、こ
の追加学習動作を例えば３万回くり返す。(6) The network control unit 16 determines to perform additional learning of the neural network 13 based on the additional learning execution signal based on the determination result of the determination unit 14.
At this time, the network control unit 16 fetches the input voice data of this time from the memory unit 15 and re-inputs this input voice data into the neural network 13, and the output value (X, Y) of the neural network 13 for this input is The conversion function and weight of each unit of the neural network 13 are modified so as to approach the target value (1,0) for the registered speaker in (3) above. The network control unit 16 repeats this additional learning operation, for example, 30,000 times.

以下、第２図に示す如く、階層的なニューラルネット
ワーク13を用い、ニューラルネットワーク13の入力とし
て音声の一定時間内における平均的な周波数特性の時間
的変化を用いた場合の具体的実施例について説明する。Hereinafter, as shown in FIG. 2, a specific embodiment will be described in which a hierarchical neural network 13 is used and an average frequency characteristic of voice within a certain time is changed as an input of the neural network 13. To do.

尚、音声処理部12は、第２図に示す如く、ローパスフ
ィルタ21、バンドパスフィルタ22、平均化回路23の結合
にて構成される。As shown in FIG. 2, the audio processing unit 12 is composed of a low-pass filter 21, a band-pass filter 22 and an averaging circuit 23 which are connected to each other.

入力音声の音声信号の高域成分を、ローパスフィルタ
21にてカットする。そして、この入力音声を第３図に示
す如く、４つのブロックに時間的に等分割する。A low-pass filter for the high-frequency components of the input voice signal.
Cut at 21. Then, this input voice is temporally equally divided into four blocks as shown in FIG.

音声波形を、第２図に示す如く、複数（ｎ個）チャン
ネルのバンドパスフィルタ22に通し、各ブロック即ち各
一定時間毎に第４図（Ａ）〜（Ｄ）のそれぞれに示す如
くの周波数特性を得る。As shown in FIG. 2, the voice waveform is passed through a band-pass filter 22 of a plurality (n) of channels, and the frequency as shown in each of FIGS. 4A to 4D is obtained at each block, that is, at each constant time. Get the characteristics.

この時、バンドパスフィルタ22の出力信号は、平均化
回路23にて、各ブロック毎、即ち一定時間で平均化され
る。At this time, the output signal of the bandpass filter 22 is averaged by the averaging circuit 23 for each block, that is, for a fixed time.

以上の前処理により、「音声の一定時間内における平
均的な周波数特性の時間的変化」が得られた。By the above pre-processing, "the temporal change of the average frequency characteristic of the voice within a fixed time" was obtained.

平均化回路23の出力は、直接的にニューラルネットワ
ーク13に転送され、或いはメモリ部15を経由して間接的
にニューラルネットワーク13に転送される。The output of the averaging circuit 23 is directly transferred to the neural network 13 or indirectly transferred to the neural network 13 via the memory unit 15.

ニューラルネットワーク13は、３層の階層的なニュー
ラルネットワークにて構成される。入力層31は、前処理
の４ブロック、ｎチャンネルに対応する４×ｎユニット
にて構成される。出力層32は、登録話者群と詐称者群と
の２ユニットにて構成される。The neural network 13 is composed of a three-layer hierarchical neural network. The input layer 31 is composed of 4 blocks of preprocessing and 4 × n units corresponding to n channels. The output layer 32 is composed of two units, a registered speaker group and an impostor group.

出力層32の目標値は、登録話者については（1,0）詐
称者については（0,1）である。The target value of the output layer 32 is (1,0) for registered speakers and (0,1) for impostors.

実験上記の如く、追加学習用しきい値θ１を設けて追加学
習したネットワークの認識率と、追加学習しないネット
ワークの認識率とを比較した結果、表１を得た。本発明
方式により、時期差による認識率劣化を防止できること
が認められる。Experiment As described above, Table 1 is obtained as a result of comparing the recognition rate of the network which is additionally learned with the threshold value θ1 for additional learning and the recognition rate of the network which is not additionally learned. It is recognized that the method of the present invention can prevent the deterioration of the recognition rate due to the time difference.

次に、上記実施例の作用について説明する。 Next, the operation of the above embodiment will be described.

（１）経時的な認識率の劣化が極めて少ない。このこと
は、後述する実験結果により確認されていることである
が、ニューラルネットワーク13が音声の時期差による変
動の影響を受けにくい構造をとることが可能なためと推
定される。(1) The deterioration of the recognition rate over time is extremely small. This is confirmed by the experimental result described later, but it is presumed that the neural network 13 can have a structure that is not easily affected by the fluctuation due to the timing difference of the voice.

（２）ニューラルネットワーク13を構成する、登録話者
に対応する出力ユニットの出力値に対し、登録話者認識
用しきい値θ２の他に追加学習用しきい値θ１を設け
た。即ち、上記出力値が登録話者認識用しきい値θ２を
超えて大なるものであり、入力話者を登録話者と判定で
きるものであっても、該出力値が該登録話者認識用しき
い値θ２より大なる追加学習用しきい値θ１を超えるも
のでない場合には、今回の入力音声データを用いてニュ
ーラルネットワーク13の追加学習を行なう。これによ
り、話者の特徴が経時変化しても認識率が劣化する前に
ニューラルネットワーク13を更新でき、結果として、音
声の経時変化に強い話者認識システムを構成できる。(2) For the output value of the output unit corresponding to the registered speaker, which constitutes the neural network 13, an additional learning threshold value θ1 is provided in addition to the registered speaker recognition threshold value θ2. That is, even if the output value is large beyond the threshold θ2 for recognizing the registered speaker and the input speaker can be determined to be the registered speaker, the output value is for recognizing the registered speaker. When it does not exceed the additional learning threshold value θ1 which is larger than the threshold value θ2, the additional learning of the neural network 13 is performed using the current input voice data. As a result, the neural network 13 can be updated before the recognition rate deteriorates even if the characteristics of the speaker change over time, and as a result, a speaker recognition system that is resistant to changes over time of the voice can be configured.

（３）ニューラルネットワーク13は、原理的に、ネット
ワーク全体の演算処理が単純且つ迅速である。(3) In principle, the neural network 13 has simple and quick arithmetic processing for the entire network.

（４）ニューラルネットワーク13は、原理的に、それを
構成している各ユニットが独立に動作しており、並列的
な演算処理が可能である。従って、演算処理が迅速であ
る。(4) In principle, each unit of the neural network 13 operates independently, and parallel arithmetic processing is possible. Therefore, the arithmetic processing is quick.

（５）上記（３）〜（４）により、話者認識システム10
を複雑な処理装置によることなく容易に実時間処理でき
る。(5) The speaker recognition system 10 according to (3) to (4) above.
Can be easily processed in real time without using a complicated processing device.

（６）ニューラルネットワーク13への入力として、「音
声の周波数特性の時間的変化」を用いたから、入力を得
るための前処理が従来の複雑な特徴量抽出に比して、単
純となりこの前処理に要する時間が短くて足りる。(6) Since the "temporal change of the frequency characteristic of the voice" is used as the input to the neural network 13, the preprocessing for obtaining the input is simpler than the conventional complicated feature amount extraction. It takes only a short time.

この時、上記ニューラルネットワークへの入力とし
て、更に、「音声の一定時間内における平均的な周波数
特性の時間的変化」を用いたから、ニューラルネットワ
ーク13における処理が単純となり、この処理に要する時
間がより短くて足りる。At this time, since the "temporal change of the average frequency characteristic within a fixed time of the voice" is further used as an input to the neural network, the processing in the neural network 13 is simplified, and the time required for this processing is It's short enough.

（７）階層的なニューラルネットワーク13を用いたか
ら、現在、既に確立している簡単な学習アルゴリズム
（バックプロパゲーション）を用いて、高い認識率を達
成できる。(7) Since the hierarchical neural network 13 is used, a high recognition rate can be achieved by using a simple learning algorithm (back propagation) that has already been established at present.

尚、本発明の実施においては、ニューラルネットワー
クへの入力として、音声の周波数特性の時間的変化、音声の平均的な線形予測係数、音声の平均的なPARCOR係数、音声の平均的な周波数特性、及びピッチ周波数、高域強調を施された音声波形の平均的な周波数特性、
並びに音声の平均的な周波数特性のうちの１つ以上を使用できる。In the implementation of the present invention, as the input to the neural network, the temporal change of the frequency characteristic of the voice, the average linear prediction coefficient of the voice, the average PARCOR coefficient of the voice, the average frequency characteristic of the voice, And pitch frequency, average frequency characteristics of high-frequency-emphasized speech waveform,
And one or more of the average frequency characteristics of speech can be used.

そして、上記の要素が更に「音声の一定時間内にお
ける平均的な周波数特性の時間的変化」として用いられ
たように、上記の要素は「音声の一定時間内における
平均的な線形予測係数の時間的変化」、上記の要素は
「音声の一定時間内における平均的なPARCOR係数の時間
的変化」、上記の要素は「音声の一定時間内における
平均的な周波数特性、及びピッチ周波数の時間的変
化」、上記の要素は、「高域強調を施された音声波形
の一定時間内における平均的な周波数特性の時間的変
化」として用いることができる。Then, as the above-mentioned elements are further used as "temporal change of average frequency characteristic within a fixed time of speech", the above-mentioned element is "time of average linear prediction coefficient within a fixed time of speech." Change ", the above-mentioned element is" temporal change of average PARCOR coefficient within a fixed time of speech ", and the above-mentioned element is" average frequency characteristic of sound within a fixed time, and temporal change of pitch frequency " The above-mentioned elements can be used as “temporal change of average frequency characteristics of high-frequency-emphasized voice waveform within a fixed time”.

尚、上記の線形予測係数は、以下の如く定義され
る。The above linear prediction coefficient is defined as follows.

即ち、音声波形のサンプル値｛x_n｝の間には、一般に
高い近接相関があることが知られている。そこで次のよ
うな線形予測が可能であると仮定する。That is, it is known that there is generally a high close correlation between the sample values {x _n } of the speech waveform. Therefore, it is assumed that the following linear prediction is possible.

ここで、x_t:時刻ｔにおける音声波形のサンプル値、
｛α_ｉ｝（ｉ＝1,…,p）：（ｐ次の）線形予測係数さて、本発明の実施においては、線形予測誤差ε_ｔの
２乗平均値が最小となるように線形予測係数｛α_ｉ｝を
求める。 Where x _t : sample value of the voice waveform at time t,
{Α _i } (i = 1, ..., P): (p-th order) linear prediction coefficient Now, in the practice of the present invention, the linear prediction coefficient is set so that the mean square value of the linear prediction error ε _t is minimized. Find {α _i }.

具体的には（ε_ｔ）^２を求め、その時間平均を
（_ｔ）^２と表わして、∂（_ｔ）²/∂α_ｉ＝0,i＝1,
2,…,pとおくことによって、次の式から｛α_ｉ｝が求め
られる。Specifically, (ε _t ) ² is obtained, and its time average is represented as ( _t ) ^2, and ∂ ( _t ) ² / ∂α _i = 0, i = 1,
By setting 2, ..., p, {α _i } is obtained from the following equation.

又、上記のPARCOR係数は以下の如く定義される。 Also, the above PARCOR coefficient is defined as follows.

即ち、［k_n］（ｎ＝1,…,p）を（ｐ次の）PARCOR係数
（偏自己相関係数）とする時、PARCOR係数k_n+1は、線形
予測による前向き残差ε_t ^(f)と後向き残差ε_t-(n+1) ^(b)
間の正規化相関係数として、次の式によって定義され
る。That is, when [k _n ] (n = 1, ..., P) is a (p-order) PARCOR coefficient (partial autocorrelation coefficient), the PARCOR coefficient k _{n + 1} is the forward residual ε _t due to linear prediction. ^(f) and backward residual ε _{t- (n + 1)} ^(b)
It is defined as the normalized correlation coefficient between the following equations.

ここで、｛α_ｉ｝：前向き予測係数、｛β_ｊ｝：後向き予測係数又、上記の音声のピッチ周波数とは、声帯波の繰り
返し周期（ピッチ周期）の逆数である。尚、ニューラル
ネットワークへの入力として、個人差がある声帯の基本
的なパラメータであるピッチ周波数を付加したから、特
に大人／小人、男性／女性間の話者の認識率を向上する
ことができる。 here, {Α _i }: forward prediction coefficient, {Β _j }: backward prediction coefficient Further, the above-mentioned voice pitch frequency is the reciprocal of the repetition period (pitch period) of the vocal cords. Since the pitch frequency, which is a basic parameter of vocal cords with individual differences, is added as an input to the neural network, it is possible to improve the recognition rate particularly for adults / dwarfs and male / female speakers. .

又、上記の高域強調とは、音声波形のスペクトルの
平均的な傾きを補償して、低域にエネルギが集中するこ
とを防止することである。然るに、音声波形のスペクト
ルの平均的な傾きは話者に共通のものであり、話者の認
識には無関係である。ところが、このスペクトルの平均
的な傾きが補償されていない音声波形をそのままニュー
ラルネットワークへ入力する場合には、ニューラルネッ
トワークが学習する時にスペクトルの平均的な傾きの特
徴の方を抽出してしまい、話者の認識に必要なスペクト
ルの山と谷を抽出するのに時間がかかる。これに対し、
ニューラルネットワークへの入力を高域強調する場合に
は、話者に共通で、認識には無関係でありながら、学習
に影響を及ぼすスペクトルの平均的な傾きを補償できる
ため、学習速度が速くなるのである。Further, the above-mentioned high-frequency emphasis is to compensate for the average inclination of the spectrum of the voice waveform to prevent energy from concentrating in the low frequency range. However, the average slope of the spectrum of the voice waveform is common to the speakers and is irrelevant to the speaker's recognition. However, when the speech waveform whose average slope of the spectrum is not compensated is directly input to the neural network, the feature of the average slope of the spectrum is extracted when the neural network learns. It takes time to extract the peaks and valleys of the spectrum necessary for human recognition. In contrast,
When the input to the neural network is emphasized in the high frequency range, it is common to the speaker and irrelevant to recognition, but the average slope of the spectrum that affects learning can be compensated, so the learning speed becomes faster. is there.

［発明の効果］以上のように本発明によれば、経時的な認識率の劣化
が極めて少なく、容易に実時間処理できる話者認識シス
テムを得ることができる。[Effects of the Invention] As described above, according to the present invention, it is possible to obtain a speaker recognition system in which deterioration of the recognition rate over time is extremely small and which can be easily processed in real time.

【図面の簡単な説明】[Brief description of drawings]

第１図は本発明が適用された話者認識システムの一例を
示す模式図、第２図は音声処理部とニューラルネットワ
ークの一例を示す模式図、第３図は入力音声を示す模式
図、第４図はバンドパスフィルタの出力を示す模式図、
第５図はニューラルネットワークを示す模式図、第６図
は階層的なニューラルネットワークを示す模式図、第７
図はユニットの構造を示す模式図である。 10……話者認識システム、 11……音声入力部、 12……音声処理部、 13……ニューラルネットワーク、 14……判定部、 15……メモリ部、 16……ネットワーク制御部、 17……機器制御部。1 is a schematic diagram showing an example of a speaker recognition system to which the present invention is applied, FIG. 2 is a schematic diagram showing an example of a voice processing unit and a neural network, FIG. 3 is a schematic diagram showing an input voice, FIG. FIG. 4 is a schematic diagram showing the output of the bandpass filter,
FIG. 5 is a schematic diagram showing a neural network, FIG. 6 is a schematic diagram showing a hierarchical neural network, and FIG.
The figure is a schematic view showing the structure of the unit. 10 …… Speaker recognition system, 11 …… Voice input section, 12 …… Voice processing section, 13 …… Neural network, 14 …… Decision section, 15 …… Memory section, 16 …… Network control section, 17 …… Equipment control unit.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭58−37700（ＪＰ，Ａ) 特開昭61−114299（ＪＰ，Ａ) 特開昭61−292696（ＪＰ，Ａ) 特開昭61−102698（ＪＰ，Ａ) 特開昭62−89098（ＪＰ，Ａ) 特開昭63−261400（ＪＰ，Ａ) 日本音響学会誌44巻10号（昭和63年) Ｐ．798〜804 昭和63年電気情報関連学会連合大会31 −１Ｐ．５−65〜68 新美「情報科学講座Ｅ・19・３音声認識」（昭和54年）Ｐ．210〜211 ─────────────────────────────────────────────────── ─── Continuation of front page (56) Reference JP-A-58-37700 (JP, A) JP-A-61-114299 (JP, A) JP-A-61-292696 (JP, A) JP-A-61- 102698 (JP, A) JP 62-89098 (JP, A) JP 63-261400 (JP, A) The Journal of the Acoustical Society of Japan, Vol. 44, No. 10 (1988) P. 798-804 1988 Japan Electrical Information Technology Society Association Conference 31 -1 P.M. 5-65-68 Niimi “Information Science Course E.19 / 3 Voice Recognition” (1979) P. 210 ~ 211

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】ニューラルネットワークを用いた話者認識
システムであって、登録話者に対応する出力ユニットの
出力値に対し、登録話者認識用しきい値と追加学習用し
きい値とを設定し、上記出力値が登録話者認識用しきい
値より大なることを条件に、今回の入力話者を登録話者
と判定し、上記出力値が登録話者認識用しきい値より
大、かつ追加学習用しきい値より小なることを条件に、
今回の入力音声データを用いてニューラルネットワーク
の追加学習を行なう話者認識システム。1. A speaker recognition system using a neural network, wherein a registered speaker recognition threshold and an additional learning threshold are set for output values of an output unit corresponding to a registered speaker. Then, on the condition that the output value is larger than the registered speaker recognition threshold value, the input speaker of this time is determined to be the registered speaker, and the output value is larger than the registered speaker recognition threshold value, And if it is smaller than the threshold for additional learning,
A speaker recognition system that performs additional learning of neural networks using the input voice data.

【請求項２】前記ニューラルネットワークへの入力とし
て、音声の周波数特性の時間的変化、音声の平均的な線形予測係数、音声の平均的なPARCOR係数、音声の平均的な周波数特性、及びピッチ周波数、高域強調を施された音声波形の平均的な周波数特性、
並びに音声の平均的な周波数特性のうちの１つ以上を使用する請求項１記載の話者認識シ
ステム。2. Inputs to the neural network: temporal changes in frequency characteristics of speech, average linear prediction coefficient of speech, average PARCOR coefficient of speech, average frequency characteristic of speech, and pitch frequency , Average frequency characteristics of voice waveform with high-frequency emphasis,
The speaker recognition system according to claim 1, wherein one or more of the average frequency characteristics of speech are used.

【請求項３】前記ニューラルネットワークが階層的なニ
ューラルネットワークである請求項１又は２記載の話者
認識システム。3. The speaker recognition system according to claim 1, wherein the neural network is a hierarchical neural network.