JP4391179B2

JP4391179B2 - Speaker recognition system and method

Info

Publication number: JP4391179B2
Application number: JP2003325119A
Authority: JP
Inventors: 聖一中川; 晋太木村
Original assignee: Animo Ltd
Current assignee: Animo Ltd
Priority date: 2003-09-17
Filing date: 2003-09-17
Publication date: 2009-12-24
Anticipated expiration: 2023-09-17
Also published as: JP2005091758A

Description

本発明は、話者認識技術に関する。 The present invention relates to speaker recognition technology.

話者認識技術とは、予め特定の話者の音声を登録しておき、後に入力された音声がその登録された話者の音声であるかどうかを判定する話者認証技術、予め複数人の音声を登録しておき、後に入力された音声が複数の音声のいずれに最も類似しているかを識別する話者識別技術のいずれかを示している。いずれにしても、先に登録された音声と後に入力された音声の類似度を計算することが基本処理となっている。 The speaker recognition technology is a speaker authentication technology for registering the voice of a specific speaker in advance and determining whether or not the voice input later is the voice of the registered speaker. One of speaker identification techniques for registering voice and identifying which of the plurality of voices is most similar to the voice input later is shown. In any case, the basic processing is to calculate the similarity between the voice registered earlier and the voice inputted later.

図１に従来技術の一例を示す。話者の音声は、マイクロフォン等である音声入力部１１００により入力される。音声入力部１１００では、空気の振動である音声波を電気信号に変換する。音声分析部１１０２は、音声の電気信号をディジタル化し、１５ｍｓから３０ｍｓ程度の分析窓（フレームとも言う）で、５ｍｓから３０ｍｓ程度の分析周期（フレーム周期とも言う）毎に分析処理を実施し、例えばＬＰＣ（Linear Predictive Coding）ケプストラム係数（ベクトル）の系列を生成する。音声波からＬＰＣケプストラム係数を出力する分析処理については周知であり、例えば社団法人電子情報通信学会出版中山聖一著「確率モデルによる音声認識」の７乃至１２頁に記載されている。 FIG. 1 shows an example of the prior art. The voice of the speaker is input by a voice input unit 1100 such as a microphone. The voice input unit 1100 converts voice waves that are air vibrations into electrical signals. The voice analysis unit 1102 digitizes a voice electrical signal, performs an analysis process for each analysis period (also referred to as a frame period) of about 5 ms to 30 ms in an analysis window (also referred to as a frame) of about 15 ms to 30 ms. A sequence of LPC (Linear Predictive Coding) cepstrum coefficients (vectors) is generated. Analysis processing for outputting LPC cepstrum coefficients from speech waves is well known, and is described, for example, on pages 7 to 12 of Seiichi Nakayama, “Speech Recognition by Probability Model” published by the Institute of Electronics, Information and Communication Engineers.

切替部１１０４は、現在の処理が話者照合である場合には音声分析部１１０２の分析結果を照合部１１０８に出力し、一方現在の処理が話者登録である場合には音声分析部１１０２の分析結果をモデル生成部１１０６に出力する。モデル生成部１１０６は、音声分析部１１０２の分析結果であるＬＰＣケプストラム係数（ベクトル）の系列のモデル化を実施する。モデルの一例は多次元正規分布モデルであって、モデル生成部１１０６はＬＰＣケプストラム係数（ベクトル）の平均ベクトルμと共分散行列Σとを計算し、登録モデル格納部１１１０に格納する。そして、照合部１１０８では、平均ベクトルμ及び共分散行列Σで特定される正規分布において、照合対象に係る音声のＬＰＣケプストラム係数（ベクトル）の系列が出現する尤度λの系列を算出する。照合結果判定部１１１２は、話者識別であれば例えば全体の尤度λ_allが最も大きい登録モデルの属性値（例えば話者ＩＤ）を、話者認証であれば閾値と比較して全体の尤度λ_allが当該閾値以上であるか判断して認証の成否を出力する。 When the current process is speaker verification, the switching unit 1104 outputs the analysis result of the voice analysis unit 1102 to the verification unit 1108. On the other hand, when the current process is speaker registration, the switching unit 1104 The analysis result is output to the model generation unit 1106. The model generation unit 1106 models a series of LPC cepstrum coefficients (vectors), which is an analysis result of the voice analysis unit 1102. An example of the model is a multidimensional normal distribution model, and the model generation unit 1106 calculates an average vector μ and a covariance matrix Σ of LPC cepstrum coefficients (vectors) and stores them in the registered model storage unit 1110. Then, the matching unit 1108 calculates a sequence of likelihood λ in which a sequence of LPC cepstrum coefficients (vectors) of speech related to the matching target appears in the normal distribution specified by the average vector μ and the covariance matrix Σ. For speaker identification, the matching result determination unit 1112 compares the attribute value (for example, speaker ID) of the registered model having the largest overall likelihood λ _all with a threshold value for speaker authentication. It is determined whether the degree λ _all is equal to or greater than the threshold value, and the success or failure of the authentication is output.

また、特開２００２−２６８６７４号公報（特許文献１）には図２のような従来技術も開示されている。すなわち、音声入力部１１００では、空気の振動である音声波を電気信号に変換する。音声分析部１１０２は、音声の電気信号をディジタル化し、１５ｍｓから３０ｍｓ程度のフレームで、５ｍｓから３０ｍｓ程度のフレーム周期毎に分析処理を実施し、例えばＬＰＣケプストラム係数（ベクトル）の系列を生成する。切替部１１０４は、現在の処理が話者照合である場合には音声分析部１１０２の分析結果を照合部１１０８に出力し、一方現在の処理が話者登録である場合には音声分析部１１０２の分析結果をモデル生成部１１０６に出力する。モデル生成部１１０６は、音声分析部１１０２の分析結果であるＬＰＣケプストラム係数（ベクトル）の系列のモデル化を実施し、登録モデル格納部１１１０に格納する。 Japanese Patent Laid-Open No. 2002-268673 (Patent Document 1) also discloses a conventional technique as shown in FIG. That is, the audio input unit 1100 converts an audio wave that is air vibration into an electrical signal. The voice analysis unit 1102 digitizes a voice electrical signal, performs analysis processing for each frame period of about 5 ms to 30 ms in a frame of about 15 ms to 30 ms, and generates a series of LPC cepstrum coefficients (vectors), for example. When the current process is speaker verification, the switching unit 1104 outputs the analysis result of the voice analysis unit 1102 to the verification unit 1108. On the other hand, when the current process is speaker registration, the switching unit 1104 The analysis result is output to the model generation unit 1106. The model generation unit 1106 models a series of LPC cepstrum coefficients (vectors), which is the analysis result of the voice analysis unit 1102, and stores it in the registered model storage unit 1110.

そして照合部１１０８は、平均ベクトルμ及び共分散行列Σで特定される正規分布において、照合対象に係る音声のＬＰＣケプストラム係数（ベクトル）の系列が出現する尤度λの系列を算出する。但し、当該照合処理結果である尤度が所定時間以内（約１音節分の時間）において所定の閾値未満である場合には、当該照合処理結果の影響を低減させる（例えば除去する）処理を実施する照合結果補正部１２０９が設けられている。照合結果判定部１２１１は、話者識別であれば例えば照合結果補正部１２０９により補正された後の全体の尤度λ_allが最も大きい登録モデルの属性値（例えば話者ＩＤ）を、話者認証であれば閾値と比較して照合結果補正部１２０９により補正された後の全体の尤度λ_allが当該閾値以上であるか判断して認証の成否を出力する。
特開２００２−２６８６７４号公報 Then, collation section 1108 calculates a sequence of likelihood λ in which a sequence of LPC cepstrum coefficients (vectors) of speech related to the collation target appears in the normal distribution specified by mean vector μ and covariance matrix Σ. However, when the likelihood that is the result of the matching process is less than a predetermined threshold within a predetermined time (a time corresponding to about one syllable), a process of reducing (for example, removing) the influence of the matching process result is performed. A matching result correction unit 1209 is provided. For speaker identification, for example, the verification result determination unit 1211 uses the attribute value (for example, speaker ID) of the registered model having the largest overall likelihood λ _all after being corrected by the verification result correction unit 1209 as the speaker authentication. If so, it is determined whether or not the overall likelihood λ _all after being corrected by the matching result correction unit 1209 compared to the threshold is equal to or greater than the threshold, and the success or failure of the authentication is output.
JP 2002-268673 A

このような話者認識技術を採用する場合、話者が多くの音素を発声してモデル生成部１１０６により登録モデルが作成されればよいが、必ずしも十分な種類の音素に基づき登録モデルが作成されるわけではない。ある音素が発声されずに登録モデルが作成され、照合処理時に話者により登録モデル作成時には発声されなかった音素が発音されると、その音素についての照合結果は著しく悪くなる。 When such speaker recognition technology is adopted, a speaker may utter many phonemes and a registration model may be created by the model generation unit 1106. However, a registration model is not necessarily created based on a sufficient number of phonemes. I don't mean. If a registered model is created without a phoneme being uttered, and a phoneme that was not uttered at the time of creating the registered model is pronounced by the speaker during the matching process, the matching result for that phoneme will be significantly worsened.

特許文献１は上記のような問題に対処すべく提案されているが、尤度λが所定時間以内において所定の閾値未満である場合には登録モデルに不足する音素が発声されたものと仮定して処理しているため、必ずしも照合結果補正部１２０９による補正が正しい補正と言えない場合もある。 Patent Document 1 has been proposed to deal with the above problem, but it is assumed that if the likelihood λ is less than a predetermined threshold within a predetermined time, a phoneme that is insufficient in the registered model has been uttered. Therefore, the correction by the collation result correction unit 1209 may not always be a correct correction.

よって、本発明の目的は、登録モデル作成時における音声データの不足により生ずる不正確な照合結果を是正するための新規な技術を提供することである。 Therefore, an object of the present invention is to provide a novel technique for correcting an inaccurate collation result caused by lack of voice data when creating a registration model.

本発明に係る話者認識システムは、照合対象者の音声データから生成された第１登録モデル・データを格納する第１登録モデル・データ格納部と、多数の不特定話者の音声データから生成された不特定話者モデル・データを照合対象者に適応化することにより生成される第２登録モデル・データを格納する第２登録モデル・データ格納部と、照合対象者の音声データを分析して音声分析データを生成する分析手段と、音声分析データと第１登録モデル・データ格納部に格納された第１登録モデル・データとを用いた照合処理を実施する第１照合処理手段と、音声分析データと第２登録モデル・データ格納部に格納された第２登録モデル・データとを用いた照合処理を実施する第２照合処理手段と、第１照合処理手段及び第２照合処理手段の照合処理結果に基づき、照合対象者に対する最終判定処理を実施する判定手段とを有する。 The speaker recognition system according to the present invention includes a first registration model data storage unit that stores first registration model data generated from voice data of a person to be collated, and voice data of a large number of unspecified speakers. A second registered model data storage unit for storing second registered model data generated by adapting the determined unspecified speaker model data to the verification target person, and analyzing the voice data of the verification target person Analyzing means for generating voice analysis data, first matching processing means for performing matching processing using the voice analysis data and the first registered model data stored in the first registered model / data storage unit, and voice Second collation processing means for performing collation processing using the analysis data and the second registration model data stored in the second registration model / data storage unit, and the first collation processing means and the second collation processing means. Based on the processing results, and a judging means for performing final determination processing for the collation object person.

第１照合処理手段による照合処理結果は、第１登録モデル・データを生成する際に照合対象者が発声した子音母音の構成と照合時に照合対象者が発声した子音母音の構成が類似している場合には良くなるが、子音母音の構成が大きく異なっていると悪くなりがちである。一方、第２照合処理手段による照合処理結果は、おおむねあまりよくないが、第２登録モデル・データを生成する際に照合対象者が発声した子音母音の構成と照合時の子音母音の構成の違いに関係なく安定したものとなる。従って、これら第１及び第２照合処理手段による照合処理結果を総合して最終判定処理を行えば、互いに補う形となり判定精度が向上する。なお、最終判定処理は、話者認証の場合には成功又は失敗という判定であり、話者識別の場合には照合対象者が誰であるかという判定である。 The collation processing result by the first collation processing means is similar in the configuration of the consonant vowel uttered by the person to be collated when generating the first registered model data and the structure of the consonant vowel uttered by the person to be collated during the collation. This is better in some cases, but tends to be worse if the consonant vowel composition is significantly different. On the other hand, the collation processing result by the second collation processing means is generally not very good, but the difference between the consonant vowel configuration uttered by the person to be collated when generating the second registered model data and the consonant vowel configuration at the time of collation It will be stable regardless of. Accordingly, if the final determination process is performed by combining the results of the verification processing by the first and second verification processing means, the results are complemented to improve the determination accuracy. The final determination process is a determination of success or failure in the case of speaker authentication, and a determination of who is the person to be verified in the case of speaker identification.

なお、上で述べた判定手段が、第１照合処理手段の照合処理結果である第１の尤度と（１−α）（αは０以上１以下の所定の実数）の積と、第２照合処理手段の照合処理結果である第２の尤度とαの積とを加算した値に基づき、照合対象者に対する最終判定処理を実施するようにしてもよい。このように第１及び第２照合処理手段による照合処理結果をブレンドすることにより判定精度を向上させることができる。 Note that the determination means described above is a product of the first likelihood that is the result of the collation processing of the first collation processing means and (1-α) (α is a predetermined real number between 0 and 1), and the second Based on the value obtained by adding the product of the second likelihood and α, which is the result of the collation processing by the collation processing means, the final determination process for the person to be collated may be performed. In this way, the accuracy of determination can be improved by blending the verification processing results by the first and second verification processing means.

また、上で述べた第１登録モデル・データ及び第２登録モデル・データを混合正規分布モデル（例えばＧＭＭ（Gaussian Mixture Model））のデータとし、第１照合処理手段による照合処理及び第２照合処理手段による照合処理を、混合正規分布モデルに対応した照合処理とする場合もある。このようにすれば照合対象者が発声する内容（テキストとも呼ぶ）を指定しない状態においても照合を行うことができる。 Further, the first registered model data and the second registered model data described above are data of a mixed normal distribution model (for example, GMM (Gaussian Mixture Model)), and collation processing and second collation processing by the first collation processing means. In some cases, the matching process by the means is a matching process corresponding to the mixed normal distribution model. In this way, collation can be performed even in a state where the content (also referred to as text) uttered by the person to be collated is not specified.

また、第１登録モデル・データを混合正規分布モデルのデータとし、第２登録モデル・データをサブワード単位（例えば音節）のモデル・データ（例えばＨＭＭ（Hidden Marcov Model））とし、第１照合処理手段による照合処理を混合正規分布モデルに対応した照合処理とし、さらに第２照合処理手段が、第２登録モデル・データ格納部に格納されたサブワード単位のモデル・データを接続して照合用モデル・データを生成する照合用モデル・データ生成手段と、照合用モデル・データと音声分析データとを用いて照合処理を実施する手段とを含むようにしてもよい。 Further, the first registered model data is mixed normal distribution model data, the second registered model data is subword unit (for example, syllable) model data (for example, HMM (Hidden Marcov Model)), and the first matching processing means The collation process according to the above is a collation process corresponding to the mixed normal distribution model, and the second collation processing means connects the model data in units of subwords stored in the second registered model data storage unit, and the collation model data May be included, and a means for performing collation processing using the collation model data and the voice analysis data may be included.

必ずしも第１照合処理手段と第２照合処理手段とは同じ種類の処理を実施せずともよい。このように第２登録モデル・データをサブワード単位のモデル・データとする場合には上で述べたように第２照合処理手段においてサブワード単位のモデル・データを接続して照合用モデル・データを生成し、照合処理を実施する。 The first collation processing unit and the second collation processing unit do not necessarily have to perform the same type of processing. When the second registered model data is used as model data in subword units as described above, model data for verification is generated by connecting the model data in subword units in the second verification processing unit as described above. Then, the verification process is performed.

なお、本発明が、照合対象者に発声を求める語句（テキストとも呼ぶ）を決定する手段をさらに有し、上で述べた照合用モデル・データ生成手段が、上記語句に従って第２登録モデル・データ格納部に格納されたサブワード単位のモデル・データを接続して照合用モデル・データを生成するようにしてもよい。照合対象者に発声を求める語句を照合時に指定する方式であれば、真正な話者の音声を録音しておき本人を詐称する者に対抗することができる。本願では特定された語句に従ってサブワード単位のモデル・データを接続して照合用モデル・データを生成することができるため、上記のような詐称者にも対処できる。 Note that the present invention further includes means for determining a phrase (also referred to as text) for requesting utterance from the person to be collated, and the collation model data generating means described above is configured to register the second registered model data according to the above phrase. Model data for collation may be generated by connecting model data in units of subwords stored in the storage unit. If it is a method of designating a phrase to be uttered by a person to be collated at the time of collation, it is possible to counter the person who records the voice of a genuine speaker and impersonates the person. In the present application, model data for sub-word units can be connected according to the specified phrase to generate collation model data, so that it is possible to deal with the above-mentioned impersonators.

また、本発明は、モデル・データ登録時において分析手段により生成された照合対象者の音声分析データから第１登録モデル・データを生成する手段と、モデル・データ登録時において分析手段により生成された照合対象者の音声分析データを用いて不特定話者モデル・データ格納部に格納された不特定話者モデル・データを適応化し、第２登録モデル・データを生成する第２登録モデル・データ生成手段とをさらに有するようにしてもよい。なお、上で述べた第２登録モデル・データ生成手段が、モデル・データ登録時において照合対象者により発声されたサブワードのモデル・データを所定の方式に従って適応化する処理を実施し、適応化されたサブワード単位のモデル・データを接続して第２登録モデル・データを生成するようにしてもよい。照合時にサブワード単位のモデル・データを接続する場合もあれば、登録時に接続する場合もある。 Further, the present invention provides means for generating first registered model data from voice analysis data of a person to be collated generated by the analysis means at the time of model data registration, and means generated by the analysis means at the time of model data registration. Second registered model data generation for generating second registered model data by adapting unspecified speaker model data stored in the unspecified speaker model data storage unit using the voice analysis data of the person to be verified And a means. The second registered model data generation means described above performs a process of adapting the model data of the subword uttered by the person to be collated at the time of model data registration according to a predetermined method, and is adapted. Alternatively, the second registered model data may be generated by connecting the model data in units of subwords. In some cases, model data in units of subwords is connected at the time of collation, and in other cases, connection is made at the time of registration.

なお、本発明に係る話者認識システムはプログラムとコンピュータの組み合せにて実現することができ、この場合、当該プログラムは、例えばフレキシブル・ディスク、ＣＤ−ＲＯＭ、光磁気ディスク、半導体メモリ、ハードディスク等の記憶媒体又は記憶装置に格納される。また、当該プログラムはネットワークを介してディジタル信号として配信されることもある。なお、処理途中のデータについては、コンピュータのメモリに一時保管される。 The speaker recognition system according to the present invention can be realized by a combination of a program and a computer. In this case, the program is, for example, a flexible disk, a CD-ROM, a magneto-optical disk, a semiconductor memory, a hard disk, or the like. It is stored in a storage medium or a storage device. The program may be distributed as a digital signal via a network. Note that data being processed is temporarily stored in the memory of the computer.

本発明によれば、登録モデル作成時における音声データの不足により生ずる不正確な照合結果を適切に是正することができる。 According to the present invention, it is possible to appropriately correct an inaccurate collation result caused by lack of voice data when creating a registration model.

図３に本発明の実施の形態に係る話者認識システムの機能ブロック図を示す。本実施の形態に係る話者認識システムは、音声入力部１と、音声分析部３と、切替部５と、第１照合部７と、モデル生成部９と、第１登録モデル格納部１１と、第２照合部１３と、モデル修正部１５と、第２登録モデル格納部１７と、照合結果判定部１９と、事前モデル格納部２１と、発声テキスト決定部２５とを含む。なお、事前モデル格納部２１に格納するデータを生成するために、事前音声データ格納部２３１と第２音声分析部２３３と事前モデル生成部２３５とを含む事前処理部２３が必要となるが、照合処理やモデル登録時には必要ない。すなわち、事前処理部２３は、話者認識システムに含まれない場合もある。 FIG. 3 shows a functional block diagram of the speaker recognition system according to the embodiment of the present invention. The speaker recognition system according to the present embodiment includes a voice input unit 1, a voice analysis unit 3, a switching unit 5, a first verification unit 7, a model generation unit 9, and a first registered model storage unit 11. The second collation unit 13, the model correction unit 15, the second registered model storage unit 17, the collation result determination unit 19, the prior model storage unit 21, and the utterance text determination unit 25 are included. In order to generate data to be stored in the prior model storage unit 21, a preprocessing unit 23 including a prior speech data storage unit 231, a second speech analysis unit 233, and a prior model generation unit 235 is required. Not required for processing or model registration. That is, the pre-processing unit 23 may not be included in the speaker recognition system.

音声入力部１の出力は音声分析部３に入力される。音声分析部３の出力は、切替部５に入力される。切替部５の出力は、話者照合処理時には第１照合部７及び第２照合部１３に入力され、モデル登録処理時にはモデル生成部９及びモデル修正部１５に入力される。モデル生成部９により生成された第１登録モデル・データは第１登録モデル格納部１１に格納される。第１照合部７は、第１登録モデル格納部１１を参照できるようになっており、その出力は、照合結果判定部１９に入力される。一方、モデル修正部１５は事前モデル格納部２１に格納された事前モデルに対して音声分析部３からの出力に基づき適応化処理を施し、第２登録モデル・データとして第２登録モデル格納部１７に格納する。第２照合部１３は、第２登録モデル格納部１７を参照できるようになっており、その出力は、照合結果判定部１９に入力される。照合結果判定部１９は、第１照合部７と第２照合部１３からの出力に基づき最終的な照合結果を出力する。なお、本実施の形態では話者識別、話者認証のいずれをも同様な処理にて行うことができ、話者識別の最終的な照合結果であれば話者が誰であるかを示す情報（話者ＩＤなど）を出力し、話者認証の最終的な照合結果であれば認証が成功したか失敗したかを示す情報を出力する。 The output of the voice input unit 1 is input to the voice analysis unit 3. The output of the voice analysis unit 3 is input to the switching unit 5. The output of the switching unit 5 is input to the first verification unit 7 and the second verification unit 13 during the speaker verification process, and is input to the model generation unit 9 and the model correction unit 15 during the model registration process. The first registered model data generated by the model generating unit 9 is stored in the first registered model storage unit 11. The first verification unit 7 can refer to the first registered model storage unit 11 , and its output is input to the verification result determination unit 19. On the other hand, the model correction unit 15 performs an adaptation process on the advance model stored in the advance model storage unit 21 based on the output from the speech analysis unit 3, and the second registration model storage unit 17 as the second registration model data. To store. The second verification unit 13 can refer to the second registered model storage unit 17, and its output is input to the verification result determination unit 19. The verification result determination unit 19 outputs a final verification result based on the outputs from the first verification unit 7 and the second verification unit 13. In this embodiment, both speaker identification and speaker authentication can be performed by the same process, and information indicating who is the speaker is the final verification result of speaker identification. (Speaker ID, etc.) is output, and if it is the final verification result of speaker authentication, information indicating whether the authentication has succeeded or failed is output.

なお、発声テキスト決定部２５は、話者が発声すべき語句を決定する必要がある場合に当該語句を決定し、決定された語句のデータを第２照合部１３と図示しない出力装置（例えば表示装置又は音声変換処理部及びスピーカ）に出力する。なお、モデル修正部１５に出力する場合もある。 The utterance text determination unit 25 determines the word / phrase when the speaker needs to determine the word / phrase to be uttered, and the data of the determined word / phrase and the second collation unit 13 and an output device (not shown, for example) Output to a device or a voice conversion processing unit and a speaker). Note that the data may be output to the model correction unit 15.

事前処理部２３における事前音声データ格納部２３１にはディジタル化された多数の不特定話者の音声データが格納されている。そして、第２音声分析部２３３は事前音声データ格納部２３１に格納された音声データを処理して、処理結果を事前モデル生成部２３５に出力する。事前モデル生成部２３５の出力は事前モデル格納部２１に格納される。この事前モデル格納部２１は話者認識システムに含まれる。 The pre-speech data storage unit 231 in the pre-processing unit 23 stores a large number of digitized voice data of unspecified speakers. Then, the second speech analysis unit 233 processes the speech data stored in the prior speech data storage unit 231 and outputs the processing result to the prior model generation unit 235. The output of the advance model generation unit 235 is stored in the advance model storage unit 21. This prior model storage unit 21 is included in the speaker recognition system.

以下、３つの実施の形態について図３に示した話者認識システム及び事前処理部２３の処理内容について説明する。 Hereinafter, the processing contents of the speaker recognition system and the preprocessing unit 23 shown in FIG.

１．実施の形態１
本実施の形態では、第１登録モデル格納部１１及び第２登録モデル格納部１７に混合正規分布モデル（ＧＭＭ）のデータが格納されており、第１照合部７及び第２照合部１３において混合正規分布モデル（ＧＭＭ）に基づく照合処理を実施する。 1. Embodiment 1
In the present embodiment, mixed normal distribution model (GMM) data is stored in the first registered model storage unit 11 and the second registered model storage unit 17, and mixed in the first matching unit 7 and the second matching unit 13. A matching process based on a normal distribution model (GMM) is performed.

最初に事前処理部２３においてどのような処理を行うかについて図４を用いて説明する。事前処理部２３の事前音声データ格納部２３１には、多数の不特定話者による音声データ（例えばディジタル・データ）が格納されている。なお、多数の不特定話者による音声データについては、各々すべての子音母音の音声のデータが含まれるものとする。そこで、第２音声分析部２３３は、事前音声データ格納部２３１に格納された事前音声データを読み出して、フレーム毎に音声分析を実施し、音声分析データを生成する（ステップＳ１）。より具体的には、１５ｍｓから３０ｍｓ程度の分析窓（フレーム）で、５ｍｓから３０ｍｓ程度の分析周期（フレーム周期）毎に分析処理を実施し、例えばＬＰＣケプストラム係数（ベクトル）の系列を生成する。図５に示すように、音声波に対して分析窓を分析周期ずつずらして設定し、分析窓毎に所定の分析処理を施し、その分析窓に対応するケプストラム係数Ｃ_ijを出力する。例えば、１回の分析処理により、１０から２０（次元）程度のＬＰＣケプストラム係数が計算される。ここでｉはフレーム番号であり、ｉ＝１〜Ｎで、Ｎはフレーム総数である。ｊはＬＰＣケプストラム係数の次元番号であり、ｊ＝１〜ｎで、ｎは次元数である。ｉ番目の分析処理により得られたＬＰＣケプストラム係数は以下のように表わせば、特徴ベクトルＸ_iとなる。
Ｘ_i＝（Ｃ_i1，Ｃ_i2，．．．Ｃ_in）^T （１）
このような処理を事前音声データ格納部２３１に格納されている音声データすべてについて実施する。処理結果については記憶装置に格納する。 First, what kind of processing is performed in the pre-processing unit 23 will be described with reference to FIG. The voice data storage unit 231 of the pre-processing unit 23 stores voice data (for example, digital data) by many unspecified speakers. Note that the speech data of many unspecified speakers includes the speech data of all consonant vowels. Therefore, the second speech analysis unit 233 reads the pre-speech data stored in the pre-speech data storage unit 231 and performs speech analysis for each frame to generate speech analysis data (step S1). More specifically, analysis processing is performed for each analysis period (frame period) of about 5 ms to 30 ms with an analysis window (frame) of about 15 ms to 30 ms, and for example, a series of LPC cepstrum coefficients (vectors) is generated. As shown in FIG. 5, the analysis window is set to be shifted for the analysis period for the sound wave, a predetermined analysis process is performed for each analysis window, and a cepstrum coefficient C _ij corresponding to the analysis window is output. For example, an LPC cepstrum coefficient of about 10 to 20 (dimensions) is calculated by one analysis process. Here, i is a frame number, i = 1 to N, and N is the total number of frames. j is a dimension number of the LPC cepstrum coefficient, j = 1 to n, and n is the number of dimensions. The LPC cepstrum coefficient obtained by the i-th analysis process becomes a feature vector X _i when expressed as follows.
X _i = (C _i1 , C _i2 ,... C _in ) ^T (1)
Such a process is performed for all audio data stored in the pre-audio data storage unit 231. The processing result is stored in the storage device.

次に、事前モデル生成部２３５は、事前音声データ格納部２３１に格納されている多数の不特定話者による音声データに対する混合正規分布モデル（ＧＭＭ）を生成するための処理を実施し、処理結果を事前モデル・データとして事前モデル格納部２１に格納する（ステップＳ３）。話者λ_sモデルの混合正規分布は、以下の式で表される。但し、ここでは話者λ_sは多数の不特定話者全員である。

ここでｘ_tは（１）式と同様に表される照合時のｎ次元特徴ベクトルである。
（２）式のように、ＧＭＭはｎ次元Ｍ混合のガウス分布Ｎ（ｘ_t|μ_sm,Σ_sm）を重みｗ_smで線形結合した確率モデルとなる。このＮ（ｘ_t|μ_sm,Σ_sm）は、以下のように表される。

Next, the prior model generation unit 235 performs processing for generating a mixed normal distribution model (GMM) for speech data by a large number of unspecified speakers stored in the prior speech data storage unit 231, and the processing result Are stored in the prior model storage unit 21 as prior model data (step S3). The mixed normal distribution of the speaker λ _s model is expressed by the following equation. However, here, the speaker λ _s is all of many unspecified speakers.

Here, x _t is an n-dimensional feature vector at the time of collation expressed in the same manner as the equation (1).
As shown in the equation (2), the GMM is a probability model in which an n-dimensional M-mixed Gaussian distribution N (x _t | μ _sm , Σ _sm ) is linearly combined with a weight w _sm . This N (x _t | μ _sm , Σ _sm ) is expressed as follows.

ここでμ_smは話者モデルλ_sの登録時の特徴ベクトルＸ_tから算出されるＭ個の平均ベクトルである。平均ベクトルμ_smについては、特徴ベクトルＸ_tからベクトル量子化や最尤推定により生成される。また、各特徴ベクトルＸ_tがいずれの平均ベクトルμ_smに関連するのかについては、各特徴ベクトルＸ_tについて最も近い平均ベクトルμ_smを見つけることにより決定することができる。 Here, μ _sm is M average vectors calculated from the feature vector X _t when the speaker model λ _s is registered. The average vector μ _sm is generated from the feature vector X _t by vector quantization or maximum likelihood estimation. As for whether the feature vector X _t is associated with any of the mean vector mu _sm, it can be determined by finding the closest mean vector mu _sm for each feature vector X _t.

またΣ_smは話者モデルλ_sの共分散行列を示している。すなわち、以下のとおりである。なお、平均ベクトルμ_smに関連する特徴ベクトルＸ_tによりＭ個の共分散行列Σ_smを求める。

平均ベクトルμ_smと共分散行列Σ_smとについては、以下同様に算出される。 Σ _sm represents the covariance matrix of the speaker model λ _s . That is, it is as follows. Note that M covariance matrices Σ _sm are obtained from the feature vector X _t related to the average vector μ _sm .

The average vector μ _sm and the covariance matrix Σ _sm are calculated in the same manner.

さらに混合分布の重みｗ_smは、以下のような関係がある。

Furthermore, the weight w _{sm of the} mixture distribution has the following relationship.

但し、各ｗ_smは解析的には決定できないので、例えば以下の式が最大となるように周知のＥＭアルゴリズムなどによりｗ_smを決定する。

However, since each w _sm cannot be determined analytically, for example, w _sm is determined by a known EM algorithm or the like so that the following expression becomes maximum.

このように、（２）式及び（３）式を計算するためには、Ｍ個のμ_smとＭ個のΣ_smとＭ個（（７）式から厳密にはＭ−１個）の重みｗ_smとが必要となり、これらのデータが事前モデル・データとなる。 Thus, in order to calculate the expressions (2) and (3), M μ _sm , M Σ _sm and M weights (strictly M−1 from the expression (7)) are used. w _sm is required, and these data become the pre-model data.

次に、本実施の形態における話者認識システムの処理フローを図６を用いて説明する。ここでは話者認証の場合の処理フローを説明する。最初に、話者から、照合と登録のいずれを実施するか指定する処理選択入力及び話者識別情報（例えば話者ＩＤ）の入力を受け付ける（ステップＳ１１）。 Next, the processing flow of the speaker recognition system in the present embodiment will be described with reference to FIG. Here, a processing flow in the case of speaker authentication will be described. First, a process selection input for designating whether collation or registration is performed and input of speaker identification information (for example, speaker ID) are received from the speaker (step S11).

次に、話者の音声は、マイクロフォン等である音声入力部１を介して入力される（ステップＳ１３）。音声入力部１では、空気の振動である音声波を電気信号に変換する。次に、音声分析部３は、音声の電気信号をディジタル化し、１５ｍｓから３０ｍｓ程度の分析窓で、５ｍｓから３０ｍｓ程度のフレーム毎に音声分析を実施し、音声分析データ（例えばＬＰＣケプストラム係数の系列Ｃ_ij）を生成する（ステップＳ１５）。すなわち、特徴ベクトルｘ_iをフレーム数分生成する。生成されたデータは図示しない記憶装置に格納する。 Next, the voice of the speaker is input via the voice input unit 1 such as a microphone (step S13). The voice input unit 1 converts a voice wave that is air vibration into an electric signal. Next, the voice analysis unit 3 digitizes the voice electrical signal, performs voice analysis for each frame of about 5 ms to 30 ms, and analyzes voice analysis data (for example, a sequence of LPC cepstrum coefficients) in an analysis window of about 15 ms to 30 ms. C _ij ) is generated (step S15). That is, feature vectors x _i are generated for the number of frames. The generated data is stored in a storage device (not shown).

そして切替部５は、ステップＳ１１で受け付けた処理選択入力が照合であるか判断する（ステップＳ１７）。処理選択入力が照合ではなく登録である場合（ステップＳ１７：登録ルート）には、モデル生成部９は、話者の入力音声に対する第１登録モデル・データを生成し、話者ＩＤに対応して第１登録モデル格納部１１に登録する（ステップＳ１９）。モデル生成部９の処理は、事前モデル生成部２３５の処理とほぼ同じである。すなわち、音声分析データである特徴ベクトルｘ_iのＭ個の平均ベクトルμ_smを算出し、さらにＭ個の共分散行列Σ_smを（６）式に従って算出する。さらに例えば（８）式を最大にするように重みｗ_smを算出する。このように算出されたデータを第１登録モデル格納部１１に登録する。 Then, the switching unit 5 determines whether or not the process selection input received in step S11 is collation (step S17). If the process selection input is registration rather than collation (step S17: registration route), the model generation unit 9 generates first registration model data for the input voice of the speaker, and corresponds to the speaker ID. It registers in the 1st registration model storage part 11 (step S19). The process of the model generation unit 9 is almost the same as the process of the prior model generation unit 235. That is, M average vectors μ _sm of feature vectors x _i that are speech analysis data are calculated, and M covariance matrices Σ _sm are calculated according to the equation (6). Further, for example, the weight w _sm is calculated so as to maximize the expression (8). The data calculated in this way is registered in the first registration model storage unit 11.

また、モデル修正部１５は、話者の入力音声の音声分析データに基づき事前モデルを修正して第２登録モデル・データを生成し、第２登録モデル格納部１７に格納する（ステップＳ２１）。具体的には、事前モデル格納部２１に格納されている、特徴ベクトルの平均ベクトルをμ₀（Ｍ個の平均ベクトルμの各々）とし、音声分析データである特徴ベクトルｘ_i（１≦ｉ≦Ｎ）と定数βとを用いて以下の式にて第２登録モデルにおける特徴ベクトルの平均ベクトルμ_aを算出する。

Further, the model correction unit 15 corrects the prior model based on the voice analysis data of the speaker's input voice, generates second registration model data, and stores it in the second registration model storage unit 17 (step S21). Specifically, an average vector of feature vectors stored in the prior model storage unit 21 is μ ₀ (each of M average vectors μ), and feature vectors x _i (1 ≦ i ≦ 1) that are speech analysis data. N) and the constant β are used to calculate the average vector μ _a of the feature vectors in the second registered model using the following equation.

（９）式では事前モデルにおける平均ベクトルμ₀の重みを定数βで決定している。この定数βについては環境に依存するため実験的に適切な値を決定する。事前モデルに含まれる共分散行列Σや重みｗについても、入力音声の音声分析データを用いて話者に適応化させてもよいが、本実施の形態では平均ベクトルμ₀のみを話者に適応化させる。従って、第２登録モデルとして（９）式で計算されるＭ個の平均ベクトルμ_aと、事前モデルに含まれるＭ個の共分散行列Σ及びＭ個（又はＭ−１個）の重みｗとを、話者ＩＤに対応して第２登録モデル格納部１７に登録する。そして処理を終了する。 In equation (9), the weight of the average vector μ ₀ in the prior model is determined by a constant β. Since this constant β depends on the environment, an appropriate value is determined experimentally. The covariance matrix Σ and the weight w included in the prior model may be adapted to the speaker using the speech analysis data of the input speech, but in this embodiment, only the average vector μ ₀ is adapted to the speaker. Make it. Therefore, M average vectors μ _a calculated by the equation (9) as the second registration model, M covariance matrices Σ and M (or M−1) weights w included in the prior model, Is registered in the second registration model storage unit 17 in correspondence with the speaker ID. Then, the process ends.

一方話者の処理選択入力が照合である場合（ステップＳ１７：照合ルート）、第１照合部７は、第１登録モデル格納部１１から話者ＩＤに対応する第１登録モデル・データを読み出し、音声分析データである特徴ベクトルｘ_i（１≦ｉ≦Ｎ）をさらに用いて照合処理を実施する（ステップＳ２３）。すなわち、各特徴ベクトルにつき（２）及び（３）式でＰ（ｘ_t|λ_s）、そして対数尤度logＰ（ｘ_t|λ_s）を算出する。さらに、（８）式に従って対数尤度の総和Ｌ１を計算する。なお、計算結果は記憶装置に格納される。 On the other hand, when the process selection input of the speaker is collation (step S17: collation route), the first collation unit 7 reads the first registration model data corresponding to the speaker ID from the first registration model storage unit 11, A matching process is further performed using the feature vector x _i (1 ≦ i ≦ N), which is speech analysis data (step S23). That is, for each feature vector, P (x _t | λ _s ) and log likelihood logP (x _t | λ _s ) are calculated according to equations (2) and (3). Furthermore, the log likelihood total L1 is calculated according to the equation (8). The calculation result is stored in the storage device.

また、第２照合部１３は、第２登録モデル格納部１７から話者ＩＤに対応する第２登録モデル・データを読み出し、音声分析データである特徴ベクトルｘ_i（１≦ｉ≦Ｎ）をさらに用いて照合処理を実施する（ステップＳ２５）。ステップＳ２３と同様に、各特徴ベクトルにつき（２）式及び（３）式でＰ（ｘ_t|λ_s）、そして対数尤度logＰ（ｘ_t|λ_s）を算出する。さらに、（８）式に従って対数尤度の総和Ｌ２を計算する。なお、第２登録モデル・データは第１登録モデル・データとは異なるのでステップＳ２３とステップＳ２５の計算結果は異なる。なお、計算結果は記憶装置に格納される。 Further, the second collation unit 13 reads out the second registration model data corresponding to the speaker ID from the second registration model storage unit 17, and further extracts the feature vector x _i (1 ≦ i ≦ N) that is the voice analysis data. The collation process is performed using them (step S25). Similar to step S23, P (x _t | λ _s ) and log likelihood logP (x _t | λ _s ) are calculated for each feature vector using equations (2) and (3). Furthermore, the log likelihood total L2 is calculated according to the equation (8). Since the second registration model data is different from the first registration model data, the calculation results in step S23 and step S25 are different. The calculation result is stored in the storage device.

そして照合結果判定部１９は、ステップＳ２３とステップＳ２５の２つの照合処理結果を用いて判定処理を実施し、判定処理結果を出力する（ステップＳ２７）。ここでは、以下のような式に従って２つの照合処理結果である尤度を加算して、総合尤度Ｌを算出する。
Ｌ＝Ｌ１×（１−α）＋Ｌ２×α （１０）
但し、０≦α≦１となる。また、αの最適値については判定精度が向上するように実験的に求める。他の実験の条件にもよるが、０．９から０．９５において良い結果を示すことがわかっている。 And the collation result determination part 19 implements a determination process using the two collation process results of step S23 and step S25, and outputs a determination process result (step S27). Here, the total likelihood L is calculated by adding the likelihoods that are two collation processing results according to the following equation.
L = L1 × (1−α) + L2 × α (10)
However, 0 ≦ α ≦ 1. Further, the optimum value of α is experimentally obtained so that the determination accuracy is improved. It has been found that 0.9 to 0.95 gives good results, depending on other experimental conditions.

そして、この総合尤度Ｌが所定の閾値を超えているかを判断することにより、今回の話者の認証が成功したか失敗したかが判定される。この場合判定処理結果としては、認証の成功又は失敗を表す情報が出力される。 Then, by determining whether or not the total likelihood L exceeds a predetermined threshold value, it is determined whether the current speaker authentication has succeeded or failed. In this case, information indicating the success or failure of the authentication is output as the determination processing result.

第１登録モデル・データを生成する際に話者により多くの子音母音を発声してもらえればよいが、実際は話者に負担がかかるため多くの子音母音を発声してもらえないことが多い。従って、第１照合部７により算出された尤度は、第１登録モデル・データを生成する際に話者が発声した子音母音の構成と照合時に話者が発声した子音母音の構成が類似している場合には良くなるが、子音母音の構成が大きく異なっていると悪くなりがちである。一方、第２照合部１３により算出された尤度は、おおむねあまりよくないが、第２登録モデル・データを生成する際に話者が発声した子音母音の構成と照合時の子音母音の構成の違いに関係なく安定したものとなる。従って、上で述べたように２つの照合処理結果を総合して最終判定処理を行えば、互いに補う形となり判定精度が向上する。 When the first registered model data is generated, it is sufficient that the speaker utters many consonant vowels. However, since the speaker is actually burdened, many consonant vowels are often not uttered. Accordingly, the likelihood calculated by the first matching unit 7 is similar to the configuration of the consonant vowels uttered by the speaker at the time of matching with the configuration of the consonant vowels uttered by the speaker when generating the first registered model data. It tends to get worse if the composition of consonant vowels is significantly different. On the other hand, the likelihood calculated by the second collation unit 13 is generally not very good, but the configuration of the consonant vowels spoken by the speaker when generating the second registered model data and the configuration of the consonant vowels at the time of collation It will be stable regardless of the difference. Therefore, as described above, if the final determination process is performed by combining the two collation processing results, the results are complemented to improve the determination accuracy.

なお、本実施の形態では、第１照合部７も第２照合部１３も、登録時又は照合時に発声される音声の内容が限定されないテキスト独立方式についての照合処理を行う例を示している。 In the present embodiment, an example is shown in which both the first matching unit 7 and the second matching unit 13 perform a matching process for a text independent method in which the content of speech uttered at the time of registration or matching is not limited.

念のため話者識別の際の簡略化した処理フローについて図７を用いて説明しておく。まず、話者の音声は、マイクロフォン等である音声入力部１を介して入力される（ステップＳ３１）。音声入力部１では、空気の振動である音声波を電気信号に変換する。次に、音声分析部３は、音声の電気信号をディジタル化し、１５ｍｓから３０ｍｓ程度の分析窓で、５ｍｓから３０ｍｓ程度のフレーム毎に音声分析を実施し、音声分析データ（例えばＬＰＣケプストラム係数の系列Ｃ_ij）を生成する（ステップＳ３３）。すなわち、特徴ベクトルｘ_iをフレーム数分生成する。生成されたデータは図示しない記憶装置に格納する。ここでは登録の場合の説明は省略するので、切替部５はフレーム数分の特徴ベクトルｘ_iを第１照合部７と第２照合部１３に出力する。 As a precaution, a simplified processing flow for speaker identification will be described with reference to FIG. First, the voice of the speaker is input via the voice input unit 1 such as a microphone (step S31). The voice input unit 1 converts a voice wave that is air vibration into an electric signal. Next, the voice analysis unit 3 digitizes the voice electrical signal, performs voice analysis for each frame of about 5 ms to 30 ms, and analyzes voice analysis data (for example, a sequence of LPC cepstrum coefficients) in an analysis window of about 15 ms to 30 ms. C _ij ) is generated (step S33). That is, feature vectors x _i are generated for the number of frames. The generated data is stored in a storage device (not shown). Since the description in the case of registration is omitted here, the switching unit 5 outputs the feature vectors x _i for the number of frames to the first matching unit 7 and the second matching unit 13.

第１照合部７は、第１登録モデル格納部１１から順次各話者ＩＤの第１登録モデル・データを読み出し、音声分析データである特徴ベクトルｘ_i（１≦ｉ≦Ｎ）に対して照合処理を実施する（ステップＳ３５）。すなわち、話者ＩＤ毎に、各特徴ベクトルにつき（２）及び（３）式でＰ（ｘ_t|λ_s）、そして対数尤度logＰ（ｘ_t|λ_s）を算出する。さらに、（８）式に従って対数尤度の総和Ｌ１を話者ＩＤ毎に計算する。なお、計算結果は記憶装置に格納される。 The first collation unit 7 sequentially reads the first registration model data of each speaker ID from the first registration model storage unit 11 and collates it with the feature vector x _i (1 ≦ i ≦ N) which is voice analysis data. Processing is performed (step S35). That is, for each speaker ID, P (x _t | λ _s ) and log likelihood logP (x _t | λ _s ) are calculated for each feature vector using equations (2) and (3). Further, the sum of log likelihoods L1 is calculated for each speaker ID according to the equation (8). The calculation result is stored in the storage device.

また、第２照合部１３は、第２登録モデル格納部１７から順次各話者ＩＤの第２登録モデル・データを読み出し、音声分析データである特徴ベクトルｘ_i（１≦ｉ≦Ｎ）に対して照合処理を実施する（ステップＳ３７）。ステップＳ３５と同様に、話者ＩＤ毎に、各特徴ベクトルにつき（２）式及び（３）式でＰ（ｘ_t|λ_s）、そして対数尤度ｌｏｇＰ（ｘ_t|λ_s）を算出する。さらに、（８）式に従って対数尤度の総和Ｌ２を話者ＩＤ毎に計算する。なお、計算結果は記憶装置に格納される。 Further, the second collation unit 13 sequentially reads out the second registration model data of each speaker ID from the second registration model storage unit 17, and for the feature vector x _i (1 ≦ i ≦ N) that is the voice analysis data. The collation process is performed (step S37). Similarly to step S35, P (x _t | λ _s ) and log likelihood logP (x _t | λ _s ) are calculated for each feature vector using the equations (2) and (3) for each speaker ID. . Further, the sum L2 of logarithmic likelihoods is calculated for each speaker ID according to the equation (8). The calculation result is stored in the storage device.

そして照合結果判定部１９は、ステップＳ３５とステップＳ３７の２つの照合処理結果を用いて総合尤度を話者ＩＤ毎に算出し、記憶装置に格納する（ステップＳ３９）。ここでは、（１０）式に従って２つの照合処理結果である尤度Ｌ１及びＬ２を加算して、総合尤度Ｌを各話者ＩＤにつき算出する。 And the collation result determination part 19 calculates total likelihood for every speaker ID using the two collation process results of step S35 and step S37, and stores it in a memory | storage device (step S39). Here, the likelihoods L1 and L2 which are two collation processing results are added according to the equation (10), and the total likelihood L is calculated for each speaker ID.

そして、照合結果判定部１９は、この総合尤度Ｌが最も高い話者ＩＤなどを、最終判定結果として出力する（ステップＳ４１）。数式で示せば、以下のようになる。

話者ＩＤがｓとして出力される。なお、ここでは総合尤度Ｌが１／Ｎされているが、しなくともよい。 And the collation result determination part 19 outputs speaker ID etc. with this highest total likelihood L as a final determination result (step S41). This can be expressed as follows:

The speaker ID is output as s. Here, the overall likelihood L is 1 / N, but it is not necessary.

このようにすれば、話者識別処理を実施することができる。最終ステップであるステップＳ３９以外は、照合処理の回数が話者ＩＤの数だけ実施されるだけであり、話者認証処理の場合と本質的な差異はない。従って、（８）式のように総合尤度を計算して判定を行うため、判定精度が向上する。 In this way, speaker identification processing can be performed. Except for step S39, which is the final step, the number of verification processes is only the number of speaker IDs, and there is no essential difference from the case of speaker authentication processing. Therefore, the determination accuracy is improved because the total likelihood is calculated and determined as in equation (8).

２．実施の形態２
次に、第２登録モデル・データにＧＭＭではなくサブワード（例えば音節あるいは音素）単位のモデル・データを採用し、第２照合部１３において当該サブワード単位のモデルを接続して照合用モデルを生成すると共にテキスト独立方式の照合処理を実施する場合の処理について説明する。 2. Embodiment 2
Next, model data in units of subwords (for example, syllables or phonemes) instead of GMM is adopted as the second registered model data, and a model for verification is generated by connecting the models in units of subwords in the second verification unit 13. In addition, a description will be given of processing in the case where text independent verification processing is performed.

最初に、図４、図８及び図９を用いて本実施の形態における事前処理部２３の処理内容について説明する。事前処理部２３の事前音声データ格納部２３１には、多数の不特定話者による音声データ（例えばディジタル・データ）が格納されている。なお、多数の不特定話者による音声データについては、各々すべての子音母音の音声のデータが含まれるものとする。そこで、第２音声分析部２３３は、事前音声データ格納部２３１に格納された事前音声データを読み出して、フレーム毎に音声分析を実施し、音声分析データを生成する（ステップＳ１）。より具体的には、１５ｍｓから３０ｍｓ程度の分析窓（フレーム）で、５ｍｓから３０ｍｓ程度の分析周期（フレーム周期）毎に分析処理を実施し、例えばＬＰＣケプストラム係数（特徴ベクトル）の系列を生成する。ここでは音節毎に特徴ベクトルＸ_iを管理する。このような処理を事前音声データ格納部２３１に格納されている音声データすべてについて実施する。処理結果については記憶装置に格納する。 First, processing contents of the preprocessing unit 23 in the present embodiment will be described with reference to FIGS. 4, 8, and 9. The voice data storage unit 231 of the pre-processing unit 23 stores voice data (for example, digital data) by many unspecified speakers. Note that the speech data of many unspecified speakers includes the speech data of all consonant vowels. Therefore, the second speech analysis unit 233 reads the pre-speech data stored in the pre-speech data storage unit 231 and performs speech analysis for each frame to generate speech analysis data (step S1). More specifically, an analysis process is performed for each analysis period (frame period) of about 5 ms to 30 ms with an analysis window (frame) of about 15 ms to 30 ms, and for example, a sequence of LPC cepstrum coefficients (feature vectors) is generated. . Here, the feature vector X _i is managed for each syllable. Such a process is performed for all audio data stored in the pre-audio data storage unit 231. The processing result is stored in the storage device.

次に、事前モデル生成部２３５は、事前音声データ格納部２３１に格納されている多数の不特定話者による音声データに対する隠れマルコフモデル（ＨＭＭ：Hidden Marcov Model）を音節毎に生成するための処理を実施し、処理結果を事前モデル・データとして事前モデル格納部２１に格納する（ステップＳ３）。 Next, the prior model generation unit 235 generates a hidden Markov model (HMM: Hidden Marcov Model) for speech data by many unspecified speakers stored in the prior speech data storage unit 231 for each syllable. And the processing result is stored in the advance model storage unit 21 as advance model data (step S3).

ＨＭＭの構造の一例を図８に示す。ＨＭＭは、複数の状態８０１乃至８０５（ここではＪ個の状態Ｓ₀乃至Ｓ_J-1）とその状態の間の遷移（状態間を結ぶ矢印）とで構成される。そして、入力音声の特徴ベクトルＸ_iが１つ出力されるたびに状態を１回遷移するものとする。ここで状態Ｓ_kからＳ_lに遷移する確率ａ_klは以下のように表される。
ａ_kl＝Ｐ（ｓ_l＝Ｓ_l|ｓ_l-1＝Ｓ_k）（１１） An example of the structure of the HMM is shown in FIG. The HMM is composed of a plurality of states 801 to 805 (here, J states S _{0 to} S _J-1 ) and transitions between the states (arrows connecting the states). It is assumed that the state transitions once every time one feature vector X _i of the input speech is output. Here, the probability a _kl of transition from the state S _k to S _l is expressed as follows.
a _kl = P (s _l = S _l | s _l-1 = S _k ) (11)

また、状態Ｓ_kからＳ_lに遷移するときに特徴ベクトルｘが出力される確率ｂ_klは以下のように表される。
ｂ_kl＝Ｐ（ｘ｜ｓ_l＝Ｓ_l，ｓ_l-1＝Ｓ_k）（１２）
なお、ｂ_klは、（２）式で表される。 Further, the probability b _kl that the feature vector x is output when transitioning from the state S _k to S _l is expressed as follows.
b _kl = P (x | s _l = S _l , s _l-1 = S _k ) (12)
Note that b _kl is expressed by equation (2).

このようなモデルＷから入力音声の特徴ベクトルの系列Ｘ＝｛Ｘ₀，Ｘ₁，．．．Ｘ_i，．．．Ｘ_T-1｝が出力される確率は、以下の式で表される。

すなわち、Ｓ₀乃至Ｓ_J-1までの状態遷移パターンＳ毎に、その状態遷移パターンに従って（１１）式及び（１２）式の積を全部掛けて得られる値のうち最も大きい値をＰ（Ｘ｜Ｗ）とするものである。状態Ｓ₀からＳ_J-1まで遷移する間に特徴ベクトルＸ₀乃至Ｘ_T-1が生成されるため、状態遷移パターンＳは、図９に示すように、左下のポイント９０１と右上のポイント９０２とを水平方向と斜め方向の線分のみを用いて接続することのできる１又は複数のパターンである。図９では点線で表されたパターン９０３と実線で表されたパターン９０４の２つのパターンのみ示されているが、実際には多くのパターンが存在している。 From such a model W, a series of feature vectors X = {X ₀ , X ₁ ,. . . X _i,. . . The probability that _XT-1 } is output is expressed by the following equation.

That is, for each state transition pattern S from S _{0 to} S _J−1 , the largest value among the values obtained by multiplying all products of Equations (11) and (12) according to the state transition pattern is P (X | W). Since the feature vectors X _{0 to} X _T-1 are generated during the transition from the state S ₀ to S _J−1 , the state transition pattern S includes the lower left point 901 and the upper right point 902 as shown in FIG. Can be connected using only horizontal and diagonal line segments. In FIG. 9, only two patterns, a pattern 903 represented by a dotted line and a pattern 904 represented by a solid line, are shown, but there are actually many patterns.

ステップＳ３では、音節毎に、（１３）式の値を最大にするように、（１１）式のａ_klと、（２）式（（１２）式から（２）式が参照される。）における重みｗ_smというパラメータを、例えば周知のＥＭアルゴリズム等により決定する。また、音節毎に、特徴ベクトルＸ_iのＭ個の平均ベクトルμ_smとＭ個の共分散行列Σ_smも算出する。このようにして求められた子音母音毎のａ_kl、Ｍ個（又はＭ−１個）の重みｗ_sm、特徴ベクトルのＭ個の平均ベクトルμ_sm及び共分散行列Σ_smが事前モデル・データとして事前モデル格納部２１に格納される。 In step S3, for each syllable, the value of equation (13) is maximized so that _{akl in} equation (11) and equation (2) (refer to equations (2) to (2)). The parameter of weight w _{sm in} is determined by, for example, a well-known EM algorithm. In addition, for each syllable, M average vectors μ _sm and M covariance matrices Σ _sm of the feature vector X _i are also calculated. The a _kl , M (or M−1) weights w _sm , M average vectors μ _sm of feature vectors, and covariance matrix Σ _sm for each consonant vowel obtained in this way are used as prior model data. It is stored in the prior model storage unit 21.

このように音節といったサブワード単位でモデル・データを用意することにより、モデル修正部１５において適切に話者に対する適応化を行うことができるようになる。 By preparing the model data in units of subwords such as syllables in this way, the model correction unit 15 can appropriately adapt to the speaker.

次に、本実施の形態における話者認識システムの処理フローを図１０を用いて説明する。ここでは話者認証の場合の処理フローを説明する。最初に、話者から、照合と登録のいずれを実施するか指定する処理選択入力及び話者識別情報（例えば話者ＩＤ）の入力を受け付ける（ステップＳ５１）。 Next, the processing flow of the speaker recognition system in the present embodiment will be described with reference to FIG. Here, a processing flow in the case of speaker authentication will be described. First, a process selection input for designating whether collation or registration is performed and input of speaker identification information (for example, speaker ID) are received from the speaker (step S51).

次に、話者の音声は、マイクロフォン等である音声入力部１を介して入力される（ステップＳ５３）。音声入力部１では、空気の振動である音声波を電気信号に変換する。次に、音声分析部３は、音声の電気信号をディジタル化し、１５ｍｓから３０ｍｓ程度の分析窓で、５ｍｓから３０ｍｓ程度のフレーム毎に音声分析を実施し、音声分析データ（例えばＬＰＣケプストラム係数の系列Ｃ_ij）を生成する（ステップＳ５５）。すなわち、特徴ベクトルｘ_iをフレーム数分生成する。生成されたデータは図示しない記憶装置に格納する。 Next, the voice of the speaker is input via the voice input unit 1 such as a microphone (step S53). The voice input unit 1 converts a voice wave that is air vibration into an electric signal. Next, the voice analysis unit 3 digitizes the voice electrical signal, performs voice analysis for each frame of about 5 ms to 30 ms, and analyzes voice analysis data (for example, a sequence of LPC cepstrum coefficients) in an analysis window of about 15 ms to 30 ms. C _ij ) is generated (step S55). That is, feature vectors x _i are generated for the number of frames. The generated data is stored in a storage device (not shown).

そして切替部５は、ステップＳ５１で受け付けた処理選択入力が照合であるか判断する（ステップＳ５７）。処理選択入力が照合ではなく登録である場合（ステップＳ５７：登録ルート）には、モデル生成部９は、話者の入力音声に対する第１登録モデル・データを生成し、話者ＩＤに対応して第１登録モデル格納部１１に登録する（ステップＳ５９）。モデル生成部９の処理は、第１の実施の形態における事前モデル生成部２３５の処理とほぼ同じである。すなわち、音声分析データである特徴ベクトルｘ_iのＭ個の平均ベクトルμ_smを算出し、さらにＭ個の共分散行列Σ_smを（６）式に従って算出する。さらに例えば（８）式を最大にするように重みｗ_smを算出する。このように算出されたデータを第１登録モデル格納部１１に登録する。 Then, the switching unit 5 determines whether or not the process selection input received in step S51 is collation (step S57). When the process selection input is registration rather than collation (step S57: registration route), the model generation unit 9 generates first registration model data for the input voice of the speaker, and corresponds to the speaker ID. Register in the first registration model storage unit 11 (step S59). The process of the model generation unit 9 is almost the same as the process of the prior model generation unit 235 in the first embodiment. That is, M average vectors μ _sm of feature vectors x _i that are speech analysis data are calculated, and M covariance matrices Σ _sm are calculated according to the equation (6). Further, for example, the weight w _sm is calculated so as to maximize the expression (8). The data calculated in this way is registered in the first registration model storage unit 11.

また、モデル修正部１５は、話者の入力音声の音声分析データに基づき事前モデルを修正して第２登録モデル・データを生成し、第２登録モデル格納部１７に格納する（ステップＳ６１）。具体的には、今回入力された音声の音節単位で、事前モデル格納部２１に格納されている音節単位の事前モデル・データ全てに対して（１３）式を計算し、最も確率の高い音節を特定する。そして、特定された音節の事前モデル・データに含まれる特徴ベクトルの平均ベクトルをμ₀（Ｍ個の平均ベクトルμの各々）とし、入力音声の音声分析データである特徴ベクトルｘ_i（１≦ｉ≦Ｎ）と定数βとを用いて（９）式にて第２登録モデルにおける特徴ベクトルのＭ個の平均ベクトルμ_aを算出する。 Further, the model correction unit 15 corrects the prior model based on the voice analysis data of the speaker's input voice, generates second registration model data, and stores it in the second registration model storage unit 17 (step S61). Specifically, in the syllable unit of the speech input this time, the equation (13) is calculated for all the pre-model data of the syllable unit stored in the pre-model storage unit 21, and the syllable with the highest probability is calculated. Identify. Then, the average vector of the feature vectors included in the prior model data of the identified syllable is μ ₀ (each of the M average vectors μ), and the feature vector x _i (1 ≦ i) that is the speech analysis data of the input speech ≦ N) and the constant β are used to calculate M average vectors μ _a of the feature vectors in the second registered model according to equation (9).

（９）式では事前モデルにおける平均ベクトルμ₀の重みを定数βで決定している。この定数βについては実験的に適切な値を決定する。事前モデルに含まれる共分散行列Σや重みｗについても、入力音声の音声分析データを用いて話者に適応化させてもよいが、本実施の形態では平均ベクトルμ₀のみを話者に適応化させる。 In equation (9), the weight of the average vector μ ₀ in the prior model is determined by a constant β. An appropriate value for this constant β is determined experimentally. The covariance matrix Σ and the weight w included in the prior model may be adapted to the speaker using the speech analysis data of the input speech, but in this embodiment, only the average vector μ ₀ is adapted to the speaker. Make it.

このように入力音声の各音節につき、第２登録モデルとして（９）式で計算されるＭ個の平均ベクトルμ_aと、事前モデルに含まれるＭ個の共分散行列Σ及びＭ個（又はＭ−１個）の重みｗとを、話者ＩＤに対応して第２登録モデル格納部１７に登録する。さらに、入力音声に含まれなかった子音母音については、事前モデル・データをそのまま第２登録モデル・データとして話者ＩＤに対応して第２登録モデル格納部１７に登録する。 As described above, for each syllable of the input speech, M average vectors μ _a calculated by the equation (9) as the second registration model, and M covariance matrices Σ and M (or M) included in the prior model. -1) weight w is registered in the second registration model storage unit 17 in correspondence with the speaker ID. Further, for the consonant vowels that are not included in the input speech, the prior model data is directly registered in the second registration model storage unit 17 as the second registration model data corresponding to the speaker ID.

一方話者の処理選択入力が照合である場合（ステップＳ５７：照合ルート）、第１照合部７は、第１登録モデル格納部１１から話者ＩＤに対応する第１登録モデル・データを読み出し、音声分析データである特徴ベクトルｘ_i（１≦ｉ≦Ｎ）をさらに用いて照合処理を実施する（ステップＳ６３）。すなわち、各特徴ベクトルにつき（２）及び（３）式でＰ（ｘ_t|λ_s）、そして対数尤度logＰ（ｘ_t|λ_s）を算出する。さらに、（８）式に従って対数尤度の総和Ｌ１を計算する。なお、計算結果は記憶装置に格納される。 On the other hand, when the process selection input of the speaker is collation (step S57: collation route), the first collation unit 7 reads the first registration model data corresponding to the speaker ID from the first registration model storage unit 11, A matching process is further performed using the feature vector x _i (1 ≦ i ≦ N), which is speech analysis data (step S63). That is, for each feature vector, P (x _t | λ _s ) and log likelihood logP (x _t | λ _s ) are calculated according to equations (2) and (3). Furthermore, the log likelihood total L1 is calculated according to the equation (8). The calculation result is stored in the storage device.

また、第２照合部１３は、第２登録モデル格納部１７から話者ＩＤに対応する第２登録モデル・データを読み出し、照合用モデルを構成する（ステップＳ６５）。本実施の形態では第２照合部１３でもテキスト独立方式を採用するため、例えば図１１に示すように音節のモデルを接続する。すなわち、スタートから遷移した後の状態２１１を全ての音節のモデルで共有し、全ての音節のモデル２１２乃至２１５を並列に接続する。そして、エンドに遷移する前の状態２１６も全ての音節のモデルで共有する。さらに、状態２１６から状態２１１に戻るための状態遷移２１７を設定する。すなわち、入力音声の音節毎に、全ての音節のモデルと照合を行い、最も確率の高い音節モデルからの出力を採用する。これを入力音声の最後の音節まで繰り返すものである。 Further, the second collation unit 13 reads out the second registration model data corresponding to the speaker ID from the second registration model storage unit 17 and configures a collation model (step S65). In the present embodiment, the second collating unit 13 also adopts the text independent method, and therefore, for example, a syllable model is connected as shown in FIG. That is, the state 211 after transition from the start is shared by all syllable models, and all the syllable models 212 to 215 are connected in parallel. The state 216 before transitioning to the end is also shared by all syllable models. Further, a state transition 217 for returning from the state 216 to the state 211 is set. In other words, for each syllable of the input speech, all syllable models are collated, and the output from the syllable model with the highest probability is adopted. This is repeated until the last syllable of the input speech.

そして、第２照合部１３は、図１１に示すような照合モデルを用いて照合処理を実施する（ステップＳ６７）。より具体的には、入力音声の最初の音節に係る音声分析データである特徴ベクトルと第２登録モデル・データに含まれる全音節に係るモデル・データとを用いて、第２登録モデル・データに含まれる全音節について（１３）式に従って確率を算出する。そして、最大の確率が算出された音節についての確率を例えば記憶装置に保持する。そして、入力音声の次の音節に係るモデル・データについても同様に（１３）式に従って確率を算出し、最大の確率が算出された音節についての確率を例えば記憶装置に保持する。このように入力音声の最後の音節まで上で述べたような処理を繰り返し、最終的に記憶装置に保持されている確率を全て掛け合わせ、算出された値を尤度Ｌ２とする。但し、記憶装置に保持されている確率のそれぞれの対数を算出し、それらの総和を尤度Ｌ２とする場合もある。なお、計算結果は記憶装置に格納される。 And the 2nd collation part 13 implements collation processing using a collation model as shown in FIG. 11 (step S67). More specifically, the second registered model data is obtained by using the feature vector that is the voice analysis data related to the first syllable of the input speech and the model data related to all syllables included in the second registered model data. Probabilities are calculated for all included syllables according to equation (13). And the probability about the syllable from which the maximum probability was calculated is hold | maintained in a memory | storage device, for example. The model data related to the syllable next to the input speech is similarly calculated according to the equation (13), and the probability for the syllable for which the maximum probability is calculated is held in a storage device, for example. In this way, the processing as described above is repeated until the last syllable of the input speech, and all the probabilities held in the storage device are finally multiplied, and the calculated value is set as the likelihood L2. However, there are cases where the logarithms of the probabilities held in the storage device are calculated and the sum of these is used as the likelihood L2. The calculation result is stored in the storage device.

そして照合結果判定部１９は、ステップＳ６３とステップＳ６７の２つの照合処理結果を用いて判定処理を実施し、判定処理結果を出力する（ステップＳ６９）。ここでは、（１０）式に従って２つの照合処理結果である尤度を加算して、総合尤度Ｌを算出する。 And the collation result determination part 19 implements a determination process using the two collation process results of step S63 and step S67, and outputs a determination process result (step S69). Here, the total likelihood L is calculated by adding the likelihoods that are two collation processing results according to the equation (10).

本実施の形態は、実施の形態１とは第２登録モデル・データの内容及び第２照合部１３の処理内容が異なるが、実施の形態１と同様に２つの照合処理結果を総合して最終判定処理を行うので、互いに補うことになり判定精度が向上する。 The present embodiment differs from the first embodiment in the contents of the second registered model data and the processing contents of the second collation unit 13, but the final result is obtained by combining the two collation processing results as in the first embodiment. Since the determination process is performed, the determination accuracy is improved because they are mutually supplemented.

なお、話者識別の処理については、図７のステップＳ３７を、全第２登録モデル・データに対する図１０のステップＳ６５及びＳ６７に置き換えることにより、実施可能となる。従って、話者識別処理の話者識別精度も向上する。 The speaker identification process can be implemented by replacing step S37 in FIG. 7 with steps S65 and S67 in FIG. 10 for all second registered model data. Therefore, the speaker identification accuracy of the speaker identification process is also improved.

３．実施の形態３
次に、第２登録モデル・データにＧＭＭではなくサブワード（例えば音節）単位のモデル・データを採用し、第２照合部１３において当該サブワード単位のモデルを指定テキストに従って接続して照合用モデルを生成すると共にテキスト依存方式の照合処理を実施する場合の処理について説明する。なお、テキスト依存とは、照合又は登録時に話者に発声させるテキストを限定する方式である。 3. Embodiment 3
Next, instead of GMM, model data in units of subwords (eg, syllables) is adopted as the second registered model data, and a model for verification is generated by connecting the models in units of subwords according to the designated text in the second verification unit 13 In addition, a description will be given of processing in the case of performing text-dependent collation processing. Note that the text dependence is a method of limiting the text to be uttered by the speaker at the time of collation or registration.

事前処理部２３の処理については、実施の形態２で述べたものと同一なのでここでは説明を省略する。 Since the processing of the pre-processing unit 23 is the same as that described in the second embodiment, the description thereof is omitted here.

次に、本実施の形態における話者認識システムの処理フローを図１２を用いて説明する。ここでは話者認証の場合の処理フローを説明する。最初に、話者から、照合と登録のいずれを実施するか指定する処理選択入力及び話者識別情報（例えば話者ＩＤ）の入力を受け付ける（ステップＳ７１）。そして、話者により照合ではなく登録が選択された場合には（ステップＳ７３：登録ルート）、話者の音声が、マイクロフォン等である音声入力部１を介して入力される（ステップＳ７５）。音声入力部１では、空気の振動である音声波を電気信号に変換する。なお、切換部５はこの段階でモデル生成部９及びモデル修正部１５の方に音声分析データの出力先を切り替える。次に、音声分析部３は、音声の電気信号をディジタル化し、１５ｍｓから３０ｍｓ程度の分析窓で、５ｍｓから３０ｍｓ程度のフレーム毎に音声分析を実施し、音声分析データ（例えばＬＰＣケプストラム係数の系列Ｃ_ij）を生成する（ステップＳ７７）。すなわち、特徴ベクトルｘ_iをフレーム数分生成する。生成されたデータは図示しない記憶装置に格納する。 Next, the processing flow of the speaker recognition system in the present embodiment will be described with reference to FIG. Here, a processing flow in the case of speaker authentication will be described. First, a process selection input for designating whether collation or registration is performed and input of speaker identification information (for example, speaker ID) are received from the speaker (step S71). When registration is selected instead of collation by the speaker (step S73: registration route), the speaker's voice is input via the voice input unit 1 such as a microphone (step S75). The voice input unit 1 converts a voice wave that is air vibration into an electric signal. Note that the switching unit 5 switches the output destination of the voice analysis data to the model generation unit 9 and the model correction unit 15 at this stage. Next, the voice analysis unit 3 digitizes the voice electrical signal, performs voice analysis for each frame of about 5 ms to 30 ms, and analyzes voice analysis data (for example, a sequence of LPC cepstrum coefficients) in an analysis window of about 15 ms to 30 ms. C _ij ) is generated (step S77). That is, feature vectors x _i are generated for the number of frames. The generated data is stored in a storage device (not shown).

そして、モデル生成部９は、話者の入力音声に対する第１登録モデル・データを生成し、話者ＩＤに対応して第１登録モデル格納部１１に登録する（ステップＳ７９）。モデル生成部９の処理は、第１の実施の形態における事前モデル生成部２３５の処理とほぼ同じである。すなわち、音声分析データである特徴ベクトルｘ_iのＭ個の平均ベクトルμ_smを算出し、さらにＭ個の共分散行列Σ_smを（６）式に従って算出する。さらに例えば（８）式を最大にするようにＭ個（又はＭ−１個）の重みｗ_smを算出する。このように算出されたデータを第１登録モデル格納部１１に登録する。 And the model production | generation part 9 produces | generates the 1st registration model data with respect to a speaker's input audio | voice, and registers it in the 1st registration model storage part 11 corresponding to a speaker ID (step S79). The process of the model generation unit 9 is almost the same as the process of the prior model generation unit 235 in the first embodiment. That is, M average vectors μ _sm of feature vectors x _i that are speech analysis data are calculated, and M covariance matrices Σ _sm are calculated according to the equation (6). Further, for example, M (or M−1) weights w _sm are calculated so as to maximize the expression (8). The data calculated in this way is registered in the first registration model storage unit 11.

また、モデル修正部１５は、話者の入力音声の音声分析データに基づき事前モデルを修正して第２登録モデル・データを生成し、第２登録モデル格納部１７に格納する（ステップＳ８１）。具体的には、今回入力された音声の音節単位で、事前モデル格納部２１に格納されている音節単位の事前モデル・データ全てに対して（１３）式を計算し、最も確率の高い音節を特定する。そして、特定された音節の事前モデル・データに含まれる特徴ベクトルの平均ベクトルをμ₀（Ｍ個の平均ベクトルの各々）とし、入力音声の音声分析データである特徴ベクトルｘ_i（１≦ｉ≦Ｎ）と定数βとを用いて（９）式にて第２登録モデルにおける特徴ベクトルのＭ個の平均ベクトルμ_aを算出する。事前モデルに含まれる共分散行列Σや重みｗについても、入力音声の音声分析データを用いて話者に適応化させてもよいが、本実施の形態では平均ベクトルμ₀のみを話者に適応化させる。 Further, the model correction unit 15 corrects the prior model based on the voice analysis data of the speaker's input voice, generates second registration model data, and stores it in the second registration model storage unit 17 (step S81). Specifically, in the syllable unit of the speech input this time, the equation (13) is calculated for all the pre-model data of the syllable unit stored in the pre-model storage unit 21, and the syllable with the highest probability is calculated. Identify. Then, an average vector of feature vectors included in the pre-model data of the identified syllable is μ ₀ (each of M average vectors), and a feature vector x _i (1 ≦ i ≦ 1) that is speech analysis data of the input speech. N) and the constant β are used to calculate M average vectors μ _a of the feature vectors in the second registered model according to equation (9). The covariance matrix Σ and the weight w included in the prior model may be adapted to the speaker using the speech analysis data of the input speech, but in this embodiment, only the average vector μ ₀ is adapted to the speaker. Make it.

このように入力音声の音節につき、第２登録モデルとして（９）式で計算されるＭ個の平均ベクトルμ_aと、事前モデルに含まれるＭ個の共分散行列Σ及びＭ個（Ｍ−１個）の重みｗとを、話者ＩＤに対応して第２登録モデル格納部１７に登録する。さらに、入力音声に含まれなかった子音母音については、事前モデル・データをそのまま第２登録モデル・データとして話者ＩＤに対応して第２登録モデル格納部１７に登録する。 In this way, for the syllable of the input speech, M average vectors μ _a calculated by the equation (9) as the second registration model, and M covariance matrices Σ and M (M−1) included in the prior model. Weight) w is registered in the second registration model storage unit 17 in correspondence with the speaker ID. Further, for the consonant vowels that are not included in the input speech, the prior model data is directly registered in the second registration model storage unit 17 as the second registration model data corresponding to the speaker ID.

一方話者の処理選択入力が照合である場合（ステップＳ７３：照合ルート）、発声テキスト決定部２５は、話者に発声を依頼する発声用テキスト（語句）を決定し、図示しない表示装置や音声変換装置及びスピーカなどを介して出力する（ステップＳ８３）。そして、指定された発声用テキストについての話者の音声が、マイクロフォン等である音声入力部１を介して入力される（ステップＳ８５）。音声入力部１では、空気の振動である音声波を電気信号に変換する。次に、音声分析部３は、音声の電気信号をディジタル化し、１５ｍｓから３０ｍｓ程度の分析窓で、５ｍｓから３０ｍｓ程度のフレーム毎に音声分析を実施し、音声分析データ（例えばＬＰＣケプストラム係数の系列Ｃ_ij）を生成する（ステップＳ８７）。すなわち、特徴ベクトルｘ_iをフレーム数分生成する。生成されたデータは図示しない記憶装置に格納する。 On the other hand, when the process selection input of the speaker is collation (step S73: collation route), the utterance text determination unit 25 determines utterance text (phrase) for requesting the speaker to utter, and displays a display device or voice (not shown). The data is output via a conversion device and a speaker (step S83). Then, the voice of the speaker for the designated utterance text is input via the voice input unit 1 such as a microphone (step S85). The voice input unit 1 converts a voice wave that is air vibration into an electric signal. Next, the voice analysis unit 3 digitizes the voice electrical signal, performs voice analysis for each frame of about 5 ms to 30 ms, and analyzes voice analysis data (for example, a sequence of LPC cepstrum coefficients) in an analysis window of about 15 ms to 30 ms. C _ij ) is generated (step S87). That is, feature vectors x _i are generated for the number of frames. The generated data is stored in a storage device (not shown).

そして、第１照合部７は、第１登録モデル格納部１１から話者ＩＤに対応する第１登録モデル・データを読み出し、音声分析データである特徴ベクトルｘ_i（１≦ｉ≦Ｎ）をさらに用いて照合処理を実施する（ステップＳ８９）。すなわち、各特徴ベクトルにつき（２）及び（３）式でＰ（ｘ_t|λ_s）、そして対数尤度logＰ（ｘ_t|λ_s）を算出する。さらに、（８）式に従って対数尤度の総和Ｌ１を計算する。なお、計算結果は記憶装置に格納される。 Then, the first collation unit 7 reads the first registration model data corresponding to the speaker ID from the first registration model storage unit 11, and further extracts the feature vector x _i (1 ≦ i ≦ N) that is the voice analysis data. The collation process is performed using them (step S89). That is, for each feature vector, P (x _t | λ _s ) and log likelihood logP (x _t | λ _s ) are calculated according to equations (2) and (3). Furthermore, the log likelihood total L1 is calculated according to the equation (8). The calculation result is stored in the storage device.

また、第２照合部１３は、第２登録モデル格納部１７から話者ＩＤに対応する第２登録モデル・データを読み出し、発声用テキストに応じた照合用モデルを構成する（ステップＳ９１）。本実施の形態では第２照合部１３においてテキスト依存方式を採用するため、例えば図１３（ａ）及び（ｂ）に示すように音節のモデルを接続する。ここでは発声用テキストが「アサヒ」であるので、図１３（ａ）に示すように「ア」「サ」「ヒ」というモデル・データを第２登録モデル・データから読み出し、図１３（ｂ）に示すように最後の音節のモデルを除き、各音節の最後の状態を次の音節の最初の状態に置き換えることによりモデルの接続を行うことができる。すなわち、話者がアサヒと発声した場合のみ有意な確率（尤度）が算出されるように、モデルを連結する。 Further, the second collation unit 13 reads out the second registration model data corresponding to the speaker ID from the second registration model storage unit 17, and configures a collation model corresponding to the utterance text (step S91). In the present embodiment, the second collating unit 13 employs a text-dependent method, and therefore, for example, syllable models are connected as shown in FIGS. 13 (a) and 13 (b). Here, since the text for utterance is “Asahi”, model data “a”, “sa”, and “hi” are read from the second registered model data as shown in FIG. The model can be connected by replacing the last state of each syllable with the first state of the next syllable except for the model of the last syllable as shown in FIG. That is, the models are connected so that a significant probability (likelihood) is calculated only when the speaker utters Asahi.

そして、第２照合部１３は、図１３に示すような照合モデルを用いて照合処理を実施する（ステップＳ９３）。より具体的には、入力音声に係る音声分析データである特徴ベクトルと発声用テキストに含まれる音節のモデル・データとから（１３）式に従って確率を算出する。算出された値を尤度Ｌ２とする。なお、計算結果は記憶装置に格納される。 And the 2nd collation part 13 implements collation processing using a collation model as shown in FIG. 13 (step S93). More specifically, the probability is calculated according to the equation (13) from the feature vector that is the voice analysis data related to the input voice and the syllable model data included in the utterance text. Let the calculated value be the likelihood L2. The calculation result is stored in the storage device.

そして照合結果判定部１９は、ステップＳ８９とステップＳ９３の２つの照合処理結果を用いて判定処理を実施し、判定処理結果を出力する（ステップＳ９５）。ここでは、（１０）式に従って２つの照合処理結果である尤度を加算して、総合尤度Ｌを算出する。 And the collation result determination part 19 implements a determination process using the two collation process results of step S89 and step S93, and outputs a determination process result (step S95). Here, the total likelihood L is calculated by adding the likelihoods that are two collation processing results according to the equation (10).

本実施の形態は、実施の形態１とは第２登録モデル・データの内容及び第２照合部１３の処理内容が異なるが、実施の形態１と同様に２つの照合処理結果を総合して最終判定処理を行うので、互いに補うことになり判定精度が向上する。また、テキスト依存方式を第２照合部１３に関連して採用しているので、例えば真正な話者の音声の録音を用いる詐称者に対抗することも可能となる。 The present embodiment differs from the first embodiment in the contents of the second registered model data and the processing contents of the second collation unit 13, but the final result is obtained by combining the two collation processing results as in the first embodiment. Since the determination process is performed, the determination accuracy is improved because they are mutually supplemented. In addition, since the text-dependent method is employed in connection with the second collating unit 13, it is possible to counter an impersonator who uses a voice recording of a genuine speaker, for example.

なお、話者識別の処理については、図７のステップＳ３７を、発声用テキストについての第２登録モデル・データに対する図１２のステップＳ９１及びＳ９３に置き換えることにより、実施可能となる。従って、話者識別処理の話者識別精度も向上する。 The speaker identification process can be implemented by replacing step S37 in FIG. 7 with steps S91 and S93 in FIG. 12 for the second registered model data for the text for utterance. Therefore, the speaker identification accuracy of the speaker identification process is also improved.

４．その他の実施の形態
（１）モデル修正部１５
上では、重みβで事前モデルの平均ベクトルμを話者に適応化する例（最大事後確率推定法ＭＡＰ）を示しているが、最尤線形回帰法（ＭＬＬＲ）を用いる場合もある。 4). Other Embodiments (1) Model Correction Unit 15
The above shows an example (maximum posterior probability estimation method MAP) in which the average vector μ of the prior model is adapted to the speaker with the weight β, but the maximum likelihood linear regression method (MLLR) may be used.

（２）テキスト依存方式
実施の形態３では、話者登録の際には話者が自由に発声し、話者照合の際には発声用テキストが指定される例を示したが、話者登録の際に発声用テキストが指定され、話者照合の際にも同じ発声用テキストが指定されるような構成であってもよい。この場合、モデル修正部１５が、実施の形態３において第２照合部１３が行う照合用モデルの構成の処理までを実施して、第２登録モデル格納部１７に格納する。 (2) Text Dependent Method In the third embodiment, an example is shown in which a speaker speaks freely during speaker registration and a text for utterance is specified during speaker verification. The utterance text may be specified at the time of the speaker verification, and the same utterance text may be specified at the time of speaker verification. In this case, the model correction unit 15 performs the processing up to the configuration of the verification model performed by the second verification unit 13 in the third embodiment, and stores it in the second registered model storage unit 17.

以上本発明の実施の形態を説明したが、本発明はこれらに限定されるものではない。例えば、図３に示した機能ブロック図であるが、これに対応してプログラムモジュールが構成されるとは限らない。 Although the embodiments of the present invention have been described above, the present invention is not limited to these. For example, although it is the functional block diagram shown in FIG. 3, a program module is not necessarily comprised corresponding to this.

第１の従来技術の機能ブロック図を示す。The functional block diagram of the 1st prior art is shown. 第２の従来技術の機能ブロック図を示す。The functional block diagram of the 2nd prior art is shown. 本発明の実施の形態に係る機能ブロック図を示す。The functional block diagram which concerns on embodiment of this invention is shown. 事前処理部の処理フローを示す図である。It is a figure which shows the processing flow of a pre-processing part. ＬＰＣケプストラム係数と音声波との関係を示す模式図である。It is a schematic diagram which shows the relationship between a LPC cepstrum coefficient and a sound wave. 実施の形態１の照合及び登録処理の処理フローを示す図である。FIG. 6 is a diagram illustrating a processing flow of collation and registration processing according to the first embodiment. 話者識別の処理フローを示す図である。It is a figure which shows the processing flow of speaker identification. ＨＭＭの一例を示す模式図である。It is a schematic diagram which shows an example of HMM. ＨＭＭにおける状態遷移パターンを説明するための模式図である。It is a schematic diagram for demonstrating the state transition pattern in HMM. 実施の形態２の照合及び登録処理の処理フローを示す図である。FIG. 10 is a diagram illustrating a processing flow of collation and registration processing according to the second embodiment. 第２登録モデルから構成される照合用モデル（実施の形態２用）を示す図である。It is a figure which shows the model for collation (for Embodiment 2) comprised from a 2nd registration model. 実施の形態３の照合及び登録処理の処理フローを示す図である。FIG. 10 is a diagram illustrating a processing flow of collation and registration processing according to the third embodiment. 第２登録モデルから構成される照合用モデル（実施の形態３用）を示す図である。It is a figure which shows the model for collation (for Embodiment 3) comprised from a 2nd registration model.

符号の説明Explanation of symbols

１音声入力部３音声分析部５切替部７第１照合部
９モデル生成部１１第１登録モデル格納部１３第２照合部
１５モデル修正部１７第２登録モデル格納部１９照合結果判定部
２１事前モデル格納部２３事前処理部２５発声テキスト決定部
２３１事前音声データ格納部２３３第２音声分析部
２３５事前モデル生成部 DESCRIPTION OF SYMBOLS 1 Voice input part 3 Voice analysis part 5 Switching part 7 1st collation part 9 Model production | generation part 11 1st registration model storage part 13 2nd collation part 15 Model correction part 17 2nd registration model storage part 19 Collation result determination part 21 Advance Model storage unit 23 Pre-processing unit 25 Speech text determination unit 231 Pre-speech data storage unit 233 Second speech analysis unit 235 Pre-model generation unit

Claims

照合対象者の音声データのみから生成された第１登録モデル・データを格納する第１登録モデル・データ格納部と、
複数の不特定話者の音声データから生成された不特定話者モデル・データを前記照合対象者に適応化することにより生成される第２登録モデル・データを格納する第２登録モデル・データ格納部と、
前記照合対象者の音声データを分析して音声分析データを生成する分析手段と、
前記音声分析データと前記第１登録モデル・データ格納部に格納された前記第１登録モデル・データとを用いた照合処理を実施する第１照合処理手段と、
前記音声分析データと前記第２登録モデル・データ格納部に格納された前記第２登録モデル・データとを用いた照合処理を実施する第２照合処理手段と、
前記第１照合処理手段及び前記第２照合処理手段の照合処理結果に基づき、前記照合対象者に対する最終判定処理を実施する判定手段と、
を有し、
前記判定手段が、
前記第１照合処理手段の照合処理結果である第１の尤度と（１−α）（αは０以上１以下の所定の実数）の積と、前記第２照合処理手段の照合処理結果である第２の尤度と前記αの積とを加算した値に基づき、前記照合対象者に対する最終判定処理を実施する
ことを特徴とする話者認識システム。 A first registered model data storage unit for storing first registered model data generated only from the voice data of the person to be verified;
Second registered model data storage for storing second registered model data generated by adapting unspecified speaker model data generated from voice data of a plurality of unspecified speakers to the verification target person And
Analyzing means for analyzing voice data of the person to be collated and generating voice analysis data;
First verification processing means for performing verification processing using the voice analysis data and the first registered model data stored in the first registered model / data storage unit;
A second matching processing means for performing a matching process using the voice analysis data and the second registered model data stored in the second registered model / data storage unit;
A determination unit that performs a final determination process on the verification target person based on the verification processing results of the first verification processing unit and the second verification processing unit;
I have a,
The determination means is
The product of the first likelihood and (1-α) (α is a predetermined real number greater than or equal to 0 and less than or equal to 1), which is the result of the first collation processing means, and the result of the collation processing of the second collation processing means. Based on a value obtained by adding a certain second likelihood and the product of α, a final determination process for the person to be collated is performed.
This is a speaker recognition system.

前記第１登録モデル・データ及び前記第２登録モデル・データが混合正規分布モデルのデータであり、
前記第１照合処理手段による照合処理及び前記第２照合処理手段による照合処理が、前記混合正規分布モデルに対応した照合処理である
ことを特徴とする請求項１記載の話者認識システム。 The first registration model data and the second registration model data are mixed normal distribution model data,
Said first collation processing means according to the verification process and the collation process by the second verification processing means, speaker recognition system according to claim 1, characterized in that the verification process corresponding to the Gaussian mixture model.

前記第１登録モデル・データが混合正規分布モデルのデータであり、
前記第２登録モデル・データがサブワード単位のモデル・データであり、
前記第１照合処理手段による照合処理が前記混合正規分布モデルに対応した照合処理であり、
前記第２照合処理手段が、
前記第２登録モデル・データ格納部に格納された前記サブワード単位のモデル・データを接続して照合用モデル・データを生成する照合用モデル・データ生成手段と、
前記照合用モデル・データと前記音声分析データとを用いて照合処理を実施する手段と、
を含む請求項１記載の話者認識システム。 The first registered model data is data of a mixed normal distribution model;
The second registration model data is model data in subword units,
The matching process by the first matching processing means is a matching process corresponding to the mixed normal distribution model,
The second matching processing means
Collation model data generation means for connecting the subword unit model data stored in the second registered model data storage unit to generate collation model data;
Means for performing a matching process using the matching model data and the voice analysis data;
Speaker recognition system according to claim 1 comprising a.

前記照合対象者に発声を求める語句を決定する手段
をさらに有し、
前記照合用モデル・データ生成手段が、
前記語句に従って前記第２登録モデル・データ格納部に格納された前記サブワード単位のモデル・データを接続して照合用モデル・データを生成する
ことを特徴とする請求項３記載の話者認識システム。 Means for determining a phrase to be uttered by the person to be collated;
The collation model data generating means is
4. The speaker recognition system according to claim 3 , wherein model data for collation is generated by connecting the model data in units of subwords stored in the second registered model data storage unit according to the phrase. 5.

モデル・データ登録時において前記分析手段により生成された前記照合対象者の音声分析データから前記第１登録モデル・データを生成する手段と、
モデル・データ登録時において前記分析手段により生成された前記照合対象者の音声分析データを用いて不特定話者モデル・データ格納部に格納された前記不特定話者モデル・データを適応化し、前記第２登録モデル・データを生成する第２登録モデル・データ生成手段と、
をさらに有する請求項１乃至４のいずれか１つ記載の話者認識システム。 Means for generating the first registration model data from the voice analysis data of the person to be collated generated by the analysis means at the time of model data registration;
Adapting the unspecified speaker model data stored in the unspecified speaker model data storage unit using the voice analysis data of the verification target person generated by the analysis means at the time of model data registration, Second registered model data generating means for generating second registered model data;
Speaker recognition system according to any one of claims 1 to 4 further comprising a.

前記第２登録モデル・データ生成手段が、
モデル・データ登録時において前記照合対象者により発声されたサブワードのモデル・データを所定の方式に従って適応化する処理を実施し、
適応化されたサブワード単位のモデル・データを接続して前記第２登録モデル・データを生成する
ことを特徴とする請求項５記載の話者認識システム。 The second registration model data generation means includes
A process of adapting the model data of the subword uttered by the person to be collated at the time of model data registration according to a predetermined method,
The speaker recognition system according to claim 5, wherein the second registration model data is generated by connecting model data in units of sub-words that have been adapted.

照合対象者の音声データを分析して音声分析データを生成するステップと、
照合対象者の音声データのみから生成され且つ第１登録モデル・データ格納装置に格納された第１登録モデル・データと前記音声分析データとの照合処理を実施する第１照合処理ステップと、
複数の不特定話者の音声データから生成された不特定話者モデル・データを前記照合対象者に適応化することにより生成され且つ第２登録モデル・データ格納装置に格納された第２登録モデル・データと前記音声分析データとの照合処理を実施する第２照合処理ステップと、
前記第１照合処理ステップと前記第２照合処理ステップとの照合処理結果に基づき、前記照合対象者に対する最終判定処理を実施する判定ステップと、
をコンピュータに実行させ、
前記判定ステップが、
前記第１照合処理ステップの照合処理結果である第１の尤度と（１−α）（αは０以上１以下の所定の実数）の積と、前記第２照合処理ステップの照合処理結果である第２の尤度と前記αの積とを加算した値に基づき、前記照合対象者に対する最終判定処理を実施する
ことを特徴とする話者認識プログラム。 Analyzing voice data of the person to be matched to generate voice analysis data;
A first collation processing step for performing collation processing between the voice analysis data and the first registered model data generated only from the voice data of the person to be collated and stored in the first registered model / data storage device;
Second registered model generated by adapting unspecified speaker model data generated from voice data of a plurality of unspecified speakers to the verification target person and stored in the second registered model data storage device A second collation processing step for performing collation processing between the data and the voice analysis data;
A determination step of performing a final determination process for the person to be verified based on a result of the verification process between the first verification process step and the second verification process step;
To the computer ,
The determination step includes
The product of the first likelihood and (1-α) (α is a predetermined real number greater than or equal to 0 and less than or equal to 1), which is the result of the first collation process step, and the collation process result of the second collation process step Based on a value obtained by adding a certain second likelihood and the product of α, a final determination process for the person to be collated is performed.
A speaker recognition program characterized by that.

照合対象者の音声データを分析して音声分析データを生成するステップと、
照合対象者の音声データのみから生成され且つ第１登録モデル・データ格納部に格納された第１登録モデル・データと前記音声分析データとを用いた照合処理を実施する第１照合処理ステップと、
複数の不特定話者の音声データから生成された不特定話者モデル・データを前記照合対象者に適応化することにより生成され且つ第２登録モデル・データ格納部に格納された第２登録モデル・データと前記音声分析データとを用いた照合処理を実施する第２照合処理ステップと、
前記第１照合処理ステップと前記第２照合処理ステップとの照合処理結果に基づき、前記照合対象者に対する最終判定処理を実施する判定ステップと、
を含み、且つコンピュータにより実行され、
前記判定ステップが、
前記第１照合処理ステップの照合処理結果である第１の尤度と（１−α）（αは０以上１以下の所定の実数）の積と、前記第２照合処理ステップの照合処理結果である第２の尤度と前記αの積とを加算した値に基づき、前記照合対象者に対する最終判定処理を実施する
ことを特徴とする話者認識方法。 Analyzing voice data of the person to be matched to generate voice analysis data;
A first collation processing step for performing collation processing using only the voice analysis data and the first registration model data generated from only the voice data of the person to be collated and stored in the first registration model data storage unit;
Second registered model generated by adapting unspecified speaker model data generated from voice data of a plurality of unspecified speakers to the verification target person and stored in the second registered model data storage unit A second collation processing step for performing collation processing using data and the voice analysis data;
A determination step of performing a final determination process for the person to be verified based on a result of the verification process between the first verification process step and the second verification process step;
It includes, is and executed by a computer,
The determination step includes
The product of the first likelihood and (1-α) (α is a predetermined real number greater than or equal to 0 and less than or equal to 1), which is the result of the first collation process step, and the collation process result of the second collation process step Based on a value obtained by adding a certain second likelihood and the product of α, a final determination process for the person to be collated is performed.
A speaker recognition method characterized by the above.