JP4318475B2

JP4318475B2 - Speaker authentication device and speaker authentication program

Info

Publication number: JP4318475B2
Application number: JP2003086865A
Authority: JP
Inventors: 史比古高井; 修一池野
Original assignee: Secom Co Ltd
Current assignee: Secom Co Ltd
Priority date: 2003-03-27
Filing date: 2003-03-27
Publication date: 2009-08-26
Anticipated expiration: 2023-03-27
Also published as: JP2004294755A

Description

【０００１】
【発明の属する技術分野】
本発明は、音声を発する話者が誰であるかを特定するための話者認証装置及び話者認証プログラムに関する。
【０００２】
【従来の技術】
ユーザが発声したキーワードから発声者が誰であるかを認証し、その認証結果に基づいてセキュリティ管理を行う話者認証装置が用いられている。
【０００３】
このような話者認証装置では、認証に先立ってユーザを特定するキーワード（暗証番号等）の音声信号の特徴量をそのユーザに対応付けて照合用データベースとして登録しておく必要がある。このキーワードに対応する音声信号の特徴量は、ユーザの認証を行う際の照合用情報として用いられる。ユーザが認証を受けようとするとき、そのユーザは予め登録しておいたキーワードを含む音声を認証用情報として話者認証装置へ入力する。この認証用情報と照合用情報とが比較されることによって、認証用情報と類似度が高い照合用情報が選び出され、認証用情報を入力したユーザはその類似度が高い照合用情報と対応付けられたユーザとして認証される。
【０００４】
照合用データベースに照合用情報を登録する際、ユーザが発した音声には雑音区間や無声区間等の不要な部分が含まれるため、キーワードに相当する区間を正確に抽出し、そのキーワードに相当する区間の音声信号の特徴量を照合用情報として登録する必要がある。照合用情報は話者認証装置における鍵の役割を果たすため、この照合用情報の区間検出の精度がその後の認証処理の精度に大きく影響を与える。
【０００５】
音声信号の切り出しに対して、入力信号の振幅情報及び継続時間に基づいて登録に必要な音声区間を検出する方法が開示されている（例えば、ラビナ(L.R.Rabiner)らの"独立した発声の終了点を決定するためのアルゴリズム(An algorithm for determining the endpoint of isolated utterances)"）。
【０００６】
一方、ＤＰマッチング等のパターンマッチングをによって算出される類似度を用いて、音声信号の中から予め定められた単語が存在するかどうか及び存在する場合はその位置を決定するワードスポッティング法も広く用いられている。ワードスポッティング法は、入力信号の振幅情報及び継続時間に基づいて音声区間を検出する方法よりも雑音の影響を受け難いことが知られている（例えば、速水悟らの「連続ＤＰによる連続単語認識実験とその考察」等）。
【０００７】
さらに、ユーザからの音声入力に基づいてキーワードとして用いられる数字、アルファベット等の記号を表す音声認識モデルを準備しておき、この話者モデルを用いて話者認識を行う方法も開示されている（例えば、特開２０００−９９０９０号公報）。
【０００８】
また、照合用情報を登録してから時間が経つと、ユーザの体調や発声の仕方が変化することがある。このような場合、照合用情報を登録したユーザと同一のユーザが発した認証用情報であっても認証に失敗する可能性が高くなる。このように経時的な音声の変化の影響を解消するために、認証処理を行う毎に認証用情報からキーワードに相当する区間を抽出し、その区間の音声信号の特徴量によって照合用情報を更新する方法も用いられている（例えば、特開昭５７−１３４９３号公報等）。
【０００９】
【特許文献１】
特開昭５７−１３４９３号公報
【特許文献２】
特開２０００−９９０９０号公報
【非特許文献１】
ラビナ(L.R.Rabiner et al.)，"独立した発声の終了点を決定するためのアルゴリズム(An algorithm for determining the endpoint of isolated utterances)",Bell Syst. Tech. J.,1975, vol.54, pp297-315
【非特許文献２】
速水悟，「連続ＤＰによる連続単語認識実験とその考察」，電気通信学会論文誌，1984, vol.J67-D,No.6,pp677-684
【００１０】
【発明が解決しようとする課題】
しかしながら、上記従来の技術においては、照合用情報を登録する際にユーザから取得した音声信号から登録すべき照合用情報を適切に抽出できない問題がある。
【００１１】
入力信号の振幅情報及び継続時間に基づいて音声区間を検出する方法では、登録すべき照合用情報の区間の前後にユーザが発生した不要音（「あ〜」、「え〜」など）や外部雑音が含まれた区間が付加されて検出されることが多い。また、語頭や語尾に振幅が小さい子音等が存在する場合にこれらの子音が欠落して検出されたりすることもある。さらに、照合用情報の中にポーズ（無音区間）が含まれる場合には、照合用情報の一部が欠落してしまう可能性もある。
【００１２】
このような照合用情報の区間検出の誤りは、高精度な認証装置を実現するうえで致命的である。雑音が混入した状態で照合用情報が登録された場合、認証処理を行う際にユーザが正しく認証される確率が低くなる。一方、照合用情報の一部が欠落して登録された場合、ユーザの個人的特徴を示す情報の情報量が少なくなるため、他人をその照合用情報を登録したユーザとして誤って認証してしまう確率が高くなる。
【００１３】
また、特開２０００−９９０９０号公報に記載の技術では、標準的な話者モデルをユーザの発声によって選択し、選択された話者モデルの組み合わせによってキーワードの音声モデルを作成し、その音声モデルを用いて照合用情報を抽出する。従って、話者モデルを選択する際のユーザの発声が明瞭でなかったときには適切な話者モデルが選択されず、照合用情報の区間が誤って検出されてしまう問題がある。
【００１４】
一方、ユーザの認証を行う際にも、認証用情報となるキーワードをユーザが言い淀んでしまったり、キーワードの語頭や語尾が明瞭に発声されなかったときには、キーワードに相当する区間の一部が欠落して検出されたり、一部が無音区間として検出されたりと、認証用情報からキーワードに相当する区間を正確に抽出できなくなる問題を生ずる。その結果、ユーザの認証の精度が低下したり、認証を誤ってしまう可能性がある。
【００１５】
また、ユーザの発声の経時変化に対応するために照合用情報を随時更新する際に、認証用情報から誤って抽出されたキーワードによって照合用情報が更新されることとなり、次回の認証処理に対して悪影響を及ぼしてしまう。さらに、照合用情報の更新が繰り返されることによって、誤りの累積的な蓄積が起こり、認証処理を行った回数の増加に伴って認証処理の精度が急激に低下してしまう問題もある。
【００１６】
本発明は、上記従来技術の問題を鑑み、上記課題の少なくとも１つを解決できる発声者が誰であるかを特定するための話者認証装置及び話者認証プログラムを提供することを目的とする。
【００１７】
【課題を解決するための手段】
上記課題を解決できる本発明は、ユーザがキーワードを発声した音声信号の特徴量を照合用情報としてそのユーザと関連付けて保持する照合用データベース記憶手段と、認証を行おうとするユーザが発声した音声信号を取得する認証用音声信号取得手段と、前記認証用音声信号取得手段において取得された音声信号の特徴量と前記照合用情報とを比較することによって前記認証を行おうとするユーザを特定するユーザ特定手段と、を備える話者認証装置であって、前記キーワードを登録するユーザから音声信号を取得する登録音声信号取得手段と、前記登録音声信号取得手段において取得された音声信号から前記キーワードを表す標準的な認識モデルとの類似性が最も高い区間の特徴量を抽出する登録キーワード区間抽出手段と、前記登録キーワード区間抽出手段において抽出された区間の特徴量を更新されない抽出用情報として前記キーワードを登録するユーザと関連付けて保持する抽出用データベース記憶手段と、前記認証用音声信号取得手段において取得された音声信号から前記抽出用情報との類似性が最も高い区間の特徴量を抽出する認証キーワード区間抽出手段と、前記ユーザ特定手段において特定されたユーザと関連付けられている照合用情報を前記認証キーワード区間抽出手段において抽出された特徴量に基づいて更新するデータベース更新手段と、を含むことを特徴とする。
【００１８】
ここで、前記ユーザ特定手段は、前記認証キーワード区間抽出手段において抽出された特徴量と前記照合用情報とを比較することによって前記認証を行おうとするユーザを特定することが好適である。
【００１９】
また、キーワードを表す記号列を取得するキーワード取得手段と、前記キーワード取得手段において取得された各記号を表す音声認識モデルを取得し、それらの音声認識モデルを組み合わせて前記標準的な認識モデルを構築する認識モデル構築手段と、をさらに備えることが好適である。
【００２０】
ここで、前記キーワード取得手段は、キーボード、ポインティングデバイス、タッチパネルを用いてキーワードを取得することが好適である。
【００２３】
また、前記認証用音声信号取得手段において取得された音声信号の特徴量と前記照合用情報の一部とを比較することによって前記照合用データベース記憶手段に保持されている照合用情報の絞り込みを行う予備検索手段をさらに含み、前記ユーザ特定手段は、前記予備検索手段によって絞り込まれた照合用情報を比較に用いることによって前記認証を行おうとするユーザを特定してもよい。
【００２４】
また、上記課題を解決できる本発明の別の形態は、ユーザがキーワードを発声した音声信号の特徴量を照合用情報としてそのユーザと関連付けて保持する照合用データベースと、ユーザから取得された音声信号からキーワードに相当する区間を抽出する際に用いられる音声信号の特徴量を更新されない抽出用情報として保持する抽出用データベースと、を備えるコンピュータに、前記キーワードを登録するユーザから音声信号を取得する登録音声信号取得ステップと、前記登録音声信号取得ステップにおいて取得された音声信号から前記キーワードを表す標準的な認識モデルとの類似性が最も高い区間の特徴量を抽出する登録キーワード区間抽出ステップと、前記登録キーワード区間抽出ステップにおいて抽出された区間の特徴量を更新されない抽出用情報として前記抽出用データベースに登録する抽出情報登録ステップと、認証を行おうとするユーザが発声した音声信号を取得する認証用音声信号取得ステップと、前記認証用音声信号取得ステップにおいて取得された音声信号から前記抽出用情報との類似性が最も高い区間の特徴量を認証用情報として抽出する認証用キーワード区間抽出ステップと、前記認証用情報と前記照合用情報とを比較することによって前記認証を行おうとするユーザを特定するユーザ特定ステップと、前記ユーザ特定ステップにおいて特定されたユーザと関連付けられている照合用情報を前記認証用情報に基づいて更新するデータベース更新ステップと、を含む処理を実行させることを特徴とする話者認証プログラムである。
【００２５】
また、前記ユーザ特定ステップは、前記認証用キーワード区間抽出ステップにおいて抽出された特徴量と前記照合用情報とを比較することによって前記認証を行おうとするユーザを特定することが好適である。また、キーワードを表す記号列を取得するキーワード取得ステップと、前記キーワード取得ステップにおいて取得された各記号を表す音声認識モデルを取得し、それらの音声認識モデルを組み合わせて前記標準的な認識モデルを構築する認識モデル構築ステップと、をさらに備えることが好適である。
【００２６】
ここで、前記キーワード取得ステップは、キーボード、ポインティングデバイス、タッチパネルを用いてキーワードを取得することが好適である。
【００２９】
さらに、上記本発明の話者認証プログラムにおいて、前記コンピュータに、前記認証用音声信号取得ステップにおいて取得された音声信号の特徴量と前記照合用情報の一部とを比較することによって前記照合用データベースに保持されている照合用情報の絞り込みを行う予備検索ステップをさらに含み、前記ユーザ特定ステップは、前記予備検索ステップによって絞り込まれた照合用情報を比較に用いることによって前記認証を行おうとするユーザを特定してもよい。
【００３０】
【発明の実施の形態】
＜認証装置＞
本発明の実施の形態における認証装置について、図を参照しながら詳細に説明する。本実施の形態における認証装置は、ユーザが発声した音声に基づいてユーザが誰であるかを認証する音声認証を行う装置である。
【００３１】
本実施の形態における認証装置１００は、図１のように、制御部１０、記憶部１２、キーワード取得部１４、音声信号取得部１６、表示部１８及びバス２０から基本的に構成される。制御部１０、記憶部１２、キーワード取得部１４、音声信号取得部１６及び表示部１８は、バス２０を介して、互いに情報伝達可能に接続される。
【００３２】
制御部１０は、コンピュータの中央処理装置（ＣＰＵ）に相当する。制御部１０は、記憶部１２に格納した基本ソフトウェア（オペレーションシステム）を実行することによって、キーワード取得部１４や音声信号取得部１６を用いてユーザから情報を取得し、表示部１８を用いてユーザへの情報の提示を行う。また、記憶部１２に格納されている認証プログラムを実行することにより、ユーザから取得した音声に基づいてユーザの認証処理を行う。認証処理については、後に詳細に説明を行う。
【００３３】
記憶部１２は、制御部１０によって実行される基本ソフトウェアや認証プログラム等を格納及び保持する。また、キーワード取得部１４や音声信号取得部１６を用いて取得された情報や表示部１８を用いてユーザへ提示される情報等、制御部１０で処理される情報を一時的又は恒久的に格納及び保持する。さらに、記憶部１２は、認証処理において使用される照合用データベース、抽出用データベース、予備検索用データベースを格納及び保持する。これらのデータベースの内容については、後に詳細に説明を行う。記憶部１２に保持された情報は、制御部１０によって適宜読み出すことができる。
【００３４】
記憶部１２としては、半導体メモリを用いることができる。また、多数のユーザに対して音声のデータベースを保存する必要がある場合には、ハードディスク、光ディスク、光磁気ディスク、磁気テープ等の大容量の補助記憶装置を備えても良い。
【００３５】
キーワード取得部１４は、認証処理に用いられるキーワードに含まれる記号を取得するものである。キーワード取得部１４は、例えばキーボードとすることができる。ユーザは、キーボードからキーワードを表す記号を入力する。入力された記号は制御部１０へ送られて処理に供される。また、キーワード取得部１４としてキーボード以外のポインティングデバイス、タッチパネル等の文字入力装置を用いて記号を選択する方法を用いても良い。
【００３６】
音声信号取得部１６は、ユーザが発声した音声を取得するためのマイク、増幅器（アンプ）及びアナログ／デジタル変換器等を含む。ユーザは、音声信号取得部１６を用いて音声の入力を行う。ユーザの発声した音声は、マイクを通じて増幅器で増幅され、アナログ／デジタル変換器によってデジタル信号に変換されて処理に供せられる。
【００３７】
表示部１８は、ユーザに対して処理に必要な情報を提供するものである。表示部１８は、例えば、ディスプレイ装置とすることができる。表示部１８は、制御部１０からの画像表示の指令を受けて、キーワード取得部１４を用いたキーワードの入力を促す画面や音声信号取得部１６を用いた音声の入力を促す画面をユーザに対して提示する。また、取得された情報や処理結果をユーザに対して提示する。表示部１８としては、タッチパネルの液晶表示装置、スピーカ等を含む音声出力装置等も用いることができる。
【００３８】
以上のように、本実施の形態における認証装置１００は、マイクロコンピュータが組み込まれた情報処理装置によって基本的に構成することができる。認証装置１００は、ユーザの認証処理を行うドア、金庫の扉等の各種装置の付近に設置することができ、ユーザからの音声の入力によってそのユーザが誰であるかを認証するために用いることができる。
【００３９】
また、図２に示すように、ネットワークインターフェース２２，２４をさらに設け、ネットワークで接続された別個のクライアント１００ａ及びサーバ１００ｂによって認証装置全体を構成しても良い。
【００４０】
＜認証方法＞
次に、本実施の形態におけるユーザの認証方法について説明を行う。本実施の形態における認証方法は、ユーザ毎にそのユーザが登録したキーワードを含む音声信号を取得し、その音声信号からキーワードに相当する区間を抽出し、その区間の音声信号の特徴量を各データベースに登録する登録処理と、それらのデータベースを用いて実際にユーザの認証を行う照合処理と、に大きく分けられる。そこで、以下に登録処理と照合処理を分説する。
【００４１】
（登録処理）
本実施の形態における登録処理は、図３に示すフローチャートに沿って行われる。なお、本実施の形態における登録処理は図３に示すフローチャートの各工程をプログラム化して記憶部１２に格納及び保持することによって、本実施の形態の認証装置によって実行することができる。
【００４２】
ステップＳ１０では、キーワード取得部１４を用いて、ユーザからキーワードに含まれる記号が取得される。制御部１０は、登録しようとするキーワードを構成する記号列の入力をユーザに促す画面を表示部１８に表示させる。ユーザは、キーワード取得部１４を用いて、自己を特定するためのキーワードに含まれる記号群を認証装置に入力する。入力された記号群は記憶部１２に格納される。
【００４３】
キーワードは数字列で構成することができる。但し、これに限られるものではなく、アルファベット、仮名文字、数字及びその他の任意記号を任意の数だけ組み合わせたものに拡張することができる。また、認証装置側でキーワードの候補を幾つか定め、ユーザにそれらの候補の中から１つを選択させても良い。
【００４４】
以下、ユーザ名Ａのユーザ（以下、ユーザＡという）が４桁の数字列「１２３４」をキーワードとして登録する例をとって説明を行う。ユーザＡはテンキーの「１」，「２」，「３」，「４」の数字キーを押下することによってキーワードに含まれる記号を入力する。
【００４５】
ステップＳ１２では、取得された記号群に基づいて認識モデルが構築される。キーワードとして使用され得る記号毎に、その記号を表す音声認識モデルをモデル構築用データベースとして予め記憶部１２に登録しておき、このモデル構築用データベースからステップＳ１０において取得された記号に対応する音声認識モデルを抽出し、それらを組み合わせることによってキーワードを表す標準的な認識モデルを構築することができる。このとき、既存のＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）を用いて認識モデルを構築することができる。この認識モデルは、後にユーザから取得される音声信号からキーワードに相当する区間を抽出するために用いられる。
【００４６】
ここで、モデル構築用データベースに格納される音声認識モデルは、各記号に対する音声信号を多数のユーザから取得し、それらの音声信号を用いて学習したものであることが好適である。
【００４７】
アルファベット、仮名文字、数字及びその他の記号を入力可能なキーボードを用いた場合には、それらの記号の組み合わせからなるキーワードに対する認識モデルを生成できるようにモデル構築用データベースを構築しておく必要がある。
【００４８】
例えば、図４に示すように、「１」，「２」，「３」及び「４」に該当する音声認識モデルがモデル構築用データベースから抽出され、キーワード「１２３４」を表す標準的な認識モデルとして組み合わされる。ここで、例えば「１」に該当する音声認識モデルとは、複数人が発声した記号「１」に相当する音声信号を「１」として認識するように、あるいは、特定の話者が複数回発声した記号「１」に相当する音声信号を「１」として認識するように統計処理を用いて構成したものである。
【００４９】
ステップＳ１４では、カウンタｉの初期化が行われる。カウンタｉは、登録を行うためにユーザがキーワードを入力した回数をカウントするために用いられる。このステップでカウンタｉが０に設定される。
【００５０】
ステップＳ１６では、音声信号取得部１６を用いて、ユーザが発声した音声信号が取得される。制御部１０は、ユーザに対してキーワードを発声することを促す画面を表示部１８に表示させると共に、音声信号取得部１６を音声入力待機状態とする。ユーザは、キーワードを発声することによって、音声信号取得部１６から音声信号を入力する。入力された音声信号は、アナログ／デジタル変換され記憶部１２に格納される。
【００５１】
ユーザＡは、「１２３４」と発声することによって音声信号取得部１６を用いて音声信号を入力する。このとき、図５に示すように、取得される音声信号にはユーザが発声したキーワード「１２３４」に該当する音声信号と共に、外部からの雑音や無声区間などの不要な区間の信号も一緒に取得される。
【００５２】
ステップＳ１８では、デジタル変換された音声信号全体に対する特徴量が抽出される。音声信号の特徴量は、スペクトル包絡情報とすることが好適である。スペクトル包絡情報とは、ある瞬間において音声信号に含まれている周波数成分の分布の概形をいう。スペクトル包絡情報は、デジタル化された信号系列を所定のフレーム幅（例えば、３２ミリ秒）及びフレーム周期（例えば、８ミリ秒）毎にスペクトル分析を行い、既存のＬＰＣ（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｅｆｆｉｃｉｅｎｔ）ケプストラムを算出することにより求めることができる。
【００５３】
但し、音声信号から抽出される特徴量はスペクトル包絡情報に限られるものではなく、音声信号の特徴を示す情報であれば良い。例えば、音声信号の振幅の時間変化、有声区間又は無声区間の出現周期等の特徴量を用いても良い。以下の処理では、ここで選択された特徴量を用いて音声信号のマッチングや抽出が行われる。
【００５４】
図６に示す例では、ユーザＡから取得した音声信号が所定のフレーム周期毎に所定のフレーム幅を有する複数のフレームに分割され、フレーム毎にスペクトル分析が行われて１２次のＬＰＣケプストラムの係数が抽出されている。
【００５５】
ステップＳ２０では、処理対象となっている音声信号が初めて取得されたものであるか否かが判断される。すなわち、カウンタｉが０である場合にはステップＳ２２に処理が移行され、カウンタｉが０以外である場合にはステップＳ２６に処理が移行される。
【００５６】
ステップＳ２２では、ステップＳ１２で構築された認識モデルを用いて、取得された音声信号からキーワードに相当する区間が検出及び抽出される。ここでは、認識モデルを用いたワードスポッティング法を用いることができる。ＨＭＭ法等を用いて、音声信号全体の特徴量と認識モデルの特徴量とが比較され、音声信号全体の中から認識モデルと最も類似性が高い区間がキーワードに相当する区間として抽出される。
【００５７】
キーワード「１２３４」に対する認識モデルのＬＰＣケプストラムが求められ、図７に示すように、ステップＳ１８で求められた音声信号のＬＰＣケプストラムとの間でワードスポッティングが実行される。すなわち、ユーザから取得された音声信号の最初から最後まで、認識モデルとの類似性を調査するためのスキャンが行われる。このとき、ＨＭＭ法を用いて、音声信号の各区間のＬＰＣケプストラムと認識モデルとの類似度が求められ、最も類似度が高い区間がキーワード「１２３４」に相当する区間として抽出される。これによって、ユーザＡから取得された音声信号からキーワード「１２３４」と無関係な不要な部分が排除される。
【００５８】
ステップＳ２４では、キーワードに相当する区間として抽出された区間の音声信号の特徴量が照合用データベース、抽出用データベース及び予備検索用データベースに登録される。抽出された区間の音声信号の特徴量は、登録を行っているユーザを示す識別子（例えば、ユーザ名等）と関連付けられて、照合用データベース及び抽出用データベースに照合用情報及び抽出用情報としてそれぞれ別個に登録される。
【００５９】
また、予備検索用データベースには、抽出された区間の音声信号の特徴量の一部が予備検索用情報として登録される。すなわち、予備検索用データベースに登録される予備検索用情報は、照合用データベースに登録される照合用情報や抽出用データベースに登録される抽出用情報よりも情報量が少なくなるように構成される。例えば、照合用情報が抽出された区間の音声信号の特徴量の全データである場合には、予備検索用情報は抽出された区間の音声信号の特徴量の一部のデータのみとする。また、照合用情報がマルチテンプレートである、すなわち複数回取得された音声信号から抽出されたキーワードに相当する区間の音声信号の特徴量の組み合わせからなる場合には、予備検索用情報はそのなかの１つの音声信号の特徴量としても良い。
【００６０】
特徴量としてＬＰＣケプストラムが選ばれた場合、照合用情報及び抽出用情報はキーワードに相当する区間の音声信号のＬＰＣケプストラムとすることができる。この場合、予備検索用データベースに登録される予備検索用情報は、キーワードに相当する区間のＬＰＣケプストラムを時間的又は次数的に間引いたものとすることができる。
【００６１】
例えば、図８（ａ）及び（ｂ）に示すように、照合用データベース及び抽出用データベースに、ユーザ名Ａと関連付けてキーワード「１２３４」に相当する区間のＬＰＣケプストラムが照合用情報及び抽出用情報としてそれぞれ登録される。また、予備検索用データベースには、図８（ｃ）のように、ユーザ名Ａと関連付けてキーワード「１２３４」に相当する区間のＬＰＣケプストラムの０〜８次係数のみが登録される。
【００６２】
照合用データベースに登録される照合用情報は、音声からユーザを認証するために用いられる。抽出用データベースに登録される抽出用情報は、音声信号の中のキーワードに相当する区間を特定して、音声信号からキーワードに相当する区間を抽出するために用いられる。予備検索用データベースに登録される予備検索用情報は、照合用情報を用いたユーザの認証や抽出用情報を用いたキーワードに相当する区間の抽出に先立って予備的な絞り込みを行うために用いられる。
【００６３】
また、登録が正しく行われたか否かは、後の照合処理の精度を左右する重要な要素であるので、登録された照合用情報、抽出用情報及び予備検索用情報をユーザ本人又は管理者に確認させる処理を行うことも好適である。
【００６４】
ステップＳ３４では、カウンタｉの値が１つ増加させられる。ステップＳ３６では、カウンタｉが繰り返し回数Ｍ以上であるか否かが判断される。カウンタｉが回数Ｍより小さい場合にはステップＳ１６へ処理を戻し、ユーザから音声信号が再度取得される。カウンタｉが回数Ｍ以上である場合には登録処理を終了する。繰り返し回数Ｍは、同一ユーザによりキーワードの登録処理を繰り返す回数を示し、認証処理に必要な精度や認証装置の処理速度等に基づいて定めることができる。
【００６５】
ステップＳ２０において、カウンタｉが０でないと判断され、ステップＳ２６に処理が移行された場合、抽出用データベースに既に登録されている抽出用情報に基づいて音声信号からキーワードに相当する区間が抽出される。抽出には、登録を行っているユーザに関連付けられた抽出用情報によるワードスポッティングを用いることができる。音声信号の中から抽出用情報と最も類似性が高い区間が新たなキーワードに相当する区間として抽出される。
【００６６】
抽出用データベースに既に登録されているユーザＡに対する抽出用情報が選び出され、選び出された抽出用情報を用いてユーザから取得された音声信号に対するワードスポッティングが行われ、キーワード「１２３４」に相当する区間が抽出される。
【００６７】
ステップＳ２８では、照合用データベースに既に登録されている照合用情報とキーワードに相当する区間として抽出された音声信号とが比較される。比較には、既存のＤＰマッチング等を用いることができる。両者の比較結果は、情報間の距離値に基づいて類似度として算出される。以下では、距離値の逆数を類似度として算出する。従って、抽出された区間の音声信号の特徴量とカウンタｉで特定されるユーザに対する照合用情報との類似性が高いほど、類似度の値は大きくなる。算出された類似度はユーザに関連付けられて記憶部１２に保持される。
【００６８】
照合用データベースに既に登録されているユーザＡに対する照合用情報が選び出され、キーワード「１２３４」に相当する区間として抽出された区間のＬＰＣケプストラムと選択された照合用情報であるＬＰＣケプストラムとの類似度が算出される。
【００６９】
ステップＳ３０では、算出された類似度と予め定められた閾値とが比較される。類似度が閾値以上である場合にはステップＳ３２に処理を移行させ、類似度が閾値より小さい場合にはステップＳ１６に処理を戻してユーザから音声信号を再度取得する。
【００７０】
ステップＳ３２では、ステップＳ２６で抽出されたキーワードに相当する区間の音声信号の特徴量によって照合用データベース及び予備検索用データベースの登録内容が更新される。すなわち、登録処理を行っているユーザと関連付けられて照合用データベースに既に登録されている照合用情報がステップＳ２６で新たに抽出された区間の音声信号の特徴量によって更新される。また、登録処理を行っているユーザと関連付けられて予備検索用データベースに既に登録されている予備検索用情報が新たに抽出された区間の音声信号の特徴量の一部と置き換えられる。
【００７１】
例えば、図９に示すように、照合用データベース及び予備検索用データベースの登録内容が更新される。ここでは、ユーザＡと関連付けられて既に登録されている照合用情報及び予備検索用情報が、ステップＳ２６で新たに抽出されたキーワードに相当する区間のＬＰＣケプストラム及びＬＰＣケプストラムの０〜８次係数の値とそれぞれ置換される。
【００７２】
また、照合用情報がマルチテンプレートによって構成されるのであれば、既に登録されているテンプレートのうち最も類似度が低いテンプレートと新たに抽出された区間の音声信号の特徴量とを置き換えても良い。この場合、予備検索用情報は照合用情報のマルチテンプレートのなかの最も類似度が高いテンプレートとすることが好ましい。
【００７３】
また、ユーザと関連付けられて既に登録されている照合用情報及び予備検索用情報とステップＳ２６で新たに抽出された音声信号の特徴量との平均値によって更新することも好適である。
【００７４】
このように、同一のユーザからキーワードを含む音声信号を複数回取得し、複数の音声信号に基づいて照合用情報及び予備検索情報を登録することによって、ユーザの認証処理の精度をより高めることができる。
【００７５】
以上のように、本実施の形態によれば、ユーザが登録しようとするキーワードに含まれる記号群をキーボード等の入力装置を用いて取得し、それらの記号群に基づいて認識モデルを構築する。初回の登録時には、このようにユーザの発声によらずに構築された認識モデルを用いてキーワードに相当する区間を抽出することによって、登録すべき照合用情報の区間の前後に不要音や雑音が含まれた区間が付加されて検出されたり、語頭や語尾に振幅が小さい子音等が存在する場合にこれらの子音が欠落して検出されたりすることを防ぐことができる。また、照合用情報の中にポーズ（無音区間）が含まれる場合にも、照合用情報の一部が欠落してしまうことを防ぐことができる。
【００７６】
すなわち、ユーザから取得した音声信号からキーワードに対応する区間を適切に切り出すことができ、照合用情報、抽出用情報及び予備検索用情報を適確に登録することができる。その結果、以下の照合処理において、ユーザの認証の精度を向上することができる。
【００７７】
（照合処理）
次に、本実施の形態における照合処理について説明する。照合処理は、図１０に示すフローチャートに沿って行われる。図１０に示すフローチャートの各工程をプログラム化し、記憶部１２に格納及び保持することによって、照合処理を上記認証装置によって実現することができる。
【００７８】
ステップＳ４０では、自己の認証を行おうとするユーザからキーワードを音声信号として取得する。制御部１０は、ユーザに対して認証を行うためにキーワードを発声することを促す画面を表示部１８に表示させると共に、音声信号取得部１６を音声入力待機状態とする。認証を行おうとするユーザは、登録処理において自己を特定するためのキーワードを発声して音声信号取得部１６から入力する。入力された音声信号は、アナログ／デジタル変換され、認証用情報として記憶部１２に格納される。
【００７９】
ユーザＡが自己の認証を行おうとする場合、ユーザＡは自己のキーワード「１２３４」をマイクに向かって発声することによって音声信号を認証用情報として認証装置に入力する。このとき、取得される認証用情報にはユーザＡが発声したキーワード「１２３４」と共に、不要音や外部からの雑音、無声区間などの不要な区間の音声信号も含まれる。
【００８０】
ステップＳ４２では、デジタル変換された認証用情報から特徴量が抽出される。認証用情報の特徴量は、登録処理で照合用データベース、抽出用データベース及び予備検索用データベースに登録された特徴量と同種のものとする。例えば、各データベースにＬＰＣケプストラムの係数値が登録されている場合には、認証用情報からＬＰＣケプストラムを特徴量として求める。ここでの処理の詳細は、ステップＳ１８と同様であるので説明は省略する。
【００８１】
ここでは、ユーザＡから取得された認証用情報が所定のフレーム周期毎に所定のフレーム幅を有する複数のフレームに分割され、フレーム毎にスペクトル分析が行われて１２次のＬＰＣケプストラ係数が抽出される。
【００８２】
ステップＳ４４では、予備検索用データベースに登録された予備検索情報を用いて、認証用情報に対する予備的な検索が行われる。既存のワードスポッティング法等を用いて、予備検索用データベースに含まれる各予備検索情報とユーザから取得した認証用情報の特徴量との比較が順次行われる。ＤＰマッチング法等を用いて算出された類似度が高い順にその予備検索情報に関連付けられたユーザが所定人数Ｃだけ予備的に選択される。
【００８３】
例えば、予備検索用データベースに各ユーザに対するキーワードのＬＰＣケプストラムの０〜８次係数が予備検索用情報として登録されている場合、ステップＳ４２で求められた認証用情報のＬＰＣケプストラムの０〜８次係数と予備検索用データベースに含まれている各予備検索用情報とのマッチングが行われ、認証用情報内の予備検索用情報と類似性が高い区間の類似度が大きい順にその予備検索用情報に関連付けられているユーザ名が所定人数だけ選び出される。選び出されたユーザには１から順番に識別番号が割り振られる。
【００８４】
予備検索で抽出する人数Ｃを３人であるとすると、ＤＰマッチング法等を用いて算出された類似度が大きい順に３つの予備検索情報が選び出され、その選択された予備検索情報に関連付けられたユーザが予備的に選択される。以下の説明では、ユーザＡ，Ｂ，Ｃが選択され、それぞれに１，２，３の識別番号が割り振られたものとする。
【００８５】
このように、照合用情報や抽出用情報よりも情報量が少ない予備検索用情報を用いてユーザの絞り込みを行うことによって後の処理の負担を低減することができる。
【００８６】
ステップＳ４６では、カウンタｊの初期化が行われる。カウンタｊは、照合を行ったユーザ数をカウントするために用いられ、このステップでカウンタｊが１に設定される。
【００８７】
ステップＳ４８では、抽出用データベースに登録された抽出用情報を用いて認証用情報の中からキーワードに相当する区間が抽出される。ワードスポッティング法等を用いて、ステップＳ４４で予備的に選択されカウンタｊの値で特定されるユーザに関連付けられた抽出用情報と最も類似性が高い区間が認証用情報から抽出される。抽出された区間の音声信号の特徴量はカウンタｊで特定されるユーザに対応付けられて記憶部１２に保持される。
【００８８】
ステップＳ４４においてユーザＡ，Ｂ，Ｃに絞り込まれ、各々に１，２，３の識別番号が割り振られた場合、カウンタｊが１であれば、抽出用データベースの中からユーザＡに対応するＬＰＣケプストラムの係数値が選択され、認証用情報からそのＬＰＣケプストラムの係数値と最も類似性が高い区間が抽出される。カウンタｊが２であればユーザＢに対応するＬＰＣケプストラムの係数値、カウンタｊが３であればユーザＣに対応するＬＰＣケプストラムの係数値を用いて抽出が行われる。
【００８９】
ステップＳ５０では、ステップＳ４４において抽出された区間の音声信号と照合用データベースに登録されている照合用情報との類似度が算出される。ＤＰマッチング法等を用いて、抽出された区間の音声信号の特徴量とカウンタｊで特定されるユーザに対する照合用情報とが比較され、両者の情報間の類似度が算出される。算出された類似度はユーザに関連付けられて記憶部１２に保持される。
【００９０】
ステップＳ４４においてユーザＡ，Ｂ，Ｃに絞り込まれ、各々に１，２，３の識別番号が割り振られた場合、カウンタｊが１であれば、照合用データベースの中からユーザＡに対応するＬＰＣケプストラムの係数値が選択され、ステップＳ４４において抽出された区間のＬＰＣケプストラムの係数値との類似度が求められる。カウンタｊが２であればユーザＢに対応するＬＰＣケプストラムの係数値、カウンタｊが３であればユーザＣに対応するＬＰＣケプストラムの係数値を用いて類似度が求められる。
【００９１】
ステップＳ５２では、カウンタｊの値が１だけ増加される。ステップＳ５４では、カウンタｊの値が予備検出されたユーザ数Ｃ以上であるか否かが判断される。カウンタｊがユーザ数Ｃより小さい場合にはステップＳ４８へ処理を戻し、次のカウンタｊの値が割り当てられたユーザに対して処理が繰り返される。カウンタｊがユーザ数Ｃ以上である場合にはステップＳ５６へ処理を進める。
【００９２】
ここまでの処理によって、ユーザＡ，Ｂ，Ｃの各々に対して、認証用情報と各ユーザの照合用情報との類似度が求められる。
【００９３】
ステップＳ５６では、記憶部１２に保持された類似度が読み出され、それらＣ個の類似度のうち最も値が大きいもの、すなわち類似性が最も高いものが選出される。その値は予め設定された閾値と比較され、類似度が閾値より大きい場合にはステップＳ５８へ処理が移行され、類似度が閾値以下の場合にはステップＳ６０へ処理が移行される。
【００９４】
ステップＳ５８では、認証処理を行っているユーザを最も値が大きい類似度に対応するユーザであるとして認証する。認証に伴って、ユーザが認証された旨を示す画面を表示部１８に表示させたり、ドアの鍵を開錠する等の処理を行っても良い。
【００９５】
ユーザＡが認証を行おうとしている場合、ユーザＡ，Ｂ，ＣのうちユーザＡの照合用情報との類似度が最も大きくなり、ユーザＡに対する類似度は閾値を超えるものとなる。一方、キーワードの登録処理を行っていないユーザが認証を行った場合、予備選択された全てのユーザに対する類似度は閾値を超えるものとはならず、そのユーザは認証されないこととなる。
【００９６】
ステップＳ６０では、ユーザが認証されない場合の処理を行う。例えば、ユーザが認証されなかった旨を示す画面を表示部１８に表示させる処理を行っても良い。
【００９７】
ステップＳ６２では、ユーザの認証が行われたことに伴って、認証されたユーザに関する照合用情報及び予備検索用情報が更新される。すなわち、ステップＳ４８において認証されたユーザに対応付けて記憶部１２に保持されたキーワードに相当する区間の音声信号の特徴量によって照合用データベース及び予備検索用データベースの登録内容が更新される。
【００９８】
例えば、認証されたユーザと関連付けられて照合用データベースに既に登録されている照合用情報がステップＳ５８で認証されたユーザに対応付けて保持された音声信号のＬＰＣケプストラムと置き換えられる。照合用情報がマルチテンプレートによって構成されるのであれば、総てのテンプレートのうち最も類似度が小さいテンプレートと抽出された音声信号のＬＰＣケプストラムが置き換えられる。
【００９９】
また、認証されたユーザと関連付けられて予備検索用データベースに既に登録されている予備検索用情報が認証されたユーザに対応付けて保持された音声信号のＬＰＣケプストラムに基づいて置き換えられる。例えば、照合用情報がＬＰＣケプストラムの全係数値である場合には、予備検索用情報はＬＰＣケプストラムの一部の係数値と置き換えられる。また、照合用情報がＬＰＣケプストラムの係数値のマルチテンプレートによって構成されるのであれば、それらのテンプレートのうち最も類似度が大きいテンプレートと置き換えても良い。
【０１００】
また、認証されたユーザと関連付けられて既に登録されている照合用情報及び予備検索用情報と認証されたユーザと対応付けられた音声信号のＬＰＣケプストラム及びＬＰＣケプストラムの一部の係数値との平均値によって更新することも好適である。
【０１０１】
このように、認証されたユーザの照合用情報及び予備検索用情報を更新することによって、ユーザの体調や発声の仕方等の経時的な変動に対する認証の精度の低下を抑制することができる。
【０１０２】
また、抽出用データベースに登録されている抽出用情報は更新されないため、認証用情報からキーワードに相当する区間を抽出する処理はキーワードの登録時から不変的に行われる。従って、認証処理を繰り返すことによる照合用情報や予備検索用情報の誤りの累積的な蓄積の影響を低減することができる。すなわち、更新されない抽出用情報を用いたキーワード区間の切り出しと、認証毎に更新される照合用情報を用いたユーザの最終的な認証とを別個に行うことによって、ユーザの発声の経時的変化に対応した精度の高い認証処理を実現すると共に、照合用情報の更新に伴う誤差の蓄積を低減することができる。
【０１０３】
また、本実施の形態では、予備検索用情報を用いてユーザの予備的な絞り込みを行ったが、これらの予備検索処理は省略しても良い。
【０１０４】
＜変形例１＞
次に上記本発明の実施の形態における変形例について説明する。本変形例は、上記実施の形態における認証装置を用いて実行することができる。本変形例も、登録処理と照合処理とに大きく分けられるが、登録処理については上記処理と同様であるので、照合処理のみについて以下に説明する。
【０１０５】
変形例における照合処理は、図１１に示すフローチャートに沿って行われる。図１１に示すフローチャートの各工程をプログラム化し、記憶部１２に格納及び保持することによって、照合処理を上記認証装置によって実現することができる。ここで、上記実施の形態における照合処理の工程と同一の処理を行う工程には同一の符号を付し説明を省略する。
【０１０６】
ステップＳ６４では、ステップＳ４２において抽出された音声信号の特徴量と照合用データベースに登録されている照合用情報とが比較される。ワードスポッティング法等を用いて、ステップＳ４４で選択されカウンタｊで特定されるユーザに関連付けられた照合用情報と最も類似度が高い区間が認証用情報から抽出される。さらに、ＤＰマッチング法等を用いて、抽出された区間の音声信号の特徴量とカウンタｊで特定されるユーザに対する照合用情報とが比較され、両者の情報間の類似度が算出される。従って、抽出された区間の音声信号の特徴量とカウンタｊで特定されるユーザに対する照合用情報との類似性が高いほど、類似度の値は大きくなる。算出された類似度はユーザに関連付けられて記憶部１２に保持される。
【０１０７】
例えば、ステップＳ４４においてユーザＡ，Ｂ，Ｃに絞り込まれ、各々に１，２，３の識別番号が割り振られた場合、カウンタｊが１であれば、照合用データベースの中からユーザＡに対応する照合用情報が選択され、認証用情報からその照合用情報と最も類似性が高い区間が切り出され、その区間の特徴量と照合用情報との類似度が算出される。カウンタｊが２であればユーザＢに対応する照合用情報、カウンタｊが３であればユーザＣに対応する照合用情報を用いて類似度の算出が行われる。
【０１０８】
すなわち、本変形例では、抽出用データベースに登録されている抽出用情報を用いることなく、認証用情報と各ユーザの照合用情報との類似度を算出する。そして、それらの類似度に基づいてステップＳ５６〜Ｓ６０においてユーザの認証処理を行う。
【０１０９】
ステップＳ６６では、最大の類似度に対応するユーザ、すなわち認証されたユーザに関連付けられている抽出用情報を用いて、音声信号からキーワードに相当する区間が抽出される。認証されたユーザに関連付けられた抽出用情報が抽出用データベースから選択され、ワードスポッティング法等を用いて、ステップＳ４２で抽出された音声信号の特徴量とその抽出用情報とが比較され、抽出用情報と最も類似性が高い区間がキーワードに相当する区間として切り出される。
【０１１０】
ステップＳ６２では、そのキーワードに相当する区間の特徴量によって、認証されたユーザに関連付けられた照合用情報及び予備検索用情報が更新される。
【０１１１】
本変形例によれば、認証されたユーザに対してのみキーワードに相当する区間が抽出されるため、ステップＳ４６〜Ｓ５４のユーザ認証処理の処理負担を軽減することができる。その結果、ユーザが音声を入力してから認証結果が得られるまでの待ち時間を短縮することができる。
【０１１２】
【発明の効果】
本発明によれば、認証に用いられるキーワードを正確に登録でき、ユーザの発声の経時的変化の影響を受け難い話者認証を実現できる。
【図面の簡単な説明】
【図１】本発明の実施の形態における認証装置の構成を示すブロック図である。
【図２】本発明の実施の形態における認証装置の別の構成を示すブロック図である。
【図３】本発明の実施の形態における話者認証の登録処理のフローチャートを示す図である。
【図４】認識モデルの構築の例を説明する図である。
【図５】ユーザから取得される音声信号の例を示す図である。
【図６】音声信号から特徴量の抽出の例を説明する図である。
【図７】ユーザから取得された音声信号に対してワードスポッティング法を適用した例を説明する図である。
【図８】照合用データベース、抽出用データベース及び予備検索用データベースの登録内容の例を示す図である。
【図９】照合用データベース及び予備検索用データベースの更新の例を説明する図である。
【図１０】本発明の実施の形態における話者認証の照合処理のフローチャートを示す図である。
【図１１】本発明の実施の形態に対する変形例における話者認証の照合処理のフローチャートを示す図である。
【符号の説明】
１０制御部、１２記憶部、１４キーワード取得部、１６音声信号取得部、１８表示部、２０バス、２２，２４ネットワークインターフェース、１００認証装置、１００ａクライアント、１００ｂサーバ。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a speaker authentication device and a speaker authentication program for specifying who is a speaker who emits voice.
[0002]
[Prior art]
A speaker authentication device that authenticates who is a speaker from a keyword uttered by a user and performs security management based on the authentication result is used.
[0003]
In such a speaker authentication device, it is necessary to register the feature quantity of the voice signal of a keyword (such as a personal identification number) specifying a user in association with the user prior to the authentication as a verification database. The feature amount of the audio signal corresponding to this keyword is used as verification information when performing user authentication. When the user is to be authenticated, the user inputs voice including a keyword registered in advance as authentication information to the speaker authentication apparatus. By comparing the authentication information with the verification information, the verification information having a high similarity with the authentication information is selected, and the user who has input the authentication information corresponds to the verification information with a high similarity. Authenticated as an attached user.
[0004]
When registering collation information in the collation database, since the voice uttered by the user includes an unnecessary part such as a noise section or an unvoiced section, the section corresponding to the keyword is accurately extracted and corresponds to the keyword. It is necessary to register the feature amount of the audio signal in the section as the verification information. Since the verification information plays the role of a key in the speaker authentication apparatus, the accuracy of the section detection of the verification information greatly affects the accuracy of subsequent authentication processing.
[0005]
A method of detecting a voice section necessary for registration based on amplitude information and duration of an input signal for audio signal segmentation is disclosed (for example, “end of independent utterance by LRRabiner et al.”). An algorithm for determining the endpoint of isolated utterances ").
[0006]
On the other hand, a word spotting method that uses a similarity calculated by pattern matching such as DP matching to determine whether or not a predetermined word is present in an audio signal and the position thereof is also widely used. It has been. It is known that the word spotting method is less susceptible to noise than the method of detecting a speech section based on the amplitude information and duration of an input signal (for example, “Continuous word recognition experiment using continuous DP” by Satoru Hayami et al. And its considerations ").
[0007]
Furthermore, a method is also disclosed in which a speech recognition model representing symbols such as numerals and alphabets used as keywords based on speech input from a user is prepared and speaker recognition is performed using this speaker model ( For example, JP 2000-99090 A).
[0008]
In addition, the user's physical condition and utterance may change over time after registration of the verification information. In such a case, there is a high possibility that authentication will fail even if it is authentication information issued by the same user who registered the verification information. In this way, in order to eliminate the influence of the change in voice over time, each time authentication processing is performed, a section corresponding to the keyword is extracted from the authentication information, and the verification information is updated with the feature amount of the voice signal in that section. (For example, JP-A-57-13493).
[0009]
[Patent Document 1]
JP-A-57-13493
[Patent Document 2]
JP 2000-99090 A
[Non-Patent Document 1]
Rabina (LRRabiner et al.), "An algorithm for determining the endpoint of isolated utterances", Bell Syst. Tech. J., 1975, vol.54, pp297 -315
[Non-Patent Document 2]
Satoru Hayami, “Continuous Word Recognition Experiments and Considerations Using Continuous DP”, Transactions of the Institute of Electrical Communication, 1984, vol.J67-D, No.6, pp677-684
[0010]
[Problems to be solved by the invention]
However, in the above conventional technique, there is a problem that the verification information to be registered cannot be appropriately extracted from the audio signal acquired from the user when the verification information is registered.
[0011]
In the method of detecting the voice section based on the amplitude information and duration of the input signal, unnecessary sounds ("A ~", "E ~", etc.) generated by the user before and after the section of the verification information to be registered, and external In many cases, a section including noise is added and detected. In addition, when a consonant with a small amplitude is present at the beginning or end of the word, these consonants may be lost and detected. Furthermore, when a pause (silent section) is included in the verification information, a part of the verification information may be lost.
[0012]
Such an error in the section detection of the verification information is fatal in realizing a highly accurate authentication device. When the verification information is registered in a state where noise is mixed, the probability that the user is correctly authenticated when performing the authentication process is low. On the other hand, if a part of the verification information is registered without registration, the amount of information indicating the personal characteristics of the user is reduced, so that another person is erroneously authenticated as the user who registered the verification information. Probability increases.
[0013]
In the technique described in Japanese Patent Laid-Open No. 2000-99090, a standard speaker model is selected by a user's utterance, and a keyword speech model is created by a combination of the selected speaker models. To extract information for verification. Therefore, when the user's utterance at the time of selecting the speaker model is not clear, an appropriate speaker model is not selected, and there is a problem that a section of information for verification is erroneously detected.
[0014]
On the other hand, even when authenticating a user, if the user utters a keyword that is authentication information or if the beginning or end of the keyword is not clearly spoken, a part of the section corresponding to the keyword is missing. If a part is detected as a silent section, a section corresponding to the keyword cannot be accurately extracted from the authentication information. As a result, the accuracy of user authentication may be reduced, or authentication may be erroneous.
[0015]
In addition, when the verification information is updated at any time in order to cope with changes in the user's utterance, the verification information is updated with a keyword that is erroneously extracted from the authentication information. Adversely affected. Furthermore, the repeated update of the verification information causes a cumulative accumulation of errors, and there is a problem that the accuracy of the authentication process rapidly decreases as the number of authentication processes increases.
[0016]
The present invention has been made in view of the above-described problems of the prior art, and an object thereof is to provide a speaker authentication device and a speaker authentication program for identifying who is a speaker who can solve at least one of the above problems. .
[0017]
[Means for Solving the Problems]
  The present invention that can solve the above problemsUser uttered a keywordA database storage means for collation that stores the characteristic amount of the audio signal as collation information in association with the user, and an audio signal uttered by the user who is trying to authenticateTakeFor authenticationAudio signalAcquisition means and said authenticationAudio signalAcquired in the acquisition meansFeatures of audio signalAnd beforeNoteA user identification means for identifying a user who intends to perform the authentication by comparing application information, and a speaker authentication device comprising:The section of the section having the highest similarity between the registered voice signal acquisition means for acquiring the voice signal from the user who registers the keyword and the standard recognition model representing the keyword from the voice signal acquired by the registered voice signal acquisition means. A registered keyword section extracting means for extracting a feature quantity, and a feature quantity of the section extracted by the registered keyword section extracting means in association with a user who registers the keyword as extraction information that is not updated.Holding database storage means for extractionAnd an authentication keyword section extracting means for extracting a feature amount of a section having the highest similarity to the extraction information from the voice signal acquired by the authentication voice signal acquiring means, and a user specified by the user specifying means Database update means for updating the verification information associated with the authentication keyword section extraction means based on the feature amount extracted;It is characterized by including.
[0018]
  here,The user specifying unit specifies a user who is going to perform the authentication by comparing the feature amount extracted by the authentication keyword section extracting unit with the verification information.Is preferred.
[0019]
  Also,A keyword acquisition unit that acquires a symbol string representing a keyword, and a speech recognition model that represents each symbol acquired by the keyword acquisition unit, and a combination of these speech recognition models is used to construct the standard recognition model It is preferable to further include model construction means.
[0020]
Here, it is preferable that the keyword acquisition unit acquires a keyword using a keyboard, a pointing device, and a touch panel.
[0023]
  Further, by comparing the feature amount of the audio signal acquired by the authentication audio signal acquisition means with a part of the verification informationThe user specifying means further includes preliminary search means for narrowing down the matching information held in the matching database storage meansBeforeInformation for collation narrowed down by preliminary search meansNewsComparisonUsed forThe user who is going to authenticateMay be.
[0024]
  Another form of the present invention that can solve the above-mentioned problems isUser uttered a keywordA database for collation that stores the feature amount of the audio signal as information for collation in association with the user, and a feature amount of the audio signal used when extracting a section corresponding to the keyword from the audio signal acquired from the user.Not updatedIn a computer equipped with an extraction database to be stored as extraction information,A registration voice signal acquisition step of acquiring a voice signal from a user who registers the keyword and a section having the highest similarity between a standard recognition model representing the keyword from the voice signal acquired in the registration voice signal acquisition step. A registered keyword section extraction step for extracting a feature quantity; an extraction information registration step for registering the feature quantity of the section extracted in the registered keyword section extraction step in the extraction database as extraction information that is not updated;A voice signal uttered by the user attempting authenticationTakeFor authenticationAudio signalAn acquisition step;From the audio signal acquired in the authentication audio signal acquisition step,Information for extraction andofSimilaritymostHigh sectionFeatures as authentication informationExtractFor authenticationA keyword interval extraction step;The authentication information and the authentication informationA user specifying step of specifying a user who is going to perform the authentication by comparing the information for verification;A database update step of updating the verification information associated with the user identified in the user identification step based on the authentication information;It is characterized by executing processing includingspeakerIt is an authentication program.
[0025]
  Also,It is preferable that the user specifying step specifies a user who is going to perform the authentication by comparing the feature amount extracted in the authentication keyword section extracting step with the verification information. Also, a keyword acquisition step for acquiring a symbol string representing a keyword, a speech recognition model representing each symbol acquired in the keyword acquisition step, and combining the speech recognition models to construct the standard recognition model It is preferable to further comprise a recognition model construction step.
[0026]
Here, it is preferable that the keyword acquisition step acquires a keyword using a keyboard, a pointing device, and a touch panel.
[0029]
  Furthermore, in the speaker authentication program of the present invention, the computer includes:By comparing the feature quantity of the audio signal acquired in the authentication audio signal acquisition step with a part of the verification informationA preliminary search step for narrowing down the collation information held in the collation database;In addition, The user specifying stepBeforeThe verification information narrowed down by the preliminary search stepNewsComparisonUsed forThe user who is going to authenticateMay be.
[0030]
DETAILED DESCRIPTION OF THE INVENTION
<Authentication device>
An authentication apparatus according to an embodiment of the present invention will be described in detail with reference to the drawings. The authentication device in the present embodiment is a device that performs voice authentication for authenticating who the user is based on the voice uttered by the user.
[0031]
As shown in FIG. 1, the authentication device 100 according to the present embodiment basically includes a control unit 10, a storage unit 12, a keyword acquisition unit 14, an audio signal acquisition unit 16, a display unit 18, and a bus 20. The control unit 10, the storage unit 12, the keyword acquisition unit 14, the audio signal acquisition unit 16, and the display unit 18 are connected via a bus 20 so as to be able to transmit information to each other.
[0032]
The control unit 10 corresponds to a central processing unit (CPU) of a computer. The control unit 10 executes basic software (operation system) stored in the storage unit 12 to acquire information from the user using the keyword acquisition unit 14 and the audio signal acquisition unit 16, and uses the display unit 18 to acquire information from the user. Present information to. Further, by executing the authentication program stored in the storage unit 12, the user authentication process is performed based on the voice acquired from the user. The authentication process will be described in detail later.
[0033]
The storage unit 12 stores and holds basic software, an authentication program, and the like executed by the control unit 10. Further, information processed by the control unit 10 such as information acquired using the keyword acquisition unit 14 or the audio signal acquisition unit 16 or information presented to the user using the display unit 18 is temporarily or permanently stored. And hold. Further, the storage unit 12 stores and holds a verification database, an extraction database, and a preliminary search database used in the authentication process. The contents of these databases will be described in detail later. Information held in the storage unit 12 can be appropriately read out by the control unit 10.
[0034]
As the storage unit 12, a semiconductor memory can be used. When it is necessary to store a voice database for a large number of users, a large-capacity auxiliary storage device such as a hard disk, an optical disk, a magneto-optical disk, or a magnetic tape may be provided.
[0035]
The keyword acquisition unit 14 acquires symbols included in keywords used for authentication processing. The keyword acquisition unit 14 can be a keyboard, for example. The user inputs a symbol representing a keyword from the keyboard. The input symbol is sent to the control unit 10 for processing. Alternatively, the keyword acquisition unit 14 may be a method of selecting a symbol using a character input device such as a pointing device other than a keyboard or a touch panel.
[0036]
The audio signal acquisition unit 16 includes a microphone, an amplifier (amplifier), an analog / digital converter, and the like for acquiring audio uttered by the user. The user inputs voice using the voice signal acquisition unit 16. The voice uttered by the user is amplified by an amplifier through a microphone, converted into a digital signal by an analog / digital converter, and used for processing.
[0037]
The display unit 18 provides information necessary for processing to the user. The display unit 18 can be a display device, for example. Upon receiving an image display command from the control unit 10, the display unit 18 displays a screen prompting the user to input a keyword using the keyword acquisition unit 14 or a screen prompting the user to input a sound using the audio signal acquisition unit 16. Present. In addition, the acquired information and processing results are presented to the user. As the display unit 18, a liquid crystal display device of a touch panel, an audio output device including a speaker, and the like can be used.
[0038]
As described above, the authentication device 100 according to the present embodiment can be basically configured by an information processing device in which a microcomputer is incorporated. The authentication device 100 can be installed in the vicinity of various devices such as a door that performs user authentication processing, a safe door, and the like, and is used to authenticate who the user is by inputting voice from the user. Can do.
[0039]
Further, as shown in FIG. 2, network interfaces 22 and 24 may be further provided, and the entire authentication apparatus may be configured by separate clients 100a and servers 100b connected via a network.
[0040]
<Authentication method>
Next, a user authentication method in the present embodiment will be described. The authentication method according to the present embodiment acquires a voice signal including a keyword registered by the user for each user, extracts a section corresponding to the keyword from the voice signal, and stores the feature amount of the voice signal in that section in each database. Registration processing to be registered in the database, and verification processing for actually authenticating the user using those databases. Therefore, the registration process and the collation process will be described below.
[0041]
(registration process)
The registration process in the present embodiment is performed according to the flowchart shown in FIG. Note that the registration process in the present embodiment can be executed by the authentication apparatus of the present embodiment by programming each step of the flowchart shown in FIG. 3 and storing and holding it in the storage unit 12.
[0042]
In step S10, using the keyword acquisition unit 14, a symbol included in the keyword is acquired from the user. The control unit 10 causes the display unit 18 to display a screen that prompts the user to input a symbol string constituting a keyword to be registered. The user uses the keyword acquisition unit 14 to input a symbol group included in the keyword for identifying himself / herself into the authentication device. The input symbol group is stored in the storage unit 12.
[0043]
Keywords can consist of numeric strings. However, the present invention is not limited to this, and can be expanded to a combination of an arbitrary number of alphabets, kana characters, numbers, and other arbitrary symbols. Alternatively, some keyword candidates may be determined on the authentication device side, and the user may be allowed to select one of these candidates.
[0044]
Hereinafter, an example in which a user with user name A (hereinafter referred to as user A) registers a 4-digit number string “1234” as a keyword will be described. User A inputs a symbol included in the keyword by pressing numeric keys “1”, “2”, “3”, and “4” on the numeric keypad.
[0045]
In step S12, a recognition model is constructed based on the acquired symbol group. For each symbol that can be used as a keyword, a speech recognition model representing the symbol is registered in advance in the storage unit 12 as a model construction database, and speech recognition corresponding to the symbol acquired in step S10 from the model construction database. By extracting models and combining them, a standard recognition model representing keywords can be constructed. At this time, a recognition model can be constructed using an existing HMM (Hidden Markov Model). This recognition model is used to extract a section corresponding to a keyword from an audio signal acquired later from a user.
[0046]
Here, it is preferable that the speech recognition model stored in the model construction database is obtained by acquiring speech signals for each symbol from a large number of users and using those speech signals.
[0047]
When using a keyboard that can input alphabets, kana characters, numbers, and other symbols, it is necessary to construct a database for model construction so that a recognition model can be generated for keywords consisting of combinations of those symbols. .
[0048]
For example, as shown in FIG. 4, standard recognition models representing the keyword “1234” are extracted from the database for model construction where the speech recognition models corresponding to “1”, “2”, “3”, and “4” are extracted. As combined. Here, for example, the speech recognition model corresponding to “1” means that a speech signal corresponding to the symbol “1” uttered by a plurality of people is recognized as “1”, or a specific speaker utters a plurality of times. The voice signal corresponding to the symbol “1” is recognized as “1” using statistical processing.
[0049]
In step S14, the counter i is initialized. The counter i is used to count the number of times the user has input a keyword for registration. In this step, the counter i is set to 0.
[0050]
In step S <b> 16, an audio signal uttered by the user is acquired using the audio signal acquisition unit 16. The control unit 10 causes the display unit 18 to display a screen that prompts the user to utter a keyword, and sets the audio signal acquisition unit 16 in a voice input standby state. The user inputs an audio signal from the audio signal acquisition unit 16 by uttering the keyword. The input audio signal is analog / digital converted and stored in the storage unit 12.
[0051]
User A inputs an audio signal using the audio signal acquisition unit 16 by uttering “1234”. At this time, as shown in FIG. 5, the acquired audio signal is acquired together with the audio signal corresponding to the keyword “1234” uttered by the user, as well as the signal of an unnecessary interval such as noise from the outside or the silent interval Is done.
[0052]
In step S18, feature amounts for the entire digitally converted audio signal are extracted. The feature amount of the audio signal is preferably spectral envelope information. Spectral envelope information refers to the outline of the distribution of frequency components included in an audio signal at a certain moment. Spectral envelope information is obtained by performing spectrum analysis on a digitized signal sequence every predetermined frame width (for example, 32 milliseconds) and frame period (for example, 8 milliseconds), and using an existing LPC (Linear Predictive Coefficient) cepstrum. It can be obtained by calculation.
[0053]
However, the feature amount extracted from the audio signal is not limited to the spectrum envelope information, and may be information indicating the feature of the audio signal. For example, feature quantities such as a temporal change in the amplitude of the audio signal, an appearance period of a voiced section or an unvoiced section may be used. In the following processing, matching and extraction of audio signals are performed using the feature amount selected here.
[0054]
In the example shown in FIG. 6, the audio signal acquired from the user A is divided into a plurality of frames having a predetermined frame width every predetermined frame period, and a spectrum analysis is performed for each frame to obtain a coefficient of the 12th-order LPC cepstrum. Has been extracted.
[0055]
In step S20, it is determined whether or not the audio signal to be processed is acquired for the first time. That is, when the counter i is 0, the process proceeds to step S22, and when the counter i is other than 0, the process proceeds to step S26.
[0056]
In step S22, a section corresponding to the keyword is detected and extracted from the acquired speech signal using the recognition model constructed in step S12. Here, a word spotting method using a recognition model can be used. Using the HMM method or the like, the feature amount of the entire speech signal is compared with the feature amount of the recognition model, and a section having the highest similarity to the recognition model is extracted from the entire speech signal as a section corresponding to the keyword.
[0057]
The LPC cepstrum of the recognition model for the keyword “1234” is obtained, and as shown in FIG. 7, word spotting is executed with the LPC cepstrum of the speech signal obtained in step S18. That is, a scan for investigating the similarity to the recognition model is performed from the beginning to the end of the audio signal acquired from the user. At this time, using the HMM method, the similarity between the LPC cepstrum and the recognition model in each section of the speech signal is obtained, and the section with the highest similarity is extracted as the section corresponding to the keyword “1234”. As a result, unnecessary portions unrelated to the keyword “1234” are excluded from the audio signal acquired from the user A.
[0058]
In step S24, the feature amount of the speech signal in the section extracted as the section corresponding to the keyword is registered in the collation database, the extraction database, and the preliminary search database. The feature amount of the extracted audio signal in the section is associated with an identifier (for example, a user name) indicating the registered user, and the matching database and the extraction database are used as matching information and extraction information, respectively. Registered separately.
[0059]
In the preliminary search database, a part of the feature amount of the extracted voice signal in the section is registered as preliminary search information. That is, the preliminary search information registered in the preliminary search database is configured to have a smaller amount of information than the verification information registered in the verification database and the extraction information registered in the extraction database. For example, when the feature data of the speech signal in the section from which the verification information is extracted is all data, the preliminary search information is only partial data of the feature value of the speech signal in the extracted section. In addition, when the matching information is a multi-template, that is, a combination of feature values of speech signals in a section corresponding to a keyword extracted from a speech signal acquired a plurality of times, the preliminary search information is It may be a feature amount of one audio signal.
[0060]
When the LPC cepstrum is selected as the feature quantity, the verification information and the extraction information can be the LPC cepstrum of the audio signal in the section corresponding to the keyword. In this case, the preliminary search information registered in the preliminary search database can be obtained by thinning out the LPC cepstrum of the section corresponding to the keyword in terms of time or order.
[0061]
For example, as shown in FIGS. 8A and 8B, in the collation database and the extraction database, the LPC cepstrum in the section corresponding to the keyword “1234” in association with the user name A is the collation information and the extraction information. As registered respectively. Further, as shown in FIG. 8C, only the 0th to 8th order coefficients of the LPC cepstrum in the section corresponding to the keyword “1234” are registered in the preliminary search database in association with the user name A.
[0062]
The verification information registered in the verification database is used to authenticate the user from the voice. The extraction information registered in the extraction database is used to identify a section corresponding to the keyword in the voice signal and extract a section corresponding to the keyword from the voice signal. The preliminary search information registered in the preliminary search database is used to perform preliminary narrowing prior to user authentication using the verification information and extraction of a section corresponding to a keyword using the extraction information. .
[0063]
In addition, whether or not the registration has been correctly performed is an important factor that affects the accuracy of the subsequent collation processing. Therefore, the registered collation information, extraction information, and preliminary search information are sent to the user or the administrator. It is also preferable to perform a process for confirmation.
[0064]
In step S34, the value of the counter i is incremented by one. In step S36, it is determined whether the counter i is equal to or greater than the number of repetitions M. If the counter i is smaller than the number of times M, the process returns to step S16, and the audio signal is acquired again from the user. If the counter i is equal to or greater than the number of times M, the registration process is terminated. The number of repetitions M indicates the number of times the keyword registration process is repeated by the same user, and can be determined based on the accuracy required for the authentication process, the processing speed of the authentication apparatus, and the like.
[0065]
If it is determined in step S20 that the counter i is not 0 and the process proceeds to step S26, a section corresponding to the keyword is extracted from the audio signal based on the extraction information already registered in the extraction database. . For extraction, word spotting based on extraction information associated with a registered user can be used. A section having the highest similarity to the extraction information is extracted from the audio signal as a section corresponding to a new keyword.
[0066]
Extraction information for the user A already registered in the extraction database is selected, and word spotting is performed on the audio signal acquired from the user using the selected extraction information, which corresponds to the keyword “1234”. The section to be extracted is extracted.
[0067]
In step S28, the collation information already registered in the collation database is compared with the voice signal extracted as the section corresponding to the keyword. For comparison, existing DP matching or the like can be used. The comparison result between the two is calculated as the similarity based on the distance value between the information. In the following, the reciprocal of the distance value is calculated as the similarity. Therefore, the higher the similarity between the feature value of the extracted speech signal in the section and the verification information for the user specified by the counter i, the larger the similarity value. The calculated similarity is stored in the storage unit 12 in association with the user.
[0068]
Similarity between the LPC cepstrum of the section extracted as the section corresponding to the keyword “1234” and the LPC cepstrum as the selected collation information is selected for the collation information already registered in the collation database. The degree is calculated.
[0069]
In step S30, the calculated similarity is compared with a predetermined threshold value. If the similarity is greater than or equal to the threshold, the process proceeds to step S32. If the similarity is less than the threshold, the process returns to step S16 to acquire the audio signal again from the user.
[0070]
In step S32, the registration contents of the collation database and the preliminary search database are updated with the feature amount of the voice signal in the section corresponding to the keyword extracted in step S26. That is, the collation information already associated with the collation database associated with the user performing the registration process is updated with the feature amount of the audio signal in the section newly extracted in step S26. In addition, the preliminary search information that is associated with the user who is performing the registration process and is already registered in the preliminary search database is replaced with a part of the feature amount of the audio signal in the newly extracted section.
[0071]
For example, as shown in FIG. 9, the registration contents of the verification database and the preliminary search database are updated. Here, the collation information and the preliminary search information that are already registered in association with the user A are the LPC cepstrum of the section corresponding to the keyword newly extracted in step S26 and the 0th to 8th order coefficients of the LPC cepstrum. Replaced with each value.
[0072]
Further, if the collation information is composed of multi-templates, the template having the lowest similarity among the templates already registered may be replaced with the feature amount of the audio signal in the newly extracted section. In this case, it is preferable that the preliminary search information is a template having the highest similarity among the multi-templates of the verification information.
[0073]
It is also preferable to update with the average value of the collation information and preliminary search information already associated with the user and the feature value of the voice signal newly extracted in step S26.
[0074]
As described above, by acquiring the voice signal including the keyword from the same user a plurality of times and registering the verification information and the preliminary search information based on the plurality of voice signals, the accuracy of the user authentication process can be further improved. it can.
[0075]
As described above, according to the present embodiment, a symbol group included in a keyword to be registered by a user is acquired using an input device such as a keyboard, and a recognition model is constructed based on the symbol group. At the time of the first registration, by extracting the section corresponding to the keyword using the recognition model constructed without relying on the user's utterance as described above, unnecessary sounds and noises are generated before and after the section of the matching information to be registered. It can be prevented that the included section is added and detected, or when a consonant with a small amplitude is present at the beginning or end of the word, these consonants are missing and detected. Further, even when a pause (silent section) is included in the verification information, it is possible to prevent a part of the verification information from being lost.
[0076]
That is, the section corresponding to the keyword can be appropriately cut out from the voice signal acquired from the user, and the matching information, extraction information, and preliminary search information can be registered appropriately. As a result, the accuracy of user authentication can be improved in the following verification process.
[0077]
(Verification process)
Next, collation processing in the present embodiment will be described. The collation process is performed according to the flowchart shown in FIG. By collating each process of the flowchart shown in FIG. 10 and storing and holding it in the storage unit 12, the verification process can be realized by the authentication device.
[0078]
In step S40, a keyword is acquired as an audio signal from a user who intends to authenticate himself. The control unit 10 causes the display unit 18 to display a screen that prompts the user to utter a keyword in order to authenticate the user, and sets the audio signal acquisition unit 16 to a voice input standby state. A user who wants to perform authentication utters a keyword for identifying himself / herself in the registration process and inputs it from the voice signal acquisition unit 16. The input audio signal is analog / digital converted and stored in the storage unit 12 as authentication information.
[0079]
When the user A tries to authenticate himself / herself, the user A utters his / her keyword “1234” toward the microphone and inputs a voice signal as authentication information to the authentication apparatus. At this time, the acquired authentication information includes the keyword “1234” uttered by the user A, as well as an unnecessary sound, an external noise, and an audio signal of an unnecessary section such as a silent section.
[0080]
In step S42, a feature amount is extracted from the digitally converted authentication information. The feature amount of the authentication information is the same type as the feature amount registered in the collation database, the extraction database, and the preliminary search database in the registration process. For example, when the coefficient value of the LPC cepstrum is registered in each database, the LPC cepstrum is obtained as a feature amount from the authentication information. The details of the process here are the same as in step S18, and thus the description thereof is omitted.
[0081]
Here, the authentication information acquired from the user A is divided into a plurality of frames having a predetermined frame width every predetermined frame period, and a spectrum analysis is performed for each frame to extract a 12th-order LPC cepstra coefficient. The
[0082]
In step S44, a preliminary search for the authentication information is performed using the preliminary search information registered in the preliminary search database. Using the existing word spotting method or the like, the preliminary search information included in the preliminary search database is sequentially compared with the feature amount of the authentication information acquired from the user. Only a predetermined number of users C are preliminarily selected in ascending order of similarity calculated using the DP matching method or the like.
[0083]
For example, when the LPC cepstrum coefficients 0 to 8 of the keyword for each user are registered as preliminary search information in the preliminary search database, the 0 to 8th order coefficients of the LPC cepstrum of the authentication information obtained in step S42. Are matched with each preliminary search information included in the preliminary search database, and the preliminary search information in the authentication information is associated with the preliminary search information in descending order of similarity. A predetermined number of user names are selected. Identification numbers are assigned to the selected users in order from 1.
[0084]
Assuming that the number of people C to be extracted in the preliminary search is 3, three preliminary search information is selected in descending order of similarity calculated using the DP matching method or the like, and is associated with the selected preliminary search information. Users are preliminarily selected. In the following description, it is assumed that users A, B, and C are selected, and identification numbers 1, 2, and 3 are assigned to the users A, B, and C, respectively.
[0085]
In this way, by performing narrowing down of users by using preliminary search information having a smaller amount of information than collation information and extraction information, it is possible to reduce the burden of subsequent processing.
[0086]
In step S46, the counter j is initialized. The counter j is used to count the number of users that have been verified, and the counter j is set to 1 in this step.
[0087]
In step S48, the section corresponding to the keyword is extracted from the authentication information using the extraction information registered in the extraction database. Using a word spotting method or the like, a section having the highest similarity with the extraction information associated with the user that is preliminarily selected in step S44 and specified by the value of the counter j is extracted from the authentication information. The extracted feature quantity of the audio signal in the section is stored in the storage unit 12 in association with the user specified by the counter j.
[0088]
If the users A, B, and C are narrowed down in step S44 and the identification numbers 1, 2, and 3 are assigned to them, and the counter j is 1, the LPC cepstrum corresponding to the user A from the extraction database. Is selected, and a section having the highest similarity with the coefficient value of the LPC cepstrum is extracted from the authentication information. If the counter j is 2, extraction is performed using the coefficient value of the LPC cepstrum corresponding to the user B, and if the counter j is 3, extraction is performed using the coefficient value of the LPC cepstrum corresponding to the user C.
[0089]
In step S50, the similarity between the speech signal in the section extracted in step S44 and the collation information registered in the collation database is calculated. Using the DP matching method or the like, the feature amount of the extracted voice signal in the section is compared with the verification information for the user specified by the counter j, and the similarity between the information is calculated. The calculated similarity is stored in the storage unit 12 in association with the user.
[0090]
When the users A, B, and C are narrowed down in step S44 and the identification numbers 1, 2, and 3 are assigned to them, if the counter j is 1, the LPC cepstrum corresponding to the user A from the collation database. Coefficient values are selected, and the similarity to the coefficient value of the LPC cepstrum in the section extracted in step S44 is obtained. If the counter j is 2, the degree of similarity is obtained using the coefficient value of the LPC cepstrum corresponding to the user B, and if the counter j is 3, the coefficient value of the LPC cepstrum corresponding to the user C is obtained.
[0091]
In step S52, the value of the counter j is incremented by 1. In step S54, it is determined whether or not the value of the counter j is equal to or greater than the number C of users detected in advance. If the counter j is smaller than the number of users C, the process returns to step S48, and the process is repeated for the user to which the next counter j value is assigned. If the counter j is equal to or greater than the number of users C, the process proceeds to step S56.
[0092]
Through the processing so far, the similarity between the authentication information and the verification information of each user is obtained for each of the users A, B, and C.
[0093]
In step S56, the similarity stored in the storage unit 12 is read, and the C similarity having the highest value, that is, the highest similarity is selected. The value is compared with a preset threshold. If the similarity is greater than the threshold, the process proceeds to step S58. If the similarity is equal to or less than the threshold, the process proceeds to step S60.
[0094]
In step S58, the user who is performing the authentication process is authenticated as the user corresponding to the similarity having the largest value. Along with the authentication, a screen indicating that the user has been authenticated may be displayed on the display unit 18 or a door lock may be unlocked.
[0095]
When the user A is going to authenticate, the similarity between the user A, B, and C and the user A's verification information is the highest, and the similarity to the user A exceeds the threshold. On the other hand, when a user who has not performed keyword registration processing performs authentication, the similarity to all the preselected users does not exceed the threshold value, and the user is not authenticated.
[0096]
In step S60, processing is performed when the user is not authenticated. For example, you may perform the process which displays the screen which shows that a user was not authenticated on the display part 18. FIG.
[0097]
In step S62, as the user is authenticated, the verification information and the preliminary search information regarding the authenticated user are updated. That is, the registration contents of the collation database and the preliminary search database are updated with the feature amount of the voice signal in the section corresponding to the keyword held in the storage unit 12 in association with the user who is authenticated in step S48.
[0098]
For example, the verification information that is associated with the authenticated user and already registered in the verification database is replaced with the LPC cepstrum of the audio signal that is stored in association with the authenticated user in step S58. If the collation information is composed of multi-templates, the template having the smallest similarity among all templates and the LPC cepstrum of the extracted audio signal are replaced.
[0099]
Further, the preliminary search information that is associated with the authenticated user and already registered in the preliminary search database is replaced based on the LPC cepstrum of the audio signal held in association with the authenticated user. For example, when the matching information is all coefficient values of the LPC cepstrum, the preliminary search information is replaced with a part of coefficient values of the LPC cepstrum. In addition, if the collation information is composed of a multi-template of LPC cepstrum coefficient values, it may be replaced with a template having the highest similarity among these templates.
[0100]
Also, the average of the collation information and preliminary search information already associated with the authenticated user and the LPC cepstrum of the audio signal associated with the authenticated user and some coefficient values of the LPC cepstrum It is also preferable to update by value.
[0101]
In this way, by updating the authenticated user verification information and preliminary search information, it is possible to suppress a decrease in the accuracy of authentication with respect to changes over time such as the physical condition of the user and the manner of utterance.
[0102]
In addition, since the extraction information registered in the extraction database is not updated, the process of extracting the section corresponding to the keyword from the authentication information is performed invariably from the time of keyword registration. Therefore, it is possible to reduce the influence of cumulative accumulation of errors in verification information and preliminary search information due to repeated authentication processing. In other words, it is possible to change the utterance of the user over time by separately performing the segmentation of the keyword section using the extraction information that is not updated and the final authentication of the user using the verification information updated for each authentication. Corresponding high-accuracy authentication processing can be realized, and accumulation of errors due to the update of verification information can be reduced.
[0103]
In the present embodiment, preliminary narrowing down of users is performed using preliminary search information, but the preliminary search processing may be omitted.
[0104]
<Modification 1>
Next, a modification of the embodiment of the present invention will be described. This modification can be executed using the authentication device in the above embodiment. This modification is also roughly divided into a registration process and a collation process. Since the registration process is the same as the above process, only the collation process will be described below.
[0105]
The matching process in the modification is performed according to the flowchart shown in FIG. By collating each process of the flowchart shown in FIG. 11 and storing and holding it in the storage unit 12, the verification process can be realized by the authentication device. Here, the same reference numerals are assigned to the steps of performing the same processing as the verification processing in the above embodiment, and the description thereof is omitted.
[0106]
In step S64, the feature amount of the audio signal extracted in step S42 is compared with the verification information registered in the verification database. Using a word spotting method or the like, a section having the highest similarity with the verification information associated with the user selected in step S44 and specified by the counter j is extracted from the authentication information. Further, using the DP matching method or the like, the feature amount of the extracted speech signal in the section is compared with the verification information for the user specified by the counter j, and the similarity between the information is calculated. Therefore, the higher the similarity between the feature value of the extracted speech signal in the section and the verification information for the user specified by the counter j, the greater the similarity value. The calculated similarity is stored in the storage unit 12 in association with the user.
[0107]
For example, when the users A, B, and C are narrowed down in step S44 and the identification numbers 1, 2, and 3 are assigned to the users A, if the counter j is 1, it corresponds to the user A from the collation database. The verification information is selected, the section having the highest similarity with the verification information is extracted from the authentication information, and the similarity between the feature amount of the section and the verification information is calculated. If the counter j is 2, the similarity is calculated using the collation information corresponding to the user B, and if the counter j is 3, the collation information corresponding to the user C is used.
[0108]
That is, in this modification, the degree of similarity between the authentication information and the verification information for each user is calculated without using the extraction information registered in the extraction database. And based on those similarities, a user authentication process is performed in steps S56 to S60.
[0109]
In step S66, the section corresponding to the keyword is extracted from the audio signal using the extraction information associated with the user corresponding to the maximum similarity, that is, the authenticated user. Extraction information associated with the authenticated user is selected from the extraction database, and the feature amount of the voice signal extracted in step S42 is compared with the extraction information using a word spotting method or the like, and the extraction information is extracted. The section having the highest similarity with the information is cut out as a section corresponding to the keyword.
[0110]
In step S62, the verification information and preliminary search information associated with the authenticated user are updated with the feature amount of the section corresponding to the keyword.
[0111]
According to this modification, since the section corresponding to the keyword is extracted only for the authenticated user, the processing load of the user authentication processing in steps S46 to S54 can be reduced. As a result, it is possible to shorten the waiting time until the authentication result is obtained after the user inputs voice.
[0112]
【The invention's effect】
ADVANTAGE OF THE INVENTION According to this invention, the keyword used for authentication can be registered correctly, and the speaker authentication which is hard to be influenced by the time-dependent change of a user's utterance is realizable.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of an authentication apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram showing another configuration of the authentication device according to the embodiment of the present invention.
FIG. 3 is a diagram showing a flowchart of speaker authentication registration processing in the embodiment of the present invention.
FIG. 4 is a diagram illustrating an example of construction of a recognition model.
FIG. 5 is a diagram illustrating an example of an audio signal acquired from a user.
FIG. 6 is a diagram for explaining an example of feature amount extraction from an audio signal;
FIG. 7 is a diagram illustrating an example in which a word spotting method is applied to an audio signal acquired from a user.
FIG. 8 is a diagram illustrating an example of registered contents of a collation database, an extraction database, and a preliminary search database.
FIG. 9 is a diagram for explaining an example of updating a collation database and a preliminary search database;
FIG. 10 is a diagram showing a flowchart of verification processing for speaker authentication in the embodiment of the present invention.
FIG. 11 is a diagram illustrating a flowchart of verification processing for speaker authentication according to a modification of the embodiment of the present invention.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 10 Control part, 12 Memory | storage part, 14 Keyword acquisition part, 16 Audio | voice signal acquisition part, 18 Display part, 20 Bus, 22, 24 Network interface, 100 Authentication apparatus, 100a client, 100b server.

Claims

ユーザがキーワードを発声した音声信号の特徴量を照合用情報としてそのユーザと関連付けて保持する照合用データベース記憶手段と、認証を行おうとするユーザが発声した音声信号を取得する認証用音声信号取得手段と、前記認証用音声信号取得手段において取得された音声信号の特徴量と前記照合用情報とを比較することによって前記認証を行おうとするユーザを特定するユーザ特定手段と、を備える話者認証装置であって、
前記キーワードを登録するユーザから音声信号を取得する登録音声信号取得手段と、
前記登録音声信号取得手段において取得された音声信号から前記キーワードを表す標準的な認識モデルとの類似性が最も高い区間の特徴量を抽出する登録キーワード区間抽出手段と、
前記登録キーワード区間抽出手段において抽出された区間の特徴量を更新されない抽出用情報として前記キーワードを登録するユーザと関連付けて保持する抽出用データベース記憶手段と、
前記認証用音声信号取得手段において取得された音声信号から前記抽出用情報との類似性が最も高い区間の特徴量を抽出する認証キーワード区間抽出手段と、
前記ユーザ特定手段において特定されたユーザと関連付けられている照合用情報を前記認証キーワード区間抽出手段において抽出された特徴量に基づいて更新するデータベース更新手段と、
を含むことを特徴とする話者認証装置。Collating database storage means for a user to hold in association with the user as the verification information feature amount of the voice signal uttered keywords, authentication audio signal acquiring means acquire the speech signal by the user utters that tries to authenticate If, speaker authentication and a user identification means for identifying a user who attempts the authentication by comparing the feature quantity of the acquired audio signal and before KiTeru if information in the authentication voice signal acquisition means A device,
Registered voice signal acquisition means for acquiring a voice signal from a user who registers the keyword;
A registered keyword section extracting unit that extracts a feature quantity of a section having the highest similarity to a standard recognition model representing the keyword from the voice signal acquired by the registered voice signal acquiring unit;
Extraction database storage means for retaining the feature quantity of the section extracted by the registered keyword section extraction means in association with a user who registers the keyword as extraction information that is not updated ;
An authentication keyword section extraction unit that extracts a feature amount of a section having the highest similarity to the extraction information from the voice signal acquired by the authentication voice signal acquisition unit;
Database updating means for updating matching information associated with the user specified by the user specifying means based on the feature quantity extracted by the authentication keyword section extracting means;
A speaker authentication device comprising:

請求項１に記載の話者認証装置であって、The speaker authentication device according to claim 1,
前記ユーザ特定手段は、前記認証キーワード区間抽出手段において抽出された特徴量と前記照合用情報とを比較することによって前記認証を行おうとするユーザを特定することを特徴とする話者認証装置。The speaker authentication apparatus characterized in that the user specifying means specifies a user who is going to perform the authentication by comparing the feature amount extracted by the authentication keyword section extracting means with the verification information.

請求項１又は２に記載の話者認証装置において、
キーワードを表す記号列を取得するキーワード取得手段と、
前記キーワード取得手段において取得された各記号を表す音声認識モデルを取得し、それらの音声認識モデルを組み合わせて前記標準的な認識モデルを構築する認識モデル構築手段と、
をさらに備えることを特徴とする話者認証装置。In the speaker authentication device according to claim 1 or 2,
Keyword acquisition means for acquiring a symbol string representing a keyword;
A recognition model construction means for obtaining a speech recognition model representing each symbol obtained by the keyword obtaining means, and combining the speech recognition models to construct the standard recognition model;
A speaker authentication device , further comprising:

請求項１〜３に記載の話者認証装置において、
前記認証用音声信号取得手段において取得された音声信号の特徴量と前記照合用情報の一部とを比較することによって前記照合用データベース記憶手段に保持されている照合用情報の絞り込みを行う予備検索手段をさらに含み、
前記ユーザ特定手段は、前記予備検索手段によって絞り込まれた照合用情報を比較に用いることによって前記認証を行おうとするユーザを特定することを特徴とする話者認証装置。The speaker authentication device according to any one of claims 1 to 3 ,
Preliminary search for narrowing down the collation information held in the collation database storage unit by comparing the feature amount of the voice signal acquired by the authentication voice signal acquisition unit with a part of the collation information Further comprising means,
The user identification means, before Symbol speaker authentication apparatus characterized by specifying the user to be performed the authentication by using the comparison collation information narrowed down by the preliminary search means.

ユーザがキーワードを発声した音声信号の特徴量を照合用情報としてそのユーザと関連付けて保持する照合用データベースと、ユーザから取得された音声信号からキーワードに相当する区間を抽出する際に用いられる音声信号の特徴量を更新されない抽出用情報として保持する抽出用データベースと、を備えるコンピュータに、
前記キーワードを登録するユーザから音声信号を取得する登録音声信号取得ステップと、
前記登録音声信号取得ステップにおいて取得された音声信号から前記キーワードを表す標準的な認識モデルとの類似性が最も高い区間の特徴量を抽出する登録キーワード区間抽出ステップと、
前記登録キーワード区間抽出ステップにおいて抽出された区間の特徴量を更新されない抽出用情報として前記抽出用データベースに登録する抽出情報登録ステップと、
認証を行おうとするユーザが発声した音声信号を取得する認証用音声信号取得ステップと、
前記認証用音声信号取得ステップにおいて取得された音声信号から前記抽出用情報との類似性が最も高い区間の特徴量を認証用情報として抽出する認証用キーワード区間抽出ステップと、
前記認証用情報と前記照合用情報とを比較することによって前記認証を行おうとするユーザを特定するユーザ特定ステップと、
前記ユーザ特定ステップにおいて特定されたユーザと関連付けられている照合用情報を前記認証用情報に基づいて更新するデータベース更新ステップと、
を含む処理を実行させることを特徴とする話者認証プログラム。 A database for collation that holds the feature quantity of the audio signal from which the user uttered the keyword as a collation information in association with the user, and an audio signal used when extracting a section corresponding to the keyword from the audio signal acquired from the user In a computer comprising an extraction database that retains the feature quantity of the information as extraction information that is not updated ,
A registered audio signal acquisition step of acquiring an audio signal from a user who registers the keyword;
A registered keyword section extraction step for extracting a feature amount of a section having the highest similarity to a standard recognition model representing the keyword from the voice signal acquired in the registered voice signal acquisition step;
An extraction information registration step of registering the feature quantity of the section extracted in the registered keyword section extraction step in the extraction database as extraction information that is not updated;
And authentication audio signal acquiring step get the audio signal that the user utterance attempting to authenticate,
Authentication keyword block extraction step of similarity between the extracted information from the speech signal to extract a feature value of the highest section as authentication information acquired in the authentication voice signal acquisition step,
A user specifying step of specifying a user who is going to perform the authentication by comparing the authentication information and the verification information;
A database update step of updating the verification information associated with the user identified in the user identification step based on the authentication information;
A speaker authentication program for executing a process including: