JP2021135313A

JP2021135313A - Collation device, collation method, and collation program

Info

Publication number: JP2021135313A
Application number: JP2020028867A
Authority: JP
Inventors: 直弘俵; Naohiro Tawara; 厚徳小川; Atsunori Ogawa; 具治岩田; Tomoharu Iwata; マークデルクロア; Marc Delcroix; 哲司小川; Tetsuji Ogawa
Original assignee: Waseda University; Nippon Telegraph and Telephone Corp
Current assignee: Waseda University; Nippon Telegraph and Telephone Corp
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2021-09-13
Anticipated expiration: 2040-02-21
Also published as: JP7388239B2

Abstract

To improve accuracy of speaker verification for short-term utterances.SOLUTION: A collation device 10 performs learning for a speaker recognition model which includes a first NN 142 which converts a voice signal into a feature amount for each frame constituting the voice signal, a second NN143 which recognizes the speaker of the frame based on the feature amount of the converted frame, and a third NN144 which identifies the phonemes of the frame based on the converted frame features, so that an output result by the second NN143 approaches the correct answer data, and an output result by the third NN144 does not approach the correct answer data. Then, a collation unit 146 performs speaker collation using the feature amount output from the first NN142 or the feature amount output from an intermediate layer of the second NN143 in the speaker recognition model after learning.SELECTED DRAWING: Figure 1

Description

本発明は、照合装置、照合方法、および、照合プログラムに関する。 The present invention relates to a collation device, a collation method, and a collation program.

発話内容が異なる２つの音声発話が、同じ話者による音声か異なる話者による音声かを識別する話者照合のタスクは、音声認識を活用した自動議事録作成システムや、音声による認証等への応用が期待される。 The task of speaker verification to identify whether two voice utterances with different utterances are voices of the same speaker or voices of different speakers is to be assigned to an automatic minutes creation system utilizing voice recognition, voice authentication, etc. Expected to be applied.

話者照合では、まず、入力音声および予め登録された照合用音声それぞれの特徴量（話者ベクトル）を抽出し、抽出した特徴量の類似度に基づいて、２つの音声発話が同じ話者による音声か、異なる話者による音声かを判定する。 In speaker matching, first, the feature amounts (speaker vectors) of the input voice and the pre-registered matching voice are extracted, and based on the similarity of the extracted feature amounts, the two voice utterances are made by the same speaker. Determine if it is voice or voice from a different speaker.

上記の話者照合と同様に、話者ベクトルを利用するタスクとして、話者認識が知られている。話者認識は、学習用に与えられた複数話者の音声から話者ベクトルを抽出し、その話者ベクトルを分類するモデルを学習させておき、学習後のモデルを用いて、入力された音声信号がどの話者によるものかを認識する。 Similar to the above speaker verification, speaker recognition is known as a task that uses the speaker vector. In speaker recognition, a speaker vector is extracted from the voices of a plurality of speakers given for learning, a model for classifying the speaker vectors is trained, and the input voice is used using the model after learning. Recognize which speaker the signal is from.

近年、ニューラルネットワーク（以下、適宜ＮＮと略す）を用いた話者認識技術として、セグメント単位（発話単位）の話者認識の手法（非特許文献１参照）が知られている。上記の手法は、音声信号を話者ベクトルに変換するＮＮに、話者認識のＮＮと音素認識を行うＮＮとを連結し、話者認識ＮＮの出力と音素認識ＮＮの出力との両方が教師データに近づくように各ＮＮのパラメータを同時に学習させる手法である。この手法によれば、話者認識性能が従来よりも高くなることが開示されている。 In recent years, as a speaker recognition technique using a neural network (hereinafter, abbreviated as NN as appropriate), a method of speaker recognition in segment units (utterance units) (see Non-Patent Document 1) has been known. In the above method, a speaker recognition NN and a phoneme recognition NN are connected to a NN that converts a voice signal into a speaker vector, and both the speaker recognition NN output and the phoneme recognition NN output are teachers. This is a method of learning the parameters of each NN at the same time so as to approach the data. According to this method, it is disclosed that the speaker recognition performance is higher than before.

Liu et al., “Speaker Embedding Extraction with Phonetic Information”，arXivpreprint arXiv:1804.04862, 2018.Liu et al., “Speaker Embedding Extraction with Phonetic Information”, arXivpreprint arXiv: 1804.04862, 2018.

ここで、例えば、スマートスピーカを経由した音声による機器操作等においては、非常に短時間の発話から発話者の照合を行うことが要求される場合がある。非特許文献１等に記載の手法は、発話単位で話者ベクトルを抽出し、話者認識を行うことを前提とした手法であるので、充分に長い時間の発話については話者認識の性能が高まる一方で、短時間の発話については話者認識の性能が低下するという問題があった。そこで、本発明は、前記した問題を解決し、短時間の発話について話者照合の精度を向上させることを課題とする。 Here, for example, in device operation by voice via a smart speaker, it may be required to collate the speaker from a very short utterance. Since the method described in Non-Patent Document 1 and the like is a method on the premise that the speaker vector is extracted for each utterance and the speaker is recognized, the speaker recognition performance is improved for a sufficiently long utterance. On the other hand, there is a problem that the speaker recognition performance deteriorates for short-time utterances. Therefore, it is an object of the present invention to solve the above-mentioned problems and improve the accuracy of speaker collation for short-time utterances.

前記した課題を解決するため、本発明は、音声信号をフレームごとの特徴量に変換する第１のニューラルネットワークと、変換された前記フレームの特徴量に基づき当該フレームの話者の認識結果を出力する第２のニューラルネットワークとを備えた第１のモデルと、前記第１のモデルに第１の音声信号と第２の音声信号とを入力する入力部と、前記第１のモデルにおける、前記第２のニューラルネットワークの中間層または前記第１のニューラルネットワークから出力される、前記第１の音声信号および前記第２の音声信号それぞれの特徴量に基づき、前記第１の音声信号の話者が、前記第２の音声信号の話者と同じか否かを示す照合結果を出力する照合部とを備え、前記第１のモデルは、前記第１のニューラルネットワークと、前記第２のニューラルネットワークと、前記第１のニューラルネットワークで変換された前記フレームの特徴量に基づき当該フレームの音素の認識結果を出力する第３のニューラルネットワークとを備える第２のモデルについて、学習用の音声信号と、前記学習用の音声信号の話者および当該音声信号に含まれる音素の正解データとを対応付けた教師データに基づき前記第２のモデルの学習を行う際、前記第２のニューラルネットワークによる出力結果は前記正解データに近づき、前記第３のニューラルネットワークによる出力結果は前記正解データに近づかないように学習させたものであることを特徴とする。 In order to solve the above-mentioned problems, the present invention outputs a first neural network that converts an audio signal into a feature amount for each frame, and outputs a recognition result of a speaker of the frame based on the converted feature amount of the frame. A first model including a second neural network to be used, an input unit for inputting a first voice signal and a second voice signal to the first model, and the first model in the first model. Based on the feature quantities of the first voice signal and the second voice signal output from the intermediate layer of the two neural networks or the first neural network, the speaker of the first voice signal The first model includes the first neural network, the second neural network, and a collation unit that outputs a collation result indicating whether or not the second voice signal is the same as the speaker. Regarding a second model including a third neural network that outputs a recognition result of a phonetic element of the frame based on the feature amount of the frame converted by the first neural network, an audio signal for learning and the learning When training the second model based on the teacher data associated with the speaker of the voice signal for use and the correct answer data of the phonemes contained in the voice signal, the output result by the second neural network is the correct answer. It is characterized in that the data is approached and the output result by the third neural network is trained so as not to approach the correct answer data.

本発明によれば、短時間の発話について話者照合の精度を向上させることができる。 According to the present invention, it is possible to improve the accuracy of speaker collation for short-time utterances.

図１は、照合装置の構成例を示す図である。FIG. 1 is a diagram showing a configuration example of a collation device. 図２は、図１の照合部による話者照合を説明するための図である。FIG. 2 is a diagram for explaining speaker collation by the collation unit of FIG. 図３は、照合装置の処理手順の例を示すフローチャートである。FIG. 3 is a flowchart showing an example of a processing procedure of the collating device. 図４は、図３のＳ２の処理を詳細に説明するフローチャートである。FIG. 4 is a flowchart illustrating the process of S2 of FIG. 3 in detail. 図５は、照合装置の構成例を示す図である。FIG. 5 is a diagram showing a configuration example of the collation device. 図６は、実験条件を示す図である。FIG. 6 is a diagram showing experimental conditions. 図７は、実験結果を示す図である。FIG. 7 is a diagram showing the experimental results. 図８は、実験結果を示す図である。FIG. 8 is a diagram showing the experimental results. 図９は、照合プログラムを実行するコンピュータの例を示す図である。FIG. 9 is a diagram showing an example of a computer that executes a collation program.

以下、図面を参照しながら、本発明を実施するための形態（実施形態）について説明する。本発明は、以下に説明する実施形態に限定されない。 Hereinafter, embodiments (embodiments) for carrying out the present invention will be described with reference to the drawings. The present invention is not limited to the embodiments described below.

［構成］
図１を用いて本実施形態の照合装置の構成例を説明する。照合装置１０は、入力部１１と、出力部１２と、記憶部１３と、制御部１４とを備える。 [composition]
A configuration example of the collation device of the present embodiment will be described with reference to FIG. The collating device 10 includes an input unit 11, an output unit 12, a storage unit 13, and a control unit 14.

入力部１１は、制御部１４が各種処理を行う際に用いるデータの入力を受け付ける。例えば、入力部１１は、話者認識モデル（話者認識部１４１）の学習に用いる教師データの入力を受け付ける。出力部１２は、制御部１４が行った処理の結果を出力する。例えば、出力部１２は、照合部１４６による音声の話者の照合結果等を出力する。 The input unit 11 receives input of data used when the control unit 14 performs various processes. For example, the input unit 11 receives input of teacher data used for learning the speaker recognition model (speaker recognition unit 141). The output unit 12 outputs the result of the processing performed by the control unit 14. For example, the output unit 12 outputs the collation result of the voice speaker by the collation unit 146.

記憶部１３は、ＲＡＭ（Random Access Memory）、フラッシュメモリ（Flash Memory）等の半導体メモリ素子、または、ハードディスク、光ディスク等の記憶装置によって実現され、照合装置１０を動作させるプログラムや、当該プログラムの実行中に使用されるデータなどが記憶される。例えば、記憶部１３は、話者認識部１４１の学習に用いる教師データを記憶する。また、記憶部１３は、話者認識部１４１に設定されるパラメータの値等を記憶する。 The storage unit 13 is realized by a semiconductor memory element such as a RAM (Random Access Memory) or a flash memory (Flash Memory), or a storage device such as a hard disk or an optical disk, and a program for operating the collation device 10 or execution of the program. The data used in it is stored. For example, the storage unit 13 stores teacher data used for learning of the speaker recognition unit 141. Further, the storage unit 13 stores the values of the parameters set in the speaker recognition unit 141 and the like.

教師データは、複数の話者の音声信号について、当該音声信号の示す音素および当該音声信号の話者（正解データ）を対応付けたデータである。この教師データは、学習部１４５が話者認識部１４１の各ＮＮの学習を行う際に用いられる。 The teacher data is data in which the phonemes indicated by the audio signals and the speakers (correct answer data) of the audio signals are associated with the audio signals of a plurality of speakers. This teacher data is used when the learning unit 145 learns each NN of the speaker recognition unit 141.

制御部１４は、照合装置１０全体の制御を司る。制御部１４は、例えば、話者認識部１４１の学習等を行う。 The control unit 14 controls the entire collation device 10. The control unit 14 learns, for example, the speaker recognition unit 141.

制御部１４は、話者認識部１４１と、学習部１４５と、照合部１４６とを備える。 The control unit 14 includes a speaker recognition unit 141, a learning unit 145, and a collation unit 146.

話者認識部１４１は、話者認識モデルに基づき、入力された音声データの話者の認識を行う。話者認識部１４１は、第１のＮＮ１４２と、第２のＮＮ１４３と、第３のＮＮ１４４とを備える。 The speaker recognition unit 141 recognizes the speaker of the input voice data based on the speaker recognition model. The speaker recognition unit 141 includes a first NN 142, a second NN 143, and a third NN 144.

第１のＮＮ１４２は、入力された音声信号を、当該音声信号を構成するフレームごとの中間特徴量に変換する。なお、フレームの長さは、例えば、10msである。 The first NN 142 converts the input audio signal into an intermediate feature amount for each frame constituting the audio signal. The length of the frame is, for example, 10 ms.

第２のＮＮ１４３は、第１のＮＮ１４２から出力されたフレーム単位の中間特徴量に基づき、各フレームの話者の認識を行い、各フレームの話者の認識結果を出力する。例えば、第２のＮＮ１４３は、第１のＮＮ１４２から出力されたフレーム単位の中間特徴量に基づき、各フレームの話者がどの話者であるかを推定し、推定した話者のIDを出力する。 The second NN 143 recognizes the speaker of each frame based on the intermediate feature amount of each frame output from the first NN 142, and outputs the recognition result of the speaker of each frame. For example, the second NN 143 estimates which speaker is the speaker of each frame based on the intermediate feature amount of each frame output from the first NN 142, and outputs the estimated speaker ID. ..

第３のＮＮ１４４は、第１のＮＮ１４２から出力されたフレーム単位の中間特徴量に基づき、各フレームの音素の認識を行い、各フレームの音素の認識の結果を出力する。 The third NN 144 recognizes the phonemes of each frame based on the intermediate feature amount of each frame output from the first NN 142, and outputs the result of the recognition of the phonemes of each frame.

学習部１４５は、教師データを用いて話者認識部１４１を構成する第１のＮＮ１４２、第２のＮＮ１４３および第３のＮＮ１４４の学習を行う。学習部１４５は、更新部１４５１と更新制御部１４５２とを備える。 The learning unit 145 learns the first NN 142, the second NN 143, and the third NN 144 that constitute the speaker recognition unit 141 using the teacher data. The learning unit 145 includes an update unit 1451 and an update control unit 1452.

更新部１４５１は、教師データを用いて話者認識部１４１を構成する第１のＮＮ１４２、第２のＮＮ１４３および第３のＮＮ１４４それぞれのパラメータを更新する。例えば、更新部１４５１は、第２のＮＮ１４３の出力と教師データにおける正解データとの損失（距離）が小さくなり、かつ、第３のＮＮ１４４の出力と教師データにおける正解データとの損失（距離）が大きくなるように、各ＮＮのパラメータを更新する。更新された各ＮＮのパラメータの値は、例えば、記憶部１３に記憶される。 The update unit 1451 updates the parameters of the first NN 142, the second NN 143, and the third NN 144 that constitute the speaker recognition unit 141 by using the teacher data. For example, in the update unit 1451, the loss (distance) between the output of the second NN143 and the correct answer data in the teacher data is small, and the loss (distance) between the output of the third NN144 and the correct answer data in the teacher data is small. Update the parameters of each NN so that it becomes larger. The updated parameter values of each NN are stored in, for example, the storage unit 13.

例えば、更新部１４５１は、第２のＮＮ１４３の出力と正解データとの損失（L_s）と、第３のＮＮ１４４の出力と正解データとの損失（L_p）とを用いて、以下の式（１）に基づき更新対象のパラメータθ_fを更新する。 For example, the update unit 1451 uses the following equation ( _{L p} _{) using the loss (L s} ) between the output of the second NN 143 and the correct answer data and the loss (L p) between the output of the third NN 144 and the correct answer data. _{The parameter θ f} to be updated is updated based on 1).

式（１）において、μとλは予め設定する学習重みであり、いずれも正の定数である。更新部１４５１が、上記の式（１）に基づき、パラメータを更新すると、結果として、パラメータは、L_sに対して減少し、L_pに対して増加する値で更新されることになる。 In equation (1), μ and λ are preset learning weights, both of which are positive constants. When the update unit 1451 updates the parameter based on the above equation (1), as a result, the parameter is updated with a value that decreases with respect to _{L s and} _{increases with respect to L p.}

更新制御部１４５２は、所定の条件を満たすまで、教師データを用いた第１のＮＮ１４２、第２のＮＮ１４３および第３のＮＮ１４４による演算と、当該演算の結果に基づく更新部１４５１による各ＮＮのパラメータの更新処理とを繰り返し実行させる。なお、上記の所定の条件は、例えば、各ＮＮのパラメータの更新回数が所定の繰り返し回数に達したこと、各ＮＮのパラメータの更新量が所定の閾値未満となったこと等である。所定の条件は、各ＮＮの学習が充分に行われた状態になったことを示す条件であれば、上記の条件に限定されない。 The update control unit 1452 is calculated by the first NN 142, the second NN 143, and the third NN 144 using the teacher data until a predetermined condition is satisfied, and the parameters of each NN by the update unit 1451 based on the result of the calculation. The update process of is repeatedly executed. The above-mentioned predetermined conditions are, for example, that the number of updates of the parameters of each NN has reached the predetermined number of repetitions, that the amount of updates of the parameters of each NN has become less than the predetermined threshold value, and the like. The predetermined condition is not limited to the above condition as long as it is a condition indicating that the learning of each NN has been sufficiently performed.

照合部１４６は、入力された音声信号の話者の照合を行う。例えば、照合部１４６は、学習部１４５による学習後の話者認識部１４１の第２のＮＮ１４３の中間層から出力される特徴量を用いて、入力された音声信号の話者の照合を行う。 The collation unit 146 collates the speaker of the input audio signal. For example, the collation unit 146 collates the speaker of the input audio signal by using the feature amount output from the intermediate layer of the second NN 143 of the speaker recognition unit 141 after learning by the learning unit 145.

上記の話者の照合処理を、図２を用いて説明する。なお、図２に示す第１のＮＮ１４２および第２のＮＮ１４３は、学習部１４５による学習後の話者認識部１４１における第１のＮＮ１４２および第２のＮＮ１４３である。まず、第１のＮＮ１４２は、入力部１１（図１参照）経由で入力された音声信号（第１の音声信号）についてフレーム単位で中間特徴量に変換する。また、第１のＮＮ１４２は、入力部１１経由で入力された照合用の音声信号（第２の音声信号）についてフレーム単位で中間特徴量に変換する。 The above speaker collation process will be described with reference to FIG. The first NN 142 and the second NN 143 shown in FIG. 2 are the first NN 142 and the second NN 143 in the speaker recognition unit 141 after learning by the learning unit 145. First, the first NN 142 converts the audio signal (first audio signal) input via the input unit 11 (see FIG. 1) into an intermediate feature amount in frame units. Further, the first NN 142 converts the matching audio signal (second audio signal) input via the input unit 11 into an intermediate feature amount in frame units.

第２のＮＮ１４３は、第１のＮＮ１４２から出力された、入力された音声信号の中間特徴量に基づき、入力された音声信号の話者の識別処理を行う。また、第２のＮＮ１４３は、第１のＮＮ１４２から出力された、照合用の音声信号の中間特徴量に基づき、照合用の音声信号の話者の識別処理を行う。 The second NN 143 performs speaker identification processing of the input audio signal based on the intermediate feature amount of the input audio signal output from the first NN 142. Further, the second NN 143 performs speaker identification processing of the matching audio signal based on the intermediate feature amount of the matching audio signal output from the first NN 142.

ここで、照合部１４６は、上記の第２のＮＮ１４３の中間層が出力する、入力された音声信号の特徴量と照合用の音声信号の特徴量とを取得する。このとき、入力された音声信号が複数のフレームからなる場合、照合部１４６は、上記の入力された音声信号の特徴量の平均ベクトルと照合用の音声信号の特徴量の平均ベクトルを算出し、それをそれぞれの音声信号の特徴量とする。そして、照合部１４６は、入力された音声信号の特徴量と、照合用の音声信号の特徴量との類似度に基づいて、入力された音声信号の話者と照合用音声信号の話者とが同じであるか否かを示す照合結果を出力する。例えば、上記の類似度が所定の閾値以上であれば、照合部１４６は、入力された音声信号の話者が、照合用の音声信号の話者と同じであると判定する。一方、類似度が所定の閾値未満であれば、照合部１４６は、入力された音声信号の話者が、照合用の音声信号の話者とは異なると判定する。そして、照合部１４６は、上記の判定結果を照合結果として出力する。 Here, the collating unit 146 acquires the feature amount of the input audio signal and the feature amount of the audio signal for collation output by the intermediate layer of the second NN143. At this time, when the input audio signal is composed of a plurality of frames, the collating unit 146 calculates the average vector of the feature amount of the input audio signal and the average vector of the feature amount of the audio signal for collation. Let that be the feature amount of each audio signal. Then, the collation unit 146 sets the speaker of the input audio signal and the speaker of the collation audio signal based on the similarity between the feature amount of the input audio signal and the feature amount of the collation audio signal. Outputs a collation result indicating whether or not is the same. For example, if the similarity is equal to or higher than a predetermined threshold value, the collating unit 146 determines that the speaker of the input audio signal is the same as the speaker of the collating audio signal. On the other hand, if the similarity is less than a predetermined threshold value, the collating unit 146 determines that the speaker of the input audio signal is different from the speaker of the collating audio signal. Then, the collation unit 146 outputs the above determination result as a collation result.

［処理手順］
次に、図３および図４を用いて、照合装置１０の処理手順を説明する。まず、照合装置１０は、教師データを用いて話者認識モデルの学習を行う（Ｓ１）。つまり、照合装置１０の学習部１４５は、教師データを用いて、話者認識部１４１の第２のＮＮ１４３の出力と、教師データにおける正解データとの損失が小さくなり、かつ、話者認識部１４１の第３のＮＮ１４４の出力と教師データにおける正解データとの損失が大きくなるように、話者認識部１４１の各ＮＮのパラメータを更新する。その後、照合装置１０は、学習後の話者認識モデルを用いた話者の照合を行う（Ｓ２）。例えば、照合装置１０の照合部１４６は、学習後の話者認識部１４１における第２のＮＮ１４３の中間層から出力される特徴量を用いて、話者の照合を行う。 [Processing procedure]
Next, the processing procedure of the collating device 10 will be described with reference to FIGS. 3 and 4. First, the collation device 10 learns the speaker recognition model using the teacher data (S1). That is, the learning unit 145 of the collation device 10 uses the teacher data to reduce the loss between the output of the second NN143 of the speaker recognition unit 141 and the correct answer data in the teacher data, and the speaker recognition unit 141. The parameters of each NN of the speaker recognition unit 141 are updated so that the loss between the output of the third NN 144 and the correct answer data in the teacher data becomes large. After that, the collation device 10 collates the speakers using the learned speaker recognition model (S2). For example, the collating unit 146 of the collating device 10 collates the speaker by using the feature amount output from the intermediate layer of the second NN 143 in the speaker recognition unit 141 after learning.

図４を用いて、図３のＳ２における話者の照合処理を詳細に説明する。例えば、学習後の話者認識部１４１は、入力部１１経由で入力された音声信号と照合用の音声信号の入力を受け付ける（図４のＳ２１）。その後、学習後の話者認識部１４１の第１のＮＮ１４２は、入力された音声信号の中間特徴量を出力し、また、照合用の音声信号の中間特徴量を出力する。次に、学習後の話者認識部１４１の第２のＮＮ１４３は、第１のＮＮ１４２から出力された、入力された音声信号の中間特徴量に基づき、入力された音声信号の話者の認識処理を行う。また、第２のＮＮ１４３は、第１のＮＮ１４２から出力された照合用の音声信号の中間特徴量に基づき、照合用の音声信号の話者の認識処理を行う。ここで、照合部１４６は、第２のＮＮ１４３が上記の話者の認識処理を行う際、第２のＮＮ１４３の中間層から出力される、入力された音声信号の特徴量および照合用の音声信号の特徴量を取得する（Ｓ２２）。 The collation process of the speaker in S2 of FIG. 3 will be described in detail with reference to FIG. For example, the speaker recognition unit 141 after learning receives the input of the audio signal input via the input unit 11 and the audio signal for collation (S21 in FIG. 4). After that, the first NN 142 of the speaker recognition unit 141 after learning outputs the intermediate feature amount of the input audio signal, and also outputs the intermediate feature amount of the audio signal for collation. Next, the second NN 143 of the speaker recognition unit 141 after learning recognizes the speaker of the input audio signal based on the intermediate feature amount of the input audio signal output from the first NN 142. I do. Further, the second NN 143 performs speaker recognition processing of the matching audio signal based on the intermediate feature amount of the matching audio signal output from the first NN 142. Here, the collation unit 146 outputs the feature amount of the input audio signal and the audio signal for collation output from the intermediate layer of the second NN 143 when the second NN 143 performs the speaker recognition process. (S22).

Ｓ２２の後、照合部１４６は、Ｓ２２で取得した、入力された音声信号の特徴量と照合用の音声信号との類似度を計算する（Ｓ２３）。そして、計算した類似度が所定の閾値以上であれば（Ｓ２４でＹｅｓ）、照合部１４６は、入力された音声信号の話者は照合用の音声信号の話者と同じと判定し、その判定の結果を出力する（Ｓ２５）。一方、計算した類似度が所定の閾値未満であれば（Ｓ２４でＮｏ）、照合部１４６は、入力された音声信号の話者は照合用の音声信号の話者とは異なると判定し、その判定の結果を出力する（Ｓ２６）。 After S22, the collation unit 146 calculates the similarity between the feature amount of the input audio signal acquired in S22 and the audio signal for collation (S23). Then, if the calculated similarity is equal to or higher than a predetermined threshold value (Yes in S24), the collating unit 146 determines that the speaker of the input audio signal is the same as the speaker of the audio signal for collation, and the determination is made. The result of is output (S25). On the other hand, if the calculated similarity is less than a predetermined threshold value (No in S24), the collation unit 146 determines that the speaker of the input audio signal is different from the speaker of the collation audio signal, and the collation unit 146 determines that the speaker is different from the speaker of the collation audio signal. The result of the determination is output (S26).

このようにすることで、照合装置１０は、学習後の話者認識部１４１の第２のＮＮ１４３の中間層から出力される特徴量を用いて、話者照合を行うことができる。 By doing so, the collation device 10 can perform speaker collation using the feature amount output from the intermediate layer of the second NN143 of the speaker recognition unit 141 after learning.

［その他の実施形態］
なお、照合部１４６は、学習後の話者認識部１４１の第２のＮＮ１４３の中間層から出力された音声信号の特徴量を用いて話者照合を行うこととしたがこれに限定されない。例えば、図１の破線矢印に示すように学習後の話者認識部１４１の第１のＮＮ１４２から出力された音声信号の特徴量を用いて話者照合を行ってもよい。 [Other Embodiments]
It should be noted that the collation unit 146 decides to perform speaker collation using the feature amount of the audio signal output from the intermediate layer of the second NN143 of the speaker recognition unit 141 after learning, but the present invention is not limited to this. For example, as shown by the broken line arrow in FIG. 1, speaker matching may be performed using the feature amount of the audio signal output from the first NN 142 of the speaker recognition unit 141 after learning.

また、照合装置１０で学習された話者認識部１４１の第１のＮＮ１４２および第２のＮＮ１４３は、当該照合装置１０により用いられてもよいし、他の装置により用いられてもよい。 Further, the first NN 142 and the second NN 143 of the speaker recognition unit 141 learned by the collation device 10 may be used by the collation device 10 or may be used by another device.

例えば、照合装置１０で学習された第１のＮＮ１４２および第２のＮＮ１４３が、他の照合装置において用いられる場合、例えば、図５に示す構成となる。 For example, when the first NN 142 and the second NN 143 learned by the collation device 10 are used in another collation device, for example, the configuration shown in FIG. 5 is obtained.

図５に示す照合装置１００は、入力部１１と、出力部１２と、制御部１４ａとを備える。制御部１４ａは、照合装置１０により学習された第１のＮＮ１４２および第２のＮＮ１４３と、照合部１４６とを備える。 The collation device 100 shown in FIG. 5 includes an input unit 11, an output unit 12, and a control unit 14a. The control unit 14a includes a first NN 142 and a second NN 143 learned by the collating device 10, and a collating unit 146.

照合装置１００の入力部１１において入力された音声信号と、照合用の音声信号とを受け付けると、学習後の第１のＮＮ１４２がそれぞれの音声信号の特徴量を出力し、第２のＮＮ１４３は第１のＮＮ１４２から出力された音声信号の特徴量に基づき、それぞれの音声信号の話者の認識処理を行う。ここで照合部１４６は、第２のＮＮ１４３が音声信号の話者の認識処理を行う際、当該第２のＮＮ１４３の中間層から出力される音声信号の特徴量を用いて、入力された音声信号の話者が、照合用の音声信号の話者と同じか否かの照合を行う。そして、照合部１４６は照合の結果を出力部１２へ出力する。 When the audio signal input by the input unit 11 of the collation device 100 and the audio signal for collation are received, the first NN 142 after learning outputs the feature amount of each audio signal, and the second NN 143 is the second. Based on the feature amount of the audio signal output from NN142 of No. 1, the speaker recognition process of each audio signal is performed. Here, when the second NN 143 performs speaker recognition processing of the audio signal, the collation unit 146 uses the feature amount of the audio signal output from the intermediate layer of the second NN 143 to input the input audio signal. Checks whether or not the speaker of is the same as the speaker of the audio signal for matching. Then, the collation unit 146 outputs the collation result to the output unit 12.

上記のように学習後の第１のＮＮ１４２および第２のＮＮ１４３を照合装置１００が用いる場合、照合装置１０は照合部１４６を含まない構成としてもよい。 When the collation device 100 uses the first NN 142 and the second NN 143 after learning as described above, the collation device 10 may be configured not to include the collation unit 146.

［効果］
照合装置１０が学習対象とする話者認識部１４１のＮＮの構成は、非特許文献１に記載のＮＮと同様に、音声信号を中間特徴量に変換するＮＮ（第１ＮＮ）に、話者認識のＮＮ（第２ＮＮ）と音素認識を行うＮＮ（第３ＮＮ）とを連結したものである。しかし、照合装置１０が学習対象とする話者認識部１４１と非特許文献１とでは、以下の点において相違する。 [effect]
Similar to the NN described in Non-Patent Document 1, the NN configuration of the speaker recognition unit 141 to be learned by the collating device 10 is a speaker recognition NN (first NN) that converts a voice signal into an intermediate feature amount. NN (second NN) and NN (third NN) that recognizes phonemes are connected. However, the speaker recognition unit 141, which is the learning target of the collation device 10, and the non-patent document 1 are different in the following points.

第１に、照合装置１０による学習対象の第１のＮＮ１４２は、セグメント単位の音声信号をフレーム単位で中間特徴量に変換するのに対し、非特許文献１に記載の技術においては、セグメント単位で、つまり、第１のＮＮ１４２よりも長い単位の音声信号を入力として中間特徴量に変換する点が異なる。 First, the first NN142 to be learned by the collation device 10 converts the audio signal of the segment unit into the intermediate feature amount in the frame unit, whereas in the technique described in Non-Patent Document 1, it is in the segment unit. That is, the difference is that an audio signal having a unit longer than that of the first NN 142 is input and converted into an intermediate feature amount.

第２に、非特許文献１では、話者認識のＮＮの出力と音素認識を行うＮＮの出力とが、いずれも正解データに近づくように学習する。これに対して、照合装置１０は、第２のＮＮ１４３については正解データとの損失（距離）が小さくなるが、第３のＮＮ１４４と正解データとの損失（距離）が大きくなるように、つまり、音素認識のタスクについては不正解となる方向に、パラメータを学習させる点が異なる。 Second, in Non-Patent Document 1, both the output of the speaker-recognized NN and the output of the phoneme-recognized NN are learned so as to approach the correct answer data. On the other hand, in the collation device 10, the loss (distance) from the correct answer data is small for the second NN143, but the loss (distance) from the third NN144 to the correct answer data is large, that is, Regarding the phoneme recognition task, the difference is that the parameters are learned in the direction of incorrect answers.

非特許文献１に記載の技術は、話者認識モデルについて話者認識と音素認識の両方が正解データに近づくようにパラメータを学習させる。この結果、学習後の話者認識モデルの第１ＮＮから出力される中間特徴量（話者ベクトル）は、話者認識に適した特徴を含み、かつ、音素認識にも適した特徴を含むようなものが抽出されるようになる。 The technique described in Non-Patent Document 1 trains parameters of a speaker recognition model so that both speaker recognition and phoneme recognition approach correct answer data. As a result, the intermediate features (speaker vector) output from the first NN of the speaker recognition model after learning include features suitable for speaker recognition and also include features suitable for phoneme recognition. Things will be extracted.

一方、照合装置１０が目的とする話者照合のタスクは、入力される２つの音声信号が同じ話者によるものか否かを判定するタスクであり、これら２つの音声信号の内容が異なることが前提となる。ここで、音声信号の内容が異なるということは、各音声に含まれる「音素が何であるか」という情報は、話者照合においては不要な情報と言える。 On the other hand, the speaker collation task aimed at by the collation device 10 is a task of determining whether or not the two input audio signals are from the same speaker, and the contents of these two audio signals may be different. It becomes a premise. Here, the fact that the contents of the voice signals are different means that the information "what is the phoneme" contained in each voice can be said to be unnecessary information in speaker verification.

ところが、非特許文献１に記載の技術は、音素に係る情報が特徴として含まれるように第１ＮＮを学習させてしまう。結果として、非特許文献１に記載の技術は、特に短い発話においては音素の特徴が強く表出され、話者の照合に必要な特徴が充分に得られないため、学習後のモデルの話者認識や話者照合の性能は低下すると考えられる。 However, the technique described in Non-Patent Document 1 causes the first NN to be learned so that information related to phonemes is included as a feature. As a result, in the technique described in Non-Patent Document 1, the characteristics of phonemes are strongly expressed especially in short utterances, and the characteristics necessary for speaker matching cannot be sufficiently obtained. Therefore, the speaker of the model after learning is used. It is thought that the performance of recognition and speaker verification will deteriorate.

そこで、照合装置１０では、学習部１４５において、音素の特徴が含まれにくくなるように、話者認識部１４１の各ＮＮのパラメータを学習させる。これにより、学習後の話者認識部１４１の第１のＮＮ１４２および第２のＮＮ１４３は短い時間区間の発話から、話者の音素に依存しない特性を効率的に抽出することができるようになる。その結果、照合装置１０は、話者照合タスクの精度向上に資する中間特徴量の抽出が可能となることが期待できる。 Therefore, in the collation device 10, the learning unit 145 learns the parameters of each NN of the speaker recognition unit 141 so that the characteristics of phonemes are less likely to be included. As a result, the first NN 142 and the second NN 143 of the speaker recognition unit 141 after learning can efficiently extract the phoneme-independent characteristics of the speaker from the utterances in a short time interval. As a result, the collation device 10 can be expected to be able to extract intermediate features that contribute to improving the accuracy of the speaker collation task.

［実験結果］
次に、照合装置１０により学習された第１のＮＮ１４２および第２のＮＮ１４３を用いた話者照合の実験結果を説明する。本実験における実験条件は、図６に示すとおり、教師データの発話者数は、2620人、発話数は2.8M、発話のトータル時間は960hであり、実験データの発話者数は、40人、発話数は2.6k、発話のトータル時間は5.3hである。それぞれのデータの特徴量は13次元のMFCCであり、音素は39音素である。また、評価方法は、各発話の音素セグメントの話者ベクトルを算出し、得られた話者ベクトル同士の類似度をProbabilistic Linear Discriminant Analysis（PLDA）で算出した。また、話者照合の精度はEqual Error Rate（EER）で評価した。 [Experimental result]
Next, the experimental results of speaker collation using the first NN 142 and the second NN 143 learned by the collation device 10 will be described. As shown in Fig. 6, the experimental conditions in this experiment are that the number of speakers in the teacher data is 2620, the number of utterances is 2.8M, the total utterance time is 960h, and the number of speakers in the experimental data is 40. The number of utterances is 2.6k, and the total utterance time is 5.3h. The features of each data are 13-dimensional MFCC, and the phonemes are 39 phonemes. In addition, as an evaluation method, the speaker vector of the phoneme segment of each utterance was calculated, and the similarity between the obtained speaker vectors was calculated by Probabilistic Linear Discriminant Analysis (PLDA). The accuracy of speaker matching was evaluated by the Equal Error Rate (EER).

図７および図８に実験結果を示す。なお、以下における「マルチタスク学習」とは、話者認識モデルにおける、話者認識を行うＮＮおよび音素認識を行うＮＮそれぞれの出力データが教師データの示す正解データに近づくように学習を行うことである。また、「敵対的学習」とは、話者認識モデルにおける、話者認識を行うＮＮの出力データは教師データの示す正解データに近づくように学習するが、音素認識を行うＮＮの出力データは教師データの示す正解データに近づかないように学習することである。 The experimental results are shown in FIGS. 7 and 8. In the following, "multi-task learning" means learning so that the output data of each of the speaker recognition NN and the phoneme recognition NN in the speaker recognition model approaches the correct answer data indicated by the teacher data. be. In addition, "hostile learning" means that in the speaker recognition model, the output data of the NN that performs speaker recognition is learned so as to approach the correct answer data indicated by the teacher data, but the output data of the NN that performs phonetic recognition is the teacher. It is to learn so as not to approach the correct answer data indicated by the data.

本実験では比較例として、図７の（１）〜（４）に示す話者認識モデルのＮＮを用いて話者照合を行った。なお、話者認識モデルはいずれも、音声信号から中間特徴量を出力するＮＮ、当該中間特徴量に基づき話者認識を行うＮＮおよび当該音素認識を行うＮＮを備えるものとする。また、（５）フレーム単位で処理を行うＮＮに敵対的学習を実施（FRM-AT）は、本実施形態の照合装置１０による学習を実施したモデルに相当する。 In this experiment, as a comparative example, speaker matching was performed using the NN of the speaker recognition model shown in FIGS. 7 (1) to (4). It should be noted that each speaker recognition model includes an NN that outputs an intermediate feature amount from an audio signal, an NN that performs speaker recognition based on the intermediate feature amount, and an NN that performs the phoneme recognition. Further, (5) hostile learning to the NN that performs processing in frame units (FRM-AT) corresponds to a model in which learning is performed by the collating device 10 of the present embodiment.

（１）セグメント（発話）単位で処理を行うＮＮにマルチタスク学習を実施（SEG-MT）
（２）セグメント単位で処理を行うＮＮに敵対的学習を実施（SEG-AT）
（３）フレーム単位で処理を行うＮＮ（FRM）
（４）フレーム単位で処理を行うＮＮにマルチタスク学習を実施（FRM-MT） (1) Multitask learning is performed on NNs that process in segment (utterance) units (SEG-MT)
(2) Conduct hostile learning to NNs that process segment by segment (SEG-AT)
(3) NN (FRM) that processes on a frame-by-frame basis
(4) Multitask learning is performed on NNs that process on a frame-by-frame basis (FRM-MT)

図７に示すように、セグメント単位で処理を行うＮＮを備えるモデルよりも、フレーム単位で処理を行うＮＮを備えるモデルの方が、フレーム単位での話者照合の精度が高くなることが確認できた。また、フレーム単位で処理を行うＮＮに対し、マルチタスク学習を行うよりも、敵対的学習を行った方が、フレーム単位の話者照合が高くなることが確認できた。 As shown in FIG. 7, it can be confirmed that the accuracy of speaker matching on a frame-by-frame basis is higher in the model having an NN that processes on a frame-by-frame basis than on the model having an NN that performs processing on a segment-by-segment basis. rice field. In addition, it was confirmed that for NNs that perform processing in frame units, speaker matching in frame units is higher when hostile learning is performed than when multitask learning is performed.

また、照合対象の音声信号の発話長と、上記の（１）〜（５）に示すモデルによる話者照合の精度との関係を、図８に示す。図８に示すように、発話長が1400ms以下の発話について、（５）フレーム単位のＮＮに敵対的学習を実施したモデル（FRM-AT）の方が、（１）〜（４）に示すモデルによりも話者照合の精度が高いことが確認できた。 Further, FIG. 8 shows the relationship between the utterance length of the audio signal to be collated and the accuracy of speaker collation by the models shown in (1) to (5) above. As shown in FIG. 8, for utterances with a utterance length of 1400 ms or less, (5) the model (FRM-AT) in which hostile learning is performed on the NN in frame units is the model shown in (1) to (4). It was also confirmed that the accuracy of speaker verification was high.

［プログラム］
図９を用いて、上記のプログラム（照合プログラム）を実行するコンピュータの一例を説明する。図９に示すように、コンピュータ１０００は、例えば、メモリ１０１０と、ＣＰＵ１０２０と、ハードディスクドライブインタフェース１０３０と、ディスクドライブインタフェース１０４０と、シリアルポートインタフェース１０５０と、ビデオアダプタ１０６０と、ネットワークインタフェース１０７０とを有する。これらの各部は、バス１０８０によって接続される。 [program]
An example of a computer that executes the above program (verification program) will be described with reference to FIG. As shown in FIG. 9, the computer 1000 has, for example, a memory 1010, a CPU 1020, a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. Each of these parts is connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。ディスクドライブ１１００には、例えば、磁気ディスクや光ディスク等の着脱可能な記憶媒体が挿入される。シリアルポートインタフェース１０５０には、例えば、マウス１１１０およびキーボード１１２０が接続される。ビデオアダプタ１０６０には、例えば、ディスプレイ１１３０が接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. A removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. For example, a mouse 1110 and a keyboard 1120 are connected to the serial port interface 1050. A display 1130 is connected to the video adapter 1060, for example.

ここで、図９に示すように、ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３およびプログラムデータ１０９４を記憶する。前記した実施形態で説明した記憶部１３は、例えばハードディスクドライブ１０９０やメモリ１０１０に装備される。 Here, as shown in FIG. 9, the hard disk drive 1090 stores, for example, the OS 1091, the application program 1092, the program module 1093, and the program data 1094. The storage unit 13 described in the above-described embodiment is provided in, for example, the hard disk drive 1090 or the memory 1010.

そして、ＣＰＵ１０２０が、ハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して、上述した各手順を実行する。 Then, the CPU 1020 reads the program module 1093 and the program data 1094 stored in the hard disk drive 1090 into the RAM 1012 as needed, and executes each of the above-described procedures.

なお、上記の照合プログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限られず、例えば、着脱可能な記憶媒体に記憶されて、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、上記のプログラムに係るプログラムモジュール１０９３やプログラムデータ１０９４は、ＬＡＮやＷＡＮ（Wide Area Network）等のネットワークを介して接続された他のコンピュータに記憶され、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and program data 1094 related to the above collation program are not limited to the case where they are stored in the hard disk drive 1090. For example, they are stored in a removable storage medium and are stored by the CPU 1020 via the disk drive 1100 or the like. It may be read out. Alternatively, the program module 1093 and the program data 1094 related to the above program are stored in another computer connected via a network such as a LAN or WAN (Wide Area Network), and read by the CPU 1020 via the network interface 1070. May be done.

１０照合装置
１１入力部
１２出力部
１３記憶部
１４制御部
１４１話者認識部
１４２第１のＮＮ
１４３第２のＮＮ
１４４第３のＮＮ
１４５学習部
１４６照合部
１４５１更新部
１４５２更新制御部 10 Collation device 11 Input unit 12 Output unit 13 Storage unit 14 Control unit 141 Speaker recognition unit 142 First NN
143 Second NN
144 Third NN
145 Learning unit 146 Matching unit 1451 Update unit 1452 Update control unit

Claims

音声信号をフレームごとの特徴量に変換する第１のニューラルネットワークと、変換された前記フレームの特徴量に基づき当該フレームの話者の認識結果を出力する第２のニューラルネットワークとを備えた第１のモデルと、前記第１のモデルに第１の音声信号と第２の音声信号とを入力する入力部と、
前記第１のモデルにおける、前記第２のニューラルネットワークの中間層または前記第１のニューラルネットワークから出力される、前記第１の音声信号および前記第２の音声信号それぞれの特徴量に基づき、前記第１の音声信号の話者が、前記第２の音声信号の話者と同じか否かを示す照合結果を出力する照合部と
を備え、
前記第１のモデルは、前記第１のニューラルネットワークと、前記第２のニューラルネットワークと、前記第１のニューラルネットワークで変換された前記フレームの特徴量に基づき当該フレームの音素の認識結果を出力する第３のニューラルネットワークとを備える第２のモデルについて、学習用の音声信号と、前記学習用の音声信号の話者および当該音声信号に含まれる音素の正解データとを対応付けた教師データに基づき前記第２のモデルの学習を行う際、前記第２のニューラルネットワークによる出力結果は前記正解データに近づき、前記第３のニューラルネットワークによる出力結果は前記正解データに近づかないように学習させたものであることを特徴とする照合装置。 A first neural network including a first neural network that converts an audio signal into a feature amount for each frame and a second neural network that outputs a recognition result of a speaker of the frame based on the converted feature amount of the frame. Model, an input unit for inputting a first audio signal and a second audio signal to the first model, and
Based on the feature quantities of the first audio signal and the second audio signal output from the intermediate layer of the second neural network or the first neural network in the first model, the first A collation unit for outputting a collation result indicating whether or not the speaker of the first audio signal is the same as the speaker of the second audio signal is provided.
The first model outputs the recognition result of the sound element of the frame based on the feature amount of the frame converted by the first neural network, the second neural network, and the first neural network. Regarding the second model including the third neural network, based on the teacher data in which the voice signal for learning is associated with the speaker of the voice signal for learning and the correct answer data of the phonemes included in the voice signal. When training the second model, the output result by the second neural network is trained so as to approach the correct answer data, and the output result by the third neural network is trained so as not to approach the correct answer data. A collation device characterized by being present.

前記照合部は、
前記入力された第１の音声信号および前記第２の音声信号が複数のフレームから構成される音声信号である場合、前記第１の音声信号および前記第２の音声信号それぞれについてフレームごとの特徴量の平均ベクトルを算出し、前記算出した平均ベクトルを、当該音声信号の特徴量として用いる
ことを特徴とする請求項１に記載の照合装置。 The collation unit
When the input first audio signal and the second audio signal are audio signals composed of a plurality of frames, the feature amount for each frame for each of the first audio signal and the second audio signal. The collation apparatus according to claim 1, wherein the average vector of the above is calculated, and the calculated average vector is used as a feature amount of the audio signal.

前記第１のモデルは、
前記第２のモデルの学習を行う際、前記第２のニューラルネットワークによる出力結果と前記正解データとの距離が小さくなり、かつ、前記第３のニューラルネットワークによる出力結果と前記正解データとの距離が大きくなるよう、前記第１のニューラルネットワーク、第２のニューラルネットワークおよび第３のニューラルネットワークのパラメータが更新されたものである
ことを特徴とする請求項１に記載の照合装置。 The first model is
When training the second model, the distance between the output result by the second neural network and the correct answer data becomes small, and the distance between the output result by the third neural network and the correct answer data becomes smaller. The collation device according to claim 1, wherein the parameters of the first neural network, the second neural network, and the third neural network are updated so as to be larger.

前記照合部は、
前記第１の音声信号および前記第２の音声信号それぞれの特徴量の類似度を計算し、前記計算した類似度が所定値以上である場合、前記第１の音声信号の話者が、前記第２の音声信号の話者と同じであると判定し、前記判定の結果を照合結果として出力する
ことを特徴とする請求項１に記載の照合装置。 The collation unit
The similarity between the feature amounts of the first audio signal and the second audio signal is calculated, and when the calculated similarity is equal to or higher than a predetermined value, the speaker of the first audio signal is the first. The collation device according to claim 1, wherein it is determined that the speaker is the same as the speaker of the audio signal of 2, and the result of the determination is output as a collation result.

音声信号をフレームごとの特徴量に変換する第１のニューラルネットワークと、変換された前記フレームの特徴量に基づき当該フレームの話者の認識結果を出力する第２のニューラルネットワークと、変換された前記フレームの特徴量に基づき当該フレームの音素の認識結果を出力する第３のニューラルネットワークとを備える話者認識モデルについて、音声信号と、前記音声信号の示す音声の話者および当該音声信号の示す音素の正解データとを対応付けた教師データに基づき前記話者認識モデルの学習を行う際、前記第２のニューラルネットワークによる出力結果は前記正解データに近づき、前記第３のニューラルネットワークによる出力結果は前記正解データに近づかないよう、前記話者認識モデルの学習を行う学習部と、
前記学習後の前記第１のニューラルネットワークと前記第２のニューラルネットワークとを有する第１のモデルに、第１の音声信号と第２の音声信号とを入力する入力部と、
前記学習後の第１のモデルにおける、前記第２のニューラルネットワークの中間層または前記第１のニューラルネットワークから出力される、前記第１の音声信号および前記第２の音声信号それぞれの特徴量に基づき、前記第１の音声信号の話者が、前記第２の音声信号の話者と同じか否かを示す照合結果を出力する照合部と
を備えることを特徴とする照合装置。 A first neural network that converts an audio signal into a feature amount for each frame, a second neural network that outputs a recognition result of a speaker of the frame based on the converted feature amount of the frame, and the converted neural network. Regarding a speaker recognition model including a third neural network that outputs the recognition result of the sound element of the frame based on the feature amount of the frame, the voice signal, the speaker of the voice indicated by the voice signal, and the sound element indicated by the voice signal. When training the speaker recognition model based on the teacher data associated with the correct answer data, the output result by the second neural network approaches the correct answer data, and the output result by the third neural network is the above. A learning unit that learns the speaker recognition model so as not to approach the correct answer data,
An input unit for inputting a first audio signal and a second audio signal into a first model having the first neural network and the second neural network after learning.
Based on the feature quantities of the first audio signal and the second audio signal output from the intermediate layer of the second neural network or the first neural network in the first model after the training. A collation device including a collation unit that outputs a collation result indicating whether or not the speaker of the first audio signal is the same as the speaker of the second audio signal.

照合装置により実行される照合方法であって、
音声信号をフレームごとの特徴量に変換する第１のニューラルネットワークと、変換された前記フレームの特徴量に基づき当該フレームの話者の認識結果を出力する第２のニューラルネットワークとを備えた第１のモデルに、第１の音声信号と第２の音声信号とを入力する入力ステップと、
前記第１のモデルにおける、前記第２のニューラルネットワークの中間層または前記第１のニューラルネットワークから出力される、前記第１の音声信号および前記第２の音声信号それぞれの特徴量に基づき、前記第１の音声信号の話者が、前記第２の音声信号の話者と同じか否かを示す照合結果を出力する照合ステップと
を含み、
前記第１のモデルは、前記第１のニューラルネットワークと、前記第２のニューラルネットワークと、前記第１のニューラルネットワークで変換された前記フレームの特徴量に基づき当該フレームの音素の認識結果を出力する第３のニューラルネットワークとを備える第２のモデルについて、学習用の音声信号と、前記学習用の音声信号の話者および当該音声信号に含まれる音素の正解データとを対応付けた教師データに基づき前記第２のモデルの学習を行う際、前記第２のニューラルネットワークによる出力結果は前記正解データに近づき、前記第３のニューラルネットワークによる出力結果は前記正解データに近づかないように学習させたものであることを特徴とする照合方法。 A collation method performed by a collation device
A first neural network including a first neural network that converts an audio signal into a feature amount for each frame and a second neural network that outputs a recognition result of a speaker of the frame based on the converted feature amount of the frame. An input step for inputting a first audio signal and a second audio signal into the model of
Based on the feature quantities of the first audio signal and the second audio signal output from the intermediate layer of the second neural network or the first neural network in the first model, the first A collation step for outputting a collation result indicating whether or not the speaker of the first audio signal is the same as the speaker of the second audio signal is included.
The first model outputs the recognition result of the sound element of the frame based on the feature amount of the frame converted by the first neural network, the second neural network, and the first neural network. Regarding the second model including the third neural network, based on the teacher data in which the voice signal for learning is associated with the speaker of the voice signal for learning and the correct answer data of the phonemes included in the voice signal. When training the second model, the output result by the second neural network is trained so as to approach the correct answer data, and the output result by the third neural network is trained so as not to approach the correct answer data. A collation method characterized by being.

音声信号をフレームごとの特徴量に変換する第１のニューラルネットワークと、変換された前記フレームの特徴量に基づき当該フレームの話者の認識結果を出力する第２のニューラルネットワークとを備えた第１のモデルに、第１の音声信号と第２の音声信号とを入力する入力ステップと、
前記第１のモデルにおける、前記第２のニューラルネットワークの中間層または前記第１のニューラルネットワークから出力される、前記第１の音声信号および前記第２の音声信号それぞれの特徴量に基づき、前記第１の音声信号の話者が、前記第２の音声信号の話者と同じか否かを示す照合結果を出力する照合ステップと
をコンピュータに実行させ、
前記第１のモデルは、前記第１のニューラルネットワークと、前記第２のニューラルネットワークと、前記第１のニューラルネットワークで変換された前記フレームの特徴量に基づき当該フレームの音素の認識結果を出力する第３のニューラルネットワークとを備える第２のモデルについて、学習用の音声信号と、前記学習用の音声信号の話者および当該音声信号に含まれる音素の正解データとを対応付けた教師データに基づき前記第２のモデルの学習を行う際、前記第２のニューラルネットワークによる出力結果は前記正解データに近づき、前記第３のニューラルネットワークによる出力結果は前記正解データに近づかないように学習させたものである
ことを特徴とする照合プログラム。 A first neural network including a first neural network that converts an audio signal into a feature amount for each frame and a second neural network that outputs a recognition result of a speaker of the frame based on the converted feature amount of the frame. An input step for inputting a first audio signal and a second audio signal into the model of
Based on the feature quantities of the first audio signal and the second audio signal output from the intermediate layer of the second neural network or the first neural network in the first model, the first The computer is made to execute a collation step for outputting a collation result indicating whether or not the speaker of the first audio signal is the same as the speaker of the second audio signal.
The first model outputs the recognition result of the sound element of the frame based on the feature amount of the frame converted by the first neural network, the second neural network, and the first neural network. Regarding the second model including the third neural network, based on the teacher data in which the voice signal for learning is associated with the speaker of the voice signal for learning and the correct answer data of the phonemes included in the voice signal. When training the second model, the output result by the second neural network is trained so as to approach the correct answer data, and the output result by the third neural network is trained so as not to approach the correct answer data. A collation program characterized by being.