JP6646337B2

JP6646337B2 - Audio data processing device, audio data processing method, and audio data processing program

Info

Publication number: JP6646337B2
Application number: JP2016161849A
Authority: JP
Inventors: トランデュング; マークデルクロア; 小川　厚徳; 厚徳小川; 中谷　智広; 智広中谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-08-22
Filing date: 2016-08-22
Publication date: 2020-02-14
Anticipated expiration: 2036-08-22
Also published as: JP2018031812A

Description

本発明は、音声データ処理装置、音声データ処理方法および音声データ処理プログラムに関する。 The present invention relates to an audio data processing device, an audio data processing method, and an audio data processing program.

従来、音声データに基づき、音声認識モデルの学習および音声認識モデルを用いた音声認識を行う音声データ処理装置が知られている。音声認識用の音声データが作成された環境が、学習用の音声データが作成された環境と異なる場合、音声データ処理装置による音声認識の精度が低下する場合がある。例えば、周囲の雑音や話者の違いが音声認識の精度を低下させる場合がある。そのため、環境の違い、すなわち音響条件に対して頑健（ロバスト）な音声認識技術が知られている（例えば非特許文献１または２を参照）。 2. Description of the Related Art Conventionally, there has been known a voice data processing device that performs learning of a voice recognition model and voice recognition using the voice recognition model based on voice data. If the environment in which the voice data for voice recognition is created is different from the environment in which the voice data for learning is created, the accuracy of voice recognition by the voice data processing device may be reduced. For example, ambient noise and speakers may reduce the accuracy of speech recognition. For this reason, a speech recognition technique that is robust against environmental differences, that is, acoustic conditions, is known (for example, see Non-Patent Document 1 or 2).

R.Gemello, F.Mana, S.Scanzio, P.Laface, and R.De Mori,“Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training,” in Proc. Of ICASSP’06, vol.1, 2006, pp.1189-1192.R. Gemello, F. Mana, S. Scanzio, P. Laface, and R. De Mori, “Adaptation of hybrid ANN / HMM models using linear hidden transformations and conservative training,” in Proc. Of ICASSP'06, vol.1 , 2006, pp. 1189-1192. D.Yu and L.Deng，“Automatic speech recognition：A deep learning approach，”Springer，2015.D. Yu and L. Deng, “Automatic speech recognition: A deep learning approach,” Springer, 2015.

しかしながら、従来の技術には、音声認識用の音声データが作成された環境と学習用の音声データが作成された環境との間で、複数の音響条件に違いがある場合、高い精度で音声認識を行うことができない場合があるという問題があった。 However, in the conventional technology, when there is a difference in a plurality of acoustic conditions between an environment in which speech data for speech recognition is created and an environment in which speech data for learning is created, speech recognition is performed with high accuracy. There is a problem that it may not be possible to do.

例えば、従来の技術では、音声データの特徴量を音声認識モデルに適応させるため、音響条件に基づくパラメータを用いて音声データの特徴量を変換していた。しかし、従来の技術において変換に用いられるパラメータは、１つの音響条件にのみ基づくものであったため、音声認識用の音声データが作成された環境と学習用の音声データが作成された環境との間で、複数の音響条件に違いがある場合、高い精度で音声認識を行うことができない場合があった。 For example, in the related art, in order to adapt a feature amount of speech data to a speech recognition model, the feature amount of speech data is converted using a parameter based on acoustic conditions. However, since the parameters used for conversion in the conventional technology are based on only one acoustic condition, the parameters between the environment in which the speech data for speech recognition is created and the environment in which the speech data for learning are created. When there is a difference between a plurality of acoustic conditions, there is a case where speech recognition cannot be performed with high accuracy.

また、本発明の音声データ処理装置は、所定の環境における音声を基に作成された適応用の音声データから、前記音声の特徴を示す特徴量である第１の入力特徴量、および前記環境の特徴を示す特徴量である第２の入力特徴量を抽出する抽出部と、ニューラルネットワークを用いた計算モデルである条件特徴量計算モデルに前記第２の入力特徴量を入力し、前記所定の環境を特徴付ける複数の条件のそれぞれに対応した要素を含んだ特徴量である条件特徴量を計算する条件特徴量計算部と、前記条件特徴量に含まれる複数の要素のそれぞれに対応したパラメータの組を含んだ計算モデルである適応特徴量計算モデルに前記第１の入力特徴量および前記条件特徴量を入力し、前記要素ごとに、ニューラルネットワークを用いた計算モデルである音声認識モデルに適応した特徴量である適応特徴量を計算する適応特徴量計算部と、前記音声認識モデルに前記適応特徴量を入力して得られた出力結果を基に、前記条件特徴量計算モデルのパラメータおよび前記適応特徴量計算モデルのパラメータの更新を行う更新部と、を有することを特徴とする。 Further, the audio data processing device of the present invention includes a first input characteristic amount that is a characteristic amount indicating a characteristic of the audio, and a first input characteristic amount, An extraction unit that extracts a second input feature value that is a feature value indicating a feature; and a second input feature value that is input to a conditional feature value calculation model that is a calculation model using a neural network; A condition feature value calculation unit that calculates a condition feature value that is a feature value including an element corresponding to each of a plurality of conditions characterizing a plurality of conditions; and a parameter set corresponding to each of the plurality of elements included in the condition feature value. The first input feature value and the condition feature value are input to an adaptive feature value calculation model that is a calculation model including the calculation model, and a neural network is used for each of the elements. An adaptive feature amount calculating unit that calculates an adaptive feature amount that is a feature amount adapted to a voice recognition model; and the conditional feature amount calculation based on an output result obtained by inputting the adaptive feature amount to the speech recognition model. An updating unit that updates a parameter of the model and a parameter of the adaptive feature amount calculation model.

また、本発明の音声データ処理方法は、音声データ処理装置で実行される音声データ処理方法であって、所定の環境における音声を基に作成された適応用の音声データから、前記音声の特徴を示す特徴量である第１の入力特徴量、および前記環境の特徴を示す特徴量である第２の入力特徴量を抽出する抽出工程と、ニューラルネットワークを用いた計算モデルである条件特徴量計算モデルに前記第２の入力特徴量を入力し、前記所定の環境を特徴付ける複数の条件のそれぞれに対応した要素を含んだ特徴量である条件特徴量を計算する条件特徴量計算工程と、前記条件特徴量に含まれる複数の要素のそれぞれに対応したパラメータの組を含んだ計算モデルである適応特徴量計算モデルに前記第１の入力特徴量および前記条件特徴量を入力し、前記要素ごとに、ニューラルネットワークを用いた計算モデルである音声認識モデルに適応した特徴量である適応特徴量を計算する適応特徴量計算工程と、前記音声認識モデルに前記適応特徴量を入力して得られた出力結果を基に、前記条件特徴量計算モデルのパラメータおよび前記適応特徴量計算モデルのパラメータの更新を行う更新工程と、を含んだことを特徴とする。 Further, the audio data processing method of the present invention is an audio data processing method executed by an audio data processing device, wherein the characteristic of the audio is obtained from adaptation audio data created based on audio in a predetermined environment. An extraction step of extracting a first input feature quantity which is a feature quantity to be shown and a second input feature quantity which is a feature quantity showing the feature of the environment; Inputting the second input feature value to the condition information, and calculating a condition feature value which is a feature value including an element corresponding to each of the plurality of conditions characterizing the predetermined environment; Inputting the first input feature quantity and the condition feature quantity to an adaptive feature quantity calculation model that is a calculation model including a set of parameters corresponding to each of a plurality of elements included in the quantity; For each element, an adaptive feature calculation step of calculating an adaptive feature that is a feature adapted to a speech recognition model that is a calculation model using a neural network; and inputting the adaptive feature to the speech recognition model. An updating step of updating the parameters of the conditional feature calculation model and the parameters of the adaptive feature calculation model based on the obtained output result.

本発明によれば、音声認識用の音声データが作成された環境と学習用の音声データが作成された環境との間で、複数の音響条件に違いがある場合であっても、高い精度で音声認識を行うことができる。 According to the present invention, even if there are differences in a plurality of acoustic conditions between an environment in which speech data for speech recognition is created and an environment in which speech data for learning is created, high accuracy is achieved. Voice recognition can be performed.

図１は、従来技術に係る音声データ処理装置の構成の一例を示す図である。FIG. 1 is a diagram illustrating an example of a configuration of a voice data processing device according to a conventional technique. 図２は、従来技術に係る音声データ処理装置の処理の概要について説明するための図である。FIG. 2 is a diagram for explaining an outline of the processing of the audio data processing device according to the related art. 図３は、従来技術に係る音声認識処理の一例を示すフローチャートである。FIG. 3 is a flowchart illustrating an example of a voice recognition process according to the related art. 図４は、第１の実施形態に係る音声データ処理装置の構成の一例を示す図である。FIG. 4 is a diagram illustrating an example of a configuration of the audio data processing device according to the first embodiment. 図５は、第１の実施形態に係る音声データ処理装置の処理の概要について説明するための図である。FIG. 5 is a diagram for explaining an outline of the processing of the audio data processing device according to the first embodiment. 図６は、第１の実施形態に係る音声データ処理装置の適応処理について説明するための図である。FIG. 6 is a diagram for explaining the adaptive processing of the audio data processing device according to the first embodiment. 図７は、第１の実施形態に係る音声データ処理装置の音声認識処理について説明するための図である。FIG. 7 is a diagram for describing a voice recognition process of the voice data processing device according to the first embodiment. 図８は、第１の実施形態に係る音声データ処理装置の適応処理の流れを示すフローチャートである。FIG. 8 is a flowchart illustrating a flow of the adaptive processing of the audio data processing device according to the first embodiment. 図９は、第１の実施形態に係る音声データ処理装置の音声認識処理の流れを示すフローチャートである。FIG. 9 is a flowchart illustrating a flow of a voice recognition process of the voice data processing device according to the first embodiment. 図１０は、プログラムが実行されることにより、第１の実施形態に係る音声データ処理装置が実現されるコンピュータの一例を示す図である。FIG. 10 is a diagram illustrating an example of a computer that implements the audio data processing device according to the first embodiment by executing a program.

以下、本願が開示する音声データ処理装置、音声データ処理方法および音声データ処理プログラムの実施形態の一例の説明に先立ち、実施形態の一例が前提とする従来技術を説明する。その後、本願が開示する音声データ処理装置、音声データ処理方法および音声データ処理プログラムの実施形態の一例を説明する。 Prior to the description of an example of an embodiment of the audio data processing device, the audio data processing method, and the audio data processing program disclosed in the present application, a conventional technique based on the example of the embodiment will be described. Thereafter, an example of an embodiment of the audio data processing device, the audio data processing method, and the audio data processing program disclosed in the present application will be described.

なお、以下では、例えばＡがベクトルである場合には“ベクトルＡ”と表記し、例えばＡが行列である場合には“行列Ａ”と表記し、例えばＡがスカラーである場合には単に“Ａ”と表記する。また、例えばＡが集合である場合には、“集合Ａ”と表記する。また、例えばベクトルＡの関数ｆは、ｆ（ベクトルＡ）と表記する。また、ベクトル、行列またはスカラーであるＡに対し、“＾Ａ”と記載する場合は「“Ａ”の直上に“＾”が記された記号」と同等であるとする。また、ベクトル、行列またはスカラーであるＡに対し、“−Ａ”と記載する場合は「“Ａ”の直上に“−”が記された記号」と同等であるとする。また、ベクトルまたは行列であるＡに対し、Ａ^ＴはＡの転置を表す。 In the following, for example, when A is a vector, it is represented as "vector A", for example, when A is a matrix, it is represented as "matrix A", and when A is a scalar, for example, it is simply referred to as "matrix A". A ". For example, when A is a set, it is described as “set A”. Also, for example, the function f of the vector A is expressed as f (vector A). In addition, when “＾ A” is described for A that is a vector, a matrix, or a scalar, it is assumed that this is equivalent to “a symbol in which“ ＾ ”is written immediately above“ A ””. In addition, when "-A" is described for A which is a vector, matrix, or scalar, it is assumed to be equivalent to "symbol in which"-"is written immediately above" A "". ^AT represents transposition of A with respect to A which is a vector or a matrix.

従来技術は、例えば文献１「G. Hinton et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition, The shared views of four research groups,” IEEE SIGNAL PROCESSING MAGAZINE, Vol. 29，No. 6, pp. 82−97, 2012.」に示される音声認識技術である。 The prior art is described in, for example, Reference 1 "G. Hinton et al.," Deep Neural Networks for Acoustic Modeling in Speech Recognition, The shared views of four research groups, "IEEE SIGNAL PROCESSING MAGAZINE, Vol. 29, No. 6, pp. 82-97, 2012. "

まず、図１を用いて従来技術に係る音声データ処理装置の構成について説明する。図１は、従来技術に係る音声データ処理装置の構成の一例を示す図である。図１に示すように、第１の従来技術に係る音声データ処理装置１０ａは、特徴量抽出部２１ａ、事後確率計算部２２ａ、単語列検索部２３ａを有する。また、音声データ処理装置１０ａは、記憶部３０ａを有する。 First, a configuration of a conventional audio data processing device will be described with reference to FIG. FIG. 1 is a diagram illustrating an example of a configuration of a voice data processing device according to a conventional technique. As shown in FIG. 1, an audio data processing device 10a according to a first related art includes a feature amount extraction unit 21a, a posterior probability calculation unit 22a, and a word string search unit 23a. The audio data processing device 10a has a storage unit 30a.

記憶部３０ａは、音響モデルおよび言語モデルをあらかじめ記憶する。音響モデルは、音声の音響的特徴をモデル化したものである。言語モデルは、音素や単語といった多数のシンボル系列から構成されている。一般的に、音声認識用の音響モデルは、各音素のLeft to rightのＨＭＭ（Hidden Markov Model：隠れマルコフモデル）であり、ニューラルネットワーク（ＮＮ：Neural Network）で計算されたＨＭＭの各状態の出力確率分布を含む。 The storage unit 30a stores an acoustic model and a language model in advance. The acoustic model is a model of acoustic characteristics of speech. The language model is composed of many symbol sequences such as phonemes and words. Generally, an acoustic model for speech recognition is a left-to-right HMM (Hidden Markov Model) of each phoneme, and the output of each state of the HMM calculated by a neural network (NN). Includes probability distribution.

すなわち、記憶部３０ａに記憶されている音響モデルは、音素等の各シンボルにおけるＨＭＭの状態遷移確率、ｉ番目の隠れ層に関する重み行列Ｗ_ｉおよびバイアスベクトルｂ_ｉ、アクティベーション関数のパラメータ等を含むニューラルネットワークのパラメータである。ここで、ｉは、隠れ層のインデックスである。これらを音響モデルパラメータと称し、その集合をΩ＝｛Ｗ_１，ｂ_１，・・・，Ｗ_Ｉ，ｂ_Ｉ｝（Ｉは、隠れ層の総数）とする。言語モデルは、音素や単語といった多数のシンボル系列ｓ_ｊから構成されており、ｐ（ｓ_ｊ）は言語モデルによって得られるシンボル系列ｓ_ｊの確率（言語確率）である。なお、シンボル系列ｓ_ｊとは、音声認識結果となりうる、音素や単語等からなるシンボルの系列である。 That is, the acoustic model stored in the storage unit 30a includes a state transition probabilities of the HMM for each symbol of the phoneme, etc., the weight matrix for the i-th hidden layer W _i and the bias vector b _i, a parameter such as the activation function These are the parameters of the neural network. Here, i is the index of the hidden layer. These are referred to as acoustic model parameters, and a set thereof is Ω = {W ₁ , b ₁ ,..., W _I , b _I } (I is the total number of hidden layers). The language model is composed of many symbol sequences s _j such as phonemes and words, and p (s _j ) is the probability (language probability) of the symbol sequence s _j obtained by the language model. Note that the symbol sequence s _j is a sequence of symbols that can be a speech recognition result and includes phonemes, words, and the like.

特徴量抽出部２１ａは、認識用の音声データから音声の特徴量を抽出する。特徴量としては、例えば、ＭＦＣＣ（Mel Frequency Cepstral Coefficient）、ＬＭＦＣ（log Mel Filterbank coefficients）、ΔＭＦＣＣ（ＭＦＣＣの１回微分）、ΔΔＭＦＣＣ（ＭＦＣＣの２回微分）、対数（スペクトル）パワー、Δ対数パワー（対数パワーの１回微分）等がある。 The feature amount extraction unit 21a extracts a feature amount of speech from speech data for recognition. As the feature amount, for example, MFCC (Mel Frequency Cepstral Coefficient), LMFC (log Mel Filterbank coefficients), ΔMFCC (one-time differentiation of MFCC), ΔΔMFCC (two-time differentiation of MFCC), logarithmic (spectral) power, Δlogarithmic power (One-time differentiation of logarithmic power).

そして、特徴量抽出部２１ａは、フレームごとに当該フレームおよびその前後５フレーム程度の連続する各フレームから得られる特徴量を連結し、１０〜２０００次元程度の時系列特徴量ベクトルｘ_ｔ（ｔは、１，・・・，Ｍの自然数）を生成する。そして、特徴量抽出部２１ａは、下記（１）式のように、全てのフレームについての時系列特徴量ベクトルｘ_ｔをまとめた特徴量ベクトルｘを生成する。特徴量ベクトルｘは、１からＭフレーム目までのＤ次元ベクトルで表現されるデータである。例えば、フレーム長は、３０ｍｓ程度、フレームシフト長は、１０ｍｓ程度である。 Then, the feature extraction unit 21a concatenates the feature amount obtained from each frame consecutive in the order of the frame and its front and rear 5 frame for each frame, the time series feature vector x _{t (t} of about 10 to 2,000 D , 1,..., M). Then, the feature extraction unit 21a, as in the following equation (1), and generates a feature vector x summarizes the time series feature vector x _t for all frames. The feature vector x is data represented by a D-dimensional vector from the first to the M-th frame. For example, the frame length is about 30 ms, and the frame shift length is about 10 ms.

事後確率計算部２２ａは、記憶部３０ａから音響モデルを取得し、音響モデルパラメータΩに基づき、特徴量ベクトルｘの各フレームｔに対する音響モデルの各ＨＭＭ状態の出力確率を計算する。ＨＭＭ状態の出力確率は、例えば文献１「G. Hinton et al., “Deep Neural Networks for Acoustic Modeling in Speech Recognition, The shared views of four research groups,” IEEE SIGNAL PROCESSING MAGAZINE, Vol. 29，No. 6, pp. 82−97, 2012.」の数式（２）で表されるようなニューラルネットワークの出力である。 The posterior probability calculation unit 22a acquires the acoustic model from the storage unit 30a, and calculates the output probability of each HMM state of the acoustic model for each frame t of the feature vector x based on the acoustic model parameter Ω. The output probability of the HMM state is described in, for example, Reference 1 “G. Hinton et al.,“ Deep Neural Networks for Acoustic Modeling in Speech Recognition, The shared views of four research groups, ”IEEE SIGNAL PROCESSING MAGAZINE, Vol. , pp. 82-97, 2012. "is the output of a neural network as represented by equation (2).

図２は、従来技術に係る音声データ処理装置の処理の概要について説明するための図である。図２に示すように、従来技術の音声認識に音響モデルを表すニューラルネットワークは、入力と出力との間に、１以上の隠れ層を有する。ニューラルネットワークの入力は、時系列特徴量ベクトルｘ_ｔであり、最前段の隠れ層へ入力される。ニューラルネットワークの出力は、最後段の隠れ層によるＨＭＭ状態の出力確率である。事後確率計算部２２ａが行う各隠れ層における計算は、線形変換による処理およびアクティベーション関数による処理の２つの処理を含む。各隠れ層における線形変換は、下記（２）式のようになる。 FIG. 2 is a diagram for explaining an outline of the processing of the audio data processing device according to the related art. As shown in FIG. 2, prior art neural networks representing acoustic models for speech recognition have one or more hidden layers between inputs and outputs. The input of the neural network is a time-series feature vector _xt, which is input to the forefront hidden layer. The output of the neural network is the output probability of the HMM state due to the last hidden layer. The calculation in each hidden layer performed by the posterior probability calculation unit 22a includes two processes of a process by a linear transformation and a process by an activation function. The linear transformation in each hidden layer is as shown in the following equation (2).

ただし、上記（２）式において、ベクトルｚ_ｉ，ｔは、ｉ番目（ｉは自然数であり、ｉ＝１，２，・・・，Ｉ（ただしＩは隠れ層の総数））の隠れ層における線形変換の出力であり、ベクトルｘ_{ｉ−１，ｔ}は（ｉ−１）番目の隠れ層の出力である。なお、ベクトルｘ_０，ｔは、ニューラルネットワークの入力である時系列特徴量ベクトルｘ_ｔである。また、アクティベーション関数の出力は、下記（３）式のようになる。 However, in the above equation (2), the vector z _{i, t} is the vector in the i-th (i is a natural number, i = 1, 2,..., I (where I is the total number of hidden layers)) hidden layer. The output of the linear transformation, and the vector x _{i−1, t} is the output of the (i−1) th hidden layer. Note that the vector x _{0, t} is the time series feature vector x _t is the input of the neural network. The output of the activation function is as shown in the following equation (3).

ただし、上記（３）式において、ベクトルｘ_ｉ，ｔはｉ番目の隠れ層の出力であり、ｆ（）は、例えばsigmoid関数等のアクティベーション関数であり、ベクトルの要素ごとに計算される。すなわち、事後確率計算部２２ａは、ｉ番目の隠れ層において、前段の隠れ層である（ｉ−１）番目の隠れ層の出力であるベクトルｘ_{ｉ−１，ｔ}に対し上記（２）式による線形変換を行った結果であるベクトルｚ_ｉ，ｔに対して、上記（２）式による処理を行った結果であるベクトルｘ_ｉ，ｔを出力する。そして、事後確率計算部２２ａは、各ベクトルｘ_ｉ，ｔ（ｉ＝１，２，・・・，Ｉ）に基づき、特徴量ベクトルｘの各フレームに対する音響モデルの各ＨＭＭ状態の出力確率を計算する。 However, in the above equation (3), the vector x _{i, t} is the output of the i-th hidden layer, and f () is an activation function such as a sigmoid function, which is calculated for each element of the vector. That is, the posterior probability calculation unit 22a calculates the vector x _{i−1, t} that is the output of the (i−1) th hidden layer that is the preceding hidden layer in the i-th hidden layer according to the above equation (2). is a result of the linear transformation vector z _i, with respect to _t, (2) that is the result of processing by the expression vector x _i, and outputs the _t. Then, the posterior probability calculation unit 22a calculates the output probability of each HMM state of the acoustic model for each frame of the feature amount vector x based on each vector x _{i, t} (i = 1, 2,..., I). I do.

単語列検索部２３ａは、事後確率計算部２２ａにより計算された各ＨＭＭ状態の出力確率に基づき、Ｊ個（Ｊは自然数）の対立候補シンボル系列ｓ_ｊを生成し、対立候補シンボル系列ｓ_ｊごとに、音響モデルとの適合尤度を示す音響スコアを計算する。シンボルは、例えば、音素である。ここで、ｊ＝１，２，・・・，Ｊである。 Word string searching unit 23a, based on the output probability of each HMM state calculated by the posterior probability calculation unit 22a, J pieces (J is a natural number) to generate conflicts candidate symbol sequence s _j of each opposition candidate symbol sequence s _j Next, an acoustic score indicating the likelihood of matching with the acoustic model is calculated. The symbol is, for example, a phoneme. Here, j = 1, 2,..., J.

次に、単語列検索部２３ａは、記憶部３０ａから読み込んだ言語モデルに基づき、対立候補シンボル系列ｓ_ｊごとに、言語モデルとの適合尤度を示す言語スコアを計算する。そして、単語列検索部２３ａは、計算した音響スコアおよび言語スコアに基づき、Ｊ個の対立候補シンボル系列ｓ_ｊの中から、認識用の音声データに対応する単語列として最も確からしい、つまり、音響スコアおよび言語スコアを統合したスコアが最も高い対立候補シンボル系列を、記憶部３０ａに記憶される言語モデルから検索し、検索した対立候補シンボル系列を、認識結果である単語列＾Ｓとして出力する。 Next, the word string search unit 23a calculates a language score indicating the likelihood of matching with the language model for each of the _alternative candidate symbol sequences sj based on the language model read from the storage unit 30a. Then, based on the calculated acoustic score and linguistic score, the word string search unit 23a determines the most probable word string corresponding to the speech data for recognition from among the J candidate symbol sequences s _j , that is, A search is made for the contended candidate symbol sequence having the highest score obtained by integrating the score and the linguistic score from the language model stored in the storage unit 30a, and the retrieved contended candidate symbol sequence is output as a word string ＾ S as a recognition result.

図３は、従来技術に係る音声認識処理の一例を示すフローチャートである。まず、音声データ処理装置１０ａは、記憶部３０ａから、音響モデルおよび言語モデルを読み込む（ステップＳ１０１ａ）。次に、音声データ処理装置１０ａは、認識用の音声データを読み込む（ステップＳ１０２ａ）。次に、音声データ処理装置１０ａは、読み込んだ認識用の音声データから音声の特徴量を抽出し、特徴量ベクトルｘ_ｔを生成する（ステップＳ１０３ａ）。 FIG. 3 is a flowchart illustrating an example of a voice recognition process according to the related art. First, the audio data processing device 10a reads an acoustic model and a language model from the storage unit 30a (Step S101a). Next, the voice data processing device 10a reads voice data for recognition (step S102a). Next, the audio data processing device 10a extracts the feature quantity of the speech from the speech data for recognizing read, generates a feature vector x _t (step S103a).

次に、音声データ処理装置１０ａは、読み込んだ音響モデルに基づき、特徴量ベクトルｘ_ｔの各フレームに対する音響モデルの各ＨＭＭ状態の出力確率を事後確率として計算する（ステップＳ１０４ａ）。次に、音声データ処理装置１０ａは、事後確率計算部２２ａにより計算された各ＨＭＭ状態の出力確率に基づき、対立候補シンボル系列ｓ_ｊを生成し、対立候補シンボル系列ｓ_ｊごとの音響スコアおよび言語スコアを統合したスコアが最も高い対立候補シンボル系列を言語モデルから検索する（ステップＳ１０５ａ）。次に、音声データ処理装置１０ａは、ステップＳ１０５ａの検索結果を、認識結果である単語列＾Ｓとして出力する（ステップＳ１０６ａ）。 Next, the audio data processing device 10a, based on the read acoustic model, calculates the output probability of each HMM state of the acoustic model for each frame of the feature vector x _t as posterior probability (step S104a). Next, the audio data processing device 10a based on the output probability of each HMM state calculated by the posterior probability calculation unit 22a, and generates an opposition candidate symbol sequence s _j, acoustic score and language for each allele candidate symbol sequence s _j A search is made from the language model for an opponent candidate symbol sequence having the highest score obtained by integrating the scores (step S105a). Next, the voice data processing device 10a outputs the search result of step S105a as a word string ＾ S as a recognition result (step S106a).

［第１の実施形態の構成］
以下、本願が開示する音声データ処理装置、音声データ処理方法および音声データ処理プログラムの実施形態を説明する。以下の実施形態は、一例を示すに過ぎず、本願が開示する技術を限定するものではない。また、以下に示す実施形態およびその他の実施形態は、矛盾しない範囲で適宜組み合わせてもよい。 [Configuration of First Embodiment]
Hereinafter, embodiments of an audio data processing device, an audio data processing method, and an audio data processing program disclosed by the present application will be described. The following embodiments are merely examples, and do not limit the technology disclosed in the present application. Further, the embodiments described below and other embodiments may be appropriately combined within a range that does not contradict.

まず、図４を用いて、第１の実施形態に係る音声データ処理装置の構成について説明する。図４は、第１の実施形態に係る音声データ処理装置の構成の一例を示す図である。図１に示すように、音声データ処理装置１０は、制御部２０および記憶部３０を有する。 First, the configuration of the audio data processing device according to the first embodiment will be described with reference to FIG. FIG. 4 is a diagram illustrating an example of a configuration of the audio data processing device according to the first embodiment. As shown in FIG. 1, the audio data processing device 10 has a control unit 20 and a storage unit 30.

制御部２０は、音声データ処理装置１０全体を制御する。制御部２０は、例えば、ＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）等の電子回路や、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field Programmable Gate Array）等の集積回路である。また、制御部２０は、各種の処理手順を規定したプログラムや制御データを格納するための内部メモリを有し、内部メモリを用いて各処理を実行する。また、制御部２０は、各種のプログラムが動作することにより各種の処理部として機能する。例えば、抽出部２１、計算部２２、更新部２３および認識部２４を有する。 The control unit 20 controls the entire audio data processing device 10. The control unit 20 is, for example, an electronic circuit such as a CPU (Central Processing Unit) or an MPU (Micro Processing Unit), or an integrated circuit such as an ASIC (Application Specific Integrated Circuit) or an FPGA (Field Programmable Gate Array). The control unit 20 has an internal memory for storing programs and control data defining various processing procedures, and executes each process using the internal memory. The control unit 20 also functions as various processing units when various programs operate. For example, it has an extracting unit 21, a calculating unit 22, an updating unit 23, and a recognizing unit 24.

記憶部３０は、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、光ディスク等の記憶装置である。なお、記憶部３０は、ＲＡＭ（Random Access Memory）、フラッシュメモリ、ＮＶＳＲＡＭ（Non Volatile Static Random Access Memory）等のデータを書き換え可能な半導体メモリであってもよい。記憶部３０は、音声データ処理装置１０で実行されるＯＳ（Operating System）や各種プログラムを記憶する。さらに、記憶部３０は、プログラムの実行で用いられる各種情報を記憶する。また、記憶部３０は、音声認識モデル３１、条件特徴量計算モデル３２および適応特徴量計算モデル３３を記憶する。具体的には、記憶部３０は、例えば、各計算モデルを用いて計算を実行するためのパラメータを記憶する。 The storage unit 30 is a storage device such as a hard disk drive (HDD), a solid state drive (SSD), and an optical disk. The storage unit 30 may be a rewritable semiconductor memory such as a random access memory (RAM), a flash memory, or a non-volatile random access memory (NVSRAM). The storage unit 30 stores an OS (Operating System) executed by the audio data processing device 10 and various programs. Further, the storage unit 30 stores various information used in executing the program. In addition, the storage unit 30 stores a speech recognition model 31, a condition feature amount calculation model 32, and an adaptive feature amount calculation model 33. Specifically, the storage unit 30 stores, for example, parameters for executing a calculation using each calculation model.

音声データ処理装置１０は、適応用の音声データを用いて、条件特徴量計算モデル３２および適応特徴量計算モデル３３を音響条件に適応させる。条件特徴量計算モデル３２は、音声認識に用いられる音声データから抽出された特徴量を、音響条件にあわせた特徴量に変換するためのパラメータを計算するための、ニューラルネットワークを用いた計算モデルである。 The audio data processing device 10 adapts the condition feature amount calculation model 32 and the adaptive feature amount calculation model 33 to acoustic conditions using the audio data for adaptation. The condition feature amount calculation model 32 is a calculation model using a neural network for calculating a parameter for converting a feature amount extracted from speech data used for speech recognition into a feature amount corresponding to acoustic conditions. is there.

この場合、まず、抽出部２１は、所定の環境における音声を基に作成された適応用の音声データから、音声の特徴を示す特徴量である第１の入力特徴量、および環境の特徴を示す特徴量である第２の入力特徴量を抽出する。 In this case, first, the extraction unit 21 indicates a first input feature amount, which is a feature amount indicating a feature of the sound, and a feature of the environment from the adaptation sound data created based on the sound in the predetermined environment. A second input feature value, which is a feature value, is extracted.

適応用の音声データから抽出された特徴量には、音声の特徴および環境の特徴の両方が含まれている。抽出部２１は、第１の入力特徴量と第２の入力特徴量を同一の特徴量としてもよい。また、抽出部２１は、第１の入力特徴量と第２の入力特徴量を異なる特徴量としてもよい。例えば、抽出部２１は、第１の入力特徴量と、雑音抑圧処理が行われた音声データの音声の特徴を示す特徴量と、の差を基に特徴量を計算し、計算した特徴量を第２の入力特徴量として抽出してもよい。 The features extracted from the adaptation speech data include both speech features and environment features. The extraction unit 21 may set the first input feature and the second input feature as the same feature. The extraction unit 21 may set the first input feature and the second input feature as different feature. For example, the extraction unit 21 calculates a feature value based on a difference between the first input feature value and a feature value indicating a voice feature of the voice data on which the noise suppression processing has been performed, and calculates the calculated feature value. You may extract as a 2nd input feature-value.

例えば、第１の入力特徴量をｘ_ｔとし、雑音抑圧処理が行われた音声データの音声の特徴を示す特徴量をｙ_ｔとすると、抽出部２１は、第２の入力特徴量を下記（４）式により計算することができる。 For example, the first input feature amount is x _t, the feature quantity indicating the feature of the sound of the audio data noise suppression process is performed and y _t, extractor 21, following the second input feature value ( It can be calculated by equation 4).

そして、計算部２２は、ニューラルネットワークを用いた計算モデルである条件特徴量計算モデル３２に第２の入力特徴量を入力し、所定の環境を特徴付ける複数の条件のそれぞれに対応した要素を含んだ特徴量である条件特徴量を計算する。条件特徴量計算モデル３２に含まれる条件は、例えば、話者の性別、年齢、国籍、また、雑音の種類、強さ等を条件とすることができる。 Then, the calculation unit 22 inputs the second input feature amount to the condition feature amount calculation model 32 which is a calculation model using a neural network, and includes elements corresponding to each of a plurality of conditions characterizing a predetermined environment. Calculate the condition feature quantity that is the feature quantity. The conditions included in the condition feature amount calculation model 32 can be, for example, the speaker's sex, age, nationality, noise type, strength, and the like.

また、計算部２２は、条件特徴量に含まれる複数の要素のそれぞれに対応したパラメータの組を含んだ計算モデルである適応特徴量計算モデル３３に第１の入力特徴量および条件特徴量を入力し、要素ごとに、ニューラルネットワークを用いた計算モデルである音声認識モデル３１に適応した特徴量である適応特徴量を計算する。このように、計算部２２は、適応特徴量計算モデル３３を用いて、条件特徴量の要素の数、すなわち環境を特徴付ける条件の数と同数の適応特徴量を計算する。 The calculation unit 22 also inputs the first input feature value and the condition feature value to the adaptive feature value calculation model 33 which is a calculation model including a set of parameters corresponding to each of a plurality of elements included in the condition feature value. Then, an adaptive feature amount, which is a feature amount adapted to the speech recognition model 31, which is a calculation model using a neural network, is calculated for each element. As described above, the calculation unit 22 uses the adaptive feature amount calculation model 33 to calculate the same number of adaptive feature amounts as the number of elements of the condition feature amount, that is, the number of conditions characterizing the environment.

そして、更新部２３は、音声認識モデル３１に適応特徴量を入力して得られた出力結果を基に、条件特徴量計算モデル３２のパラメータおよび適応特徴量計算モデル３３のパラメータの更新を行う。更新部２３は、ニューラルネットワークの誤差逆伝搬等の手法を用いてパラメータの更新を行う。このとき、更新部２３は、条件特徴量計算モデル３２のパラメータおよび適応特徴量計算モデル３３のパラメータの更新に、音声認識モデル３１の出力結果を反映させているため、音響条件を考慮した音声認識モデル３１の音声認識精度が向上していくことになる。 Then, the updating unit 23 updates the parameters of the conditional feature calculation model 32 and the parameters of the adaptive feature calculation model 33 based on the output result obtained by inputting the adaptive feature to the speech recognition model 31. The updating unit 23 updates the parameters using a method such as back propagation of the error of the neural network. At this time, since the update unit 23 reflects the output result of the speech recognition model 31 in updating the parameters of the conditional feature calculation model 32 and the parameters of the adaptive feature calculation model 33, the speech recognition considering the acoustic conditions is performed. The speech recognition accuracy of the model 31 will be improved.

なお、音声認識モデル３１自体の学習は、適応特徴量計算モデル３３の適応と同時に行われてもよいし、別に行われてもよい。以降、音声認識モデル３１の学習のための音声データを学習用の音声データとよび、条件特徴量計算モデル３２の適応のための音声データを適応用の音声データとよぶ。 The learning of the speech recognition model 31 itself may be performed at the same time as the adaptation of the adaptive feature amount calculation model 33, or may be performed separately. Hereinafter, the speech data for learning the speech recognition model 31 will be referred to as training speech data, and the speech data for adapting the conditional feature calculation model 32 will be referred to as adaptation speech data.

なお、音声認識モデル３１および条件特徴量計算モデル３２はいずれもニューラルネットワークを用いた計算モデルとすることができるため、音声認識モデル３１の学習、条件特徴量計算モデル３２の適応および音声認識は、従来技術と同様の方法で行うこととしてもよい。 Since both the speech recognition model 31 and the condition feature calculation model 32 can be calculation models using a neural network, learning of the speech recognition model 31, adaptation of the condition feature calculation model 32, and speech recognition are performed by: This may be performed in the same manner as in the related art.

また、適応用の音声データに対応した書き起こし等の正解データが存在する場合、音声データ処理装置１０は、教師あり適応を行うことができる。また、音声データ処理装置１０は、音声認識の出力等に基づき教師なし適応を行うことができる。 When there is correct data such as a transcription corresponding to the audio data for adaptation, the audio data processing device 10 can perform supervised adaptation. Further, the voice data processing device 10 can perform unsupervised adaptation based on the output of voice recognition and the like.

音声データ処理装置１０は、音声認識用の音声データを用いて音声認識を行う。この場合、まず、抽出部２１は、所定の環境における音声を基に作成された音声認識用の音声データから、第１の入力特徴量、および第２の入力特徴量を抽出する。抽出部２１は、第１の入力特徴量および第２の入力特徴量を、適応の場合と同様の方法で抽出する。 The voice data processing device 10 performs voice recognition using voice data for voice recognition. In this case, first, the extracting unit 21 extracts a first input feature amount and a second input feature amount from speech recognition voice data created based on speech in a predetermined environment. The extraction unit 21 extracts the first input feature amount and the second input feature amount in the same manner as in the case of adaptation.

そして、計算部２２は、条件特徴量計算モデル３２に第２の入力特徴量を入力し、条件特徴量を計算する。そして、計算部２２は、適応特徴量計算モデル３３に第１の入力特徴量および条件特徴量を入力し、適応特徴量を計算する。そして、認識部２４は、音声認識モデル３１に適応特徴量を入力して得られた出力結果を基に、音声の認識を行う。これにより、音声データ処理装置１０は、音響条件を考慮した音声認識を行うことができる。 Then, the calculation unit 22 inputs the second input feature value to the condition feature value calculation model 32 and calculates the condition feature value. Then, the calculation unit 22 inputs the first input feature value and the condition feature value to the adaptive feature value calculation model 33, and calculates the adaptive feature value. Then, the recognition unit 24 performs speech recognition based on an output result obtained by inputting the adaptive feature amount to the speech recognition model 31. Thereby, the voice data processing device 10 can perform voice recognition in consideration of the acoustic conditions.

図５を用いて、音声データ処理装置１０の処理の概要について説明する。図５は、第１の実施形態に係る音声データ処理装置の処理の概要について説明するための図である。図５に示すように、計算部２２は、第２の入力特徴量σ_ｘ，ｔを、条件特徴量計算モデル３２に入力する。そして、計算部２２は、条件特徴量計算モデル３２を用いて計算した条件特徴量の要素α_１、α_２、α_３、および、要素のそれぞれに対応したパラメータ（Ｕ_１，ｖ_１）、（Ｕ_２，ｖ_２）、（Ｕ_３，ｖ_３）を用いて、第１の入力特徴量ｘ_ｔから、適応特徴量−ｘ_ｔを計算する。ここで、Ｕは変換行列であり、ｖはバイアスベクトルである。 The outline of the processing of the audio data processing device 10 will be described with reference to FIG. FIG. 5 is a diagram for explaining an outline of the processing of the audio data processing device according to the first embodiment. As shown in FIG. 5, the calculation unit 22 inputs the second input feature amount σ _{x, t} to the condition feature amount calculation model 32. Then, the calculation unit 22 calculates the elements α ₁ , α ₂ , α _{3 of} the condition feature calculated using the condition feature calculation model 32, and the parameters (U ₁ , v ₁ ) corresponding to each of the elements, ( U _{_2,} _v _2), using _(U _{3, v} 3), from the first input feature amount _{x t,} calculates the adaptive feature amount -x _t. Here, U is a transformation matrix, and v is a bias vector.

また、更新部２３は、音声認識モデル３１に適応特徴量−ｘ_ｔを入力して得られた音声認識結果を基に、条件特徴量計算モデル３２のパラメータおよび適応特徴量計算モデル３３のパラメータの更新を行う。図５に示すように、音声認識モデル３１は、ソフトマクス層に至るまでの各層にパラメータ（Ｗ_１，ｂ_１）、（Ｗ_２，ｂ_２）、（Ｗ_３，ｂ_３）、（Ｗ_４，ｂ_４）が設定されたＤＮＮ（Deep Neural Network）である。 Further, the updating unit 23 updates the parameters of the conditional feature calculation model 32 and the parameters of the adaptive feature calculation model 33 based on the speech recognition result obtained by inputting the adaptive feature −x _t to the speech recognition model 31. Perform an update. As shown in FIG. 5, the speech recognition model 31 includes parameters (W ₁ , b ₁ ), (W ₂ , b ₂ ), (W ₃ , b ₃ ), (W ₄ ) for each layer up to the softmax layer. , B ₄ ) is a set DNN (Deep Neural Network).

ここで、図６および７を用いて、第１の実施形態に係る音声データ処理装置１０の適応処理および音声認識処理について説明する。図６は、第１の実施形態に係る音声データ処理装置の適応処理について説明するための図である。また、図７は、第１の実施形態に係る音声データ処理装置の音声認識処理について説明するための図である。 Here, the adaptive processing and the voice recognition processing of the voice data processing apparatus 10 according to the first embodiment will be described with reference to FIGS. FIG. 6 is a diagram for explaining the adaptive processing of the audio data processing device according to the first embodiment. FIG. 7 is a diagram for describing a voice recognition process of the voice data processing device according to the first embodiment.

以降の説明では、データの流れを明確にするため、音声データ処理装置１０の各処理部が、それぞれさらに処理部を有することとして説明する。具体的には、図６および７に示すように、抽出部２１は、第１の入力特徴量抽出部２１１および第２の入力特徴量抽出部２１２を有する。また、計算部２２は、条件特徴量計算部２２１、特徴量変換部２２２、事後確率計算部２２３を有する。また、更新部２３は、エラー計算部２３１、微分値計算部２３２、パラメータ更新部２３３および収束判定部２３４を有する。また、認識部２４は、単語列検索部２４１を有する。 In the following description, in order to clarify the flow of data, it is assumed that each processing unit of the audio data processing device 10 further has a processing unit. Specifically, as shown in FIGS. 6 and 7, the extraction unit 21 includes a first input feature amount extraction unit 211 and a second input feature amount extraction unit 212. The calculating unit 22 includes a conditional feature calculating unit 221, a feature converting unit 222, and a posterior probability calculating unit 223. The updating unit 23 includes an error calculating unit 231, a differential value calculating unit 232, a parameter updating unit 233, and a convergence determining unit 234. The recognition unit 24 has a word string search unit 241.

まず、図６を用いて、音声データ処理装置１０の適応処理について説明する。図６に示すように、第１の入力特徴量抽出部２１１は、適応用の音声データから各フレームの第１の入力特徴量ｘ_ｔを抽出する。また、第２の入力特徴量抽出部２１２は、適応用の音声データから各フレームの第２の入力特徴量σ_ｘ，ｔを抽出する。 First, the adaptive processing of the audio data processing device 10 will be described with reference to FIG. As shown in FIG. 6, the first input feature amount extraction unit 211 extracts the first input feature amount x _t of each frame from the audio data for the adaptation. Further, the second input feature amount extraction unit 212 extracts the second input feature amount σ _{x, t} of each frame from the audio data for adaptation.

次に、条件特徴量計算部２２１は、記憶部３０から条件特徴量計算モデル３２を取得し、第２の入力特徴量σ_ｘ，ｔを用いて、各フレームの条件特徴量α_ｎ，ｔを計算する。下記（５）式に示すように、条件特徴量計算部２２１は、ニューラルネットワークである条件特徴量計算モデル３２の出力として条件特徴量α_ｎ，ｔを計算する。なお、Ω´は、条件特徴量計算モデル３２の各層における線形変換のためのパラメータの集合であり、Ω´＝｛Ｗ´_１，ｂ´_１，・・・，Ｗ´_Ｉ´，ｂ´_Ｉ´｝（Ｉ´は、隠れ層の総数）とする。また、各層における線形変換の方法は、従来技術と同様である。 Next, the condition feature value calculation unit 221 acquires the condition feature value calculation model 32 from the storage unit 30 _, and uses the second input feature value σ _{x, t} to convert the condition feature value α _{n, t} of each frame. calculate. As shown in the following equation (5), the condition feature quantity calculation unit 221 calculates the condition feature quantity α _{n, t} as an output of the condition feature quantity calculation model 32 which is a neural network. Incidentally, Omega' is a set of parameters for the linear transformation in layers of the conditions characteristic quantity calculation model _{_{32, Ω'= {W'1,}} b'1, ···, W'I', b'I _'} (I'the total number of hidden layers) to. Also, the method of linear conversion in each layer is the same as in the prior art.

次に、特徴量変換部２２２は、記憶部３０から適応特徴量計算モデル３３を取得し、第１の入力特徴量ｘ_ｔ、条件特徴量α_ｎ，ｔ、パラメータの組Ｕ_１，ｖ_２，・・・，Ｕ_Ｎ，ｖ_Ｎに基づいて、下記（６）式を用いて、各フレームの適応特徴量−ｘ_ｔを計算する。ここで、前述の通り、Ｕは変換行列であり、ｖはバイアスベクトルである。 Next, the feature value conversion unit 222 acquires the adaptive feature value calculation model 33 from the storage unit 30, and acquires the first input feature value x _t , the condition feature value α _{n, t} , the parameter set U ₁ , v ₂ , ... on _the basis of the U N, _{v N,} using the following equation (6), to calculate the adaptive feature amount -x _t of each frame. Here, as described above, U is a transformation matrix, and v is a bias vector.

次に、事後確率計算部２２３は、記憶部３０から音声認識モデル３１を取得し、適応特徴量−ｘ_ｔに基づいて、下記（７）式のように、ＨＭＭ状態の出力確率を各フレームの事後確率ｏ_ｔとして計算する。なお、Ωは、音声認識モデル３１の各層における線形変換のためのパラメータの集合であり、Ω＝｛Ｗ_１，ｂ_１，・・・，Ｗ_Ｉ，ｂ_Ｉ｝（Ｉは、隠れ層の総数）とする。また、各層における線形変換の方法は、従来技術と同様である。 Next, the posterior probability calculation unit 223 acquires the speech recognition model 31 from the storage unit 30, based on the adaptive feature amount -x _t, as follows (7), the output probability of the HMM state of each frame It is calculated as the posterior probability _{o t.} Here, Ω is a set of parameters for linear transformation in each layer of the speech recognition model 31, and Ω = {W ₁ , b ₁ ,..., W _I , b _I } (I is the total number of hidden layers) ). Also, the method of linear conversion in each layer is the same as in the prior art.

更新部２３は、条件特徴量計算モデル３２のパラメータΩ´、および適応特徴量計算モデル３３のパラメータＵ_１，ｖ_２，・・・，Ｕ_Ｎ，ｖ_Ｎの最適化を行う。更新部２３は、ニューラルネットワークの学習手順に従い、誤差逆伝搬とＳＧＤを用いてパラメータを更新し最適化する。 Updating unit 23, the parameter Ω'condition feature quantity calculation model 32, and the parameters _U _1, v 2 of the adaptive feature quantity calculation model 33, · · _·, U N, the optimization of the _{v N} performed. The updating unit 23 updates and optimizes parameters using error backpropagation and SGD according to a neural network learning procedure.

まず、エラー計算部２３１は、下記（８）式の通りエラー、すなわち各層における逆伝搬した誤差δ_ｉ，ｔを計算する。また、エラー計算部２３１は、δ_Ｉ，ｔを下記（９）式の通り計算する。なお、ｄ_ｔは、正解データから得られる正解ＨＭＭ状態である。 First, the error calculator 231 calculates an error according to the following equation (8), that is, an error δ _{i, t} that is back-propagated in each layer. Further, the error calculation unit 231 calculates δI _{, t according} to the following equation (9). Note that _dt is the correct HMM state obtained from the correct data.

ここで、微分値計算部２３２は、下記（１０）式で表されるCross Entropy関数を各パラメータで微分した値を、それぞれ下記（１１）〜（１４）式により計算する。 Here, the differential value calculation unit 232 calculates values obtained by differentiating the Cross Entropy function represented by the following equation (10) with each parameter according to the following equations (11) to (14).

また、微分値計算部２３２は、δ´_Ｉ´，ｔを、下記（１５）式により計算する。なお、（１５）式中の各インデクス０，ｔ，ｐのうち、１番目のインデクスは層のインデクスである。また、２番目のインデクスは時間フレームのインデクスである。また、３番目のインデクスは各ベクトル内の次元のインデクスである。例えば、ｚ_０,t,pは、０番目の層の時間フレームｔにおける出力であるベクトルのｐ次元目の要素である。また、（１５）式の層のインデクスが０である層は、変換層、すなわち特徴量変換部２２２によって適応特徴量計算モデル３３を用いた変換が行われる層である。 Further, the differential value calculation section 232 calculates δ′I _{′, t} by the following equation (15). The first index among the indexes 0, t, and p in the expression (15) is a layer index. The second index is a time frame index. The third index is a dimension index in each vector. For example, z _{0, t, p} is a p-dimensional element of a vector that is an output in the time frame t of the 0th layer. The layer in which the index of the layer of Expression (15) is 0 is a conversion layer, that is, a layer in which the conversion using the adaptive feature calculation model 33 is performed by the feature conversion unit 222.

また、微分値計算部２３２は、（１５）式中のｚ_{ｎ，０，ｔ}を、下記（１６）式により計算する。 Further, the differential value calculation unit 232 calculates z _{n, 0, t} in the equation (15) according to the following equation (16).

パラメータ更新部２３３は、微分値計算部２３２による計算結果を基に、下記（１７）〜（２０）式により各パラメータを更新する。なお、ηは、音響モデルパラメータ補正用パラメータであり、例えば0.1〜0.0001等の微小値である。 The parameter update unit 233 updates each parameter by the following equations (17) to (20) based on the calculation result by the differential value calculation unit 232. Here, η is an acoustic model parameter correction parameter, and is a minute value such as 0.1 to 0.0001.

収束判定部２３４は、パラメータ更新部２３３により更新されたパラメータが収束したか否かを判定する。収束判定部２３４がパラメータが収束していないと判定した場合、計算部２２および更新部２３は、さらにパラメータ更新のための処理を実行する。また、収束判定部２３４は、パラメータが収束したと判定した場合、更新後のパラメータを記憶部３０に格納する。 The convergence determination unit 234 determines whether the parameters updated by the parameter update unit 233 have converged. When the convergence determination unit 234 determines that the parameters have not converged, the calculation unit 22 and the update unit 23 further execute a process for updating the parameters. When determining that the parameters have converged, the convergence determination unit 234 stores the updated parameters in the storage unit 30.

収束判定部２３４は、例えば、１つ前のステップで得られていたパラメータと新たに求めたパラメータとの差分が閾値以下になった場合、繰り返し回数が所定の回数以上になった場合、所定の評価基準に基づく音声認識の評価が悪化した場合等に、パラメータが収束したと判定する。 For example, when the difference between the parameter obtained in the previous step and the newly obtained parameter is equal to or less than a threshold, when the number of repetitions is equal to or more than a predetermined number, It is determined that the parameters have converged, for example, when the evaluation of speech recognition based on the evaluation criterion deteriorates.

次に、図７を用いて、音声データ処理装置１０の音声認識処理について説明する。図７に示すように、音声認識を行う場合、音声データ処理装置１０は、認識用の音声データの入力を受け付ける。その後、抽出部２１および計算部２２は、適応処理の場合と同様の処理を行う。そして、認識部２４の単語列検索部２４１は、事後確率ｏ_ｔに基づき、従来技術と同様の方法により認識結果である単語列＾Ｓを検索し出力する。 Next, the speech recognition processing of the speech data processing device 10 will be described with reference to FIG. As shown in FIG. 7, when performing voice recognition, the voice data processing device 10 accepts input of voice data for recognition. After that, the extraction unit 21 and the calculation unit 22 perform the same processing as in the case of the adaptive processing. Then, the word string search unit 241 of the recognition unit 24, based on the posterior probability o _t, to search a word string ^ S is a recognition result by the same method as the prior art output.

［第１の実施形態の処理］
図８を用いて、音声データ処理装置１０の適応処理の流れについて説明する。図８は、第１の実施形態に係る音声データ処理装置の適応処理の流れを示すフローチャートである。 [Processing of First Embodiment]
The flow of the adaptive processing of the audio data processing device 10 will be described with reference to FIG. FIG. 8 is a flowchart illustrating a flow of the adaptive processing of the audio data processing device according to the first embodiment.

まず、音声データ処理装置１０は、記憶部３０から、音声認識モデル３１、条件特徴量計算モデル３２および適応特徴量計算モデル３３を読み込む（ステップＳ１０１）。次に、音声データ処理装置１０は、適応用の音声データを読み込む（ステップＳ１０２）。次に、音声データ処理装置１０は、正解データを読み込む（ステップＳ１０３）。 First, the speech data processing apparatus 10 reads the speech recognition model 31, the condition feature quantity calculation model 32, and the adaptive feature quantity calculation model 33 from the storage unit 30 (Step S101). Next, the audio data processing device 10 reads the audio data for adaptation (Step S102). Next, the audio data processing device 10 reads the correct answer data (step S103).

次に、抽出部２１は、適応用の音声データから音声の特徴を示す第１の入力特徴量を抽出する（ステップＳ１０４）。次に、抽出部２１は、適応用の音声データから環境の特徴を示す第２の入力特徴量を抽出する（ステップＳ１０５）。 Next, the extracting unit 21 extracts a first input feature amount indicating a feature of the voice from the voice data for adaptation (Step S104). Next, the extraction unit 21 extracts a second input feature amount indicating a feature of the environment from the audio data for adaptation (Step S105).

そして、計算部２２は、ニューラルネットワークである条件特徴量計算モデル３２に第２の入力特徴量を入力し、条件特徴量を計算する（ステップＳ１０６）。次に、計算部２２は、第１の入力特徴量および条件特徴量に基づき、適応特徴量を計算する（ステップＳ１０７）。次に、計算部２２は、ニューラルネットワークである音声認識モデル３１に、適応特徴量を入力し、事後確率を計算する（ステップＳ１０８）。 Then, the calculation unit 22 inputs the second input feature value to the condition feature value calculation model 32, which is a neural network, and calculates the condition feature value (Step S106). Next, the calculation unit 22 calculates an adaptive feature based on the first input feature and the condition feature (step S107). Next, the calculation unit 22 inputs the adaptive features to the speech recognition model 31 which is a neural network, and calculates the posterior probability (Step S108).

そして、更新部２３は、事後確率に基づいて、条件特徴量計算モデル３２および適応特徴量計算モデル３３のパラメータを更新する（ステップＳ１０９）。さらに、更新部２３は、パラメータの更新の結果、パラメータが収束していると判定した場合、処理を終了させる（ステップＳ１１０、Ｙｅｓ）。一方、更新部２３は、パラメータの更新の結果、パラメータが収束していないと判定した場合、処理をステップＳ１０６に戻す（ステップＳ１１０、Ｎｏ）。 Then, the updating unit 23 updates the parameters of the conditional feature calculation model 32 and the adaptive feature calculation model 33 based on the posterior probability (Step S109). Further, when the updating unit 23 determines that the parameters have converged as a result of updating the parameters, the updating unit 23 ends the process (Step S110, Yes). On the other hand, when the updating unit 23 determines that the parameters have not converged as a result of updating the parameters, the updating unit 23 returns the process to step S106 (No in step S110).

次に、図９を用いて、音声データ処理装置１０の音声認識処理の流れについて説明する。図９は、第１の実施形態に係る音声データ処理装置の音声認識処理の流れを示すフローチャートである。 Next, the flow of the voice recognition process of the voice data processing device 10 will be described with reference to FIG. FIG. 9 is a flowchart illustrating a flow of a voice recognition process of the voice data processing device according to the first embodiment.

まず、音声データ処理装置１０は、記憶部３０から、音声認識モデル３１、条件特徴量計算モデル３２および適応特徴量計算モデル３３を読み込む（ステップＳ２０１）。次に、音声データ処理装置１０は、音声認識用の音声データを読み込む（ステップＳ２０２）。 First, the speech data processing device 10 reads the speech recognition model 31, the condition feature calculation model 32, and the adaptive feature calculation model 33 from the storage unit 30 (step S201). Next, the voice data processing device 10 reads voice data for voice recognition (step S202).

次に、抽出部２１は、音声認識用の音声データから音声の特徴を示す第１の入力特徴量を抽出する（ステップＳ２０３）。次に、抽出部２１は、音声認識用の音声データから環境の特徴を示す第２の入力特徴量を抽出する（ステップＳ２０４）。 Next, the extracting unit 21 extracts a first input feature amount indicating a feature of the voice from the voice data for voice recognition (step S203). Next, the extracting unit 21 extracts a second input feature quantity indicating a feature of the environment from the speech data for speech recognition (step S204).

そして、計算部２２は、ニューラルネットワークである条件特徴量計算モデル３２に第２の入力特徴量を入力し、条件特徴量を計算する（ステップＳ２０５）。次に、計算部２２は、第１の入力特徴量および条件特徴量に基づき、適応特徴量を計算する（ステップＳ２０６）。次に、計算部２２は、ニューラルネットワークである音声認識モデル３１に、適応特徴量を入力し、事後確率を計算する（ステップＳ２０７）。そして、認識部２４は、事後確率に基づいて、スコアが最も大きくなる単語列を検索し（ステップＳ２０８）、検索結果を出力する（ステップＳ２０９）。 Then, the calculation unit 22 inputs the second input feature value to the condition feature value calculation model 32, which is a neural network, and calculates the condition feature value (step S205). Next, the calculation unit 22 calculates an adaptive feature based on the first input feature and the condition feature (step S206). Next, the calculation unit 22 inputs the adaptive features to the speech recognition model 31 which is a neural network, and calculates the posterior probability (Step S207). Then, based on the posterior probability, the recognizing unit 24 searches for a word string having the highest score (Step S208), and outputs a search result (Step S209).

なお、第１の実施形態では、音声認識モデル３１および条件特徴量計算モデル３２がFully connectedニューラルネットワークである場合の例を説明したが、音声認識モデル３１および条件特徴量計算モデル３２は、ＲＮＮ／ＬＳＴＭ、またはＣＮＮ等の他のニューラルネットワークであってもよい。 In the first embodiment, an example in which the speech recognition model 31 and the condition feature amount calculation model 32 are a fully connected neural network has been described. However, the speech recognition model 31 and the condition feature amount calculation model 32 Other neural networks such as LSTM or CNN may be used.

［第１の実施形態の効果］
抽出部２１は、所定の環境における音声を基に作成された適応用の音声データから、音声の特徴を示す特徴量である第１の入力特徴量、および環境の特徴を示す特徴量である第２の入力特徴量を抽出する。そして、計算部２２は、ニューラルネットワークを用いた計算モデルである条件特徴量計算モデル３２に第２の入力特徴量を入力し、所定の環境を特徴付ける複数の条件のそれぞれに対応した要素を含んだ特徴量である条件特徴量を計算する。そして、計算部２２は、条件特徴量に含まれる複数の要素のそれぞれに対応したパラメータの組を含んだ計算モデルである適応特徴量計算モデル３３に第１の入力特徴量および条件特徴量を入力し、ニューラルネットワークを用いた計算モデルである音声認識モデル３１に適応した特徴量である適応特徴量を計算する。そして、更新部２３は、音声認識モデル３１に適応特徴量を入力して得られた出力結果を基に、条件特徴量計算モデル３２のパラメータおよび適応特徴量計算モデル３３のパラメータの更新を行う。 [Effect of First Embodiment]
The extraction unit 21 extracts a first input feature amount that is a feature amount indicating a feature of a sound and a first input feature amount that is a feature amount that indicates a feature of an environment from adaptation speech data created based on the speech in a predetermined environment. 2 is extracted. Then, the calculation unit 22 inputs the second input feature value to the condition feature value calculation model 32 which is a calculation model using a neural network, and includes elements corresponding to each of a plurality of conditions characterizing a predetermined environment. Calculate the condition feature quantity that is the feature quantity. Then, the calculation unit 22 inputs the first input feature amount and the condition feature amount to the adaptive feature amount calculation model 33 which is a calculation model including a set of parameters corresponding to each of the plurality of elements included in the condition feature amount. Then, an adaptive feature amount, which is a feature amount adapted to the speech recognition model 31, which is a calculation model using a neural network, is calculated. Then, the updating unit 23 updates the parameters of the conditional feature calculation model 32 and the parameters of the adaptive feature calculation model 33 based on the output result obtained by inputting the adaptive feature to the speech recognition model 31.

このように、第１の実施形態では、音声認識用の音声データが作成された環境と学習用の音声データが作成された環境との間で、複数の音響条件に違いがある場合であっても、それぞれの音響条件に対応した特徴量を生成することができる。これにより、第１の実施形態によれば、複数の音響条件を考慮し、音声認識モデルの精度を高めることができるようになる。 As described above, in the first embodiment, there is a case where there is a difference in a plurality of acoustic conditions between the environment in which the voice data for voice recognition is created and the environment in which the voice data for learning is created. Also, it is possible to generate a feature amount corresponding to each acoustic condition. Thus, according to the first embodiment, the accuracy of the speech recognition model can be improved in consideration of a plurality of acoustic conditions.

また、認識部２４は、音声認識モデル３１を用いて音声認識を行う。このとき、抽出部２１は、所定の環境における音声を基に作成された音声認識用の音声データから、第１の入力特徴量、および第２の入力特徴量を抽出する。そして、計算部２２は、条件特徴量計算モデル３２に第２の入力特徴量を入力し、条件特徴量を計算する。そして、計算部２２は、適応特徴量計算モデル３３に第１の入力特徴量および条件特徴量を入力し、適応特徴量を計算する。そして、認識部２４は、音声認識モデル３１に適応特徴量を入力して得られた出力結果を基に、音声の認識を行う。これにより、複数の音響条件を考慮した、音声認識モデルを用いた音声認識を行うことができる。 The recognition unit 24 performs voice recognition using the voice recognition model 31. At this time, the extraction unit 21 extracts a first input feature and a second input feature from voice data for voice recognition created based on voice in a predetermined environment. Then, the calculation unit 22 inputs the second input feature value to the condition feature value calculation model 32 and calculates the condition feature value. Then, the calculation unit 22 inputs the first input feature value and the condition feature value to the adaptive feature value calculation model 33, and calculates the adaptive feature value. Then, the recognition unit 24 performs speech recognition based on an output result obtained by inputting the adaptive feature amount to the speech recognition model 31. This makes it possible to perform speech recognition using the speech recognition model in consideration of a plurality of acoustic conditions.

また、抽出部２１は、第１の入力特徴量と、雑音抑圧処理が行われた音声データの音声の特徴を示す特徴量と、の差を基に特徴量を計算し、計算した特徴量を第２の入力特徴量として抽出してもよい。これにより、雑音に基づく音響条件を考慮した音声認識を行うことができるようになる。 Further, the extraction unit 21 calculates a feature value based on a difference between the first input feature value and a feature value indicating a voice feature of the voice data on which the noise suppression processing has been performed, and calculates the calculated feature value. You may extract as a 2nd input feature-value. This makes it possible to perform speech recognition in consideration of acoustic conditions based on noise.

ここで、従来の技術と第１の実施形態との音声認識精度の比較結果について説明する。比較対象の従来の技術は、音響条件への適応を行わないベースラインである従来技術１（ＤＮＮ）、および、１つの条件に対する適応のみ行う従来技術２（ＬＩＮ）である。なお、ＬＩＮは、linear input networkの略称である。また、本発明の技術をＦａｃｔｏｒｉｚｅＬＩＮとよぶ。 Here, a comparison result of the speech recognition accuracy between the conventional technique and the first embodiment will be described. The conventional technologies to be compared are Conventional Technology 1 (DNN), which is a baseline without adaptation to acoustic conditions, and Conventional Technology 2 (LIN), which adapts only to one condition. LIN is an abbreviation for linear input network. Further, the technology of the present invention is called FactorizeLIN.

表１は、従来技術１、従来技術２および本発明を用いて、音声認識タスクＣＨＩＭＥ３に対し音声認識を行った際の単語誤り率を示している。なお、従来技術２および本発明の音響環境への適応は教師なし適応により行った。表１に示すように、本発明は、従来技術１および２のいずれよりも単語誤り率が小さくなった。これより、本発明は、従来技術１および２と比較して音声認識精度が高いといえる。 Table 1 shows a word error rate when speech recognition is performed on the speech recognition task CTIME 3 using the related art 1, the related art 2, and the present invention. The adaptation of the prior art 2 and the present invention to the acoustic environment was performed by unsupervised adaptation. As shown in Table 1, in the present invention, the word error rate was smaller than in either of the prior arts 1 and 2. From this, it can be said that the present invention has higher speech recognition accuracy than the prior arts 1 and 2.

［システム構成等］
また、図示した各装置の各構成要素は機能概念的なものであり、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況等に応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。さらに、各装置にて行われる各処理機能は、その全部または任意の一部が、ＣＰＵおよび当該ＣＰＵにて解析実行されるプログラムにて実現され、あるいは、ワイヤードロジックによるハードウェアとして実現され得る。 [System configuration, etc.]
Each component of each device illustrated is a functional concept, and does not necessarily need to be physically configured as illustrated. In other words, the specific mode of distribution / integration of each device is not limited to the illustrated one, and all or a part of each device may be functionally or physically distributed / arbitrarily divided into arbitrary units according to various loads and usage conditions. Can be integrated and configured. Further, all or any part of each processing function performed by each device can be realized by a CPU and a program analyzed and executed by the CPU, or can be realized as hardware by wired logic.

また、本実施形態において説明した各処理のうち、自動的に行われるものとして説明した処理の全部または一部を手動的に行うこともでき、あるいは、手動的に行われるものとして説明した処理の全部または一部を公知の方法で自動的に行うこともできる。この他、上記文書中や図面中で示した処理手順、制御手順、具体的名称、各種のデータやパラメータを含む情報については、特記する場合を除いて任意に変更することができる。 Further, of the processes described in the present embodiment, all or a part of the processes described as being performed automatically can be manually performed, or the processes described as being performed manually can be performed. All or part can be performed automatically by a known method. In addition, the processing procedures, control procedures, specific names, and information including various data and parameters shown in the above documents and drawings can be arbitrarily changed unless otherwise specified.

［プログラム］
一実施形態として、音声データ処理装置１０は、パッケージソフトウェアやオンラインソフトウェアとして上記の適応および音声認識を実行する音声データ処理プログラムを所望のコンピュータにインストールさせることによって実装できる。例えば、上記の音声データ処理プログラムを情報処理装置に実行させることにより、情報処理装置を音声データ処理装置１０として機能させることができる。ここで言う情報処理装置には、デスクトップ型またはノート型のパーソナルコンピュータが含まれる。また、その他にも、情報処理装置にはスマートフォン、携帯電話機やＰＨＳ（Personal Handyphone System）等の移動体通信端末、さらには、ＰＤＡ（Personal Digital Assistant）等のスレート端末等がその範疇に含まれる。 [program]
As one embodiment, the audio data processing apparatus 10 can be implemented by installing an audio data processing program that performs the above-described adaptation and audio recognition as package software or online software on a desired computer. For example, the information processing device can be caused to function as the audio data processing device 10 by causing the information processing device to execute the above-described audio data processing program. The information processing device referred to here includes a desktop or notebook personal computer. In addition, the information processing apparatus includes a mobile communication terminal such as a smartphone, a mobile phone, and a PHS (Personal Handyphone System), and a slate terminal such as a PDA (Personal Digital Assistant).

また、音声データ処理システムは、ユーザが使用する端末装置をクライアントとし、当該クライアントに上記の適応および音声認識に関するサービスを提供する音声データ処理サーバ装置として実装することもできる。例えば、音声データ処理サーバ装置は、音声データを入力とし、音声認識結果を出力とする音声データ処理サービスを提供するサーバ装置として実装される。この場合、音声データ処理サーバ装置は、Ｗｅｂサーバとして実装することとしてもよいし、アウトソーシングによって上記の音声データ処理に関するサービスを提供するクラウドとして実装することとしてもかまわない。 Further, the voice data processing system can be implemented as a voice data processing server device that uses a terminal device used by a user as a client and provides the client with the above-described service related to adaptation and voice recognition. For example, the voice data processing server device is implemented as a server device that provides a voice data processing service that receives voice data as input and outputs voice recognition results. In this case, the audio data processing server device may be implemented as a Web server, or may be implemented as a cloud that provides the services related to the audio data processing by outsourcing.

図１０は、プログラムが実行されることにより、第１の実施形態に係る音声データ処理装置が実現されるコンピュータの一例を示す図である。コンピュータ１０００は、例えば、メモリ１０１０、ＣＰＵ１０２０を有する。また、コンピュータ１０００は、ハードディスクドライブインタフェース１０３０、ディスクドライブインタフェース１０４０、シリアルポートインタフェース１０５０、ビデオアダプタ１０６０、ネットワークインタフェース１０７０を有する。これらの各部は、バス１０８０によって接続される。 FIG. 10 is a diagram illustrating an example of a computer that implements the audio data processing device according to the first embodiment by executing a program. The computer 1000 has, for example, a memory 1010 and a CPU 1020. Further, the computer 1000 has a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These components are connected by a bus 1080.

メモリ１０１０は、ＲＯＭ（Read Only Memory）１０１１およびＲＡＭ１０１２を含む。ＲＯＭ１０１１は、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェース１０３０は、ハードディスクドライブ１０９０に接続される。ディスクドライブインタフェース１０４０は、ディスクドライブ１１００に接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブ１１００に挿入される。シリアルポートインタフェース１０５０は、例えばマウス１１１０、キーボード１１２０に接続される。ビデオアダプタ１０６０は、例えばディスプレイ１１３０に接続される。 The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores, for example, a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to the hard disk drive 1090. The disk drive interface 1040 is connected to the disk drive 1100. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to the display 1130, for example.

ハードディスクドライブ１０９０は、例えば、ＯＳ１０９１、アプリケーションプログラム１０９２、プログラムモジュール１０９３、プログラムデータ１０９４を記憶する。すなわち、音声データ処理装置１０の各処理を規定するプログラムは、コンピュータにより実行可能なコードが記述されたプログラムモジュール１０９３として実装される。プログラムモジュール１０９３は、例えばハードディスクドライブ１０９０に記憶される。例えば、音声データ処理装置１０における機能構成と同様の処理を実行するためのプログラムモジュール１０９３が、ハードディスクドライブ１０９０に記憶される。なお、ハードディスクドライブ１０９０は、ＳＳＤにより代替されてもよい。 The hard disk drive 1090 stores, for example, the OS 1091, the application program 1092, the program module 1093, and the program data 1094. That is, a program that defines each process of the audio data processing device 10 is implemented as a program module 1093 in which codes executable by a computer are described. The program module 1093 is stored in, for example, the hard disk drive 1090. For example, a program module 1093 for executing the same processing as the functional configuration in the audio data processing device 10 is stored in the hard disk drive 1090. Note that the hard disk drive 1090 may be replaced by an SSD.

また、上述した実施形態の処理で用いられる設定データは、プログラムデータ１０９４として、例えばメモリ１０１０やハードディスクドライブ１０９０に記憶される。そして、ＣＰＵ１０２０が、メモリ１０１０やハードディスクドライブ１０９０に記憶されたプログラムモジュール１０９３やプログラムデータ１０９４を必要に応じてＲＡＭ１０１２に読み出して実行する。 The setting data used in the processing of the above-described embodiment is stored as the program data 1094 in, for example, the memory 1010 or the hard disk drive 1090. Then, the CPU 1020 reads out the program module 1093 and the program data 1094 stored in the memory 1010 and the hard disk drive 1090 to the RAM 1012 as needed, and executes them.

なお、プログラムモジュール１０９３やプログラムデータ１０９４は、ハードディスクドライブ１０９０に記憶される場合に限らず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ１１００等を介してＣＰＵ１０２０によって読み出されてもよい。あるいは、プログラムモジュール１０９３およびプログラムデータ１０９４は、ネットワーク（ＬＡＮ、ＷＡＮ（Wide Area Network）等）を介して接続された他のコンピュータに記憶されてもよい。そして、プログラムモジュール１０９３およびプログラムデータ１０９４は、他のコンピュータから、ネットワークインタフェース１０７０を介してＣＰＵ１０２０によって読み出されてもよい。 The program module 1093 and the program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in, for example, a removable storage medium and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and the program data 1094 may be stored in another computer connected via a network (LAN, WAN (Wide Area Network) or the like). Then, the program module 1093 and the program data 1094 may be read from another computer by the CPU 1020 via the network interface 1070.

１０音声データ処理装置
２０制御部
２１抽出部
２２計算部
２３更新部
２４認識部
３０記憶部
３１音声認識モデル
３２条件特徴量計算モデル
３３適応特徴量計算モデル
２１１第１の入力特徴量抽出部
２１２第２の入力特徴量抽出部
２２１条件特徴量計算部
２２２特徴量変換部
２２３事後確率計算部
２３１エラー計算部
２３２微分値計算部
２３３パラメータ更新部
２３４収束判定部
２４１単語列検索部 Reference Signs List 10 voice data processing device 20 control unit 21 extraction unit 22 calculation unit 23 update unit 24 recognition unit 30 storage unit 31 voice recognition model 32 conditional feature calculation model 33 adaptive feature calculation model 211 first input feature extraction unit 212 first Input feature amount extraction unit 221 Condition feature amount calculation unit 222 Feature amount conversion unit 223 Posterior probability calculation unit 231 Error calculation unit 232 Differential value calculation unit 233 Parameter update unit 234 Convergence determination unit 241 Word string search unit

Claims

所定の環境における音声を基に作成された適応用の音声データから、前記音声の特徴を示す特徴量である第１の入力特徴量、および前記第１の入力特徴量と、雑音抑圧処理が行われた前記音声データの音声の特徴を示す特徴量と、の差を基に計算された第２の入力特徴量を抽出する抽出部と、
ニューラルネットワークを用いた計算モデルである条件特徴量計算モデルに前記第２の入力特徴量を入力し、前記所定の環境を特徴付ける複数の条件のそれぞれに対応した要素を含んだ特徴量である条件特徴量を計算する条件特徴量計算部と、
前記条件特徴量に含まれる複数の要素のそれぞれに対応したパラメータの組を含んだ計算モデルである適応特徴量計算モデルに前記第１の入力特徴量および前記条件特徴量を入力し、前記要素ごとに、ニューラルネットワークを用いた計算モデルである音声認識モデルに適応した特徴量である適応特徴量を計算する適応特徴量計算部と、
前記音声認識モデルに前記適応特徴量を入力して得られた出力結果を基に、前記条件特徴量計算モデルのパラメータおよび前記適応特徴量計算モデルのパラメータの更新を行う更新部と、
を有することを特徴とする音声データ処理装置。 A first input feature value, which is a feature value indicating the feature of the voice, and the first input feature value, and a noise suppression process are performed from the voice data for adaptation created based on the voice in a predetermined environment. An extraction unit that extracts a second input feature value calculated based on a difference between a feature value indicating a feature of a voice of the obtained voice data ,
The second input feature value is input to a condition feature value calculation model, which is a calculation model using a neural network, and the condition feature is a feature value including an element corresponding to each of the plurality of conditions characterizing the predetermined environment. A condition feature quantity calculation unit for calculating the quantity,
The first input feature amount and the condition feature amount are input to an adaptive feature amount calculation model, which is a calculation model including a set of parameters corresponding to each of a plurality of elements included in the condition feature amount, and An adaptive feature calculator that calculates an adaptive feature that is a feature adapted to a speech recognition model that is a calculation model using a neural network;
An update unit that updates parameters of the conditional feature calculation model and parameters of the adaptive feature calculation model based on an output result obtained by inputting the adaptive feature to the speech recognition model;
An audio data processing device comprising:

前記音声認識モデルを用いて音声認識を行う認識部をさらに有し、
前記抽出部は、所定の環境における音声を基に作成された音声認識用の音声データから、前記第１の入力特徴量、および前記第２の入力特徴量を抽出し、
前記条件特徴量計算部は、前記条件特徴量計算モデルに前記第２の入力特徴量を入力し、前記条件特徴量を計算し、
前記適応特徴量計算部は、前記適応特徴量計算モデルに前記第１の入力特徴量および前記条件特徴量を入力し、前記適応特徴量を計算し、
前記認識部は、前記音声認識モデルに前記適応特徴量を入力して得られた出力結果を基に、音声の認識を行うことを特徴とする請求項１に記載の音声データ処理装置。 Further comprising a recognition unit that performs voice recognition using the voice recognition model,
The extraction unit extracts the first input feature and the second input feature from voice data for voice recognition created based on voice in a predetermined environment,
The condition feature value calculation unit inputs the second input feature value to the condition feature value calculation model, calculates the condition feature value,
The adaptive feature value calculation unit inputs the first input feature value and the condition feature value to the adaptive feature value calculation model, calculates the adaptive feature value,
The speech data processing device according to claim 1, wherein the recognition unit performs speech recognition based on an output result obtained by inputting the adaptive feature amount to the speech recognition model.

音声データ処理装置で実行される音声データ処理方法であって、
所定の環境における音声を基に作成された適応用の音声データから、前記音声の特徴を示す特徴量である第１の入力特徴量、および前記第１の入力特徴量と、雑音抑圧処理が行われた前記音声データの音声の特徴を示す特徴量と、の差を基に計算された第２の入力特徴量を抽出する抽出工程と、
ニューラルネットワークを用いた計算モデルである条件特徴量計算モデルに前記第２の入力特徴量を入力し、前記所定の環境を特徴付ける複数の条件のそれぞれに対応した要素を含んだ特徴量である条件特徴量を計算する条件特徴量計算工程と、
前記条件特徴量に含まれる複数の要素のそれぞれに対応したパラメータの組を含んだ計算モデルである適応特徴量計算モデルに前記第１の入力特徴量および前記条件特徴量を入力し、前記要素ごとに、ニューラルネットワークを用いた計算モデルである音声認識モデルに適応した特徴量である適応特徴量を計算する適応特徴量計算工程と、
前記音声認識モデルに前記適応特徴量を入力して得られた出力結果を基に、前記条件特徴量計算モデルのパラメータおよび前記適応特徴量計算モデルのパラメータの更新を行う更新工程と、
を含んだことを特徴とする音声データ処理方法。 An audio data processing method executed by the audio data processing device,
A first input feature value, which is a feature value indicating the feature of the voice, and the first input feature value, and a noise suppression process are performed from the voice data for adaptation created based on the voice in a predetermined environment. An extraction step of extracting a second input feature amount calculated based on a difference between a feature amount indicating a feature of a voice of the obtained voice data ,
The second input feature value is input to a condition feature value calculation model, which is a calculation model using a neural network, and the condition feature is a feature value including an element corresponding to each of the plurality of conditions characterizing the predetermined environment. A condition feature amount calculating step of calculating the amount,
The first input feature amount and the condition feature amount are input to an adaptive feature amount calculation model, which is a calculation model including a set of parameters corresponding to each of a plurality of elements included in the condition feature amount, and An adaptive feature amount calculating step of calculating an adaptive feature amount which is a feature amount adapted to a speech recognition model which is a calculation model using a neural network;
An updating step of updating the parameters of the conditional feature calculation model and the parameters of the adaptive feature calculation model based on an output result obtained by inputting the adaptive feature to the speech recognition model;
A voice data processing method comprising:

前記音声認識モデルを用いて音声認識を行う認識工程をさらに含み、
前記抽出工程は、所定の環境における音声を基に作成された音声認識用の音声データから、前記第１の入力特徴量、および前記第２の入力特徴量を抽出し、
前記条件特徴量計算工程は、前記条件特徴量計算モデルに前記第２の入力特徴量を入力し、前記条件特徴量を計算し、
前記適応特徴量計算工程は、前記適応特徴量計算モデルに前記第１の入力特徴量および前記条件特徴量を入力し、前記適応特徴量を計算し、
前記認識工程は、前記音声認識モデルに前記適応特徴量を入力して得られた出力結果を基に、音声の認識を行うことを特徴とする請求項３に記載の音声データ処理方法。 The method further includes a recognition step of performing voice recognition using the voice recognition model,
The extracting step extracts the first input feature and the second input feature from voice data for voice recognition created based on voice in a predetermined environment,
The condition feature value calculation step is to input the second input feature value to the condition feature value calculation model, calculate the condition feature value,
The adaptive feature value calculating step includes inputting the first input feature value and the condition feature value to the adaptive feature value calculation model, calculating the adaptive feature value,
The speech data processing method according to claim 3 , wherein the recognition step performs speech recognition based on an output result obtained by inputting the adaptive feature amount to the speech recognition model.

請求項１または２に記載の音声データ処理装置としてコンピュータを機能させる音声データ処理プログラム。 An audio data processing program that causes a computer to function as the audio data processing device according to claim 1.