JP6818372B2

JP6818372B2 - Noise Removal Variational Auto-Encoder Platform Integrated Training Methods and Equipment for Speech Detection

Info

Publication number: JP6818372B2
Application number: JP2019158891A
Authority: JP
Inventors: フェリンキム; ヨンムンチョン; ヨンジュチェ
Original assignee: Korea Advanced Institute of Science and Technology KAIST
Current assignee: Korea Advanced Institute of Science and Technology KAIST
Priority date: 2018-11-29
Filing date: 2019-08-30
Publication date: 2021-01-20
Anticipated expiration: 2039-08-30
Also published as: KR102095132B1; JP2020086434A

Description

本発明は、音声検出のための雑音除去変分オートエンコーダ基盤の統合トレーニング方法および装置に関する。 The present invention relates to an integrated training method and apparatus for a noise reduction variational autoencoder board for speech detection.

フレームを音声または非音声に分類する過程である音声区間検出（ＶｏｉｃｅＡｃｔｉｖｉｔｙＤｅｔｅｃｔｉｏｎ：ＶＡＤ）は、音声コーディング、自動音声認識（ＡｕｔｏｍａｔｉｃＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ：ＡＳＲ）、音声向上（ＳｐｅｅｃｈＥｎｈａｎｃｅｍｅｎｔ：ＳＥ）、話者認識、および音声認識のような多様な音声アプリケーションにおける重要なモジュールである。 Speech Activity Detection (VAD), which is the process of classifying frames into speech or non-speech, includes voice coding, automatic speech recognition (ASR), speech enhancement (SE), and speaker recognition. , And an important module in various speech applications such as speech recognition.

初期のＶＡＤ接近法のほとんどは、時間領域エネルギー、ピッチ、およびゼロクロッシング速度を含んだ原始的な音響特性を基盤としていた。既存のＶＡＤ方法のさらに他の類型としては、音声および雑音フレームの分布をＤＦＴ（ＤｉｓｃｒｅｔｅＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）領域のガウス分布にモデリングし、尤度比を使用してフレームが音声であるか否かを決定する統計モデル基盤の接近法がある。その後、ＶＡＤにＳＶＭ（ＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅ）およびＨＭＭ（ｈｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）のような機械学習基盤方法が適用された。最近では、完全に結合されたディープニューラルネットワーク（ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋｓ：ＤＮＮｓ）、畳み込みニューラルネットワーク（ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋｓ：ＣＮＮｓ）および長・短期記憶（ＬｏｎｇＳｈｏｒｔ−ＴｅｒｍＭｅｍｏｒｙ：ＬＳＴＭ）、反復的ニューラルネットワークのような深層的な学習アキテクチャがＶＡＤで大きな成功を収め、ＶＡＤモデリングに広く普及された。 Most of the early VAD approaches were based on primitive acoustic properties including time domain energy, pitch, and zero crossing velocities. Yet another type of existing VAD method is to model the distribution of voice and noise frames into a Gaussian distribution in the DFT (Discrete Fourier Transform) region and use the likelihood ratio to determine if the frame is voice. There is a method of approaching the statistical model base. After that, machine learning infrastructure methods such as SVM (Support Vector Machine) and HMM (Hidden Markov Model) were applied to VAD. More recently, fully connected deep neural networks (Deep Neural Networks: DNNs), convolutional neural networks (CNNs) and long short-term memory (Long Short-Term Memory: LSTM), iterative neural networks. Deep learning architecture has been very successful in VAD and has become widespread in VAD modeling.

数年間の持続的な開発にもかかわらず、ＶＡＤは依然として極めて低い信号対雑音比（ＳＮＲ）に挑んでいる。騒然とした環境に対する確実性を向上させるためにＶＡＤに対する統合トレーニング方法が利用される。従来技術に係る音声向上と音声区間検出ＤＮＮの統合トレーニング接近法は、ＶＡＤに対してより優れた結果をもたらすことが確認された。 Despite years of sustained development, VAD still challenges extremely low signal-to-noise ratios (SNRs). Integrated training methods for VAD are used to improve certainty in noisy environments. It was confirmed that the integrated training approach method of voice improvement and voice section detection DNN according to the prior art provides better results for VAD.

本発明が達成しようとする技術的課題は、２つのネットワーク間にバッチ正規化レイヤを追加することによって内部共変量シフト現象を減少させ、音質改善ＤＮＮのパラメータアップデートによって音質改善ＤＮＮが音声検出を助長する特徴を出力し、ＶＡＥに雑音除去過程を取り入れるＤＶＡＥを適用した、音声検出のための雑音除去変分オートエンコーダ基盤の統合トレーニング方法および装置を提供することを目的とする。 The technical problem to be achieved by the present invention is to reduce the internal covariate shift phenomenon by adding a batch regularization layer between two networks, and the sound quality improvement DNN promotes sound detection by updating the parameter of the sound quality improvement DNN. It is an object of the present invention to provide an integrated training method and apparatus of a noise elimination variational auto-encoder board for sound detection, to which DVAE is applied, which outputs the characteristics to be output and incorporates a noise elimination process into VAE.

一側面において、本発明で提案する音声検出のための雑音除去変分オートエンコーダ基盤の統合トレーニング方法は、トレーニング時に発生する内部共変量シフト（ｉｎｔｅｒｎａｌｃｏｖａｒｉａｔｅｓｈｉｆｔ）現象を減少させるためにバッチ正規化（ｂａｔｃｈｎｏｒｍａｌｉｚａｔｉｏｎ）を利用する段階、音質改善ＤＮＮ（Ｄｅｅｐｎｅｕｒａｌｎｅｔｗｏｒｋ）が音声検出に必要な音声特徴を出力するようにＧｒａｄｉｅｎｔｗｅｉｇｈｔｉｎｇ技法を利用する段階、および音質改善ＤＮＮで雑音除去変分オートエンコーダ（ＤｅｎｏｉｓｉｎｇＶａｒｉａｔｉｏｎａｌＡｕｔｏｅｎｃｏｄｅｒ）を利用する段階を含み、前記音声検出のための統合トレーニング方法は、音質改善ＤＮＮによって音声特徴から雑音を除去するように音声特徴を変換し、雑音が除去された音声特徴を利用して音声検出ＤＮＮによって音声検出を実行することを含む。 In one aspect, the integrated training method of the noise reduction variational autoencoder board for voice detection proposed in the present invention is batch normalized (in order to reduce the internal covariate shift phenomenon that occurs during training. The stage of using batch normalization, the stage of using the Gradient weighting technique so that the sound quality improvement DNN (Deep natural network) outputs the voice features necessary for voice detection, and the stage of using the noise reduction variational autoencoder (Denoising) with the sound quality improvement DNN. The integrated training method for voice detection includes the step of using Variational Autoencoder), and the sound feature is converted so as to remove noise from the sound feature by the sound quality improvement DNN, and the noise-removed voice feature is used. Includes performing voice detection by voice detection DNN.

トレーニング時に発生する内部共変量シフト現象を減少させるためにバッチ正規化を利用する段階は、２つのネットワークを結合して統合トレーニングを実行する場合に発生する音質改善ＤＮＮの出力分布の変分を減少させるために、２つのネットワーク間にバッチ正規化レイヤを追加して不正規的な入力分布を処理することによって内部共変量シフト現象を減少させることを含む。 The stage of using batch normalization to reduce the internal covariate shift phenomenon that occurs during training reduces the variation in the output distribution of the sound quality improvement DNN that occurs when two networks are combined and integrated training is performed. This involves reducing the internal covariate shift phenomenon by adding a batch normalization layer between the two networks to handle the irregular input distribution.

音質改善ＤＮＮが音声検出に必要な音声特徴を出力するようにＧｒａｄｉｅｎｔｗｅｉｇｈｔｉｎｇ技法を利用する段階は、音質改善ＤＮＮと音声検出ＤＮＮの損失関数を計算し、逆伝播法を利用して各損失関数に対する勾配を求めた後、計算された勾配を利用して２つのネットワークのパラメータをアップデートし、音質改善ＤＮＮのパラメータアップデートによって音質改善ＤＮＮの損失関数だけでなく音声検出ＤＮＮの損失関数も減らすようにトレーニングを実行し、これによって音質改善ＤＮＮによる音声検出に必要な特徴を出力することを含む。 At the stage of using the Gradient weighting technique so that the sound quality improvement DNN outputs the voice features required for voice detection, the loss functions of the sound quality improvement DNN and the voice detection DNN are calculated, and the back propagation method is used for each loss function. After finding the gradient, the parameters of the two networks are updated using the calculated gradient, and the sound quality improvement DNN parameter update is trained to reduce not only the sound quality improvement DNN loss function but also the voice detection DNN loss function. Is included, thereby outputting the features required for sound quality detection by the sound quality improvement DNN.

音質改善ＤＮＮで雑音除去変分オートエンコーダを利用する段階は、エンコーダ確率分布とデコーダ確率分布の両方を対角ガウス分布として仮定し、エンコーダＤＮＮとデコーダＤＮＮによってそれぞれ対応する確率分布の平均およびログ分散を推定し、事前確率を等方的なガウス分布として仮定し、エンコーダ確率分布とデコーダ確率分布から潜在変数と観測変数を決定的に求め、変分下限を最大化するようにネットワークパラメータをアップデートすることを含む。 At the stage of using the noise elimination variable auto-encoder in the sound quality improvement DNN, both the encoder probability distribution and the decoder probability distribution are assumed as diagonal Gaussian distributions, and the mean and log distribution of the corresponding probability distributions by the encoder DNN and the decoder DNN, respectively. Is estimated, the prior probability is assumed as an isotropic Gaussian distribution, the latent and observed variables are decisively obtained from the encoder probability distribution and the decoder probability distribution, and the network parameters are updated to maximize the lower limit of variation. Including that.

また他の一側面において、本発明で提案する音声検出のための雑音除去変分オートエンコーダ基盤の統合トレーニング装置は、トレーニング時に発生する内部共変量シフト（ｉｎｔｅｒｎａｌｃｏｖａｒｉａｔｅｓｈｉｆｔ）現象を減少させるためにバッチ正規化（ｂａｔｃｈｎｏｒｍａｌｉｚａｔｉｏｎ）を利用する正規化部、音質改善ＤＮＮ（Ｄｅｅｐｎｅｕｒａｌｎｅｔｗｏｒｋ）が音声検出に必要な音声特徴を出力するようにＧｒａｄｉｅｎｔｗｅｉｇｈｔｉｎ技法を利用する加重値部、および音質改善ＤＮＮで雑音除去変分オートエンコーダ（ｄｅｎｏｉｓｉｎｇｖａｒｉａｔｉｏｎａｌａｕｔｏｅｎｃｏｄｅｒ）を利用する符号化部を備え、前記音声検出のための統合トレーニング方法は、音質改善ＤＮＮによって音声特徴から雑音を除去するように音声特徴を変換し、雑音が除去された音声特徴を利用して音声検出ＤＮＮによって音声検出を実行することを含む。 In another aspect, the integrated training device based on the noise reduction variational autoencoder for voice detection proposed in the present invention is a batch to reduce the internal covariate shift phenomenon that occurs during training. A normalization section that uses batch normalization, a weighted section that uses the Gradient weighttin technique so that the sound quality improvement DNN (Deep natural encoder) outputs the voice features required for voice detection, and noise in the sound quality improvement DNN. The integrated training method for voice detection includes a coding unit that utilizes a variational autoencoder (denoising variational autoencoder), and the sound quality improvement DNN converts the voice features so as to remove the noise from the voice features, and the noise. Includes performing sound detection by voice detection DNN utilizing the sound features from which has been removed.

本発明の実施形態によると、２つのネットワーク間にバッチ正規化レイヤを追加することによって内部共変量シフト現象を減少させることができ、音質改善ＤＮＮのパラメータアップデートによって音質改善ＤＮＮが音声検出を助長する特徴が出力され、ＶＡＥに雑音除去過程を取り入れるＤＶＡＥを適用した、音声検出のための雑音除去変分オートエンコーダ基盤の統合トレーニング方法および装置が提案される。 According to an embodiment of the present invention, the internal covariate shift phenomenon can be reduced by adding a batch regularization layer between the two networks, and the sound quality improvement DNN promotes speech detection by updating the sound quality improvement DNN parameters. An integrated training method and device for a noise-removal variational auto-encoder board for speech detection is proposed, which applies DVAE, which outputs features and incorporates a noise-removal process into the VAE.

本発明の一実施形態における、音声検出のための雑音除去変分オートエンコーダ基盤の統合トレーニング方法を説明するためのフローチャートである。It is a flowchart for demonstrating the integrated training method of the noise-removing variational autoencoder board for voice detection in one Embodiment of this invention. 本発明の一実施形態における、ＳＥ−ＤＶＡＥのための雑音除去変分オートエンコーダを説明するための図である。It is a figure for demonstrating the noise-removing variational autoencoder for SE-DVAE in one Embodiment of this invention. 本発明の一実施形態における、３種類の統合トレーニング方法を説明するための図である。It is a figure for demonstrating three kinds of integrated training methods in one Embodiment of this invention. 本発明の一実施形態における、音声検出のための雑音除去変分オートエンコーダ基盤の統合トレーニング装置の構成を示した図である。It is a figure which showed the structure of the integrated training apparatus of the noise-removing variational autoencoder base for voice detection in one Embodiment of this invention.

音声区間検出（ＶｏｉｃｅＡｃｔｉｖｉｔｙＤｅｔｅｃｔｉｏｎ：ＶＡＤ）は、フレーム（ｆｒａｍｅ）単位の入力信号に対し、該当フレームが音声であるか非音声であるかを分類する過程において、音声認識、音質改善、話者認識などの多様な音声アプリケーション分野の重要な前処理過程に利用される。音声検出は、低い信号対雑音比（Ｓｉｇｎａｌ−ｔｏ−ＮｏｉｓｅＲａｔｉｏ：ＳＮＲ）環境では低い性能を示す。このような問題を解決するために、本発明では、音声区間検出のための統合トレーニング方法を提案する。以下、本発明の実施例について、添付の図面を参照しながら詳細に説明する。 Voice Activity Detection (VAD) is a process of classifying whether a frame is voice or non-voice with respect to an input signal in frame units, in which voice recognition, sound quality improvement, and speaker recognition are performed. It is used for important preprocessing processes in various voice application fields such as. Speech detection exhibits poor performance in low signal-to-noise ratio (SNR) environments. In order to solve such a problem, the present invention proposes an integrated training method for voice section detection. Hereinafter, examples of the present invention will be described in detail with reference to the accompanying drawings.

ＶＡＥ（ＶａｒｉａｔｉｏｎａｌＡｕｔｏｅｎｃｏｄｅｒ）は、変分推論の接近法と深層学習法を結合した潜在変数生成モデルである。ここで観測された変数ｘに対する潜在変数生成モデルｐθ（ｘ｜ｚ）（デコーダとも言う）は、媒介変数θを有するディープニューラルネットワークによって媒介変数化される。推論モデルｑψ（ｚ｜ｘ）（エンコーダとも言う）は、媒介変数ψを有する２番目のディープニューラルネットワークによって媒介変数化される。潜在変数ｚは、データｘの圧縮情報をエンベディングするように定義され、エンコーダは、データ空間を対応する潜在空間にマッピングする。デコーダは、潜在的空間のサンプル地点からデータを再構成する。媒介変数θおよびψは、数式（１）のように、ログ限界尤度の変分下限Ｌ（θ，φ；ｘ）を最大化することによって統合トレーニングされる。 VAE (Variational Autoencoder) is a latent variable generative model that combines the approach method of variational inference and the deep learning method. The latent variable generation model pθ (x | z) (also referred to as a decoder) for the variable x observed here is made into a parameter by a deep neural network having a parameter θ. The inference model qψ (z | x) (also called an encoder) is parametricized by a second deep neural network with parametric ψ. The latent variable z is defined to embed the compressed information of the data x, and the encoder maps the data space to the corresponding latent space. The decoder reconstructs the data from the sample points in the potential space. The parameters θ and ψ are integratedly trained by maximizing the variational lower limit L (θ, φ; x) of the log limit likelihood, as in equation (1).

本発明のＶＡＥフレームワークで、エンコーダとデコーダは、対角線ガウス分布を利用してパラメータ化される。このようなガウス分布は、それぞれ次のとおりとなる。ｑφ（ｚ｜ｘ）＝Ｎ（ｚ；μ_z，σ² _zＩ）およびｐθ（ｘ｜ｚ）＝Ｎ（ｘ；μ_x，σ² _xＩ）。事前確率（ｐｒｉｏｒ）は、自由媒介変数のない等方的なガウス分布ｐ（ｚ）＝Ｎ（ｚ；０，Ｉ）であると仮定する。 In the VAE framework of the present invention, encoders and decoders are parameterized using a diagonal Gaussian distribution. Such Gaussian distributions are as follows. qφ (z | x) = N (z; μ _z , σ ² _z I) and pθ (x | z) = N (x; μ _x , σ ² _x I). It is assumed that the prior probability is an isotropic Gaussian distribution p (z) = N (z; 0, I) with no free parameters.

ここで、ＪとＤはそれぞれｚとｘの次元であり、ｘ_iはベクトルｘのｉ番目のエレメントである。μ_xiおよびσ_xiは、ベクトルμ_xおよびμ_xのｉ番目のエレメントを示す。同じように、μ_zjとσ_zjは、ベクトルμ_zとσ_zのｊ番目の要素を示す。 Here, J and D are the dimensions of z and x, respectively, and x _i is the i-th element of the vector x. μ _xi and σ _xi indicate the i-th element of the vectors μ _x and μ _x . _Similarly , μ _zj and σ _zj indicate the jth element of the vectors μ _z and σ _z .

図１は、本発明の一実施形態における、音声検出のための雑音除去変分オートエンコーダ基盤の統合トレーニング方法を説明するためのフローチャートである。 FIG. 1 is a flowchart for explaining an integrated training method of a noise elimination variational autoencoder board for voice detection in one embodiment of the present invention.

音質改善（ｓｐｅｅｃｈｅｎｈａｎｃｅｍｅｎｔ）ＤＮＮ（ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋ）と音声検出ＤＮＮの統合トレーニング方法においては、先ず、音質改善ＤＮＮを利用して雑音が混ざった音声の特徴（ｆｅａｔｕｒｅ）を綺麗な音声の特徴に変換し、音声検出ＤＮＮは、改善された音声特徴を利用して音声検出を実行する。このような方式では、従来技術の音声検出において統合トレーニング方法を利用したときの方が、利用しなかったときよりも優れた性能を示すということが確認された。本発明では、統合トレーニング方法を３つの側面から発展させた。 In the integrated training method of sound quality improvement (speech enhancement) DNN (Deep Natural Network) and speech detection DNN, first, the sound quality improvement DNN is used to convert the feature of the voice mixed with noise (feature) into the feature of beautiful voice. However, the voice detection DNN takes advantage of the improved voice features to perform voice detection. It was confirmed that in such a method, when the integrated training method was used in the conventional speech detection, the performance was superior to that when the integrated training method was not used. In the present invention, the integrated training method has been developed from three aspects.

提案する音声検出のための雑音除去変分オートエンコーダ基盤の統合トレーニング方法は、トレーニング時に発生する内部共変量シフト（ｉｎｔｅｒｎａｌｃｏｖａｒｉａｔｅｓｈｉｆｔ）現象を減少させるためにバッチ正規化（ｂａｔｃｈｎｏｒｍａｌｉｚａｔｉｏｎ）を利用する段階１１０、音質改善ＤＮＮ（Ｄｅｅｐｎｅｕｒａｌｎｅｔｗｏｒｋ）が音声検出に必要な音声特徴を出力するようにＧｒａｄｉｅｎｔｗｅｉｇｈｔｉｎｇ技法を利用する段階１２０、および音質改善ＤＮＮで雑音除去変分オートエンコーダ（ｄｅｎｏｉｓｉｎｇｖａｒｉａｔｉｏｎａｌａｕｔｏｅｎｃｏｄｅｒ）を利用する段階１３０を含む。提案する音声検出のための統合トレーニング方法では、音質改善ＤＮＮによって音声特徴から雑音を除去するように音声特徴を変換し、雑音が除去された音声特徴を利用して音声検出ＤＮＮによって音声検出を実行する。 The proposed integrated training method for noise reduction variational autoencoder infrastructure for voice detection is a step that uses batch normalization to reduce the internal coordinate shift phenomenon that occurs during training. 110, Step 120, where the Gradient weighting technique is used so that the sound quality improvement DNN (Deep neutral work) outputs the voice features required for voice detection, and the noise reduction variational autoencoder (denoising variational autoencoder) is used in the sound quality improvement DNN. Includes step 130. In the proposed integrated training method for voice detection, the voice feature is converted so as to remove noise from the voice feature by the sound quality improvement DNN, and the voice detection is executed by the voice detection DNN using the noise-removed voice feature. To do.

段階１１０では、トレーニング時に発生する内部共変量シフト現象を減少させるためにバッチ正規化を利用する。２つのネットワークを結合して統合トレーニングを実行する場合に発生する音質改善ＤＮＮの出力分布の変分を減少させるために、２つのネットワーク間にバッチ正規化レイヤを追加して不正規的な入力分布を処理することによって内部共変量シフト現象を減少させる。 In step 110, batch normalization is used to reduce the internal covariate shift phenomenon that occurs during training. Sound quality improvement that occurs when two networks are combined to perform integrated training To reduce the variation of the DNN output distribution, a batch normalization layer is added between the two networks to create an irregular input distribution. Reduces the internal covariate shift phenomenon by processing.

本発明の実施形態に係るバッチ正規化は、音質改善と音声認識の統合トレーニング方法において、２つのネットワーク間にバッチ正規化レイヤを追加することによって内部共変量シフト（ｉｎｔｅｒｎａｌｃｏｖａｒｉａｔｅｓｈｉｆｔ）現象を減少させ、トレーニングをより容易にする。２つのネットワークを結合して統合トレーニングを実行すれば、音質改善ＤＮＮの出力分布、言い換えれば、音声検出ＤＮＮの入力分布が継続して変わる。このような現象は内部共変量シフト現象と呼ばれ、これによって全体ネットワークのトレーニングに困難をきたすようになる。これは、音声検出ＤＮＮが非正常的（ｎｏｎ−ｓｔａｔｉｏｎａｒｙ）であり、正規化されていない（ｕｎｎｏｒｍａｌｉｚｅｄ）入力分布を扱わなければならないためである。したがって、本発明の実施形態に係るバッチ正規化により、このような内部共変量シフト現象を減少させることができる。 The batch normalization according to the embodiment of the present invention reduces the internal covariate shift phenomenon by adding a batch normalization layer between two networks in the integrated training method of sound quality improvement and speech recognition. , Make training easier. If the two networks are combined and integrated training is performed, the output distribution of the sound quality improvement DNN, in other words, the input distribution of the voice detection DNN, continuously changes. Such a phenomenon is called the internal covariate shift phenomenon, which makes it difficult to train the entire network. This is because the speech detection DNN is non-stationary and must handle an unnormalized input distribution. Therefore, such batch normalization according to the embodiment of the present invention can reduce such an internal covariate shift phenomenon.

段階１２０では、音質改善ＤＮＮが音声検出に必要な音声特徴を出力するようにＧｒａｄｉｅｎｔｗｅｉｇｈｔｉｎｇ技法を利用する。音質改善ＤＮＮと音声検出ＤＮＮの損失関数を計算し、逆伝播法を利用して各損失関数に対する勾配を求めた後、計算された勾配を利用して２つのネットワークのパラメータをアップデートする。音質改善ＤＮＮのパラメータアップデートによって音声検出ＤＮＮの損失関数を減らすようにトレーニングを実行し、音質改善ＤＮＮによる音声検出に必要な特徴を出力する。 In step 120, the Gradient weighting technique is used so that the sound quality improving DNN outputs the audio features required for audio detection. The loss functions of the sound quality improvement DNN and the voice detection DNN are calculated, the gradient for each loss function is obtained using the back propagation method, and then the parameters of the two networks are updated using the calculated gradient. Training is executed to reduce the loss function of the voice detection DNN by updating the parameters of the sound quality improvement DNN, and the features necessary for the voice detection by the sound quality improvement DNN are output.

段階１２０では、先ず、音質改善ＤＮＮと音声検出ＤＮＮの損失関数（ｌｏｓｓｆｕｎｃｔｉｏｎ）を計算し、逆伝播（ｂａｃｋｐｒｏｐａｇａｔｉｏｎ）法を利用して各損失関数に対する勾配を求める。この後、計算された勾配を利用して２つのネットワークのパラメータをアップデートする。 In step 120, first, the loss function of the sound quality improvement DNN and the voice detection DNN is calculated, and the gradient for each loss function is obtained by using the backpropagation method. After this, the parameters of the two networks are updated using the calculated gradient.

勾配を求める段階において、音声検出の勾配は、音声検出ＤＮＮだけでなく音質改善ＤＮＮまで逆伝播される。したがって、音質改善ＤＮＮのパラメータアップデートは、音質改善損失関数だけではなく音声検出損失関数にも影響を受ける。 At the stage of obtaining the gradient, the gradient of the voice detection is back-propagated not only to the voice detection DNN but also to the sound quality improvement DNN. Therefore, the parameter update of the sound quality improvement DNN is affected not only by the sound quality improvement loss function but also by the voice detection loss function.

音質改善ＤＮＮのパラメータアップデートにより、音質改善ＤＮＮは、音声検出ＤＮＮの損失関数を減らすためにトレーニングされ、したがって、音質改善ＤＮＮが音声検出を助長する特徴を出力することができるようになる。 The parameter update of the sound quality improvement DNN allows the sound quality improvement DNN to be trained to reduce the loss function of the speech detection DNN, thus allowing the sound quality improvement DNN to output features that facilitate speech detection.

段階１３０では、音質改善ＤＮＮで雑音除去変分オートエンコーダを利用する。エンコーダ確率分布とデコーダ確率分布の両方を対角ガウス分布として仮定し、エンコーダＤＮＮとデコーダＤＮＮによってそれぞれ対応する確率分布の平均およびログ分散を推定する。そして、事前確率を等方的なガウス分布として仮定し、エンコーダ確率分布とデコーダ確率分布から潜在変数と観測変数を決定的に求め、変分下限を最大化するようにネットワークパラメータをアップデートする。 In step 130, a noise elimination variational autoencoder is used in the sound quality improvement DNN. Both the encoder probability distribution and the decoder probability distribution are assumed as diagonal Gaussian distributions, and the mean and log variance of the corresponding probability distributions are estimated by the encoder DNN and the decoder DNN, respectively. Then, assuming the prior probability as an isotropic Gaussian distribution, the latent and observed variables are decisively obtained from the encoder probability distribution and the decoder probability distribution, and the network parameters are updated so as to maximize the lower limit of variation.

ＶＡＥ（ＶａｒｉａｔｉｏｎａｌＡｕｔｏｅｎｃｏｄｅｒ）は、潜在変数生成モデル（ＬａｔｅｎｔＶａｒｉａｂｌｅＧｅｎｅｒａｔｉｖｅＭｏｄｅｌ）であって、ディープラーニングと変分推論（ＶａｒｉａｔｉｏｎａｌＩｎｆｅｒｅｎｃｅ）を結合したものである。ＶＡＥは、大まかにはエンコーダ（ｅｎｃｏｄｅｒ）とデコーダ（ｄｅｃｏｄｅｒ）で構成され、エンコーダは、パラメータφを有するＤＮＮによって潜在変数ｚに対する確率分布ｑφ（ｚ｜ｘ）をモデリングするし、デコーダは、パラメータθを有するＤＮＮによって観測変数ｘに対する確率分布ｐθ（ｘ｜ｚ）をモデリングする。観測変数ｘのログ周辺尤度（ｌｏｇｍａｒｇｉｎａｌｌｉｋｅｌｉｈｏｏｄ）の変分下限（ｖａｒｉａｔｉｏｎａｌｌｏｗｅｒｂｏｕｎｄ）であるＬ（θ，φ；ｘ）を、数式（１）のように誘導することができる。 The VAE (Variational Autoencoder) is a latent variable generation model (Latent Variable Generative Model) that combines deep learning and variational inference. The VAE is roughly composed of an encoder and a decoder. The encoder models the probability distribution qφ (z | x) with respect to the latent variable z by a DNN having a parameter φ, and the decoder models the parameter θ. The probability distribution pθ (x | z) for the observed variable x is modeled by the DNN having. L (θ, φ; x), which is the variational lower bound of the log marginal likelihood (log marginal likelihood) of the observed variable x, can be derived as in the mathematical formula (1).

本発明では、エンコーダ確率分布（ｑφ（ｚ｜ｘ））とデコーダ確率分布（ｐθ（ｘ｜ｚ））の両方を対角ガウス分布（ｄｉａｇｏｎａｌＧａｕｓｓｉａｎｄｉｓｔｒｉｂｕｔｉｏｎ）として仮定し、エンコーダＤＮＮとデコーダＤＮＮはそれぞれ対応する確率分布の平均およびログ分散を推定する。事前確率（ｐｒｉｏｒ）は、等方的なガウス分布（ｉｓｏｔｒｏｐｉｃＧａｕｓｓｉａｎｄｉｓｔｒｉｂｕｔｉｏｎ）として仮定する。エンコーダ確率分布とデコーダ確率分布からそれぞれ潜在変数ｚと観測変数ｘをサンプリングすれば、全体ネットワークの微分が不可能になるため、再媒介化トリック（ｒｅｐａｒａｍｅｔｒｉｚａｔｉｏｎｔｒｉｃｋ）を取り入れてｚとｘを決定的（ｄｅｔｅｒｍｉｎｉｓｔｉｃ）に求める。数式（２）のように変分下限を整理することができ、これを最大化する方向としてネットワークパラメータであるθとφをアップデートする。 In the present invention, both the encoder probability distribution (qφ (z | x)) and the decoder probability distribution (pθ (x | z)) are assumed as diagonal Gaussian distributions, and the encoder DNN and the decoder DNN are each assumed to be diagonal gaussian distribution. Estimate the mean and log variance of the corresponding probability distributions. Priors are assumed to be isotropic Gaussian distributions. If the latent variable z and the observed variable x are sampled from the encoder probability distribution and the decoder probability distribution, respectively, it becomes impossible to differentiate the entire network. Therefore, a reparamation trick is adopted to determine z and x (re). Deterministic). The variational lower limit can be arranged as in equation (2), and the network parameters θ and φ are updated as the direction to maximize this.

本発明の音質改善ＤＮＮでは、ＶＡＥに雑音除去（ｄｅｎｏｉｓｉｎｇ）過程を取り入れるＤＶＡＥ（ｄｅｎｏｉｓｉｎｇｖａｒｉａｔｉｏｎａｌａｕｔｏｅｎｃｏｄｅｒ）を適用する。ＤＶＡＥのトレーニング過程はＶＡＥのトレーニング過程とほぼ同じであるが、その差異としては、入力は雑音が混ざった音声であるが出力は綺麗な音声であるという点にある。ＶＡＥとＡＥ（ａｕｔｏｅｎｃｏｄｅｒ）を利用してフィルタバンク特徴（ｆｉｌｔｅｒ−ｂａｎｋｆｅａｔｕｒｅ）を復元（ｒｅｃｏｎｓｔｒｕｃｔｉｏｎ）する実験において、ＶＡＥがＡＥに比べて復元能力が優れているということが確認されたことから、このような事実に着眼して音質改善ＤＮＮにＤＶＡＥを適用した。 In the sound quality improvement DNN of the present invention, DVAE (denoising variational autoencoder) that incorporates a noise reduction (denoising) process into VAE is applied. The DVAE training process is almost the same as the VAE training process, with the difference that the input is a noisy voice but the output is a clean voice. In an experiment in which the filter bank feature (filter-bank facture) was restored (reconstruction) using VAE and AE (autoencoder), it was confirmed that VAE has superior restoration ability compared to AE. Focusing on such facts, DVAE was applied to the sound quality improvement DNN.

図２は、本発明の一実施形態における、ＳＥ−ＤＶＡＥのための雑音除去変分オートエンコーダを説明するための図である。 FIG. 2 is a diagram for explaining a noise reduction variational autoencoder for SE-DVAE in one embodiment of the present invention.

バッチ正規化（ＢＮ）およびドロップアウトは、ガウス媒介変数レイヤを除いたすべての隠しレイヤで使用される。上述したように、ＢＮが統合トレーニングに大きな影響を及ぼすことは周知の事項である。統合トレーニング時、ＳＥネットワークの出力分布（すなわち、ＶＡＤネットワークの入力分布）は、トレーニングプロセス中に大きく変化するため、ＶＡＤモジュールは不正規的であり、不正規化された入力分布を処理しなければならない。内部共変量シフトというこのような問題により、全体ネットワークをトレーニングするのに困難をきたすようになる。ＢＮを利用することによって２つのモジュール間の境界で内部共変量シフトを減らし、事前トレーニングをしなくても全体ネットワークを効率的にトレーニングすることができるようになる。 Batch normalization (BN) and dropouts are used in all hidden layers except the Gaussian parameter layer. As mentioned above, it is well known that BN has a significant impact on integrated training. During integrated training, the output distribution of the SE network (ie, the input distribution of the VAD network) changes significantly during the training process, so the VAD module is irregular and must handle the irregular input distribution. It doesn't become. This problem of internal covariate shifts makes it difficult to train the entire network. By using BN, the internal covariate shift can be reduced at the boundary between the two modules, and the entire network can be trained efficiently without pre-training.

図３は、本発明の一実施形態における、３種類の統合トレーニング方法を説明するための図である。 FIG. 3 is a diagram for explaining three types of integrated training methods in one embodiment of the present invention.

ＤＶＡＥを利用した統合トレーニング方法として大きく３つの方式を提案したが、それぞれ図３の（ａ）ＪＬ−ＤＶＡＥ−１方式、（ｂ）ＪＬ−ＤＶＡＥ−２方式、および（ｃ）ＪＬ−ＤＶＡＥ−３方式がこれに該当する。ＪＬ−ＤＶＡＥ−１方式は、音質改善ネットワーク出力である改善された特徴が直接的に音声検出ＤＮＮの入力に挿入するものである。ＪＬ−ＤＶＡＥ−２方式は、潜在変数ｚが音声検出ＤＮＮの入力に挿入するものであり、ＪＬ−ＤＶＡＥ−３方式は、改善された特徴と潜在変数が同時に音声検出ＤＮＮの入力に挿入するものである。実験により、ＪＬ−ＤＶＡＥ−３方式が最も優れていることが確認された。 Three major methods were proposed as integrated training methods using DVAE, and (a) JL-DVAE-1 method, (b) JL-DVAE-2 method, and (c) JL-DVAE-3 in FIG. 3, respectively. The method corresponds to this. In the JL-DVAE-1 system, the improved feature, which is the sound quality improvement network output, is directly inserted into the input of the voice detection DNN. In the JL-DVAE-2 method, the latent variable z is inserted into the input of the voice detection DNN, and in the JL-DVAE-3 method, the improved features and the latent variable are inserted into the input of the voice detection DNN at the same time. Is. Experiments confirmed that the JL-DVAE-3 method was the best.

１．ＳＥＤＶＡＥおよびＶＡＤ−ＤＮＮの出力で損失関数を計算する。
２．逆伝播を利用して損失Ｇｒａｄｉｅｎｔを計算する。
３．ＳＥ−ＤＶＡＥおよびＶＡＤＤＮＮの媒介変数をアップデートする。 1. 1. Calculate the loss function with the outputs of SEDVAE and VAD-DNN.
2. 2. Calculate the loss Gradient using backpropagation.
3. 3. Update the parameters of SE-DVAE and VADDNN.

段階２で、ＶＡＤＧｒａｄｉｅｎｔもＳ、Ｅ−ＤＶＡＥによって逆伝播される。これにより、ＳＥＤＶＡＥのパラメータアップデートは、ＳＥ損失関数だけでなくＶＡＤ損失関数にも依存するようになる。 In step 2, the VADgradient is also backpropagated by S, E-DVAE. As a result, the parameter update of SEDVAE depends not only on the SE loss function but also on the VAD loss function.

数式（３）において、θ_SEはＳＥ−ＤＶＡＥのパラメータであり、ｇ_SEはθ_SEに対するＳＥ損失Ｇｒａｄｉｅｎｔであり、ｇ_VADはθ_SEに対するＶＡＤ損失勾配である。最後に、λはｇ_VADに加重値を与えるハイパー媒介変数であり、α₁はθ_SEに対する学習率である。改善プロセスが部分的にＶＡＤ損失関数によって案内されるため、フロントエンドは後続ＶＡＤ作業よりも適合し、差別化された向上された特徴を提供することができるであろう。ＶＡＤＤＮＮの媒介変数アップデートは、以下に表示するＶＡＤ損失関数だけに依存する。 In equation (3), θ _SE is a parameter of SE-DVAE, g _SE is the _SE loss gradient for θ _SE , and g _VAD is the VAD loss gradient for θ _SE . Finally, λ is a hyper-parameter that gives a weight to g _VAD , and α ₁ is the learning rate for θ _SE . Since the improvement process is partially guided by the VAD loss function, the front end will be able to better fit and provide differentiated and improved features than subsequent VAD work. The parameter update of VADDN depends only on the VAD loss function shown below.

図４は、本発明の一実施形態における、音声検出のための雑音除去変分オートエンコーダ基盤の統合トレーニング装置の構成を示した図である。 FIG. 4 is a diagram showing a configuration of an integrated training device based on a noise elimination variational autoencoder board for voice detection in one embodiment of the present invention.

提案する音声検出のための雑音除去変分オートエンコーダ基盤の統合トレーニング装置は、正規化部４１０、加重値部４２０、符号化部４３０を備える。 The proposed integrated training device for noise elimination variational autoencoder board for voice detection includes a normalization unit 410, a weighted value unit 420, and a coding unit 430.

正規化部４１０は、トレーニング時に発生する内部共変量シフト現象を減少させるためにバッチ正規化を利用する。２つのネットワークを結合して統合トレーニングを実行する場合に発生する音質改善ＤＮＮの出力分布の変分を減少させるために、２つのネットワーク間にバッチ正規化レイヤを追加して不正規的な入力分布を処理することによって内部共変量シフト現象を減少させる。 The normalization unit 410 utilizes batch normalization to reduce the internal covariate shift phenomenon that occurs during training. Sound quality improvement that occurs when two networks are combined to perform integrated training To reduce the variation of the DNN output distribution, a batch normalization layer is added between the two networks to create an irregular input distribution. Reduces the internal covariate shift phenomenon by processing.

本発明の実施形態に係るバッチ正規化は、音質改善と音声認識の統合トレーニング方法において、２つのネットワーク間のバッチ正規化レイヤを追加することによって内部共変量シフト（ｉｎｔｅｒｎａｌｃｏｖａｒｉａｔｅｓｈｉｆｔ）現象を減少させ、トレーニングをより容易にする。２つのネットワークを結合して統合トレーニングを実行すれば、音質改善ＤＮＮの出力分布、言い換えれば、音声検出ＤＮＮの入力分布が継続して変わる。このような現象は内部共変量シフト現象と呼ばれ、これによって全体ネットワークのトレーニングに困難をきたすようになる。これは、音声検出ＤＮＮが非正常的（ｎｏｎ−ｓｔａｔｉｏｎａｒｙ）であり、正規化されていない（ｕｎｎｏｒｍａｌｉｚｅｄ）入力分布を扱わなければならないためである。したがって、本発明の実施形態に係るバッチ正規化により、このような内部共変量シフト現象を減少させることができる。 The batch normalization according to the embodiment of the present invention reduces the internal covariate shift phenomenon by adding a batch normalization layer between two networks in the integrated training method of sound quality improvement and speech recognition. , Make training easier. If the two networks are combined and integrated training is performed, the output distribution of the sound quality improvement DNN, in other words, the input distribution of the voice detection DNN, continuously changes. Such a phenomenon is called the internal covariate shift phenomenon, which makes it difficult to train the entire network. This is because the speech detection DNN is non-stationary and must handle an unnormalized input distribution. Therefore, such batch normalization according to the embodiment of the present invention can reduce such an internal covariate shift phenomenon.

加重値部４２０は、音質改善ＤＮＮが音声検出に必要な音声特徴を出力するようにＧｒａｄｉｅｎｔｗｅｉｇｈｔｉｎｇ技法を利用する。音質改善ＤＮＮと音声検出ＤＮＮの損失関数を計算し、逆伝播法を利用して各損失関数に対する勾配を求めた後、計算された勾配を利用して２つのネットワークのパラメータをアップデートする。音質改善ＤＮＮのパラメータアップデートによって音声検出ＤＮＮの損失関数を減らすようにトレーニングを実行し、音質改善ＤＮＮによる音声検出に必要な特徴を出力する。 The weighted value unit 420 uses the Gradient weighting technique so that the sound quality improving DNN outputs the voice features required for voice detection. The loss functions of the sound quality improvement DNN and the voice detection DNN are calculated, the gradient for each loss function is obtained using the back propagation method, and then the parameters of the two networks are updated using the calculated gradient. Training is executed to reduce the loss function of the voice detection DNN by updating the parameters of the sound quality improvement DNN, and the features necessary for the voice detection by the sound quality improvement DNN are output.

加重値部４２０は、先ず、音質改善ＤＮＮと音声検出ＤＮＮの損失関数（ｌｏｓｓｆｕｎｃｔｉｏｎ）を計算し、逆伝播法（ｂａｃｋｐｒｏｐａｇａｔｉｏｎ）を利用して各損失関数に対する勾配を求める。この後、計算された勾配を利用して２つのネットワークのパラメータをアップデートする。 The weighted value unit 420 first calculates the loss function of the sound quality improvement DNN and the voice detection DNN, and obtains the gradient for each loss function by using the backpropagation method. After this, the parameters of the two networks are updated using the calculated gradient.

勾配を求める段階において、音声検出勾配は、音声検出ＤＮＮだけでなく音質改善ＤＮＮにまで逆伝播される。したがって、音質改善ＤＮＮのパラメータアップデートは、音質改善損失関数だけでなく音声検出損失関数にも影響を受ける。 At the stage of obtaining the gradient, the voice detection gradient is back-propagated not only to the voice detection DNN but also to the sound quality improvement DNN. Therefore, the parameter update of the sound quality improvement DNN is affected not only by the sound quality improvement loss function but also by the voice detection loss function.

音質改善ＤＮＮのパラメータアップデートにより、音質改善ＤＮＮは、音声検出ＤＮＮの損失関数を減らすためにトレーニングされるようになり、したがって、音質改善ＤＮＮが音声検出を助長する特徴を出力することができるようになる。 With the parameter update of the sound quality improvement DNN, the sound quality improvement DNN is trained to reduce the loss function of the voice detection DNN, so that the sound quality improvement DNN can output the features that promote the voice detection. Become.

符号化部４３０は、音質改善ＤＮＮで雑音除去変分オートエンコーダを利用する。エンコーダ確率分布とデコーダ確率分布の両方を対角ガウス分布として仮定し、エンコーダＤＮＮとデコーダＤＮＮによってそれぞれ対応する確率分布の平均およびログ分散を推定する。そして、事前確率を等方的なガウス分布として仮定し、エンコーダ確率分布とデコーダ確率分布から潜在変数と観測変数を決定的に求め、変分下限を最大化するようにネットワークパラメータをアップデートする。 The coding unit 430 uses a noise elimination variational autoencoder in the sound quality improvement DNN. Both the encoder probability distribution and the decoder probability distribution are assumed as diagonal Gaussian distributions, and the mean and log variance of the corresponding probability distributions are estimated by the encoder DNN and the decoder DNN, respectively. Then, assuming the prior probability as an isotropic Gaussian distribution, the latent and observed variables are decisively obtained from the encoder probability distribution and the decoder probability distribution, and the network parameters are updated so as to maximize the lower limit of variation.

本発明では、既存の統合トレーニング方法を３つの方法に拡張する。第１に、トレーニング中の内部共変量変分を減らすためにバッチ正規化を使用する。バッチ正規化が音声認識作業における統合トレーニング接近法に対する内部共変量変分を減少させるのに効果的であるということは、既に証明されている。これは、ＶＡＤ作業でも同じである。第２に、ＳＥネットワークのパラメータ更新は、ＳＥ損失関数だけでなくＶＡＤ損失関数にも依存する。このために、フロントエンドは、後続ＶＡＤ作業に適合した、向上された特徴を提供することができる。最後に、音声向上のためにＤＶＡＥ（ｄｅｎｏｉｓｉｎｇｖａｒｉａｔｉｏｎａｌａｕｔｏｅｎｃｏｄｅｒ）を適用する。ＤＶＡＥは、雑音がある特徴を潜伏コードにマッピングした後、潜伏コードを復号化することによって綺麗な機能を再構成する。本発明の実施形態によると、ＶＡＤネットワークに、向上された機能だけでなく潜在的コードも提供する。実験結果では、提案された方法が既存の統合トレーニング基盤方法よりも優れていることが示された。 In the present invention, the existing integrated training method is extended to three methods. First, batch normalization is used to reduce internal covariates during training. It has already been proven that batch normalization is effective in reducing internal covariate variations for integrated training approaches in speech recognition tasks. This is the same for VAD work. Second, the parameter update of the SE network depends not only on the SE loss function but also on the VAD loss function. To this end, the front end can provide improved features adapted for subsequent VAD work. Finally, DVAE (denoising variational autoencoder) is applied to improve voice. DVAE reconstructs a clean function by mapping a noisy feature to a latent code and then decoding the latent code. According to embodiments of the present invention, the VAD network is provided with potential code as well as improved functionality. Experimental results show that the proposed method is superior to existing integrated training infrastructure methods.

上述した装置は、ハードウェア構成要素、ソフトウェア構成要素、および／またはハードウェア構成要素とソフトウェア構成要素との組み合わせによって実現されてよい。実施形態で説明された装置および構成要素は、例えば、プロセッサ、コントローラ、ＡＬＵ（ａｒｉｔｈｍｅｔｉｃｌｏｇｉｃｕｎｉｔ）、デジタル信号プロセッサ、マイクロコンピュータ、ＦＰＡ（ｆｉｅｌｄｐｒｏｇｒａｍｍａｂｌｅａｒｒａｙ）、ＰＬＵ（ｐｒｏｇｒａｍｍａｂｌｅｌｏｇｉｃｕｎｉｔ）、マイクロプロセッサ、または命令を実行して応答することができる様々な装置のように、１つ以上の汎用コンピュータまたは特殊目的コンピュータを利用して実現されてよい。処理装置は、オペレーティングシステム（ＯＳ）および前記ＯＳ上で実行される１つ以上のソフトウェアアプリケーションを実行してよい。また、処理装置は、ソフトウェアの実行に応答し、データにアクセスし、データを格納、操作、処理、および生成してよい。理解の便宜のために、１つの処理装置が使用されるとして説明される場合もあるが、当業者は、処理装置が複数の処理要素および／または複数種類の処理要素を含んでもよいことが理解できるであろう。例えば、処理装置は、複数のプロセッサまたは１つのプロセッサおよび１つのコントローラを含んでよい。また、並列プロセッサのような、他の処理構成も可能である。 The devices described above may be implemented by hardware components, software components, and / or combinations of hardware components and software components. The apparatus and components described in the embodiments include, for example, a processor, a controller, an ALU (arithmetic logic unit), a digital signal processor, a microcomputer, an FPA (field program array), a PLU (programmable log unit), a microprocessor, or It may be implemented using one or more general purpose computers or special purpose computers, such as various devices capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on said OS. The processor may also respond to software execution, access data, and store, manipulate, process, and generate data. For convenience of understanding, one processor may be described as being used, but one of ordinary skill in the art understands that the processor may include multiple processing elements and / or multiple types of processing elements. You can do it. For example, a processor may include multiple processors or one processor and one controller. Other processing configurations, such as parallel processors, are also possible.

ソフトウェアは、コンピュータプログラム、コード、命令、またはこれらのうちの１つ以上の組み合わせを含んでもよく、任意に動作するように処理装置を構成したり、独立的または集合的に処理装置に命令したりしてよい。ソフトウェアおよび／またはデータは、処理装置に基づいて解釈されたり、処理装置に命令またはデータを提供したりするために、いかなる種類の機械、コンポーネント、物理装置、仮想装置、コンピュータ格納媒体または装置によって具現化されてよい。ソフトウェアは、ネットワークによって接続されたコンピュータシステム上に分散され、分散された状態で格納されても実行されてもよい。ソフトウェアおよびデータは、１つ以上のコンピュータで読み取り可能な記録媒体に格納されてよい。 The software may include computer programs, codes, instructions, or a combination of one or more of these, configuring the processing equipment to operate arbitrarily, or instructing the processing equipment independently or collectively. You can do it. Software and / or data is embodied by any type of machine, component, physical device, virtual device, computer storage medium or device to be interpreted based on the processing device or to provide instructions or data to the processing device. It may be converted. The software is distributed on a computer system connected by a network and may be stored or executed in a distributed state. The software and data may be stored on a recording medium readable by one or more computers.

実施形態に係る方法は、多様なコンピュータ手段によって実行可能なプログラム命令の形態で実現されてコンピュータ読み取り可能な媒体に記録されてよい。前記コンピュータ読み取り可能な媒体は、プログラム命令、データファイル、データ構造などを単独でまたは組み合わせて含んでよい。前記媒体に記録されるプログラム命令は、実施形態のために特別に設計されて構成されたものであってもよいし、コンピュータソフトウェア当業者に公知な使用可能なものであってもよい。コンピュータ読み取り可能な記録媒体の例としては、ハードディスク、フロッピー（登録商標）ディスク、および磁気テープのような磁気媒体、ＣＤ−ＲＯＭ、ＤＶＤのような光媒体、フロプティカルディスク（ｆｌｏｐｔｉｃａｌｄｉｓｋ）のような光磁気媒体、およびＲＯＭ、ＲＡＭ、フラッシュメモリなどのようなプログラム命令を格納して実行するように特別に構成されたハードウェア装置が含まれる。プログラム命令の例は、コンパイラによって生成されるもののような機械語コードだけではなく、インタプリタなどを使用してコンピュータによって実行される高級言語コードを含む。 The method according to the embodiment may be implemented in the form of program instructions that can be executed by various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be those specially designed and configured for embodiments, or may be usable known to those skilled in the art of computer software. Examples of computer-readable recording media include hard disks, floppy (registered trademark) disks, and magnetic media such as magnetic tapes, optical media such as CD-ROMs and DVDs, and floptic disks. Includes optical magnetic media and hardware devices specially configured to store and execute program instructions such as ROMs, RAMs, flash memories, and the like. Examples of program instructions include not only machine language code such as those generated by a compiler, but also high-level language code executed by a computer using an interpreter or the like.

以上のように、限定された実施形態と図面に基づいて実施形態を説明したが、当業者であれば、上述した記載から多様な修正および変形が可能であろう。例えば、説明された技術が、説明された方法とは異なる順序で実行されたり、かつ／あるいは、説明されたシステム、構造、装置、回路などの構成要素が、説明された方法とは異なる形態で結合されたりまたは組み合わされたり、他の構成要素または均等物によって置換されたとしても、適切な結果を達成することができる。 As described above, the embodiments have been described based on the limited embodiments and drawings, but those skilled in the art will be able to make various modifications and modifications from the above description. For example, the techniques described may be performed in a different order than the methods described, and / or components such as the systems, structures, devices, circuits described may be in a form different from the methods described. Appropriate results can be achieved even if they are combined or combined, or replaced by other components or equivalents.

したがって、異なる実施形態であっても、特許請求の範囲と均等なものであれば、添付される特許請求の範囲に属する。 Therefore, even different embodiments belong to the attached claims as long as they are equivalent to the claims.

４１０：正規化部
４２０：加重値部
４３０：符号化部 410: Normalization unit 420: Weighted value unit 430: Encoding unit

Claims

音声検出のための統合トレーニング方法であって、
トレーニング時に発生する内部共変量シフト（ｉｎｔｅｒｎａｌｃｏｖａｒｉａｔｅｓｈｉｆｔ）現象を減少させるためにバッチ正規化（ｂａｔｃｈｎｏｒｍａｌｉｚａｔｉｏｎ）を利用する段階、
音質改善ＤＮＮ（Ｄｅｅｐｎｅｕｒａｌｎｅｔｗｏｒｋ）が音声検出に必要な音声特徴を出力するようにＧｒａｄｉｅｎｔｗｅｉｇｈｔｉｎｇ技法を利用する段階、および
音質改善ＤＮＮで雑音除去変分オートエンコーダ（ｄｅｎｏｉｓｉｎｇｖａｒｉａｔｉｏｎａｌａｕｔｏｅｎｃｏｄｅｒ）を利用する段階
を含み、
前記音声検出のための統合トレーニング方法は、音質改善ＤＮＮによって音声特徴から雑音を除去するように音声特徴を変換し、雑音が除去された音声特徴を利用して音声検出ＤＮＮによって音声検出を実行することを含み、
前記音質改善ＤＮＮが前記音声検出に必要な前記音声特徴を出力するように前記Ｇｒａｄｉｅｎｔｗｅｉｇｈｔｉｎｇ技法を利用する前記段階は、
前記音質改善ＤＮＮと前記音声検出ＤＮＮの損失関数を計算し、逆伝播法を利用して各前記損失関数に対する勾配を求めた後、計算された前記勾配を利用して２つのネットワークのパラメータをアップデートすることを含み、
前記音声検出のための統合トレーニング方法は、
前記音質改善ＤＮＮの前記パラメータアップデートによって前記音声検出ＤＮＮの前記損失関数を減らすように前記トレーニングを実行し、前記音質改善ＤＮＮによる前記音声検出に必要な特徴を出力することをさらに含む、
音声検出のための統合トレーニング方法。 An integrated training method for voice detection
The stage of using batch normalization to reduce the internal covariate shift phenomenon that occurs during training,
The stage where the sound quality improvement DNN (Deep natural network) uses the Gradient weighting technique to output the voice features required for voice detection, and the stage where the sound quality improvement DNN uses the noise reduction variational autoencoder (denoising variational autoencoder). Including
In the integrated training method for voice detection, the voice feature is converted so as to remove noise from the voice feature by the sound quality improvement DNN, and the voice detection is executed by the voice detection DNN using the noise-removed voice feature. look at including it,
The step of utilizing the Gradient weighting technique so that the sound quality improving DNN outputs the audio features required for the audio detection is the step.
The loss functions of the sound quality improvement DNN and the voice detection DNN are calculated, the gradient for each loss function is obtained by using the back propagation method, and then the parameters of the two networks are updated using the calculated gradient. Including doing
The integrated training method for voice detection is
The training is executed so as to reduce the loss function of the voice detection DNN by the parameter update of the sound quality improvement DNN, and further includes outputting the features necessary for the voice detection by the sound quality improvement DNN.
An integrated training method for voice detection.

前記トレーニング時に発生する前記内部共変量シフト現象を減少させるために前記バッチ正規化を利用する前記段階は、
２つのネットワークを結合して前記統合トレーニングを実行する場合に発生する前記音質改善ＤＮＮの出力分布の変分を減少させるために、前記２つのネットワーク間にバッチ正規化レイヤを追加して不正規的な入力分布を処理することによって前記内部共変量シフト現象を減少させることを含む、
請求項１に記載の音声検出のための統合トレーニング方法。 The step of utilizing the batch normalization to reduce the internal covariate shift phenomenon that occurs during the training is
In order to reduce the variation of the output distribution of the sound quality improvement DNN that occurs when the two networks are combined and the integrated training is performed, a batch normalization layer is added between the two networks to cause irregularity. Including reducing the internal covariate shift phenomenon by processing a flexible input distribution.
The integrated training method for voice detection according to claim 1.

前記音質改善ＤＮＮで前記雑音除去変分オートエンコーダを利用する前記段階は、
エンコーダ確率分布とデコーダ確率分布の両方を対角ガウス分布として仮定し、エンコーダＤＮＮとデコーダＤＮＮによってそれぞれ対応する確率分布の平均およびログ分散を推定し、事前確率を等方的なガウス分布として仮定し、前記エンコーダ確率分布と前記デコーダ確率分布から潜在変数と観測変数を決定的に求め、変分下限を最大化するようにネットワークパラメータをアップデートすることを含む、
請求項１に記載の音声検出のための統合トレーニング方法。 The step of using the noise elimination variational autoencoder in the sound quality improvement DNN is
Both the encoder probability distribution and the decoder probability distribution are assumed as diagonal Gaussian distributions, the mean and log distribution of the corresponding probability distributions are estimated by the encoder DNN and decoder DNN, respectively, and the prior probabilities are assumed to be isotropic Gaussian distributions. Includes deterministically finding latent and observed variables from the encoder probability distribution and the decoder probability distribution and updating network parameters to maximize the lower bound.
The integrated training method for voice detection according to claim 1.

音声検出のための統合トレーニング装置であって、
トレーニング時に発生する内部共変量シフト（ｉｎｔｅｒｎａｌｃｏｖａｒｉａｔｅｓｈｉｆｔ）現象を減少させるためにバッチ正規化（ｂａｔｃｈｎｏｒｍａｌｉｚａｔｉｏｎ）を利用する正規化部、
音質改善ＤＮＮ（Ｄｅｅｐｎｅｕｒａｌｎｅｔｗｏｒｋ）が音声検出に必要な音声特徴を出力するようにＧｒａｄｉｅｎｔｗｅｉｇｈｔｉｎｇ技法を利用する加重値部、および
前記音質改善ＤＮＮで雑音除去変分オートエンコーダ（ｄｅｎｏｉｓｉｎｇｖａｒｉａｔｉｏｎａｌａｕｔｏｅｎｃｏｄｅｒ）を利用する符号化部
を備え、
前記音声検出のための統合トレーニング方法は、前記音質改善ＤＮＮによって前記音声特徴から雑音を除去するように前記音声特徴を変換し、前記雑音が除去された前記音声特徴を利用して音声検出ＤＮＮによって前記音声検出を実行することを含み、
前記加重値部は、
前記音質改善ＤＮＮと前記音声検出ＤＮＮの損失関数を計算し、逆伝播法を利用して各前記損失関数に対する勾配を求めた後、計算された前記勾配を利用して２つのネットワークのパラメータをアップデートし、
前記音声検出のための統合トレーニング装置は、前記音質改善ＤＮＮの前記パラメータアップデートによって前記音声検出ＤＮＮの前記損失関数を減らすように前記トレーニングを実行し、前記音質改善ＤＮＮによる前記音声検出に必要な特徴を出力する、
音声検出のための統合トレーニング装置。 An integrated training device for voice detection
A normalization unit that uses batch normalization to reduce the internal covariate shift phenomenon that occurs during training.
The sound quality improvement DNN (Deep natural network) uses a gradient weighting technique to output the voice features required for voice detection, and the sound quality improvement DNN uses a noise reduction variational autoencoder (denoising variational autoencoder). Equipped with an encoding unit
In the integrated training method for voice detection, the voice feature is converted so as to remove noise from the voice feature by the sound quality improvement DNN, and the voice detection DNN utilizes the noise-removed voice feature. only it contains to perform the voice detection,
The weighted value part is
The loss functions of the sound quality improvement DNN and the voice detection DNN are calculated, the gradient for each loss function is obtained by using the back propagation method, and then the parameters of the two networks are updated using the calculated gradient. And
The integrated training device for voice detection executes the training so as to reduce the loss function of the voice detection DNN by the parameter update of the sound quality improvement DNN, and features required for the voice detection by the sound quality improvement DNN. To output,
Integrated training device for voice detection.

前記正規化部は、
２つのネットワークを結合して統合トレーニングを実行する場合に発生する前記音質改善ＤＮＮの出力分布の変分を減少させるために、前記２つのネットワーク間にバッチ正規化レイヤを追加して不正規的な入力分布を処理することによって前記内部共変量シフト現象を減少させる、
請求項４に記載の音声検出のための統合トレーニング装置。 The normalization unit
In order to reduce the variation of the output distribution of the sound quality improvement DNN that occurs when the two networks are combined and the integrated training is performed, a batch normalization layer is added between the two networks to make it irregular. By processing the input distribution, the internal covariate shift phenomenon is reduced.
The integrated training device for voice detection according to claim 4 .

前記符号化部は、
エンコーダ確率分布とデコーダ確率分布の両方を対角ガウス分布として仮定し、エンコーダＤＮＮとデコーダＤＮＮによってそれぞれ対応する確率分布の平均およびログ分散を推定し、事前確率を等方的なガウス分布として仮定し、前記エンコーダ確率分布と前記デコーダ確率分布から潜在変数と観測変数を決定的に求め、変分下限を最大化するようにネットワークパラメータをアップデートする、
請求項４に記載の音声検出のための統合トレーニング装置。 The coding unit is
Both the encoder probability distribution and the decoder probability distribution are assumed as diagonal Gaussian distributions, the mean and log distribution of the corresponding probability distributions are estimated by the encoder DNN and decoder DNN, respectively, and the prior probabilities are assumed to be isotropic Gaussian distributions. , The latent and observed variables are decisively obtained from the encoder probability distribution and the decoder probability distribution, and the network parameters are updated so as to maximize the lower limit of variation.
The integrated training device for voice detection according to claim 4 .