JP6791816B2

JP6791816B2 - Voice section detection device, voice section detection method, and program

Info

Publication number: JP6791816B2
Application number: JP2017141793A
Authority: JP
Inventors: 勇気太刀岡
Original assignee: Denso IT Laboratory Inc
Current assignee: Denso IT Laboratory Inc
Priority date: 2017-07-21
Filing date: 2017-07-21
Publication date: 2020-11-25
Anticipated expiration: 2037-07-21
Also published as: JP2019020685A

Description

本発明は、音声区間検出装置、音声区間検出方法、およびプログラムに関する。 The present invention relates to a voice section detection device, a voice section detection method, and a program.

音声を使ったアプリケーション・機器の普及に伴い、騒音がある環境で、騒音に埋もれた音声の発話時刻を特定する音声区間検出技術の重要性が増している。音声区間を検出する技術は、音声部分を特定しての録音、音声から騒音を取り除く音声強調、音声認識等を行う上で有用である。 With the spread of applications and devices that use voice, the importance of voice section detection technology that identifies the utterance time of voice buried in noise is increasing in a noisy environment. The technique of detecting a voice section is useful for recording by specifying a voice portion, voice enhancement for removing noise from voice, voice recognition, and the like.

こうした音声区間検出技術として、従来から、様々な技術が提案されている。特許文献１は、音声のパワーが騒音のパワーよりも大きいことを利用して音声区間を検出する手法を開示している。特許文献２は、音声の到来方向等の他の物理情報を利用する方法を開示している。特許文献３は、事前に大量の音声より学習した音声モデルを使って、音声区間検出の性能を向上させる技術を開示している。また音声認識の結果を用いて、音声区間検出を行う構成も考えられるが、音声認識に必要な計算量は一般的に大きく、音声区間検出は常時行う必要があるため、現実的でない。 Various techniques have been conventionally proposed as such voice section detection techniques. Patent Document 1 discloses a method of detecting a voice section by utilizing the fact that the power of voice is larger than the power of noise. Patent Document 2 discloses a method of using other physical information such as the direction of arrival of voice. Patent Document 3 discloses a technique for improving the performance of voice section detection by using a voice model learned from a large amount of voice in advance. Further, a configuration in which voice section detection is performed using the result of voice recognition is conceivable, but it is not realistic because the amount of calculation required for voice recognition is generally large and voice section detection must be performed at all times.

特開２００５−３２１５３９号公報Japanese Unexamined Patent Publication No. 2005-321339 特開２０１２−０４８１１９号公報Japanese Unexamined Patent Publication No. 2012-048119 特開２００９−２１０６４７号公報JP-A-2009-210647

上記した特許文献１に記載された技術では、音声のパワーが騒音レベルのパワーより大きいことを前提としているが、騒音レベルが高い実環境での音声利用が増える中で、必ずしも音声のパワーの方が騒音のパワーよりも大きいことを期待できない場合が増えてきている。また、特許文献２に記載された方法は、他の物理情報を取得するためのセンサが必要である。上記した特許文献３に開示された方法は、大量の音声をすべて混合した音声モデルを使っているため、音声に対する知識を扱いにくいという問題があった。また音声区間検出の特徴上、それほど必要な計算量を増やせないという点に留意する必要がある。 The technology described in Patent Document 1 described above assumes that the power of voice is larger than the power of noise level, but the power of voice is not necessarily the power of voice as the use of voice in a real environment with high noise level increases. Is increasingly not expected to be greater than the power of noise. Further, the method described in Patent Document 2 requires a sensor for acquiring other physical information. Since the method disclosed in Patent Document 3 described above uses a voice model in which a large amount of voice is mixed, there is a problem that knowledge about voice is difficult to handle. In addition, it should be noted that due to the characteristics of voice interval detection, the required amount of calculation cannot be increased so much.

本発明は、上記背景に鑑み、計算量の増加を抑えつつ、音声区間の検出の性能を向上させる方法を提案することを目的とする。 In view of the above background, it is an object of the present invention to propose a method for improving the detection performance of a voice section while suppressing an increase in the amount of calculation.

本発明の音声区間検出装置は、音声と騒音を含む入力音声を入力する入力部と、前記入力音声から音声特徴量を求める特徴量算出部と、音響モデルに基づいて、前記音声特徴量が音声のうちどのサブセットかの尤度を求める音響尤度算出部と、音声区間検出モデルに基づいて、前記音声特徴量から音声又は非音声の事後確率を算出する事後確率算出部であって、前記音響尤度算出部にて算出された尤度をも用いて事後確率を求める事後確率算出部と、前記事後確率に基づいて音声か非音声かの判定を行う判定部とを備える。ここで、音声のサブセットとは、例えば「あ」「い」「う」といった音素、もしくは「母音」「子音」といった音素の分類である。 In the voice section detection device of the present invention, the voice feature amount is voice based on an input unit for inputting input voice including voice and noise, a feature amount calculation unit for obtaining voice feature amount from the input voice, and an acoustic model. An acoustic likelihood calculation unit that obtains the likelihood of any of the subsets, and a posterior probability calculation unit that calculates a posterior probability of voice or non-voice from the voice feature amount based on the voice interval detection model. It includes a posterior probability calculation unit that obtains posterior probability using the likelihood calculated by the likelihood calculation unit, and a determination unit that determines whether it is voice or non-voice based on the posterior probability. Here, the voice subset is a classification of phonemes such as "a", "i", and "u", or phonemes such as "vowel" and "consonant".

この構成により、入力音声の音声特徴量が音声だとすればどのサブセットに属するかを考慮して、音声又は非音声の事後確率を求めるので、音声区間検出の性能を高めることができる。ここで、音響モデルは、音声特徴量で特徴付けられた音がある音（例えば「あ」）である尤度を求めるためのモデルである。これに対し、音声区間検出モデルは、音声と騒音を分類するために、それらの音響的特徴を学習したモデルである。 With this configuration, if the voice feature amount of the input voice is voice, the posterior probability of voice or non-voice is obtained in consideration of which subset it belongs to, so that the performance of voice section detection can be improved. Here, the acoustic model is a model for obtaining the likelihood that a sound characterized by a voice feature is a sound (for example, "a"). On the other hand, the voice section detection model is a model that learns the acoustic characteristics of voice and noise in order to classify them.

本発明の音声区間検出装置は、前記音響尤度算出部にて算出された尤度のデータの次元を圧縮する次元圧縮部を備え、前記次元圧縮部にて次元が圧縮された尤度データを前記事後確率算出部に入力してもよい。 The voice section detection device of the present invention includes a dimension compression unit that compresses the dimension of the likelihood data calculated by the acoustic likelihood calculation unit, and the likelihood data whose dimension is compressed by the dimension compression unit. It may be input to the posterior probability calculation unit.

音声の尤度のデータは次元が大きくなりがちであるが、次元圧縮を行うことにより、事後確率算出部における計算量を抑えることができる。なお、次元圧縮部における次元圧縮の手法としては、例えば、ベクトル変換、主成分分析、ニューラルネットワーク等を用いることができる。 The dimension of voice likelihood data tends to be large, but by performing dimensional compression, the amount of calculation in the posterior probability calculation unit can be suppressed. As the dimensional compression method in the dimensional compression unit, for example, vector transformation, principal component analysis, neural network, or the like can be used.

本発明の音声区間検出装置は、前記音響尤度算出部にて算出された尤度のデータを用いて音声認識を行う音声認識部を備えてもよい。 The voice section detection device of the present invention may include a voice recognition unit that performs voice recognition using the likelihood data calculated by the acoustic likelihood calculation unit.

この構成により、音声区間検出のために求めた音声の尤度のデータを、音声認識に用いることができ、音声認識の計算コストを低減することができる。 With this configuration, the data of the likelihood of the voice obtained for detecting the voice section can be used for the voice recognition, and the calculation cost of the voice recognition can be reduced.

本発明の別の態様の音声区間検出装置は、音声と騒音を含む入力音声を入力する入力部と、前記入力音声から音声特徴量を求める特徴量算出部と、音声強調のモデルに基づいて、前記音声特徴量を音声と騒音とに分離した際のそれぞれの活性化度を算出する活性化度算出部と、音声区間検出モデルに基づいて、前記音声特徴量から音声又は非音声の事後確率を算出する事後確率算出部であって、前記活性化度算出部にて算出された活性化度をも用いて事後確率を求める事後確率算出部と、前記事後確率に基づいて音声か非音声かの判定を行う判定部とを備える。 The voice section detection device of another aspect of the present invention is based on an input unit for inputting input voice including voice and noise, a feature amount calculation unit for obtaining voice feature amount from the input voice, and a voice enhancement model. Based on the activation degree calculation unit that calculates the activation degree when the voice feature amount is separated into voice and noise, and the voice section detection model, the posterior probability of voice or non-voice is calculated from the voice feature amount. A post-probability calculation unit that calculates the post-probability, and a post-probability calculation unit that obtains the post-probability using the activation degree calculated by the activation-degree calculation unit, and a voice or non-voice based on the post-probability. It is provided with a determination unit for determining the above.

この構成により、入力音声の音声特徴量の音声と騒音の活性化度をも含めて、音声又は非音声の事後確率を求めるので、音声区間検出の性能を高めることができる。 With this configuration, the posterior probability of voice or non-voice is obtained including the activation degree of voice and noise of the voice feature amount of the input voice, so that the performance of voice section detection can be improved.

本発明の音声区間検出装置において、前記活性化度算出部は、非負値行列因子分解によって前記音声特徴量を基底と活性化度とに因子分解し、求めた基底から音声の基底と騒音の基底とを求め、音声と騒音のそれぞれの基底に対応する活性化度を求めてもよい。 In the voice section detection device of the present invention, the activation degree calculation unit factorizes the voice feature amount into a basis and an activation degree by non-negative matrix factorization, and obtains a basis of voice and a basis of noise. And may be obtained, and the degree of activation corresponding to each base of voice and noise may be obtained.

この構成により、教師なし学習によって音声の基底と騒音の基底に対応する活性化度を求めることができる。 With this configuration, the degree of activation corresponding to the basis of speech and the basis of noise can be obtained by unsupervised learning.

本発明の音声区間検出装置は、クリーンな音声に基づいて音声の基底を学習する基底学習部を備え、前記活性化度算出部は、学習によって求められた音声の基底を用いて、前記音声特徴量を非負値行列因子分解によって基底と活性化度に因子分解し、音声と騒音のそれぞれの基底に対応する活性化度を求めてもよい。 The speech section detection device of the present invention includes a basis learning unit that learns the basis of speech based on clean speech, and the activation degree calculation unit uses the basis of speech obtained by learning to describe the speech feature. The quantity may be factor-decomposed into the basis and the degree of activation by non-negative matrix factor decomposition, and the degree of activation corresponding to each basis of speech and noise may be obtained.

このように騒音を含まないクリーンな音声に基づいて学習した音声の基底を用いることにより、基底および活性化度の精度を高めることができる。 By using the speech basis learned based on the clean speech that does not contain noise in this way, the accuracy of the basis and the degree of activation can be improved.

本発明の音声区間検出装置は、前記活性化度算出部にて算出した音声の基底に基づいて、クリーンな音声を選別するデータ選別部を備え、前記基底学習部は、選別されたクリーンな音声を用いて、音声の基底を再学習してもよい。 The voice section detection device of the present invention includes a data selection unit that selects clean voices based on the voice base calculated by the activation degree calculation unit, and the base learning unit is the selected clean voices. May be used to relearn the basis of speech.

この構成により、発話者や発話内容に近い音声の基底が生成されるので、使用者に応じて音声区間検出の精度をさらに高めることができる。 With this configuration, a voice base close to the speaker and the content of the utterance is generated, so that the accuracy of voice section detection can be further improved depending on the user.

本発明の音声区間検出装置は、前記活性化度算出部にて算出された基底及び活性化度のデータを用いて騒音を抑圧する騒音抑圧部を備えてもよい。 The voice section detection device of the present invention may include a noise suppression unit that suppresses noise using the basis and activation degree data calculated by the activation degree calculation unit.

この構成により、音声区間検出のために求めた音声と騒音の基底とその活性化度のデータを、騒音抑圧に用いることができ、騒音抑圧の計算コストを低減することができる。 With this configuration, the data of the basis of voice and noise obtained for detecting the voice section and the degree of activation thereof can be used for noise suppression, and the calculation cost of noise suppression can be reduced.

本発明の音声区間検出方法は、音声区間検出装置によって音声区間を検出する方法であって、前記音声区間検出装置が、音声と騒音を含む入力音声を入力するステップと、前記音声区間検出装置が、前記入力音声から音声特徴量を求めるステップと、前記音声区間検出装置が、音響モデルに基づいて、前記音声特徴量が音声のうちどのサブセットかの尤度を求めるステップと、前記音声区間検出装置が、音声区間検出モデルに基づいて、前記音声特徴量から音声又は非音声の事後確率を算出するステップであって、前記音声特徴量が音声である尤度をも用いて事後確率を求めるステップと、前記音声区間検出装置が、前記事後確率に基づいて音声か非音声かの判定を行うステップとを備える。 The voice section detection method of the present invention is a method of detecting a voice section by a voice section detection device, wherein the voice section detection device inputs an input voice including voice and noise, and the voice section detection device , The step of obtaining the voice feature amount from the input voice, the step of finding the likelihood of which subset of the voice the voice feature amount is based on the acoustic model, and the voice section detection device. However, there is a step of calculating the posterior probability of voice or non-voice from the voice feature amount based on the voice section detection model, and a step of obtaining the posterior probability using the likelihood that the voice feature amount is voice. The voice section detection device includes a step of determining whether the voice is voice or non-voice based on the posterior probability.

本発明の音声区間検出方法は、音声区間検出装置によって音声区間を検出する方法であって、前記音声区間検出装置が、音声と騒音を含む入力音声を入力するステップと、前記音声区間検出装置が、前記入力音声から音声特徴量を求めるステップと、前記音声区間検出装置が、音声強調のモデルに基づいて、前記音声特徴量を音声と騒音とに分離した際のそれぞれの活性化度を算出するステップと、前記音声区間検出装置が、音声区間検出モデルに基づいて、前記音声特徴量から音声又は非音声の事後確率を算出するステップであって、前記活性化度をも用いて事後確率を求めるステップと、前記音声区間検出装置が、前記事後確率に基づいて音声か非音声かの判定を行うステップとを備える。 The voice section detection method of the present invention is a method of detecting a voice section by a voice section detection device, wherein the voice section detection device inputs an input voice including voice and noise, and the voice section detection device , The step of obtaining the voice feature amount from the input voice, and the voice section detection device calculates the activation degree when the voice feature amount is separated into voice and noise based on the voice enhancement model. A step and a step in which the voice section detection device calculates a voice or non-voice posterior probability from the voice feature amount based on the voice section detection model, and obtains the posterior probability using the activation degree as well. It includes a step and a step in which the voice section detection device determines whether the voice is voice or non-voice based on the post-probability.

本発明のプログラムは、音声区間を検出するためのプログラムであって、コンピュータに、音声と騒音を含む入力音声を入力するステップと、前記入力音声から音声特徴量を求めるステップと、音響モデルに基づいて、前記音声特徴量が音声のうちどのサブセットかの尤度を求めるステップと、音声区間検出モデルに基づいて、前記音声特徴量から音声又は非音声の事後確率を算出するステップであって、前記音声特徴量が音声である尤度をも用いて事後確率を求めるステップと、前記事後確率に基づいて音声か非音声かの判定を行うステップとを実行させる。 The program of the present invention is a program for detecting a voice section, and is based on a step of inputting an input voice including voice and noise to a computer, a step of obtaining a voice feature amount from the input voice, and an acoustic model. The step of obtaining the likelihood of which subset of the voice the voice feature amount is, and the step of calculating the posterior probability of voice or non-voice from the voice feature amount based on the voice section detection model. The step of obtaining the posterior probability by using the likelihood that the voice feature amount is voice and the step of determining whether it is voice or non-voice based on the posterior probability are executed.

本発明のプログラムは、音声区間を検出するためのプログラムであって、コンピュータに、音声と騒音を含む入力音声を入力するステップと、前記入力音声から音声特徴量を求めるステップと、音声強調のモデルに基づいて、前記音声特徴量を音声と騒音とに分離した際のそれぞれの活性化度を算出するステップと、音声区間検出モデルに基づいて、前記音声特徴量から音声又は非音声の事後確率を算出するステップであって、前記活性化度をも用いて事後確率を求めるステップと、前記事後確率に基づいて音声か非音声かの判定を行うステップとを実行させる。 The program of the present invention is a program for detecting a voice section, which includes a step of inputting an input voice including voice and noise to a computer, a step of obtaining a voice feature amount from the input voice, and a voice enhancement model. Based on the step of calculating the activation degree of each when the voice feature amount is separated into voice and noise, and the voice section detection model, the post-probability of voice or non-voice is calculated from the voice feature amount. The step of calculating, the step of obtaining the posterior probability by using the activation degree, and the step of determining whether it is voice or non-voice based on the posterior probability are executed.

本発明によれば、音声区間の検出性能を高めることができる。 According to the present invention, the detection performance of the voice section can be improved.

第１の実施の形態の音声区間検出装置の構成を示す図である。It is a figure which shows the structure of the voice section detection apparatus of 1st Embodiment. 第１の実施の形態の音声区間検出装置の動作を示すフローチャートである。It is a flowchart which shows the operation of the voice section detection apparatus of 1st Embodiment. 第２の実施の形態の音声区間検出装置の構成を示す図である。It is a figure which shows the structure of the voice section detection apparatus of 2nd Embodiment. 第３の実施の形態の音声区間検出装置の構成を示す図である。It is a figure which shows the structure of the voice section detection apparatus of 3rd Embodiment. 第４の実施の形態の音声区間検出装置の構成を示す図である。It is a figure which shows the structure of the voice section detection apparatus of 4th Embodiment. 第４の実施の形態の音声区間検出装置の動作を示すフローチャートである。It is a flowchart which shows the operation of the voice section detection apparatus of 4th Embodiment. 第５の実施の形態の音声区間検出装置の構成を示す図である。It is a figure which shows the structure of the voice section detection apparatus of 5th Embodiment. 第６の実施の形態の音声区間検出装置の構成を示す図である。It is a figure which shows the structure of the voice section detection apparatus of 6th Embodiment. 第７の実施の形態の音声区間検出装置の構成を示す図である。It is a figure which shows the structure of the voice section detection apparatus of 7th Embodiment.

以下、本発明の実施の形態の音声区間検出装置１について、図面を参照して説明する。
（第１の実施の形態）
図１は、第１の実施の形態の音声区間検出装置１の構成を示す図である。入力部１０は、音声区間の検出対象となる入力音声のフレーム単位の入力を受け付ける機能を有する。本書では、音声と騒音が混在している検出対象の音のデータを「入力音声」といい、そのフレーム単位を「入力音声フレーム」という。特徴量算出部１１は、入力部１０にて入力された入力音声フレームの音声特徴量を算出する。本実施の形態では、音声特徴量として、スペクトル特徴量ｖを用いる。スペクトル特徴量ｖは、入力音声フレームに含まれる周波数成分を特徴量とするものである。 Hereinafter, the voice section detection device 1 according to the embodiment of the present invention will be described with reference to the drawings.
(First Embodiment)
FIG. 1 is a diagram showing a configuration of a voice section detection device 1 according to the first embodiment. The input unit 10 has a function of accepting input in frame units of input voice to be detected in a voice section. In this document, the sound data to be detected in which voice and noise are mixed is referred to as "input voice", and the frame unit is referred to as "input voice frame". The feature amount calculation unit 11 calculates the voice feature amount of the input voice frame input by the input unit 10. In the present embodiment, the spectral feature amount v is used as the voice feature amount. The spectral feature amount v has a frequency component included in the input voice frame as a feature amount.

事後確率算出部１２は、スペクトル特徴量ｖに基づいて、音声または騒音の事後確率を計算する機能を有する。事後確率算出部１２は、音声モデル記憶部１５及び騒音モデル記憶部１６と接続されている。音声モデル及び騒音モデルは、音声区間を検出するための音声区間検出モデルであり、音声モデルは音声の音響的特徴を学習したモデル、騒音モデルは騒音の音響的特徴を学習したモデルである。 The posterior probability calculation unit 12 has a function of calculating the posterior probability of voice or noise based on the spectral feature amount v. The posterior probability calculation unit 12 is connected to the voice model storage unit 15 and the noise model storage unit 16. The voice model and the noise model are voice section detection models for detecting a voice section, the voice model is a model that learns the acoustic characteristics of voice, and the noise model is a model that learns the acoustic characteristics of noise.

音声モデル及び騒音モデルは、事後的に学習することもできるが、本実施の形態では、音声モデル及び騒音モデルは事前に学習しておく。多様な状況での使用を考えると、音声モデルの学習データには多種多様な音声を混ぜておくことが好ましい。なお、騒音モデル記憶部１６は、任意の構成であり、省略することも可能である。 The voice model and the noise model can be learned after the fact, but in the present embodiment, the voice model and the noise model are learned in advance. Considering the use in various situations, it is preferable to mix a wide variety of voices in the training data of the voice model. The noise model storage unit 16 has an arbitrary configuration and can be omitted.

事後確率算出部１２は、スペクトル特徴量ｖに基づいて事後確率ｗを算出する。音声区間検出モデルによるスペクトル特徴量ｖから事後確率ｗへの変換をｆ（）で表すと、ｗ＝ｆ（ｖ）となる。事後確率算出部１２は、求めた事後確率ｗのデータを判定部１３に入力する。 The posterior probability calculation unit 12 calculates the posterior probability w based on the spectral feature amount v. When the conversion from the spectral feature amount v to the posterior probability w by the voice interval detection model is expressed by f (), w = f (v). The posterior probability calculation unit 12 inputs the obtained data of the posterior probability w into the determination unit 13.

判定部１３は、事後確率ｗに基づいて、入力音声フレームが音声か非音声を判定する。例えば、ｗ＝［音声の事後確率，騒音の事後確率］とし、ベクトルのインデクスを1から始めるとすると、ａｒｇｍａｘ（ｗ）＝１のとき当該入力音声フレームは音声、ａｒｇｍａｘ（ｗ）＝２のとき当該フレームは騒音と判定することができる。 The determination unit 13 determines whether the input audio frame is audio or non-audio based on the posterior probability w. For example, if w = [posterior probability of voice, posterior probability of noise] and the vector index starts from 1, when argmax (w) = 1, the input voice frame is voice, and when argmax (w) = 2. The frame can be determined to be noise.

出力部１４は、入力音声フレームが音声か騒音か判定された結果を出力する。出力部１４は、判定部１３によって音声と判定された区間のデータを出力することとしてもよい。 The output unit 14 outputs the result of determining whether the input audio frame is audio or noise. The output unit 14 may output the data of the section determined to be voice by the determination unit 13.

本実施の形態の音声区間検出装置１は、さらに、音響尤度算出部１７を備えている。特徴量算出部１１は、求めたスペクトラム特徴量ｖのデータを音響尤度算出部１７に入力する。なお、本実施の形態では、事後確率算出部１２に対して入力したのと同じスペクトラム特徴量ｖを入力するが、事後確率算出部１２と音響尤度特徴部に入力する特徴量は必ずしも同じでなくてもよく、それぞれに異なる特徴量を入力してもよい。 The voice section detection device 1 of the present embodiment further includes an acoustic likelihood calculation unit 17. The feature amount calculation unit 11 inputs the obtained data of the spectrum feature amount v to the acoustic likelihood calculation unit 17. In the present embodiment, the same spectrum feature amount v as input to the posterior probability calculation unit 12 is input, but the feature amount input to the posterior probability calculation unit 12 and the acoustic likelihood feature unit is not necessarily the same. It may not be necessary, and different features may be input for each.

音響尤度算出部１７は、入力されたスペクトラム特徴量ｖを音響モデルに適用して、スペクトラム特徴量ｖが音声である尤度Ｌを求める。音響尤度算出部１７は、モノフォンの状態やトライフォンの状態、これを縮退させた状態等、適切に選択した状態単位で尤度Ｌを求める。音響モデルは、音響モデル記憶部１８に記憶されている。音響モデルとしては、従来から用いられている混合ガウス分布やニューラルネットワーク等を用いることができる。音響尤度算出部１７は、算出した尤度Ｌを事後確率算出部１２に入力する。 The acoustic likelihood calculation unit 17 applies the input spectrum feature amount v to the acoustic model to obtain the likelihood L in which the spectrum feature amount v is voice. The acoustic likelihood calculation unit 17 obtains the likelihood L in an appropriately selected state unit such as a monophone state, a triphone state, and a degenerate state. The acoustic model is stored in the acoustic model storage unit 18. As the acoustic model, a conventionally used mixed Gaussian distribution, a neural network, or the like can be used. The acoustic likelihood calculation unit 17 inputs the calculated likelihood L to the posterior probability calculation unit 12.

事後確率算出部１２は、前述したとおり、音声区間検出モデルによってスペクトル特徴量ｖから事後確率ｗを求めるが、本実施の形態においては、さらに尤度Ｌのデータも用いて事後確率ｗを求める。すなわち、事後確率算出部１２は、元のスペクトル特徴量ｖに連結して、ｗ＝ｆ（［ｖ；Ｌ］）のようにして求める、あるいは、他のモデルｇ（）を用いてｗ＝α・ｆ（ｖ）＋β・ｇ（Ｌ）のように求めることができる。ここで、α、βは適当な係数である。なお、ここで挙げた方法は一例であり、線形和に限らず、様々な結合の仕方が考えられる。一般的には、ｗ＝ｈ（ｆ（ｖ），ｇ（Ｌ））のように、統合モデルｈ（）に入力する方法が考えられる。 As described above, the posterior probability calculation unit 12 obtains the posterior probability w from the spectral feature amount v by the voice interval detection model, but in the present embodiment, the posterior probability w is further obtained by using the data of the likelihood L. That is, the posterior probability calculation unit 12 connects to the original spectral feature amount v and obtains it as w = f ([v; L]), or w = α using another model g ().・ It can be obtained as f (v) + β · g (L). Here, α and β are appropriate coefficients. The method given here is an example, and various joining methods are conceivable, not limited to the linear sum. Generally, a method of inputting to the integrated model h () such as w = h (f (v), g (L)) can be considered.

以上、第１の実施の形態の音声区間検出装置１の構成について説明したが、音声区間検出装置１のハードウェアの例は、ＣＰＵ、ＲＡＭ、ＲＯＭ、ハードディスク、ディスプレイ、キーボード、マウス、通信インターフェース等を備えたコンピュータである。上記した各機能を実現するモジュールを有するプログラムをＲＡＭまたはＲＯＭに格納しておき、ＣＰＵによって当該プログラムを実行することによって、上記した音声区間検出装置１が実現される。このようなプログラムも本発明の範囲に含まれる。 Although the configuration of the voice section detection device 1 of the first embodiment has been described above, examples of the hardware of the voice section detection device 1 include a CPU, RAM, ROM, hard disk, display, keyboard, mouse, communication interface, and the like. It is a computer equipped with. The voice section detection device 1 described above is realized by storing a program having a module that realizes each of the above functions in a RAM or ROM and executing the program by a CPU. Such programs are also included in the scope of the present invention.

図２は、第１の実施の形態の音声区間検出装置１の動作を示すフローチャートである。音声区間検出装置１は、まず、入力音声フレームのデータの入力を受け付け（Ｓ１０）、入力音声フレームのスペクトル特徴量ｖを求める（Ｓ１１）。続いて、音声区間検出装置１は、スペクトル特徴量ｖに基づいて音響尤度Ｌを求め（Ｓ１２）、求めた音響尤度Ｌのデータを事後確率算出部１２に入力する。事後確率算出部１２は、スペクトル特徴量ｖを音声モデル及び騒音モデルに適用すると共に音響尤度Ｌのデータを用いて、音声または騒音の事後確率ｗを求める（Ｓ１３）。判定部１３は、事後確率ｗに基づいて、入力音声フレームが音声か騒音かの判定を行い（Ｓ１４）、音声区間を検出する。音声区間検出装置１は、求められた音声区間のデータを出力する（Ｓ１５）。 FIG. 2 is a flowchart showing the operation of the voice section detection device 1 according to the first embodiment. First, the voice section detection device 1 accepts the input of the data of the input voice frame (S10), and obtains the spectral feature amount v of the input voice frame (S11). Subsequently, the voice interval detection device 1 obtains the acoustic likelihood L based on the spectral feature amount v (S12), and inputs the obtained data of the acoustic likelihood L to the posterior probability calculation unit 12. The posterior probability calculation unit 12 applies the spectral feature amount v to the voice model and the noise model, and obtains the posterior probability w of the voice or noise by using the data of the acoustic likelihood L (S13). The determination unit 13 determines whether the input audio frame is audio or noise based on the posterior probability w (S14), and detects the audio section. The voice section detection device 1 outputs the data of the obtained voice section (S15).

以上、第１の実施の形態の音声区間検出装置１の構成および動作について説明した。第１の実施の形態の音声区間検出装置１は、入力音声フレームのスペクトル特徴量が音声である尤度Ｌをも含めて、音声か騒音かの事後確率を求める。尤度Ｌは発話内容を多分に反映しているので音声区間の検出に有用であり、事後確率算出部１２での事後確率の計算に加えることにより、音声区間検出の性能を高めることができる。 The configuration and operation of the voice section detection device 1 of the first embodiment have been described above. The voice section detection device 1 of the first embodiment obtains the posterior probability of voice or noise including the likelihood L that the spectral feature amount of the input voice frame is voice. Since the likelihood L reflects the utterance content, it is useful for detecting the voice section, and by adding it to the calculation of the posterior probability in the posterior probability calculation unit 12, the performance of the voice section detection can be improved.

（第２の実施の形態）
図３は、第２の実施の形態の音声区間検出装置２の構成を示す図である。第２の実施の形態の音声区間検出装置２の基本的な構成は、第１の実施の形態の音声区間検出装置１と同じであるが、第２の実施の形態の音声区間検出装置２は、第１の実施の形態の音声区間検出装置１の構成に加えてベクトル変換部１９を備えており、音響尤度算出部１７にて求めた音響尤度Ｌのベクトルを低次元に変換する点が異なる。 (Second Embodiment)
FIG. 3 is a diagram showing the configuration of the voice section detection device 2 according to the second embodiment. The basic configuration of the voice section detection device 2 of the second embodiment is the same as that of the voice section detection device 1 of the first embodiment, but the voice section detection device 2 of the second embodiment is In addition to the configuration of the voice section detection device 1 of the first embodiment, the vector conversion unit 19 is provided, and the vector of the acoustic likelihood L obtained by the acoustic likelihood calculation unit 17 is converted to a lower dimension. Is different.

音響尤度算出部１７にて算出される尤度Ｌは、スペクトル特徴量ｖよりも次元が大きくなりがちである。第２の実施の形態の音声区間検出装置２は、ベクトル変換部１９により、低次元のベクトルに変換する。ベクトル変換部１９で実現される変換の関数をＴとし、上述の元の特徴量ｖに連結して入力する方法を例にすると、ｗ＝ｆ（［ｖ；Ｔ（Ｌ）］）となる。これにより、事後確率算出部１２による計算処理の負担を軽減できる。 The likelihood L calculated by the acoustic likelihood calculation unit 17 tends to have a larger dimension than the spectral feature amount v. The voice section detection device 2 of the second embodiment is converted into a low-dimensional vector by the vector conversion unit 19. As an example, the conversion function realized by the vector conversion unit 19 is T, and the method of inputting by connecting to the original feature amount v described above is taken as an example, and w = f ([v; T (L)]). As a result, the burden of calculation processing by the posterior probability calculation unit 12 can be reduced.

本実施の形態では、第２の実施の形態の音声区間検出装置２は、次元圧縮を目的としてベクトル変換を行う例を説明したが、次元を小さくする必要がない場合にも、ベクトルの長さの正規化や平均値の補償といった変換を行ってもよい。 In the present embodiment, the voice section detection device 2 of the second embodiment has described an example of performing vector conversion for the purpose of dimension compression, but even when it is not necessary to reduce the dimension, the length of the vector Conversions such as normalization and compensation of mean values may be performed.

なお、本実施の形態では、ベクトル変換を行う例を説明したが、次元を圧縮する手法としては、主成分分析やニューラルネットワークによる次元圧縮を用いてもよい。 In this embodiment, an example of performing vector conversion has been described, but as a method of compressing dimensions, principal component analysis or dimension compression by a neural network may be used.

（第３の実施の形態）
図４は、第３の実施の形態の音声区間検出装置３の構成を示す図である。第３の実施の形態の音声区間検出装置３の基本的な構成は、第２の実施の形態の音声区間検出装置２と同じであるが、第３の実施の形態の音声区間検出装置３は、第２の実施の形態の音声区間検出装置２の構成に加えて音声認識部２０を備えており、音響尤度算出部１７にて求めた音響尤度Ｌを利用して音声認識を行う点が異なる。 (Third Embodiment)
FIG. 4 is a diagram showing the configuration of the voice section detection device 3 according to the third embodiment. The basic configuration of the voice section detection device 3 of the third embodiment is the same as that of the voice section detection device 2 of the second embodiment, but the voice section detection device 3 of the third embodiment is In addition to the configuration of the voice section detection device 2 of the second embodiment, the voice recognition unit 20 is provided, and voice recognition is performed using the sound likelihood L obtained by the sound likelihood calculation unit 17. Is different.

このように、音声認識部２０を組み合わせることで比較的低コストに音声認識を行うことが可能である。出力部１４は、音声認識部２０によって認識を行った結果も出力する。なお、本実施の形態では、第２の実施の形態の音声区間検出装置２に対して音声認識部２０を付加した構成について説明したが、第１の実施の形態の音声区間検出装置１に音声認識部２０を付加してもよいことはもちろんである。 In this way, it is possible to perform voice recognition at a relatively low cost by combining the voice recognition unit 20. The output unit 14 also outputs the result of recognition by the voice recognition unit 20. In the present embodiment, the configuration in which the voice recognition unit 20 is added to the voice section detection device 2 of the second embodiment has been described, but the voice section detection device 1 of the first embodiment has voice. Of course, the recognition unit 20 may be added.

（第４の実施の形態）
図５は、第４の実施の形態の音声区間検出装置４の構成を示す図である。上記した実施の形態では、音声認識の仕組みを利用することで発話内容を反映した音声区間検出が行えるようになることを示したが、第４の実施の形態の音声区間検出装置４では、音声強調の仕組みを利用して、話者の特徴と、発話内容の片方もしくは両方利用した音声区間検出について説明する。 (Fourth Embodiment)
FIG. 5 is a diagram showing the configuration of the voice section detection device 4 according to the fourth embodiment. In the above-described embodiment, it has been shown that the voice section detection reflecting the utterance content can be performed by using the voice recognition mechanism, but in the voice section detection device 4 of the fourth embodiment, the voice can be detected. Using the enhancement mechanism, the characteristics of the speaker and the speech section detection using one or both of the utterance contents will be explained.

第４の実施の形態の音声区間検出装置４は、第１〜第３の実施の形態で説明した音響尤度算出部１７に代えて、活性化度算出部２１を備えている。特徴量算出部１１は、入力音声フレームから求めたスペクトル特徴量を活性化度算出部２１に入力する。活性化度算出部２１は、非負値行列因子分解を用いて、基底と活性化度(アクティベーション)に因子分解する。具体的には、活性化度算出部２１は、音声の基底と騒音の基底を算出し、これらに対応する活性化度を算出し、算出した基底と活性化度から特徴量を復元し、これが元の特徴量と近くなるように基底と活性化度を逐次的に更新する。これにより、活性化度算出部２１は、音声の基底及び騒音の基底とそれぞれの活性化度を求めることができる。 The voice section detection device 4 of the fourth embodiment includes an activation degree calculation unit 21 in place of the acoustic likelihood calculation unit 17 described in the first to third embodiments. The feature amount calculation unit 11 inputs the spectral feature amount obtained from the input voice frame to the activation degree calculation unit 21. The activation degree calculation unit 21 factorizes into the basis and the activation degree (activation) by using the non-negative matrix factorization. Specifically, the activation degree calculation unit 21 calculates the voice base and the noise base, calculates the activation degree corresponding to these, and restores the feature amount from the calculated base and the activation degree. The basis and activation degree are sequentially updated so as to be close to the original features. As a result, the activation degree calculation unit 21 can obtain the activation degree of each of the voice base and the noise base.

活性化度算出部２１にて求めた基底及び活性化度のデータを、ベクトル変換部１９を介して次元圧縮して事後確率算出部１２に入力する。事後確率算出部１２は、入力された基底およびその活性化度のデータも用いて事後確率ｗを求める。 The basis and activation degree data obtained by the activation degree calculation unit 21 are dimensionally compressed via the vector conversion unit 19 and input to the posterior probability calculation unit 12. The posterior probability calculation unit 12 obtains the posterior probability w by using the input data of the basis and the degree of activation thereof.

図６は、第４の実施の形態の音声区間検出装置４の動作を示すフローチャートである。音声区間検出装置４は、まず、入力音声フレームの入力を受け付け（Ｓ２０）、入力音声フレームのスペクトル特徴量ｖを求める（Ｓ２１）。続いて、音声区間検出装置４は、スペクトル特徴量ｖに基づいて、音声及び騒音の基底とそれぞれの活性化度を求め（Ｓ２２）、求めた基底及び活性化度のデータを事後確率算出部１２に入力する。 FIG. 6 is a flowchart showing the operation of the voice section detection device 4 according to the fourth embodiment. The voice section detection device 4 first receives the input of the input voice frame (S20) and obtains the spectral feature amount v of the input voice frame (S21). Subsequently, the voice section detection device 4 obtains the basis of voice and noise and their respective activation degrees based on the spectral feature amount v (S22), and obtains the obtained basis and activation degree data in the posterior probability calculation unit 12. Enter in.

事後確率算出部１２は、スペクトル特徴量ｖを音声モデル及び騒音モデルに適用すると共に音響尤度Ｌのデータを用いて、音声または騒音の事後確率ｗを求める（Ｓ２３）。判定部１３は、事後確率ｗに基づいて、入力音声フレームが音声か騒音かの判定を行い（Ｓ２４）、音声区間を検出する。音声区間検出装置４は、求められた音声区間のデータを出力する（Ｓ２５）。 The posterior probability calculation unit 12 applies the spectral feature amount v to the voice model and the noise model, and obtains the posterior probability w of the voice or noise by using the data of the acoustic likelihood L (S23). The determination unit 13 determines whether the input audio frame is audio or noise based on the posterior probability w (S24), and detects the audio section. The voice section detection device 4 outputs the data of the obtained voice section (S25).

以上、第４の実施の形態の音声区間検出装置４の構成および動作について説明した。本実施の形態の音声区間検出装置４は、入力音声フレームを音声及び騒音の基底とそれぞれの活性化度に分解し、この情報を用いて音声か騒音かの事後確率を求める。活性化度は、話者と発話内容の特徴を多分に反映しているので音声区間の検出に有用であり、事後確率算出部１２での事後確率の計算に加えることにより、音声区間検出の性能を高めることができる。 The configuration and operation of the voice section detection device 4 according to the fourth embodiment have been described above. The voice section detection device 4 of the present embodiment decomposes the input voice frame into the base of voice and noise and the respective activation degrees, and obtains the posterior probability of voice or noise using this information. The degree of activation is useful for detecting voice sections because it probably reflects the characteristics of the speaker and the content of the utterance, and by adding it to the calculation of posterior probabilities by the posterior probability calculation unit 12, the performance of voice section detection Can be enhanced.

（第５の実施の形態）
図７は、第５の実施の形態の音声区間検出装置５の構成を示す図である。第５の実施の形態の音声区間検出装置５の基本的な構成は、第４の実施の形態の音声区間検出装置４の構成と同じであるが、第５の実施の形態の音声区間検出装置５は、第４の実施の形態の音声区間検出装置４の構成に加え、基底学習部２２と基底学習部２２における学習に用いるクリーン音声を記憶したクリーン音声記憶部２３を有している。ここで、クリーン音声とは、騒音のない環境で取得された人の音声のみからなるデータである。 (Fifth Embodiment)
FIG. 7 is a diagram showing a configuration of the voice section detection device 5 according to the fifth embodiment. The basic configuration of the voice section detection device 5 of the fifth embodiment is the same as the configuration of the voice section detection device 4 of the fourth embodiment, but the voice section detection device of the fifth embodiment. In addition to the configuration of the voice section detection device 4 of the fourth embodiment, the fifth has a base learning unit 22 and a clean voice storage unit 23 that stores clean voices used for learning in the base learning unit 22. Here, the clean voice is data consisting only of human voice acquired in a noise-free environment.

基底学習部２２は、クリーン音声記憶部２３に記憶されたクリーン音声のデータを用いて、音声の基底を学習しておくことにより、精度の高い音声基底を求めることができる。第５の実施の形態においては、活性化度算出部２１は、スペクトル特徴量を基底と活性化度に因子分解する際に、基底学習部２２にて予めクリーン音声に基づいて生成された音声の基底を用いる。これにより、非負値行列因子分解の精度を高めることができ、ひいては音声区間検出の精度を高めることができる。 The base learning unit 22 can obtain a highly accurate voice base by learning the voice base using the clean voice data stored in the clean voice storage unit 23. In the fifth embodiment, the activation degree calculation unit 21 of the voice generated in advance by the base learning unit 22 based on the clean voice when factorizing the spectral features into the basis and the activation degree. Use the basis. As a result, the accuracy of non-negative matrix factorization can be improved, and the accuracy of speech interval detection can be improved.

（第６の実施の形態）
図８は、第６の実施の形態の音声区間検出装置６の構成を示す図である。第６の実施の形態の音声区間検出装置６の基本的な構成は、第５の実施の形態の音声区間検出装置５の構成と同じであるが、第６の実施の形態の音声区間検出装置６は、第５の実施の形態の音声区間検出装置５の構成に加え、クリーン音声のデータを選択するデータ選別部２４をさらに有している。 (Sixth Embodiment)
FIG. 8 is a diagram showing the configuration of the voice section detection device 6 according to the sixth embodiment. The basic configuration of the voice section detection device 6 of the sixth embodiment is the same as the configuration of the voice section detection device 5 of the fifth embodiment, but the voice section detection device of the sixth embodiment. 6 further includes a data selection unit 24 for selecting clean voice data, in addition to the configuration of the voice section detection device 5 of the fifth embodiment.

基底学習部２２は、最初にクリーン音声を用いた学習によって音声の基底を生成するが、この音声基底は、多様な話者の多様な発話内容に基づく音声の基底である。本実施の形態では、入力部１０より入力音声フレームに基づいて、音声の基底を学習する。すなわち、活性化度算出部２１にて、入力音声フレームのスペクトル特徴量から音声の基底とその活性化度を求めると、データ選別部２４は、求められた音声基底に近い特徴量を有するクリーン音声を選別する。そして、基底学習部２２は、選別されたクリーン音声を用いて、音声の基底の再学習を行う。 The basis learning unit 22 first generates a speech basis by learning using clean speech, and this speech basis is a speech basis based on various speech contents of various speakers. In the present embodiment, the basis of the voice is learned from the input unit 10 based on the input voice frame. That is, when the activation degree calculation unit 21 obtains the voice basis and its activation degree from the spectral feature amount of the input voice frame, the data selection unit 24 determines the clean voice having a feature amount close to the obtained voice base. To sort out. Then, the base learning unit 22 relearns the base of the voice using the selected clean voice.

本実施の形態の音声区間検出装置６は、発話者や発話内容の特徴に応じて、音声の基底をカスタマイズし、音声区間検出の精度を高めることができる。なお、発話者の性別や発話内容（例えば、車載機器に対する指示命令に限定される場合）等があらかじめわかっている場合には、データ選別部２４は、それらの情報に基づいてクリーン音声を選別し、基底学習部２２に渡してもよい。このように、本実施の形態の音声区間検出装置６は、発話者の特徴・発話内容に応じた使用者のカスタマイズを非常に行いやすい構成である。 The voice section detection device 6 of the present embodiment can customize the voice base according to the characteristics of the speaker and the utterance content, and can improve the accuracy of the voice section detection. If the gender of the speaker and the content of the utterance (for example, when limited to instructions and commands to the in-vehicle device) are known in advance, the data selection unit 24 selects the clean voice based on the information. , May be passed to the basic learning unit 22. As described above, the voice section detection device 6 of the present embodiment has a configuration that makes it very easy to customize the user according to the characteristics and utterance contents of the speaker.

（第７の実施の形態）
図９は、第７の実施の形態の音声区間検出装置７の構成を示す図である。第７の実施の形態の音声区間検出装置７の基本的な構成は、第４の実施の形態の音声区間検出装置４と同じであるが、第７の実施の形態の音声区間検出装置７は、第４の実施の形態の音声区間検出装置４の構成に加えて騒音抑圧部２５を備えており、活性化度算出部２１にて求めた音声基底及びその活性化度を利用して騒音抑圧の処理を行う点が異なる。 (7th Embodiment)
FIG. 9 is a diagram showing the configuration of the voice section detection device 7 according to the seventh embodiment. The basic configuration of the voice section detection device 7 of the seventh embodiment is the same as that of the voice section detection device 4 of the fourth embodiment, but the voice section detection device 7 of the seventh embodiment is , A noise suppression unit 25 is provided in addition to the configuration of the voice section detection device 4 of the fourth embodiment, and noise suppression is performed by using the voice base obtained by the activation degree calculation unit 21 and the activation degree thereof. The difference is that the processing of is performed.

このように、騒音抑圧部２５を組み合わせることで比較的低コストに騒音抑圧処理を行うことが可能である。出力部１４は、騒音抑圧された音声データも出力する。なお、本実施の形態では、第４の実施の形態の音声区間検出装置４に対して騒音抑圧部２５を付加した例を挙げているが、第５の実施の形態や第６の実施の形態の音声区間検出装置５，６に対して、騒音抑圧部２５を加えることももちろん可能である。 In this way, by combining the noise suppression unit 25, it is possible to perform the noise suppression processing at a relatively low cost. The output unit 14 also outputs noise-suppressed voice data. In the present embodiment, an example in which the noise suppression unit 25 is added to the voice section detection device 4 of the fourth embodiment is given, but the fifth embodiment and the sixth embodiment are given. Of course, it is also possible to add the noise suppression unit 25 to the voice section detection devices 5 and 6.

以上、本発明の音声区間検出装置について、実施の形態を挙げて詳細に説明したが、本発明の音声区間検出装置は上記した実施の形態に限定されるものではない。上記した実施の形態では、事後確率算出部による事後確率の算出に、音響尤度Ｌを用いる例（第１〜第３の実施の形態）と、基底および活性化度を用いる例（第４〜第７の実施の形態）を説明したが、音響尤度Ｌと基底及び活性化度を組み合わせて用いてもよく、これにより、いっそうの性能向上を期待できる。 Although the voice section detection device of the present invention has been described in detail with reference to the embodiments, the voice section detection device of the present invention is not limited to the above-described embodiment. In the above-described embodiment, the posterior probability calculation unit uses the acoustic likelihood L (first to third embodiments) and the basis and activation degree (fourth to third). Although the seventh embodiment) has been described, the acoustic likelihood L may be used in combination with the basis and the activation degree, and further improvement in performance can be expected.

本発明は、入力音声フレームの音声区間を検出する装置として有用である。 The present invention is useful as a device for detecting a voice section of an input voice frame.

１〜７音声区間検出装置
１０入力部
１１特徴量算出部
１２事後確率算出部
１３判定部
１４出力部
１５音声モデル記憶部
１６騒音モデル記憶部
１７音響尤度算出部
１８音響モデル
１９ベクトル変換部
２１活性化度算出部
２２基底学習部
２３クリーン音声記憶部
２４データ選別部
２５騒音抑圧部 1-7 Voice section detection device 10 Input unit 11 Feature amount calculation unit 12 Posterior probability calculation unit 13 Judgment unit 14 Output unit 15 Voice model storage unit 16 Noise model storage unit 17 Acoustic likelihood calculation unit 18 Acoustic model 19 Vector conversion unit 21 Activation degree calculation unit 22 Basic learning unit 23 Clean voice storage unit 24 Data selection unit 25 Noise suppression unit

Claims

音声と騒音を含む入力音声を入力する入力部と、
前記入力音声から音声特徴量を求める特徴量算出部と、
音響モデルに基づいて、前記音声特徴量が音声である尤度を求める音響尤度算出部と、
音声区間検出モデルに基づいて、前記音声特徴量から音声又は非音声の事後確率を算出する事後確率算出部であって、前記音響尤度算出部にて算出された尤度をも用いて事後確率を求める事後確率算出部と、
前記事後確率に基づいて音声か非音声かの判定を行う判定部と、
を備える音声区間検出装置。 Input section that inputs voice and noise, and an input unit that inputs voice,
A feature amount calculation unit that obtains a voice feature amount from the input voice,
Based on the acoustic model, the acoustic likelihood calculation unit that obtains the likelihood that the voice feature amount is voice , and
A posterior probability calculation unit that calculates the posterior probability of voice or non-voice from the voice feature amount based on the voice interval detection model, and the posterior probability using the likelihood calculated by the acoustic likelihood calculation unit. The posterior probability calculation unit that finds
A determination unit that determines whether it is voice or non-voice based on the posterior probability,
A voice section detection device comprising.

前記音響尤度算出部にて算出された尤度のデータの次元を圧縮する次元圧縮部を備え、
前記次元圧縮部にて次元が圧縮された尤度データを前記事後確率算出部に入力する請求項１に記載の音声区間検出装置。 A dimension compression unit for compressing the dimension of the likelihood data calculated by the acoustic likelihood calculation unit is provided.
The voice section detection device according to claim 1, wherein the likelihood data whose dimensions are compressed by the dimension compression unit is input to the posterior probability calculation unit.

前記音響尤度算出部にて算出された尤度のデータを用いて音声認識を行う音声認識部を備える請求項１または２に記載の音声区間検出装置。 The voice section detection device according to claim 1 or 2, further comprising a voice recognition unit that performs voice recognition using the likelihood data calculated by the acoustic likelihood calculation unit.

音声と騒音を含む入力音声を入力する入力部と、
前記入力音声から音声特徴量を求める特徴量算出部と、
音声強調のモデルに基づいて、前記音声特徴量を音声と騒音とに分離した際のそれぞれの活性化度を算出する活性化度算出部と、
音声区間検出モデルに基づいて、前記音声特徴量から音声又は非音声の事後確率を算出する事後確率算出部であって、前記活性化度算出部にて算出された活性化度をも用いて事後確率を求める事後確率算出部と、
前記事後確率に基づいて音声か非音声かの判定を行う判定部と、
を備える音声区間検出装置。 Input section that inputs voice and noise, and an input unit that inputs voice,
A feature amount calculation unit that obtains a voice feature amount from the input voice,
Based on the speech enhancement model, an activation degree calculation unit that calculates the activation degree when the voice feature amount is separated into voice and noise, and an activation degree calculation unit.
A posterior probability calculation unit that calculates posterior probabilities of voice or non-voice from the voice feature amount based on the voice interval detection model, and also uses the activation degree calculated by the activation degree calculation unit. The posterior probability calculation unit for finding the probability and
A determination unit that determines whether it is voice or non-voice based on the posterior probability,
A voice section detection device comprising.

前記活性化度算出部は、非負値行列因子分解によって前記音声特徴量を基底と活性化度とに因子分解し、求めた基底から音声の基底と騒音の基底とを求め、音声と騒音のそれぞれの基底に対応する活性化度を求める請求項４に記載の音声区間検出装置。 The activation degree calculation unit factorizes the voice feature amount into a basis and an activation degree by non-negative matrix factorization, obtains a voice base and a noise base from the obtained base, and obtains each of the voice and noise. The voice section detection device according to claim 4, wherein the degree of activation corresponding to the basis of the above is obtained.

クリーンな音声に基づいて音声の基底を学習する基底学習部を備え、
前記活性化度算出部は、学習によって求められた音声の基底を用いて、前記音声特徴量を非負値行列因子分解によって基底と活性化度に因子分解し、音声と騒音のそれぞれの基底に対応する活性化度を求める請求項４に記載の音声区間検出装置。 Equipped with a basis learning unit that learns the basis of speech based on clean speech
The activation degree calculation unit factorizes the voice features into bases and activation degrees by non-negative matrix factorization using the voice bases obtained by learning, and corresponds to each base of voice and noise. The voice section detection device according to claim 4, wherein the degree of activation is determined.

前記活性化度算出部にて算出した音声の基底に基づいて、クリーンな音声を選別するデータ選別部を備え、
前記基底学習部は、選別されたクリーンな音声を用いて、音声の基底を再学習する請求項６に記載の音声区間検出装置。 A data selection unit that selects clean voice based on the voice base calculated by the activation degree calculation unit is provided.
The voice section detection device according to claim 6, wherein the base learning unit relearns the base of the voice by using the selected clean voice.

前記活性化度算出部にて算出された基底及び活性化度のデータを用いて騒音を抑圧する騒音抑圧部を備える請求項４乃至７のいずれかに記載の音声区間検出装置。 The voice section detection device according to any one of claims 4 to 7, further comprising a noise suppression unit that suppresses noise using the basis and activation degree data calculated by the activation degree calculation unit.

音声区間検出装置によって音声区間を検出する方法であって、
前記音声区間検出装置が、音声と騒音を含む入力音声を入力するステップと、
前記音声区間検出装置が、前記入力音声から音声特徴量を求めるステップと、
前記音声区間検出装置が、音響モデルに基づいて、前記音声特徴量が音声である尤度を求めるステップと、
前記音声区間検出装置が、音声区間検出モデルに基づいて、前記音声特徴量から音声又は非音声の事後確率を算出するステップであって、前記音声特徴量が音声である尤度をも用いて事後確率を求めるステップと、
前記音声区間検出装置が、前記事後確率に基づいて音声か非音声かの判定を行うステップと、
を備える音声区間検出方法。 It is a method of detecting a voice section by a voice section detection device.
A step in which the voice section detection device inputs input voice including voice and noise,
A step in which the voice section detection device obtains a voice feature amount from the input voice,
A step in which the voice section detection device obtains a likelihood that the voice feature amount is voice based on an acoustic model.
The voice section detection device is a step of calculating the posterior probability of voice or non-voice from the voice feature amount based on the voice section detection model, and the posterior probability using the likelihood that the voice feature amount is voice is also used. Steps to find the probability and
A step in which the voice section detection device determines whether the voice is voice or non-voice based on the posterior probability.
A voice section detection method comprising.

音声区間検出装置によって音声区間を検出する方法であって、
前記音声区間検出装置が、音声と騒音を含む入力音声を入力するステップと、
前記音声区間検出装置が、前記入力音声から音声特徴量を求めるステップと、
前記音声区間検出装置が、音声強調のモデルに基づいて、前記音声特徴量を音声と騒音とに分離した際のそれぞれの活性化度を算出するステップと、
前記音声区間検出装置が、音声区間検出モデルに基づいて、前記音声特徴量から音声又は非音声の事後確率を算出するステップであって、前記活性化度をも用いて事後確率を求めるステップと、
前記音声区間検出装置が、前記事後確率に基づいて音声か非音声かの判定を行うステップと、
を備える音声区間検出方法。 It is a method of detecting a voice section by a voice section detection device.
A step in which the voice section detection device inputs input voice including voice and noise,
A step in which the voice section detection device obtains a voice feature amount from the input voice,
A step of calculating the activation degree of each of the voice feature amounts when the voice feature amount is separated into voice and noise based on the voice enhancement model by the voice section detection device.
The voice section detection device calculates the posterior probability of voice or non-voice from the voice feature amount based on the voice section detection model, and obtains the posterior probability using the activation degree as well.
A step in which the voice section detection device determines whether the voice is voice or non-voice based on the posterior probability.
A voice section detection method comprising.

音声区間を検出するためのプログラムであって、コンピュータに、
音声と騒音を含む入力音声を入力するステップと、
前記入力音声から音声特徴量を求めるステップと、
音響モデルに基づいて、前記音声特徴量が音声である尤度を求めるステップと、
音声区間検出モデルに基づいて、前記音声特徴量から音声又は非音声の事後確率を算出するステップであって、前記音声特徴量が音声である尤度をも用いて事後確率を求めるステップと、
前記事後確率に基づいて音声か非音声かの判定を行うステップと、
を実行させるプログラム。 A program for detecting audio sections, which can be applied to a computer.
Input including voice and noise Steps to input voice and
The step of obtaining the voice feature amount from the input voice and
Based on the acoustic model, the step of finding the likelihood that the voice feature is voice , and
A step of calculating the posterior probability of voice or non-voice from the voice feature amount based on the voice interval detection model, and a step of obtaining the posterior probability using the likelihood that the voice feature amount is voice.
The step of determining whether it is voice or non-voice based on the posterior probability,
A program that executes.

音声区間を検出するためのプログラムであって、コンピュータに、
音声と騒音を含む入力音声を入力するステップと、
前記入力音声から音声特徴量を求めるステップと、
音声強調のモデルに基づいて、前記音声特徴量を音声と騒音とに分離した際のそれぞれの活性化度を算出するステップと、
音声区間検出モデルに基づいて、前記音声特徴量から音声又は非音声の事後確率を算出するステップであって、前記活性化度をも用いて事後確率を求めるステップと、
前記事後確率に基づいて音声か非音声かの判定を行うステップと、
を実行させるプログラム。 A program for detecting audio sections, which can be applied to a computer.
Input including voice and noise Steps to input voice and
The step of obtaining the voice feature amount from the input voice and
Based on the speech enhancement model, the step of calculating the activation degree of each of the speech features when they are separated into speech and noise, and
A step of calculating the posterior probability of voice or non-voice from the voice feature amount based on the voice interval detection model, and a step of obtaining the posterior probability using the activation degree as well.
The step of determining whether it is voice or non-voice based on the posterior probability,
A program that executes.