JP5961530B2

JP5961530B2 - Acoustic model generation apparatus, method and program thereof

Info

Publication number: JP5961530B2
Application number: JP2012244757A
Authority: JP
Inventors: 哲小橋川; 祥子山畠; 太一浅見; 済央野本; 裕司青野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-11-06
Filing date: 2012-11-06
Publication date: 2016-08-02
Anticipated expiration: 2032-11-06
Also published as: JP2014092751A

Description

本発明は、雑音耐性を向上させた音響モデルを生成する音響モデル生成装置とその方法とプログラムに関する。 The present invention relates to an acoustic model generation apparatus, method and program for generating an acoustic model with improved noise resistance.

音声認識では、周囲の背景雑音（環境雑音とも云う。）の影響を無視することが出来ない。例えばハンズフリー通話のように、発話者の口元とマイクロフォンとが離れている状況では、背景雑音が加法性雑音として音声のＳ/Ｎ比を悪化させる。また、発話者の口元とマイクロフォンの間の空間伝達特性により、収音される音声は、接話型のマイクロフォンで収録された音声とは異なる周波数特性を持つ乗法性雑音が重畳した音声となる。 In speech recognition, the influence of ambient background noise (also called environmental noise) cannot be ignored. For example, in a situation where the speaker's mouth is far away from the microphone, such as a hands-free call, background noise is additive noise, which deteriorates the S / N ratio of the voice. In addition, due to the spatial transmission characteristics between the speaker's mouth and the microphone, the collected voice is a voice in which multiplicative noise having a frequency characteristic different from that of the voice recorded by the close-talking microphone is superimposed.

このような実環境において収録された音声を認識するためには、背景雑音による加法性雑音と、伝達特性による乗法性雑音に対して適応させた音響モデルが必要である。雑音に適応させた音響モデルを生成する方法としては、音声認識装置９５０（特許文献１）で開示された雑音適応モデルの生成方法が知られている。 In order to recognize speech recorded in such a real environment, an acoustic model adapted to additive noise due to background noise and multiplicative noise due to transfer characteristics is required. As a method for generating an acoustic model adapted to noise, a method for generating a noise adaptive model disclosed in a speech recognition device 950 (Patent Document 1) is known.

図１３に、音声認識装置９５０の機能構成を示してその雑音適応モデルの生成方法を簡単に説明する。音声認識装置９５０の雑音適応モデルの生成は、雑音モデル生成部２７、クリーン音声モデル格納部２８、正規化雑音モデル生成部２９、音声ケプストラム平均計算部２１１、雑音適応部２１０、雑音適応モデル生成部２１２、で構成される。なお、音声認識装置としての動作説明は省略する。 FIG. 13 shows a functional configuration of the speech recognition device 950, and a method for generating the noise adaptive model will be briefly described. Generation of a noise adaptive model of the speech recognition device 950 includes a noise model generation unit 27, a clean speech model storage unit 28, a normalized noise model generation unit 29, a speech cepstrum average calculation unit 211, a noise adaptation unit 210, and a noise adaptation model generation unit. 212. A description of the operation as a voice recognition device is omitted.

雑音モデル生成部２７は、雑音区間として判定された音声区間の音響特徴量に基づいて雑音モデルを生成する。雑音モデルは、雑音ＨＭＭ（Hidden Markov Model）として生成される。雑音ＨＭＭは、雑音と音響特徴量との関係を確率として与える確率モデルである。一方、クリーン音声モデル格納部２８には、雑音環境を可能な限り排除した非雑音環境で収音したクリーン音声に対して、ある音声単位（音素）毎に予め作成したＨＭＭが格納されている。 The noise model generation unit 27 generates a noise model based on the acoustic feature amount of the speech section determined as the noise section. The noise model is generated as a noise HMM (Hidden Markov Model). The noise HMM is a probabilistic model that gives the relationship between noise and acoustic features as probabilities. On the other hand, the clean speech model storage unit 28 stores an HMM created in advance for each speech unit (phoneme) with respect to clean speech collected in a non-noise environment that eliminates the noise environment as much as possible.

雑音適応部２１０は、雑音モデル生成部２７によって生成された雑音モデルと、クリーン音声モデル格納部２８に格納されているクリーン音声モデルを読み込み、雑音モデルとクリーン音声モデルとを合成して雑音重畳音声モデルを生成する。この雑音重畳音声モデルは、次式に示す近似を含むものである。 The noise adaptation unit 210 reads the noise model generated by the noise model generation unit 27 and the clean speech model stored in the clean speech model storage unit 28, synthesizes the noise model and the clean speech model, and adds the noise superimposed speech. Generate a model. This noise superimposed speech model includes the approximation shown in the following equation.

ここで、Ｓはクリーン音声信号のスペクトル、Ｎは加法性雑音信号のスペクトル、Ｈは乗法性雑音の伝達特性である。 Here, S is the spectrum of the clean speech signal, N is the spectrum of the additive noise signal, and H is the transfer characteristic of multiplicative noise.

雑音適応モデル生成部２１２は、雑音適応部２１０によって生成された雑音重畳音声モデルのモデルパラメータを平均してモデルパラメータ平均を生成し、次に、雑音重畳音声モデルもモデルパラメータを、モデルパラメータ平均によって正規化して正規化済み雑音適応モデルを生成する。この正規化済み雑音適応モデルが、背景雑音による加法性雑音と伝達特性による乗法性雑音に対して適応させた音響モデルとなる。なお、以降では、音響モデルは非音声モデルと音声モデル含むものとして説明を行う。 The noise adaptive model generation unit 212 averages the model parameters of the noise-superimposed speech model generated by the noise adaptation unit 210 to generate a model parameter average. Next, the noise-superimposed speech model also calculates the model parameter by the model parameter average. Normalize to generate a normalized noise adaptation model. This normalized noise adaptive model is an acoustic model adapted to additive noise due to background noise and multiplicative noise due to transfer characteristics. In the following description, it is assumed that the acoustic model includes a non-voice model and a voice model.

特許第４７２８７９１号Japanese Patent No. 4287791

従来の雑音適応モデルは、雑音区間として判定された雑音にしか対応できず、定常的に変動するような背景に重畳する他の話者の発話音声のような非定常な雑音に対して適応していない。よって、その様な非定常な雑音の影響下での音声認識率の向上は期待できない課題があった。また、音声区間と雑音区間を判定する区間判定精度が音響モデルの雑音耐性に大きく影響する課題もあった。 The conventional noise adaptation model can only deal with the noise determined as the noise interval, and adapts to non-stationary noise such as the speech of other speakers superimposed on a constantly changing background. Not. Therefore, there is a problem that improvement of the speech recognition rate under the influence of such non-stationary noise cannot be expected. In addition, there is a problem that the section determination accuracy for determining the voice section and the noise section greatly affects the noise tolerance of the acoustic model.

本発明は、この課題に鑑みてなされたものであり、他者の発話音声の背景雑音のような非定常な雑音であっても音声認識率の向上が期待できる音響モデルを学習する音響モデル生成装置とその方法とプログラムを提供することを目的とする。 The present invention has been made in view of this problem, and generates an acoustic model for learning an acoustic model that can be expected to improve the speech recognition rate even with non-stationary noise such as background noise of the speech of another person. An object is to provide an apparatus, a method thereof, and a program.

本発明の音響モデル生成装置は、ゲイン調整部と、非音声音響モデル学習部と、を具備する。ゲイン調整部は、入力される妨害用信号に１よりも小さな値のゲインを乗じて音量レベルを小さくした擬似非認識対象信号を生成する。非音声音響モデル学習部は、擬似非認識対象信号を非音声信号としてベース音響モデルの非音声モデルを学習して出力する。 The acoustic model generation device of the present invention includes a gain adjustment unit and a non-speech acoustic model learning unit. The gain adjusting unit generates a pseudo non-recognition target signal in which the volume level is reduced by multiplying the input interference signal by a gain of a value smaller than 1. The non-speech acoustic model learning unit learns and outputs a non-speech model of the base acoustic model using the pseudo non-recognition target signal as a non-speech signal.

本発明の音響モデル生成装置によれば、背景雑音を例えば他者の発話音声を妨害用信号として、その振幅に１よりも小さな値のゲインを乗じて減衰させた妨害用信号を擬似非認識対象信号とする。そして、その擬似非認識対象信号で音響モデルの非音声モデルを学習するので、非音声モデルを人の声のような非定常な雑音により適応したものにすることが出来る。従って、この発明の音響モデル生成方法で生成した非音声モデルは、非定常な雑音環境下における雑音耐性を向上させることが出来る。また、この発明の音響モデル生成方法では、音声区間と雑音区間を区分して検出する過程を有しないので、その過程の精度が音響モデルに影響することもない。 According to the acoustic model generation device of the present invention, the interference signal obtained by attenuating the background noise, for example, using the speech of another person as an interference signal and multiplying the amplitude by a gain smaller than 1 is pseudo-unrecognized. Signal. Since the non-speech model of the acoustic model is learned with the pseudo non-recognition target signal, the non-speech model can be adapted to non-stationary noise such as a human voice. Therefore, the non-speech model generated by the acoustic model generation method of the present invention can improve noise resistance in a non-stationary noise environment. In addition, since the acoustic model generation method of the present invention does not have a process of detecting the voice section and the noise section separately, the accuracy of the process does not affect the acoustic model.

この発明の音響モデル生成装置１００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 100 of this invention. 音響モデル生成装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the acoustic model production | generation apparatus 100. FIG. 妨害用信号と擬似非認識対象信号を例示する図。The figure which illustrates the signal for disturbance, and a pseudo | simulation non-recognition target signal. この発明の音響モデル生成装置２００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 200 of this invention. この発明の音響モデル生成装置３００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 300 of this invention. ３つの状態から構成される非音声モデルの概念図を示す図。The figure which shows the conceptual diagram of the non-voice model comprised from three states. この発明の音響モデル生成装置４００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 400 of this invention. 非音声ループ文法の概念を示す図。The figure which shows the concept of non-voice loop grammar. この発明の音響モデル生成装置５００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 500 of this invention. この発明の音響モデル生成装置６００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 600 of this invention. この発明の音響モデル生成装置７００，８００の機能構成例を示す図。The figure which shows the function structural example of the acoustic model production | generation apparatus 700,800 of this invention. この発明の各実施例が、組み合わせて用いることを示す機能構成例を示す図。The figure which shows the function structural example which shows that each Example of this invention uses combining. 特許文献１に開示された音声認識装置９５０の機能構成を示す図。The figure which shows the function structure of the speech recognition apparatus 950 disclosed by patent document 1. FIG.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明の音響モデル生成装置１００の機能構成例を示す。その動作フローを図２に示す。音響モデル生成装置１００は、ゲイン調整部１０と、ベース音響モデル２０と、非音声音響モデル学習部３０と、制御部５０と、を具備する。音響モデル生成装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。 FIG. 1 shows an example of the functional configuration of an acoustic model generation apparatus 100 according to the present invention. The operation flow is shown in FIG. The acoustic model generation device 100 includes a gain adjustment unit 10, a base acoustic model 20, a non-speech acoustic model learning unit 30, and a control unit 50. The acoustic model generation apparatus 100 is realized by a predetermined program being read into a computer including, for example, a ROM, a RAM, a CPU, and the like, and the CPU executing the program.

ゲイン調整部１０は、妨害用信号に１よりも小さな値のゲインを乗じて音量レベルを小さくした擬似非認識対象信号を生成する（ステップＳ１０）。図３に妨害用信号と擬似非認識対象信号を例示する。 The gain adjustment unit 10 generates a pseudo non-recognition target signal in which the volume level is reduced by multiplying the interference signal by a gain smaller than 1 (step S10). FIG. 3 illustrates an interference signal and a pseudo non-recognition target signal.

妨害用信号は、例えば駅のホーム上の雑踏の背景雑音に人の話声が重畳したような音声信号である。例えば、定常的な背景雑音に非定常な人の声が重なって収音された音声信号である。背景雑音の雑踏音はなくても良い。クリーン音声の人の声で有っても良い。つまり、非定常な音声信号であることが妨害用信号のポイントである。 The interference signal is, for example, an audio signal in which a person's speech is superimposed on background noise of a crowd on the platform of a station. For example, it is an audio signal that is picked up by a non-stationary human voice superimposed on stationary background noise. There is no need for background noise. It may be a human voice with clean voice. That is, the point of the interference signal is that it is an unsteady audio signal.

擬似非認識対象信号は、妨害用信号に１よりも小さな値のゲインを乗じて音量レベルを小さくした音声信号である。ゲインを１よりも小さな値とすることで、認識対象の音声と区別し、認識対象の音声を非音声として認識してしまう可能性を低減させることができ、雑音耐性を高めると共に、ゲインを小さく設定すれば雑音の少ないクリーン音声での精度も高く保つ事ができる。擬似非認識対象信号の振幅は、例えば−６０[ｄＢ]程度とされる。 The pseudo non-recognition target signal is an audio signal in which the volume level is reduced by multiplying the interference signal by a gain smaller than 1. By setting the gain to a value smaller than 1, it is possible to distinguish from the speech to be recognized and reduce the possibility of recognizing the speech to be recognized as non-speech, increasing noise resistance and reducing the gain. If set, it is possible to maintain high accuracy in clean sound with little noise. The amplitude of the pseudo non-recognition target signal is, for example, about −60 [dB].

擬似非認識対象信号は、破線で示すように擬似非認識対象信号ＤＢ４０（ＤＢはデータベースの略）に蓄えるようにしても良い。音響モデル生成装置１００を構成するコンピュータの処理速度が高速であれば、擬似非認識対象信号ＤＢ４０は無くても良い。 The pseudo non-recognition target signal may be stored in the pseudo non-recognition target signal DB 40 (DB is an abbreviation of database) as indicated by a broken line. If the processing speed of the computer constituting the acoustic model generation device 100 is high, the pseudo non-recognition target signal DB 40 may be omitted.

非音声音響モデル学習部３０は、擬似非認識対象信号を非音声信号としてベース音響モデル２０の非音声モデルを学習（適応）する（ステップＳ３０）。ここで、（非）音声モデルを学習するとは、学習処理により学習データ中の事例ができるだけ多く正しく認識できるように、音響モデルのパラメータを最適化する処理を意味する。なお、擬似非認識対象信号ＤＢ４０を備え、擬似非認識対象信号ＤＢ４０に蓄えられた擬似非認識対象信号を用いて、非音声音響モデル学習部３０は非音声モデルをＭＬ（尤度最大化）学習等を用いて繰り返し学習するようにしても良い。 The non-speech acoustic model learning unit 30 learns (adapts) the non-speech model of the base acoustic model 20 using the pseudo non-recognition target signal as a non-speech signal (step S30). Here, learning the (non) voice model means a process of optimizing the parameters of the acoustic model so that as many examples as possible in the learning data can be correctly recognized by the learning process. Note that the non-speech acoustic model learning unit 30 includes a pseudo non-recognition target signal DB 40 and uses the pseudo non-recognition target signal stored in the pseudo non-recognition target signal DB 40 to perform ML (likelihood maximization) learning of the non-speech model. Etc. may be used for repeated learning.

音響モデルのパラメータを最適化する方法は、最尤推定に基づく方法の他、最大事後確率や回帰行列に基づく周知の方法で実現できる。一般的にこれらの処理方法は、ディジタル信号処理で実現される。図１では、擬似非認識対象信号をディジタル信号に変換するＡ/Ｄ変換部や、擬似非認識対象信号の音響特徴量を抽出する特徴量抽出部等は、一般的であることから非音声音響モデル学習部３０に含まれるものとして、その表記を省略している。 The method for optimizing the parameters of the acoustic model can be realized by a known method based on the maximum posterior probability or the regression matrix, in addition to the method based on the maximum likelihood estimation. Generally, these processing methods are realized by digital signal processing. In FIG. 1, an A / D converter that converts a pseudo-non-recognition target signal into a digital signal, a feature quantity extraction unit that extracts an acoustic feature quantity of the pseudo-non-recognition target signal, and the like are common, so The description is omitted as being included in the model learning unit 30.

非音声音響モデル学習部３０は、ゲイン調整部１０が出力する擬似非認識対象信号を用いて、ベース音響モデル２０中の非音声モデルを学習させて擬似非認識対象信号に適応した非音声モデルを生成する。この非音声モデルは、擬似非認識対象信号に適応させることによって、非定常的な妨害用信号、つまり背景雑音による誤認識結果の湧き出しを低減するモデルとすることが出来る。 The non-speech acoustic model learning unit 30 learns the non-speech model in the base acoustic model 20 using the pseudo non-recognition target signal output from the gain adjustment unit 10 and applies a non-speech model adapted to the pseudo non-recognition target signal. Generate. This non-speech model can be a model that reduces the occurrence of erroneous recognition results due to non-stationary interference signals, that is, background noise, by adapting to pseudo non-recognition target signals.

上記したゲイン調整過程（ステップＳ１０）と非音声音響モデル学習過程（ステップＳ３０）は、入力信号の妨害用信号から生成される全ての擬似非認識対象信号（ファイル）の処理を終了するまで繰り返される（ステップＳ５０のＮｏ）。全ての擬似非認識対象信号とは、例えば擬似非認識対象信号の長さに対応した複数の非音声モデルのことである。ステップＳ１０とＳ３０の繰り返し処理の制御は制御部５０が行う。 The above gain adjustment process (step S10) and non-speech acoustic model learning process (step S30) are repeated until the processing of all pseudo non-recognition target signals (files) generated from the interference signal of the input signal is completed. (No in step S50). All the pseudo non-recognition target signals are a plurality of non-speech models corresponding to the length of the pseudo non-recognition target signal, for example. The control unit 50 controls the repetition processing of steps S10 and S30.

以降で説明する各実施例の制御部も、音響モデル生成装置の各部の時系列的な動作を制御するものであり、実施例１の制御部５０と同じである。各実施例の制御部については、各実施例の構成が異なるので制御部の参照符号のみを変えて図面中に表記するのみとし、特に説明は行わない。 The control unit of each embodiment described below controls the time-series operation of each unit of the acoustic model generation apparatus, and is the same as the control unit 50 of the first embodiment. About the control part of each Example, since the structure of each Example differs, it changes only the referential mark of a control part, it only describes in drawing, and does not perform description in particular.

ベース音響モデル２０は、非音声モデルの他に音声モデルも含むものである。よって、非音声音響モデル学習部３０は、非音声モデル以外の音声モデルはそのまま通過させて外部に音響モデル６０として記録するようにしても良いし、又は学習済み非音声モデルのみを分けて記録して置き、後でベース音響モデルの音声モデルと学習済み非音声モデルを併合して１個の音響モデル６０とするようにしても良い。音響モデル６０は、学習済みの非音声モデルも蓄えるので、非音声モデルのみが擬似非認識対象信号で学習されたものとなる。 The base acoustic model 20 includes a speech model in addition to a non-speech model. Therefore, the non-speech acoustic model learning unit 30 may pass the sound model other than the non-speech model as it is and record it as the acoustic model 60 outside, or record only the learned non-speech model separately. The sound model of the base acoustic model and the learned non-speech model may be merged to form one acoustic model 60 later. Since the acoustic model 60 also stores a learned non-speech model, only the non-speech model is learned with the pseudo non-recognition target signal.

図４に、この発明の音響モデル生成装置２００の機能構成例を示す。音響モデル生成装置２００は、音響モデル生成装置１００のゲイン調整部１０が、（自動）ゲイン調整部２０１に置き換わった点のみが異なる。 FIG. 4 shows a functional configuration example of the acoustic model generation apparatus 200 of the present invention. The acoustic model generation apparatus 200 is different only in that the gain adjustment unit 10 of the acoustic model generation apparatus 100 is replaced with an (automatic) gain adjustment unit 201.

ゲイン調整部２０１は、入力される妨害用信号の平均振幅を測定して、当該妨害用信号の振幅を目標振幅に自動的に調整する。ゲイン調整部２０１によれば、擬似非認識対象信号の振幅を所定の範囲内に収めることが出来る。 The gain adjusting unit 201 measures the average amplitude of the input interference signal and automatically adjusts the amplitude of the interference signal to the target amplitude. According to the gain adjusting unit 201, the amplitude of the pseudo non-recognition target signal can be kept within a predetermined range.

上記したゲイン推定部１０ではゲインが固定値のため、入力される妨害用信号の振幅が過度に大きい場合、妨害用信号と音声認識対象の音声信号との区別がつかなくなる。また、妨害用信号の振幅が過度に小さい場合、擬似非認識対象信号の振幅が小さくなり過ぎて非音声音響モデル学習部３０で学習処理が行えなくなってしまう可能性がある。また、妨害用信号の振幅が変動する場合は、擬似非認識対象信号の振幅が不安定になる。 In the gain estimation unit 10 described above, since the gain is a fixed value, if the amplitude of the input interference signal is excessively large, the interference signal and the speech signal to be recognized cannot be distinguished. Further, when the amplitude of the interference signal is excessively small, the amplitude of the pseudo non-recognition target signal may be too small, and the non-speech acoustic model learning unit 30 may not be able to perform the learning process. Further, when the amplitude of the interfering signal varies, the amplitude of the pseudo non-recognition target signal becomes unstable.

ゲイン調整部２０１によって、これらの不都合を解消することが出来る。ゲイン調整部２０１は妨害用信号の振幅の平均値Ｎ[ｄＢ]と目標振幅Ｌ[ｄＢ]（例えば−６０ｄＢ）からゲインＧを求め、妨害用信号にゲインＧを乗じて擬似非認識対象信号を出力する。ゲインＧは式（２）で計算することが出来る。目標振幅Ｌ[ｄＢ]の値は、予め定数としてゲイン調整部２０１に設定しておいても良いし、外部から数値として与えても良い。 These inconveniences can be solved by the gain adjusting unit 201. The gain adjusting unit 201 obtains the gain G from the average value N [dB] of the interference signal amplitude and the target amplitude L [dB] (for example, -60 dB), and multiplies the interference signal by the gain G to obtain the pseudo non-recognition target signal. Output. The gain G can be calculated by equation (2). The value of the target amplitude L [dB] may be set in advance in the gain adjustment unit 201 as a constant, or may be given as a numerical value from the outside.

なお、目標振幅Ｌ[ｄＢ]を一定の値として説明を行ったが、幾つかの目標振幅Ｌ[ｄＢ]の擬似非認識対象信号を用意するようにしても良い。複数の擬似非認識対象音声を用意することで、異なる音量レベルの雑音に対する耐性を高められる効果が期待できる。 Note that the target amplitude L [dB] is described as a constant value, but pseudo non-recognition target signals having several target amplitudes L [dB] may be prepared. By preparing a plurality of pseudo-non-recognition target voices, it is possible to expect an effect of enhancing resistance to noise of different volume levels.

図５に、この発明の音響モデル生成装置３００の機能構成例を示す。音響モデル生成装置３００は、音響モデル生成装置１００に対して信号分割部３０１を備える点で異なる。 FIG. 5 shows a functional configuration example of the acoustic model generation apparatus 300 of the present invention. The acoustic model generation device 300 is different from the acoustic model generation device 100 in that a signal dividing unit 301 is provided.

信号分割部３０１は、ゲイン調整部１０が出力する擬似非認識対象信号を所定の時間長に分割した分割擬似非認識対象信号を生成して非音声音響モデル学習部３０に出力する。非音声音響モデル学習部３０で学習する非音声モデルは、混合分布で表現される複数の状態でモデル化される事が多い。 The signal dividing unit 301 generates a divided pseudo non-recognition target signal obtained by dividing the pseudo non-recognition target signal output from the gain adjustment unit 10 into a predetermined time length, and outputs the generated signal to the non-speech acoustic model learning unit 30. The non-speech model learned by the non-speech acoustic model learning unit 30 is often modeled in a plurality of states expressed by a mixture distribution.

図６に、３つの状態から構成される非音声モデルの概念図を示す。図６に示す例は、left-to-right型ＨＭＭと呼ばれるもので、３つの状態ｊ_１（第１状態）、ｊ_２（第２状態）、ｊ_３（第３状態）を並べたものであり、状態の確率連鎖としては、自己遷移ａ_１１，ａ_２２，ａ_３３と、次状態へのａ_１２，ａ_２３，ａ_３４からなる。各状態は音素モデルであり、それぞれは混合正規分布Ｍ_１，Ｍ_２，Ｍ_３で構成される。 FIG. 6 shows a conceptual diagram of a non-voice model composed of three states. The example shown in FIG. 6 is called a left-to-right type HMM, in which three states j ₁ (first state), j ₂ (second state), and j ₃ (third state) are arranged. Yes, the state probability chain consists of self-transitions a ₁₁ , a ₂₂ , a ₃₃ and a ₁₂ , a ₂₃ , a ₃₄ to the next state. Each state is a phoneme model, and each is composed of a mixed normal distribution M ₁ , M ₂ , M ₃ .

この非音声モデルの各状態に割り当たる特徴量に、擬似非認識対象信号の長さによって偏りが生じる可能性がある。具体的には、擬似非認識対象信号の先頭や末尾に現れる特徴に非音声モデルの先頭や末尾の状態が学習されてしまう可能性がある。つまり、擬似非認識対象信号の長さによって第１〜第３状態に割り当たる特徴量が変化する。 There is a possibility that the feature amount assigned to each state of the non-speech model may be biased depending on the length of the pseudo non-recognition target signal. Specifically, there is a possibility that the beginning and end states of the non-speech model are learned from features appearing at the beginning and end of the pseudo non-recognition target signal. That is, the feature amount assigned to the first to third states changes depending on the length of the pseudo non-recognition target signal.

そこで、擬似非認識対象信号の長さを所定の短い時間（例えば１秒）に分割した分割擬似非認識対象信号で、非音声モデルの学習回数を増加させることで、非音声モデルの各状態に割り当たる特徴量に偏りを生じ難くする。 Therefore, by dividing the length of the pseudo non-recognition target signal into a predetermined short time (for example, 1 second) and increasing the number of times of learning of the non-speech model, each state of the non-speech model is obtained. Makes it difficult to bias the assigned feature values.

このように擬似非認識対象信号を分割した分割擬似非認識対象信号で非音声モデルを学習することで、擬似非認識対象信号の位置に影響された偏った非音声モデルが生成される可能性を低減することが出来る。また、変動する雑音に対しても頑健な非音声モデルを生成することが可能になる。 By learning the non-speech model with the divided pseudo non-recognition target signal obtained by dividing the pseudo non-recognition target signal in this way, there is a possibility that a biased non-speech model influenced by the position of the pseudo non-recognition target signal is generated. It can be reduced. In addition, it is possible to generate a non-voice model that is robust against fluctuating noise.

また、所定の長さに満たない分割擬似非認識対象信号は廃棄するようにしても良い。分割擬似非認識対象信号の長さが所定の長さに満たないものを廃棄することで、短過ぎる信号を学習対象から外す事ができるため、非音声モデルの学習精度を更に向上させることが出来る。 Further, a divided pseudo non-recognition target signal that is less than a predetermined length may be discarded. By discarding signals whose length of the divided pseudo non-recognition target signal is less than the predetermined length, signals that are too short can be excluded from the learning target, so that the learning accuracy of the non-speech model can be further improved. .

図７に、この発明の音響モデル生成装置４００の機能構成例を示す。音響モデル生成装置４００は、音響モデル生成装置１００に対して非音声ラベル推定部４０１を備える点で異なる。 FIG. 7 shows a functional configuration example of the acoustic model generation apparatus 400 of the present invention. The acoustic model generation device 400 is different from the acoustic model generation device 100 in that a non-voice label estimation unit 401 is provided.

非音声ラベル推定部４０１は、ゲイン調整部１０が出力する擬似非認識対象信号を、非音声ラベルの繰り返しを許す非音声ループ文法を用いて文法型音声認識を行い、擬似非認識対象信号に適合する非音声ラベルを推定し、擬似非認識対象信号に非音声ラベルを付与して出力する。 The non-speech label estimation unit 401 performs grammatical speech recognition on the pseudo non-recognition target signal output from the gain adjustment unit 10 using a non-speech loop grammar that allows repetition of non-speech labels, and conforms to the pseudo non-recognition target signal. The non-speech label to be estimated is estimated, and a non-speech label is assigned to the pseudo non-recognition target signal and output.

図８に、非音声ループ文法と非音声ラベルの例を示す。非音声ループ文法は、ポーズの自己遷移で構成される。開始〜終了までの状態遷移の累積尤度が最大の自己遷移の数を音声認識結果として出力する。例えば、擬似非認識対象信号がポーズの自己遷移１個分に適合する場合の非音声ラベルはポーズ１、自己遷移２個分に適合すればポーズ２、３個分に適合すればポーズ３、の非音声ラベルを擬似非認識対象信号に付与する。図８に、…で示すように非音声ループ文法は、自己遷移１個分〜３個分に限定されない。更に多くの自己遷移分に対応させても良い。 FIG. 8 shows an example of a non-speech loop grammar and a non-speech label. The non-speech loop grammar consists of pause self-transitions. The number of self transitions with the maximum cumulative likelihood of state transition from the start to the end is output as a speech recognition result. For example, the non-voice label when the pseudo non-recognition target signal matches one pose self-transition is pose 1, pose 2 if it matches two self-transitions, and pose 3 if it matches three poses. A non-voice label is added to the pseudo-non-recognition target signal. In FIG. 8, the non-voice loop grammar is not limited to one to three self-transitions as indicated by. Further, more self-transitions may be handled.

非音声音響モデル学習部３０は、その非音声ラベルの非音声モデルをその擬似非認識対象信号で学習する。このようにすることで非音声ラベルの学習精度を向上させることが出来る。 The non-speech acoustic model learning unit 30 learns the non-speech model of the non-speech label from the pseudo non-recognition target signal. By doing so, the learning accuracy of the non-voice label can be improved.

図９に、この発明の音響モデル生成装置５００の機能構成例を示す。音響モデル生成装置５００は、音響モデル生成装置１００に対して学習選択部５０１を備える点で異なる。 FIG. 9 shows a functional configuration example of the acoustic model generation apparatus 500 of the present invention. The acoustic model generation device 500 is different from the acoustic model generation device 100 in that a learning selection unit 501 is provided.

学習選択部５０１は、ゲイン制御部１０の出力する擬似非認識対象信号を、非音声信号としてベース音響モデルで音声認識を行い、所定値以上のスコアの擬似非認識対象信号を選択して出力する。学習選択部５０１は、ベース音響モデルの非音声モデルに適合する擬似非認識対象信号を、非音声音響モデル学習部３０に出力するものである。 The learning selection unit 501 performs speech recognition using the base acoustic model as a non-speech signal for the pseudo non-recognition target signal output from the gain control unit 10, and selects and outputs a pseudo non-recognition target signal having a score equal to or higher than a predetermined value. . The learning selection unit 501 outputs a pseudo non-recognition target signal that matches the non-speech model of the base acoustic model to the non-speech acoustic model learning unit 30.

擬似非認識対象信号は、実際の無音とは異なり、非音声モデルとは本来適合しない信号である。従って、学習を行ったとしても異常な非音声モデルが生成されたり、非音声モデルと適合せず学習が失敗する可能性がある。 The pseudo non-recognition target signal is a signal that is not originally adapted to the non-speech model, unlike actual silence. Therefore, even if learning is performed, an abnormal non-speech model may be generated, or learning may fail because it does not match the non-speech model.

そこで、学習選択部５０１は、擬似非認識対象信号をベース音響モデルの非音声モデルを用いて音声認識し、所定値以上のスコアが得られた擬似非認識対象信号を、学習対象の非音声信号として非音声音響モデル学習部３０に出力する。学習選択部５０１を備えることで、異常な非音声モデルが生成されることを防止できると共に、学習の失敗を減らすことが出来る。 Therefore, the learning selection unit 501 recognizes the pseudo non-recognition target signal by using the non-speech model of the base acoustic model, and the pseudo non-recognition target signal with a score equal to or higher than a predetermined value is used as the learning target non-speech signal. To the non-voice acoustic model learning unit 30. By providing the learning selection unit 501, it is possible to prevent generation of an abnormal non-speech model and reduce learning failures.

スコアの所定値は、擬似非認識対象信号を音声認識した結果のスコアの平均μや分散から設定する。例えば標準偏差σを用いてμ−２σ以下等のスコアの擬似非認識対象信号を学習の対象外にするようにしても良い。 The predetermined value of the score is set based on the average μ or variance of the result of speech recognition of the pseudo non-recognition target signal. For example, the pseudo-non-recognition target signal having a score of μ−2σ or less may be excluded from learning using the standard deviation σ.

図１０に、この発明の音響モデル生成装置６００の機能構成例を示す。音響モデル生成装置６００は、音響モデル生成装置１００に対して、ゲイン設定部６０１と、言語モデル６０２と、を備える点で異なる。 FIG. 10 shows a functional configuration example of the acoustic model generation apparatus 600 of the present invention. The acoustic model generation device 600 differs from the acoustic model generation device 100 in that it includes a gain setting unit 601 and a language model 602.

言語モデル６０２は、書き起こしテキストから得られた２個以上の単語から成る単語連鎖とその出現確率とが格納されたものである。言語モデル部６０２は、言語の特徴を統計的手法によりモデル化したデータを格納し、連続音声認識の実行時に音声認識結果候補に対して言語的な尤もらしさを与えるものである。 The language model 602 stores a word chain composed of two or more words obtained from a transcription text and its appearance probability. The language model unit 602 stores data in which language features are modeled by a statistical method, and gives linguistic likelihood to a speech recognition result candidate during continuous speech recognition.

ゲイン設定部６０１は、ゲイン調整部１０のゲインを複数の所定の値に設定して学習した音響モデルと、言語モデル６０２を用いて開発セットの音声を音声認識し、複数の所定の値の中で最も高い尤度あるいは認識率が得られたゲインを、ゲイン調整部１０のゲインとして設定する。開発セットとは、書き起こしテキスト付きの音声データのことである。 The gain setting unit 601 recognizes the speech of the development set using the acoustic model learned by setting the gain of the gain adjusting unit 10 to a plurality of predetermined values and the language model 602, and among the plurality of predetermined values. The gain at which the highest likelihood or recognition rate is obtained is set as the gain of the gain adjusting unit 10. A development set is audio data with a transcript.

ゲイン設定部６０１は、ゲイン調整部１０のゲインを複数の所定の値に設定して、上記した音響モデル生成装置１００〜５００で生成した音響モデル６０と、言語モデル６０２とを用いて開発セットを音声認識し、最も高い尤度あるいは認識率が得られたゲインを、ゲイン調整部１０に設定する。従って、音響モデル生成装置６００は、ゲイン調整部１０のゲインを最適化することが出来る。 The gain setting unit 601 sets the gain of the gain adjustment unit 10 to a plurality of predetermined values, and sets the development set using the acoustic model 60 generated by the acoustic model generation devices 100 to 500 and the language model 602. The gain that has been recognized by speech and has the highest likelihood or recognition rate is set in the gain adjustment unit 10. Therefore, the acoustic model generation apparatus 600 can optimize the gain of the gain adjustment unit 10.

図１１に、この発明の音響モデル生成装置７００の機能構成例を示す。音響モデル生成装置７００は、音響モデル生成装置１００に対して、ベース学習音声７３０を備える点と、非音声音響モデル学習部７４０とで異なる。 FIG. 11 shows a functional configuration example of the acoustic model generation apparatus 700 of the present invention. The acoustic model generation device 700 differs from the acoustic model generation device 100 in that the base learning speech 730 is provided and the non-speech acoustic model learning unit 740 is different.

ベース学習音声７３０は、ベース音響モデル２０を作成するのに用いた学習音声であり、ベース学習音声７３０には、ベース音響モデル２０を作成するのに用いた多数の音声が記録されている。ベース学習音声７３０は、多数の音声を記録したものとして音響モデル生成装置７００の内部に具備しても良いし、破線で示すように外部から与えても良い。 The base learning voice 730 is a learning voice used to create the base acoustic model 20, and many voices used to create the base acoustic model 20 are recorded in the base learning voice 730. The base learning voice 730 may be provided inside the acoustic model generation apparatus 700 as a large number of voices recorded, or may be given from the outside as indicated by a broken line.

非音声音響モデル学習部７４０は、擬似非認識対象信号を非音声信号としてベース音響モデル２０の非音声モデルを学習すると共に、ベース学習音声７３０の非音声信号でベース音響モデル２０の非音声モデルを学習する。 The non-speech acoustic model learning unit 740 learns the non-speech model of the base acoustic model 20 using the pseudo non-recognition target signal as a non-speech signal, and also uses the non-speech model of the base acoustic model 20 as a non-speech signal of the base learning speech 730. learn.

音響モデル生成装置７００は、ベース学習音声７３０を用いて、つまり、クリーンな非音声モデルでも学習することになるので、擬似非認識対象信号のみで非音声モデルが汚れる影響を低減し、クリーン音声に対する精度も保つことが可能である。 The acoustic model generation apparatus 700 uses the base learning speech 730, that is, learns even with a clean non-speech model, and thus reduces the influence that the non-speech model is contaminated with only the pseudo non-recognition target signal, The accuracy can also be maintained.

音響モデル生成装置７００のベース学習音声７３０の学習音声に、擬似認識非対象信号を重畳した音声を用いるようにしても良い。ベース学習音声に擬似非認識対象信号が重畳されたベース学習音声８３０を具備した音響モデル生成装置８００としても良い。音響モデル生成装置８００の機能構成例は、音響モデル生成装置７００（図１１）と同じである。 A sound obtained by superimposing a pseudo-recognition non-target signal on the learning sound of the base learning sound 730 of the acoustic model generation apparatus 700 may be used. The acoustic model generation apparatus 800 may include the base learning voice 830 in which the pseudo non-recognition target signal is superimposed on the base learning voice. The functional configuration example of the acoustic model generation device 800 is the same as that of the acoustic model generation device 700 (FIG. 11).

音響モデル生成装置８００によれば、音声に重畳する雑音に対する頑健性を向上させることが出来る。つまり、非音声モデルばかりでなく音声モデルについても擬似非認識対象信号が重畳されるので、音声モデルの頑健性を向上させることが出来る。 According to the acoustic model generation apparatus 800, robustness against noise superimposed on speech can be improved. That is, since the pseudo non-recognition target signal is superimposed not only on the non-speech model but also on the speech model, the robustness of the speech model can be improved.

また、擬似認識非対象信号を重畳した音声と重畳しない音声の両者を用いて音響モデルを学習するようにしても良い。そうすることで、雑音が重畳された音声だけでなく、クリーン音声と雑音重畳音声の両者に対応する音響モデルを生成することが可能である。 Moreover, you may make it learn an acoustic model using both the audio | voice which superimposed the pseudo recognition non-target signal, and the audio | voice which is not superimposed. By doing so, it is possible to generate an acoustic model corresponding to both clean speech and noise superimposed speech, as well as speech with superimposed noise.

以上説明した各実施例は、それぞれを組み合わせて用いることが可能である。図１２に、各実施例で説明した各機能構成部を全て備えたこの発明の音響モデル生成装置８６０の構成例を示す。 The embodiments described above can be used in combination. FIG. 12 shows a configuration example of an acoustic model generation apparatus 860 of the present invention that includes all the functional configuration units described in the embodiments.

図１２に実線で示す構成であるゲイン調整部１０と、非音声音響モデル学習部３０と、の構成が音響モデル生成装置１００である。その音響モデル生成装置１００の構成に、信号分割部３０１の構成を加えたものが音響モデル生成装置２００である。同様に、上記した各実施例で説明した構成の全てを組み合わせることも可能である。音響モデル生成装置１００〜７００の全ての機能構成部を具備したものが、音響モデル生成装置８６０となる。各機能部を組み合わせることで、それぞれの効果を得ることが出来る。 The configuration of the gain adjustment unit 10 and the non-speech acoustic model learning unit 30 which are the configurations shown by the solid lines in FIG. An acoustic model generation device 200 is obtained by adding the configuration of the signal dividing unit 301 to the configuration of the acoustic model generation device 100. Similarly, it is possible to combine all of the configurations described in the above embodiments. The acoustic model generation device 860 includes all the functional components of the acoustic model generation devices 100 to 700. Each effect can be obtained by combining each functional unit.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。具体的には、例えば、磁気記録装置として、ハードディスク装置、フレキシブルディスク、磁気テープ等を、光ディスクとして、DVD（Digital Versatile Disc）、DVD-RAM（Random Access Memory）、CD-ROM（Compact Disc Read Only Memory）、CD-R（Recordable）/RW（ReWritable）等を、光磁気記録媒体として、MO（Magneto Optical disc）等を、半導体メモリとしてEEP-ROM（Electronically Erasable and Programmable-Read Only Memory）等を用いることが出来る。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used. Specifically, for example, as a magnetic recording device, a hard disk device, a flexible disk, a magnetic tape or the like, and as an optical disk, a DVD (Digital Versatile Disc), a DVD-RAM (Random Access Memory), a CD-ROM (Compact Disc Read Only) Memory), CD-R (Recordable) / RW (ReWritable), etc., magneto-optical recording media, MO (Magneto Optical disc), etc., semiconductor memory, EEP-ROM (Electronically Erasable and Programmable-Read Only Memory), etc. Can be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

入力される妨害用信号に１よりも小さな値のゲインを乗じて音量レベルを小さくした擬似非認識対象信号を生成するゲイン調整部と、
上記擬似非認識対象信号を非音声信号としてベース音響モデルの非音声モデルを学習する非音声音響モデル学習部と、
を具備する音響モデル生成装置。 A gain adjustment unit that generates a pseudo non-recognition target signal in which the volume level is reduced by multiplying the input interference signal by a gain smaller than 1;
A non-speech acoustic model learning unit that learns a non-speech model of a base acoustic model using the pseudo non-recognition target signal as a non-speech signal;
An acoustic model generation apparatus comprising:

請求項１に記載した音響モデル生成装置において、
上記ゲイン調整部は、
上記妨害用信号の平均振幅を測定して、当該妨害用信号の振幅を目標振幅に自動的に調整するゲイン調整部であることを特徴とする音響モデル生成装置。 The acoustic model generation device according to claim 1,
The gain adjustment unit
An acoustic model generation apparatus, comprising: a gain adjusting unit that measures an average amplitude of the interference signal and automatically adjusts the amplitude of the interference signal to a target amplitude.

請求項１又は２に記載した音響モデル生成装置において、
更に、
上記擬似非認識対象信号を所定の時間長に分割した分割擬似非認識対象信号を生成し、当該分割擬似非認識対象信号を非音声音響モデル学習部に出力する信号分割部を、備えることを特徴とする音響モデル生成装置。 In the acoustic model generation device according to claim 1 or 2,
Furthermore,
A signal dividing unit that generates a divided pseudo-non-recognition target signal obtained by dividing the pseudo-non-recognition target signal into a predetermined time length, and outputs the divided pseudo-non-recognition target signal to the non-speech acoustic model learning unit; An acoustic model generation device.

請求項１乃至３の何れかに記載した音響モデル生成装置において、
更に、
上記擬似非認識対象信号を、非音声ラベルの繰り返しを許す非音声ループ文法を用いて文法型音声認識を行い上記擬似非認識対象信号に適合する非音声ラベルを推定し、当該非音声ラベルと上記擬似非認識対象信号を上記非音声音響モデル学習部に出力する非音声ラベル推定部を、備えることを特徴とする音響モデル生成装置。 The acoustic model generation device according to any one of claims 1 to 3,
Furthermore,
The pseudo-non-recognition target signal is subjected to grammatical speech recognition using a non-speech loop grammar that allows repetition of non-speech labels, and a non-speech label that matches the pseudo-non-recognition target signal is estimated. An acoustic model generation apparatus comprising: a non-speech label estimation unit that outputs a pseudo non-recognition target signal to the non-speech acoustic model learning unit.

請求項１乃至４の何れかに記載した音響モデル生成装置において、
更に、
上記擬似非認識対象信号を非音声信号としてベース音響モデルの非音声モデルを用いて音声認識し、音声認識結果のスコアが所定値以上の擬似非認識対象信号を選択して非音声音響モデル学習部に出力する学習選択部を、備えることを特徴とする音響モデル生成装置。 The acoustic model generation device according to any one of claims 1 to 4,
Furthermore,
The pseudo-non-recognition target signal is recognized as a non-speech signal using a non-speech model of a base acoustic model, and a non-speech-acoustic model learning unit is selected by selecting a pseudo-non-recognition target signal whose speech recognition result score is a predetermined value or more An acoustic model generation apparatus, comprising: a learning selection unit that outputs the learning model.

請求項１乃至５の何れかに記載した音響モデル生成装置において、
更に、
言語モデルと、
上記ゲイン調整部のゲインを複数の所定の値に設定して学習した音響モデルと、上記言語モデルを用いて開発セットの音声を音声認識し、上記複数の所定の値の中で最も高い尤度あるいは認識率が得られたゲインを、上記ゲイン調整部のゲインとして設定するゲイン設定部と、
を備えることを特徴とする音響モデル生成装置。 The acoustic model generation device according to any one of claims 1 to 5,
Furthermore,
Language model,
The acoustic model learned by setting the gain of the gain adjustment unit to a plurality of predetermined values and the speech of the development set using the language model is recognized, and the highest likelihood among the plurality of predetermined values Or the gain setting part which sets the gain from which the recognition rate was obtained as a gain of the above-mentioned gain adjustment part,
An acoustic model generation device comprising:

請求項１乃至６の何れかに記載した音響モデル生成装置において、
更に、上記したベース音響モデルを作成するのに用いたベース学習音声を記録したベース学習音声データベースを備え、
上記非音声音響モデル学習部は、上記擬似非認識対象信号を非音声信号として上記ベース音響モデルの非音声モデルを学習すると共に、上記ベース学習音声の非音声信号で上記ベース音響モデルの非音声モデルを学習するものである、ことを特徴とする音響モデル生成装置。 The acoustic model generation device according to any one of claims 1 to 6,
Furthermore, a base learning speech database that records the base learning speech used to create the above-described base acoustic model is provided,
The non-speech acoustic model learning unit learns the non-speech model of the base acoustic model using the pseudo non-recognition target signal as a non-speech signal, and also uses the non-speech model of the base acoustic model as a non-speech signal of the base learning speech. An acoustic model generation device characterized by learning.

請求項７に記載した音響モデル生成装置において、
上記ベース学習音声データベースに記録されたベース学習音声には上記擬似非認識対象信号が重畳されていることを特徴とする音響モデル生成装置。 The acoustic model generation device according to claim 7,
An acoustic model generation apparatus, wherein the pseudo non-recognition target signal is superimposed on a base learning voice recorded in the base learning voice database.

入力される妨害用信号に１よりも小さな値のゲインを乗じて音量レベルを小さくした擬似非認識対象信号を生成するゲイン調整過程と、
上記擬似非認識対象信号を非音声信号としてベース音響モデルの非音声モデルを学習する非音声音響モデル学習過程と、
を備える音響モデル生成方法。 A gain adjustment process for generating a pseudo non-recognition target signal in which the input interference signal is multiplied by a gain smaller than 1 to reduce the volume level;
A non-speech acoustic model learning process of learning a non-speech model of a base acoustic model using the pseudo non-recognition target signal as a non-speech signal;
An acoustic model generation method comprising:

請求項１乃至８の何れかに記載した音響モデル生成装置としてコンピュータを機能させるためのプログラム。 A program for causing a computer to function as the acoustic model generation device according to any one of claims 1 to 8.