JP3331297B2

JP3331297B2 - Background sound / speech classification method and apparatus, and speech coding method and apparatus

Info

Publication number: JP3331297B2
Application number: JP01032697A
Authority: JP
Inventors: 正浩押切; 政巳赤嶺
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1997-01-23
Filing date: 1997-01-23
Publication date: 2002-10-07
Anticipated expiration: 2017-01-23
Also published as: JPH10207491A; JPH11117213A

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力信号が背景音
区間と音声区間のいずれに属するかを判定する背景音／
音声分類方法及び装置並びに音声符号化方法及び装置に
関する。BACKGROUND OF THE INVENTION The present invention provides the input signal or the determined background sound belongs to the background sounds interval and speech interval /
The present invention relates to a speech classification method and apparatus and a speech encoding method and apparatus .

【０００２】[0002]

【従来の技術】音声信号の高能率・低ビットレート符号
化は、移動体通信や企業内通信においてチャネル容量の
増加や通信コストの削減のための重要な技術である。音
声信号は、音声が存在しない背景音区間と、音声が存在
する音声区間とに分類することができる。音声通信を行
う上で意味のあるものは音声区間であり、背景音区間は
違和感の生じない限りビットレートを下げても構わな
い。背景音区間のビットレートを下げることにより、全
体的なビットレートを下げることができ、さらなるチャ
ネル容量の増加、通信コストの削減が図られる。2. Description of the Related Art High-efficiency and low-bit-rate coding of voice signals is an important technique for increasing channel capacity and reducing communication costs in mobile communications and intra-company communications. The audio signal can be classified into a background sound section where no sound is present and a sound section where sound is present. What is significant in performing voice communication is the voice section, and the bit rate of the background sound section may be reduced as long as no discomfort occurs. By lowering the bit rate of the background sound section, the overall bit rate can be lowered, thereby further increasing the channel capacity and reducing the communication cost.

【０００３】この場合、背景音／音声分類に失敗し、例
えば音声区間が背景音区間と分類されてしまうと、音声
区間は低いビットレートで符号化されることになり、深
刻な音声劣化が生じてしまう。逆に、背景音区間が音声
区間と分類されると、全体的なビットレートが増加して
しまい、符号化効率が低減してしまう。このため、正確
な背景音／音声分類技術の確立が重要になる。In this case, if the background sound / speech classification fails and, for example, the speech section is classified as a background sound section, the speech section is coded at a low bit rate, and serious speech degradation occurs. Would. Conversely, if the background sound section is classified as a speech section, the overall bit rate increases and the coding efficiency decreases. Therefore, it is important to establish an accurate background sound / speech classification technology.

【０００４】従来の背景音／音声分類方法では、信号の
パワー情報の変化を監視して背景音区間と音声区間とを
分類している。例えば、J.F.Lynch Jr．氏らによる“Sp
eech/Silence Segmentation for Real-time Coding via
Rule Based Adaptive Endpoint Detection ”:Proc.IC
ASSP '87,pp.31.7.1-31.7.4 （文献１）によれば、入力
信号のフレームパワーで算出される音声メトリックと背
景音メトリックを用いて背景音／音声分類を行ってい
る。In a conventional background sound / speech classification method, a change in power information of a signal is monitored to classify a background sound section and a sound section. For example, JFLynch Jr. “Sp
eech / Silence Segmentation for Real-time Coding via
Rule Based Adaptive Endpoint Detection ”: Proc.IC
According to ASSP '87, pp. 31.7.1-31.7.4 (Reference 1), background sound / speech classification is performed using a sound metric and a background sound metric calculated based on the frame power of an input signal.

【０００５】このように信号のパワー情報のみを使用し
て背景音区間と音声区間の分類を行う方法は、背景音が
ほとんど聞こえない静かな状況では特に問題は生じな
い。このような場合は、背景音区間の信号パワーに対し
て音声区間の信号パワーが十分に大きいため、音声区間
を容易に識別できるからである。しかし、実際には背景
音として大きな背景雑音が存在する状況もあり、このよ
うな状況では正確な背景音／音声分類を実現することは
できない。また、背景雑音は必ずしも白色雑音であると
は限らず、例えば車や電車の走行音、他人の話し声など
のスペクトルがフラットでない背景雑音も存在するが、
従来の背景音／音声分類方法ではこのような背景雑音の
下では適切な分類が非常に困難である。[0005] The method of classifying the background sound section and the sound section using only the power information of the signal as described above has no particular problem in a quiet situation where the background sound is hardly heard. In such a case, since the signal power of the voice section is sufficiently larger than the signal power of the background sound section, the voice section can be easily identified. However, there is actually a situation where a large background noise exists as a background sound, and in such a situation, accurate background sound / speech classification cannot be realized. In addition, the background noise is not necessarily white noise, for example, there is a background noise that is not flat spectrum such as a running sound of a car or a train, a voice of another person,
With the conventional background sound / speech classification method, appropriate classification is very difficult under such background noise.

【０００６】一方、音声信号の音声区間は母音に相当す
る周期性の強い有声区間と、子音に相当する周期性が低
く雑音的な無声区間に分類することができる。有声区間
と無声区間は信号の特質が明らかに異なるため、それぞ
れに適した符号化法とビットレートの設定を行うこと
で、さらなる高品質化、低レート化が可能になる。On the other hand, the voice section of a voice signal can be classified into a voiced section having a strong periodicity corresponding to a vowel and a voiceless section having a low periodicity corresponding to a consonant and having a low noise. Since the characteristics of the signal are clearly different between the voiced section and the unvoiced section, the quality and the bit rate can be further reduced by setting the appropriate coding method and bit rate.

【０００７】この場合、有声／無声分類に失敗し、有声
区間が無声区間に分類されたり、逆に無声区間が有声区
間に分類されてしまうと、深刻な音質劣化が生じたり、
不必要にビットレートが増加してしまうという問題が生
じる。このため、正確な音声／無声分類方法の確立が重
要になる。In this case, if voiced / unvoiced classification fails and a voiced section is classified as a voiced section, or if a voiceless section is classified as a voiced section, serious sound quality deterioration occurs,
There is a problem that the bit rate is unnecessarily increased. Therefore, it is important to establish an accurate voice / unvoice classification method.

【０００８】従来の有声／無声分類方法として、例えば
J.P.Campbell氏らによる“Voiced/Unvoiced Classifica
tion of Speech with Applications to the U.S. Gover
nment LPC-10E Algorithm ”;Proc.ICASSP '86, vol.1
pp.473-476（文献２）がある。この文献２では、音声の
音響パラメータを複数種類算出し、この音響パラメータ
の加重平均値を求め、この値を予め設定してある閾値と
比較して有声／無声分類を行っている。As a conventional voiced / unvoiced classification method, for example,
“Voiced / Unvoiced Classifica” by JPCampbell et al.
tion of Speech with Applications to the US Gover
nment LPC-10E Algorithm ”; Proc.ICASSP '86, vol.1
pp.473-476 (Reference 2). In Document 2, a plurality of types of acoustic parameters of speech are calculated, a weighted average value of the acoustic parameters is obtained, and this value is compared with a preset threshold to perform voiced / unvoiced classification.

【０００９】しかし、加重平均のために各音響パラメー
タに用いる重み値と閾値とのバランスが有声／無声分類
性能に大きく作用するのは明らかであり、最適な重み値
と閾値を決定するのは困難である。However, it is clear that the balance between the weight value and the threshold value used for each acoustic parameter for weighted averaging greatly affects voiced / unvoiced classification performance, and it is difficult to determine the optimal weight value and threshold value. It is.

【００１０】次に、従来の背景音復号法について説明す
る。背景音区間では前述のように全体的なビットレート
を低減するため、超低ビットレートで符号化を行ってい
る。例えば、E.Paksoy氏らによる“Variable Rate Spee
ch Coding with Phonetic Segmentation;Proc.ICASSP '
93,pp.II-155-158（文献３）では、背景音の符号化を僅
か１．０ｋｂｐｓというレートで行っている。復号側で
は、このように低いビットレートで表された復号パラメ
ータを用いて背景音を復号する。Next, a conventional background sound decoding method will be described. In the background sound section, encoding is performed at an extremely low bit rate in order to reduce the overall bit rate as described above. For example, “Variable Rate Spee” by E. Paksoy et al.
ch Coding with Phonetic Segmentation; Proc.ICASSP ''
93, pp. II-155-158 (Reference 3), encoding of background sound is performed at a rate of only 1.0 kbps. On the decoding side, the background sound is decoded using the decoding parameters represented at such a low bit rate.

【００１１】このような背景音区間の音声復号法では、
復号パラメータが超低ビットレートで表現されているた
め、各パラメータの更新周期が長くなってしまう。仮に
ゲインの復号パラメータの更新周期が長くなってしまう
と、背景音区間のゲインの変化が正しく追従できなくな
り、ゲインの大きさが不連続になってしまう。このよう
なゲインを用いて背景音を復号すると、ゲインの不連続
性が耳障りになり、主観品質が大きく低下してしまう結
果となる。In such a speech decoding method for a background sound section,
Since the decoding parameters are expressed at a very low bit rate, the update cycle of each parameter becomes long. If the update period of the gain decoding parameter becomes long, the change in the gain in the background sound section cannot be correctly followed, and the magnitude of the gain becomes discontinuous. When the background sound is decoded using such a gain, the discontinuity of the gain is disturbing, and the subjective quality is greatly reduced.

【００１２】[0012]

【発明が解決しようとする課題】上述したように、信号
のパワー情報のみを利用する従来の背景音／音声分類方
法では、大きな背景雑音が存在する状況では正確な背景
音／音声分類を実現することができず、また車や電車の
走行音、他人の話し声などスペクトルが白色でない背景
雑音が存在する状況下では、適切な分類が極めて難しい
という問題があった。As described above, the conventional background sound / speech classification method using only the power information of a signal realizes accurate background sound / speech classification in a situation where a large background noise exists. However, there is a problem that appropriate classification is extremely difficult in a situation where there is a background noise whose spectrum is not white, such as a running sound of a car or a train or a voice of another person.

【００１３】[0013]

【００１４】[0014]

【００１５】本発明の目的は、背景雑音の大きさや性質
にかかわらず適切に背景音区間と音声区間の分類を行う
ことを可能とした背景音／音声分類方法及び装置並びに
これを用いた音声符号化方法及び装置を提供することに
ある。The purpose of the present invention, possible to carry out proper classification of background sounds interval and speech interval regardless of the size and nature of the background noise and the background noise / speech classification method, apparatus, and
An object of the present invention is to provide a speech encoding method and apparatus using the same .

【００１６】[0016]

【００１７】[0017]

【００１８】[0018]

【課題を解決するための手段】上記の課題を解決するた
め、本発明に係る背景音／音声分類方法は、入力信号の
パワーおよびスペクトルの情報を特徴量として算出し、
この算出特徴量と背景音区間の推定パワーおよび推定ス
ペクトルの情報からなる推定特徴量とを比較することに
より、入力信号が音声および背景音のいずれに属するか
を判定することを基本的な特徴とする。In order to solve the above-mentioned problems, a background sound / speech classification method according to the present invention calculates power and spectrum information of an input signal as feature amounts,
By comparing the calculated feature amount with an estimated feature amount including information on the estimated power and the estimated spectrum of the background sound section, a basic feature is to determine whether the input signal belongs to voice or background sound. I do.

【００１９】より具体的には、算出特徴量と推定特徴量
との比較によりパワーおよびスペクトルの変動量を分析
し、これらパワーおよびスペクトルの変動量の分析結果
が共に背景音であることを示したときは入力信号が背景
音に属すると判定し、それ以外のときは音声に属すると
判定する。スペクトル情報は、例えばＬＳＰ係数により
更新される。More specifically, the power and spectrum fluctuations were analyzed by comparing the calculated characteristic amount and the estimated characteristic amount, and it was shown that the analysis results of these power and spectrum fluctuation amounts were both background sounds. In this case, it is determined that the input signal belongs to the background sound, and otherwise, it is determined that the input signal belongs to the voice. The spectrum information is updated by, for example, LSP coefficients.

【００２０】パワー情報だけを用いて背景音／音声分類
を行う従来の方法では、背景雑音のパワーが大きいとき
に音声区間でパワーの小さな部分が背景音と判定されて
しまうという問題があったが、本発明のようにパワー情
報に加えてスペクトル情報を用いて背景音／音声分類を
行うと、パワーの小さな音声区間であっても背景音区間
のスペクトルと音声区間のスペクトルとでは明らかに異
なるため、音声区間を正確に判定することが可能にな
る。In the conventional method of performing background sound / speech classification using only power information, there is a problem that when the power of background noise is large, a portion having a small power in a speech section is determined as a background sound. When the background sound / speech classification is performed using the spectrum information in addition to the power information as in the present invention, the spectrum of the background sound section and the spectrum of the speech section are clearly different even in a speech section having a small power. , The voice section can be accurately determined.

【００２１】また、この背景音／音声分類方法において
は、推定特徴量を入力信号が背景音に属すると判定され
た場合と音声に属すると判定された場合とで異なる方法
により更新すると共に、入力信号が背景音に属すると判
定された場合の更新量を音声に属すると判定された場合
の更新量より小さくすることが好ましい。このようにす
ると、入力信号の音声区間が長時間続いても、推定特徴
量が入力信号の音声区間の特徴量の影響をほとんど受け
ることがないため、音声区間が長時間続いた後に背景音
に変化するような入力信号が与えられた場合でも、背景
音の識別が容易に可能となる。In this background sound / speech classification method, the estimated feature amount is updated by a different method depending on whether the input signal is determined to belong to the background sound or to the case where the input signal is determined to belong to the sound, and the input feature is updated. It is preferable that the update amount when the signal is determined to belong to the background sound be smaller than the update amount when the signal is determined to belong to the sound. In this way, even if the voice section of the input signal continues for a long time, the estimated feature value is hardly affected by the feature value of the voice section of the input signal. Even when a changing input signal is given, the background sound can be easily identified.

【００２２】スペクトルの変動量の分析は、入力信号の
スペクトルの情報から求められるスペクトル包絡と背景
音区間の推定スペクトルの情報から求められるスペクト
ル包絡との間の歪（スペクトル歪み）の値と、予め設定
された閾値とを比較することにより、正確に行うことが
できる。これによって、より正確な背景音／音声の分類
が可能となる。The analysis of the amount of change in the spectrum is performed by calculating a distortion (spectral distortion) value between the spectrum envelope obtained from the information on the spectrum of the input signal and the spectrum envelope obtained from the information on the estimated spectrum of the background sound section. By comparing with the set threshold value, it can be performed accurately. This enables more accurate classification of the background sound / voice.

【００２３】また、この場合に推定パワーの情報に応じ
て閾値を変化させ、例えば推定パワーが小さいときは閾
値を大きく設定し、推定パワーが大きいときは閾値を小
さく設定するようにすれば、推定パワーの変化によるス
ペクトル変動量の変化によって判定を誤ることが少なく
なり、より一層正確に背景音／音声の分類を行うことが
できる。In this case, the threshold is changed in accordance with the information of the estimated power. For example, when the estimated power is small, the threshold is set large, and when the estimated power is large, the threshold is set small. Erroneous determination due to a change in the amount of spectrum fluctuation due to a change in power is reduced, and background sound / voice can be more accurately classified.

【００２４】さらに、本発明においては入力信号が音声
および背景音のいずれに属するかの判定結果が音声から
背景音へ変化したとき、特定期間（これをハングオーバ
期間という）だけその判定結果を強制的に音声に変更さ
せるようにしてもよい。この場合、背景音区間の推定パ
ワーおよび推定スペクトルの情報を用いてハングオーバ
期間を変化させ、例えば推定フレームパワーが大きいと
きまたは推定スペクトルの情報から求められるスペクト
ル包絡のホルマントのスペクトルパワーが大きいときに
ハングオーバ時間を長く設定することにより、背景音の
パワーが大きいときや背景音のスペクトルが白色でない
場合の語尾切れが回避される。Further, in the present invention, when the result of determination as to which of the input signal belongs to the voice and the background sound changes from the voice to the background sound, the determination result is forcibly applied only for a specific period (this is called a hangover period). May be changed to voice. In this case, the hangover period is changed using information on the estimated power and the estimated spectrum in the background sound section. For example, when the estimated frame power is large or when the formant spectrum power of the spectrum envelope obtained from the estimated spectrum information is large, the hangover period is changed. By setting the time to be long, it is possible to avoid ending when the power of the background sound is large or when the spectrum of the background sound is not white.

【００２５】[0025]

【００２６】[0026]

【００２７】[0027]

【００２８】[0028]

【００２９】[0029]

【発明の実施の形態】以下、図面を参照して本発明の実
施の形態を説明する。（第１の実施形態）図１に、本発明の第１の実施形態に
係る背景音／音声分類装置の構成を示す。同図におい
て、入力端子１０１には入力信号として例えばマイクロ
ホンで集音され、ディジタル化された音声信号が複数サ
ンプルを１フレームとしてフレーム単位で順次入力され
る。本実施形態では、１フレームを２４０サンプルとす
る。Embodiments of the present invention will be described below with reference to the drawings. (First Embodiment) FIG. 1 shows a configuration of a background sound / speech classification device according to a first embodiment of the present invention. In FIG. 1, an input terminal 101 receives an input signal, for example, a microphone, and a digitized audio signal is sequentially input in units of frames with a plurality of samples as one frame. In the present embodiment, one frame is 240 samples.

【００３０】この入力信号は特徴量算出部１０２に与え
られ、入力信号を特徴付ける種々の特徴量が算出され
る。本実施形態では、算出特徴量としてパワー情報であ
るフレームパワーｐ_sと、スペクトル情報であるＬＳＰ
係数｛ω_s(i) ，ｉ＝１，…，ＮＰ｝を用いる場合につ
いて説明する。The input signal is supplied to a feature value calculator 102, and various feature values characterizing the input signal are calculated. In the present embodiment, the frame power p _s is the power information as a calculation feature amount, which is spectrum information LSP
The case where the coefficients {ω _s (i), i = 1,..., NP} are used will be described.

【００３１】図２に、特徴量算出部１０２の構成を示
す。入力端子２０１からの入力信号ｓ(n) についてフレ
ームパワー算出部２０２でフレームパワーｐ_sが算出さ
れ、出力端子２０５から出力される。この算出フレーム
パワーｐ_sは、次式で定義される。FIG. 2 shows the configuration of the feature amount calculation unit 102. A frame power calculator 202 calculates a frame power p _s for an input signal s (n) from the input terminal 201, and outputs the frame power p _s from an output terminal 205. This calculated frame power p _s is defined by the following equation.

【００３２】[0032]

【数１】 (Equation 1)

【００３３】ここで、Ｎはフレーム長を表す。Here, N represents a frame length.

【００３４】入力信号ｓ(n) は、ＬＰＣ係数分析部２０
３にも与えられる。ＬＰＣ係数分析部２０３は、例えば
自己相関法などの既存の技術を用いてＬＰＣ係数を求め
る。こうして求められたＬＰＣ係数はＬＰＣ係数変換部
２０４に渡され、ＬＳＰ係数｛ω_s(i) ，ｉ＝１，…，
ＮＰ｝に変換された後、出力端子２０６から出力され
る。The input signal s (n) is input to the LPC coefficient analyzer 20.
3 is also given. The LPC coefficient analysis unit 203 calculates an LPC coefficient using an existing technique such as an autocorrelation method. The LPC coefficient thus obtained is passed to the LPC coefficient conversion unit 204, where the LSP coefficient ｛ω _s (i), i = 1,.
After being converted to NP #, it is output from the output terminal 206.

【００３５】特徴量算出部１０２で求められた算出フレ
ームパワーｐ_sおよび算出ＬＳＰ係数｛ω_s(i) ，ｉ＝
１，…，ＮＰ｝は背景音／音声判定部２０３に与えら
れ、それと同時に推定特徴量更新部１０４で求められた
推定フレームパワーｐ_eおよび推定ＬＳＰ係数｛ω
_e(i) ，ｉ＝１，…，ＮＰ｝も背景音／音声判定部２０
３に与えられる。背景音／音声判定部２０３では、これ
らの情報を基にして入力信号ｓ(n) が背景音か音声かが
判定され、その判定結果が出力端子１０５に出力され
る。The calculated frame power p _s and the calculated LSP coefficient ｛ω _s (i), i =
, NP} are given to the background sound / speech determination unit 203, and at the same time, the estimated frame power _pe and the estimated LSP coefficient {ω
_e (i), i = 1,..., NP} are also background sound / speech determination units 20
3 given. The background sound / speech determining unit 203 determines whether the input signal s (n) is a background sound or a sound based on the information, and outputs the determination result to the output terminal 105.

【００３６】このようにして背景音／音声判定部２０３
であるフレームについての背景音／音声の判定が行われ
た後、推定特徴量更新部１０４で次のフレームに備え
て、特徴量算出部１０２で求められた算出フレームパワ
ーｐ_sおよび算出ＬＳＰ係数｛ω_s(i) ，ｉ＝１…Ｎ
Ｐ｝を用いて推定フレームパワーｐ_eおよび推定ＬＳＰ
係数｛ω_e(i) ，ｉ＝１…ＮＰ｝の更新が行われる。In this way, the background sound / voice determination unit 203
After the background sound / speech determination for a frame is made, the estimated feature amount updating unit 104 prepares for the next frame and calculates the calculated frame power p _s and calculated LSP coefficient ｛obtained by the feature amount calculating unit 102. ω _s (i), i = 1 ... N
Estimated frame power p _e and estimated LSP using P｝
The coefficients {ω _e (i), i = 1... NP} are updated.

【００３７】以下、背景音／音声判定部１０３と推定特
徴量更新部１０４の詳細をさらに詳しく説明する。背景
音／音声判定部１０３の機能は、算出フレームパワーｐ
_sと算出ＬＳＰ係数｛ω_s(i) ，ｉ＝１…ＮＰ｝および
推定フレームパワーｐ_eと推定ＬＳＰ係数｛ω_e(i) ，
ｉ＝１…ＮＰ｝を入力として、判定結果として背景音判
定信号“０”および音声判定信号“１”のいずれかを出
力する関数として表現される。Hereinafter, the details of the background sound / speech determining unit 103 and the estimated feature amount updating unit 104 will be described in more detail. The function of the background sound / speech determination unit 103 is calculated frame power p
_s and the calculated LSP coefficient {ω _s (i), i = 1... NP}, the estimated frame power _pe and the estimated LSP coefficient {ω _e (i),
i = 1... NP} as input and expressed as a function of outputting either a background sound determination signal “0” or a voice determination signal “1” as a determination result.

【００３８】ｃ＝Ｆ（ｐ_s，ω_s(i) ，ｐ_e，ω_e(i) ）（２）ここで、Ｆは背景音と判定したときに“０”を音声と判
定したときに“１”を返す関数とする。C = F (p _s , ω _s (i), p _e , ω _e (i)) (2) Here, when F is determined as a background sound and “0” is determined as a voice, A function that returns “1”.

【００３９】この関数Ｆについて、具体例を用いて説明
する。関数Ｆは次のような手続きに従い実現される。ま
ず、最初にフレームパワーの変動量を分析し、次にＬＳ
Ｐ係数の変動量を分析する。そして、最後にフレームパ
ワーの変動量およびＬＳＰ係数の変動量の分析結果が共
に背景音と判断された場合にのみ背景音と判断して
“０”を返し、そうでない場合には音声と判断して
“１”を返す。The function F will be described using a specific example. The function F is realized according to the following procedure. First, the amount of change in frame power is analyzed, and then LS
The variation of the P coefficient is analyzed. Finally, only when the analysis result of the variation amount of the frame power and the analysis amount of the variation amount of the LSP coefficient are both determined to be the background sound, it is determined to be the background sound and “0” is returned. And return "1".

【００４０】図３に、背景音／音声判定部の構成を示
す。入力端子３０１から算出フレームパワーｐ_s、入力
端子３０２からは算出ＬＳＰ係数｛ω_s(i) ，ｉ＝１，
…，ＮＰ｝、入力端子３０３からは推定フレームパワー
ｐ_e、入力端子３０４からは推定ＬＳＰ係数｛ω_e(i)
，ｉ＝１，…，ＮＰ｝の情報がそれぞれ入力される。
フレームパワー変動量算出部３０５では、算出フレーム
パワーｐ_sと算出推定フレームパワーｐ_eを用いてフレ
ームパワー変動量に着目した背景音／音声判定を行う。FIG. 3 shows the configuration of the background sound / voice determination unit. From the input terminal 301, the calculated frame power p _s , and from the input terminal 302, the calculated LSP coefficient ｛ω _s (i), i = 1,
, NP}, the estimated frame power p _e from the input terminal 303, and the estimated LSP coefficient {ω _e (i) from the input terminal 304.
, I = 1,..., NP}.
The frame power fluctuation amount calculating section 305 performs background noise / speech decision focused on frame power fluctuation amount using the calculated frame power p _s and calculating the estimated frame power p _e.

【００４１】次に、スペクトル変動量算出部３０６で
は、入力された算出ＬＳＰ係数｛ω_s(i) ，ｉ＝１，
…，ＮＰ｝と推定ＬＳＰ係数｛ω_e(i) ，ｉ＝１，…，
ＮＰ｝を用いてスペクトル変動量に着目した背景音／音
声判定を行う。一方、判定部３０７では、フレームパワ
ー変動量算出部３０５で判定された結果とスペクトル変
動量算出部３０６で判定された結果を総合的に判断し
て、両者とも背景音と判定されているなら背景音を、そ
うでなければ音声を最終的な判定結果として出力端子３
０８から出力する。Next, in the spectrum variation calculating section 306, the input calculated LSP coefficient ｛ω _s (i), i = 1,
, NP} and the estimated LSP coefficient {ω _e (i), i = 1,.
A background sound / speech determination is performed by focusing on the amount of spectrum variation using NP #. On the other hand, the determination unit 307 comprehensively determines the result determined by the frame power variation calculation unit 305 and the result determined by the spectrum variation calculation unit 306, and if both are determined to be background sounds, Output terminal 3 as sound, otherwise sound as final judgment result
08 to output.

【００４２】次に、フレームパワー変動量の分析につい
て説明する。フレームパワー変動量の分析は次式に従い
行われ、次式が成り立つときパワー情報においてはその
フレームは背景音と判断される。逆に、次式が成り立た
ないときはそのフレームは音声と判断される。ｐ_s−ｘ・ｐ_e＜０（３）ここで、ｘは予め定められた正の定数を表し、推定フレ
ームパワーｐ_eをｘ倍した値と現フレームの算出フレー
ムパワーｐ_sとを比較することで、推定フレームパワー
より少なくともｘ倍以上のパワーを有するフレームを音
声と判定することができる。これにより、本来背景音で
あるフレームを音声と誤判定してしまうことを回避し、
安定した判定を行うことができる。Next, the analysis of the frame power fluctuation will be described. The analysis of the frame power fluctuation amount is performed according to the following equation. When the following equation is satisfied, the frame is determined to be a background sound in the power information. Conversely, if the following equation does not hold, the frame is determined to be speech. In _{_{p s -x · p e <0}} (3) where, x represents a positive predetermined constant, the estimated frame power p _e comparing the calculated frame power p _s value and the current frame x times Thus, a frame having at least x times the power of the estimated frame power can be determined to be speech. This avoids erroneous determination of a frame that is originally a background sound as a sound,
A stable determination can be made.

【００４３】また、ｘを算出フレームパワーｐ_sの大き
さに依存して適応的に変化させると、背景音のパワーが
大きく本来正しい判定が困難な場合でも十分に判定を行
うことができる。すなわち、算出フレームパワーｐ_sが
大きいときｘを小さくし、逆にフレームパワーｐ_sが小
さいときｘを大きく設定すると、誤判定が少なくなるの
で、そのようにｘを適応化させればよい。When x is adaptively changed depending on the magnitude of the calculated frame power p _s , sufficient judgment can be made even when the power of the background sound is large and it is difficult to make a proper judgment. In other words, if x is reduced when the calculated frame power p _s is large, and x is set large when the frame power p _s is small, erroneous determinations are reduced. Therefore, x may be adapted as such.

【００４４】ＬＳＰ係数の変動量は、ＬＳＰ係数間のユ
ークリッド距離として定義され、次式に従って求められ
る。次式が成り立つとき、スペクトル情報においては、
そのフレームは背景音と判断される。逆に、次式が成り
立たない場合は音声と判断される。The variation of the LSP coefficient is defined as a Euclidean distance between the LSP coefficients, and is obtained according to the following equation. When the following equation holds, in the spectral information,
The frame is determined to be a background sound. Conversely, if the following equation does not hold, it is determined to be speech.

【００４５】[0045]

【数２】 (Equation 2)

【００４６】Ｔ_fは、予め設定しておいた閾値である。T _f is a preset threshold value.

【００４７】このようにして、フレームパワーの変動量
およびＬＳＰ係数の変動量を評価して、両変動量が背景
音と判断されたとき、背景音／音声判定部１０３は背景
音／音声判定結果として背景音を表す判定信号である
“０”を出力する。これ以外のとき、すなわちフレーム
パワーの変動量とＬＳＰ係数の変動量のいずれか一方が
音声を表す場合、背景音／音声判定部１０３は背景音／
音声判定結果として音声を表す判定信号である“１”を
出力する。In this way, the fluctuation amount of the frame power and the fluctuation amount of the LSP coefficient are evaluated, and when both the fluctuation amounts are determined to be the background sound, the background sound / speech determination unit 103 outputs the background sound / speech determination result. As a result, "0" which is a determination signal representing the background sound is output. In other cases, that is, when one of the fluctuation amount of the frame power and the fluctuation amount of the LSP coefficient represents a sound, the background sound / speech determining unit 103 outputs
"1" which is a determination signal representing voice is output as a voice determination result.

【００４８】推定特徴量更新部１０４では、次フレーム
の入力に備えて推定特徴量の更新を行う。推定特徴量の
うち、推定フレームパワーｐ_eは次式に従い更新され
る。The estimated feature updating unit 104 updates the estimated feature in preparation for the input of the next frame. Among the estimated features, the estimated frame power p _e is updated according to the following equation.

【００４９】ｐ_e ^new＝（１−β）・ｐ_s＋β・ｐ_e（０≦β≦１）（５）ここで、ｐ_e ^newは次フレームに用いられる推定フレー
ムパワーを表す。またβは予め定められた定数である。[0049] _{^{p e new = (1-β}} ) · p s + β · p e (0 ≦ β ≦ 1) (5) where, p _e ^{new new} represents the estimated frame power used for the next frame. Β is a predetermined constant.

【００５０】推定ＬＳＰ係数｛ω_e(i) ，ｉ＝１，…，
ＮＰ｝の更新も同様に、次式に従い実現される。The estimated LSP coefficient ｛ω _e (i), i = 1,.
Update of NP # is similarly realized according to the following equation.

【００５１】 ω_e ^new(i) ＝（１−γ）・ω_s(i) ＋γ・ω_e(i) （０≦γ≦１）（６）ここで、ω_e ^new(i) は次フレームに用いられる推定Ｌ
ＳＰ係数を表す。またγは予め定められた定数である。Ω _e ^new (i) = (1−γ) · ω _s (i) + γ · ω _e (i) (0 ≦ γ ≦ 1) (6) where ω _e ^new (i) is the next frame L used for
Represents the SP coefficient. Γ is a predetermined constant.

【００５２】次に、本実施形態における処理の流れを図
４に示すフローチャート用いて説明する。最初に、フレ
ーム単位で入力される信号を分析して特徴量を算出する
（ステップＳ１０）。次に、現フレームの算出特徴量と
前フレームの処理の段階で求めておいた推定特徴量との
比較を行い、入力信号が背景音か音声のいずれに属する
かを判定する（ステップＳ１１）。最後に、現フレーム
で求められた算出特徴量を用いて推定特徴量を更新し、
次フレームの入力に備える（ステップＳ１２）。ここ
で、算出特徴量および推定特徴量としては、前述したよ
うにフレームパワーなどのパワー情報と、ＬＳＰ係数な
どのスペクトル情報と併用される点が従来と異なってい
る。Next, the flow of processing in this embodiment will be described with reference to the flowchart shown in FIG. First, a characteristic amount is calculated by analyzing a signal input in frame units (step S10). Next, a comparison is made between the calculated feature value of the current frame and the estimated feature value obtained at the stage of processing of the previous frame to determine whether the input signal belongs to the background sound or the voice (step S11). Finally, the estimated feature value is updated using the calculated feature value obtained in the current frame,
Prepare for the input of the next frame (step S12). Here, as described above, the calculated feature amount and the estimated feature amount are different from the related art in that power information such as frame power and spectrum information such as LSP coefficients are used together.

【００５３】本実施形態の効果を図５を用いて説明す
る。図５（ａ）に示すような入力信号に対して、パワー
情報だけを用いて背景音／音声判定を行うと、図５
（ｂ）に示すように背景雑音パワーの大きいときに音声
区間でパワーの小さな部分が背景音と判定されてしまう
という問題がある。The effect of this embodiment will be described with reference to FIG. When the background sound / speech determination is performed using only the power information for the input signal as shown in FIG.
As shown in (b), when the background noise power is large, there is a problem that a portion having a small power in a voice section is determined as a background sound.

【００５４】これに対し、本実施形態のようにパワー情
報に加えてスペクトル情報を用いると、パワーの小さな
音声区間であっても背景音区間のスペクトルと音声区間
のスペクトルとでは明らかに異なるため、図５（ｃ）に
示すように音声区間を正確に判定することが可能とな
る。On the other hand, when spectrum information is used in addition to power information as in the present embodiment, the spectrum of the background sound section and the spectrum of the speech section are clearly different even in a speech section having low power. As shown in FIG. 5C, it is possible to accurately determine the voice section.

【００５５】（第２の実施形態）図６に、本発明の第２
の実施形態に係る背景音／音声分類装置の構成を示す。
図６において、図１と同一の構成要素に同一の参照符号
を付して詳細な説明は省略する。本実施形態と第１の実
施形態の違いは、推定特徴量更新部１０４の実現法にあ
る。(Second Embodiment) FIG. 6 shows a second embodiment of the present invention.
1 shows a configuration of a background sound / speech classification device according to the embodiment.
6, the same components as those in FIG. 1 are denoted by the same reference numerals, and a detailed description thereof will be omitted. The difference between the present embodiment and the first embodiment lies in the method of realizing the estimated feature amount updating unit 104.

【００５６】すなわち、本実施形態では背景音／音声判
定部１０３の判定結果に応じて、推定特徴量更新部１０
４での更新方法を切り替える。この場合、推定フレーム
パワーｐ_eの更新は次式に従う。ｐ_e ^new＝（１−β₀）・ｐ_s＋β₀・ｐ_e （７）ｐ_e ^new＝（１−β₁）・ｐ_s＋β₁・ｐ_e （８）式（７）は背景音／音声判定部１０３で背景音と判定さ
れたときの更新を表しており、式（８）は背景音／音声
判定部１０３で音声と判定されたときの更新を表してい
る。ただし、β₀とβ₁の間には、０≦β₀＜β₁≦１
の関係が成り立つものとする。That is, according to the present embodiment, the estimated feature amount updating unit 10
Switch the update method in 4. In this case, updating of the estimated frame power p _e follows the following equation. _pe ^new = (1−β ₀ ) · _ps + β ₀ · _pe (7) _pe ^new = (1 −β ₁ ) · _ps + β ₁ · _pe (8) Equation (7) represents the background sound / Equation (8) represents an update when the sound is determined to be a sound by the background sound / speech determination unit 103. However, between β ₀ and β ₁ , 0 ≦ β ₀ <β ₁ ≦ 1
It is assumed that the following relationship holds.

【００５７】同様に、推定ＬＳＰ係数｛ω_e(i) ，ｉ＝
１，…，ＮＰ｝の更新は、次の２式に従う。Similarly, the estimated LSP coefficient ｛ω _e (i), i =
, NP} is updated according to the following two equations.

【００５８】 ω_e ^new(i) ＝（１−γ₀）・ω_s(i) ＋γ₀・ω_e(i) （９） ω_e ^new(i) ＝（１−γ₁）・ω_s(i) ＋γ₁・ω_e(i) （１０）式（９）は背景音／音声判定部１０３で背景音と判定さ
れたときの更新を表しており、式（１０）は背景音／音
声判定部１０３で音声と判定されたときの更新を表して
いる。ただし、γ₀とγ₁の間には０≦γ₀＜γ₁≦１
の関係が成り立つものとする。Ω _e ^new (i) = (1−γ ₀ ) · ω _s (i) + γ ₀ · ω _e (i) (9) ω _e ^new (i) = (1−γ ₁ ) · ω _s ( i) + γ ₁ · ω _e (i) (10) Equation (9) represents the update when the background sound / speech determination unit 103 determines that the background sound is present, and equation (10) represents the background sound / speech determination. The update when the voice is determined by the unit 103 is shown. However, between γ ₀ and γ ₁ , 0 ≦ γ ₀ <γ ₁ ≦ 1
It is assumed that the following relationship holds.

【００５９】以上の処理をまとめると、図７に示される
フローチャートのようになる。図７のステップＳ２０，
Ｓ２１は図４のステップＳ１０，Ｓ１１と同一なので、
ここでは説明を省略する。ステップＳ２２ではステップ
Ｓ２１の判定結果を受けて、ステップＳ２１で背景音と
判定された場合はステップＳ２３に進み、音声と判定さ
れた場合はステップＳ２４に進む。ステップＳ２３で
は、背景音と判定されたときの更新法を用いて推定特徴
量を更新し、次の入力フレームに備える。ステップＳ２
４では、音声と判定されたときの更新法を用いて推定特
徴量を更新し、次の入力フレームに備える。The above processing is summarized as a flowchart shown in FIG. Step S20 in FIG.
S21 is the same as steps S10 and S11 in FIG.
Here, the description is omitted. In step S22, upon receiving the determination result in step S21, the process proceeds to step S23 when it is determined to be a background sound in step S21, and proceeds to step S24 when it is determined to be a sound. In step S23, the estimated feature amount is updated by using the updating method when it is determined that the background sound is present, and the estimated feature amount is prepared for the next input frame. Step S2
In step 4, the estimated feature amount is updated by using the updating method when it is determined that the voice is a speech, and is prepared for the next input frame.

【００６０】本実施形態の利点は、図８を用いて次のよ
うに説明できる。背景音／音声判定部１０３の結果に関
係なく常に同一の更新法を用いる場合、図８（ａ）に示
すように音声区間が長期にわたる入力信号が与えられる
と、推定特徴量が音声区間の特徴量に大きく影響を受け
てしまう。このため、図８（ｂ）に示すように、図８
（ａ）の入力信号が音声区間から背景音に変わっても、
推定特徴量は既に音声区間の特徴量に類似してしまって
いる、つまり背景音と異なるスペクトル情報を有してし
まうため、背景音を識別することが非常に困難になって
しまう。The advantages of the present embodiment can be explained as follows with reference to FIG. When the same update method is always used regardless of the result of the background sound / speech determination unit 103, when an input signal having a long speech period is given as shown in FIG. It is greatly affected by quantity. For this reason, as shown in FIG.
Even if the input signal of (a) changes from a voice section to a background sound,
Since the estimated feature value is already similar to the feature value of the voice section, that is, it has spectrum information different from the background sound, it becomes very difficult to identify the background sound.

【００６１】これに対して、本実施形態では図８（ｃ）
に示すように、背景音区間と音声区間とで推定特徴量の
更新法が異なり、かつ音声区間での更新量は小さく設定
してあるので、音声区間の特徴量に影響をほとんど受け
ない。そのため、音声区間が長時間続いた後に背景音に
変化する入力信号が与えられて背景音の識別は可能とな
り、より正確な背景音／音声判定が実現できる。On the other hand, in the present embodiment, FIG.
As shown in (1), since the updating method of the estimated feature amount differs between the background sound section and the speech section, and the update amount in the speech section is set to be small, the feature quantity of the speech section is hardly affected. For this reason, an input signal that changes to the background sound after the speech section lasts for a long time is given, so that the background sound can be identified, and more accurate background sound / speech determination can be realized.

【００６２】（第３の実施形態）図９を用いて、本発明
の第３の実施形態に係る背景音／音声分類装置について
説明する。本実施形態の特徴は背景音／音声判定部１０
３の実現法にあり、先に説明した図１または図６で表さ
れる構成のどちらにも適用可能である。(Third Embodiment) A background sound / voice classification device according to a third embodiment of the present invention will be described with reference to FIG. The feature of this embodiment is that the background sound / voice determination unit 10
3 and can be applied to either the configuration shown in FIG. 1 or FIG. 6 described above.

【００６３】図９は、図３に示した背景音／音声判定部
１０３におけるスペクトル変動量算出部３０６において
算出ＬＳＰ係数｛ω_s(i) ，ｉ＝１，…，ＮＰ｝と推定
ＬＳＰ係数｛ω_e(i) ，ｉ＝１，…，ＮＰ｝との変動量
を求めるための構成を示すブロック図である。先の実施
形態では、ＬＳＰ係数の変動量は式（４）で定義される
ようにＬＳＰ係数間のユークリッド距離によって求めら
れている。これに対し、本実施形態では、ＬＳＰ係数を
スペクトル包絡に変換して、スペクトル包絡間のスペク
トル歪を求め、このスペクトル歪と予め定めてある閾値
と比較を行い背景音／音声判定を行う。FIG. 9 shows the calculated LSP coefficients {ω _s (i), i = 1,..., NP} and the estimated LSP coefficients において in the spectrum variation calculating section 306 in the background sound / voice determining section 103 shown in FIG. FIG. 9 is a block diagram showing a configuration for obtaining a variation from ω _e (i), i = 1,. In the above embodiment, the variation of the LSP coefficient is determined by the Euclidean distance between the LSP coefficients as defined by equation (4). On the other hand, in the present embodiment, the LSP coefficient is converted into a spectrum envelope, spectrum distortion between the spectrum envelopes is obtained, and this spectrum distortion is compared with a predetermined threshold to determine the background sound / voice.

【００６４】式（４）に定義されるようなＬＳＰ係数間
のユークリッド距離は、本来の算出ＬＳＰ係数｛ω
_s(i) ，ｉ＝１，…，ＮＰ｝と推定ＬＳＰ係数｛ω
_e(i) ，ｉ＝１，…，ＮＰ｝間のスペクトルの変動量と
対応しないことがある。これはＬＳＰ係数の性質から、
ＬＳＰ係数はスペクトル包絡の山部の周波数に対応する
ものの、ＬＳＰ係数間のユークリッド距離という定義が
スペクトルの変動量とは一意に対応しないことに起因す
るものであり、正確な背景音／音声判定の妨げとなる。The Euclidean distance between the LSP coefficients as defined in the equation (4) is equal to the originally calculated LSP coefficient ｛ω
_s (i), i = 1,..., NP} and the estimated LSP coefficient {ω
_e (i), i = 1,..., NP} may not correspond to the amount of spectrum variation. This is due to the nature of the LSP coefficient,
Although the LSP coefficient corresponds to the peak frequency of the spectrum envelope, the definition of the Euclidean distance between the LSP coefficients does not uniquely correspond to the amount of fluctuation of the spectrum. It hinders.

【００６５】この点を改善するため、本実施形態では算
出ＬＳＰ係数｛ω_s(i) ，ｉ＝１，…，ＮＰ｝と推定Ｌ
ＳＰ係数｛ω_e(i) ，ｉ＝１，…，ＮＰ｝のスペクトル
包絡をそれぞれ求め、そのスペクトル歪を基に背景音／
音声判定を行うことにより、正確なスペクトル変動量を
求めることができ、より正確な背景音／音声判定が可能
になる。In order to improve this point, in the present embodiment, the calculated LSP coefficient {ω _s (i), i = 1,.
The spectral envelopes of the SP coefficients {ω _e (i), i = 1,..., NP} are respectively determined, and the background sound /
By performing the voice determination, an accurate spectrum fluctuation amount can be obtained, and more accurate background sound / voice determination becomes possible.

【００６６】図９を用いて詳しく説明すると、入力端子
４０１から算出ＬＳＰ係数｛ω_s(i) ，ｉ＝１，…，Ｎ
Ｐ｝が入力され、ＬＳＰ変換部４０２に与えられる。Ｌ
ＳＰ変換部４０２では、既存の技術を使って算出ＬＳＰ
係数をＬＰＣ係数｛α_s(i)，ｉ＝１，…，ＮＰ｝へ変
換して求める。同様に、入力端子４０３から推定ＬＳＰ
係数｛ω_e(i) ，ｉ＝１，…，ＮＰ｝が入力されてＬＳ
Ｐ変換部４０４に与えられ、ＬＳＰ係数変換部４０４で
推定ＬＰＣ係数｛α_e(i) ，ｉ＝１，…，ＮＰ｝を変換
される。スペクトル歪算出部４０５では、算出ＬＰＣ係
数から構成される合成フィルタのスペクトル包絡と、推
定ＬＰＣ係数から構成される合成フィルタのスペクトル
包絡との対数領域での２乗誤差として定義されるスペク
トル歪ＳＤを次式により算出する。Referring to FIG. 9 in detail, the calculated LSP coefficient ｛ω _s (i), i = 1,.
P｝ is input and provided to LSP conversion section 402. L
The SP conversion unit 402 calculates the LSP using the existing technology.
The coefficients are converted to LPC coefficients {α _s (i), i = 1,..., NP}. Similarly, from the input terminal 403, the estimated LSP
The coefficient {ω _e (i), i = 1,..., NP} is input and LS
The estimated LPC coefficient {α _e (i), i = 1,..., NP} is provided to the P conversion section 404. The spectrum distortion calculation unit 405 calculates a spectrum distortion SD defined as a square error in a logarithmic domain between the spectrum envelope of the synthesis filter composed of the calculated LPC coefficients and the spectrum envelope of the synthesis filter composed of the estimated LPC coefficients. It is calculated by the following equation.

【００６７】[0067]

【数３】 (Equation 3)

【００６８】Ｍはスペクトル包絡における周波数軸上の
解像度を表し、このＭを大きく設定するほど正確なスペ
クトル歪を求めることができる。また、式（１１）では
周波数軸上で等間隔に刻んだ場合のスペクトル歪を規定
しているが、刻み幅を非一様にすることも可能である。
例えば、低域のスペクトル変動量が重要である場合、低
域においては刻み幅を小さく設定し、高域では逆に刻み
幅を大きくすることで計算量の増加を回避し、かつ正確
なスペクトル変動量を求めることができる。M represents the resolution on the frequency axis in the spectrum envelope. The larger this M is set, the more accurate spectrum distortion can be obtained. In addition, although the equation (11) defines the spectrum distortion when the pitch is cut at equal intervals on the frequency axis, the pitch can be made non-uniform.
For example, when the amount of spectrum fluctuation in the low frequency band is important, the step size is set to be small in the low frequency band, and conversely, by increasing the step width in the high frequency band, an increase in the amount of calculation can be avoided, and accurate spectrum fluctuation time can be obtained. The quantity can be determined.

【００６９】式（１１）に従い求められたスペクトル歪
ＳＤは、スペクトル歪判定部４０６に与えられる。スペ
クトル歪判定部４０６では、スペクトル歪ＳＤと予め定
められた閾値Ｔ_sdとの比較を行い、次式が成立する場合
は背景音、成立しない場合は音声という判定結果を出力
端子４０７に出力する。ＳＤ＜Ｔ_sd （１２）（第４の実施形態）図１０を用いて、本発明の第４の実
施形態に係る背景音／音声分類装置について説明する。
本実施形態の特徴は、背景音／音声判定部１０３の他の
実現法にある。図１０において、図９と同一の構成要素
には同一の参照符号を付して、説明を省略すると、本実
施形態は推定フレームパワーｐ_eに依存して閾値Ｔ_sdの
大きさを適応的に切り替える点が第３の実施形態と異な
っている。The spectrum distortion SD obtained according to the equation (11) is provided to the spectrum distortion determining unit 406. The spectrum distortion determination unit 406 compares the spectrum distortion SD with a predetermined threshold value T _sd, and outputs to the output terminal 407 a determination result of a background sound if the following equation is satisfied, and a voice if not. SD <T _sd (12) (Fourth Embodiment) A background sound / speech classification device according to a fourth embodiment of the present invention will be described with reference to FIG.
The feature of this embodiment lies in another method of realizing the background sound / speech determination unit 103. 10, the same elements as those of FIG. 9 are designated by the same reference numerals, omitting the description, the present embodiment is adaptively the size of the threshold T _sd depending on the estimated frame power p _e The point of switching is different from the third embodiment.

【００７０】すなわち、入力端子４０８から推定フレー
ムパワーｐ_sが与えられ、スペクトル歪判定部４０６に
入力される。スペクトル歪判定部４０６では、推定フレ
ームパワーｐ_eに応じて、予め用意しておいた複数個の
閾値から一つの閾値を選択して、次式に示すようにペク
トル歪ＳＤと比較を行う。ＳＤ＜Ｔ_sd(j) （ｊ＝１〜ＮＴ，ｊはｐ_eによって決定）（１３）ＮＴは予め用意しておいた閾値の数を表す。推定フレー
ムパワーｐ_eが小さいときは大きな閾値を設定し、逆に
推定フレームパワーｐ_eが大きいときは小さな閾値を設
定するようにすると効果的である。That is, the estimated frame power p _s is given from the input terminal 408, and is input to the spectrum distortion determination section 406. In spectral distortion determination section 406, according to the estimated frame power p _e, and selects one of the threshold from a plurality of threshold values prepared in advance, and compares the spectrum distortion SD as shown in the following equation. _{SD <T sd (j) (} j = 1~NT, j is p _e determined by) (13) NT is the number of thresholds which had been prepared in advance. When the estimated frame power p _e is small and sets a large threshold value, when the inverse of the estimated frame power p _e is large it is effective so as to set a smaller threshold value.

【００７１】推定フレームパワーの大きさに関わらず常
に固定の閾値を用いる場合、次のような問題がある。推
定フレームパワーが小さいときに合わせてスペクトル歪
の閾値を設定した場合、すなわちスペクトル歪の閾値が
大きい値をとるとき、推定フレームパワーの大きな信号
が入力されると、音声区間と背景音区間のスペクトルの
変動量が小さいため、音声区間であっても背景音と判定
されてしまうという問題が生じてしまう。逆に、推定フ
レームパワーが小さいときに合わせてスペクトル歪の閾
値を設定した場合、すなわちスペクトル歪の閾値が小さ
い値をとると、推定フレームパワーの小さな信号が入力
されるたとき、音声区間と背景音区間とのスペクトル変
動量は大きいため、背景音区間であっても音声と判定さ
れてしまうという問題が生じてしまう。When a fixed threshold value is always used regardless of the magnitude of the estimated frame power, there are the following problems. When the threshold of the spectrum distortion is set in accordance with the estimated frame power is small, that is, when the threshold of the spectrum distortion takes a large value, when a signal with a large estimated frame power is input, the spectrum of the voice section and the background sound section is Is small, so that there is a problem that the sound section is determined as a background sound. Conversely, when the spectral distortion threshold is set in accordance with the estimated frame power is small, that is, when the spectral distortion threshold is small, when a signal with a small estimated frame power is input, the speech section and the background Since the amount of spectrum variation from the sound section is large, there is a problem that the sound section is determined to be a sound even in the background sound section.

【００７２】これに対し、本実施形態では先に説明した
ように、推定フレームパワーに応じて閾値を適応的に選
択することにより、このような問題を回避することがで
き、正確な背景音／音声分類を実現することができる。On the other hand, in the present embodiment, as described above, such a problem can be avoided by adaptively selecting a threshold value according to the estimated frame power, and an accurate background sound / Voice classification can be realized.

【００７３】（第５の実施形態）次に、図１１を用いて
本発明の第５の実施形態に係る背景音／音声分類装置を
説明する。図１１において、図６と同一の構成要素には
同一の参照符号を付して説明を省略する。本実施形態
は、図６の構成にハングオーバ処理部１０６を追加した
点が第２の実施形態と異なる。このハングオーバ処理部
１０６は、背景音／音声判定部１０３で判定された結果
を監視し、音声区間から背景音に判定結果が変化したと
き、予め決められたフレーム数の期間（これをハングオ
ーバ期間という）だけ、強制的に背景音を音声区間とす
るように判定結果を変える機能を有する。(Fifth Embodiment) Next, a background sound / speech classification device according to a fifth embodiment of the present invention will be described with reference to FIG. 11, the same components as those in FIG. 6 are denoted by the same reference numerals, and description thereof will be omitted. This embodiment is different from the second embodiment in that a hangover processing unit 106 is added to the configuration of FIG. The hangover processing unit 106 monitors the result determined by the background sound / voice determination unit 103, and when the determination result changes from the voice section to the background sound, a period of a predetermined number of frames (this is called a hangover period). ) Has a function of changing the determination result so that the background sound is set to a voice section.

【００７４】一般に、背景音／音声分類に際しては、文
章の最後の部分（語尾）で音声区間を背景区間と判断し
てしまう誤判定を生じやすい。これは語尾の部分では音
声のパワーが小さくなることが多く、背景音のパワーと
の変動量が小さいことに起因する。この問題を回避する
ため、本実施形態ではハングオーバ処理部１０６を用い
て、音声から背景音に判定結果が変化したところから数
フレームを音声区間であるとみなして判定結果を出力す
る。また、本実施形態ではハングオーバ期間は推定パワ
ー情報と推定スペクトル情報に応じて適応的に変化する
という特徴を有する。In general, when performing background sound / speech classification, an erroneous judgment that a speech section is determined to be a background section at the last part (end of word) of a sentence is likely to occur. This is due to the fact that the power of the voice is often small in the end part, and the amount of fluctuation from the power of the background sound is small. In order to avoid this problem, in the present embodiment, the hangover processing unit 106 is used to output a determination result by regarding several frames as a voice section from the point where the determination result changes from voice to background sound. Further, the present embodiment has a feature that the hangover period is adaptively changed according to the estimated power information and the estimated spectrum information.

【００７５】以下、ハングオーバ処理部１０６について
図１２により詳しく説明する。図１２において、端子５
０１からは背景音／音声判定部１０３の判定結果が入力
される。この判定結果として、先に説明したように背景
音の場合は“０”、音声の場合は“１”の判定信号が入
力されるものとする。カウンタ５０７のカウンタ値が
“０”である場合、スイッチ５１０は端子５０８と接続
し、判定結果はそのまま出力端子５１１より出力され
る。通常、カウンタ５０７の値は“０”となっている。Hereinafter, the hangover processing unit 106 will be described in detail with reference to FIG. In FIG.
From 01, the determination result of the background sound / voice determination unit 103 is input. As described above, it is assumed that, as described above, a determination signal of “0” is input for a background sound and a determination signal of “1” for a voice. When the counter value of the counter 507 is “0”, the switch 510 is connected to the terminal 508, and the determination result is output from the output terminal 511 as it is. Normally, the value of the counter 507 is “0”.

【００７６】変化検出部５０４では、入力端子５０１か
ら入力される判定結果を監視しており、音声から背景音
（すなわち“１”→“０”）に変化したときスイッチ５
０６をオンにする。それ以外のときは、スイッチ５０６
はオフである。スイッチ５０６がオンとなると、そのと
きの推定フレームパワーｐ_eが入力端子５０２から入力
され、推定ＬＳＰ係数｛ω_e(i) ，ｉ＝１，…，ＮＰ｝
が入力端子５０３から入力される。The change detection unit 504 monitors the judgment result input from the input terminal 501, and when the sound changes from the sound to the background sound (ie, “1” → “0”), the switch 5
Turn on 06. Otherwise, switch 506
Is off. When the switch 506 is turned on, the estimated frame power p _{e at} that time is input from the input terminal 502, and the estimated LSP coefficient {ω _e (i), i = 1,.
Is input from the input terminal 503.

【００７７】ハングオーバ時間算出部５０５では、推定
フレームパワーｐ_eと推定ＬＳＰ係数｛ω_e(i) ，ｉ＝
１，…，ＮＰ｝を用いてハングオーバ時間を算出し、そ
の値をカウンタ値としてスイッチ５０６を経由してカウ
ンタ５０７に与える。カウンタ５０７は、カウンタ値が
“０”より大きいときスイッチ５１０を端子５０９と接
続させて、判定結果が音声となるように“１”を出力端
子５１１から出力させるようにする。カウンタ５０７
は、入力端子５０１から判定結果が入力される度に１つ
ずつデクリメントされる。ただし、カウンタ値は“０”
未満にならないようにカウンタ値が負の値をとったとき
“０”で置き換える。The hangover time calculator 505 calculates the estimated frame power p _e and the estimated LSP coefficient ｛ω _e (i), i =
, NP}, the hangover time is calculated, and the calculated value is given to the counter 507 via the switch 506 as the counter value. When the counter value is larger than “0”, the counter 507 connects the switch 510 to the terminal 509 and outputs “1” from the output terminal 511 so that the determination result is sound. Counter 507
Is decremented by one each time a determination result is input from the input terminal 501. However, the counter value is “0”
When the counter value takes a negative value so as not to be less than “0”, it is replaced with “0”.

【００７８】ハングオーバ時間算出部５０５は、ハング
オーバ時間ＨＯを次式（１４），（１５）のいずれかに
従い算出する。ＨＯ＝ＨＯ_p＋ＨＯ_LSP （１４）ＨＯ＝Ｍａｘ（ＨＯ_p，ＨＯ_LSP）（１５）ここで、ＨＯ_pは推定フレームパワーｐ_eから算出され
るハングオーバ時間、ＨＯ_LSPは推定ＬＳＰ係数から算
出されるハングオーバ時間をそれぞれ表す。また、Ｍａ
ｘ（）は最大値を返値とする関数である。The hangover time calculation unit 505 calculates the hangover time HO according to one of the following equations (14) and (15). _{_{HO = HO p + HO LSP (}} 14) HO = Max (HO p, HO LSP) (15) where, HO _p hangover time is calculated from the estimated frame power p _e, HO _LSP is calculated from the estimated LSP coefficients Indicates the hangover time. Also, Ma
x () is a function that returns the maximum value.

【００７９】ＨＯ_pは、推定フレームパワーｐ_eの値に
応じて予め用意されている複数個のハングオーバ時間か
ら１つを選択して決定することができる。また、ＨＯ
_LSPは推定ＬＳＰ係数｛ω_e(i) ，ｉ＝１，…，ＮＰ｝
が表すスペクトル包絡のピークの大きさに対応して、複
数個用意されているハングオーバ時間から１つを選択し
て決定される。スペクトル包絡のピークの大きさを表す
指標ｆｄは次式で定義される。[0079] HO _p can be determined by selecting one of the plurality of hangover time are prepared in advance according to the value of the estimated frame power p _e. Also, HO
_LSP is the estimated LSP coefficient {ω _e (i), i = 1,.
Is determined by selecting one from a plurality of prepared hangover times corresponding to the magnitude of the peak of the spectral envelope represented by. The index fd representing the magnitude of the peak of the spectral envelope is defined by the following equation.

【００８０】[0080]

【数４】 (Equation 4)

【００８１】式（１６）によると、隣接する推定ＬＳＰ
係数が接近しているとき、つまりスペクトル包絡のピー
クが大きいとき、指標ｆｄは大きな値をとり、それに対
応してハングオーバ時間ＨＯ_LSPは長いものが選択され
る。逆に、指標ｆｄが小さな値をとる場合は、ハングオ
ーバ時間ＨＯ_LSPは短いものが選択される。According to equation (16), the adjacent estimated LSP
When the coefficients are close to each other, that is, when the peak of the spectrum envelope is large, the index fd takes a large value, and a correspondingly long hangover time HO _LSP is selected. Conversely, when the index fd takes a small value, a short hangover time HO _LSP is selected.

【００８２】本実施形態のように推定フレームパワーと
推定ＬＳＰ係数によってハングオーバ時間を適応的に伸
縮する方法では、次のような利点がある。先に説明した
ように、語尾部分ではパワーが落ちていることが多い。
そのため背景音のパワー（つまり推定フレームパワー）
が大きいと語尾切れが生じ易く、かつ長い時間に渡って
語尾切れが生じてしまう。また、背景で別の人間が話を
している状況や、例えば車や電車の通過音などが発生し
ている状況では、背景音のスペクトル包絡にピークが発
生し、これが本来の話者のスペクトル包絡に類似する
と、背景音に誤判定される場合がある。The method of adaptively expanding and contracting the hangover time by the estimated frame power and the estimated LSP coefficient as in this embodiment has the following advantages. As described above, power is often reduced at the end part.
Therefore, the power of the background sound (that is, the estimated frame power)
Is large, the inflection is likely to occur, and the inflection will occur over a long period of time. Also, in a situation where another person is talking in the background, for example, when a passing sound of a car or a train occurs, a peak occurs in the spectrum envelope of the background sound, and this is the spectrum of the original speaker. If it is similar to the envelope, it may be erroneously determined as a background sound.

【００８３】このような場合、つまり推定フレームパワ
ーが大きいときまたは推定ＬＳＰ係数が表すスペクトル
包絡のピークが大きいときには、ハングオーバ時間を長
く設定すると効果的である。In such a case, that is, when the estimated frame power is large or the peak of the spectral envelope represented by the estimated LSP coefficient is large, it is effective to set the hangover time to be long.

【００８４】本実施形態における処理の流れを図１３の
フローチャートを用いて説明する。図１３におけるステ
ップＳ３０、ステップＳ３１、ステップＳ３４、ステッ
プＳ３５、ステップＳ３６は、図７におけるステップＳ
２０、ステップＳ２１、ステップＳ２２、ステップＳ２
３、ステップＳ２４と同じなので、ここでは説明を省略
する。The flow of processing in this embodiment will be described with reference to the flowchart of FIG. Steps S30, S31, S34, S35, and S36 in FIG.
20, step S21, step S22, step S2
3. Since this is the same as step S24, the description is omitted here.

【００８５】ステップＳ３１で入力信号が背景音か音声
のいずれに属するかを判定された後に、ステップＳ３２
でハングオーバ処理部を適用する条件を満足するかどう
かを判断する。ステップＳ３２での判断結果がＹｅｓの
場合、ステップＳ３３においてハングオーバ処理部を適
用してステップＳ３４に進む。ステップＳ３２での判断
結果がＮｏの場合、直接ステップＳ３４に進む。After it is determined in step S31 whether the input signal belongs to the background sound or the voice, the process proceeds to step S32.
To determine whether the conditions for applying the hangover processing section are satisfied. If the result of the determination in step S32 is Yes, the process proceeds to step S34 by applying the hangover processing unit in step S33. If the determination result in step S32 is No, the process directly proceeds to step S34.

【００８６】（第６の実施形態）次に、本発明の第６の
実施形態として有声／無声分類装置を図１４を参照して
説明する。(Sixth Embodiment) Next, a voiced / unvoiced classification device according to a sixth embodiment of the present invention will be described with reference to FIG.

【００８７】入力端子６０１から信号が入力され、音響
パラメータ算出部６０２に与えられる。音響パラメータ
算出部６０２では、音声の特徴量であるＭ（Ｍ≧１）種
の音響パラメータが算出される。算出される音響パラメ
ータとしては、信号パワー、サブバンドに分割した後の
信号パワー、１次のＰＡＲＣＯＲ係数、ＬＰＣ予測ゲイ
ン、ピッチ予測ゲインなどが挙げられる。A signal is input from an input terminal 601 and given to an acoustic parameter calculation unit 602. The acoustic parameter calculation unit 602 calculates M (M ≧ 1) kinds of acoustic parameters, which are the feature amounts of the voice. The calculated acoustic parameters include signal power, signal power after being divided into subbands, first-order PARCOR coefficient, LPC prediction gain, pitch prediction gain, and the like.

【００８８】音響パラメータ算出部６０２で求められた
音響パラメータは、無声出現確率算出部６０３および有
声出現確率算出部６０６に与えられる。有声出現確率テ
ーブル６０４，６０５および無声出現確率テーブル６０
７，６０８は、音声の特徴量に有声出現確率および無声
出現確率を対応付けて記述したものであり、具体的には
予め実音声データをマニュアルで有声／無声判定を行
い、その判定結果を用いて作成されたものである。The sound parameters obtained by the sound parameter calculation unit 602 are given to the unvoiced appearance probability calculation unit 603 and the voiced appearance probability calculation unit 606. Voiced appearance probability tables 604 and 605 and unvoiced appearance probability table 60
7, 608 is a description in which voiced appearance probabilities and unvoiced appearance probabilities are associated with voice feature amounts. Specifically, voiced / unvoiced determination is performed manually on real voice data in advance, and the determination result is used. It was created.

【００８９】無声出現確率算出部６０３は、音響パラメ
ータの種類の数に相当するＭ個の無声出現確率テーブル
６０４，６０５を有し、与えられた音響パラメータをキ
ーとしてそれぞれに対応する無声出現確率テーブルを参
照することにより、各音響パラメータの無声確率｛φ_U
(m) ，ｍ＝１，…，Ｍ）を求める。The unvoiced appearance probability calculation unit 603 has M unvoiced appearance probability tables 604 and 605 corresponding to the number of types of acoustic parameters, and uses the given acoustic parameters as keys to correspond to the respective unvoiced appearance probability tables. , The unvoiced probability of each acoustic parameter ｛φ _U
(m), m = 1,..., M).

【００９０】同様に、有声出現確率算出部６０６も、音
響パラメータの種類数に相当するＭ個の有声出現確率テ
ーブル６０７，６０８を有し、与えられた音響パラメー
タをキーとしてそれぞれに対応する有声出現確率テーブ
ルを参照することにより、各音響パラメータの有声確率
｛φ_V(m) ，ｍ＝１，…，Ｍ｝を求める。Similarly, the voiced appearance probability calculating section 606 also has M voiced appearance probability tables 607 and 608 corresponding to the number of types of acoustic parameters, and uses the given acoustic parameter as a key to correspond to each voiced appearance probability. By referring to the probability table, the voiced probability {φ _V (m), m = 1,..., M} of each acoustic parameter is obtained.

【００９１】有声／無声判定部６０９では、音声出現確
率算出部６０３で求められた各音響パラメータの無声確
率（φ_U(m) ，ｍ＝１，…，Ｍ｝と、有声出現確率算出
部６０６で求められた各音響パラメータの有声確率｛φ
_V(m) ，ｍ＝１，…，Ｍ｝とを用いて、入力信号が有声
に属するか無声に属するかを判定し、その判定結果を出
力端子６１０より出力する。有声／無声判定部６０９で
は、次式が成り立つ場合に無声、成り立たない場合に有
声と判定する。The voiced / unvoiced determination unit 609 calculates the unvoiced probability (φ _U (m), m = 1,..., M} of each acoustic parameter obtained by the voice appearance probability calculation unit 603, and the voiced appearance probability calculation unit 606. Voiced probability of each acoustic parameter obtained by 求め φ
_{Using V} (m), m = 1,..., M}, it is determined whether the input signal belongs to voiced or unvoiced, and the result of the determination is output from output terminal 610. The voiced / unvoiced determination unit 609 determines that voice is unvoiced when the following equation is satisfied, and voiced when the following equation is not satisfied.

【００９２】[0092]

【数５】 (Equation 5)

【００９３】また、有声、無声の判定に次の条件を用い
てもよい。 φ_U(m) ＞φ_V(m) （for all ｍ）（１８）この条件が満足されたときに無声、満足しないときは有
声と判定する。この条件を用いると、有声が判定されや
すくなる。このように、適用する分野に適した判定条件
を用いることが望ましい。Further, the following condition may be used for the judgment of voiced / unvoiced. φ _U (m)> φ _V (m) (for all m) (18) If this condition is satisfied, it is judged as unvoiced, and if not, it is judged as voiced. When this condition is used, voicedness is easily determined. Thus, it is desirable to use a determination condition suitable for the field to which it is applied.

【００９４】本実施形態によると、実音声データをマニ
ュアルで有声／無声判定して作成した出現確率テーブル
を用いて、最も確からしい声質を判断するので、従来法
のように経験に基づいた重み値や閾値に分類の性能が左
右されるという問題を回避でき、安定で正確な有声／無
声判定が実現できる。According to the present embodiment, the most probable voice quality is determined using the appearance probability table created by manually determining the voiced / unvoiced state of the actual voice data. In addition, it is possible to avoid the problem that the performance of the classification is affected by the threshold value and the threshold value, and realize stable and accurate voiced / unvoiced determination.

【００９５】（第７の実施形態）次に、本発明の第７の
実施形態を図１５を用いて説明する。本実施形態は、図
１１で説明した背景音／音声分類装置を音声符号化に適
用したものである。図１５において、図１１と同一部分
には同一の参照符号を付して説明を省略する。同図にお
いて、入力端子７０１には入力信号として例えばマイク
ロホンで集音され、ディジタル化された信号が複数サン
プルを１フレームとしてフレーム単位で順次入力され
る。(Seventh Embodiment) Next, a seventh embodiment of the present invention will be described with reference to FIG. In the present embodiment, the background sound / speech classification device described with reference to FIG. 11 is applied to speech coding. 15, the same parts as those in FIG. 11 are denoted by the same reference numerals, and the description will be omitted. In the figure, an input terminal 701 receives as input signals, for example, a microphone, and digitized signals are sequentially input in frame units with a plurality of samples as one frame.

【００９６】本実施形態では、１フレームを２４０サン
プルとする。In this embodiment, one frame is composed of 240 samples.

【００９７】入力端子７０１からの入力信号は、図１１
に示した背景音／音声分類装置７０２に入力され、この
背景音／音声分類装置７０２内の背景音／音声判定部１
０３で判定された結果に基づいて切替器７０３が制御さ
れて、入力信号の符号化方法が切り替えられる。The input signal from the input terminal 701 is
Is input to the background sound / speech classification device 702 shown in FIG.
Switch 703 is controlled based on the result determined in step 03, and the encoding method of the input signal is switched.

【００９８】すなわち、判定結果が背景音であった場合
には、入力信号は背景音用符号化部７０４に与えられ
る。判定結果が音声であった場合には、入力信号は音声
用符号化部７０５に与えられる。背景音用符号化部７０
４は背景音に適した方法で符号化を行い、同様に音声用
符号化部７０５は音声に適した方法で符号化を行うこと
により、効率的に情報の圧縮を行うことができる。この
ようにして符号化して得られる符号化パラメータは、マ
ルチプレクサ７０７を介して出力端子７０８から出力さ
れる。That is, when the determination result is a background sound, the input signal is supplied to the background sound encoding unit 704. If the determination result is speech, the input signal is provided to speech encoding section 705. Background sound encoder 70
No. 4 performs coding by a method suitable for background sound, and similarly, the voice coding unit 705 performs coding by a method suitable for voice, thereby efficiently compressing information. The encoding parameter obtained by encoding in this way is output from the output terminal 708 via the multiplexer 707.

【００９９】（第８の実施形態）次に、本発明の第８の
実施形態を図１６を用いて説明する。本実施形態は、図
９で説明した背景音／音声分類装置と図１４で説明した
有声／無声分類装置を音声符号化に適用したものであ
る。図１６において、図１５と同一部分に同一の参照符
号を付して説明を省略する。(Eighth Embodiment) Next, an eighth embodiment of the present invention will be described with reference to FIG. In the present embodiment, the background sound / speech classification device described in FIG. 9 and the voiced / unvoiced classification device described in FIG. 14 are applied to speech coding. In FIG. 16, the same parts as those in FIG. 15 are denoted by the same reference numerals, and description thereof will be omitted.

【０１００】入力端子８０１から入力される信号は、ま
ず背景音／音声分類装置８０２に与えられる。先に説明
したように、背景音／音声分類装置８０２で入力信号が
背景音か音声か判定される。その判定結果がセレクタ８
０４に送られ、背景音と判定された場合は、有声／無声
分類装置８０３の処理を実行せずに入力信号を背景音用
符号化部８０６に与えて符号化を行う。音声と判定され
た場合は、入力信号を有声／無声分類装置８０３に与
え、先に説明した手順に従い有声／無声判定を行う。A signal input from input terminal 801 is first applied to background sound / speech classification device 802. As described above, the background sound / speech classification device 802 determines whether the input signal is a background sound or a sound. The judgment result is the selector 8
When the input signal is transmitted to the background sound encoding unit 806, the input signal is supplied to the background sound encoding unit 806 without performing the processing of the voiced / unvoiced classification device 803. If determined to be voice, the input signal is provided to the voiced / unvoiced classification device 803, and voiced / unvoiced determination is performed according to the procedure described above.

【０１０１】そして、有声／無声分類装置８０３の結果
をセレクタ８０４に与え、無声と判定されたときには入
力信号を無声音用符号化部８０８に与えて符号化を行
う。逆に、有声と判定されたときには有声音用符号化部
８０９に入力信号を与えて符号化を行う。Then, the result of the voiced / unvoiced classifier 803 is supplied to the selector 804, and when it is determined that the signal is unvoiced, the input signal is supplied to the unvoiced sound encoder 808 to perform encoding. Conversely, when it is determined that the voice signal is voiced, the input signal is supplied to the voiced sound coding unit 809 to perform coding.

【０１０２】ここで、背景音用符号化部８０６、無声音
用符号化部８０８、有声音用符号化部８０９はそれぞれ
背景音、無声音、有声音に適した符号化部により構成さ
れているため、効率的な符号化が実現できる。このよう
にして得られた符号化パラメータは、マルチプレクサ８
１１を介して出力端子８１２より出力される。Here, the background sound coding unit 806, the unvoiced sound coding unit 808, and the voiced sound coding unit 809 are constituted by coding units suitable for background sound, unvoiced sound, and voiced sound, respectively. Efficient encoding can be realized. The encoding parameters obtained in this way are stored in the multiplexer 8
11 and output from an output terminal 812.

【０１０３】（第９の実施形態）次に、本発明の第９の
実施形態を図１７を用いて説明する。本実施形態は、背
景音復号装置の実現に関するものである。入力端子９０
１から入力される符号化データはデマルチプレクサ９０
２で復号され、復号パラメータが求められる。本実施形
態では、復号パラメータは復号駆動信号パラメータ、復
号ゲインパラメータ、復号合成フィルタパラメータの３
種類であり、これとは別に背景音／音声判定信号がマル
チプレクサ９０２から出力される。(Ninth Embodiment) Next, a ninth embodiment of the present invention will be described with reference to FIG. This embodiment relates to the realization of a background sound decoding device. Input terminal 90
The encoded data input from 1 is supplied to the demultiplexer 90.
2 and the decoding parameters are obtained. In the present embodiment, the decoding parameters are three of a decoding drive signal parameter, a decoding gain parameter, and a decoding synthesis filter parameter.
A background sound / speech determination signal is output from the multiplexer 902 separately from this.

【０１０４】復号パラメータは、背景音／音声判定信号
により切り替えられる切替器９０３によって、背景音区
間では背景音復号部９０４に入力され、音声区間では音
声復号部９０５に入力される。音声復号部９０５は本発
明の要旨と関係がないため、ここでは背景音復号部９０
４についてのみ説明する。The decoding parameters are input to the background sound decoding section 904 in the background sound section and to the sound decoding section 905 in the voice section by the switch 903 switched by the background sound / speech determination signal. Since the audio decoding unit 905 has nothing to do with the gist of the present invention, here, the background sound decoding unit 90
Only 4 will be described.

【０１０５】背景音復号部９０４において、マルチプレ
クサ９０２からの復号駆動信号パラメータは駆動信号復
号部９０６に与えられ、駆動信号ｃ(n) が求められる。
同様に、ゲイン復号パラメータはゲイン復号部９０７に
与えられてゲインｇが復号される。ゲインｇはゲインス
ムージング部９０８に与えられ、滑らかに変化するよう
に修正（スムージング）されたゲインが得られる。ま
た、合成フィルタ復号パラメータは合成フィルタ復号部
９１０に与えられ、合成フィルタ９１１の特性が決定さ
れる。駆動信号ｃ(n) とスムージングされたゲインとが
乗算器８０９で乗じられ、合成フィルタ９１１に与えら
れる。合成フィルタ９１１では、フィルタリング処理に
より合成信号ｅ(n) を生成し、この合成信号ｅ(n) は背
景音／音声判定信号により切り替えられるスイッチ９１
２を介して出力端子９１３より出力される。音声区間で
は、音声復号部９０５で同様にして得られた合成信号が
スイッチ９１２を介して出力端子９１３より出力され
る。In the background sound decoding unit 904, the decoded drive signal parameters from the multiplexer 902 are supplied to the drive signal decoding unit 906, and the drive signal c (n) is obtained.
Similarly, the gain decoding parameter is provided to gain decoding section 907, and gain g is decoded. The gain g is provided to a gain smoothing unit 908, and a gain corrected (smoothed) so as to change smoothly is obtained. The synthesis filter decoding parameter is provided to the synthesis filter decoding unit 910, and the characteristics of the synthesis filter 911 are determined. The driving signal c (n) and the smoothed gain are multiplied by the multiplier 809 and provided to the synthesis filter 911. The synthesis filter 911 generates a synthesized signal e (n) by a filtering process, and the synthesized signal e (n) is switched by a switch 91 which is switched by a background sound / voice determination signal.
2 and output from the output terminal 913. In the voice section, the synthesized signal similarly obtained by the voice decoding unit 905 is output from the output terminal 913 via the switch 912.

【０１０６】次に、ゲインスムージング部９０８につい
て説明する。ゲインスムージング部９０８でのゲインの
スムージングは、次式に従い実現される。ｇｓ(n) ＝（１−ξ）・ｇ＋ξ・ｇｓ(n-1) （０≦ξ≦１）（１９）ここで、ｇは復号されたゲイン、ｇｓ(n) はスムージン
グ後のゲインをそれぞれ表し、ｎはサンプル位置を表
す。また、ξはスムージングの程度を制御する定数であ
る。Next, the gain smoothing section 908 will be described. The gain smoothing in the gain smoothing unit 908 is realized according to the following equation. gs (n) = (1-ξ) · g + ξ · gs (n−1) (0 ≦ ξ ≦ 1) (19) where g is the decoded gain, and gs (n) is the gain after smoothing, respectively. And n represents the sample position. Ξ is a constant for controlling the degree of smoothing.

【０１０７】このようにゲインのスムージングを行う
と、ゲインの変化が滑らかになり、背景音の主観品質が
向上するという利点がある。When the gain is smoothed as described above, there is an advantage that the change in the gain becomes smooth and the subjective quality of the background sound is improved.

【０１０８】（第１０の実施形態）次に、本発明の第１
０の実施形態に係る背景音復号装置について説明する。
本実施形態は図１７の構成と同じなので、図１７を用い
て説明を行う。本実施形態は、ゲインスムージング部９
０８の処理に特徴がある。(Tenth Embodiment) Next, a tenth embodiment of the present invention will be described.
The background sound decoding device according to the embodiment 0 will be described.
This embodiment is the same as the configuration in FIG. 17, and therefore, will be described with reference to FIG. In the present embodiment, the gain smoothing unit 9
There is a feature in the processing of step 08.

【０１０９】第９の実施形態で説明したゲインスムージ
ングでは、常に固定の定数ξを使用してスムージングを
行っていた。図１８の破線は復号されたゲインｇの推移
を、また実線はスムージング後のゲインｇｓ(n) の推移
をそれぞれ表す。図１８から分かるように、復号ゲイン
ｇに比べてスムージング後のゲインｇｓ(n) は明らかに
滑らかに変化しているが、復号ゲインｇの値が小さくな
っても、スムージング後のゲインｇｓ(n) が小さくなる
には時間がかかってしまう。このため、不必要にゲイン
が大きくなってしまうところ（斜線部）が発生してしま
い、主観品質を損なう原因になる。In the gain smoothing described in the ninth embodiment, smoothing is always performed using a fixed constant ξ. The broken line in FIG. 18 indicates the transition of the decoded gain g, and the solid line indicates the transition of the gain gs (n) after smoothing. As can be seen from FIG. 18, the gain gs (n) after smoothing is clearly and smoothly changing compared to the decoding gain g. However, even when the value of the decoding gain g is small, the gain gs (n) after smoothing is obtained. It takes time to reduce the). For this reason, a place where the gain becomes unnecessarily large (a hatched portion) occurs, which causes a loss of subjective quality.

【０１１０】これに対して、本実施形態ではゲインのス
ムージングの方法を次の手続きに従い実現している。具
体的には、例えば復号ゲインｇが増大する方向に変化す
るときには、徐々に大きくなるようにゲインのスムージ
ングを行い、逆に復号ゲインｇが減少する方向に変化す
るときには、ゲインが急速に小さくなるようにゲインの
スムージングを行う。これを式で表すと、次のようにな
る。On the other hand, in the present embodiment, the gain smoothing method is realized according to the following procedure. Specifically, for example, when the decoding gain g changes in the increasing direction, the gain is smoothed so as to gradually increase, and when the decoding gain g changes in the decreasing direction, the gain decreases rapidly. So that the gain is smoothed. This can be expressed as follows.

【０１１１】ｇｓ(n) ＝（１−ξ_UP）・ｇ＋ξ_UP・ｇｓ（ｎ−１）（ｇ＞ｇｓ（ｎ−１）のとき）（２０）ｇｓ(n) ＝（１−ξ_DOWN）・ｇ＋ξ_DOWN・ｇｓ（ｎ−１）（ｇ≦ｇｓ（ｎ−１）のとき）（２１）ただし、（０≦ξ_DOWN＜ξ_UP≦１）本実施形態のスムージングによる効果を図１９を用いて
説明する。図１９から分かるように、復号ゲインｇとス
ムージング後のゲインｇｓ(n-1)を比較して、小さい方
のゲインの影響が強くなるようスムーング後のゲインｇ
ｓ(n) を決定しているため、図１８のように復号ゲイン
ｇが減少する方向に変化するときにスムージング後のゲ
インｇｓ(n) がいつまでも大きな値を引きずってしまう
という現象が解消されている（斜線部の面積が減少して
いる）。よって本実施形態を用いれば、ゲインが滑らか
に変化し、かつ不必要なゲインの増大を回避することが
でき、主観品質がさらに向上するという利点がある。[0111] (when g> gs (n-1) ) gs (n) = (1-ξ UP) · g + ξ UP · gs (n-1) (20) gs (n) = (1-ξ DOWN) G + ξ _DOWN · gs (n−1) (when g ≦ gs (n−1)) (21) However, (0 ≦ ξ _DOWN <ξ _UP ≦ 1) The effect of the smoothing of the present embodiment will be described with reference to FIG. Will be explained. As can be seen from FIG. 19, the decoding gain g is compared with the gain gs (n-1) after smoothing, and the gain g after smoothing is increased so that the smaller gain becomes stronger.
Since s (n) is determined, the phenomenon that the gain gs (n) after smoothing drags a large value indefinitely when the decoding gain g changes in the decreasing direction as shown in FIG. 18 is solved. (The area of the shaded area is reduced). Therefore, according to the present embodiment, there is an advantage that the gain changes smoothly and unnecessary increase of the gain can be avoided, and the subjective quality is further improved.

【０１１２】[0112]

【発明の効果】以上説明したように、本発明の背景音／
音声分類方法によれば、入力信号のパワーおよびスペク
トルの情報を特徴量として算出し、この算出特徴量と背
景音区間の推定パワーおよび推定スペクトルの情報から
なる推定特徴量とを比較することにより、入力信号が音
声および背景音のいずれに属するかを判定することによ
り、背景雑音のパワーが大きく相対的にパワーの小さな
音声区間であっても、背景音区間と音声区間のスペクト
ルが明らかに異なることから、音声区間を正確に判定す
ることができる。As described above, the background sound of the present invention /
According to the speech classification method, the power and spectrum information of the input signal is calculated as a feature amount, and the calculated feature amount is compared with an estimated feature amount including information on the estimated power and the estimated spectrum of the background sound section, By determining whether the input signal belongs to the voice or the background sound, the spectrum of the background sound section is clearly different from the spectrum of the voice section even in the voice section where the background noise power is relatively large and the power is relatively low. Thus, the voice section can be accurately determined.

【０１１３】この場合、推定特徴量を入力信号が背景音
に属すると判定された場合と音声に属すると判定された
場合とで異なる方法により更新し、入力信号が背景音に
属すると判定された場合の更新量を音声に属すると判定
された場合の更新量より小さくするようにすれば、入力
信号の音声区間が長時間続いても、推定特徴量が入力信
号の音声区間の特徴量の影響をほとんど受けることがな
く、音声区間が長時間続いた後に背景音に変化するよう
な入力信号が与えられた場合でも、背景音を正しく判定
することが可能となる。In this case, the estimated feature is updated by different methods depending on whether the input signal is determined to belong to the background sound or to the case where the input signal is determined to belong to the voice, and it is determined that the input signal is determined to belong to the background sound. If the update amount in the case is made smaller than the update amount in the case where it is determined that the input signal belongs to the voice, even if the voice period of the input signal continues for a long time, the estimated feature amount is affected by the feature amount of the voice section of the input signal. Even if an input signal that changes to a background sound after a long voice section lasts for a long time is given, the background sound can be correctly determined.

【０１１４】また、スペクトルの変動量の分析を入力信
号のスペクトルの情報から求められるスペクトル包絡と
背景音区間の推定スペクトルの情報から求められるスペ
クトル包絡との間のスペクトル歪みの値と閾値との比較
により行うことにより、正確な分析が可能となり、より
正確な背景音／音声の分類ができる。そして、推定パワ
ーが小さいときは閾値を大きく設定し、推定パワーが大
きいときは閾値を小さく設定するようにすれば、推定パ
ワーの変化によるスペクトル変動量の変化によって判定
を誤ることが少なくなり、さらに正確に背景音／音声の
分類を行うことができる。The analysis of the amount of fluctuation in the spectrum is performed by comparing the value of the spectrum distortion between the spectrum envelope obtained from the information of the spectrum of the input signal and the spectrum envelope obtained from the information of the estimated spectrum of the background sound section with the threshold value. , Accurate analysis becomes possible, and more accurate background sound / speech classification can be performed. Then, when the estimated power is small, the threshold is set large, and when the estimated power is large, the threshold is set small, so that erroneous determination due to a change in the amount of spectrum variation due to a change in the estimated power is reduced. Background sound / speech can be accurately classified.

【０１１５】さらに、入力信号が音声および背景音のい
ずれに属するかの判定結果が音声から背景音へ変化した
とき、その判定結果を強制的に音声に変更させるハング
オーバ処理を行い、背景音区間の推定パワーおよび推定
スペクトルの情報を用いて、例えば推定フレームパワー
が大きいときや、推定スペクトルの情報から求められる
スペクトル包絡のホルマントのスペクトルパワーが大き
いとき、このハングオーバ時間を長く設定することによ
り、背景音のパワーが大きいときや背景音のスペクトル
が白色でない場合の起こる語尾切れを避けることができ
る。Further, when the result of determination as to which of the input signal belongs to the voice and the background sound changes from the voice to the background sound, a hangover process for forcibly changing the determination result to the voice is performed. Using the information of the estimated power and the estimated spectrum, for example, when the estimated frame power is large, or when the spectral power of the formant of the spectrum envelope obtained from the information of the estimated spectrum is large, the hangover time is set to be long to obtain a background sound. Can be avoided when the power of is high or when the spectrum of the background sound is not white.

【０１１６】[0116]

【０１１７】[0117]

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の一実施形態に係る背景音／音声分類方
法を適用した装置の構成を示すブロック図FIG. 1 is a block diagram showing a configuration of an apparatus to which a background sound / speech classification method according to an embodiment of the present invention is applied.

【図２】同実施形態における特徴量算出部の構成を示す
ブロック図FIG. 2 is a block diagram showing a configuration of a feature amount calculation unit according to the embodiment;

【図３】同実施形態における背景音／音声判定部の構成
を示すブロック図FIG. 3 is an exemplary block diagram showing a configuration of a background sound / speech determination unit according to the embodiment;

【図４】同実施形態の概略的な処理手順を示すフローチ
ャートFIG. 4 is a flowchart showing a schematic processing procedure of the embodiment;

【図５】同実施形態の効果を説明するための図FIG. 5 is a view for explaining effects of the embodiment;

【図６】本発明の他の実施形態に係る背景音／音声分類
方法を適用した装置の構成を示すブロック図FIG. 6 is a block diagram showing a configuration of an apparatus to which a background sound / speech classification method according to another embodiment of the present invention is applied.

【図７】同実施形態の処理手順を示すフローチャートFIG. 7 is a flowchart showing a processing procedure according to the embodiment;

【図８】同実施形態の効果を説明するための図FIG. 8 is a view for explaining effects of the embodiment.

【図９】同実施形態におけるスペクトル変動量算出部の
構成を示すブロック図FIG. 9 is a block diagram showing a configuration of a spectrum variation calculation unit in the embodiment.

【図１０】同実施形態におけるスペクトル変動量算出部
の別の構成を示すブロック図FIG. 10 is a block diagram showing another configuration of the spectrum variation calculation unit in the embodiment.

【図１１】本発明の別の実施形態に係る背景音／音声分
類方法を適用した装置の構成を示すブロック図FIG. 11 is a block diagram showing a configuration of an apparatus to which a background sound / speech classification method according to another embodiment of the present invention is applied.

【図１２】同実施形態におけるハングオーバ処理部の構
成を示すブロック図FIG. 12 is a block diagram showing a configuration of a hangover processing unit in the embodiment.

【図１３】同実施形態の処理手順を示すフローチャートFIG. 13 is a flowchart showing a processing procedure of the embodiment.

【図１４】本発明の一実施形態に係る有声／無声分類方
法を適用した装置の構成を示すブロック図FIG. 14 is a block diagram showing a configuration of an apparatus to which a voiced / unvoiced classification method according to an embodiment of the present invention is applied.

【図１５】本発明の一実施形態に係る背景音／音声分類
方法を適用した音声符号化装置の構成を示すブロック図FIG. 15 is a block diagram showing a configuration of a speech encoding apparatus to which a background sound / speech classification method according to an embodiment of the present invention is applied.

【図１６】本発明の一実施形態に係る背景音／音声分類
方法および有声／無声分類方法を適用した音声符号化装
置の構成を示すブロック図FIG. 16 is a block diagram showing a configuration of a speech coding apparatus to which a background sound / speech classification method and a voiced / unvoiced classification method according to an embodiment of the present invention are applied.

【図１７】本発明の一実施形態に係る背景音復号方法を
説明するための音声復号装置の構成を示すブロック図FIG. 17 is a block diagram showing a configuration of a speech decoding device for explaining a background sound decoding method according to an embodiment of the present invention.

【図１８】同実施形態の効果を説明するための図FIG. 18 is a view for explaining effects of the embodiment.

【図１９】同実施形態の別の効果を説明するための図FIG. 19 is an exemplary view for explaining another effect of the embodiment;

【符号の説明】[Explanation of symbols]

１０２…特徴量算出部１０３…背景音／有声判定部１０４…推定特徴量更新部１０６…ハングオーバ処理部６０２…音響パラメータ産出部６０３…無声出現確率算出部６０４，６０５…無声出現確率テーブル６０６…有声出現確率算出部６０７，６０８…有声出現確率テーブル６０９…有声／無声判定部７０２…背景音／有声分類装置８０２…背景音／有声分類装置８０３…有声／無声分類装置９０４…背景音復号部９０５…音声復号部９０６…駆動信号復号部９０７…ゲイン復号部９０８…ゲインスムージング部９０９…乗算器９１０…合成フィル復号部９１１…合成フィルタ Reference numeral 102: Feature amount calculation unit 103: Background sound / voiced determination unit 104: Estimated feature amount update unit 106: Hangover processing unit 602: Sound parameter production unit 603: Unvoiced appearance probability calculation unit 604, 605: Unvoiced appearance probability table 606: Voiced Appearance probability calculation units 607, 608 Voiced appearance probability table 609 Voiced / unvoiced determination unit 702 Background sound / voiced classification device 802 Background sound / voiced classification device 803 Voiced / unvoiced classification device 904 Background sound decoding unit 905 Voice decoding unit 906 ... Drive signal decoding unit 907 ... Gain decoding unit 908 ... Gain smoothing unit 909 ... Multiplier 910 ... Synthesis fill decoding unit 911 ... Synthesis filter

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩＨ０３Ｍ 7/30 Ｇ１０Ｌ 3/00 ５１５Ｄ (56)参考文献特開昭62−238599（ＪＰ，Ａ) 特開昭59−170894（ＪＰ，Ａ) 特開平７−181991（ＪＰ，Ａ) 特開平９−6397（ＪＰ，Ａ) 特開昭61−25200（ＪＰ，Ａ) 特表平１−502779（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 11/02 G10L 15/04 ──────────────────────────────────────────────────の Continuation of the front page (51) Int.Cl. ⁷ identification code FI H03M 7/30 G10L 3/00 515D (56) References JP-A-62-238599 (JP, A) JP-A-59-170894 (JP, A) JP, A) JP-A-7-181991 (JP, A) JP-A-9-6397 (JP, A) JP-A-61-25200 (JP, A) JP-A-1-502779 (JP, A) (58) ) Surveyed field (Int.Cl. ⁷ , DB name) G10L 11/02 G10L 15/04

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】入力信号のパワーおよびスペクトルの情報
を特徴量として算出し、この算出した特徴量と背景音区
間の推定パワーおよび推定スペクトルの情報からなる推
定特徴量とを比較することによりパワーおよびスペクト
ルの変動量を分析し、これらパワーおよびスペクトルの
変動量の分析結果が共に背景音であることを示したとき
は前記入力信号が背景音に属すると判定し、それ以外の
ときは音声に属すると判定する背景音／音声分類方法で
あって、前記入力信号のスペクトルの情報から求められるスペク
トル包絡と前記背景音区間の推定スペクトルの情報から
求められるスペクトル包絡との間の歪の値と、予め設定
された閾値とを比較することにより、前記スペクトルの
変動量を分析するとともに、前記推定パワーの情報に応
じて前記閾値を変化させることを特徴とする背景音／音
声分類方法。1. The power and spectrum information of an input signal is calculated as a feature amount, and the calculated feature amount is compared with an estimated feature amount comprising information of an estimated power and an estimated spectrum of a background sound section to obtain a power and a spectrum. Analyzing the amount of change in the spectrum, when the analysis results of the power and the amount of change in the spectrum indicate that both are background sounds, it is determined that the input signal belongs to the background sound, and otherwise belongs to the voice. The background sound / voice classification method
And a spectrum obtained from information on the spectrum of the input signal.
From the information of the estimated spectrum of the
The value of the distortion between the required spectral envelope and the preset value
By comparing with the threshold value obtained,
Analyze the amount of fluctuation and respond to the information of the estimated power.
A background sound / speech classification method characterized by changing the threshold value .

【請求項２】前記推定特徴量を前記入力信号が背景音に
属すると判定された場合と音声に属すると判定された場
合とで異なる方法により更新すると共に、前記入力信号
が背景音に属すると判定された場合の更新量を、前記入
力信号が音声に属すると判定された場合の更新量より小
さくすることを特徴とする請求項１に記載の背景音／音
声分類方法。2. The method according to claim 1, wherein the estimated feature quantity is updated by different methods depending on whether the input signal belongs to a background sound or when the input signal belongs to a sound. If the update amount is determined ,
2. The background sound / speech classification method according to claim 1, wherein the update amount is smaller than the update amount when the force signal is determined to belong to the speech.

【請求項３】前記入力信号が音声および背景音のいずれ
に属するかの判定結果が音声から背景音へ変化したとき
特定期間だけ、該判定結果を強制的に音声に変更させる
とともに、前記背景音区間の推定パワーおよび推定スペ
クトルの情報を用いて前記特定期間を変化させることを
特徴とする請求項１に記載の背景音／音声分類方法。3. The method according to claim 1, wherein when the determination result as to whether the input signal belongs to a voice or a background sound changes from the voice to the background sound, the determination result is forcibly changed to the voice only for a specific period. 2. The background sound / speech classification method according to claim 1, wherein the specific period is changed using information on the estimated power and the estimated spectrum of the section.

【請求項４】入力信号のパワーおよびスペクトルの情報
を特徴量として算出する特徴量算出部と、算出された特徴量と背景音区間の推定パワーおよび推定
スペクトルの情報からなる推定特徴量とを比較すること
によりパワーおよびスペクトルの変動量を分析し、これ
らパワーおよびスペクトルの変動量の分析結果が共に背
景音であることを示したときは前記入力信号が背景音に
属すると判定し、それ以外のときは音声に属すると判定
する判定部とを備え、前記判定部は、前記入力信号のスペクトルの情報から求
められるスペクトル包絡と前記背景音区間の推定スペク
トルの情報から求められるスペクトル包絡との間の歪の
値と、予め設定された閾値とを比較することにより、前
記スペクトルの変動量を分析するとともに、前記推定パ
ワーの情報に応じて前記閾値を変化させることを特徴と
する背景音／音声分類装置。 4. Power and spectrum information of an input signal.
Amount calculating section that calculates the feature amount as the feature amount, and the estimated power and the estimated power of the calculated feature amount and the background sound section.
Comparing with estimated features consisting of spectral information
To analyze power and spectral variation ,
Of the power and spectrum fluctuations
When the input signal indicates a background sound, the input signal becomes a background sound.
Determined to belong, otherwise determined to belong to audio
And a determining unit that determines from the information on the spectrum of the input signal.
Spectrum envelope and the estimated spectrum of the background sound section
Of the distortion between the spectral envelope obtained from the
By comparing the value with a preset threshold,
In addition to analyzing the variation of the spectrum,
Changing the threshold value according to the information of the
Background sound / speech classification device.

【請求項５】前記推定特徴量を前記入力信号が背景音に
属すると判定された場合と音声に属すると判定された場
合とで異なる方法により更新すると共に、前記入力信号
が背景音に属すると判定された場合の更新量を、前記入
力信号が音声に属すると判定された場合の更新量より小
さくする推定特徴量更新部をさら備えることを特徴とす
る請求項４に記載の背景音／音声分類装置。 5. The method according to claim 1, wherein the estimated feature amount is converted into a background sound.
When it is determined to belong and when it is determined to belong to audio
Update by a different method and the input signal
The update amount when it is determined that the
Less than the update amount when it is determined that the force signal belongs to the voice
A feature updating unit for estimating the estimated feature amount.
The background sound / speech classification device according to claim 4.

【請求項６】前記入力信号が音声および背景音のいずれ
に属するかの判定結果が音声から背景音へ変化したとき
特定期間だけ、該判定結果を強制的に音声に変更させる
とともに、前記背景音区間の推定パワーおよび推定スペ
クトルの情報を用いて前記特定期間を変化させるハング
オーバ処理部をさらに備えることを特徴とする請求項４
に記載の背景音／音声分類装置。 6. An apparatus according to claim 1, wherein said input signal is one of a voice and a background sound.
When the result of judgment as to whether the sound belongs to the background changes from voice to background sound
Forcibly change the judgment result to sound only for a specific period
Together with the estimated power and estimated spectrum of the background sound section.
Hang that changes the specific period using the information of the vector
5. The apparatus according to claim 4, further comprising an over-processing unit.
2. A background sound / speech classification device according to claim 1.

【請求項７】請求項１乃至３のいずれか１項に記載の背
景音／音声分類方法により前記入力信号が背景音と判定
された場合と音声と判定された場合とで、それぞれに適
した異なる符号化方法により符号化を行うことを特徴と
する音声符号化方法。 7. The spine according to claim 1, wherein :
The input signal is determined to be a background sound according to a scene sound / voice classification method.
The sound and the sound
Characterized by performing encoding using different encoding methods.
The audio coding method to use.

【請求項８】請求項４乃至６のいずれか１項に記載の背
景音／音声分類装置と、該背景音／音声分類装置により前記入力信号が背景音と
判定された場合には前記入力信号を背景音に適した符号
化方法により符号化する背景音符号化部と、該背景音／音声分類装置により前記入力信号が音声と判
定された場合には前記入力信号を音声に適した符号化方
法により符号化する音声符号化部とを備えることを特徴
とする音声符号化装置。 8. The spine according to claim 4, wherein :
A background sound / speech classification device, wherein the input signal is
If determined, convert the input signal into a code suitable for a background sound.
And a background sound / speech classification device that encodes the input signal as speech.
If specified, the input signal is encoded in a manner suitable for speech.
And a voice coding unit for coding by a method.
Speech encoding device.