JP2003131686A

JP2003131686A - Method and device to estimate mixture ratio of voice and music and audio device using the same

Info

Publication number: JP2003131686A
Application number: JP2001330154A
Authority: JP
Inventors: Kaoru Watanabe; 馨渡辺; Tomoyasu Komori; 智康小森
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2001-10-29
Filing date: 2001-10-29
Publication date: 2003-05-09
Anticipated expiration: 2021-10-29
Also published as: JP3933909B2

Abstract

PROBLEM TO BE SOLVED: To effectively emphasize narration sound of audio signals so as to easily listen to the sound. SOLUTION: In the mixture ratio estimating method of voice and music, feature data of voice and music signals are beforehand learned and stored using pure narration voice and music. Then, the mixture ratio of the narration voice and the music of the audio signals is estimated using the stored feature data and the feature data of audio signals being broadcasted.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声と音楽の混合
比推定方法及び装置並びにそれを用いたオーディオ装置
に関し、特に、あらかじめ学習した音声及び音楽の特徴
データに基づいて、オーディオ信号の音声／音楽の混合
比を推定する技術に適用して有効な技術に関するもので
ある。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method and apparatus for estimating a mixture ratio of voice and music, and an audio apparatus using the same, and more particularly, to a method for estimating voice / audio of an audio signal based on feature data of voice and music previously learned. The present invention relates to a technique effectively applied to a technique for estimating a music mixture ratio.

【０００２】[0002]

【従来の技術】従来の放送オーディオ技術は、音声信号
及び音楽信号の特徴データに基づいた音声／音楽の区間
判別のみを行い、混合比の推定は行っていない。従来、
「音声／音楽の区間判定」では、「この時間からこの時
間」の部分は音声区間、「この時間からこの時間」の部
分は音楽区間というように、音声か音楽かいわゆる０／
１の判定のみ行っていました。すなわち、従来技術では
音声区間となった部分は音声１００％、音楽０％と考え
ることもできる。2. Description of the Related Art In the conventional broadcast audio technology, only the voice / music section is discriminated based on the characteristic data of the voice signal and the music signal, and the mixture ratio is not estimated. Conventionally,
In the "voice / music section determination", the section "from this time to this time" is the voice section, and the section from "this time to this time" is the music section.
Only the judgment of 1 was done. That is, in the prior art, it can be considered that the voice section is 100% voice and 0% music.

【０００３】[0003]

【発明が解決しようとする課題】前記従来の技術では、
混合比の推定が行えないため、適応的にナレーション音
を強調して、放送オーディオを聞きやすくする装置に応
用することは困難であった。多くの人は適切な混合比で
ミクシングされている場合でも、年齢を重ねるとともに
高域の音声が聞きづらくなるなど人により聴覚の性質
（感度）が異なっているため、例えば、背景音楽（ＢＧ
Ｍ：背景音）によりナレーション音声が聞きづらくなる
場合があり、その人に適応させるようにしてナレーショ
ンを聞きやすくすることが望まれている。本発明の目的
は、オーディオ信号のナレーション音が効果的に強調さ
れ聞きやすくすることが可能な技術を提供することにあ
る。本発明の前記ならびにその他の目的と新規な特徴
は、本明細書の記述及び添付図面によって明らかにす
る。SUMMARY OF THE INVENTION In the above conventional technique,
Since the mixing ratio cannot be estimated, it is difficult to apply it to a device that emphasizes narration sound adaptively and makes broadcast audio easy to hear. Even when mixed with an appropriate mixing ratio, many people have different hearing characteristics (sensitivity) such as becoming more difficult to hear high-frequency sounds as they get older.
The narration voice may be difficult to hear due to (M: background sound), and it is desired to make the narration easy to hear by adapting to the person. An object of the present invention is to provide a technique capable of effectively enhancing a narration sound of an audio signal to make it easy to hear. The above and other objects and novel features of the present invention will become apparent from the description of this specification and the accompanying drawings.

【０００４】[0004]

【課題を解決するための手段】本願において開示される
発明のうち、代表的なものの概要を簡単に説明すれば、
下記のとおりである。Among the inventions disclosed in the present application, a brief description will be given to the outline of typical ones.
It is as follows.

【０００５】第１の発明は、純粋なナレーション音声と
音楽を用いて音声信号及び音楽信号の特徴データをあら
かじめ学習して記憶しておき、これらの記憶済み特徴デ
ータと放送されているオーディオ信号の特徴データを用
いて、オーディオ信号のナレーション音声と音楽の混合
比を推定する音声と音楽の混合比推定方法である。A first aspect of the invention is to learn and store characteristic data of a voice signal and a music signal in advance by using pure narration voice and music, and to store the stored characteristic data and the broadcast audio signal. It is a method of estimating a mixture ratio of voice and music that estimates a mixture ratio of narration voice and music of an audio signal using feature data.

【０００６】第２の発明は、前記第１の発明の混合比推
定方法による混合比の推定結果に従って、ナレーション
音と効果音のバランスの制御方法を切り替えるオーディ
オ信号のナレーション音を聞きやすくする方法である。A second aspect of the present invention is a method for making it easier to hear the narration sound of an audio signal for switching the control method of the balance between the narration sound and the sound effect according to the result of the estimation of the mixing ratio by the mixing ratio estimation method of the first invention. is there.

【０００７】第３の発明は、純粋なナレーション音声と
音楽を用いて音声信号及び音楽信号の特徴データをあら
かじめ学習して記憶しておく学習済み特徴データ記憶手
段と、放送されているオーディオ信号の音声と音楽の特
徴データを抽出する音声／音楽特徴データ抽出手段と、
前記学習済み特徴データ記憶手段からの特徴データと前
記音声／音楽特徴データ抽出手段からの放送されている
オーディオ信号の特徴データを入力して、オーディオ信
号のナレーション音声と音楽の混合比を推定する音声／
音楽混合比推定手段とを具備する音声／音楽混合比推定
装置である。A third aspect of the present invention is a learned feature data storage means for preliminarily learning and storing feature data of a voice signal and a music signal by using pure narration voice and music, and of a broadcast audio signal. Voice / music feature data extraction means for extracting feature data of voice and music,
A voice for inputting the feature data from the learned feature data storage means and the feature data of the audio signal being broadcast from the voice / music feature data extracting means to estimate the mixture ratio of narrated voice and music of the audio signal. /
A speech / music mixture ratio estimation device comprising a music mixture ratio estimation means.

【０００８】第４の発明は、前記第３の発明の音声／音
楽混合比推定装置による混合比の推定結果に従って、ナ
レーション音と効果音のバランスの制御方法を切り替え
る混合比推定適応音声強調手段を具備するオーディオ装
置である。According to a fourth aspect of the present invention, there is provided a mixture ratio estimation adaptive voice emphasizing means for switching the control method of the balance between the narration sound and the sound effect according to the result of the mixture ratio estimation by the voice / music mixture ratio estimating device of the third aspect. It is an audio device provided.

【０００９】すなわち、本発明のポイントは、オーディ
オ信号に含まれるナレーション音声成分とＢＧＭ音楽成
分の混合比を推定するために、あらかじめ純粋なナレー
ション音声と音楽を用いて音声信号及び音楽信号の特徴
データを学習する。入力オーディオ信号に含まれる音声
／音楽特徴データを抽出（計算）し、このデータと学習
済特徴データから音声推定確率ＰＳと音楽推定確率ＰＡ
を求める。That is, the point of the present invention is to use the pure narration voice and music beforehand to estimate the mixing ratio of the narration voice component and the BGM music component contained in the audio signal, and to use the characteristic data of the voice signal and the music signal. To learn. The voice / music feature data included in the input audio signal is extracted (calculated), and the voice estimation probability PS and the music estimation probability PA are extracted from this data and the learned feature data.
Ask for.

【００１０】前記求められた音声推定確率ＰＳと音楽推
定確率ＰＡの比から、音声と音楽の混合比を推定する。
この混合比の推定結果に従って、ナレーション音と効果
音のバランスの制御方法を切り替えることにより、オー
ディオ信号のナレーション音が効果的に強調された聞き
やすい音を提供することができる。The mixing ratio of voice and music is estimated from the obtained ratio of the voice estimation probability PS and the music estimation probability PA.
By switching the control method of the balance between the narration sound and the sound effect according to the estimation result of the mixture ratio, it is possible to provide a audible sound in which the narration sound of the audio signal is effectively emphasized.

【００１１】ここで、純粋な音声とは、いわゆるスピー
チ信号だけを含んだ音声である。また、純粋な音楽と
は、楽器の音信号においてスピーチ信号を含んでいない
ものである。純粋な音声及び音楽は学習用のみに使用す
るので、人が事前に確認して使用することができる。ま
た、前記データは個々のベクトルの集合体としてメモリ
などに蓄積してあるものなどを意味する。前記ベクトル
は個々のベクトル自体を意味する。Here, the pure voice is a voice including only a so-called speech signal. Further, pure music is a sound signal of a musical instrument that does not include a speech signal. Since pure voice and music are used only for learning, they can be checked and used by a person in advance. The data means data stored in a memory or the like as a set of individual vectors. The vector means the individual vector itself.

【００１２】前記本発明によれば、あらかじめ学習した
音声信号及び音楽信号の特徴データと放送されているオ
ーディオ信号の特徴データを用いて、オーディオ信号に
含まれるナレーション音とＢＧＭ音楽の混合比が推定で
きる。According to the present invention, the mixture ratio of the narration sound and the BGM music included in the audio signal is estimated by using the characteristic data of the audio signal and the music signal learned in advance and the characteristic data of the broadcast audio signal. it can.

【００１３】また、この混合比の推定結果に従って、ナ
レーション音と効果音のバランスの制御方法を切り替え
ることにより、オーディオ信号のナレーション音が聞き
やすいシステムを構成することができる。Further, by switching the control method of the balance between the narration sound and the sound effect according to the estimation result of the mixing ratio, it is possible to construct a system in which the narration sound of the audio signal is easily heard.

【００１４】以下に、本発明について、本発明による実
施形態（実施例）とともに図面を参照して詳細に説明す
る。Hereinafter, the present invention will be described in detail with reference to the drawings together with an embodiment (example) according to the present invention.

【００１５】[0015]

【発明の実施の形態】図１は、本発明の一実施例の音声
／音楽混合比推定装置を用いたオーディオ装置の概略構
成を示すブロック図である。1 is a block diagram showing a schematic configuration of an audio apparatus using a voice / music mixture ratio estimating apparatus according to an embodiment of the present invention.

【００１６】本実施例の音声／音楽混合比推定装置を用
いたオーディオ装置は、図１に示すように、純粋なナレ
ーション音声と音楽を用いて音声信号及び音楽信号の特
徴データもしくは特徴ベクトルをあらかじめ学習して記
憶しておくための学習済み特徴データ記憶手段１と、放
送されているオーディオ信号の音声と音楽の特徴データ
もしくは特徴ベクトルを抽出する音声／音楽特徴データ
抽出手段２と、前記学習済み特徴データ記憶手段１から
の特徴データと前記音声／音楽特徴データ抽出手段２か
らの放送されているオーディオ信号の特徴データを入力
して、オーディオ信号のナレーション音声と音楽（例え
ば背景音）の混合比を推定する音声／音楽混合比推定手
段３と、該音声／音楽混合比推定手段３による混合比の
推定結果に従って、ナレーション音と効果音のバランス
の制御方法を切り替える混合比推定適応音声強調回路
（手段）４とを備えている。As shown in FIG. 1, an audio device using the voice / music mixture ratio estimation device of the present embodiment uses pure narration voice and music to obtain feature data or feature vectors of voice signals and music signals in advance. Learned feature data storage means 1 for learning and storing, voice / music feature data extraction means 2 for extracting feature data or feature vector of voice and music of broadcast audio signal, and the learned The feature data from the feature data storage means 1 and the feature data of the audio signal being broadcast from the voice / music feature data extracting means 2 are input, and the narration voice and music (for example, background sound) mixing ratio of the audio signal is mixed. According to the estimation result of the mixing ratio by the speech / music mixture ratio estimating means 3 for estimating And a mixing ratio estimation adaptive speech enhancement circuit for switching a control method of the balance of the narration sound and sound effects (means) 4.

【００１７】前記音声／音楽特徴データ抽出手段２は、
入力オーディオ信号を分析し、オーディオ信号に含まれ
る音声や音楽の特徴を表す複数の指標（音声／音楽特徴
データ）を求める。The voice / music characteristic data extracting means 2 is
The input audio signal is analyzed to obtain a plurality of indexes (voice / music feature data) representing the features of voice and music included in the audio signal.

【００１８】前記音声／音楽混合比推定手段３は、音声
／音楽特徴データ抽出手段２で求めた音声／音楽特徴デ
ータを入力し、事前に学習した学習済音声／音楽特徴デ
ータを用いて、音声／音楽の混合比を推定する。The voice / music mixture ratio estimating means 3 inputs the voice / music characteristic data obtained by the voice / music characteristic data extracting means 2 and uses the learned voice / music characteristic data learned in advance to generate a voice. / Estimate the mix ratio of music.

【００１９】前記混合比推定適応音声強調回路（手段）
４は、前記推定した混合比と入力オーディオ信号を入力
し、推定した混合比に従ってナレーション音と効果音の
バランスの制御方法を切り替え、ナレーション音が強調
されて聞きやすいオーディオ信号を出力する。Mixing ratio estimation adaptive speech enhancement circuit (means)
Reference numeral 4 inputs the estimated mixing ratio and the input audio signal, switches the control method of the balance between the narration sound and the sound effect according to the estimated mixing ratio, and outputs an audio signal in which the narration sound is emphasized and is easy to hear.

【００２０】ここで、純粋な音声とは、いわゆるスピー
チ信号だけを含んだ音声である。また、純粋な音楽と
は、楽器の音信号においてスピーチ信号を含んでいない
ものである。純粋な音声及び音楽は学習用のみに使用す
るので、人が事前に確認して使用することができる。ま
た、前記データは個々のベクトルの集合体としてメモリ
などに蓄積してあるものなどを意味し、前記ベクトルは
個々のベクトル自体を意味する。Here, the pure voice is a voice containing only a so-called speech signal. Further, pure music is a sound signal of a musical instrument that does not include a speech signal. Since pure voice and music are used only for learning, they can be checked and used by a person in advance. The data means data stored in a memory or the like as an aggregate of individual vectors, and the vector means individual vectors themselves.

【００２１】次に、本実施例のオーディオ装置の各手段
の概略構成とその動作について説明する。（１）音声／音楽特徴データ抽出手段２について前記音声／音楽特徴データ抽出手段２の構成例を図２に
示す。音声／音楽特徴データとしては、信号成分中央周
波数、信号９５％値周波数、音声変調度、ゼロ交差数と
これらの分散値、及び信号エネルギーとする。しかし、
音声／音楽特徴ベクトルとしては、これらの特徴ベクト
ルの一部で構成される場合や他の特徴ベクトルが付加さ
れた特徴ベクトルで構成される場合がある。例えば、以
下の表１の通りである。Next, the schematic construction and operation of each means of the audio apparatus of this embodiment will be described. (1) Voice / Music Characteristic Data Extraction Unit 2 FIG. 2 shows a configuration example of the voice / music characteristic data extraction unit 2. The voice / music feature data includes a signal component central frequency, a signal 95% value frequency, a voice modulation degree, the number of zero crossings and their variances, and signal energy. But,
The voice / music feature vector may be formed of a part of these feature vectors or a feature vector to which another feature vector is added. For example, as shown in Table 1 below.

【００２２】[0022]

【表１】 [Table 1]

【００２３】図２は、本実施例の音声／音楽特徴データ
抽出手段の概域構成を示すブロック図である。前記音声
／音楽特徴データ抽出手段２の一実施例は、図２に示す
ように、周波数分析手段２１、信号成分中央周波数抽出
手段２２、信号９５％値周波数抽出手段２３、音声変調
度抽出手段２４、分散手段２５〜２７、ゼロ交差数抽出
手段２８、分散手段２９及び信号エネルギー抽出手段３
０からなる。これらの手段を用いて音声／音楽特徴デー
タを抽出する。FIG. 2 is a block diagram showing the general structure of the voice / music characteristic data extracting means of this embodiment. As shown in FIG. 2, one embodiment of the voice / music characteristic data extracting means 2 is a frequency analyzing means 21, a signal component central frequency extracting means 22, a signal 95% value frequency extracting means 23, and a voice modulation degree extracting means 24. , Dispersion means 25 to 27, zero crossing number extraction means 28, dispersion means 29 and signal energy extraction means 3
It consists of zero. The voice / music feature data is extracted using these means.

【００２４】（２）音声／音楽混合比推定手段３につい
て図３は、本実施例の音声／音楽混合比推定手段３の概略
構成を示すブロック図である。前記音声／音楽混合比推
定手段３の一実施例は、図３に示すように、事前に学習
した学習済音声特徴データの類似係数ベクトルｂｓ１、
ｂｓ２、…、ｂｓｎと音楽特徴データの類似係数ベクト
ルｂａ１、ｂａ２、…、ｂａｎ及び学習済音声特徴デー
タの類似確率係数ｐｓ１、ｐｓ２、…、ｐｓｎと音楽特
徴データの類似確率係数ｐａ１、ｐａ２、…、ｐａｎを
類似確率メモリに保持しておく（図３はｎ＝３の場
合）。(2) Voice / Music Mixing Ratio Estimating Means 3 FIG. 3 is a block diagram showing a schematic configuration of the voice / music mixing ratio estimating means 3 of this embodiment. As shown in FIG. 3, one embodiment of the voice / music mixture ratio estimating means 3 is a similarity coefficient vector bs1 of learned voice characteristic data learned in advance,
, bsn and the similarity coefficient vectors ba1, ba2, ..., Ban of the music feature data and the similarity probability coefficients ps, ps2, ..., Psn of the learned voice feature data and the similarity feature coefficients pa1, pa2, of the music feature data. , Pan are stored in the similarity probability memory (in the case of n = 3 in FIG. 3).

【００２５】前記音声／音楽特徴データ抽出手段２で求
めた音声／音楽特徴ベクトルを入力し、特徴ベクトルと
音声類似係数ベクトルｂｓ１を乗算してｂｓ１の音声類
似度を求める。同様に、音声／音楽特徴ベクトルと音声
類似係数ベクトルｂｓ２を乗算してｂｓ２の音声類似
度、…、音声／音楽特徴ベクトルと音声類似係数ベクト
ルｂｓｎを乗算してｂｓｎの音声類似度を求める。同様
に音声／音楽特徴ベクトルと音楽類似係数ベクトルｂａ
１を乗算してｂａ１の音楽類似度、…、音声／音楽特徴
ベクトルと音楽類似係数ベクトルｂａｎを乗算してｂａ
ｎの音楽類似度を求める。The voice / music feature vector obtained by the voice / music feature data extracting means 2 is input, and the feature vector is multiplied by the voice similarity coefficient vector bs1 to obtain the voice similarity of bs1. Similarly, the voice / music feature vector and the voice similarity coefficient vector bs2 are multiplied to obtain the voice similarity of bs2, and the voice / music feature vector and the voice similarity coefficient vector bsn are multiplied to obtain the voice similarity of bsn. Similarly, the voice / music feature vector and the music similarity coefficient vector ba
1 is multiplied by the music similarity of ba1, ..., The voice / music feature vector is multiplied by the music similarity coefficient vector ban, and ba is calculated.
Find the music similarity of n.

【００２６】前記ｂｓ１の音声類似度とｐｓ１の音声類
似確率係数を乗算してｂｓ１の音声類似確率Ｐｓ１を求
める。同様にしてｂｓ２の音声類似確率Ｐｓ２、…、ｂ
ｓｎの音声類似確率Ｐｓｎを求める。前記求められたｂ
ｓ１の音声類似確率Ｐｓ１、…、ｂｓｎの音声類似確率
Ｐｓｎを加算器３１で加算して、音声類似確率ＰＳを求
める。The voice similarity probability of bs1 is multiplied by the voice similarity probability coefficient of ps1 to obtain the voice similarity probability Ps1 of bs1. Similarly, the voice similarity probability Ps2 of bs2, ..., B
The voice similarity probability Psn of sn is obtained. B obtained above
The voice similarity probability Ps1 of s1 is added by the adder 31 to obtain the voice similarity probability PS.

【００２７】同様にして、前記ｂａ１の音声類似度とｐ
ａ１の音声類似確率係数を乗算し、ｂａ１の音声類似確
率Ｐａ１を求める。同様にしてｂａ２の音声類似確率Ｐ
ａ２、…、ｂａｎの音声類似確率Ｐａｎを求める。前記
ｂｓ１の音声類似確率Ｐａ１、…、ｂｓｎの音声類似確
率Ｐａｎを加算器３１で加算して、音楽類似確率ＰＡを
求める。次に、前記音声類似確率ＰＳと音楽類似確率Ｐ
Ａの比ＲatioＳ／Ａを除算器３２で求める。Similarly, the speech similarity of ba1 and p
The voice similarity probability coefficient a1 is multiplied to obtain the voice similarity probability Pa1 of ba1. Similarly, the voice similarity probability P of ba2
The voice similarity probability Pan of a2, ..., Ban is obtained. The voice similarity probability Pan of bs1 is added by the adder 31 to obtain the music similarity probability PA. Next, the voice similarity probability PS and the music similarity probability P
The ratio RatioS / A of A is calculated by the divider 32.

【００２８】混合比出力部３３では、比ＲatioＳ／Ａと
適合関数ｆ（ＲatioＳ／Ａ）から、混合比推定係数を計
算し、シグモイド関数により、音声／音楽混合比推定値
ｍを計算する。ｍは音声である可能性が１００％と推定
された場合には＝１、音楽である可能性が１００％と推
定された場合には＝０をとり、音声／音楽混合比の推定
量に比例して１〜０の中間値をとる。The mixture ratio output unit 33 calculates the mixture ratio estimation coefficient from the ratio RatioS / A and the adaptation function f (RatioS / A), and calculates the voice / music mixture ratio estimated value m by the sigmoid function. m is set to 1 when the possibility of being a voice is estimated to be 100%, and is set to 0 when it is estimated to be 100% of being a music, and is proportional to the estimated amount of the voice / music mixture ratio. And takes an intermediate value of 1 to 0.

【００２９】ここで、適合関数とは、最適な値を出すた
めに必要不可欠なものであり、実際には多くのデータで
最適になるように後から関数を求めていくことになる。
また、シグモイド関数は、−∞から∞までの値を０から
１の値に変換するための関数である。混合比推定値
［０，１］区間に限定するために必要な関数である。Here, the fitting function is indispensable for obtaining an optimum value, and in reality, the function will be obtained later so as to be optimum for many data.
The sigmoid function is a function for converting a value from −∞ to ∞ into a value of 0 to 1. This is a function required to limit the mixture ratio estimated value [0, 1] section.

【００３０】（３）混合比推定適応音声強調回路（手
段）４について図４は、本実施例の混合比推定適応音声強調回路（手
段）４の概略構成を示すブロック図である。図４におい
て、符号Ｍｓ、Ｍｓａ、Ｍａは、音声か、音楽か、音声
と音楽が混合されたものかを判別するためのスレツショ
ルド値である。これらは事前に設定しておく必要があ
る。Ｍは混合比（ｍｉｘ）であり、ｓはｓｐｅｅｃｈ、
ａはａｕｄｉｏである。ｓａは実際に構成されてくる信
号として音声・音楽・音声と音楽が混合されたものなど
いろいろなものが入っているものである。(3) About Mixing Ratio Estimation Adaptive Speech Enhancement Circuit (Means) 4 FIG. 4 is a block diagram showing a schematic configuration of the mixing ratio estimation adaptive speech enhancement circuit (means) 4 of this embodiment. In FIG. 4, symbols Ms, Msa, and Ma are threshold values for discriminating between voice, music, and a mixture of voice and music. These must be set in advance. M is a mixing ratio (mix), s is a speech,
a is audio. sa is a signal that actually contains various signals such as voice, music, and a mixture of voice and music.

【００３１】前記混合比推定適応音声強調回路（手段）
４は、図４に示すように、第１の音声強調回路Ｔａｓ
（音声強調手段）４１と、第２の音声強調回路Ｔｓａ
（音声強調手段）４２と、音声／音楽混合比推定値ｍに
従ってナレーション音と効果音のバランスの制御方法を
切り替える切り替え器（切り替え手段）４３とからな
る。Mixing ratio estimation adaptive speech enhancement circuit (means)
As shown in FIG. 4, the reference numeral 4 denotes a first voice emphasis circuit Tas.
(Voice enhancement unit) 41 and the second voice enhancement circuit Tsa
(Voice enhancement means) 42, and a switcher (switching means) 43 for switching the control method of the balance between the narration sound and the sound effect according to the voice / music mixture ratio estimated value m.

【００３２】前記第１の音声強調回路Ｔａｓ（音声強調
手段）４１と、第２の音声強調回路Ｔｓａ（音声強調手
段）４２は、それぞれ、例えば、常に４の音声強調回路
（スルー含む）に音が入っていて、言い換えれば音声強
調回路は常に動作していて必要な回鳴だけ選択する構成
になっている。すなわち、ある条件のみに合致した時だ
け必要な音声強調路が働き、他のときは働かない（つま
り、条件のみに合致した時に必要な音声強調路をとおっ
た同じ音が出力される）。The first voice emphasizing circuit Tas (speech emphasizing means) 41 and the second voice emphasizing circuit Tsa (speech emphasizing means) 42 are, for example, always sounded to four voice emphasizing circuits (including through). In other words, the voice emphasizing circuit is always operating and is configured to select only the necessary sound. That is, the necessary speech emphasis path works only when the certain conditions are met, and does not work at other times (that is, the same sound is output through the necessary speech emphasis path when the only conditions are met).

【００３３】前記第１の音声強調回路Ｔａｓ（音声強調
手段）４１と、第２の音声強調回路Ｔｓａ（音声強調手
段）４２は、入力として音声／音楽混合比の推定値ｍと
入力オーディオ信号を持ち、出力としてナレーション音
が聞きやすくなった出力オーディオ信号を持っている。The first voice emphasizing circuit Tas (voice emphasizing means) 41 and the second voice emphasizing circuit Tsa (voice emphasizing means) 42 receive the estimated value m of the voice / music mixture ratio and the input audio signal as inputs. It has an output audio signal that makes it easier to hear the narration sound.

【００３４】Ｍｓ以上の１に近い推定値ｍでは、オーデ
ィオ信号はナレーション音声のみが含まれ、強調を行う
必要が無いと判断し、そのまま出力する。一方、Ｍａ以
下の０に近い推定値ｍでは、オーディオ信号はＢＧＭ音
楽のみが含まれており強調すべきナレーション音声は無
いと判断し、音声強調を行なわずにそのまま出力する。With an estimated value m close to 1 which is equal to or greater than Ms, it is determined that the audio signal contains only the narration voice and that it is not necessary to emphasize it, and the audio signal is output as it is. On the other hand, with an estimated value m close to 0, which is equal to or less than Ma, it is determined that the audio signal includes only BGM music and that there is no narration voice to be emphasized, and the voice signal is output without being emphasized.

【００３５】ｍがＭａとＭｓの中間値をとる場合は、音
声強調を行う必要があると判断し、ｍの値により複数の
音声強調回路を切り替えながらナレーション音を強調し
たオーディオ信号を出力する。例えば、音声強調回路Ｔ
ｓａ４２はＢＧＭ音楽よりもナレーション音声が大きく
混合されているＭｓａ＜ｍ＜Ｍｓの場合に使用され、高
域周波数成分の減衰の動作を行う。音声強調回路Ｔａｓ
４１はナレーション音声に比ベＢＧＭ音楽が大きく混合
されているＭａ＜ｍ＜Ｍｓａの場合に使用され、高域の
周波数成分の減衰に加えて、音声成分の周波数的な特徴
または音声以外の周波数的な特徴を時間的に監視し、こ
の特徴を利用して特定低域周波数成分の周期的な増幅と
減衰動作を行う。これらは音声強調回路（手段）の簡単
な例であり、その他の音声強調回路を使用することが可
能である。When m takes an intermediate value between Ma and Ms, it is judged that it is necessary to enhance the voice, and an audio signal in which the narration sound is enhanced is output while switching a plurality of voice enhancement circuits according to the value of m. For example, the voice enhancement circuit T
sa42 is used in the case of Msa <m <Ms in which the narration voice is mixed more than the BGM music, and performs the operation of attenuating the high frequency components. Speech enhancement circuit Tas
41 is used in the case of Ma <m <Msa in which the BGM music is largely mixed with the narration voice, and in addition to the attenuation of the high frequency component, the frequency characteristic of the voice component or the frequency characteristic other than the voice is used. These characteristics are monitored over time, and this characteristic is used to perform periodic amplification and attenuation operations of specific low frequency components. These are simple examples of speech enhancement circuits (means), and other speech enhancement circuits can be used.

【００３６】以上、本発明者によってなされた発明を、
前記実施形態に基づき具体的に説明したが、本発明は、
前記実施形態に限定されるものではなく、その要旨を逸
脱しない範囲において種々変更可能であることは勿論で
ある。As described above, the inventions made by the present inventor are
Although specifically described based on the above embodiment, the present invention is
It is needless to say that the present invention is not limited to the above embodiment, and various changes can be made without departing from the scope of the invention.

【００３７】[0037]

【発明の効果】以上説明したように、本発明によれば、
放送されているオーディオ信号に含まれるナレーション
音声成分とＢＧＭ（背景音）音楽成分の混合比を推定す
ることができ、この混合比の推定結果に従って、ナレー
ション音と効果音のバランスの制御方法を適応的に切り
替えることで、オーディオ信号のナレーション音が効果
的に強調された聞きやすい音を提供することが可能とな
る。As described above, according to the present invention,
It is possible to estimate the mixing ratio of the narration voice component and the BGM (background sound) music component included in the broadcast audio signal, and the control method of the balance between the narration sound and the sound effect is applied according to the estimation result of the mixing ratio. It is possible to provide an easy-to-listen sound in which the narration sound of the audio signal is effectively emphasized by selectively switching.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の一実施例の音声／音楽混合比推定装置
を用いたオーディオ装置の概略構成を示すブロック図で
ある。FIG. 1 is a block diagram showing a schematic configuration of an audio device using a voice / music mixture ratio estimation device according to an embodiment of the present invention.

【図２】本実施例の音声／音楽特徴データ抽出手段の概
略構成を示すブロック図である。FIG. 2 is a block diagram showing a schematic configuration of a voice / music characteristic data extracting means of the present embodiment.

【図３】本実施例の音声／音楽混合比推定手段３の一実
施例の概略構成を示すブロック図である。FIG. 3 is a block diagram showing a schematic configuration of an embodiment of a voice / music mixture ratio estimating means 3 of the present embodiment.

【図４】本実施例の混合比推定適応音声強調回路の概略
構成を示すブロック図である。FIG. 4 is a block diagram showing a schematic configuration of a mixture ratio estimation adaptive speech enhancement circuit of this embodiment.

【符号の説明】[Explanation of symbols]

１…学習済み特徴データ記憶手段２…音声／音楽特徴データ抽出手段３…音声／音楽混合比推定手段４…混合比推定適応音強調回路２１…周波数分析手段２２…信号成分中央周波数抽出手段２３…信号９５％値周波数抽出手段２４…音声変調度抽出手段２５〜２７…分散手段２８…ゼロ交差数抽出手段２９…分散手段３０…信号エネルギー抽出手段３１…加算器３２…除算器３３…混合比出力部４１…第１の音声強調回路Ｔａｓ（音声強調手段）４２…第２の音声強調回路Ｔｓａ（音声強調手段）４３…切り替え器（切り替え手段） 1 ... Learned feature data storage means 2 ... Voice / music feature data extraction means 3 ... Voice / music mixture ratio estimating means 4. Mixing ratio estimation adaptive sound enhancement circuit 21 ... Frequency analysis means 22 ... Signal component central frequency extraction means 23 ... Signal 95% value frequency extraction means 24 ... Voice Modulation Degree Extraction Means 25-27 ... Dispersing means 28 ... Zero crossing number extraction means 29 ... Dispersing means 30 ... Signal energy extraction means 31 ... Adder 32 ... Divider 33 ... Mixing ratio output section 41 ... First voice enhancement circuit Tas (voice enhancement means) 42 ... Second voice enhancement circuit Tsa (voice enhancement means) 43 ... Switching device (switching means)

Claims

【特許請求の範囲】[Claims]

【請求項１】純粋なナレーション音声と音楽を用いて
音声信号及び音楽信号の特徴データをあらかじめ学習し
て記憶しておき、これらの記憶済み特徴データと放送さ
れているオーディオ信号の特徴データを用いて、オーデ
ィオ信号のナレーション音声と音楽の混合比を推定する
ことを特徴とする音声と音楽の混合比推定方法。1. Characteristic data of a voice signal and a music signal are preliminarily learned and stored by using pure narration voice and music, and the stored characteristic data and the characteristic data of a broadcast audio signal are used. A method for estimating a mixture ratio of voice and music, which comprises estimating a mixture ratio of voice and music of an audio signal.

【請求項２】請求項１に記載の混合比推定方法による
混合比の推定結果に従って、ナレーション音と効果音の
バランスの制御方法を切り替えることを特徴とするオー
ディオ信号のナレーション音を聞きやすくする方法。2. A method for facilitating hearing of narration sound of an audio signal, characterized in that a control method for controlling the balance between the narration sound and the sound effect is switched according to the result of estimation of the mixing ratio by the mixing ratio estimation method according to claim 1. .

【請求項３】純粋なナレーション音声と音楽を用いて
音声信号及び音楽信号の特徴データをあらかじめ学習し
て記憶しておく学習済み特徴データ記憶手段と、放送さ
れているオーディオ信号の音声と音楽の特徴データを抽
出する音声／音楽特徴データ抽出手段と、前記学習済み
特徴データ記憶手段からの特徴データと前記音声／音楽
特徴データ抽出手段からの放送されているオーディオ信
号の特徴データを入力して、オーディオ信号のナレーシ
ョン音声と音楽の混合比を推定する音声／音楽混合比推
定手段とを具備することを特徴とする音声／音楽混合比
推定装置。3. A learned feature data storage means for preliminarily learning and storing feature data of a voice signal and a music signal by using pure narration voice and music, and voice and music of a broadcast audio signal. Voice / music characteristic data extracting means for extracting characteristic data, characteristic data from the learned characteristic data storing means and characteristic data of a broadcast audio signal from the voice / music characteristic data extracting means are inputted, A voice / music mixture ratio estimating device comprising: a voice / music mixture ratio estimating means for estimating a mixture ratio of narrated voice of an audio signal and music.

【請求項４】請求項３に記載の音声／音楽混合比推定
装置による混合比の推定結果に従って、ナレーション音
と効果音のバランスの制御方法を切り替える混合比推定
適応音声強調手段を具備することを特徴とするオーディ
オ装置。4. A mixture ratio estimation adaptive voice emphasizing means for switching the control method of the balance between the narration sound and the sound effect according to the result of the mixture ratio estimation by the voice / music mixture ratio estimating device according to claim 3. Characteristic audio device.