JPH06236198A

JPH06236198A - Tone quality subjective evaluation prediction system

Info

Publication number: JPH06236198A
Application number: JP5020916A
Authority: JP
Inventors: Keiko Nagano; 敬子永野; Shigeru Ono; 茂小野
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1993-02-09
Filing date: 1993-02-09
Publication date: 1994-08-23
Anticipated expiration: 2014-09-27
Also published as: JP2953238B2

Abstract

PURPOSE:To provide a tone quality subjective evaluation prediction system capable of improving predictive accuracy by evaluating it using a weighing distance measure considering the auditory sense characteristic of a human. CONSTITUTION:By an input audio signal characteristic extraction part 4, a characteristic parameter is extracted from an input audio signal, and by a regenerative audio signal characteristic extraction part 5, the characteristic parameter is extracted from a regenerative audio signal, and by an input audio signal dynamic characteristic extraction part 6, a dynamic characteristic parameter is extracted from the characteristic parameter of the input audio signal, and by a first weighing coefficient extraction part 7, a first weighing coefficient is extracted from the characteristic parameter of the input audio signal, and by a second weighing coefficient extraction part 8, a second weighing coefficient is extracted from the dynamic characteristic parameter, and by an objective evaluation part 9, a distance between the characteristic parameters of the input and the regenerative audio signals is calculated considering the first and the second weighing coefficients to output an objective evaluation value, and by a subjective evaluation prediction part 10, an subjective evaluation value is predicted using the objective evaluation value.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音質主観評価予測方式
に関し、特に入力音声信号の特徴パラメータと再生音声
信号の特徴パラメータとに重み付け距離尺度を用いて再
生音声信号の主観評価値を予測する方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a sound quality subjective evaluation prediction method, and in particular, predicts a subjective evaluation value of a reproduced sound signal by using a weighted distance measure for the characteristic parameter of the input sound signal and the characteristic parameter of the reproduced sound signal. Regarding the scheme.

【０００２】[0002]

【従来の技術】従来、音声の音質評価を行なう場合に
は、聴取実験で評価音声を主観的に評価する他に、音声
の特徴パラメータを原音声と評価音声とから抽出し、特
徴パラメータ間の比や距離を求めて客観的に評価する方
法がとられている。2. Description of the Related Art Conventionally, when performing sound quality evaluation of speech, in addition to subjectively evaluating the evaluation speech in a listening experiment, characteristic parameters of the speech are extracted from the original speech and the evaluation speech, and A method of objectively evaluating the ratio and distance is used.

【０００３】この客観評価には、ＳＮ比（特開昭63-273
895)やセグメンタルＳＮ比などの他にケプストラム距離
やＢＳＤ距離（Ｐroc.ＩＣＡＳＳＰ 91,ＩＥＥＥＳpeec
h Ｐrocessing.,vol.1,pp.493-496,1991) などスペクト
ルの歪みが使われている。For this objective evaluation, the SN ratio (Japanese Patent Laid-Open No. 63-273) is used.
895) and segmental signal-to-noise ratio, as well as cepstrum distance and BSD distance (Proc.ICASSP 91, IEEESpeec
Distortion of spectrum is used, such as h Processing., vol.1, pp.493-496, 1991).

【０００４】また最近では、これら客観評価に用いられ
ている音声の特徴パラメータから、主観評価値を予測す
るモデルも研究されている（電子情報通信学会論文誌Ａ
Ｖol. J73-ＡＮｏ．6 pp.1039-1047 1990 年 6月）。Recently, a model for predicting the subjective evaluation value from the characteristic parameters of the voice used for the objective evaluation has been studied (IEICE Transactions A.
Vol. J73-A No. 6 pp.1039-1047 June 1990).

【０００５】一方、音声符号化の分野においては、高音
質な再生音声を得るため適応フィルタによる重み付け距
離が広く用いられている（特開平4-84200)。そして、こ
の重み付け距離を最小化するように符号化パラメータを
決定することで、主観的に良好な再生音声を得ることが
できるとしている。On the other hand, in the field of voice coding, a weighted distance by an adaptive filter is widely used in order to obtain a reproduced voice with high sound quality (Japanese Patent Laid-Open No. 4-84200). Then, by determining the coding parameter so as to minimize this weighted distance, subjectively good reproduced speech can be obtained.

【０００６】しかし、ここで用いられている評価尺度
は、再生音声の相対的な評価を行なうためのもので、再
生音声の主観評価値を決定或いは予測するために適用で
きるものではない。However, the evaluation scale used here is for making a relative evaluation of the reproduced voice and cannot be applied for determining or predicting the subjective evaluation value of the reproduced voice.

【０００７】[0007]

【発明が解決しようとする課題】客観評価に基づいて主
観評価値を予測する従来の音質主観評価予測方式は、低
ビットレートの符号化方式の評価には適していないとい
われている（ＩＥＥＥ Transactions on Selected Area
s in Communications,vol.SAC-6,pp.242-248,Feb.1988
) 、（ＩＥＥＥ Trans.Comm.,vol.COM-30,pp.642-654,
Apr.1982)。It is said that the conventional sound quality subjective evaluation prediction method for predicting the subjective evaluation value based on the objective evaluation is not suitable for the evaluation of the low bit rate encoding method (IEEE Transactions). on Selected Area
s in Communications, vol.SAC-6, pp.242-248, Feb.1988
), (IEEE Trans.Comm., Vol.COM-30, pp.642-654,
Apr.1982).

【０００８】そこで、低ビットレートの符号化音声の主
観評価予測の予測精度を上げるためには、人間が音声を
評価するときに用いている聴覚特性、例えばマスキング
効果を考慮した客観評価を行なう必要がある。Therefore, in order to improve the prediction accuracy of the subjective evaluation prediction of low bit rate coded speech, it is necessary to perform an objective evaluation in consideration of the auditory characteristics used by humans to evaluate speech, for example, the masking effect. There is.

【０００９】このマスキング効果を考慮するためには、
音声信号の始まりから終りまでを同一の重みで評価する
のではなく、聴覚的に重要てある部分とそうでない部分
を反映した重み付けを行なうことが有効であると考えら
れる。In order to consider this masking effect,
It is considered effective not to evaluate the beginning to the end of the audio signal with the same weight, but to perform weighting that reflects a portion that is auditory important and a portion that is not.

【００１０】本発明の目的は、人間が音声信号を評価す
る場合と近いモデルで主観評価値を予測するために、入
力音声信号の特徴パラメータ値が大きい部分や、動的特
徴パラメータ値が小さい部分での評価が、他の部分と比
べて大きく評価されるような重みづけ距離尺度を用いて
評価することにより従来に比し格段に予測精度を向上で
きる音質主観評価予測方式を提供することにある。An object of the present invention is to predict a subjective evaluation value with a model similar to a case where a human evaluates a speech signal, so that a portion having a large characteristic parameter value of an input speech signal or a portion having a small dynamic characteristic parameter value. The objective is to provide a sound quality subjective evaluation prediction method that can significantly improve the prediction accuracy compared to the conventional method by using a weighted distance measure that makes the evaluation in the above method larger than other parts. .

【００１１】[0011]

【課題を解決するための手段】第１の発明の音質主観評
価予測方式は、入力音声信号を符号化／復号化し再生音
声信号を作成する符号化／復号化部と、前記入力音声信
号から少なくとも１つの入力音声信号特徴パラメータを
抽出する入力音声信号特徴抽出部と、前記再生音声信号
から少なくとも１つの再生音声信号特徴パラメータを抽
出する再生音声信号特徴抽出部と、前記入力音声信号特
徴パラメータを用いて少なくとも１つの動的特徴パラメ
ータを抽出する入力音声信号動的特徴抽出部と、前記入
力音声信号特徴パラメータを用いて少なくとも１つの第
１の重み付け係数を抽出する第１の重み付け係数抽出部
と、前記動的特徴パラメータを用いて少なくとも１つの
第２の重み付け係数を抽出する第２の重み付け係数抽出
部と、前記入力音声信号特徴パラメータと前記再生音声
信号特徴パラメータとの距離を求める際に前記第１の重
み付け係数と前記第２の重み付け係数とのうちの少なく
とも１つの重み付け係数を用いて計算し客観評価値を出
力する客観評価部と、前記客観評価値を用いて主観評価
値を予測する主観評価予測部とを含んで構成されてい
る。According to a first aspect of the present invention, there is provided a sound quality subjective evaluation prediction method, wherein at least an encoding / decoding unit for encoding / decoding an input voice signal to generate a reproduced voice signal, and at least the input voice signal is used. An input audio signal feature extracting section for extracting one input audio signal characteristic parameter, a reproduced audio signal characteristic extracting section for extracting at least one reproduced audio signal characteristic parameter from the reproduced audio signal, and using the input audio signal characteristic parameter An input voice signal dynamic feature extraction unit that extracts at least one dynamic feature parameter by using the input voice signal feature parameter, and a first weighting coefficient extraction unit that extracts at least one first weighting factor using the input voice signal feature parameter. A second weighting coefficient extraction unit that extracts at least one second weighting coefficient using the dynamic feature parameter; and the input sound. When obtaining the distance between the signal characteristic parameter and the reproduced audio signal characteristic parameter, calculation is performed using at least one weighting coefficient of the first weighting coefficient and the second weighting coefficient, and an objective evaluation value is output. It is configured to include an objective evaluation unit and a subjective evaluation prediction unit that predicts a subjective evaluation value using the objective evaluation value.

【００１２】第２の発明の音質主観評価予測方式は、第
１の発明の音質主観評価予測方式において、入力音声信
号特徴抽出部に代えて、第１の重み付け係数抽出部と動
的特徴抽出部で用いる入力音声信号の特徴パラメータを
抽出する重み係数抽出用入力音声信号特徴抽出部と、客
観評価部で用いる前記入力音声信号の特徴パラメータを
抽出する評価用入力音声信号特徴抽出部とを含んで構成
されている。The sound quality subjective evaluation prediction method of the second invention is the same as the sound quality subjective evaluation prediction method of the first invention, but instead of the input speech signal feature extraction unit, a first weighting coefficient extraction unit and a dynamic feature extraction unit. The input voice signal feature extraction unit for extracting a weight coefficient for extracting the feature parameter of the input voice signal used in, and the evaluation input voice signal feature extraction unit for extracting the feature parameter of the input voice signal used in the objective evaluation unit. It is configured.

【００１３】[0013]

【実施例】次に、本発明について図面を参照して説明す
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, the present invention will be described with reference to the drawings.

【００１４】図１は第１の発明の音質主観評価予測方式
の第１の一実施例を示すブロック図である。FIG. 1 is a block diagram showing a first embodiment of the sound quality subjective evaluation prediction method of the first invention.

【００１５】図１において、入力端子１からは、音声信
号が入力され、符号化／復号化部３と入力音声信号特徴
抽出部４へ送られる。In FIG. 1, a voice signal is input from an input terminal 1 and sent to an encoding / decoding unit 3 and an input voice signal feature extraction unit 4.

【００１６】符号化／復号化部３では、原信号を用いて
符号化／復号化を行ない再生音声信号を作成する。符号
化／復号化には例えばＣＥＬＰ（Ｐroc.Ｉnt. Ｃonf.Ａ
coust., Ｓpeech,Ｓignal Ｐrocessing,pp200-203,198
9) などが用いられる。この符号化／復号化部３にて作
成された再生音声信号は、再生音声信号特徴抽出部５へ
送られる。The encoding / decoding unit 3 performs encoding / decoding using the original signal to create a reproduced voice signal. For encoding / decoding, for example, CELP (Proc. Int. Conf.
coust., Speech, Signal Processing, pp200-203,198
9) etc. are used. The reproduced audio signal created by the encoding / decoding unit 3 is sent to the reproduced audio signal feature extracting unit 5.

【００１７】入力音声信号特徴抽出部４では、入力音声
信号Ｓ_xを用いて一定時間（フレーム）毎に、特徴パラ
メータxparを求める。入力音声信号特徴抽出部４にて求
められた特徴パラメータxparは、入力音声信号動的特徴
抽出部６と第１の重み付け係数抽出部７へ送られる。入
力音声信号特徴抽出部４で求められる特徴パラメータxp
arとしては、例えばrms 、Ｂark スペクトル、ピッチ、
ケプストラムなど周知のものを使うことができる。The input voice signal feature extraction unit 4 obtains a feature parameter xpar using the input voice signal S _x at constant time intervals (frames). The feature parameter xpar obtained by the input voice signal feature extraction unit 4 is sent to the input voice signal dynamic feature extraction unit 6 and the first weighting coefficient extraction unit 7. Feature parameter xp obtained by the input voice signal feature extraction unit 4
Examples of ar include rms, Bark spectrum, pitch,
Well-known things such as cepstrum can be used.

【００１８】ここでは、入力音声信号｛Ｓ_x（０），
…，Ｓ_x（Ｌ−１）｝から特徴パラメータxparを求める
例として、以下でrms とＢark スペクトルを求める式を
示す。Here, the input voice signal {S _x (0),
As an example of obtaining the characteristic parameter xpar from [..., S _x (L-1)], the formulas for obtaining the rms and Bark spectra are shown below.

【００１９】まず、入力音声信号から、第ｋフレームの
rms を求める式（１）を示す。First, from the input speech signal, the k-th frame
Equation (1) for obtaining rms is shown.

【００２０】 [0020]

【００２１】次に、入力音声信号から第ｋフレームのＢ
ark スペクトルを求める手順を説明する。１．入力音声信号Ｓ_x ^(k)に対し、ＦＦＴを行ない、パワ
ースペクトルＸ^(k)（ｆ）を求める。２．ワパースペクトルＸ^(k)（ｆ）をＢark スケールＹ
^(k)（ｘ）に変換する。ｘからｆへの変換は、以下の関
係式（２）を用いて行なう。Next, B of the k-th frame from the input voice signal
The procedure for obtaining the ark spectrum will be described. 1. FFT is performed on the input audio signal S _x ^(k ) to obtain the power spectrum X ^(k) (f). 2. Wapper spectrum X ^(k) (f) on Bark scale Y
^(k) Convert to (x). The conversion from x to f is performed using the following relational expression (2).

【００２２】 [0022]

【００２３】３．Ｂark スケール変換したパワースペク
トルＹ^(k)（ｘ）に臨界帯域フィルタＦ^(k) （ｘ）をか
け、excitation patternＤ^(k)（ｘ）を式（４）により
求める。臨界帯域フィルタＦ^(k)（ｘ）は、以下の式
（３）で表される。3. A critical band filter F ^(k) (x) is applied to the Bark scale-converted power spectrum Y ^(k) (x) to obtain an excitation pattern D ^(k) (x) according to the equation (4). The critical band filter F ^(k) (x) is expressed by the following equation (3).

【００２４】 [0024]

【００２５】ここで、α＝0.215 とする。Here, it is assumed that α = 0.215.

【００２６】 [0026]

【００２７】４．Ｅxcitation pattern Ｄ^(k)（ｘ）に
聴感重み付けを式（６）により行なう。4. The perceptual weighting is applied to the excitation pattern D ^(k) (x) by the equation (6).

【００２８】１８００〜３４００Ｈｚの聴感重み付けＨ
（ｆ）は、以下の式（５）より求めることができる。Hearing weight H from 1800 to 3400 Hz
(F) can be obtained from the following equation (5).

【００２９】 [0029]

【００３０】１８００Ｈｚ以下ではＨ（ｆ）＝１とし３
４００Ｈｚ以上では３４００Ｈｚと同じ値をとる。Below 1800 Hz, H (f) = 1 and 3
Above 400 Hz, it takes the same value as 3400 Hz.

【００３１】 [0031]

【００３２】５．第ｉ次元目、第ｋフレームのＢark ス
ペクトル｛ｘＢ^(k)[i],i=1, …,T｝は、以下の式（７）
によって求められる。5. The Bark spectrum {xB ^(k) [i], i = 1, ..., T} of the i-th dimension and the k-th frame is expressed by the following equation (7).
Required by.

【００３３】 [0033]

【００３４】但し、channel 番号ｉとＢark Ｓcale ｘ
との間には、ｘ＝1.0 * ｉとなる関係がある。However, channel number i and Bark Scale x
Has a relationship of x = 1.0 * i.

【００３５】以上の手順で求めた入力信号のrms やＢar
k スペクトルの特徴パラメータは、第１の重み付け係数
抽出部７と入力音声信号動的特徴抽出部６に出力され
る。The rms and Bar of the input signal obtained by the above procedure
The feature parameter of the k spectrum is output to the first weighting coefficient extraction unit 7 and the input voice signal dynamic feature extraction unit 6.

【００３６】第１の重み付け係数抽出部７では、入力音
声信号の特徴パラメータxparを用いて重み付け係数ωを
抽出する。The first weighting coefficient extraction unit 7 extracts the weighting coefficient ω using the characteristic parameter xpar of the input audio signal.

【００３７】人間が音声を聞く場合に、音の大きいとこ
ろでの違いの方が小さい方での違いより見つけやすかっ
たり、定常部での異音の方が変化している部分での異音
より耳障りに聞こえたりすることがある。本方式では、
このような人間の聴覚特性を考慮し、入力音声信号のパ
ラメータ値が大きい部分の評価が小さい部分と比べて重
要になるような重み付け距離尺度を用いている。重み付
け係数を求める１例を、第ｋフレームの重み付け係数を
ω₁ ^(k)として以下の式（８）によって表す。When a human hears a voice, a difference in a large sound is easier to find than a difference in a small sound, or an abnormal sound in a stationary part is more jarring than an abnormal sound in a changing part. May be heard. In this method,
In consideration of such human auditory characteristics, a weighted distance scale is used so that the evaluation of a portion where the parameter value of the input speech signal is large is more important than a portion where the evaluation is small. An example of obtaining the weighting coefficient is represented by the following equation (8), where the weighting coefficient of the kth frame is ω ₁ ^(k) .

【００３８】 [0038]

【００３９】rms を用いた重み付けも、入力音声信号の
rms が大きいところの方が小さいところよりも大きく評
価する重み付けを行なう。ここで、入力音声信号のrms
の特徴パラメータをxrms、重み付け係数をω_rmsとする
と、第ｋフレームのrms の重み付け係数ω_rms ^(k)は以下
の式（９）で表すことができる。Weighting using rms also applies to the input speech signal.
Weighting is performed so that a larger rms is evaluated more than a smaller one. Where rms of the input audio signal
Where x _rms is the characteristic parameter and ω _rms is the weighting coefficient, the rms weighting coefficient ω _rms ^(k) of the k-th frame rms can be expressed by the following equation (9).

【００４０】 [0040]

【００４１】特徴パラメータが多次元である場合も各次
元毎に１次元の場合と同様の重みをつける。ｉ次元目の
重み付け係数ω₂ ^(k)[i] は、以下の式（１０）により求
めることができる。When the characteristic parameter is multidimensional, the same weight as in the case of one dimension is assigned to each dimension. The weighting coefficient ω ₂ ^(k) [i] of the i-th dimension can be calculated by the following equation (10).

【００４２】 [0042]

【００４３】次に、Ｂark スペクトルの重み付けについ
て説明する。Next, the weighting of the Bark spectrum will be described.

【００４４】Ｂark スペクトルのエネルギーの大きい部
分で評価した値がより重要になるような重みづけをする
ことによって音声信号の明瞭性や自然性などに影響する
母音定常部の評価に重みをおいた。By weighting the value evaluated in the high energy part of the Bark spectrum to be more important, the vowel stationary part which affects the clarity and naturalness of the voice signal is weighted.

【００４５】入力音声信号のＢark スペクトルの特徴パ
ラメータのｘＢ、重み付け係数をω_Bとすると、第ｋフ
レームで抽出されたｉ次元のＢark スペクトルの特徴パ
ラメータ｛ｘＢ^(k)[i],i＝１，．．．，Ｔ｝の重み付け
係数ω_B ^(k)[i] は下の式（１１）のように表す。If xB of the characteristic parameter of the Bark spectrum of the input speech signal and ω _B are the weighting coefficients, the characteristic parameter of the i-dimensional Bark spectrum extracted in the k-th frame {xB ^(k) [i], i = 1. ,. ．． , T} weighting coefficient ω _B ^(k) [i] is expressed by the following equation (11).

【００４６】 [0046]

【００４７】以上のように入力音声信号の特徴パラメー
タから求められた重み付け係数は、客観評価部９へ出力
される。The weighting coefficient obtained from the characteristic parameters of the input voice signal as described above is output to the objective evaluation section 9.

【００４８】次に、入力音声信号特徴抽出部４にて求め
られた入力音声信号Ｓ_xの特徴パラメータxparを、動的
特徴パラメータδxparに変換し第２の重み付け係数抽出
部８へと送る動作を行なう入力音声信号動的特徴抽出部
６について説明する。入力音声信号の特徴パラメータxp
arの動的特徴パラメータδxparに変換する方法はいくつ
かあるため、ここではその１例をあげておく。Next, the operation of converting the characteristic parameter xpar of the input speech signal S _x obtained by the input speech signal characteristic extraction unit 4 into the dynamic characteristic parameter δxpar and sending it to the second weighting coefficient extraction unit 8 The input voice signal dynamic feature extraction unit 6 to be executed will be described. Input audio signal feature parameter xp
Since there are several methods for converting the dynamic feature parameter δxpar of ar, one example will be given here.

【００４９】入力音声信号の特徴パラメータをxpar₁、
xpar₁から変換された動的特徴パラメータをδxpar₁と
すると、第ｋフレームで抽出された特徴パラメータxpar
₁ ^(k)の動的特徴パラメータδxpar₁ ^(k)は、ｓフレーム前
の特徴パラメータxpar₁ ^(k+s)からｓフレーム後の特徴パ
ラメータxpar₁ ^(k-s)の差によって式（１２）のように求
まる。The characteristic parameters of the input speech signal are xpar ₁ ,
When the dynamic feature parameter converted from xpar ₁ is δxpar ₁ , the feature parameter xpar extracted in the k-th frame is
₁ ^(k) the dynamic characteristic parameter δxpar ₁ ^(k) of, s previous frame feature parameter XPAR ₁ ^{(k + s)} as in equation (12) by the difference of feature parameters XPAR ₁ ^(ks) after s frames from Sought.

【００５０】 [0050]

【００５１】入力音声信号の特徴パラメータとしてrms
を用いる場合の動的特徴パラメータは、以下の式（１
３）によって求められる。Rms as a characteristic parameter of the input voice signal
The dynamic feature parameter in the case of using
3) is required.

【００５２】 [0052]

【００５３】また、入力音声信号の特徴パラメータxpar
₂が多次元の場合について、xpar₂から変換される動的
特徴パラメータをδxpar₂として説明する。第ｋフレー
ムで抽出されたｉ次元目の特徴パラメータ｛xpar
₂ ^(k)[i］,i＝１，．．．，Ｔ｝の動的特徴パラメータδ
xpar₂ ^(k)[i] は、以下の式（１４）より求めることがで
きる。Also, the characteristic parameter xpar of the input speech signal
_In the case where ₂ is multidimensional, the dynamic feature parameter converted from xpar ₂ will be described as Δxpar ₂ . I-th feature parameter {xpar extracted in the kth frame
₂ ^(k) [i], i = 1 ,. ．． , T} dynamic feature parameter δ
xpar ₂ ^(k) [i] can be obtained from the following equation (14).

【００５４】 [0054]

【００５５】入力音声信号のＢark スペクトルの特徴パ
ラメータをｘＢ、動的特徴パラメータをδｘＢとする
と、第ｋフレームで抽出されたｉ次元目Ｂark スペクト
ルの特徴パラメータ｛ｘＢ^(k)[i],i＝１，．．．，Ｔ｝
の動的特徴パラメータδｘＢ^(k)[i]は、以下の式（１
５）より求めることができる。When the characteristic parameter of the Bark spectrum of the input speech signal is xB and the dynamic characteristic parameter is δxB, the characteristic parameter {xB ^(k) [i], i = of the i-th dimension Bark spectrum extracted in the k-th frame. 1 ,. ．． , T}
The dynamic feature parameter δxB ^(k) [i] of
It can be obtained from 5).

【００５６】 [0056]

【００５７】次に、入力音声信号の特徴パラメータを動
的特徴パラメータに変換する上記以外の方法を、入力音
声信号の特徴パラメータxpar₃と、動的特徴パラメータ
δxpar₃とを使って説明する。第ｋフレームの動的特徴
パラメータδxpar₃ ^(k)は、特徴パラメータxpar₃ ^(k)と特
徴パラメータxpar₃の平均特徴パラメータavgxpar （式
（１６））との差により式（１７）のように求められ
る。Next, a method other than the above for converting the characteristic parameter of the input speech signal into the dynamic characteristic parameter will be described using the characteristic parameter xpar ₃ of the input speech signal and the dynamic characteristic parameter δxpar ₃ . The dynamic feature parameter δxpar ₃ ^(k) of the k-th frame is calculated by the difference between the feature parameter xpar ₃ ^(k) and the average feature parameter avgxpar (equation (16)) of the feature parameter xpar ₃ as shown in equation (17). To be

【００５８】 [0058]

【００５９】 [0059]

【００６０】さらに、入力音声信号の特徴パラメータxp
arを、動的特徴パラメータδxparに変換する方法とし
て、第ｋフレームの動的特徴パラメータをδxpar₄ ^(k)と
し、入力音声信号の特徴パラメータxpar₄ ^(k)と予測特徴
パラメータxpar_4' ^(k)の比より求める式（１８）を以下
に示す。Further, the characteristic parameter xp of the input voice signal
As a method for converting ar into a dynamic feature parameter δxpar, the dynamic feature parameter of the k-th frame is δxpar ₄ ^(k), and the feature parameter xpar ₄ ^{(k) of the} input speech signal and the prediction feature parameter xpar _{4 ′} ^{(k The} formula (18) obtained from the ratio of () is shown below.

【００６１】 [0061]

【００６２】第２の重み付け係数抽出部８では、入力音
声信号動的特徴抽出部６にて求められた入力音声信号の
動的特徴パラメータδxparを用いて動的特徴パラメータ
の絶対値が小さいフレームで重みが強くかかるような重
み付け係数δωを抽出する。The second weighting coefficient extraction unit 8 uses the dynamic feature parameter δxpar of the input voice signal obtained by the input voice signal dynamic feature extraction unit 6 in the frame in which the absolute value of the dynamic feature parameter is small. A weighting coefficient δω with which the weight is strongly applied is extracted.

【００６３】従来の客観評価では特徴パラメータの時間
的変化については考慮していない。しかし、低ビットレ
ートの符号化音声信号では、ピッチや声の高さの揺らぎ
などパラメータの変動が原因で音質が劣化しているもの
もある。The conventional objective evaluation does not consider the temporal change of the characteristic parameter. However, in a low bit rate encoded voice signal, there are some in which the sound quality is deteriorated due to fluctuations in parameters such as fluctuations in pitch and voice pitch.

【００６４】本方式では入力音声信号の動的特徴パラメ
ータから求めた重み付けを客観評価尺度に加え、パラメ
ータの時間的変動についても考慮することにした。入力
音声信号の動的特徴パラメータの値が小さい部分は揺ら
ぎが少なく音質が良いため、再生音声信号との違いが目
立ちやすい。In this method, the weighting obtained from the dynamic feature parameter of the input speech signal is added to the objective evaluation scale, and the temporal variation of the parameter is considered. The part of the input voice signal where the value of the dynamic characteristic parameter is small has little fluctuation and good sound quality, so that the difference from the reproduced voice signal is easily noticeable.

【００６５】そこで、動的特徴パラメータの値の小さい
部分の評価が重要視されるような重みをつけている。そ
の一例として、第ｋフレームの重み付け係数δω₁ ^(k)を
求める式（１９）を以下に示す。Therefore, the weighting is performed so that the evaluation of the portion where the value of the dynamic feature parameter is small is considered important. As an example, Equation (19) for obtaining the weighting coefficient δω ₁ ^(k) of the k-th frame is shown below.

【００６６】 [0066]

【００６７】入力音声信号の動的特徴パラメータδxpar
₁がrms であった場合の重み付け係数の求め方を以下に
示す。声の大きさが変化しているところよりも、変化し
ていないところの方が、音質の違いを見つけやすいた
め、rms の場合も、動的特徴パラメータの値が小さい部
分の評価が大きく影響するような重みづけ係数を求めて
いる。ここで、入力音声信号のrms の動的特徴パラメー
タをδxrms、重み付け係数をδω_rmsとすると、第ｋフ
レームのrms の重み付け係数δω_rms ^(k)は、以下の式
（２０）で求めることができる。Dynamic feature parameter δxpar of input speech signal
_{The method} of obtaining the weighting coefficient when ₁ is rms is shown below. It is easier to find the difference in sound quality when the voice volume is not changing than when the voice volume is changing, so even in the case of rms, the evaluation of the part where the value of the dynamic feature parameter is small has a large effect. Such a weighting coefficient is sought. Here, if the dynamic feature parameter of rms of the input speech signal is δxrms and the weighting coefficient is δω _rms , the weighting coefficient δω _rms ^(k) of rms of the k-th frame can be obtained by the following equation (20). .

【００６８】 [0068]

【００６９】また、動的特徴パラメータが多次元である
場合は以下のようにして求めることができる。第ｋフレ
ームで抽出されたｉ次元の動的特徴パラメータ｛δxpar
₂ ^(k)[i],i ＝１，．．．，Ｔ｝の重み付け係数δω₂ ^(k)
[i] は、以下の式（２１）より求めることができる。When the dynamic feature parameter is multidimensional, it can be obtained as follows. I-dimensional dynamic feature parameter {δxpar extracted at the k-th frame
₂ ^(k) [i], i = 1 ,. ．． , T} weighting coefficient δω ₂ ^(k)
[i] can be obtained from the following equation (21).

【００７０】 [0070]

【００７１】入力音声信号の動的Ｂark スペクトルをδ
ｘＢ、δｘＢより求めた重み付け係数をδω_Bとする
と、第ｋフレームで抽出されたｉ次元の動的Ｂark スペ
クトル｛δｘＢ^(k)[i],i＝１，．．．，Ｔ｝の重み付け
係数δω_B ^(k)[i] は、以下の式（２２）で表すことがで
きる。Let δ be the dynamic Bark spectrum of the input speech signal.
Letting δω _{B be} the weighting coefficient obtained from xB and δxB, the i-dimensional dynamic Bark spectrum {δxB ^(k) [i], i = 1 ,. ．． , T} weighting coefficient δω _B ^(k) [i] can be expressed by the following equation (22).

【００７２】 [0072]

【００７３】以上のようにして入力音声信号の動的特徴
パラメータから求めた重み付け係数は、客観評価部９へ
出力される。The weighting coefficient obtained from the dynamic feature parameter of the input voice signal as described above is output to the objective evaluation section 9.

【００７４】再生音声信号特徴抽出部５では、再生音声
信号Ｓ_yを用いて、一定時間（フレーム）毎に特徴パラ
メータyparを求める。再生音声信号特徴抽出部５にて求
められた特徴パラメータyparは、客観評価部９へ送られ
る。The reproduced voice signal feature extraction unit 5 uses the reproduced voice signal S _y to obtain the characteristic parameter ypar for each constant time (frame). The characteristic parameter ypar obtained by the reproduced voice signal characteristic extraction unit 5 is sent to the objective evaluation unit 9.

【００７５】再生音声信号特徴抽出部５で求められる特
徴パラメータyparとして例えば特徴パラメータrms 、Ｂ
ark スペクトル、ピッチ、ケプストラムなどがある。As the characteristic parameter ypar obtained by the reproduced voice signal characteristic extraction unit 5, for example, characteristic parameters rms, B
ark spectrum, pitch, cepstrum, etc.

【００７６】再生音声信号Ｓ_yから特徴パラメータypar
を求める方法は、前記入力音声信号特徴抽出部４におい
て、入力音声信号Ｓ_xを用いて入力音声信号の特徴パラ
メータを求める方法と同じであるため、ここでは説明を
省略する。From the reproduced voice signal S _y , the characteristic parameter ypar
The method of obtaining the above is the same as the method of obtaining the characteristic parameter of the input voice signal using the input voice signal S _x in the input voice signal feature extraction unit 4, and therefore the description thereof is omitted here.

【００７７】客観評価部９では、入力音声信号特徴抽出
部４と、再生音声信号特徴抽出部５にて求められた特徴
パラメータxparとyparとの距離に、第１の重み付け係数
抽出部７と第２の重み付け係数抽出部８から求められた
重み付け係数をかけた客観評価値ωＡＶＧを求め、主観
評価予測部９へ送る。In the objective evaluation unit 9, the first weighting coefficient extraction unit 7 and the first weighting coefficient extraction unit 7 are added to the distances between the input sound signal feature extraction unit 4 and the feature parameters xpar and ypar obtained by the reproduced sound signal feature extraction unit 5. The objective evaluation value ωAVG multiplied by the weighting coefficient obtained from the weighting coefficient extraction section 8 of 2 is obtained and sent to the subjective evaluation prediction section 9.

【００７８】特徴パラメータが、１次元であった場合の
重み付け客観評価値を求める式（２３）を以下に示す。
入力音声信号の特徴パラメータより求めた重み付け係数
をω₁ ^(k)、入力音声信号の動的特徴パラメータより求め
た重み付け係数をδω₁ ^(k)で表す。The formula (23) for obtaining the weighted objective evaluation value when the characteristic parameter is one-dimensional is shown below.
The weighting coefficient obtained from the characteristic parameter of the input speech signal is represented by ω ₁ ^(k) , and the weighting coefficient obtained from the dynamic characteristic parameter of the input speech signal is represented by δω ₁ ^(k) .

【００７９】 [0079]

【００８０】入力音声信号の特徴パラメータxpar₁をxr
ms、再生音声信号の特徴パラメータypar₁をyrmsとする
と重み付け客観評価値ωＡＶＧ_rmsは以下の式（２４）
で求められる。The characteristic parameter xpar ₁ of the input speech signal is set to xr
ms, and the characteristic parameter ypar ₁ of the reproduced audio signal is yrms, the weighted objective evaluation value ωAVG _rms is given by the following equation (24).
Required by.

【００８１】 [0081]

【００８２】また、特徴パラメータが多次元である場合
の重み付け客観評価値は以下の式（２５）で求める。入
力音声信号の特徴パラメータより求めた重み付け係数を
ω₂、入力音声信号の動的特徴パラメータより求めら重
み付け係数をδω₂、重み付け客観評価値をωＡＶＧ₂
として説明する。The weighted objective evaluation value when the characteristic parameter is multidimensional is calculated by the following equation (25). The weighting coefficient obtained from the characteristic parameters of the input speech signal is ω ₂ , the weighting coefficient obtained from the dynamic characteristic parameters of the input speech signal is δω ₂ , and the weighted objective evaluation value is ωAVG _2.
As described below.

【００８３】 [0083]

【００８４】ここで、入力音声信号の特徴パラメータxp
ar₂をｘＢ、特徴パラメータypar₂をｙＢ、特徴パラメ
ータの重み付け客観評価値ωＡＶＧ₂をωＡＶＧ_BSDと
して、ＢＳＤの求め方を以下の式（２６）で説明する。Here, the characteristic parameter xp of the input voice signal
Letting ar ₂ be xB, the characteristic parameter ypar ₂ be yB, and the weighted objective evaluation value ωAVG ₂ of the characteristic parameter be ωAVG _BSD , the method of obtaining _BSD will be described by the following equation (26).

【００８５】 [0085]

【００８６】以上の方法によって、客観評価部９で求め
られた重み付け客観評価値は、主観評価予測部１０へ送
られる。The weighted objective evaluation value obtained by the objective evaluation section 9 by the above method is sent to the subjective evaluation prediction section 10.

【００８７】主観評価予測部８では、少なくとも１つの
重み付け客観評価値と少なくとも２つの予測係数で主観
評価値を予測し、評価結果を出力端子２より出力する。
予測係数は、予め大量の音声データを用いて集めた主観
評価値と予測評価値の誤差が、最小になるように求めら
れる。予測係数ａと主観評価値との関係を以下に示す。The subjective evaluation predicting section 8 predicts the subjective evaluation value using at least one weighted objective evaluation value and at least two prediction coefficients, and outputs the evaluation result from the output terminal 2.
The prediction coefficient is obtained so that the error between the subjective evaluation value and the prediction evaluation value collected in advance using a large amount of voice data is minimized. The relationship between the prediction coefficient a and the subjective evaluation value is shown below.

【００８８】 [0088]

【００８９】 [0089]

【００９０】予測係数と客観評価部９で求めた客観評価
値とを用いて予測主観評価値を求める式（２９）を以下
に示す。ここで、予測主観評価値はＭＯＳ’、予測係数
はａ、ｂ、ｃ、客観評価部９にて求めら特徴パラメータ
の重み付け客観評価値は、ωＡＶＧ_pとωＡＶＧ_qとす
る。The formula (29) for obtaining the predicted subjective evaluation value by using the prediction coefficient and the objective evaluation value obtained by the objective evaluation section 9 is shown below. Here, the prediction subjective evaluation value is MOS ′, the prediction coefficients are a, b, and c, and the weighted objective evaluation values of the characteristic parameters obtained by the objective evaluation unit 9 are ωAVG _p and ωAVG _q .

【００９１】 [0091]

【００９２】以上のようにして求められた予測主観評価
値は、出力端子２より出力される。The predicted subjective evaluation value obtained as described above is output from the output terminal 2.

【００９３】図２と図３は、第１の発明の別の実施例を
示すブロック図である。図２は、客観評価値の重みづけ
として、入力音声の特徴パラメータから抽出した重みの
みを使う第２の一実施例で、図３は、入力音声の動的特
徴パラメータから抽出した重みのみを使う第３の一実施
例を示している。2 and 3 are block diagrams showing another embodiment of the first invention. FIG. 2 shows a second embodiment in which only the weights extracted from the feature parameters of the input voice are used as the weighting of the objective evaluation value, and FIG. 3 uses only the weights extracted from the dynamic feature parameters of the input voice. The 3rd Example is shown.

【００９４】図４は第２の発明の音質主観評価予測方式
の実施例を示すブロック図である。図４において、同一
の番号のある構成要素は、図１の同一の動作をするので
説明は省略する。FIG. 4 is a block diagram showing an embodiment of the sound quality subjective evaluation prediction method of the second invention. In FIG. 4, components having the same numbers perform the same operations as in FIG.

【００９５】第２の発明では、入力端子１より入力され
た音声信号が、第１の発明の入力音声信号特徴抽出部４
のかわりに、重み係数抽出用入力音声信号特徴抽出部１
１と評価用入力音声信号抽出部１２に送られる。重み係
数抽出用入力音声信号特徴抽出部１１では、入力音声信
号の特徴パラメータを抽出し、第１の重み付け係数抽出
部７と入力音声信号動的特徴抽出部６とに送る。In the second invention, the voice signal input from the input terminal 1 is the input voice signal feature extraction unit 4 of the first invention.
Instead of the input speech signal feature extraction unit 1 for weighting factor extraction
1 and the input voice signal extraction unit 12 for evaluation. The weighting coefficient extraction input voice signal feature extraction unit 11 extracts the feature parameter of the input voice signal and sends it to the first weighting coefficient extraction unit 7 and the input voice signal dynamic feature extraction unit 6.

【００９６】評価用入力音声信号特徴抽出部１２では、
客観評価部９に送る入力音声信号の特徴パラメータを抽
出する。In the evaluation input voice signal feature extraction unit 12,
The characteristic parameters of the input voice signal sent to the objective evaluation section 9 are extracted.

【００９７】重み係数抽出用入力音声信号特徴抽出部１
１と評価用入力音声信号特徴抽出部１２で、入力音声か
ら特徴パラメータを抽出する方法は、第１の発明の入力
音声信号特徴抽出部４と同じ方法が使えるのでここでは
その説明を省略する。この場合それぞれ抽出する特徴パ
ラメータは異なえることができる。Input voice signal feature extraction unit 1 for weighting factor extraction
1 and the evaluation input voice signal feature extraction unit 12 can extract the feature parameters from the input voice by using the same method as the input voice signal feature extraction unit 4 of the first invention, and the description thereof will be omitted here. In this case, the characteristic parameters to be extracted can be different.

【００９８】ここに、従来方式（ＢＳＤ）と本方式（ω
ＢＳＤ）とを用いてポストフィルタなしとポストフィル
タありの音声信号の主観評価を予測した結果を示す。表
中の相関係数は、予測主観評価値と実際の主観評価値と
の相関を表している。Here, the conventional method (BSD) and the present method (ω
The results of predicting the subjective evaluation of a voice signal without a post filter and with a post filter using BSD) are shown below. The correlation coefficient in the table represents the correlation between the predicted subjective evaluation value and the actual subjective evaluation value.

【００９９】[0099]

【表１】 [Table 1]

【０１００】実験の結果よりポストフィルタの有無に係
わらず、ωＢＳＤの方がＢＳＤよりも主観値との相関が
高くなることがわかる。よって、本方式が音声信号の特
徴パラメータから主観評価値を予測する際の予測精度を
上げるのに有効であることが示された。From the experimental results, it can be seen that ωBSD has a higher correlation with the subjective value than BSD regardless of the presence or absence of the post filter. Therefore, it is shown that this method is effective to improve the prediction accuracy when predicting the subjective evaluation value from the characteristic parameter of the audio signal.

【０１０１】[0101]

【発明の効果】以上説明したように、本発明による音質
主観評価予測方式は、人間が音声信号を評価する場合と
近いモデルで主観評価値を予測する目的で、入力音声信
号の特徴パラメータ値が大きい部分や、動的特徴パラメ
ータ値が小さい部分での評価が、他の部分と比べて大き
く評価されるような重みづけ距離尺度を用いている。そ
のため、従来の音質主観評価予測方式よりも予測精度の
向上が実現できるという効果を有する。As described above, the sound quality subjective evaluation prediction method according to the present invention is characterized in that the characteristic parameter value of the input audio signal is calculated in order to predict the subjective evaluation value with a model similar to the case where a human evaluates the audio signal. The weighted distance measure is used so that the evaluation in a large part or in a part with a small dynamic feature parameter value is evaluated larger than other parts. Therefore, there is an effect that the prediction accuracy can be improved as compared with the conventional sound quality subjective evaluation prediction method.

【図面の簡単な説明】[Brief description of drawings]

【図１】第１の発明の音質主観評価予測方式の第１の一
実施例を示すブロック図である。FIG. 1 is a block diagram showing a first embodiment of a sound quality subjective evaluation and prediction method of the first invention.

【図２】第１の発明の音質主観評価予測方式の第２の一
実施例を示すブロック図である。FIG. 2 is a block diagram showing a second embodiment of the sound quality subjective evaluation prediction method of the first invention.

【図３】第１の発明の音質主観評価予測方式の第３の一
実施例を示すブロック図である。FIG. 3 is a block diagram showing a third embodiment of the sound quality subjective evaluation and prediction method of the first invention.

【図４】第２の発明の音質主観評価予測方式の一実施例
を示すブロック図である。FIG. 4 is a block diagram showing an embodiment of a sound quality subjective evaluation prediction method of the second invention.

【符号の説明】[Explanation of symbols]

１入力端子２出力端子３符号化／復号化部４入力音声信号特徴抽出部５再生音声信号特徴抽出部６入力音声信号動的特徴抽出部７第１の重み付け係数抽出部８第２の重み付け係数抽出部９客観評価部１０主観評価予測部１１重み係数抽出用入力音声信号特徴抽出部１２評価用入力音声信号特徴抽出部 1 Input Terminal 2 Output Terminal 3 Encoding / Decoding Section 4 Input Speech Signal Feature Extraction Section 5 Playback Speech Signal Feature Extraction Section 6 Input Speech Signal Dynamic Feature Extraction Section 7 First Weighting Coefficient Extraction Section 8 Second Weighting Coefficient Extraction unit 9 Objective evaluation unit 10 Subjective evaluation prediction unit 11 Weighted coefficient extraction input speech signal feature extraction unit 12 Evaluation input speech signal feature extraction unit

Claims

【特許請求の範囲】[Claims]

【請求項１】入力音声信号を符号化／復号化し再生音
声信号を作成する符号化／復号化部と、前記入力音声信
号から少なくとも１つの入力音声信号特徴パラメータを
抽出する入力音声信号特徴抽出部と、前記再生音声信号
から少なくとも１つの再生音声信号特徴パラメータを抽
出する再生音声信号特徴抽出部と、前記入力音声信号特
徴パラメータを用いて少なくとも１つの動的特徴パラメ
ータを抽出する入力音声信号動的特徴抽出部と、前記入
力音声信号特徴パラメータを用いて少なくとも１つの第
１の重み付け係数を抽出する第１の重み付け係数抽出部
と、前記動的特徴パラメータを用いて少なくとも１つの
第２の重み付け係数を抽出する第２の重み付け係数抽出
部と、前記入力音声信号特徴パラメータと前記再生音声
信号特徴パラメータとの距離を求める際に前記第１の重
み付け係数と前記第２の重み付け係数とのうちの少なく
とも１つの重み付け係数を用いて計算し客観評価値を出
力する客観評価部と、前記客観評価値を用いて主観評価
値を予測する主観評価予測部とを含むことを特徴とする
音質主観評価予測方式。1. An encoding / decoding unit that encodes / decodes an input voice signal to create a reproduced voice signal, and an input voice signal feature extraction unit that extracts at least one input voice signal feature parameter from the input voice signal. A reproduced voice signal characteristic extraction unit for extracting at least one reproduced voice signal characteristic parameter from the reproduced voice signal; and an input voice signal dynamic for extracting at least one dynamic characteristic parameter using the input voice signal characteristic parameter. A feature extracting section, a first weighting coefficient extracting section that extracts at least one first weighting coefficient using the input speech signal feature parameter, and at least one second weighting coefficient using the dynamic feature parameter A second weighting coefficient extraction unit for extracting the input audio signal characteristic parameter and the reproduced audio signal characteristic parameter And an objective evaluation unit that outputs an objective evaluation value by calculating using at least one weighting coefficient of the first weighting coefficient and the second weighting coefficient when determining the distance between A subjective evaluation method for sound quality, comprising: a subjective evaluation prediction unit that predicts a subjective evaluation value by using the subjective evaluation value.

【請求項２】請求項１記載の音質主観評価予測方式に
おいて、入力音声信号特徴抽出部に代えて、第１の重み
付け係数抽出部と動的特徴抽出部で用いる入力音声信号
の特徴パラメータを抽出する重み係数抽出用入力音声信
号特徴抽出部と、客観評価部で用いる前記入力音声信号
の特徴パラメータを抽出する評価用入力音声信号特徴抽
出部とを有することを特徴とする音質主観評価予測方
式。2. The subjective sound quality estimation and prediction method according to claim 1, wherein instead of the input voice signal feature extraction unit, feature parameters of the input voice signal used in the first weighting coefficient extraction unit and the dynamic feature extraction unit are extracted. A sound quality subjective evaluation prediction method comprising: a weighting factor extraction input voice signal feature extraction unit; and an evaluation input voice signal feature extraction unit that extracts a feature parameter of the input voice signal used in the objective evaluation unit.