JP2009500952A

JP2009500952A - Voice quality evaluation method and voice quality evaluation system

Info

Publication number: JP2009500952A
Application number: JP2008520343A
Authority: JP
Inventors: キム，ドー−スク
Original assignee: ルーセントテクノロジーズインコーポレーテッド
Priority date: 2005-07-05
Filing date: 2006-06-30
Publication date: 2009-01-08
Also published as: US7856355B2; KR20080028384A; US20070011006A1; WO2007005875A1; CN101218627A; EP1899961A1

Abstract

一実施形態では、受信された音声信号における歪みが、主観的品質評価データに基づいて訓練された少なくとも１つのモデルを使用して推定される。次に、受信された音声信号に関する音声品質評価が、推定された歪みに基づいて判定される。 In one embodiment, distortion in the received speech signal is estimated using at least one model trained based on subjective quality assessment data. Next, a voice quality assessment for the received voice signal is determined based on the estimated distortion.

Description

本発明は、音声品質評価に関する。 The present invention relates to voice quality evaluation.

最新の電気通信ネットワークが、より複雑になり、回線交換網から、ＶｏＩＰ（ボイス・オーバー・インターネット・プロトコル）などのパケット・ベースのネットワークに進化するにつれ、知覚される音声品質に影響を及ぼす新たなタイプの歪みに直面している。このため、稼動中のネットワークのＱｏＳ（サービス品質）を維持し、向上させることは、引き続き重要な問題である。現行の技術では、主観的音声品質評価が、最も信頼でき、音声品質を評価するための一般に受け入れられているやり方である。主観的音声品質評価では、人間の聴取者が、処理済みの音声の音声品質を格付けするのに使用され、処理済みの音声とは、受信機において処理されている、例えば、復号されている伝送された音声信号である。この技術は、個別の人間の知覚に基づくため、主観的である。しかし、主観的音声品質評価は、十分に大きい数の音声サンプルおよび聴取者が、統計的に信頼できる結果を得るのに必要であるため、費用が高く、時間のかかる技術である。例えば、１から５までの段階で音声品質を格付けする、これらの主観的結果が平均されて、ＭＯＳ（平均オピニオン評点）が得られる。 As modern telecommunications networks become more complex and evolve from circuit-switched networks to packet-based networks such as VoIP (voice over internet protocol), new ones that affect perceived voice quality Facing type distortion. For this reason, maintaining and improving the quality of service (QoS) of the network in operation remains an important issue. With current technology, subjective speech quality assessment is the most reliable and generally accepted way to assess speech quality. In subjective speech quality assessment, a human listener is used to rate the speech quality of the processed speech, which is the transmission being processed at the receiver, eg, decoded. Audio signal. This technique is subjective because it is based on individual human perception. However, subjective speech quality assessment is an expensive and time consuming technique because a sufficiently large number of speech samples and listeners are required to obtain statistically reliable results. For example, these subjective results of rating speech quality on a scale from 1 to 5 are averaged to obtain a MOS (mean opinion score).

客観的音声品質評価が、音声品質を評価するための別の技術である。主観的音声品質評価とは異なり、客観的音声品質評価は、個別の人間の知覚には基づかない。客観的音声品質評価は、２つのタイプのいずれかであることが可能である。第１のタイプの客観的音声品質評価は、知られているソース音声に基づき、しばしば、侵入的評価と呼ばれる。この第１のタイプの客観的音声品質評価では、例えば、移動局が、知られているソース音声から導き出された、例えば、符号化された音声信号を伝送する。伝送された音声信号が、受信され、処理され、その後、記録される。記録された処理済みの音声信号は、ＰＥＳＱ（音声品質の知覚評価）などの、よく知られている音声評価技術を使用して、知られているソース音声と比較されて、音声品質が判定される。ソース音声信号が、知られていない場合、または伝送された音声信号が、知られているソース音声から導き出されていない場合、この第１のタイプの客観的音声品質評価は、利用することができない。 Objective speech quality assessment is another technique for assessing speech quality. Unlike subjective speech quality assessment, objective speech quality assessment is not based on individual human perception. Objective voice quality assessment can be one of two types. The first type of objective speech quality assessment is based on known source speech and is often referred to as intrusive assessment. In this first type of objective speech quality assessment, for example, the mobile station transmits, for example, an encoded speech signal derived from a known source speech. The transmitted audio signal is received, processed and then recorded. The recorded processed audio signal is compared with known source audio using well-known audio evaluation techniques such as PESQ (Perceptual Evaluation of Audio Quality) to determine audio quality. The This first type of objective speech quality assessment is not available if the source speech signal is not known or if the transmitted speech signal is not derived from the known source speech. .

第２のタイプの客観的音声品質評価は、知られているソース音声には基づかず、非侵入的評価、シングルエンド評価、または出力ベースの評価と呼ばれる。この第２のタイプの客観的音声品質評価のほとんどの実施形態には、処理済みの音声からソース音声を推定し、次に、よく知られている音声評価技術を使用して、その推定されたソース音声を処理済みの音声と比較することがかかわる。非侵入的な方法は、現実の応用例において、例えば、ソース音声信号が利用できない場合に稼動中のネットワークの音声品質を監視する応用例において、大きい可能性を有する。劣化していないソース音声データベースから導き出された符号語からの、劣化した音声信号の特徴ベクトルの偏差を測定することによる、または電気通信ネットワーク歪みに対する感度のよい声道モデルのパラメータ化により、非侵入的測定システムを構築する、いくつかの試みが、行われている。最近、ＩＴＵ−Ｔにおいて、Ｐ．ＳＥＡＭ（Ｓｉｎｇｌｅ−ＥｎｄｅｄＡｓｓｅｓｓｍｅｎｔＭｏｄｅｌ）と呼ばれる標準化活動が、音声品質の非侵入的推定のためのアルゴリズムを標準化するように創設された。いくつかのモデルが、提案され、それらのモデルの１つが、標準勧告Ｐ．５６３として採用された。しかし、ＩＴＵ−ＴＰ．５６３モデルは、このモデルの開発において使用される、知られているＭＯＳデータに関してさえ、非常に限られたパフォーマンス、すなわち、２４回のＭＯＳ試験に関して、主観的評点と客観的評点との間で平均、約０．８８の相関を示す。 The second type of objective speech quality assessment is not based on known source speech and is called non-intrusive assessment, single-ended assessment, or output-based assessment. In most embodiments of this second type of objective speech quality assessment, the source speech is estimated from the processed speech and then estimated using well-known speech assessment techniques. It involves comparing the source audio with the processed audio. Non-intrusive methods have great potential in real-world applications, for example in applications that monitor the voice quality of a working network when the source voice signal is not available. Non-intrusive by measuring the deviation of feature vectors of degraded speech signals from codewords derived from non-degraded source speech databases or by parameterizing vocal tract models sensitive to telecommunications network distortion Several attempts have been made to build a static measurement system. Recently, in ITU-T, P.I. A standardization activity called SEAM (Single-Ended Assessment Model) was created to standardize algorithms for non-intrusive estimation of voice quality. Several models have been proposed, one of which is the standard recommendation P.I. 563 was adopted. However, ITU-TP The 563 model is very limited in performance, even for the known MOS data used in the development of this model, ie, average between subjective and objective scores for the 24 MOS tests. , Showing a correlation of about 0.88.

本発明は、客観的音声品質評価を提供する。 The present invention provides objective speech quality assessment.

一実施形態では、受信された音声信号における歪みが、主観的品質評価データに基づいて訓練された少なくとも１つのモデルを使用して推定される。次に、受信された音声信号に関する音声品質評価が、推定された歪みに基づいて判定される。
例えば、推定する工程は、主観的品質評価データに基づいて訓練された第１のモデルを使用して、受信された音声信号における音声歪みを推定することを含むことが可能である。推定する工程は、主観的品質評価データに基づいて訓練された第１のモデルを使用して、受信された音声信号における背景雑音歪みを推定することをさらに含むことが可能である。 In one embodiment, distortion in the received speech signal is estimated using at least one model trained based on subjective quality assessment data. Next, a voice quality assessment for the received voice signal is determined based on the estimated distortion.
For example, the estimating step can include estimating speech distortion in the received speech signal using a first model trained based on subjective quality assessment data. The estimating step can further include estimating a background noise distortion in the received speech signal using a first model trained based on the subjective quality assessment data.

第１のモデルは、音声信号における歪みの主観的判定をモデル化することが可能である。
また、推定する工程は、主観的品質評価データに基づいて訓練された第２のモデルを使用して、受信された音声信号におけるミュートによって生じた歪みを評価することをさらに含むことも可能である。 The first model can model a subjective determination of distortion in an audio signal.
The step of estimating may further include evaluating distortion caused by mute in the received audio signal using a second model trained based on subjective quality assessment data. .

本発明の別の実施形態では、音声品質評価のための装置が、主観的品質評価データに基づいて訓練された少なくとも１つのモデルを使用して、受信された音声信号における歪みを推定する少なくとも１つのエスティメータと、推定された歪みを音声品質メトリックにマップするマッピング・ユニットとを含む。 In another embodiment of the invention, at least one apparatus for speech quality assessment estimates distortion in a received speech signal using at least one model trained based on subjective quality assessment data. One estimator and a mapping unit that maps the estimated distortion to a voice quality metric.

本発明のさらに別の実施形態は、フレーム歪みを推定する方法を提供する。この実施形態では、受信された信号における音声歪みが、主観的品質評価データに基づいて訓練されたモデルを使用して推定され、受信された信号における背景雑音歪みが、主観的品質評価データに基づいて訓練されたモデルを使用して推定される。推定された音声歪みと推定された背景雑音歪みが組み合わされて、フレーム歪み推定値が得られる。 Yet another embodiment of the present invention provides a method for estimating frame distortion. In this embodiment, speech distortion in the received signal is estimated using a model trained based on subjective quality assessment data, and background noise distortion in the received signal is based on subjective quality assessment data. Estimated using a trained model. The estimated speech distortion and the estimated background noise distortion are combined to obtain a frame distortion estimate.

本発明のさらなる実施形態は、ミュート歪みを推定する方法を提供する。この実施形態では、受信された音声信号におけるミュートが、検出され、検出されたミュートによって生じた歪みが、主観的品質評価データに基づいて訓練されたモデルを使用して評価される。 A further embodiment of the present invention provides a method for estimating mute distortion. In this embodiment, mute in the received audio signal is detected, and the distortion caused by the detected mute is evaluated using a model trained based on subjective quality assessment data.

本発明は、品質評価システムを訓練する方法をさらに含む。一実施形態では、方法は、第１の主観的品質評価データを使用して、システムの第２の歪み推定パスからの影響を排除しながら、システムの第１の歪み推定パスを訓練することを含む。第１の主観的品質評価データは、第１の音声信号と、第１の関連する主観的品質メトリックとを含み、第１の音声信号には、ミュート歪みが欠如している。次に、システムの第２の歪み推定パスが、第２の主観的品質評価データを使用して訓練される。第２の主観的品質評価データは、第２の音声信号と、第２の関連する主観的品質メトリックとを含み、第２の音声信号は、ミュート歪みを含む。次に、第１の歪みパスが、第２の歪みパスの影響を含みながら、第１の品質評価データ、および第２の品質評価データを使用して再訓練される。 The present invention further includes a method for training a quality assessment system. In one embodiment, the method uses the first subjective quality assessment data to train the system's first distortion estimation path while eliminating the effects from the system's second distortion estimation path. Including. The first subjective quality assessment data includes a first audio signal and a first associated subjective quality metric, wherein the first audio signal lacks mute distortion. Next, a second distortion estimation path of the system is trained using the second subjective quality assessment data. The second subjective quality assessment data includes a second audio signal and a second associated subjective quality metric, and the second audio signal includes mute distortion. Next, the first distortion path is retrained using the first quality evaluation data and the second quality evaluation data, including the effects of the second distortion path.

本発明は、後段で与えられる詳細な説明、および単に例として与えられる添付の図面から、より完全に理解されよう。図面では、同様の符号は、様々な図面において対応する部分を示す。 The present invention will become more fully understood from the detailed description given hereinafter and the accompanying drawings, given by way of example only. In the drawings, like numerals indicate corresponding parts in the various drawings.

図１は、本発明の実施形態による音声品質評価システムのブロック図を示す。図示されるとおり、フィルタ１０が、音声信号ｘ（ｎ）に対してレベル正規化、および変形ＲＸ−ＩＲＳ（受信中間基準システム）フィルタリングを実行する。フィルタ１０は、よく知られているＰ．５６音声電圧計を使用して、音声信号ｘ（ｎ）を２６ｄＢｏｖに正規化する。次に、主観的リスト試験において使用されるハンドセットの特性を反映する、よく知られているＲＸ−ＩＲＳフィルタリングが、フィルタ１０によって適用される。正規化とＲＸ−ＩＲＳフィルタリングはともに、よく知られているため、これらの動作を詳細に説明することはしない。 FIG. 1 shows a block diagram of a speech quality evaluation system according to an embodiment of the present invention. As shown, the filter 10 performs level normalization and modified RX-IRS (Receiving Intermediate Reference System) filtering on the audio signal x (n). The filter 10 is a well-known P.I. A 56 audio voltmeter is used to normalize the audio signal x (n) to 26 dBov. The well-known RX-IRS filtering is then applied by the filter 10 that reflects the characteristics of the handset used in the subjective list test. Since both normalization and RX-IRS filtering are well known, their operation will not be described in detail.

正規化され、フィルタリングされた音声信号は、調音解析ユニット１２による調音解析を受ける。調音解析ユニット１２は、特徴ベクトルを生成し、特徴ベクトルのそれぞれは、自然な人間音声と関係のある信号成分を反映する平均調音パワー成分と、人間の調音系の速度を超えたレートで生成される知覚的にうるさい歪みを反映する平均非調音パワー成分とを含む。次に、フレーム歪みエスティメータ１４が、各フレームｍに関する特徴ベクトルに基づき、各フレームｍに関する音声歪みおよび背景雑音歪みを推定する。フレーム歪みエスティメータ１４は、いくつかのフレームに関する音声歪みおよび背景歪みを蓄積し、蓄積された音声歪みおよび背景雑音歪みを正規化して、フレーム歪みをもたらす。フレーム歪みエスティメータ１４の動作は、後段で詳細に説明する。 The normalized and filtered audio signal is subjected to articulation analysis by the articulation analysis unit 12. The articulation analysis unit 12 generates feature vectors, and each of the feature vectors is generated at an average articulation power component reflecting a signal component related to natural human speech and a rate exceeding the speed of the human articulation system. And an average non-harmonic power component that reflects perceptually noisy distortion. Next, the frame distortion estimator 14 estimates speech distortion and background noise distortion for each frame m based on the feature vector for each frame m. The frame distortion estimator 14 accumulates audio distortion and background distortion for several frames and normalizes the accumulated audio distortion and background noise distortion to produce frame distortion. The operation of the frame distortion estimator 14 will be described in detail later.

また、フィルタ１０からのフィルタリングされた音声信号は、ミュート検出ユニット１６にも供給される。ミュート検出ユニット１６は、例えば、パケット損失によって生じる予期されない、不要な休止であるミュートを検出する。より具体的には、ミュート検出ユニット１６は、ミュートの時間的な位置、およびミュートの長さ（深度とも呼ばれる）を検出する。ミュート検出ユニット１６の動作は、後段で詳細に説明する。 The filtered audio signal from the filter 10 is also supplied to the mute detection unit 16. The mute detection unit 16 detects a mute that is an unexpected and unnecessary pause caused by, for example, packet loss. More specifically, the mute detection unit 16 detects the time position of mute and the length of mute (also referred to as depth). The operation of the mute detection unit 16 will be described in detail later.

ミュート歪みエスティメータ１８が、ミュート検出ユニット１６から情報を受け取り、ミュートによって生じる知覚的歪み（以降、「ミュート歪み」と呼ぶ）を推定する。ミュート歪みエスティメータ１８の動作は、後段で詳細に説明する。 A mute distortion estimator 18 receives information from the mute detection unit 16 and estimates perceptual distortion caused by mute (hereinafter referred to as “mute distortion”). The operation of the mute distortion estimator 18 will be described in detail later.

コンバイナ２０が、フレーム歪み推定値とミュート歪み推定値を組み合わせて、客観的歪み推定値を生成する。マッピング・ユニット２２が、その客観的歪み推定値を、ＭＯＳなどの、対応する主観的音声品質性能指数にマップする。例えば、マッピング・ユニット２２は、客観的歪み推定値をＭＯＳに変換するためのルックアップ・テーブルを格納することができる。ルックアップ・テーブルの中の歪み推定ポイント間にある値に関しては、補間が実行されて、ＭＯＳが得られることが可能である。図２は、ルックアップ・テーブルによって表されるＭＯＳに対する推定された客観的歪みの曲線を示す。代替として、マッピング・ユニット２２は、図２における曲線を特徴付ける数式を格納し、推定された客観的歪みを入力として、その数式に適用して、結果のＭＯＳを得てもよい。例えば、図２に関して、ＭＯＳ値Ｑｘは、（−３．５^＊客観的歪み推定値＋４．５）に等しく、最大ＭＯＳが、４．５であり、最小ＭＯＳが、１．０であるようになっていてもよい。
次に、調音解析ユニット１２、フレーム歪みエスティメータ１４、ミュート検出ユニット１６、およびミュート歪みエスティメータ１８の動作を説明する。 The combiner 20 combines the frame distortion estimated value and the mute distortion estimated value to generate an objective distortion estimated value. A mapping unit 22 maps the objective distortion estimate to a corresponding subjective speech quality performance index, such as MOS. For example, the mapping unit 22 can store a look-up table for converting objective distortion estimates into MOS. For values between the distortion estimation points in the lookup table, interpolation can be performed to obtain the MOS. FIG. 2 shows the estimated objective distortion curve for the MOS represented by the lookup table. Alternatively, the mapping unit 22 may store a mathematical expression that characterizes the curve in FIG. 2 and apply the estimated objective distortion as an input to the mathematical expression to obtain the resulting MOS. For example, with reference to FIG. 2, the MOS value Qx is equal to (−3.5 ^* objective distortion estimate + 4.5), so that the maximum MOS is 4.5 and the minimum MOS is 1.0. It may be.
Next, operations of the articulation analysis unit 12, the frame distortion estimator 14, the mute detection unit 16, and the mute distortion estimator 18 will be described.

（調音解析ユニット）
図３は、本発明の実施形態による図１の調音解析ユニットにおいて使用される音声品質評価構成を示す。この音声品質評価構成は、蝸牛フィルタバンク２、エンベロープ解析モジュール４、および調音解析モジュール６から成る。この音声品質評価構成において、正規化され、ＲＸ−ＩＲＳフィルタリングされた音声信号ｓ（ｔ）が、蝸牛フィルタバンク２に入力として与えられる。蝸牛フィルタバンク２は、末梢聴覚系の第１の段階に従って音声信号ｓ（ｔ）を処理するための複数の蝸牛フィルタｈ_ｉ（ｔ）を含み、ただし、ｉ＝１，２，．．．，Ｎ_Ｃは、特定の蝸牛フィルタ・チャネルを表し、Ｎ_Ｃは、蝸牛フィルタ・チャネルの総数を表す。具体的には、蝸牛フィルタバンク２は、音声信号ｓ（ｔ）をフィルタリングして、複数の臨界帯域信号ｓ_ｉ（ｔ）をもたらし、臨界帯域信号ｓ_ｉ（ｔ）は、ｓ（ｔ）^＊ｈ_ｉ（ｔ）に等しい。
複数の臨界帯域信号ｓ_ｉ（ｔ）は、エンベロープ解析モジュール４に入力として与えられる。エンベロープ解析モジュール４において、複数の臨界帯域信号ｓ_ｉ（ｔ）は、処理されて、複数のエンベロープａ_ｉ（ｔ）が得られ、ただし、

および

は、ｓ_ｉ（ｔ）のヒルベルト変換である。 (Articulation analysis unit)
FIG. 3 shows a voice quality evaluation configuration used in the articulation analysis unit of FIG. 1 according to an embodiment of the present invention. This voice quality evaluation configuration includes a cochlear filter bank 2, an envelope analysis module 4, and an articulation analysis module 6. In this speech quality evaluation configuration, a normalized and RX-IRS filtered speech signal s (t) is provided as input to the cochlear filter bank 2. Cochlear filter bank 2 includes a plurality of cochlear filters h _i (t) for processing the audio signal s (t) according to the first stage of the peripheral auditory system, where i = 1, 2,. . . , N _C represents a specific cochlear filter channel and N _C represents the total number of cochlear filter channels. Specifically, cochlear filterbank 2 filters the speech signal s (t), results in a plurality of critical band signals _s i (t), the critical band signal _s i (t) is s ^{(t) *} equal to h _i (t).
The plurality of critical band signals s _i (t) are provided as an input to the envelope analysis module 4. In the envelope analysis module 4, the plurality of critical band signals s _i (t) are processed to obtain a plurality of envelopes a _i (t), where

and

Is the Hilbert transform of s _i (t).

次に、複数のエンベロープａ_ｉ（ｔ）が入力として調音解析モジュール６に与えられる。調音解析モジュール６において、複数のエンベロープａ_ｉ（ｔ）が処理されて、音声信号ｓ（ｔ）に関する音声品質評価が得られる。具体的には、調音解析モジュール６は、人間の調音系からは生成されていない信号に関連するパワー（以降、「非調音パワーＰ_ＮＡ（ｍ，ｉ）」と呼ぶ）を伴う、人間の調音系から生成された信号に関連するパワー（以降、「調音パワーＰ_Ａ（ｍ，ｉ）」と呼ぶ）に基づいて、特徴ベクトルを生成する。 Next, a plurality of envelopes a _i (t) are provided as input to the articulation analysis module 6. In the articulation analysis module 6, a plurality of envelopes a _i (t) are processed to obtain a voice quality evaluation for the voice signal s (t). Specifically, the articulatory analysis module 6 is a human articulator with power related to a signal not generated from the human articulation system (hereinafter referred to as “non-articulator power P _NA (m, i)”). _A feature vector is generated based on the power related to the signal generated from the system (hereinafter referred to as “articulation power P _A (m, i)”).

図４は、本発明の一実施形態による調音解析モジュール６において複数のエンベロープａ_ｉ（ｔ）を処理するための流れ図２００を示す。工程２１０で、複数のエンベロープａ_ｉ（ｔ）の各エンベロープのフレームｍに対してフーリエ変換が実行されて、変調スペクトルＡ_ｉ（ｍ，ｆ）が生成され、ただし、ｆは、周波数である。 FIG. 4 shows a flowchart 200 for processing a plurality of envelopes a _i (t) in the articulation analysis module 6 according to one embodiment of the present invention. In step 210, a Fourier transform is performed on each envelope frame m of the plurality of envelopes a _i (t) to generate a modulated spectrum A _i (m, f), where f is a frequency.

図５は、パワー対周波数の点で変調スペクトルＡ_ｉ（ｍ，ｆ）を示す例を示す。図示されるとおり、調音パワーＰ_Ａ（ｍ，ｉ）は、周波数２〜３０Ｈｚに関連するパワーであり、非調音パワーＰ_ＮＡ（ｍ，ｉ）は、３０Ｈｚを超える周波数に関連するパワーである。２Ｈｚ未満の周波数に関連するパワーＰ_Ｎｏ（ｍ，ｉ）は、臨界帯域信号ｓ_ｉ（ｔ）のフレームｍのＤＣ成分である。この例では、調音パワーＰ_Ａ（ｍ，ｉ）は、人間の調音の速度が、２〜３０Ｈｚであり、調音パワーＰ_Ａ（ｍ，ｉ）に関連する周波数範囲と非調音パワーＰ_ＮＡ（ｍ，ｉ）に関連する周波数範囲（以降、それぞれ、「調音周波数範囲」および「非調音周波数範囲」と呼ぶ）が、隣接した、重なり合わない周波数範囲であるという事実に基づき、周波数２〜３０Ｈｚに関連するパワーとして選択される。本明細書の目的として、「調音パワーＰ_Ａ（ｍ，ｉ）」という用語は、人間の調音の周波数範囲、または前述した周波数範囲２〜３０Ｈｚに限定されるべきではないことを理解されたい。同様に、「非調音パワーＰ_ＮＡ（ｍ，ｉ）」という用語も、調音パワーＰ_Ａ（ｍ，ｉ）に関連する周波数範囲を超える周波数範囲に限定されるべきではない。非調音周波数範囲は、調音周波数範囲と重なり合っていても、重なり合っていなくても、隣接していても、隣接していなくてもよい。非調音周波数範囲は、臨界帯域信号ｓ_ｉ（ｔ）のフレームｍのＤＣ成分に関連する周波数などの、調音周波数範囲における最低の周波数より低い周波数も含むことが可能である。 FIG. 5 shows an example showing the modulation spectrum A _i (m, f) in terms of power versus frequency. As illustrated, the articulation power P _A (m, i) is a power associated with a frequency of 2 to 30 Hz, and the non-articulation power P _NA (m, i) is a power associated with a frequency exceeding 30 Hz. The power P _No (m, i) associated with frequencies below 2 Hz is the DC component of frame m of the critical band signal s _i (t). In this example, the articulation power P _A (m, i) has a human articulation speed of 2 to 30 Hz, and the frequency range related to the articulation power P _A (m, i) and the non-articulation power P _NA (m , I) based on the fact that the frequency ranges associated with (hereinafter referred to as “articulatory frequency ranges” and “non-articulatory frequency ranges” respectively) are adjacent, non-overlapping frequency ranges. Selected as the associated power. For the purposes of this specification, it should be understood that the term “articulation power P _A (m, i)” should not be limited to the frequency range of human articulation, or the frequency range of 2-30 Hz described above. Similarly, the term “non-articulation power P _NA (m, i)” should not be limited to a frequency range that exceeds the frequency range associated with the articulation power P _A (m, i). The non-articulation frequency range may or may not overlap with the articulation frequency range. The non-articulatory frequency range can also include frequencies below the lowest frequency in the articulatory frequency range, such as the frequency associated with the DC component of frame m of the critical band signal s _i (t).

次に、特徴ベクトルζ_ｋ（ｍ）が、以下のとおり定義される。すなわち、

は、自然な人間音声と関係のある信号成分を反映する平均調音パワーであり、

は、人間の調音系の速度を超えるレートで生成される知覚的にうるさい歪みを表す平均非調音パワーである。人間の調音系の動きの速度に対応する２〜３０Ｈｚの周波数範囲を扱うため、数式（２）および（３）におけるＬＡは、例えば、４として設定される。平均非調音パワーΨ_ｋ，Ｎ（ｍ）の計算のために、第（Ｌ_Ａ＋１）番の帯域からＬ_Ｎ（ｋ）番の帯域までの変調帯域パワーが、数式（３）において見ることができるとおり、考慮される。このことは、非調音パワーを推定する際の最高の変調周波数が、異なる臨界帯域に関して異なるように選択されることを意味する（Ｌ_Ｎは、ｋの関数であることに留意されたい）。このことの理由は、臨界帯域エンベロープ検出器の上限カットオフ周波数に関してＧｈｉｔｚａによって実行された調査に基づく。Ｇｈｉｔｚａの心理学的実験では、所与の聴覚チャネルにおいて、音声品質を保つのに要求されるエンベロープ情報の最小帯域幅は、そのチャネルの臨界帯域幅のおおよそ１／２であることが示された。このことは、臨界帯域幅の半分までの変調周波数成分だけしか、音声品質の知覚と関係していないことを暗示する。このため、Ｌ_Ｎ（ｋ）は、Ψ_ｋ，Ｎ（ｍ）を計算する際に考慮される変調フィルタ・チャネルが、臨界帯域幅の約半分までを範囲に含むように決定される。 Next, a feature vector ζ _k (m) is defined as follows. That is,

Is the average articulatory power that reflects signal components related to natural human speech,

Is the average non-tonic power that represents perceptually noisy distortions generated at a rate that exceeds the speed of the human articulatory system. In order to handle a frequency range of 2 to 30 Hz corresponding to the speed of movement of the human articulation system, LA in Equations (2) and (3) is set as 4, for example. In order to calculate the average non-harmonic power Ψ _{k, N} (m), the modulation band power from the (L _A +1) th band to the L _N (k) band can be seen in Equation (3). Considered as much as possible. This means that the highest modulation frequency in estimating the non-harmonic power is chosen to be different for different critical bands (note that L _N is a function of k). The reason for this is based on a study performed by Ghitza on the upper cutoff frequency of the critical band envelope detector. Ghitza's psychological experiment showed that for a given auditory channel, the minimum bandwidth of envelope information required to preserve voice quality is approximately one-half of that channel's critical bandwidth. . This implies that only modulation frequency components up to half the critical bandwidth are related to speech quality perception. Thus, L _N (k) is determined such that the modulation filter channel considered in calculating Ψ _{k, N} (m) covers up to about half of the critical bandwidth.

（フレーム歪みエスティメータ）
フレーム歪みエスティメータ１４は、調音解析ユニット１２から、各フレームｍに関する特徴ベクトルζ_ｋ（ｍ）を受け取る。各フレームに関する特徴ベクトルを、ニューラル・ネットワーク、例えば、フレーム歪みエスティメータ１４の一部分を形成する多層パーセプトロンへの入力として使用することにより、各フレームの客観的歪みが、多層パーセプトロンによって推定される。図６は、フレーム歪みエスティメータ１４において使用されるような多層パーセプトロンの例を示す。第ｍ番のフレームζ_ｋ（ｍ）における入力ベクトルに関する多層パーセプトロンの出力Ｏ（ｍ）は、以下のとおり表現される。すなわち、

ただし、ｗ_ｊｋおよびＷ_ｊは、それぞれ、入力層および隠れ層に関する結合加重であり、ｇ（ｘ）は、非線形シグモイド関数である。音声Ｄ_ｓ（ｍ）に関する第ｍ番のフレーム歪みが、時間とともに蓄積され、次に、音声フレームの総数Ｔ_Ｓによって正規化されて、音声歪みＤ_ｓが見出される。また、背景雑音も、音声の知覚される品質に影響を及ぼすので、フレーム歪みＤ_ｖが、以下によって表現されるとおり、音声歪みＤ_ｓと背景雑音歪みＤ_ｂ（やはり、蓄積され、背景雑音フレーム、つまり、非調音フレームの総数Ｔ_ｂによって正規化される）の合計によってモデル化される。すなわち、

ただし、Ｐ_ｂ（ｍ）は、第ｍ番のフレームにおける信号のエネルギーであり、Ｐ_ｔｈは、可聴の背景雑音に関する閾値であり、Ｔ_ｓおよびＴ_ｂは、それぞれ、音声に関するフレームの数、および背景雑音に関するフレームの数である。 (Frame distortion estimator)
The frame distortion estimator 14 receives the feature vector ζ _k (m) for each frame m from the articulation analysis unit 12. By using the feature vector for each frame as an input to a neural network, eg, a multilayer perceptron that forms part of the frame distortion estimator 14, the objective distortion of each frame is estimated by the multilayer perceptron. FIG. 6 shows an example of a multilayer perceptron as used in the frame distortion estimator 14. The output O (m) of the multilayer perceptron relating to the input vector in the m-th frame ζ _k (m) is expressed as follows. That is,

Where w _jk and W _j are joint weights for the input layer and hidden layer, respectively, and g (x) is a nonlinear sigmoid function. The mth frame distortion for speech D _s (m) is accumulated over time and then normalized by the total number of speech frames T _S to find speech distortion D _s . Also, since background noise also affects the perceived quality of speech, the frame distortion D _v is expressed as follows by the speech distortion D _s and the background noise distortion D _b (also accumulated and background noise frames , that is modeled by the sum of the normalized) by the total number T _b of the non-articulatory frame. That is,

Where P _b (m) is the energy of the signal in the m-th frame, P _th is the threshold for audible background noise, T _s and T _b are the number of frames for speech, and The number of frames related to background noise.

理解されるとおり、フレーム歪みエスティメータ１４は、この実施形態における多層パーセプトロンである、ニューラル・ネットワークであるため、ニューラル・ネットワークは、意味のある出力を生成するように訓練される。フレーム歪みエスティメータ１４の訓練は、後段で詳細に説明する。 As will be appreciated, since the frame distortion estimator 14 is a neural network, which is a multi-layer perceptron in this embodiment, the neural network is trained to produce meaningful output. The training of the frame distortion estimator 14 will be described in detail later.

（ミュート検出ユニット）
最新の電気通信ネットワークは、ますます複雑になっている。既存の従来のＰＳＴＮ（公衆交換電話網）に加え、ＧＳＭ（ｇｌｏｂａｌｓｙｓｔｅｍｆｏｒｍｏｂｉｌｅｃｏｍｍｕｎｉｃａｔｉｏｎｓ）、ＣＤＭＡ（符号分割多元接続）、ＵＭＴＳ（ｕｎｉｖｅｒｓａｌｍｏｂｉｌｅｔｅｌｅｃｏｍｍｕｎｉｃａｔｉｏｎｓｓｙｓｔｅｍ）、およびＶｏＩＰ（ボイス・オーバー・インターネット・プロトコル）などの、様々なタイプのネットワークが、日常生活において広く使用されており、あるいは、もうすぐ世界中で展開されようとしている。電話網が、回線交換網からパケット網（特に、ＶｏＩＰのための）に進化するなかで、パケット損失および遅延ジッタが、伝送される音声品質を低下させる重大な歪みのタイプである。これらのタイプの歪みは、しばしば、音声信号における不要なミュートをもたらす。 (Mute detection unit)
Modern telecommunications networks are becoming increasingly complex. In addition to the existing conventional PSTN (public switched telephone network), GSM (global system for mobile communications), CDMA (code division multiple access), UMTS (universal mobile telecommunications system), and VoIP (voice over voice) Various types of networks are widely used in daily life or are about to be deployed all over the world soon. As the telephone network evolves from a circuit switched network to a packet network (especially for VoIP), packet loss and delay jitter are significant distortion types that degrade the voice quality transmitted. These types of distortion often result in unwanted mute in the audio signal.

ミュート検出ユニット１６において、フレーム対数パワーが、以下のとおり、４ミリ秒ごとに計算される。すなわち、

ただし、ｓ（ｌ；ｎ）は、フィルタ１０の出力であるｓ（ｎ）の第ｌ番のフレーム信号であり、ｈ_ｗ（ｎ）は、長さ６４（＝８ｋＨｚのサンプリング・レートに関して８ミリ秒）のハミング窓である。ｅ（ｌ）の時間微分が、以下のとおり、得られる。すなわち、

In the mute detection unit 16, the frame log power is calculated every 4 milliseconds as follows. That is,

Where s (l; n) is the l-th frame signal of s (n) that is the output of the filter 10 and h _w (n) is 8 mm for a sampling rate of length 64 (= 8 kHz) Second) Hamming window. The time derivative of e (l) is obtained as follows: That is,

音声活動プロファイルが、ｅ（ｌ）の値を使用することによって生成される。図７は、時間につれての音声活動に関するプロファイルの説明的な例を示す。図示されるとおり、Ｖ_ｉは、音声活動の時間であり、Ｇ_{ｉ−ｌ，ｉ}は、隣接する２つの音声活動Ｖ_ｉ−１とＶ_ｉの間の間隙である。
Ｖ_ｉの終わりに位置するフレームｌ_Ｍは、以下の条件が満たされる場合、異常な突然の停止として識別される。すなわち、
Δｅ（ｌ_Ｍ）＜Δｅ_ｓｔｏｐ＝−７
Ｌ_ｓｔｏｐ（Ｚ_ｓｔｏｐ（ｌ_Ｍ））≧Ｌ_{ｔｈ，ｓｔｏｐ}＝０．５５（８）
ただし、Ｌ_ｓｔｏｐ（Ｚ_ｓｔｏｐ（ｌ_Ｍ））は、Ｚ_ｓｔｏｐ（ｌ_Ｍ）を、２つの時間インスタンスｌ_Ｍ、およびｌ_Ｍより１５ミリ秒前において抽出された入力特徴ベクトルとする異常な停止に関するニューラル・ネットワーク検出器の出力である。各時間フレームに関して、１２次のＭＦＣＣ（メル周波数ケプストラル係数）および有声率が、３０ミリ秒の解析長で得られて、入力特徴ベクトルＺ_ｓｔｏｐ（ｌ_Ｍ）の次元が２６にされる。有声率は、音声のセグメントが、いくつかの周期的成分を含むかを示し、以下のとおり、５０〜４００Ｈｚのピッチ周期範囲（時間サンプルにおいて２０〜１６０に相当する）内の正規化された自己相関として定義される。すなわち、

A voice activity profile is generated by using the value of e (l). FIG. 7 shows an illustrative example of a profile for voice activity over time. As shown, V _i is the time of voice activity and G _{i−l, i} is the gap between two adjacent voice activities V _i−1 and V _i .
Frame l _M located at the end of V _i is identified as an abnormal sudden stop if the following conditions are met: That is,
Δe (l _M ) <Δe _stop = −7
L _stop (Z _stop (l _M )) ≧ L _{th, stop} = 0.55 (8)
However, L _stop (Z _stop (l _M )) relates to an abnormal _stop with Z _stop (l _M ) as two time instances l _M and an input feature vector extracted 15 ms before l _M. This is the output of the neural network detector. For each time frame, a 12th order MFCC (Mel Frequency Cepstral Coefficient) and voiced rate is obtained with an analysis length of 30 milliseconds, and the dimension of the input feature vector Z _stop (l _M ) is 26. The voiced rate indicates whether the segment of speech contains several periodic components and is normalized self within a pitch period range of 50-400 Hz (corresponding to 20-160 in time samples) as follows: Defined as correlation. That is,

ニューラル・ネットワーク検出器は、後段で詳細に説明されるとおり、訓練データベース上で訓練される多層パーセプトロンである。
Ｖ_ｉの始めに位置するフレームｌ_Ｍは、以下の条件が満たされる場合、異常な突然の停止として識別される。すなわち、
Δｅ（ｌ_Ｍ）＞Δｅ_{ｓｔａｒｔ}＝１３
Ｌ_{ｓｔａｒｔ}（Ｚ_{ｓｔａｒｔ}（ｌ_Ｍ））≧Ｌ_{ｔｈ，ｓｔａｒｔ}＝０．５５（１０）
ただし、Ｌ_{ｓｔａｒｔ}（Ｚ_{ｓｔａｒｔ}（ｌ_Ｍ））は、Ｚ_{ｓｔａｒｔ}（ｌ_Ｍ）を、２つの時間インスタンスｌ_Ｍ、およびｌ_Ｍより１５ミリ秒後において抽出された入力特徴ベクトルとする異常な開始に関するニューラル・ネットワーク検出器の出力である。各時間フレームに関して、１２次のＭＦＣＣ（メル周波数ケプストラル係数）、（９）において定義された有声率、およびスペクトル中心が、３０ミリ秒の解析長で得られる。スペクトル中心は、

として定義され、ただし、｜Ｘ（ｋ）｜は、音声セグメントのＦＦＴ振幅である。異常な停止に関するニューラル・ネットワーク検出器、および異常な開始に関するニューラル・ネットワーク検出器はそれぞれ、多層パーセプトロンであることが可能である。例えば、突然の停止に関するニューラル・ネットワークは、２６の入力ニューロンと、少なくとも１つの隠れ層と、１つの出力ニューロンとを有することが可能である。このネットワークは、突然の停止が存在する場合、「１」を出力としてもたらし、その他の場合、「０」をもたらすように訓練される。誤差費用関数の勾配を利用する従来の誤差逆伝播アルゴリズムなどの、任意のよく知られた訓練アルゴリズムが、使用されることが可能である。突然の開始に関するニューラル・ネットワークも、突然の開始に関して「１」をもたらし、それ以外の場合、「０」をもたらすように、同じように構築されて、訓練されることが可能である。 The neural network detector is a multilayer perceptron that is trained on a training database, as will be described in detail later.
A frame l _M located at the beginning of V _i is identified as an abnormal sudden stop if the following conditions are met: That is,
Δe (l _M )> Δe _start = 13
L _start (Z _start (l _M )) ≧ L _{th, start} = 0.55 (10)
However, L _start (Z _start (l _M )) relates to an abnormal _start with Z _start (l _M ) as two time instances l _M and an input feature vector extracted 15 ms after l _M. This is the output of the neural network detector. For each time frame, the 12th order MFCC (Mel Frequency Cepstral Coefficient), the voice rate defined in (9), and the spectral center are obtained with an analysis length of 30 milliseconds. The spectral center is

Where | X (k) | is the FFT amplitude of the speech segment. The neural network detector for abnormal stopping and the neural network detector for abnormal starting may each be a multilayer perceptron. For example, a neural network for sudden stops can have 26 input neurons, at least one hidden layer, and one output neuron. This network is trained to provide a “1” as an output if there is a sudden outage and a “0” otherwise. Any well-known training algorithm can be used, such as a conventional backpropagation algorithm that utilizes the slope of the error cost function. Neural networks for sudden onsets can be similarly constructed and trained to yield a “1” for a sudden onset, otherwise a “0”.

（ミュート歪みエスティメータ）
最近の実験は、人間が、音声の品質を、時間をかけて継続的に評価する可能性があること、および知覚される全体的品質に、いくらかの新近性効果があることも明らかにしている。すなわち、歪み（例えば、不要なミュート）が新しいほど、音声品質に及ぼす影響は、大きい。このことは、生物学的な短期記憶と関係しており、最近の事象が、過去の事象より大きい役割を果たすことを意味する。知られている機構は、全く存在しないものの、本発明のこの実施形態によるモデルは、ミュートの影響を、瞬間的な歪みの後に、短期記憶効果をシミュレートする減衰が続く組み合わせとしてモデル化する。したがって、後段で示すとおり、ミュート、およびミュートの時間を考慮に入れることに加え、ミュート歪みエスティメータ１８は、ミュート歪みを推定する際に、新近性効果も考慮に入れる。 (Mute distortion estimator)
Recent experiments have also revealed that humans may continuously assess voice quality over time and that there is some recency effect on perceived overall quality . That is, the newer the distortion (eg, unnecessary mute), the greater the impact on audio quality. This is related to biological short-term memory, meaning that recent events play a larger role than past events. Although there is no known mechanism, the model according to this embodiment of the invention models the effect of mute as a combination of instantaneous distortion followed by decay that simulates short-term memory effects. Therefore, as shown later, in addition to taking mute and mute time into consideration, mute distortion estimator 18 also takes into account the proximity effect when estimating mute distortion.

音声信号が、Ｋ個のミュートを含み、ｔ_ｉ、ｉ＝１，２，．．．，Ｍが、各ミュートが終了する時間インスタンスであるものと想定されたい。ミュートによって生じる客観的歪みは、ミュート歪みエスティメータ１８によって以下のとおり、モデル化される。すなわち、

ただし、ｕ（ｘ）は、ｘ≧０の場合に１であり、ｘ＜０の場合に０である単位ステップ関数であり、ｈ_ｉは、時刻ｔｉにおける第ｉ番のミュートの瞬間的歪みであり、Ｔは、音声信号の時間的な長さであり、τは、時間につれてのミュート事象の影響の減衰に関する時定数である。各ミュートに関して、知覚される歪みは、そのミュート事象の終わりにｈ_ｉという量だけ増加され、時間とともに時定数τで減衰する。すなわち、数式（１２）が示すとおり、ミュート事象が新しいほど、推定されるミュート歪みＤ_Ｍに与える影響が大きい。
第ｉ番のミュートｈ_ｉの瞬間的歪みは、以下によって推定される。すなわち、
ｈ_ｉ＝ｐ_１ｌｏｇ（Ｌ_ｉ）＋ｐ_２（１３）
ただし、Ｌ_ｉは、第ｉ番のミュートの長さであり、ｐ_１およびｐ_２は、後段で詳細に説明するとおり、訓練データから決定される定数である。 The audio signal includes K mutes, and t _i , i = 1, 2,. . . , M is the time instance when each mute ends. The objective distortion caused by mute is modeled by the mute distortion estimator 18 as follows. That is,

Here, u (x) is a unit step function that is 1 when x ≧ 0 and 0 when x <0, and h _i is the instantaneous distortion of the i-th mute at time ti. Yes, T is the time length of the audio signal, and τ is the time constant for the attenuation of the effect of the mute event over time. For each mute, the perceived distortion is increased by an amount h _{i at} the end of the mute event and decays with time constant τ over time. That is, as shown by the equation (12), as the mute event is new, a large influence on the mute distortion D _M to be estimated.
The instantaneous distortion of the i th mute h _i is estimated by: That is,
h _i = p ₁ log (L _i ) + p ₂ (13)
However, L _i is the length of the i-th mute, and p ₁ and p ₂ are constants determined from the training data, as will be described in detail later.

（フレーム歪みエスティメータおよびミュート歪みエスティメータを訓練すること）
図８Ａ〜図８Ｃは、フレーム歪みエスティメータ１４およびミュート歪みエスティメータ１８の訓練を示す。図８Ａは、フレーム歪みエスティメータ１４が初期に訓練される訓練プロセスにおける第１の工程を示す。したがって、この訓練プロセスに関与しない図１の要素は、簡明のために示されていない（例えば、ミュート検出ユニット１６、ミュート歪みエスティメータ１８、コンバイナ２０、およびマッピング・ユニット２２）。図示されるとおり、データベース２４が、提供されている。データベース２４は、複数の音声信号と、よく知られているやり方で決定された関連する主観的ＭＯＳとを含む。データベース２４の中の音声信号は、歪みを含む可能性があるが、時間的不連続性（例えば、ミュート）歪みは、含まない。 (Training frame distortion estimator and mute distortion estimator)
8A-8C illustrate training of the frame distortion estimator 14 and the mute distortion estimator 18. FIG. 8A shows a first step in the training process in which the frame distortion estimator 14 is initially trained. Accordingly, the elements of FIG. 1 that are not involved in this training process are not shown for the sake of clarity (eg, mute detection unit 16, mute distortion estimator 18, combiner 20, and mapping unit 22). As shown, a database 24 is provided. Database 24 includes a plurality of audio signals and associated subjective MOSs determined in a well-known manner. The audio signal in the database 24 may include distortion, but not temporal discontinuity (eg, mute) distortion.

データベースの中の各音声信号（または音声信号の少なくともサブセット）が、フィルタ１０に供給される。対応する主観的ＭＯＳが、逆マッピング・ユニット３０に供給され、ユニット３０は、そのＭＯＳを主観的歪みに変換する。逆マッピング・ユニット３０の変換動作は、マッピング・ユニット２２によって実行される変換動作の逆である。したがって、逆マッピング・ユニット３０は、ルックアップ・テーブル、数式などを使用して変換を実行する。例えば、図２を参照すると、逆マッピングの数式は、主観的歪みが、−（ｍｉｎ（ＭＯＳ^（ｐ），４．５）−４．５）／３．５に等しいことであることが可能であり、ただし、ＭＯＳ^（ｐ）は、データベースの中の第ｐ番の音声信号のＭＯＳである。 Each audio signal (or at least a subset of the audio signals) in the database is supplied to the filter 10. The corresponding subjective MOS is supplied to the inverse mapping unit 30, which converts the MOS into subjective distortion. The conversion operation of the inverse mapping unit 30 is the reverse of the conversion operation performed by the mapping unit 22. Accordingly, the inverse mapping unit 30 performs the conversion using a look-up table, mathematical formulas, and the like. For example, referring to FIG. 2, the inverse mapping formula can be that the subjective distortion is equal to-(min (MOS ^(p) , 4.5) -4.5) /3.5. Yes, but MOS ^(p) is the MOS of the p-th audio signal in the database.

フィルタ１０、調音解析ユニット１２、およびフレーム歪みエスティメータ１４は、フレーム歪みエスティメータ１４の加重ＷｊおよびＷｊｋが、非常に小さい乱数に初期設定されること以外は、図１に関連して前述したとおり、動作する。誤差ジェネレータ３２が、フレーム歪み推定値、および逆マッピング・ユニット３０からの主観的歪みを受け取り、誤差信号を生成する。より具体的には、一実施形態では、誤差ジェネレータ３２は、主観的歪みからフレーム歪み推定値を引いて、誤差信号を生成する。
フレーム歪みエスティメータ１４は、以下の歪みモデル

によってもたらされる、主観的歪みＤｓｂｊと客観的フレーム歪み推定値Ｄ_ｖ ^（ｐ）の間の、訓練サンプル全体にわたる累積２乗差を最小にすることによって訓練され、したがって、

であり、ただし、上付き文字（ｐ）は、第ｐ番の音声信号を表す。（１４）において最小化されるべき費用関数は、以下のとおり、（５）を使用することによって表現されることが可能である。すなわち、

勾配降下規則をとることにより、以下のとおり、第ｔ番の工程における更新規則が与えられる。すなわち、

であり、加重は、以下のとおり、

ただし、
ｃ^（ｐ）（ｍ）＝α（Ｐ^（ｐ）（ｍ）−Ｐ_ｔｈ）＋βであり、

費用関数（１５）が、事前定義された誤差限度に達するまで、更新される。 The filter 10, articulation analysis unit 12, and frame distortion estimator 14 are as described above in connection with FIG. 1 except that the weights Wj and Wjk of the frame distortion estimator 14 are initialized to very small random numbers. ,Operate. An error generator 32 receives the frame distortion estimate and the subjective distortion from the inverse mapping unit 30 and generates an error signal. More specifically, in one embodiment, error generator 32 subtracts the frame distortion estimate from the subjective distortion to generate an error signal.
The frame distortion estimator 14 has the following distortion model

Trained by minimizing the cumulative square difference across the training samples between the subjective distortion Dsbj and the objective frame distortion estimate D _v ^(p) ,

Where the superscript (p) represents the p-th audio signal. The cost function to be minimized in (14) can be expressed by using (5) as follows: That is,

By taking the gradient descent rule, the update rule in the t-th process is given as follows. That is,

The weights are as follows:

However,
c ^(p) (m) = α (P ^(p) (m) −P _th ) + β,

The cost function (15) is updated until a predefined error limit is reached.

これは、入力フレームのシーケンスから成る音声ファイルの全体的な目標は、与えられるが、各フレームに関する個別の目標は、与えられない、監督された訓練と監督されない訓練の混合である。十分に大量の整合性のある音声サンプルを使用して、訓練プロセスは、フレーム特徴ベクトルをフレーム品質と関係付ける固有の規則を学習するフレーム歪みモデルを導き出すことができる。 This is a mixture of supervised and unsupervised training where an overall goal for an audio file consisting of a sequence of input frames is given, but a separate goal for each frame is not given. Using a sufficiently large amount of consistent speech samples, the training process can derive a frame distortion model that learns the unique rules that relate frame feature vectors to frame quality.

フレーム歪みエスティメータ１４の初期訓練の後、ミュート歪みエスティメータ１８のミュート影響モデルが、訓練される。図７Ｂは、訓練プロセスのこの工程を示す。この訓練プロセスに関与しない図１の要素は、簡明のために示されていない（例えば、マッピング・ユニット２２）。図示されるとおり、データベース２６が、提供されている。データベース２６は、複数の音声信号と、よく知られているやり方で決定された関連する主観的ＭＯＳとを含む。データベース２４の中の音声信号は、時間的不連続性（例えば、ミュート）歪みは含まない歪みを含むことが可能である。 After initial training of the frame distortion estimator 14, the mute influence model of the mute distortion estimator 18 is trained. FIG. 7B illustrates this step of the training process. Elements of FIG. 1 that are not involved in this training process are not shown for clarity (eg, mapping unit 22). As shown, a database 26 is provided. Database 26 includes a plurality of audio signals and associated subjective MOSs determined in a well-known manner. The audio signal in the database 24 can include distortion that does not include temporal discontinuity (eg, mute) distortion.

データベースの中の各音声信号（または音声信号の少なくともサブセット）が、フィルタ１０に供給される。対応する主観的ＭＯＳが、逆マッピング・ユニット３０に供給され、ユニット３０は、そのＭＯＳを歪みに変換する。フィルタ１０、調音解析ユニット１２、およびフレーム歪みエスティメータ１４は、フレーム歪みエスティメータ１４の加重ＷｊおよびＷｊｋが、訓練プロセスの第１の工程において訓練されるとおりであること以外は、図１に関連して前述したとおり、動作する。また、ミュート検出ユニット１６およびミュート歪みエスティメータ１８も、図１に関連して前述したとおり、動作する。この訓練工程において、コンバイナ２０が、訓練ループの中に含められて、フレーム歪み推定値とミュート歪み推定値の組み合わせを誤差信号ジェネレータ３２に供給する。誤差ジェネレータ３２は、コンバイナ２０から全体的な歪み推定値を受け取り、逆マッピング・ユニット３０から主観的歪みを受け取り、誤差信号を生成する。より具体的には、一実施形態では、誤差ジェネレータ３２は、主観的歪みから全体的歪みを引いて、誤差信号を生成する。
訓練は、回帰によって（１３）における最適なパラメータ・セットＡｍ、すなわち、ｐ１およびｐ２を求めることであり、したがって、

であり、ただし、以前に訓練されたフレーム歪みモデル

が、使用される。 Each audio signal (or at least a subset of the audio signals) in the database is supplied to the filter 10. The corresponding subjective MOS is supplied to the inverse mapping unit 30, which converts the MOS into distortion. Filter 10, articulation analysis unit 12, and frame distortion estimator 14 relate to FIG. 1 except that the weight Wj and Wjk of frame distortion estimator 14 are as trained in the first step of the training process. And operates as described above. The mute detection unit 16 and the mute distortion estimator 18 also operate as described above with reference to FIG. In this training process, the combiner 20 is included in the training loop to supply the error signal generator 32 with a combination of frame distortion estimates and mute distortion estimates. The error generator 32 receives the overall distortion estimate from the combiner 20, receives the subjective distortion from the inverse mapping unit 30, and generates an error signal. More specifically, in one embodiment, error generator 32 subtracts the overall distortion from the subjective distortion to generate an error signal.
Training is to find the optimal parameter set Am in (13) by regression, ie, p1 and p2, so

However, the previously trained frame distortion model

Is used.

訓練プロセスの第３の、最後の工程は、フレーム歪みエスティメータ１４を再訓練することである。図７Ｃは、この最後の訓練を示す。図示されるとおり、データベース２４と、データベース２６とを含むデータベース２８が、音声信号および主観的ＭＯＳを供給する。誤差ジェネレータ３２からの誤差信号が、フレーム歪みエスティメータ１４に供給される。この再訓練工程は、フレーム歪みモデルが、ミュート影響モデルの残差を補償することを可能にする。これは、以下の歪みモデル

である。この訓練は、

をモデルの初期パラメータとして、訓練工程１と同じように実行されることが可能である。 The third and final step of the training process is to retrain the frame distortion estimator 14. FIG. 7C shows this final training. As shown, database 24, including database 24 and database 26, provides audio signals and subjective MOS. An error signal from the error generator 32 is supplied to the frame distortion estimator 14. This retraining process allows the frame distortion model to compensate for the residuals of the mute effect model. This is the following distortion model

It is. This training

Can be performed in the same way as in the training step 1 with the initial parameters of the model.

前述の実施形態から理解されるとおり、この音声品質推定システムは、コンピュータ上で実行されたソフトウェア、配線によって組まれている回路、デジタル・シグナル・プロセッサなどとして実施されることが可能である。 As will be understood from the foregoing embodiments, the speech quality estimation system can be implemented as software executed on a computer, a circuit built by wiring, a digital signal processor, or the like.

本発明は、このように説明したが、本発明は、多様に変形されることが可能であることが明らかであろう。そのような変形形態は、本発明からの逸脱と見なされるべきではなく、すべてのそのような変更形態が、本発明の範囲に含められるべきものとする。 Although the present invention has been described in this manner, it will be apparent that the present invention can be modified in various ways. Such variations are not to be regarded as a departure from the invention, and all such modifications are to be included within the scope of the invention.

本発明の実施形態による音声品質評価システムを示すブロック図である。It is a block diagram which shows the audio | voice quality evaluation system by embodiment of this invention. ルックアップ・テーブルによって表されるＭＯＳに対する推定された客観的歪みの曲線を示す図である。FIG. 6 is a diagram illustrating an estimated objective distortion curve for a MOS represented by a lookup table. 本発明の実施形態による調音解析ユニットにおいて使用される音声品質評価構成を示す図である。It is a figure which shows the audio | voice quality evaluation structure used in the articulation analysis unit by embodiment of this invention. 本発明の一実施形態による図３の調音解析モジュールにおいて複数のエンベロープａ_ｉ（ｔ）を処理するための流れ図を示す。4 shows a flowchart for processing a plurality of envelopes a _i (t) in the articulation analysis module of FIG. 3 according to one embodiment of the present invention. パワー対周波数の点で変調スペクトルＡ_ｉ（ｍ，ｆ）を示す例を示す図である。It is a figure which shows the example which shows modulation spectrum _Ai (m, f) in the point of power versus frequency. 図１のフレーム歪みエスティメータにおいて使用されるような多層パーセプトロンの例を示す図である。FIG. 2 shows an example of a multilayer perceptron as used in the frame distortion estimator of FIG. 時間につれての音声活動に関するプロファイルを図示する例を示す図である。FIG. 6 is a diagram illustrating an example illustrating a profile for voice activity over time. 図１のフレーム歪みエスティメータおよびミュート歪みエスティメータの訓練を示す図である。It is a figure which shows training of the frame distortion estimator of FIG. 1, and a mute distortion estimator. 図１のフレーム歪みエスティメータおよびミュート歪みエスティメータの訓練を示す図である。It is a figure which shows training of the frame distortion estimator of FIG. 1, and a mute distortion estimator. 図１のフレーム歪みエスティメータおよびミュート歪みエスティメータの訓練を示す図である。It is a figure which shows training of the frame distortion estimator of FIG. 1, and a mute distortion estimator.

Claims

受信された音声信号における歪みを、主観的品質評価データに基づいて訓練された少なくとも１つのモデルを使用して推定すること、および
前記受信された音声信号に関する音声品質評価を、前記推定された歪みに基づいて判定することを含む音声品質評価方法。 Estimating a distortion in the received speech signal using at least one model trained based on subjective quality assessment data; and estimating a speech quality assessment for the received speech signal with the estimated distortion A speech quality evaluation method including determining based on

前記推定する工程は、前記主観的品質評価データに基づいて訓練された第１のモデルを使用して、前記受信された音声信号における音声歪みを推定することを含む請求項１に記載の方法。 The method of claim 1, wherein the estimating comprises estimating speech distortion in the received speech signal using a first model trained based on the subjective quality assessment data.

前記推定する工程は、前記主観的品質評価データに基づいて訓練された前記第１のモデルを使用して、前記受信された音声信号における背景雑音歪みを推定することを含む請求項２に記載の方法。 The estimation step of claim 2, wherein the estimating step comprises estimating a background noise distortion in the received speech signal using the first model trained based on the subjective quality assessment data. Method.

前記受信された音声信号から平均調音パワーおよび平均非調音パワーを算出することをさらに含み、
前記音声歪みを推定する工程は、前記算出された平均調音パワー、前記算出された平均非調音パワー、および前記第１のモデルを使用して前記音声歪みを推定し、
前記背景雑音歪みを推定する工程は、前記算出された平均調音パワー、前記算出された平均非調音パワー、および前記第１のモデルを使用して前記背景雑音歪みを推定する請求項３に記載の方法。 Calculating average articulation power and average non-articulation power from the received audio signal;
Estimating the audio distortion estimates the audio distortion using the calculated average articulation power, the calculated average non-articulation power, and the first model;
The step of estimating the background noise distortion estimates the background noise distortion using the calculated average articulation power, the calculated average non-articulation power, and the first model. Method.

前記推定する工程は、前記主観的品質評価データに基づいて訓練された第２のモデルを使用して、前記受信された音声信号の中のミュートによって生じた歪みを推定することを含む請求項３に記載の方法。 The estimating step includes estimating a distortion caused by mute in the received audio signal using a second model trained based on the subjective quality assessment data. The method described in 1.

前記検出されたミュートによって生じた歪みを前記推定する工程は、前記受信された音声信号の中でより後期のミュートが、前記受信された音声信号の中でより早期のミュートより大きい影響を及ぼすように前記ミュート歪みを推定する請求項５に記載の方法。 The step of estimating the distortion caused by the detected mute is such that a later mute in the received audio signal has a greater effect than an earlier mute in the received audio signal. The method of claim 5, wherein the mute distortion is estimated.

前記推定する工程は、主観的品質評価データに基づいて訓練されたモデルを使用して、前記受信された音声信号におけるミュート歪みを推定することを含む請求項１に記載の方法。 The method of claim 1, wherein the estimating step includes estimating a mute distortion in the received audio signal using a model trained based on subjective quality assessment data.

前記判定する工程は、前記推定された歪みを主観的品質評価メトリックにマップする請求項１に記載の方法。 The method of claim 1, wherein the determining step maps the estimated distortion to a subjective quality assessment metric.

主観的品質評価データに基づいて訓練された少なくとも１つのモデルを使用して、受信された音声信号における歪みを推定する少なくとも１つのエスティメータと、
前記推定された歪みを音声品質メトリックにマップするマッピング・ユニットとを含む音声品質評価のための装置。 At least one estimator that estimates distortion in the received speech signal using at least one model trained based on subjective quality assessment data;
A device for voice quality evaluation comprising a mapping unit that maps the estimated distortion to a voice quality metric.

品質評価システムを訓練する方法であって、
第１の主観的品質評価データを使用して、前記システムの第２の歪み推定パスからの影響を排除しながら、前記システムの第１の歪み推定パスを訓練し、前記第１の主観的品質評価データは、第１の音声信号と、第１の関連する主観的品質メトリックとを含み、前記第１の音声信号には、ミュート歪みが欠如していること、
第２の主観的品質評価データを使用して、前記システムの第２の歪み推定パスを訓練し、前記第２の主観的品質評価データは、第２の音声信号と、第２の関連する主観的品質メトリックとを含み、前記第２の音声信号は、ミュート歪みを含むこと、および
前記第１及び第２の品質評価データを使用して、前記第２の歪みパスの影響を含めながら、前記第１の歪みパスを再訓練すること、
を含む方法。 A method of training a quality assessment system,
The first subjective quality assessment data is used to train the first distortion estimation path of the system while eliminating the influence from the second distortion estimation path of the system, and the first subjective quality evaluation data. The evaluation data includes a first audio signal and a first associated subjective quality metric, wherein the first audio signal lacks mute distortion;
Second subjective quality assessment data is used to train a second distortion estimation path of the system, the second subjective quality assessment data comprising a second audio signal and a second associated subjective subject. The second audio signal includes mute distortion, and includes the effects of the second distortion path using the first and second quality assessment data. Retraining the first distortion path;
Including methods.