JPH07219577A

JPH07219577A - Phoneme segmentation method

Info

Publication number: JPH07219577A
Application number: JP6007507A
Authority: JP
Inventors: Chiharu Yamano; 千晴山野; Yumi Takizawa; 由美滝沢; Keisuke Oda; 啓介小田; Atsushi Fukazawa; 敦司深澤
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1994-01-27
Filing date: 1994-01-27
Publication date: 1995-08-18

Abstract

PURPOSE:To easily perform excellent segmentation by determining a variation section and a stationary section of a phoneme from the moving average of varying energy. CONSTITUTION:Speech data S(n) inputted from an input terminal 1 is and analyzed by a glottis analysis part 2 by linear prediction of up to 10th order of maximum prediction degree and its prediction error is analyzed by a vocal chord analysis part 3 by linear prediction of up to 10th order of maximum prediction degree to extract a timer series of speech feature vectors X(n). Then a projection arithmetic part 4 generates two-dimensional projection vectors Y(n). The projection vectors Y(n) are supplied to a varying energy calculation part 5, which calculates varying energy e(n) and finds the moving average E(n) of this varying energy e(n). This moving average E(n) is inputted to a threshold value circuit to determine a section where this E(n) does not exceed a threshold value Th as the stationary phoneme section and a section where the threshold value is exceeded as the nonstationary phoneme section.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は連続して発音した音声信
号の音韻セグメンテーション方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a phonological segmentation method for continuously pronounced speech signals.

【０００２】[0002]

【従来の技術】音韻あるいは音節単位での認識方法とし
て、例えば次記文献のように、入力音声信号の分析を多
次元で行い、それを低次元へ射影して認識処理する方法
が知られている。文献名：電子情報通信学会、技術研究報告、ＤＳＰ９２
−５６、１９９２年７月２４日発行。この音声認識方法は、入力音声から音韻の特徴を表す６
次元の音声特徴パラメータを抽出し、それらを射影して
２次元空間上の位置ベクトル（射影ベクトル）としてと
らえ、その時間的な動特性をとらえることにより音韻を
対応させるものである。そこで提案されている音韻のセ
グメンテーション方法は、射影ベクトルの第１成分であ
る一つの射影パラメータとその移動平均との差を監視
し、０レベルを越えた区間を急区間（子音に対応する区
間）とし、それ以外であれば緩区間（母音に対応する区
間）と決定する。2. Description of the Related Art As a recognition method in units of phonemes or syllables, there is known a method of performing multidimensional analysis of an input voice signal and projecting it into a low dimension to perform recognition processing as described in the following document. There is. Reference: IEICE, Technical Research Report, DSP92
-56, published 24 July 1992. This speech recognition method represents phoneme features from input speech.
The three-dimensional voice feature parameters are extracted and projected to be captured as a position vector (projection vector) in a two-dimensional space, and the temporal dynamic characteristics thereof are captured to correspond the phonemes. The phonological segmentation method proposed there monitors the difference between one projection parameter, which is the first component of the projection vector, and its moving average, and detects a section exceeding 0 level as a sudden section (section corresponding to a consonant). Otherwise, it is determined to be a slow section (section corresponding to a vowel).

【０００３】[0003]

【発明が解決しようとする課題】この方法は、母音と子
音が交互に現れしかもその音韻の継続時間がほぼ均等で
あるような音声の性質に着目したものであり、特に日本
語の場合は移動平均をとる時間長を２００ｍｓ程度に設
定することにより、母音部分と子音部分の境界を簡単な
方法により検出することが可能である。しかしながら、
この方法を用いて、移動平均の時間長を種々変更し、音
韻の継続時間長にばらつきのあるような音声をセグメン
テーションしようと試みたところ、ふさわしくない区間
検出がみられた。従って、本発明は、定常的な音韻また
は非定常的な音韻の継続時間にばらつきのあるような特
徴をもつ連続音声が入力された場合でも、音韻の定常区
間及び変動区間の良好な検出を簡単な方法で可能とする
ことを目的とするものであり、これを移動平均の時間長
を短くすることによって達成したものである。This method focuses on the nature of voices in which vowels and consonants appear alternately and the durations of their phonemes are almost equal. By setting the averaging time length to about 200 ms, the boundary between the vowel part and the consonant part can be detected by a simple method. However,
Using this method, we tried to segment the speech with varying phoneme durations by varying the duration of the moving average, and found unsuitable section detection. Therefore, according to the present invention, even when a continuous voice having a characteristic that the duration of a stationary phoneme or a non-stationary phoneme has variations is input, good detection of a stationary section and a fluctuating section of the phoneme can be easily performed. This is achieved by shortening the time length of the moving average.

【０００４】[0004]

【課題を解決するための手段】本発明は、入力音声信号
を分析して、１個以上の音源特徴と２個以上の声道特徴
とを含む多次元の音声特徴ベクトルの時系列を抽出する
処理ステップと、複数の射影パラメータが互いに相関の
ないように予め決めてある射影演算子を加重として用
い、音声特徴ベクトルの要素を加算することにより、音
声特徴ベクトルを２次元又は３次元の射影パラメータへ
次元縮小した射影ベクトルの時系列を得る処理ステップ
を有する。また、射影ベクトルの変動エネルギーの時系
列上の移動平均を得る処理ステップを有する。また、そ
の変動エネルギー移動平均を監視し、それが予め決めら
れたしきい値を越えた区間を音韻の変動区間と決定し、
そのしきい値を越えない区間を音韻の定常区間と決定す
る処理ステップを有する。SUMMARY OF THE INVENTION The present invention analyzes an input speech signal to extract a time series of multidimensional speech feature vectors containing one or more sound source features and two or more vocal tract features. A processing step and a projection operator that is predetermined so that a plurality of projection parameters are not correlated with each other are used as weights, and the elements of the speech feature vector are added to make the speech feature vector a two-dimensional or three-dimensional projection parameter. There is a processing step of obtaining a time series of the projection vectors reduced to the dimension. It also has a processing step of obtaining a time-series moving average of the variation energy of the projection vector. Further, the fluctuation energy moving average is monitored, and a section in which it exceeds a predetermined threshold is determined as a phoneme fluctuation section,
There is a processing step of determining a section that does not exceed the threshold value as a phoneme steady section.

【０００５】[0005]

【作用】入力音声信号を多次元で分析し、２次または３
次に次元縮小する。射影演算子を適切に設定しておくこ
とにより、射影ベクトルのパラメータは互いに無相関と
なる。射影パラメータも、分析で得た音声特徴と同様
に、音韻の準定常部分では、音響的な性質がほとんど変
化しないため、微少な変動はあるもののぼほ一定値とな
り、それに対し、音韻の変化している部分では、その音
響的な性質が短時間で大きく変動しているため、大きく
揺れ動き、よって、局所的な非定常性を表すパラメータ
とみることができる。射影ベクトルの変動エネルギー
は、各パラメータ毎に隣接サンプルとの差を計算し、そ
の２乗和を計算することによって求めることができ、適
当な時間窓を用いて移動平均をとることによって、変動
エネルギーを平滑化したものを得ることができる。変動
エネルギー（レベル）が予め決めたしきい値より小さい
ときは定常区間と判定し、それよりも大きいときは変動
区間であると判定することによって、音韻セグメンテー
ションを行う。Operation: The input voice signal is analyzed in a multidimensional manner and the secondary or third
Next, the dimension is reduced. By setting the projection operator appropriately, the parameters of the projection vector are mutually uncorrelated. Similar to the speech features obtained by the analysis, the projection parameters also have a small constant variation, but almost constant value, in the quasi-stationary part of the phoneme, so there is a slight variation, whereas the phoneme changes. Since the acoustic properties of the part of the part that fluctuates greatly fluctuates in a short time, it can be regarded as a parameter that represents a large sway, and thus local non-stationarity. The fluctuation energy of the projection vector can be obtained by calculating the difference from the adjacent sample for each parameter and calculating the sum of squares thereof. The fluctuation energy can be obtained by taking the moving average using an appropriate time window. Can be obtained by smoothing. When the variation energy (level) is smaller than a predetermined threshold value, it is determined to be a steady section, and when it is larger than that, it is determined to be a variation section, and phonological segmentation is performed.

【０００６】入力音声信号を６次元程度で分析した場
合、１次元まで、２次元まで、３次元までの寄与率は、
大雑把に、５０％程度、７０％程度、９０％程度であ
る。射影パラメータは互いに無相関または直交している
ものと見なせるため、射影ベクトルの変動エネルギーす
なわち射影パラメータの変化量の２乗和は、射影ベクト
ルの大きさの変化量と対応し、また、変動エネルギー
は、分析で得た情報量の７０％程度を反映した値をとる
ことになる。音声特徴ベクトルの２０％の情報を担う第
２成分は短時間の変動を有効に反映するものと推察で
き、数１０ｍｓの移動平均時間長を設定することによ
り、子音と母音間、子音と子音間などの音韻間、並びに
無音と音韻間において、比較的確実に変動区間を検出す
ることができる。When the input voice signal is analyzed in about 6 dimensions, the contribution ratios in 1 dimension, 2 dimensions and 3 dimensions are as follows:
Roughly, it is about 50%, about 70%, about 90%. Since the projection parameters can be regarded as uncorrelated or orthogonal to each other, the variation energy of the projection vector, that is, the sum of squares of the variation amount of the projection parameter corresponds to the variation amount of the projection vector, and the variation energy is The value will reflect about 70% of the amount of information obtained by the analysis. It can be inferred that the second component, which carries 20% of the information of the voice feature vector, effectively reflects short-term fluctuations. By setting a moving average time length of several tens of ms, between consonants and vowels and between consonants and consonants. It is possible to detect the fluctuation section relatively reliably between the phonemes such as, and between the silence and the phonemes.

【０００７】[0007]

【実施例】図１は本発明の一実施例を示すブロック図で
ある。以下、本発明の実施例を図に基づいて説明する。
図１において、入力端子１から入力された音声データＳ
（ｎ）は声道分析部２にあたえられ、そこで、最大予測
次数１０次の線形予測によって分析し、その予測誤差を
声帯分析部３で、最大予測次数１０次の線形予測によっ
て分析し、音声特徴ベクトルＸ（ｎ）の時系列が抽出さ
れる。ここでは、音声データＳ（ｎ）のサンプリング周
波数を８ＫＨｚ、分析フレーム長を２４ｍｓ、フレーム
周期（分析間隔）を４ｍｓとし、音声平均パワーｘ１
（ｎ）、声道特徴パラメータｘ２（ｎ）〜ｘ４（ｎ）、
及び音源特徴パラメータｘ５（ｎ）〜ｘ６（ｎ）とを音
声特徴パラメータ（ベクトル要素）として持つ音声特徴
ベクトルＸ（ｎ）を抽出する。ただし、ｎは分析時刻で
ある。ｘ１：音声平均パワーｘ２：予測次数１０次の声道エントロピーｘ３：予測次数２次の予測係数に基づく声道周波数ｘ４：同上の強度ｘ５：音源平均強度ｘ６：予測次数１０次の音源エントロピーなお、要素ｘ３（ｎ）、ｘ４（ｎ）は、予測次数２次の
予測係数から１組の複素共役根を算出し、その絶対値を
一方の要素ｘ４（ｎ）とし、Ｚ平面で偏角が０〜πであ
る根に対応した周波数を他方の要素ｘ３（ｎ）としたも
のである。FIG. 1 is a block diagram showing an embodiment of the present invention. Embodiments of the present invention will be described below with reference to the drawings.
In FIG. 1, the audio data S input from the input terminal 1
(N) is given to the vocal tract analysis unit 2 where it is analyzed by linear prediction with a maximum prediction order of 10 and its prediction error is analyzed by a vocal cord analysis unit 3 with linear prediction of a maximum prediction order of 10 The time series of the feature vector X (n) is extracted. Here, the sampling frequency of the audio data S (n) is 8 KHz, the analysis frame length is 24 ms, the frame period (analysis interval) is 4 ms, and the audio average power x1.
(N), vocal tract feature parameters x2 (n) to x4 (n),
And a voice feature vector X (n) having voice source feature parameters x5 (n) to x6 (n) as voice feature parameters (vector elements). However, n is the analysis time. x1: average voice power x2: vocal tract entropy of 10th order of prediction x3: vocal tract frequency based on prediction coefficient of second order of prediction x4: same intensity x5: average intensity of sound source x6: sound source entropy of 10th order of prediction For the elements x3 (n) and x4 (n), one set of complex conjugate roots is calculated from the prediction coefficient of the second-order prediction order, and the absolute value is set to one element x4 (n), and the declination is 0 in the Z plane. The frequency corresponding to the root that is ˜π is the other element x3 (n).

【０００８】次に、この音声特徴ベクトルＸ（ｎ）の時
系列は射影演算部４へ与えられる。ここでは、式（１）
に示すように、射影行列Ｌを加重とした加算演算を実行
し、６次元特徴ベクトルＸ（ｎ）から互いに相関のない
成分（射影パラメータ）を持つ２次元射影ベクトルＹ
（ｎ）を作成する。Ｙ（ｎ）＝Ｌ＊Ｘ（ｎ）式（１）ただし、上式は行列演算を表し、射影行列Ｌとしては式
（２）に示すものを用いた。また、この射影行列Ｌは、
音韻サンプルの主成分分析によって、予め決められたも
のである。Next, the time series of the voice feature vector X (n) is given to the projection calculation unit 4. Here, the formula (1)
As shown in, the addition operation with the projection matrix L as a weight is executed, and the two-dimensional projection vector Y having components (projection parameters) that are not correlated with each other from the six-dimensional feature vector X (n).
Create (n). Y (n) = L * X (n) Formula (1) However, the above formula represents a matrix operation, and as the projection matrix L, the one shown in Formula (2) was used. Also, this projection matrix L is
It is predetermined by the principal component analysis of the phoneme sample.

【０００９】[0009]

【数１】 [Equation 1]

【００１０】この射影ベクトルＹ（ｎ）は変動エネルギ
ー計算部５に与えられ、そこで、まず、式（３）に示す
ように、変動エネルギーｅ（ｎ）を計算する。This projection vector Y (n) is given to the fluctuating energy calculation unit 5, where first the fluctuating energy e (n) is calculated as shown in equation (3).

【００１１】[0011]

【数２】 [Equation 2]

【００１２】次に、式（４）に示すように、この変動エ
ネルギーｅ（ｎ）の移動平均Ｅ（ｎ）を求めることによ
り、Ｅ（ｎ）を算出する。Next, as shown in the equation (4), E (n) is calculated by obtaining the moving average E (n) of the fluctuation energy e (n).

【００１３】[0013]

【数３】 [Equation 3]

【００１４】ここで、ｈｉは移動平均をとる際の重みで
あり、ｈｉの形としては矩形窓を用い、単純にｈＩ＝１
とした。また、ｐは移動平均の区間長の半分であり、ｐ
＝４４ｍｓとした。Here, hi is a weight for obtaining a moving average, and a rectangular window is used as the shape of hi, and simply hI = 1.
And Also, p is half the section length of the moving average, and p
= 44 ms.

【００１５】次に、上記のパラメータＥ（ｎ）はしきい
値回路に入力される。このＥ（ｎ）がある所定のしきい
値Ｔｈを越えない区間、すなわちＥ（ｎ）＜Ｔｈ式（５）を満たす区間を定常音韻区間と決定し、越えた区間、す
なわちＥ（ｎ）≧Ｔｈ式（６）を満たす区間を非定常音韻区間と決定する。ここで、Ｔ
ｈ＝０．３とした。なお、しきい値は、移動平均の区間
長に関係があり、入力音声の音韻の継続時間特性にあわ
せてに適切に決定すべきものである。図２は、入力音声
「ジャズ」に対し上記の方法を用いセグメンテーション
を行った結果を示した図である。この図２を見てわかる
ようにそれぞれの音韻区間とそのわたりの区間が検出さ
れていることがわかる。すなわち、図２は、入力音声
「ジャズ」の前半部分について、入力音声信号Ｓ（ｎ）
と変動エネルギー移動平均Ｅ（ｎ）とを示したものであ
り、無音と子音の間、子音と母音の間で変動区間が検出
されることわかる。Next, the above parameter E (n) is input to the threshold circuit. A section in which E (n) does not exceed a predetermined threshold Th, that is, a section that satisfies E (n) <Th expression (5) is determined as a stationary phoneme section, and a section that exceeds E (n), that is, E (n) ≧ A section that satisfies Th equation (6) is determined as a non-stationary phoneme section. Where T
h = 0.3. The threshold value is related to the section length of the moving average and should be appropriately determined in accordance with the phoneme duration characteristic of the input voice. FIG. 2 is a diagram showing a result of performing segmentation on the input voice “jazz” using the above method. As can be seen from FIG. 2, it is understood that each phoneme section and its section are detected. That is, FIG. 2 shows the input voice signal S (n) for the first half of the input voice "jazz".
And the fluctuating energy moving average E (n) are shown, and it can be seen that a fluctuating section is detected between a silent and a consonant and between a consonant and a vowel.

【００１６】[0016]

【発明の効果】多次元空間での短区間の変動のエネルギ
ーによって音韻の非定常の度合いをパラメータ化するこ
とができ、これによって比較的簡単な方法で短い音韻に
対しても長い音韻に対しても良好なセグメンテーション
が可能となった。The nonstationary degree of the phoneme can be parameterized by the energy of the fluctuation of the short section in the multidimensional space, and this makes it possible to perform a relatively simple method on the short phoneme and the long phoneme. Also good segmentation is possible.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の一実施例のブロック図FIG. 1 is a block diagram of an embodiment of the present invention.

【図２】本発明の一実施例による動作説明のための図FIG. 2 is a diagram for explaining an operation according to an embodiment of the present invention.

【符号の説明】[Explanation of symbols]

１入力端子２声道分析部３声帯分析部４射影演算部５変動エネルギー計算部Ｓ（ｎ）音声信号ｘ１（ｎ）〜ｘ（６）６次元音声特徴ベクトルｙ１（ｎ）、ｙ２（ｎ）２次元音声特徴ベクトルＥ（ｎ）変動エネルギー移動平均 1 Input Terminal 2 Vocal Tract Analysis Section 3 Vocal Cord Analysis Section 4 Projection Calculation Section 5 Fluctuating Energy Calculation Section S (n) Speech Signals x1 (n) to x (6) Six-dimensional Speech Feature Vector y1 (n), y2 (n) Two-dimensional voice feature vector E (n) Fluctuating energy moving average

フロントページの続き (72)発明者深澤敦司東京都港区虎ノ門１丁目７番12号沖電気工業株式会社内Front page continuation (72) Inventor Atsushi Fukasawa 1-7-12 Toranomon, Minato-ku, Tokyo Oki Electric Industry Co., Ltd.

Claims

【特許請求の範囲】[Claims]

【請求項１】入力音声信号を分析して、１個以上の音
源特徴と２個以上の声道特徴とを音声特徴パラメータと
して含む音声特徴ベクトルの時系列を抽出する処理ステ
ップと、複数の射影パラメータが互いに相関のないように予め決
めてある射影演算子を加重として用い、前記音声特徴ベ
クトルの各音声特徴パラメータを加算することにより、
前記音声特徴ベクトルを２次元又は３次元の射影パラメ
ータへ次元縮小した射影ベクトルの時系列を得る処理ス
テップと、当該射影ベクトルの変動エネルギーの時系列上の移動平
均を得る処理ステップと、当該変動エネルギー移動平均を監視し、それが予め決め
られたしきい値を越えた区間を音韻の変動区間と決定
し、当該しきい値を越えない区間を音韻の定常区間と決
定する処理ステップと、を備えたことを特徴とする音韻
セグメンテーション方法1. A processing step of analyzing an input voice signal to extract a time series of a voice feature vector including one or more sound source features and two or more vocal tract features as voice feature parameters, and a plurality of projections. By using a predetermined projection operator as a weight so that the parameters are not correlated with each other, by adding each voice feature parameter of the voice feature vector,
A processing step of obtaining a time series of projection vectors obtained by dimensionally reducing the speech feature vector into a two-dimensional or three-dimensional projection parameter; a processing step of obtaining a moving average of the fluctuation energy of the projection vector in time series; A process step of monitoring the moving average, determining a section in which it exceeds a predetermined threshold as a phoneme variation section, and determining a section in which the threshold does not exceed the threshold as a phoneme steady section. Phonological segmentation method characterized by