JPH07219577A - Phoneme segmentation method - Google Patents

Phoneme segmentation method

Info

Publication number
JPH07219577A
JPH07219577A JP6007507A JP750794A JPH07219577A JP H07219577 A JPH07219577 A JP H07219577A JP 6007507 A JP6007507 A JP 6007507A JP 750794 A JP750794 A JP 750794A JP H07219577 A JPH07219577 A JP H07219577A
Authority
JP
Japan
Prior art keywords
section
projection
phoneme
moving average
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP6007507A
Other languages
Japanese (ja)
Inventor
Chiharu Yamano
千晴 山野
Yumi Takizawa
由美 滝沢
Keisuke Oda
啓介 小田
Atsushi Fukazawa
敦司 深澤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oki Electric Industry Co Ltd
Original Assignee
Oki Electric Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Oki Electric Industry Co Ltd filed Critical Oki Electric Industry Co Ltd
Priority to JP6007507A priority Critical patent/JPH07219577A/en
Publication of JPH07219577A publication Critical patent/JPH07219577A/en
Pending legal-status Critical Current

Links

Abstract

PURPOSE:To easily perform excellent segmentation by determining a variation section and a stationary section of a phoneme from the moving average of varying energy. CONSTITUTION:Speech data S(n) inputted from an input terminal 1 is and analyzed by a glottis analysis part 2 by linear prediction of up to 10th order of maximum prediction degree and its prediction error is analyzed by a vocal chord analysis part 3 by linear prediction of up to 10th order of maximum prediction degree to extract a timer series of speech feature vectors X(n). Then a projection arithmetic part 4 generates two-dimensional projection vectors Y(n). The projection vectors Y(n) are supplied to a varying energy calculation part 5, which calculates varying energy e(n) and finds the moving average E(n) of this varying energy e(n). This moving average E(n) is inputted to a threshold value circuit to determine a section where this E(n) does not exceed a threshold value Th as the stationary phoneme section and a section where the threshold value is exceeded as the nonstationary phoneme section.

Description

【発明の詳細な説明】Detailed Description of the Invention

【0001】[0001]

【産業上の利用分野】本発明は連続して発音した音声信
号の音韻セグメンテーション方法に関する。
BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a phonological segmentation method for continuously pronounced speech signals.

【0002】[0002]

【従来の技術】音韻あるいは音節単位での認識方法とし
て、例えば次記文献のように、入力音声信号の分析を多
次元で行い、それを低次元へ射影して認識処理する方法
が知られている。 文献名:電子情報通信学会、技術研究報告、DSP92
−56、1992年7月24日発行。 この音声認識方法は、入力音声から音韻の特徴を表す6
次元の音声特徴パラメータを抽出し、それらを射影して
2次元空間上の位置ベクトル(射影ベクトル)としてと
らえ、その時間的な動特性をとらえることにより音韻を
対応させるものである。そこで提案されている音韻のセ
グメンテーション方法は、射影ベクトルの第1成分であ
る一つの射影パラメータとその移動平均との差を監視
し、0レベルを越えた区間を急区間(子音に対応する区
間)とし、それ以外であれば緩区間(母音に対応する区
間)と決定する。
2. Description of the Related Art As a recognition method in units of phonemes or syllables, there is known a method of performing multidimensional analysis of an input voice signal and projecting it into a low dimension to perform recognition processing as described in the following document. There is. Reference: IEICE, Technical Research Report, DSP92
-56, published 24 July 1992. This speech recognition method represents phoneme features from input speech.
The three-dimensional voice feature parameters are extracted and projected to be captured as a position vector (projection vector) in a two-dimensional space, and the temporal dynamic characteristics thereof are captured to correspond the phonemes. The phonological segmentation method proposed there monitors the difference between one projection parameter, which is the first component of the projection vector, and its moving average, and detects a section exceeding 0 level as a sudden section (section corresponding to a consonant). Otherwise, it is determined to be a slow section (section corresponding to a vowel).

【0003】[0003]

【発明が解決しようとする課題】この方法は、母音と子
音が交互に現れしかもその音韻の継続時間がほぼ均等で
あるような音声の性質に着目したものであり、特に日本
語の場合は移動平均をとる時間長を200ms程度に設
定することにより、母音部分と子音部分の境界を簡単な
方法により検出することが可能である。しかしながら、
この方法を用いて、移動平均の時間長を種々変更し、音
韻の継続時間長にばらつきのあるような音声をセグメン
テーションしようと試みたところ、ふさわしくない区間
検出がみられた。従って、本発明は、定常的な音韻また
は非定常的な音韻の継続時間にばらつきのあるような特
徴をもつ連続音声が入力された場合でも、音韻の定常区
間及び変動区間の良好な検出を簡単な方法で可能とする
ことを目的とするものであり、これを移動平均の時間長
を短くすることによって達成したものである。
This method focuses on the nature of voices in which vowels and consonants appear alternately and the durations of their phonemes are almost equal. By setting the averaging time length to about 200 ms, the boundary between the vowel part and the consonant part can be detected by a simple method. However,
Using this method, we tried to segment the speech with varying phoneme durations by varying the duration of the moving average, and found unsuitable section detection. Therefore, according to the present invention, even when a continuous voice having a characteristic that the duration of a stationary phoneme or a non-stationary phoneme has variations is input, good detection of a stationary section and a fluctuating section of the phoneme can be easily performed. This is achieved by shortening the time length of the moving average.

【0004】[0004]

【課題を解決するための手段】本発明は、入力音声信号
を分析して、1個以上の音源特徴と2個以上の声道特徴
とを含む多次元の音声特徴ベクトルの時系列を抽出する
処理ステップと、複数の射影パラメータが互いに相関の
ないように予め決めてある射影演算子を加重として用
い、音声特徴ベクトルの要素を加算することにより、音
声特徴ベクトルを2次元又は3次元の射影パラメータへ
次元縮小した射影ベクトルの時系列を得る処理ステップ
を有する。また、射影ベクトルの変動エネルギーの時系
列上の移動平均を得る処理ステップを有する。また、そ
の変動エネルギー移動平均を監視し、それが予め決めら
れたしきい値を越えた区間を音韻の変動区間と決定し、
そのしきい値を越えない区間を音韻の定常区間と決定す
る処理ステップを有する。
SUMMARY OF THE INVENTION The present invention analyzes an input speech signal to extract a time series of multidimensional speech feature vectors containing one or more sound source features and two or more vocal tract features. A processing step and a projection operator that is predetermined so that a plurality of projection parameters are not correlated with each other are used as weights, and the elements of the speech feature vector are added to make the speech feature vector a two-dimensional or three-dimensional projection parameter. There is a processing step of obtaining a time series of the projection vectors reduced to the dimension. It also has a processing step of obtaining a time-series moving average of the variation energy of the projection vector. Further, the fluctuation energy moving average is monitored, and a section in which it exceeds a predetermined threshold is determined as a phoneme fluctuation section,
There is a processing step of determining a section that does not exceed the threshold value as a phoneme steady section.

【0005】[0005]

【作用】入力音声信号を多次元で分析し、2次または3
次に次元縮小する。射影演算子を適切に設定しておくこ
とにより、射影ベクトルのパラメータは互いに無相関と
なる。射影パラメータも、分析で得た音声特徴と同様
に、音韻の準定常部分では、音響的な性質がほとんど変
化しないため、微少な変動はあるもののぼほ一定値とな
り、それに対し、音韻の変化している部分では、その音
響的な性質が短時間で大きく変動しているため、大きく
揺れ動き、よって、局所的な非定常性を表すパラメータ
とみることができる。射影ベクトルの変動エネルギー
は、各パラメータ毎に隣接サンプルとの差を計算し、そ
の2乗和を計算することによって求めることができ、適
当な時間窓を用いて移動平均をとることによって、変動
エネルギーを平滑化したものを得ることができる。変動
エネルギー(レベル)が予め決めたしきい値より小さい
ときは定常区間と判定し、それよりも大きいときは変動
区間であると判定することによって、音韻セグメンテー
ションを行う。
Operation: The input voice signal is analyzed in a multidimensional manner and the secondary or third
Next, the dimension is reduced. By setting the projection operator appropriately, the parameters of the projection vector are mutually uncorrelated. Similar to the speech features obtained by the analysis, the projection parameters also have a small constant variation, but almost constant value, in the quasi-stationary part of the phoneme, so there is a slight variation, whereas the phoneme changes. Since the acoustic properties of the part of the part that fluctuates greatly fluctuates in a short time, it can be regarded as a parameter that represents a large sway, and thus local non-stationarity. The fluctuation energy of the projection vector can be obtained by calculating the difference from the adjacent sample for each parameter and calculating the sum of squares thereof. The fluctuation energy can be obtained by taking the moving average using an appropriate time window. Can be obtained by smoothing. When the variation energy (level) is smaller than a predetermined threshold value, it is determined to be a steady section, and when it is larger than that, it is determined to be a variation section, and phonological segmentation is performed.

【0006】入力音声信号を6次元程度で分析した場
合、1次元まで、2次元まで、3次元までの寄与率は、
大雑把に、50%程度、70%程度、90%程度であ
る。射影パラメータは互いに無相関または直交している
ものと見なせるため、射影ベクトルの変動エネルギーす
なわち射影パラメータの変化量の2乗和は、射影ベクト
ルの大きさの変化量と対応し、また、変動エネルギー
は、分析で得た情報量の70%程度を反映した値をとる
ことになる。音声特徴ベクトルの20%の情報を担う第
2成分は短時間の変動を有効に反映するものと推察で
き、数10msの移動平均時間長を設定することによ
り、子音と母音間、子音と子音間などの音韻間、並びに
無音と音韻間において、比較的確実に変動区間を検出す
ることができる。
When the input voice signal is analyzed in about 6 dimensions, the contribution ratios in 1 dimension, 2 dimensions and 3 dimensions are as follows:
Roughly, it is about 50%, about 70%, about 90%. Since the projection parameters can be regarded as uncorrelated or orthogonal to each other, the variation energy of the projection vector, that is, the sum of squares of the variation amount of the projection parameter corresponds to the variation amount of the projection vector, and the variation energy is The value will reflect about 70% of the amount of information obtained by the analysis. It can be inferred that the second component, which carries 20% of the information of the voice feature vector, effectively reflects short-term fluctuations. By setting a moving average time length of several tens of ms, between consonants and vowels and between consonants and consonants. It is possible to detect the fluctuation section relatively reliably between the phonemes such as, and between the silence and the phonemes.

【0007】[0007]

【実施例】図1は本発明の一実施例を示すブロック図で
ある。以下、本発明の実施例を図に基づいて説明する。
図1において、入力端子1から入力された音声データS
(n)は声道分析部2にあたえられ、そこで、最大予測
次数10次の線形予測によって分析し、その予測誤差を
声帯分析部3で、最大予測次数10次の線形予測によっ
て分析し、音声特徴ベクトルX(n)の時系列が抽出さ
れる。ここでは、音声データS(n)のサンプリング周
波数を8KHz、分析フレーム長を24ms、フレーム
周期(分析間隔)を4msとし、音声平均パワーx1
(n)、声道特徴パラメータx2(n)〜x4(n)、
及び音源特徴パラメータx5(n)〜x6(n)とを音
声特徴パラメータ(ベクトル要素)として持つ音声特徴
ベクトルX(n)を抽出する。ただし、nは分析時刻で
ある。 x1 : 音声平均パワー x2 : 予測次数10次の声道エントロピー x3 : 予測次数2次の予測係数に基づく声道周波数 x4 : 同上の強度 x5 : 音源平均強度 x6 : 予測次数10次の音源エントロピー なお、要素x3(n)、x4(n)は、予測次数2次の
予測係数から1組の複素共役根を算出し、その絶対値を
一方の要素x4(n)とし、Z平面で偏角が0〜πであ
る根に対応した周波数を他方の要素x3(n)としたも
のである。
FIG. 1 is a block diagram showing an embodiment of the present invention. Embodiments of the present invention will be described below with reference to the drawings.
In FIG. 1, the audio data S input from the input terminal 1
(N) is given to the vocal tract analysis unit 2 where it is analyzed by linear prediction with a maximum prediction order of 10 and its prediction error is analyzed by a vocal cord analysis unit 3 with linear prediction of a maximum prediction order of 10 The time series of the feature vector X (n) is extracted. Here, the sampling frequency of the audio data S (n) is 8 KHz, the analysis frame length is 24 ms, the frame period (analysis interval) is 4 ms, and the audio average power x1.
(N), vocal tract feature parameters x2 (n) to x4 (n),
And a voice feature vector X (n) having voice source feature parameters x5 (n) to x6 (n) as voice feature parameters (vector elements). However, n is the analysis time. x1: average voice power x2: vocal tract entropy of 10th order of prediction x3: vocal tract frequency based on prediction coefficient of second order of prediction x4: same intensity x5: average intensity of sound source x6: sound source entropy of 10th order of prediction For the elements x3 (n) and x4 (n), one set of complex conjugate roots is calculated from the prediction coefficient of the second-order prediction order, and the absolute value is set to one element x4 (n), and the declination is 0 in the Z plane. The frequency corresponding to the root that is ˜π is the other element x3 (n).

【0008】次に、この音声特徴ベクトルX(n)の時
系列は射影演算部4へ与えられる。ここでは、式(1)
に示すように、射影行列Lを加重とした加算演算を実行
し、6次元特徴ベクトルX(n)から互いに相関のない
成分(射影パラメータ)を持つ2次元射影ベクトルY
(n)を作成する。 Y(n) = L*X(n) 式(1) ただし、上式は行列演算を表し、射影行列Lとしては式
(2)に示すものを用いた。また、この射影行列Lは、
音韻サンプルの主成分分析によって、予め決められたも
のである。
Next, the time series of the voice feature vector X (n) is given to the projection calculation unit 4. Here, the formula (1)
As shown in, the addition operation with the projection matrix L as a weight is executed, and the two-dimensional projection vector Y having components (projection parameters) that are not correlated with each other from the six-dimensional feature vector X (n).
Create (n). Y (n) = L * X (n) Formula (1) However, the above formula represents a matrix operation, and as the projection matrix L, the one shown in Formula (2) was used. Also, this projection matrix L is
It is predetermined by the principal component analysis of the phoneme sample.

【0009】[0009]

【数1】 [Equation 1]

【0010】この射影ベクトルY(n)は変動エネルギ
ー計算部5に与えられ、そこで、まず、式(3)に示す
ように、変動エネルギーe(n)を計算する。
This projection vector Y (n) is given to the fluctuating energy calculation unit 5, where first the fluctuating energy e (n) is calculated as shown in equation (3).

【0011】[0011]

【数2】 [Equation 2]

【0012】次に、式(4)に示すように、この変動エ
ネルギーe(n)の移動平均E(n)を求めることによ
り、E(n)を算出する。
Next, as shown in the equation (4), E (n) is calculated by obtaining the moving average E (n) of the fluctuation energy e (n).

【0013】[0013]

【数3】 [Equation 3]

【0014】ここで、hiは移動平均をとる際の重みで
あり、hiの形としては矩形窓を用い、単純にhI=1
とした。また、pは移動平均の区間長の半分であり、p
=44msとした。
Here, hi is a weight for obtaining a moving average, and a rectangular window is used as the shape of hi, and simply hI = 1.
And Also, p is half the section length of the moving average, and p
= 44 ms.

【0015】次に、上記のパラメータE(n)はしきい
値回路に入力される。このE(n)がある所定のしきい
値Thを越えない区間、すなわち E(n)<Th 式(5) を満たす区間を定常音韻区間と決定し、越えた区間、す
なわち E(n)≧Th 式(6) を満たす区間を非定常音韻区間と決定する。ここで、T
h=0.3とした。なお、しきい値は、移動平均の区間
長に関係があり、入力音声の音韻の継続時間特性にあわ
せてに適切に決定すべきものである。図2は、入力音声
「ジャズ」に対し上記の方法を用いセグメンテーション
を行った結果を示した図である。この図2を見てわかる
ようにそれぞれの音韻区間とそのわたりの区間が検出さ
れていることがわかる。すなわち、図2は、入力音声
「ジャズ」の前半部分について、入力音声信号S(n)
と変動エネルギー移動平均E(n)とを示したものであ
り、無音と子音の間、子音と母音の間で変動区間が検出
されることわかる。
Next, the above parameter E (n) is input to the threshold circuit. A section in which E (n) does not exceed a predetermined threshold Th, that is, a section that satisfies E (n) <Th expression (5) is determined as a stationary phoneme section, and a section that exceeds E (n), that is, E (n) ≧ A section that satisfies Th equation (6) is determined as a non-stationary phoneme section. Where T
h = 0.3. The threshold value is related to the section length of the moving average and should be appropriately determined in accordance with the phoneme duration characteristic of the input voice. FIG. 2 is a diagram showing a result of performing segmentation on the input voice “jazz” using the above method. As can be seen from FIG. 2, it is understood that each phoneme section and its section are detected. That is, FIG. 2 shows the input voice signal S (n) for the first half of the input voice "jazz".
And the fluctuating energy moving average E (n) are shown, and it can be seen that a fluctuating section is detected between a silent and a consonant and between a consonant and a vowel.

【0016】[0016]

【発明の効果】多次元空間での短区間の変動のエネルギ
ーによって音韻の非定常の度合いをパラメータ化するこ
とができ、これによって比較的簡単な方法で短い音韻に
対しても長い音韻に対しても良好なセグメンテーション
が可能となった。
The nonstationary degree of the phoneme can be parameterized by the energy of the fluctuation of the short section in the multidimensional space, and this makes it possible to perform a relatively simple method on the short phoneme and the long phoneme. Also good segmentation is possible.

【図面の簡単な説明】[Brief description of drawings]

【図1】本発明の一実施例のブロック図FIG. 1 is a block diagram of an embodiment of the present invention.

【図2】本発明の一実施例による動作説明のための図FIG. 2 is a diagram for explaining an operation according to an embodiment of the present invention.

【符号の説明】[Explanation of symbols]

1 入力端子 2 声道分析部 3 声帯分析部 4 射影演算部 5 変動エネルギー計算部 S(n) 音声信号 x1(n)〜x(6) 6次元音声特徴ベクトル y1(n)、y2(n) 2次元音声特徴ベクトル E(n) 変動エネルギー移動平均 1 Input Terminal 2 Vocal Tract Analysis Section 3 Vocal Cord Analysis Section 4 Projection Calculation Section 5 Fluctuating Energy Calculation Section S (n) Speech Signals x1 (n) to x (6) Six-dimensional Speech Feature Vector y1 (n), y2 (n) Two-dimensional voice feature vector E (n) Fluctuating energy moving average

フロントページの続き (72)発明者 深澤 敦司 東京都港区虎ノ門1丁目7番12号 沖電気 工業株式会社内Front page continuation (72) Inventor Atsushi Fukasawa 1-7-12 Toranomon, Minato-ku, Tokyo Oki Electric Industry Co., Ltd.

Claims (1)

【特許請求の範囲】[Claims] 【請求項1】 入力音声信号を分析して、1個以上の音
源特徴と2個以上の声道特徴とを音声特徴パラメータと
して含む音声特徴ベクトルの時系列を抽出する処理ステ
ップと、 複数の射影パラメータが互いに相関のないように予め決
めてある射影演算子を加重として用い、前記音声特徴ベ
クトルの各音声特徴パラメータを加算することにより、
前記音声特徴ベクトルを2次元又は3次元の射影パラメ
ータへ次元縮小した射影ベクトルの時系列を得る処理ス
テップと、 当該射影ベクトルの変動エネルギーの時系列上の移動平
均を得る処理ステップと、 当該変動エネルギー移動平均を監視し、それが予め決め
られたしきい値を越えた区間を音韻の変動区間と決定
し、当該しきい値を越えない区間を音韻の定常区間と決
定する処理ステップと、を備えたことを特徴とする音韻
セグメンテーション方法
1. A processing step of analyzing an input voice signal to extract a time series of a voice feature vector including one or more sound source features and two or more vocal tract features as voice feature parameters, and a plurality of projections. By using a predetermined projection operator as a weight so that the parameters are not correlated with each other, by adding each voice feature parameter of the voice feature vector,
A processing step of obtaining a time series of projection vectors obtained by dimensionally reducing the speech feature vector into a two-dimensional or three-dimensional projection parameter; a processing step of obtaining a moving average of the fluctuation energy of the projection vector in time series; A process step of monitoring the moving average, determining a section in which it exceeds a predetermined threshold as a phoneme variation section, and determining a section in which the threshold does not exceed the threshold as a phoneme steady section. Phonological segmentation method characterized by
JP6007507A 1994-01-27 1994-01-27 Phoneme segmentation method Pending JPH07219577A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP6007507A JPH07219577A (en) 1994-01-27 1994-01-27 Phoneme segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP6007507A JPH07219577A (en) 1994-01-27 1994-01-27 Phoneme segmentation method

Publications (1)

Publication Number Publication Date
JPH07219577A true JPH07219577A (en) 1995-08-18

Family

ID=11667708

Family Applications (1)

Application Number Title Priority Date Filing Date
JP6007507A Pending JPH07219577A (en) 1994-01-27 1994-01-27 Phoneme segmentation method

Country Status (1)

Country Link
JP (1) JPH07219577A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004334238A (en) * 1996-11-20 2004-11-25 Yamaha Corp Sound signal analyzing device and method
JP2008139747A (en) * 2006-12-05 2008-06-19 Nippon Telegr & Teleph Corp <Ntt> Sound model parameter update processing method, sound model parameter update processor, program, and recording medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004334238A (en) * 1996-11-20 2004-11-25 Yamaha Corp Sound signal analyzing device and method
JP2008139747A (en) * 2006-12-05 2008-06-19 Nippon Telegr & Teleph Corp <Ntt> Sound model parameter update processing method, sound model parameter update processor, program, and recording medium

Similar Documents

Publication Publication Date Title
Drugman et al. A comparative study of glottal source estimation techniques
US8655656B2 (en) Method and system for assessing intelligibility of speech represented by a speech signal
US8326610B2 (en) Producing phonitos based on feature vectors
EP0838805B1 (en) Speech recognition apparatus using pitch intensity information
US9454976B2 (en) Efficient discrimination of voiced and unvoiced sounds
WO2014153800A1 (en) Voice recognition system
JP3451146B2 (en) Denoising system and method using spectral subtraction
Subhashree et al. Speech Emotion Recognition: Performance Analysis based on fused algorithms and GMM modelling
JP5282523B2 (en) Basic frequency extraction method, basic frequency extraction device, and program
JP4666129B2 (en) Speech recognition system using speech normalization analysis
JPH07219577A (en) Phoneme segmentation method
JP2000163099A (en) Noise eliminating device, speech recognition device, and storage medium
JP2019035935A (en) Voice recognition apparatus
Faycal et al. Comparative performance study of several features for voiced/non-voiced classification
JP3034279B2 (en) Sound detection device and sound detection method
KR19990049148A (en) Compression method of speech waveform by similarity of FO / F1 ratio by pitch interval
WO2009055701A1 (en) Processing of a signal representing speech
JPH01255000A (en) Apparatus and method for selectively adding noise to template to be used in voice recognition system
JP2001083978A (en) Speech recognition device
JPH0222399B2 (en)
JP2018180482A (en) Speech detection apparatus and speech detection program
JPH0114599B2 (en)
Smith A neurally motivated technique for voicing detection and F0 estimation for speech
Manjutha et al. Statistical Model-Based Tamil Stuttered Speech Segmentation Using Voice Activity Detection
JPS60168198A (en) Formant extractor