JPS63306498A

JPS63306498A - Voice section detecting system

Info

Publication number: JPS63306498A
Application number: JP62143666A
Authority: JP
Inventors: 章次栗木
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1987-06-08
Filing date: 1987-06-08
Publication date: 1988-12-14

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】技術分野本発明は、音声認識装置の音声区間検出方式に関する。[Detailed description of the invention] Technical field The present invention relates to a speech segment detection method for a speech recognition device.

皿米韮亙従来の音声認識技術においては１語尾の音声区間は、音
声パワーがある値より小さくなる点を終端としていた。In conventional speech recognition technology, the speech section at the end of a word ends at the point where the speech power becomes smaller than a certain value.

しかし、この方式では、話者が単語発声後に発する息の
音まで音声区間として検出されてしまう１話者によって
は息音を発する癖を持つ人がおり、その場合、区間検出
に誤りを起こすこととなる。However, with this method, even the breath sounds that the speaker makes after uttering a word are detected as voice segments.Some speakers have a habit of making breath sounds, and in that case, errors may occur in segment detection. becomes.

目　　　　　的本発明は、上述のごとき実情に鑑みてなされたもので、
特に、音声区間検出方式において、単語の末尾に息音が
付いた場合に、その部分を取り除いて正しく音声区間を
検出することを目的としてなされたものである。Purpose The present invention was made in view of the above-mentioned circumstances.
In particular, in the speech segment detection method, when a breath sound is added at the end of a word, this method was developed with the aim of correctly detecting the speech segment by removing that part.

遭−一一収本発明は、上記目的を達成するために、入力された音声
をあるサンプル周期で特徴抽出する手段と、ローカルピ
ークを検出する手段と、音声パワーを用いて音声区間を
検出する手段と、無音、有音区間を検出する手段とを有
する音声区間検出方式において、音声パワーを用いて検
出された音声区間の末尾より始端の方向へ検索し、無音
区間が検出されるまで常にローカルピークが１つしかな
く、かつその周波数が３００〜５００Ｈｚの定められた
値より低い場合には、その区間は息音であると考え、次
に検索された有音区間の末尾を音声区間の終端とするこ
とを特徴としたものである。In order to achieve the above object, the present invention includes means for extracting features of input speech at a certain sampling period, means for detecting local peaks, and detecting speech sections using speech power. In the voice section detection method, which has means for detecting silent and voiced sections, the voice power is used to search from the end of the detected voice section toward the beginning, and the local search is always performed until a silent section is detected. If there is only one peak and its frequency is lower than a predetermined value of 300 to 500 Hz, that section is considered to be a breath sound, and the end of the searched voiced section is then set as the end of the voiced section. It is characterized by the following.

以下、本発明の実施例に基づいて説明する。Hereinafter, the present invention will be explained based on examples.

第１図は１本発明による音声区間検出方式の一実施例を
説明するための図で、同図は単語「ストップ」の後に息
音が付いた場合の例を示し、図中、Ｔ１．Ｔ、、Ｔ、は
有音区間、Ｔ、、Ｔ、は無音区間、Ｔｏは音声区間を示
す。話者によっては単語発声後、大きな息をはく人がお
り、この息がマイクにあたり息音となる。息音はマイク
にあたる風のようなものであるため、通常、低い周波数
の成分のみ存在する。一般的に、単語の終了はある閾値
以下の音声パワーが一定時間続いた場合、その最後の閾
値以上であった場合は息音まで音声区間としてしまう。FIG. 1 is a diagram for explaining an embodiment of the speech interval detection method according to the present invention. The figure shows an example in which a breath sound is added after the word "stop", and in the figure, T1. T,, T, indicates a sound section, T,, T, indicates a silent section, and To indicates a voice section. Some speakers take a deep breath after uttering a word, and this breath hits the microphone and becomes a breath sound. Breath sounds are like wind hitting a microphone, so typically only low frequency components are present. Generally, the end of a word is determined when the voice power remains below a certain threshold for a certain period of time, and when the last voice power is above the threshold, the voice section includes breath sounds.

第１図（ａ）においては閾値■以上に息音のパワーがあ
るため、第１図（ｂ）に示すように音声区間が息音を含
むことになる。この息音は取り除く必要がある。そのた
め音声区間の終了点■から始端方向へ検索をし、無音区
間がある点◎までの間、ローカルピークが１つであり、
しかも３００〜５００Ｈｚまでのある定められた周波数
より小さければ、この有音区間は息音だと判断する０例
えばＢＴＳＰ方式における息音のパターンを第２図に示
す。In FIG. 1(a), the power of the breath sound is greater than the threshold ■, so the voice section includes the breath sound as shown in FIG. 1(b). This breath sound needs to be removed. Therefore, the search is performed from the end point ■ of the voice section toward the start end, and there is one local peak between the point ◎ where there is a silent section,
Furthermore, if the frequency is lower than a certain predetermined frequency between 300 and 500 Hz, the sound interval is determined to be a breath sound. For example, the pattern of breath sounds in the BTSP system is shown in FIG.

゛第２図において、３ｃｈ目の周波数は３００〜５００
Ｈｚの値となるため、４ｃｈ以上に１がなく、３ｃｈ以
下にのみ１が立っていることが息音のパターンとなる。゛In Figure 2, the frequency of the 3rd channel is 300-500
Since the value is in Hz, the breath sound pattern is such that there is no 1 on channels 4 or above, and there are 1s only on channels 3 or below.

このように息音であると判断された有音区間は省き、次
にくる有音区間の終了点Ｏを新しい音声区間の終了点と
する。もし、最後に息音が来ない単語について上記の働
きを当てはめても、音声の場合は必ず４ｃｈ以上にもホ
ルマントを持つため１が立ち、息音であると判断されな
いため、誤って音声区間を短かくすることはない。The sound section determined to be a breath sound is omitted, and the end point O of the next sound section is set as the end point of the new voice section. Even if we apply the above function to a word that does not have a breath sound at the end, if it is a voice, it will always have a formant in channel 4 or more, so it will be 1, and it will not be judged as a breath sound, so it will not be judged that it is a breath sound. It won't be short.

また、息音に比較的近いパターンを示す「ん」について
も、「ん」だけで音声区間を形成する単語は無いため開
運はない、ただし、単音節の「ん」は、それだけで１つ
の有音区間になるため、対象とする音声区間がひとつの
有音区間しかない場合には、上記の動作を行わないこと
にすれば良い。Also, for ``n'', which has a pattern relatively close to a breath sound, there is no word that forms a phonetic interval with just ``n'', so there is no good luck. Since this is a sound section, if the target speech section is only one sound section, the above operation may not be performed.

逆に、２つ以上の息音が付いた場合が考えられる。Conversely, there may be cases where two or more breath sounds are included.

この場合は、１回上記の作用を行ない、新しく音声区間
を設定する。さらにもう一度上記の作用をくり返せば息
音を省くことができる。このように息音以外の有音区間
が検出されるまでくり返し上記作用を行なうことにより
、完全に息音は取り除くことができる。In this case, the above operation is performed once to set a new voice section. The breath sound can be omitted by repeating the above action one more time. By repeating the above operation until a sound interval other than breath sounds is detected, breath sounds can be completely removed.

第３図は、上記本発明による音声区間検出方式を実現す
るための一実施例を示す構成図で、図中。FIG. 3 is a block diagram showing an embodiment for realizing the voice section detection method according to the present invention.

１はマイクロフォン、２はアンプ、３は特徴抽出部、４
はローカルピーク検出部、５は音声パワー検出部、６は
有音／無音区間検出部、７はメモリ、８は音声区間決定
部、９は有音区間検出部、１０は最終有音区間検出部、
１１は息音検出部で、マイク１より入力された音−声は
、特徴抽出部３、ローカルピーク検出部４、音声パワー
検出部５、有音無音区間検出部６に入力され、各検出部
により出力されたデータはメモリ７に記憶される。一定
閾値による音声区間が決定されて対称となる有音区間数
を有音区間数検出部にて検出し、もし１つであるならば
先の音声区間を出力する。２つ以上であるならば最終有
音区間検出部で検出された有音区間において息音である
かどうかの判断を息音検出部１１で行ない、息音でなけ
れば先の音声区間を出力する。息音であるならば、最終
有音区間を省き、音声区間を次の有音区間の終了点に変
更し、もう一度新たな最終有音区間について上記の動作
を行なう。このようにして息音を除いた音声区間を出力
する。1 is a microphone, 2 is an amplifier, 3 is a feature extraction unit, 4
is a local peak detection unit, 5 is a voice power detection unit, 6 is a voice/silence interval detection unit, 7 is a memory, 8 is a voice interval determination unit, 9 is a voice interval detection unit, 10 is a final voice interval detection unit ,
Reference numeral 11 denotes a breath sound detection section, and the voice input from the microphone 1 is inputted to a feature extraction section 3, a local peak detection section 4, a voice power detection section 5, and a voiced/silent section detection section 6. The data outputted by is stored in the memory 7. A voice section is determined based on a certain threshold value, and the number of symmetrical voice sections is detected by a voice section number detecting section, and if it is one, the previous voice section is output. If there are two or more, the breath sound detection unit 11 determines whether the sound interval detected by the last sound interval detection unit is a breath sound, and if it is not a breath sound, outputs the previous voice interval. . If it is a breath sound, the final sound interval is omitted, the voice interval is changed to the end point of the next sound interval, and the above operation is performed again for the new final sound interval. In this way, a voice section excluding breath sounds is output.

効　　　果以上の説明から明らかなように、本発明による（ａｌと、語尾に息音が付いた場合も正しい音声区間の検出が
可能となる。Effects As is clear from the above explanation, the present invention makes it possible to detect the correct speech interval even when there is a breath sound at the end of the word (al).

【図面の簡単な説明】[Brief explanation of the drawing]

第１図は、本発明の一実施例を説明するための　　　（
ｂ）タイムチャート、第２図は、息音パターンの一例を
示す図、第３図は１本発明を実現するための一実施例を
示す構成図である。１・・・マイクロフォン、２・・・アンプ、３・・・特
徴抽出部、４・・・ローカルピーク検出部、５・・・音
声パワー検出部、６・・・有音／無音区間検出部、７・
・・メモリ。８・・・音声区間決定部、９・・・有音区間検出部、１
０・・・最終有音区間検出部、１１・・・息音検出部。第　　１　　図第２図FIG. 1 is a diagram (
b) Time chart. FIG. 2 is a diagram showing an example of a breath sound pattern, and FIG. 3 is a configuration diagram showing an embodiment for realizing the present invention. DESCRIPTION OF SYMBOLS 1...Microphone, 2...Amplifier, 3...Feature extraction part, 4...Local peak detection part, 5...Audio power detection part, 6...Speech/silence section detection part, 7.
··memory. 8... Voice section determination section, 9... Sound section detection section, 1
0: Final sound interval detection unit, 11: Breath sound detection unit. Figure 1 Figure 2

Claims

【特許請求の範囲】[Claims]

（１）入力された音声をあるサンプル周期で特徴抽出す
る手段と、ローカルピークを検出する手段と、音声パワ
ーを用いて音声区間を検出する手段と、無音、有音区間
を検出する手段とを有する音声区間検出方式において、
音声パワーを用いて検出された音声区間の末尾より始端
の方向へ検索し、無音区間が検出されるまで常にローカ
ルピークが１つしかなく、かつその周波数が３００〜５
００Ｈｚの定められた値より低い場合には、その区間は
息音であると考え、次に検索された有音区間の末尾を音
声区間の終端とすることを特徴とする音声区間検出方式
。(1) Means for extracting features of input speech at a certain sampling period, means for detecting local peaks, means for detecting speech sections using speech power, and means for detecting silent and speech sections. In the voice section detection method having
Search from the end of the voice section detected using the voice power to the beginning, and find that there is always only one local peak and its frequency is between 300 and 5 until a silent section is detected.
A voice section detection method characterized in that when the frequency is lower than a predetermined value of 00 Hz, the section is considered to be a breath sound, and the end of the next retrieved voiced section is set as the end of the voice section.

（２）前記処理をくり返し行ない、２以上の息音の区間
を補正できるようにしたことを特徴とする特許請求の範
囲第（１）項に記載の音声区間検出方式。(2) The voice section detection method according to claim (1), wherein the process is repeated to correct two or more breath sound sections.

（３）対象となる音声区間内に無音区間が無い場合には
処理を行なわないことを特徴とする特許請求の範囲第（
１）項又は第（２）項に記載の音声区間検出方式。(3) If there is no silent section within the target speech section, no processing is performed.
The voice section detection method described in item 1) or item (2).