JPH032319B2

JPH032319B2 -

Info

Publication number: JPH032319B2
Application number: JP60173274A
Authority: JP
Inventors: Eiji Oohira; Akio Komatsu
Original assignee: Agency of Industrial Science and Technology
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 1985-08-08
Filing date: 1985-08-08
Publication date: 1991-01-14
Also published as: JPS6234200A

Description

【発明の詳細な説明】〔発明の利用分野〕本発明は自然に発声された会話文を理解し、理
解結果に応じた応答を行なう会話音声理解システ
ムに係り、特に会話文を意味的なまとまりを示す
単位に分割する方法に関する。[Detailed Description of the Invention] [Field of Application of the Invention] The present invention relates to a conversational speech understanding system that understands naturally uttered conversational sentences and responds according to the understanding result. It relates to a method of dividing into units indicating.

〔発明の背景〕[Background of the invention]

従来、音声を入力手段とするシステムでは、単
語音声や朗読調に発声された連続音声を対象とし
ていた。しかし、朗読調ではなく自然に発声され
た会話文（以下、これを単に会話文と呼ぶ。）の
場合は、思考を組立てながら発声するため、言い
問違いや省略表現などによる文法的に整つていな
い文が存在するほか、複数の文が続けて入力され
る。そして、これらの文は句読点で区切られてい
ない。したがつて会話文の理解においては、ま
ず、会話文を言語処理可能にするために、意味的
まとまりを示す単位に分割する必要がある。入力
音声を分割する方式に関しては、特開昭48−
30302などがあるが、これらは、限定単語を音韻
に分割する方法や文法的に整つた朗読調の文を文
節に分割する方法に関するものであり、自然な会
話文の分割については配慮されていない。 Conventionally, systems using speech as an input means have targeted word speech or continuous speech uttered in a reading tone. However, in the case of conversational sentences that are uttered naturally rather than recited (hereinafter referred to simply as conversational sentences), the sentences are uttered while assembling thoughts, so they are not grammatically correct due to misunderstandings or abbreviations. In addition to some sentences that are not entered correctly, multiple sentences are entered in succession. And these sentences are not separated by punctuation. Therefore, in order to understand conversational sentences, it is first necessary to divide the conversational sentences into units that represent semantic groups in order to make them linguistically processable. Regarding the method of dividing input audio, see Japanese Patent Application Laid-open No.
30302, etc., but these are concerned with methods of dividing limited words into phonemes and methods of dividing grammatically well-organized recitation sentences into clauses, and do not consider the division of natural conversational sentences. .

〔発明の目的〕[Purpose of the invention]

本発明の目的は、文法的に整つていない文が存
在し、かつ複数の文が連続して入力される会話文
を、意味的まとまりをもつた単位に分割すること
により、理解の信頼性が高く、かつ処理量の少な
い会話音声理解方法を提供することにある。 The purpose of the present invention is to improve the reliability of understanding by dividing conversational sentences in which there are sentences that are not grammatically well-organized and in which multiple sentences are input consecutively into units that have semantic coherence. An object of the present invention is to provide a conversational speech understanding method that has high performance and requires a small amount of processing.

〔発明の概要〕[Summary of the invention]

かかる目的を達成するため、本発明は音声の抑
揚や強勢などの韻律情報（音声パワー、基本周波
数など）を用い、会話文を意味的まとまりに分割
したことを特徴とする。この韻律情報は、発声内
容に対して合理的・自然的な情報であり、特に抑
揚は、発声内容が問いかけ調の場合は、どこの国
においても文末の声の高さを高くするというよう
に言語によらない普遍的な情報となる。 In order to achieve this object, the present invention is characterized in that a conversational sentence is divided into semantic groups using prosodic information such as intonation and stress of the voice (voice power, fundamental frequency, etc.). This prosodic information is rational and natural information relative to the content of the utterance, and in particular intonation, when the content of the utterance is in a questioning tone, the pitch of the voice at the end of the sentence is raised in any country. It becomes universal information that is not dependent on language.

〔発明の実施例〕[Embodiments of the invention]

本発明の一実施例を第１図に示す。第１図は、
キーボードなどよりカナ文字で入力される記述文
の理解システムを応用した会話音声理解システム
の一構成図である。記述文理解では、形態素解析
部６にカナ文字列が入力される。形態素解析部６
は辞書メモリ７を用いて文節の検出を行ない、文
節侯補を出力する。そして、構文解析部８では構
文を満足する文節侯補のチエーンを検出し、意味
解析部１０では、更に意味的に自然なチエーンを
検出し、最も確かなものを解として出力する。会
話音声理解システムでは、入力手段が音声である
ため、音声をカナ文字に変換する必要がある。こ
のため、音声の音韻情報や韻律情報を求める特徴
抽出部１および標準パターン５とのマツチングに
より入力音声をカナ文字に変換する音声認識部４
を設ける。 An embodiment of the present invention is shown in FIG. Figure 1 shows
FIG. 1 is a configuration diagram of a conversational speech understanding system that applies a system for understanding descriptive sentences input in kana characters from a keyboard or the like. In the written sentence understanding, a kana character string is input to the morphological analysis unit 6. Morphological analysis section 6
uses the dictionary memory 7 to detect phrases and output phrase candidates. Then, the syntactic analysis unit 8 detects a chain of clause candidates that satisfy the syntax, and the semantic analysis unit 10 detects a more semantically natural chain, and outputs the most reliable one as a solution. In a conversational speech understanding system, since the input means is speech, it is necessary to convert speech into kana characters. For this reason, the feature extraction unit 1 obtains phonological information and prosody information of speech, and the speech recognition unit 4 converts input speech into kana characters by matching with standard patterns 5.
will be established.

記述文理解においては、処理の対象を句点で区
切られた文としており、これに基づいた構文情報
などに従がつて理解を行なつている。しかし、会
話文は文法的に整つてない文が存在し、複数の文
が続けて入力されることがあるため、これをその
まま理解しようとすると、多くの変形を含んだ構
文情報９等を用意する必要がある。したがつて、
処理量が増大し、理解の信頼性の低下が生じてし
まう。このため会話文を意味的なまとまりに分割
する必要がある。一般的に会話文を意味的なまと
まりに分割するものとしては、記述文における句
点が上げられる。会話文の句点位置に相当する箇
所の特徴の１つは、息つぎによるポーズが生じる
ことである。したがつて、無音区間（音声パワー
が雑音レベルPθ以下の区間）の長さが閾値Pλ（例
えば300ｍ sec）以上をポーズとし、これを検出
することにより、句点位置の検出が可能である。
しかし、会話文の場合は、思考を行ないながら発
声していくため、言い違いや思い違いをした後で
も長くポーズが生じてしまう。 In descriptive text comprehension, the processing target is a sentence separated by periods, and comprehension is performed according to syntactic information based on this. However, in conversational sentences, there are sentences that are not grammatically organized, and multiple sentences may be input in succession, so if you try to understand them as they are, you will need to prepare syntactic information 9 that includes many variations. There is a need to. Therefore,
This increases the amount of processing and reduces the reliability of understanding. For this reason, it is necessary to divide conversational sentences into semantic groups. Periods in descriptive sentences are generally used to divide conversational sentences into semantic groups. One of the characteristics of places that correspond to periods in conversational sentences is that a pause occurs due to a pause. Therefore, by detecting a period in which the length of a silent section (a section in which the audio power is equal to or less than the noise level Pθ) is equal to or greater than a threshold value Pλ (for example, 300 m sec) as a pause, it is possible to detect the position of a period.
However, in the case of conversational sentences, the person speaks while thinking, so there is a long pause even after a misunderstanding or misunderstanding occurs.

句点位置の特徴を表わすもう一つの韻律情報と
して、音声の抑揚であるイントネーシヨンがあ
る。イントネーシヨンは、文頭において急速に立
ち上がり、その後文末に向つて緩やかに低くなつ
ていく。そして、文末においては、話者の最低基
本周波数に近づく。しかし、言い間違いや思い違
いによりポーズが生じた箇所では、文末の基本周
波数が高いまま終わり、ポーズ後の基本周波数も
ポーズ前の基本周波数とほぼ同じ高さから始ま
り、文を継続しようとする傾向にある。 Another piece of prosodic information that represents the characteristics of the point position is intonation, which is the intonation of the voice. Intonation rises rapidly at the beginning of a sentence and then gradually decreases toward the end. At the end of the sentence, the frequency approaches the speaker's lowest fundamental frequency. However, in places where a pause occurs due to a misspoken word or misunderstanding, the fundamental frequency at the end of the sentence remains high, and the fundamental frequency after the pause starts from almost the same height as the fundamental frequency before the pause, leading to a tendency to continue the sentence. be.

会話文分割部２は、以上の会話文の句点に相当
する位置の特徴を利用して、会話文を意味的まと
まりに分割する。その分割方式を第２図、第３図
を用いて具体的に説明する。第２図は、ポーズ付
近の韻律情報の形状例、第３図は本方式の流れ図
を示している。 The conversational sentence dividing unit 2 divides the conversational sentence into semantic groups by utilizing the characteristics of the positions corresponding to the punctuation marks in the conversational sentence. The division method will be specifically explained using FIGS. 2 and 3. FIG. 2 shows an example of the shape of prosody information near a pause, and FIG. 3 shows a flowchart of this method.

(1) まずポーズを検出するため、無音区間がPλ
以上続く箇所を検出し、分割候補とする。(1) First, to detect a pause, the silent interval is Pλ
A continuous portion is detected and set as a division candidate.

(2) ポーズが検出された箇所のうち、ポーズ前の
音声の基本周波数Feが話者の下限周波数以下
のもののみを候補として残し、後は文中である
とする。(2) Among the parts where a pause is detected, only those in which the fundamental frequency Fe of the voice before the pause is equal to or lower than the speaker's lower limit frequency are left as candidates, and the remaining parts are assumed to be in the sentence.

(3) 更に、ポーズ後に最大値を示す基本周波数
FsとFeの差であるΔFが閾値（例えば、男性で
は40〜50Hz）以上であれば、その位置を分割点
とする。(3) Furthermore, the fundamental frequency that shows the maximum value after the pause
If ΔF, which is the difference between Fs and Fe, is equal to or greater than a threshold (for example, 40 to 50 Hz for men), that position is set as a division point.

ここで話者の下限周波数とは、現在システムを
利用している話者の発声可能な最低周波数に定数
倍（例えば1.1〜1.2倍）したものである。そし
て、この情報は、話者情報学習部３により抽出さ
れ、話者情報としてあらかじめ登録する。話者情
報は、数十音節よりなる平叙文（例えば挨拶文）
より求める。 Here, the speaker's lower limit frequency is the lowest frequency that can be uttered by a speaker currently using the system multiplied by a constant (for example, 1.1 to 1.2 times). This information is then extracted by the speaker information learning section 3 and registered in advance as speaker information. Speaker information is a declarative sentence (for example, a greeting) consisting of several dozen syllables.
Seek more.

〔発明の効果〕〔Effect of the invention〕

本発明によれば、文法的に整つていない文が存
在し、かつ複数の文が続けて入力する会話文を意
味的まとまりをもつた単位に分割できる。このた
め、以降の理解処理が簡素化され、処理量が低減
できるほか、理解の信頼性をも向上できる効果が
ある。 According to the present invention, a conversational sentence in which there is a sentence that is not grammatically well-organized and a plurality of consecutive sentences are input can be divided into units that are semantically coherent. Therefore, the subsequent understanding processing is simplified, the amount of processing can be reduced, and the reliability of understanding can also be improved.

【図面の簡単な説明】[Brief explanation of the drawing]

第１図は、会話音声理解システムの一構成図、
第２図は、ポーズ付近の韻律情報を説明するため
の図、第３図は本方式の流れ図である。符号の説明、１……特徴抽出部、２……会話文
分解部、３……話者情報学習部。 Figure 1 is a configuration diagram of a conversational speech understanding system.
FIG. 2 is a diagram for explaining prosody information near a pause, and FIG. 3 is a flowchart of this method. Explanation of symbols: 1...Feature extraction unit, 2...Conversation sentence decomposition unit, 3...Speaker information learning unit.

Claims

【特許請求の範囲】[Claims]

１音声による会話文を理解する会話音声理解方
法において、入力される会話音声から韻律情報を
抽出し、その抽出された韻律情報を利用して無音
区間の長さからポーズを検出し、該検出されたポ
ーズの直前の基本周波数を予め登録しておいた話
者の下限周波数と比較して分割点の侯補を選び、
該ポーズの直前の基本周波数と該ポーズの直後に
最大値を示す基本周波数との差に応じて上記分割
点の侯補を分割点とすることにより、上記入力さ
れる会話音声を意味的まとまりをもつた単位に分
割することを特徴とする韻律情報を利用した会話
音声理解方法。1. In a conversational speech understanding method for understanding spoken conversation sentences, prosodic information is extracted from input conversational speech, and the extracted prosodic information is used to detect pauses from the length of silent intervals. The fundamental frequency immediately before the pause is compared with the lower limit frequency of the speaker registered in advance, and candidates for the dividing point are selected.
By setting the candidates of the dividing point as dividing points according to the difference between the fundamental frequency immediately before the pause and the fundamental frequency that shows the maximum value immediately after the pause, the input conversational speech is divided into semantic groups. A method for understanding conversational speech using prosodic information, which is characterized by dividing it into multiple units.