JP2003015686A

JP2003015686A - Device and method for voice interaction and voice interaction processing program

Info

Publication number: JP2003015686A
Application number: JP2001197645A
Authority: JP
Inventors: Takehide Yano; 武秀屋野; Munehiko Sasajima; 宗彦笹島; Hiroshi Shimomori; 大志下森; Tatsuya Uehara; 龍也上原
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2001-06-29
Filing date: 2001-06-29
Publication date: 2003-01-17
Anticipated expiration: 2021-06-29
Also published as: JP3795350B2

Abstract

PROBLEM TO BE SOLVED: To provide a device and a method for voice interaction and a voice interaction processing program which enable interpretation considering ambiguity in the pause of speaking without newly preparing grammar or rules. SOLUTION: The voice analyzed result resulting from the acoustic analysis of a voiced sound, is inputted and at least one voice analyzed result including the voice analyzed result to the last speaking at least is stored. When the voice analyzed result of new speaking is inputted, linking processing is performed for generating the linked voice analyzed result obtained by linking the voice analyzed result of new speaking and the voice analyzed result of the last speaking while keeping the temporal order.

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は利用者の音声入力を
受理し解釈する機能を持ち、発話の区切りがあいまいな
音声の解釈を実行する機能を持つ音声対話装置、音声対
話方法及び音声対話処理プログラムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention has a function of receiving and interpreting a user's voice input, and a voice dialogue device, a voice dialogue method, and a voice dialogue process having a function of interpreting a voice in which utterance boundaries are ambiguous. Regarding the program.

【０００２】[0002]

【従来の技術】近年、音声や自然言語入力などによる文
章入力を受け付けるインターフェースに関する研究が盛
んである。また、そのようなインターフェースを用いる
エキスパートシステムなどが多数開発され、音声やテキ
ストなどによる入力を受理する装置が一般向けにも利用
可能となっている。特に音声による入力は利用者にとっ
て利用しやすいものと位置付けられ、音声認識技術を利
用し音声入力を受け付ける音声対話装置が開発されてい
る。2. Description of the Related Art In recent years, much research has been conducted on interfaces for receiving text input such as voice or natural language input. Further, many expert systems and the like using such an interface have been developed, and a device that accepts input by voice or text is available to the general public. In particular, voice input is regarded as easy for the user to use, and a voice interaction device that receives voice input using voice recognition technology has been developed.

【０００３】従来このような音声対話装置では入力され
る文章を記述する文法を定義し、発話のポーズを区切り
として構文解析によって音声認識結果が文法に適合する
ものを適切な入力文として採用していた。しかしながら
音声による話し言葉入力は特殊であり、発話の区切りが
あいまいという特徴がある。このため、例えば音声対話
装置を採用したナビゲーションシステムなどで「目的地
の近くの（ポーズ）レストラン」のような文章の途中で
発話が中断される、すなわち利用者の意図がポーズをは
さんだ複数回の発話（「目的地の近くの」と）「レスト
ラン」）によって表現される場合がある。このような場
合、入力文が文法と適合しないために適切な入力文が得
られないという問題があった。Conventionally, in such a voice dialogue system, a grammar describing an input sentence is defined, and a speech recognition result conforming to the grammar is adopted as an appropriate input sentence by parsing with a pause of an utterance as a delimiter. It was However, the input of spoken language by voice is special, and there is a feature that the boundaries between utterances are ambiguous. For this reason, for example, in a navigation system that employs a voice interaction device, utterances are interrupted in the middle of a sentence such as “(pause) restaurant near destination”, that is, the user's intention is to pause multiple times. May be represented by the utterance of "(near the destination") and "restaurant". In such a case, there is a problem that an appropriate input sentence cannot be obtained because the input sentence does not match the grammar.

【０００４】このような問題に対処した音声解釈技術と
して特開平11-237894号に開示されているものがある。
これは、例えば「目的地の近くの」の解釈結果群を保持
しておき「レストラン」が入力された時に保持していた
解釈結果群の後部に追加するという具合に単語が入力さ
れる毎にこれまでの解釈結果群との時間的な連結を図る
ものである。このため発話が文章の途中で区切られても
次の発話入力をも継続して解釈し、複数の発話からなる
適切な文章を探索することが可能となる。この解釈結果
群から適切な候補を選択する際には構文解析用の文法に
加えて言語理解規則を作成し、言語理解規則による優先
度が最も高いものを解釈結果として採用している。As a speech interpretation technique for dealing with such a problem, there is one disclosed in Japanese Patent Laid-Open No. 11-237894.
This is because each time a word is input, for example, the interpretation result group of “near the destination” is held and added to the rear of the interpretation result group held when “Restaurant” is input. It is intended to connect with the interpretation result group so far in time. Therefore, even if the utterance is divided in the middle of the sentence, it is possible to continuously interpret the next utterance input and search for an appropriate sentence composed of a plurality of utterances. When selecting an appropriate candidate from this interpretation result group, a language understanding rule is created in addition to the grammar for parsing, and the one with the highest priority according to the language understanding rule is adopted as the interpretation result.

【０００５】ところが、上記の技術では解釈のために文
法の他に言語理解規則を作成する必要がある。このた
め、既存のシステムに適用する文法だけでは実行不可能
であり、新規に作成するルールも適用する文法が複雑に
なればそれだけ作成が困難になるという問題がある。However, in the above technique, it is necessary to create a language understanding rule in addition to the grammar for interpretation. Therefore, there is a problem that the grammar applied to the existing system cannot be executed, and the more complicated the grammar applied to the newly created rule is, the more difficult it is to create it.

【０００６】また、音声入力の場合、利用者が入力を推
敲せずに発話するために言い直しや条件の追加などの発
話がなされる場合がある。このような場合には音声解釈
の後段の対話処理などで、直前の回答に対応する解釈内
容と新規入力のみの解釈内容との意味の合成（以下意味
合成と表記）を図らなければならない。上記の技術で
は、その解釈結果が新規発話のみによるものかその前の
解釈との連結によるものかが不明であるため、意味合成
すべきかどうかを判定することが困難であるという問題
がある。あるいは意味合成をしない場合には、別途言い
直し・条件追加に対応する文法と言語理解規則を作成す
る必要がある。[0006] In the case of voice input, there are cases in which the user speaks again, such as adding a condition, in order to speak without refining the input. In such a case, the meaning of the interpretation content corresponding to the immediately preceding answer and the interpretation content of only the new input must be combined (hereinafter referred to as “semantic composition”) by a dialogue process or the like after the voice interpretation. The technique described above has a problem that it is difficult to determine whether or not meaning synthesis should be performed, because it is unclear whether the interpretation result is based on only a new utterance or a connection with the previous interpretation. Alternatively, when semantic synthesis is not performed, it is necessary to separately create a grammar and language understanding rule corresponding to rewording / condition addition.

【０００７】加えて、例えば音声対話装置から質問文が
提示される場合など、音声解釈の際にポーズで区切られ
た音声情報の時間的な連結を図らないほうが良い場合が
ある。逆に音声対話装置からの応答を聞かずに入力がな
されるなどの状況では、利用者の発話途中で解釈を開始
したと考えられるため時間的な連結が優先される。しか
しながら、上記技術ではそのようなフィードバックは利
用せずに言語的ルールを参照することでその判定を行っ
ているため、不適切な解釈が行われるという問題があ
る。In addition, for example, when a question sentence is presented from a voice dialog device, it is sometimes better not to temporally connect voice information separated by pauses during voice interpretation. On the other hand, in a situation where an input is made without listening to the response from the voice interaction device, it is considered that the interpretation has started in the middle of the user's utterance, so that temporal concatenation is prioritized. However, in the above technique, such feedback is not used, but the determination is made by referring to the linguistic rule, so that there is a problem that inappropriate interpretation is performed.

【０００８】[0008]

【発明が解決しようとする課題】このように従来の音声
対話装置では、発話の区切りがあいまいな音声入力を考
慮し、入力の音声情報を時間的に連結して解釈するため
には文法の他に言語理解規則を作成する必要があるた
め、既存のシステムに適用する文法だけでは実行不可能
であり、新規に作成するルールも適用する文法が複雑に
なればそれだけ作成が困難になるという問題があった。As described above, in the conventional voice dialogue system, in consideration of a voice input having ambiguous utterance breaks, in order to interpret the voice information of the input in a temporally concatenated manner, it is necessary to use other grammars. Since it is necessary to create a language understanding rule in the existing system, it is not possible to execute it only with the grammar applied to the existing system, and the problem is that the more complicated the grammar that applies the new rule, the more difficult it is to create it. there were.

【０００９】また、利用者の言い直しや条件追加の場合
には、後段の対話処理で意味合成をする必要があるが、
その解釈結果が新規発話のみによるものかその前の解釈
との連結によるものかが不明であるため、意味合成すべ
きかどうかを判定することが困難であるという問題があ
る。In addition, in the case of a user's rewording or condition addition, it is necessary to perform semantic synthesis in the subsequent dialogue processing.
There is a problem that it is difficult to determine whether or not meaning synthesis should be performed, because it is unclear whether the interpretation result is based on only a new utterance or a connection with the previous interpretation.

【００１０】あるいはこの場合に意味合成を実行しない
のであれば言い直し、条件追加のための文法と言語規則
を別途作成する必要があるという問題がある。Alternatively, in this case, if semantic synthesis is not executed, it is necessary to re-state, and it is necessary to separately create a grammar and a language rule for adding a condition.

【００１１】加えて、音声対話装置からのフィードバッ
クは利用せずに言語的ルールを参照することでその判定
を行っているため、音声情報を連結する・しないの判定
において不適切な解釈が行われるという問題がある。In addition, since the judgment is made by referring to the linguistic rule without using the feedback from the voice dialogue device, an inappropriate interpretation is made in the judgment of connecting / not connecting the voice information. There is a problem.

【００１２】本発明はこのような事情を考慮してなされ
たもので、新規に文法やルールを作成せずに発話の区切
りのあいまいさを考慮して解釈することができ、また
は、その解釈実行時に言い直し条件追加の解釈のための
意味合成の実行可否判定ができ、または、対話処理部の
フィードバックを利用することで時間的な連結条件をよ
り適切に判定することが可能な音声対話装置、音声対話
方法及び音声対話処理プログラムを提供することにあ
る。The present invention has been made in consideration of such circumstances, and can be interpreted in consideration of the ambiguity of utterance delimiters without newly creating a grammar or rule, or the interpretation and execution thereof. Sometimes it is possible to determine whether or not to execute semantic synthesis for the interpretation of additional rewording conditions, or by using the feedback of the dialogue processing unit, it is possible to more appropriately determine the temporal connection condition, a voice dialogue device, An object is to provide a voice interaction method and a voice interaction processing program.

【００１３】[0013]

【課題を解決するための手段】上記目的を解決するため
に本発明に係る音声対話装置、音声対話方法及び音声対
話処理プログラムは、発話された音声の音声認識結果が
入力され、前回の発話に対する音声認識結果を連結処理
候補として記憶し、新規発話の音声認識結果が入力され
た場合に前記新規発話の音声認識結果と前記記憶された
音声認識結果とを時間順序を保ったまま連結する連結処
理し、この連結処理による連結音声認識結果が導入さ
れ、この音声認識結果に対して定められた文法を参照
し、定めらため文法との整合性の判定を行う構文解析処
理を行い、前記構文解析処理の解析結果に基づいて前記
音声認識結果が一致の場合には前記音声認識結果に対応
する応答情報を生成し、前記音声認識結果が不一致の場
合には新たに前記連結処理候補として記憶する対話処理
を行うことを特徴とするものである。In order to solve the above-mentioned problems, a voice dialogue apparatus, a voice dialogue method and a voice dialogue processing program according to the present invention are inputted with a voice recognition result of uttered voice and respond to a previous utterance. A concatenation process of storing a voice recognition result as a concatenation process candidate and concatenating the voice recognition result of the new utterance and the stored voice recognition result while maintaining a time order when the voice recognition result of the new utterance is input. Then, the concatenated speech recognition result by this concatenation processing is introduced, the grammar defined for the speech recognition result is referred to, the syntactic analysis processing for determining the consistency with the grammar is performed, and the syntactic analysis is performed. Based on the analysis result of the processing, if the voice recognition result matches, response information corresponding to the voice recognition result is generated, and if the voice recognition result does not match, the concatenation is newly performed. It is characterized in carrying out the interaction of storing as management candidates.

【００１４】これによれば、発話の区切りをこえた音声
解析結果を作成し、既存の構文解析方式・ルールを採用
していても構文解析を可能とすることが可能となる。According to this, it is possible to create a speech analysis result that exceeds the boundaries of utterances and enable the syntactic analysis even if the existing syntactic analysis method / rule is adopted.

【００１５】また、本発明に係る音声対話装置及び音声
対話方法は、その対話処理はそれまでの応答に使用した
対話履歴情報と新規発話に関する構文解析結果から導出
される新規応答情報とを意味的に合成する機能を有し、
連結音声解析結果から生成された前記新規応答情報につ
いては前記意味的に合成する機能を使用しない制限を行
うことを特徴とするものである。Further, in the voice dialogue apparatus and the voice dialogue method according to the present invention, the dialogue processing means the dialogue history information used for the response till then and the new response information derived from the syntactic analysis result regarding the new speech. Has the function of combining
It is characterized in that the new response information generated from the concatenated voice analysis result is restricted not to use the semantically combining function.

【００１６】これによれば、発話の区切りをこえた解釈
結果とそうでない解釈結果との区別がないために、意味
合成すべきかどうかを判定することが容易である。According to this, since there is no distinction between the interpretation result that exceeds the utterance demarcation and the interpretation result that does not, it is easy to determine whether or not meaning synthesis should be performed.

【００１７】また、本発明に係る音声対話装置、音声対
話方法及び音声対話処理プログラムは、その対話処理は
今現在の対話状態を通知する機能を有し、連結処理では
通知された対話状態に基づき連結処理可否を決定するこ
とを特徴とするものである。Further, the voice interactive apparatus, the voice interactive method and the voice interactive processing program according to the present invention have a function of notifying the current interactive state in the interactive processing, and based on the notified interactive state in the connection processing. It is characterized by determining whether or not the connection processing is possible.

【００１８】これによれば、対話処理のフィードバック
に基づきその判定を行っているため、音声情報を連結す
る・しないの判定において不適切な解釈が行われること
がなくなる。According to this, since the determination is made based on the feedback of the dialogue processing, an improper interpretation is not made in the determination as to whether or not the voice information is connected.

【００１９】また、本発明に係る音声対話装置、音声対
話方法及び音声対話処理プログラムは、その対話処理は
応答情報生成終了時に解釈終了情報を通知する機能を有
し、連結処理では解釈終了情報の有無に基づき連結処理
可否を決定することが可能である。Further, the voice interaction device, the voice interaction method, and the voice interaction processing program according to the present invention have a function of notifying the interpretation completion information at the time of the completion of the response information generation. It is possible to determine whether or not the connection processing is possible based on the presence or absence.

【００２０】また、本発明に係る音声対話装置、音声対
話方法及び音声対話処理プログラムは、その対話処理は
構文解析結果受理時に構文解析成功情報を通知する機能
を有し、連結処理では構文解析成功情報に基づき連結処
理可否を決定すること可能である。Further, the voice interactive apparatus, the voice interactive method, and the voice interactive processing program according to the present invention have the function of notifying the syntactical analysis success information when the syntactical analysis result is received, and the syntactical success of the concatenated processing. It is possible to determine whether or not the connection processing is possible based on the information.

【００２１】また、本発明に係る音声対話装置、音声対
話方法及び音声対話処理プログラムは、その対話処理は
応答情報生成時に対象となった入力のあいまいさを判定
し、前記入力のあいまいさ情報を通知する機能を有し、
連結処理部では前記入力のあいまいさ情報に基づき連結
処理可否を決定することが可能である。Further, the voice dialogue apparatus, the voice dialogue method, and the voice dialogue processing program according to the present invention judge the ambiguity of the input that is the target of the dialogue processing when the response information is generated, and use the ambiguity information of the input. Has a function to notify,
The connection processing unit can determine whether or not the connection processing can be performed based on the ambiguity information of the input.

【００２２】また、本発明に係る音声対話装置、音声対
話方法及び音声対話処理プログラムは、その連結処理で
は応答生成終了時刻と次発話入力時刻の差分に基づき連
結処理可否を決定することが可能である。Further, in the voice interaction device, the voice interaction method, and the voice interaction processing program according to the present invention, in the connection operation, it is possible to determine whether or not the connection processing is possible based on the difference between the response generation end time and the next speech input time. is there.

【００２３】[0023]

【発明の実施の形態】本発明は利用者の音声入力を受理
し解釈する機能を持ち、発話の区切りがあいまいな音声
の解釈を実行する機能を持つ音声対話装置及びその音声
対話方式を提供するものであり、以下、図面を参照して
本発明の詳細につき説明する。BEST MODE FOR CARRYING OUT THE INVENTION The present invention provides a voice interactive apparatus and a voice interactive system having a function of receiving and interpreting a voice input of a user and having a function of executing an interpretation of a voice in which utterance boundaries are ambiguous. The present invention will be described in detail below with reference to the drawings.

【００２４】図１は、本発明の一実施の形態に係る音声
対話装置100の概略構成図である。図１に示すように本
発明の音声対話装置100は連結処理部101、構文解析部10
2、対話処理部103、応答出力部104からなる。FIG. 1 is a schematic configuration diagram of a voice interaction device 100 according to an embodiment of the present invention. As shown in FIG. 1, the voice interaction device 100 of the present invention includes a connection processing unit 101 and a syntax analysis unit 10.
2. Consists of a dialogue processing unit 103 and a response output unit 104.

【００２５】音声解析装置110は利用者が発話した音声
入力の音響的な解析を行うことにより音声識結果を出力
する。この音声対話装置100は音声解析装置110から与え
られた音声認識結果を解釈し、その解釈結果に応じて応
答を返すものである。The voice analysis device 110 outputs a voice recognition result by acoustically analyzing the voice input uttered by the user. The voice interaction device 100 interprets the voice recognition result given from the voice analysis device 110 and returns a response according to the interpretation result.

【００２６】連結処理部101は音声解析装置110、構文解
析部102、対話処理部103と接続している。また、連結処
理部101はその内部に音声認識結果を保持する記憶部101
-1をもち、音声解析装置110から通知される今回の音声
認識結果と記憶部101-1に記憶していたそれ以前の音声
認識結果とを時間的に連結するかどうかを判定する。The connection processing unit 101 is connected to the voice analysis device 110, the syntax analysis unit 102, and the dialogue processing unit 103. In addition, the connection processing unit 101 includes a storage unit 101 that holds a voice recognition result therein.
-1 is set, and it is determined whether or not the current voice recognition result notified from the voice analysis device 110 and the previous voice recognition result stored in the storage unit 101-1 are temporally connected.

【００２７】判定は対話処理部103から必要な情報を受
け取り、それに基づいて実行される。利用する情報及び
判定方法の詳細は後述する。その後判定結果が『連結あ
り』であれば、以前の音声認識結果と今回の音声認識結
果を時間的に連結する。連結した場合は連結したことを
表す連結フラグをその認識結果に付与し、連結済みの認
識結果と今回の認識結果をそれぞれ構文解析部102に通
知する。連結していない場合は今回の認識結果のみを構
文解析部102に通知する。The judgment is made based on the necessary information received from the dialogue processing unit 103. Details of the information used and the determination method will be described later. After that, if the determination result is “with connection”, the previous voice recognition result and the current voice recognition result are temporally linked. When they are connected, a connection flag indicating that they are connected is added to the recognition result, and the connected recognition result and the current recognition result are notified to the syntax analysis unit 102. If they are not linked, only the recognition result of this time is notified to the syntax analysis unit 102.

【００２８】また、構文解析部102は連結処理部101、対
話処理部103と接続し、連結処理部101から通知される音
声認識結果に対して構文解析を実行し、文法で受理可能
な候補(文章候補)群を対話処理部103に通知している
が、連結フラグが付与されている音声認識結果から得ら
れる各文章候補には連結フラグ情報も付与する。構文解
析方式は既存のものを使用することが可能である。Further, the syntactic analysis unit 102 is connected to the concatenation processing unit 101 and the dialogue processing unit 103, executes syntactic analysis on the voice recognition result notified from the concatenation processing unit 101, and accepts candidates in the grammar ( Although the text candidate group is notified to the dialogue processing unit 103, the connection flag information is also added to each text candidate obtained from the voice recognition result to which the connection flag is added. The existing parsing method can be used.

【００２９】また、対話処理部103は構文解析部102、連
結処理部101、応答出力部104と接続し、構文解析部102
から通知される文章候補群を解釈し、回答対象となる文
章候補（回答対象候補）を決定すると共に応答を生成
し、応答出力部104に通知する。構文解析部102から通知
される文章候補群に連結／非連結の文章候補が混在する
場合には、連結フラグのあるものを優先させる。また選
択された回答対象候補を記憶し、次に入力される新規情
報との意味合成を実行して解釈する。The dialogue processing unit 103 is connected to the syntax analysis unit 102, the concatenation processing unit 101, and the response output unit 104, and the syntax analysis unit 102.
It interprets the sentence candidate group notified from, determines a sentence candidate to be a response target (response target candidate), generates a response, and notifies the response output unit 104. If the sentence candidate group notified from the syntax analysis unit 102 includes mixed / non-concatenated sentence candidates, the one having the concatenation flag is prioritized. In addition, the selected answer target candidate is stored, and meaning synthesis with new information to be input next is executed and interpreted.

【００３０】ただし、連結フラグが付与されている文章
候補に関しては意味合成処理を実行しないようにする。
同時に連結処理部101に認識結果連結の判定に必要な情
報を通知する。なお、意味合成処理および判定に必要な
情報についての詳細は後述する。However, the semantic synthesizing process is not executed for the sentence candidates to which the concatenation flag is added.
At the same time, the connection processing unit 101 is notified of information necessary for determining the recognition result connection. Details of the information required for the semantic composition process and the determination will be described later.

【００３１】また、応答出力部104は対話処理部103と接
続し、対話処理部から通知される応答の情報に基づき応
答を出力する。The response output unit 104 is connected to the dialogue processing unit 103 and outputs a response based on the information of the response notified from the dialogue processing unit.

【００３２】次に図２乃至図４に示す具体例を用いて本
発明の一実施形態を説明する。Next, one embodiment of the present invention will be described with reference to the specific examples shown in FIGS.

【００３３】ここではナビゲーションシステムにおける
地図検索タスクを例にしてさらに詳しく説明する。Here, the map search task in the navigation system will be described in more detail as an example.

【００３４】地図検索タスクは検索対象となる場所を指
定する音声入力を受け付け、それに基づき地図データベ
ースの検索を遂行するものである。音声入力としては例
えば「目的地近くの（ポーズ）レストラン」の発話で、
（ポーズ）を境にして文章の途中で発話が中断される発
話を考える。The map search task accepts a voice input designating a place to be searched and executes a search of the map database based on the voice input. As voice input, for example, utterance of “(pause) restaurant near destination”,
Consider an utterance where the utterance is interrupted in the middle of a sentence at the (pause) boundary.

【００３５】音声認識方式はキーワードスポッティング
を採用しているものとし、その認識結果は所定のキーワ
ードが検出された時刻に配置される図２のようなグラフ
となる。また、各ノード（「目的地」「近く」「うち」
等）内の"()"内はその単語が所属するクラスタ識別子で
ある。クラスタ情報として、例えば"(場所)"には場所を
表す単語である「目的地」「レストラン」「中華料理
屋」などを対応させて登録する。この登録情報は本発明
の主旨と異なるため割愛する。It is assumed that the voice recognition method employs keyword spotting, and the recognition result is a graph as shown in FIG. 2 arranged at the time when a predetermined keyword is detected. In addition, each node (“destination” “near” “home”)
"()" In () etc. is a cluster identifier to which the word belongs. As the cluster information, for example, “(place)” is associated with a word representing a place, such as “destination”, “restaurant”, “chinese restaurant”, and registered. This registration information is omitted because it is different from the gist of the present invention.

【００３６】対象とする文法は図３のようなテンプレー
ト形式のものとする。各行が一つの文法を表現してお
り、":"の左側（Place1、Place2、Place3）が文法の識
別子、右側がテンプレート情報である。テンプレート情
報内のそれぞれの"()"で囲まれている部分は語彙情報の
クラスタの識別子であり、左から順に出現する系列を受
理できることを表している。例えばPlace1の文法では
「目的地近くレストラン」という文章を受理できる。
この文法は例えばPlace1のサブセットである『(場所)
(位置関係)』という文法を設定していないことから、文
法の途中で発話が中断されることを考慮していないもの
と言える。The target grammar has a template format as shown in FIG. Each line expresses one grammar, the left side of ":" (Place1, Place2, Place3) is the grammar identifier, and the right side is the template information. The part enclosed by each "()" in the template information is a cluster identifier of the vocabulary information, and indicates that sequences appearing sequentially from the left can be accepted. For example, in the grammar of Place1, the sentence "restaurant near the destination" can be accepted.
This grammar is, for example, a subset of Place1
(Position relation) ”is not set, so it can be said that the utterance is interrupted in the middle of the grammar.

【００３７】図２では201、202、203がそれぞれ一つの
音声認識結果である。これらでは音声認識結果をグラフ
状に表しているが、各ノードが認識された単語に対応
し、その横方向が時間軸（左から右）を、その縦方向が
音声認識のあいまい性により同じ時間で複数の候補が現
れた場合を現している。そのグラフの辺は時間的条件
（重なり／離れ）によって連絡される。In FIG. 2, 201, 202 and 203 are respectively one voice recognition result. In these, the speech recognition results are shown in a graph form. Each node corresponds to the recognized word, the horizontal direction is the time axis (left to right), and the vertical direction is the same time due to the ambiguity of the speech recognition. Shows the case where multiple candidates appear. The edges of the graph are communicated by temporal conditions (overlap / separation).

【００３８】始めに「目的地近くの（ポーズ）レストラ
ン」が音声解析装置110に入力されると、（ポーズ）の
前の部分の「目的地近くの」に対応する音声区間につい
て音声認識結果を連結処理部101に通知する。First, when a "pause restaurant near the destination" is input to the voice analysis device 110, a voice recognition result is obtained for the voice section corresponding to "close to the destination" in the part before the (pause). Notify the connection processing unit 101.

【００３９】連結処理部101はこの音声認識結果に対し
て連結が必要であるかどうか判定を行う。この場合は初
期状態であるため連結対象となる認識結果が存在しない
ため、そのまま構文解析部102に音声認識結果を通知
し、音声認識結果201を記憶部101-1に一時保存する。The concatenation processing unit 101 determines whether or not concatenation is necessary for this voice recognition result. In this case, since there is no recognition result to be connected because it is in the initial state, the syntax analysis unit 102 is directly notified of the speech recognition result, and the speech recognition result 201 is temporarily stored in the storage unit 101-1.

【００４０】構文解析部102は音声認識結果201を対象に
して構文解析を実行する。本実施例での構文解析部102
は与えられた音声認識結果のグラフ構造のStart-Endの
区間から対象とする文法にマッチするノード列（群）を
探索する処理を実行するものであり、構文解析方式は既
存のものを使用することが可能である。以後このノード
列を文章候補と呼ぶ。The syntactic analysis unit 102 executes syntactic analysis on the speech recognition result 201. The syntax analysis unit 102 in this embodiment
Is a process that searches the node sequence (group) that matches the target grammar from the Start-End section of the graph structure of the given speech recognition result, and the existing parsing method is used. It is possible. Hereinafter, this node string will be referred to as a sentence candidate.

【００４１】この場合、音声認識結果201には「（場
所）（位置関係）」となる候補しか生成できないため、
図３の文法には当てはまらないため構文解析に失敗した
旨を対話処理部103に通知する。In this case, since only "(place) (positional relationship)" candidates can be generated in the voice recognition result 201,
Since it does not apply to the grammar of FIG. 3, the dialog processing unit 103 is notified that the syntactic analysis has failed.

【００４２】対話処理部103は構文解析部102から通知さ
れる文章候補群を解釈するが、この場合構文解析部102
から解析失敗したことを通知されているため、連結処理
部101にその旨を通知する。The dialogue processing unit 103 interprets the sentence candidate group notified from the syntax analyzing unit 102. In this case, the syntax analyzing unit 102.
Since the fact that the analysis has failed has been notified from, the connection processing unit 101 is notified of that fact.

【００４３】つぎに、音声解析装置110が（ポーズ）の
後の部分の「レストラン」に対応する音声区間の音声認
識結果202を連結処理部101に通知する。Next, the voice analysis device 110 notifies the concatenation processing unit 101 of the voice recognition result 202 of the voice section corresponding to the "restaurant" in the portion after (pause).

【００４４】連結処理部101はこの音声認識結果に対し
て連結が必要であるかどうかを判定する。この場合は直
前の構文解析に失敗しているために『連結あり』である
と判断される（判定の詳細は後述）。このとき図２の20
1と202に示した音声認識結果の連結処理が実行され、連
結された認識結果203と今回入力された認識結果202とが
構文解析部102に通知される。The concatenation processing unit 101 determines whether or not concatenation is necessary for this voice recognition result. In this case, the syntax analysis immediately before has failed, and thus it is determined that there is "connection" (details of the determination will be described later). At this time, 20 in FIG.
The concatenation processing of the voice recognition results shown in 1 and 202 is executed, and the concatenated recognition result 203 and the recognition result 202 input this time are notified to the syntax analysis unit 102.

【００４５】連結処理は一時保存されていた音声認識結
果と今回入力された音声認識結果とを時間的に連結する
処理である（図２）。The concatenation process is a process for temporally concatenating the temporarily stored voice recognition result and the voice recognition result input this time (FIG. 2).

【００４６】すなわち、一時保存されている認識結果の
右側に今回入力された音声認識結果を配置し、それを１
つの音声認識結果としてノード同士の辺を連結させる処
理となる。Start／Endノードは音声区間の始終端を表す
ように適宜取り除かれる。That is, the voice recognition result input this time is placed on the right side of the temporarily stored recognition result, and it is set to 1
As a result of one voice recognition, the processing is performed to connect the edges of the nodes. The Start / End node is appropriately removed so as to represent the start and end of the voice section.

【００４７】また、連結したことを明示する連結フラグ
を連結した認識結果に通知する。Further, a linking flag that clearly indicates that the link has been made is notified to the linked recognition result.

【００４８】この例では音声認識結果201の最後尾にあ
る「近く」ノードの右側に音声認識結果202を配置し、
時間条件に基づき辺を作成することで両認識結果が連結
している（203）。音声認識結果203は「目的地近くのレ
ストラン」に相当する音声区間を対象にした認識結果と
等価になっている。さらに、音声認識結果203には連結
フラグが付与されている。In this example, the speech recognition result 202 is arranged on the right side of the “near” node at the end of the speech recognition result 201,
Both recognition results are connected by creating edges based on the time condition (203). The voice recognition result 203 is equivalent to the recognition result for the voice section corresponding to “restaurant near the destination”. Further, the voice recognition result 203 is provided with a connection flag.

【００４９】構文解析部102では音声認識結果203と202
を対象にして構文解析を実施する。音声認識結果203か
らは「（場所）（位置関係）（場所）」となる候補が生
成可能であり、「目的地近くレストラン」などの文
章候補が探索される。音声認識結果202からは「（場
所）」となる候補が生成可能であり「レストラン」など
の文章候補が探索される。これら全ての文章候補を対話
処理部103に通知する。In the syntax analysis unit 102, the voice recognition results 203 and 202
The parsing is performed for. From the voice recognition result 203, a candidate of “(place) (positional relation) (place)” can be generated, and sentence candidates such as “restaurant near destination” are searched. From the voice recognition result 202, a candidate “(place)” can be generated, and a sentence candidate such as “restaurant” is searched. The dialog processing unit 103 is notified of all these text candidates.

【００５０】対話処理部103では、連結フラグのある文
章候補を優先して回答対象として選択するため、文章候
補「目的地近くレストラン」を回答対象として採用
する。この候補には連結フラグが付与されているために
意味合成は実行されない。Since the dialogue processing unit 103 preferentially selects a sentence candidate having a concatenation flag as an answer target, the sentence candidate “restaurant near the destination” is adopted as an answer target. Since a concatenation flag is added to this candidate, semantic synthesis is not executed.

【００５１】これに基づき応答情報を生成し応答出力部
104に通知する。応答出力部104は通知された応答情報に
基づき利用者に応答を出力する。Based on this, response information is generated and a response output unit
Notify 104. The response output unit 104 outputs a response to the user based on the notified response information.

【００５２】次に対話処理部103で実行される意味合成
について図４を示して説明する。利用者の発話の例とし
て、まず（１）「目的地の近くのレストラン」が入力さ
れ、システムが回答した後に（２）「やっぱり中華料理
屋」が入力された場合を考える。Semantic synthesis executed by the dialogue processing unit 103 will be described below with reference to FIG. As an example of the user's utterance, consider a case where (1) “restaurant near the destination” is input first, and (2) “after all, Chinese restaurant” is input after the system answers.

【００５３】この場合音声解析装置110が入力（２）に
対する音声認識結果を連結処理部101に通知した時に、
連結判定部101において『連結なし』と判定され、構文
解析部102に音声区間「やっぱり中華料理屋」に相当す
る認識結果を通知したものとする。In this case, when the voice analysis device 110 notifies the concatenation processing unit 101 of the voice recognition result for the input (2),
It is assumed that the connection determination unit 101 determines “no connection” and notifies the syntax analysis unit 102 of the recognition result corresponding to the voice section “After all, Chinese restaurant”.

【００５４】構文解析部102ではこの入力から文章候補
「中華料理屋」を生成し、対話処理部103に通知し、対
話処理部103では、文章候補「中華料理屋」には連結フ
ラグが付与されていないため、意味合成が実行される。The syntax analysis unit 102 generates a sentence candidate "Chinese restaurant" from this input and notifies the dialogue processing unit 103. In the dialogue processing unit 103, a concatenation flag is added to the sentence candidate "Chinese restaurant". Not so, semantic synthesis is performed.

【００５５】図４において401が入力（１）に対してシ
ステムが回答を行った際の回答対象候補であり、文章候
補「目的地近くレストラン」に相当する。402が入
力（２）に対する文章候補「中華料理屋」を解釈したも
のとなっている。この解釈結果は場所を表す“Place”
という式で表されており、Placeには３種類のスロット
が存在する。各スロットにおいて“：”の左には属性、
右側には属性値が与えられる。In FIG. 4, 401 is an answer target candidate when the system answers the input (1), and corresponds to the sentence candidate “restaurant near the destination”. 402 is the interpretation of the sentence candidate “Chinese restaurant” for input (2). The result of this interpretation is "Place"
There are three types of slots in Place. The attribute to the left of “:” in each slot,
The attribute value is given on the right side.

【００５６】401を例に取ると、「基準点が『目的
地』、基準点に関する位置関係が『近く』、対象となる
クラスが『レストラン』である」という情報に入力文が
解釈されていることを表している。意味合成処理は対話
処理部103に保存されている回答対象候補のスロット毎
に新規入力に対応する文章候補のスロットの属性値を上
書きする処理である。その結果が403であり、新規入力
である402のクラスの値「中華料理屋」が401のクラスス
ロットに上書きし、「目的地近く中華料理屋」となる
候補が得られる。Taking 401 as an example, the input sentence is interpreted as the information that “the reference point is“ destination ”, the positional relationship with respect to the reference point is“ near ”, and the target class is“ restaurant ”” ”. It means that. The semantic synthesis process is a process of overwriting the attribute value of the slot of the sentence candidate corresponding to the new input for each slot of the answer target candidate stored in the dialogue processing unit 103. The result is 403, and the class value “Chinese restaurant” of 402, which is a new input, overwrites the class slot of 401, and a candidate for “Chinese restaurant near destination” is obtained.

【００５７】図５は本発明の一実施形態に係る音声対話
装置における認識結果の連結処理の概略構成である。以
下、図１及び図５を参照して認識結果の連結処理につい
て説明する。FIG. 5 is a schematic configuration of a process of connecting recognition results in the voice interactive apparatus according to the embodiment of the present invention. Hereinafter, the process of connecting the recognition results will be described with reference to FIGS. 1 and 5.

【００５８】連結処理部101は対話処理部103から(1)対
話状態(S501)、(2)解析終了フラグ(S502)、(3)構文解析
成功フラグ(S503)、(4)条件不足フラグ(S504)、(5)時間
条件(S505)の情報を受け取り、それらに基づいて連結の
判定をしその結果に基づき認識結果の連結処理を実行す
る。なお、初期値として(1)対話状態情報＝"追加"、(2)
解釈終了フラグ＝"φ"、(3)構文解析成功フラグ＝"
φ"、(4)条件不足フラグ＝"φ"、(5)応答終了時刻＝"
φ"が与えられている。なお、"φ"は非通知状態である
ことを意味する。The concatenation processing unit 101 receives from the dialogue processing unit 103 (1) dialogue state (S501), (2) analysis end flag (S502), (3) syntax analysis success flag (S503), (4) condition shortage flag ( (S504), (5) The information of the time condition (S505) is received, the connection is determined based on the information, and the recognition result connecting process is executed based on the result. As an initial value, (1) dialogue state information = "additional", (2)
Interpretation end flag = "φ", (3) Parsing success flag = "
φ ", (4) Condition insufficient flag =" φ ", (5) Response end time ="
φ is given, where “φ” means a non-notification state.

【００５９】はじめに、対話処理部103の対話状態を確
認する（S501）。ここでは対話状態として音声解析装置
110からの認識結果の連結が不要であることを明示する
『更新』とそれ以外の『追加』の2種類を定義する。こ
の情報は対話処理部103で変更が検出されるたびに対話
処理部103から連結処理部101に通知する。例えば「更
新」状態は、対話処理部103で問い返しを決定した場合
など、特定の語彙を待つ状態になる時に検出される。こ
の時は次に入力される特定の語彙だけが必要であるため
に以前の認識結果は不要となる。更新状態が通知されて
いる場合には「連結なし」と判断しS507へ、そうでなけ
れば解析終了フラグの有無の判断であるS502へ進む。First, the dialogue state of the dialogue processing unit 103 is confirmed (S501). Here, the speech analysis device
Two types are defined: "update", which clearly indicates that the connection of recognition results from 110 is unnecessary, and "addition" other than that. This information is notified from the interaction processing unit 103 to the connection processing unit 101 each time the interaction processing unit 103 detects a change. For example, the “updated” state is detected when a state waiting for a specific vocabulary is entered, such as when the dialogue processing unit 103 decides to return a question. At this time, since only the specific vocabulary to be input next is required, the previous recognition result is unnecessary. If the update status has been notified, it is determined that there is no connection, and the process proceeds to step S507. If not, the process proceeds to step S502, which is a determination of the presence or absence of the analysis end flag.

【００６０】S502では、新規入力が行われた時が解釈中
・応答生成中かどうかの判定を行う。この情報は、例え
ば対話処理部103が解釈終了あるいは応答生成部104の応
答生成終了をモニタすることで検出することが可能であ
る。この検出時に解釈終了したことを意味する解釈終了
フラグを対話処理部103から連結処理部101に通知する。In S502, it is determined whether or not the time when a new input is made is during interpretation / response generation. This information can be detected, for example, by the dialogue processing unit 103 monitoring the end of interpretation or the end of response generation by the response generation unit 104. At the time of this detection, the dialogue processing unit 103 notifies the concatenation processing unit 101 of an interpretation completion flag which means that the interpretation has been completed.

【００６１】連結処理部101では、解釈終了フラグを受
け取っていない時に次の認識結果を受け取った場合は、
利用者の入力が終了する前の認識結果に対して解釈を始
めたものと判断し、連結が必要という判定をする。連結
処理部101が解釈終了フラグを受け取っていない場合はS
506「連結あり」と判断しS506へ、受け取っている場合
は構文解析成功フラグの有無の判断であるS503へ進む。When the concatenation processing unit 101 receives the next recognition result when the interpretation end flag is not received,
It is determined that the interpretation is started for the recognition result before the user's input is completed, and it is determined that the connection is necessary. If the concatenation processing unit 101 has not received the interpretation end flag, S
506 The processing proceeds to step S506 when it is determined that there is a connection, and proceeds to step S503 when it is determined whether the parsing success flag is present.

【００６２】S503では、直前の認識結果に対する構文解
析が成功したかどうかによる判定を行う。この情報は、
例えば対話処理部103が構文解析部102から有意な構文解
析結果が通知されたか否かを検出することで得ることが
可能である。有意な構文解析結果が通知された時に、構
文解析に成功したことを意味する構文解析成功フラグを
対話処理部103から連結処理部101に通知する。In step S503, it is determined whether the parsing of the immediately preceding recognition result has succeeded. This information is
For example, it can be obtained by the interaction processing unit 103 detecting whether or not a significant syntactic analysis result is notified from the syntactic analyzing unit 102. When the significant syntactic analysis result is notified, the dialogue processing unit 103 notifies the concatenation processing unit 101 of the syntactic analysis success flag meaning that the syntactic analysis is successful.

【００６３】連結処理部101では、構文解析成功フラグ
を受け取っていない時に次の認識結果を受け取った場合
は、直前の認識結果が、それに対する構文解析に成功し
ていないために発話の途中であるという可能性があると
判断し、認識結果の連結が必要と判定する。連結処理部
101が構文解析成功フラグを受け取っていない場合には
「連結あり」と判断しS506へ、そうでなければ条件不足
フラグの有無の判断であるS504に進む。In the concatenation processing unit 101, if the next recognition result is received while the syntactic analysis success flag is not received, the immediately preceding recognition result is in the middle of utterance because the syntactic analysis for it is not successful. Therefore, it is determined that it is necessary to connect the recognition results. Connection processing section
If 101 has not received the parsing success flag, it is determined that there is "connection" and the process proceeds to S506, and if not, the process proceeds to S504, which is the determination of the presence or absence of the condition shortage flag.

【００６４】S504では、直前の解釈結果があいまいかど
うかの判定を行う。この情報は、対話処理部103で構文
解析結果を解釈し回答対象候補を生成する度にあいまい
か否かを判定したものを利用する。判定基準は、例えば
データベース検索をタスクとした場合には対話処理部10
3における解釈時に回答対象候補の検索結果が所定の個
数を上回った場合にあいまいであると判定することや、
あるいは回答対象候補の元となる入力条件に与える条件
の数・種類などに基づき判定することも可能である。In S504, it is determined whether the immediately preceding interpretation result is ambiguous. This information is used by the dialogue processing unit 103 which interprets the syntactic analysis result and determines whether or not it is appropriate each time the answer target candidate is generated. The criterion is, for example, when the database search is a task, the dialogue processing unit 10
If the search result of the answer target candidate exceeds the predetermined number at the time of interpretation in 3, it is judged as ambiguous,
Alternatively, it is also possible to make the determination based on the number and types of conditions given to the input condition that is the source of the answer target candidate.

【００６５】あいまいであると判断された場合には、そ
れを意味する条件不足フラグを対話処理部103から連結
処理部101に通知する。連結処理部101では、条件不足フ
ラグ受け取った時に次の認識結果を受け取った場合は今
後続けて条件指定が見込まれるため連結が必要であると
判定する。連結処理部101が条件不足フラグを受け取っ
ている場合には「連結あり」と判断しS506へ、そうでな
ければ時間条件の判断であるS505へ進む。When it is determined that the condition is ambiguous, the interaction processing unit 103 notifies the concatenation processing unit 101 of a condition shortage flag indicating that the condition is insufficient. When the next recognition result is received when the insufficient condition flag is received, the concatenation processing unit 101 determines that concatenation is necessary because condition specification is expected to continue in the future. If the concatenation processing unit 101 receives the condition insufficient flag, it is determined that “concatenation is present” and the process proceeds to S506, and if not, the process proceeds to S505, which is the determination of the time condition.

【００６６】S505では、時間的な連結条件を判定する。
この判定は応答生成の終了時刻と、次発話開始時刻との
差分である発話間隔をしきい値処理することで行うこと
ができる。応答生成の終了時刻は、例えば対話処理部10
3が応答生成部104の応答生成終了時刻をモニタすること
で取得することが可能である。次発話開始時刻は、例え
ば連結処理部101に認識結果が通知される時刻などで取
得することが可能である。対話処理部103から連結処理
部101へは応答生成終了時刻を通知する。At S505, the temporal connection condition is determined.
This determination can be performed by thresholding the utterance interval, which is the difference between the response generation end time and the next utterance start time. The end time of response generation is, for example, the interaction processing unit 10.
3 can be acquired by monitoring the response generation end time of the response generation unit 104. The next utterance start time can be acquired, for example, at the time when the connection processing unit 101 is notified of the recognition result. The interaction processing unit 103 notifies the connection processing unit 101 of the response generation end time.

【００６７】連結処理部101では発話間隔を計算し、そ
れが所定の値よりも大きければ発話の関連性がないもの
とし、連結処理は不要であると判定することができる。
発話間隔が所定の値よりも大きければ「連結なし」と判
断しS507へ、そうでなければ「連結あり」と判断しS506
へ進む。The concatenation processing unit 101 calculates the utterance interval, and if it is larger than a predetermined value, it can be determined that there is no relevance of the utterance, and the concatenation processing is unnecessary.
If the utterance interval is larger than a predetermined value, it is determined that "no connection" is made, and the process proceeds to S507, and if not, it is determined that "connection is made" S506.
Go to.

【００６８】S506では、以前の認識結果と通知された新
規の認識結果とを連結する。連結の手続きは前述(図２)
のように実行する。その後、構文解析部102へ連結した
認識結果と新規の認識結果をそれぞれ通知する。S508へ
進む。In S506, the previous recognition result and the notified new recognition result are connected. The connection procedure is described above (Fig. 2)
Run like. After that, the syntactic analysis unit 102 is notified of the connected recognition result and the new recognition result. Proceed to S508.

【００６９】S507では、連結する必要がないと判断され
たため、構文解析部102へ新規の認識結果のみを通知
し、S508へ進む。In S507, since it is determined that the connection is not necessary, the syntactic analysis unit 102 is notified of only the new recognition result, and the process proceeds to S508.

【００７０】S508では、連結処理部101の初期化処理を
行う。初期化処理は対話処理部103から通知される各情
報を初期化する処理と、記憶する認識結果の更新処理か
らなる。例えば、対話処理部103から通知される情報
を、対話状態情報＝"追加"、解釈終了フラグ＝"φ"、構
文解析成功フラグ＝"φ"、条件不足フラグ＝"φ"、応答
終了時刻＝"φ"と初期化する。尚"φ"は非通知状態であ
ることを意味する。更に、これまでに保存されていた認
識結果を破棄し、今回入力された新規の認識結果を連結
処理部101で保存しておく。In S508, the initialization processing of the connection processing unit 101 is performed. The initialization process includes a process of initializing each information notified from the dialogue processing unit 103 and a process of updating the recognition result to be stored. For example, the information notified from the dialogue processing unit 103 includes dialogue state information = “addition”, interpretation end flag = “φ”, syntax analysis success flag = “φ”, condition insufficient flag = “φ”, response end time = Initialize as "φ". In addition, "φ" means a non-notification state. Further, the recognition result stored so far is discarded, and the new recognition result input this time is stored in the connection processing unit 101.

【００７１】以上が本発明に係る音声対話装置装置の構
成とその機能、及びその音声対話方式である。The above is the configuration and the function of the voice dialogue apparatus and the voice dialogue system according to the present invention.

【００７２】さらに、本実施形態の音声認識装置をナビ
ゲーションシステムにおける地図検索タスクに適用し
て、その動作について［対話1］乃至［対話３］を例に
して詳細に説明する。［対話1］ User1：「目的地の近くの(ポーズ) レストラン」前述した利用者が音声認識装置に対して発した「目的地
の近くの(ポーズ) レストラン」について図１乃至図２
を用いてさらに説明する。Furthermore, the voice recognition apparatus of this embodiment is applied to a map search task in a navigation system, and its operation will be described in detail by taking [Dialogue 1] to [Dialogue 3] as an example. [Dialogue 1] User1: “Pose restaurant near destination” About “Pose restaurant near destination” issued by the user to the voice recognition device described above, with reference to FIGS. 1 to 2.
Will be further explained using.

【００７３】まず、(ポーズ)の前の「目的地の近くの」
の音声が検出され、音声解析装置110によって音声認識
が行われる。この結果が図２の201であり、これが連結
処理部101に通知される。このとき連結処理部101は初期
状態であり、上述したように対話状態＝"追加"、解釈終
了フラグ＝"φ"、構文解析成功フラグ＝"φ"、条件不足
フラグ＝"φ"、応答生成終了時刻＝"φ"、保存されてい
る認識結果なしとなっている。First, "near the destination" before (pause)
The voice is detected, and the voice analysis device 110 performs voice recognition. The result is 201 in FIG. 2, and this is notified to the concatenation processing unit 101. At this time, the connection processing unit 101 is in the initial state, and as described above, the dialogue state = “addition”, the interpretation end flag = “φ”, the parsing success flag = “φ”, the condition shortage flag = “φ”, and the response generation. End time = "φ", there is no saved recognition result.

【００７４】この場合、解釈終了フラグがないためにS5
02で『連結あり』と判定しS506に進むが、以前の認識結
果が記憶されていないために新規の認識結果である図２
の201を構文解析部102に通知する。In this case, since there is no interpretation end flag, S5
Although it is determined that there is a connection in 02 and the process proceeds to S506, it is a new recognition result because the previous recognition result is not stored.
No. 201 is notified to the syntax analysis unit 102.

【００７５】構文解析部102では、図２の201を対象にし
て図３の文法を適用して構文解析すると、"(位置関係)"
で終了する文法は定義されていないために構文解析でき
ず失敗し、構文解析に失敗したことを対話処理部103に
通知する。In the syntactic analysis unit 102, when the syntactic analysis is performed by applying the grammar of FIG. 3 to 201 of FIG. 2, "(positional relation)" is obtained.
Since the grammar ending with is not defined, it fails to be parsed, and the interaction processing unit 103 is notified that the parse has failed.

【００７６】対話処理部103では、構文解析に失敗した
ことを受理したことで解釈終了したと判定し、解釈終了
フラグを連結処理部101に通知する。その後解釈失敗し
た旨を通知する応答を生成するように応答生成部104に
通知する。The dialogue processing unit 103 determines that the interpretation is completed by accepting that the syntactic analysis has failed, and notifies the concatenation processing unit 101 of the interpretation completion flag. After that, the response generation unit 104 is notified so as to generate a response indicating that the interpretation has failed.

【００７７】続いて、「レストラン」の部分が音声解析
装置110で音声認識され、図２の202が認識結果として対
話管理部101に通知される。このとき連結処理部101の判
定情報の状態は、対話状態＝"追加"、解釈終了フラグ
＝"終了"、構文解析成功フラグ＝"φ"、条件不足フラグ
＝"φ"、応答生成終了時刻＝"φ"となっている。ここ
で、連結判定を実行すると構文解析成功フラグが通知さ
れていないのでS503で連結と判定され、S506に進む。S5
06では直前の認識結果である図２の201と新規の認識結
果である図２の202を連結し、新たに認識結果、図２の2
01を生成し図２の201と202を構文解析部102に通知す
る。Subsequently, the "restaurant" portion is voice-recognized by the voice analysis device 110, and 202 in FIG. 2 is notified to the dialogue management unit 101 as a recognition result. At this time, the state of the determination information of the connection processing unit 101 is as follows: dialogue state = “addition”, interpretation end flag = “end”, parsing success flag = “φ”, condition shortage flag = “φ”, response generation end time = It is "φ". Here, when the connection determination is executed, since the syntax analysis success flag is not notified, it is determined to be connection in S503, and the process proceeds to S506. S5
In 06, the immediately preceding recognition result 201 in FIG. 2 and the new recognition result 202 in FIG. 2 are connected, and the new recognition result, 2 in FIG.
01 is generated and 201 and 202 of FIG. 2 are notified to the syntax analysis unit 102.

【００７８】構文解析部102で構文解析を実行する。音
声認識方式はキーワードスポッティングを用いると図２
の201からは音響スコアの順に「目的地近くレストラ
ン」、「目的地近くスーパー」、「うち近くレスト
ラン」、「うち近くスーパー」が出力される。また、
図２の202からは音響スコア順に「レストラン」、「ス
ーパー」が出力される。これを対話処理部103に通知す
る。The syntactic analysis unit 102 executes syntactic analysis. The speech recognition method is shown in FIG. 2 when keyword spotting is used.
From 201, “restaurant near destination”, “supermarket near destination”, “restaurant near destination”, and “supermarket near home” are output in order of acoustic score. Also,
From 202 in FIG. 2, "restaurant" and "supermarket" are output in the order of the acoustic scores. This is notified to the dialogue processing unit 103.

【００７９】対話処理部103では、まず構文解析が成功
したことを連結処理部101に通知する。連結された認識
結果を優先的に取り扱うため、図２の201から得られた
候補を優先する。このとき、直前の入力に対する解釈は
失敗しているため全ての候補について意味合成は行われ
ない。その結果「目的地近くレストラン」が解釈結果
として適当であると決定し、これを回答対象候補とす
る。回答のためのデータベース検索を実行し、応答情報
を生成する。ここでこの検索結果が少ない数で収まった
とする。このときに解釈終了フラグを連結処理部101に
通知する。その後、応答生成部104で「目的地近くレ
ストラン」に関する応答文が利用者に提示される。The dialogue processing unit 103 first notifies the concatenation processing unit 101 that the syntax analysis has succeeded. Since the connected recognition result is preferentially handled, the candidate obtained from 201 of FIG. 2 is prioritized. At this time, since the interpretation of the immediately preceding input has failed, meaning synthesis is not performed for all candidates. As a result, "Restaurant near the destination" is determined to be appropriate as the interpretation result, and this is set as the answer candidate. Perform a database search for answers and generate response information. It is assumed that the number of search results is small. At this time, the concatenation processing unit 101 is notified of the interpretation end flag. After that, the response generation unit 104 presents the response sentence regarding “restaurant near the destination” to the user.

【００８０】以上のように、音声認識結果を連結すると
いう処理を追加することによって、文章の途中で発話が
中断されることを想定していない文法でも発話単位のあ
いまい性に対応することが可能である。As described above, by adding the process of concatenating the speech recognition results, it is possible to deal with the ambiguity of the utterance unit even in a grammar that does not assume that the utterance is interrupted in the middle of a sentence. Is.

【００８１】なお、「目的地近く」を受理可能な文法
を作成している場合でも、「目的地の近くの場所」とい
うあいまいな条件を指定することになるため、あいまい
であると判断することが可能である。このときは連結処
理部101が条件不足フラグを受け取るため、S504で『連
結あり』と判定され、最終的には「目的地近くレスト
ラン」が選択される。Even when a grammar that can accept "near the destination" is created, the ambiguous condition "place near the destination" is specified, so it should be judged as ambiguous. Is possible. At this time, the connection processing unit 101 receives the condition insufficient flag, so that it is determined in S504 that "there is connection", and finally "restaurant near destination" is selected.

【００８２】また、User1'：「目的地の（ポーズ）近く
のレストラン」という発話の場合、「目的地」は構文解
析可能(図３のPlace2に該当)となるが、このようなポー
ズは一般的に短く、解釈中あるいは応答文生成中に「近
くのレストラン」が入力されることになる。このため、
S502あるいはS505で連結と判定され、最終的には「目的
地近くレストラン」が選択される。［対話2］ User2：「目的地の近くのレストラン」 System2：「目的地の近くにはファミリーレストランが
あります」 User3：「やっぱり中華料理屋」利用者が発した「目的地の近くのレストラン」の音声
について本実施形態の音声対話装置が音声認識した後、
応答出力部104から「目的地の近くにはファミリーレス
トランがあります」が出力され、さらにこの応答として
本実施形態の音声認識装置に対して「やっぱり中華料理
屋」と発した場合について図１乃至図５を用いてさらに
説明する。In the case of the utterance "User1 ':" restaurant near the destination (pause) ", the" destination "can be parsed (corresponding to Place2 in FIG. 3), but such a pose is common. It is very short, and "nearby restaurant" is input during interpretation or response sentence generation. For this reason,
In S502 or S505, it is determined to be connected, and finally "restaurant near destination" is selected. [Dialogue 2] User2: “Restaurant near the destination” System2: “There is a family restaurant near the destination” User3: “After all Chinese restaurant” “Restaurants near the destination” issued by the user Regarding voice, after the voice dialog device of the present embodiment performs voice recognition,
1 to FIG. 1 when the response output unit 104 outputs “There is a family restaurant near the destination”, and further, as a response to the voice recognition device of the present embodiment, “After all, Chinese restaurant” is issued. Further explanation will be given with reference to FIG.

【００８３】まず、「目的地の近くのレストラン」の音
声が検出され、音声解析装置110によって音声認識が行
われその結果が連結処理部101に通知される。このとき
の連結処理部101は初期状態である。［対話1］の時と同
様に以前の認識結果が記憶されていないために新規入力
の認識結果のみを構文解析部102に通知する。First, the voice of "restaurant near the destination" is detected, the voice analysis device 110 performs voice recognition, and the result is notified to the connection processing unit 101. The connection processing unit 101 at this time is in the initial state. As in the case of [Dialogue 1], since the previous recognition result is not stored, only the recognition result of the new input is notified to the syntax analysis unit 102.

【００８４】構文解析部102では、構文解析を実行す
る。その結果、図６の601の「目的地近くレストラン」
を得る。これを対話処理部103に通知する。The syntax analysis unit 102 executes syntax analysis. As a result, 601 "Restaurant near the destination" in Figure 6
To get This is notified to the dialogue processing unit 103.

【００８５】対話処理部103では、まず構文解析が成功
したことを連結処理部101に通知する。初期状態では意
味合成をするための直前の回答対象候補が存在しないた
め、「目的地近くレストラン」は解釈結果として適当
であると決定し、これを回答対象候補とする。The dialogue processing unit 103 first notifies the concatenation processing unit 101 that the syntax analysis has succeeded. In the initial state, there is no previous answer target candidate for semantic synthesis, so it is determined that “restaurant near the destination” is appropriate as the interpretation result, and this is set as the answer target candidate.

【００８６】回答のためのデータベース検索を実行し、
応答情報を生成する。ここでこの検索結果が少ない数で
収まったとする。このときに解釈終了フラグを連結処理
部101に通知する。その後、応答生成部104で「目的地
近くレストラン」に関する応答文「目的地の近くには
ファミリーレストランがあります」が利用者に提示され
る。Perform a database search for answers,
Generate response information. It is assumed that the number of search results is small. At this time, the concatenation processing unit 101 is notified of the interpretation end flag. After that, the response generation unit 104
The user is presented with a response sentence "There is a family restaurant near the destination."

【００８７】続いて、この応答文を受けて、「やっぱり
中華料理屋」の部分が音声解析装置110で音声認識さ
れ、図６の602が認識結果として連結処理部101に通知さ
れる。「目的地の近くにはファミリーレストランがあり
ます」と「やっぱり中華料理屋」との発話間隔は短いも
のとする。Then, in response to this response sentence, the portion of "Chinese restaurant" is recognized by the voice analysis device 110, and 602 in FIG. 6 is notified to the connection processing unit 101 as a recognition result. It is assumed that the utterance interval between "there is a family restaurant near the destination" and "after all the Chinese restaurant" is short.

【００８８】このとき連結処理部101の判定情報の状態
は、対話状態＝"追加"、解釈終了フラグ＝"終了"、構文
解析成功フラグ＝"成功"、条件不足フラグ＝"φ"、応答
生成終了時刻＝T2となっている。ここで、連結判定を実
行すると、「目的地の近くにはファミリーレストランが
あります」と「やっぱり中華料理屋」の発話間隔は短い
ためにS505で連結と判定され、S506に進む。S506では直
前の認識結果と新規の認識結果601を連結し、新たに認
識結果、図５の603が得られたとする。この図５の603と
602を構文解析部102に通知する。At this time, the states of the determination information of the concatenation processing unit 101 are: dialogue state = “addition”, interpretation end flag = “end”, syntax analysis success flag = “success”, condition shortage flag = “φ”, response generation The end time is T2. Here, when the connection determination is executed, it is determined in S505 that connection is made because "there is a family restaurant near the destination" and "after all the Chinese restaurant" utterance interval is short, and the process proceeds to S506. In S506, it is assumed that the previous recognition result and the new recognition result 601 are connected to obtain a new recognition result, 603 in FIG. With 603 in this figure
The syntax analysis unit 102 is notified of 602.

【００８９】構文解析部102で構文解析を実行する。図
６の603からは音響スコアの順に「目的地近く中華料
理屋」「目的地近くレストラン」「うち近く中華料
理屋」「うち近くレストラン」を、図６の602からは
「中華料理屋」を得る。これを対話処理部103に通知す
る。The syntactic analysis unit 102 executes syntactic analysis. From 603 in Fig. 6, in order of acoustic score, "Chinese restaurant near destination", "Restaurant near destination", "Chinese restaurant near home", "Restaurant near home", and "Chinese restaurant" from 602 in Fig. 6 obtain. This is notified to the dialogue processing unit 103.

【００９０】対話処理部103では、まず構文解析が成功
したことを連結処理部101に通知する。連結された認識
結果を優先的に取り扱うため、図６の603から得られた
候補を優先する。このとき、図６の603から得られた文
章候補には連結フラグが付与されているため、直前の回
答対象候補である「目的地近くレストラン」との意味
合成を実行しないように動作し、そのまま「目的地近
く中華料理屋」が解釈結果であると決定し、これを回
答対象候補とする。回答のためのデータベース検索を実
行し、応答情報を生成する。The dialogue processing unit 103 first notifies the concatenation processing unit 101 that the syntax analysis has succeeded. In order to preferentially handle the connected recognition result, the candidate obtained from 603 in FIG. 6 is given priority. At this time, since the concatenation flag is added to the sentence candidate obtained from 603 of FIG. 6, the sentence candidate operates as if it does not perform the semantic synthesis with the immediately preceding answer target candidate “restaurant near the destination”, and it remains as it is. The Chinese restaurant near the destination is determined to be the interpretation result, and this is the candidate for the answer. Perform a database search for answers and generate response information.

【００９１】ここでこの検索結果が少ない数で収まった
とする。このときに解釈終了フラグを連結処理部101に
通知する。その後、応答生成部104で「目的地近く中
華料理屋」に関する応答文が利用者に提示される。Here, it is assumed that the number of the search results is small. At this time, the concatenation processing unit 101 is notified of the interpretation end flag. After that, the response generation unit 104 presents the user with a response sentence regarding “Chinese restaurant near destination”.

【００９２】なお、構文解析部102で図６の603の構文解
析に失敗した場合は、602の構文解析結果である「中華
料理屋」が対話処理部103に通知される。この候補は連
結フラグが付与されていないため、直前の回答対象候補
「目的地近くレストラン」と「中華料理屋」との意味
合成が実行される(図４)。最終的には「目的地近く中
華料理屋」が回答対象候補として選択されることにな
る。When the syntactic analysis unit 102 fails in the syntactic analysis of 603 in FIG. 6, the interactive processing unit 103 is notified of the syntactic analysis result of 602, “Chinese restaurant”. Since the concatenation flag is not added to this candidate, meaning synthesis of the immediately preceding answer target candidate “restaurant near destination” and “Chinese restaurant” is executed (FIG. 4). Eventually, “Chinese restaurant near destination” will be selected as the answer candidate.

【００９３】以上のように、連結したことを明示する連
結フラグを各文章候補に付与することによって、誤って
意味合成を実行しないように制御することが可能とな
る。［対話3］ System4:「近くのコンビニについてお尋ねですか？」 User4:「はい」システムがS4のような問い返し発話をする時には次の語
彙セットを「はい」／「いいえ」に限定するものとす
る。このときに対話処理部103は対話状態＝"更新"を連
結処理部101に通知する。As described above, by adding the concatenation flag that clearly indicates that the sentences have been concatenated to each sentence candidate, it is possible to perform control so that semantic synthesis is not erroneously executed. [Dialogue 3] System4: “Ask about a nearby convenience store?” User4: “Yes” When the system utters a question like S4, the next vocabulary set is limited to “Yes” / “No”. . At this time, the dialogue processing unit 103 notifies the connection processing unit 101 of the dialogue state = “update”.

【００９４】続いて「はい」が入力され音声認識結果が
連結処理部101に通知される。このとき連結処理部101の
判定情報の状態は、対話状態＝"更新"、解釈終了フラグ
＝"終了"、構文解析成功フラグ＝"成功"、条件不足フラ
グ＝"φ"、応答生成終了時刻＝T3となっている。ここで
連結判定を実行すると対話状態＝"更新"であるためにS5
01で連結なしと判定され、S507に進む。認識結果として
連結したものを作成せずに、新規入力の音声認識結果の
みを構文解析部102に通知し、システムは「はい」に対
応した処理を実行することが可能となる。Subsequently, "Yes" is input and the voice recognition result is notified to the concatenation processing unit 101. At this time, the states of the determination information of the concatenation processing unit 101 are: dialogue state = “update”, interpretation end flag = “end”, syntax analysis success flag = “success”, condition lack flag = “φ”, response generation end time = It is T3. If the connection determination is executed here, the dialog state = "update", so S5
In 01, it is determined that there is no connection, and the process proceeds to S507. It is possible to notify only the newly input speech recognition result to the syntax analysis unit 102 without creating a concatenated recognition result, and the system can execute the process corresponding to “Yes”.

【００９５】以上のように、システムが決定した対話状
況などにより連結の判定をすることによって、意図しな
い認識結果の連結を実行しないように制御することが可
能となる。As described above, it is possible to perform control so that unintended connection of recognition results is not executed by determining the connection based on the dialog situation determined by the system.

【００９６】かくしてこのように構成された音声対話装
置及びその音声対話方法によれば、新規に文法やルール
を作成せずに発話の区切りのあいまいさを考慮して解釈
することができ、その解釈実行時に言い直し条件追加の
解釈のための意味的意味合成の実行可否判定ができ、対
話処理部のフィードバックを利用することで時間的な連
結条件をより適切に判定することが可能となる。Thus, according to the voice dialogue apparatus and the voice dialogue method thus constructed, it is possible to interpret in consideration of the ambiguity of the utterances without creating a new grammar or rule, and interpret the interpretation. At the time of execution, it is possible to determine whether or not to execute semantic meaning synthesis for the interpretation of rewording conditions, and it is possible to more appropriately determine the temporal connection condition by using the feedback of the dialogue processing unit.

【００９７】なお、上述の例では、ナビゲーションシス
テムのデータベース検索の形式で実現しているように記
述しているが、これらの音声対話方法については上述の
実現形態に限定されるものではない。以下のような実現
形態も本発明の趣旨の範囲内である。In the above example, it is described that the navigation system is realized in the form of database search. However, these voice interaction methods are not limited to the above-described modes of realization. The following implementations are also within the scope of the present invention.

【００９８】また、簡単の為に音声認識方式をキーワー
ドスポッティング方式とし、テンプレート形式の文法、
テンプレートとのマッチングによって構文解析を行うよ
うに記述しているが、例えば音声認識方式を連続単語認
識、文法を文脈自由文法などの書き換え規則などにし、
構文解析方式をチャートパーザやLRパーザなどの既存の
手法にしても良い。音声認識結果を構文解析できる組み
合わせであれば任意の組み合わせが可能である。For simplicity, the voice recognition method is the keyword spotting method, and the template format grammar,
Although it is described that parsing is performed by matching with a template, for example, the speech recognition method is continuous word recognition, the grammar is a rewriting rule such as context-free grammar, etc.
The parsing method may be an existing method such as a chart parser or LR parser. Any combination is possible as long as it is a combination that can parse the speech recognition result.

【００９９】また、簡単の為に連結処理部101で管理す
る「以前の認識結果」は直前のものだけであると記述し
たが、直前の複数回の認識結果を記憶しても良い。この
場合は記憶する認識結果を全て連結した形式で一つの結
果として保存する手法や、発話ごとの認識結果に時間情
報を付与して段階的に破棄する手法などが考えられる。Further, for the sake of simplicity, it has been described that the “previous recognition result” managed by the connection processing unit 101 is only the previous recognition result, but the recognition results of the previous plural times may be stored. In this case, a method of saving all the recognition results to be stored as one result in a concatenated form, a method of adding time information to the recognition result of each utterance and discarding the recognition results step by step, and the like can be considered.

【０１００】対話処理部103における意味合成実行条件
について、連結フラグがないものには必ず意味合成を実
行するように記述しているが、本発明の実現形態はこれ
に限定されるものではなく、例えば、対話処理部103で
検出される連結判定情報を利用して独自に意味合成実行
判定おこなうようにすることで実現することも可能であ
る。Regarding the semantic synthesis execution condition in the dialogue processing unit 103, it is described that the semantic synthesis is executed without fail in the case where there is no concatenation flag, but the embodiment of the present invention is not limited to this. For example, it is also possible to implement the meaning synthesis execution determination independently using the connection determination information detected by the dialogue processing unit 103.

【０１０１】例えば、発話間隔の時間条件を対話処理部
103でも判定するようにし、発話間隔が大きい時には意
味合成をしないようにすることや、対話状態＝"更新"の
場合には意味合成をしないようにするように制御するこ
とが可能である。あるいは連結処理部101においても同
様の判定を対話処理部103に通知することが可能であ
る。また、解釈情報の形式や意味合成の手法は意味合成
によって以前の解釈結果と新規入力の値を合成するもの
であれば既存の方式を適用することが可能である。For example, the time condition of the utterance interval is set to the dialogue processing unit.
It is also possible to make a determination in step 103 so that the meaning synthesis is not performed when the utterance interval is large, or the semantic synthesis is not performed when the dialogue state = “update”. Alternatively, the connection processing unit 101 can also notify the dialog processing unit 103 of the same determination. Further, as the format of the interpretation information and the method of semantic synthesis, an existing method can be applied as long as the previous interpretation result and the value of the new input are synthesized by semantic synthesis.

【０１０２】また、対話処理部103での候補選択条件に
おいて、連結フラグがある候補とない候補が混在する場
合には連結フラグがある候補を優先させるように記述し
たが、音響スコアのみで比較して有意な候補を選択して
もよい。Further, in the candidate selection condition in the dialogue processing unit 103, when the candidates having the connection flag and the candidates not having the connection flag are mixed, the candidate having the connection flag is described as prioritized, but only the acoustic score is compared. Significant candidates may be selected.

【０１０３】以上のように、本発明の実現形態には上述
の例に対して種々の変形が可能であり、それらも趣旨に
反しない限り本発明の実施形態の範囲内である。As described above, various modifications can be made to the embodiments of the present invention with respect to the above examples, and these are also within the scope of the embodiments of the present invention as long as they do not violate the spirit.

【０１０４】[0104]

【発明の効果】以上説明したように本発明の音声対話装
置、音声対話方法及び音声対話処理プログラムによれば
新規に文法やルールを作成せずに発話の区切りのあいま
いさを考慮して解釈することができ、または、その解釈
実行時に言い直し条件追加の解釈のための意味的意味合
成の実行可否判定ができ、または、対話処理部のフィー
ドバックを利用することで時間的な連結条件をより適切
に判定することが可能となる。As described above, according to the voice dialog device, voice dialog method, and voice dialog processing program of the present invention, interpretation is performed in consideration of the ambiguity of utterance breaks without newly creating a grammar or rule. Or, it is possible to judge whether or not to execute semantic meaning synthesis for the interpretation of additional rewording conditions at the time of the interpretation execution, or to use the feedback of the dialogue processing unit to make the temporal connection conditions more appropriate. It becomes possible to judge.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の第１の実施の形態に係る音声対話装置
を示す図FIG. 1 is a diagram showing a voice interaction device according to a first embodiment of the present invention.

【図２】本発明の第１の実施の形態に係る音声対話装置
による音声認識結果の連結を示す図FIG. 2 is a diagram showing connection of voice recognition results by the voice interaction device according to the first embodiment of the present invention.

【図３】本発明の第１の実施の形態に係る音声対話装置
の文法の一例を示す図FIG. 3 is a diagram showing an example of a grammar of the voice interaction device according to the first embodiment of the present invention.

【図４】本発明の第１の実施の形態に係る音声対話装置
の意味合成の一例を示す図FIG. 4 is a diagram showing an example of semantic synthesis of the voice interaction device according to the first embodiment of the present invention.

【図５】本発明の第１の実施の形態に係る音声対話装置
の連結処理部の処理を示すフロー図FIG. 5 is a flowchart showing a process of a connection processing unit of the voice interaction device according to the first embodiment of the present invention.

【図６】本発明の第１の実施の形態に係る音声対話装置
による他の音声認識結果の連結を示す図FIG. 6 is a diagram showing connection of other voice recognition results by the voice interaction device according to the first embodiment of the present invention.

【符号の説明】[Explanation of symbols]

100・・・音声対話装置 101・・・連結処理部 102・・・構文解析部 103・・・対話処理部 104・・・応答処理部 110・・・音声解析装置 100 ... Voice dialogue device 101 ... Consolidation processing unit 102 ... Syntax analysis 103 ... Dialog processing unit 104 ・・・ Response processing unit 110: Speech analysis device

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 15/22 Ｇ１０Ｌ 3/00 ５５１Ｑ 15/28 Ｒ (72)発明者下森大志神奈川県川崎市幸区小向東芝町１番地株式会社東芝研究開発センター内 (72)発明者上原龍也神奈川県川崎市幸区小向東芝町１番地株式会社東芝研究開発センター内Ｆターム(参考） 5B091 AA11 BA03 CA07 CA12 CB12 CB32 DA03 5D015 AA05 HH00 KK02 KK04 LL10 5D045 AB17 ─────────────────────────────────────────────────── ─── Continuation of front page (51) Int.Cl. ⁷ Identification code FI theme code (reference) G10L 15/22 G10L 3/00 551Q 15/28 R (72) Inventor Taishi Shimomori Kawasaki City, Kanagawa Prefecture Komukai TOSHIBA-Cho No. 1 inside the Toshiba R & D Center Co., Ltd. (72) Inventor Tatsuya Uehara No. 1 Komukai TOSHIBA-cho No. 1 inside the Toshiba Research and Development Center Co., Ltd. Kawasaki City, Kanagawa 5B091 AA11 BA03 CA07 CA12 CB12 CB32 DA03 5D015 AA05 HH00 KK02 KK04 LL10 5D045 AB17

Claims

【特許請求の範囲】[Claims]

【請求項１】発話された音声の音声認識結果が入力さ
れ、前回の発話に対する音声認識結果を記憶する記憶部
を有し、新規発話の音声認識結果が入力された場合に前
記新規発話の音声認識結果と前記記憶された音声認識結
果とを時間順序を保ったまま連結する連結処理部と、前
記連結処理部からの連結音声認識結果が導入され、この
音声認識結果に対して定められた文法を参照し、定めら
ため文法との整合性の判定を行う構文解析部と、前記構
文解析部の解析結果に基づいて前記音声認識結果が一致
の場合には前記音声認識結果に対応する応答情報を生成
し、前記音声認識結果が不一致の場合には前記連結音声
認識結果を前記連結処理部に供給し、新たに前記連結処
理部の記憶部に保持させる対話処理部とを具備すること
を特徴とする音声対話装置。1. A voice of a new utterance is input when a voice recognition result of a uttered voice is input, and a storage unit for storing a voice recognition result of a previous utterance is provided. A concatenation processing unit for concatenating the recognition result and the stored speech recognition result while maintaining the time order, and a concatenated speech recognition result from the concatenation processing unit are introduced, and a grammar defined for the speech recognition result. And a syntactic analysis unit that determines the consistency with the predetermined grammar, and response information corresponding to the speech recognition result when the speech recognition result matches based on the analysis result of the syntactic analysis unit. And a dialog processing unit that supplies the concatenated speech recognition result to the concatenation processing unit and newly stores the concatenated speech recognition result in the storage unit of the concatenation processing unit when the speech recognition results do not match. Voice pair Apparatus.

【請求項２】前記対話処理部はそれまでの応答に使用し
た対話履歴情報と新規発話に関する構文解析結果から導
出される新規応答情報とを意味的に合成する機能を有
し、連結音声解析結果から生成された前記新規応答情報
については前記意味的に合成する機能を使用しない制限
を行うことを特徴とする請求項１記載の音声対話装置。2. The dialogue processing unit has a function of semantically synthesizing dialogue history information used for a response up to that time and new response information derived from a syntactic analysis result regarding a new utterance, and a concatenated speech analysis result. 2. The voice interactive apparatus according to claim 1, wherein the new response information generated from is restricted so that the semantically combining function is not used.

【請求項３】前記対話処理部は前記連結処理部に今現在
の対話状態を通知する機能を有し、前記連結処理部は通
知された対話状態に基づき前記新規発話の音声認識結果
と前記記憶された音声認識結果との連結の可否を決定す
ることを特徴とする請求項１記載の音声対話装置。3. The dialogue processing unit has a function of notifying the concatenation processing unit of the present conversation state, and the concatenation processing unit stores the voice recognition result of the new utterance and the storage based on the notified conversation state. The voice interaction device according to claim 1, wherein whether or not to connect with the voice recognition result is determined.

【請求項４】前記対話処理部は応答情報生成終了時に前
記連結処理部に解釈終了情報を通知する機能を有し、前
記連結処理部は解釈終了情報の有無に基づき前記新規発
話の音声認識結果と前記記憶された音声認識結果との連
結の可否を決定することを特徴とする請求項１記載の音
声対話装置。4. The dialogue processing unit has a function of notifying the concatenation processing unit of interpretation end information at the end of response information generation, and the concatenation processing unit determines the voice recognition result of the new utterance based on the presence or absence of the interpretation end information. 2. The voice interaction apparatus according to claim 1, wherein whether or not to connect the stored voice recognition result is determined.

【請求項５】前記対話処理部は構文解析結果受理時に前
記連結処理部に構文解析成功情報を通知する機能を有
し、前記連結処理部は構文解析成功情報に基づき前記新
規発話の音声認識結果と前記記憶された音声認識結果と
の連結の可否を決定することを特徴とする請求項１記載
の音声対話装置。5. The dialogue processing unit has a function of notifying the concatenation processing unit of the syntactic analysis success information at the time of receiving the syntactic analysis result, and the concatenation processing unit based on the syntactic analysis success information, the speech recognition result of the new utterance. 2. The voice interaction apparatus according to claim 1, wherein whether or not to connect the stored voice recognition result is determined.

【請求項６】前記対話処理部は応答情報生成時に対象と
なった入力のあいまいさを判定し、前記入力のあいまい
さ情報を前記連結処理部に通知する機能を有し、前記連
結処理部は前記入力のあいまいさ情報に基づき前記新規
発話の音声認識結果と前記記憶された音声認識結果との
連結の可否を決定することを特徴とする請求項１記載の
音声対話装置。を決定することを特徴とする請求項１記
載の音声対話装置。6. The interaction processing unit has a function of determining ambiguity of an input which is a target when generating response information, and notifying the concatenation processing unit of the ambiguity information of the input, and the concatenation processing unit. The voice interaction device according to claim 1, wherein whether or not to connect the voice recognition result of the new utterance and the stored voice recognition result is determined based on the ambiguity information of the input. The voice interaction device according to claim 1, wherein

【請求項７】前記連結処理部は応答生成終了時刻と次発
話入力時刻の差分に基づき前記新規発話の音声認識結果
と前記記憶された音声認識結果との連結の可否を決定す
ることを特徴とする請求項１記載の音声対話装置。7. The connection processing unit determines whether to connect the voice recognition result of the new utterance and the stored voice recognition result based on the difference between the response generation end time and the next speech input time. The voice interaction device according to claim 1.

【請求項８】発話された音声の音声認識結果が入力さ
れ、前回の発話に対する音声認識結果を連結処理候補と
して記憶し、新規発話の音声認識結果が入力された場合
に前記新規発話の音声認識結果と前記記憶された音声認
識結果とを時間順序を保ったまま連結する連結処理部ス
テップと、連結音声認識結果が導入され、この音声認識
結果に対して定められた文法を参照し、定めらため文法
との整合性の判定を行う構文解析部ステップと、前記構
文解析ステップの解析結果に基づいて前記音声認識結果
が一致の場合には前記音声認識結果に対応する応答情報
を生成し、前記音声認識結果が不一致の場合には新たに
前記連結処理候補として記憶する対話処理ステップとを
具備することを特徴とする音声対話方法。8. A voice recognition result of an uttered voice is input, a voice recognition result of a previous utterance is stored as a concatenation processing candidate, and a voice recognition of the new utterance is input when a voice recognition result of a new utterance is input. A concatenation processing unit step for concatenating the result and the stored speech recognition result while maintaining a time order, and a concatenated speech recognition result are introduced, and a grammar defined for this speech recognition result is referred to and determined. Therefore, a syntax analysis unit step for determining the consistency with the grammar, and when the voice recognition result matches based on the analysis result of the syntax analysis step, generates response information corresponding to the voice recognition result, And a dialogue processing step of newly storing as a candidate for the concatenation processing when the voice recognition results do not match.

【請求項９】前記対話ステップはそれまでの応答に使用
した対話履歴情報と新規発話に関する構文解析結果から
導出される新規応答情報とを意味的に合成するステップ
を有し、連結音声解析結果から生成された前記新規応答
情報については前記意味的に合成するステップを使用し
ない制限を行うことを特徴とする請求項８記載の音声対
話方法。9. The dialogue step includes a step of semantically synthesizing dialogue history information used for a response up to that time and new response information derived from a syntactic analysis result regarding a new utterance. 9. The voice interaction method according to claim 8, wherein the generated new response information is restricted so that the step of semantically combining is not used.

【請求項１０】前記対話処理ステップは前記連結処理ス
テップに今現在の対話状態を通知するステップを含み、
前記連結処理ステップは通知された対話状態に基づき前
記新規発話の音声認識結果と前記記憶された音声認識結
果との連結の可否を決定するステップを含むことを特徴
とする請求項８記載の音声対話方法。10. The dialog processing step includes a step of notifying the connection processing step of a current dialog state,
9. The voice interaction according to claim 8, wherein the connection processing step includes a step of determining whether to connect the voice recognition result of the new utterance and the stored voice recognition result based on the notified dialogue state. Method.

【請求項１１】前記対話処理ステップは応答情報生成
終了時に前記連結処理ステップに解釈終了情報を通知す
るステップを含み、前記連結処理ステップは解釈終了情
報の有無に基づき前記新規発話の音声認識結果と前記記
憶された音声認識結果との連結の可否を決定するステッ
プを含むことを特徴とする請求項８記載の音声対話方
法。11. The dialog processing step includes a step of notifying the concatenation processing step of the interpretation end information at the end of the response information generation, and the concatenation processing step includes a voice recognition result of the new utterance based on the presence or absence of the interpretation end information. 9. The voice interaction method according to claim 8, further comprising the step of determining whether to connect with the stored voice recognition result.

【請求項１２】前記対話処理ステップは構文解析結果受
理時に前記連結処理ステップに構文解析成功情報を通知
するステップを含み、前記連結処理ステップは構文解析
成功情報に基づき前記新規発話の音声認識結果と前記記
憶された音声認識結果との連結の可否を決定するステッ
プを含むことを特徴とする請求項８記載の音声対話方
法。12. The dialog processing step includes a step of notifying the concatenation processing step of syntactic analysis success information when a syntactic analysis result is received, and the concatenation processing step includes a speech recognition result of the new utterance based on the syntactic analysis success information. 9. The voice interaction method according to claim 8, further comprising the step of determining whether to connect with the stored voice recognition result.

【請求項１３】前記対話処理ステップは応答情報生成時
に対象となった入力のあいまいさを判定し、前記入力の
あいまいさ情報を前記連結処理ステップに通知するステ
ップを含み、前記連結処理ステップは前記入力のあいま
いさ情報に基づき前記新規発話の音声認識結果と前記記
憶された音声認識結果との連結の可否を決定するステッ
プを含むことを特徴とする請求項８記載の音声対話方
法。13. The dialog processing step includes a step of determining ambiguity of an input which is a target when generating response information, and notifying the ambiguity information of the input to the connection processing step, and the connection processing step includes: 9. The voice interaction method according to claim 8, further comprising the step of determining whether or not to connect the voice recognition result of the new utterance and the stored voice recognition result based on the ambiguity information of the input.

【請求項１４】前記連結処理ステップは応答生成終了時
刻と次発話入力時刻の差分に基づき前記新規発話の音声
認識結果と前記記憶された音声認識結果との連結の可否
を決定するステップを含むことを特徴とする請求項８記
載の音声対話方法。14. The concatenation processing step includes a step of deciding whether or not to concatenate the voice recognition result of the new utterance and the stored voice recognition result based on a difference between a response generation end time and a next utterance input time. 9. The voice interaction method according to claim 8, wherein:

【請求項１５】発話された音声の音声認識結果が入力さ
れ、前回の発話に対する音声認識結果を連結処理候補と
して記憶し、新規発話の音声認識結果が入力された場合
に前記新規発話の音声認識結果と前記記憶された音声認
識結果とを時間順序を保ったまま連結する連結処理と、
連結音声認識結果が導入され、この音声認識結果に対し
て定められた文法を参照し、定めらため文法との整合性
の判定を行う構文解析処理と、前記構文解析ステップの
解析結果に基づいて前記音声認識結果が一致の場合には
前記音声認識結果に対応する応答情報を生成し、前記音
声認識結果が不一致の場合には新たに前記連結処理候補
として記憶する対話処理とを含むこと特徴とする音声対
話処理プログラム。15. A speech recognition result of a new speech is input when a speech recognition result of a spoken speech is input, a speech recognition result of a previous speech is stored as a concatenation processing candidate, and a speech recognition result of the new speech is input when a speech recognition result of a new speech is input. A concatenation process for concatenating the result and the stored speech recognition result while maintaining the time order;
A concatenated speech recognition result is introduced, the grammar defined for this speech recognition result is referred to, and the parsing process for determining the consistency with the grammar is defined, and based on the parsing result of the parsing step. When the voice recognition result is a match, response information corresponding to the voice recognition result is generated, and when the voice recognition result is a mismatch, a dialogue process is newly stored as the connection process candidate. Spoken dialogue processing program.