JP2005196134A

JP2005196134A - System, method, and program for voice interaction

Info

Publication number: JP2005196134A
Application number: JP2004319327A
Authority: JP
Inventors: Ryoko Tokuhisa; 良子徳久; Ryuta Terajima; 立太寺嶌; Toshihiro Wakita; 敏裕脇田
Original assignee: Toyota Central R&D Labs Inc
Current assignee: Toyota Central R&D Labs Inc
Priority date: 2003-12-12
Filing date: 2004-11-02
Publication date: 2005-07-21
Anticipated expiration: 2024-11-02
Also published as: JP4729902B2

Abstract

PROBLEM TO BE SOLVED: To naturally advance interaction with the user, while making the internal processings within a system more appropriate. SOLUTION: A system is provided with a storage device 26 which stores an answer unit model obtained by statistically modeling the answer timing, at which one interacting speaker answers and a storage device 28 which stores a meaning processing unit model obtained, by statistically modeling units of meaning processing, and a speech recognizing part 20 recognizes a speech that a user speaks. Image information of the user who is speaking, sound features of the speech that the user utters, and the linguistic features of the speech that the user utters are extracted and a processing unit deciding part 16 decides the meaning processing timing and the answer timing, based on the image information, sound features, meaning processing unit model, answer unit model, speech recognition result of the speech recognition means, and linguistic features, so that when it is decided that it is the meaning processing timing and answer timing, an answer is made by voice, while the content subjected to meaning processing is reflected. COPYRIGHT: (C)2005,JPO&NCIPI

Description

本発明は、音声対話システム及び方法並びに音声対話プログラムにかかり、特に、システム内での内部処理をより適切にしながらユーザーとの対話を自然に進めることが可能な音声対話システム及び方法並びに音声対話プログラムに関する。 The present invention relates to a voice dialogue system and method and a voice dialogue program, and more particularly, a voice dialogue system and method and a voice dialogue program that can naturally advance a dialogue with a user while making internal processing in the system more appropriate. About.

従来より、任意の入力単位で自由発話された入力音声を音声認識し、この音声認識の結果得られた文字列を、文の単位または文に相当する単位である言語処理単位に変換して出力する音声言語処理単位変換装置が知られている（例えば、特許文献１参照）。この音声言語処理単位変換装置では、音声認識により形態素単位に分割された文字列を入力し、言語処理単位らしさを予め学習した統計モデルと、経済的知識を予め抽出して作成した経験的規則とを用いて、ポーズ地点（無音区間）には読点を、言語処理単位には句点を付けて出力している。 Conventionally, input speech freely uttered in any input unit is recognized as speech, and the character string obtained as a result of this speech recognition is converted into a language processing unit, which is a unit of sentence or a unit corresponding to a sentence, and output. A spoken language processing unit conversion apparatus is known (for example, see Patent Document 1). In this speech language processing unit converter, a statistical model in which a character string divided into morpheme units by speech recognition is input and the language processing unit-likeness is learned in advance, and empirical rules created by extracting economic knowledge in advance Is used to output a punctuation point (silent section) with a punctuation mark and a linguistic processing unit with a punctuation mark.

この従来の技術は、翻訳を行うことを目的としたシステムであり、翻訳するために適切な意味のまとまりを言語処理単位（意味処理単位）とし、その意味処理単位のみを判定するものである。
特開平１１−１２６０９号公報 This conventional technique is a system for the purpose of performing translation, in which a group of meanings appropriate for translation is defined as a language processing unit (semantic processing unit), and only the semantic processing unit is determined.
Japanese Patent Laid-Open No. 11-12609

しかしながら、翻訳や検索等における意味処理単位と、ユーザーとの発話のやり取りをする応答単位とは異なっているため、上記従来の技術を音声対話システムに適用した場合、図３（２）の対話に対する適用結果（負例）に示すように、「大人の雰囲気の〈Ｐ〉静かなバーをお願いします〈Ｐ〉」（ただし、Pはポーズを表す。）の時点で言語処理単位（意味処理単位）の区切りＳ３と判定され、システムが「大人の雰囲気の静かなバーですね。」と応答する対話になることが予想される。すなわち、上記従来の技術を音声対話システムに適用する場合には、適切な応答タイミングで応答することができない、という問題がある。 However, since the semantic processing unit for translation, search, etc. is different from the response unit for exchanging utterances with the user, when the above conventional technique is applied to a voice dialogue system, the dialogue shown in FIG. As shown in the application result (negative example), the language processing unit (semantic processing unit) at the time of “P> Please give a quiet bar in an adult atmosphere ” (where P represents a pause) ) Is determined as S3, and the system is expected to be a dialogue that responds with "A quiet bar with an adult atmosphere." That is, when the above conventional technique is applied to a voice interaction system, there is a problem that it is not possible to respond at an appropriate response timing.

本発明は、上記従来の問題点を解決するためになされたもので、システム内での内部処理をより適切にしながらユーザーとの対話を自然に進めることが可能な音声対話システム及び方法並びに音声対話プログラムを提供することを目的とする。 The present invention has been made to solve the above-described conventional problems, and a voice dialogue system and method capable of naturally proceeding with a user's dialogue while making internal processing in the system more appropriate, and a voice dialogue. The purpose is to provide a program.

上記目的を達成するために、本発明の音声対話システムは、ユーザーから発話された音声を認識する音声認識手段と、意味処理タイミング及び対話中のユーザーに対して応答する応答タイミングを判定する判定手段と、前記音声認識手段の認識結果に基づいて、前記意味処理タイミングで意味処理単位の意味処理を行う処理手段と、意味処理タイミングでかつ応答タイミングであると判定されたときに、意味処理を行なった内容を反映させて音声で応答する応答手段と、を含んで構成されている。 To achieve the above object, a speech dialogue system according to the present invention comprises speech recognition means for recognizing speech uttered by a user, determination means for judging semantic processing timing and response timing for responding to the user during the dialogue. And processing means for performing semantic processing in the semantic processing unit at the semantic processing timing based on the recognition result of the voice recognition means, and performing semantic processing when it is determined that the semantic processing timing is at the response timing. And responding means that responds by voice reflecting the contents.

本発明の意味処理単位は、データベース検索の検索要求を理解するための意味処理等のシステム内部の処理を行うタイミングから次の処理を行なうタイミングまでの発話区間を指すものである。 The semantic processing unit of the present invention indicates an utterance period from the timing of performing internal processing such as semantic processing for understanding a search request for database search to the timing of performing the next processing.

また、応答単位は、対話において一方の話者が他方の話者に対して応答したタイミングから次に応答するタイミングまでの発話区間を指すものである。 The response unit indicates an utterance period from the timing when one speaker responds to the other speaker to the next response timing in the conversation.

判定手段は、意味処理タイミング及び対話中のユーザーに対して応答する応答タイミングを判定し、音声認識手段は、ユーザーから発話された音声を認識し、処理手段は、音声認識手段の音声認識結果に基づいて、意味処理タイミングで意味処理単位の意味処理を行う。そして、応答手段は、意味処理タイミングでかつ応答タイミングであると判定されたときに、意味処理を行なった内容を反映させて音声で応答する。 The determination means determines the semantic processing timing and the response timing for responding to the user during the conversation, the voice recognition means recognizes the voice spoken by the user, and the processing means determines the voice recognition result of the voice recognition means. Based on the semantic processing timing, semantic processing is performed in units of semantic processing. The response means responds with a voice reflecting the contents of the semantic processing when it is determined that it is the semantic processing timing and the response timing.

本発明では、意味処理タイミングでかつ応答タイミングであると判定されたときに、意味処理を行なった内容を反映させて音声で応答しているため、翻訳や検索等のシステム内での内部処理をより適切にしながらユーザーとの対話を自然に進めることができる。 In the present invention, when it is determined that it is a semantic processing timing and a response timing, it responds with a voice reflecting the contents of the semantic processing, so internal processing in the system such as translation and search is performed. It is possible to naturally advance the dialogue with the user while being more appropriate.

本発明では、応答タイミングでありかつ意味処理タイミングでないと判定されたときに、意味処理を行なった内容を反映させることなく音声で応答することができる。これによって、応答タイミングのみのときに意味処理を行なった内容が反映されないため、応答タイミングのみに応じた適切な応答を行なうことができる。 In the present invention, when it is determined that it is a response timing and not a semantic processing timing, it is possible to respond with a voice without reflecting the contents of the semantic processing. As a result, the content of the semantic processing is not reflected only at the response timing, so that an appropriate response according to only the response timing can be performed.

また、本発明では、意味処理タイミングでありかつ応答タイミングでないと判定されたときに、音声での応答を停止することができる。これによって、不必要なタイミングにおいて音声で応答するのが防止される。 Further, according to the present invention, when it is determined that it is the semantic processing timing and not the response timing, the voice response can be stopped. This prevents voice response at unnecessary timing.

本発明では、意味処理を行う単位を統計的にモデル化した意味処理単位モデルを記憶した意味処理単位モデル記憶手段を更に設け、判定手段によって、前記意味処理単位モデルに基づいて、前記意味処理タイミングを判定することができる。意味処理単位モデルは、意味処理単位の情報を備えた対話や文書データを用い、これらのデータから得られる意味処理タイミング前後に発話された語の品詞情報等の言語的特徴、発話された語のパワーやピッチ等の音響的特徴、及びユーザーのうなずきや視線等の画像情報の少なくとも１つに基づいて、学習を行なうことによってモデル化することができる。 In the present invention, there is further provided a semantic processing unit model storage unit that stores a semantic processing unit model in which a unit for performing semantic processing is statistically modeled, and the semantic processing timing is determined by the determination unit based on the semantic processing unit model. Can be determined. The semantic processing unit model uses dialogue and document data with information on semantic processing units, uses linguistic features such as part-of-speech information of words uttered before and after semantic processing timing obtained from these data, Modeling can be performed by performing learning based on at least one of acoustic features such as power and pitch and image information such as user's nodding and line of sight.

このように学習された意味処理単位モデルを用いて意味処理タイミングを判定するには、意味処理単位モデルの他に、学習の際に用いた物理量、すなわち、言語的特徴、音響的特徴、及び画像情報の少なくとも１つを用いて判定することができる。 In order to determine the semantic processing timing using the semantic processing unit model learned in this way, in addition to the semantic processing unit model, physical quantities used in learning, that is, linguistic features, acoustic features, and images The determination can be made using at least one of the information.

また、対話中の一方の話者が応答を行う応答タイミングを統計的にモデル化した応答単位モデルを記憶した応答単位モデル記憶手段を更に設け、判定手段によって、応答単位モデルと音声認識手段の音声認識結果とにに基づいて、対話中のユーザーに対して応答する応答タイミングを判定するのも効果的である。応答単位モデルは、意味処理単位モデルと同様に、応答単位の情報を備えた対話や文書データを用い、これらのデータから得られる応答タイミング前後に発話された語の品詞情報等の言語的特徴、発話された語のパワーやピッチ等の音響的特徴、及びユーザーのうなずきや視線等の画像情報の少なくとも１つに基づいて、学習を行なうことによってモデル化することができる。 In addition, a response unit model storage unit that stores a response unit model that statistically models the response timing of one of the speakers during the conversation is further provided, and the response unit model and the voice of the speech recognition unit are determined by the determination unit. It is also effective to determine the response timing for responding to the user during the dialogue based on the recognition result. Like the semantic processing unit model, the response unit model uses dialogue and document data with response unit information, and linguistic features such as part-of-speech information of words spoken before and after the response timing obtained from these data, Modeling can be performed by performing learning based on at least one of acoustic features such as the power and pitch of the spoken word and image information such as the user's nodding and line of sight.

学習された応答単位モデルを用いて応答タイミングを判定するには、応答単位モデルの他に、学習の際に用いた物理量、すなわち、言語的特徴、音響的特徴、及び画像情報の少なくとも１つを用いて判定することができる。 In order to determine the response timing using the learned response unit model, in addition to the response unit model, at least one of physical quantities used in learning, that is, linguistic features, acoustic features, and image information is used. Can be used.

このように、意味処理単位モデル、応答単位モデル、または意味処理単位モデル及び応答単位モデルの両方のモデルを用いることにより、より精度よく意味処理タイミング、応答タイミング、または意味処理タイミング及び応答タイミングの両方を判定することができる。 As described above, by using the semantic processing unit model, the response unit model, or both the semantic processing unit model and the response unit model, the semantic processing timing, the response timing, or both the semantic processing timing and the response timing are more accurately used. Can be determined.

本発明は、発話中のユーザーの画像情報を抽出する抽出手段と、ユーザーから発話された音声の音響的特徴を抽出する抽出手段と、意味処理を行う単位を統計的にモデル化した意味処理単位モデルを記憶した意味処理単位モデル記憶手段と、対話中の一方の話者が応答を行う応答タイミングを統計的にモデル化した応答単位モデルを記憶した応答単位モデル記憶手段と、ユーザーから発話された音声を認識する音声認識手段と、前記音声認識手段の音声認識結果に基づいて、ユーザーから発話された音声の言語的特徴を抽出する抽出手段と、前記画像情報、前記音響的特徴、前記意味処理単位モデル、前記応答単位モデル、前記音声認識手段の音声認識結果、及び前記言語的特徴に基づいて、意味処理タイミング及び応答タイミングを判定する判定手段と、前記音声認識手段の認識結果に基づいて、前記意味処理タイミングで意味処理単位の意味処理を行う処理手段と、意味処理タイミングでかつ応答タイミングであると判定されたときに、意味処理を行なった内容を反映させて音声で応答する応答手段と、を含んで構成することもできる。 The present invention relates to an extraction means for extracting image information of a user who is speaking, an extraction means for extracting acoustic features of speech uttered by a user, and a semantic processing unit in which semantic processing units are statistically modeled. The semantic processing unit model storage means that stores the model, the response unit model storage means that stores the response unit model that statistically models the response timing when one speaker in the conversation responds, and the user spoke Speech recognition means for recognizing speech; extraction means for extracting linguistic features of speech uttered by a user based on a speech recognition result of the speech recognition means; the image information; the acoustic features; and the semantic processing. Semantic processing timing and response timing are determined based on the unit model, the response unit model, the speech recognition result of the speech recognition means, and the linguistic feature. Based on the recognition result of the determination means, the speech recognition means, the processing means for performing semantic processing in the semantic processing unit at the semantic processing timing, and the semantic processing when it is determined that the semantic processing timing is the response timing And responding means that responds by voice reflecting the contents of the above.

このように、応答単位と意味処理単位とを各々モデル化して、意味処理タイミングと応答タイミングとを各々判定することで、翻訳や検索等の意味処理をシステム内部で実行しながらユーザーとの対話を円滑に進行することが可能となる。 In this way, the response unit and the semantic processing unit are modeled, and the semantic processing timing and the response timing are determined, thereby enabling the user to interact with the user while executing the semantic processing such as translation and search inside the system. It is possible to proceed smoothly.

また、本発明は、次のような音声対話方法及び音声対話プログラムも実現可能である。 In addition, the present invention can also implement the following voice dialogue method and voice dialogue program.

すなわち、本発明に係る音声対話方法は、ユーザが発話した音声を認識し、音声認識結果に基づいて、ユーザが発話した音声の中で意味処理単位になったタイミングを示す意味処理タイミングと、前記ユーザに対して応答するタイミングを示す応答タイミングと、を判定し、前記意味処理タイミングが判定されたときに、前記音声認識結果に基づいて、前記意味処理単位の意味処理を行い、前記意味処理タイミング及び応答タイミングが判定されたときに、意味処理を行なった内容を反映して応答するものである。 That is, the voice interaction method according to the present invention recognizes the voice spoken by the user, and based on the voice recognition result, the semantic processing timing indicating the timing of the semantic processing unit in the voice spoken by the user, A response timing indicating a response timing to the user, and when the semantic processing timing is determined, the semantic processing unit performs semantic processing based on the speech recognition result, and the semantic processing timing When the response timing is determined, the response is performed reflecting the contents of the semantic processing.

本発明に係る音声対話プログラムは、コンピュータに、ユーザが発話した音声を認識し、音声認識結果に基づいて、ユーザが発話した音声の中で意味処理単位になったタイミングを示す意味処理タイミングと、前記ユーザに対して応答するタイミングを示す応答タイミングと、を判定し、前記意味処理タイミングが判定されたときに、前記音声認識結果に基づいて、前記意味処理単位の意味処理を行い、前記意味処理タイミング及び応答タイミングが判定されたときに、意味処理を行なった内容を反映して応答する処理を実行させるものである。 The voice interaction program according to the present invention recognizes the voice uttered by the user to the computer, and based on the voice recognition result, the semantic processing timing indicating the timing of the semantic processing unit in the voice uttered by the user; A response timing indicating a timing of responding to the user, and when the semantic processing timing is determined, the semantic processing unit performs semantic processing based on the speech recognition result, and the semantic processing When the timing and the response timing are determined, a process of responding reflecting the contents of the semantic process is executed.

本発明に係る音声対話システムは、ユーザが発話した音声を認識する音声認識手段と、前記音声認識手段の認識結果に基づいて、前記ユーザに対して応答する応答タイミングを判定する応答タイミング判定手段と、前記応答タイミング判定手段により前記応答タイミングが判定されたときに前記音声認識手段の認識結果に基づく意味内容に対して応答し、前記応答タイミングが判定されていないときに前記意味内容以外の内容を応答する応答手段と、を含んで構成されている。 The voice dialogue system according to the present invention includes a voice recognition unit that recognizes a voice spoken by a user, a response timing determination unit that determines a response timing to respond to the user based on a recognition result of the voice recognition unit, When the response timing is determined by the response timing determination means, a response is made to the meaning content based on the recognition result of the voice recognition means, and when the response timing is not determined, contents other than the meaning content are displayed. Response means for responding.

応答タイミングは、ユーザから対話主導権を奪ってユーザを聞き手にさせるのに最適なタイミングであり、相槌以外の応答をすべきタイミングである。応答タイミングは、音声認識手段の認識結果に基づいて判定される。なお、応答タイミング判定手段は、ユーザの発話中の無音区間について、応答タイミングであるか否かを判定してもよい。 The response timing is the optimal timing for depriving the user of the dialogue initiative and making the user a listener, and is a timing at which a response other than a conflict should be made. The response timing is determined based on the recognition result of the voice recognition means. Note that the response timing determination means may determine whether or not it is the response timing for the silent section during the user's speech.

応答手段は、応答タイミングが判定されたときには、音声認識手段の認識結果に基づく意味内容に対して応答する。ここにいう応答とは、単なる相槌ではなく、例えば、応答タイミングが判定されたときまでにユーザが発話した意味内容に対する受け答えである。また、応答手段は、応答タイミングが判定されていないときには、意味内容以外の内容を応答する。意味内容以外の内容としては、例えば、相槌、又はユーザの発話中のキーワードなどがある。また、応答手段は、応答内容を言語により出力してもよいし、または、応答内容を表現するようにインタフェースロボットを制御してもよい。 When the response timing is determined, the response unit responds to the meaning content based on the recognition result of the voice recognition unit. The response mentioned here is not a mere interaction, but, for example, is a response to the meaning content spoken by the user until the response timing is determined. The response means responds with contents other than the meaning contents when the response timing is not determined. As contents other than the meaning contents, there are, for example, a keyword or a keyword being spoken by the user. The response unit may output the response content in a language, or may control the interface robot to express the response content.

したがって、本発明に係る音声対話システムは、音声認識手段の認識結果に基づいて、ユーザに対して応答する応答タイミングを判定し、応答タイミングが判定されたときに音声認識手段の認識結果に基づく意味内容に対して応答し、応答タイミングが判定されていないときに前記意味内容以外の内容を応答することにより、ユーザの発話を妨げることなく、ユーザにユーザ自身の伝えたい意図を最後まで発話させることができる。 Therefore, the speech dialogue system according to the present invention determines the response timing for responding to the user based on the recognition result of the speech recognition means, and the meaning based on the recognition result of the speech recognition means when the response timing is determined. By responding to the contents and responding to contents other than the meaning contents when the response timing has not been determined, the user's intention to be communicated can be uttered to the end without disturbing the user's utterance. Can do.

また、上記音声対話システムは、前記ユーザを撮像する撮像手段と、前記撮像手段により撮像されたユーザの画像に基づいて画像特徴量を抽出する画像特徴量抽出手段と、を更に含んで構成されてもよい。このとき、前記応答タイミング判定手段は、前記画像特徴量抽出手段により抽出された画像特徴量を更に用いて、前記ユーザに対して応答する応答タイミングを判定すればよい。 The voice interaction system further includes an imaging unit that images the user, and an image feature amount extraction unit that extracts an image feature amount based on an image of the user captured by the imaging unit. Also good. At this time, the response timing determination unit may determine the response timing for responding to the user by further using the image feature amount extracted by the image feature amount extraction unit.

さらに、上記音声対話システムにおいて、前記音声認識手段は、ユーザが発話した音声から複数の特徴量を抽出し、応答タイミング判定手段は、前記複数の特徴量に各々対応し応答タイミングをモデル化した複数の特徴量モデルを記憶する特徴量モデル記憶手段と、前記音声認識手段で抽出される各特徴量と前記各特徴量モデルとに基づいて各々応答タイミングであるかを判定する複数の第１の判定手段と、前記複数の第１の判定手段の判定結果と各第１の判定手段の信頼度とに基づいて総合的に応答タイミングであるかを判定する第２の判定手段と、を含んで構成されたものでもよい。 Furthermore, in the above-described voice interaction system, the voice recognition unit extracts a plurality of feature amounts from the voice uttered by the user, and a response timing determination unit is a plurality of response timings that are modeled for the plurality of feature amounts, respectively. A plurality of first determinations for determining whether each is a response timing based on each feature quantity extracted by the speech recognition means and each feature quantity model And a second determination unit that comprehensively determines whether the response timing is based on the determination results of the plurality of first determination units and the reliability of each first determination unit. It may be done.

すなわち、本発明に係る音声対話方法は、ユーザが発話した音声を認識し、前記音声認識結果に基づいて、前記ユーザに対して応答する応答タイミングを判定し、前記応答タイミングが判定されたときに前記音声認識結果に基づく意味内容に対して応答し、前記応答タイミングが判定されていないときに前記意味内容以外の内容を応答するものである。 That is, the voice interaction method according to the present invention recognizes a voice spoken by a user, determines a response timing for responding to the user based on the voice recognition result, and when the response timing is determined It responds to the semantic content based on the speech recognition result, and responds content other than the semantic content when the response timing is not determined.

本発明に係る音声対話プログラムは、コンピュータに、ユーザが発話した音声を認識し、前記音声認識結果に基づいて、前記ユーザに対して応答する応答タイミングを判定し、前記応答タイミングが判定されたときに前記音声認識結果に基づく意味内容に対して応答し、前記応答タイミングが判定されていないときに前記意味内容以外の内容を応答する処理を実行させるものである。 The voice interaction program according to the present invention recognizes a voice spoken by a user to a computer, determines a response timing for responding to the user based on the voice recognition result, and when the response timing is determined In response to the meaning content based on the voice recognition result, a process of responding to contents other than the meaning content when the response timing is not determined is executed.

以上説明したように本発明によれば、意味処理単位でかつ応答単位であると判定された場合のみ意味処理結果を反映させて音声応答しているので、違和感の無い音声応答を行なうことができる、という効果が得られる。 As described above, according to the present invention, the voice response is made by reflecting the result of the semantic processing only when it is determined that the unit is the semantic processing unit and the response unit, so that a voice response without a sense of incongruity can be performed. The effect of is obtained.

また、本発明によれば、音声認識手段の認識結果に基づいて、ユーザに対して応答する応答タイミングを判定し、応答タイミングが判定されたときに音声認識手段の認識結果に基づく意味内容に対して応答し、応答タイミングが判定されていないときに前記意味内容以外の内容を応答することにより、ユーザの発話を妨げることなく、ユーザにユーザ自身の伝えたい意図を最後まで発話させることができる。 According to the present invention, the response timing for responding to the user is determined based on the recognition result of the speech recognition means, and the meaning content based on the recognition result of the speech recognition means when the response timing is determined. By responding with a response other than the semantic content when the response timing is not determined, it is possible to let the user utter the intention he / she wants to convey to the end without disturbing the user's utterance.

［第１の実施の形態］
以下、図面を参照して本発明の第１の実施の形態を詳細に説明する。本実施の形態には、図１に示すように、発話者であるユーザの顔部分を撮影するカメラ１０、及び発話者の音声を入力するためのマイク１２が設けられている。 [First Embodiment]
Hereinafter, a first embodiment of the present invention will be described in detail with reference to the drawings. In this embodiment, as shown in FIG. 1, a camera 10 that captures a face portion of a user who is a speaker and a microphone 12 for inputting the voice of the speaker are provided.

カメラ１０は、カメラ１０から出力される画像信号から発話者の視線の時系列変化を抽出し、視線の時系列情報を画像情報として出力する視線抽出部１４を介して処理単位判定部１６に接続されている。 The camera 10 extracts a time-series change of the line of sight of the speaker from the image signal output from the camera 10 and connects to the processing unit determination unit 16 via the line-of-sight extraction unit 14 that outputs the time-series information of the line of sight as image information. Has been.

マイク１２は、マイク１２から出力される音声データから発話された語の音響的特徴である発話のピッチを抽出し、ピッチの時系列情報を出力するピッチ抽出部１８を介して処理単位判定部１６に接続されている。また、マイク１２は、マイク１２から出力される音声データに基づいて、認識用辞書が記憶された記憶装置２２を用いて音声認識を行ない、音声認識の結果を文字列として出力する音声認識部２０、及び音声認識部２０の音声認識結果に基づいて係り受け解析を行なう係り受け解析部２４を介して処理単位判定部１６に接続されている。この係り受け解析部２４は、言語的特徴である係り受け情報を出力する。 The microphone 12 extracts an utterance pitch, which is an acoustic feature of a uttered word, from voice data output from the microphone 12 and outputs a processing unit determination unit 16 via a pitch extraction unit 18 that outputs time series information of the pitch. It is connected to the. In addition, the microphone 12 performs voice recognition using the storage device 22 in which the recognition dictionary is stored based on the voice data output from the microphone 12, and outputs a voice recognition result as a character string. And the processing unit determination unit 16 via a dependency analysis unit 24 that performs dependency analysis based on the speech recognition result of the speech recognition unit 20. The dependency analysis unit 24 outputs dependency information that is a linguistic feature.

処理単位判定部１６には、学習により生成された応答単位モデルを記憶した記憶装置２６、及び学習により生成された意味処理単位モデルを記憶した記憶装置２８が接続されている。処理単位判定部１６は、後述するように、視線の時系列情報、ピッチの時系列情報、音声認識結果、及び係り受け情報等の発話者の音声から得られる情報と、意味処理単位及び応答単位の各モデルとに基づいて、意味処理単位及び応答単位の判定を行う。 The processing unit determination unit 16 is connected to a storage device 26 that stores a response unit model generated by learning and a storage device 28 that stores a semantic processing unit model generated by learning. As will be described later, the processing unit determination unit 16 includes information obtained from the voice of the speaker such as time-series information of the line of sight, pitch time-series information, speech recognition result, and dependency information, a semantic processing unit, and a response unit. Based on each model, a semantic processing unit and a response unit are determined.

処理単位判定部１６は、処理単位判定部１６の意味処理単位及び応答単位の判定結果に基づいて、意味処理としてのデータベース検索処理や応答生成を行う動作制御部３０に接続されている。動作制御部３０には、応答用コーパスを記憶した記憶装置３４が接続された応答生成部３２、及び検索用のデータベース３８が接続された検索部３６が接続されると共に、検索や応答の結果を報知するスピーカ４０及びディスプレイ４２が接続されている。 The processing unit determination unit 16 is connected to the operation control unit 30 that performs database search processing and response generation as semantic processing based on the determination result of the semantic processing unit and the response unit of the processing unit determination unit 16. The operation control unit 30 is connected to a response generation unit 32 to which a storage device 34 storing a response corpus is connected, and to a search unit 36 to which a search database 38 is connected. A speaker 40 and a display 42 for notification are connected.

次に、応答単位モデル及び意味処理単位モデルのモデル化の方法について説明する。このモデル化では、図２に示すように、記憶装置２６及び記憶装置２８の各々に、学習器４４及び学習器４６を接続し、学習器４４及び学習器４６の学習によって得られた結果を記憶装置２６及び２８の各々にモデルとして記憶する。 Next, a method for modeling the response unit model and the semantic processing unit model will be described. In this modeling, as shown in FIG. 2, a learning device 44 and a learning device 46 are connected to each of the storage device 26 and the storage device 28, and the results obtained by learning of the learning device 44 and the learning device 46 are stored. Each of the devices 26 and 28 is stored as a model.

図３（１）の正例に示した「大人の雰囲気の＜Ｐ＞静かなバーをお願いします＜Ｐ＞あとは〜＜Ｐ＞おつまみがおしゃれな所＜Ｐ＞がいいですね＜Ｐ＞」という発話について、ある地点が応答単位もしくは意味処理単位であるかどうかをモデル化することを例に、モデル化の方法を説明する。
（１）ステップ１
まず、最初のステップ１では、図４に示すように、応答単位の情報及び意味処理単位の情報を備えた発話について、モデル化する地点の前後に直前Ｎ個及び直後ｎ個の窓（ウインド）を設けて、形態素単位に区切られた単語列を抽出する。図４は「大人の雰囲気の」の直後が応答単位もしくは意味処理単位であるかどうかをモデル化する場合を示すものである。窓幅は任意に定めることができるが、本実施の形態の窓では、直前4個、直後1個の単語が含まれる大きさとした。
（２）ステップ２
ステップ２では、ステップ１で、窓枠を設けた各範囲に対して、特徴量をベクトルデータに変換する。本実施の形態では、形態素情報や係り受け情報等の言語的特徴、ピッチの時系列情報等の音響的特徴、及び視線の時系列情報等の画像情報を特徴として用い、これらの全ての特徴に対して特徴量である一意のベクトル値を割り当てる。 As shown in the right example in Fig. 3 (1), "I would like an quiet bar for adults. After that, ~ Where the snacks are fashionable is good " The modeling method will be described by taking an example of modeling whether a certain point is a response unit or a semantic processing unit.
(1) Step 1
First, in the first step 1, as shown in FIG. 4, for the utterance having the response unit information and the semantic processing unit information, N windows immediately before and n windows immediately before and after the point to be modeled (window). To extract a word string divided into morpheme units. FIG. 4 shows a case where whether or not immediately after “adult atmosphere” is a response unit or a semantic processing unit is modeled. Although the window width can be arbitrarily determined, the window of this embodiment has a size that includes four words immediately before and one word immediately after.
(2) Step 2
In step 2, the feature amount is converted into vector data for each range provided with the window frame in step 1. In this embodiment, linguistic features such as morpheme information and dependency information, acoustic features such as pitch time-series information, and image information such as line-of-sight time-series information are used as features. A unique vector value, which is a feature amount, is assigned.

図４に示すように「大人の雰囲気の」の直後をモデル化する場合には、破線内の各特徴の各々に対して、例えば、名詞＝−1、助詞＝−2、・・・、修飾する＝9のように、一意のベクトル値を与える。 As shown in FIG. 4, when modeling immediately after “adult atmosphere”, for example, noun = −1, particle = −2,. Give a unique vector value, such as Yes = 9.

次に、これらのベクトル値の線形結合に対して、予め与えられた応答単位及び意味処理単位の正解データが区切りであれば＋1、つなぎであれば−1のベクトルデータを割り当てる。ここで、区切りとは応答単位や意味処理単位であることを表し、つなぎとは応答単位や意味処理単位でないことを表す。 Next, +1 vector data is assigned to the linear combination of these vector values if the correct answer data in a given response unit and semantic processing unit is delimited, and −1 if connected. Here, the delimiter means a response unit or a semantic processing unit, and the connection means that it is not a response unit or a semantic processing unit.

すなわち、これらの線形結合に対して、応答単位のモデルを作成する場合には、応答単位であるとき＋1、応答単位でない時−1のベクトルデータを与える。また、意味処理単位のモデルを作成する場合にも同様に、意味処理単位であるとき＋1、意味処理単位でない時−1のベクトルデータを与える。 That is, when creating a response unit model for these linear combinations, vector data of +1 is given when the unit is a response unit, and -1 is given when the unit is not a response unit. Similarly, when creating a model of a semantic processing unit, vector data of +1 is given when it is a semantic processing unit, and -1 when it is not a semantic processing unit.

ここでは「大人の雰囲気の」の直後をモデル化する例のみを示したが、窓を図の右方にずらすことで全ての形態素についてベクトルデータを作成し、これを学習データとしてモデル化する。
（３）ステップ３
ステップ３では、全学習データに対して学習器４４及び学習器４６でモデルを作成する。すなわち、上記ステップ２で作成した学習データに対して、応答単位と意味処理単位の各モデルを各々作成する。 Here, only the example of modeling immediately after “adult atmosphere” is shown, but vector data is created for all morphemes by shifting the window to the right in the figure, and this is modeled as learning data.
(3) Step 3
In step 3, models are created by the learning device 44 and learning device 46 for all learning data. That is, each model of a response unit and a semantic processing unit is created for the learning data created in step 2 above.

この問題は、＋または−の2値判別問題であるので、学習識別手法として、例えば、サポートベクターマシン（Support Vector Machine，SVM）を用いることができる。SVMは、パターンの識別手法の１つであり、ある特徴ベクトル空間に対して適切な識別面を決定することができる学習器である。ｌ個の学習データに対し、任意のベクトルデータｉの正解値をｙi、特徴ベクトルをｘiとすると、この学習器は、下記の制約条件の下で下記（１）式のｆ（α）を最大化する。 Since this problem is a binary discrimination problem of + or −, for example, a support vector machine (SVM) can be used as a learning identification method. SVM is one of pattern identification methods, and is a learning device that can determine an appropriate identification plane for a certain feature vector space. If the correct value of arbitrary vector data i is yi and the feature vector is xi with respect to l pieces of learning data, this learning device maximizes f (α) in the following equation (1) under the following constraints. Turn into.

ここで、Ｋは任意の核関数（Ｋｅｒｎｅｌ関数）である。また。モデルは、α（≠０）、このαに対応するｘ、ｙ、及び以下の（２）式で得られるｂとして求めることができる。 Here, K is an arbitrary kernel function (Kernel function). Also. The model can be obtained as α (≠ 0), x and y corresponding to α, and b obtained by the following equation (2).

なお、SVMに関する詳細は、文献（「痛快！サポートベクトルマシン」、前田英作、情報処理42巻7号、pp.676−663,2001年7月）等に記載されている。 Details on SVM are described in literature ("Exciting! Support Vector Machine", Hidesaku Maeda, Information Processing Vol. 42, No. 7, pp. 676-663, July 2001).

次に、本実施の形態の音声対話システムを用いて、データベース３８に記憶された飲食店のデータから目的の店を検索する音声対話システム全体の動作ロジックを図５に示すフローチャートを参照して説明する。この音声対話システムは、車両に搭載することができる。 Next, referring to the flowchart shown in FIG. 5, the operation logic of the entire voice interaction system for searching for a target store from the restaurant data stored in the database 38 using the voice interaction system of the present embodiment will be described. To do. This voice interaction system can be mounted on a vehicle.

図５に示す動作ロジックの処理が開始されると、ステップ１００で初期化処理が実行され、ステップ１０２においてスピーカー４０から「検索条件を言ってください。」という音声が提示され、それと同時に図６に示す初期画面がディスプレイ４２に表示される。この初期画面には、「マイクに向かって、検索条件をお話ください。」というユーザへの動作指示、ユーザーの検索要求の表示ウインド、及びユーザーの検索結果の表示ウインドウが表示され、各ウインドウには、処理の進行に従ってユーザーへの動作指示、ユーザーの検索要求、及びデータベースの検索結果等の動作の詳細が表示される。 When the processing of the operation logic shown in FIG. 5 is started, initialization processing is executed in step 100, and in step 102, a voice saying “Please say the search condition” is presented from the speaker 40, and at the same time in FIG. An initial screen shown is displayed on the display 42. On this initial screen, an operation instruction to the user, “Please speak to the microphone, tell us the search conditions.”, A user search request display window, and a user search result display window are displayed. As the process proceeds, operation details such as an operation instruction to the user, a user search request, and a database search result are displayed.

マイク１２から音声データが入力されると、ステップ１０４において、記憶装置２２に記憶されている認識用辞書を用いて音声認識部２０によって、スピーカからの音声提示に応じてユーザから発話された音声の音声認識が行なわれる。また、音声データからピッチ抽出器１８により発話のピッチが時系列で抽出され、カメラの画像データから視線抽出器１４により発話者の視線が時系列で抽出される。音声認識結果より得られた文字列データは、係り受け解析器２４に入力され、各文字について係り受け情報が求められる。 When voice data is input from the microphone 12, in step 104, the voice recognition unit 20 uses the recognition dictionary stored in the storage device 22, and the voice uttered by the user according to the voice presentation from the speaker. Voice recognition is performed. Further, the pitch of the utterance is extracted in time series from the voice data by the pitch extractor 18, and the line of sight of the speaker is extracted in time series from the image data of the camera by the line-of-sight extractor 14. Character string data obtained from the speech recognition result is input to the dependency analyzer 24, and dependency information is obtained for each character.

音声認識結果、ピッチの時系列情報、視線の時系列情報、係り受け情報は、処理単位判定部１６に入力され、ステップ１０６で処理単位判定部１６によって各モデルに基づいて、検索要求を理解するための意味処理単位（検索要求単位）及び応答単位の処理単位判定、すなわち意味処理タイミング及び応答タイミングの判定を以下のように行う。
（１）ステップ１
ステップ１では、判別する地点の前後に窓を設けて判別する箇所の前後の単語を抽出する。ここで使用する窓は、抽出する単語の個数がモデル化の際の個数と同一になる窓を使用する。
（２）ステップ２
ステップ２では、与えられた線形ベクトルの値をSVMで計算する。すなわち、ステップ１で得られた範囲のデータを学習時と同様に特徴ベクトルデータｘに変換し、学習時に得られたパラメータを用いて、下記（３）式に基づいてＣを求める。 The speech recognition result, pitch time-series information, line-of-sight time-series information, and dependency information are input to the processing unit determination unit 16, and the processing unit determination unit 16 understands the search request based on each model in step 106. The processing unit determination of the semantic processing unit (search request unit) and the response unit for that purpose, that is, the determination of the semantic processing timing and the response timing is performed as follows.
(1) Step 1
In step 1, a window is provided before and after the point to be determined to extract words before and after the point to be determined. The window used here is a window in which the number of extracted words is the same as the number used for modeling.
(2) Step 2
In step 2, the value of the given linear vector is calculated by SVM. That is, the data in the range obtained in step 1 is converted into feature vector data x in the same way as during learning, and C is obtained based on the following equation (3) using the parameters obtained during learning.

（３）ステップ３
ステップ３では、計算されたＣの値により区切りか、つなぎかを判定する。すなわち、ステップ２で計算されたＣの値が正であれば応答単位や意味処理単位の区切りを表し、負であれば応答単位や意味処理単位のつなぎを表すと判断する。 (3) Step 3
In step 3, it is determined whether the separation or connection is made based on the calculated C value. That is, if the value of C calculated in step 2 is positive, it represents a delimiter between response units and semantic processing units, and if it is negative, it represents a connection between response units and semantic processing units.

ステップ１１０で、意味処理単位であり、かつ応答単位であると判定されたか否か、すなわち意味処理タイミングでかつ応答タイミングであるかを判断し、意味処理単位であり、かつ、応答単位であると判定された場合には、ステップ１１２において意味処理の結果得られた検索キーを音声と画像とでユーザーに提示する。 In step 110, it is determined whether it is a semantic processing unit and a response unit, that is, whether it is a semantic processing timing and a response timing, and is a semantic processing unit and a response unit. If it is determined, the search key obtained as a result of the semantic process is presented to the user in step 112 by voice and image.

次のステップ１１４では、処理単位判定部１６の処理単位判定結果に従って、動作制御部３０によってデータベースの検索と応答生成とを実行する。データベース検索は、データベース３８に記憶された飲食店のデータに基づいて行なわれ、応答は、応答生成部３２により記憶装置３４に予め記憶された応答用コーパスに基づいて生成される。 In the next step 114, database search and response generation are executed by the operation control unit 30 according to the processing unit determination result of the processing unit determination unit 16. The database search is performed based on the restaurant data stored in the database 38, and the response is generated based on the response corpus stored in advance in the storage device 34 by the response generation unit 32.

データベース検索及び応答の生成が終了すると、検索及び応答の結果は、スピーカー４０から音声で報知されると共にディスプレイ４２に画像で表示される。 When the database search and response generation are completed, the search and response results are notified by sound from the speaker 40 and displayed on the display 42 as an image.

ユーザーが「大人、の、雰囲気、の、バー、が、いい、かな、あと、〈ポーズ〉、できれば、駅の、近く、〈ポーズ〉、」と発話し、この入力の処理単位判定結果がステップ１１０で意味処理単位でかつ応答単位である、と判定された場合には、「“大人の雰囲気のバー”と“できれば駅の近く”という条件で検索します。」とスピーカ４０から音声提示すると共に、ディスプレイに表示された検索要求のウインドウの未確認の欄に図７に示すように「大人の雰囲気のバー」と「できれば駅の近く」という検索キーワードを表示する。 The user says "Adults, Atmosphere, Bar, Good, Kana, And <Pause>, If possible, Near the station, <Pause>," and the processing unit judgment result of this input is a step If it is determined at 110 that the processing unit is a semantic processing unit and a response unit, the speaker 40 presents a voice message “Search under the condition“ adult atmosphere bar ”and“ preferably near the station ””. At the same time, as shown in FIG. 7, search keywords “adult atmosphere bar” and “preferably near the station” are displayed in the unconfirmed column of the search request window displayed on the display.

また、データベースでの検索結果は、スピーカから「２件、バーAとパブBが見つかりました。」と音声提示すると共に、図８に示すように、ディスプレイのウインドウに「件数：２件詳細：１件目バーA ２件目パブB」のように表示する。 In addition, the search results in the database are voiced as “2 found bar A and pub B” from the speaker, and as shown in FIG. “First bar A, second pub B” is displayed.

したがって、意味処理単位であり、かつ、応答単位であると判定された場合には、意味処理の結果得られた検索要求の結果を反映させて音声で応答される。 Therefore, when it is determined that the unit is a semantic processing unit and a response unit, a response is made by voice reflecting the result of the search request obtained as a result of the semantic processing.

一方、ステップ１１０で意味処理単位でかつ応答単位でないと判定された場合には、ステップ１１８において意味処理単位（検索要求単位）か否かを判断し、検索要求単位であると判断されたときは、ステップ１２２で検索キーワードをディスプレイに表示すると共に、ステップ１２４でデータベースを用いて検索し、ステップ１２６で検索結果をディスプレイに表示する。すなわち、意味処理単位でかつ応答単位でないと判断されたときは、音声で報知することなく表示のみによってユーザーに検索キー及び検索結果が提示され、次の入力を待つ処理を行なう。具体的には、ユーザーが「大人、の、雰囲気、の、バー、が、いい、かな、あと、〈ポーズ〉、できれば、駅の、近く、が、いい、です。」と発話し、この入力の処理単位判定結果が意味処理単位であるが応答単位ではないを判定された場合には、スピーカーからの音声による応答を行なうことなく、画像のみで検索キーワードを提示してデータベース検索を行う。検索結果は、図８のようにディスプレイに表示される。 On the other hand, if it is determined in step 110 that it is a semantic processing unit and not a response unit, it is determined in step 118 whether it is a semantic processing unit (search request unit), and if it is determined that it is a search request unit. In step 122, the search keyword is displayed on the display, and in step 124, the database is searched for. In step 126, the search result is displayed on the display. That is, when it is determined that the unit is a semantic processing unit and not a response unit, a search key and a search result are presented to the user only by display without a voice notification, and processing for waiting for the next input is performed. Specifically, the user says, “Adult, atmosphere, bar, is good, kana, and“ pose ”, if possible, close to the station is good.” When it is determined that the processing unit determination result is a semantic processing unit but not a response unit, a database search is performed by presenting a search keyword only with an image without performing a response by voice from a speaker. The search result is displayed on the display as shown in FIG.

また、ステップ１１８で検索要求単位で無いと判定され、かつステップ１２０で応答単位であると判定された場合、すなわち意味処理単位でないが応答単位であると判定された場合には、ステップ１２８でスピーカーから音声によってユーザに適切な応答を返し、次の入力を待つ処理を行なう。具体的には、ユーザーが「大人、の、雰囲気、の、バー、が、いい、かな、あと、〈ポーズ〉、」と発話し、処理単位判定結果が意味処理単位でないが応答単位であるであった場合には、「はい」などの相槌に相当する音声応答を行う。 If it is determined in step 118 that it is not a search request unit and it is determined in step 120 that it is a response unit, that is, if it is determined that it is not a semantic processing unit but is a response unit, a speaker is determined in step 128. An appropriate response is returned to the user by voice and processing for waiting for the next input is performed. Specifically, the user utters “Adult, Atmosphere, Bar, Good, Kana, <Pose>,” and the processing unit judgment result is not a semantic processing unit but a response unit. If there is, a voice response corresponding to a conflict such as “Yes” is performed.

ステップ１２０で応答単位でないと判定された場合、すなわち意味処理単位でも応答単位でもないと判定された場合には、検索や応答は行わずに次の入力を待つ。具体的には、ユーザーが「大人、の、雰囲気、の、」と発話し、処理単位判定結果が、意味処理利単位でも応答単位でもなかった場合には、ステップ１０４の音声認識処理に戻り、次の入力を待つ。 If it is determined in step 120 that it is not a response unit, that is, if it is determined that it is neither a semantic processing unit nor a response unit, the next input is awaited without performing a search or response. Specifically, when the user speaks “Adult, Atmosphere,” and the processing unit determination result is neither a semantic processing profit unit nor a response unit, the process returns to the voice recognition processing of Step 104, Wait for the next input.

以上説明したように、本実施の形態では、意味処理単位でかつ応答単位であると判定された場合のみ検索結果を反映させて音声応答しているので、違和感の無い音声応答を行なうことができる、という効果が得られる。 As described above, in the present embodiment, since the voice response is made by reflecting the search result only when it is determined that the unit is a semantic processing unit and a response unit, a voice response without a sense of incongruity can be performed. The effect of is obtained.

図３に、本実施の形態を用いた場合の対話例と用いない場合の対話例とを比較して示す。ここで、括弧内の発話はユーザーの発話を、丸枠の吹き出しはシステムの応答を、角枠の吹き出しは意味単位処理状態を表す。また対話中の〈Ｐ〉は無音区間（ポーズ）を表す。 FIG. 3 shows a comparison between an example of dialogue when this embodiment is used and an example of dialogue when not used. Here, the utterance in parentheses represents the user's utterance, the round frame balloon represents the system response, and the square frame balloon represents the semantic unit processing state. Further, during the dialogue represents a silent section (pause).

まず、本実施の形態を用いる場合には、システムの判定の結果、Ｕ１及びＵ２が応答単位、Ｓ１及びＳ２が意味処理単位と判定された場合には、応答単位とのみ判定されたＵ１とＵ２では、システムは相槌や検索内容の確認等の応答を返す。意味処理単位とのみ判定されたＳ1では、システムはデータベース検索のための検索キー作成等の内部処理を進める。一方、応答単位でかつ意味処理単位と判定されたＳ2では、データベース検索結果を反映させて応答する。 First, in the case of using the present embodiment, as a result of system determination, when U1 and U2 are determined as response units, and S1 and S2 are determined as semantic processing units, U1 and U2 determined only as response units. Then, the system returns a response such as confirmation and confirmation of search contents. In S1, which is determined only as a semantic processing unit, the system proceeds with internal processing such as search key creation for database search. On the other hand, in S2, which is determined as a response unit and a semantic processing unit, a response is made by reflecting the database search result.

このように、応答を行うべき単位を判定することで、非常にスムーズな発話のやり取りが実現する。また、意味処理単位を判定することにより、ユーザーの要求を適切に反映した対話を実現している。さらに、これらを独立に判定することにより、ユーザーと自然な対話を実現しながらシステム内部で適切に検索処理を実行することを可能にしている。 As described above, by determining the unit to be responded to, a very smooth exchange of utterances is realized. Also, by determining the semantic processing unit, a dialog that appropriately reflects the user's request is realized. Furthermore, by independently determining these, it is possible to appropriately execute search processing within the system while realizing natural dialogue with the user.

一方、本実施の形態を用いない場合の対話例は、本実施の形態のように応答単位と意味処理単位とを個々に判定するのではなく、意味処理単位を行う単位が応答単位であるとしているため、ユーザーは本来伝えたかった「おつまみがおしゃれである」という要求を発話する前にシステムが検索を実行し検索結果を応答したことから、発話のやり取りが不自然になっている。また、ユーザーの要求を検索に適切に反映して検索できていない。 On the other hand, the dialogue example when this embodiment is not used is that the response unit and the semantic processing unit are not individually determined as in this embodiment, but the unit for performing the semantic processing unit is the response unit. Therefore, the user exchanges utterances unnaturally because the system performed a search and responded to the search results before speaking the request that “the snack is fashionable” that the user originally wanted to convey. In addition, the user's request cannot be properly reflected in the search.

以上述べたとおり、応答単位を判定することにより、より自然な発話のやり取りを実現することが可能となる。また、意味処理単位を判定することにより、ユーザーの要求をより正確に反映した検索を行うことが可能となる。さらに、これらを組み合わせることで、適切な検索をシステム内部で実行しながら、ユーザーと自然な対話を実現することが可能となる。 As described above, it is possible to realize a more natural utterance exchange by determining the response unit. Further, by determining the semantic processing unit, it is possible to perform a search that more accurately reflects the user's request. Furthermore, by combining these, it is possible to realize a natural dialogue with the user while executing an appropriate search inside the system.

本実施の形態では、応答と検索の処理単位との各々を判定し、システム動作を制御することで、より適切な内部処理を可能にしながら、ユーザーとの対話を自然に進めることが可能となる。 In this embodiment, by determining each of the response and the processing unit of search and controlling the system operation, it becomes possible to naturally advance the dialogue with the user while enabling more appropriate internal processing. .

なお、上記では、音声認識結果、ピッチの時系列情報、視線の時系列情報、係り受け情報、及び応答単位モデルを用いて応答タイミングを判定する判定する例について説明したが、これらのいずれか１つを用いて応答タイミングを判定するようにしてもよい。また、音声認識結果、ピッチの時系列情報、視線の時系列情報、係り受け情報、及び意味処理単位モデルを用いて意味処理タイミングを判定する例について説明したが、これらのいずれか１つを用いて意味処理タイミングを判定するようにしてもよい。また、発話された語のパワーの時系列情報、及びユーザーのうなずきの時系列情報を更に用いて応答タイミングまたは意味処理タイミングを判定するようにしてもよい。 In the above description, the example in which the response timing is determined using the speech recognition result, the time series information of the pitch, the time series information of the line of sight, the dependency information, and the response unit model has been described. The response timing may be determined using one of them. Moreover, although the example which determines the semantic processing timing using the speech recognition result, the time series information of the pitch, the time series information of the line of sight, the dependency information, and the semantic processing unit model has been described, any one of these is used. The semantic processing timing may be determined. Further, the response timing or the semantic processing timing may be determined by further using the time series information of the power of the spoken word and the time series information of the user's nodding.

［第２の実施の形態］
つぎに、本発明の第２の実施形態について説明する。なお、第１の実施形態と同一の部位には同一の符号を付し、その詳細な説明は省略する。 [Second Embodiment]
Next, a second embodiment of the present invention will be described. In addition, the same code | symbol is attached | subjected to the site | part same as 1st Embodiment, and the detailed description is abbreviate | omitted.

図９は、第２の実施形態に係る音声対話システムの構成を示す図である。音声対話システムは、発話者であるユーザの音声を入力するマイク１２と、マイク１２からの音声データに基づいて、認識用辞書を用いて音声認識を行い、音声認識結果を文字列として出力する音声認識部２０Ａと、上記認識用辞書を記憶している認識用辞書記憶装置２２と、出力すべき発話を生成する発話生成部５０と、音声出力するスピーカ４０と、画像出力するディスプレイ４２と、を備えている。 FIG. 9 is a diagram showing a configuration of a voice interaction system according to the second embodiment. The voice dialogue system performs voice recognition using a recognition dictionary based on voice data from the microphone 12 that inputs the voice of the user who is a speaker, and outputs voice recognition results as a character string. A recognition unit 20A, a recognition dictionary storage device 22 that stores the recognition dictionary, an utterance generation unit 50 that generates an utterance to be output, a speaker 40 that outputs audio, and a display 42 that outputs an image. I have.

音声認識部２０Ａは、第１の実施形態におけるピッチ抽出部１８、音声認識部２０、係り受け解析部２４の各機能を有している。つまり、音声認識部２０Ａは、音声データから形態素情報、係り受け及びピッチなどの特徴量を抽出し、認識用辞書を用いて文字列を生成し、これらを発話生成部５０に供給する。 The voice recognition unit 20A has the functions of the pitch extraction unit 18, the voice recognition unit 20, and the dependency analysis unit 24 in the first embodiment. That is, the speech recognition unit 20A extracts feature quantities such as morpheme information, dependency, and pitch from the speech data, generates a character string using the recognition dictionary, and supplies these to the utterance generation unit 50.

発話生成部５０は、音声認識部２０Ａによる音声認識結果に対して言語処理（例えば、形態素情報の生成など）を行う言語処理部５１と、発話が終了したか否かを判定する終了発話判定部５２と、応答単位であるか否かを判定する応答単位判定部５３と、応答単位モデルを記憶する応答単位モデル記憶装置２６と、応答内容を生成する応答生成部５４と、相槌を生成する相槌生成部５５とを備えている。 The utterance generation unit 50 includes a language processing unit 51 that performs language processing (for example, generation of morpheme information) on the speech recognition result by the speech recognition unit 20A, and an end utterance determination unit that determines whether the utterance has ended. 52, a response unit determination unit 53 that determines whether or not it is a response unit, a response unit model storage device 26 that stores a response unit model, a response generation unit 54 that generates response content, and a conflict that generates a conflict And a generation unit 55.

終了発話判定部５２は、終了発話辞書を予め記憶しており、当該終了発話辞書を用いて、言語処理部５１で言語処理された文字列が終了発話か否かを判定する。 The end utterance determination unit 52 stores an end utterance dictionary in advance, and uses the end utterance dictionary to determine whether the character string processed by the language processing unit 51 is an end utterance.

図１０は、終了発話判定部５２に記憶されている終了発話辞書を示す図である。終了発話とは、発話の最後に用いられる発話であり、例えば図１０に示すように、「またね」、「ばいばい」、「さようなら」、「じゃあね」、「おやすみ」、「終了」がある。終了発話辞書は、これらの様々な終了発話を有している。 FIG. 10 is a diagram illustrating an end utterance dictionary stored in the end utterance determination unit 52. The ending utterance is an utterance used at the end of the utterance. For example, as shown in FIG. 10, there are “See you”, “Baibai”, “Goodbye”, “Janey”, “Good night”, “End”. . The end utterance dictionary has these various end utterances.

応答単位判定部５３は、応答単位モデル記憶装置２６に記憶されている応答モデルを用いて、発話中の無音区間が応答タイミング（応答単位）であるか否かを判定する。応答タイミングとは、ユーザから対話主導権を奪ってユーザを聞き手にさせるのに最適なタイミングであり、相槌以外の応答をすべきタイミングである。応答単位は、応答タイミング間の単位をいい、ユーザから対話主導権を奪ってよいのかを判断するのに必要な最小限の意味情報を持っている。なお、応答単位判定部５３の構成については詳しくは後述する。 The response unit determination unit 53 uses the response model stored in the response unit model storage device 26 to determine whether or not the silent section during utterance is the response timing (response unit). The response timing is an optimal timing for depriving the user of the dialogue initiative and making the user a listener, and is a timing at which a response other than a conflict should be made. The response unit is a unit between response timings, and has minimum semantic information necessary for determining whether the user can take away the dialogue initiative. The configuration of the response unit determination unit 53 will be described later in detail.

応答単位モデル記憶装置２６には、応答単位を表す応答単位モデルが記憶されている。応答単位モデルは、図４に示すように、応答タイミングの直前Ｎ個及び直後ｎ個の形態素に関する特徴量をモデル化したものであり、発話された語の品詞情報などの言語的特徴（本実施形態では、例えば、形態素情報単位モデル、係り受け単位モデル）、パワーやピッチなどの音響的特徴（本実施形態では、例えば、ピッチ単位モデル）からそれぞれ構成されている。 The response unit model storage device 26 stores a response unit model representing a response unit. As shown in FIG. 4, the response unit model is a model of feature quantities related to N morphemes immediately before and n immediately after the response timing, and linguistic features such as part-of-speech information of spoken words (this embodiment) In the form, for example, a morpheme information unit model and a dependency unit model), and acoustic features such as power and pitch (in this embodiment, for example, a pitch unit model) are configured.

なお、第１の実施形態では、Ｎ＝３、ｎ＝１としたが、第２の実施形態では、Ｎ＝３、ｎ＝０とする。また、第２の実施形態では、視線の特徴量は用いていないが、第１の実施形態と同様に、ユーザの顔を撮像するカメラがある場合、うなずきや視線などの画像特徴を応答単位モデルとして用いてもよい。 In the first embodiment, N = 3 and n = 1. However, in the second embodiment, N = 3 and n = 0. Further, in the second embodiment, the feature amount of the line of sight is not used. However, as in the first embodiment, when there is a camera that captures the user's face, the image feature such as the nodding and the line of sight is represented as a response unit model. It may be used as

応答生成部５４は、応答単位判定部５３により無音区間が応答タイミングであると判定されたときに、相槌以外の応答を生成する。また、応答生成部５４は、終了応答辞書を記憶しており、終了発話判定部５２によって終了発話が判定されたときに、所定の応答を行う。 The response generation unit 54 generates a response other than the conflict when the response unit determination unit 53 determines that the silent section is the response timing. The response generation unit 54 stores an end response dictionary, and performs a predetermined response when the end utterance is determined by the end utterance determination unit 52.

図１１は、終了応答辞書を示す図である。終了応答辞書は、ユーザとの対話終了時にユーザに対して発話する用語を示したものであり、例えば、「またね」、「ばいばい」、「お話を聞かせてくれてありがとう。またね。」、「お話を聞かせてくれてありがとう。また話そうね。」などがある。 FIG. 11 shows an end response dictionary. The end response dictionary indicates terms that are uttered to the user when the dialogue with the user ends. For example, “See you again”, “Baibai”, “Thank you for telling us.” "Thank you for telling me the story. Let's talk again."

相槌生成部５５は、応答単位判定部５３により無音区間が応答タイミングでないと判定されたときに、「うん」、「はい」などの相槌を生成する。このようにして生成された応答や相槌は、スピーカ４０を介して音声出力されたり、ディスプレイ４２を介して画像出力される。 When the response unit determination unit 53 determines that the silent section is not the response timing, the conflict generation unit 55 generates a conflict such as “Yes” or “Yes”. The response and the interaction generated in this way are output as audio via the speaker 40 or output as an image via the display 42.

以上のように構成された音声対話システムは、音声が入力されると、次のような処理を実行する。 The voice dialogue system configured as described above executes the following processing when voice is input.

図１２は、第２の実施形態に係る音声対話処理ルーチンを示すフローチャートである。音声対話システムは、図示しないスイッチが押圧されると、初期化を実行して（ステップ２００）、ユーザが発話するまで待機する。ここでは、ユーザは、「すごく行列だったけど（無音区間）そんなに待たなくて（無音区間）よかったよ（無音区間）」と発話したとする。 FIG. 12 is a flowchart showing a voice interaction processing routine according to the second embodiment. When a switch (not shown) is pressed, the voice interaction system executes initialization (step 200) and waits until the user speaks. Here, it is assumed that the user utters, “It was a very long line (silent section), but I didn't have to wait so much (silent section) (silent section)”.

音声認識部２０Ａは、ユーザの発話を認識する（ステップ２０２）。このとき音声認識部２０Ａによって認識された文字列は、発話生成部５０の言語処理部５１で言語処理される。 The voice recognition unit 20A recognizes the user's utterance (step 202). At this time, the character string recognized by the voice recognition unit 20A is subjected to language processing by the language processing unit 51 of the utterance generation unit 50.

終了発話判定部５２は、音声認識の結果得られた文字列に対して、終了発話辞書を用いて終了発話判定を行い（ステップ２０４）、上記文字列が終了発話であるか否かを判定する（ステップ２０６）。終了発話であると判定したときはステップ２２０に進み、終了発話でないと判定したときはステップ２０８に進む。 The end utterance determination unit 52 performs end utterance determination on the character string obtained as a result of speech recognition using the end utterance dictionary (step 204), and determines whether or not the character string is the end utterance. (Step 206). When it is determined that it is an end utterance, the process proceeds to step 220. When it is determined that it is not an end utterance, the process proceeds to step 208.

応答単位判定部５３は、音声認識部２０で抽出された特徴量や言語処理部５１で得られた形態素情報と、応答単位モデル記憶装置２６に記憶された各応答単位モデルとを用いて応答単位の判定を行い（ステップ２０８）、無音区間が応答タイミングであるか否かを判定する（ステップ２１０）。ユーザが上記のように「すごく行列だったけど（無音区間）そんなに待たなくて（無音区間）よかったよ（無音区間）」と発話をした場合、応答単位判定部５３は次の処理を行う。 The response unit determination unit 53 uses the feature amount extracted by the speech recognition unit 20 and the morpheme information obtained by the language processing unit 51 and each response unit model stored in the response unit model storage device 26 to make a response unit. Is determined (step 208), and it is determined whether the silent period is the response timing (step 210). As described above, when the user utters "It was a very long line (silent section) but I didn't have to wait so much (silent section) (silent section)", the response unit determination unit 53 performs the following process.

図１３は、応答単位判定の一例を説明する図であり、（Ａ）は最初の無音区間を判別箇所ｔ１とした場合、（Ｂ）は２番目の無音区間を判別箇所ｔ２とした場合、（Ｃ）は最後の無音区間を判別箇所ｔ３とした場合である。 FIGS. 13A and 13B are diagrams for explaining an example of response unit determination. FIG. 13A illustrates a case where the first silent section is determined as the determination location t1, and FIG. 13B illustrates a case where the second silence interval is determined as the determination location t2. C) is a case where the last silent section is set as the determination point t3.

応答単位判定部５３は、３箇所の無音区間をそれぞれ判別箇所ｔ１、ｔ２、ｔ３とし、これらの判別箇所が応答単位であるか否かを判定する。応答単位判定部５３は、最初に、判別箇所ｔ１について、当該判別箇所ｔ１の直前にある例えば３個の形態素のｎグラムモデルと、応答単位モデルとを比較して、応答単位か判別する。 The response unit determination unit 53 sets the three silent sections as determination points t1, t2, and t3, respectively, and determines whether or not these determination points are response units. First, the response unit determination unit 53 compares the response unit model with, for example, the three morpheme n-gram models immediately before the determination location t1 to determine whether it is a response unit.

そして、応答単位でないときはステップ２１６に進み、応答単位であるときはステップ２１２に進む。再びステップ２１０に戻ると、応答単位判定部５３は次に判別箇所ｔ２について判定する。さらに、再びステップ２１０に戻ると、応答単位判定部５３は最後に判別箇所ｔ３について判定する。なお、本実施形態では、判別箇所ｔ３が応答単位と判別されたものとする。 If it is not a response unit, the process proceeds to step 216. If it is a response unit, the process proceeds to step 212. When returning to step 210 again, the response unit determination unit 53 next determines the determination point t2. Furthermore, when returning to step 210 again, the response unit determination unit 53 finally determines the determination point t3. In the present embodiment, it is assumed that the determination location t3 is determined as a response unit.

応答生成部５４は、無音区間が応答単位であると判定されたときに、音声認識結果の意味内容に応じて応答発話を生成し（ステップ２１２）、この応答発話を、スピーカ４０を介して音声出力する（ステップ２１４）。応答発話としては、例えば、ユーザがポジティブな発話をした場合は「いいな」、「すごい」など、ユーザがネガティブな発話をした場合は「残念だね」、「がんばって」などがある。なお、応答内容は、相槌以外であり、かつ音声認識結果の意味内容を反映したものであれば、特に限定されない。そして、ステップ２１４が終了すると、ステップ２０２に戻る。 When it is determined that the silent period is a response unit, the response generation unit 54 generates a response utterance according to the meaning content of the voice recognition result (step 212), and the response utterance is transmitted to the voice via the speaker 40. Output (step 214). As the response utterance, for example, when the user utters a positive utterance, “I like it” or “great”, and when the user utters a negative utterance, there are “sorry”, “good luck”, and the like. Note that the response content is not particularly limited as long as it is other than the conflict and reflects the semantic content of the voice recognition result. When step 214 ends, the process returns to step 202.

一方、相槌生成部５５は、無音区間が応答単位ではないと判定されたときに、相槌発話を生成し（ステップ２１６）、この相槌発話を、スピーカ４０を介して音声出力する（ステップ２１８）。これにより、例えば判別箇所ｔ１、ｔ２において、「はい」、「うん」などの相槌が出力される。なお、相槌生成部５５は、相槌の代わりに、ユーザの発話に含まれていたキーワードを出力してもよい。 On the other hand, when it is determined that the silent section is not a response unit, the interaction generation unit 55 generates an interaction utterance (step 216), and outputs the interaction utterance through the speaker 40 (step 218). As a result, for example, at the determination points t1 and t2, a conflict such as “Yes” or “Yes” is output. In addition, the consideration generation part 55 may output the keyword contained in the user's utterance instead of the consideration.

そして、ステップ２１８が終了すると、ステップ２０２に戻る。このように、ユーザが終了発話を発するまで、ステップ２０２からステップ２１８までの処理が繰り返し実行される。 When step 218 ends, the process returns to step 202. Thus, the processing from step 202 to step 218 is repeatedly executed until the user utters the end utterance.

また、応答単位判定部５３により応答単位であると判定されると（ステップ２０６の肯定判定）、応答生成部５４は、図１１に示す終了応答辞書を参照して、これらの中からランダムに用語を選択し、選択した用語（例えば「ばいばい」）を、スピーカ４０を介して音声出力する。 If the response unit determination unit 53 determines that the response unit is a response unit (affirmative determination in step 206), the response generation unit 54 refers to the end response dictionary shown in FIG. Is selected, and the selected term (for example, “Beibai”) is output as voice through the speaker 40.

従来、ユーザが例えば「すごく行列だったけど（無音区間）そんなに待たなくて（無音区間）よかったよ（無音区間）」とシステムに発話をした場合、ユーザとシステムの対話は、
ユーザ：「すごく行列だったけど（無音区間）」
システム：「えー最悪だね」
ユーザ：「そんなに待たなくて（無音区間）」
システム：「いいね」
ユーザ：「よかったよ（無音区間）」
システム：「へーよかったね」
となり、テンポの悪い対話になっていた。 Traditionally, if a user utters to the system, for example, "It was a very long line (silent section) but I didn't have to wait so much (silent section) (silent section)"
User: “It was really a line (silent section)”
System: “Wow, it ’s the worst.”
User: “Don't wait so much (silent section)”
System: “Like”
User: “It was good (silent section)”
System: “It was good”
It was a dialogue with a bad tempo.

これに対して、ユーザが上記のように本実施形態に係る音声対話システムに発話をした場合、ユーザとシステムの対話は、
ユーザ：「すごく行列だったけど（無音区間）」
システム：「うん」
ユーザ：「そんなに待たなくて（無音区間）」
システム：「うん」
ユーザ：「よかったよ（無音区間）」
システム：「へーよかったね」
となり、テンポのよい対話が成立する。 On the other hand, when the user speaks to the voice interaction system according to the present embodiment as described above, the interaction between the user and the system is as follows.
User: “It was really a line (silent section)”
System: “Yes”
User: “Don't wait so much (silent section)”
System: “Yes”
User: “It was good (silent section)”
System: “It was good”
Thus, a dialogue with a good tempo is established.

以上のように、第２の実施形態に係る音声対話システムは、ユーザの発話に含まれている複数の無音区間から応答タイミング（応答単位）を判定し、応答タイミングのときに応答することにより、ユーザの発話を妨げることなく、ユーザにユーザ自身の伝えたい意図を最後まで発話させることができる。 As described above, the voice interaction system according to the second embodiment determines the response timing (response unit) from a plurality of silence intervals included in the user's utterance, and responds at the response timing. Without disturbing the user's utterance, the user can utter the intention that the user wants to convey to the end.

また、上記音声対話システムは、応答タイミングでないときに相槌することにより、適切なタイミングで相槌することができる。これにより、テンポのよい発話のやり取りが生まれ、自然な対話を実現することができる。また、ユーザは、発話途中であっても、音声対話システムからの相槌を聞くことで、音声対話システムが発話を理解していると考えるので、安心して発話を続けることができる。 In addition, the above-mentioned voice interactive system can be considered at an appropriate timing by considering when it is not the response timing. Thereby, the exchange of utterances with a good tempo is born, and a natural dialogue can be realized. In addition, even when the user is in the middle of speaking, the user thinks that the voice dialogue system understands the utterance by listening to the feedback from the voice dialogue system.

［第３の実施の形態］
つぎに、本発明の第３の実施形態について説明する。なお、上述した実施形態と同一の部位には同一の符号を付し、その詳細な説明は省略する。 [Third Embodiment]
Next, a third embodiment of the present invention will be described. In addition, the same code | symbol is attached | subjected to the site | part same as embodiment mentioned above, and the detailed description is abbreviate | omitted.

図１４は、第３の実施形態に係る音声対話システムの構成を示す図である。音声対話システムは、発話者であるユーザの音声を入力するマイク１２と、マイク１２からの音声データに基づいて、認識用辞書及び感情語辞書を用いて音声認識を行い、音声認識結果を文字列として出力する音声認識部２０Ｂと、認識用辞書を記憶している認識用辞書記憶装置２２と、感情語辞書を記憶している感情語辞書記憶装置２３と、出力すべき発話を生成する発話生成部５０Ａと、非言語の応答を生成する非言語応答生成部６０と、インタフェースロボット７０と、を備えている。 FIG. 14 is a diagram showing a configuration of a voice interaction system according to the third embodiment. The voice dialogue system performs voice recognition using a recognition dictionary and an emotion word dictionary based on voice data from the microphone 12 for inputting the voice of the user who is a speaker, and the voice recognition result is converted into a character string. A speech recognition unit 20B for outputting as a recognition dictionary storage device 22 for storing a recognition dictionary, an emotion word dictionary storage device 23 for storing an emotion word dictionary, and an utterance generation for generating an utterance to be output 50A, a non-language response generation unit 60 that generates a non-language response, and an interface robot 70.

音声認識部２０Ｂは、第２の実施形態で示した機能を実行することができ、さらに、感情語辞書を参照して、認識した文字列がどのような感情語であるかを判定することができる。 The voice recognition unit 20B can execute the function shown in the second embodiment, and can determine what emotion word the recognized character string is referring to the emotion word dictionary. it can.

図１５は、感情語辞書記憶装置２３に記憶されている感情語辞書を示す図である。感情語辞書は、認識された文字列がポジティブ、ネガティブ、ニュートラルのいずれの感情語であるかを判定するときに用いられる。例えば、おいしい、明るい、本格的な味、待たない、回転が速い、よい等は、ポジティブな感情語に該当する。人が多い、待つ、行列は、ネガティブな感情語に該当する。そして、上記以外のものはニュートラルな感情語に該当する。なお、これらの用語は例示であり、感情語辞書は図１４の構成に限定されるものではない。 FIG. 15 is a diagram showing an emotion word dictionary stored in the emotion word dictionary storage device 23. The emotion word dictionary is used to determine whether a recognized character string is a positive, negative, or neutral emotion word. For example, delicious, bright, full-fledged taste, do not wait, fast rotation, good, etc. correspond to positive emotion words. Many people, wait, and queue correspond to negative emotion words. Other than the above correspond to neutral emotion words. These terms are examples, and the emotion word dictionary is not limited to the configuration shown in FIG.

発話生成部５０Ａは、音声認識部２０Ｂによる音声認識結果に対して言語処理を行う言語処理部５１と、発話が終了したか否かを判定する終了発話判定部５２と、応答単位であるか否かを判定する応答単位判定部５３と、応答単位モデルを記憶する応答単位モデル記憶装置２６と、感情を込めた応答内容を生成する応答生成部５４Ａと、感情を込めた相槌を生成する相槌生成部５５Ａと、感情処理を行う感情処理部５６と、を備えている。 The utterance generation unit 50A is a language processing unit 51 that performs language processing on the speech recognition result by the speech recognition unit 20B, an end utterance determination unit 52 that determines whether or not the utterance has ended, and whether or not a response unit A response unit determination unit 53 that determines whether or not, a response unit model storage device 26 that stores a response unit model, a response generation unit 54A that generates a response content including emotion, and a conflict generation that generates an emotional response 55A and an emotion processing unit 56 for performing emotion processing.

感情処理部５６は、音声認識部２０Ｂで判別された感情語を用いて、１つの応答単位について１つの感情表現を算出する。感情表現の算出ルールは、以下の通りである。
（１）１つの応答単位の中で矛盾した感情語が複数存在する場合は後半の節の感情語を優先する。
（２）同じ節の中で感情語が矛盾した場合は、述語の感情語を優先する。
（３）上記の（１）及び（２）を適用してもなおポジティブ／ネガティブの判定が矛盾する場合は、ニュートラルと判定する。 The emotion processing unit 56 calculates one emotion expression for one response unit using the emotion word determined by the voice recognition unit 20B. The calculation rules for emotional expression are as follows.
(1) If there are a plurality of contradictory emotion words in one response unit, priority is given to the emotion words in the latter half.
(2) If emotion words conflict in the same section, the predicate emotion word has priority.
(3) If the positive / negative determination is inconsistent even after applying the above (1) and (2), it is determined as neutral.

応答生成部５４Ａは、応答単位判定部５３により無音区間が応答タイミングであると判定されたときに、次の感情表現−応答対応テーブルを参照して、感情処理部５６で算出された感情表現に対応する応答を生成する。 When the response unit determination unit 53 determines that the silent period is the response timing, the response generation unit 54 A refers to the next emotion expression-response correspondence table and determines the emotion expression calculated by the emotion processing unit 56. Generate a corresponding response.

図１６は、応答生成部５４Ａに記憶されている感情表現−応答対応表を示す図である。感情表現−応答対応表では、ポジティブな感情に対して、「へーすごい」、「よかったね」、「いいなぁ」の応答が対応付けられている。ネガティブな感情に対して、「げー最悪だね」、「そりゃひどいね」、「何それ最低」の応答が対応付けられている。ニュートラルな感情に対して、「それでそれで？」の応答が対応付けられている。 FIG. 16 is a diagram showing an emotion expression-response correspondence table stored in the response generation unit 54A. In the emotion expression-response correspondence table, responses of “hewful”, “good”, and “good” are associated with positive emotions. Negative emotions are associated with responses of “It ’s the worst”, “That ’s terrible”, and “What's the lowest”. A response of “So then?” Is associated with a neutral emotion.

相槌生成部５５Ａは、応答単位判定部５３により無音区間が応答タイミングでないと判定されたときに、次の感情表現−相槌対応テーブルを参照して、感情処理部５６で算出された感情表現に対応する相槌を生成する。 When the response unit determination unit 53 determines that the silent section is not the response timing, the conflict generation unit 55A refers to the next emotion expression-compatibility correspondence table and corresponds to the emotion expression calculated by the emotion processing unit 56. To generate a companion.

図１７は、相槌生成部５５に記憶されている感情表現−相槌対応表を示す図である。感情表現−相槌対応表では、ポジティブな感情に対して、「ほぅほぅ」、「うんうん」の相槌が対応付けられている。ネガティブな感情に対して、「げ」、「あらら」、「うわー」の相槌が対応付けられている。ニュートラルな感情に対して、「うん」の相槌が対応付けられている。 FIG. 17 is a diagram showing an emotion expression-compatibility correspondence table stored in the conflict generation unit 55. In the emotion expression / contrast correspondence table, positive emotions are associated with the connotations of “Honho” and “Yunun”. Negative emotions are associated with “ge”, “arara”, and “wow”. A neutral feeling is associated with the “yes”.

このようにして生成された応答や相槌は、インタフェースロボット７０内の図示しないスピーカを介して音声出力される。 The response and the interaction generated in this way are output as audio through a speaker (not shown) in the interface robot 70.

非言語応答生成部６０は、感情処理部５６で算出された感情表現に対応する応答又は相槌をするようにインタフェースロボット７０を制御する。具体的には、非言語応答生成部６０は、状況−動作ＩＤ対応表を参照して感情表現に対応する動作ＩＤを選択し、そして、動作ＩＤ−動作対応表を参照して動作ＩＤが示す動作を実行するようにインタフェースロボット７０を制御する。 The non-language response generation unit 60 controls the interface robot 70 so as to make a response or a reconciliation corresponding to the emotion expression calculated by the emotion processing unit 56. Specifically, the non-language response generation unit 60 selects an action ID corresponding to the emotional expression with reference to the situation-action ID correspondence table, and the action ID indicates with reference to the action ID-action correspondence table. The interface robot 70 is controlled to execute the operation.

図１８は、非言語応答生成部６０に記憶されている状況−動作ＩＤ対応表を示す図である。状況−動作ＩＤ対応表は、初期化、ポジティブ応答、ネガティブ応答、ニュートラル応答、ポジティブ相槌、ネガティブ相槌、ニュートラル相槌、終了応答にそれぞれ対応する動作ＩＤを表している。 FIG. 18 is a diagram illustrating a situation-action ID correspondence table stored in the non-language response generation unit 60. The situation-operation ID correspondence table represents operation IDs corresponding to initialization, positive response, negative response, neutral response, positive phase, negative phase, neutral phase, and end response, respectively.

状況−動作ＩＤ対応表の「状況」は、初期化及び終了応答を除いて、応答単位判定部５３及び感情処理部５６の判定結果によって決定される。例えば、ポジティブ応答は、感情表現がポジティブであり、かつ応答タイミングと判定されたときの状況をいう。ネガティブ相槌は、感情表現がネガティブであり、かつ応答タイミングでないと判定されたときの状況をいう。 The “situation” in the situation-action ID correspondence table is determined by the determination results of the response unit determination unit 53 and the emotion processing unit 56 except for the initialization and end response. For example, the positive response refers to a situation when the emotional expression is positive and the response timing is determined. Negative conflict refers to the situation when it is determined that the emotional expression is negative and not the response timing.

また、状況−動作ＩＤ対応表において、複数の動作ＩＤが“ｏｒ”によって連結されているときは、ランダムに動作ＩＤが選択される。例えば、非言語応答生成部６０は、初期化時では動作ＩＤ１、２、３のいずれかをランダムに選択し、ポジティブ応答時では動作ＩＤ１及び４の組合せ、動作ＩＤ１及び５の組合せ、のいずれかの組合せをランダムに選択する。 In the situation-action ID correspondence table, when a plurality of action IDs are connected by “or”, the action IDs are selected at random. For example, the non-language response generation unit 60 randomly selects one of the operation IDs 1, 2, and 3 at the time of initialization, and at the time of a positive response, any of the combination of the operation IDs 1 and 4 and the combination of the operation IDs 1 and 5 The combination of is selected at random.

図１９は、非言語応答生成部６０に記憶されている動作ＩＤ−動作対応表を示す図である。動作ＩＤ−動作対応表では、動作ＩＤ１〜１０のぞれぞれに、インタフェースロボット７０の動作が対応付けられている。 FIG. 19 is a diagram illustrating an action ID-action correspondence table stored in the non-language response generation unit 60. In the action ID-action correspondence table, the actions of the interface robot 70 are associated with the action IDs 1 to 10, respectively.

図２０は、第３の実施形態に係る音声対話処理ルーチンを示すフローチャートである。音声対話システムは、図示しないスイッチが押圧されると、初期化を実行して（ステップ３００）、ユーザが発話するまで待機する。ここでは、ユーザは、「すごく行列だったけど（無音区間）そんなに待たなくて（無音区間）よかったよ（無音区間）」と発話したとする。 FIG. 20 is a flowchart showing a voice interaction processing routine according to the third embodiment. When a switch (not shown) is pressed, the voice interaction system executes initialization (step 300) and waits until the user speaks. Here, it is assumed that the user utters, “It was a very long line (silent section), but I didn't have to wait so much (silent section) (silent section)”.

音声認識部２０Ｂは、ユーザの発話を認識する（ステップ３０２）。音声認識部２０Ｂによって認識された文字列は、発話生成部５０Ａの言語処理部５１で言語処理される。 The voice recognition unit 20B recognizes the user's utterance (step 302). The character string recognized by the speech recognition unit 20B is subjected to language processing by the language processing unit 51 of the utterance generation unit 50A.

終了発話判定部５２は、言語処理済みの文字列に対して、終了発話辞書を用いて終了発話判定を行い（ステップ３０４）、上記文字列が終了発話であるか否かを判定する（ステップ３０６）。終了発話であると判定したときはステップ３１４に進み、終了発話でないと判定したときはステップ３０８に進む。 The end utterance determination unit 52 performs end utterance determination on the language-processed character string using the end utterance dictionary (step 304), and determines whether the character string is an end utterance (step 306). ). When it is determined that it is an end utterance, the process proceeds to step 314. When it is determined that it is not an end utterance, the process proceeds to step 308.

応答単位判定部５３は、音声認識部２０Ａで抽出された特徴量や言語処理部５１で得られた形態素情報と、応答単位モデル記憶装置２６に記憶された各応答単位モデルとを用いて応答単位の判定を行い、無音区間が応答タイミングであるか否かを判定する。また、感情処理部５６は、音声認識部２０Ｂで判別された感情語を用いて、１つの応答単位について、感情表現を算出する（ステップ３０８）。 The response unit determination unit 53 uses the feature amount extracted by the speech recognition unit 20A, the morpheme information obtained by the language processing unit 51, and each response unit model stored in the response unit model storage device 26 to respond to the response unit. It is determined whether or not the silent section is the response timing. Further, the emotion processing unit 56 calculates an emotion expression for one response unit using the emotion word determined by the voice recognition unit 20B (step 308).

図２１は、ユーザの発話から求められた応答単位及び感情語を示す図である。ユーザが上記のような発話をした場合、応答単位判定部５３は、応答単位として「すごく行列だったけどそんなに待たなくてよかったよ」を求める。この応答単位では、「行列」に対応する感情語は「ネガティブ」、「待たない（なく）」に対応する感情語は「ポジティブ」、「よい（よかっ）」に対応する感情語は「ポジティブ」である。 FIG. 21 is a diagram illustrating response units and emotion words obtained from user utterances. When the user utters the utterance as described above, the response unit determination unit 53 requests “It was a very long line but I didn't have to wait so much” as the response unit. In this response unit, the emotion word corresponding to “matrix” is “negative”, the emotion word corresponding to “do not wait” is “positive”, and the emotion word corresponding to “good” is “positive” It is.

そこで、感情処理部５６は、上述した感情表現の算出ルールに従って、上記応答単位について感情表現を算出する。ここでは、算出ルール（１）が適用され、上記応答単位の感情表現として「ポジティブ」が算出される。 Therefore, the emotion processing unit 56 calculates an emotion expression for the response unit in accordance with the above-described emotion expression calculation rule. Here, the calculation rule (1) is applied, and “positive” is calculated as the emotional expression of the response unit.

非言語応答生成部６０は、感情処理部５６で算出された感情表現に対応する非言語応答を生成する（ステップ３１０）。具体的には、非言語応答生成部６０は、応答単位判定部５３及び感情処理部５６の判定結果に基づいて「状況」を判定し、図１８に示す状況−動作ＩＤ対応表を参照して、現在の状況に対応する動作ＩＤを選択する。このとき、応答生成部５４Ａ又は相槌生成部５５Ａは、言語応答するための応答又は相槌を生成してもよい。 The non-language response generating unit 60 generates a non-language response corresponding to the emotion expression calculated by the emotion processing unit 56 (step 310). Specifically, the non-language response generation unit 60 determines “situation” based on the determination results of the response unit determination unit 53 and the emotion processing unit 56, and refers to the situation-action ID correspondence table shown in FIG. The operation ID corresponding to the current situation is selected. At this time, the response generation unit 54A or the interaction generation unit 55A may generate a response or interaction for making a language response.

例えば、応答生成部５４Ａは、無音区間が応答タイミングであると判定されたときに、感情表現−応答対応テーブルを参照して、感情処理部５６で算出された感情表現に対応する応答を生成すればよい。また、相槌生成部５５Ａは、無音区間が応答タイミングでないと判定されたときに、感情表現−相槌対応テーブルを参照して、感情処理部５６で算出された感情表現に対応する相槌を生成すればよい。 For example, when it is determined that the silent period is the response timing, the response generation unit 54A refers to the emotion expression-response correspondence table and generates a response corresponding to the emotion expression calculated by the emotion processing unit 56. That's fine. Further, when it is determined that the silent section is not the response timing, the conflict generation unit 55A refers to the emotion expression-interaction correspondence table and generates an interaction corresponding to the emotion expression calculated by the emotion processing unit 56. Good.

非言語応答生成部６０は、選択した動作ＩＤに対応する動作を行うようにインタフェースロボット７０を制御することで、非言語応答又は非言語相槌をインタフェースロボット７０に行わせる（ステップ３１２）。また、インタフェースロボット７０に設けられた図示しないスピーカは、応答生成部５４Ａで生成された応答又は相槌生成部５５Ａで生成された相槌の音声を出力してもよい。そして、ステップ３１２からステップ３０２に戻る。 The non-language response generation unit 60 controls the interface robot 70 to perform an operation corresponding to the selected operation ID, thereby causing the interface robot 70 to perform a non-language response or non-language interaction (step 312). Further, a speaker (not shown) provided in the interface robot 70 may output the response generated by the response generation unit 54A or the audio of the comparison generated by the interaction generation unit 55A. Then, the process returns from step 312 to step 302.

一方、ステップ３０６で肯定判定の場合、非言語応答生成部６０は、動作ＩＤ１０を選択し、お辞儀をするようにインタフェースロボット７０を制御して、終了応答を行う。 On the other hand, if the determination in step 306 is affirmative, the non-language response generator 60 selects the action ID 10, controls the interface robot 70 to bow, and sends a termination response.

以上のように、第３の実施形態に係る音声対話システムは、ユーザの発話に含まれている複数の無音区間から応答タイミング（応答単位）を判定し、応答タイミングのときにインタフェースロボット７０に応答動作をさせることにより、ユーザの発話を妨げることなく、ユーザにユーザ自身の伝えたい意図を最後まで発話させることができる。 As described above, the voice interaction system according to the third embodiment determines the response timing (response unit) from the plurality of silence intervals included in the user's utterance, and responds to the interface robot 70 at the response timing. By performing the operation, it is possible to let the user utter the intention he / she wants to convey to the end without disturbing the user's utterance.

また、上記音声対話システムは、応答タイミングでないときにインタフェースロボット７０に相槌動作をさせることにより、適切なタイミングで相槌することができる。これにより、テンポのよい発話のやり取りが生まれ、自然な対話を実現することができる。また、ユーザは、発話途中であっても、インタフェースロボット７０が相槌動作をすることで、音声対話システムが発話を理解していると考えるので、安心して発話を続けることができる。 In addition, the above-described voice interaction system can make an interaction at an appropriate timing by causing the interface robot 70 to perform an interaction operation when it is not a response timing. Thereby, the exchange of utterances with a good tempo is born, and a natural dialogue can be realized. Further, even when the user is in the middle of utterance, the user thinks that the voice dialogue system understands the utterance by the interaction of the interface robot 70, so that the utterance can be continued with peace of mind.

特に、上記音声対話システムは、言語以外の情報を用いてインタフェースロボット７０に応答や相槌をさせることによって、ユーザの発話を妨げることなく、対話することができる。 In particular, the above spoken dialogue system can interact without interfering with the user's utterance by causing the interface robot 70 to respond and interact using information other than language.

なお、本実施形態では、音声対話システムは、言語及び非言語の両方を用いて応答や相槌を行ったが、言語、非言語のいずれか一方だけを用いてもよい。 In the present embodiment, the voice interaction system performs the response and the interaction using both the language and the non-language. However, only one of the language and the non-language may be used.

［応答単位判定部５３の構成］
応答単位判定部５３の詳細な構成について説明する。応答単位判定部５３は、形態素情報、係り受け、ピッチに基づいて、無音区間が応答タイミングであるか否かを判定する。 [Configuration of Response Unit Determination Unit 53]
A detailed configuration of the response unit determination unit 53 will be described. The response unit determination unit 53 determines whether or not the silent section is the response timing based on the morphological information, the dependency, and the pitch.

図２２は、応答単位判定部５３の構成を示すブロック図である。応答単位判定部５３は、様々な特徴量毎に無音区間が応答タイミングであるかを判定する判定器７１、７２、７３と、判定器７１、７２、７３の各々の判定結果から総合的に応答タイミングであるか否かを判定する判定器７４とを備えている。 FIG. 22 is a block diagram illustrating a configuration of the response unit determination unit 53. The response unit determination unit 53 comprehensively responds from the determination results of the determination units 71, 72, and 73 and the determination units 71, 72, and 73 that determine whether the silent period is the response timing for each of various feature amounts. And a determiner 74 for determining whether or not the timing is reached.

応答単位モデル記憶装置２６は、応答単位モデルとして、形態素情報単位モデル、係り受け単位モデル、ピッチ単位モデルを記憶している。形態素情報単位モデル、係り受け単位モデル、ピッチ単位モデルは、学習器８０が学習データ（形態素情報、係り受け、ピッチの正解値）を用いて学習することによって、生成されたものである。なお、各モデルの生成方法は、第１の実施形態において説明した通りである。 The response unit model storage device 26 stores a morpheme information unit model, a dependency unit model, and a pitch unit model as response unit models. The morpheme information unit model, the dependency unit model, and the pitch unit model are generated by the learning device 80 learning using learning data (morphological information, dependency, and correct pitch values). The method for generating each model is as described in the first embodiment.

判定器７１は、形態素情報と形態素モデルとに基づいて、無音区間が応答タイミングであるかを判定し、応答タイミングであるときは判定結果α₁＝“１”を出力し、応答タイミングでないときは判定結果α₁＝“−１”を出力する。 Based on the morpheme information and the morpheme model, the determiner 71 determines whether the silent period is the response timing, and outputs the determination result α ₁ = “1” when it is the response timing, and when it is not the response timing The determination result α ₁ = “− 1” is output.

判定器７２は、特徴量として音声認識部２０Ｂから供給された係り受け情報と係り受けモデルとに基づいて、無音区間が応答タイミングであるかを判定し、応答タイミングであるときは判定結果α₂＝“１”を出力し、応答タイミングでないときは判定結果α₂＝“−１”を出力する。 The determiner 72 determines whether the silent section is the response timing based on the dependency information and the dependency model supplied from the voice recognition unit 20B as the feature amount. If the silence interval is the response timing, the determination result α ₂ = outputs "1", the judgment result alpha ₂ = and if not, response time - outputs "1".

判定器７３は、特徴量として音声認識部２０Ｂから供給されたピッチ情報とピッチモデルとに基づいて、無音区間が応答タイミングであるかを判定し、応答タイミングであるときは判定結果α₃＝“１” を出力し、応答タイミングでないときは判定結果α₃＝“−１”を出力する。 The determiner 73 determines whether the silent section is the response timing based on the pitch information and the pitch model supplied from the speech recognition unit 20B as the feature amount. When the silence timing is the response timing, the determination result α ₃ = “ 1 "is output, and if it is not the response timing, the determination result α ₃ ="-1 "is output.

判定器７４は、判定器７１、７２，７３の各々の判定結果と、各判定結果の信頼度Ｃ₁、Ｃ₂、Ｃ₃とに基づいて、無音区間が応答タイミングであるかを総合的に判定する。なお、信頼度Ｃ_i（ｉ＝１、２、３）は、０以上１以下であるとする。具体的には、判定器７４は、次の（４）式の演算を行う。 Based on the determination results of the determination devices 71, 72, and 73 and the reliability C ₁ , C ₂ , and C _{3 of} each determination result, the determination device 74 comprehensively determines whether the silent section is the response timing. judge. It is assumed that the reliability C _i (i = 1, 2, 3) is 0 or more and 1 or less. Specifically, the determiner 74 performs the calculation of the following equation (4).

判定器７４は、（４）式の演算結果の符号を判定し、符号が正であるときは応答タイミングであると判定し、符号が負であるときは応答タイミングでないと判定する。なお、判定器７４は、第１の実施形態と同様に、（４）式の代わりに（３）式を演算し、求められたＣの値が正であれば応答タイミングであると判定し、負の値であれば応答タイミングでない（つなぎ）と判定してもよい。 The determiner 74 determines the sign of the calculation result of equation (4), determines that it is a response timing when the sign is positive, and determines that it is not a response timing when the sign is negative. Note that, similarly to the first embodiment, the determiner 74 calculates the expression (3) instead of the expression (4), and determines that it is the response timing if the obtained value of C is positive, If it is a negative value, it may be determined that the response timing is not reached (connection).

以上の説明では、ｉが１から３までの場合を例に挙げたが、応答単位判定部５３の構成はこれに限定されるものではない。すなわち、特徴量がｘ個ある場合は、ｉは１からｘまでとなり、各特徴量を判定する判定器をｘ個設ければよい。 In the above description, the case where i is 1 to 3 has been described as an example, but the configuration of the response unit determination unit 53 is not limited to this. That is, when there are x feature amounts, i is from 1 to x, and it is sufficient to provide x determination units for determining each feature amount.

［モデルの学習方法］
第１の実施の形態では、学習器は、形態素情報、係り受け、ピッチなどの特徴量列の正解値を学習した結果、形態素情報、係り受け、ピッチなどの各モデルを生成する。つまり、学習器で使用される学習データは、特徴量列の正解値である。 [Model learning method]
In the first embodiment, the learning device generates models of morpheme information, dependency, pitch, and the like as a result of learning correct values of feature quantity sequences such as morpheme information, dependency, and pitch. That is, the learning data used by the learning device is a correct value of the feature amount sequence.

一方、判定器である応答単位判定部５３は、特徴量抽出器で抽出された形態素情報、係り受け、ピッチなどの特徴量と、上記の各々のモデルとに基づいて、応答タイミングであるかを判定する。しかし、特徴量抽出器で抽出された形態素情報、係り受け、ピッチなどの特徴量は、抽出誤りを含んだものである一方、各モデルは抽出誤りを含んでいない。このため、応答タイミングの判定結果の精度が低下することがある。 On the other hand, the response unit determination unit 53, which is a determination unit, determines whether the response timing is based on the feature amounts extracted by the feature amount extractor, the feature amounts such as dependency and pitch, and the above models. judge. However, feature quantities such as morpheme information, dependency, and pitch extracted by the feature quantity extractor contain extraction errors, but each model does not contain extraction errors. For this reason, the accuracy of the response timing determination result may decrease.

図２３は、第２及び第３の実施形態におけるモデルの学習方法を示す図である。学習器８０で使用される学習データは、特徴量抽出器８１で抽出された形態素情報、係り受け、ピッチなどであり、抽出誤りを含んでいる。したがって、学習器８０は、このような学習データを用いて学習するので、抽出誤りを含んだモデルを生成する。 FIG. 23 is a diagram illustrating a model learning method according to the second and third embodiments. The learning data used by the learning device 80 includes morpheme information, dependency, pitch, and the like extracted by the feature amount extractor 81, and includes extraction errors. Therefore, the learning device 80 learns using such learning data, and thus generates a model including an extraction error.

そして、判定器である応答単位判定部５３は、特徴量抽出器で抽出された形態素情報、係り受け、ピッチなどの特徴量列と、上記モデルとに基づいて、応答タイミングであるかを判定して、判定結果を出力する。このため、精度のよい判定結果を得ることができる。 Then, the response unit determination unit 53, which is a determiner, determines whether it is a response timing based on the morpheme information extracted by the feature amount extractor, the feature amount sequence such as dependency and pitch, and the model. And output the determination result. For this reason, an accurate determination result can be obtained.

本発明は、上述した実施形態に限定されるものではなく、例えば特許請求の範囲に記載された事項の範囲内で設計変更されたものについても適用可能である。 The present invention is not limited to the embodiments described above, and can be applied to, for example, those whose design has been changed within the scope of the matters described in the claims.

例えば、コンピュータに音声対話プログラムがインストールされた場合、そのコンピュータが、上述した第１乃至第３の実施形態で説明した処理を実行してもよい。また、音声対話プログラムは、光ディスクや磁気ディスク等の記録媒体に記録されたものでもよいし、ＬＡＮやインターネット等の回線を通じて伝送されたものでもよい。 For example, when a voice interaction program is installed in a computer, the computer may execute the processing described in the first to third embodiments. The voice interaction program may be recorded on a recording medium such as an optical disk or a magnetic disk, or may be transmitted through a line such as a LAN or the Internet.

また、第２及び第３の実施形態では、音声データにより得られた特徴量及び応答単位モデルを用いたが、第１の実施形態と同様に、画像データにより得られた画像特徴量及び応答単位モデルを更に用いてもよい。 In the second and third embodiments, the feature amount and response unit model obtained from the audio data are used. However, as in the first embodiment, the image feature amount and response unit obtained from the image data are used. A model may also be used.

本発明の実施の形態の音声対話システムのブロック図である。It is a block diagram of the voice dialogue system of an embodiment of the invention. 応答単位モデル及び意味処理単位モデルを学習状態を説明するがめのブロック図である。It is a block diagram explaining the learning state of the response unit model and the semantic processing unit model. 本実施の形態のシステムの応答タイミングと従来の応答タイミングとを比較して示す説明図である。It is explanatory drawing which compares and shows the response timing of the system of this Embodiment, and the conventional response timing. 応答単位モデル及び意味処理単位モデルのモデル化を説明するための説明図である。It is explanatory drawing for demonstrating modeling of a response unit model and a semantic process unit model. 本発明の実施の形態の音声対話システム全体の処理を示すフローチャートである。It is a flowchart which shows the process of the whole voice interactive system of embodiment of this invention. 初期状態の提示画面を示す平面図である。It is a top view which shows the presentation screen of an initial state. 検索要求単位でかつ応答要求単位でないときの提示画面を示す平面図である。It is a top view which shows a presentation screen when it is a search request unit and is not a response request unit. 検索要求単位でかつ応答要求単位のときの提示画面を示す平面図である。It is a top view which shows the presentation screen in the case of a search request unit and a response request unit. 第２の実施形態に係る音声対話システムの構成を示す図である。It is a figure which shows the structure of the speech dialogue system which concerns on 2nd Embodiment. 終了発話判定部に記憶されている終了発話辞書を示す図である。It is a figure which shows the end utterance dictionary memorize | stored in the end utterance determination part. 終了応答辞書を示す図である。It is a figure which shows an end response dictionary. 第２の実施形態に係る音声対話処理ルーチンを示すフローチャートである。It is a flowchart which shows the voice dialogue processing routine which concerns on 2nd Embodiment. 応答単位判定の一例を説明する図であり、（Ａ）は最初の無音区間を判別箇所ｔ１とした場合、（Ｂ）は２番目の無音区間を判別箇所ｔ２とした場合、（Ｃ）は最後の無音区間を判別箇所ｔ３とした場合である。It is a figure explaining an example of a response unit determination, (A) is the case where the 1st silence area is made into discrimination | determination location t1, (B) is the case where the 2nd silence interval is made into discrimination location t2, (C) is the last. This is a case where the silent section is set as the determination location t3. 第３の実施形態に係る音声対話システムの構成を示す図である。It is a figure which shows the structure of the speech dialogue system which concerns on 3rd Embodiment. 感情語辞書記憶装置に記憶されている感情語辞書を示す図である。It is a figure which shows the emotion word dictionary memorize | stored in the emotion word dictionary memory | storage device. 応答生成部に記憶されている感情表現−応答対応表を示す図である。It is a figure which shows the emotion expression-response correspondence table memorize | stored in the response production | generation part. 相槌生成部に記憶されている感情表現−相槌対応表を示す図である。It is a figure which shows the emotion expression-interaction correspondence table memorize | stored in the interaction generation part. 非言語応答生成部に記憶されている状況−動作ＩＤ対応表を示す図である。It is a figure which shows the situation-action ID correspondence table memorize | stored in the non-language response production | generation part. 非言語応答生成部に記憶されている動作ＩＤ−動作対応表を示す図である。It is a figure which shows the operation | movement ID-motion correspondence table memorize | stored in the non-language response production | generation part. 第３の実施形態に係る音声対話処理ルーチンを示すフローチャートである。It is a flowchart which shows the voice dialogue processing routine which concerns on 3rd Embodiment. ユーザの発話から求められた応答単位及び感情語を示す図である。It is a figure which shows the response unit and emotion word which were calculated | required from the user's utterance. 応答単位判定部の構成を示すブロック図である。It is a block diagram which shows the structure of a response unit determination part. 第２及び第３の実施形態におけるモデルの学習方法を示す図である。It is a figure which shows the learning method of the model in 2nd and 3rd embodiment.

符号の説明Explanation of symbols

１０カメラ
１２マイク
１６処理単位判定部
２６応答単位モデルを記憶した記憶装置
２８意味処理単位モデルを記憶した記憶装置
５０、５０Ａ発話生成部
５３応答単位判定部
５４、５４Ａ応答生成部
５５、５５Ａ相槌生成部
６０非言語応答生成部
７０インタフェースロボット DESCRIPTION OF SYMBOLS 10 Camera 12 Microphone 16 Processing unit determination part 26 Memory | storage device which memorize | stored response unit model 28 Storage device which memorize | stored semantic processing unit model 50, 50A Utterance production | generation part 53 Response unit judgment part 54, 54A Response production | generation part 55, 55A 60 Non-verbal response generator 70 Interface robot

Claims

ユーザーから発話された音声を認識する音声認識手段と、
意味処理タイミング及び対話中のユーザーに対して応答する応答タイミングを判定する判定手段と、
前記音声認識手段の認識結果に基づいて、前記意味処理タイミングで意味処理単位の意味処理を行う処理手段と、
意味処理タイミングでかつ応答タイミングであると判定されたときに、意味処理を行なった内容を反映させて音声で応答する応答手段と、
を含む音声対話システム。 Speech recognition means for recognizing speech uttered by the user,
Determination means for determining semantic processing timing and response timing for responding to the user during the dialogue;
Processing means for performing semantic processing in a semantic processing unit at the semantic processing timing based on a recognition result of the voice recognition means;
A response means for responding with a voice reflecting the contents of the semantic processing when it is determined that the semantic processing timing is the response timing;
Spoken dialogue system including

前記応答手段は、応答タイミングでありかつ意味処理タイミングでないと判定されたときに、意味処理を行なった内容を反映させることなく音声で応答する請求項１記載の音声対話システム。 2. The spoken dialogue system according to claim 1, wherein the response means responds by voice without reflecting the contents of the semantic processing when it is determined that the response timing is not the semantic processing timing.

前記応答手段は、意味処理タイミングでありかつ応答タイミングでないと判定されたときに、音声での応答を停止する請求項１または請求項２記載の音声対話システム。 3. The spoken dialogue system according to claim 1, wherein the response unit stops the response by voice when it is determined that the timing is semantic processing timing but not the response timing.

意味処理を行う単位を統計的にモデル化した意味処理単位モデルを記憶した意味処理単位モデル記憶手段を更に含み、前記判定手段は、前記意味処理単位モデルに基づいて、前記意味処理タイミングを判定する請求項１〜３のいずれか１項記載の音声対話システム。 It further includes a semantic processing unit model storage unit that stores a semantic processing unit model that statistically models units for performing semantic processing, and the determination unit determines the semantic processing timing based on the semantic processing unit model. The spoken dialogue system according to claim 1.

対話中の一方の話者が応答を行う応答タイミングを統計的にモデル化した応答単位モデルを記憶した応答単位モデル記憶手段を更に含み、前記判定手段は、前記応答単位モデルに基づいて、前記応答タイミングを判定する請求項１〜４のいずれか１項記載の音声対話システム。 A response unit model storage unit that stores a response unit model that statistically models a response timing at which one speaker in the conversation responds, and the determination unit includes the response unit model based on the response unit model. The spoken dialogue system according to claim 1, wherein timing is determined.

発話中のユーザーの画像情報を抽出する抽出手段と、
ユーザーから発話された音声の音響的特徴を抽出する抽出手段と、
意味処理を行う単位を統計的にモデル化した意味処理単位モデルを記憶した意味処理単位モデル記憶手段と、
対話中の一方の話者が応答を行う応答タイミングを統計的にモデル化した応答単位モデルを記憶した応答単位モデル記憶手段と、
ユーザーから発話された音声を認識する音声認識手段と、
前記音声認識手段の音声認識結果に基づいて、ユーザーから発話された音声の言語的特徴を抽出する抽出手段と、
前記画像情報、前記音響的特徴、前記意味処理単位モデル、前記応答単位モデル、前記音声認識手段の音声認識結果、及び前記言語的特徴に基づいて、意味処理タイミング及び応答タイミングを判定する判定手段と、
前記音声認識手段の認識結果に基づいて、前記意味処理タイミングで意味処理単位の意味処理を行う処理手段と、
意味処理タイミングでかつ応答タイミングであると判定されたときに、意味処理を行なった内容を反映させて音声で応答する応答手段と、
を含む音声対話システム。 Extraction means for extracting image information of the user who is speaking;
Extraction means for extracting acoustic features of speech uttered by the user;
Semantic processing unit model storage means storing a semantic processing unit model that statistically models units for semantic processing;
A response unit model storage means for storing a response unit model in which a response timing at which one speaker in the conversation responds is statistically modeled;
Speech recognition means for recognizing speech uttered by the user,
Extraction means for extracting linguistic features of speech uttered by the user based on the speech recognition result of the speech recognition means;
Determination means for determining semantic processing timing and response timing based on the image information, the acoustic feature, the semantic processing unit model, the response unit model, the speech recognition result of the speech recognition means, and the linguistic feature; ,
Processing means for performing semantic processing in a semantic processing unit at the semantic processing timing based on a recognition result of the voice recognition means;
A response means for responding with a voice reflecting the contents of the semantic processing when it is determined that the semantic processing timing is the response timing;
Spoken dialogue system including

ユーザが発話した音声を認識し、
音声認識結果に基づいて、ユーザが発話した音声の中で意味処理単位になったタイミングを示す意味処理タイミングと、前記ユーザに対して応答するタイミングを示す応答タイミングと、を判定し、
前記意味処理タイミングが判定されたときに、前記音声認識結果に基づいて、前記意味処理単位の意味処理を行い、
前記意味処理タイミング及び応答タイミングが判定されたときに、意味処理を行なった内容を反映して応答する音声対話方法。 Recognize the voice spoken by the user,
Based on the speech recognition result, a semantic processing timing indicating a timing that is a semantic processing unit in a voice uttered by the user and a response timing indicating a timing of responding to the user are determined,
When the semantic processing timing is determined, based on the speech recognition result, the semantic processing unit performs semantic processing,
A spoken dialogue method for responding by reflecting the contents of semantic processing when the semantic processing timing and response timing are determined.

コンピュータに、
ユーザが発話した音声を認識し、
音声認識結果に基づいて、ユーザが発話した音声の中で意味処理単位になったタイミングを示す意味処理タイミングと、前記ユーザに対して応答するタイミングを示す応答タイミングと、を判定し、
前記意味処理タイミングが判定されたときに、前記音声認識結果に基づいて、前記意味処理単位の意味処理を行い、
前記意味処理タイミング及び応答タイミングが判定されたときに、意味処理を行なった内容を反映して応答する処理を実行させる音声対話プログラム。 On the computer,
Recognize the voice spoken by the user,
Based on the speech recognition result, a semantic processing timing indicating a timing that is a semantic processing unit in a voice uttered by the user and a response timing indicating a timing of responding to the user are determined,
When the semantic processing timing is determined, based on the speech recognition result, the semantic processing unit performs semantic processing,
A spoken dialogue program that, when the semantic processing timing and the response timing are determined, executes a process of responding by reflecting the contents of the semantic processing.

ユーザが発話した音声を認識する音声認識手段と、
前記音声認識手段の認識結果に基づいて、前記ユーザに対して応答する応答タイミングを判定する応答タイミング判定手段と、
前記応答タイミング判定手段により前記応答タイミングが判定されたときに前記音声認識手段の認識結果に基づく意味内容に対して応答し、前記応答タイミングが判定されていないときに前記意味内容以外の内容を応答する応答手段と、
を含む音声対話システム。 Voice recognition means for recognizing the voice spoken by the user;
Response timing determination means for determining a response timing for responding to the user based on a recognition result of the voice recognition means;
When the response timing is determined by the response timing determination means, a response is made to the meaning content based on the recognition result of the voice recognition means, and when the response timing is not determined, contents other than the meaning content are responded A response means to
Spoken dialogue system including

前記応答手段は、意味内容以外の内容として、相槌またはユーザの発話中のキーワードを応答する請求項９記載の音声対話システム。 The spoken dialogue system according to claim 9, wherein the response means responds to a keyword that is in the middle of speech or a user's speech as content other than the semantic content.

前記応答タイミング判定手段は、ユーザの発話中の無音区間について、応答タイミングであるか否かを判定する請求項９または請求項１０記載の音声対話システム。 The voice interaction system according to claim 9 or 10, wherein the response timing determination unit determines whether or not it is a response timing for a silent section during a user's speech.

前記応答手段は、応答内容を言語により出力し、または、応答内容を表現するようにインタフェースロボットを制御する請求項９〜１１のいずれか１項記載の音声対話システム。 The voice response system according to any one of claims 9 to 11, wherein the response unit outputs the response content in a language or controls the interface robot so as to express the response content.

前記ユーザを撮像する撮像手段と、
前記撮像手段により撮像されたユーザの画像に基づいて画像特徴量を抽出する画像特徴量抽出手段と、を更に含み、
前記応答タイミング判定手段は、前記画像特徴量抽出手段により抽出された画像特徴量を更に用いて、前記ユーザに対して応答する応答タイミングを判定する請求項９〜１２のいずれか１項記載の音声対話システム。 Imaging means for imaging the user;
Image feature quantity extraction means for extracting an image feature quantity based on a user image captured by the imaging means,
The voice according to any one of claims 9 to 12, wherein the response timing determination unit determines a response timing for responding to the user by further using the image feature amount extracted by the image feature amount extraction unit. Dialog system.

前記音声認識手段は、ユーザが発話した音声から複数の特徴量を抽出し、
応答タイミング判定手段は、前記複数の特徴量に各々対応し応答タイミングをモデル化した複数の特徴量モデルを記憶する特徴量モデル記憶手段と、前記音声認識手段で抽出される各特徴量と前記各特徴量モデルとに基づいて各々応答タイミングであるかを判定する複数の第１の判定手段と、前記複数の第１の判定手段の判定結果と各第１の判定手段の信頼度とに基づいて総合的に応答タイミングであるかを判定する第２の判定手段と、を含む請求項９〜１２のいずれか１項記載の音声対話システム。 The voice recognition means extracts a plurality of feature amounts from the voice spoken by the user,
The response timing determination means includes a feature quantity model storage means for storing a plurality of feature quantity models each corresponding to the plurality of feature quantities, and modeled response timing, each feature quantity extracted by the speech recognition means, Based on a plurality of first determination means for determining whether each is a response timing based on the feature amount model, a determination result of the plurality of first determination means, and a reliability of each first determination means The voice dialog system according to claim 9, further comprising: a second determination unit that determines whether the response timing is comprehensive.

ユーザが発話した音声を認識し、
前記音声認識結果に基づいて、前記ユーザに対して応答する応答タイミングを判定し、
前記応答タイミングが判定されたときに前記音声認識結果に基づく意味内容に対して応答し、前記応答タイミングが判定されていないときに前記意味内容以外の内容を応答する音声対話方法。 Recognize the voice spoken by the user,
Based on the voice recognition result, a response timing for responding to the user is determined,
A voice interaction method that responds to meaning content based on the voice recognition result when the response timing is determined, and responds to content other than the meaning content when the response timing is not determined.

コンピュータに、
ユーザが発話した音声を認識し、
前記音声認識結果に基づいて、前記ユーザに対して応答する応答タイミングを判定し、
前記応答タイミングが判定されたときに前記音声認識結果に基づく意味内容に対して応答し、前記応答タイミングが判定されていないときに前記意味内容以外の内容を応答する処理を実行させる音声対話プログラム。 On the computer,
Recognize the voice spoken by the user,
Based on the voice recognition result, a response timing for responding to the user is determined,
A voice interaction program that responds to semantic content based on the voice recognition result when the response timing is determined, and executes processing for responding to content other than the semantic content when the response timing is not determined.