JP2017009825A

JP2017009825A - Conversation state analyzing device and conversation state analyzing method

Info

Publication number: JP2017009825A
Application number: JP2015125631A
Authority: JP
Inventors: 純一伊藤; Junichi Ito; 池野　篤司; Tokuji Ikeno; 篤司池野; 健郎相原; Tateo Aihara; 河野　進; Susumu Kono; 進河野
Original assignee: Research Organization of Information and Systems; Toyota Motor Corp
Current assignee: Research Organization of Information and Systems; Toyota Motor Corp
Priority date: 2015-06-23
Filing date: 2015-06-23
Publication date: 2017-01-12

Abstract

PROBLEM TO BE SOLVED: To analyze relationship of utterance by a plurality of speakers.SOLUTION: The present invention relates to a conversation state analyzing device which analyzes a state of a conversation by a plurality of speakers, the conversation state analyzing device comprising: acquisition means of acquiring conversation speeches by the plurality of speakers; separation means of separating the conversation speeches into a plurality of utterances by the speakers and utterance sections; recognition means of recognizing utterance contents using speech recognition processing for each of the plurality of utterances; and analyzing means of analyzing relationship among the utterances and speakers based upon the utterance contents, the analyzing means specifying, as a series utterance group, utterances estimated to belong to a conversation of the same theme based upon the contents by the utterances.SELECTED DRAWING: Figure 9

Description

本発明は、複数の話者による会話の状況を分析する技術に関する。 The present invention relates to a technique for analyzing a situation of conversation between a plurality of speakers.

近年、コンピュータから人間に対して提案や援助などの種々の介入を行う技術の研究・開発が進められている。複数の人間が会話している状況においてコンピュータが適切な介入を行うためには、複数話者による会話音声に基づいて現在の状況を把握する必要がある。 In recent years, research and development of techniques for performing various interventions such as proposals and assistance from humans to computers have been underway. In order for a computer to appropriately intervene in a situation where a plurality of people are talking, it is necessary to grasp the current situation based on conversational speech by a plurality of speakers.

特許文献１は、複数の話者の音声特徴を分析・比較する技術を開示する。特許文献１では、第１および第２の人の音声の特徴から第１および第２の人の分類情報を求めて、２つの分類情報の組み合わせに基づいて第１および第２の人の相性を分析している。 Patent Document 1 discloses a technique for analyzing and comparing voice characteristics of a plurality of speakers. In Patent Document 1, the first and second person classification information is obtained from the voice characteristics of the first and second persons, and the compatibility of the first and second persons is determined based on the combination of the two classification information. Analyzing.

特許文献２，３は、電話での会話から、話者が特定の感情（不満、満足、謝罪など）を表す区間を特定する技術を開示する。特許文献２，３では、各区間における話者の感情を個別に検出し、話者間の感情の変化パターンに応じて特定感情を表す区間を特定している。 Patent Documents 2 and 3 disclose techniques for specifying a section in which a speaker represents a specific emotion (dissatisfaction, satisfaction, apology, etc.) from telephone conversation. In Patent Documents 2 and 3, speaker emotions in each section are individually detected, and a section representing a specific emotion is specified according to the emotion change pattern between the speakers.

特許第３２８０８２５号公報Japanese Patent No. 3280825 国際公開第２０１４／０６９０７６号International Publication No. 2014/069076 国際公開第２０１４／０６９１２０号International Publication No. 2014/069120

特許文献１の手法は、複数の話者の音声特徴から話者の相性を分析しているが、会話の状況を分析しているわけではない。また、特許文献２，３の手法も、話者が特定の感情を示している区間を特定しているだけであり、会話におけるそれぞれの発話の関係性を特定しているわけではない。いずれの手法も、会話におけるそれぞれの発話の関係性や発話全体の分析を行うことはできない。 The method of Patent Document 1 analyzes the compatibility of speakers based on the voice characteristics of a plurality of speakers, but does not analyze the situation of conversation. In addition, the methods of Patent Documents 2 and 3 only specify a section where the speaker shows a specific emotion, and do not specify the relationship of each utterance in the conversation. Neither method can analyze the relationship between each utterance in a conversation or the entire utterance.

上記のような問題を考慮して、本発明は、複数の話者による発話の関係性を分析可能な技術を提供することを目的とする。 In view of the above problems, an object of the present invention is to provide a technique capable of analyzing the relationship between utterances by a plurality of speakers.

上記目的を達成するために、本発明の第一の態様は、複数の話者による会話の状況を分析する会話状況分析装置であって、複数の話者による会話音声を取得する取得手段と、前記会話音声を、話者ごとおよび発話区間ごとの複数の発話に分離する分離手段と、前記複数の発話のそれぞれについて、音声認識処理を用いて発話内容の認識する認識手段と、発話内容に基づいて発話間の関係性を分析する分析手段であって、発話ごとの内容に基づいて各発話の会話テーマを推定し、同一の会話テーマと推定される発話を一連の発話群であると特定する分析手段と、を備える。 In order to achieve the above object, a first aspect of the present invention is a conversation situation analyzing apparatus for analyzing a conversation situation by a plurality of speakers, and acquiring means for acquiring conversation voices by a plurality of speakers, Separation means for separating the conversational voice into a plurality of utterances for each speaker and each utterance section; a recognition means for recognizing the utterance content using speech recognition processing for each of the plurality of utterances; and Analyzing the relationship between utterances, estimating the conversation theme of each utterance based on the content of each utterance, and identifying the utterances estimated as the same conversation theme as a series of utterances Analyzing means.

このようにすれば、同一の会話テーマについて発話群を特定することができる。また、複数の話者が異なるグループに分かれて異なるテーマについて会話している場合でも、適切に一連の発話群を特定できる。 In this way, utterance groups can be specified for the same conversation theme. Moreover, even when a plurality of speakers are divided into different groups and are talking about different themes, a series of utterance groups can be appropriately specified.

ここで、発話の内容は発話のテキストを意味する。したがって、分析手段は発話のテキストから各発話の会話テーマの同一性を推定する。なお、一連の発話群の特定は、発話の内容だけに基づいて行う必要はなく、発話のタイミングなどその他の情報にも基づいて行って構わない。例えば、発話の内容からだけでは会話テーマが推定できない場合には、当該発話の直前または直後の発話が属する会話のテーマあるいは当該発話の話者が直前に話した会話のテーマと同一としてもよい。 Here, the content of the utterance means the text of the utterance. Therefore, the analysis means estimates the identity of the conversation theme of each utterance from the utterance text. Note that a series of utterance groups need not be determined based on the content of utterances alone, but may be performed based on other information such as the timing of utterances. For example, when the conversation theme cannot be estimated only from the content of the utterance, it may be the same as the theme of the conversation to which the utterance immediately before or immediately after the utterance belongs or the conversation theme spoken immediately by the speaker of the utterance.

本発明において、前記認識手段は、音声認識処理により得られる発話のテキストを辞書と照合することにより前記発話内容を認識し、前記分析手段は、前記認識手段によって認識される発話内容のテキストと辞書を照合することにより、発話の意図と話題を求め、当該発話の意図および話題に基づいて当該発話の会話テーマを推定する、ことができる。発話意図の例として、話題の切り出し、提案、提案への賛成・反対、意見の集約などが挙げられる。発話の話題は、発話のジャンル、話題となっている場所やものが含まれる。発話のジャンルの例として、飲食、旅行、音楽、天候などが挙げられる。話題となっている場所やものの例として、地名、ランドマーク、店舗・施設名など挙げられる。このように発話内容（テキスト）に基づいて発話の意図や話題を考慮することで、より適切に会話テーマを推定することができる。 In the present invention, the recognizing unit recognizes the utterance content by comparing the utterance text obtained by voice recognition processing with a dictionary, and the analyzing unit recognizes the utterance content text and the dictionary recognized by the recognition unit. , The utterance intention and topic can be obtained, and the conversation theme of the utterance can be estimated based on the utterance intention and topic. Examples of utterance intentions include topic extraction, proposals, approval / disapproval of proposals, and aggregation of opinions. The topic of the utterance includes the genre of the utterance and the location or thing that is the topic. Examples of utterance genres include eating and drinking, travel, music, and weather. Examples of places and things that are talked about include place names, landmarks, store / facility names, and the like. Thus, the conversation theme can be estimated more appropriately by considering the intention and topic of the utterance based on the utterance content (text).

本発明における会話状況分析装置は、前記複数の発話のそれぞれについて、音声特徴量を算出する特徴量算出手段をさらに備え、前記分析手段は、各話者について音声特徴量の変化に基づいてそれぞれの発話時における話者の感情を推定し、当該感情も考慮して前記発話の意図を推定する、ことも好ましい。話者の感情も考慮することで、発話の意図をより正確にあるいはより詳細に推定することができる。例えば、発話の内容は提案への賛成であったとしても、感情が不満や苛立ちを表している場合には、当該発話の意図が不本意な賛成であると推定できる。 The conversation state analyzing apparatus according to the present invention further includes a feature amount calculating unit that calculates a speech feature amount for each of the plurality of utterances, and the analysis unit is configured to determine each speaker based on a change in the speech feature amount. It is also preferable to estimate a speaker's emotion at the time of utterance and to estimate the intention of the utterance in consideration of the emotion. By considering the emotion of the speaker, the intention of the utterance can be estimated more accurately or in detail. For example, even if the content of the utterance is in favor of the proposal, if the emotion represents dissatisfaction or irritation, it can be estimated that the intention of the utterance is unwilling to agree.

本発明において、前記分析手段は、発話の意図、発話の特徴量、発話時の話者の感情などに基づいて、前記一連の発話群における発話間の対応関係や話者間の関係を取得する、ことも好ましい。発話間の対応関係や話者間の関係は、例えば、ある話者のある発話がどの話者のどの発話に対する応答であるかや、ある話者がどの話者とどのように会話上でつながっているかを示すものである。上述のようにそれぞれの発話についてその意図を求めているので、発話間の対応関係や会話上での話者間の関係を精度良く求めることができる。なお、発話間の対応関係や話者間の対応関係は、発話の意図のみに基づいて決定する必要はなく、発話の話題や発話のタイミング、各発話の特徴量の変化などその他の情報に基づいて決定してもよい。例えば、発話の意図からは発話の対応関係が確実には分からない場合や、発話の意図が取得できない場合には、当該発話は同一会話内の直前または直後の発話と対応付けてもよい。また、例えば、ある話者の提案を意図する発話の直後に、意図が抽出できない別の話者の短い発話があった場合、発話の特徴量を分析して、当該発話をその直前の発話に対する相槌（同意）や嘆き（否定）を示す発話であると対応づけることができる。このようにして、発話間の対応関係（ある発話がどの発話とどのような関係でつながっているのかなど）や、会話上での話者間の関係（ある話者間でどのような発話がどの程度発生していて、その話者間の上下関係や親密性がどのように推定されるかなど）を求めることができる。 In the present invention, the analysis means acquires correspondence between utterances or relation between speakers in the series of utterance groups based on the intention of the utterance, the feature amount of the utterance, the emotion of the speaker at the time of utterance, and the like. It is also preferable. The correspondence between utterances and the relationship between speakers is, for example, which utterance of a certain speaker is a response to which utterance of which speaker, and how a certain speaker is connected to which speaker in the conversation. It indicates whether or not Since the intention of each utterance is obtained as described above, the correspondence between utterances and the relationship between speakers on the conversation can be obtained with high accuracy. Note that the correspondence between utterances and the correspondence between speakers do not need to be determined based solely on the intention of the utterance, but based on other information such as the topic of the utterance, the timing of the utterance, and changes in the feature amount of each utterance. May be determined. For example, when the correspondence between utterances is not sure from the intention of the utterance, or when the intention of the utterance cannot be acquired, the utterance may be associated with the utterance immediately before or after the same conversation. Also, for example, immediately after an utterance intended for a speaker's proposal, if there is a short utterance of another speaker whose intent cannot be extracted, the feature amount of the utterance is analyzed and the utterance is compared with the immediately preceding utterance. It can be associated with an utterance that shows a mutual opinion (consent) or grief (denial). In this way, correspondence between utterances (such as what utterances are connected to which utterances), and relationships between speakers in a conversation (what utterances are between certain speakers) It is possible to obtain the degree of occurrence and how the relationship between the speakers and the intimacy is estimated.

本発明における会話状況分析装置は、話者を撮影する撮像手段をさらに有し、前記分析手段は、前記撮像手段が撮影した画像における話者の体、顔、または視線の向きを考慮して、前記一連の発話群における発話間の対応関係を取得する、ことも好ましい。会話においては、話しかける相手の方に体や顔や視線を向けて発話するので、上述のように画像によって話者の体や顔や視線の向きを取得することによって、発話間の対応関係をより正確
に取得することができる。 The conversation state analysis apparatus according to the present invention further includes an imaging unit that images a speaker, and the analysis unit takes into account the direction of the body, face, or line of sight of the speaker in the image captured by the imaging unit, It is also preferable to acquire a correspondence relationship between utterances in the series of utterance groups. In conversation, speaking with the body, face, or line of sight toward the person you are talking to, the direction of the speaker's body, face, or line of sight is acquired from the image as described above, so that the correspondence between utterances can be improved. Can be obtained accurately.

また、本発明における会話状況分析装置は、話者を撮影する撮像手段をさらに有し、前記分析手段は、前記撮像手段が撮影した画像における話者の顔画像から算出される顔特徴量の変化に応じて話者の感情を推定し、当該感情も考慮して発話間の関係性を分析する、ことも好ましい。話者の感情は表情にも表れるので、話者の顔画像を撮影して感情を推定し、推定された感情を用いて発話の関係性を分析することで、より適確な分析が行える。 In addition, the conversation state analysis apparatus according to the present invention further includes an imaging unit that captures a speaker, and the analysis unit changes the facial feature amount calculated from the face image of the speaker in the image captured by the imaging unit. It is also preferable to estimate the emotion of the speaker according to the situation and analyze the relationship between the utterances in consideration of the emotion. Since the speaker's emotions also appear in the facial expression, a more accurate analysis can be performed by photographing the speaker's face image, estimating the emotions, and analyzing the relationship of the utterances using the estimated emotions.

また、本発明において、前記分析手段は、発話間の関係性と、発話の内容、発話の特徴量、発話時の話者の感情の少なくともいずれかとに基づいて、話者間の関係を求めることも好ましい。話者間の関係は、例えば、話者間の親密性、上下関係、親子関係などが含まれる。分析手段は、話者間の発話（関係性のある発話）における、発話の内容（言葉づかいから求められる丁寧度や親密度など）、発話特徴量（発話の回数、時間、重なり）、話者の感情から、上記のような関係性を求めることができる。 Further, in the present invention, the analysis means obtains a relationship between speakers based on the relationship between utterances and at least one of the content of the utterance, the feature amount of the utterance, and the emotion of the speaker at the time of utterance. Is also preferable. The relationship between speakers includes, for example, intimacy between speakers, a hierarchical relationship, a parent-child relationship, and the like. Analyzing means includes the utterance content (eg, politeness and intimacy required by wording), utterance features (number of utterances, time, overlap) in utterances between speakers (related utterances), The relationship as described above can be obtained from emotion.

また、本発明における会話状況分析装置は、前記一連の発話群に関するデータである会話状況データを出力する出力手段を、さらに備える、ことも好ましい。会話状況データは、例えば、各発話の話者、発話間の対応関係、各発話の意味と意図、各発話時の話者の感情、発話群における各話者の発話頻度、各発話における音声特徴量、話者間の関係の少なくともいずれかを含む、ことができる。 Moreover, it is preferable that the conversation situation analysis apparatus according to the present invention further includes output means for outputting conversation situation data that is data relating to the series of utterance groups. The conversation status data includes, for example, the speakers of each utterance, the correspondence between utterances, the meaning and intention of each utterance, the emotion of the speaker at each utterance, the utterance frequency of each speaker in the utterance group, and the voice characteristics of each utterance Including at least one of a quantity and a relationship between speakers.

本発明の第二の態様は、複数の話者による会話に介入して支援を行う支援装置である。本態様に係る支援装置は、上述した会話状況分析装置と、前記会話状況分析装置から出力される会話状況データに基づいて、一連の発話群に参加している複数の話者からなるグループの状態を判定するグループ状態判定手段と、前記グループの状態に基づいて前記会話への介入の内容を決定し、前記会話へ介入を行う介入手段と、を備える。なお、グループの状態には、グループの種別、グループ内の話者間の関係性、グループの状態変化が含まれる。このように、複数話者による会話や話者間の関係を適確に分析した結果に基づいてグループ状態を判定し、判定したグループ状態に従った介入を行うことで、より適確な支援が行える。なお、会話への介入は、音声出力、文字出力、画像出力など任意の方法で行えば良く、その態様は特に限定されない。 The second aspect of the present invention is a support apparatus that intervenes and supports a conversation by a plurality of speakers. The support device according to this aspect includes a state of a group of a plurality of speakers participating in a series of utterance groups based on the above-described conversation state analysis device and conversation state data output from the conversation state analysis device. And a group state determining unit for determining the content of the intervention in the conversation based on the state of the group and intervening in the conversation. The group status includes the group type, the relationship between speakers in the group, and the group status change. In this way, more accurate support can be achieved by determining the group status based on the results of an accurate analysis of the conversations between speakers and the relationship between the speakers, and performing intervention according to the determined group status. Yes. The intervention in the conversation may be performed by any method such as voice output, character output, and image output, and the mode is not particularly limited.

なお、本発明は、上記手段の少なくとも一部を備える会話状況分析装置あるいは支援装置として捉えることができる。また、本発明は、上記手段が行う処理の少なくとも一部を実行する会話状況分析方法あるいは支援方法として捉えることもできる。また、本発明は、これらの方法をコンピュータに実行させるためのコンピュータプログラム、あるいはこのコンピュータプログラムを非一時的に記憶したコンピュータ可読記憶媒体として捉えることもできる。上記手段および処理の各々は可能な限り互いに組み合わせて本発明を構成することができる。 Note that the present invention can be understood as a conversation state analysis apparatus or a support apparatus including at least a part of the above means. The present invention can also be understood as a conversation situation analysis method or support method for executing at least a part of the processing performed by the above means. The present invention can also be understood as a computer program for causing a computer to execute these methods, or a computer-readable storage medium in which this computer program is stored non-temporarily. Each of the above means and processes can be combined with each other as much as possible to constitute the present invention.

本発明によれば、複数の話者による発話の関係性を分析できる。 According to the present invention, the relationship of utterances by a plurality of speakers can be analyzed.

第１の実施形態に係る会話介入支援システムの構成例を示す図である。It is a figure which shows the structural example of the conversation intervention assistance system which concerns on 1st Embodiment. 第１の実施形態に係る会話介入支援システムの機能ブロック例を示す図である。It is a figure which shows the example of a functional block of the conversation intervention assistance system which concerns on 1st Embodiment. 第１の実施形態に係る会話介入支援システムが行う会話介入支援方法の全体的な処理の流れの例を示すフローチャートである。It is a flowchart which shows the example of the flow of the whole process of the conversation intervention assistance method which the conversation intervention assistance system which concerns on 1st Embodiment performs. 会話介入支援方法における会話状況分析処理（Ｓ３０３）の流れの例を示すフローチャートである。It is a flowchart which shows the example of the flow of the conversation situation analysis process (S303) in a conversation intervention support method. 話者ごとおよび発話区間ごとに分離された発話の例を示す図である。It is a figure which shows the example of the speech isolate | separated for every speaker and every speech area. 発話ごとに抽出されたジャンル・話題の場所・意図の例を示す図である。It is a figure which shows the example of the genre, the place of a topic, and the intent extracted for every utterance. 同一の会話テーマを有する一連の発話群の例を示す図である。It is a figure which shows the example of a series of utterance groups which have the same conversation theme. 会話状況データの例を示す図である。It is a figure which shows the example of conversation status data. （Ａ）会話状況データに含まれる、発話間の対応関係、各発話の会話テーマ・発話の意図・話者の感情と（Ｂ）会話における話者間の発話発生状況と話者間の関係性の例を説明する図である。(A) Correspondence between utterances included in conversation status data, conversation theme of each utterance, intention of utterance, emotion of speaker, and (B) relationship between utterance occurrence between speakers in conversation and speaker It is a figure explaining the example of. 会話介入支援方法におけるグループ状態判定処理（Ｓ３０４）の流れの例を示すフローチャートである。It is a flowchart which shows the example of the flow of the group state determination process (S304) in a conversation intervention assistance method. （Ａ）グループ種別と（Ｂ）グループ種別の推定条件の例を示す図である。It is a figure which shows the example of the estimation conditions of (A) group classification and (B) group classification. 会話介入支援方法における介入内容決定処理（Ｓ３０５）の流れの例を示すフローチャートである。It is a flowchart which shows the example of the flow of the intervention content determination process (S305) in a conversation intervention support method. （Ａ）グループ種別に応じた介入ポリシーと（Ｂ）グループの状態変化に応じた介入方法の例を説明する図である。It is a figure explaining the example of the intervention method according to the (A) intervention policy according to group classification, and the (B) group state change.

（第１の実施形態）
＜システム構成＞
本実施形態は、車両内の複数人の会話に対して介入して情報提供や意思決定支援を行う会話介入支援システムである。本実施形態は、複数人特に３人以上の会話に対しても適切な介入を行えるように構成される。 (First embodiment)
<System configuration>
The present embodiment is a conversation intervention support system that intervenes with respect to conversations of a plurality of persons in a vehicle to provide information and support decision making. This embodiment is configured so that appropriate intervention can be performed for conversations of a plurality of persons, particularly three or more persons.

図１は本実施形態に係る会話介入支援システムの構成の一例を示す図である。ナビゲーション装置１１１がマイクを介して取得した乗員の会話音声は、通信装置１１４を経由してサーバ装置１２０に送られる。サーバ装置１２０は、車両１１０から送信された会話音声を分析して、状況に応じて適切な情報提供や意思決定支援などの介入を行う。サーバ装置１２０は、会話音声を分析してどのような方針で介入を行うかを決定し、その方針に従った情報をレコメンドシステム１２１、店舗広告情報ＤＢ１２２、関連情報ＷＥＢサイト１３０から取得する。サーバ装置１２０は介入指示を車両１１０に送信し、車両１１０はナビゲーション装置１１１のスピーカーやディスプレイを通じて音声再生あるいはテキストや画像の表示を行う。また、車両１１０は、現在位置を取得するＧＰＳ装置１１２および乗員（話者）の顔や体を撮影するカメラ１１３も備える。 FIG. 1 is a diagram showing an example of the configuration of a conversation intervention support system according to the present embodiment. The passenger's conversation voice acquired by the navigation device 111 via the microphone is sent to the server device 120 via the communication device 114. The server device 120 analyzes the conversation voice transmitted from the vehicle 110, and performs intervention such as provision of appropriate information and decision support according to the situation. The server device 120 analyzes the conversational voice and determines what kind of policy is used for intervention, and acquires information according to the policy from the recommendation system 121, the store advertisement information DB 122, and the related information WEB site 130. The server device 120 transmits an intervention instruction to the vehicle 110, and the vehicle 110 performs voice reproduction or text or image display through the speaker or display of the navigation device 111. The vehicle 110 also includes a GPS device 112 that acquires the current position and a camera 113 that captures the face and body of the occupant (speaker).

図２は本実施形態に係る会話介入支援システムの機能ブロック図である。会話介入支援システムは、マイク（音声入力部）２０１、雑音除去部２０２、音源分離部（話者分離部）２０３、会話状況分析部２０４、音声認識用コーパス・辞書２０５、語彙意図理解用コーパス・辞書２０６、グループ状態判定部２０７、グループモデル定義記憶部２０８、介入・調停部２０９、介入ポリシー定義記憶部２１０、関連情報ＤＢ２１１、出力制御部２１２、スピーカー（音声出力部）２１３、ディスプレイ（画像表示部）２１４を含む。これらの各機能部が行う処理の詳細は、以下でフローチャートともに説明する。 FIG. 2 is a functional block diagram of the conversation intervention support system according to the present embodiment. The conversation intervention support system includes a microphone (speech input unit) 201, a noise removal unit 202, a sound source separation unit (speaker separation unit) 203, a conversation state analysis unit 204, a speech recognition corpus / dictionary 205, a vocabulary intention understanding corpus / Dictionary 206, group state determination unit 207, group model definition storage unit 208, intervention / arbitration unit 209, intervention policy definition storage unit 210, related information DB 211, output control unit 212, speaker (voice output unit) 213, display (image display) Part) 214. Details of processing performed by each of these functional units will be described below together with flowcharts.

本実施形態では、図２で示す各機能のうち、マイク２０１による音声入力と、出力制御部２１２、スピーカー２１３、ディスプレイ２１４による介入内容の出力を車両１１０にて行う。その他の機能は、サーバ装置１２０で行うように構成する。しかしながら、これらの機能を車両１１０とサーバ装置１２０でどのように分担するかは特に限定されない。例えば、車両１１０で、雑音除去や音源分離などを行ってもよいし、さらに音声認識処理まで行ってもよい。また、サーバ装置１２０は介入ポリシーの決定までを行い、決定された介入ポリシーに従ってどのような情報を提示するかは車両１１０で決定してもよい。さ
らには、全ての機能を車両１１０内で実現しても構わない。 In this embodiment, among the functions shown in FIG. 2, voice input by the microphone 201 and intervention content output by the output control unit 212, the speaker 213, and the display 214 are performed by the vehicle 110. Other functions are configured to be performed by the server device 120. However, how these functions are shared between the vehicle 110 and the server device 120 is not particularly limited. For example, the vehicle 110 may perform noise removal, sound source separation, and the like, and may further perform voice recognition processing. In addition, the server device 120 may determine the intervention policy, and the vehicle 110 may determine what information is to be presented according to the determined intervention policy. Furthermore, all functions may be realized in the vehicle 110.

なお、ナビゲーション装置１１１およびサーバ装置１２０は、いずれも、ＣＰＵなどの演算装置、ＲＡＭやＲＯＭなどの記憶装置、入力装置、出力装置、通信インタフェースなどを備えるコンピュータであり、記憶装置に記憶されたプログラムを演算装置が実行することによって、上記の各機能を実現する。ただし、上記の機能の一部または全部を専用のハードウェアによって実現しても構わない。また、サーバ装置１２０は、１台の装置である必要はなく、通信回線を介して結合された複数の装置（コンピュータ）から構成されそれぞれの装置間で機能を分担しても構わない。 Note that each of the navigation device 111 and the server device 120 is a computer provided with an arithmetic device such as a CPU, a storage device such as a RAM or a ROM, an input device, an output device, a communication interface, etc., and a program stored in the storage device By executing the above, the above functions are realized. However, part or all of the above functions may be realized by dedicated hardware. The server device 120 does not have to be a single device, and may be configured by a plurality of devices (computers) coupled via a communication line and share the functions among the devices.

＜全体処理＞
図３は、本実施形態に係る会話介入支援システムが行う会話介入支援方法の全体的な流れを示すフローチャートである。図３を参照しながら、会話介入支援方法の全体について説明する。 <Overall processing>
FIG. 3 is a flowchart showing the overall flow of the conversation intervention support method performed by the conversation intervention support system according to this embodiment. The whole conversation intervention support method will be described with reference to FIG.

ステップＳ３０１において、ナビゲーション装置１１１が、マイク２０１を介して車両１１０内の複数の乗員による会話音声を取得する。本実施形態では、取得された音声に対する以降の処理はサーバ装置１２０において行われるので、ナビゲーション装置１１１は取得した会話音声を、通信装置１１４を介してサーバ装置１２０へ送信する。なお、使用するマイクの数や配置は特に限定されないが、マイクあるいはマイクアレイを複数用いることが好ましい。 In step S 301, the navigation device 111 acquires conversational voices by a plurality of passengers in the vehicle 110 via the microphone 201. In the present embodiment, subsequent processing for the acquired voice is performed in the server device 120, so the navigation device 111 transmits the acquired conversation voice to the server device 120 via the communication device 114. The number and arrangement of microphones to be used are not particularly limited, but it is preferable to use a plurality of microphones or microphone arrays.

ステップＳ３０２において、サーバ装置１２０は、雑音除去部２０２と音源分離部２０３を用いて、会話音声から話者ごとのそれぞれの発話を抽出する。なお、「発話」とは言語を音声として発生すること、およびその結果として発生された音声を意味する。ここでの処理は、雑音除去部２０２による雑音除去と、音源分離部２０３による音源分離（話者分離）が含まれる。雑音除去部２０２は、例えば、雑音発生源近くに配置されたマイクから得られる音声と、その他のマイクから得られる音声との相違から、雑音を特定して除去する。雑音除去部２０２は、また、複数のマイクに入力される発話の相関を利用して、雑音を除去する。音源分離部２０３は、複数のマイクに音声が入力される時間差から各話者のマイクに対する方向および距離を検出して、話者を特定する。 In step S 302, the server device 120 uses the noise removing unit 202 and the sound source separation unit 203 to extract each utterance for each speaker from the conversational voice. Note that “speech” means that a language is generated as speech and the resulting speech is generated. The processing here includes noise removal by the noise removal unit 202 and sound source separation (speaker separation) by the sound source separation unit 203. For example, the noise removing unit 202 identifies and removes noise from a difference between a voice obtained from a microphone arranged near a noise generation source and a voice obtained from another microphone. The noise removal unit 202 also removes noise by using the correlation of utterances input to a plurality of microphones. The sound source separation unit 203 identifies the speaker by detecting the direction and distance of each speaker from the time difference in which sound is input to the plurality of microphones.

ステップＳ３０３において、会話状況分析部２０４が、複数人による会話の状況を分析する。複数人、特に３人以上の会話の状況を分析するためには、それぞれの話者による発話に相関があるか、また、相関がある場合にはどのような関係があるか、などを認識する必要がある。そこで、会話状況分析部２０４は、同一の会話テーマに関する発話群を一連の発話群として抽出し、さらにその発話群の中での発話間の関係性を把握して、発話間の関係性を考慮して会話の状況や話者間の関係を分析する。会話状況分析部２０４による具体的な処理内容については、後述する。 In step S303, the conversation situation analysis unit 204 analyzes the conversation situation of a plurality of people. In order to analyze the situation of conversations of multiple people, especially 3 or more, it recognizes whether the utterances by each speaker are correlated, and if there is a correlation, what kind of relationship is recognized There is a need. Therefore, the conversation situation analysis unit 204 extracts utterance groups related to the same conversation theme as a series of utterance groups, further grasps the relations between utterances in the utterance groups, and considers the relations between utterances. And analyze the conversation situation and the relationship between the speakers. Specific processing contents by the conversation state analysis unit 204 will be described later.

ステップＳ３０４において、グループ状態判定部２０７は、会話状況分析部２０４による会話状況データをもとに、同一の会話に参加している話者グループがどのようなグループであるかあるいはこのグループがどのような状態にあるかを判定する。グループの例として、例えば、「フラットな関係で親密度が高く、メンバー同士が互いに遠慮なく意見を言いあえるようなグループ」、「上下関係があるが、親密度が高く、特定のメンバーがグループの意思決定を主導しているようなグループ」、「上下関係があり、親密度が低く、特定のメンバーがグループの意思決定を主導しているようなグループ」などが挙げられる。また、グループの状態変化の例として、特定のメンバーの発話頻度が低下した、グループ全体の発話頻度が低下した、特定のメンバーの感情が変化した、グループの主導者が変化した、などが挙げられる。グループ状態判定部２０７による具体的な処理内容について
は後述する。 In step S304, based on the conversation status data from the conversation status analysis unit 204, the group status determination unit 207 determines what type of speaker group is participating in the same conversation, and how this group is. It is determined whether it is in a proper state. Examples of groups include, for example, “a group that has a close relationship and high intimacy, and members can speak to each other”, “an up-and-down relationship but high intimacy and a specific member Such as “a group that leads the decision” and “a group that has a hierarchical relationship, low intimacy, and a specific member leading the group's decision”. In addition, examples of group status changes include a decrease in the utterance frequency of a specific member, a decrease in the utterance frequency of the entire group, a change in the emotion of a specific member, and a change in the leader of the group. . Specific processing contents by the group state determination unit 207 will be described later.

ステップＳ３０５において、介入・調停部２０９は、グループ状態判定部２０７によるグループ状態に応じて介入ポリシーを決定し、介入ポリシーと現在の会話の内容にしたがって具体的な介入のタイミングと内容を決定する。例えば、フラットな関係で親密度が高く、メンバー同士が互いに遠慮なく意見を言いあえるようなグループであれば、全員に対してほぼ均等に詳しい参考情報を提示して活発な討議を促すという介入ポリシーを採用することが考えられる。また、例えば、特定の話者あるいはグループ全体の発話頻度が低下した場合には、会話を活発化させるような話題に誘導する介入ポリシーを採用することが考えられる。介入・調停部２０９は、介入ポリシーを決定したら、現在の話題にしたがって、レコメンドシステム１２１、店舗広告情報ＤＢ１２２、あるいは関連情報ＷＥＢサイト１３０から提示すべき情報を取得して、介入指示を行う。介入・調停部２０９による具体的な処理内容については後述する。 In step S305, the intervention / arbitration unit 209 determines an intervention policy according to the group status by the group status determination unit 207, and determines the specific intervention timing and content according to the intervention policy and the content of the current conversation. For example, if the group has a flat relationship and high intimacy, and members can freely share their opinions with each other, an intervention policy that encourages lively discussion by presenting detailed reference information almost equally to all members. It is possible to adopt. In addition, for example, when the utterance frequency of a specific speaker or the entire group decreases, it may be possible to adopt an intervention policy that leads to a topic that activates conversation. When the intervention / mediation unit 209 determines the intervention policy, the intervention / arbitration unit 209 obtains information to be presented from the recommendation system 121, the store advertisement information DB 122, or the related information WEB site 130 according to the current topic, and issues an intervention instruction. Specific processing contents by the intervention / arbitration unit 209 will be described later.

ステップＳ３０６では、出力制御部２１２が、介入・調停部２０９から出力される介入指示に従って、出力すべき合成音声あるいはテキストを生成して、スピーカー２１３やディスプレイ２１４において再生する。 In step S 306, the output control unit 212 generates synthesized speech or text to be output according to the intervention instruction output from the intervention / arbitration unit 209 and reproduces it on the speaker 213 or the display 214.

以上のようにして、車両１１０内の複数の話者による会話に対する介入が行える。なお、図３のフローチャートに示す処理は繰り返し実行される。会話介入支援システムは、会話音声を随時取得して、会話状況や話者間の関係やグループ状態を監視し続け、介入が必要と判断した場合に介入を行う。 As described above, intervention for conversations by a plurality of speakers in the vehicle 110 can be performed. Note that the processing shown in the flowchart of FIG. 3 is repeatedly executed. The conversation intervention support system obtains conversation voices at any time, continuously monitors the conversation status, the relationship between speakers, and the group state, and performs intervention when it is determined that intervention is necessary.

＜会話状況分析処理＞
次に、ステップＳ３０３における会話状況分析処理の詳細について説明する。図４は、会話状況分析処理の流れを示すフローチャートである。なお、図４に示すフローチャートの処理は図示されたとおりの順序で行う必要はなく、また一部の処理を省略しても構わない。 <Conversation situation analysis processing>
Next, the details of the conversation situation analysis process in step S303 will be described. FIG. 4 is a flowchart showing the flow of the conversation state analysis process. Note that the processing of the flowchart shown in FIG. 4 does not need to be performed in the order shown, and some of the processing may be omitted.

ステップＳ４０１において、会話状況分析部２０４は、音源分離された音声データから発話区間を検出し、発話区間ごとに区間ＩＤとタイムスタンプを付加する。なお、発話区間は音声が発話されている１連続の区間である。発話区間の終了は、例えば、１５００ミリ秒以上の無発話が生じる前までとする。この処理により、会話音声を、話者ごとおよび発話区間ごとに複数の音声データに分離できる。以下では、１つの発話区間における発話の音声のことを、単に発話とも称する。図５は、ステップＳ４０１において分離されたそれぞれの発話を示す。 In step S401, the conversation state analysis unit 204 detects an utterance section from the sound data separated from the sound source, and adds a section ID and a time stamp for each utterance section. Note that the utterance section is a continuous section in which voice is uttered. The end of the utterance period is, for example, until no utterance of 1500 milliseconds or more occurs. By this process, the conversational voice can be separated into a plurality of voice data for each speaker and each speech section. Hereinafter, the voice of an utterance in one utterance section is also simply referred to as an utterance. FIG. 5 shows each utterance separated in step S401.

ステップＳ４０２では、会話状況分析部２０４が、それぞれの発話について発話特徴量（音声特徴量）を算出する。発話特徴量として、発話音量、ピッチ、トーン、持続時間、発話速度（平均モーラ長）が挙げられる。発話音量は、発話の音圧レベルである。トーンは、音の高低や音そのものであり、音の高低は音波の１秒間あたりの振動回数（周波数）によって特定される。ピッチは、知覚される音の高さであり、音の物理的な高さ（基本周波数）によって特定される。平均モーラ長は、１モーラあたりの発話の長さ（時間）として算出される。なお、モーラは拍数である。ここで、発話音量、ピッチ、トーン、発話速度については、発話区間内の平均値、最大値、最小値、変動幅、標準偏差などを求めるとよい。本実施形態ではこれらの発話特徴量を算出するが、ここで例示した発話特徴量の全てを算出しなくてもよいし、ここで例示した以外の発話特徴量を算出してもよい。 In step S402, the conversation state analysis unit 204 calculates an utterance feature amount (voice feature amount) for each utterance. Examples of the utterance feature amount include utterance volume, pitch, tone, duration, and utterance speed (average mora length). The utterance volume is the sound pressure level of the utterance. The tone is the pitch of the sound or the sound itself, and the pitch of the sound is specified by the number of vibrations (frequency) per second of the sound wave. The pitch is the perceived pitch of the sound and is specified by the physical pitch (fundamental frequency) of the sound. The average mora length is calculated as the utterance length (time) per mora. Note that mora is the number of beats. Here, with respect to the speech volume, pitch, tone, and speech speed, an average value, maximum value, minimum value, fluctuation range, standard deviation, and the like within the speech section may be obtained. In the present embodiment, these utterance feature amounts are calculated, but not all of the utterance feature amounts exemplified here may be calculated, or utterance feature amounts other than those exemplified here may be calculated.

ステップＳ４０３において、会話状況分析部２０４は、それぞれの発話についての話者の感情を、発話特徴量の変化から求める。求める感情の例として、満足、不満足、興奮、
怒り、悲しみ、期待、安心、不安などが挙げられる。感情は、例えば、発声の音量、ピッチ、トーンの平常時からの変化に基づいて求めることができる。各話者の平常時の発話特徴量は、これまでに得られた発話特徴量から求めてもよいし、あるいはユーザ情報・利用履歴ＤＢ１２３に格納されている情報を用いてもよい。なお、話者の感情は、発話（音声データ）のみに基づいて決定する必要はない。話者の感情は発話の内容（テキスト）からも求めることができる。また、話者の感情は、例えば、カメラ１１３から撮影される話者の顔画像から顔特徴量を算出し、顔特徴量の変化に基づいて求めることもできる。 In step S403, the conversation state analysis unit 204 obtains the speaker's emotion for each utterance from the change in the utterance feature amount. Satisfaction, dissatisfaction, excitement,
Examples include anger, sadness, expectation, security, and anxiety. The emotion can be obtained based on, for example, changes in the volume, pitch, and tone of the utterance from the normal time. The normal utterance feature amount of each speaker may be obtained from the utterance feature amount obtained so far, or information stored in the user information / use history DB 123 may be used. Note that the emotion of the speaker does not need to be determined based only on the utterance (voice data). The speaker's emotion can also be obtained from the content (text) of the utterance. Further, the emotion of the speaker can be obtained based on a change in the facial feature amount by calculating the facial feature amount from the facial image of the speaker photographed by the camera 113, for example.

ステップＳ４０４において、会話状況分析部２０４は、それぞれの発話について、音声認識用コーパス・辞書２０５を用いた音声認識処理を施して、発話内容をテキスト化する。音声認識処理には既存の技術を適用すればよい。図５に示す発話内容（テキスト）は、ステップＳ４０４の処理によって求められる。 In step S 404, the conversation state analysis unit 204 performs speech recognition processing using the speech recognition corpus / dictionary 205 for each utterance, and converts the utterance content into text. Existing technology may be applied to the speech recognition processing. The utterance content (text) shown in FIG. 5 is obtained by the process of step S404.

ステップＳ４０５において、会話状況分析部２０４は、それぞれの発話の内容（テキスト）から、語彙意図理解用コーパス・辞書２０６を参照して、発話の意図および話題を推定する。発話の意図は、例えば、話題の切り出し、提案、提案への賛成・反対、意見の集約などを含む。発話の話題は、例えば、発話のジャンル、場所、ものなどを含む。発話のジャンルは、例えば、飲食、旅行、音楽、天候などを含む。話題となっている場所は、例えば、地名、ランドマーク、店舗名、施設名などが含まれる。語彙意図理解用コーパス・辞書２０６は、「話題を切り出す、提案する、質問する、賛成する、反対する、物事を集約する」といった場合にそれぞれ使われる語彙や、発話のジャンルを特定するための「飲食、旅行、音楽、天候など」に関する語彙や、話題となっている場所を特定するための「地名、ランドマーク、店舗名、施設名など」に関する語彙の辞書を含む。なお、発話意図の推定においては、テキストだけでなく話者の感情を考慮することも好ましい。例えば、発話内容（テキスト）は提案に対する同意を示している場合に、話者の感情を考慮することで、喜んで同意しているのか渋々同意しているのかなどをより詳細な発話意図を推定することができる。 In step S405, the conversation situation analysis unit 204 estimates the intention and topic of the utterance by referring to the vocabulary intention understanding corpus / dictionary 206 from the content (text) of each utterance. The intention of utterance includes, for example, topic extraction, proposal, approval / disapproval of proposal, and aggregation of opinions. The topic of utterance includes, for example, the genre, place, thing, etc. of the utterance. The genre of utterance includes, for example, eating and drinking, travel, music, weather, and the like. For example, the location that has become a topic includes a place name, a landmark, a store name, and a facility name. The vocabulary intention understanding corpus / dictionary 206 is used to specify a vocabulary and an utterance genre that are used in cases such as “cut out a topic, propose, ask a question, agree, disagree, and aggregate things”. It includes a dictionary of vocabularies relating to “food, travel, music, weather, etc.” and “vocabulary names, landmarks, store names, facility names, etc.” for identifying places of interest. In estimating the utterance intention, it is preferable to consider not only the text but also the emotion of the speaker. For example, if the content of the utterance (text) indicates consent to the proposal, a more detailed utterance intention can be estimated by considering the emotions of the speaker and whether the content is happily agreed or awkwardly agreed. can do.

ステップＳ４０５の処理の結果、各発話について、「何をどうしたいか」といった話者の意図と、話題となっているジャンルを推定することができる。例えば、図５における発話ＩＤ２の「北鎌倉のイタリアンはどぉー」というテキストについては、辞書との照合により、「イタリアン」という語からジャンルが「飲食」であること、「北鎌倉」という語から話題の場所が「鎌倉」であること、「どぉー」という語から発話の意図が「提案」であることが推定できる。 As a result of the processing in step S405, for each utterance, it is possible to estimate the speaker's intention such as “what to do” and the genre that is the topic. For example, with respect to the text of the utterance ID 2 in FIG. 5, from the word “Italian”, the genre is “food” and “Kitakamakura” by matching with the dictionary. It can be inferred that the topical location is “Kamakura” and that the intention of the utterance is “suggestion” from the word “doi”.

図６は、図５に示すそれぞれの発話に対する、話題となっているジャンル、話題となっている場所、および発話の意図の抽出結果を示す。本実施形態において、意図等を推定した「発話ｎ（Ｓ）」は、例えば、以下のような式で示される。

発話ｎ（Ｓ）＝（Ｇ_ｎ，Ｐ_ｎ，Ｉ_ｎ）

ここで、ｎは発話ＩＤ（１〜ｋ）であり、発話の発生順に発話ＩＤを付与するとする。Ｓは話者（Ａ、Ｂ、Ｃ．．．）であり、Ｇ_ｎ、Ｐ_ｎ、Ｉ_ｎは、それぞれ、推定された発話のジャンル、話題となっている場所、発話の意図を示す。 FIG. 6 shows the extraction result of the genre that is the topic, the topic location, and the intention of the utterance for each utterance shown in FIG. In the present embodiment, “speech n (S)” in which the intention or the like is estimated is expressed by the following equation, for example.

Utterance n (S) = (G _n , P _n , I _n )

Here, n is an utterance ID (1 to k), and it is assumed that the utterance IDs are assigned in the order of the utterances. S is the speaker (A, B, C ...) , G n, P n, I n are, respectively, the genre of the estimated speech, location has become a hot topic, the intention of the speech.

例えば、話者Ａの発話１を語彙意図理解用コーパス・辞書２０６と照合し、「Ｇ_１：飲食」、「Ｐ_１：鎌倉」、「Ｉ_１：話題の切り出し」とマッチした場合は、次のように示す。

発話１（Ａ）＝（"飲食"，"鎌倉"，"話題の切り出し"）
For example, when the utterance 1 of the speaker A is checked against the vocabulary intention understanding corpus / dictionary 206 and matched with “G ₁ : Food and drink”, “P ₁ : Kamakura”, and “I ₁ : Extraction of topic” As shown.

Utterance 1 (A) = ("Food &Drink","Kamakura","Cut out topic")

なお、それぞれの発話に対する、話題となっているジャンル、話題の場所、および発話の意図といった情報は、発話の内容（テキスト）以外の情報を考慮して求めることも好ましい。特に、発話の意図は、発話特徴量から求められる話者の感情を考慮して求めることも好ましい。発話内容が提案への賛成を表している場合であっても、発話特徴量から喜んで同意しているのか、渋々同意しているのかを判別できる。また、発話によっては、発話内容（テキスト）から上記の情報を抽出できない場合もある。このような場合には、会話状況分析部２０４は、時系列で発生している前後の発話意図の抽出結果あるいは発話内容（テキスト）を考慮して、当該発話の意図を推定するとよい。 In addition, it is also preferable to obtain information such as the genre that is the topic, the topic location, and the intention of the utterance in consideration of information other than the content (text) of the utterance. In particular, the intention of the utterance is preferably obtained in consideration of the emotion of the speaker obtained from the utterance feature amount. Even if the utterance content indicates approval for the proposal, it can be determined from the utterance feature quantity whether the user agrees happily or not. Further, depending on the utterance, the above information may not be extracted from the utterance content (text). In such a case, the conversation state analysis unit 204 may estimate the intention of the utterance in consideration of the extraction result of the utterance intention before and after the chronological occurrence or the utterance content (text).

ステップＳ４０６において、会話状況分析部２０４は、ステップＳ４０５にて得られた各発話のジャンルと発話の時系列的な結果を考慮して、同一テーマと推定される発話を抽出し、その結果得られた発話群を一連の会話に含まれる発話群であると特定する。この処理により、１つの会話の開始から終了までに含まれる発話を特定することができる。 In step S406, the conversation state analysis unit 204 extracts the utterances estimated to be the same theme in consideration of the genre of each utterance obtained in step S405 and the time-series results of the utterances, and the result is obtained. The utterance group is identified as an utterance group included in a series of conversations. By this process, it is possible to specify an utterance included from the start to the end of one conversation.

会話テーマの同一性判定では、発話のジャンルや話題の場所の類似性が考慮される。例えば、発話ＩＤ５は抽出語「魚」からジャンルが「飲食」で、抽出語「海」から話題の場所が「海」であると判定されているが、いずれもジャンルが「飲食」であり、同一の会話テーマを有すると判定できる。また、発話には発話ＩＤ１のように「話題の切り出し」を判定する語（「決めよう」）や、発話ＩＤ９のように「集約」を判定する語（「決まり」）が含まれている場合があり、それぞれの発話を、同じテーマの会話の開始時あるいは終了時の発話と推定することもできる。また、発話の時間的関係も考慮して、発話のジャンルや話題の場所などが同一であったとしても、発話間の時間間隔が長すぎる場合には異なる会話テーマと判断してもよい。また、発話の中には、意図やジャンルなどを抽出できる語彙を含まない発話もある。このような場合は、時系列的な発話の流れを考慮し、同一の会話の開始と終了の間に発生している同じ話者の発話は同じ会話に含まれるとみなすとよい。 In the identity determination of the conversation theme, similarity of utterance genre and topic location is taken into consideration. For example, it is determined that the genre is “food” from the extracted word “fish” and the topic location is “sea” from the extracted word “sea”, but the utterance ID 5 is “food”. It can be determined that they have the same conversation theme. Further, when the utterance includes a word for determining “topic extraction” (“determine”) like utterance ID 1 and a word for determining “aggregation” (“decision”) like utterance ID 9 Each utterance can be estimated as the utterance at the start or end of the conversation of the same theme. Also, considering the temporal relationship of utterances, even if the genre of utterances and the topic location are the same, if the time interval between utterances is too long, it may be determined as a different conversation theme. Also, some utterances do not include vocabulary from which intent, genre, etc. can be extracted. In such a case, considering the flow of chronological utterances, it is preferable to consider that the utterances of the same speaker occurring between the start and end of the same conversation are included in the same conversation.

図７は、図６に示した各発話のジャンル、話題の場所、および発話意図から、一連の発話群を特定した結果を示す図である。ここでは、３つの会話が抽出されている。会話１は、「飲食（昼食）」「飲食（料理）」「鎌倉」に関する会話であり、発話ＩＤ１，２，３，５，７，９が含まれる。会話２は、「天候」「スポーツ（運動会）」に関する会話であり、発話ＩＤ４，６，８が含まれる。なお、「天候」と「スポーツ（運動会）」は異なるジャンルであるが、「天候」に関する発話の直後に「スポーツ（運動会）」に関する発話が連続して発生する場合、それらの発話は「天候」に関する会話に含まれると判断する。会話３は、「音楽」に関する会話であり、発話ＩＤ１０，１１が含まれる。 FIG. 7 is a diagram illustrating a result of specifying a series of utterance groups from the genre, topic location, and utterance intention of each utterance shown in FIG. Here, three conversations are extracted. The conversation 1 is a conversation related to “food (lunch)”, “food (dish)”, and “Kamakura”, and includes utterance IDs 1, 2, 3, 5, 7, and 9. Conversation 2 is a conversation related to “weather” and “sports (athletic meet)”, and includes utterance IDs 4, 6, and 8. Note that “weather” and “sports (athletic meet)” are different genres, but if utterances related to “sports (athletic meet)” occur immediately after utterances related to “weather”, these utterances are “weather” Determined to be included in the conversation. The conversation 3 is a conversation related to “music” and includes utterance IDs 10 and 11.

図５に示す発話は、話者Ａ〜Ｅの合計５人によって行われているが、全員が同じ会話に参加しているわけではない。ここでは、話者Ａ〜Ｃの３人が飲食に関する会話１を行っており、話者Ｄ，Ｅが天候に関する会話２を行っている。本実施形態における会話状況分析部２０４は、各発話のジャンルや、話題となっている場所（もの）や、発話の意図に着目しているので、複数の会話が同時に進行している場合であっても、適切に一連の会話に含まれる発話群を特定できる。 The utterance shown in FIG. 5 is performed by a total of five speakers A to E, but not all of them are participating in the same conversation. Here, three of the speakers A to C have a conversation 1 regarding food and drink, and speakers D and E have a conversation 2 regarding weather. The conversation state analysis unit 204 in the present embodiment focuses on the genre of each utterance, the topic (thing) that is the topic, and the intention of the utterance. However, it is possible to appropriately identify the utterance group included in the series of conversations.

本実施形態において、このようにして特定された一連の「会話ｍ」は、例えば、以下のような式で示される。

会話ｍ（Ｓ_Ａ，Ｓ_Ｂ，Ｓ_Ｃ… ）
＝｛発話１（Ｓ_Ａ），発話２（Ｓ_Ｂ），発話３（Ｓ_Ｃ）… ｝
＝Ｔ_ｍ｛（Ｇ_Ａ，Ｐ_Ａ，Ｉ_Ａ），（Ｇ_Ｂ，Ｐ_Ｂ，Ｉ_Ｂ），（Ｇ_Ｃ，Ｐ_Ｃ，Ｉ_Ｃ）… ｝

ここで、ｍは会話ＩＤ（１〜ｋ）、であり、会話の発生順に会話ＩＤを付与するとする。Ｓ_A,B,C…は話者（Ａ、Ｂ、Ｃ．．．）であり、Ｔ_ｍ、Ｇ_ｎ、Ｐ_ｎ、Ｉ_ｎは、それぞれ
、推定された会話のテーマ、発話のジャンル、発話で話題となっている場所、発話の意図を示す。 In the present embodiment, the series of “conversations m” specified in this way is represented by the following expression, for example.

Conversation _{_{m (S A, S B,}} S C ...)
= {Speak 1 (S _A ), Speak 2 (S _B ), Speak 3 (S _C ) ...}
_{_{_{_{= T m {(G A,}}}} P A, I A), (G B, P B, I B), (G C, P C, I C) ...}

Here, m is a conversation ID (1 to k), and it is assumed that conversation IDs are assigned in the order in which conversations occur. S _{A, B, C ...} is the speaker (A, B, C ...) , T m, G n, P n, I n , respectively, the estimated conversation theme, genre speech, speech Indicate the location of the topic and the intention of the utterance.

例えば，話者Ａ、Ｂ、Ｃのテーマ「飲食」の発話群が会話１に特定された場合は、次のように示される。

会話１（Ａ，Ｂ，Ｃ）
＝Ｔ_"食事"｛（"飲食（昼食）"，"鎌倉"，"話題の切り出し"）,
（"飲食（料理）"，"鎌倉"，"提案"）,
（"飲食（料理）"，"ｎａ"，"否定／提案"）… ｝
For example, when the utterance group of the theme “Food & Drink” of speakers A, B, and C is specified as conversation 1, it is shown as follows.

Conversation 1 (A, B, C)
= T _“meal” {(“food (lunch)”, “Kamakura”, “cutting out topics”),
("Eating and drinking (cooking)", "Kamakura", "Proposal"),
("Food & Drink (Cooking)", "na", "Negation / Proposal") ...}

ステップＳ４０７において、会話状況分析部２０４は、上記の分析結果を統合した会話状況データを生成して出力する。例えば、会話状況データは、直近の所定期間（例えば３分間）における同一会話内の発話について、図８に示すような情報を含む。発話が多い話者は、期間内における発話回数と発話時間の両方が所定値以上（例えば、１回と１０秒）の話者である。発話が少ない話者は、期間内における発話回数が発話時間の両方が所定値未満の話者である。話者間の平均発話間隔あるいは重なりは、話者ペアごとに発話区間の間の無音期間の時間または発話区間が重なっている時間である。発話音量、トーン、ピッチ、発話速度は、話者別と全話者について求められる。それぞれ、期間内の平均値、最大値、最小値、変動幅、標準偏差のいずれかまたは複数によって表し、特に変動が顕著に測定された場合には該当する発話内容などの情報と結び付けて示す。また、会話状況データは、期間内の各発話について、発話内容のテキスト、会話テーマ、推定話者名、発話の意図、発話の話題（ジャンル、場所、ものなど）、話者の感情も含む。また、会話状況データは、発話間の対応関係や話者間の関係も含む。 In step S407, the conversation state analysis unit 204 generates and outputs conversation state data in which the above analysis results are integrated. For example, the conversation status data includes information as shown in FIG. 8 for utterances in the same conversation during the most recent predetermined period (for example, 3 minutes). A speaker with many utterances is a speaker whose both the number of utterances and the utterance time within a period are equal to or greater than a predetermined value (for example, once and 10 seconds). A speaker with few utterances is a speaker whose number of utterances within the period is both less than a predetermined value. The average utterance interval or overlap between speakers is the time of a silent period between utterance intervals or the overlap of utterance intervals for each speaker pair. The utterance volume, tone, pitch, and utterance speed are required for each speaker and for all speakers. Each is represented by one or more of an average value, a maximum value, a minimum value, a variation range, and a standard deviation within a period, and particularly when variation is measured remarkably, it is shown in association with information such as the corresponding utterance content. In addition, the conversation status data includes, for each utterance within the period, the text of the utterance content, the conversation theme, the estimated speaker name, the intention of the utterance, the topic of the utterance (genre, place, thing, etc.), and the emotion of the speaker. The conversation status data also includes correspondences between utterances and relationships between speakers.

図９（Ａ）は、発話間の対応関係と、各発話の会話テーマ・発話の意図・話者の感情を表示した例である。図９（Ａ）では、話者Ａ〜Ｅについてそれぞれ発話区間が時系列に示されており、発話間の対応関係が矢印で示されている。また、発話ごとに、発話の意図と話者の感情が示されている（利用可能な場合）。例えば、話者Ａによる話題の切り出し（発話ＩＤ１）に対し、話者Ｂが提案（発話ＩＤ２）を行い、これら両方の発話を受けて話者Ｃが提案への反対と再提案（発話ＩＤ３）をしていることなどが分かる。なお、発話間の対応関係は、必ずしも発話（音声データ）のみに基づいて決定する必要はない。例えば、カメラ１１３から取得される話者の視線や顔や体の向きから、ある発話が特定のメンバーに対するものであるか否かを判定し、この判定結果を基に発話間の対応関係を求めてもよい。 FIG. 9A shows an example in which the correspondence between utterances, the conversation theme of each utterance, the intention of the utterance, and the emotion of the speaker are displayed. In FIG. 9A, the utterance sections are shown in time series for the speakers A to E, and the correspondence between the utterances is shown by arrows. For each utterance, the intention of the utterance and the emotion of the speaker are shown (when available). For example, speaker B makes a proposal (utterance ID 2) for topic extraction by speaker A (utterance ID 1), and speaker C responds to the proposal and re-suggests (utterance ID 3) after receiving both of these utterances. You can see that you are doing. Note that the correspondence between utterances does not necessarily need to be determined based only on utterances (voice data). For example, it is determined whether or not a certain utterance is directed to a specific member from the line of sight of the speaker acquired from the camera 113 and the face or body direction, and the correspondence between the utterances is obtained based on this determination result. May be.

図９（Ｂ）では、話者Ａ〜Ｅの会話において、どのような発話がどの程度発生していて、当該話者間の上下関係や親密性がどのように推定されるかなどを示している。任意の２人の話者間の発話において、発話の意図や、発話特徴量（発話回数、発話時間、発話の重なり、テンションレベル）、言葉づかい（丁寧度）から、２話者間の親密度や関係性（フラットであるか上下関係があるか）を求めることができる。なお、図９（Ｂ）には示していないが、話者間に上下関係などがある場合には、どちらが上位者でありどちらが下位者であるかも求めることができる。 FIG. 9B shows what kind of utterance is occurring in the conversations of the speakers A to E and how the relationship and intimacy between the speakers is estimated. Yes. In the utterance between any two speakers, the intimacy between the two speakers can be determined from the intention of the utterance, utterance features (number of utterances, utterance time, overlap of utterances, tension level), wording (politeness) Relationships (whether flat or vertical) can be determined. Although not shown in FIG. 9B, when there is a vertical relationship between speakers, it can be determined which is the superior and which is the inferior.

会話状況分析部２０４は上述のような会話状況データをグループ状態判定部２０７へ出
力する。会話状況データを用いることで、会話の流れと各発話の特徴量変化を紐付けることが可能となり、会話を行っているグループの状態を適確に推定できる。 The conversation state analysis unit 204 outputs the conversation state data as described above to the group state determination unit 207. By using the conversation status data, it becomes possible to link the flow of conversation and the feature amount change of each utterance, and the state of the group in conversation can be accurately estimated.

＜グループ状態判定処理＞
次に、図３のステップＳ３０４におけるグループ状態判定処理の詳細について説明する。図１０は、グループ状態判定処理の流れを示すフローチャートである。 <Group status determination processing>
Next, details of the group state determination processing in step S304 in FIG. 3 will be described. FIG. 10 is a flowchart showing the flow of the group state determination process.

ステップＳ１００１において、グループ状態判定部２０７は、会話状況分析部２０４が出力した会話状況データを取得する。グループ状態判定部２０７は、この会話状況データに基づく以下の処理によって、グループ種別、各メンバーの役割（関係性）、グループの状態変化などを含むグループ状態を分析する。 In step S 1001, the group state determination unit 207 acquires the conversation state data output from the conversation state analysis unit 204. The group state determination unit 207 analyzes the group state including the group type, the role (relationship) of each member, the group state change, and the like by the following processing based on the conversation state data.

ステップＳ１００２において、グループ状態判定部２０７は、会話における話者間のつながりを判定する。会話状況データには、各発話の話者と、発話間のつながり、発話の意図（提案、賛成、反対など）が含まれる。したがって、会話状況データに基づいて、話者ペアの間の会話の頻度（例えば「話者Ａと話者Ｂは頻繁に直接会話している」、「話者Ａと話者Ｂの間では直接の会話がない」など）や、話者ペアの間でどの程度、提案・賛成・反対の発話がなされているか（「話者Ａは話者Ｂに対して提案をＸ回、賛成意見をＹ回、反対意見をＺ回述べている」など）を把握することができる。グループ状態判定部２０７は、グループ内のそれぞれの話者ペアについて、これらの情報を求める。 In step S1002, the group state determination unit 207 determines a connection between speakers in a conversation. The conversation status data includes the speaker of each utterance, the connection between utterances, and the intention of the utterance (suggestion, approval, disagreement, etc.). Therefore, based on the conversation status data, the frequency of conversation between the speaker pair (for example, “speaker A and speaker B frequently talk directly”, “speaker A and speaker B directly ) And the extent to which the pair of speakers is proposing, approving, or disagreeing (“Speaker A makes suggestions to speaker B X times, and agrees with Y”) , Times, disagreements Z times "). The group state determination unit 207 obtains such information for each speaker pair in the group.

ステップＳ１００３において、グループ状態判定部２０７は、メンバー間の意見交換状況を判定する。意見交換状況には、グループ内における意見交換の活発さ、提案に対する賛成と反対の比率、意思決定における主導者の有無などの情報が含まれる。意見交換の活発さは、例えば、提案から最終的な意思決定までの間の発話回数または賛成意見あるいは反対意見の数などによって評価できる。また、意思決定における主導者の有無は、特定の話者の提案に対して反対意見が少なく同意や賛成意見のみが発生しているか否か、特定の話者の提案や意見が高い割合で最終的な意見として採用されるか否か、などによって評価できる。会話状況データには、各発話の話者と、発話間のつながり、発話の意図、発話の内容などが含まれるので、グループ状態判定部２０７は会話状況データに基づいてこれらの意見交換状況を判定できる。 In step S1003, the group state determination unit 207 determines the opinion exchange status between members. Opinion exchange status includes information such as the active exchange of opinions within the group, the ratio of approval and disapproval of proposals, and the presence or absence of leaders in decision making. The activity of exchanging opinions can be evaluated by, for example, the number of utterances between proposals and final decision making, or the number of pros and cons. In addition, the presence or absence of a leader in decision-making depends on whether or not there is little disagreement with respect to the proposal of a specific speaker and whether only consent or approval is generated. It can be evaluated by whether or not it is adopted as a general opinion. Since the conversation status data includes the speaker of each utterance, the connection between the utterances, the intention of the utterance, the content of the utterance, etc., the group state determination unit 207 determines the opinion exchange status based on the conversation status data. it can.

ステップＳ１００４において、グループ状態判定部２０７は、会話状況データに含まれる発話特徴量および発話内容の言葉づかいと、ステップＳ１００２において求めた話者間のつながり、ステップＳ１００３において求めた話者間の意見交換状況に基づいて、グループ種別（グループモデル）を推定する。グループ種別はあらかじめ定義されており、例えば図１１（Ａ）に示すような、グループ種別Ａ：「フラットな関係で親密度が高く、メンバー同士が互いに遠慮なく意見を言いあえるようなグループ」、グループ種別Ｂ：「上下関係があるが、親密度が高く、特定のメンバーがグループの意思決定を主導しているようなグループ」、グループ種別Ｃ：「上下関係があり、親密度が低く、特定のメンバーがグループの意思決定を主導しているようなグループ」などが例として挙げられる。グループ種別Ａは、親友同士のような全員がフラットに繋がっているグループを想定したものである。グループ種別Ａには、主導者（特に意思決定への影響力を持つメンバー）が含まれる場合も含まれない場合もある。グループ種別Ｂは、家族のような、メンバー間のつながりが強く上下関係を持つグループを想定したものである。グループ種別Ｂには、主導者（例えば親）がいる。グループ種別Ｃは、職場の上司と部下のような、比較的ドライなつながりで上下関係を持つグループを想定している。グループ種別Ｃには、主導者（最上位者）がいる。ここでは例として３つのみ挙げているが、グループ種別の数はいくつであってもよい。 In step S1004, the group state determination unit 207 determines the wording of the utterance feature amount and utterance content included in the conversation state data, the connection between the speakers obtained in step S1002, and the opinion exchange situation between the speakers obtained in step S1003. Based on the above, the group type (group model) is estimated. The group type is predefined, for example, as shown in FIG. 11A, for example, group type A: “a group that is flat and has a high degree of intimacy, and members can freely express their opinions”, group type B: “A group that has a hierarchical relationship but has a high degree of intimacy and a specific member leading the group's decision-making”, Group type C: “A hierarchical relationship that has a low degree of closeness and a specific member An example of such a group is “leading group decision making”. The group type A assumes a group in which everyone, such as best friends, is connected in a flat manner. The group type A may or may not include a leader (particularly a member who has an influence on decision making). The group type B is assumed to be a group such as a family that has a strong connection between members and a vertical relationship. In the group type B, there is a leader (for example, a parent). Group type C assumes a group that has a relatively dry connection and a vertical relationship, such as a boss and a subordinate in the workplace. In the group type C, there is a leader (top person). Here, only three are given as an example, but the number of group types may be any number.

グループモデル定義記憶部２０８には、それぞれのグループ種別の判定基準が格納される。グループモデル定義記憶部２０８には、発話特徴量、発話内容の言葉づかい、話者間のつながり、意見交換情報などに基づく、複数個の判定基準が格納される。図１１（Ｂ）は、発話特徴量に基づく判定基準の例を示す。グループ種別Ａは「フラットな関係で親密度が高く、メンバー同士が互いに遠慮なく意見を言いあえるようなグループ」なので、例えば、「全話者が活発に発話している」、「発話が重なる傾向にある」、「各発話のトーンやピッチの変動が大きい」、「発話音量の変動が大きい」、「提案に対して反対意見がある程度発生する」という特徴を有することが多い。そこで、グループモデル定義記憶部２０８には、発話特徴量に基づくグループ種別Ａの判定基準として、例えば、「発話回数が３分間に３回以上または合計２０秒以上発話している話者が全体の６０％以上である」、「発話の重なりが３分間に３回以上または合計５秒以上」、「各話者のトーン、ピッチまたは音圧レベルの変動幅が所定の閾値以上」などの判定基準を含む。グループ状態判定部２０７は、現在のグループがこれらの判定基準をどの程度満たすかを評価し、現在のグループがグループ種別Ａである確からしさを示す評価値を求める。その他のグループ種別Ｂ，Ｃについても同様に評価値が求められる。 The group model definition storage unit 208 stores determination criteria for each group type. The group model definition storage unit 208 stores a plurality of determination criteria based on utterance feature amounts, wording of utterance contents, connection between speakers, opinion exchange information, and the like. FIG. 11B shows an example of a determination criterion based on the utterance feature amount. Group type A is “a group that is flat and has a high degree of intimacy, and members can speak to each other without hesitation.” For example, “all speakers are actively speaking”, “speaking tends to overlap There are many features such as “Yes”, “Fluctuation in tone and pitch of each utterance”, “Fluctuation in utterance volume is large”, and “Some oppositions to the proposal occur to some extent”. Therefore, in the group model definition storage unit 208, as a criterion for the group type A based on the utterance feature amount, for example, “the number of utterances is 3 or more in 3 minutes or a total of 20 seconds or more Criteria such as “over 60%”, “overlap of utterances over 3 minutes in 3 minutes or a total of over 5 seconds”, “variation of each speaker's tone, pitch or sound pressure level over a predetermined threshold” including. The group state determination unit 207 evaluates how much the current group satisfies these determination criteria, and obtains an evaluation value indicating the probability that the current group is the group type A. Evaluation values are similarly obtained for the other group types B and C.

グループ状態判定部２０７は、ここで求めた評価値のみを用いて、すなわち発話特徴量のみに基づいてグループの種別を判定してもよいが、判定精度をより向上させるためにその他の要素も考慮してグループ種別を判定する。 The group state determination unit 207 may determine the type of the group using only the evaluation value obtained here, that is, based only on the utterance feature amount, but other factors are considered in order to further improve the determination accuracy. To determine the group type.

グループ状態判定部２０７は、例えば、会話における発話内容（テキスト）を解析し、各話者の発話に含まれる命令語、敬語・丁寧語・謙譲語、くだけた語（親しい間柄で使う言葉）、子供が使う語、子供に対して使う語などをどの程度の頻度で現れるかを取得する。これにより、会話における各話者の言葉づかいが分かる。グループ状態判定部２０７は、言葉づかいも考慮してグループ種別を推定する。例えば、「グループ内に命令調で話す人がいて、それに対して敬語・丁寧語・謙譲語で返答する人がいる」場合は、グループ種別Ｃである可能性が高いと判断できる。また、「グループ内に命令調で話す人がいるが、それに対してくだけた言葉で返答する人がいる」場合は、グループ種別Ａの可能性が高いと判断できる。また、「グループ内のほとんど話者がくだけた言葉を多く使う」場合は、グループ種別ＡまたはＢの可能性が高いと判断できる。また、「グループ内に親（大人）が子供に対して使う言葉づかいで話す人と、子供が使う言葉づかいで話す人がいる」場合には、グループ種別Ｂの可能性が高いと判断できる。ここで挙げた例は一例であり、グループ種別と言葉づかいの関連性があらかじめ定義されていれば、グループ状態判定部２０７は、現在のグループがどのグループ種別に該当するか可能性が高いかを判定できる。 The group state determination unit 207, for example, analyzes the utterance content (text) in the conversation, and includes command words, honorific words, polite words, humility words included in each speaker's utterances, words (words used in close relations), Get the frequency of words used by children and words used for children. As a result, the language of each speaker in the conversation can be understood. The group state determination unit 207 estimates the group type in consideration of wording. For example, if there is a person who speaks in a command tone in the group and there is a person who responds with honorific, polite or humble words, it can be determined that there is a high possibility of being group type C. Further, when “a person who speaks in a command tone in the group, but there is a person who responds with a simple word”, it can be determined that the possibility of the group type A is high. Further, when “uses many words spoken by almost the speakers in the group”, it can be determined that the possibility of the group type A or B is high. Further, if there are “a person who speaks in the language used by the parent (adult) for the child and a person who speaks in the language used by the child” in the group, it can be determined that the possibility of the group type B is high. The example given here is an example, and if the relationship between the group type and the wording is defined in advance, the group state determination unit 207 determines which group type the current group corresponds to is highly likely. it can.

また、グループ状態判定部２０７は、会話における意見交換状況からもグループ種別を判断できる。例えば、グループ内において意見交換が活発な場合や、提案に対して拒否や反対意見が比較的多く発生している場合は、グループ種別ＡまたはＢの可能性が高いと判断できる。また、グループ内において意見交換が活発でない場合や、グループ内に主導者が存在する場合は、グループ種別Ｃの可能性が高いと判断できる。ここで挙げた例は一例であり、グループ種別と意見交換状況の関連性があらかじめ定義されていれば、グループ状態判定部２０７は、現在のグループがどのグループ種別に該当するか可能性が高いかを判定できる。 The group state determination unit 207 can also determine the group type from the opinion exchange status in the conversation. For example, when opinions are actively exchanged within the group, or when a relatively large number of refusals or disagreements occur with respect to the proposal, it can be determined that the possibility of group type A or B is high. In addition, when the exchange of opinions is not active in the group or when the leader exists in the group, it can be determined that the possibility of the group type C is high. The example given here is an example. If the relationship between the group type and the opinion exchange status is defined in advance, the group status determination unit 207 is likely to be the group type to which the current group corresponds. Can be determined.

グループ状態判定部２０７は、上記のように、発話特徴量、言葉づかい、意見交換状況、話者間のつながりに基づいて推定されるグループ種別を統合して、最も良く合致するグループ種別を、現在のグループの種別として決定する。 As described above, the group state determination unit 207 integrates the group types estimated based on the utterance feature amount, the wording, the opinion exchange situation, and the connection between speakers, and determines the best matching group type as the current Determine as group type.

ステップＳ１００５において、グループ状態判定部２０７は、ステップＳ１００２，Ｓ１００３などの解析結果やその他の会話状況データを用いて、グループにおける各メンバ
ーの役割を推定する。グループにおける役割として、意思決定における主導者、主導者に対する追従者が挙げられる。また、役割として、上位者、下位者、親、子、などを推定してもよい。メンバーの役割推定において、ステップＳ１００４において決定したグループ種別を考慮することも好ましい。 In step S1005, the group state determination unit 207 estimates the role of each member in the group using the analysis results in steps S1002 and S1003 and other conversation status data. Roles in the group include leaders in decision making and followers to the leaders. Moreover, you may estimate a superior, a subordinate, a parent, a child, etc. as a role. It is also preferable to consider the group type determined in step S1004 in the member role estimation.

ステップＳ１００６において、グループ状態判定部２０７は、グループの状態変化を推定する。グループの状態には、発話の頻度、会話への参加者、会話における主導者が誰であるかなどが含まれる。ステップＳ１００６において推定される状態変化は、例えば、特定話者の発話頻度の低下、全体的な発話頻度の低下、会話グループの分離、主導者の変化などが挙げられる。 In step S1006, the group state determination unit 207 estimates a group state change. The group status includes the frequency of utterances, participants in the conversation, who is the leader in the conversation, and the like. Examples of the state change estimated in step S1006 include a decrease in the utterance frequency of a specific speaker, a decrease in the overall utterance frequency, separation of conversation groups, and a change in leader.

ステップＳ１００７において、グループ状態判定部２０７は、ステップＳ１００４において推定したグループ種別、ステップＳ１００５において推定した各メンバーの役割、およびステップＳ１００６に推定したグループの状態変化をまとめてグループ状態データとして、介入・調停部２０９へ出力する。介入・調停部２０９は、グループ状態データを参照することで、会話中のグループがどのような状態であるのかを把握でき、それにしたがった適切な介入が行える。 In step S1007, the group state determination unit 207 intervenes and adjusts the group type estimated in step S1004, the role of each member estimated in step S1005, and the group state change estimated in step S1006 as group state data. Output to the unit 209. The intervention / arbitration unit 209 can grasp the state of the group in conversation by referring to the group state data, and can perform appropriate intervention according to the state.

＜介入／調停処理＞
次に、図３のステップＳ３０５における介入内容決定処理の詳細について説明する。図１２は、介入内容決定処理の流れを示すフローチャートである。 <Intervention / mediation>
Next, details of the intervention content determination process in step S305 in FIG. 3 will be described. FIG. 12 is a flowchart showing the flow of the intervention content determination process.

ステップＳ１２０１において、介入・調停部２０９は、会話状況分析部２０４が出力した会話状況データとグループ状態判定部２０７が出力したグループ状態データを取得する。介入・調停部２０９は、これらのデータに基づく以下の処理によって、介入や調停を行う際に提示する情報の内容を決定する。 In step S 1201, the intervention / arbitration unit 209 acquires the conversation state data output from the conversation state analysis unit 204 and the group state data output from the group state determination unit 207. The intervention / arbitration unit 209 determines the content of information to be presented when performing intervention or arbitration by the following processing based on these data.

ステップＳ１２０２において、介入・調停部２０９は、グループ状態データに含まれるグループ種別やグループ状態変化に応じた介入ポリシーを、介入ポリシー定義記憶部２１０から取得する。介入ポリシーとは、グループ状態に応じて、グループ内のどのメンバーを優先的に支援するか、また、どのように支援するかを表す情報である。介入ポリシー定義記憶部２１０に定義される介入ポリシーの例を、図１３（Ａ）（Ｂ）に示す。 In step S 1202, the intervention / arbitration unit 209 acquires from the intervention policy definition storage unit 210 an intervention policy corresponding to the group type and group state change included in the group state data. The intervention policy is information indicating which member in the group is preferentially supported and how it is supported according to the group status. Examples of intervention policies defined in the intervention policy definition storage unit 210 are shown in FIGS.

図１３（Ａ）は、グループ種別に応じた介入ポリシーの例である。例えば、フラットな関係で親密度が高く、メンバー同士が互いに遠慮なく意見を言いあえるようなグループ種別Ａに対する介入ポリシーの一例として、メンバー間で討議して決められるように促すために、「メンバー全員に対して、選択要素（例えば、食事場所を決める場合は、その候補となる店舗など）に関する情報を提示する」というポリシーが定義される。また、上下関係があるが、親密度が高く、特定のメンバーがグループの意思決定を主導しているようなグループ種別Ｂに対する介入ポリシーの一例として、意見の表明ができないような弱い立場にあるメンバーの意見を引き出して採用されるように促すために、「ファシリティター的なメンバーに対して、いずれのメンバーの意見を引き出すのが好ましいかという情報と、選択要素に関する情報を提示し、該当メンバーから意見を引き出し、その意見が採用されるように支援する」というポリシーが定義される。また、上下関係があり、親密度が低く、特定のメンバーがグループの意思決定を主導しているようなグループ種別Ｃに対する介入ポリシーの一例として、特定のメンバーの意見ばかりが採用されないように支援するために、「１番目の意思決定事項では上位メンバーの意見を優先扱いするが、２番目以降では、ファシリティター的なメンバーに対して、いずれのメンバーの意見を引き出すのが好ましいかという情報と、選択要素に関する情報を提示し、順次、該当メンバーから意見を引き出し、その意見が採用されるように支援する」というポリシーが定義される。なお
、これらのポリシーにおけるファシリティター的なメンバーとは、特に意見の表明ができないような弱い立場にあるメンバーに寄り添って、該当メンバーの意見を引き出し、該当意見が採用されるように支援できるような人を意図する。また、図１３（Ａ）では、グループ種別ごとに１つの介入ポリシーが定義されるように記載しているが、グループ種別ごとに複数の介入ポリシーが定義されてもよい。 FIG. 13A shows an example of an intervention policy corresponding to the group type. For example, as an example of an intervention policy for Group Type A that has a close relationship and high degree of intimacy, and members can express their opinions with each other, in order to encourage members to discuss and decide, On the other hand, a policy of “presenting information on a selection element (for example, a candidate store when a meal place is determined)” is defined. In addition, members who are in a weak position that cannot express an opinion as an example of an intervention policy for group type B that has a hierarchical relationship but is highly intimate and a specific member is leading the group's decision making. In order to encourage the members to draw out their opinions and be hired, “provide the facilitator member information on which member's opinion is preferable and information on the selection factors. A policy is defined that “leads out opinions and helps them be adopted”. In addition, as an example of an intervention policy for group type C, which has a hierarchical relationship, has low intimacy, and a specific member leads the group's decision-making, supports that only the opinions of a specific member are not adopted For this reason, “In the first decision-making matter, the opinion of the upper member is treated preferentially, but in the second and later, information on which member's opinion is preferable to the facilitator member, A policy is defined that presents information on the selected element, sequentially extracts opinions from the relevant members, and supports the adoption of the opinions. In addition, the facilitator members in these policies are those who are particularly close to weak members who cannot express their opinions, and are able to draw out their opinions and support their adoption. Intended for people. In FIG. 13A, one intervention policy is defined for each group type, but a plurality of intervention policies may be defined for each group type.

図１３（Ｂ）は、グループの状態変化に応じた介入ポリシーの例である。例えば、特定話者の発話停滞（発話頻度の低下）が発生している場合は、それが話題の変化に伴って発生していれば、停滞前の話題に関連する情報を提示する。また、全体的な発話の停滞が発生している場合には、停滞前の話題に関連する情報を提示する。また、グループが２つのサブグループに分離してそれぞれが異なる会話をしている場合には、いずれかのサブグループでの話題に関連する情報を、他のグループの人にも興味を持ってもらえるように提示する。また、主導者が変化した場合には、新しい主導者が話題を先導できるように情報提供を行う。なお、図１３（Ｂ）では、グループの状態変化ごとに１つの介入ポリシーが定義されているように記載しているが、状態変化ごとに複数の介入ポリシーが定義されてもよい。 FIG. 13B is an example of an intervention policy corresponding to a group state change. For example, when an utterance stagnation (decrease in utterance frequency) of a specific speaker occurs, information related to the topic before the stagnation is presented if it occurs with a change in topic. In addition, when the stagnation of the entire utterance has occurred, information related to the topic before the stagnation is presented. In addition, when a group is divided into two subgroups and each has a different conversation, information related to the topic in one of the subgroups may be of interest to other groups. To present. Also, when the leader changes, information is provided so that the new leader can lead the topic. In FIG. 13B, one intervention policy is defined for each state change of the group, but a plurality of intervention policies may be defined for each state change.

上記のような介入ポリシーは、グループの種別やグループの状態変化に応じて、グループ内の各メンバーに対する介入の優先度と、どのような介入を行うかを定義した情報といえる。ここで、介入の優先度の設定は、メンバー個人に対して行われるというよりは、グループ内における役割（主導者など）を有するメンバーに対して設定されたり、特定の条件（発話頻度低下）を満たすメンバーに対して設定されたりする。ただし、全ての介入ポリシーが介入優先度を含んでいる必要はない。 The intervention policy as described above can be said to be information that defines the priority of intervention for each member in the group and what kind of intervention is to be performed in accordance with the type of group and the change in the state of the group. Here, the priority of intervention is set for members who have roles in the group (such as leaders) rather than being set for individual members, or specific conditions (decreased utterance frequency) are set. It may be set for members who meet. However, not all intervention policies need to include intervention priority.

ステップＳ１２０３において、介入・調停部２０９は、ステップＳ１２０２において取得された介入ポリシーに基づいて介入対象メンバーと介入方法を決定する。例えば、介入・調停部２０９は、主導者に対してその他のメンバーの嗜好にあった情報を提供するように決定したり、発話が停滞した話者が好む話題に関連する情報を提供するように決定したりする。なお、ステップＳ１２０３では、現時点では介入を行わないという決定がされることもある。ステップＳ１２０３の決定は、介入ポリシーのみに基づいて行う必要はなく、会話状況データなどその他の情報に基づいて行うことも好ましい。例えば、会話状況データに含まれる発話の意図などからグループ内で意思決定のための意見交換がされていると判断した場合に、意思決定を支援する介入ポリシーに基づいて介入対象と介入方法を決定するとよい。 In step S1203, the intervention / mediation unit 209 determines an intervention target member and an intervention method based on the intervention policy acquired in step S1202. For example, the intervention / mediation unit 209 determines to provide the leader with information that suits the tastes of other members, or provides information related to a topic preferred by a speaker whose utterance is stagnant. To decide. In step S1203, it may be determined that no intervention is currently performed. The determination in step S1203 does not have to be performed based only on the intervention policy, but is preferably performed based on other information such as conversation status data. For example, if it is determined that opinions are exchanged for decision making within the group based on the intention of the utterance included in the conversation status data, the intervention target and intervention method are determined based on the intervention policy that supports decision making. Good.

ステップＳ１２０４において、介入・調停部２０９は、介入対象メンバーおよび介入方法に応じた提示情報を生成・取得する。例えば、主導者に対してその他のメンバーの嗜好にあった情報を提供する場合には、まず、他のメンバーの嗜好を、それまでの会話テーマおよび当該メンバーの感情（興奮度など）に基づいて取得したり、あるいはユーザ情報ＤＢ１２３から取得したりして決定する。昼食の場所についての会話をしているときに、当該メンバーがイタリアン料理を好む場合には、イタリアン料理店についての情報を関連情報ＷＥＢサイト１３０などから取得する。この際、車両１１０のＧＰＳ装置１１２から得られる位置情報も考慮して提示する店舗を絞り込むとよい。 In step S1204, the intervention / mediation unit 209 generates / acquires presentation information according to the intervention target member and the intervention method. For example, when providing information that matches the preferences of other members to the leader, first, the preferences of other members are determined based on the previous conversation theme and the emotions (excitement level, etc.) of the members. It is determined by acquisition or acquisition from the user information DB 123. If the member prefers Italian cuisine during a conversation about a lunch place, information about the Italian restaurant is acquired from the related information WEB site 130 or the like. At this time, it is preferable to narrow down the stores to be presented in consideration of the position information obtained from the GPS device 112 of the vehicle 110.

ステップＳ１２０５において、介入・調停部２０９は、ステップＳ１２０４にて生成・取得した提示情報を含む介入指示データを生成して、出力する。本実施形態では、サーバ装置１２０から車両１１０のナビゲーション装置１１１に対して、介入指示データが送信される。ナビゲーション装置１１１の出力制御部２１２は、介入指示データに基づいて、合成音声や表示テキストを生成して、スピーカー２１３やディスプレイ２１４から情報の提示を行う（Ｓ３０６）。 In step S1205, the intervention / arbitration unit 209 generates and outputs intervention instruction data including the presentation information generated and acquired in step S1204. In the present embodiment, intervention instruction data is transmitted from the server device 120 to the navigation device 111 of the vehicle 110. The output control unit 212 of the navigation device 111 generates synthesized speech and display text based on the intervention instruction data, and presents information from the speaker 213 and the display 214 (S306).

上記で説明した一連の会話介入支援処理（図３）は繰り返し実行される。発話に対して適切なタイミングで介入が行えるように、繰り返し間隔は短いことが好ましい。ただし、繰り返し処理において全ての処理を毎回行う必要はない。例えば、会話状況分析Ｓ３０３やグループ状態判定Ｓ３０４はある程度の間隔（例えば３分）を空けて行うようにしてもよい。また、グループ状態の判定においても、グループ種別の判定とグループの状態変化の判定を異なる実行間隔で行ってもよい。 The series of conversation intervention support processes (FIG. 3) described above are repeatedly executed. It is preferable that the repetition interval is short so that intervention can be performed at an appropriate timing for the utterance. However, it is not necessary to perform every process every time in the repetitive process. For example, the conversation state analysis S303 and the group state determination S304 may be performed with a certain interval (for example, 3 minutes). In the group state determination, the group type determination and the group state change determination may be performed at different execution intervals.

＜本実施形態の有利な効果＞
本実施形態においては、会話状況分析部２０４が、複数の話者によって行われる会話において、同一の会話テーマからなる発話群を特定し、さらに各発話間の関係があるかどうか、さらに関係がある場合にどのような関係があるかなどを把握できる。さらに、同一の会話における話者間の発話の間隔や重なり度合いから、会話状況が推定できる。本実施形態による会話状況の解析手法では、多数の話者が異なるグループに分かれて同時に会話を行っている場合であっても、それぞれの会話についての状況を把握することができる。 <Advantageous effects of this embodiment>
In the present embodiment, the conversation state analysis unit 204 specifies an utterance group having the same conversation theme in a conversation performed by a plurality of speakers, and whether or not there is a relationship between the utterances. You can understand what kind of relationship there is. Furthermore, the conversation situation can be estimated from the interval between speeches and the degree of overlap between speakers in the same conversation. In the conversation state analysis method according to the present embodiment, even when a large number of speakers are divided into different groups and have conversations at the same time, the state of each conversation can be grasped.

また、本実施形態では、グループ状態判定部２０７が、会話状況データなどに基づいて、会話を行っているグループの種別や状態変化、あるいはグループ内の各話者の役割や互いの関係性などを把握することができる。このような把握ができることによって、システムが会話に介入する際に、どの話者をより優先的に支援するかを決定したり、グループの状態に応じた適切な介入が行えたりする。 In the present embodiment, the group state determination unit 207 determines the type or state change of the group in which the conversation is performed, the role of each speaker in the group, and the relationship between them based on the conversation state data. I can grasp it. Such an understanding makes it possible to determine which speaker is to be supported more preferentially when the system intervenes in the conversation, and to perform appropriate intervention according to the group status.

＜変形例＞
上記の説明では、会話介入支援システムを、車両とサーバ装置とが連携するテレマティクスサービスとして構成する例を示したが、具体的なシステムの形態はこれに限られない。例えば、会議室などの室内における会話を取得して、この会話に介入するシステムとして構成することができる。 <Modification>
In the above description, an example in which the conversation intervention support system is configured as a telematics service in which the vehicle and the server device cooperate with each other is shown, but the specific system form is not limited to this. For example, it can be configured as a system that acquires a conversation in a room such as a conference room and intervenes in the conversation.

２０１：マイク２０２：雑音除去部２０３：音源分離部２０４：会話状況分析部２０５：音声認識用コーパス・辞書２０６：語彙意図理解用コーパス・辞書
２０７：グループ状態判定部２０８グループモデル定義記憶部２０９：介入・調停部２１０：介入ポリシー定義記憶部２１１：関連情報データベース２１２：出力制御部２１３：スピーカー２１４：ディスプレイ 201: microphone 202: noise removal unit 203: sound source separation unit 204: conversation state analysis unit 205: speech recognition corpus / dictionary 206: vocabulary intention understanding corpus / dictionary 207: group state determination unit 208 group model definition storage unit 209: Intervention / arbitration unit 210: Intervention policy definition storage unit 211: Related information database 212: Output control unit 213: Speaker 214: Display

Claims

複数の話者による会話の状況を分析する会話状況分析装置であって、
複数の話者による会話音声を取得する取得手段と、
前記会話音声を、話者ごとおよび発話区間ごとの複数の発話に分離する分離手段と、
前記複数の発話のそれぞれについて、音声認識処理を用いて発話内容の認識する認識手段と、
発話内容に基づいて発話間の関係性を分析する分析手段であって、発話ごとの内容に基づいて、同一の会話テーマと推定される発話を一連の発話群であると特定する分析手段と、
を備える、会話状況分析装置。 A conversation situation analysis device for analyzing the situation of conversations by a plurality of speakers,
An acquisition means for acquiring conversation voices from a plurality of speakers;
Separating means for separating the conversational sound into a plurality of utterances for each speaker and each utterance section;
For each of the plurality of utterances, recognition means for recognizing the utterance content using voice recognition processing;
An analysis means for analyzing the relationship between utterances based on utterance content, and an analysis means for identifying utterances estimated to have the same conversation theme as a series of utterance groups based on the content of each utterance;
Conversation situation analysis device.

前記認識手段は、音声認識処理により得られる発話のテキストを辞書と照合することにより前記発話内容を認識し、
前記分析手段は、前記認識手段によって認識される発話内容のテキストと辞書を照合することにより、発話の意図と話題を求め、当該発話の意図および話題に基づいて当該発話の会話テーマを推定する、
請求項１に記載の会話状況分析装置。 The recognizing means recognizes the utterance content by comparing a text of the utterance obtained by the voice recognition processing with a dictionary,
The analysis means obtains the intention and topic of the utterance by comparing the text of the utterance content recognized by the recognition means with the dictionary, and estimates the conversation theme of the utterance based on the intention and topic of the utterance.
The conversation state analysis apparatus according to claim 1.

前記複数の発話のそれぞれについて、音声特徴量を算出する特徴量算出手段をさらに備え、
前記分析手段は、各話者について音声特徴量の変化に基づいてそれぞれの発話時における話者の感情を推定し、当該感情も考慮して前記発話の意図を推定する、
請求項２に記載の会話状況分析装置。 For each of the plurality of utterances, further comprising a feature quantity calculating means for calculating a voice feature quantity,
The analysis means estimates the emotion of the speaker at the time of each utterance based on the change in the voice feature amount for each speaker, and estimates the intention of the utterance in consideration of the emotion.
The conversation state analysis apparatus according to claim 2.

前記分析手段は、発話の意図に基づいて、前記一連の発話群における発話間の対応関係を取得する、
請求項２または３に記載の会話状況分析装置。 The analysis means acquires a correspondence relationship between utterances in the series of utterance groups based on the intention of the utterance;
The conversation state analysis apparatus according to claim 2 or 3.

話者を撮影する撮像手段をさらに有し、
前記分析手段は、前記撮像手段が撮影した画像における話者の体、顔、または視線の向きを考慮して、前記一連の発話群における発話間の対応関係を取得する、
請求項４に記載の会話状況分析装置。 It further has an imaging means for photographing the speaker,
The analysis unit obtains a correspondence relationship between utterances in the series of utterance groups in consideration of the direction of the speaker's body, face, or line of sight in the image captured by the imaging unit.
The conversation state analysis apparatus according to claim 4.

話者を撮影する撮像手段をさらに有し、
前記分析手段は、前記撮像手段が撮影した画像における話者の顔画像から算出される顔特徴量の変化に応じて話者の感情を推定し、当該感情も考慮して発話間の関係性を分析する、
請求項１から５のいずれか１項に記載の会話状況分析装置。 It further has an imaging means for photographing the speaker,
The analysis unit estimates a speaker's emotion according to a change in a facial feature amount calculated from the speaker's face image in the image captured by the imaging unit, and considers the emotion to determine the relationship between utterances. analyse,
The conversation state analysis apparatus according to any one of claims 1 to 5.

前記分析手段は、発話間の関係性と、発話の内容、発話の特徴量、発話時の話者の感情の少なくともいずれかに基づいて、話者間の関係を求める、
請求項１から６のいずれか１項に記載の会話状況分析装置。 The analysis means obtains the relationship between the speakers based on the relationship between the utterances and at least one of the content of the utterance, the feature amount of the utterance, and the emotion of the speaker at the time of utterance.
The conversation state analysis apparatus according to any one of claims 1 to 6.

前記一連の発話群に関するデータである会話状況データを出力する出力手段を、さらに備える、
請求項１から７のいずれか１項に記載の会話状況分析装置。 An output means for outputting conversation status data which is data relating to the series of utterance groups;
The conversation situation analysis device according to any one of claims 1 to 7.

前記会話状況データは、各発話の話者、発話間の対応関係、各発話の意味と意図、各発話時の話者の感情、発話群における各話者の発話頻度、各発話における音声特徴量、話者
間の関係、の少なくともいずれかを含む、
請求項８に記載の会話状況分析装置。 The conversation status data includes the speaker of each utterance, the correspondence between utterances, the meaning and intention of each utterance, the emotion of the speaker at the time of each utterance, the utterance frequency of each speaker in the utterance group, and the voice feature amount in each utterance , Including at least one of the relationships between speakers,
The conversation state analysis apparatus according to claim 8.

複数の話者による会話に介入して支援を行う支援装置であって、
請求項８または９に記載の会話状況分析装置と、
前記会話状況分析装置から出力される会話状況データに基づいて、一連の発話群に参加している複数の話者からなるグループの状態を判定するグループ状態判定手段と、
前記グループの状態に基づいて前記会話への介入の内容を決定し、前記会話へ介入を行う介入手段と、
を備える、支援装置。 A support device for intervening and supporting a conversation by a plurality of speakers,
A conversation situation analysis device according to claim 8 or 9,
Group state determination means for determining the state of a group of a plurality of speakers participating in a series of utterance groups based on conversation state data output from the conversation state analysis device;
Intervention means for determining the content of the intervention in the conversation based on the state of the group and performing the intervention in the conversation;
A support device comprising:

複数の話者による会話の状況を分析する会話状況分析方法であって、コンピュータが、
複数の話者による会話音声を取得する取得ステップと、
前記会話音声を、話者ごとおよび発話区間ごとの複数の発話に分離する分離ステップと、
前記複数の発話のそれぞれについて、音声認識処理を用いて発話内容の認識する認識ステップと、
発話内容に基づいて発話間の関係性を分析する分析ステップであって、発話ごとの内容に基づいて、同一の会話テーマと推定される発話を一連の発話群であると特定する分析ステップと、
を実行する、会話状況分析方法。 A conversation situation analysis method for analyzing the situation of conversations by a plurality of speakers, wherein a computer
An acquisition step of acquiring conversation voices by a plurality of speakers;
A separation step of separating the conversational sound into a plurality of utterances for each speaker and each utterance section;
For each of the plurality of utterances, a recognition step for recognizing the utterance content using a speech recognition process;
An analysis step for analyzing the relationship between utterances based on utterance content, and an analysis step for identifying utterances estimated as the same conversation theme as a series of utterance groups based on the content of each utterance;
Conversation situation analysis method to execute.