JP7423490B2

JP7423490B2 - Dialogue program, device, and method for expressing a character's listening feeling according to the user's emotions

Info

Publication number: JP7423490B2
Application number: JP2020161450A
Authority: JP
Inventors: 俊一田原; 元服部; 一則松本
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2020-09-25
Filing date: 2020-09-25
Publication date: 2024-01-29
Anticipated expiration: 2040-09-25
Also published as: JP2022054326A

Description

本発明は、ユーザと対話する対話エージェントの技術に関する。 The present invention relates to a technology for a dialogue agent that interacts with a user.

スマートフォンやタブレット端末を用いて、ユーザと自然に対話する「対話エージェント」の技術が普及している。この技術によれば、ディスプレイに表示されたコンピュータグラフィックスのキャラクタ（アバター）が、ユーザに対して音声やテキストで対話する。対話エージェントとしてのキャラクタは、ユーザから見て対話可能な人物として認識でき、固有のプロファイル（年齢、性別、出身地等の属性）を有する。勿論、キャラクタは、実在しない仮想的なものであってもよい。キャラクタは、ユーザの状況や趣味趣向に応じて対話を成立させるために、ユーザは、そのキャラクタに対して親近感を持ちやすい。 ``Conversation agent'' technology, which naturally interacts with users using smartphones and tablet devices, is becoming widespread. According to this technology, a computer graphics character (avatar) displayed on a display interacts with the user through voice and text. A character serving as a conversation agent can be recognized as a person with whom a user can interact, and has a unique profile (attributes such as age, gender, place of birth, etc.). Of course, the character may be a non-existent virtual character. Since the character establishes a dialogue according to the user's situation and hobbies and preferences, the user tends to feel a sense of affinity towards the character.

対話エージェントとしては、「対話シナリオ」や「機械学習モデル」に基づいて対話を進行させるものがある。また、対話エージェントは、ユーザの発話文に対して対話シナリオや機械学習モデルによって適切な応答文を推定できない場合であっても、相槌やオウム返しのような傾聴的な応答文を返答することもできる。ユーザは、傾聴的な応答文によって、キャラクタが自ら話を理解してくれているような錯覚を持ちやすい。 Some dialogue agents advance dialogue based on "dialogue scenarios" and "machine learning models." In addition, even if the dialogue agent cannot estimate an appropriate response to the user's utterance using a dialogue scenario or machine learning model, it can respond with a listening response, such as a compliment or a parrot. can. The user tends to have the illusion that the character understands what is being said by listening to the response sentences.

従来、人間同士の対話の中で、相槌のように傾聴的に応答するだけでなく、相手の身振り（仕草）を模倣することによって、模倣しない場合と比較して、相手に対してポジティブな印象を与えたり、ラポール形成が生まれやすくなることが理解されている（例えば非特許文献１参照）。
これによれば、対話エージェントとしてのキャラクタが、ユーザの顔表情や身振りを模倣することによって、ユーザは、キャラクタに対して、自分の話を理解してくれていると感じることが予想される。例えばカウンセリングの場合、ユーザ自らの悩みなどの発言を引き出しやすくなる。 Traditionally, in dialogue between humans, by not only listening and responding in a mutual manner, but also by imitating the other person's gestures, it is possible to create a more positive impression of the other person than if one did not imitate. It is understood that rapport formation is more likely to occur (see, for example, Non-Patent Document 1).
According to this, when the character acting as a dialogue agent imitates the user's facial expressions and gestures, the user is expected to feel that the character understands what he or she is saying. For example, in the case of counseling, it becomes easier to draw out comments from users about their own concerns.

また、ユーザの顔表情や身振りを検出し、ＣＧ(Computer Graphics)のキャラクタが、そのユーザの顔表情や身振りを模倣する技術もある（例えば非特許文献２参照）。この技術によれば、カメラで撮影された、ユーザの顔が映り込む映像を解析することによって、ユーザの顔表情（喜びや悲しみなど）を検出する。また、身振りについては、静止画に映る人間の関節点を検出するOpenPose（商標登録）が用いられている（例えば非特許文献３参照）。 There is also a technology in which a user's facial expressions and gestures are detected and a CG (Computer Graphics) character imitates the user's facial expressions and gestures (for example, see Non-Patent Document 2). According to this technology, a user's facial expression (such as joy or sadness) is detected by analyzing a video in which the user's face is captured by a camera. Regarding gestures, OpenPose (registered trademark), which detects joint points of a human in a still image, is used (for example, see Non-Patent Document 3).

更に、対話中に出現する発話文とその発話文に対する相槌とを訓練した学習モデルを構築し、その学習モデルにパターンをモデルに学習し、任意のユーザ発話をモデルに入力すると、適切な相槌を予測して出力する技術もある（例えば非特許文献４参照）。ユーザの発言に対してキャラクタが例えば「うんうん」と相槌を打つことによって、ユーザは、自らの話を聞いてくれていると感じて、対話における好感度を高めることもできる。 Furthermore, we build a learning model that is trained on the utterances that appear during dialogue and the responses to those utterances, and then we train the learning model to learn patterns, and when we input arbitrary user utterances into the model, we can generate appropriate responses. There is also a technique for predicting and outputting (for example, see Non-Patent Document 4). When the character responds to the user's comments by saying, for example, "Yeah, yeah," the user feels that the user is listening to what he or she has to say, which can increase the user's likeability in the dialogue.

心理臨床場面でのノンバーバル・スキルに関する実験的検討、青柳宏亮, Japanese Journal of Counseling Science, 2013、[online]、［令和２年９月２０日検索］、インターネット＜URL:https://www.jstage.jst.go.jp/article/cou/46/2/46_83/_article/-char/ja/＞Experimental study on nonverbal skills in psychological clinical settings, Hiroaki Aoyagi, Japanese Journal of Counseling Science, 2013, [online], [Retrieved September 20, 2020], Internet <URL: https://www .jstage.jst.go.jp/article/cou/46/2/46_83/_article/-char/ja/＞人に寄り添うAIの実現に向け、感情表現を模倣する技術を開発、日立製作所, (2018)、[online]、［令和２年９月２０日検索］、インターネット＜URL:https://www.hitachi.co.jp/rd/news/topics/2018/1106.html＞Developing technology to imitate emotional expressions to realize human-friendly AI, Hitachi, Ltd. (2018), [online], [Retrieved September 20, 2020], Internet <URL: https://www .hitachi.co.jp/rd/news/topics/2018/1106.html＞ Gui Liang-Yan, et al. "Teaching robots to predict human motion." [2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, (2018)]、[online]、［令和２年９月２０日検索］、インターネット＜URL:https://www.researchgate.net/publication/330580980_Teaching_Robots_to_Predict_Human_Motion＞Gui Liang-Yan, et al. "Teaching robots to predict human motion." [2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, (2018)], [online], [September 2020] Searched on the 20th of the month], Internet <URL: https://www.researchgate.net/publication/330580980_Teaching_Robots_to_Predict_Human_Motion> Prediction and Generation of Backchannel Form for Attentive Listening Systems. [Kawahara, Tatsuya, et al, Interspeech. 2016.]、[online]、［令和２年９月２０日検索］、インターネット＜URL:http://sap.ist.i.kyoto-u.ac.jp/EN/bib/intl/KAW-INTERSP16.pdf＞Prediction and Generation of Backchannel Form for Attentive Listening Systems. [Kawahara, Tatsuya, et al, Interspeech. 2016.], [online], [Retrieved September 20, 2020], Internet <URL:http://sap .ist.i.kyoto-u.ac.jp/EN/bib/intl/KAW-INTERSP16.pdf＞ＡＩと機械学習プロダクト、「感情分析」、[online]、［令和２年９月３日検索］、インターネット＜https://cloud.***.com/natural-language/docs/analyzing-sentiment?hl=ja＞AI and machine learning products, "sentiment analysis", [online], [searched on September 3, 2020], Internet < https://cloud.***.com/natural-language/docs/analyzing-sentiment?hl =ja＞市川寛子. 二者間対面コミュニケーションにおける同調的表情表出. Diss. 筑波大学, 2008.、[online]、［令和２年９月２０日検索］、インターネット＜URL:https://tsukuba.repo.nii.ac.jp/?action=repository_action_common_download&item_id=21183&item_no=1&attribute_id=17&file_no=2＞Hiroko Ichikawa. Synchronous facial expressions in two-party face-to-face communication. Diss. University of Tsukuba, 2008., [online], [Retrieved September 20, 2020], Internet <URL: https://tsukuba.repo .nii.ac.jp/?action=repository_action_common_download&item_id=21183&item_no=1&attribute_id=17&file_no=2＞ N2、KDDI総合研究所、[online]、［令和２年９月３日検索］、インターネット＜https://www.kddi-research.jp/products/n2/spec.html＞N2, KDDI Research Institute, [online], [searched on September 3, 2020], Internet <https://www.kddi-research.jp/products/n2/spec.html> Prediction and Generation of Backchannel Form for Attentive Listening Systems. [Kawahara, Tatsuya, et al, Interspeech. 2016.]、[online]、［令和２年９月３日検索］、インターネット＜https:https://www.researchgate.net/publication/307889355_Prediction_and_Generation_of_Backchannel_Form_for_Attentive_Listening_Systems＞Prediction and Generation of Backchannel Form for Attentive Listening Systems. [Kawahara, Tatsuya, et al, Interspeech. 2016.], [online], [Retrieved September 3, 2020], Internet＜https:https://www .researchgate.net/publication/307889355_Prediction_and_Generation_of_Backchannel_Form_for_Attentive_Listening_Systems＞ Jianming Wu et al. “Effects of objective feedback of facial expression recognition during video support chat”, MUM '17: Proceedings of the 16th International Conference on Mobile and Ubiquitous Multimedia.、[online]、［令和２年９月３日検索］、インターネット＜https://dl.acm.org/doi/10.1145/3152832.3152848＞Jianming Wu et al. “Effects of objective feedback of facial expression recognition during video support chat”, MUM '17: Proceedings of the 16th International Conference on Mobile and Ubiquitous Multimedia., [online], [September 3, 2020] Search], Internet <https://dl.acm.org/doi/10.1145/3152832.3152848> Wang, Yanan, et al. "Multi-Attention Fusion Network for Video-based Emotion Recognition." 2019 International Conference on Multimodal Interaction. 2019. 、[online]、［令和２年９月３日検索］、インターネット＜https:https://www.researchgate.net/publication/336632156_Multi-Attention_Fusion_Network_for_Video-based_Emotion_Recognition＞Wang, Yanan, et al. "Multi-Attention Fusion Network for Video-based Emotion Recognition." 2019 International Conference on Multimodal Interaction. 2019. , [online], [Retrieved September 3, 2020], Internet < https :https://www.researchgate.net/publication/336632156_Multi-Attention_Fusion_Network_for_Video-based_Emotion_Recognition＞ J. Xu, K. Tasaka, and H. Yanagihara, "Beyond Two-stream: Skeleton-based Three-stream Networks for Action Recognition in Videos", The 24th International Conference on Pattern Recognition (ICPR2018).J. Xu, K. Tasaka, and H. Yanagihara, "Beyond Two-stream: Skeleton-based Three-stream Networks for Action Recognition in Videos", The 24th International Conference on Pattern Recognition (ICPR2018).

非特許文献２及び３に記載された技術によれば、対話中に、ユーザの顔表情や身振りが変化しない限り、キャラクタがそれを模倣することはできない。そのために、ユーザから見て、キャラクタに対する好感度を高める効果に乏しいという問題が生じる。
また、非特許文献４に記載の技術によれば、ユーザは、長時間に及ぶ対話の中でキャラクタに何度も単調な相槌をされた場合、違和感を生じ、キャラクタに対する好感度が低下するという問題も生じる。 According to the techniques described in Non-Patent Documents 2 and 3, unless the user's facial expressions and gestures change during dialogue, the character cannot imitate them. Therefore, a problem arises in that, from the user's point of view, the effect of increasing the character's likeability is poor.
Furthermore, according to the technology described in Non-Patent Document 4, when a user repeatedly makes monotonous responses to a character during a long conversation, the user feels uncomfortable and his or her liking for the character decreases. Problems also arise.

そこで、本発明は、ユーザの感情に応じたキャラクタの傾聴感を表現する対話プログラム、装置及び方法を提供することを目的とする。 SUMMARY OF THE INVENTION Accordingly, an object of the present invention is to provide an interaction program, device, and method that express a character's listening feeling in accordance with the user's emotions.

本発明によれば、ユーザの感情に応じた対話エージェントとしてコンピュータを機能させるプログラムであって、
ユーザの発話音声から発話文に変換する音声解析手段と、
ユーザの発話文から感情極性を判定する発話文感情判定手段と、
ユーザの発話文が対話シナリオに沿っている場合に、シナリオ的な応答文を生成し、沿っていない場合に、発話文感情判定手段によってポジティブ又はニュートラルな感情極性が得られた際に、傾聴的な応答文を生成する傾聴的な応答文を生成する対話制御手段と、
発声中のユーザが映り込む映像を入力し、当該ユーザの感情極性を判定するユーザ映像解析手段と、
ユーザ映像解析手段によって感情極性が判定できなかった際に、発話文感情判定手段によって判定された感情極性を出力し、逆に、判定できた際に、ユーザ映像解析手段によって判定された感情極性を出力する視覚表現判定手段と、
感情極性毎にキャラクタ映像を予め記憶しており、視覚表現判定手段から出力された感情極性に応じたキャラクタ映像を生成するキャラクタ映像生成手段と、
応答文から応答音声に変換し、当該応答音声をスピーカから出力させる音声合成手段と、
当該応答音声に同期して、キャラクタ映像をディスプレイに表示させる映像表示制御手段と
してコンピュータを機能させることを特徴とするプログラム。 According to the present invention, there is provided a program that causes a computer to function as a dialogue agent according to a user's emotions,
a voice analysis means for converting the user's uttered voice into a uttered sentence;
uttered sentence emotion determination means for determining emotional polarity from a user 's uttered sentence ;
When the user 's utterances follow the dialogue scenario, a scenario-like response sentence is generated, and when the utterances do not follow the dialogue scenario, when a positive or neutral emotional polarity is obtained by the utterance emotion determination means, Dialogue control means for generating an attentive response sentence ;
a user video analysis means that inputs a video of a user who is speaking and determines the emotional polarity of the user;
When the emotional polarity cannot be determined by the user video analysis means, the emotional polarity determined by the uttered sentence emotion determination means is output, and conversely, when the emotional polarity can be determined, the emotional polarity determined by the user video analysis means is output. A visual expression determining means to output;
a character image generating means that stores character images for each emotional polarity in advance and generates a character image according to the emotional polarity output from the visual expression determining means;
a voice synthesis means for converting a response sentence into a response voice and outputting the response voice from a speaker;
A program that causes a computer to function as a video display control means for displaying a character video on a display in synchronization with the response voice.

本発明のプログラムにおける他の実施形態によれば、
音声解析手段は、ユーザの発話音声から話調データを更に検出し、
音声合成手段は、話調データに同期させて応答音声を生成する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
The voice analysis means further detects tone data from the user's uttered voice,
It is also preferable that the voice synthesis means causes the computer to function so as to generate the response voice in synchronization with the speech tone data.

本発明のプログラムにおける他の実施形態によれば、
話調データは、単位時間当たりの文字数、及び／又は、音量レベルである
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
It is also preferable that the computer functions such that the speech tone data is the number of characters per unit time and/or the volume level.

本発明のプログラムにおける他の実施形態によれば、
キャラクタ映像生成手段は、感情極性毎に、顔表情が異なるキャラクタ映像を予め記憶しており、判定された感情極性に応じた顔表情のキャラクタ映像を生成する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
It is also preferable that the character image generation means stores in advance character images with different facial expressions for each emotional polarity, and causes the computer to function so as to generate a character image with a facial expression corresponding to the determined emotional polarity.

本発明のプログラムにおける他の実施形態によれば、
キャラクタ映像生成手段は、感情極性毎に、複数の骨格点の時系列の座標変位を予め記憶しており、判定された感情極性に応じた複数の骨格点の座標変位を、キャラクタの複数の骨格点に対応させて時系列に変位させたキャラクタ映像を生成する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
The character image generation means stores in advance the time-series coordinate displacements of a plurality of skeleton points for each emotional polarity, and calculates the coordinate displacements of the plurality of skeleton points according to the determined emotional polarity from the plurality of skeletons of the character. It is also preferable that the computer function to generate a character image that is displaced in time series in correspondence with the points.

本発明のプログラムにおける他の実施形態によれば、
ユーザ映像解析手段は、教師データとして複数のユーザの顔表情と感情極性とを対応付けて学習した学習エンジンを用いて、発声中のユーザが映り込む映像における顔映像から、当該ユーザの感情極性を推定する
ようにコンピュータを機能させることも好ましい。
According to another embodiment of the program of the present invention,
The user video analysis means uses a learning engine that has learned to associate the facial expressions and emotional polarities of multiple users as training data to determine the emotional polarity of the user from the facial image in the video in which the user is speaking. It is also preferred to have the computer function to estimate.

本発明のプログラムにおける他の実施形態によれば、
ユーザ映像解析手段は、教師データとして複数のユーザについて複数の骨格点の時系列の座標変位と感情極性とを対応付けて学習した学習エンジンを用いて、発声中のユーザが映り込む映像における複数の骨格点の時系列の座標変位から、当該ユーザの感情極性を推定する
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
The user video analysis means uses a learning engine that has learned to associate time-series coordinate displacements of multiple skeletal points and emotional polarities for multiple users as training data to analyze multiple It is also preferable that the computer function to estimate the emotional polarity of the user from the time-series coordinate displacement of the skeleton points.

本発明のプログラムにおける他の実施形態によれば、
対話制御手段は、
傾聴的な応答文としては、相槌又はオウム返しである
ようにコンピュータを機能させることも好ましい。 According to another embodiment of the program of the present invention,
The dialogue control means are
As a listening response, it is also preferable to have the computer function as if it were a compliment or a parrot.

本発明によれば、ユーザの感情に応じた対話エージェントとして機能させる対話装置であって、
ユーザの発話音声から発話文に変換する音声解析手段と、
ユーザの発話文から感情極性を判定する発話文感情判定手段と、
ユーザの発話文が対話シナリオに沿っている場合に、シナリオ的な応答文を生成し、沿っていない場合に、発話文感情判定手段によってポジティブ又はニュートラルな感情極性が得られた際に、傾聴的な応答文を生成する傾聴的な応答文を生成する対話制御手段と、
発声中のユーザが映り込む映像を入力し、当該ユーザの感情極性を判定するユーザ映像解析手段と、
ユーザ映像解析手段によって感情極性が判定できなかった際に、発話文感情判定手段によって判定された感情極性を出力し、逆に、判定できた際に、ユーザ映像解析手段によって判定された感情極性を出力する視覚表現判定手段と、
感情極性毎にキャラクタ映像を予め記憶しており、視覚表現判定手段から出力された感情極性に応じたキャラクタ映像を生成するキャラクタ映像生成手段と、
応答文から応答音声に変換し、当該応答音声をスピーカから出力させる音声合成手段と、
当該応答音声に同期して、キャラクタ映像をディスプレイに表示させる映像表示制御手段と
を有することを特徴とする。 According to the present invention, there is provided a dialogue device that functions as a dialogue agent according to a user's emotions,
a voice analysis means for converting the user's uttered voice into a uttered sentence;
uttered sentence emotion determination means for determining emotional polarity from a user 's uttered sentence ;
When the user 's utterances follow the dialogue scenario, a scenario-like response sentence is generated, and when the utterances do not follow the dialogue scenario, when a positive or neutral emotional polarity is obtained by the utterance emotion determination means, Dialogue control means for generating an attentive response sentence ;
a user video analysis means that inputs a video of a user who is speaking and determines the emotional polarity of the user;
When the emotional polarity cannot be determined by the user video analysis means, the emotional polarity determined by the uttered sentence emotion determination means is output, and conversely, when the emotional polarity can be determined, the emotional polarity determined by the user video analysis means is output. A visual expression determining means to output;
a character image generating means that stores character images for each emotional polarity in advance and generates a character image according to the emotional polarity output from the visual expression determining means;
a voice synthesis means for converting a response sentence into a response voice and outputting the response voice from a speaker;
The present invention is characterized by comprising a video display control means for displaying a character video on a display in synchronization with the response voice.

本発明によれば、ユーザの感情に応じた対話エージェントとして機能させる装置の対話方法であって、
装置は、
ユーザの発話音声から発話文に変換する第１のステップと、
ユーザの発話文から感情極性を判定する第２のステップと、
ユーザの発話文が対話シナリオに沿っている場合に、シナリオ的な応答文を生成し、沿っていない場合に、発話文感情判定手段によってポジティブ又はニュートラルな感情極性が得られた際に、傾聴的な応答文を生成する傾聴的な応答文を生成する第３のステップと、
発声中のユーザが映り込む映像を入力し、当該ユーザの感情極性を判定する第４のステップと、
第４のステップによって感情極性が判定できなかった際に、第３のステップによって判定された感情極性を出力し、逆に、判定できた際に、第４のステップによって判定された感情極性を出力する第５のステップと、
感情極性毎にキャラクタ映像を予め記憶しており、第５のステップによって出力された感情極性に応じたキャラクタ映像を生成する第６のステップと、
応答文から応答音声に変換し、当該応答音声をスピーカから出力させる第７のステップと、
当該応答音声に同期して、キャラクタ映像をディスプレイに表示させる第８のステップと
を実行することを特徴とする。
According to the present invention, there is provided a dialogue method for a device to function as a dialogue agent according to a user's emotions, comprising:
The device is
a first step of converting the user's uttered voice into a uttered sentence;
a second step of determining emotional polarity from the user 's utterance;
When the user 's utterances follow the dialogue scenario, a scenario-like response sentence is generated, and when the utterances do not follow the dialogue scenario, when a positive or neutral emotional polarity is obtained by the utterance emotion determination means, a third step of generating a listening response sentence ;
a fourth step of inputting an image of the user who is speaking and determining the emotional polarity of the user;
When the emotional polarity cannot be determined in the fourth step, the emotional polarity determined in the third step is output, and conversely, when the emotional polarity can be determined, the emotional polarity determined in the fourth step is output. The fifth step is to
a sixth step in which a character image is stored in advance for each emotional polarity, and a character image is generated according to the emotional polarity output in the fifth step;
a seventh step of converting the response sentence into a response voice and outputting the response voice from a speaker;
An eighth step of displaying a character image on a display is performed in synchronization with the response voice.

本発明の対話プログラム、装置及び方法によれば、ユーザの感情に応じたキャラクタの傾聴感を表現する対話プログラムことができる。 According to the dialogue program, device, and method of the present invention, it is possible to create a dialogue program that expresses a character's listening feeling in accordance with the user's emotions.

対話装置の実施形態を表す構成図である。FIG. 1 is a configuration diagram showing an embodiment of an interaction device. 本発明における対話装置の基本的な機能構成図である。1 is a basic functional configuration diagram of an interaction device according to the present invention. 音声取得部及び音声解析部の処理を表す説明図である。It is an explanatory diagram showing processing of a voice acquisition part and a voice analysis part. 発話文感情判定部の処理を表す説明図である。FIG. 3 is an explanatory diagram showing the processing of the uttered sentence emotion determination unit. キャラクタ映像生成部における身振り生成機能を表す説明図である。FIG. 3 is an explanatory diagram showing a gesture generation function in a character image generation section. 音声合成部及び映像表示制御部の処理を表す説明図である。FIG. 3 is an explanatory diagram showing processing of an audio synthesis section and a video display control section. ユーザ映像解析部及び視覚表現判定部を更に含む対話装置の機能構成図である。FIG. 2 is a functional configuration diagram of an interaction device further including a user video analysis section and a visual expression determination section.

以下、本発明の実施の形態について、図面を用いて詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail using the drawings.

図１は、対話装置の実施形態を表す構成図である。 FIG. 1 is a configuration diagram showing an embodiment of an interaction device.

図１（ａ）によれば、ユーザは、スマートフォンやタブレットのような、ディスプレイを搭載した対話装置１と対話する。対話装置１には、本発明の対話プログラムが予めインストールされたものとして実装されている。
対話装置１のディスプレイには、対話エージェントしてのキャラクタ（アバター）が表示されている。ユーザの発話音声は、対話装置１のマイクによって収音され、対話プログラムに入力される。対話プログラムは、ユーザの発話文に応じた応答文を生成し、対話装置１のスピーカから音声で応答する。 According to FIG. 1(a), a user interacts with an interaction device 1 equipped with a display, such as a smartphone or a tablet. The dialog device 1 is equipped with a dialog program of the present invention installed in advance.
A character (avatar) serving as a dialogue agent is displayed on the display of the dialogue device 1. The user's speech is picked up by the microphone of the dialogue device 1 and input into the dialogue program. The dialogue program generates a response sentence according to the user's utterance, and responds with voice from the speaker of the dialogue device 1.

図１（ｂ）によれば、対話装置１は、ネットワークに接続されたサーバとして実装されている。端末２は、対話装置１とネットワークを介して通信すると共に、ディスプレイ、マイク及びスピーカを備えたユーザインタフェースとして機能する。
端末２は、対話装置１からの指示に応じて、ディスプレイのキャラクタの表示態様を制御する。また、端末２は、マイクによって収音した音声情報を対話装置１へ送信し、対話装置１から返答された応答文をスピーカから出力する。 According to FIG. 1(b), the dialogue device 1 is implemented as a server connected to a network. The terminal 2 communicates with the interaction device 1 via a network, and functions as a user interface equipped with a display, a microphone, and a speaker.
The terminal 2 controls the display mode of characters on the display according to instructions from the interaction device 1. Furthermore, the terminal 2 transmits audio information collected by the microphone to the dialogue device 1, and outputs a response sentence from the dialogue device 1 from the speaker.

図２は、本発明における対話装置の基本的な機能構成図である。 FIG. 2 is a basic functional configuration diagram of the dialog device according to the present invention.

本発明の対話装置１は、ユーザの感情に応じたキャラクタの傾聴感を表現する対話プログラムとして機能する。これによって、ユーザは、対話装置１のディスプレイに表示されたキャラクタが自らの話を理解してくれているように感じ、キャラクタに対する好感度を高めることができる。
図２によれば、本発明の対話装置１は、対話制御部１０と、音声取得部１０１と、音声解析部１１と、発話文感情判定部１２と、キャラクタ映像生成部１３と、音声合成部１４と、映像表示制御部１５とを有する。これら機能構成部は、装置に搭載されたコンピュータを機能させるプログラムを実行することによって実現される。また、これら機能構成部の処理の流れは、対話方法としても理解できる。 The dialogue device 1 of the present invention functions as a dialogue program that expresses the listening feeling of a character according to the user's emotions. As a result, the user feels as if the character displayed on the display of the dialogue device 1 understands what he is saying, and can increase his liking for the character.
According to FIG. 2, the dialogue device 1 of the present invention includes a dialogue control section 10, a voice acquisition section 101, a voice analysis section 11, an uttered sentence emotion determination section 12, a character video generation section 13, and a voice synthesis section. 14, and a video display control section 15. These functional components are realized by executing a program that causes a computer installed in the device to function. Furthermore, the processing flow of these functional components can also be understood as a dialog method.

［音声取得部１０１］
音声取得部１０１は、マイクからユーザの発話音声を入力し、ユーザの発話期間の音声信号のみを音声解析部１１へ出力する。勿論、ユーザの発話音声は、映像から抽出された音声であってもよい。 [Audio acquisition unit 101]
The audio acquisition unit 101 inputs the user's utterance from the microphone, and outputs only the audio signal during the user's utterance period to the audio analysis unit 11. Of course, the user's uttered audio may be audio extracted from the video.

図３は、音声取得部及び音声解析部の処理を表す説明図である。
図３によれば、音声取得部１０１は、ユーザの発話期間の音声信号のみを検出する。
（キャラクタ「おはようー！」） -> ユーザ「おはよう～」
（キャラクタ「どこに行くの？」） -> ユーザ「映画を観に行くよ」 FIG. 3 is an explanatory diagram showing the processing of the voice acquisition section and the voice analysis section.
According to FIG. 3, the audio acquisition unit 101 detects only the audio signal during the user's utterance period.
(Character "Good morning!") -> User "Good morning"
(Character: "Where are you going?") -> User: "I'm going to see a movie."

［音声解析部１１］
音声解析部１１は、音声取得部１０１から入力されたユーザ発話の音声信号を、「発話文（テキスト）」に変換する。具体的には、例えばGoogle（登録商標）のCloud Speech-to-Text（登録商標）や、Microsoft（登録商標）のSpeech to Text（登録商標）のような技術を適用することができる。 [Speech analysis unit 11]
The voice analysis unit 11 converts the voice signal of the user's utterance input from the voice acquisition unit 101 into a “speech sentence (text)”. Specifically, technologies such as Google (registered trademark)'s Cloud Speech-to-Text (registered trademark) and Microsoft (registered trademark) Speech to Text (registered trademark) can be applied.

また、音声解析部１１は、ユーザ発話の音声信号から「話調データ」を更に検出することもできる。話調データとは、単位時間当たりの文字数（発話速度）、及び／又は、音量レベル（声量（例えばデシベル））である。
ここで、音声解析部１１は、ユーザの発話音声を、発話速度に応じて例えば「遅い」「速い」「普通」の３つに分類するものであってもよい。最も簡単な判定基準としては、分類毎に、発話速度の閾値を予め規定しておけばよい。
＜発話速度＞＜判定基準＞
遅い１秒当たり平均2文字以下
普通１秒当たり平均3～5文字
速い１秒当たり平均6文字以上 Furthermore, the voice analysis unit 11 can further detect "tone data" from the voice signal of the user's utterance. The speech tone data is the number of characters per unit time (speech rate) and/or the volume level (voice volume (for example, decibels)).
Here, the voice analysis unit 11 may classify the voice uttered by the user into three categories, for example, "slow", "fast", and "normal" according to the speech speed. The simplest criterion is to predefine a speech rate threshold for each classification.
<Speech rate><Judgmentcriteria>
Slow An average of less than 2 characters per second Average An average of 3 to 5 characters per second Fast An average of more than 6 characters per second

他の実施形態として、図３によれば、例えば１秒当たりの文字数が異なる複数のテスト発話文を、実際に人に聞かせ、その人に、主観的な印象「遅い」「普通」「速い」をラベリングさせるものであってもよい。
最も簡単な方法としては、多数決によって、文字数毎に主観的な印象を決定することができる。
別途の方法としては、印象のラベル毎に、以下のように平均値を算出するものであってもよい。
印象「遅い」： No.1とNo.6の平均値 1.589(字)＝（1.428＋1.750）／2
印象「普通」： No.3とNo.4の平均値 2.612(字)＝（2.140＋3.083）／2
印象「速い」： No.2とNo.5の平均値 6.850(字)＝（5.900＋7.800）／2
そして、印象「遅い」の平均値と印象「普通」の平均値との中間値を、印象「遅い」の判定基準とする。
印象「遅い」： 2.101字＝（1.589(字)＋2.612(字)）／2
印象「普通」： 4.730字＝（2.612(字)＋6.850(字)）／2
例えばユーザの発話速度が3.429（字/秒）の場合、「普通」と判定する。 As another embodiment, according to FIG. 3, for example, a person actually listens to a plurality of test utterances with different numbers of characters per second, and the person gives a subjective impression of "slow,""normal," or "fast." may be labeled.
The simplest method is to determine the subjective impression for each number of characters by majority vote.
As a separate method, the average value may be calculated for each impression label as follows.
Impression “slow”: Average value of No. 1 and No. 6 1.589 (characters) = (1.428 + 1.750) / 2
Impression “normal”: Average value of No. 3 and No. 4 2.612 (characters) = (2.140 + 3.083) / 2
Impression “fast”: Average value of No. 2 and No. 5 6.850 (characters) = (5.900 + 7.800) / 2
Then, the intermediate value between the average value of the impression "slow" and the average value of the impression "normal" is used as a criterion for determining the impression "slow".
Impression “slow”: 2.101 characters = (1.589 (characters) + 2.612 (characters)) / 2
Impression “normal”: 4.730 characters = (2.612 (characters) + 6.850 (characters)) / 2
For example, if the user's speech rate is 3.429 (characters/second), it is determined to be "normal."

そして、音声解析部１１は、検出した発話文を、発話文感情判定部１２及び対話制御部１０へ出力すると共に、話調データを、音声合成部１４へ出力する。 Then, the speech analysis unit 11 outputs the detected utterance to the utterance emotion determination unit 12 and the dialogue control unit 10, and outputs tone data to the speech synthesis unit 14.

［発話文感情判定部１２］
発話文感情判定部１２は、ユーザの発話文から「感情極性」を判定する。
感情極性として、例えばポジティブ／ネガティブ／ニュートラルの３分類のいずれかに判定するものであってもよい。具体的には、Support Vector Machine等の判定器を使用する方法や、Google（登録商標）社が提供しているCloud Natural Language APIを使用する方法がある（例えば非特許文献５参照）。 [Utterance sentence emotion determination unit 12]
The uttered sentence emotion determination unit 12 determines the "emotional polarity" from the user's uttered sentence.
The emotional polarity may be determined, for example, into one of three categories: positive/negative/neutral. Specifically, there are a method of using a determiner such as Support Vector Machine, and a method of using Cloud Natural Language API provided by Google (registered trademark) (for example, see Non-Patent Document 5).

発話文感情判定部１２は、ユーザの発話文に含まれる単文毎に「感情極性」を判定するものであってもよい。結果的に、総合的にその発話文全体の感情極性を判定するものであってもよいし、１つでもポジティブな単文が含まれている場合には発話文全体をポジティブと判定するものであってもよい。 The uttered sentence emotion determination unit 12 may determine the "emotional polarity" of each simple sentence included in the user's uttered sentence. As a result, the emotional polarity of the entire utterance may be judged comprehensively, or the entire utterance may be judged as positive if it contains even one positive simple sentence. It's okay.

図４は、発話文感情判定部の処理を表す説明図である。
図４によれば、発話文感情判定部１２は、ユーザの発話文毎に、以下のように感情極性を判定している。
発話文「おはよう」 -> 感情極性「ニュートラル」
発話文「映画を観に行くよ」 -> 感情極性「ポジティブ」
そして、発話文感情判定部１２は、判定した感情極性を、対話制御部１０及びキャラクタ映像生成部１３へ出力する。
FIG. 4 is an explanatory diagram showing the processing of the uttered sentence emotion determination section.
According to FIG. 4, the uttered sentence emotion determining unit 12 determines the emotional polarity of each uttered sentence by the user as follows.
Utterance sentence “Good morning” -> Emotional polarity “Neutral”
Utterance sentence “I’m going to see a movie” -> Emotional polarity “Positive”
Then, the uttered sentence emotion determination unit 12 outputs the determined emotion polarity to the dialogue control unit 10 and the character video generation unit 13.

尚、発話文感情判定部１２は、ポジティブ／ネガティブ／ニュートラルに限るものでは、なく、例えば７つの感情極性（喜び、悲しみ、怒り、軽蔑、嫌悪、恐れ、驚き）を判定できるものであってもよい。各感情極性に応じて、対話制御部１０及びキャラクタ映像生成部１３を機能させる。 Note that the uttered sentence emotion determination section 12 is not limited to positive/negative/neutral, and may be capable of determining, for example, seven emotional polarities (joy, sadness, anger, contempt, disgust, fear, surprise). good. The dialogue control section 10 and the character image generation section 13 are operated according to each emotional polarity.

［対話制御部１０］
対話制御部１０は、音声解析部１１から入力されたユーザの発話文から、シナリオ的又は傾聴的な応答文を生成する。 [Dialogue control unit 10]
The dialogue control unit 10 generates a scenario-like or listening-like response sentence from the user's utterance inputted from the voice analysis unit 11 .

対話制御部１０は、一般的な対話エージェントであり、「対話シナリオ」に基づいてユーザとの対話を進行させる。対話シナリオは、発話文と応答文との交互のシーケンスによって構成されている。発話文は、ユーザが発話するであろうと想定したテキストであり、応答文は、その発話文に対して対話エージェントが返答するテキストである。
対話制御部１０は、ユーザの発話文が対話シナリオに沿っている場合には、「シナリオ的応答文」で返答することできる。一方で、そうでない場合には、ユーザの発話文に対して傾聴的な応答文を返答することができ、ユーザからの次の発話文を待つ。傾聴的な応答文としては、相槌であってもよいし、オウム返しであってもよい。 The dialogue control unit 10 is a general dialogue agent, and advances dialogue with the user based on a "dialogue scenario." A dialogue scenario is composed of an alternating sequence of uttered sentences and response sentences. The uttered sentence is text that is assumed to be uttered by the user, and the response sentence is text that the dialog agent replies to the uttered sentence.
If the user's utterance matches the dialogue scenario, the dialogue control unit 10 can respond with a "scenario-like response sentence." On the other hand, if this is not the case, it is possible to respond with a listening response to the user's utterance, and wait for the next utterance from the user. A listening response may be a compliment or a parrot.

また、対話制御部１０は、発話文感情判定部１２によってポジティブ又はニュートラルな感情極性が得られた場合にのみ、傾聴的な応答文を生成することも好ましい。ネガティブな感情極性が得られた場合に、相槌やオウム返しで返答すると、ユーザのネガティブな発話文を肯定したように感じられるためである。
勿論、対話制御部１０によって生成される応答文は、シナリオ的応答文や傾聴的応答文に限ることなく、機械学習エンジンを用いて推定されるものであってもよい。 It is also preferable that the dialogue control unit 10 generates a listening response sentence only when the uttered sentence emotion determination unit 12 obtains a positive or neutral emotional polarity. This is because when a negative emotional polarity is obtained, if the user responds with a compliment or a parrot, it will feel like an affirmation of the user's negative utterance.
Of course, the response sentence generated by the dialogue control unit 10 is not limited to a scenario response sentence or a listening response sentence, and may be estimated using a machine learning engine.

［キャラクタ映像生成部１３］
キャラクタ映像生成部１３は、感情極性毎にキャラクタ映像を予め記憶しており、判定された感情極性に応じたキャラクタ映像を生成する。感情極性に応じて、以下のように表現することできる。
＜キャラクタの顔表情の生成＞
＜キャラクタの身振りの生成＞ [Character video generation unit 13]
The character image generation unit 13 stores character images for each emotional polarity in advance, and generates a character image according to the determined emotional polarity. Depending on the emotional polarity, it can be expressed as follows.
<Generation of character's facial expressions>
<Generation of character gestures>

＜キャラクタの顔表情の生成＞
キャラクタ映像生成部１３は、感情極性毎に、顔表情が異なるキャラクタ映像を予め記憶しており、判定された感情極性に応じた顔表情のキャラクタ映像を生成する。
感情極性「ポジティブ」 -> 笑顔のキャラクタ映像
感情極性「ネガティブ」 -> 悲しい顔のキャラクタ映像
これによって、ユーザの感情極性を模倣した顔表情のキャラクタ映像が表示される。 <Generation of character's facial expressions>
The character image generation unit 13 stores in advance character images with different facial expressions for each emotional polarity, and generates a character image with a facial expression according to the determined emotional polarity.
Emotional polarity "Positive"-> Character image with a smiling face Emotional polarity "Negative"-> Character image with a sad face As a result, a character image with a facial expression that imitates the user's emotional polarity is displayed.

尚、顔表情の生成における他の実施形態として、発話文感情判定部１２が７つの感情極性（喜び、悲しみ、怒り、軽蔑、嫌悪、恐れ、驚き）の中で「怒り」と判定した場合、キャラクタ映像生成部１３は、「怒り」の顔表情を模倣せず、あえて「喜び」の顔表情を選択するようにすることも好ましい。
人間同士の対話の中で、相手方が怒りの表情を表出している時に、自らも怒りの表情を表出すると、相手から好感を持たれない傾向がある（例えば非特許文献６参照）。この場合、「笑顔」を表出するとよい、と言及されている。この知見を踏まえれば、キャラクタ映像生成部１３は、発話文感情判定部１２によって出力された顔表情をそのまま模倣することなく、適応的に変化させた顔表情を生成することも好ましい。 As another embodiment of facial expression generation, when the utterance emotion determination unit 12 determines "anger" among seven emotion polarities (joy, sadness, anger, contempt, disgust, fear, surprise), It is also preferable that the character video generation unit 13 intentionally selects a facial expression of "joy" instead of imitating a facial expression of "anger."
In a dialogue between humans, if you also express an angry expression when the other person expresses an angry expression, you tend to not be liked by the other person (for example, see Non-Patent Document 6). In this case, it is mentioned that it is good to show a "smile". Based on this knowledge, it is also preferable that the character video generation unit 13 generate a facial expression that is adaptively changed, without directly imitating the facial expression output by the uttered sentence emotion determination unit 12.

＜キャラクタの身振りの生成＞
キャラクタ映像生成部１３は、感情極性毎に、複数の骨格点の時系列の座標変位を予め記憶しており、判定された感情極性に応じた複数の骨格点の座標変位を、キャラクタの複数の骨格点に対応させて時系列に変位させたキャラクタ映像を生成する。
感情極性「ポジティブ」 -> 手を上げる身振りのキャラクタ映像
感情極性「ネガティブ」 -> 手を組む身振りのキャラクタ映像
これによって、ユーザの感情極性を模倣した身振りのキャラクタ映像が表示される。 <Generation of character gestures>
The character image generation unit 13 stores in advance the time-series coordinate displacements of a plurality of skeletal points for each emotional polarity, and calculates the coordinate displacements of the plurality of skeletal points according to the determined emotional polarity from a plurality of character A character image is generated that is displaced in time series in correspondence with the skeleton points.
Emotional polarity "Positive"-> Character image with a gesture of raising hands Emotional polarity "Negative"-> Character image with a gesture of folding hands This displays a character image with a gesture that imitates the user's emotional polarity.

図５は、キャラクタ映像生成部における身振り生成機能を表す説明図である。
図５によれば、キャラクタに複数の骨格点が対応付けられており、骨格点同士が線で結ばれる。尚、複数の骨格点を１つに纏めた座標を、フレーム骨格点座標として定義される。
これら複数の骨格点を時系列に変位させることによって、所定の身振りを生成する。キャラクタを動かす映像生成ソフトとして、例えばLive2D（登録商標）がある。 FIG. 5 is an explanatory diagram showing the gesture generation function in the character image generation section.
According to FIG. 5, a plurality of skeletal points are associated with a character, and the skeletal points are connected by lines. Note that the coordinates obtained by combining a plurality of skeleton points into one are defined as frame skeleton point coordinates.
By displacing these multiple skeleton points in time series, a predetermined gesture is generated. An example of video generation software that moves characters is Live2D (registered trademark).

［音声合成部１４］
音声合成部１４は、対話制御部１０から入力された応答文を応答音声に変換し、当該応答音声をスピーカから出力させる。
また、音声合成部１４は、音声解析部１１から出力された「話調データ」に応じて、応答音声を制御することもできる。 [Speech synthesis unit 14]
The speech synthesis section 14 converts the response sentence input from the dialogue control section 10 into a response voice, and outputs the response voice from the speaker.
Furthermore, the speech synthesis section 14 can also control the response speech according to the "speech tone data" output from the speech analysis section 11.

図６は、音声合成部及び映像表示制御部の処理を表す説明図である。
図６によれば、音声合成部１４は、対話制御部１０から応答文「映画、いいねえ」が入力され、音声解析部１１から話調データ「速い」が入力されている。このとき、音声合成部１４は、応答文「映画、いいねえ」を、「速い」話調で、音声信号に合成する。音声合成部１４は、話調データに合わせた音声応答を生成するために、具体的にはN2を適用することもできる（例えば非特許文献７参照）。
また、音声合成部１４は、合成された音声信号に同期するように、映像表示制御部１５へ指示する。 FIG. 6 is an explanatory diagram showing the processing of the audio synthesis section and the video display control section.
According to FIG. 6, the speech synthesis section 14 receives the response sentence "I like the movie" from the dialogue control section 10 and the tone data "fast" from the speech analysis section 11. At this time, the speech synthesis unit 14 synthesizes the response sentence "I like movies" into the speech signal in a "fast" tone. Specifically, the speech synthesis unit 14 can also apply N2 in order to generate a speech response that matches the speech tone data (for example, see Non-Patent Document 7).
The audio synthesis unit 14 also instructs the video display control unit 15 to synchronize with the synthesized audio signal.

尚、他の実施形態として、相槌的な応答文のパターンが少ない場合、相槌毎に、異なる話調データに基づく応答音声を予め登録しているものであってもよい。具体的には、１つの相槌「うんうん」について、異なる話調データの応答音声を複数用意しておく（例えば非特許文献８参照）。相槌「うんうん」を返答する際に、音声解析部１１から出力された「話調データ」に応じた相槌の応答音声を選択することができる。 As another embodiment, when there are few patterns of complimentary response sentences, a response voice based on different speech tone data may be registered in advance for each complimentary response. Specifically, a plurality of response voices with different tone data are prepared for one compliment "Unun" (for example, see Non-Patent Document 8). When replying with the response "Yeah yeah", it is possible to select a response voice corresponding to the "speech tone data" output from the voice analysis section 11.

［映像表示制御部１５］
映像表示制御部１５は、音声合成部１４から出力される応答音声に同期して、キャラクタ映像をディスプレイに表示させる。
具体的には、キャラクタ映像における口部分が、応答音声に同期して変化するように、再生速度を速めたり遅めたりして制御する。キャラクタ映像と応答音声とが同期することによって、ユーザが、キャラクタとの対話に違和感を生じないようにする。 [Video display control unit 15]
The video display control unit 15 causes the character video to be displayed on the display in synchronization with the response voice output from the audio synthesis unit 14.
Specifically, the playback speed is controlled by speeding up or slowing down so that the mouth part in the character video changes in synchronization with the response voice. To prevent a user from feeling uncomfortable in dialogue with a character by synchronizing a character image and a response voice.

図７は、ユーザ映像解析部及び視覚表現判定部を更に含む対話装置の機能構成図である。 FIG. 7 is a functional configuration diagram of an interaction device further including a user video analysis section and a visual expression determination section.

［映像取得部１０２］
対話装置１は、カメラを更に有し、ユーザが映り込む映像を撮影する。その映像は、映像取得部１０２へ入力される。
映像取得部１０２は、例えば音声取得部１０１でユーザの発話期間だけ同期して、その映像を切り取るものであってもよい。前述した図３の場合、ユーザの発話期間における2秒～7秒と14秒～18秒の期間の映像のみを抽出する。
抽出された映像は、ユーザ映像解析部１６へ出力される。 [Video acquisition unit 102]
The dialogue device 1 further includes a camera, and photographs a video in which the user is reflected. The video is input to the video acquisition unit 102.
For example, the video acquisition unit 102 may synchronize with the audio acquisition unit 101 only during the user's utterance period and cut out the video. In the case of FIG. 3 described above, only the images from 2 seconds to 7 seconds and from 14 seconds to 18 seconds during the user's speech period are extracted.
The extracted video is output to the user video analysis section 16.

［ユーザ映像解析部１６］
ユーザ映像解析部１６は、ユーザが映り込む映像を入力し、当該ユーザの感情極性を判定する。
ユーザ映像解析部１６は、映像から、以下のような感情極性を推定する。
＜ユーザの顔表情からの感情極性の推定＞
＜ユーザの身振りからの感情極性の推定＞ [User video analysis unit 16]
The user video analysis unit 16 inputs a video in which a user is reflected, and determines the emotional polarity of the user.
The user video analysis unit 16 estimates the following emotional polarity from the video.
<Estimation of emotional polarity from user's facial expression>
<Estimating emotional polarity from user gestures>

＜ユーザの顔表情からの感情極性の推定＞
ユーザ映像解析部１６は、教師データとして複数のユーザの顔表情と感情極性とを対応付けて学習した学習エンジンを用いて、発声中のユーザが映り込む映像における顔映像から、当該ユーザの感情極性を推定する。
ユーザの顔画像から３つの感情極性（ポジティブ／ネガティブ／ニュートラル）を判定する既存技術がある（例えば非特許文献９参照）。また、ユーザの顔画像から７つの感情極性（喜び、悲しみ、怒り、軽蔑、嫌悪、恐れ、驚き）を判定する既存技術もある（例えば非特許文献１０参照）。映像は、複数のフレーム（静止画）から構成されるが、各フレームに映るユーザの顔表情から感情極性を判定し、最も多く出現した感情極性に決定するものであってもよい。 <Estimation of emotional polarity from user's facial expression>
The user video analysis unit 16 uses a learning engine that has learned to associate the facial expressions and emotional polarities of a plurality of users as training data to determine the emotional polarity of the user from the facial image in the video in which the user is speaking. Estimate.
There is an existing technology that determines three emotional polarities (positive/negative/neutral) from a user's facial image (for example, see Non-Patent Document 9). There is also an existing technology that determines seven emotional polarities (joy, sadness, anger, contempt, disgust, fear, and surprise) from a user's facial image (for example, see Non-Patent Document 10). Although the video is composed of a plurality of frames (still images), the emotional polarity may be determined from the user's facial expressions shown in each frame, and the emotional polarity that appears most frequently may be determined.

＜ユーザの身振りからの感情極性の推定＞
ユーザ映像解析部１６は、教師データとして複数のユーザについて複数の骨格点の時系列の座標変位と感情極性とを対応付けて学習した学習エンジンを用いて、発声中のユーザが映り込む映像における複数の骨格点の時系列の座標変位から、当該ユーザの感情極性を推定する。映像に含まれる各フレームを学習エンジンに入力し、最も多く出現した感情極性を決定するものであってよい。
身振りの判定について、フレーム毎に、ユーザの大まかな骨格を捉えて、右肩、右手首、右膝など複数の骨格点の座標(x,y)を推定する既存技術がある（例えば非特許文献１１参照）。 <Estimating emotional polarity from user gestures>
The user video analysis unit 16 uses a learning engine that has learned to associate time-series coordinate displacements of multiple skeletal points and emotional polarities for multiple users as training data, and uses a learning engine that has learned to associate time-series coordinate displacements of multiple skeleton points with emotional polarities for multiple users as training data. The emotional polarity of the user is estimated from the time-series coordinate displacement of the skeleton points. Each frame included in the video may be input to a learning engine, and the emotional polarity that appears most frequently may be determined.
Regarding gesture determination, there is an existing technology that captures the general skeleton of the user for each frame and estimates the coordinates (x, y) of multiple skeletal points such as the right shoulder, right wrist, and right knee (for example, non-patent literature 11).

［視覚表現判定部１７］
視覚表現判定部１７は、ユーザ映像解析部１６における感情極性の判定の有無に応じて、以下のように判定する。
（１）ユーザ映像解析部１６によって感情極性が判定できなかった場合
視覚表現判定部１７は、キャラクタ映像生成部１３に対して、「発話文感情判定部１２によって判定された感情極性」のキャラクタ映像を生成させる。
（２）ユーザ映像解析部１６によって感情極性が判定できた場合
視覚表現判定部１７は、キャラクタ映像生成部１３に対して、「ユーザ映像解析部１６によって判定された感情極性」のキャラクタ映像を生成させる。
尚、感情極性が３分類の場合、「感情極性が判定できなかった場合」とはニュートラルであった場合を意味し、「感情極性が判定できた場合」とはポジティブ／ネガティブであった場合を意味する。 [Visual expression determination unit 17]
The visual expression determination unit 17 makes the determination as follows, depending on whether or not the user video analysis unit 16 determines the emotional polarity.
(1) When the emotional polarity cannot be determined by the user video analysis unit 16 The visual expression determination unit 17 sends the character video with “the emotional polarity determined by the uttered sentence emotion determination unit 12” to the character video generation unit 13. to generate.
(2) When the emotional polarity is determined by the user video analysis unit 16 The visual expression determination unit 17 generates a character video of “the emotional polarity determined by the user video analysis unit 16” for the character video generation unit 13. let
In addition, when the emotional polarity is classified into 3, ``if the emotional polarity could not be determined'' means the case where it was neutral, and ``if the emotional polarity could be determined'' means the case if it was positive/negative. means.

以上、詳細に説明したように、本発明の対話プログラム、装置及び方法によれば、ユーザの感情に応じたキャラクタの傾聴感を表現することができる。
本発明によれば、ユーザの発話音声（又は映像）から推定されたユーザの感情極性に応じて、対話エージェントとしてのキャラクタ自体の視覚表現を変化させることができる。具体的には、キャラクタの視覚表現は、ユーザの発話文から推定される感情極性を模倣して変化する。また、キャラクタは、ユーザの発話文の感情極性を模倣するだけでなく、ユーザの発話音声の話調も模倣するために、ユーザはキャラクタに対する同調感覚を持つことができる。更に、ユーザの映像から顔表情や身振りから感情極性を推定できなくても、ユーザの発話文から感情極性を推定するために、その感情極性に応じたキャラクタ映像を再生することができる。
ユーザは、自らの感情極性と同じ感情極性で傾聴的に表現されるキャラクタと対話することによって、そのキャラクタに対する好感度を高めることができる。 As described above in detail, according to the dialogue program, device, and method of the present invention, it is possible to express the character's listening feeling in accordance with the user's emotions.
According to the present invention, the visual expression of the character itself as a dialogue agent can be changed according to the user's emotional polarity estimated from the user's uttered audio (or video). Specifically, the visual expression of the character changes to imitate the emotional polarity estimated from the user's utterances. Further, since the character not only imitates the emotional polarity of the user's utterances but also the tone of the user's utterances, the user can feel a sense of attunement to the character. Furthermore, even if emotional polarity cannot be estimated from facial expressions and gestures from the user's video, a character video corresponding to the emotional polarity can be reproduced in order to estimate the emotional polarity from the user's utterances.
By interacting with a character that listens and expresses the same emotional polarity as the user's own emotional polarity, the user can increase his or her liking for the character.

前述した本発明の種々の実施形態について、本発明の技術思想及び見地の範囲の種々の変更、修正及び省略は、当業者によれば容易に行うことができる。前述の説明はあくまで例であって、何ら制約しようとするものではない。本発明は、特許請求の範囲及びその均等物として限定するものにのみ制約される。 Regarding the various embodiments of the present invention described above, various changes, modifications, and omissions within the scope of the technical idea and viewpoint of the present invention can be easily made by those skilled in the art. The above description is merely an example and is not intended to be limiting in any way. The invention is limited only by the claims and their equivalents.

１対話装置
１０対話制御部
１０１音声取得部
１０２映像取得部
１１音声解析部
１２発話文感情判定部
１３キャラクタ映像生成部
１４音声合成部
１５映像表示制御部
１６ユーザ映像解析部
１７視覚表現判定部
２端末
1 Dialogue device 10 Dialogue control unit 101 Audio acquisition unit 102 Video acquisition unit 11 Audio analysis unit 12 Speech emotion determination unit 13 Character video generation unit 14 Audio synthesis unit 15 Video display control unit 16 User video analysis unit 17 Visual expression determination unit 2 terminal

Claims

ユーザの感情に応じた対話エージェントとしてコンピュータを機能させるプログラムであって、
ユーザの発話音声から発話文に変換する音声解析手段と、
ユーザの発話文から感情極性を判定する発話文感情判定手段と、
ユーザの発話文が対話シナリオに沿っている場合に、シナリオ的な応答文を生成し、沿っていない場合に、発話文感情判定手段によってポジティブ又はニュートラルな感情極性が得られた際に、傾聴的な応答文を生成する傾聴的な応答文を生成する対話制御手段と、
発声中のユーザが映り込む映像を入力し、当該ユーザの感情極性を判定するユーザ映像解析手段と、
ユーザ映像解析手段によって感情極性が判定できなかった際に、発話文感情判定手段によって判定された感情極性を出力し、逆に、判定できた際に、ユーザ映像解析手段によって判定された感情極性を出力する視覚表現判定手段と、
感情極性毎にキャラクタ映像を予め記憶しており、視覚表現判定手段から出力された感情極性に応じたキャラクタ映像を生成するキャラクタ映像生成手段と、
応答文から応答音声に変換し、当該応答音声をスピーカから出力させる音声合成手段と、
当該応答音声に同期して、キャラクタ映像をディスプレイに表示させる映像表示制御手段と
してコンピュータを機能させることを特徴とするプログラム。 A program that causes a computer to function as a dialogue agent according to the user's emotions,
a voice analysis means for converting the user's uttered voice into a uttered sentence;
uttered sentence emotion determination means for determining emotional polarity from a user 's uttered sentence ;
When the user 's utterances follow the dialogue scenario, a scenario-like response sentence is generated, and when the utterances do not follow the dialogue scenario, when a positive or neutral emotional polarity is obtained by the utterance emotion determination means, Dialogue control means for generating an attentive response sentence ;
a user video analysis means that inputs a video of a user who is speaking and determines the emotional polarity of the user;
When the emotional polarity cannot be determined by the user video analysis means, the emotional polarity determined by the uttered sentence emotion determination means is output, and conversely, when the emotional polarity can be determined, the emotional polarity determined by the user video analysis means is output. A visual expression determining means to output;
a character image generating means that stores character images for each emotional polarity in advance and generates a character image according to the emotional polarity output from the visual expression determining means;
a voice synthesis means for converting a response sentence into a response voice and outputting the response voice from a speaker;
A program that causes a computer to function as a video display control means for displaying a character video on a display in synchronization with the response voice.

音声解析手段は、ユーザの発話音声から話調データを更に検出し、
音声合成手段は、話調データに同期させて応答音声を生成する
ようにコンピュータを機能させることを特徴とする請求項１に記載のプログラム。 The voice analysis means further detects tone data from the user's uttered voice,
2. The program according to claim 1, wherein the speech synthesis means causes the computer to function so as to generate response speech in synchronization with tone data.

話調データは、単位時間当たりの文字数、及び／又は、音量レベルである
ようにコンピュータを機能させることを特徴とする請求項２に記載のプログラム。 3. The program according to claim 2, wherein the speech tone data causes the computer to function as the number of characters per unit time and/or the volume level.

キャラクタ映像生成手段は、感情極性毎に、顔表情が異なるキャラクタ映像を予め記憶しており、判定された感情極性に応じた顔表情のキャラクタ映像を生成する
ようにコンピュータを機能させることを特徴とする請求項１から３のいずれか１項に記載のプログラム。 The character image generation means is characterized in that character images with different facial expressions are stored in advance for each emotional polarity, and the computer is operated to generate character images with facial expressions according to the determined emotional polarity. The program according to any one of claims 1 to 3.

キャラクタ映像生成手段は、感情極性毎に、複数の骨格点の時系列の座標変位を予め記憶しており、判定された感情極性に応じた複数の骨格点の座標変位を、キャラクタの複数の骨格点に対応させて時系列に変位させたキャラクタ映像を生成する
ようにコンピュータを機能させることを特徴とする請求項１から３のいずれか１項に記載のプログラム。 The character image generation means stores in advance the time-series coordinate displacements of a plurality of skeleton points for each emotional polarity, and calculates the coordinate displacements of the plurality of skeleton points according to the determined emotional polarity from the plurality of skeletons of the character. 4. The program according to claim 1, which causes a computer to function to generate a character image that is displaced in time series in correspondence with points.

ユーザ映像解析手段は、教師データとして複数のユーザの顔表情と感情極性とを対応付けて学習した学習エンジンを用いて、発声中のユーザが映り込む映像における顔映像から、当該ユーザの感情極性を推定する
ようにコンピュータを機能させることを特徴とする請求項５に記載のプログラム。 The user video analysis means uses a learning engine that has learned to associate the facial expressions and emotional polarities of multiple users as training data to determine the emotional polarity of the user from the facial image in the video in which the user is speaking. 6. The program according to claim 5 , which causes a computer to perform the estimation.

ユーザ映像解析手段は、教師データとして複数のユーザについて複数の骨格点の時系列の座標変位と感情極性とを対応付けて学習した学習エンジンを用いて、発声中のユーザが映り込む映像における複数の骨格点の時系列の座標変位から、当該ユーザの感情極性を推定する
ようにコンピュータを機能させることを特徴とする請求項１から６のいずれか１項に記載のプログラム。 The user video analysis means uses a learning engine that has learned to associate time-series coordinate displacements of multiple skeletal points and emotional polarities for multiple users as training data to analyze multiple 7. The program according to claim 1, which causes a computer to function to estimate the emotional polarity of the user from time-series coordinate displacements of skeletal points.

対話制御手段は、
傾聴的な応答文としては、相槌又はオウム返しである
ようにコンピュータを機能させることを特徴とする請求項１から７のいずれか１項に記載のプログラム。 The dialogue control means are
8. The program according to any one of claims 1 to 7, characterized in that the computer is made to function as if the attentive response sentence is a reply or a parrot.

ユーザの感情に応じた対話エージェントとして機能させる対話装置であって、
ユーザの発話音声から発話文に変換する音声解析手段と、
ユーザの発話文から感情極性を判定する発話文感情判定手段と、
ユーザの発話文が対話シナリオに沿っている場合に、シナリオ的な応答文を生成し、沿っていない場合に、発話文感情判定手段によってポジティブ又はニュートラルな感情極性が得られた際に、傾聴的な応答文を生成する傾聴的な応答文を生成する対話制御手段と、
発声中のユーザが映り込む映像を入力し、当該ユーザの感情極性を判定するユーザ映像解析手段と、
ユーザ映像解析手段によって感情極性が判定できなかった際に、発話文感情判定手段によって判定された感情極性を出力し、逆に、判定できた際に、ユーザ映像解析手段によって判定された感情極性を出力する視覚表現判定手段と、
感情極性毎にキャラクタ映像を予め記憶しており、視覚表現判定手段から出力された感情極性に応じたキャラクタ映像を生成するキャラクタ映像生成手段と、
応答文から応答音声に変換し、当該応答音声をスピーカから出力させる音声合成手段と、
当該応答音声に同期して、キャラクタ映像をディスプレイに表示させる映像表示制御手段と
を有することを特徴とする対話装置。 A dialogue device that functions as a dialogue agent according to a user's emotions,
a voice analysis means for converting the user's uttered voice into a uttered sentence;
uttered sentence emotion determination means for determining emotional polarity from a user 's uttered sentence ;
When the user 's utterances follow the dialogue scenario, a scenario-like response sentence is generated, and when the utterances do not follow the dialogue scenario, when a positive or neutral emotional polarity is obtained by the utterance emotion determination means, Dialogue control means for generating an attentive response sentence ;
a user video analysis means that inputs a video of a user who is speaking and determines the emotional polarity of the user;
When the emotional polarity cannot be determined by the user video analysis means, the emotional polarity determined by the uttered sentence emotion determination means is output, and conversely, when the emotional polarity can be determined, the emotional polarity determined by the user video analysis means is output. A visual expression determining means to output;
a character image generating means that stores character images for each emotional polarity in advance and generates a character image according to the emotional polarity output from the visual expression determining means;
a voice synthesis means for converting a response sentence into a response voice and outputting the response voice from a speaker;
An interaction device comprising: video display control means for displaying a character video on a display in synchronization with the response voice.

ユーザの感情に応じた対話エージェントとして機能させる装置の対話方法であって、
装置は、
ユーザの発話音声から発話文に変換する第１のステップと、
ユーザの発話文から感情極性を判定する第２のステップと、
ユーザの発話文が対話シナリオに沿っている場合に、シナリオ的な応答文を生成し、沿っていない場合に、発話文感情判定手段によってポジティブ又はニュートラルな感情極性が得られた際に、傾聴的な応答文を生成する傾聴的な応答文を生成する第３のステップと、
発声中のユーザが映り込む映像を入力し、当該ユーザの感情極性を判定する第４のステップと、
第４のステップによって感情極性が判定できなかった際に、第３のステップによって判定された感情極性を出力し、逆に、判定できた際に、第４のステップによって判定された感情極性を出力する第５のステップと、
感情極性毎にキャラクタ映像を予め記憶しており、第５のステップによって出力された感情極性に応じたキャラクタ映像を生成する第６のステップと、
応答文から応答音声に変換し、当該応答音声をスピーカから出力させる第７のステップと、
当該応答音声に同期して、キャラクタ映像をディスプレイに表示させる第８のステップと
を実行することを特徴とする装置の対話方法。 A dialogue method for a device to function as a dialogue agent according to a user's emotions, the method comprising:
The device is
a first step of converting the user's uttered voice into a uttered sentence;
a second step of determining emotional polarity from the user 's utterance;
When the user 's utterances follow the dialogue scenario, a scenario-like response sentence is generated, and when the utterances do not follow the dialogue scenario, when a positive or neutral emotional polarity is obtained by the utterance emotion determination means, a third step of generating a listening response sentence ;
a fourth step of inputting an image of the user who is speaking and determining the emotional polarity of the user;
When the emotional polarity cannot be determined in the fourth step, the emotional polarity determined in the third step is output, and conversely, when the emotional polarity can be determined, the emotional polarity determined in the fourth step is output. The fifth step is to
a sixth step in which a character image is stored in advance for each emotional polarity, and a character image is generated according to the emotional polarity output in the fifth step;
a seventh step of converting the response sentence into a response voice and outputting the response voice from a speaker;
and an eighth step of displaying a character image on a display in synchronization with the response voice.