JP2004258290A

JP2004258290A - Apparatus and method for speech processing, recording medium, and program

Info

Publication number: JP2004258290A
Application number: JP2003048559A
Authority: JP
Inventors: Hideki Shimomura; 秀樹下村
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2003-02-26
Filing date: 2003-02-26
Publication date: 2004-09-16

Abstract

PROBLEM TO BE SOLVED: To guide a user who utters to an utterance speed at which precision of speech recognition is excellent. SOLUTION: A speech recognition part 223 recognizes a user's speech. An interaction management part 231 of action determining mechanism part 203 determines utterance contents of a robot according to the speech recognition result. An utterance speed detection part 206 finds the utterance speed of the user according to the speech recognition result. An utterance speed correction part 207 sets the utterance speed of the robot to a value smaller than the utterance speed of the user when the utterance speed of the user is larger than a speed at which precision of speech recognition is excellent, and sets the utterance speed of the robot to a value larger than the utterance speed of the user when the speaking speed of the user is smaller than the speed at which precision of speech recognition is excellent. A speech synthesis part 208 synthesizes a synthesized speech based upon the utterance contents determined by the interaction management part 231 to attain the utterance speed set by the utterance speed correction part 207, and outputs the speech from a speaker 72. This invention is applicable to the robot. COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、音声処理装置および方法、記録媒体、並びにプログラムに関し、特に、ユーザの発話速度を、音声認識の精度が良い範囲に、自然に導くようにした音声処理装置および方法、記録媒体、並びにプログラムに関する。
【０００２】
【従来の技術】
コンピュータが、ユーザにより発せられた音声を音声認識し、この音声認識結果に基づいて合成音を出力する対話システムにおいては、より自然な対話を可能にするために、スピーカから出力する合成音の発話速度をユーザの発話速度に同調させるようにすることが知られている（例えば、特許文献１参照）。
【０００３】
【特許文献１】
特開平５−２１６６１８号公報（番号０１０３の段落）
【０００４】
【発明が解決しようとする課題】
しかしながら、一般的に、音声認識処理においては、認識精度が最も良くなるユーザの発話速度が存在するため、ユーザの発話速度が、速すぎたり遅すぎたりすると、誤った認識結果を出力してしまうという課題があった。
【０００５】
ユーザの発話速度が、速すぎたり遅すぎたりした場合に、ユーザに、例えば「もう少し、ゆっくり喋って下さい」のように、発話速度を適切な速度に変更するように要求することも考えられるが、頻繁に要求した場合、ユーザに不快感を催させる可能性がある。
【０００６】
本発明はこのような状況に鑑みてなされたものであり、ユーザの発話速度を、音声認識の精度が良い範囲に、自然に導くことができるようにするものである。
【０００７】
【課題を解決するための手段】
本発明の音声処理装置は、ユーザにより発話された音声を認識する音声認識手段と、音声認識手段により認識され、生成された単語列に基づいて、ユーザの発話速度を算出する算出手段と、算出手段により算出されたユーザの発話速度を、精度良く音声認識することができる発話速度と比較して、音声処理装置から出力される合成音の発話速度を設定する設定手段と、出力される合成音の発話内容を決定する決定手段と、決定手段により決定された発話内容に基づいて、設定手段により設定された発話速度の合成音を出力する出力手段とを備えることを特徴とする。
【０００８】
前記設定手段には、前記ユーザの前記発話速度が、精度良く音声認識することができる前記発話速度より大きい場合、出力される前記合成音の前記発話速度を前記ユーザの前記発話速度より小さく設定し、前記ユーザの前記発話速度が、精度良く音声認識することができる前記発話速度より小さい場合、出力される前記合成音の前記発話速度を前記ユーザの前記発話速度より大きく設定するようにさせることができる。
【０００９】
前記設定手段には、前記ユーザの前記発話速度が、精度良く音声認識することができる前記発話速度の範囲内にある場合、出力される前記合成音の前記発話速度を前記ユーザの前記発話速度と同一の値に設定し、前記ユーザの前記発話速度が、精度良く音声認識することができる前記発話速度の範囲の上限値より大きい場合、出力される前記合成音の前記発話速度を前記ユーザの前記発話速度より小さく設定し、前記ユーザの前記発話速度が、精度良く音声認識することができる前記発話速度の範囲の下限値より小さい場合、出力される前記合成音の前記発話速度を前記ユーザの前記発話速度より大きく設定するようにさせることができる。
【００１０】
本発明の音声処理方法は、ユーザにより発話された音声を認識する音声認識ステップと、音声認識ステップの処理により認識され、生成された単語列に基づいて、ユーザの発話速度を算出する算出ステップと、算出ステップの処理により算出されたユーザの発話速度を、精度良く音声認識することができる発話速度と比較して、音声処理装置から出力される合成音の発話速度を設定する設定ステップと、出力される合成音の発話内容を決定する決定ステップと、決定ステップの処理により決定された発話内容に基づいて、設定ステップの処理により設定された発話速度の合成音を出力する出力ステップとを含むことを特徴とする。
【００１１】
本発明の記録媒体のプログラムは、ユーザにより発話された音声を認識する音声認識ステップと、音声認識ステップの処理により認識され、生成された単語列に基づいて、ユーザの発話速度を算出する算出ステップと、算出ステップの処理により算出されたユーザの発話速度を、精度良く音声認識することができる発話速度と比較して、音声処理装置から出力される合成音の発話速度を設定する設定ステップと、設定ステップの処理により設定された発話速度で出力される合成音の発話内容を決定する決定ステップとを含むことを特徴とする。
【００１２】
本発明のプログラムは、ユーザと対話する音声処理装置を制御するコンピュータに、ユーザにより発話された音声を認識する音声認識ステップと、音声認識ステップの処理により認識され、生成された単語列に基づいて、ユーザの発話速度を算出する算出ステップと、算出ステップの処理により算出されたユーザの発話速度を、精度良く音声認識することができる発話速度と比較して、出力される合成音の発話速度を設定する設定ステップと、音声処理装置から、設定ステップの処理により設定された発話速度で出力される合成音の発話内容を決定する決定ステップとを実行させることを特徴とする。
【００１３】
本発明の音声処理装置および方法、記録媒体、並びにプログラムにおいては、ユーザにより発話された音声が認識され、認識され、生成された単語列に基づいて、ユーザの発話速度が算出され、算出されたユーザの発話速度が、精度良く音声認識することができる発話速度と比較され、音声処理装置から出力される合成音の発話速度が設定され、出力される合成音の発話内容が決定され、決定された発話内容に基づいて、設定された発話速度の合成音が出力される。
【００１４】
本発明は、例えばロボットに適用することができる。
【００１５】
【発明の実施の形態】
以下、図を参照して、本発明の実施の形態について説明する。
【００１６】
図１は、本発明を適用した２足歩行型のロボット１の正面方向の斜視図であり、図２は、ロボット１の背面方向からの斜視図である。また、図３は、ロボット１の軸構成について説明するための図である。
【００１７】
ロボット１は、胴体部ユニット１１、胴体部ユニット１１の上部に配設された頭部ユニット１２、胴体部ユニット１１の上部左右の所定位置に取り付けられた腕部ユニット１３Ａおよび腕部ユニット１３Ｂ、並びに胴体部ユニット１１の下部左右の所定位置に取り付けられた脚部ユニット１４Ａおよび脚部ユニット１４Ｂにより構成されている。腕部ユニット１３Ａおよび腕部ユニット１３Ｂは、同様の構成とされる。また、脚部ユニット１４Ａおよび脚部ユニット１４Ｂも、同様の構成とされる。頭部ユニット１２には、頭部センサ５１が設けられている。
【００１８】
胴体部ユニット１１は、体幹上部を形成するフレーム２１および体幹下部を形成する腰ベース２２が腰関節機構２３を介して連結することにより構成されている。胴体部ユニット１１は、体幹下部の腰ベース２２に固定された腰関節機構２３のアクチュエータＡ１、および、アクチュエータＡ２をそれぞれ駆動することによって、体幹上部を、図３に示す直交するロール軸２４およびピッチ軸２５の回りに、それぞれ独立に回転させることができるようになされている。
【００１９】
頭部ユニット１２は、フレーム２１の上端に固定された肩ベース２６の上面中央部に首関節機構２７を介して取り付けられており、首関節機構２７のアクチュエータＡ３、およびアクチュエータＡ４をそれぞれ駆動することによって、図３に示す直交するピッチ軸２８およびヨー軸２９の回りに、それぞれ独立に回転させることができるようになされている。
【００２０】
腕部ユニット１３Ａ、および腕部ユニット１３Ｂは、肩関節機構３０を介して肩ベース２６の左右にそれぞれ取り付けられており、対応する肩関節機構３０のアクチュエータＡ５、および、アクチュエータＡ６をそれぞれ駆動することによって、図３に示す、直交するピッチ軸３１およびロール軸３２の回りに、それぞれを独立に回転させることができるようになされている。
【００２１】
この場合、腕部ユニット１３Ａ、および腕部ユニット１３Ｂは、上腕部を形成するアクチュエータＡ７の出力軸に、肘関節機構４４を介して、前腕部を形成するアクチュエータＡ８が連結され、前腕部の先端に手部３４が取り付けられることにより構成されている。
【００２２】
そして腕部ユニット１３Ａ、および腕部ユニット１３Ｂでは、アクチュエータＡ７を駆動することによって、前腕部を図３に示すヨー軸３５に対して回転させることができ、アクチュエータＡ８を駆動することによって、前腕部を図３に示すピッチ軸３６に対して回転させることができるようになされている。
【００２３】
脚部ユニット１４Ａ、および、脚部ユニット１４Ｂは、股関節機構３７を介して、体幹下部の腰ベース２２にそれぞれ取り付けられており、対応する股関節機構３７のアクチュエータＡ９乃至Ａ１１をそれぞれ駆動することによって、図３に示す、互いに直交するヨー軸３８、ロール軸３９、およびピッチ軸４０に対して、それぞれ独立に回転させることができるようになされている。
【００２４】
脚部ユニット１４Ａ、および、脚部ユニット１４Ｂにおいては、大腿部を形成するフレーム４１の下端が、膝関節機構４２を介して、下腿部を形成するフレーム４３に連結されるとともに、フレーム４３の下端が、足首関節機構４４を介して、足部４５に連結されている。
【００２５】
これにより脚部ユニット１４Ａ、および、脚部ユニット１４Ｂにおいては、膝関節機構４２を形成するアクチュエータＡ１２を駆動することによって、図３に示すピッチ軸４６に対して、下腿部を回転させることができ、また足首関節機構４４のアクチュエータＡ１３、および、アクチュエータＡ１４をそれぞれ駆動することによって、図３に示す直交するピッチ軸４７およびロール軸４８に対して、足部４５をそれぞれ独立に回転させることができるようになされている。
【００２６】
脚部ユニット１４Ａ、および脚部ユニット１４Ｂの、足部４５の足底面（床と接する面）には、それぞれ足底センサ９１（図５）が配設されており、足底センサ９１のオン・オフに基づいて、足部４５が床に接地しているか否かが判別される。
【００２７】
また、胴体部ユニット１１の体幹下部を形成する腰ベース２２の背面側には、後述するメイン制御部６１（図４）などを内蔵したボックスである、制御ユニット５２が配設されている。
【００２８】
図４は、ロボット１のアクチュエータとその制御系等について説明する図である。
【００２９】
制御ユニット５２には、ロボット１全体の動作制御をつかさどるメイン制御部６１、並びに、後述するＤ／Ａ変換部１０１、Ａ／Ｄ変換部１０２、バッテリ１０３、バッテリセンサ１３１、加速度センサ１３２、通信部１０５、および外部メモリ１０６（いずれも図５）等を含む周辺回路６２が収納されている。
【００３０】
そしてこの制御ユニット５２は、各構成ユニット（胴体部ユニット１１、頭部ユニット１２、腕部ユニット１３Ａおよび腕部ユニット１３Ｂ、並びに、脚部ユニット１４Ａおよび脚部ユニット１４Ｂ）内にそれぞれ配設されたサブ制御部６３Ａ乃至６３Ｄと接続されており、サブ制御部６３Ａ乃至６３Ｄに対して必要な電源電圧を供給したり、サブ制御部６３Ａ乃至６３Ｄと通信を行う。
【００３１】
また、サブ制御部６３Ａ乃至６３Ｄは、対応する構成ユニット内のアクチュエータＡ１乃至Ａ１４と、それぞれ接続されており、メイン制御部６１から供給された各種制御コマンドに基づいて、構成ユニット内のアクチュエータＡ１乃至Ａ１４を、指定された状態に駆動させるように制御する。
【００３２】
図５は、ロボット１の内部構成を示すブロック図である。
【００３３】
頭部ユニット１２には、このロボット１の「目」として機能するＣＣＤ（ＣｈａｒｇｅＣｏｕｐｌｅｄＤｅｖｉｃｅ）カメラ８１、「耳」として機能するマイクロホン８２、頭部センサ５１などからなる外部センサ部７１、および、「口」として機能するスピーカ７２となどがそれぞれ所定位置に配設され、制御ユニット５２内には、バッテリセンサ１３１および加速度センサ１３２などからなる内部センサ部１０４が配設されている。また、脚部ユニット１４Ａ、および脚部ユニット１４Ｂの足部４５の足底面には、このロボット１の「体性感覚」の１つとして機能する足底センサ９１が配設されている。
【００３４】
そして、外部センサ部７１のＣＣＤカメラ８１は、周囲の状況を撮像し、得られた画像信号を、Ａ／Ｄ変換部１０２を介して、メイン制御部６１に送出する。マイクロホン８２は、ユーザから音声入力として与えられる「歩け」、「とまれ」または「右手を挙げろ」等の各種命令音声を集音し、得られた音声信号を、Ａ／Ｄ変換部１０２を介して、メイン制御部６１に送出する。
【００３５】
また、頭部センサ５１は、例えば、図１および図２に示されるように頭部ユニット１２の上部に設けられており、ユーザからの「撫でる」や「叩く」といった物理的な働きかけにより受けた圧力を検出し、検出結果としての圧力検出信号を、Ａ／Ｄ変換部１０２を介して、メイン制御部６１に送出する。
【００３６】
足底センサ９１は、足部４５の足底面に配設されており、足部４５が床に接地している場合、接地信号を、Ａ／Ｄ変換部１０２を介して、メイン制御部６１に送出する。メイン制御部６１は、接地信号に基づいて、足部４５が床に接地しているか否かを判定する。足底センサ９１は、脚部ユニット１４Ａ、および脚部ユニット１４Ｂの両方の足部４５に配設されているため、メイン制御部６１は、接地信号に基づいて、ロボット１の両足が床に接地しているか、片足が床に接地しているか、両足とも床に接地していないかを判定することができる。
【００３７】
制御ユニット５２は、メイン制御部６１、Ｄ／Ａ変換部１０１、Ａ／Ｄ変換部１０２、バッテリ１０３、内部センサ部１０４、通信部１０５、および外部メモリ１０６等により構成される。
【００３８】
Ｄ／Ａ（Ｄｉｇｉｔａｌ／Ａｎａｌｏｇ）変換部１０１は、メイン制御部６１から供給されるデジタル信号をＤ／Ａ変換することによりアナログ信号とし、スピーカ７２に供給する。Ａ／Ｄ（Ａｎａｌｏｇ／Ｄｉｇｉｔａｌ）変換部１０２は、ＣＣＤカメラ８１、マイクロフォン８２、頭部センサ５１、および足底センサ９１が出力するアナログ信号をＡ／Ｄ変換することによりデジタル信号とし、メイン制御部６１に供給する。
【００３９】
内部センサ部１０４のバッテリセンサ１３１は、バッテリ１０３のエネルギ残量を所定の周期で検出し、検出結果をバッテリ残量検出信号として、メイン制御部６１に送出する。加速度センサ１３２は、ロボット１の移動について、３軸方向（ｘ軸、ｙ軸、およびｚ軸）の加速度を、所定の周期で検出し、検出結果を、加速度検出信号として、メイン制御部６１に送出する。
【００４０】
メイン制御部６１は、メイン制御部６１全体の動作を制御するＣＰＵ１１１と、ＣＰＵ１１１が各部を制御するために実行するＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ）１２１、アプリケーションプログラム１２２、その他の必要なデータ等が記憶されている内部メモリ１１２等を内蔵している。
【００４１】
メイン制御部６１は、外部センサ部７１のＣＣＤカメラ８１、マイクロホン８２および頭部センサ５１からそれぞれ供給される、画像信号、音声信号および圧力検出信号、並びに足底センサ９１から供給される接地信号（以下、これらをまとめて外部センサ信号Ｓ１と称する）と、内部センサ部１０４のバッテリセンサ１３１および加速度センサ１３２等からそれぞれ供給される、バッテリ残量検出信号および加速度検出信号（以下、これらをまとめて内部センサ信号Ｓ２と称する）に基づいて、ロボット１の周囲および内部の状況や、ユーザからの指令、または、ユーザからの働きかけの有無などを判断する。
【００４２】
そして、メイン制御部６１は、ロボット１の周囲および内部の状況や、ユーザからの指令、または、ユーザからの働きかけの有無の判断結果と、内部メモリ１１２に予め格納されている制御プログラム、あるいは、そのとき装填されている外部メモリ１０６に格納されている各種制御パラメータなどに基づいて、ロボット１の行動を決定し、決定結果に基づく制御コマンドＣＯＭを生成して、対応するサブ制御部６３Ａ乃至６３Ｄに送出する。サブ制御部６３Ａ乃至６３Ｄは、供給された制御コマンドＣＯＭに基づいて、アクチュエータＡ１乃至Ａ１４のうち、対応するものの駆動を制御するので、ロボット１は、例えば、頭部ユニット１２を上下左右に揺動させたり、腕部ユニット１３Ａ、あるいは、腕部ユニット１３Ｂを上に挙げたり、脚部ユニット１４Ａおよび脚部ユニット１４Ｂを交互に駆動させて、歩行するなどの機械的動作を行うことが可能となる。
【００４３】
また、メイン制御部６１は、必要に応じて、所定の音声信号をスピーカ７２に与えることにより、音声信号に基づく音声を外部に出力させる。更に、メイン制御部６１は、外見上の「目」として機能する、頭部ユニット１２の所定位置に設けられた、図示しないＬＥＤ（ＬｉｇｈｔＥｍｉｔｔｉｎｇＤｉｏｄｅ）に対して駆動信号を出力することにより、ＬＥＤを点灯、消灯、または点滅させる。
【００４４】
このようにして、ロボット１は、周囲および内部の状況や、ユーザからの指令および働きかけの有無などに基づいて、自律的に行動することができるようになされている。
【００４５】
通信部１０５は、外部と無線または有線で通信するときの通信制御を行う。これにより、ＯＳ１２１やアプリケーションプログラム１２２がバージョンアップされたときに、通信部１０５を介して、そのバージョンアップされたＯＳやアプリケーションプログラムをダウンロードして、内部メモリ１１２に記憶させたり、また、所定のコマンドを、通信部１０５で受信し、ＣＰＵ１１１に与えることができるようになっている。
【００４６】
外部メモリ１０６は、例えば、ＥＥＰＲＯＭ（ＥｌｅｃｔｒｉｃａｌｌｙＥｒａｓａｂｌｅＰｒｏｇｒａｍｍａｂｌｅＲｅａｄ−ｏｎｌｙＭｅｍｏｒｙ）等で構成され、胴体部ユニット１１に設けられた図示せぬスロットに対して、着脱可能になっている。外部メモリ１０６には、例えば、後述するような感情モデル等が記憶される。
【００４７】
次に、図６は、図５のメイン制御部６１の機能的構成例を示している。なお、図６に示す機能的構成は、メイン制御部６１が、内部メモリ１１２に記憶されたＯＳ１２１およびアプリケーションプログラム１２２を実行することで実現されるようになっている。また、図６では、Ｄ／Ａ変換部１０１およびＡ／Ｄ変換部１０２の図示を省略してある。
【００４８】
メイン制御部６１のセンサ入力処理部２０１は、頭部センサ５１、足底センサ９１、加速度センサ１３２、マイクロフォン８２、およびＣＣＤカメラ８１からそれぞれ与えられる圧力検出信号、接地信号、加速度検出信号、音声信号、画像信号等に基づいて、特定の外部状態や、ユーザからの特定の働きかけ、ユーザからの指示等を認識し、その認識結果を表す状態認識情報を、モデル記憶部２０２および行動決定機構部２０３に通知する。
【００４９】
すなわち、センサ入力処理部２０１は、圧力処理部２２１、加速度処理部２２２、音声認識部２２３、および画像認識部２２４を有している。
【００５０】
圧力処理部２２１は、頭部センサ５１から与えられる圧力検出信号を処理する。そして、圧力処理部２２１は、例えば、その処理の結果、所定の閾値以上で、かつ短時間の圧力を検出したときには、「叩かれた（しかられた）」と認識し、所定の閾値未満で、かつ長時間の圧力を検出しなときには、「なでれらた（ほめられた）」と認識して、その認識結果を、状態認識情報として、モデル記憶部２０２および行動決定機構部２０３に通知する。
【００５１】
また、圧力処理部２２１は、足底センサ９１から与えられる接地信号を処理する。そして、圧力処理部２２１は、例えば、その処理の結果、脚部ユニット１４Ａの足部４５に配設された足底センサ９１から接地信号が与えられている場合、脚部ユニット１４Ａの足部４５が床（地面）に接地していると認識し、足底センサ９１から接地信号が与えられていない場合、脚部ユニット１４Ａの足部４５が床（地面）に接地していないと認識する。脚部ユニット１４Ｂについても、同様にして、足底センサ９１からの接地信号に基づいて、脚部ユニット１４Ｂの足部４５が床（地面）に接地しているか否かを認識する。そして、圧力処理部２２１は、その認識結果を、状態認識情報として、モデル記憶部２０２および行動決定機構部２０３に通知する。
【００５２】
加速度処理部２２２は、加速度センサ１３２から与えられる加速度検出信号に基づいて、胴体部ユニット１１の加速度の方向および大きさを、状態認識情報として、モデル記憶部２０２および行動決定機構部２０３に通知する。
【００５３】
音声認識部２２３は、マイクロフォン８２から与えられる音声信号を対象とした音声認識を行う。そして、音声認識部２２３は、その音声認識結果としての、例えば、「歩け」、「伏せ」、「ボールを追いかけろ」等の単語列を、状態認識情報として、モデル記憶部２０２および行動決定機構部２０３に通知する。また、音声認識部２２３は、音声認識結果を、発話速度検出部２０６にも供給する。
【００５４】
画像認識部２２４は、ＣＣＤカメラ８１から与えられる画像信号を用いて、画像認識処理を行う。そして、画像認識部２２４は、その処理の結果、例えば、「赤い丸いもの」や、「地面に対して垂直なかつ所定高さ以上の平面」等を検出したときには、「ボールがある」や、「壁がある」等の画像認識結果を、状態認識情報として、モデル記憶部２０２および行動制御部２０３に通知する。
【００５５】
モデル記憶部２０２は、ロボット１の感情、本能、成長の状態を表現する感情モデル、本能モデル、成長モデルをそれぞれ記憶し、管理している。
【００５６】
ここで、感情モデルは、例えば、「うれしさ」、「悲しさ」、「怒り」、「楽しさ」等の感情の状態（度合い）を、所定の範囲（例えば、−１．０乃至１．０等）の値によってそれぞれ表し、センサ入力処理部２０１からの状態認識情報や時間経過等に基づいて、その値を変化させる。
【００５７】
本能モデルは、例えば、「食欲」、「睡眠欲」、「運動欲」等の本能による欲求の状態（度合い）を、所定の範囲の値によってそれぞれ表し、センサ入力処理部２０１からの状態認識情報や時間経過等に基づいて、その値を変化させる。
【００５８】
成長モデルは、例えば、「幼年期」、「青年期」、「熟年期」、「老年期」等の成長の状態（度合い）を、所定の範囲の値によってそれぞれ表し、センサ入力処理部２０１からの状態認識情報や時間経過等に基づいて、その値を変化させる。
【００５９】
モデル記憶部２０２は、上述のようにして感情モデル、本能モデル、成長モデルの値で表される感情、本能、成長の状態を、状態情報として、行動決定機構部２０３に送出する。
【００６０】
なお、モデル記憶部２０２には、センサ入力処理部２０１から状態認識情報が供給される他に、行動決定機構部２０３から、ロボット１の現在または過去の行動、具体的には、例えば、「長時間歩いた」などの行動の内容を示す行動情報が供給されるようになっており、モデル記憶部２０２は、同一の状態認識情報が与えられても、行動情報が示すロボット１の行動に応じて、異なる状態情報を生成するようになっている。
【００６１】
例えば、ロボット１が、ユーザに挨拶をし、ユーザに頭を撫でられた場合には、ユーザに挨拶をしたという行動情報と、頭を撫でられたという状態認識情報とが、モデル記憶部２０２に与えられ、この場合、モデル記憶部２０２では、「うれしさ」を表す感情モデルの値が増加される。
【００６２】
行動決定機構部２０３は、センサ入力処理部２０１からの状態認識情報や、モデル記憶部２０２からの状態情報、時間経過等に基づいて、次の行動を決定し、決定された行動の内容を、行動指令情報として、姿勢遷移機構部２０４に出力する。
【００６３】
すなわち、行動決定機構部２０３は、ロボット１がとり得る行動をステート（状態）（ｓｔａｔｅ）に対応させた有限オートマトンを、ロボット１の行動を規定する行動モデルとして管理している。そして、行動決定機構部２０３は、この行動モデルとしての有限オートマトンにおけるステートを、センサ入力処理部２０１からの状態認識情報や、モデル記憶部２０２における感情モデル、本能モデル、または成長モデルの値、時間経過等に基づいて遷移させ、遷移後のステートに対応する行動を、次にとるべき行動として決定する。
【００６４】
ここで、行動決定機構部２０３は、所定のトリガ（ｔｒｉｇｇｅｒ）があったことを検出すると、ステートを遷移させる。すなわち、行動決定機構部２０３は、例えば、現在のステートに対応する行動を実行している時間が所定時間に達したときや、特定の状態認識情報を受信したとき、モデル記憶部２０２から供給される状態情報が示す感情や、本能、成長の状態の値が所定の閾値以下または以上になったとき等に、ステートを遷移させる。
【００６５】
なお、行動決定機構部２０３は、上述したように、センサ入力処理部２０１からの状態認識情報だけでなく、モデル記憶部２０２における感情モデルや、本能モデル、成長モデルの値等にも基づいて、行動モデルにおけるステートを遷移させることから、同一の状態認識情報が入力されても、感情モデルや、本能モデル、成長モデルの値（状態情報）によっては、ステートの遷移先は異なるものとなる。
【００６６】
なお、行動決定機構部２０３には、モデル記憶部２０２から供給される状態情報が示す感情や、本能、成長の状態に基づいて、遷移先のステートに対応する行動のパラメータとしての、例えば、歩行の速度や、手足を動かす際の動きの大きさおよび速度などを決定させることができ、この場合、それらのパラメータを含む行動指令情報が、姿勢遷移機構部２０４に送出される。
【００６７】
また、行動決定機構部２０３は対話管理部２３１を含んでおり、対話管理部２３１は、ロボット１に発話を行わせる行動指令情報（以下、ロボット１に発話を行わせる行動指令情報を発話指令情報とも称する）を、必要に応じて生成する。発話指令情報は、ロボット１が出力する発話内容のデータを含んでおり、音声合成部２０８に供給されるようになっている。音声合成部２０８は、発話指令情報を受信すると、その発話指令情報に従って音声合成を行い、得られた合成音を、発話速度修正部２０７から指令された発話速度で、スピーカ７２から出力させる。
【００６８】
姿勢遷移機構部２０４は、行動決定機構部２０３から供給される行動指令情報に基づいて、ロボット１の姿勢を、現在の姿勢から次の姿勢に遷移させるための姿勢遷移情報を生成し、これを制御機構部２０５に送出する。
【００６９】
ここで、現在の姿勢から次に遷移可能な姿勢は、例えば、胴体部ユニット１１、頭部ユニット１２、腕部ユニット１３Ａおよび１３Ｂ、並びに脚部ユニット１４Ａおよび１４Ｂの形状、重さ、各部の結合状態のようなロボット１の物理的形状と、関節が曲がる方向や角度のようなアクチュエータの機構とによって決定される。
【００７０】
また、次の姿勢としては、現在の姿勢から直接遷移可能な姿勢と、直接には遷移できない姿勢とがある。例えば、ロボット１は、手足を大きく投げ出して仰向けに寝転んでいる状態から、うつ伏せ状態へ直接遷移することはできるが、仰向けの状態から、起立状態へ直接遷移することはできず、一旦、手足を胴体近くに引き寄せてしゃがみこんだ姿勢になり、それから立ち上がるという２段階の動作が必要である。また、安全に実行できない姿勢も存在する。
【００７１】
このため、姿勢遷移機構部２０４は、直接遷移可能な姿勢をあらかじめ登録しておき、行動決定機構部２０３から供給される行動指令情報が、直接遷移可能な姿勢を示す場合には、その行動指令情報を制御機構部２０５に送出する。
【００７２】
一方、行動指令情報が、直接遷移不可能な姿勢を示す場合には、姿勢遷移機構部２０４は、遷移可能な他の姿勢に一旦遷移した後に、目的の姿勢まで遷移させるような姿勢遷移情報を生成し、制御機構部２０５に送出する。これによりロボットが、遷移不可能な姿勢を無理に実行しようとする事態や、転倒するような事態を回避することができるようになっている。
【００７３】
制御機構部２０５は、姿勢遷移機構部２０４からの姿勢遷移情報にしたがって、アクチュエータＡ１乃至Ａ１４を駆動するための制御信号を生成し、これを、サブ制御部６３Ａ乃至６３Ｄに送出する。サブ制御部６３Ａ乃至６３Ｄは、この制御信号に基づいて、適宜のアクチュエータを駆動し、ロボット１に種々の動作を実行させる。
【００７４】
発話速度検出部２０６は、センサ入力処理部２０１内の音声認識部２２３より、音声認識結果を供給され、供給された音声認識結果に基づいて、ユーザの発話速度を算出し、算出したユーザの発話速度を発話速度修正部２０７に通知する。
【００７５】
発話速度修正部２０７は、発話速度検出部２０６から通知された、ユーザの発話速度に基づいて、ロボット１が出力する合成音の発話速度を設定し、その設定情報を音声合成部２０８に通知する。
【００７６】
音声合成部２０８は、行動決定機構部２０３内の対話管理部２３１から発話指令情報を受信し、その発話指令情報にしたがって、例えば、規則音声合成を行う。ここで、音声合成部２０８は、発話速度修正部２０７で設定された発話速度になるように合成音を生成し、合成音をスピーカ７２に供給して出力させる。
【００７７】
図７は、センサ入力処理部２０１の音声認識部２２３の機能を示す機能ブロック図である。
【００７８】
図５のマイクロフォン８２およびＡ／Ｄ変換部１０２を介して、音声認識部２２３に入力される音声データは、時刻情報取得部２５１に供給される。
【００７９】
時刻情報取得部２５１は、音声がマイクロフォン８２により集音された時点の現在時刻を内蔵する内部時計より取得し、マイクロフォン８２から集音され、Ａ／Ｄ変換部１０２によりＡ／Ｄ変換された音声データに付加する。これにより、音声データには、その音声データが集音された時刻を示す時刻情報が付加されることになる。時刻情報取得部２５１は、時刻情報が付加された音声データを特徴量抽出部２５２に供給する。
【００８０】
特徴抽出部２５２は、時刻情報取得部２５１からの音声データについて、適当なフレームごとに音響分析処理を施し、これにより、例えば、ＭＦＣＣ（ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒｕｍＣｏｅｆｆｉｃｉｅｎｔ）等の特徴量としての特徴ベクトルを抽出する。なお、特徴抽出部２５２では、その他、例えば、スペクトルや、線形予測係数、ケプストラム係数、線スペクトル対等の特徴ベクトル（特徴パラメータ）を抽出することが可能である。
【００８１】
特徴抽出部２５２においてフレームごとに得られる特徴ベクトルは、特徴ベクトルバッファ２５３に順次供給されて記憶される。従って、特徴ベクトルバッファ２５３では、フレームごとの特徴ベクトルが時系列に記憶されていく。
【００８２】
なお、特徴ベクトルバッファ２５３は、例えば、ある発話の開始から終了まで（音声区間）に得られる時系列の特徴ベクトルを記憶する。
【００８３】
マッチング部２５４は、特徴ベクトルバッファ２５３に記憶された特徴ベクトルを用いて、音響モデルデータベース２５５、辞書データベース２５６、および文法データベース２５７を必要に応じて参照しながら、マイクロフォン８２に入力された音声（入力音声）を、例えば、連続分布ＨＭＭ法等に基づいて音声認識する。
【００８４】
即ち、音響モデルデータベース２５５は、音声認識する音声の言語における個々の音素や音節などの所定の単位（ＰＬＵ（Ｐｈｏｎｅｔｉｃ−Ｌｉｎｇｕｉｓｔｉｃ−Ｕｎｉｔｓ））ごとの音響的な特徴を表す音響モデルのセットを記憶している。ここでは、連続分布ＨＭＭ法に基づいて音声認識を行うので、音響モデルとしては、例えば、ガウス分布等の確率密度関数を用いたＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ）が用いられる。辞書データベース２５６は、認識対象の各単語（語彙）について、その発音に関する情報（音韻情報）が記述された単語辞書を記憶している。文法データベース２５７は、辞書データベース２５６の単語辞書に登録されている各単語が、どのように連鎖する（つながる）かを記述した文法規則（言語モデル）を記憶している。ここで、文法規則としては、例えば、文脈自由文法（ＣＦＧ）や、正規文法（ＲＧ）、統計的な単語連鎖確率（Ｎ−ｇｒａｍ）などに基づく規則を用いることができる。
【００８５】
マッチング部２５４は、辞書データベース２５６の単語辞書を参照することにより、音響モデルデータベース２５５に記憶されている音響モデルを接続することで、単語の音響モデル（単語モデル）を構成する。さらに、マッチング部２５４は、幾つかの単語モデルを、文法データベース２５７に記憶された文法規則を参照することにより接続し、そのようにして接続された単語モデルを用いて、時系列の特徴ベクトルとのマッチングを、連続分布ＨＭＭ法によって行い、マイクロフォン８２に入力された音声を認識する。即ち、マッチング部２５４は、上述したようにして構成された各単語モデルの系列から、特徴ベクトルバッファ２５３に記憶された時系列の特徴ベクトルが観測される尤度を表すスコアを計算する。そして、マッチング部２５４は、例えば、そのスコアが最も高い単語モデルの系列を検出し、その単語モデルの系列に対応する単語列を、音声の認識結果として出力する。
【００８６】
なお、ここでは、ＨＭＭ法により音声認識が行われるため、マッチング部２５４は、音響的には、接続された単語モデルに対応する単語列について、各特徴ベクトルの出現確率を累積し、その累積値をスコアとする。
【００８７】
即ち、マッチング部２５４におけるスコア計算は、音響モデルデータベース２５５に記憶された音響モデルによって与えられる音響的なスコア（以下、適宜、音響スコアという）と、文法データベース２５７に記憶された文法規則によって与えられる言語的なスコア（以下、適宜、言語スコアという）とを総合評価することで行われる。
【００８８】
具体的には、音響スコアは、例えば、ＨＭＭ法による場合には、単語モデルを構成する音響モデルから、特徴抽出部２５２が出力する特徴ベクトルの系列が観測される確率（出現する確率）に基づいて、単語ごとに計算される。また、言語スコアは、例えば、バイグラムによる場合には、注目している単語と、その単語の直前の単語とが連鎖（連接）する確率に基づいて求められる。そして、各単語についての音響スコアと言語スコアとを総合評価して得られる最終的なスコア（以下、適宜、最終スコアという）に基づいて、音声認識結果が確定される。
【００８９】
ここで、音声認識部２２３は、文法データベース２５７を設けずに構成することも可能である。但し、文法データベース２５７に記憶された規則によれば、接続する単語モデルが制限され、その結果、マッチング部２５４における音響スコアの計算の対象とする単語数が限定されるので、マッチング部２５４の計算量を低減し、処理速度を向上させることができる。
【００９０】
時刻情報付加部２５８は、マッチング部２５４から出力された、音声認識結果としての単語列に、その単語列が発話された時刻を付加する。すなわち、時刻情報付加部２５８は、時刻情報取得部２５１により取得された、音声データの発話時刻に基づいて、単語列に対応する音声の発話開始時刻および発話終了時刻を特定し、この発話開始時刻および発話終了時刻を時刻情報として単語列に付加する。そして、時刻情報付加部２５８は、時刻情報が付加された単語列を状態認識情報として、モデル記憶部２０２、行動決定機構部２０３、および発話速度検出部２０６に供給する。
【００９１】
次に、図８のフローチャートを参照して、ロボット１の対話処理、すなわちロボット１がユーザと対話する処理について説明する。
【００９２】
ステップＳ１において、Ａ／Ｄ変換部１０２は、マイクロフォン８２を介して、ユーザから音声入力があったか否かを判定し、ユーザから音声入力があるまで、ステップＳ１の処理をくり返して待機する。そして、ユーザから音声入力があった場合、処理はステップＳ２に進む。
【００９３】
ステップＳ２において、ロボット１は、音声認識処理を実行する。ここで、ステップＳ２の音声認識処理について、図９のフローチャートを参照して、詳細に説明する。
【００９４】
図９のステップＳ５１において、Ａ／Ｄ変換部１０２は、マイクロフォン８２から入力された音声信号をＡ／Ｄ変換して、センサ入力処理部２０１の音声認識部２２３に供給する。
【００９５】
ステップＳ５２において、時刻情報取得部２５１は、音声が入力された時点の現在時刻を音声入力時刻として取得し、Ａ／Ｄ変換部１０２から供給された音声データに付加して、特徴抽出部２５２に供給する。
【００９６】
ステップＳ５３において、特徴抽出部２５２は、時刻情報取得部２５１から供給された音声データについて、適当な時間間隔で音響分析処理を施し、音声の音響的特徴を表すパラメータ（特徴ベクトル）に変換し、特徴量として抽出する。なお、特徴抽出部２５２は、抽出した特徴ベクトルに、その特徴ベクトルの抽出元となる音声の発話時刻を付加する。抽出された特徴ベクトルは、特徴ベクトルバッファ２５３に順次供給され、記憶される。
【００９７】
ステップＳ５４において、マッチング部２５４は、特徴ベクトルバッファ２５３に記憶された時系列の特徴ベクトルを読み出し、音響モデルデータベース２５５に記憶された音響モデル、辞書データベース２５６に記憶された、音韻情報が記述された単語辞書、および文法データベース２５７に記憶された言語モデルを利用して、時系列の特徴ベクトルに対応する単語列を生成し、これを時刻情報付加部２５８に出力する。なお、マッチング部２５４は、特徴ベクトルに付加されていた、発話時刻を示す時刻情報を単語列に付加して、時刻情報付加部２５８に供給する。
【００９８】
ステップＳ５５において、時刻情報付加部２５８は、マッチング部２５４から供給された単語列、および発話時刻を示す時刻情報に基づいて、供給された単語列の元となる音声がマイクロフォン８２により集音された時刻、より具体的には、その単語列の発話開始時刻および発話終了時刻を特定する。そして、時刻情報付加部２５８は、単語列に、発話開始時刻および発話終了時刻を示す時刻情報を付加して、これを音声認識結果として、モデル記憶部２０２、行動決定機構部２０３、および発話速度検出部２０６に供給する。
【００９９】
以上のようにして、音声認識処理が実行される。
【０１００】
図８に戻り、ステップＳ２の音声認識処理の後、処理はステップＳ３に進む。
【０１０１】
ステップＳ３において、行動決定機構部２０３の対話管理部２３１は、音声認識部２２３から供給された音声認識結果の単語列に基づいて、ユーザの発話内容を解析し、ロボット１の発話内容のデータを生成し、このデータを含む発話指令情報を音声合成部２０８に供給する。
【０１０２】
ステップＳ４において、ロボット１は、発話速度検出処理を実行し、ユーザの発話速度（文字数／秒）を検出する。
【０１０３】
ここで、図１０のフローチャートを参照して、図８のステップＳ４の発話速度検出処理について、詳細に説明する。
【０１０４】
発話速度検出部２０６は、音声認識部２２３から単語列が供給されたとき、図１０のステップＳ７１において、供給された単語列に付加されている時刻情報に基づいて、供給された単語列の発話終了時刻からの経過時間の計測を開始する。すなわち、メイン制御部６１は、図示せぬ内部時計を有しており、発話速度検出部２０６は、この内部時計より現在時刻を取得し、この現在時刻と単語列の発話終了時刻との差をとって、発話終了時刻から経過時間を求める処理を開始する。
【０１０５】
ステップＳ７２において、発話速度検出部２０６は、ステップＳ７１で計測を開始した経過時間に基づいて、単語列の発話終了時刻から、予め設定された所定の時間が経過したか否かを判定し、単語列の発話終了時刻から、予め設定された所定の時間が経過していない場合、処理はステップＳ７３に進む。
【０１０６】
ステップＳ７３において、発話速度検出部２０６は、音声認識部２２３から次の単語列が入力されたか否かを判定し、音声認識部２２３から次の単語列が入力されていない場合、処理はステップＳ７２に戻り、ステップＳ７２以降の処理をくり返す。
【０１０７】
ステップＳ７３において、発話速度検出部２０６が、音声認識部２２３から次の単語列が入力されたと判定した場合、処理はステップＳ７４に進む。
【０１０８】
ステップＳ７４において、発話速度検出部２０６は、ステップＳ７１で計測を開始していた経過時間を０にリセットする。その後、処理は、ステップＳ７１に戻り、ステップＳ７１以降の処理をくり返す。
【０１０９】
以上のようにして、音声認識部２２３より単語列が供給されてから、所定の時間以内に次の単語列が供給された場合、ステップＳ７１乃至ステップＳ７４の処理がくり返される。これにより、ロボット１は、ユーザの発話が途切れるまで、ユーザの話を聞き続けるようにできる。
【０１１０】
そして、ステップＳ７２において、発話速度検出部２０６が、単語列の発話終了時刻から、所定の時間が経過したと判定した場合、処理はステップＳ７５に進む。
【０１１１】
ステップＳ７５において、発話速度検出部２０６は、単語列に付加されている時刻情報に基づいて、最初に入力された単語列の発話開始時刻から最後に入力された単語列の発話終了時刻までに要した発話時間を求める。すなわち、上記したように、ユーザが発話を開始してから途切れるまで、ステップＳ７１乃至ステップＳ７４の処理がくり返されるが、ユーザの発話が途切れたとき、ステップＳ７５で、発話速度検出部２０６は、ユーザが発話を開始してから途切れるまでの時間を求める。具体的には、発話速度検出部２０６は、ユーザの発話が開始されてから最初に入力された単語列の発話開始時刻、およびユーザの発話が途切れる直前に、最後に入力された単語列の発話終了時刻の差をとって、ユーザが発話を開始してから途切れるまでの時間を求める。
【０１１２】
ステップＳ７６において、発話速度検出部２０６は、ユーザが発話を開始してから途切れるまでに、音声認識部２２３から供給された全単語列の、合計文字数を求める。なお、合計文字数は、漢字カナ混じりの文字数ではなく、全てをカナにした場合の文字数である。また、文字数を求める代わりに、モーラ数を求めるようにしても良い。
【０１１３】
ステップＳ７７において、発話速度検出部２０６は、ステップＳ７６で求められた合計文字数を、ステップＳ７５で求められた発話時間で割り算して、発話速度（１秒当りにユーザが発話した文字数）を算出する。
【０１１４】
ステップＳ７８において、発話速度検出部２０６は、ステップＳ７７で算出された発話速度を、発話速度修正部２０７に通知する。
【０１１５】
以上のようにして、発話速度検出処理が実行される。
【０１１６】
処理は図８に戻り、ステップＳ４の発話速度検出処理の後、処理はステップＳ５に進む。
【０１１７】
ステップＳ５において、ロボット１は、発話速度検出部２０６から通知された発話速度に基づいて発話速度修正処理を実行し、ロボット１が出力する合成音の発話速度を設定する。
【０１１８】
ここで、図１１のフローチャートを参照して、図８のステップＳ５の発話速度修正処理について詳細に説明する。
【０１１９】
図１１のステップＳ１０１において、発話速度修正部２０７は、発話速度検出部２０６から通知されたユーザの発話速度が、音声認識に最適な発話速度（以下、音声認識に最適な発話速度を最適速度とも称する）より大きい値であるか否かを判定し、ユーザの発話速度が最適速度より大きい場合、処理はステップＳ１０２に進む。
【０１２０】
ステップＳ１０２において、発話速度修正部２０７は、スピーカ７２から出力する合成音の発話速度を、ユーザの発話速度より所定の値だけ小さい値に設定する。設定する値の詳細な説明は、後述する。その後、処理はステップＳ１０４に進む。
【０１２１】
ステップＳ１０１において、発話速度修正部２０７が、ユーザの発話速度は、最適速度より大きくない（ユーザの発話速度は最適速度以下である）と判定した場合、処理はステップＳ１０３に進む。
【０１２２】
ステップＳ１０３において、発話速度修正部２０７は、スピーカ７２から出力する合成音の発話速度を、ユーザの発話速度より所定の値だけ大きい値に設定する。設定する値の詳細な説明は、後述する。その後、処理はステップＳ１０４に進む。
【０１２３】
ステップＳ１０４において、発話速度修正部２０７は、設定されたロボット１の発話速度の値を、音声合成部２０８に通知する。
【０１２４】
以上のようにして、発話速度修正処理が実行される。
【０１２５】
ここで、ステップＳ１０２およびステップＳ１０３におけるロボット１の発話速度の値の設定方法について、図１２乃至図１４を参照して説明する。
【０１２６】
図１２は、ユーザの発話速度およびロボット１の発話速度の経時変化を表すグラフである。
【０１２７】
図１２のグラフにおいて、縦軸は発話速度を表し、横軸は時刻を表している。また、ユーザの発話速度Ｖを実線で表し、ロボット１の発話速度Ｖ’を１点鎖線で表している。
【０１２８】
図１２の例においては、ロボット１の発話速度Ｖ’は、式（１）を満たすように設定される。
【０１２９】
（Ｖ’−ｖ）／（Ｖ−ｖ）＝ｋ（１）
【０１３０】
式（１）において、ｋは０＜ｋ＜１の定数である。
【０１３１】
すなわち、ロボット１の発話速度Ｖ’は、ロボット１の発話速度の最適速度ｖからの差（Ｖ’−ｖ）と、ユーザの発話速度Ｖの最適速度ｖからの差（Ｖ−ｖ）との比が一定になり、かつ、ロボット１の発話速度Ｖ’がユーザの発話速度Ｖより、最適速度ｖに近くなるように設定される。
【０１３２】
人は、対話中、自らの発話速度を、相手の発話速度に合わせようとする傾向があるといわれている。そこで、このように、ロボット１の発話速度Ｖ’を、ユーザの発話速度Ｖより最適速度ｖに近い値に設定することにより、ユーザは、無意識のうちに自らの発話速度Ｖを、ロボット１の発話速度Ｖ’に合わせてゆき、結果的に、知らず知らずのうちに、自らの発話速度を、最適速度ｖに合わせるようになる。図１２においては、時刻ｔ０から時刻ｔ１にかけて、徐々にロボット１の発話速度Ｖ’が落ちるのにつられて、ユーザの発話速度Ｖが落ちてゆき、時刻ｔ１において、ユーザの発話速度Ｖは、最適速度ｖに一致している。
【０１３３】
ロボット１は、このようにして、ユーザの発話速度を、精度良く音声認識を行なえる発話速度まで誘導することができる。
【０１３４】
なお、ｋが１に近づけば近づくほど、ロボット１の発話速度Ｖ’はユーザの発話速度Ｖに近づいた値に設定され、ｋが０に近づけば近づくほど、ロボット１の発話速度Ｖ’は最適速度ｖに近づいた値に設定される。
【０１３５】
図１２は、ユーザの発話速度Ｖが最適速度ｖより大きい場合の例を示しているが、ユーザの発話速度Ｖが最適速度ｖより小さい場合の例を図１３に示す。
【０１３６】
図１３のグラフは、図１２と同様、縦軸が発話速度を表し、横軸が時刻を表し、グラフ中の実線がユーザの発話速度Ｖを表し、１点鎖線が、ロボットの発話速度Ｖ’を表している。図１３においても、上記した式（１）に基づいて、ロボット１の発話速度Ｖ’が設定される。すなわち、ロボット１の発話速度Ｖ’は、ユーザの発話速度Ｖより、最適速度ｖに近い値に設定される。これにより、ユーザの発話速度Ｖを、徐々に最適速度ｖに誘導することができ、時刻ｔ１において、ユーザの発話速度Ｖは、最適速度ｖに一致している。
【０１３７】
以上のようにして、ステップＳ１０２およびステップＳ１０３において、発話速度修正部２０７は、ロボット１の発話速度Ｖ’を設定する。
【０１３８】
次に、図１４のグラフを参照して、式（１）とは異なる方法で、ロボット１の発話速度を設定する場合の例を説明する。
【０１３９】
図１４においても、図１２および図１３と同様に、縦軸が発話速度を表し、横軸が時刻を表し、グラフ中の実線がユーザの発話速度Ｖを表し、１点鎖線が、ロボットの発話速度Ｖ’を表している。また、Ｖは、ユーザの発話速度と最適速度との差を表し、Ｖ’は、ロボット１の発話速度と最適速度との差を表している。
【０１４０】
図１４に示される例においては、ロボット１の発話速度は、式（２）に基づいて設定される。
【０１４１】
（ｖ−Ｖ’）＝（Ｖ−ｖ）（２）
【０１４２】
すなわち、図１４の例においては、ロボット１の発話速度Ｖ’は、最適速度ｖとロボット１の発話速度Ｖ’との差（ｖ−Ｖ’）が、ユーザの発話速度Ｖと最適速度ｖとの差（Ｖ−ｖ）と等しくなるように設定される。なお、ロボット１の発話速度Ｖ’には、予め最低値Ｖｍｉｎが設定されており、式（２）により、この最低値Ｖｍｉｎより低い値Ｖ’が算出された場合、ロボット１は、式（２）により算出された値をキャンセルし、発話速度Ｖ’を、最低値Ｖｍｉｎに設定する。
【０１４３】
図１４においては、時刻ｔ０からｔ１まで、徐々に、ユーザの発話速度Ｖが、最適速度ｖに近づいてゆき、時刻ｔ１において、最適速度ｖと一致している。
【０１４４】
図１１のステップＳ１０２およびステップＳ１０３において、以上のようにして、ロボット１の発話速度を設定しても良い。
【０１４５】
図８のフローチャートに戻り、ステップＳ５の発話速度修正処理の後、処理はステップＳ６に進む。
【０１４６】
ステップＳ６において、音声合成部２０８は、ステップＳ３で決定された、ロボット１の発話内容を、ステップＳ５で設定された発話速度で発話すべく、合成音を生成し、生成した合成音をスピーカ７２から出力させる。すなわち、音声合成部２０８には、ステップＳ３で、対話管理部２３１より、発話内容のデータを含む発話指令情報が供給され、ステップＳ５で、発話速度修正部２０７より、発話速度の設定値が供給されている。そこで、音声合成部２０８は、供給された発話内容のデータに基づいて合成音を生成するが、このとき、発話速度修正部２０７から供給された設定値に基づいた発話速度になるように、合成音を生成する。そして、音声合成部２０８は、生成した合成音をスピーカ７２から出力させる。
【０１４７】
以上のようにして、ロボット１の対話制御処理が実行される。
【０１４８】
以上に説明したように、本発明のロボット１は、ユーザが無意識のうちに、ユーザの発話速度を、音声認識に最適な速度に誘導する。したがって、音声認識の精度を良好な状態に保つことが可能となる。また、本発明のロボット１は、ユーザの音声を聞き取れない（的確に音声認識できない）場合に、聞き返す（再度、発言するように要求する）ようにしても良いが、このようにした場合にも、的確に音声認識できないこと自体が減少するので、聞き返す回数を減少させることができる。
【０１４９】
なお、以上の説明における、ステップＳ３の処理、並びにステップＳ４およびステップＳ５の処理は、並列に実行される。すなわち、ステップＳ５の処理は、ステップＳ４の処理の後に実行されるが、ステップＳ３の処理と、ステップＳ４およびステップＳ５の処理は同時並行に、実行される。
【０１５０】
ところで、図１１に示された発話速度修正処理においては、音声認識の精度を良好に保てる、唯１つの最適速度の値を基準として、ロボット１の発話速度を設定していたが、音声認識の精度を良好に保てる範囲を設定し、その範囲内と範囲外で、ロボット１の発話速度の設定方法を変えるようにしても良い。次に、このようにした場合の発話速度修正処理（図８のステップＳ５の処理）について、図１５のフローチャートを参照して説明する。
【０１５１】
発話速度修正部２０７は、最適速度の他に、音声認識の精度を良好に保てる発話速度の範囲の上限値および下限値を予め記憶している。そこで、図１５のステップＳ１２１において、発話速度修正部２０７は、発話速度検出部２０６から通知されたユーザの発話速度が、音声認識の精度を良好に保てる範囲内にあるか否かを判定し、ユーザの発話速度が、音声認識の精度を良好に保てる範囲内にない場合、処理はステップＳ１２２に進む。
【０１５２】
ステップＳ１２２において、発話速度修正部２０７は、ユーザの発話速度が、音声認識の精度を良好に保てる範囲の上限値より大きいか否かを判定し、ユーザの発話速度が、音声認識の精度を良好に保てる範囲の上限値より大きい場合、処理はステップＳ１２３に進む。
【０１５３】
ステップＳ１２３において、発話速度修正部２０７は、ロボット１の発話速度を、ユーザの発話速度より所定の値だけ小さい値に設定する。発話速度修正部２０７は、例えば、式（１）または式（２）を利用して、ロボット１の発話速度を設定する。その後、処理はステップＳ１２６に進む。
【０１５４】
ステップＳ１２２において、発話速度修正部２０７が、ユーザの発話速度は、音声認識の精度を良好に保てる範囲の上限値より大きくない（ユーザの発話速度は音声認識の精度を良好に保てる範囲の下限値より小さい）と判定した場合、処理はステップＳ１２４に進む。
【０１５５】
ステップＳ１２４において、発話速度修正部２０７は、ロボット１の発話速度を、ユーザの発話速度より所定の値だけ大きい値に設定する。発話速度修正部２０７は、例えば、式（１）または式（２）を利用して、ロボット１の発話速度を設定する。その後、処理はステップＳ１２６に進む。
【０１５６】
ステップＳ１２１において、発話速度修正部２０７が、ユーザの発話速度は、精度良く音声認識することができる範囲内の速度であると判定した場合、処理はステップＳ１２５に進む。
【０１５７】
ステップＳ１２５において、発話速度修正部２０７は、ロボット１の発話速度を、ユーザの発話速度と同じ値に設定する。その後、処理はステップＳ１２６に進む。
【０１５８】
ステップＳ１２６において、発話速度修正部２０７は、ステップＳ１２３、ステップＳ１２４、またはステップＳ１２５で設定されたロボット１の発話速度の設定値を、音声合成部２０８に通知する。
【０１５９】
以上のように、ユーザの発話速度が、音声認識の精度を保てる範囲内にあった場合、ロボット１の発話速度をユーザの発話速度に合わせるようにしても良い。これにより、ユーザに、より自然で心地よく、ロボット１と対話させることが可能となる。
【０１６０】
図１５に示された発話速度修正処理により設定されたロボット１の発話速度の例について、図１６のグラフを参照して説明する。
【０１６１】
図１６のグラフにおいて、縦軸は発話速度を表し、横軸は時刻を表している。また、縦軸中に記載された「上限速度」は、精度良く音声認識することができるユーザの発話速度の範囲の上限値を表し、「下限速度」は、精度良く音声認識することができるユーザの発話速度の範囲の下限値を表し、この上限速度と下限速度の間に、ユーザの発話速度がある場合、精度良く音声認識することが可能となる。また、縦軸中に記載された「最適速度」は、最も精度良く音声認識することができるユーザの発話速度を表している。また、図１６のグラフにおいては、ユーザの発話速度Ｖを実線で表し、ロボット１の発話速度Ｖ’を１点鎖線で表している。
【０１６２】
図１６において、時刻ｔ０からｔ１までの区間においては、ユーザの発話速度Ｖが、上限速度より大きい。したがって、ロボット１の発話速度は、例えば、式（１）に基づいて設定される。図１６において、時刻ｔ１からｔ２までの区間においては、ユーザの発話速度Ｖが、下限速度と上限速度の間にある。したがって、このときロボット１は、自らの発話速度をユーザの発話速度と同一の値に設定する。これにより、ユーザは、より自然で心地よくロボット１と会話することができる。時刻ｔ２において、再度、ユーザの発話速度が上限速度を超えている。このとき、ロボット１の発話速度は、再度、例えば式（１）に基づいて設定するように変更される。そして、時刻ｔ３において、ユーザの発話速度Ｖが上限速度と下限速度の間になると、ロボット１は、再度、自らの発話速度をユーザの発話速度と同一の値に設定する。
【０１６３】
以上のようにしても良い。
【０１６４】
なお、以上の説明においては、本発明を人型ロボットに適用した場合を例にして説明したが、本発明は人型ロボット以外のロボット（例えば犬型のロボットなど）に適用したり、産業用ロボットに適用したりすることも可能である。さらにまた、本発明は、例えばカーナビゲーションシステムなどのように、ユーザと対話する機能を有する、その他の装置に適用することも可能である。
【０１６５】
上述した一連の処理は、ハードウェアにより実行させることもできるし、上述したようにソフトウェアにより実行させることもできる。一連の処理をソフトウェアにより実行させる場合には、そのソフトウェアを構成するプログラムが専用のハードウェアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどに、記録媒体等からインストールされる。
【０１６６】
図１７は、このような処理を実行するパーソナルコンピュータ３０１の内部構成例を示す図である。パーソナルコンピュータのＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）３１１は、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）３１２に記憶されているプログラムに従って各種の処理を実行する。ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）３１３には、ＣＰＵ３１１が各種の処理を実行する上において必要なデータやプログラムなどが適宜記憶される。入出力インタフェース３１５は、ディスプレイ、スピーカ、およびＤＡ変換器などから構成される出力部３１６も接続されている。また、入出力インタフェース３１５には、マウス、キーボード、マイクロフォン、ＡＤ変換器などから構成される入力部３１７が接続され、入力部３１７に入力された信号をＣＰＵ３１１に出力する。
【０１６７】
さらに、入出力インタフェース３１５には、ハードディスクなどから構成される記憶部３１８、および、インターネットなどのネットワークを介して他の装置とデータの通信を行う通信部３１９も接続されている。ドライブ３２０は、磁気ディスク３３１、光ディスク３３２、光磁気ディスク３３３、半導体メモリ３３４などの記録媒体からデータを読み出したり、データを書き込んだりするときに用いられる。
【０１６８】
記録媒体は、図１７に示されるように、パーソナルコンピュータとは別に、ユーザにプログラムを提供するために配布される、プログラムが記録されている磁気ディスク３３１（フレキシブルディスクを含む）、光ディスク３３２（ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｃ−ＲｅａｄＯｎｌｙＭｅｍｏｒｙ），ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）を含む）、光磁気ディスク３３３（ＭＤ（Ｍｉｎｉ−Ｄｉｓｃ）（登録商標）を含む）、若しくは半導体メモリ３３４などよりなるパッケージメディアにより構成されるだけでなく、コンピュータに予め組み込まれた状態でユーザに提供される、プログラムが記憶されているＲＯＭ３１２や記憶部３１８が含まれるハードディスクなどで構成される。
【０１６９】
なお、本明細書において、媒体により提供されるプログラムを記述するステップは、記載された順序に従って、時系列的に行われる処理は勿論、必ずしも時系列的に処理されなくとも、並列的あるいは個別に実行される処理をも含むものである。
【０１７０】
また、本明細書において、システムとは、複数の装置により構成される装置全体を表すものである。
【０１７１】
【発明の効果】
このように、本発明によれば、音声を認識することができる。特に、ユーザの発話速度を、精度良く音声認識することができる発話速度に誘導することが可能となり、結果的に、誤った音声認識をする確率を下げることが可能となる。
【図面の簡単な説明】
【図１】本発明を適用したロボットの外観構成を示す斜視図である。
【図２】図１のロボットの外観構成を示す、背後側の斜視図である。
【図３】図１のロボットについて説明するための略線図である。
【図４】図１のロボットの内部構成を示すブロック図である。
【図５】図１のロボットの制御に関する部分を主に説明するためのブロック図である。
【図６】図５のメイン制御部の構成を示すブロック図である。
【図７】図６の音声認識部の構成を示すブロック図である。
【図８】ロボットの対話処理を説明するフローチャートである。
【図９】図８のステップＳ２の処理を詳細に説明するフローチャートである。
【図１０】図８のステップＳ４の処理を詳細に説明するフローチャートである。
【図１１】図８のステップＳ５の処理を詳細に説明するフローチャートである。
【図１２】図１１の発話速度修正処理により設定されるロボットの発話速度の例について説明する図である。
【図１３】図１１の発話速度修正処理により設定されるロボットの発話速度の例について説明する他の図である。
【図１４】図１１の発話速度修正処理により設定されるロボットの発話速度の例について説明する、さらに他の図である。
【図１５】図８のステップＳ５の処理を詳細に説明する他のフローチャートである。
【図１６】図１５の発話速度修正処理により設定されるロボットの発話速度の例について説明する図である。
【図１７】本発明を適用したコンピュータの構成を示すブロック図である。
【符号の説明】
１ロボット，６１メイン制御部，６３サブ制御部，７２スピーカ，８２マイクロホン，１０１Ｄ／Ａ変換部，１０２Ａ／Ｄ変換部，１１１ＣＰＵ，１１２内部メモリ，１２１ＯＳ，１２２アプリケーションプログラム，２０１センサ入力処理部，２０２モデル記憶部，２０３行動決定機構部，２０４姿勢遷移機構部，２０５制御機構部，２０６発話速度検出部，２０７発話速度修正部，２０８音声合成部，２２３音声認識部，２３１対話管理部，２５１時刻情報取得部，２５２特徴抽出部，２５３特徴ベクトルバッファ，２５４マッチング部，２５５音響モデルデータベース，２５６辞書データベース，２５７文法データベース，２５８時刻情報付加部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a voice processing apparatus and method, a recording medium, and a program, and in particular, a voice processing apparatus and method, a recording medium, and a voice processing method that naturally guide the utterance speed of a user to a range where voice recognition accuracy is good. About the program.
[0002]
[Prior art]
2. Description of the Related Art In a dialogue system in which a computer performs voice recognition of a voice uttered by a user and outputs a synthesized voice based on the voice recognition result, in order to enable a more natural dialogue, a speech of a synthesized voice output from a speaker is provided. It is known to tune the speed to the utterance speed of a user (for example, see Patent Document 1).
[0003]
[Patent Document 1]
JP-A-5-216618 (paragraph number 0103)
[0004]
[Problems to be solved by the invention]
However, in general, in the speech recognition processing, there is a user's utterance speed at which the recognition accuracy is the highest, and an incorrect recognition result is output if the user's utterance speed is too fast or too slow. There was a problem.
[0005]
If the user's utterance speed is too fast or too slow, it is conceivable to request the user to change the utterance speed to an appropriate speed, for example, "Please speak a little more slowly." Frequent requests may cause user discomfort.
[0006]
The present invention has been made in view of such a situation, and aims to naturally guide the utterance speed of a user to a range where the accuracy of voice recognition is good.
[0007]
[Means for Solving the Problems]
The voice processing device of the present invention includes: a voice recognition unit that recognizes a voice uttered by a user; a calculation unit that calculates a user's utterance speed based on a word string that is recognized and generated by the voice recognition unit; Setting means for comparing the user's utterance speed calculated by the means with an utterance speed at which speech can be accurately recognized, and setting an utterance speed of a synthesized sound output from the voice processing device; Determining means for determining the utterance content, and output means for outputting a synthesized sound having the utterance speed set by the setting means based on the utterance content determined by the determining means.
[0008]
In the setting unit, when the utterance speed of the user is higher than the utterance speed at which the voice can be accurately recognized, the utterance speed of the synthesized sound to be output is set to be lower than the utterance speed of the user. When the utterance speed of the user is lower than the utterance speed at which the voice can be accurately recognized, the utterance speed of the synthesized sound to be output may be set to be higher than the utterance speed of the user. it can.
[0009]
In the setting unit, when the utterance speed of the user is within the range of the utterance speed at which the voice can be accurately recognized, the utterance speed of the synthesized sound to be output is set to the utterance speed of the user. Set to the same value, if the utterance speed of the user is greater than the upper limit of the range of the utterance speed that can be accurately recognized voice, the utterance speed of the synthesized sound to be output to the user of the If the utterance speed of the user is set smaller than the utterance speed, and the utterance speed of the user is smaller than the lower limit of the range of the utterance speed at which the voice can be accurately recognized, the utterance speed of the synthesized sound to be output is set to the utterance speed of the user. It can be set to be higher than the utterance speed.
[0010]
The voice processing method according to the present invention includes a voice recognition step of recognizing a voice uttered by the user, and a calculation step of calculating a user's utterance speed based on the generated word string recognized by the voice recognition step. A setting step of comparing the utterance speed of the user calculated by the processing of the calculation step with an utterance speed capable of accurately recognizing voice, and setting an utterance speed of a synthesized sound output from the voice processing device; A determining step of determining the utterance content of the synthesized sound to be performed, and an output step of outputting a synthesized sound of the utterance speed set by the processing of the setting step based on the utterance content determined by the processing of the determining step It is characterized by.
[0011]
The program of the recording medium according to the present invention includes a voice recognition step of recognizing a voice uttered by the user, and a calculation step of calculating the user's utterance speed based on the generated word string recognized by the processing of the voice recognition step. And a setting step of comparing the utterance speed of the user calculated by the process of the calculation step with an utterance speed capable of accurately recognizing voice, and setting an utterance speed of a synthesized sound output from the voice processing device; And a determining step of determining the utterance content of the synthesized sound output at the utterance speed set by the processing of the setting step.
[0012]
The program according to the present invention provides a computer that controls a voice processing device that interacts with a user to a voice recognition step of recognizing a voice uttered by the user, and a word sequence that is recognized and generated by the processing of the voice recognition step. Calculating the speech rate of the user, and comparing the speech rate of the user calculated by the processing of the calculation step with the speech rate at which speech can be accurately recognized, It is characterized by executing a setting step of setting and a determining step of determining the utterance content of the synthesized sound output at the utterance speed set by the processing of the setting step from the voice processing device.
[0013]
In the voice processing apparatus and method, the recording medium, and the program according to the present invention, the voice uttered by the user is recognized, and based on the generated word string, the utterance speed of the user is calculated and calculated. The utterance speed of the user is compared with the utterance speed at which the voice can be accurately recognized, the utterance speed of the synthesized sound output from the voice processing device is set, and the utterance content of the synthesized sound to be output is determined and determined. Based on the uttered content, a synthesized sound having the set utterance speed is output.
[0014]
The present invention can be applied to, for example, a robot.
[0015]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0016]
FIG. 1 is a front perspective view of a bipedal walking robot 1 to which the present invention is applied, and FIG. 2 is a perspective view of the robot 1 as viewed from the rear. FIG. 3 is a diagram for explaining the axis configuration of the robot 1.
[0017]
The robot 1 includes a body unit 11, a head unit 12 disposed above the body unit 11, an arm unit 13 A and an arm unit 13 B attached to predetermined positions on the left and right of the body unit 11, and It is composed of a leg unit 14A and a leg unit 14B attached to predetermined positions on the lower left and right sides of the body unit 11. The arm unit 13A and the arm unit 13B have the same configuration. The leg unit 14A and the leg unit 14B have the same configuration. The head unit 12 is provided with a head sensor 51.
[0018]
The torso unit 11 is configured by connecting a frame 21 forming the upper trunk and a waist base 22 forming the lower trunk via a waist joint mechanism 23. The torso unit 11 drives the actuator A1 and the actuator A2 of the waist joint mechanism 23 fixed to the waist base 22 at the lower part of the trunk, thereby driving the upper part of the trunk to the orthogonal roll shaft 24 shown in FIG. And around the pitch axis 25, respectively.
[0019]
The head unit 12 is attached to the center of the upper surface of a shoulder base 26 fixed to the upper end of the frame 21 via a neck joint mechanism 27, and drives the actuator A3 and the actuator A4 of the neck joint mechanism 27, respectively. Thereby, it is possible to independently rotate about the orthogonal pitch axis 28 and yaw axis 29 shown in FIG.
[0020]
The arm unit 13A and the arm unit 13B are respectively attached to the left and right of the shoulder base 26 via the shoulder joint mechanism 30, and drive the corresponding actuator A5 and actuator A6 of the shoulder joint mechanism 30, respectively. Thereby, each can be independently rotated around the pitch axis 31 and the roll axis 32 which are orthogonal to each other as shown in FIG.
[0021]
In this case, the arm unit 13A and the arm unit 13B are connected to the output shaft of the actuator A7 forming the upper arm via the elbow joint mechanism 44, the actuator A8 forming the forearm is connected. Is configured by attaching a hand portion 34 thereto.
[0022]
In the arm unit 13A and the arm unit 13B, the forearm can be rotated with respect to the yaw axis 35 shown in FIG. 3 by driving the actuator A7, and the forearm can be rotated by driving the actuator A8. Can be rotated with respect to a pitch axis 36 shown in FIG.
[0023]
The leg unit 14A and the leg unit 14B are respectively attached to the waist base 22 below the trunk via the hip joint mechanism 37, and by driving the actuators A9 to A11 of the corresponding hip joint mechanism 37, respectively. , The yaw axis 38, the roll axis 39, and the pitch axis 40, which are orthogonal to each other, can be rotated independently of each other.
[0024]
In the leg unit 14A and the leg unit 14B, the lower end of the frame 41 forming the thigh is connected to the frame 43 forming the lower leg through the knee joint mechanism 42, and the frame 43 Is connected to a foot 45 via an ankle joint mechanism 44.
[0025]
Thereby, in the leg unit 14A and the leg unit 14B, the lower leg can be rotated with respect to the pitch axis 46 shown in FIG. 3 by driving the actuator A12 forming the knee joint mechanism 42. By driving the actuator A13 and the actuator A14 of the ankle joint mechanism 44, the foot 45 can be independently rotated with respect to the orthogonal pitch axis 47 and roll axis 48 shown in FIG. It has been made possible.
[0026]
A sole sensor 91 (FIG. 5) is disposed on the sole surface (surface in contact with the floor) of the foot part 45 of each of the leg unit 14A and the leg unit 14B. Based on the off state, it is determined whether or not the foot 45 is in contact with the floor.
[0027]
On the back side of the waist base 22, which forms the lower part of the trunk of the body unit 11, a control unit 52, which is a box containing a main control unit 61 (FIG. 4) described later, is provided.
[0028]
FIG. 4 is a diagram illustrating an actuator of the robot 1 and a control system thereof.
[0029]
The control unit 52 includes a main control unit 61 that controls the operation of the entire robot 1, a D / A conversion unit 101, an A / D conversion unit 102, a battery 103, a battery sensor 131, an acceleration sensor 132, and a communication unit, which will be described later. A peripheral circuit 62 including an external memory 105 and an external memory 106 (both shown in FIG. 5) is housed.
[0030]
The control unit 52 is disposed in each of the constituent units (the body unit 11, the head unit 12, the arm unit 13A and the arm unit 13B, and the leg unit 14A and the leg unit 14B). It is connected to the sub-control units 63A to 63D, supplies necessary power supply voltages to the sub-control units 63A to 63D, and performs communication with the sub-control units 63A to 63D.
[0031]
The sub-control units 63A to 63D are respectively connected to the actuators A1 to A14 in the corresponding constituent units, and based on various control commands supplied from the main control unit 61, the actuators A1 to A14 in the constituent units. A14 is controlled to be driven to a designated state.
[0032]
FIG. 5 is a block diagram showing the internal configuration of the robot 1.
[0033]
The head unit 12 includes a CCD (Charge Coupled Device) camera 81 functioning as an “eye” of the robot 1, a microphone 82 functioning as an “ear”, an external sensor unit 71 including a head sensor 51, and the like. A speaker 72 functioning as a “mouth” is disposed at a predetermined position, and an internal sensor unit 104 including a battery sensor 131 and an acceleration sensor 132 is disposed in the control unit 52. Further, a sole sensor 91 that functions as one of the “somatic senses” of the robot 1 is provided on the soles of the feet 45 of the leg units 14A and 14B.
[0034]
Then, the CCD camera 81 of the external sensor unit 71 captures an image of the surroundings, and sends the obtained image signal to the main control unit 61 via the A / D conversion unit 102. The microphone 82 collects various command voices such as “walk”, “stop” or “raise your right hand” given as a voice input from the user, and converts the obtained voice signal through the A / D converter 102. , To the main control unit 61.
[0035]
The head sensor 51 is provided on the head unit 12, for example, as shown in FIGS. 1 and 2, and is received by a physical action such as “stroke” or “hit” from the user. The pressure is detected, and a pressure detection signal as a detection result is sent to the main control unit 61 via the A / D conversion unit 102.
[0036]
The sole sensor 91 is provided on the sole of the foot 45, and when the foot 45 is in contact with the floor, sends a ground signal to the main controller 61 via the A / D converter 102. Send out. The main controller 61 determines whether or not the foot 45 is on the floor based on the ground signal. Since the sole sensor 91 is disposed on both the legs 45 of the leg unit 14A and the leg unit 14B, the main control unit 61 determines that both feet of the robot 1 are on the floor based on the ground signal. It can be determined whether the user is on the ground, whether one foot is on the floor, or whether both feet are on the floor.
[0037]
The control unit 52 includes a main control unit 61, a D / A conversion unit 101, an A / D conversion unit 102, a battery 103, an internal sensor unit 104, a communication unit 105, an external memory 106, and the like.
[0038]
The D / A (Digital / Analog) converter 101 converts the digital signal supplied from the main controller 61 into an analog signal by D / A conversion, and supplies the analog signal to the speaker 72. An A / D (Analog / Digital) converter 102 converts the analog signals output from the CCD camera 81, the microphone 82, the head sensor 51, and the sole sensor 91 into digital signals by A / D conversion, and outputs a digital signal. 61.
[0039]
The battery sensor 131 of the internal sensor unit 104 detects the remaining energy of the battery 103 at a predetermined cycle, and sends the detection result to the main control unit 61 as a remaining battery detection signal. The acceleration sensor 132 detects acceleration in three axes directions (x-axis, y-axis, and z-axis) at a predetermined cycle with respect to the movement of the robot 1, and sends the detection result to the main control unit 61 as an acceleration detection signal. Send out.
[0040]
The main control unit 61 stores a CPU 111 that controls the entire operation of the main control unit 61, an OS (Operating System) 121 that the CPU 111 executes to control each unit, an application program 122, and other necessary data. Built-in internal memory 112 and the like.
[0041]
The main control unit 61 includes an image signal, an audio signal, and a pressure detection signal supplied from the CCD camera 81, the microphone 82, and the head sensor 51 of the external sensor unit 71, and a ground signal (supplied from the sole sensor 91). Hereinafter, these are collectively referred to as an external sensor signal S1) and a battery remaining amount detection signal and an acceleration detection signal (hereinafter collectively referred to) supplied from the battery sensor 131 and the acceleration sensor 132 of the internal sensor unit 104, respectively. Based on the internal sensor signal S2), the surrounding and internal states of the robot 1, the command from the user, the presence or absence of the user's action, and the like are determined.
[0042]
Then, the main control unit 61 determines a state around and inside the robot 1, a command from the user, a determination result of the presence or absence of an action from the user, a control program stored in the internal memory 112 in advance, or At this time, the action of the robot 1 is determined based on various control parameters and the like stored in the external memory 106 loaded, a control command COM based on the determination result is generated, and the corresponding sub-control units 63A to 63D are determined. To send to. The sub-control units 63A to 63D control the driving of the corresponding one of the actuators A1 to A14 based on the supplied control command COM, so that the robot 1 swings the head unit 12 up, down, left and right, for example. Or the arm unit 13A or the arm unit 13B is raised, and the leg unit 14A and the leg unit 14B are alternately driven to perform a mechanical operation such as walking. .
[0043]
Further, the main control section 61 outputs a sound based on the sound signal to the outside by giving a predetermined sound signal to the speaker 72 as necessary. Further, the main control unit 61 outputs a drive signal to an unillustrated LED (Light Emitting Diode) provided at a predetermined position of the head unit 12 and functions as an apparent “eye”, thereby outputting an LED. Turn on, off, or blink.
[0044]
In this way, the robot 1 is capable of acting autonomously based on the surrounding and internal conditions, the presence / absence of a command from the user, and the presence or absence of an action.
[0045]
The communication unit 105 performs communication control when communicating with the outside wirelessly or by wire. Accordingly, when the OS 121 or the application program 122 is upgraded, the upgraded OS or application program is downloaded via the communication unit 105 and stored in the internal memory 112, or a predetermined command Can be received by the communication unit 105 and given to the CPU 111.
[0046]
The external memory 106 is configured by, for example, an electrically erasable programmable read-only memory (EEPROM) or the like, and is detachable from a slot (not shown) provided in the body unit 11. The external memory 106 stores, for example, an emotion model described later.
[0047]
Next, FIG. 6 shows an example of a functional configuration of the main control unit 61 of FIG. Note that the functional configuration illustrated in FIG. 6 is realized by the main control unit 61 executing the OS 121 and the application program 122 stored in the internal memory 112. In FIG. 6, the illustration of the D / A converter 101 and the A / D converter 102 is omitted.
[0048]
The sensor input processing unit 201 of the main control unit 61 includes a pressure detection signal, a ground signal, an acceleration detection signal, and a voice signal respectively supplied from the head sensor 51, the sole sensor 91, the acceleration sensor 132, the microphone 82, and the CCD camera 81. Based on image signals and the like, a specific external state, a specific action from the user, an instruction from the user, and the like are recognized, and state recognition information representing the recognition result is stored in the model storage unit 202 and the action determination mechanism unit 203. Notify
[0049]
That is, the sensor input processing unit 201 includes a pressure processing unit 221, an acceleration processing unit 222, a voice recognition unit 223, and an image recognition unit 224.
[0050]
The pressure processing unit 221 processes a pressure detection signal provided from the head sensor 51. Then, for example, as a result of the processing, when detecting a pressure that is equal to or more than a predetermined threshold value and for a short time, the pressure processing unit 221 recognizes that “hit” has been detected, and determines that the pressure is less than the predetermined threshold value. When the pressure is not detected for a long time, it is recognized as “patched” and the recognition result is stored in the model storage unit 202 and the action determination mechanism unit 203 as state recognition information. Notice.
[0051]
Further, the pressure processing unit 221 processes a ground contact signal provided from the sole sensor 91. Then, for example, as a result of the processing, when a ground contact signal is given from the sole sensor 91 disposed on the foot 45 of the leg unit 14A, the pressure processing unit 221 outputs the foot 45 of the leg unit 14A. Is recognized as being in contact with the floor (ground), and when the ground signal is not given from the sole sensor 91, it is recognized that the foot 45 of the leg unit 14A is not grounded on the floor (ground). Similarly, the leg unit 14B recognizes whether the foot 45 of the leg unit 14B is in contact with the floor (ground) based on the ground signal from the sole sensor 91. Then, the pressure processing unit 221 notifies the model storage unit 202 and the action determination mechanism unit 203 of the recognition result as state recognition information.
[0052]
The acceleration processing unit 222 notifies the model storage unit 202 and the action determination mechanism unit 203 of the direction and magnitude of the acceleration of the body unit 11 as state recognition information based on the acceleration detection signal given from the acceleration sensor 132. .
[0053]
The voice recognition unit 223 performs voice recognition on a voice signal provided from the microphone 82. Then, the speech recognition unit 223 uses the word string such as “walk”, “down”, “chase the ball” or the like as the speech recognition result as the state recognition information as the model storage unit 202 and the action determination mechanism unit. Notify 203. Further, the speech recognition unit 223 also supplies the speech recognition result to the speech speed detection unit 206.
[0054]
The image recognition unit 224 performs an image recognition process using an image signal given from the CCD camera 81. Then, when the image recognition unit 224 detects, for example, “a red round object” or “a plane that is perpendicular to the ground and is equal to or more than a predetermined height” as a result of the processing, “there is a ball”, “ An image recognition result such as “there is a wall” is notified to the model storage unit 202 and the behavior control unit 203 as state recognition information.
[0055]
The model storage unit 202 stores and manages an emotion model, an instinct model, and a growth model expressing the emotion, instinct, and growth state of the robot 1, respectively.
[0056]
Here, the emotion model indicates, for example, the state (degree) of emotions such as “joy”, “sadness”, “anger”, and “fun” in a predetermined range (for example, −1.0 to 1.. 0, etc.), and the values are changed based on the state recognition information from the sensor input processing unit 201 or the passage of time.
[0057]
The instinct model represents, for example, the state (degree) of the instinct's desire such as “appetite”, “sleep desire”, and “exercise desire” by values in a predetermined range, and state recognition information from the sensor input processing unit 201. The value is changed based on the time or the passage of time.
[0058]
The growth model represents, for example, a growth state (degree) such as “childhood”, “adolescence”, “mature”, “elderly”, etc., by a value in a predetermined range. The value is changed on the basis of the state recognition information or the passage of time.
[0059]
The model storage unit 202 sends the emotion, instinct, and growth state represented by the values of the emotion model, instinct model, and growth model as described above to the action determination mechanism unit 203 as state information.
[0060]
In addition to the state recognition information supplied from the sensor input processing unit 201 to the model storage unit 202, the current or past behavior of the robot 1, specifically, for example, “ Behavior information indicating the content of an action such as "walking for time" is supplied. Even if the same state recognition information is given, the model storage unit 202 responds to the action of the robot 1 indicated by the action information. Thus, different state information is generated.
[0061]
For example, when the robot 1 greets the user and is stroked by the user, the behavior information indicating that the robot 1 greets the user and state recognition information indicating that the robot has been stroked are stored in the model storage unit 202. In this case, in the model storage unit 202, the value of the emotion model representing “joy” is increased.
[0062]
The action determination mechanism unit 203 determines the next action based on the state recognition information from the sensor input processing unit 201, the state information from the model storage unit 202, the passage of time, and the like, and determines the content of the determined action. Output to the posture transition mechanism unit 204 as action command information.
[0063]
That is, the action determining mechanism unit 203 manages a finite state automaton in which actions that can be taken by the robot 1 correspond to states, as an action model that defines actions of the robot 1. Then, the action determining mechanism unit 203 converts the state in the finite automaton as the action model into the state recognition information from the sensor input processing unit 201, the value of the emotion model, the instinct model, or the growth model in the model storage unit 202, and the time. A transition is made based on progress or the like, and an action corresponding to the state after the transition is determined as an action to be taken next.
[0064]
Here, upon detecting that a predetermined trigger has occurred, the action determining mechanism unit 203 changes the state. That is, for example, when the time during which the action corresponding to the current state is being executed reaches a predetermined time, or when specific state recognition information is received, the action determining mechanism unit 203 is supplied from the model storage unit 202. The state is changed when the value of the emotion, instinct, or growth state indicated by the state information is equal to or less than a predetermined threshold.
[0065]
Note that, as described above, the action determination mechanism unit 203 performs, based on not only the state recognition information from the sensor input processing unit 201, but also the values of the emotion model, instinct model, growth model, and the like in the model storage unit 202. Since the state in the action model is changed, even if the same state recognition information is input, the destination of the state changes depending on the value (state information) of the emotion model, the instinct model, and the growth model.
[0066]
In addition, the action determination mechanism unit 203 includes, for example, a walking parameter as an action parameter corresponding to the transition destination state based on the emotion, instinct, and growth state indicated by the state information supplied from the model storage unit 202. , And the magnitude and speed of the movement when the limb is moved. In this case, action command information including those parameters is sent to the posture transition mechanism unit 204.
[0067]
Further, the action determining mechanism unit 203 includes a dialogue management unit 231. The dialogue management unit 231 transmits the action command information for causing the robot 1 to speak (hereinafter, the action command information for causing the robot 1 to speak) to the speech command information. Is also generated as needed. The utterance command information includes utterance content data output by the robot 1, and is supplied to the voice synthesis unit 208. Upon receiving the utterance command information, the voice synthesizer 208 performs voice synthesis according to the utterance command information, and outputs the resultant synthesized sound from the speaker 72 at the utterance speed commanded by the utterance speed corrector 207.
[0068]
The posture transition mechanism unit 204 generates posture transition information for transitioning the posture of the robot 1 from the current posture to the next posture based on the behavior command information supplied from the behavior determination mechanism unit 203, and It is sent to the control mechanism unit 205.
[0069]
Here, for example, the posture that can be changed next from the current posture is, for example, the shape, weight, and combination of the body unit 11, the head unit 12, the arm units 13A and 13B, and the leg units 14A and 14B. It is determined by the physical shape of the robot 1 such as the state, and the actuator mechanism such as the direction and angle at which the joint bends.
[0070]
The next posture includes a posture that can directly transition from the current posture and a posture that cannot directly transition. For example, the robot 1 can directly transition from the state in which the limb is thrown out and lying on the back to the prone state, but cannot directly transition from the state in which the back is in the upright state to the standing state. It is necessary to perform a two-stage operation in which the robot approaches the torso, squats down, and stands up. There are also postures that cannot be safely executed.
[0071]
For this reason, the posture transition mechanism unit 204 pre-registers a posture to which a direct transition is possible, and if the behavior command information supplied from the behavior determination mechanism unit 203 indicates a posture to which a direct transition is possible, the behavior command The information is sent to the control mechanism unit 205.
[0072]
On the other hand, when the action command information indicates a posture that cannot directly transition, the posture transition mechanism unit 204 temporarily changes the posture to another transmissible posture, and then transmits posture transition information that causes a transition to the target posture. It is generated and sent to the control mechanism unit 205. As a result, it is possible to avoid a situation where the robot forcibly executes an untransitionable posture or a situation where the robot falls.
[0073]
The control mechanism unit 205 generates a control signal for driving the actuators A1 to A14 according to the posture transition information from the posture transition mechanism unit 204, and sends the control signal to the sub-control units 63A to 63D. The sub-control units 63A to 63D drive appropriate actuators based on the control signals to cause the robot 1 to execute various operations.
[0074]
The speech speed detection unit 206 is supplied with the speech recognition result from the speech recognition unit 223 in the sensor input processing unit 201, calculates the speech speed of the user based on the supplied speech recognition result, and calculates the calculated user speech. The speech speed is notified to the speech speed correction unit 207.
[0075]
The utterance speed correction unit 207 sets the utterance speed of the synthesized sound output by the robot 1 based on the utterance speed of the user notified from the utterance speed detection unit 206, and notifies the speech synthesis unit 208 of the setting information. .
[0076]
The speech synthesis unit 208 receives the utterance command information from the dialogue management unit 231 in the action determination mechanism unit 203, and performs, for example, rule speech synthesis according to the utterance command information. Here, the speech synthesis unit 208 generates a synthesized sound so as to have the speech speed set by the speech speed correction unit 207, and supplies the synthesized sound to the speaker 72 for output.
[0077]
FIG. 7 is a functional block diagram illustrating functions of the voice recognition unit 223 of the sensor input processing unit 201.
[0078]
The voice data input to the voice recognition unit 223 via the microphone 82 and the A / D conversion unit 102 in FIG. 5 is supplied to the time information acquisition unit 251.
[0079]
The time information acquisition unit 251 acquires the current time at which the sound was collected by the microphone 82 from the internal clock, and collects the sound from the microphone 82 and A / D-converts the sound by the A / D conversion unit 102. Append to data. As a result, time information indicating the time at which the sound data was collected is added to the sound data. The time information acquisition unit 251 supplies the audio data to which the time information is added to the feature amount extraction unit 252.
[0080]
The feature extraction unit 252 performs an acoustic analysis process on the audio data from the time information acquisition unit 251 for each appropriate frame, thereby extracting a feature vector as a feature amount such as, for example, an MFCC (Mel Frequency Cepstrum Coefficient). I do. The feature extraction unit 252 can also extract, for example, a feature vector (feature parameter) such as a spectrum, a linear prediction coefficient, a cepstrum coefficient, and a line spectrum pair.
[0081]
The feature vectors obtained for each frame in the feature extraction unit 252 are sequentially supplied to and stored in the feature vector buffer 253. Therefore, in the feature vector buffer 253, the feature vectors for each frame are stored in time series.
[0082]
The feature vector buffer 253 stores, for example, time-series feature vectors obtained from the start to the end of a certain utterance (voice section).
[0083]
The matching unit 254 uses the feature vector stored in the feature vector buffer 253 to refer to the acoustic model database 255, the dictionary database 256, and the grammar database 257 as needed, and to input the voice (input) to the microphone 82. Speech) is recognized based on, for example, a continuous distribution HMM method or the like.
[0084]
That is, the acoustic model database 255 stores a set of acoustic models representing acoustic characteristics for each predetermined unit (PLU (Phonetic-Linguistic-Units)) such as individual phonemes or syllables in the language of the speech to be recognized. ing. Here, since speech recognition is performed based on the continuous distribution HMM method, for example, an HMM (Hidden Markov Model) using a probability density function such as a Gaussian distribution is used as the acoustic model. The dictionary database 256 stores, for each word (vocabulary) to be recognized, a word dictionary in which information (phonological information) related to its pronunciation is described. The grammar database 257 stores grammar rules (language models) that describe how words registered in the word dictionary of the dictionary database 256 are linked (connected). Here, as the grammar rule, for example, a rule based on a context-free grammar (CFG), a regular grammar (RG), a statistical word chain probability (N-gram), or the like can be used.
[0085]
The matching unit 254 refers to the word dictionary in the dictionary database 256 and connects the acoustic models stored in the acoustic model database 255 to form a word acoustic model (word model). Further, the matching unit 254 connects some word models by referring to the grammar rules stored in the grammar database 257, and uses the word models connected in this way to create a time-series feature vector and Is performed by the continuous distribution HMM method, and the voice input to the microphone 82 is recognized. That is, the matching unit 254 calculates a score representing the likelihood that the time-series feature vector stored in the feature vector buffer 253 is observed from the series of the respective word models configured as described above. Then, the matching unit 254 detects, for example, a sequence of the word model having the highest score, and outputs a word string corresponding to the sequence of the word model as a speech recognition result.
[0086]
Here, since the speech recognition is performed by the HMM method, the matching unit 254 acoustically accumulates the appearance probabilities of each feature vector for the word string corresponding to the connected word model, and calculates the accumulated value. Is the score.
[0087]
That is, the score calculation in the matching unit 254 is given by an acoustic score (hereinafter, appropriately referred to as an acoustic score) given by the acoustic model stored in the acoustic model database 255 and a grammatical rule stored in the grammatical database 257. This is performed by comprehensively evaluating a linguistic score (hereinafter, appropriately referred to as a language score).
[0088]
Specifically, for example, in the case of using the HMM method, the acoustic score is based on the probability that a sequence of feature vectors output by the feature extracting unit 252 is observed (probability of appearance) from the acoustic model forming the word model. And is calculated for each word. In the case of a bigram, for example, the language score is obtained based on the probability that a word of interest and a word immediately before the word are linked (connected). Then, the speech recognition result is determined based on a final score (hereinafter, appropriately referred to as a final score) obtained by comprehensively evaluating the acoustic score and the language score for each word.
[0089]
Here, the voice recognition unit 223 can be configured without providing the grammar database 257. However, according to the rules stored in the grammar database 257, the word models to be connected are restricted, and as a result, the number of words for which the acoustic score is calculated in the matching unit 254 is limited. The amount can be reduced and the processing speed can be improved.
[0090]
The time information adding unit 258 adds the time at which the word string was uttered to the word string output from the matching unit 254 as the speech recognition result. That is, the time information adding unit 258 specifies the utterance start time and the utterance end time of the voice corresponding to the word string based on the utterance time of the voice data acquired by the time information acquisition unit 251, and the utterance start time And the utterance end time is added to the word string as time information. Then, the time information addition unit 258 supplies the word string to which the time information has been added to the model storage unit 202, the action determination mechanism unit 203, and the utterance speed detection unit 206 as state recognition information.
[0091]
Next, an interaction process of the robot 1, that is, a process in which the robot 1 interacts with a user will be described with reference to a flowchart of FIG.
[0092]
In step S1, the A / D conversion unit 102 determines whether or not there has been a voice input from the user via the microphone 82, and repeats the processing of step S1 to wait until there is a voice input from the user. Then, when there is a voice input from the user, the process proceeds to step S2.
[0093]
In step S2, the robot 1 performs a voice recognition process. Here, the speech recognition processing in step S2 will be described in detail with reference to the flowchart in FIG.
[0094]
In step S 51 of FIG. 9, the A / D conversion unit 102 performs A / D conversion on the audio signal input from the microphone 82 and supplies the A / D conversion to the audio recognition unit 223 of the sensor input processing unit 201.
[0095]
In step S52, the time information acquisition unit 251 acquires the current time at the time when the voice is input as the voice input time, adds the current time to the voice data supplied from the A / D conversion unit 102, and sends the voice data to the feature extraction unit 252. Supply.
[0096]
In step S53, the feature extraction unit 252 performs acoustic analysis processing on the audio data supplied from the time information acquisition unit 251 at appropriate time intervals, and converts the audio data into a parameter (feature vector) representing an acoustic characteristic of the audio. It is extracted as a feature value. Note that the feature extraction unit 252 adds the utterance time of the voice from which the feature vector is extracted to the extracted feature vector. The extracted feature vectors are sequentially supplied to the feature vector buffer 253 and stored.
[0097]
In step S54, the matching unit 254 reads out the time-series feature vector stored in the feature vector buffer 253, and describes the acoustic model stored in the acoustic model database 255, and the phonemic information stored in the dictionary database 256. Using the word dictionary and the language model stored in the grammar database 257, a word string corresponding to the time-series feature vector is generated and output to the time information adding unit 258. The matching unit 254 adds the time information indicating the utterance time added to the feature vector to the word string, and supplies the word sequence to the time information adding unit 258.
[0098]
In step S55, based on the word string supplied from the matching unit 254 and the time information indicating the utterance time, the time information adding unit 258 collects the sound as the source of the supplied word string by the microphone 82. The time, more specifically, the utterance start time and the utterance end time of the word string are specified. Then, the time information adding unit 258 adds time information indicating the utterance start time and the utterance end time to the word string, and uses the time information as a speech recognition result as the model storage unit 202, the action determination mechanism unit 203, and the utterance speed. This is supplied to the detection unit 206.
[0099]
As described above, the voice recognition processing is executed.
[0100]
Returning to FIG. 8, after the voice recognition processing in step S2, the processing proceeds to step S3.
[0101]
In step S3, the dialog management unit 231 of the action determination mechanism unit 203 analyzes the utterance content of the user based on the word string of the voice recognition result supplied from the voice recognition unit 223, and converts the utterance content data of the robot 1 into data. The generated speech command information including this data is supplied to the speech synthesizer 208.
[0102]
In step S4, the robot 1 executes an utterance speed detection process to detect the utterance speed (number of characters / second) of the user.
[0103]
Here, the utterance speed detection processing in step S4 in FIG. 8 will be described in detail with reference to the flowchart in FIG.
[0104]
When the word sequence is supplied from the speech recognition unit 223, the utterance speed detection unit 206 utters the supplied word sequence based on the time information added to the supplied word sequence in step S71 of FIG. Start measuring the elapsed time from the end time. That is, the main control unit 61 has an internal clock (not shown), and the utterance speed detection unit 206 acquires the current time from the internal clock, and determines the difference between the current time and the utterance end time of the word string. Then, the processing for obtaining the elapsed time from the speech end time is started.
[0105]
In step S72, the utterance speed detection unit 206 determines whether or not a predetermined time has elapsed from the utterance end time of the word string based on the elapsed time at which the measurement was started in step S71. If the preset predetermined time has not elapsed from the utterance end time in the column, the process proceeds to step S73.
[0106]
In step S73, the speech speed detection unit 206 determines whether or not the next word string has been input from the voice recognition unit 223. If the next word string has not been input from the voice recognition unit 223, the process proceeds to step S72. , And repeats the processing from step S72.
[0107]
If the utterance speed detection unit 206 determines in step S73 that the next word string has been input from the speech recognition unit 223, the process proceeds to step S74.
[0108]
In step S74, the utterance speed detection unit 206 resets the elapsed time at which the measurement was started in step S71 to zero. Thereafter, the process returns to step S71, and repeats the processes from step S71.
[0109]
As described above, when the next word string is supplied within a predetermined time after the supply of the word string from the voice recognition unit 223, the processing of steps S71 to S74 is repeated. Thereby, the robot 1 can keep listening to the user's story until the user's speech is interrupted.
[0110]
Then, in step S72, if the utterance speed detection unit 206 determines that a predetermined time has elapsed from the ending time of the utterance of the word string, the process proceeds to step S75.
[0111]
In step S75, based on the time information added to the word sequence, the utterance speed detection unit 206 determines that the utterance speed needs to be from the utterance start time of the first input word sequence to the utterance end time of the last input word sequence. Find the speaking time. That is, as described above, the processing from step S71 to step S74 is repeated until the user starts the utterance until the utterance is interrupted. However, when the utterance of the user is interrupted, in step S75, the utterance speed detection unit 206 The time from when the user starts speaking to when the user stops is calculated. Specifically, the utterance speed detection unit 206 determines the utterance start time of the word string input first after the utterance of the user starts, and the utterance of the last input word string immediately before the utterance of the user is interrupted. By taking the difference between the end times, the time from when the user starts speaking to when the user stops speaking is obtained.
[0112]
In step S76, the utterance speed detection unit 206 obtains the total number of characters of all the word strings supplied from the speech recognition unit 223 until the user stops uttering and stops. Note that the total number of characters is not the number of characters mixed with kanji and kana, but is the number of characters when all are kana. Further, instead of obtaining the number of characters, the number of mora may be obtained.
[0113]
In step S77, the utterance speed detection unit 206 calculates the utterance speed (the number of characters uttered by the user per second) by dividing the total number of characters obtained in step S76 by the utterance time obtained in step S75. .
[0114]
In step S78, the speech speed detection unit 206 notifies the speech speed correction unit 207 of the speech speed calculated in step S77.
[0115]
As described above, the speech speed detection processing is executed.
[0116]
The process returns to FIG. 8, and after the speech speed detection process in step S4, the process proceeds to step S5.
[0117]
In step S5, the robot 1 executes an utterance speed correction process based on the utterance speed notified from the utterance speed detection unit 206, and sets the utterance speed of the synthetic sound output by the robot 1.
[0118]
Here, the utterance speed correction processing in step S5 in FIG. 8 will be described in detail with reference to the flowchart in FIG.
[0119]
In step S101 in FIG. 11, the utterance speed correction unit 207 determines that the utterance speed of the user notified from the utterance speed detection unit 206 is the optimum utterance speed for speech recognition (hereinafter, the optimum utterance speed for speech recognition is the optimum utterance speed. It is determined whether the utterance speed is higher than the optimum speed. If the utterance speed of the user is higher than the optimum speed, the process proceeds to step S102.
[0120]
In step S102, the utterance speed correction unit 207 sets the utterance speed of the synthesized sound output from the speaker 72 to a value lower than the utterance speed of the user by a predetermined value. A detailed description of the values to be set will be described later. Thereafter, the process proceeds to step S104.
[0121]
In step S101, when the utterance speed correction unit 207 determines that the utterance speed of the user is not higher than the optimum speed (the utterance speed of the user is equal to or lower than the optimum speed), the process proceeds to step S103.
[0122]
In step S103, the utterance speed correction unit 207 sets the utterance speed of the synthetic sound output from the speaker 72 to a value higher than the utterance speed of the user by a predetermined value. A detailed description of the values to be set will be described later. Thereafter, the process proceeds to step S104.
[0123]
In step S104, the speech speed correction unit 207 notifies the speech synthesis unit 208 of the set value of the speech speed of the robot 1.
[0124]
As described above, the speech speed correction process is executed.
[0125]
Here, a method of setting the value of the utterance speed of the robot 1 in steps S102 and S103 will be described with reference to FIGS.
[0126]
FIG. 12 is a graph showing a temporal change in the utterance speed of the user and the utterance speed of the robot 1.
[0127]
In the graph of FIG. 12, the vertical axis represents the utterance speed, and the horizontal axis represents time. Further, the speech speed V of the user is represented by a solid line, and the speech speed V ′ of the robot 1 is represented by a chain line.
[0128]
In the example of FIG. 12, the utterance speed V ′ of the robot 1 is set so as to satisfy Expression (1).
[0129]
(V′−v) / (V−v) = k (1)
[0130]
In the equation (1), k is a constant of 0 <k <1.
[0131]
That is, the utterance speed V ′ of the robot 1 is the difference between the utterance speed of the robot 1 from the optimum speed v (V′−v) and the difference of the utterance speed V of the user from the optimum speed v (V−v). The ratio is set to be constant, and the utterance speed V ′ of the robot 1 is set closer to the optimum speed v than the utterance speed V of the user.
[0132]
It is said that a person tends to adjust his or her own utterance speed during conversation to the utterance speed of the other party. Thus, by setting the speech speed V ′ of the robot 1 to a value closer to the optimum speed v than the speech speed V of the user, the user can unknowingly change his or her own speech speed V The utterance speed is adjusted to the utterance speed V '. As a result, the user's utterance speed is adjusted to the optimum speed v without knowing it. In FIG. 12, as the speech speed V ′ of the robot 1 gradually decreases from time t0 to time t1, the user's speech speed V decreases, and at time t1, the user's speech speed V is optimized. It corresponds to the speed v.
[0133]
In this way, the robot 1 can guide the user's utterance speed to an utterance speed at which voice recognition can be performed with high accuracy.
[0134]
It should be noted that the utterance speed V 'of the robot 1 is set to a value closer to the utterance speed V of the user as k approaches 1 and the utterance speed V' of the robot 1 becomes more optimal as k approaches 0. It is set to a value approaching the speed v.
[0135]
FIG. 12 shows an example in which the user's utterance speed V is higher than the optimum speed v. FIG. 13 shows an example in which the user's utterance speed V is lower than the optimum speed v.
[0136]
In the graph of FIG. 13, as in FIG. 12, the vertical axis represents the utterance speed, the horizontal axis represents time, the solid line in the graph represents the utterance speed V of the user, and the one-dot chain line represents the utterance speed V ′ of the robot. Is represented. Also in FIG. 13, the speech speed V ′ of the robot 1 is set based on the above equation (1). That is, the speech speed V ′ of the robot 1 is set to a value closer to the optimum speed v than the speech speed V of the user. As a result, the user's utterance speed V can be gradually guided to the optimum speed v. At time t1, the user's utterance speed V matches the optimum speed v.
[0137]
As described above, in steps S102 and S103, the speech speed correction unit 207 sets the speech speed V ′ of the robot 1.
[0138]
Next, an example in which the utterance speed of the robot 1 is set by a method different from the equation (1) will be described with reference to the graph of FIG.
[0139]
In FIG. 14, as in FIGS. 12 and 13, the vertical axis represents the speech speed, the horizontal axis represents time, the solid line in the graph represents the user's speech speed V, and the one-dot chain line represents the robot's speech speed. The speed V ′ is shown. V represents the difference between the speech speed of the user and the optimal speed, and V ′ represents the difference between the speech speed of the robot 1 and the optimal speed.
[0140]
In the example shown in FIG. 14, the utterance speed of the robot 1 is set based on Expression (2).
[0141]
(V−V ′) = (V−v) (2)
[0142]
That is, in the example of FIG. 14, the difference (v−V ′) between the optimum speed v and the voice speed V ′ of the robot 1 is determined by the user's voice speed V ′ and the optimum speed v. Is set to be equal to the difference (V−v). Note that the minimum value Vmin is set in advance for the utterance speed V ′ of the robot 1, and when a value V ′ lower than the minimum value Vmin is calculated by Expression (2), the robot 1 uses Expression (2) ) Is canceled, and the speech speed V ′ is set to the minimum value Vmin.
[0143]
In FIG. 14, the user's utterance speed V gradually approaches the optimum speed v from time t0 to time t1, and coincides with the optimum speed v at time t1.
[0144]
In steps S102 and S103 of FIG. 11, the speech speed of the robot 1 may be set as described above.
[0145]
Returning to the flowchart of FIG. 8, after the speech speed correction processing in step S5, the processing proceeds to step S6.
[0146]
In step S6, the speech synthesis unit 208 generates a synthesized sound so as to utter the utterance content of the robot 1 determined in step S3 at the utterance speed set in step S5, and outputs the generated synthesized sound to the speaker 72. Output from That is, the speech synthesis unit 208 is supplied with utterance command information including utterance content data from the dialog management unit 231 in step S3, and is supplied with the utterance speed setting value from the utterance speed correction unit 207 in step S5. Have been. Therefore, the speech synthesis unit 208 generates a synthesized sound based on the supplied speech content data. At this time, the speech synthesis unit 208 synthesizes the speech so that the speech speed is based on the set value supplied from the speech speed correction unit 207. Generate sound. Then, the speech synthesis unit 208 causes the generated synthesized sound to be output from the speaker 72.
[0147]
As described above, the dialog control processing of the robot 1 is executed.
[0148]
As described above, the robot 1 of the present invention guides the user's utterance speed to a speed optimal for speech recognition without the user's consciousness. Therefore, it is possible to maintain the accuracy of voice recognition in a good state. Further, the robot 1 of the present invention may be configured to hear back (request to speak again) when the voice of the user cannot be heard (cannot accurately recognize the voice). In addition, since the inability to accurately recognize the speech itself decreases, the number of times of repetition can be reduced.
[0149]
In the above description, the processing of step S3 and the processing of steps S4 and S5 are executed in parallel. That is, the process of step S5 is performed after the process of step S4, but the process of step S3 and the processes of steps S4 and S5 are performed simultaneously and in parallel.
[0150]
By the way, in the utterance speed correction process shown in FIG. 11, the utterance speed of the robot 1 is set on the basis of the value of only one optimum speed which can keep the accuracy of voice recognition good. A range in which the accuracy can be kept good may be set, and the method of setting the utterance speed of the robot 1 may be changed within and outside the range. Next, the speech speed correction process (the process of step S5 in FIG. 8) in such a case will be described with reference to the flowchart in FIG.
[0151]
The utterance speed correction unit 207 stores, in addition to the optimum speed, the upper limit value and the lower limit value of the utterance speed range in which the accuracy of voice recognition can be kept good. Therefore, in step S121 in FIG. 15, the speech speed correction unit 207 determines whether or not the speech speed of the user notified from the speech speed detection unit 206 is within a range in which the accuracy of voice recognition can be kept good. If the utterance speed of the user is not within the range where the accuracy of the voice recognition can be kept good, the process proceeds to step S122.
[0152]
In step S122, the utterance speed correction unit 207 determines whether or not the utterance speed of the user is larger than the upper limit of a range in which the accuracy of the voice recognition can be kept good. If it is larger than the upper limit of the range that can be maintained, the process proceeds to step S123.
[0153]
In step S123, the speech speed correction unit 207 sets the speech speed of the robot 1 to a value smaller than the speech speed of the user by a predetermined value. The utterance speed correction unit 207 sets the utterance speed of the robot 1 using, for example, Expression (1) or Expression (2). Thereafter, the process proceeds to step S126.
[0154]
In step S122, the utterance speed correcting unit 207 determines that the utterance speed of the user is not larger than the upper limit of the range in which the accuracy of the voice recognition can be kept good (the utterance speed of the user is the lower limit of the range in which the accuracy of the voice recognition can be kept good) If smaller, the process proceeds to step S124.
[0155]
In step S124, the utterance speed correction unit 207 sets the utterance speed of the robot 1 to a value higher than the utterance speed of the user by a predetermined value. The utterance speed correction unit 207 sets the utterance speed of the robot 1 using, for example, Expression (1) or Expression (2). Thereafter, the process proceeds to step S126.
[0156]
In step S121, when the utterance speed correction unit 207 determines that the utterance speed of the user is within a range in which the voice can be accurately recognized, the process proceeds to step S125.
[0157]
In step S125, the speech speed correction unit 207 sets the speech speed of the robot 1 to the same value as the speech speed of the user. Thereafter, the process proceeds to step S126.
[0158]
In step S126, the speech speed correction unit 207 notifies the speech synthesis unit 208 of the set value of the speech speed of the robot 1 set in step S123, step S124, or step S125.
[0159]
As described above, when the user's utterance speed is within the range in which the accuracy of voice recognition can be maintained, the utterance speed of the robot 1 may be adjusted to the utterance speed of the user. This allows the user to interact with the robot 1 more naturally and comfortably.
[0160]
An example of the utterance speed of the robot 1 set by the utterance speed correction process shown in FIG. 15 will be described with reference to the graph of FIG.
[0161]
In the graph of FIG. 16, the vertical axis represents the utterance speed, and the horizontal axis represents time. The “upper limit speed” described in the vertical axis represents the upper limit of the range of the utterance speed of the user who can accurately recognize the voice, and the “lower limit speed” represents the user who can accurately recognize the voice. Represents the lower limit value of the range of the utterance speed, and if the utterance speed of the user is between the upper limit speed and the lower limit speed, the voice can be recognized with high accuracy. The “optimum speed” described in the vertical axis indicates the user's utterance speed at which the voice can be recognized with the highest accuracy. In the graph of FIG. 16, the utterance speed V of the user is represented by a solid line, and the utterance speed V ′ of the robot 1 is represented by a chain line.
[0162]
In FIG. 16, in a section from time t0 to t1, the user's utterance speed V is higher than the upper limit speed. Therefore, the speech speed of the robot 1 is set based on, for example, Expression (1). In FIG. 16, in the section from time t1 to t2, the user's utterance speed V is between the lower limit speed and the upper limit speed. Therefore, at this time, the robot 1 sets its own utterance speed to the same value as the user's utterance speed. Thereby, the user can talk with the robot 1 more naturally and comfortably. At time t2, the user's utterance speed again exceeds the upper limit speed. At this time, the utterance speed of the robot 1 is changed so as to be set again based on, for example, Expression (1). Then, at time t3, when the user's utterance speed V falls between the upper limit speed and the lower limit speed, the robot 1 again sets its own utterance speed to the same value as the user's utterance speed.
[0163]
The above may be performed.
[0164]
In the above description, the case where the present invention is applied to a humanoid robot has been described as an example. However, the present invention is applicable to a robot other than a humanoid robot (for example, a dog-type robot) or an industrial robot. It is also possible to apply to a robot. Furthermore, the present invention can be applied to other devices having a function of interacting with a user, such as a car navigation system.
[0165]
The above-described series of processing can be executed by hardware, or can be executed by software as described above. When a series of processing is executed by software, various functions can be executed by installing a computer in which the programs constituting the software are embedded in dedicated hardware, or by installing various programs For example, it is installed in a general-purpose personal computer from a recording medium or the like.
[0166]
FIG. 17 is a diagram illustrating an example of the internal configuration of the personal computer 301 that executes such processing. A CPU (Central Processing Unit) 311 of the personal computer executes various processes according to a program stored in a ROM (Read Only Memory) 312. A RAM (Random Access Memory) 313 stores data and programs necessary for the CPU 311 to execute various processes as appropriate. The input / output interface 315 is also connected to an output unit 316 including a display, a speaker, a DA converter, and the like. An input unit 317 including a mouse, a keyboard, a microphone, an AD converter, and the like is connected to the input / output interface 315, and outputs a signal input to the input unit 317 to the CPU 311.
[0167]
Further, the input / output interface 315 is also connected to a storage unit 318 configured from a hard disk or the like, and a communication unit 319 that performs data communication with another device via a network such as the Internet. The drive 320 is used when reading data from or writing data to a recording medium such as a magnetic disk 331, an optical disk 332, a magneto-optical disk 333, and a semiconductor memory 334.
[0168]
As shown in FIG. 17, a recording medium is a magnetic disk 331 (including a flexible disk) on which a program is recorded and an optical disk 332 (CD), which are distributed to provide a program to a user separately from a personal computer. A package medium including a ROM (Compact Disc-Read Only Memory), a DVD (including a Digital Versatile Disc), a magneto-optical disk 333 (including an MD (Mini-Disc) (registered trademark)), or a semiconductor memory 334; In addition to the configuration, it is provided to a user in a state where the program is stored in advance in a computer, and is configured by a hard disk including a ROM 312 and a storage unit 318 in which a program is stored.
[0169]
In this specification, the steps of describing a program provided by a medium include, in the order described, not only processing performed in chronological order but also processing in chronological order, or in parallel or individually. This includes the processing to be executed.
[0170]
Also, in this specification, a system refers to an entire device including a plurality of devices.
[0171]
【The invention's effect】
Thus, according to the present invention, speech can be recognized. In particular, it is possible to guide the user's utterance speed to an utterance speed at which speech recognition can be performed with high accuracy, and as a result, it is possible to reduce the probability of erroneous speech recognition.
[Brief description of the drawings]
FIG. 1 is a perspective view showing an external configuration of a robot to which the present invention is applied.
FIG. 2 is a rear perspective view showing the external configuration of the robot shown in FIG. 1;
FIG. 3 is a schematic diagram for explaining the robot of FIG. 1;
FIG. 4 is a block diagram showing an internal configuration of the robot shown in FIG. 1;
FIG. 5 is a block diagram for mainly explaining a portion related to control of the robot in FIG. 1;
FIG. 6 is a block diagram illustrating a configuration of a main control unit in FIG. 5;
FIG. 7 is a block diagram illustrating a configuration of a voice recognition unit in FIG. 6;
FIG. 8 is a flowchart illustrating an interactive process of the robot.
FIG. 9 is a flowchart illustrating the process of step S2 in FIG. 8 in detail.
FIG. 10 is a flowchart illustrating the process of step S4 in FIG. 8 in detail.
FIG. 11 is a flowchart illustrating the process of step S5 in FIG. 8 in detail.
FIG. 12 is a diagram illustrating an example of an utterance speed of a robot set by the utterance speed correction process of FIG. 11;
FIG. 13 is another diagram illustrating an example of the robot utterance speed set by the utterance speed correction process of FIG. 11;
FIG. 14 is yet another diagram illustrating an example of the robot utterance speed set by the utterance speed correction process of FIG. 11;
FIG. 15 is another flowchart illustrating in detail the process of step S5 in FIG. 8;
FIG. 16 is a diagram illustrating an example of a robot utterance speed set by the utterance speed correction process of FIG. 15;
FIG. 17 is a block diagram illustrating a configuration of a computer to which the present invention has been applied.
[Explanation of symbols]
1 robot, 61 main control unit, 63 sub control unit, 72 speakers, 82 microphone, 101 D / A conversion unit, 102 A / D conversion unit, 111 CPU, 112 internal memory, 121 OS, 122 application program, 201 sensor input Processing section, 202 model storage section, 203 action determination mechanism section, 204 attitude transition mechanism section, 205 control mechanism section, 206 speech rate detection section, 207 speech rate correction section, 208 speech synthesis section, 223 speech recognition section, 231 dialog management , 251 time information acquisition unit, 252 feature extraction unit, 253 feature vector buffer, 254 matching unit, 255 acoustic model database, 256 dictionary database, 257 grammar database, 258 time information addition unit

Claims

ユーザと対話する音声処理装置において、
前記ユーザにより発話された音声を認識する音声認識手段と、
前記音声認識手段により認識され、生成された単語列に基づいて、前記ユーザの発話速度を算出する算出手段と、
前記算出手段により算出された前記ユーザの前記発話速度を、精度良く音声認識することができる発話速度と比較して、前記音声処理装置から出力される合成音の発話速度を設定する設定手段と、
出力される前記合成音の発話内容を決定する決定手段と、
前記決定手段により決定された前記発話内容に基づいて、前記設定手段により設定された前記発話速度の前記合成音を出力する出力手段と
を備えることを特徴とする音声処理装置。In a voice processing device that interacts with a user,
Voice recognition means for recognizing voice uttered by the user;
Calculating means for calculating the utterance speed of the user based on the generated word string, which is recognized by the voice recognition means;
Setting means for comparing the utterance speed of the user calculated by the calculation means with an utterance speed capable of accurately recognizing voice, and setting an utterance speed of a synthesized sound output from the voice processing device;
Determining means for determining the utterance content of the synthesized sound to be output;
Output means for outputting the synthesized sound at the utterance speed set by the setting means based on the utterance content determined by the determining means.

前記設定手段は、
前記ユーザの前記発話速度が、精度良く音声認識することができる前記発話速度より大きい場合、出力される前記合成音の前記発話速度を前記ユーザの前記発話速度より小さく設定し、
前記ユーザの前記発話速度が、精度良く音声認識することができる前記発話速度より小さい場合、出力される前記合成音の前記発話速度を前記ユーザの前記発話速度より大きく設定する
ことを特徴とする請求項１に記載の音声処理装置。The setting means,
When the utterance speed of the user is higher than the utterance speed at which the voice can be accurately recognized, the utterance speed of the synthesized sound to be output is set to be lower than the utterance speed of the user,
When the utterance speed of the user is lower than the utterance speed at which speech can be accurately recognized, the utterance speed of the synthesized sound to be output is set to be higher than the utterance speed of the user. Item 2. The audio processing device according to item 1.

前記設定手段は、
前記ユーザの前記発話速度が、精度良く音声認識することができる前記発話速度の範囲内にある場合、出力される前記合成音の前記発話速度を前記ユーザの前記発話速度と同一の値に設定し、
前記ユーザの前記発話速度が、精度良く音声認識することができる前記発話速度の範囲の上限値より大きい場合、出力される前記合成音の前記発話速度を前記ユーザの前記発話速度より小さく設定し、
前記ユーザの前記発話速度が、精度良く音声認識することができる前記発話速度の範囲の下限値より小さい場合、出力される前記合成音の前記発話速度を前記ユーザの前記発話速度より大きく設定する
ことを特徴とする請求項１に記載の音声処理装置。The setting means,
When the utterance speed of the user is within the range of the utterance speed at which the voice can be accurately recognized, the utterance speed of the synthesized sound to be output is set to the same value as the utterance speed of the user. ,
If the utterance speed of the user is larger than the upper limit of the utterance speed range that can accurately recognize the voice, the utterance speed of the synthesized sound to be output is set to be smaller than the utterance speed of the user,
When the utterance speed of the user is smaller than a lower limit value of a range of the utterance speed at which speech can be accurately recognized, the utterance speed of the synthesized sound to be output is set to be higher than the utterance speed of the user. The audio processing device according to claim 1, wherein:

ユーザと対話する音声処理装置の音声処理方法において、
前記ユーザにより発話された音声を認識する音声認識ステップと、
前記音声認識ステップの処理により認識され、生成された単語列に基づいて、前記ユーザの発話速度を算出する算出ステップと、
前記算出ステップの処理により算出された前記ユーザの前記発話速度を、精度良く音声認識することができる発話速度と比較して、前記音声処理装置から出力される合成音の発話速度を設定する設定ステップと、
出力される前記合成音の発話内容を決定する決定ステップと、
前記決定ステップの処理により決定された前記発話内容に基づいて、前記設定ステップの処理により設定された前記発話速度の前記合成音を出力する出力ステップと
を含むことを特徴とする音声処理方法。In a voice processing method of a voice processing device that interacts with a user,
A voice recognition step of recognizing voice uttered by the user;
A calculating step of calculating the utterance speed of the user based on the generated word string, which is recognized by the processing of the voice recognition step;
A setting step of comparing the utterance speed of the user calculated by the processing of the calculation step with an utterance speed capable of accurately recognizing voice, and setting an utterance speed of a synthetic sound output from the voice processing device; When,
Determining a utterance content of the synthesized sound to be output;
Outputting the synthesized sound having the utterance speed set by the processing of the setting step based on the utterance content determined by the processing of the determining step.

ユーザと対話する音声処理装置を制御する処理をコンピュータに実行させるプログラムであって、
前記ユーザにより発話された音声を認識する音声認識ステップと、
前記音声認識ステップの処理により認識され、生成された単語列に基づいて、前記ユーザの発話速度を算出する算出ステップと、
前記算出ステップの処理により算出された前記ユーザの前記発話速度を、精度良く音声認識することができる発話速度と比較して、前記音声処理装置から出力される合成音の発話速度を設定する設定ステップと、
前記設定ステップの処理により設定された前記発話速度で出力される前記合成音の発話内容を決定する決定ステップと
を含むことを特徴とするコンピュータが読み取り可能なプログラムが記録されている記録媒体。A program that causes a computer to execute a process of controlling a voice processing device that interacts with a user,
A voice recognition step of recognizing voice uttered by the user;
A calculating step of calculating the utterance speed of the user based on the generated word string, which is recognized by the processing of the voice recognition step;
A setting step of comparing the utterance speed of the user calculated by the processing of the calculation step with an utterance speed capable of accurately recognizing voice, and setting an utterance speed of a synthetic sound output from the voice processing device; When,
A determining step of determining the utterance content of the synthesized sound output at the utterance speed set by the processing of the setting step.

ユーザと対話する音声処理装置を制御するコンピュータに、
前記ユーザにより発話された音声を認識する音声認識ステップと、
前記音声認識ステップの処理により認識され、生成された単語列に基づいて、前記ユーザの発話速度を算出する算出ステップと、
前記算出ステップの処理により算出された前記ユーザの前記発話速度を、精度良く音声認識することができる発話速度と比較して、前記音声処理装置から出力される合成音の発話速度を設定する設定ステップと、
前記設定ステップの処理により設定された前記発話速度で出力される前記合成音の発話内容を決定する決定ステップと
を実行させることを特徴とするプログラム。A computer that controls a voice processing device that interacts with the user,
A voice recognition step of recognizing voice uttered by the user;
A calculating step of calculating the utterance speed of the user based on the generated word string, which is recognized by the processing of the voice recognition step;
A setting step of comparing the utterance speed of the user calculated by the processing of the calculation step with an utterance speed capable of accurately recognizing voice, and setting an utterance speed of a synthetic sound output from the voice processing device; When,
A deciding step of deciding the utterance content of the synthesized sound output at the utterance speed set by the processing of the setting step.