JP2020177106A

JP2020177106A - Voice interaction control method, voice interaction control device and program

Info

Publication number: JP2020177106A
Application number: JP2019078726A
Authority: JP
Inventors: 浩貴山田; Hirotaka Yamada; 兼人小川; Kaneto Ogawa; 勇次國武; Yuji Kunitake
Original assignee: Panasonic Intellectual Property Corp of America
Current assignee: Panasonic Intellectual Property Corp of America
Priority date: 2019-04-17
Filing date: 2019-04-17
Publication date: 2020-10-29

Abstract

To make a user for which it is difficult to understand the meaning of a response sentence of asking back such as an infant intuitively comprehend that the cause of erroneous recognition resides in a speaking speed.SOLUTION: A voice interaction control device performs voice recognition processing for voice data converted from a first speech by a user inputted into a voice input device; measures a first speaking speed which is a speed of the speech by the user based on the voice data, when reliability of the voice recognition processing is equal to or less than a threshold; determines a second speaking speed which is the speed for outputting a response sentence to urge the user to speak again by voice, based on the first speaking speed; and outputs indication information for outputting the voice of the response sentence with the second speaking speed.SELECTED DRAWING: Figure 2

Description

本開示は音声によりユーザと対話する技術に関するものである。 The present disclosure relates to a technique for interacting with a user by voice.

従来の音声対話装置では、ユーザの音声入力において音声認識が誤認識であると推定された場合、ユーザに言い直しを促す聞き返しのメッセージが出力される。しかし、聞き返しのメッセージを出力するだけでは、ユーザに対して音声認識の誤認識が生じた原因を認識させるには不十分である。したがって、ユーザによる再度の発話において誤認識が生じた原因が改善されず、音声対話装置による聞き返しが繰り返されるという課題がある。 In the conventional voice dialogue device, when it is presumed that the voice recognition is erroneous in the voice input of the user, a reply message prompting the user to rephrase is output. However, it is not enough to make the user recognize the cause of the misrecognition of voice recognition only by outputting the message to be heard back. Therefore, there is a problem that the cause of the erroneous recognition in the re-utterance by the user is not improved and the listening back by the voice dialogue device is repeated.

音声認識の誤認識が生じた原因をユーザに認識させるための技術として、例えば特許文献１、特許文献２、および特許文献３の技術がある。 As a technique for causing the user to recognize the cause of the misrecognition of voice recognition, for example, there are techniques of Patent Document 1, Patent Document 2, and Patent Document 3.

特許文献１では、音声認識結果に誤認識が含まれると推定された場合、誤認識が生じた原因を分析し、誤認識の原因を強調した発声変換が施された音声認識結果を含む応答を音声合成して出力することが開示されている。これにより、入力された音声において、誤認識が生じた部分及び原因をユーザに対して直観的に認識させることができる。例えば、入力された音声の一部のみ音量が小さいことが原因で誤認識が生じた場合、音声認識結果において音量が小さいと推定された部分の音量が意図的に小さくされた音声認識結果の音声が応答として出力される。これにより、ユーザは誤認識が生じた部分及び原因を認識できる。 In Patent Document 1, when it is presumed that the voice recognition result includes misrecognition, the cause of the misrecognition is analyzed, and a response including the voice recognition result in which the cause of the misrecognition is emphasized is subjected to vocal conversion. It is disclosed that voice synthesis is performed and output. As a result, the user can intuitively recognize the portion and the cause of the erroneous recognition in the input voice. For example, if misrecognition occurs because the volume of only a part of the input voice is low, the volume of the part estimated to be low in the voice recognition result is intentionally reduced. Is output as a response. As a result, the user can recognize the part where the misrecognition has occurred and the cause.

特許文献２では、音声認識結果に誤認識が含まれると推定された場合、誤認識が生じた原因を分析し、この原因の改善方法を音声でユーザに通知することが開示されている。これにより、ユーザに対して、言い直しを行う際に発話の改善方法を認識させることができる。例えば、入力された音声の発話速度が不適正であった場合、「もっとゆっくり話して下さい」と音声対話装置が応答をすることで、ユーザは音声入力の改善方法を認識できる。 Patent Document 2 discloses that when it is presumed that the voice recognition result includes erroneous recognition, the cause of the erroneous recognition is analyzed and the user is notified by voice of a method for improving the cause. As a result, the user can be made aware of the method of improving the utterance when rephrasing. For example, if the utterance speed of the input voice is inappropriate, the voice dialogue device responds with "Speak more slowly", so that the user can recognize how to improve the voice input.

特許文献３では、ユーザとの情報のやりとりを音声によって行う音声対話装置であって、ユーザの発話音量が所定の音量より小さい場合は出力する音楽の音量を大きくし、ユーザの発話音量が所定の音量より大きい場合は出力する音楽の音量を小さくする音声対話装置が開示されている。 Patent Document 3 is a voice dialogue device that exchanges information with a user by voice. When the user's utterance volume is lower than a predetermined volume, the volume of the output music is increased, and the user's utterance volume is predetermined. A voice dialogue device that reduces the volume of output music when it is louder than the volume is disclosed.

特開２００６−２５１０６１号公報Japanese Unexamined Patent Publication No. 2006-251061 特開２００６−１１３４３９号公報Japanese Unexamined Patent Publication No. 2006-11439 特許第４７６５３９４号公報Japanese Patent No. 4765394

しかしながら、幼児のような聞き返しの応答文の意味を理解するのが困難なユーザに対して誤認識の原因が発話速度にあることを直感的に理解させるためには、更なる改善が必要である。 However, further improvement is needed to intuitively understand that the cause of misrecognition is the speech speed for users who have difficulty understanding the meaning of the response sentence of the response sentence such as an infant. ..

本開示の目的は、幼児のような聞き返しの応答文の意味を理解するのが困難なユーザに対しても誤認識の原因が発話速度にあることを直感的に理解させ、聞き返しが繰り返されることを抑制する技術を提供することにある。 The purpose of the present disclosure is to intuitively understand that the cause of misrecognition is the utterance speed even for a user who has difficulty in understanding the meaning of the response sentence of the response sentence such as an infant, and to repeat the response sentence. The purpose is to provide a technology to suppress.

本開示の一態様は、音声対話制御装置が行う音声対話制御方法であって、音声入力装置に入力されたユーザによる第１発話から変換された音声データに対して音声認識処理を行い、前記音声認識処理の信頼度を算出し、前記信頼度が所定の閾値以下であるか否かを判定し、前記信頼度が前記閾値以下である場合、前記音声データに基づき前記ユーザによる発話の速度である第１発話速度を計測し、前記第１発話速度に基づき、前記ユーザに対して再度の発話を促す旨の応答文を音声により出力する際の速度である第２発話速度を決定し、前記第２発話速度で前記応答文の音声を出力する旨の指示情報を出力する。 One aspect of the present disclosure is a voice dialogue control method performed by a voice dialogue control device, in which voice recognition processing is performed on voice data converted from a first utterance by a user input to the voice input device, and the voice is described. The reliability of the recognition process is calculated, it is determined whether or not the reliability is equal to or less than a predetermined threshold, and if the reliability is equal to or less than the threshold, it is the speed of utterance by the user based on the voice data. The first utterance speed is measured, and based on the first utterance speed, the second utterance speed, which is the speed at which a response sentence for prompting the user to speak again is output by voice, is determined, and the second utterance speed is determined. 2 Outputs instruction information to output the voice of the response sentence at the utterance speed.

本開示によれば、幼児のような聞き返しの応答文の意味を理解するのが困難なユーザに対しても誤認識の原因が発話速度にあることを直感的に理解させ、聞き返しが繰り返されることを抑制することができる。 According to the present disclosure, even a user who has difficulty in understanding the meaning of the response sentence of the response sentence such as an infant can intuitively understand that the cause of the misrecognition is the speech speed, and the response sentence is repeated. Can be suppressed.

本開示の実施の形態１における音声対話システムの全体像を示す図である。It is a figure which shows the whole image of the voice dialogue system in Embodiment 1 of this disclosure. 本開示の実施の形態１における音声対話システムの構成の一例を示す図である。It is a figure which shows an example of the structure of the voice dialogue system in Embodiment 1 of this disclosure. 音声認識処理の誤認識の改善を促す場面の一例を示す図である。It is a figure which shows an example of the scene which promotes improvement of false recognition of voice recognition processing. 音声認識候補と信頼度との一例を示す図である。It is a figure which shows an example of a voice recognition candidate and reliability. 発話時間とモーラ数との一例を示す図である。It is a figure which shows an example of the utterance time and the number of mora. 第１データベースのデータ構成の一例を示す図である。It is a figure which shows an example of the data structure of the 1st database. 本開示の実施の形態１における音声対話システムの処理の一例を示すフローチャートである。It is a flowchart which shows an example of the processing of the voice dialogue system in Embodiment 1 of this disclosure. 本開示の実施の形態１の変形例に係る音声対話システムの構成の一例を示すブロック図である。It is a block diagram which shows an example of the structure of the voice dialogue system which concerns on the modification of Embodiment 1 of this disclosure. 第２データベースのデータ構成の一例を示す図である。It is a figure which shows an example of the data structure of the 2nd database. 本開示の実施の形態１における音声対話システムの処理の一例を示すフローチャートである。It is a flowchart which shows an example of the processing of the voice dialogue system in Embodiment 1 of this disclosure. ユーザの最初の発話である第１発話に対して、音声対話システムが発話速度の改善を促したが、改善されなかった場面の一例を示す図である。It is a figure which shows an example of the scene which the voice dialogue system promoted the improvement of the utterance speed with respect to the first utterance which is the first utterance of a user, but was not improved. 本開示の実施の形態２における音声対話システムの構成の一例を示す図である。It is a figure which shows an example of the structure of the voice dialogue system in Embodiment 2 of this disclosure. 第３データベースのデータ構成の一例を示す図である。It is a figure which shows an example of the data structure of the 3rd database. 実施の形態２において最終応答文が決定される場面の一例を示す図である。It is a figure which shows an example of the scene where the final response sentence is determined in Embodiment 2. 本開示の実施の形態２における音声対話システムの処理の一例を示すフローチャートである。It is a flowchart which shows an example of the processing of the voice dialogue system in Embodiment 2 of this disclosure.

（本発明の基礎となった知見）
ユーザの発話を音声認識し、音声認識結果を基に自然な応答文を返すことで、ユーザとの自然な対話の実現を図ったり、機器の制御及び情報提供などのサービスをユーザに提供したりする音声対話装置に関する技術が検討されている。音声対話装置は、ユーザの入力発話を受け取ると音声認識を行う。この時、音声対話装置は、音声認識結果に対する複数の候補を決定し、各候補に対する音声認識結果の確からしさを表す信頼度を確率的に計算し、最も信頼度が高い候補を音声認識結果として選択する。 (Knowledge on which the present invention is based)
By recognizing the user's utterance by voice and returning a natural response sentence based on the voice recognition result, it is possible to realize a natural dialogue with the user and provide the user with services such as device control and information provision. Technology related to voice dialogue devices is being studied. The voice dialogue device performs voice recognition when it receives a user's input utterance. At this time, the voice dialogue device determines a plurality of candidates for the voice recognition result, probabilistically calculates the reliability representing the certainty of the voice recognition result for each candidate, and sets the candidate with the highest reliability as the voice recognition result. select.

ここで、全ての候補において信頼度が一定の閾値以下であった場合、音声認識結果に誤認識が含まれると判断され、ユーザへ「もう一度発話してください」などの聞き返しの応答文が出力される。これにより、ユーザが発話を言い直すことで、正しい音声認識が可能となる。一方で、聞き返しの応答文のみでは、ユーザはどのように発話を改善すればよいのか認識できない。したがって、ユーザが再び同じような発話を繰り返してしまうと、音声認識結果の信頼度が改善されず、聞き返しの応答文が繰り返されるという課題が存在する。 Here, if the reliability of all the candidates is below a certain threshold value, it is determined that the voice recognition result includes erroneous recognition, and a reply message such as "Please speak again" is output to the user. To. As a result, correct voice recognition becomes possible by the user rephrasing the utterance. On the other hand, the user cannot recognize how to improve the utterance only by the response sentence of the reply. Therefore, if the user repeats the same utterance again, the reliability of the voice recognition result is not improved, and there is a problem that the response sentence of the response is repeated.

このような課題に対し、特許文献１に係る技術では、上述したように、入力された音声の一部の音量が小さいことが原因で誤認識が生じた場合、誤認識が生じた部分の音量が意図的に小さくされた音声認識結果の音声を応答文として出力する技術が開示されている。 In response to such a problem, in the technique according to Patent Document 1, when erroneous recognition occurs due to the volume of a part of the input voice being low, as described above, the volume of the erroneously recognized portion occurs. Discloses a technique for outputting the voice of the voice recognition result that is intentionally reduced as a response sentence.

しかしながら、特許文献１では、単に音量調整を行うだけであるため、誤認識の原因がユーザの発話速度にある場合、そのことをユーザに直感的に理解させることはできない。 However, in Patent Document 1, since the volume is simply adjusted, when the cause of the erroneous recognition is the utterance speed of the user, it is not possible for the user to intuitively understand it.

特許文献２に係る技術では、上述したように、入力された音声の発話速度が不適正であった場合、「もっとゆっくり話してください」といった応答文の音声を出力する技術が開示されている。 As described above, the technique according to Patent Document 2 discloses a technique for outputting a voice of a response sentence such as "Please speak more slowly" when the utterance speed of the input voice is inappropriate.

しかしながら、特許文献２では、誤認識の原因が応答文によって通知されているため、幼児のような言葉の認識能力が未発達なユーザの場合、「もっと」の意味を理解できず、発話速度を適性速度にまで改善できない可能性がある。したがって、このようなユーザに対して発話速度を適性速度にまで改善させるには、応答文の意味によって誤認識の原因を知らせる特許文献２の手法は不十分である。 However, in Patent Document 2, since the cause of erroneous recognition is notified by a response sentence, a user such as an infant who has an underdeveloped ability to recognize words cannot understand the meaning of "more" and speaks at a high speed. It may not be possible to improve to the appropriate speed. Therefore, in order to improve the utterance speed to an appropriate speed for such a user, the method of Patent Document 2 that informs the cause of erroneous recognition by the meaning of the response sentence is insufficient.

特許文献３では、上述したように、ユーザの発話音量が所定の音量より小さい場合は出力する音楽の音量を大きくし、ユーザの発話音量が所定の音量より大きい場合は出力する音楽の音量を小さくする音声対話装置が開示されている。これは、人が周囲の音量に合わせて発話音量を変更するというランバード効果を利用したものである。 In Patent Document 3, as described above, when the user's utterance volume is smaller than the predetermined volume, the volume of the output music is increased, and when the user's utterance volume is larger than the predetermined volume, the volume of the output music is decreased. A voice dialogue device is disclosed. This utilizes the lumbard effect in which a person changes the utterance volume according to the surrounding volume.

しかしながら、特許文献３は、単に音量調整を行うだけであるため、誤認識の原因がユーザの発話速度にある場合、そのことをユーザに直感的に理解させることはできない。 However, since Patent Document 3 merely adjusts the volume, if the cause of the erroneous recognition is the utterance speed of the user, the user cannot intuitively understand that fact.

そこで、本発明者は、音声認識結果の誤認識の原因がユーザの発話速度にある場合、聞き返しの音声の発話速度を調整することで、幼児のような応答文の意味を理解するのが困難なユーザに対してもその原因が発話速度であることを直感的に理解させることができるとの知見を得た。例えば、発話速度が適性速度よりも遅ければ、聞き返しの音声の発話速度を速くする一方、発話速度が適性速度よりも速ければ、聞き返しの音声の発話速度を遅くすることで、幼児のような応答文の意味を理解するのが困難なユーザに対しても誤認識の原因が発話速度であることを直感的に理解させることができるとの知見を得て、下記に示す各種態様を想到するに至った。 Therefore, when the cause of the erroneous recognition of the voice recognition result is the utterance speed of the user, it is difficult for the present inventor to understand the meaning of the response sentence like an infant by adjusting the utterance speed of the voice to be heard back. It was found that it is possible to intuitively understand that the cause is the speech speed even for a normal user. For example, if the utterance speed is slower than the appropriate speed, the utterance speed of the return voice is increased, while if the utterance speed is faster than the appropriate speed, the utterance speed of the return voice is slowed down to respond like an infant. Based on the finding that even users who have difficulty understanding the meaning of sentences can intuitively understand that the cause of erroneous recognition is utterance speed, we came up with the various aspects shown below. I arrived.

本構成によれば、音声認識処理の信頼度が閾値以下の場合、ユーザの発話速度である第１発話速度に基づいて、ユーザに対して再度の発話を促す旨の応答文を音声により出力する際の速度である第２発話速度が決定され、決定された第２発話速度により応答文の音声を出力させることができる。このように、本態様は、再度の発話を促す旨の応答文、すなわち聞き返しの応答文の音声が、ユーザの発話速度に応じた速度に設定されたうえで出力されている。そのため、幼児のように聞き返しの応答文の意味を理解するのが困難なユーザに対しても誤認識の原因が発話速度にあることを直感的に理解させることができ、聞き返しが繰り返されることを抑制できる。 According to this configuration, when the reliability of the voice recognition process is equal to or less than the threshold value, a response sentence for prompting the user to speak again is output by voice based on the first utterance speed which is the utterance speed of the user. The second utterance speed, which is the speed at the time, is determined, and the voice of the response sentence can be output by the determined second utterance speed. As described above, in this aspect, the voice of the response sentence for prompting another utterance, that is, the response sentence for listening back is set to a speed corresponding to the utterance speed of the user and then output. Therefore, even a user who has difficulty in understanding the meaning of the response sentence of the reply, such as an infant, can intuitively understand that the cause of the misrecognition is the speech speed, and the reply is repeated. Can be suppressed.

上記態様において、前記第１発話速度が第１閾値未満である場合、第２発話速度を、あらかじめ定められた第３発話速度より大きくしてもよい。 In the above aspect, when the first utterance speed is less than the first threshold value, the second utterance speed may be made higher than the predetermined third utterance speed.

本構成によれば、第１発話速度が第１閾値未満である場合、第２発話速度が予め定められた第３発話速度より大きくされるため、誤認識の原因が自身の発話速度が遅いことにあることをユーザに対してより直感的に理解させることができる。 According to this configuration, when the first utterance speed is less than the first threshold value, the second utterance speed is made larger than the predetermined third utterance speed, so that the cause of misrecognition is that the own utterance speed is slow. It is possible to make the user understand that it is in more intuitively.

上記態様において、前記第１発話速度が第２閾値以上である場合、第２発話速度を、あらかじめ定められた第３発話速度より小さくしてもよい。 In the above aspect, when the first utterance speed is equal to or higher than the second threshold value, the second utterance speed may be made smaller than the predetermined third utterance speed.

本構成によれば、第１発話速度が第２閾値以上である場合、第２発話速度が予め定められた第３発話速度より小さくされるため、誤認識の原因が自身の発話速度が速いことにあることをユーザに対してより直感的に理解させることができる。 According to this configuration, when the first utterance speed is equal to or higher than the second threshold value, the second utterance speed is made smaller than the predetermined third utterance speed, so that the cause of misrecognition is that the own utterance speed is high. It is possible to make the user understand that it is in more intuitively.

上記態様において、前記第２発話速度で出力された応答文に対する前記ユーザの第２発話から変換された音声データに対して音声認識処理を行い、前記音声認識処理の信頼度を算出し、前記信頼度が前記閾値以下であるか否かを判定し、前記信頼度が前記閾値以下である場合、前記音声データに基づき前記第２発話の速度である第４発話速度を計測し、前記第４発話速度に基づき、前記ユーザに対して再度の発話を促す旨の応答文を音声により出力する際の速度である第５発話速度を決定し、前記第５発話速度で前記応答文の音声を出力する旨の指示情報を出力してもよい。 In the above aspect, voice recognition processing is performed on the voice data converted from the user's second utterance with respect to the response sentence output at the second utterance speed, the reliability of the voice recognition processing is calculated, and the reliability. It is determined whether or not the degree is equal to or less than the threshold, and if the reliability is equal to or less than the threshold, the fourth utterance speed, which is the speed of the second utterance, is measured based on the voice data, and the fourth utterance is measured. Based on the speed, the fifth utterance speed, which is the speed at which the response sentence for prompting the user to speak again is output by voice, is determined, and the voice of the response sentence is output at the fifth utterance speed. Instruction information to that effect may be output.

本構成によれば、聞き返しに対するユーザの第２発話の信頼度が閾値以下の場合、第２発話の発話速度である第４発話速度に基づいて第５発話速度が決定され、その第５発話速度で再度の発話を促す旨の応答文の音声が出力される。そのため、聞き返しによっても発話速度の改善が見られなかった場合、さらに聞き返しを行うことによって、ユーザに誤認識の原因が発話速度にあることをより確実に理解させることができる。 According to this configuration, when the reliability of the user's second utterance for listening back is less than or equal to the threshold value, the fifth utterance speed is determined based on the fourth utterance speed, which is the utterance speed of the second utterance, and the fifth utterance speed is determined. The voice of the response sentence to prompt the utterance again is output. Therefore, when the utterance speed is not improved by the utterance, the user can more surely understand that the cause of the erroneous recognition is the utterance speed by further listening.

上記態様において、前記信頼度が前記閾値未満である場合、さらに前記第１発話の音量である第１発話音量を計測し、さらに前記第１発話速度および前記第１発話音量の少なくとも一方に基づき、前記ユーザに対して再度の発話を促す旨の応答文を音声により出力する際の速度である第２発話速度および前記応答文を音声により出力する際の音量である第２発話音量の少なくとも一方を決定し、さらに前記第２発話速度および前記第２発話音量の少なくとも一方で前記応答文を出力する旨の指示情報を出力してもよい。 In the above aspect, when the reliability is less than the threshold value, the first utterance volume, which is the volume of the first utterance, is further measured, and further based on at least one of the first utterance speed and the first utterance volume. At least one of the second utterance speed, which is the speed at which the response sentence for prompting the user to speak again, and the second utterance volume, which is the volume when the response sentence is output by voice, are set. It may be determined, and further, instruction information to output the response sentence may be output at at least one of the second utterance speed and the second utterance volume.

本構成によれば、音声認識の信頼度が閾値未満である場合、第１発話の第１発話速度及び第１発話の第１発話音量の少なくとも一方に基づいて、第２発話速度及び第２発話音量の少なくとも一方が決定され、第２発話速度及び第２発話音量の少なくとも一方により再度の発話を促す旨の応答文の音声が出力される。そのため、誤認識の原因が発話速度及び発話音量の一方にある場合、発話速度及び発話音量の一方が改善されるように音量調整された応答文の音声を出力できる。 According to this configuration, when the reliability of voice recognition is less than the threshold value, the second utterance speed and the second utterance are based on at least one of the first utterance speed of the first utterance and the first utterance volume of the first utterance. At least one of the volume is determined, and the voice of the response sentence to prompt the second utterance is output by at least one of the second utterance speed and the second utterance volume. Therefore, when the cause of the erroneous recognition is one of the utterance speed and the utterance volume, the voice of the response sentence whose volume is adjusted so that one of the utterance speed and the utterance volume is improved can be output.

上記態様において、前記第２発話速度は、前記第１閾値よりも小さくなるにつれて、前記第３発話速度に対してより大きく設定されてもよい。 In the above aspect, the second utterance speed may be set higher than the third utterance speed as it becomes smaller than the first threshold value.

本構成によれば、第１発話速度が第１閾値よりも小さくなるにつれて、第３発話速度に対して第２発話速度がより大きくされるため、ユーザに対して誤認識の原因が発話速度の遅さにあることをさらに直感的に理解させることができる。 According to this configuration, as the first utterance speed becomes smaller than the first threshold value, the second utterance speed becomes larger than the third utterance speed, so that the cause of misrecognition to the user is the utterance speed. You can make it more intuitive to understand that it is slow.

上記態様において、前記第２発話速度は、前記第１発話速度が前記第２閾値よりも大きくなるにつれて、前記第３発話速度に対してより小さく設定されてもよい。 In the above aspect, the second utterance speed may be set smaller than the third utterance speed as the first utterance speed becomes larger than the second threshold value.

本構成によれば、第１発話速度が第２閾値よりも大きくなるにつれて、第３発話速度に対して第２発話速度がより小さくされるため、ユーザに対して誤認識の原因が発話速度の速さにあることをさらに直感的に理解させることができる。 According to this configuration, as the first utterance speed becomes larger than the second threshold value, the second utterance speed becomes smaller than the third utterance speed, so that the cause of misrecognition to the user is the utterance speed. You can make it more intuitive to understand that it is in speed.

上記態様において、前記第１発話速度は、第１発話のモーラ数を発話時間で割った値であってもよい。 In the above aspect, the first utterance speed may be a value obtained by dividing the number of mora of the first utterance by the utterance time.

本構成によれば、第１発話速度を正確に計測できる。 According to this configuration, the first utterance speed can be measured accurately.

上記態様において、前記第２発話速度の決定では、複数の前記第１発話速度のそれぞれに対応する応答文及び前記第２発話速度が登録された第１データベースを参照し、計測された前記第１発話速度に対応する応答文及び前記第２発話速度を決定し、前記第１データベースにおいて、前記第２発話速度は、前記第１発話速度が第１閾値よりも小さくなるにつれてあらかじめ定められた第３発話速度よりも大きく設定され、前記第１発話速度が第２閾値（＞第１閾値）よりも大きくなるにつれて前記第３発話速度よりも小さく設定されてもよい。 In the above aspect, in the determination of the second utterance speed, the first utterance measured by referring to the response sentence corresponding to each of the plurality of first utterance speeds and the first database in which the second utterance speed is registered. The response sentence corresponding to the utterance speed and the second utterance speed are determined, and in the first database, the second utterance speed is determined in advance as the first utterance speed becomes smaller than the first threshold value. It may be set higher than the utterance speed, and may be set lower than the third utterance speed as the first utterance speed becomes larger than the second threshold (> first threshold).

本構成によれば、複数の第１発話速度のそれぞれに対応する応答文及び第２発話速度が登録された第１データベースを参照することで応答文及び第２発話速度が決定されているため、妥当な応答文及び第２発話速度を正確且つ速やかに決定できる。 According to this configuration, the response sentence and the second utterance speed are determined by referring to the first database in which the response sentence and the second utterance speed corresponding to each of the plurality of first utterance speeds are registered. Appropriate response sentences and second utterance speed can be determined accurately and quickly.

上記態様において、前記第５発話速度の決定では、複数の前記第４発話速度のそれぞれに対応する応答文及び前記第５発話速度が登録された第２データベースを参照し、計測された前記第４発話速度に対応する応答文及び前記第５発話速度を決定し、前記第２データベースにおいて、前記応答文は、前記第４発話速度が第１閾値より小さい場合、前記発話速度の上昇を強調する言葉を含み、前記第４発話速度が第２閾値（＞第１閾値）より大きい場合、前記発話速度の低下を強調する言葉を含み、前記第２データベースにおいて、前記第５発話速度は、前記第４発話速度が前記第１閾値よりも小さい場合、あらかじめ定められた第３発話速度よりも大きく設定され、前記第４発話速度が前記第２閾値よりも大きい場合、前記第３発話速度よりも小さく設定されてもよい。 In the above aspect, in the determination of the fifth utterance speed, the fourth utterance measured by referring to the response sentence corresponding to each of the plurality of fourth utterance speeds and the second database in which the fifth utterance speed is registered. A response sentence corresponding to the utterance speed and the fifth utterance speed are determined, and in the second database, the response sentence emphasizes an increase in the utterance speed when the fourth utterance speed is smaller than the first threshold value. When the fourth utterance speed is larger than the second threshold (> first threshold), the word emphasizing the decrease in the utterance speed is included, and in the second database, the fifth utterance speed is the fourth. When the utterance speed is smaller than the first threshold value, it is set to be larger than the predetermined third utterance speed, and when the fourth utterance speed is larger than the second threshold value, it is set to be smaller than the third utterance speed. May be done.

本構成によれば、複数の第４発話速度のそれぞれに対応する応答文及び第５発話速度が登録された第２データベースを参照することで応答文及び第５発話速度が決定されているため、妥当な応答文及び第５発話速度を正確且つ速やかに決定できる。 According to this configuration, the response sentence and the fifth utterance speed are determined by referring to the second database in which the response sentence and the fifth utterance speed corresponding to each of the plurality of fourth utterance speeds are registered. Appropriate response sentences and fifth utterance speed can be determined accurately and quickly.

上記態様において、前記第２発話音量の決定では、複数の前記第１発話音量のそれぞれに対応する応答文及び前記第２発話音量が登録された第３データベースを参照し、計測された前記第１発話音量に対応する応答文及び前記第２発話音量を決定し、前記第３データベースにおいて、前記第２発話音量は、前記第１発話音量が第３閾値よりも小さくなるにつれてあらかじめ定められた第３発話音量よりも大きく設定され、前記第１発話音量が第４閾値よりも大きくなるにつれて前記第３発話音量よりも小さく設定されてもよい。 In the above aspect, in the determination of the second utterance volume, the first utterance measured by referring to the response sentence corresponding to each of the plurality of first utterance volumes and the third database in which the second utterance volume is registered. The response sentence corresponding to the utterance volume and the second utterance volume are determined, and in the third database, the second utterance volume is determined in advance as the first utterance volume becomes smaller than the third threshold value. It may be set higher than the utterance volume, and may be set lower than the third utterance volume as the first utterance volume becomes larger than the fourth threshold value.

本構成によれば、複数の第１発話音量のそれぞれに対応する応答文及び第２発話音量が登録された第３データベースを参照することで応答文及び第２発話音量が決定されているため、妥当な応答文及び第２発話音量を正確且つ速やかに決定できる。 According to this configuration, the response sentence and the second utterance volume are determined by referring to the third database in which the response sentence and the second utterance volume corresponding to each of the plurality of first utterance volumes are registered. Appropriate response sentences and second utterance volume can be determined accurately and quickly.

本開示は、このような音声対話制御方法に含まれる特徴的な各構成をコンピュータに実行させるプログラム、或いはこのプログラムによって動作する音声対話制御装置として実現することもできる。また、このようなコンピュータプログラムを、ＣＤ−ＲＯＭ等のコンピュータ読取可能な非一時的な記録媒体あるいはインターネット等の通信ネットワークを介して流通させることができるのは、言うまでもない。 The present disclosure can also be realized as a program for causing a computer to execute each characteristic configuration included in such a voice dialogue control method, or as a voice dialogue control device operated by this program. Needless to say, such a computer program can be distributed via a computer-readable non-temporary recording medium such as a CD-ROM or a communication network such as the Internet.

なお、以下で説明する実施の形態は、いずれも本開示の一具体例を示すものである。以下の実施の形態で示される数値、形状、構成要素、ステップ、ステップの順序などは、一例であり、本開示を限定する主旨ではない。また、以下の実施の形態における構成要素のうち、最上位概念を示す独立請求項に記載されていない構成要素については、任意の構成要素として説明される。また全ての実施の形態において、各々の内容を組み合わせることもできる。 The embodiments described below are all specific examples of the present disclosure. The numerical values, shapes, components, steps, order of steps, etc. shown in the following embodiments are examples, and are not intended to limit the present disclosure. Further, among the components in the following embodiments, the components not described in the independent claims indicating the highest level concept are described as arbitrary components. Moreover, in all the embodiments, each content can be combined.

（実施の形態１）
図１は、本開示の実施の形態１における音声対話システムの全体像を示す図である。図１に示す音声対話システムは、音声入出力装置１００及び音声対話制御装置２００を含む。音声入出力装置１００及び音声対話制御装置２００は、ネットワーク３００を介して相互に通信可能に接続されている。音声入出力装置１００は、ユーザの発話を音声データに変換して、入力音声データ２０としてネットワーク３００を介して音声対話制御装置２００に送信する。また、音声入出力装置１００は、音声対話制御装置２００から応答音声データ３０をネットワーク３００を介して受信し、応答音声４０を出力する。 (Embodiment 1)
FIG. 1 is a diagram showing an overall picture of the voice dialogue system according to the first embodiment of the present disclosure. The voice dialogue system shown in FIG. 1 includes a voice input / output device 100 and a voice dialogue control device 200. The voice input / output device 100 and the voice dialogue control device 200 are connected to each other so as to be able to communicate with each other via the network 300. The voice input / output device 100 converts the user's utterance into voice data and transmits it as input voice data 20 to the voice dialogue control device 200 via the network 300. Further, the voice input / output device 100 receives the response voice data 30 from the voice dialogue control device 200 via the network 300, and outputs the response voice 40.

例えば、音声入出力装置１００（音声入力装置）は、「今日の天気は？」と入力されたユーザ発話１０の入力音声データ２０を、ネットワーク３００を介して音声対話制御装置２００へ送信する。そして、音声入出力装置１００は、音声対話制御装置２００で作成された応答音声データ３０を、ネットワーク３００を介して受信し、「晴れです」という応答音声４０を出力する。 For example, the voice input / output device 100 (voice input device) transmits the input voice data 20 of the user utterance 10 in which "What is the weather today?" To the voice dialogue control device 200 via the network 300. Then, the voice input / output device 100 receives the response voice data 30 created by the voice dialogue control device 200 via the network 300, and outputs the response voice 40 saying "it is sunny".

音声対話制御装置２００は、受け取った入力音声データ２０への応答となる応答音声データ３０を生成する。図１の例では、音声入出力装置１００からネットワーク３００を介して受信した入力音声データ２０に対して、「晴れです」という応答音声４０を示す応答音声データ３０を作成し、ネットワーク３００を介して音声入出力装置１００へ送信する。 The voice dialogue control device 200 generates the response voice data 30 that is a response to the received input voice data 20. In the example of FIG. 1, for the input voice data 20 received from the voice input / output device 100 via the network 300, the response voice data 30 indicating the response voice 40 saying "it is sunny" is created, and the response voice data 30 indicating the response voice 40 is created via the network 300. It is transmitted to the audio input / output device 100.

ネットワーク３００は、音声入出力装置１００と音声対話制御装置２００とを接続し、入力音声データ２０と応答音声データ３０とを流す。ここでネットワーク３００は、光ファイバ、無線又は、公衆電話回線など任意のネットワークで実現されてよく、宅内のローカルネットワークに閉じた環境で構築されてもよいし、インターネットなどの構成で構築されてもよい。 The network 300 connects the voice input / output device 100 and the voice dialogue control device 200, and sends the input voice data 20 and the response voice data 30. Here, the network 300 may be realized by any network such as optical fiber, wireless, or public telephone line, may be constructed in an environment closed to a local network in the home, or may be constructed in a configuration such as the Internet. Good.

図２は、本開示の実施の形態１における音声対話システムの構成の一例を示す図である。図２に示す音声対話システムは、音声入出力装置１００及び音声対話制御装置２００を含む。音声入出力装置１００は、通信部１０１、マイク１０２、及びスピーカ１０３を含む。音声入出力装置１００は、少なくとも、集音機能と音声出力機能と通信機能とを備える装置で構成され、例えばスマートスピーカ、スマートフォン、タブレット端末などである。また、音声入出力装置１００は、ロボットに実装されてもよい。 FIG. 2 is a diagram showing an example of the configuration of the voice dialogue system according to the first embodiment of the present disclosure. The voice dialogue system shown in FIG. 2 includes a voice input / output device 100 and a voice dialogue control device 200. The voice input / output device 100 includes a communication unit 101, a microphone 102, and a speaker 103. The voice input / output device 100 is composed of at least a device having a sound collecting function, a voice output function, and a communication function, and is, for example, a smart speaker, a smartphone, a tablet terminal, or the like. Further, the voice input / output device 100 may be mounted on the robot.

通信部１０１は、図１に示すネットワーク３００を介して、音声対話制御装置２００の通信部２０１と通信する。通信部１０１は、マイク１０２より取得された入力音声データの送信及び音声対話制御装置２００からの応答音声データの受信を行う。通信部１０１は、例えば、通信インターフェースで構成される。 The communication unit 101 communicates with the communication unit 201 of the voice dialogue control device 200 via the network 300 shown in FIG. The communication unit 101 transmits the input voice data acquired from the microphone 102 and receives the response voice data from the voice dialogue control device 200. The communication unit 101 is composed of, for example, a communication interface.

マイク１０２は、ユーザの発話を集音し、入力音声データを取得する。マイク１０２は、ロボット、スマートフォン、又はスマートスピーカに備え付けられたもので構成される。但し、これは一例であり、マイク１０２は、音声入出力装置１００の本体部の外部に設けられたハンドマイク及びピンマイクなど任意の集音デバイスであってもよい。 The microphone 102 collects the user's utterance and acquires the input voice data. The microphone 102 is provided in a robot, a smartphone, or a smart speaker. However, this is an example, and the microphone 102 may be any sound collecting device such as a handheld microphone and a pin microphone provided outside the main body of the audio input / output device 100.

スピーカ１０３は、ユーザの発話に対する応答音声を再生する。スピーカ１０３は、例えばロボット、スマートフォン、スマートスピーカに備え付けられている。但し、これは一例であり、スピーカ１０３は、音声入出力装置１００の本体部の外部に設けられた任意の音声再生デバイスであってもよい。 The speaker 103 reproduces the response voice to the user's utterance. The speaker 103 is provided in, for example, a robot, a smartphone, or a smart speaker. However, this is an example, and the speaker 103 may be an arbitrary audio reproduction device provided outside the main body of the audio input / output device 100.

音声対話制御装置２００は、例えば通信部２０１、プロセッサ２１０、及びメモリ２０５を含むサーバ装置である。プロセッサ２１０は、例えばＣＰＵ、ＡＳＩＣなどの電気回路である。プロセッサ２１０は、音声認識部２０２、意図理解部２０３、発話速度計測部２０４、応答文生成部２０６、及び音声合成部２０７を含む。メモリ２０５は、例えばＥＥＰＲＯＭなどの書き換え可能な不揮発性の記憶装置である。メモリ２０５は、第１データベースＴ１を記憶する。 The voice dialogue control device 200 is a server device including, for example, a communication unit 201, a processor 210, and a memory 205. The processor 210 is, for example, an electric circuit such as a CPU or an ASIC. The processor 210 includes a voice recognition unit 202, an intention understanding unit 203, an utterance speed measurement unit 204, a response sentence generation unit 206, and a voice synthesis unit 207. The memory 205 is a rewritable non-volatile storage device such as EEPROM. The memory 205 stores the first database T1.

音声対話制御装置２００を構成する全部又は一部の要素は音声入出力装置１００に実装されてもよいし、音声対話制御装置２００を構成する一部の要素（例えば、音声認識部２０２）は、別のサーバに実装されてもよい。この場合、音声対話制御装置２００は、複数のサーバで構成される。 All or a part of the elements constituting the voice dialogue control device 200 may be mounted on the voice input / output device 100, and some elements (for example, the voice recognition unit 202) constituting the voice dialogue control device 200 may be mounted. It may be implemented on another server. In this case, the voice dialogue control device 200 is composed of a plurality of servers.

通信部２０１は、図１に示すネットワーク３００を介して、音声対話制御装置２００の通信部１０１と通信し、音声入出力装置１００から入力音声データの受信及び音声合成部２０７で作成された応答音声データの送信を行う。 The communication unit 201 communicates with the communication unit 101 of the voice dialogue control device 200 via the network 300 shown in FIG. 1, receives input voice data from the voice input / output device 100, and responds to voice created by the voice synthesis unit 207. Send data.

音声認識部２０２は、通信部２０１から入力された入力音声データに音声認識処理を行ことで入力音声データを音声認識結果であるテキストデータへ変換する。音声認識処理としては、例えばＪｕｌｉｕｓなどのオープンソースソフトウェアが採用できる。 The voice recognition unit 202 converts the input voice data into text data which is a voice recognition result by performing voice recognition processing on the input voice data input from the communication unit 201. As the voice recognition process, open source software such as Julius can be adopted.

このとき、音声認識部２０２は、入力音声データを解析し、１以上の単語の組み合わせからなる複数の音声認識候補と、各音声認識候補に対する信頼度とを取得する。そして、音声認識部２０２は、複数の音声認識候補の中から信頼度が最大の音声認識候補を音声認識結果として採用する。信頼度は、音声認識の確からしさを示す確率である。 At this time, the voice recognition unit 202 analyzes the input voice data and acquires a plurality of voice recognition candidates composed of a combination of one or more words and the reliability for each voice recognition candidate. Then, the voice recognition unit 202 adopts the voice recognition candidate having the highest reliability from the plurality of voice recognition candidates as the voice recognition result. The reliability is a probability indicating the certainty of speech recognition.

例えば、図３に示すように「今日はいい天気だね」とユーザが発話することにより、音声認識部２０２は、図４に示す音声認識候補と信頼度とを取得したとする。図４の例では、信頼度が高い順に「今日はいい天気だね」、「今日はいい敵だね」、及び「こうはいい天気でね」の３つの音声認識候補が取得されている。このとき、音声認識部２０２は、信頼度が最も高い「０．８」の音声認識候補「今日はいい天気だね」を、音声認識結果として決定する。 For example, it is assumed that the voice recognition unit 202 acquires the voice recognition candidate and the reliability shown in FIG. 4 when the user utters "It's a nice weather today" as shown in FIG. In the example of FIG. 4, three voice recognition candidates of "Today is a good weather", "Today is a good enemy", and "This is a good weather" are acquired in descending order of reliability. At this time, the voice recognition unit 202 determines the voice recognition candidate "Today is good weather" with the highest reliability of "0.8" as the voice recognition result.

一方で、全ての音声認識候補の信頼度が一定の閾値未満の場合、音声認識部２０２は、音声認識結果に誤認識が含まれる確率が高いため、音声認識結果を出力せず、音声認識に誤認識が生じたと判断する。例えば、閾値が「０．９」で設定されていた場合、図４に示す全ての音声認識候補の信頼度が閾値未満であるため、音声認識部２０２は、音声認識に誤認識が生じたと判断する。音声認識部２０２は、信頼度が閾値未満と判断した場合、音声認識結果を発話速度計測部２０４に入力する一方、信頼度が閾値以上と判断した場合、音声認識結果を意図理解部２０３に入力する。 On the other hand, when the reliability of all the voice recognition candidates is less than a certain threshold, the voice recognition unit 202 does not output the voice recognition result because there is a high possibility that the voice recognition result includes erroneous recognition, and the voice recognition is performed. Judge that misrecognition has occurred. For example, when the threshold value is set to "0.9", the reliability of all the voice recognition candidates shown in FIG. 4 is less than the threshold value, so that the voice recognition unit 202 determines that erroneous recognition has occurred in voice recognition. To do. When the voice recognition unit 202 determines that the reliability is less than the threshold value, the voice recognition result is input to the utterance speed measurement unit 204, while when the reliability level is determined to be equal to or higher than the threshold value, the voice recognition result is input to the intention understanding unit 203. To do.

さらに、音声認識部２０２は、通信部２０１から入力された入力音声データにおいて、発話開始時点から発話終了時点までの区間を発話区間として抽出し、抽出した発話区間におけるサンプル点の数に予め設定されたサンプリング周期を乗じることでユーザの発話時間を算出する。なお、音声認識部２０２は、算出した発話時間を音声認識結果に含ませて発話速度計測部２０４に入力する。 Further, the voice recognition unit 202 extracts the section from the start time of the utterance to the end time of the utterance as the utterance section in the input voice data input from the communication unit 201, and sets the number of sample points in the extracted utterance section in advance. The user's utterance time is calculated by multiplying the sampling period. The voice recognition unit 202 includes the calculated speech time in the voice recognition result and inputs it to the speech speed measurement unit 204.

図２に参照を戻す。意図理解部２０３は、音声認識部２０２から入力された音声認識結果を解析する意図理解処理を行い、ユーザの発話の意図を解析する。意図理解処理としては、あらかじめ決められたルールに基づいて発話内容を理解するルールベースの方式、又は機械学習手法であるディープラーニング若しくはＡｄａＢｏｏｓｔなどの手法を用いて、想定される複数の意図から一意の意図を選択する方式などが採用できる。意図理解部２０３は、意図理解結果を応答文生成部２０６に入力する。意図理解結果は、例えば天気予報、経路探索、及び食事などの発話内容を大まかにジャンル分けしたジャンル情報及び理解した意図を示す意図情報などが含まれる。 The reference is returned to FIG. The intention understanding unit 203 performs an intention understanding process for analyzing the voice recognition result input from the voice recognition unit 202, and analyzes the intention of the user's utterance. The intention understanding process is unique from a plurality of assumed intentions by using a rule-based method of understanding the utterance content based on predetermined rules, or a method such as deep learning or AdaBoost, which is a machine learning method. A method of selecting an intention can be adopted. The intention understanding unit 203 inputs the intention understanding result to the response sentence generation unit 206. The intention understanding result includes genre information in which the utterance contents such as weather forecast, route search, and meal are roughly classified into genres, and intention information indicating the understood intention.

発話速度計測部２０４は、信頼度が閾値未満の場合に音声認識部２０２から入力された音声認識結果に含まれるテキストデータ及び発話時間を用いてユーザの発話に対する発話速度を計測する。発話速度は、例えば数１に示すように単位時間当たりのモーラ数で定義される。モーラ数とは日本語において拗音を除く、かなの一字に相当する長さの単位である。したがって、発話速度計測部２０４は、音声認識結果に含まれるテキストデータからモーラ数を決定し、決定したモーラ数を音声認識結果に含まれる発話時間で割ることで発話速度を計測する。 The utterance speed measuring unit 204 measures the utterance speed for the user's utterance by using the text data and the utterance time included in the voice recognition result input from the voice recognition unit 202 when the reliability is less than the threshold value. The utterance speed is defined by the number of mora per unit time, for example, as shown in Equation 1. The number of mora is a unit of length equivalent to one character of kana, excluding yoon in Japanese. Therefore, the utterance speed measuring unit 204 determines the number of mora from the text data included in the voice recognition result, and divides the determined number of mora by the utterance time included in the voice recognition result to measure the utterance speed.

図５を用いて、ユーザ発話の音声速度を計測する方法の一例を説明する。ユーザが「今日はいい天気だね」と発話した場合、この発話は、図５の上部に例示するように、「きょ／う／は／い／い／て／ん／き／だ／ね」と１０個に区切られる。したがって、この発話のモーラ数は１０モーラとなる。この発話の発話時間が２秒であった場合、ユーザの発話速度は５モーラ／秒となる。また、図５の下部に例示するように「今日はいい天気だね」が「きょうはいい敵だね」と誤認識された場合、この発話は、「きょ／う／は／い／い／て／き／だ／ね」と区切られ、モーラ数が９モーラとなる。したがって、入力発話に対する発話速度は４．５モーラ／秒となる。誤認識をしても正しく認識できた場合と比較し、単位時間当たりのモーラ数は近い値になる。そのため、誤認識をした場合、認識できた場合と同じ発話速度の計測方法が適用可能となる。なお、この発話速度計測方法は一例であり、他の発話速度の計測方法が採用されてもよい。 An example of a method of measuring the voice speed of a user's utterance will be described with reference to FIG. When the user says "It's a nice day today", this utterance is "Kyo / U / Ha / I / I / Te / N / Ki / Da / Ne" as illustrated at the top of FIG. It is divided into 10 pieces. Therefore, the number of mora of this utterance is 10 mora. If the utterance time of this utterance is 2 seconds, the utterance speed of the user is 5 mora / sec. Also, as illustrated at the bottom of Fig. 5, if "Today is a good weather" is mistakenly recognized as "Today is a good enemy", this utterance will be "Kyo / U / Ha / I / I". It is divided into "/ te / ki / da / ne", and the number of mora is 9 mora. Therefore, the utterance speed for input utterance is 4.5 mora / sec. The number of mora per unit time is close to the value when it can be recognized correctly even if it is misrecognized. Therefore, in the case of erroneous recognition, the same measurement method of utterance speed as in the case of recognition can be applied. This utterance speed measurement method is an example, and other utterance speed measurement methods may be adopted.

図２に参照を戻す。第１データベースＴ１は、発話速度計測部２０４により計測された発話速度に応じた発話速度の改善を促すための応答文を記憶する。図６は、第１データベースＴ１のデータ構成の一例を示す図である。具体的には、第１データベースＴ１は、入力発話速度と、応答文と、出力発話速度とを対応付けて記憶する。入力発話速度は、発話速度計測部２０４で計測された発話速度が属する速度条件を示す。応答文は、計測された発話速度に応じた応答文を示す。出力発話速度は、応答文の音声を出力する際の発話速度を示す。 The reference is returned to FIG. The first database T1 stores a response sentence for promoting improvement of the utterance speed according to the utterance speed measured by the utterance speed measuring unit 204. FIG. 6 is a diagram showing an example of the data structure of the first database T1. Specifically, the first database T1 stores the input utterance speed, the response sentence, and the output utterance speed in association with each other. The input utterance speed indicates a speed condition to which the utterance speed measured by the utterance speed measuring unit 204 belongs. The response sentence indicates a response sentence according to the measured utterance speed. The output utterance speed indicates the utterance speed when the voice of the response sentence is output.

例えば、１行目のレコードは、音声認識に誤認識が生じた際に計測された発話速度が６モーラ／秒以上だった場合に、応答文として「もっとゆっくりと話してください」が決定され、この応答文が２．５モーラ／秒の発話速度で出力されることを示している。 For example, in the record on the first line, when the utterance speed measured when the voice recognition is erroneously recognized is 6 mora / sec or more, "speak more slowly" is determined as the response sentence. It is shown that this response sentence is output at a speech speed of 2.5 mora / sec.

図６の例では、第１データベースＴ１は、５つの速度条件に応じた５つのレコードを含む。上から３行目の入力発話速度が「４モーラ／秒以上、５モーラ／秒未満」のレコードは、発話速度が適性速度にあるため、応答文は「もう一度発話してください」となっており、発話速度の改善を促す内容になっていない。また、３行目のレコードは発話速度が適性速度にあり、発話速度を改善する必要がないため、出力発話速度が入力発話速度の範囲の中間値である４．５モーラ／秒に設定されている。 In the example of FIG. 6, the first database T1 contains five records according to five speed conditions. In the record with the input utterance speed of "4 mora / sec or more and less than 5 mora / sec" on the third line from the top, the utterance speed is at the appropriate speed, so the response sentence is "Please speak again". , The content does not encourage improvement in speech speed. Also, in the record on the third line, the utterance speed is at the appropriate speed, and there is no need to improve the utterance speed, so the output utterance speed is set to 4.5 mora / sec, which is an intermediate value in the range of the input utterance speed. There is.

ここで、適性速度は、音声認識部２０２が音声認識処理を行ううえで適切な発話速度であり、音声認識部２０２の能力に応じて予め設定される。例えば、音声認識部２０２が４．５モーラ／秒の音声データを多く用いて機械学習された学習モデルを用いて音声認識処理を行うものであるとすると、４．５モーラ／秒程度の発話速度が適性速度になる。 Here, the appropriate speed is an appropriate utterance speed for the voice recognition unit 202 to perform the voice recognition process, and is preset according to the ability of the voice recognition unit 202. For example, assuming that the voice recognition unit 202 performs voice recognition processing using a machine-learned learning model using a large amount of voice data of 4.5 mora / sec, the speech speed is about 4.5 mora / sec. Is the appropriate speed.

一方、入力発話速度が適性速度に比べて速い２行目、１行目の応答文は、「ゆっくりと発話してください」、「もっとゆっくりと発話してください」というように、発話速度を遅くすることを促す内容になっている。また、２行目の出力発話速度は適性速度よりも遅い３．５モーラ／秒に設定され、３行目の出力発話速度は２行目よりもさらに遅い２．５モーラ／秒となっている。このように、図６の例では、入力発話速度が適性速度に比べて速くなるほど、出力発話速度は、適性速度に比べてより遅くなるように設定されている。これにより、発話速度が適性速度に比べて速いために誤認識が生じた場合、そのことをユーザに対してより直感的に理解させることができる。 On the other hand, the response sentences on the second and first lines, whose input utterance speed is faster than the appropriate speed, slow down the utterance speed, such as "speak slowly" and "speak more slowly". The content encourages you to do so. The output utterance speed of the second line is set to 3.5 mora / sec, which is slower than the appropriate speed, and the output utterance speed of the third line is 2.5 mora / sec, which is even slower than the second line. .. As described above, in the example of FIG. 6, the output utterance speed is set to be slower than the appropriate speed as the input utterance speed becomes faster than the appropriate speed. As a result, when a false recognition occurs because the utterance speed is faster than the appropriate speed, the user can be made to understand it more intuitively.

また、２行目よりも入力発話速度がさらに速い１行目の応答文は、２行目の応答文に対して強調語「もっと」を含んでいる。したがって、入力発話速度が適性速度に比べて速くなるほど、応答文は発話速度の低下の度合いをより強調する言葉を含んでいる。これにより、発話速度が適性速度に比べてどの程度遅くすればよいのかをユーザに知らせることができる。 Further, the response sentence of the first line, which has a faster input utterance speed than the second line, includes the emphasized word "more" with respect to the response sentence of the second line. Therefore, the faster the input utterance speed is compared to the appropriate speed, the more the response sentence contains words that emphasize the degree of decrease in the utterance speed. This makes it possible to inform the user how much the utterance speed should be slower than the appropriate speed.

一方、入力発話速度が適性速度に比べて遅い４行目、５行目の応答文は、「はやく発話してください」、「もっとはやく発話してください」というように、発話速度を速くすることを促す内容になっている。また、４行目の出力発話速度は適性速度よりも速い５．５モーラ／秒に設定され、５行目の出力発話速度は４行目よりもさらに速い６．５モーラ／秒に設定されている。このように、図６の例では、入力発話速度が適性速度に比べて遅くなるほど、出力発話速度は、適性速度に比べてもより速くなるように設定されている。これにより、発話速度が適性速度に比べて遅いために誤認識が生じた場合、そのことをユーザに対してより直感的に理解させることができる。 On the other hand, for the response sentences on the 4th and 5th lines, where the input utterance speed is slower than the appropriate speed, increase the utterance speed, such as "Please speak faster" or "Speak faster". It is a content that encourages. The output utterance speed of the 4th line is set to 5.5 mora / sec, which is faster than the appropriate speed, and the output utterance speed of the 5th line is set to 6.5 mora / sec, which is even faster than the 4th line. There is. As described above, in the example of FIG. 6, the output utterance speed is set to be faster than the appropriate speed as the input utterance speed is slower than the appropriate speed. As a result, when a false recognition occurs because the utterance speed is slower than the appropriate speed, the user can be made to understand it more intuitively.

また、４行目よりも入力発話速度がさらに遅い５行目の応答文は、４行目の応答文に対して強調語「もっと」を含んでいる。したがって、入力発話速度が適性速度に比べて遅くなるほど、応答文は発話速度の上昇の度合いをより強調する言葉を含んでいる。これにより、発話速度が適性速度に比べてどの程度速くすればよいのかををユーザに知らせることができる。 Further, the response sentence on the fifth line, which has a slower input utterance speed than the fourth line, contains the emphasized word "more" with respect to the response sentence on the fourth line. Therefore, the slower the input utterance speed than the appropriate speed, the more the response sentence contains words that emphasize the degree of increase in the utterance speed. This makes it possible to inform the user how much the utterance speed should be faster than the appropriate speed.

図２に参照を戻す。応答文生成部２０６は、音声認識部２０２により信頼度が閾値以上と判断された場合に意図理解部２０３から入力された意図理解結果に対して、適切な応答文を生成する。例えば、応答文生成部２０６は、複数のジャンルごとに予め作成された応答文テーブル（図略）のうち、意図理解結果に含まれるジャンル情報に対応する応答文テーブルをメモリ２０５から取得し、取得した応答文テーブルと意図情報などに基づいて意図理解結果に対する手適切な応答文を生成し、音声合成部２０７に出力する。 The reference is returned to FIG. The response sentence generation unit 206 generates an appropriate response sentence for the intention understanding result input from the intention understanding unit 203 when the voice recognition unit 202 determines that the reliability is equal to or higher than the threshold value. For example, the response sentence generation unit 206 acquires and acquires the response sentence table corresponding to the genre information included in the intention understanding result from the response sentence tables (not shown) created in advance for each of a plurality of genres. Based on the response sentence table and the intention information, an appropriate response sentence for the intention understanding result is generated and output to the speech synthesis unit 207.

一方、応答文生成部２０６は、音声認識部２０２により信頼度が閾値未満と判断された場合に発話速度計測部２０４から入力された発話速度に応じた応答文を第１データベースＴ１から取得することで生成し、音声合成部２０７に出力する。 On the other hand, the response sentence generation unit 206 acquires the response sentence according to the utterance speed input from the utterance speed measurement unit 204 from the first database T1 when the voice recognition unit 202 determines that the reliability is less than the threshold value. Is generated and output to the voice synthesis unit 207.

音声合成部２０７は、発話速度計測部２０４で計測された発話速度に応じた出力発話速度を第１データベースＴ１から取得することで、出力発話速度を決定する。そして、音声合成部２０７は、決定した出力発話速度で応答文生成部２０６から入力された応答文が出力されるように応答文に対して音声合成処理を行い、応答音声データを生成する。生成された応答音声データは通信部２０１を介して音声入出力装置１００に送信され、スピーカ１０３から応答音声として出力される。 The voice synthesis unit 207 determines the output utterance speed by acquiring the output utterance speed corresponding to the utterance speed measured by the utterance speed measurement unit 204 from the first database T1. Then, the voice synthesis unit 207 performs voice synthesis processing on the response sentence so that the response sentence input from the response sentence generation unit 206 is output at the determined output utterance speed, and generates the response voice data. The generated response voice data is transmitted to the voice input / output device 100 via the communication unit 201, and is output as a response voice from the speaker 103.

例えば、発話速度計測部２０４で計測された発話速度が３モーラ／秒であった場合、図６に示す入力発話速度の条件が「３モーラ／秒以上、４モーラ／秒未満」に合致する。このとき、応答文生成部２０６は、合致した入力発話速度に対応する「はやく発話してください」を応答文として生成して音声合成部２０７に入力する。一方、音声合成部２０７は、合致した入力発話速度に対応する出力発話速度である５．５モーラ／秒で応答文の音声が出力されるように音声を合成し、応答音声データを生成する。これにより、スピーカ１０３から図３に例示するように「はやく発話してください」との応答音声が出力される。 For example, when the utterance speed measured by the utterance speed measuring unit 204 is 3 mora / sec, the condition of the input utterance speed shown in FIG. 6 meets "3 mora / sec or more and less than 4 mora / sec". At this time, the response sentence generation unit 206 generates "Please speak quickly" corresponding to the matched input utterance speed as a response sentence and inputs it to the speech synthesis unit 207. On the other hand, the voice synthesis unit 207 synthesizes the voice so that the voice of the response sentence is output at the output utterance speed of 5.5 mora / sec corresponding to the matched input utterance speed, and generates the response voice data. As a result, the response voice "Please speak quickly" is output from the speaker 103 as illustrated in FIG.

なお、図２において、音声認識部２０２は、音声入力装置に入力されたユーザによる第１発話から変換された音声データに対して音声認識処理を行う音声認識部の一例に相当する。また、音声認識部２０２は、音声認識処理の信頼度を算出する算出部の一例に相当する。さらに、音声認識部２０２は、信頼度が所定の閾値以下であるか否かを判定する判定部の一例に相当する。 In FIG. 2, the voice recognition unit 202 corresponds to an example of a voice recognition unit that performs voice recognition processing on the voice data converted from the first speech by the user input to the voice input device. Further, the voice recognition unit 202 corresponds to an example of a calculation unit that calculates the reliability of the voice recognition process. Further, the voice recognition unit 202 corresponds to an example of a determination unit that determines whether or not the reliability is equal to or less than a predetermined threshold value.

発話速度計測部２０４は、信頼度が閾値以下である場合、音声データに基づきユーザによる発話の速度である第１発話速度を計測する計測部の一例に相当する。応答文生成部２０６及び音声合成部２０７は、第１発話速度に基づき、ユーザに対して再度の発話を促す旨の応答文を音声により出力する際の速度である第２発話速度を決定し、第２発話速度で応答文を出力する旨の指示情報を出力する応答文生成部の一例に相当する。また、応答音声データは指示情報の一例に相当する。発話速度計測部２０４で計測された発話速度は第１発話速度の一例に相当する。また、音声合成部２０７が発話速度に応じて決定した出力発話速度は第２発話速度の一例に相当する。 The utterance speed measuring unit 204 corresponds to an example of a measuring unit that measures the first utterance speed, which is the speed of utterance by the user, based on voice data when the reliability is equal to or less than the threshold value. The response sentence generation unit 206 and the voice synthesis unit 207 determine the second utterance speed, which is the speed at which the response sentence for prompting the user to speak again is output by voice, based on the first utterance speed. This corresponds to an example of a response sentence generation unit that outputs instruction information to output a response sentence at the second utterance speed. The response voice data corresponds to an example of instruction information. The utterance speed measured by the utterance speed measuring unit 204 corresponds to an example of the first utterance speed. Further, the output utterance speed determined by the voice synthesis unit 207 according to the utterance speed corresponds to an example of the second utterance speed.

また、図６において、３行目の入力発話速度「４モーラ／秒以上、５モーラ秒／未満」の「４モーラ／秒」は第１閾値の一例であり、「５モーラ／秒」は第２閾値の一例である。また、３行目の「４．５モーラ／秒」はあらかじめ定められた第３発話速度の一例である。 Further, in FIG. 6, “4 mora / sec” of the input speech speed “4 mora / sec or more and 5 mora sec / less” in the third line is an example of the first threshold value, and “5 mora / sec” is the first threshold value. This is an example of two threshold values. Further, "4.5 mora / sec" on the third line is an example of a predetermined third utterance speed.

図７は、本開示の実施の形態１における音声対話システムの処理の一例を示すフローチャートである。図７を用いて音声対話システムにおいてユーザが発話した音声に対する応答音声が出力される処理が説明される。 FIG. 7 is a flowchart showing an example of processing of the voice dialogue system according to the first embodiment of the present disclosure. The process of outputting the response voice to the voice spoken by the user in the voice dialogue system will be described with reference to FIG. 7.

まず、音声入出力装置１００の通信部１０１は、マイク１０２が取得したユーザの発話を示す入力音声データを音声対話制御装置２００の通信部２０１に送信する（ステップＳ１０１）。次に、音声認識部２０２は、通信部２０１により受信された入力音声データに対して音声認識処理を行う（ステップＳ１０２）。 First, the communication unit 101 of the voice input / output device 100 transmits the input voice data indicating the user's utterance acquired by the microphone 102 to the communication unit 201 of the voice dialogue control device 200 (step S101). Next, the voice recognition unit 202 performs voice recognition processing on the input voice data received by the communication unit 201 (step S102).

次に、音声認識部２０２は、音声認識結果の信頼度が閾値以上であるか否かを判断する（ステップＳ１０３）。信頼度が閾値以上の場合（ステップＳ１０３：ＹＥＳ）、音声認識部２０２は、音声認識結果を意図理解部２０３へ入力する。次に、意図理解部２０３は入力された音声認識結果に対して意図理解処理を行う（ステップＳ１０４）。 Next, the voice recognition unit 202 determines whether or not the reliability of the voice recognition result is equal to or higher than the threshold value (step S103). When the reliability is equal to or higher than the threshold value (step S103: YES), the voice recognition unit 202 inputs the voice recognition result to the intention understanding unit 203. Next, the intention understanding unit 203 performs an intention understanding process on the input voice recognition result (step S104).

次に、応答文生成部２０６は、意図理解部２０３から入力された意図理解結果に基づき応答文を生成する（ステップＳ１０５）。 Next, the response sentence generation unit 206 generates a response sentence based on the intention understanding result input from the intention understanding unit 203 (step S105).

次に、音声合成部２０７は、応答文生成部２０６から入力された応答文に対して音声合成処理を行い、応答音声データを生成する（ステップＳ１０９）、応答音声データは、通信部２０１を介して音声入出力装置１００に送信され、スピーカ１０３から出力される（ステップＳ１１０）。 Next, the voice synthesis unit 207 performs a voice synthesis process on the response sentence input from the response sentence generation unit 206 to generate the response voice data (step S109), and the response voice data is transmitted via the communication unit 201. Is transmitted to the audio input / output device 100 and output from the speaker 103 (step S110).

一方、音声認識結果の信頼度が閾値未満であった場合（ステップＳ１０３：ＮＯ）、発話速度計測部２０４は、数１を用いてユーザの発話に対する発話速度を計測する（ステップＳ１０６）。次に、応答文生成部２０６は、計測された発話速度に応じた応答文を第１データベースＴ１から取得することで応答文を生成する（ステップＳ１０７）。 On the other hand, when the reliability of the voice recognition result is less than the threshold value (step S103: NO), the utterance speed measuring unit 204 measures the utterance speed with respect to the user's utterance using Equation 1 (step S106). Next, the response sentence generation unit 206 generates a response sentence by acquiring the response sentence corresponding to the measured utterance speed from the first database T1 (step S107).

次に、音声合成部２０７は、計測された発話速度に応じた出力発話速度を第１データベースＴ１から取得することで出力音声速度を決定する（ステップＳ１０８）。図６に示すように、計測された発話速度が例えば３モーラ／秒の場合、応答文は「はやく発話してください」となり、出力発話速度は５．５モーラ／秒となる。また、計測された発話速度が例えば４．２モーラ／秒で誤認識の原因が発話速度でなかった場合、応答文は「もう一度発話してください」となり、出力発話速度は４．５モーラ／秒となる。 Next, the voice synthesis unit 207 determines the output voice speed by acquiring the output voice speed according to the measured speech speed from the first database T1 (step S108). As shown in FIG. 6, when the measured utterance speed is, for example, 3 mora / sec, the response sentence is "Please speak quickly" and the output utterance speed is 5.5 mora / sec. If the measured utterance speed is, for example, 4.2 mora / sec and the cause of the misrecognition is not the utterance speed, the response sentence is "Please speak again" and the output utterance speed is 4.5 mora / sec. It becomes.

次に、音声合成部２０７は、出力発話速度を満たすように、応答文生成部２０６により決定された応答文に対して音声合成処理を行い、応答音声データを生成する（ステップＳ１０９）。次に、音声合成部２０７は、応答音声データを通信部２０１を介して音声入出力装置１００に送信し、スピーカ１０３から応答音声を出力させる（ステップＳ１１０）。 Next, the voice synthesis unit 207 performs voice synthesis processing on the response sentence determined by the response sentence generation unit 206 so as to satisfy the output utterance speed, and generates the response voice data (step S109). Next, the voice synthesis unit 207 transmits the response voice data to the voice input / output device 100 via the communication unit 201, and outputs the response voice from the speaker 103 (step S110).

以上で述べた本実施の形態１の音声対話システムによれば、発話速度が原因で音声認識に誤認識が生じた場合、聞き返しの応答文の音声が、ユーザの発話速度に応じた速度に設定されたうえで出力されている。そのため、幼児のように聞き返しの応答文の意味を理解するのが困難なユーザに対しても誤認識の原因が発話速度にあることを直感的に理解させることができ、聞き返しが繰り返されることを抑制できる。 According to the voice dialogue system of the first embodiment described above, when the voice recognition is erroneously recognized due to the utterance speed, the voice of the response sentence of the response sentence is set to the speed according to the utterance speed of the user. It is output after being output. Therefore, even a user who has difficulty in understanding the meaning of the response sentence of the reply, such as an infant, can intuitively understand that the cause of the misrecognition is the speech speed, and the reply is repeated. Can be suppressed.

（実施の形態１の変形例）
実施の形態１の変形例は、一度目の聞き返しの実施に対するユーザの発話に改善が見られなかった場合、一度目よりもさらに発話速度を改善するように発話内容及び出力発話速度が変更された二度目の聞き返しを行うものである。 (Modified Example of Embodiment 1)
In the modified example of the first embodiment, the utterance content and the output utterance speed are changed so as to further improve the utterance speed as compared with the first time when the user's utterance for the first listening back is not improved. This is the second time to listen back.

図８は、本開示の実施の形態１の変形例に係る音声対話システムの構成の一例を示すブロック図である。図８において図２との相違点は、メモリ２０５に第１データベースＴ１に代えて第２データベースＴ２が記憶されている点にある。図９は、第２データベースＴ２のデータ構成の一例を示す図である。 FIG. 8 is a block diagram showing an example of the configuration of the voice dialogue system according to the modified example of the first embodiment of the present disclosure. The difference from FIG. 2 in FIG. 8 is that the memory 205 stores the second database T2 instead of the first database T1. FIG. 9 is a diagram showing an example of the data structure of the second database T2.

図９に示す第２データベースＴ２は、第１データベースＴ１に対してさらに「改善されなかった場合の応答文」の列を備えている。なお、第２データベースＴ２において、入力発話速度、応答文、出力発話速度は第１データベースＴ１と同じである。「改善されなかった場合の応答文」の列には、聞き返しによって発話速度の改善が見られなかった場合において再度の聞き返しを行う場合に選択される応答文が記憶されている。 The second database T2 shown in FIG. 9 further includes a column of "response statements when no improvement is made" with respect to the first database T1. In the second database T2, the input utterance speed, the response sentence, and the output utterance speed are the same as those in the first database T1. In the column of "response sentence when there is no improvement", the response sentence selected when the answer is repeated when the speech speed is not improved by the answer is stored.

図８に参照を戻す。応答文生成部２０６は、最初の発話（第１発話）に対して聞き返しを受けたユーザによる再度の発話（第２発話）に対する音声認識結果の信頼度が音声認識部２０２により閾値未満と判断された場合であって、聞き返しによる改善が見られなかったと判断した場合、第２データベースＴ２の「改善されなかった場合の応答文」の列を参照し、発話速度計測部２０４により計測された発話速度に応じた応答文を選択する。 The reference is returned to FIG. In the response sentence generation unit 206, the voice recognition unit 202 determines that the reliability of the voice recognition result for the second utterance (second utterance) by the user who has been heard back for the first utterance (first utterance) is less than the threshold value. If it is determined that there is no improvement due to listening back, refer to the column of "Response sentence when there is no improvement" in the second database T2, and the utterance speed measured by the utterance speed measuring unit 204. Select the response statement according to.

ここで、聞き返しによる改善がみられなかった場合としては、例えば第１発話及び第２発話の発話速度が共に適性速度に比べて遅い場合、又は第１発話及び第２発話の発話速度が共に適性速度に比べて速い場合が該当する。具体的には、図９の例では、第１発話の発話速度が５モーラ／秒以上の場合において、第２発話の発話速度も５モーラ／秒以上になる場合、又は、第１発話の発話速度が４モーラ／秒未満の場合において、第２発話の発話速度も４モーラ／秒未満の場合が該当する。 Here, when the improvement by listening back is not seen, for example, when the utterance speeds of the first utterance and the second utterance are both slower than the appropriate speed, or when both the utterance speeds of the first utterance and the second utterance are appropriate. This applies when it is faster than the speed. Specifically, in the example of FIG. 9, when the utterance speed of the first utterance is 5 mora / sec or more, the utterance speed of the second utterance is also 5 mora / sec or more, or the utterance of the first utterance. When the speed is less than 4 mora / sec, the utterance speed of the second utterance is also less than 4 mora / sec.

図９の例では、１行目、２行目のレコードに示されるように、入力発話速度が５モーラ／秒以上の場合、「改善されなかった場合の応答文」の列には、「もっとゆっくりと発話して下さい」が記憶されている。そのため、応答文生成部２０６は、第２発話の発話速度が５モーラ／秒以上の場合、「もっとゆっくりと発話して下さい」を選択する。このとき、音声合成部２０７は、第２発話の発話速度が５モーラ／秒以上且つ６モーラ／秒未満であれば、応答文「もっとゆっくりと発話して下さい」を、発話速度「３．５モーラ／秒」で出力する応答音声データを生成し、第２音声の発話速度が６モーラ／秒以上であれば、応答文「もっとゆっくりと発話して下さい」を、発話速度「２．５モーラ／秒」で出力する応答音声データを生成する。 In the example of FIG. 9, when the input utterance speed is 5 mora / sec or more, as shown in the records of the first and second rows, the column of "response sentence when not improved" is "more". Please speak slowly "is remembered. Therefore, the response sentence generation unit 206 selects "Speak more slowly" when the utterance speed of the second utterance is 5 mora / sec or more. At this time, if the utterance speed of the second utterance is 5 mora / sec or more and less than 6 mora / sec, the voice synthesis unit 207 sends a response sentence "Speak more slowly" to the utterance speed "3.5". Generate response voice data to be output at "mora / sec", and if the utterance speed of the second voice is 6 mora / sec or more, the response sentence "Please speak more slowly" is sent at the utterance speed of "2.5 mora". Generate response voice data to be output at "/ sec".

一方、４行目、５行目のレコードに示されるように、入力発話速度が４モーラ／秒未満の場合、「改善されなかった場合の応答文」の列には、「もっとはやくと発話して下さい」が記憶されている。そのため、応答文生成部２０６は、第２発話の発話速度が４モーラ／秒未満の場合、「もっとはやく発話して下さい」を選択する。このとき、音声合成部２０７は、第２発話の発話速度が３モーラ／秒以上且つ４モーラ／秒未満の場合、応答文「もっとゆっくりと発話して下さい」を、発話速度「５．５モーラ／秒」で出力する応答音声データを生成し、第２発話の発話速度が３モーラ／秒未満であれば、応答文「もっとゆっくりと発話して下さい」を発話速度「６．５モーラ／秒」で出力する応答音声データを生成する。 On the other hand, as shown in the records on the 4th and 5th lines, when the input utterance speed is less than 4 mora / sec, the column of "Response sentence when not improved" is "Speak more quickly". Please, "is remembered. Therefore, the response sentence generation unit 206 selects "Speak faster" when the utterance speed of the second utterance is less than 4 mora / sec. At this time, when the utterance speed of the second utterance is 3 mora / sec or more and less than 4 mora / sec, the voice synthesizing unit 207 sends a response sentence "Please speak more slowly" and a utterance speed of "5.5 mora". Generate response voice data to be output at "/ sec", and if the utterance speed of the second utterance is less than 3 mora / sec, the response sentence "Speak more slowly" is spoken at the utterance speed "6.5 mora / sec". Generates the response voice data to be output with.

一方、第２発話の発話速度が４モーラ／秒以上且つ５モーラ／秒未満であれば、応答文生成部２０６は、誤認識の原因が発話速度以外にあるため、応答文「もう一度発話してください」が４．５モーラ／秒で出力される。 On the other hand, if the utterance speed of the second utterance is 4 mora / sec or more and less than 5 mora / sec, the response sentence generation unit 206 has a cause of erroneous recognition other than the utterance speed. Please "is output at 4.5 mora / sec.

なお、図９の例では、「改善されなかった場合の応答文」の列には、入力発話速度が適性速度（４モーラ／秒以上且つ５モーラ／秒未満）に対して速い場合に１つの応答文が記憶され、入力発話速度が適性速度に対して遅い場合に１つの応答文が記憶されているが、これは一例であれ、それぞれの場合において複数の応答文が記憶されていてもよい。また、改善されなかった場合の応答文「もっとゆっくり発話してください」は、第２発話の応答速度に応じて出力発話速度「２．５モーラ／秒」又は「３．５モーラ／秒」が対応付けられているが、これは一例であり、両出力発話速度のうち一方が対応付けられていてもよい。 In the example of FIG. 9, in the column of "response sentence when not improved", there is one case where the input utterance speed is faster than the appropriate speed (4 mora / sec or more and less than 5 mora / sec). One response sentence is stored when the response sentence is stored and the input utterance speed is slower than the appropriate speed, but this may be an example, and a plurality of response sentences may be stored in each case. .. In addition, the response sentence "Please speak more slowly" when it is not improved has an output utterance speed of "2.5 mora / sec" or "3.5 mora / sec" depending on the response speed of the second utterance. Although it is associated, this is an example, and one of the two output utterance speeds may be associated.

図１０は、本開示の実施の形態１における音声対話システムの処理の一例を示すフローチャートである。なお、図１０において、図７と同じステップには同じ符号を付して説明を省略する。ステップＳ１０６に続くステップＳ２００１において、応答文生成部２０６は、ステップＳ１０６により計測された第２発話の発話速度（第４発話速度の一例）に基づいて第２発話によって発話速度が改善したか否かを判定する。第２発話の発話速度が改善していない場合（ステップＳ２００１：ＹＥＳ）、応答文生成部２０６は、第２データベースＴ２の「改善されなかった場合の応答文」の列から、第２発話の発話速度に応じた応答文を取得することで、改善されなかった場合の応答文を生成する（ステップＳ２００２）。一方、第２発話の発話速度が改善した場合又はステップＳ１０１において第１発話が取得されている場合（ステップＳ２００１：ＮＯ）、応答文生成部２０６は、第２データベースＴ２の「応答文」の列から、第２発話又は第１発話の発話速度に応じた応答文を取得することで応答文を生成する（ステップＳ２００３）。 FIG. 10 is a flowchart showing an example of processing of the voice dialogue system according to the first embodiment of the present disclosure. In FIG. 10, the same steps as those in FIG. 7 are designated by the same reference numerals, and the description thereof will be omitted. In step S2001 following step S106, the response sentence generation unit 206 determines whether or not the utterance speed is improved by the second utterance based on the utterance speed of the second utterance measured in step S106 (an example of the fourth utterance speed). To judge. When the utterance speed of the second utterance is not improved (step S2001: YES), the response sentence generation unit 206 starts uttering the second utterance from the column of "response sentence when not improved" in the second database T2. By acquiring the response sentence according to the speed, the response sentence when the improvement is not generated is generated (step S2002). On the other hand, when the utterance speed of the second utterance is improved or when the first utterance is acquired in step S101 (step S2001: NO), the response sentence generation unit 206 is in the "response sentence" column of the second database T2. To generate a response sentence by acquiring a response sentence corresponding to the utterance speed of the second utterance or the first utterance (step S2003).

ステップＳ２００４において、音声合成部２０７は、ステップＳ１０６により計測された発話速度に応じた出力発話速度（第５発話速度の一例）を第２データベースＴ２から決定する（ステップＳ２００４）。これにより、ステップＳ１０９では、ステップＳ２００２又はステップＳ２００３で生成された応答文をステップＳ２００４で決定された出力発話速度で出力するための応答音声データが生成されることになる。 In step S2004, the speech synthesis unit 207 determines the output utterance speed (an example of the fifth utterance speed) according to the utterance speed measured in step S106 from the second database T2 (step S2004). As a result, in step S109, the response voice data for outputting the response sentence generated in step S2002 or step S2003 at the output utterance speed determined in step S2004 is generated.

図１１は、ユーザの最初の発話である第１発話に対して、音声対話システムが発話速度の改善を促したが、改善されなかった場面の一例を示す図である。 FIG. 11 is a diagram showing an example of a scene in which the voice dialogue system promotes the improvement of the speech speed with respect to the first utterance which is the first utterance of the user, but the improvement is not performed.

図１１の例では、ユーザは、第１発話「きょうはいいてんきだね？」を発話速度「３．０モーラ／秒」で実施している（ステップＳ１００１）。この第１発話の信頼度が閾値未満であり、誤認識と判断されたため、第２データベースＴ２から、この発話速度に対応する第１応答文「はやく発話してください」が選択され、この第１応答文が発話速度「５．５モーラ／秒」でスピーカ１０３から出力され、１回目の聞き返しが実施されている（ステップＳ１００２）。 In the example of FIG. 11, the user executes the first utterance "Today is good, isn't it?" At the utterance speed of "3.0 mora / sec" (step S1001). Since the reliability of this first utterance is less than the threshold value and it is determined that the recognition is erroneous, the first response sentence "Please speak quickly" corresponding to this utterance speed is selected from the second database T2, and this first one. The response sentence is output from the speaker 103 at an utterance speed of "5.5 mora / sec", and the first listening back is performed (step S1002).

この１回目の聞き返しを受けてユーザは、第１発話を言い直して第２発話「きょうはいいてんきだね？」を実施している（ステップＳ１００３）。しかし、第２発話の発話速度が３．５モーラ／秒であり、適性速度よりも遅く、発話速度に改善が見られなかった。そこで、第２データベースＴ２が参照され、発話速度「３．５モーラ／秒」に対応する「改善されなかった場合の応答文」の列の応答文「もっとはやく発話してください」が第２応答文として選択され、この第２応答文が発話速度「５．５モーラ／秒」でスピーカ１０３から出力され、２回目の聞き返しが実施されている（ステップＳ１００４）。 In response to this first hearing, the user rephrases the first utterance and executes the second utterance "Today is good, isn't it?" (Step S1003). However, the utterance speed of the second utterance was 3.5 mora / sec, which was slower than the appropriate speed, and no improvement was observed in the utterance speed. Therefore, the second database T2 is referred to, and the response sentence "Please speak faster" in the column of "Response sentence when not improved" corresponding to the utterance speed "3.5 mora / sec" is the second response. It is selected as a sentence, and this second response sentence is output from the speaker 103 at a utterance speed of "5.5 mora / sec", and the second listening back is performed (step S1004).

これに対して、１回目の聞き返しを受けたユーザにより実施された第２発話「きょうはいいてんきだね？」の発話速度が５．２モーラ／秒となり適性速度に比べて速くなったとする。この場合、第２データベースＴ２が参照され、「５．２モーラ／秒」に対応する応答文の列の応答文「ゆっくり発話してください」が選択され、発話速度「３．５モーラ／秒」で２回目の聞き返しが実施される。すなわち、この場合、適性速度に比べて第２発話が速いため２回目の聞き返しが行われるが、発話速度に改善が見られたため、「改善されなかった場合の応答文」の列ではなく、第１発話の場合と同様、「応答文」の列から応答文が選択されることになる。 On the other hand, it is assumed that the utterance speed of the second utterance "Today is good, isn't it?" Performed by the user who received the first hearing back is 5.2 mora / sec, which is faster than the appropriate speed. In this case, the second database T2 is referred to, the response sentence "speak slowly" in the response sentence column corresponding to "5.2 mora / sec" is selected, and the utterance speed "3.5 mora / sec" is selected. The second hearing will be carried out. That is, in this case, since the second utterance is faster than the appropriate speed, the second utterance is performed, but since the utterance speed is improved, it is not the column of "response sentence when it is not improved", but the first. As in the case of one utterance, the response sentence is selected from the "response sentence" column.

以上で述べた本実施の形態１に係る音声対話システムの変形例によれば、聞き返しによっても発話速度の改善が見られなかった場合、さらに聞き返しを行うことによって、ユーザに誤認識の原因が発話速度にあることをより確実に理解させることができる。さらに、発話速度の改善がみられなかった場合における聞き返しの応答文には、「もっとゆっくりと発話してください」及び「もっとはやく発話してください」というように、第２発話の発話速度に拘わらず強調語「もっと」が含まれている。そのため、発話速度の改善の程度が足りない旨をユーザに通知することができる。さらに、聞き返しの応答文は、第２発話の発話速度に応じた出力発話速度でスピーカ１０３から出力される。そのため、発話速度の程度が足りない旨を直感的にユーザに伝えることができる。その結果、ユーザの再発話による誤認識が繰り返されることを防止できる。 According to the modified example of the voice dialogue system according to the first embodiment described above, when the speech speed is not improved by listening back, the cause of misrecognition is spoken to the user by further listening back. You can more surely understand that you are in speed. In addition, if the speech speed does not improve, the response sentence of the reply will be "Speak more slowly" and "Speak faster" regardless of the speech speed of the second speech. The utterance "more" is included. Therefore, it is possible to notify the user that the degree of improvement in the speech speed is insufficient. Further, the response sentence of the response is output from the speaker 103 at an output utterance speed corresponding to the utterance speed of the second utterance. Therefore, it is possible to intuitively inform the user that the utterance speed is insufficient. As a result, it is possible to prevent repeated erroneous recognition due to the user's reoccurrence.

（実施の形態２）
図１２は、本開示の実施の形態２における音声対話システムの構成の一例を示す図である。図１０において、図２との相違点は、発話音量計測部１２０８がさらに追加され、メモリ１２０５、応答文生成部１２０６、及び音声合成部１２０７が置き換えられている点にある。実施の形態２は音声認識に誤認識が生じた場合、誤認識の原因が発話速度及び発話音量のいずれにあるのかを判断し、判断結果に応じた聞き返しを行うものである。 (Embodiment 2)
FIG. 12 is a diagram showing an example of the configuration of the voice dialogue system according to the second embodiment of the present disclosure. In FIG. 10, the difference from FIG. 2 is that the utterance volume measuring unit 1208 is further added, and the memory 1205, the response sentence generation unit 1206, and the voice synthesis unit 1207 are replaced. In the second embodiment, when erroneous recognition occurs in voice recognition, it is determined whether the cause of the erroneous recognition is the utterance speed or the utterance volume, and the erroneous recognition is performed according to the determination result.

発話音量計測部１２０８は、信頼度が閾値未満の場合に音声認識部２０２から入力される音声認識結果に基づいて発話音量（第１発話音量の一例）を計測する。発話音量は、入力音声データから抽出された発話区間に亘る音量の平均値が採用される。発話区間の音量としては、音声のパワー又は音圧等の音声の大きさ又は強度を表すパラメータが採用される。ここでは、発話区間における音量の平均値が採用されたが、これに代えて、発話区間における音量の積算値、最大値、又は最小値などが採用されてもよい。なお、本実施の形態において、音声認識部２０２は、入力音声データを音声認識結果に含ませて発話音量計測部１２０８に入力すればよい。 The utterance volume measuring unit 1208 measures the utterance volume (an example of the first utterance volume) based on the voice recognition result input from the voice recognition unit 202 when the reliability is less than the threshold value. As the utterance volume, the average value of the volume over the utterance section extracted from the input voice data is adopted. As the volume of the utterance section, a parameter indicating the loudness or intensity of the voice such as the power or the sound pressure of the voice is adopted. Here, the average value of the volume in the utterance section is adopted, but instead of this, the integrated value, the maximum value, the minimum value, or the like of the volume in the utterance section may be adopted. In the present embodiment, the voice recognition unit 202 may include the input voice data in the voice recognition result and input it to the utterance volume measurement unit 1208.

メモリ１２０５は、第１データベースＴ１に加えて第３データベースＴ３をさらに記憶する。第３データベースＴ３は、発話音量に対して適切な発話の改善を促すための応答文を決定するために必要な情報を記憶する。図１３は、第３データベースＴ３のデータ構成の一例を示す図である。第３データベースＴ３は、入力音量と応答文と出力音量とを対応付けて記憶する。入力音量は、発話音量計測部１２０８で計測された発話音量が属する音量条件を示す。応答文は、計測された発話音量に応じた応答文を示す。出力音量は、計測された発話音量に応じた応答文の音声を出力する際の発話音量を示している。例えば、１行目のレコードは、音声認識に誤認識が生じた際の発話音量が５０ｄｂ以上だった場合に、応答文として「もっと小さな声で発話してください」が決定され、この応答文が１５ｄｂの音量で出力されることを示している。 Memory 1205 further stores a third database T3 in addition to the first database T1. The third database T3 stores information necessary for determining a response sentence for promoting appropriate improvement of utterance with respect to the utterance volume. FIG. 13 is a diagram showing an example of the data structure of the third database T3. The third database T3 stores the input volume, the response sentence, and the output volume in association with each other. The input volume indicates a volume condition to which the utterance volume measured by the utterance volume measuring unit 1208 belongs. The response sentence indicates a response sentence according to the measured utterance volume. The output volume indicates the utterance volume when outputting the voice of the response sentence according to the measured utterance volume. For example, in the record on the first line, when the utterance volume is 50 db or more when erroneous recognition occurs in voice recognition, "Please speak in a quieter voice" is determined as the response sentence, and this response sentence is It shows that it is output at a volume of 15db.

図１３の例では、第３データベースＴ３は、５つの速度条件に応じた５つのレコードを含む。上から３行目の入力音量が「３０ｄｂ以上、４０ｄｂ未満」のレコードは、発話音量が適性音量にあるため、応答文は「もう一度発話してください」となっており、発話音量の改善を促す内容になっていない。また、３行目のレコードは、発話音量が適性音量にあり、発話音量を改善する必要がないため、出力音量が入力音量の範囲の中間値である３５ｄｂに設定されている。 In the example of FIG. 13, the third database T3 contains five records according to five speed conditions. In the record with the input volume of "30db or more and less than 40db" on the third line from the top, the utterance volume is at the appropriate volume, so the response sentence is "Please speak again", prompting improvement of the utterance volume. It is not the content. Further, in the record on the third line, since the utterance volume is at an appropriate volume and there is no need to improve the utterance volume, the output volume is set to 35db, which is an intermediate value in the range of the input volume.

一方、入力音量が適性音量に比べて大きい２行目、１行目の応答文は、「ちいさな声で発話してください」、「もっと小さな声で発話してください」というように、発話音量の増大を促す内容になっている。また、２行目の出力音量は適性音量よりも小さい２５ｄｂに設定され、１行目の出力音量は２行目よりもさらに小さい１５ｄｂとなっている。このように、図１３の例では、入力音量が適性音量に比べて大きくなるほど、出力音量は、適性音量に比べてより小さくなるように設定されている。これにより、発話音量が適性音量に比べて大きいために誤認識が生じた場合、そのことをユーザに対して直感的に理解させることができる。 On the other hand, the response sentences on the second and first lines, where the input volume is louder than the appropriate volume, are of the utterance volume, such as "Please speak in a small voice" and "Speak in a quieter voice." It is a content that encourages an increase. Further, the output volume of the second line is set to 25db, which is smaller than the appropriate volume, and the output volume of the first line is 15db, which is even smaller than that of the second line. As described above, in the example of FIG. 13, the output volume is set to be smaller than the appropriate volume as the input volume is larger than the appropriate volume. As a result, when a false recognition occurs because the utterance volume is higher than the appropriate volume, the user can intuitively understand it.

また、２行目よりも入力音量がさらに大きい１行目の応答文は、２行目の応答文に対して強調語「もっと」を含んでいる。したがって、入力音量が適性音量に比べて大きくなるほど、応答文は発話音量の低下の度合いをより強調する言葉を含んでいる。これにより、発話音量が適性音量に比べてどの程度小さくすればよいのかをユーザに知らせることができる。 Further, the response sentence of the first line whose input volume is higher than that of the second line includes the emphasized word "more" with respect to the response sentence of the second line. Therefore, as the input volume becomes louder than the appropriate volume, the response sentence contains words that emphasize the degree of decrease in the utterance volume. As a result, it is possible to inform the user how much the utterance volume should be made smaller than the appropriate volume.

一方、入力音量が適性音量に比べて遅い４行目、５行目の応答文は、「大きな声で発話してください」、「もっと大きな声で発話してください」というように、発話音量を大きくすることを促す内容になっている。また、４行目の出力音量は適性音量よりも大きい４５ｄｂに設定され、５行目の出力音量は４行目よりもさらに大きい５５ｄｂに設定されている。このように、図１３の例では、入力音量が適性音量に比べて小さくなるほど、出力発音量は、適性音量に比べてもより大きくなるように設定されている。これにより、発話音量が適性音量に比べて小さいために誤認識が生じた場合、そのことをユーザに対して直感的に理解させることができる。 On the other hand, in the response sentences on the 4th and 5th lines, where the input volume is slower than the appropriate volume, the utterance volume should be increased, such as "Please speak loudly" or "Speak louder". The content encourages you to make it bigger. Further, the output volume of the 4th line is set to 45db, which is larger than the appropriate volume, and the output volume of the 5th line is set to 55db, which is larger than the 4th line. As described above, in the example of FIG. 13, the output sound volume is set to be larger than the appropriate volume as the input volume is smaller than the appropriate volume. As a result, when an erroneous recognition occurs because the utterance volume is lower than the appropriate volume, the user can intuitively understand it.

また、４行目よりも入力音量がさらに小さい５行目の応答文は、４行目の応答文に対して強調語「もっと」を含んでいる。したがって、入力音量が適性音量に比べて小さくなるほど、応答文は発話音量の上昇の度合いをより強調する言葉を含んでいる。これにより、発話音量が適性音量に比べてどの程度大きくすればよいかをユーザに知らせることができる。 Further, the response sentence on the fifth line, which has a lower input volume than the fourth line, contains the emphasized word "more" with respect to the response sentence on the fourth line. Therefore, the smaller the input volume is compared to the appropriate volume, the more the response sentence contains words that emphasize the degree of increase in the utterance volume. As a result, it is possible to inform the user how much the utterance volume should be made higher than the appropriate volume.

図１３において、３行目の入力音量「３０ｄｂ以上、４０ｄｂ未満」の「３０ｄｂ」は第３閾値の一例であり、「４０ｄｂ」は第４閾値の一例であり、３行目の「３５ｄｂ」はあらかじめ定められた第３発話音量の一例である。 In FIG. 13, “30db” of the input volume “30db or more and less than 40db” in the third line is an example of the third threshold value, “40db” is an example of the fourth threshold value, and “35db” in the third line is an example. This is an example of a predetermined third utterance volume.

図１２に参照を戻す。応答文生成部１２０６は、実施の形態１と同様、意図理解部２０３から入力された意図理解結果に対して適切な応答文を生成する。また、応答文生成部１２０６は、発話速度計測部２０４により計測された発話速度に応じた応答文を第１データベースＴ１から取得すると共に、発話音量計測部１２０８により計測された発話音量に応じた応答文を第３データベースＴ３から取得し、取得した応答文に基づいて最終応答文を生成し、音声合成部１２０７に入力する。 The reference is returned to FIG. Similar to the first embodiment, the response sentence generation unit 1206 generates an appropriate response sentence for the intention understanding result input from the intention understanding unit 203. Further, the response sentence generation unit 1206 acquires the response sentence corresponding to the utterance speed measured by the utterance speed measuring unit 204 from the first database T1 and responds according to the utterance volume measured by the utterance volume measuring unit 1208. A sentence is acquired from the third database T3, a final response sentence is generated based on the acquired response sentence, and the sentence is input to the speech synthesis unit 1207.

音声合成部１２０７は、発話速度計測部２０４により計測された発話速度に応じた出力発話速度を第１データベースＴ１から取得すると共に、発話音量計測部１２０８により計測された発話音量に応じた出力音量（第２発話音量の一例）を第３データベースＴ３から取得する。そして、音声合成部１２０７は、取得した出力発話速度及び出力音量で応答文生成部２０６から入力された最終応答文が出力されるように最終応答文に対して音声合成処理を行い、応答音声データを生成する。生成された応答音声データは通信部２０１を介して音声入出力装置１００に送信され、スピーカ１０３から応答音声として出力される。 The voice synthesis unit 1207 acquires the output utterance speed according to the utterance speed measured by the utterance speed measurement unit 204 from the first database T1 and the output volume according to the utterance volume measured by the utterance volume measurement unit 1208 ( An example of the second utterance volume) is acquired from the third database T3. Then, the voice synthesis unit 1207 performs voice synthesis processing on the final response sentence so that the final response sentence input from the response sentence generation unit 206 is output at the acquired output utterance speed and output volume, and the response voice data. To generate. The generated response voice data is transmitted to the voice input / output device 100 via the communication unit 201, and is output as a response voice from the speaker 103.

図１４は、実施の形態２において最終応答文が決定される場面の一例を示す図である。図１４の第１例では、ユーザは、ゆっくりした大きい音量である３モーラ／秒、４５ｄｂで「今日はいい天気だね」の発話を実施している（ステップＳ１４０１）。この発話の信頼度が閾値未満であり、誤認識と判断されたため、応答文生成部１２０６は、第１データベースＴ１から、発話速度「３モーラ／秒」に対応する応答文「はやく発話してください」を取得すると共に第３データベースＴ３から発話音量「４０ｄｂ」に対応する応答文「小さな声で発話してください」を取得する。そして、応答文生成部１２０６は、応答文「はやく発話してください」と、応答文「小さな声で発話してください」とを組み合わせて最終応答文「小さな声ではやく発話してください」を生成する。具体的に、応答文生成部１２０６は、取得した２つの応答文において相互に不足する単語を補うことで最終応答文を生成する。例えば、応答文「はやく発話してください」は、応答文「小さな声で発話してください」に対して、「小さな声で」が不足する。そのため、応答文生成部１２０６は、応答文「はやく発話してください」に「小さな声で」を追加して、最終応答文「小さな声ではやく発話してください」を生成する。 FIG. 14 is a diagram showing an example of a scene in which the final response sentence is determined in the second embodiment. In the first example of FIG. 14, the user is uttering "It's nice weather today" at a slow, loud volume of 3 mora / sec, 45 db (step S1401). Since the reliability of this utterance is less than the threshold and it is determined that the recognition is erroneous, the response sentence generation unit 1206 uses the first database T1 to speak the response sentence "quickly speak" corresponding to the utterance speed "3 mora / sec". , And the response sentence "Please speak in a small voice" corresponding to the utterance volume "40db" from the third database T3. Then, the response sentence generation unit 1206 combines the response sentence "please speak quickly" and the response sentence "please speak in a small voice" to generate the final response sentence "please speak in a small voice". To do. Specifically, the response sentence generation unit 1206 generates a final response sentence by compensating for words that are mutually insufficient in the two acquired response sentences. For example, the response sentence "Please speak quickly" lacks "Speak in a small voice" as opposed to the response sentence "Please speak in a small voice". Therefore, the response sentence generation unit 1206 adds "in a small voice" to the response sentence "please speak quickly" to generate the final response sentence "please speak in a small voice".

音声合成部１２０７は、第１データベースＴ１から発話速度「３モーラ／秒」に対応する出力発話速度「５．５モーラ／秒」を取得すると共に第３データベースＴ３から発話音量「４０ｄｂ」に対応する出力音量「２５ｄｂ」を取得する。そして、音声合成部１２０７は、最終応答文「小さな声ではやく発話してください」を発話速度「５．５モーラ／秒」且つ発話音量「２５ｄｂ」で出力するための応答音声データを生成する。これによりスピーカ１０３からは最終応答文「小さな声ではやく発話してください」が発話速度「５．５モーラ／秒」且つ出力音量「２５ｄｂ」で出力される（ステップＳ１４０２）。 The speech synthesizer 1207 acquires the output utterance speed "5.5 mora / sec" corresponding to the utterance speed "3 mora / sec" from the first database T1 and corresponds to the utterance volume "40 db" from the third database T3. Acquires the output volume "25db". Then, the voice synthesis unit 1207 generates response voice data for outputting the final response sentence "Please speak quickly with a small voice" at an utterance speed of "5.5 mora / sec" and an utterance volume of "25 db". As a result, the final response sentence "Please speak quickly with a small voice" is output from the speaker 103 at an utterance speed of "5.5 mora / sec" and an output volume of "25 db" (step S1402).

図１４の第２例では、ユーザは、適性な発話速度（４．５モーラ／秒）且つ小さい音量（２５ｄｂ）で「今日はいい天気だね」の発話を実施している（ステップＳ１５０１）。この発話の信頼度が閾値未満であり、誤認識と判断されたため、応答文生成部１２０６は、第１データベースＴ１から発話速度「４．５モーラ／秒」に対応する応答文「もう一度発話してください」を取得すると共に第３データベースＴ３から発話音量「２５ｄｂ」に対応する応答文「大きな声で発話してください」を取得する。 In the second example of FIG. 14, the user is uttering "It's fine today" at an appropriate utterance speed (4.5 mora / sec) and a low volume (25 db) (step S1501). Since the reliability of this utterance is less than the threshold value and it is determined that the utterance is erroneous, the response sentence generation unit 1206 receives the response sentence "Speak again" corresponding to the utterance speed "4.5 mora / sec" from the first database T1. Acquire "Please" and acquire the response sentence "Please speak loudly" corresponding to the utterance volume "25db" from the third database T3.

そして、応答文生成部１２０６は、発話速度は改善の必要がないため、応答文「もう一度発話してください」を用いずに応答文「大きな声で発話してください」のみを用いて最終応答文を生成する。なお、応答文生成部１２０６は、応答文「もう一度発話してください」と応答文「大きな声で発話してください」とを組み合わせ、最終応答文「大きな声でもう一度発話して下さい」を生成してもよい。 Then, since the response sentence generation unit 1206 does not need to improve the utterance speed, the final response sentence uses only the response sentence "Please speak loudly" without using the response sentence "Please speak again". To generate. In addition, the response sentence generation unit 1206 combines the response sentence "Please speak again" and the response sentence "Please speak loudly" to generate the final response sentence "Please speak loudly again". You may.

音声合成部１２０７は、第１データベースＴ１から発話速度「４．５モーラ／秒」に対応する出力発話速度「４．５モーラ／秒」を取得すると共に第３データベースＴ３から発話音量「２５ｄｂ」に対応する出力音量「４５ｄｂ」を取得する。そして、音声合成部１２０７は、最終応答文「大きな声で発話して下さい」を出力発話速度「４．５モーラ／秒」且つ「４５ｄｂ」で出力するための応答音声データを生成する。これにより、スピーカ１０３からは、最終応答文「大きな声で発話して下さい」が出力発話速度「４．５モーラ／秒」且つ「４５ｄｂ」で出力される（ステップＳ１５０２）。 The speech synthesizer 1207 acquires the output utterance speed "4.5 mora / sec" corresponding to the utterance speed "4.5 mora / sec" from the first database T1 and shifts the utterance volume "25 db" from the third database T3. Acquires the corresponding output volume "45db". Then, the voice synthesis unit 1207 generates response voice data for outputting the final response sentence "Please speak in a loud voice" at an output utterance speed of "4.5 mora / sec" and "45 db". As a result, the final response sentence "Please speak in a loud voice" is output from the speaker 103 at an output utterance speed of "4.5 mora / sec" and "45 db" (step S1502).

図１５は、本開示の実施の形態２における音声対話システムの処理の一例を示すフローチャートである。ステップＳ１０１〜ステップＳ１０６の処理は、図７に示すステップＳ１０１〜ステップＳ１０６の処理と同じである。 FIG. 15 is a flowchart showing an example of processing of the voice dialogue system according to the second embodiment of the present disclosure. The processing of steps S101 to S106 is the same as the processing of steps S101 to S106 shown in FIG.

ステップＳ１０６に続くステップＳ２０７では、発話音量計測部１２０８は前述した手法を用いてユーザの発話に対する発話音量を計測する（ステップＳ２０７）。次に、応答文生成部１２０６は、第１データベースＴ１及び第３データベースＴ３を参照し、ステップＳ１０６で計測された発話速度とステップＳ２０７で計測された発話音量とに対応する応答文を取得し、最終応答文を生成する（ステップＳ２０８）。 In step S207 following step S106, the utterance volume measuring unit 1208 measures the utterance volume for the user's utterance by using the method described above (step S207). Next, the response sentence generation unit 1206 refers to the first database T1 and the third database T3, and acquires the response sentence corresponding to the utterance speed measured in step S106 and the utterance volume measured in step S207. Generate the final response statement (step S208).

次に、音声合成部１２０７は、第１データベースＴ１及び第３データベースＴ３を参照し、ステップＳ１０６で計測された発話速度とステップＳ２０７で計測された発話音量に対応する出力発話速度と出力音量とを決定する（ステップＳ２０９）。次に、音声合成部１２０７は、ステップＳ２０８で生成された最終応答文をステップＳ２０９で決定された発話速度及び出力音量で出力するための応答音声データを生成する（ステップＳ２１０）。なお、ステップＳ１０５において意図理解結果に基づく応答文が生成された場合、ステップＳ２１０において、音声合成部１２０７は、ステップＳ１０５で生成された応答文に対する応答音声データを生成する。次に、音声合成部１２０７は、応答音声データを通信部２０１を介して音声入出力装置１００に送信し、スピーカ１０３から応答音声を出力させる。 Next, the voice synthesis unit 1207 refers to the first database T1 and the third database T3, and obtains the utterance speed measured in step S106 and the output utterance speed and output volume corresponding to the utterance volume measured in step S207. Determine (step S209). Next, the voice synthesis unit 1207 generates response voice data for outputting the final response sentence generated in step S208 at the utterance speed and output volume determined in step S209 (step S210). When the response sentence based on the intention understanding result is generated in step S105, the voice synthesis unit 1207 generates the response voice data for the response sentence generated in step S105 in step S210. Next, the voice synthesis unit 1207 transmits the response voice data to the voice input / output device 100 via the communication unit 201, and outputs the response voice from the speaker 103.

以上で述べた本実施の形態２の音声対話システムによれば、発話速度と発話音量との一方または双方が原因で音声認識に誤認識が生じた場合において、発話速度と発話音量との一方または双方により聞き返しの音声が出力される。そのため、誤認識の原因が発話速度及び発話音量の一方又は両方にある場合、発話速度及び発話音量の一方又は両方が改善されるように発話速度及び出力音量の一方又は両方が調整された応答文の音声を出力できる。これにより、誤認識の原因が発話速度及び発話音量の一方又は両方にある場合であっても、その原因及び改善の程度を直観的にユーザに認識させることができ、誤認識が繰り返されることが防止できる。 According to the voice dialogue system of the second embodiment described above, when erroneous recognition occurs in voice recognition due to one or both of the utterance speed and the utterance volume, one or one of the utterance speed and the utterance volume Both sides output the return voice. Therefore, when the cause of misrecognition is one or both of the utterance speed and the utterance volume, the response sentence in which one or both of the utterance speed and the output volume are adjusted so that one or both of the utterance speed and the utterance volume are improved. Can output the voice of. As a result, even if the cause of the erroneous recognition is one or both of the utterance speed and the utterance volume, the user can intuitively recognize the cause and the degree of improvement, and the erroneous recognition may be repeated. Can be prevented.

本開示は下記の変形例が採用できる。 The following modifications can be adopted in the present disclosure.

（１）図６の例では、第１データベースＴ１は５つのレコードにより構成されているが、これは一例であり、６つ以上、４つ以下のレコードで構成されていてもよい。また、図６の例では、５モーラ／秒以上の入力発話速度に対して、１行目及び２行目の２つのレコードが設けられており、応答文及び出力発話速度は２段階で調整可能に構成されているが、これは一例であり、応答文及び出力発話速度は１段階又は３段階以上で調整可能に構成されてもよい。また、４モーラ／秒未満の入力発話速度に対して、４行目及び５行目の２つのレコードが設けられており、応答文及び出力発話速度は２段階で調整可能に構成されているが、これは一例であり、応答文及び出力発話速度は１段階又は３段階以上で調整可能に構成されてもよい。これらのことは、図８に示す第２データベースＴ２及び第３データベースＴ３についても同じである。さらに、出力発話速度及び出力音量に関しては連続的に調整可能に構成されてもよい。 (1) In the example of FIG. 6, the first database T1 is composed of five records, but this is an example, and may be composed of six or more and four or less records. Further, in the example of FIG. 6, two records of the first line and the second line are provided for the input utterance speed of 5 mora / sec or more, and the response sentence and the output utterance speed can be adjusted in two steps. This is an example, and the response sentence and the output utterance speed may be adjusted in one step or three steps or more. In addition, two records on the 4th and 5th lines are provided for the input utterance speed of less than 4 mora / sec, and the response sentence and the output utterance speed can be adjusted in two stages. , This is an example, and the response sentence and the output utterance speed may be configured to be adjustable in one step or three or more steps. These things are the same for the second database T2 and the third database T3 shown in FIG. Further, the output utterance speed and the output volume may be configured to be continuously adjustable.

（２）第１データベースＴ１において発話速度を規定する各種数値及び応答文は一例に過ぎず、他の数値及び応答文が採用されてもよい。このことは、第２データベースＴ２及び第３データベースＴ３も同じである。 (2) In the first database T1, various numerical values and response sentences that define the utterance speed are merely examples, and other numerical values and response sentences may be adopted. This also applies to the second database T2 and the third database T3.

（３）実施の形態１の変形例では、第１データベースＴ１に代えて第２データベースＴ２が用いられているが本開示はこれに限定されない。例えば、第２データベースＴ２は、図９に示す「応答文」の列が省かれて構成されてもよい。この場合、応答文生成部２０６は、第１発話に対しては第１データベースＴ１を参照して発話速度に応じた応答文及び出力発話速度を決定し、第２発話に対しては「応答文」の列が省かれた第２データベースＴ２を参照して改善されなかった場合の応答文及び出力発話速度を決定すればよい。 (3) In the modified example of the first embodiment, the second database T2 is used instead of the first database T1, but the present disclosure is not limited to this. For example, the second database T2 may be configured by omitting the column of “response statement” shown in FIG. In this case, the response sentence generation unit 206 determines the response sentence and the output utterance speed according to the utterance speed with reference to the first database T1 for the first utterance, and "response sentence" for the second utterance. The response sentence and the output utterance speed when the improvement is not made may be determined by referring to the second database T2 in which the column of "" is omitted.

（４）実施の形態２では、図１２に示すようにメモリ１２０５は、第１データベースＴ１を含んでいるが、本開示はこれに限定されず、第１データベースＴ１に代えて図９に示す第２データベースＴ２を含んでいてもよい。この場合、応答文生成部１２０６は、聞き返しにより実施された第２発話において、発話速度の改善がみられない場合、第２データベースＴ２の「改善されなかった場合の応答文」の列を参照して発話速度に応じた応答文を取得すればよい。そして、この場合、応答文生成部１２０６は、第２データベースＴ２から取得した応答文と第３データベースＴ３から取得した発話音量に応じた応答文とを組み合わせて最終応答文を生成すればよい。 (4) In the second embodiment, as shown in FIG. 12, the memory 1205 includes the first database T1, but the present disclosure is not limited to this, and the first database T1 is replaced with the first database T1 shown in FIG. 2 Database T2 may be included. In this case, the response sentence generation unit 1206 refers to the column of "response sentence when not improved" in the second database T2 when the speech speed is not improved in the second utterance performed by listening back. The response sentence according to the utterance speed may be acquired. Then, in this case, the response sentence generation unit 1206 may generate the final response sentence by combining the response sentence acquired from the second database T2 and the response sentence according to the utterance volume acquired from the third database T3.

（５）第１データベースＴ１において、３行目の出力発話速度「４．５モーラ／秒」は、３行目の入力発話速度「４モーラ／秒以上、５モーラ／秒未満」の中間値が採用されているが、本開示はこれに限定されない。例えば、３行目の出力発話速度は３行目の入力発話速度の範囲外の所定の値が採用されてもよい。この場合、１行目、２行目の出力発話速度は、３行目の出力発話速度の値を基準として相対的に遅い値が採用され、４行目、５行目の出力発話速度は、３行目の出力発話速度の値を基準として相対的に速い値が採用されてもよい。これは、応答文の音声の発話速度を変更することで発話速度の改善を促すに際して、出力発話速度は基準となる出力発話速度に対して相対的に速い又は遅ければそれで足り、必ずしも入力発話速度を基準に設定しなくてもよいという考えに基づいている。この考えは第２データベースＴ２及び第３データベースＴ３についても適用可能である。 (5) In the first database T1, the output utterance speed "4.5 mora / sec" in the third line has an intermediate value of the input utterance speed "4 mora / sec or more and less than 5 mora / sec" in the third line. Although adopted, the present disclosure is not limited to this. For example, as the output utterance speed of the third line, a predetermined value outside the range of the input utterance speed of the third line may be adopted. In this case, the output utterance speed of the first and second lines is a relatively slow value based on the value of the output utterance speed of the third line, and the output utterance speed of the fourth and fifth lines is A relatively fast value may be adopted based on the value of the output utterance speed on the third line. This is because when promoting improvement of the utterance speed by changing the utterance speed of the voice of the response sentence, it is sufficient if the output utterance speed is relatively faster or slower than the reference output utterance speed, and the input utterance speed is not necessarily the same. It is based on the idea that it is not necessary to set the standard. This idea is also applicable to the second database T2 and the third database T3.

本開示は、幼児のような聞き返しの応答文の意味を理解するのが困難なユーザに対しても誤認識の原因が発話速度にあることを直感的に理解させることができるため、このようなユーザと音声によりインタラクションを図るうえで有用である。 This disclosure makes it possible for a user who has difficulty in understanding the meaning of a response sentence such as an infant to intuitively understand that the cause of misrecognition is the speech speed. It is useful for communicating with the user by voice.

１００：音声入出力装置
１０１：通信部
１０２：マイク
１０３：スピーカ
２００：音声対話制御装置
２０１：通信部
２０２：音声認識部
２０３：意図理解部
２０４：発話速度計測部
２０５：メモリ
２０６：応答文生成部
２０７：音声合成部
２１０：プロセッサ
１２０５：メモリ
１２０６：応答文生成部
１２０７：音声合成部
１２０８：発話音量計測部
Ｔ１：第１データベース
Ｔ２：第２データベース
Ｔ３：第３データベース 100: Voice input / output device 101: Communication unit 102: Microphone 103: Speaker 200: Voice dialogue control device 201: Communication unit 202: Voice recognition unit 203: Intention understanding unit 204: Speech speed measurement unit 205: Memory 206: Response sentence generation Part 207: Speech synthesis unit 210: Processor 1205: Memory 1206: Response sentence generation unit 1207: Speech synthesis unit 1208: Speech volume measurement unit T1: First database T2: Second database T3: Third database

Claims

音声対話制御装置が行う音声対話制御方法であって、
音声入力装置に入力されたユーザによる第１発話から変換された音声データに対して音声認識処理を行い、
前記音声認識処理の信頼度を算出し、
前記信頼度が所定の閾値以下であるか否かを判定し、
前記信頼度が前記閾値以下である場合、
前記音声データに基づき前記ユーザによる発話の速度である第１発話速度を計測し、
前記第１発話速度に基づき、前記ユーザに対して再度の発話を促す旨の応答文を音声により出力する際の速度である第２発話速度を決定し、
前記第２発話速度で前記応答文の音声を出力する旨の指示情報を出力する、
音声対話制御方法。 It is a voice dialogue control method performed by the voice dialogue control device.
Voice recognition processing is performed on the voice data converted from the first utterance by the user input to the voice input device.
Calculate the reliability of the voice recognition process and
It is determined whether or not the reliability is equal to or less than a predetermined threshold value.
When the reliability is equal to or less than the threshold value
Based on the voice data, the first utterance speed, which is the speed of utterance by the user, is measured.
Based on the first utterance speed, the second utterance speed, which is the speed at which the response sentence for prompting the user to speak again is output by voice, is determined.
Outputs instruction information to output the voice of the response sentence at the second utterance speed.
Voice dialogue control method.

前記第１発話速度が第１閾値未満である場合、第２発話速度を、あらかじめ定められた第３発話速度より大きくする、
請求項１記載の音声対話制御方法。 When the first utterance speed is less than the first threshold value, the second utterance speed is made larger than the predetermined third utterance speed.
The voice dialogue control method according to claim 1.

前記第１発話速度が第２閾値以上である場合、第２発話速度を、あらかじめ定められた第３発話速度より小さくする、
請求項１記載の音声対話制御方法。 When the first utterance speed is equal to or higher than the second threshold value, the second utterance speed is made smaller than the predetermined third utterance speed.
The voice dialogue control method according to claim 1.

前記第２発話速度で出力された応答文に対する前記ユーザの第２発話から変換された音声データに対して音声認識処理を行い、
前記音声認識処理の信頼度を算出し、
前記信頼度が前記閾値以下であるか否かを判定し、
前記信頼度が前記閾値以下である場合、
前記音声データに基づき前記第２発話の速度である第４発話速度を計測し、
前記第４発話速度に基づき、前記ユーザに対して再度の発話を促す旨の応答文を音声により出力する際の速度である第５発話速度を決定し、
前記第５発話速度で前記応答文の音声を出力する旨の指示情報を出力する、
請求項１記載の音声対話制御方法。 Voice recognition processing is performed on the voice data converted from the user's second utterance with respect to the response sentence output at the second utterance speed.
Calculate the reliability of the voice recognition process and
It is determined whether or not the reliability is equal to or less than the threshold value.
When the reliability is equal to or less than the threshold value
Based on the voice data, the fourth utterance speed, which is the speed of the second utterance, is measured.
Based on the fourth utterance speed, the fifth utterance speed, which is the speed at which the response sentence for prompting the user to speak again is output by voice, is determined.
Outputs instruction information to output the voice of the response sentence at the fifth utterance speed.
The voice dialogue control method according to claim 1.

前記信頼度が前記閾値未満である場合、
さらに前記第１発話の音量である第１発話音量を計測し、
さらに前記第１発話速度および前記第１発話音量の少なくとも一方に基づき、前記ユーザに対して再度の発話を促す旨の応答文を音声により出力する際の速度である第２発話速度および前記応答文を音声により出力する際の音量である第２発話音量の少なくとも一方を決定し、
さらに前記第２発話速度および前記第２発話音量の少なくとも一方で前記応答文を出力する旨の指示情報を出力する、
請求項１記載の音声対話制御方法。 If the reliability is less than the threshold
Further, the volume of the first utterance, which is the volume of the first utterance, is measured.
Further, based on at least one of the first utterance speed and the first utterance volume, the second utterance speed and the response sentence, which are the speeds at which a response sentence for prompting the user to speak again is output by voice. At least one of the second utterance volume, which is the volume when outputting by voice, is determined.
Further, the instruction information to output the response sentence is output at least one of the second utterance speed and the second utterance volume.
The voice dialogue control method according to claim 1.

前記第２発話速度は、前記第１発話速度が前記第１閾値よりも小さくなるにつれて、前記第３発話速度に対してより大きく設定される、
請求項２記載の音声対話制御方法。 The second utterance speed is set higher than the third utterance speed as the first utterance speed becomes smaller than the first threshold value.
The voice dialogue control method according to claim 2.

前記第２発話速度は、前記第１発話速度が前記第２閾値よりも大きくなるにつれて、前記第３発話速度に対してより小さく設定される、
請求項３記載の音声対話制御方法。 The second utterance speed is set to be smaller than the third utterance speed as the first utterance speed becomes larger than the second threshold value.
The voice dialogue control method according to claim 3.

前記第１発話速度は、第１発話のモーラ数を発話時間で割った値である、
請求項１〜７のいずれかに記載の音声対話制御方法。 The first utterance speed is a value obtained by dividing the number of mora of the first utterance by the utterance time.
The voice dialogue control method according to any one of claims 1 to 7.

前記第２発話速度の決定では、複数の前記第１発話速度のそれぞれに対応する応答文及び前記第２発話速度が登録された第１データベースを参照し、計測された前記第１発話速度に対応する応答文及び前記第２発話速度を決定し、
前記第１データベースにおいて、前記第２発話速度は、前記第１発話速度が第１閾値よりも小さくなるにつれてあらかじめ定められた第３発話速度よりも大きく設定され、前記第１発話速度が第２閾値（＞第１閾値）よりも大きくなるにつれて前記第３発話速度よりも小さく設定されている、
請求項１〜８のいずれかに記載の音声対話制御方法。 In the determination of the second utterance speed, the response sentence corresponding to each of the plurality of first utterance speeds and the first database in which the second utterance speed is registered are referred to, and the measured first utterance speed is supported. The response sentence to be made and the second utterance speed are determined.
In the first database, the second utterance speed is set to be larger than a predetermined third utterance speed as the first utterance speed becomes smaller than the first threshold value, and the first utterance speed becomes the second threshold value. It is set to be smaller than the third utterance speed as it becomes larger than (> first threshold).
The voice dialogue control method according to any one of claims 1 to 8.

前記第５発話速度の決定では、複数の前記第４発話速度のそれぞれに対応する応答文及び前記第５発話速度が登録された第２データベースを参照し、計測された前記第４発話速度に対応する応答文及び前記第５発話速度を決定し、
前記第２データベースにおいて、前記応答文は、前記第４発話速度が第１閾値より小さい場合、発話速度の上昇を強調する言葉を含み、前記第４発話速度が第２閾値（＞第１閾値）より大きい場合、前記発話速度の低下を強調する言葉を含み、
前記第２データベースにおいて、前記第５発話速度は、前記第４発話速度が前記第１閾値よりも小さい場合、あらかじめ定められた第３発話速度よりも大きく設定され、前記第４発話速度が前記第２閾値よりも大きい場合、前記第３発話速度よりも小さく設定されている、
請求項４記載の音声対話制御方法。 In the determination of the fifth utterance speed, the response sentence corresponding to each of the plurality of the fourth utterance speeds and the second database in which the fifth utterance speed is registered are referred to, and the measured fourth utterance speed is supported. The response sentence to be made and the fifth utterance speed are determined.
In the second database, the response sentence includes a word emphasizing an increase in the utterance speed when the fourth utterance speed is smaller than the first threshold, and the fourth utterance speed is the second threshold (> first threshold). If it is larger, it contains words that emphasize the decrease in speech speed.
In the second database, the fifth utterance speed is set to be higher than the predetermined third utterance speed when the fourth utterance speed is smaller than the first threshold value, and the fourth utterance speed is the first. When it is larger than the two thresholds, it is set to be smaller than the third utterance speed.
The voice dialogue control method according to claim 4.

前記第２発話音量の決定では、複数の前記第１発話音量のそれぞれに対応する応答文及び前記第２発話音量が登録された第３データベースを参照し、計測された前記第１発話音量に対応する応答文及び前記第２発話音量を決定し、
前記第３データベースにおいて、前記第２発話音量は、前記第１発話音量が第３閾値よりも小さくなるにつれてあらかじめ定められた第３発話音量よりも大きく設定され、前記第１発話音量が第４閾値よりも大きくなるにつれて前記第３発話音量よりも小さく設定されている、
請求項５記載の音声対話制御方法。 In determining the second utterance volume, the response sentence corresponding to each of the plurality of first utterance volumes and the third database in which the second utterance volume is registered are referred to, and the measured first utterance volume is supported. The response sentence to be made and the volume of the second utterance are determined.
In the third database, the second utterance volume is set to be larger than a predetermined third utterance volume as the first utterance volume becomes smaller than the third threshold value, and the first utterance volume becomes the fourth threshold value. It is set to be lower than the third utterance volume as it becomes louder than
The voice dialogue control method according to claim 5.

音声対話制御装置であって、
音声入力装置に入力されたユーザによる第１発話から変換された音声データに対して音声認識処理を行う音声認識部と、
前記音声認識処理の信頼度を算出する算出部と、
前記信頼度が所定の閾値以下であるか否かを判定する判定部と、
前記信頼度が前記閾値以下である場合、前記音声データに基づき前記ユーザによる発話の速度である第１発話速度を計測する計測部と、
前記第１発話速度に基づき、前記ユーザに対して再度の発話を促す旨の応答文を音声により出力する際の速度である第２発話速度を決定し、前記第２発話速度で前記応答文を出力する旨の指示情報を出力する応答文生成部と、
を備える音声対話制御装置。 It is a voice dialogue control device
A voice recognition unit that performs voice recognition processing on the voice data converted from the first utterance by the user input to the voice input device, and
A calculation unit that calculates the reliability of the voice recognition process,
A determination unit that determines whether or not the reliability is equal to or less than a predetermined threshold value,
When the reliability is equal to or less than the threshold value, a measuring unit that measures the first utterance speed, which is the speed of utterance by the user, based on the voice data
Based on the first utterance speed, the second utterance speed, which is the speed at which the response sentence for prompting the user to speak again is output by voice, is determined, and the response sentence is transmitted at the second utterance speed. A response statement generator that outputs instruction information to output, and
A voice dialogue control device comprising.

コンピュータを、音声対話制御装置として機能させるプログラムであって、
前記コンピュータに、
音声入力装置に入力されたユーザによる第１発話から変換された音声データに対して音声認識処理を行い、
前記音声認識処理の信頼度を算出し、
前記信頼度が所定の閾値以下であるか否かを判定し、
前記信頼度が前記閾値以下である場合、
前記音声データに基づき前記ユーザによる発話の速度である第１発話速度を計測し、
前記第１発話速度に基づき、前記ユーザに対して再度の発話を促す旨の応答文を音声により出力する際の速度である第２発話速度を決定し、
前記第２発話速度で前記応答文を出力する旨の指示情報を出力する、
ことを実行させるプログラム。 A program that causes a computer to function as a voice dialogue control device.
On the computer
Voice recognition processing is performed on the voice data converted from the first utterance by the user input to the voice input device.
Calculate the reliability of the voice recognition process and
It is determined whether or not the reliability is equal to or less than a predetermined threshold value.
When the reliability is equal to or less than the threshold value
Based on the voice data, the first utterance speed, which is the speed of utterance by the user, is measured.
Based on the first utterance speed, the second utterance speed, which is the speed at which the response sentence for prompting the user to speak again is output by voice, is determined.
Outputs instruction information to output the response sentence at the second utterance speed.
A program that lets you do things.