JP2021051172A

JP2021051172A - Dialogue system and program

Info

Publication number: JP2021051172A
Application number: JP2019173551A
Authority: JP
Inventors: 哲則小林; Tetsunori Kobayashi; 真也藤江; Shinya Fujie
Original assignee: Waseda University
Current assignee: Waseda University
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2021-04-01
Anticipated expiration: 2039-09-24
Also published as: JP7274210B2

Abstract

To provide a dialogue system which can improve responsiveness of a system and avoid or suppress occurrence of collisions while avoiding or suppressing occurrence of unnecessarily long system shift latency.SOLUTION: A dialogue system 10 is configured by providing: system utterance timing detecting means 22 which detects a start timing of a system utterance by identifying maintenance/termination of a user's utterance right, asynchronously with voice recognition processing means 41, using the acoustic feature amounts extracted from a voice signal of the user's utterance; next utterance preparation means 43 for preparing a next utterance of a system, using subject data stored in subject data storage means 51, etc., and dialogue history information, and a result of a voice recognition processing up to the middle of the user's utterance in progress, before the detection; and utterance generation means 25 for reproducing the system utterance, using the next utterance prepared by the next utterance preparation means 43, after detecting the start timing of the system utterance.SELECTED DRAWING: Figure 1

Description

本発明は、ユーザとの音声対話のための処理を実行するコンピュータにより構成された対話システムおよびプログラムに係り、例えば、ニュースやコラムや歴史等の各種の話題を記載した記事データから生成したシナリオデータを用いてユーザに対して記事の内容を伝達するニュース対話システム、ユーザに対して機器の使用方法の説明や施設の案内等を行うガイダンス対話システム、選挙情勢や消費者志向等の各種のユーザの動向調査を行うアンケート対話システム、ユーザが店舗・商品・旅行先・聞きたい曲等の情報検索を行うための情報検索対話システム、ユーザが家電機器や車等の各種の機器や装置等を操作するための操作対話システム、子供や学生や新入社員等であるユーザに対して教育を行うための教育対話システム、システムがユーザ属性等の情報を特定するための情報特定対話システム等に利用できる。 The present invention relates to a dialogue system and a program composed of a computer that executes processing for voice dialogue with a user, and for example, scenario data generated from article data describing various topics such as news, columns, and history. A news dialogue system that conveys the content of articles to users using, a guidance dialogue system that explains how to use equipment and guides facilities to users, and various users such as election situation and consumer orientation. Questionnaire dialogue system for trend surveys, information search dialogue system for users to search for information on stores, products, travel destinations, songs they want to hear, etc., users operate various devices and devices such as home appliances and cars. It can be used as an operation dialogue system for the purpose, an education dialogue system for educating users such as children, students, and new employees, and an information identification dialogue system for the system to specify information such as user attributes.

一般に、音声対話システムは、人であるユーザと、コンピュータシステムである自身との間で、互いに主に音声チャネルを通じた言語情報のやりとりを行うことにより、所望のタスクを実行し、その目的（例えば、ユーザへのニュース等の記事の内容の伝達、ユーザに対するガイダンス、ユーザへのアンケート、ユーザによる情報検索、ユーザによる機器等の操作、ユーザの教育、システムによる情報特定等）を達成するものである。 In general, a voice dialogue system performs a desired task by exchanging linguistic information between a user who is a person and himself / herself, which is a computer system, mainly through a voice channel, and performs a desired task (for example,). , Communication of article contents such as news to users, guidance to users, questionnaires to users, information retrieval by users, operation of devices by users, education of users, information identification by systems, etc.) ..

より詳細には、従来の音声対話システムでは、先ず、ユーザ発話の音声信号を取得し（音声信号取得）、連続的に得られる音声信号から、ユーザの発話が途切れたことを手がかりとして発話単位の音声信号を切り出す発話区間検出を行い（発話区間検出）、次に、得られた発話区間の音声信号を言語情報に変換する音声認識処理を行うことにより、検出したユーザ発話の意味を推定し（音声認識）、続いて、推定した意味に応じて次発話を決定し、すなわち得られたユーザの言語情報に適したシステム発話の内容を生成し（発話内容生成）、さらに、その発話内容を音声信号に変換する音声合成処理を行い（音声合成）、その後、システム発話の内容をユーザに伝達するため、生成したシステム発話の音声信号を再生する処理を行う（音声信号再生）。従来の音声対話システムは、これらの一連の処理を、原則的にはシーケンシャルに行うため、それぞれの処理における遅延が蓄積することで、ユーザが発話を完了してから、システムが応答するまでに長い遅延が生じることになる。 More specifically, in the conventional voice dialogue system, first, the voice signal of the user's utterance is acquired (voice signal acquisition), and the utterance unit is based on the fact that the user's utterance is interrupted from the continuously obtained voice signal. The meaning of the detected user utterance is estimated by performing utterance section detection that cuts out the utterance signal (utterance section detection) and then performing voice recognition processing that converts the obtained utterance section voice signal into language information (speech section detection). (Voice recognition), then the next utterance is determined according to the estimated meaning, that is, the content of the system utterance suitable for the obtained language information of the user is generated (utterance content generation), and the utterance content is further voiced. A voice synthesis process for converting into a signal is performed (voice synthesis), and then a process for reproducing the generated voice signal of the system utterance is performed in order to convey the content of the system utterance to the user (voice signal reproduction). In a conventional voice dialogue system, these series of processes are performed sequentially in principle, and therefore, due to the accumulation of delays in each process, it takes a long time from the user completes the utterance to the system responding. There will be a delay.

音声対話における二者間の発話の間（ま）の長さを交替潜時と呼ぶが、人同士の円滑な対話における交替潜時は、平均的には０．６秒程度であり、長くとも１秒程度である。また、相手の発話が終了する前に、発話を開始することも多く、これを衝突と呼ぶ。一方、近年普及しているスマートスピーカ等の対話システムと人との対話においては、ユーザの発話終了からシステムの発話開始までの間（ま）（以下、特にユーザからシステムという方向性を持たせた交替潜時を指すときは、システムの交替潜時と呼ぶ。）が、１秒から数秒となることが多い。従来の研究によれば、一方の交替潜時が他方の交替潜時に影響を与えるとされているので、システムの交替潜時が不要に長くなると、これに影響されてユーザの間（ま）（システムの発話終了からユーザの応答開始までに要する時間）も長くなる。これにより、対話全体に要する時間が不要に長くなるため、タスク達成の効率や、ユーザ体験の観点から好ましくない。 The length of the utterance between two parties in a voice dialogue is called the alternate latency, but the alternate latency in a smooth dialogue between people is about 0.6 seconds on average, at the longest. It takes about 1 second. In addition, the utterance is often started before the other party's utterance ends, which is called a collision. On the other hand, in the dialogue between a human and a dialogue system such as a smart speaker, which has become widespread in recent years, the direction from the end of the user's utterance to the start of the system's utterance (hereinafter, in particular, the user to the system) is given. When referring to the alternate latency, it is often referred to as the alternate latency of the system), which is often one to several seconds. According to previous research, one alternation latency affects the other alternation latency, so if the system's alternation latency becomes unnecessarily long, it will be affected by this and between users. The time required from the end of the system utterance to the start of the user's response) also becomes longer. As a result, the time required for the entire dialogue becomes unnecessarily long, which is not preferable from the viewpoint of task achievement efficiency and user experience.

従って、システムの応答性を向上させることにより、上述した従来生じていたユーザ発話とシステム発話との間に生じる不要に長い無音の時間を短くするか、あるいは発生そのものを避けることが望ましく、それを実現するためには、システム発話の開始タイミングを適切に検出することが必要となる。なぜなら、システムの交替潜時を短くするためにシステム発話の開始タイミングを不当に早めるような方法で検出処理を行えば、衝突が発生する可能性が高くなるので、単純にシステム発話の開始タイミングが早まる方法を採用すればよいというものではないからである。 Therefore, by improving the responsiveness of the system, it is desirable to shorten the unnecessarily long silence time that occurs between the user utterance and the system utterance that has occurred in the past, or to avoid the occurrence itself. In order to realize this, it is necessary to appropriately detect the start timing of the system utterance. This is because if the detection process is performed by a method that unreasonably advances the start timing of the system utterance in order to shorten the shift latency of the system, the possibility of a collision increases, so the start timing of the system utterance is simply set. This is because it is not a matter of adopting a method that accelerates.

より詳細には、従来の音声対話システムでは、ユーザ発話の終了時をシステム発話の開始タイミングとみなしていた。１対１の対話においては、これは極めて自然な考え方であるが、そもそもユーザ発話が終了する現象の定義が明確ではなかった。例えば、特定の長さ（例えば、１００ミリ秒以上）のポーズで区切られた音声区間をＩｎｔｅｒ−ＰａｕｓａｌＵｎｉｔ（ＩＰＵ）と呼び、音声分析や会話分析では音声区間の単位として広く用いられているが、１００ミリ秒程度の無音区間は、１人の話者の発話区間内にも頻繁に生じるため、必ずしもその前後で話者交替が起こるわけではない。そのため、ユーザ発話の音声信号における短い無音区間をシステム発話の開始タイミングの検出に用いると、生成して再生を開始したシステム発話と、継続されたユーザ発話とがオーバーラップする衝突を起こし、対話を崩してしまう可能性がある。一方、より長い無音区間で区切ることにより、オーバーラップ（衝突）を防ぐことはできるが、システム発話の開始タイミングは、無音区間の長さだけ遅れ、ユーザ発話とシステム発話との間の無音区間を短くすることができなくなる。 More specifically, in the conventional voice dialogue system, the end time of the user utterance is regarded as the start timing of the system utterance. In one-on-one dialogue, this is a very natural idea, but the definition of the phenomenon that the user's utterance ends was not clear in the first place. For example, a voice section divided by a pose of a specific length (for example, 100 milliseconds or more) is called an Inter-Pasal Unit (IPU), and is widely used as a unit of a voice section in voice analysis and conversation analysis. Since the silent section of about 100 milliseconds frequently occurs in the speech section of one speaker, the speaker change does not necessarily occur before and after that. Therefore, if a short silent section in the voice signal of the user utterance is used to detect the start timing of the system utterance, the system utterance that is generated and started to be played and the continued user utterance cause a collision that overlaps, and the dialogue is performed. There is a possibility of breaking it. On the other hand, it is possible to prevent overlap (collision) by dividing into longer silent sections, but the start timing of system utterance is delayed by the length of the silent section, and the silent section between user utterance and system utterance is delayed. It cannot be shortened.

また、従来の音声認識では、音声認識対象とする音声区間を決定するために音声区間検出（ＶｏｉｃｅＡｃｔｉｖｉｔｙＤｅｔｅｃｔｉｏｎ；ＶＡＤ）と呼ばれる処理を行う。音声信号の振幅やゼロ交差数を閾値処理する単純なものから、音声信号から得られる特徴量に基づき確率的に音声が含まれるか否かを確定するモデルなど、様々な手法が研究されてきた。しかし、システム発話の開始タイミングを早期に決定するということを意図した手法は提案されていなかった。 Further, in the conventional voice recognition, a process called voice section detection (VAD) is performed in order to determine the voice section to be voice-recognized. Various methods have been studied, such as a simple model that thresholds the amplitude and number of zero crossings of an audio signal, and a model that stochastically determines whether or not audio is included based on the features obtained from the audio signal. .. However, a method intended to determine the start timing of system utterance at an early stage has not been proposed.

さらに、システム発話の開始タイミングを決定するために、ユーザ発話の継続または終了、あるいはシステムが次にどのような行動をとるべきか（発話だけに限らず、相槌なども含む）を検出する技術も、本願発明者らにより研究されているが、ユーザ発話途中でのシステムの相槌・復唱の生成技術を除けば、これらは全て音声認識と同様にＶＡＤを前提としており、ＶＡＤ処理による遅延の影響を排除することができない。 In addition, there is also a technique for detecting the continuation or end of user utterances, or what action the system should take next (including not only utterances but also reciprocity) in order to determine the start timing of system utterances. Although it has been studied by the inventors of the present application, all of them are premised on VAD as well as voice recognition, except for the system's reciprocal / repeat generation technology during user utterance, and the influence of delay due to VAD processing is affected. It cannot be excluded.

これらの従来技術に対し、本願発明者らは、音声信号を逐次処理し、短い周期（例えば、１０ミリ秒〜１００ミリ秒）で音声信号から音響特徴量を抽出し、抽出した音響特徴量を用いて、システムが発話をすべきか否かの識別を行う技術、換言すれば、ユーザが発話する地位または立場を有していることを示すユーザ発話権の維持または終了（終了には、譲渡、放棄が含まれる。）を識別する技術を開発した（非特許文献１，２参照）。このようにすることで、音声区間検出処理（ＶＡＤ処理）による遅延なしにシステム発話の開始タイミングを決定することができる。 In contrast to these prior arts, the inventors of the present application sequentially process voice signals, extract acoustic features from the voice signals in a short period (for example, 10 ms to 100 ms), and obtain the extracted acoustic features. A technique that is used to identify whether the system should speak, in other words, to maintain or terminate the user's right to speak, indicating that the user has the position or position to speak (transfer to termination, transfer, A technique for identifying (including abandonment) has been developed (see Non-Patent Documents 1 and 2). By doing so, it is possible to determine the start timing of the system utterance without delay due to the voice section detection process (VAD process).

なお、本発明では、複数の次発話候補が準備された場合に、その中から次発話を選択する処理が行われるが、この選択処理を行うために必要となる情報を生成する技術としては、本願発明者らにより開発された、韻律分析によりユーザ発話意図を推定する技術が知られている（非特許文献３参照）。 In the present invention, when a plurality of next utterance candidates are prepared, a process of selecting the next utterance is performed. As a technique for generating information necessary for performing this selection process, the next utterance candidate is selected. A technique for estimating a user's utterance intention by rhyme analysis, which was developed by the inventors of the present application, is known (see Non-Patent Document 3).

また、本発明は、例えば、ニュース対話システム、ガイダンス対話システム、アンケート対話システム、情報検索対話システム、操作対話システム、教育対話システム等の各種の対話システムに適用することができるが、ユーザへの効率的な情報伝達を実現することができる対話システムとしては、本願発明者らにより開発された、主計画および副計画からなるシナリオデータを用いてユーザに対してニュース等の記事の内容を伝達するニュース対話システムが知られている（非特許文献４参照）。 Further, the present invention can be applied to various dialogue systems such as a news dialogue system, a guidance dialogue system, a questionnaire dialogue system, an information search dialogue system, an operation dialogue system, and an educational dialogue system, but it is efficient for users. As a dialogue system that can realize such information transmission, news that conveys the content of articles such as news to users using scenario data consisting of a main plan and a sub-plan developed by the inventors of the present application. A dialogue system is known (see Non-Patent Document 4).

藤江真也、横山勝矢、小林哲則、“音声対話システムのためのユーザ発話終了タイミングの逐次予測”、日本音響学会講演論文集、２０１８Shinya Fujie, Katsuya Yokoyama, Tetsunori Kobayashi, "Sequential Prediction of User Speaking End Timing for Speech Dialogue System", Proceedings of the Acoustical Society of Japan, 2018 藤江真也、横山勝矢、小林哲則、“音声対話システムのためのユーザの発話権維持状態の逐次推定”、人工知能学会全国大会、２Ｎ１−０３、Ｊｕｎｅ２０１８Shinya Fujie, Katsuya Yokoyama, Tetsunori Kobayashi, "Sequential estimation of user's speech right maintenance status for voice dialogue system", Japanese Society for Artificial Intelligence National Convention, 2N1-03, June2018 高津弘明、横山勝矢、本田裕、藤江真也、小林哲則、“システム発話の文脈を考慮した発話意図理解”、言語処理学会第２５回年次大会発表論文集、ｐｐ．３２０−３２３、２０１９Hiroaki Takatsu, Katsuya Yokoyama, Hiroshi Honda, Shinya Fujie, Tetsunori Kobayashi, "Understanding Speaking Intentions Considering the Context of System Speaking", Proceedings of the 25th Annual Meeting of the Natural Language Processing Society, pp. 320-323, 2019 高津弘明、福岡維新、藤江真也、林良彦、小林哲則、“意図性の異なる多様な情報行動を可能とする音声対話システム”、人工知能学会論文誌、ｖｏｌ．２２、ｎｏ．１、ｐ．ＤＳＨ−Ｃ＿１−２４、２０１８Hiroaki Takatsu, Ishin Fukuoka, Shinya Fujie, Yoshihiko Hayashi, Tetsunori Kobayashi, "Voice Dialogue System that Enables Various Information Behaviors with Different Intentions", Journal of the Japanese Society for Artificial Intelligence, vol. 22, no. 1, p. DSH-C_1-24, 2018

従来の音声対話システムでは、前述したように、音声信号取得、発話区間検出、音声認識、発話内容生成、音声合成、音声信号再生という一連の処理を、シーケンシャルに行うため、それぞれの処理における遅延が蓄積するという問題があった。 In the conventional voice dialogue system, as described above, a series of processes such as voice signal acquisition, speech section detection, voice recognition, speech content generation, voice synthesis, and voice signal reproduction are sequentially performed, so that there is a delay in each process. There was a problem of accumulation.

また、前述した非特許文献１，２に記載された技術を用いれば、短い周期（例えば、１０ミリ秒〜１００ミリ秒）で音声信号から抽出した音響特徴量を用いてユーザ発話権の維持または終了を識別するパターン認識処理を行うので、音声区間検出処理（ＶＡＤ処理）による遅延なしにシステム発話の開始タイミングを決定することができる。 Further, by using the techniques described in Non-Patent Documents 1 and 2 described above, the user's utterance right can be maintained or the user's utterance right can be maintained by using the acoustic features extracted from the voice signal in a short period (for example, 10 ms to 100 ms). Since the pattern recognition process for identifying the end is performed, the start timing of the system utterance can be determined without delay due to the voice section detection process (VAD process).

しかし、非特許文献１，２に記載された技術を用いれば、システム発話の開始タイミングを、ＶＡＤ処理による遅延なしに早期に、かつ、衝突の発生を回避または抑制しながら適切に、決定することができるものの、その後の処理、すなわち、前述した一連の処理のうちの音声認識、発話内容生成、音声合成、音声信号再生の各処理を、従来通りにシーケンシャルに行うと、そこでの遅延が生じるという問題がある。 However, by using the techniques described in Non-Patent Documents 1 and 2, the start timing of system speech can be appropriately determined early without delay due to VAD processing and while avoiding or suppressing the occurrence of collision. However, if the subsequent processing, that is, each processing of voice recognition, utterance content generation, voice synthesis, and voice signal reproduction in the above-mentioned series of processing is performed sequentially as before, a delay will occur there. There's a problem.

従って、非特許文献１，２に記載された技術を利用してシステム発話の開始タイミングを早期かつ適切に決定しつつ、ユーザ発話とシステム発話との間に生じる不要に長い無音の時間を短くするか、あるいは発生を回避することができる技術の開発が望まれる。 Therefore, while utilizing the techniques described in Non-Patent Documents 1 and 2 to determine the start timing of the system utterance early and appropriately, the unnecessarily long silence time between the user utterance and the system utterance is shortened. Alternatively, it is desired to develop a technology that can avoid the occurrence.

本発明の目的は、システムの応答性を向上させることができ、衝突の発生を回避または抑制しつつ、不要に長いシステムの交替潜時の発生を回避または抑制することができる対話システムおよびプログラムを提供するところにある。 An object of the present invention is to provide an interactive system and program capable of improving the responsiveness of a system and avoiding or suppressing the occurrence of collisions while avoiding or suppressing the occurrence of unnecessarily long system alternation latency. It is in the place to provide.

＜本発明の基本構成＞ <Basic configuration of the present invention>

本発明は、ユーザとの音声対話のための処理を実行するコンピュータにより構成された対話システムであって、
ユーザ発話の音声信号を取得する音声信号取得手段と、
この音声信号取得手段により取得したユーザ発話の音声信号についての音声認識処理を実行する音声認識処理手段と、
音声信号取得手段により取得したユーザ発話の音声信号から音響特徴量を抽出し、抽出した音響特徴量を用いるか、または、この音響特徴量に加え、音声認識処理手段による音声認識処理の結果として得られたユーザ発話の言語情報から抽出した言語特徴量を用いて、音声認識処理手段による音声認識処理の実行タイミングに依拠しない周期で、ユーザが発話する地位または立場を有していることを示すユーザ発話権の維持または終了を識別するパターン認識処理を繰り返し実行し、このパターン認識処理の結果を用いて、システム発話の開始タイミングを検出する処理を実行するシステム発話タイミング検出手段と、
このシステム発話タイミング検出手段によるパターン認識処理の周期に依拠しないタイミングで、かつ、このシステム発話タイミング検出手段によりシステム発話の開始タイミングが検出される前に、題材データ記憶手段に記憶された題材データまたはネットワークを介して接続された外部システムに記憶された題材データを用いるとともに、ユーザとシステムとの間の対話履歴情報の少なくとも一部および／または音声認識処理手段による進行中のユーザ発話についての途中までの音声認識処理の結果を用いて、システムの次発話の内容データを取得または生成する準備処理を実行する次発話準備手段と、
システム発話タイミング検出手段によりシステム発話の開始タイミングが検出された後に、次発話準備手段による準備処理で得られた次発話の内容データを用いて、システム発話の音声信号の再生を含むシステム発話生成処理を実行する発話生成手段と
を備えたことを特徴とするものである。 The present invention is a dialogue system composed of a computer that executes processing for voice dialogue with a user.
A voice signal acquisition means for acquiring a voice signal spoken by a user,
A voice recognition processing means that executes voice recognition processing for a user-spoken voice signal acquired by this voice signal acquisition means, and a voice recognition processing means.
An acoustic feature amount is extracted from the user-spoken voice signal acquired by the voice signal acquisition means, and the extracted acoustic feature amount is used, or is obtained as a result of voice recognition processing by the voice recognition processing means in addition to this acoustic feature amount. A user who indicates that the user has a position or position to speak at a cycle that does not depend on the execution timing of the voice recognition process by the voice recognition processing means by using the linguistic feature amount extracted from the linguistic information of the user's utterance. A system utterance timing detecting means that repeatedly executes a pattern recognition process for identifying the maintenance or termination of the utterance right and uses the result of this pattern recognition process to detect the start timing of the system utterance.
The subject data or the subject data stored in the subject data storage means at a timing that does not depend on the cycle of the pattern recognition process by the system speech timing detecting means and before the start timing of the system speech is detected by the system speech timing detecting means. Using the subject data stored in an external system connected via a network, at least part of the dialogue history information between the user and the system and / or halfway through the ongoing user speech by the voice recognition processing means. The next speech preparation means that executes the preparatory process to acquire or generate the content data of the next speech of the system using the result of the voice recognition process of
After the start timing of the system utterance is detected by the system utterance timing detecting means, the system utterance generation process including the reproduction of the voice signal of the system utterance is used by using the content data of the next utterance obtained by the preparatory process by the next utterance preparation means. It is characterized by having an utterance generation means for executing.

ここで、ユーザ発話権の「終了」には、放棄および譲渡の双方が含まれる。放棄は、自分の発話を終了させるだけの場合であり、譲渡は、相手への質問等のように、自分の発話を終了させるとともに、相手の発話開始を促す場合である。 Here, the "termination" of the user's speech right includes both abandonment and transfer. Abandonment is a case of only ending one's utterance, and transfer is a case of urging the other party to start utterance while ending one's utterance, such as a question to the other party.

また、「次発話準備手段」における「題材データ記憶手段に記憶された題材データまたはネットワークを介して接続された外部システムに記憶された題材データ」には、例えば、ニュース等の各種の話題をシナリオ化したシナリオデータ、シナリオ化されていない各種の話題データ、辞書データ、事典データ、機器の使用方法や施設等のガイダンス用データ、アンケート調査用データ、機器や装置等の操作補助用データ、教育用データ等が含まれる。 In addition, in the "subject data stored in the subject data storage means or the subject data stored in the external system connected via the network" in the "next speech preparation means", for example, various topics such as news are scenarios. Scenario data that has been converted, various topic data that has not been scenarioized, dictionary data, encyclopedia data, guidance data for equipment usage and facilities, questionnaire survey data, operation assistance data for equipment and devices, educational data Data etc. are included.

さらに、「次発話準備手段」による「システムの次発話の内容データを取得または生成」の「取得」には、題材データ記憶手段やネットワークを介して接続された外部システムに記憶されている複数の題材データの中からの必要な題材データ（使用するか、または使用する可能性のある題材データ）の選択的な取得と、題材データ記憶手段や外部システムに記憶されている任意の１つの題材データの構成要素の中からの必要な構成要素（使用するか、または使用する可能性のある構成要素）の選択的な取得とが含まれる。 Further, in the "acquisition" of "acquisition or generation of the content data of the next utterance of the system" by the "next utterance preparation means", a plurality of material data storage means or a plurality of external systems connected via a network are stored. Selective acquisition of necessary subject data (subject data to be used or may be used) from the subject data, and any one subject data stored in the subject data storage means or an external system. Includes the selective acquisition of the required components (components that are used or may be used) from among the components of.

また、上記の「次発話準備手段」における「生成」には、取得した言語情報（題材データまたはその構成要素であるテキストデータ）の加工（語尾等の部分的な変換調整、結合等）が含まれる。但し、題材データは、題材データ記憶手段や外部システムに記憶されている段階で、予め加工されていることが好ましい。そして、「生成」には、テキストデータから音声データへの変換（音声合成）も含まれる。なお、題材データ記憶手段や外部システムに、題材データまたはその構成要素として、音声データ（例えばｗａｖファイル等）が既に用意されている場合には、「次発話準備手段」による音声合成処理は行わなくてもよい。 In addition, the "generation" in the above "next utterance preparation means" includes processing (partial conversion adjustment of endings, etc., combination, etc.) of acquired linguistic information (subject data or text data which is a component thereof). Is done. However, it is preferable that the subject data is pre-processed at the stage of being stored in the subject data storage means or the external system. The "generation" also includes conversion from text data to voice data (speech synthesis). If voice data (for example, a wav file) is already prepared as the subject data or its constituent elements in the subject data storage means or the external system, the voice synthesis process by the "next utterance preparation means" is not performed. You may.

さらに、「次発話準備手段」により準備される「システムの次発話の内容データ」は、テキストデータおよびこれに対応する音声データの場合と、テキストデータだけの場合とがある。但し、「発話生成手段」の処理負荷の軽減および遅延防止の観点からは、「次発話準備手段」により音声データも併せて準備することが好ましい。そして、対話中に、付帯的な情報として、映像（動画）や静止画を再生する場合には、「システムの次発話の内容データ」には、映像データや画像データが付随していてもよく、対話中に音楽を再生する場合には、「システムの次発話の内容データ」には、楽曲データが含まれていてもよい。 Further, the "content data of the next utterance of the system" prepared by the "next utterance preparation means" may be text data and corresponding voice data, or only text data. However, from the viewpoint of reducing the processing load of the "utterance generation means" and preventing delays, it is preferable to also prepare voice data by the "next utterance preparation means". Then, when playing a video (video) or a still image as incidental information during the dialogue, the "content data of the next utterance of the system" may be accompanied by video data or image data. , When playing music during a dialogue, the "content data of the next utterance of the system" may include music data.

また、「次発話準備手段」における「ユーザとシステムとの間の対話履歴情報の少なくとも一部および／または音声認識処理手段による進行中のユーザ発話についての途中までの音声認識処理の結果を用いて」の「対話履歴情報の少なくとも一部」を用いることには、対話履歴情報（システム発話、ユーザ発話）の全体を用いること、直前のシステム発話のみを用いること、直前のシステム発話を用いずにそれよりも前のシステム発話やユーザ発話を用いること（例えば、ユーザの「さっき言ってたＸＸＸのこと、もう少し詳しく聞きたいんだけど・・・」等の要求に応答する場合等）、直前のユーザ発話のみを用いること等が含まれる。そして、「音声認識処理手段による進行中のユーザ発話についての途中までの音声認識処理の結果」を用いることは、進行中のユーザ発話の部分的な内容（ユーザ発話の発話区間全体の内容ではなく、その途中までの部分的な内容）を用いることである。なお、部分的な音声認識処理の結果が得られた場合に、それがユーザ発話の発話区間の最後の部分であるか否かは、その時点では判らないことがあるが、結果的にそれがユーザ発話の発話区間の最後の部分であった場合には、直前のユーザ発話ということになり、「対話履歴情報の少なくとも一部」に該当する。 In addition, using at least a part of the dialogue history information between the user and the system in the "next utterance preparation means" and / or the result of the halfway voice recognition processing for the ongoing user utterance by the voice recognition processing means. To use "at least a part of the dialogue history information", use the whole dialogue history information (system utterance, user utterance), use only the immediately preceding system utterance, and do not use the immediately preceding system utterance. Use earlier system utterances or user utterances (for example, when responding to a user's request such as "I want to hear a little more about XXX I mentioned earlier ..."), the previous user It includes using only utterances. Then, using "the result of the voice recognition processing halfway about the ongoing user utterance by the voice recognition processing means" is not the content of the partial content of the user utterance in progress (the content of the entire utterance section of the user utterance). , Partial content up to the middle) is to be used. When the result of partial speech recognition processing is obtained, it may not be known at that time whether or not it is the last part of the utterance section of the user's utterance, but as a result, it is If it is the last part of the utterance section of the user's utterance, it means the immediately preceding user's utterance, and corresponds to "at least a part of the dialogue history information".

さらに、「発話生成手段」における「システム発話の音声信号の再生を含むシステム発話生成処理」には、次発話のテキストデータについての音声合成が未だ済んでいない場合には、音声合成処理が含まれる。なお、前述した通り、「発話生成手段」の処理負荷の軽減および遅延防止の観点からは、音声データ（例えばｗａｖファイル等）は、次発話準備手段による準備処理で用意することが好ましい。さらに、「次発話準備手段」により準備された「システムの次発話の内容データ」に映像データや画像データが付随している場合には、「発話生成手段」における「システム発話生成処理」には映像や画像の再生処理も含まれ、「システムの次発話の内容データ」に楽曲データが含まれている場合には、「システム発話生成処理」には音楽の再生処理も含まれる。 Further, the "system utterance generation process including the reproduction of the voice signal of the system utterance" in the "utterance generation means" includes the voice synthesis process when the voice synthesis of the text data of the next utterance has not been completed yet. .. As described above, from the viewpoint of reducing the processing load of the "utterance generation means" and preventing delays, it is preferable that the voice data (for example, a wav file) is prepared by the preparatory processing by the next utterance preparation means. Furthermore, if video data or image data is attached to the "content data of the next utterance of the system" prepared by the "next utterance preparation means", the "system utterance generation process" in the "speech generation means" Video and image playback processing is also included, and when music data is included in the "content data of the next utterance of the system", the "system utterance generation processing" also includes music playback processing.

＜本発明の基本構成の作用・効果＞ <Action / Effect of Basic Configuration of the Present Invention>

このような本発明の対話システムにおいては、システム発話タイミング検出手段により、ユーザが自己の発話権を維持しているか、または、譲渡若しくは放棄により終了させたかをパターン認識処理により逐次推定するとともに、次発話準備手段により、システム発話タイミング検出手段によるパターン認識処理とは非同期で、かつ、システム発話タイミング検出手段によりシステム発話の開始タイミングが検出される前に、ユーザ発話に対するシステムの次発話の内容データを準備する。すなわち、ユーザ発話に対するシステムの次発話の内容データを、当該ユーザ発話の進行中に、または、それよりも前の段階である当該ユーザ発話の開始前に準備しておく。 In such a dialogue system of the present invention, the system utterance timing detecting means sequentially estimates whether the user maintains his / her own utterance right or terminates it by transfer or abandonment, and then sequentially estimates by pattern recognition processing, and then The utterance preparation means is asynchronous with the pattern recognition process by the system utterance timing detection means, and before the system utterance start timing is detected by the system utterance timing detection means, the content data of the next utterance of the system for the user utterance is input. prepare. That is, the content data of the next utterance of the system for the user utterance is prepared during the progress of the user utterance or before the start of the user utterance which is a stage before that.

このため、対話相手であるユーザが自己の発話権を譲渡若しくは放棄することによりユーザ発話権が終了し、システム発話タイミング検出手段により、このユーザ発話権の終了が捉えられ、システム発話の開始タイミングが検出された場合には、その検出直後に、発話生成手段により、タイミングよくシステム発話を開始させることが可能となるので、システムの応答性を向上させることが可能となる。 Therefore, the user who is the conversation partner transfers or abandons his / her own utterance right, and the user utterance right is terminated. The system utterance timing detection means catches the termination of the user utterance right, and the system utterance start timing is set. If it is detected, the utterance generation means can start the system utterance at the right time immediately after the detection, so that the responsiveness of the system can be improved.

また、システム発話タイミング検出手段は、音声認識処理手段による音声認識処理とは非同期で、ユーザ発話権の維持または終了を識別するパターン認識処理を繰り返し実行する構成とされているので、音声区間検出処理（ＶＡＤ処理）を前提としない処理を実現することができるため、ＶＡＤ処理による遅延なしに早期に、システム発話の開始タイミングを決定することができるとともに、ユーザ発話とシステム発話との衝突も回避または抑制することができる。 Further, since the system utterance timing detection means is configured to repeatedly execute the pattern recognition process for identifying the maintenance or termination of the user's utterance right, which is asynchronous with the voice recognition process by the voice recognition processing means, the voice section detection process is performed. Since it is possible to realize processing that does not presuppose (VAD processing), it is possible to determine the start timing of system utterance at an early stage without delay due to VAD processing, and avoid collision between user utterance and system utterance. It can be suppressed.

以上より、本発明では、次発話準備手段により、システムが発話すべき内容を早期に確定したうえで、システム発話タイミング検出手段により、ユーザ発話権の終了が推定され、システム発話の開始タイミングが検出されるのを待って、発話生成手段により、システム応答を行うので、ユーザ発話の終了後、システム発話の開始までに、長い間（ま）が空くことを避けることができるうえ、両者の発話の衝突の発生も回避または抑制することができ、これらにより前記目的が達成される。 From the above, in the present invention, after the content to be spoken by the system is determined at an early stage by the next utterance preparation means, the end of the user utterance right is estimated by the system utterance timing detecting means, and the start timing of the system utterance is detected. Since the system response is performed by the utterance generation means after waiting for the utterance to be made, it is possible to avoid a long period of time between the end of the user's utterance and the start of the system utterance. The occurrence of collisions can also be avoided or suppressed, thereby achieving the above objectives.

また、本発明では、次発話準備手段により、次発話の候補となる複数の次発話候補の内容データを取得または生成する準備処理を実行し、この準備処理で得られた複数の次発話候補の内容データの中から、発話生成手段で用いる次発話の内容データを選択する処理を実行するようにしてもよい。これにより、様々な種別の対話に対応可能となる。具体的には、以下のような構成を採用することができる。 Further, in the present invention, the next utterance preparation means executes a preparatory process for acquiring or generating content data of a plurality of next utterance candidates that are candidates for the next utterance, and the plurality of next utterance candidates obtained in this preparatory process. The process of selecting the content data of the next utterance to be used in the utterance generation means may be executed from the content data. This makes it possible to handle various types of dialogue. Specifically, the following configuration can be adopted.

＜音声認識処理の結果として得られた言語情報を用いて、複数の次発話候補の内容データの中から、次発話の内容データを選択する構成＞ <A configuration in which the content data of the next utterance is selected from the content data of a plurality of next utterance candidates using the language information obtained as a result of the voice recognition process>

すなわち、前述した対話システムにおいて、
次発話準備手段は、
次発話の候補となる複数の次発話候補の内容データを取得または生成する準備処理を実行する構成とされ、
システム発話タイミング検出手段によりシステム発話の開始タイミングが検出された後に、音声認識処理手段による音声認識処理の結果として得られた言語情報を用いて、次発話準備手段による準備処理で得られた複数の次発話候補の内容データの中から、発話生成手段で用いる次発話の内容データを選択する処理を実行する次発話選択手段を備えた構成を採用することができる。 That is, in the above-mentioned dialogue system,
The means to prepare for the next utterance is
It is configured to execute preparatory processing to acquire or generate content data of a plurality of next utterance candidates that are candidates for the next utterance.
After the start timing of the system utterance is detected by the system utterance timing detecting means, a plurality of linguistic information obtained as a result of the voice recognition processing by the voice recognition processing means is used in the preparatory processing by the next utterance preparation means. It is possible to adopt a configuration including a next utterance selection means that executes a process of selecting the content data of the next utterance to be used in the utterance generation means from the content data of the next utterance candidate.

このように音声認識処理の結果として得られた言語情報を用いて、複数の次発話候補の内容データの中から、次発話の内容データを選択する構成とした場合には、次発話選択手段により、ユーザ発話の内容に応じて、システムの次発話の内容データを選択することができる。 When the content data of the next utterance is selected from the content data of a plurality of next utterance candidates by using the language information obtained as a result of the voice recognition processing in this way, the content data of the next utterance is selected by the next utterance selection means. , The content data of the next utterance of the system can be selected according to the content of the user's utterance.

このため、例えば、次発話準備手段により、直前のシステム発話の内容に基づき、またはそれまでの対話履歴（システム発話、ユーザ発話）の内容に基づき、システムの次発話での使用が想定される複数の次発話候補の内容データを準備しておき、ユーザ発話の内容に応じて、準備した複数の次発話候補の内容データの中から、次発話の内容データを選択することができる。 Therefore, for example, by the next utterance preparation means, a plurality of utterances that are expected to be used in the next utterance of the system based on the content of the immediately preceding system utterance or the contents of the dialogue history (system utterance, user utterance) up to that point. The content data of the next utterance candidate can be prepared, and the content data of the next utterance can be selected from the content data of the plurality of prepared next utterance candidates according to the content of the user utterance.

また、例えば、次発話準備手段により、進行中のユーザ発話の途中までの部分的な内容（ユーザ発話の開始時点から途中の時点までの内容、あるいは、ユーザ発話の途中の時点から別の途中の時点までの内容）に応じ、システムの次発話での使用が想定される複数の次発話候補の内容データを準備しておき、その後の発話内容（途中の時点以降、あるいは別の途中の時点以降の発話内容）を含めたユーザ発話の発話区間全体の内容に応じて、準備した複数の次発話候補の内容データの中から、次発話の内容データを選択することもできる。 Further, for example, by the next utterance preparation means, a partial content up to the middle of the user utterance in progress (content from the start time to the middle point of the user utterance, or another middle point from the middle point of the user utterance). Depending on the content up to the time point), prepare the content data of multiple next utterance candidates that are expected to be used in the next utterance of the system, and then the content of the subsequent utterance (after the middle point or after another halfway point). It is also possible to select the content data of the next utterance from the content data of a plurality of prepared next utterance candidates according to the content of the entire utterance section of the user utterance including the utterance content).

＜韻律分析で推定したユーザ発話意図を用いて、複数の次発話候補の内容データの中から、次発話の内容データを選択する構成＞ <A configuration in which the content data of the next utterance is selected from the content data of a plurality of next utterance candidates using the user utterance intention estimated by the prosody analysis>

また、前述した対話システムにおいて、
次発話準備手段は、
次発話の候補となる複数の次発話候補の内容データを取得または生成する準備処理を実行する構成とされ、
音声信号取得手段により取得したユーザ発話の音声信号から得られる韻律情報を用いるか、若しくは、この韻律情報に加えて、音声認識処理手段による音声認識処理の結果として得られたユーザ発話の言語情報を用いるか、またはこれらの韻律情報およびユーザ発話の言語情報に加えて、ユーザとシステムとの間の対話履歴情報のうちの直前のシステム発話の言語情報を用いて、質問、応答、相槌、補足要求、反復要求、理解、不理解、無関心、若しくはその他のユーザ発話意図を識別するパターン認識処理を繰り返し実行する次発話選択用情報生成手段と、
システム発話タイミング検出手段によりシステム発話の開始タイミングが検出された後に、次発話選択用情報生成手段による処理で得られたユーザ発話意図の識別結果を用いて、次発話準備手段による準備処理で得られた複数の次発話候補の内容データの中から、発話生成手段で用いる次発話の内容データを選択する処理を実行する次発話選択手段と
を備えた構成を採用することができる。 In addition, in the above-mentioned dialogue system,
The means to prepare for the next utterance is
It is configured to execute preparatory processing to acquire or generate content data of a plurality of next utterance candidates that are candidates for the next utterance.
The rhyme information obtained from the voice signal of the user utterance acquired by the voice signal acquisition means is used, or in addition to this rhythm information, the language information of the user utterance obtained as a result of the voice recognition processing by the voice recognition processing means is used. Questions, responses, reciprocity, supplementary requests to be used, or by using these linguistic information and the linguistic information of the user utterance, as well as the linguistic information of the immediately preceding system utterance of the dialogue history information between the user and the system. , An information generation means for selecting the next utterance, which repeatedly executes a pattern recognition process for identifying a repetitive request, understanding, incomprehension, indifference, or other user utterance intention.
After the system utterance timing detection means detects the start timing of the system utterance, it is obtained by the preparatory process by the next utterance preparation means using the identification result of the user utterance intention obtained by the process by the information generation means for selecting the next utterance. It is possible to adopt a configuration including a next utterance selection means that executes a process of selecting the content data of the next utterance used in the utterance generation means from the content data of the plurality of next utterance candidates.

ここで、「質問、応答、相槌、補足要求、反復要求、理解、不理解、無関心、若しくはその他のユーザ発話意図」における質問、応答、相槌等は、ユーザ発話意図の例示列挙であり、ここに列挙されていない「その他」のユーザ発話意図を用意してもよい。また、質問、応答、相槌等は、例示列挙であるので、これらの各々は必須ではなく、別の定義のユーザ発話意図を用意してもよい。他の発明においても同様である。 Here, the question, response, response, etc. in "question, response, reciprocity, supplementary request, repetitive request, understanding, incomprehension, indifference, or other user utterance intention" are an example list of user utterance intention, and are described here. You may prepare "other" user utterance intents that are not listed. Further, since the question, the answer, the aizuchi, etc. are an example list, each of them is not indispensable, and a user utterance intention of another definition may be prepared. The same applies to other inventions.

このように韻律分析で推定したユーザ発話意図を用いて、複数の次発話候補の内容データの中から、次発話の内容データを選択する構成とした場合には、前述した＜音声認識処理の結果として得られた言語情報を用いて、複数の次発話候補の内容データの中から、次発話の内容データを選択する構成＞の場合と同様な作用・効果が得られることに加え、ユーザ発話意図を用いるので、音声認識処理の結果を得ることなく、次発話の内容データを選択することが可能となるため、システムの応答性を向上させることが可能となる。 In the case where the content data of the next utterance is selected from the content data of a plurality of next utterance candidates by using the user utterance intention estimated by the rhyme analysis in this way, the above-mentioned <Result of voice recognition processing In addition to being able to obtain the same effects and effects as in the case of the configuration> in which the content data of the next utterance is selected from the content data of a plurality of next utterance candidates using the language information obtained as, the user's utterance intention. Therefore, it is possible to select the content data of the next utterance without obtaining the result of the voice recognition process, so that the responsiveness of the system can be improved.

なお、次発話選択用情報生成手段のユーザ発話意図の識別器と、システム発話タイミング検出手段のユーザ発話権の維持・終了の識別器とは、マルチタスクの識別器とすることにより一体化させてもよい。 It should be noted that the discriminator of the user's utterance intention of the information generation means for selecting the next utterance and the discriminator of the maintenance / termination of the user's utterance right of the system utterance timing detecting means are integrated by making it a multitasking discriminator. May be good.

また、次発話選択用情報生成手段で用いる韻律情報を得るための分析処理は、システム発話タイミング検出手段で用いる音響特徴量を抽出するための分析処理と共通の処理としてもよい。 Further, the analysis process for obtaining the prosodic information used in the next utterance selection information generation means may be the same process as the analysis process for extracting the acoustic features used in the system utterance timing detection means.

＜韻律分析で推定したユーザ発話意図と、音声認識処理の結果として得られた言語情報とを組み合わせて用いて、複数の次発話候補の内容データの中から、次発話の内容データを選択する構成＞ <A configuration in which the content data of the next utterance is selected from the content data of a plurality of next utterance candidates by using the user's utterance intention estimated by the scansion and the linguistic information obtained as a result of the speech recognition process in combination. ＞

さらに、前述した対話システムにおいて、
次発話準備手段は、
次発話の候補となる複数の次発話候補の内容データを取得または生成する準備処理を実行する構成とされ、
音声信号取得手段により取得したユーザ発話の音声信号から得られる韻律情報を用いるか、若しくは、この韻律情報に加えて、音声認識処理手段による音声認識処理の結果として得られたユーザ発話の言語情報を用いるか、またはこれらの韻律情報およびユーザ発話の言語情報に加えて、ユーザとシステムとの間の対話履歴情報のうちの直前のシステム発話の言語情報を用いて、質問、応答、相槌、補足要求、反復要求、理解、不理解、無関心、若しくはその他のユーザ発話意図を識別するパターン認識処理を繰り返し実行する次発話選択用情報生成手段と、
システム発話タイミング検出手段によりシステム発話の開始タイミングが検出された後に、次発話選択用情報生成手段による処理で得られたユーザ発話意図の識別結果と、音声認識処理手段による音声認識処理の結果として得られた言語情報とを組み合わせて用いて、次発話準備手段による準備処理で得られた複数の次発話候補の内容データの中から、発話生成手段で用いる次発話の内容データを選択する処理を実行する次発話選択手段と
を備えた構成を採用することができる。 Furthermore, in the above-mentioned dialogue system,
The means to prepare for the next utterance is
It is configured to execute preparatory processing to acquire or generate content data of a plurality of next utterance candidates that are candidates for the next utterance.
The rhyme information obtained from the voice signal of the user utterance acquired by the voice signal acquisition means is used, or in addition to this rhythm information, the language information of the user utterance obtained as a result of the voice recognition processing by the voice recognition processing means is used. Questions, responses, reciprocity, supplementary requests to be used, or by using these linguistic information and the linguistic information of the user utterance, as well as the linguistic information of the immediately preceding system utterance of the dialogue history information between the user and the system. , An information generation means for selecting the next utterance, which repeatedly executes a pattern recognition process for identifying a repetitive request, understanding, incomprehension, indifference, or other user utterance intention.
After the system utterance timing detection means detects the start timing of the system utterance, the identification result of the user utterance intention obtained by the processing by the next utterance selection information generation means and the result of the voice recognition processing by the voice recognition processing means are obtained. Using the linguistic information in combination, the process of selecting the content data of the next utterance to be used in the utterance generation means is executed from the content data of the plurality of next utterance candidates obtained in the preparatory process by the next utterance preparation means. It is possible to adopt a configuration including a means for selecting the next utterance.

このように韻律分析で推定したユーザ発話意図と、音声認識処理の結果として得られた言語情報とを組み合わせて用いて、複数の次発話候補の内容データの中から、次発話の内容データを選択する構成とした場合には、ユーザ発話意図を用いるだけでは、対応できないときでも、あるいは、音声認識処理の結果を用いるだけでは、対応できないときでも、次発話の内容データの選択処理を行うことができるようになるので、あらゆるタイプの音声対話に対応可能となる。 Using the user's utterance intention estimated by the rhyme analysis and the linguistic information obtained as a result of the voice recognition process in combination, the content data of the next utterance is selected from the content data of a plurality of next utterance candidates. In the case of the configuration, the selection process of the content data of the next utterance can be performed even when it is not possible to respond only by using the user's utterance intention or when it is not possible to respond only by using the result of the voice recognition processing. Being able to do this makes it possible to handle all types of voice dialogue.

＜システム発話タイミング検出手段によりユーザ発話意図の識別も行う構成＞ <A configuration that also identifies the user's utterance intention by the system utterance timing detection means>

また、前述した対話システムにおいて、
次発話準備手段は、
次発話の候補となる複数の次発話候補の内容データを取得または生成する準備処理を実行する構成とされ、
システム発話タイミング検出手段は、
ユーザ発話権の維持または終了を識別するパターン認識処理を実行する際に、終了については、質問、応答、相槌、補足要求、反復要求、理解、不理解、無関心、若しくはその他のユーザ発話意図のうちのいずれのユーザ発話意図で終了するのかを識別するパターン認識処理を実行する構成とされ、
システム発話タイミング検出手段によりシステム発話の開始タイミングが検出された後に、システム発話タイミング検出手段による処理で得られたユーザ発話意図の識別結果を用いて、次発話準備手段による準備処理で得られた複数の次発話候補の内容データの中から、発話生成手段で用いる次発話の内容データを選択する処理を実行する次発話選択手段を備えた構成を採用することができる。 In addition, in the above-mentioned dialogue system,
The means to prepare for the next utterance is
It is configured to execute preparatory processing to acquire or generate content data of a plurality of next utterance candidates that are candidates for the next utterance.
The system utterance timing detection means
When executing a pattern recognition process that identifies the maintenance or termination of the user's utterance right, the termination may be a question, response, reciprocal request, supplementary request, repetitive request, understanding, incomprehension, indifference, or other user utterance intention. It is configured to execute a pattern recognition process that identifies which user's utterance intention ends.
After the system utterance timing detection means detects the start timing of the system utterance, the plurality of utterances obtained in the preparation process by the next utterance preparation means using the identification result of the user utterance intention obtained by the process by the system utterance timing detection means. It is possible to adopt a configuration including a next utterance selection means for executing a process of selecting the content data of the next utterance used in the utterance generation means from the content data of the next utterance candidate.

このようにシステム発話タイミング検出手段によりユーザ発話意図の識別も行う構成とした場合には、システム発話タイミング検出手段によりシステム発話の開始タイミングが検出された時点で、同時にユーザ発話意図の識別結果も得られているので、システムの応答性を向上させることが可能となる。 In this way, when the system utterance timing detecting means is configured to identify the user utterance intention, the identification result of the user utterance intention is also obtained at the same time when the system utterance timing detecting means detects the start timing of the system utterance. Therefore, it is possible to improve the responsiveness of the system.

＜システム発話タイミング検出手段により、システム状態を示す情報である準備完了・準備中の別を参照する構成＞ <A configuration in which the system utterance timing detection means refers to information indicating the system status, ready or under preparation>

また、以上に述べた対話システムにおいて、
次発話準備手段による準備処理の状態を含むシステム状態を示す情報を記憶するシステム状態記憶手段を備え、
システム発話タイミング検出手段は、
ユーザ発話権の維持または終了を識別するパターン認識処理の結果およびシステム状態記憶手段に記憶されているシステム状態を示す情報を用いて、システム発話の開始タイミングを検出する処理を実行する際に、
パターン認識処理の結果がユーザ発話権の維持を示している場合には、システム発話の開始タイミングではないと判断し、
パターン認識処理の結果がユーザ発話権の終了を示し、かつ、システム状態を示す情報が準備完了を示している場合には、システム発話の開始タイミングであると判断し、
パターン認識処理の結果がユーザ発話権の終了を示し、かつ、システム状態を示す情報が準備中を示している場合には、次発話準備手段による準備中の処理内容に応じ、直ぐに完了する処理内容として予め分類されている処理の準備中であるときには、準備完了になるまで待ってシステム発話の開始タイミングであると判断し、直ぐに完了しない処理内容として予め分類されている処理の準備中であるときには、システム発話の開始タイミングであると判断するとともに、フィラーの挿入タイミングである旨の情報を出力する処理を実行する構成としてもよい。 In addition, in the dialogue system described above,
A system state storage means for storing information indicating the system state including the state of preparation processing by the next utterance preparation means is provided.
The system utterance timing detection means
When executing the process of detecting the start timing of system utterance using the result of the pattern recognition process for identifying the maintenance or termination of the user's utterance right and the information indicating the system state stored in the system state storage means.
If the result of the pattern recognition process indicates that the user's utterance right is maintained, it is judged that it is not the start timing of system utterance.
If the result of the pattern recognition process indicates the end of the user's utterance right and the information indicating the system status indicates the completion of preparation, it is determined that it is the start timing of the system utterance.
If the result of the pattern recognition process indicates the end of the user's utterance right and the information indicating the system status indicates that the system is being prepared, the process content to be completed immediately according to the process content being prepared by the next utterance preparation means. When the process is being prepared in advance, it is determined that it is the start timing of the system utterance by waiting until the preparation is completed, and when the process is being prepared in advance as the process content that is not completed immediately. , It may be configured to execute a process of determining that it is the start timing of the system utterance and outputting information indicating that it is the insertion timing of the filler.

このようにシステム発話タイミング検出手段により、システム状態を示す情報である準備完了・準備中の別を参照する構成とした場合には、システム状態を考慮し、より適切なシステム発話の開始タイミングを検出することが可能となる。 In this way, when the system utterance timing detection means is configured to refer to the information indicating the system status, whether it is ready or under preparation, the system status is taken into consideration and a more appropriate start timing of the system utterance is detected. It becomes possible to do.

＜システム発話タイミング検出手段により、ユーザ状態を示す情報であるユーザ発話継続時間を用いて、ユーザ発話権終了判定用閾値の調整を行うか、またはシステム発話の開始タイミングであるか否かの判断を行う構成＞ <The system utterance timing detecting means adjusts the threshold value for determining the end of the user utterance right by using the user utterance duration, which is information indicating the user status, or determines whether or not it is the start timing of the system utterance. Configuration to be performed>

さらに、以上に述べた対話システムにおいて、
ユーザ発話継続時間を含むユーザ状態を示す情報を記憶するユーザ状態記憶手段を備え、
システム発話タイミング検出手段は、
ユーザ発話権の維持または終了を識別するパターン認識処理の結果およびユーザ状態記憶手段に記憶されているユーザ状態を示す情報を用いて、システム発話の開始タイミングを検出する処理を実行し、この際の処理として、
（１）ユーザ状態記憶手段に記憶されているユーザ発話継続時間が、予め定められた短時間判定用閾値以下または未満の場合には、パターン認識処理の結果として得られる尤度に対して設定されているユーザ発話権終了判定用閾値を標準値よりも高く設定し、予め定められた長時間判定用閾値以上または超過の場合には、ユーザ発話権終了判定用閾値を標準値よりも低く設定する処理と、
（２）ユーザ状態記憶手段に記憶されているユーザ発話継続時間を用いて、パターン認識処理の結果として得られる尤度に対するユーザ発話権終了判定用閾値を、ユーザ発話継続時間が短いときには当該ユーザ発話権終了判定用閾値が高くなり、ユーザ発話継続時間が長いときには当該ユーザ発話権終了判定用閾値が低くなるように予め定められた関数により設定する処理と、
（３）ユーザ状態記憶手段に記憶されているユーザ発話継続時間が、予め定められた短時間判定用閾値以下または未満の場合には、パターン認識処理の結果がユーザ発話権の終了を示していても、システム発話の開始タイミングではないと判断し、予め定められた長時間判定用閾値以上または超過の場合には、パターン認識処理の結果がユーザ発話権の維持を示していても、システム発話の開始タイミングであると判断する処理とのうちのいずれかの処理を実行する構成としてもよい。 Furthermore, in the dialogue system described above,
A user state storage means for storing information indicating the user state including the user utterance duration is provided.
The system utterance timing detection means
Using the result of the pattern recognition process for identifying the maintenance or termination of the user utterance right and the information indicating the user state stored in the user state storage means, a process for detecting the start timing of the system utterance is executed. As a process
(1) When the user utterance duration stored in the user state storage means is equal to or less than or less than a predetermined short-time determination threshold value, it is set for the likelihood obtained as a result of the pattern recognition process. Set the threshold value for determining the end of the user speaking right higher than the standard value, and set the threshold value for determining the end of the user speaking right lower than the standard value when the threshold value for determining the end of the user speaking right is equal to or greater than the predetermined threshold value for a long time. Processing and
(2) Using the user utterance duration stored in the user state storage means, the threshold for determining the end of the user utterance right for the likelihood obtained as a result of the pattern recognition process is set, and when the user utterance duration is short, the user utterance is set. When the threshold for determining the end of the right is high and the duration of the user's utterance is long, the process of setting the threshold for the end of the user's utterance is low by a predetermined function, and
(3) When the user utterance duration stored in the user state storage means is equal to or less than or less than a predetermined short-time determination threshold value, the result of the pattern recognition process indicates the end of the user utterance right. However, if it is judged that it is not the start timing of the system utterance, and if it exceeds or exceeds the predetermined long-time judgment threshold value, even if the result of the pattern recognition process indicates that the user utterance right is maintained, the system utterance It may be configured to execute any of the processes that are determined to be the start timing.

ここで、「ユーザ発話権終了判定用閾値」の「標準値」は、別の情報に基づく別の趣旨での閾値調整が別途に行われている場合には、その別途の閾値調整後の値を指す。 Here, the "standard value" of the "threshold value for determining the end of the user's utterance right" is the value after the separate threshold adjustment when the threshold adjustment for another purpose based on different information is performed separately. Point to.

このようにシステム発話タイミング検出手段により、ユーザ状態を示す情報であるユーザ発話継続時間を用いて、ユーザ発話権終了判定用閾値の調整を行う（上記（１）、（２））か、またはシステム発話の開始タイミングであるか否かの判断を行う（上記（３））構成とした場合には、ユーザ発話継続時間の長短に応じ、システム発話の開始タイミングを調整することが可能となる。 In this way, the system utterance timing detecting means adjusts the threshold value for determining the end of the user utterance right by using the user utterance duration, which is information indicating the user state ((1), (2) above), or the system. In the case of the configuration in which it is determined whether or not it is the start timing of the utterance ((3) above), it is possible to adjust the start timing of the system utterance according to the length of the user utterance duration.

上記（１）、（２）では、ユーザ発話継続時間が短いときにはユーザ発話権終了判定用閾値が高くなり、ユーザ発話継続時間が長いときにはユーザ発話権終了判定用閾値が低くなるように設定することができるので、ユーザ発話の開始直後の時期には、ユーザ発話権が終了したという識別結果が出にくい設定状態とし、ユーザ発話の開始時点から比較的長時間が経過している時期には、ユーザ発話権が終了したという識別結果が出やすい設定状態とすることができる。 In the above (1) and (2), when the user utterance duration is short, the user utterance right end determination threshold is set to be high, and when the user utterance duration is long, the user utterance right end determination threshold is set to be low. Therefore, in the period immediately after the start of user utterance, it is difficult to obtain the identification result that the user utterance right has ended, and in the period when a relatively long time has passed from the start of user utterance, the user It is possible to set the setting state so that the identification result that the utterance right has ended can be easily obtained.

上記（３）では、ユーザ発話継続時間を、ユーザ発話権終了判定用閾値に反映させるのではなく、ユーザ発話権終了判定用閾値を用いて維持・終了の識別結果を出した後におけるシステム発話の開始タイミングの判断処理に反映させることにより、上記（１）、（２）と同様な作用・効果を得る。 In (3) above, the user utterance duration is not reflected in the user utterance right end determination threshold value, but the system utterance after the maintenance / end identification result is output using the user utterance right end determination threshold value. By reflecting it in the determination process of the start timing, the same actions and effects as in (1) and (2) above can be obtained.

＜システム発話タイミング検出手段により、システム状態を示す情報であるシステム発話意欲度を用いてユーザ発話権終了判定用閾値を動的に調整する構成＞ <A configuration in which the threshold value for determining the end of the user's utterance right is dynamically adjusted by using the system utterance motivation, which is information indicating the system state, by the system utterance timing detecting means>

また、前述した次発話準備手段により準備した複数の次発話候補の内容データの中から、次発話選択手段により次発話の内容データを選択する構成とした場合において、
システムによる発話開始に対する要求の強さの度合いを示すシステム発話意欲度の指標値として、対話目的を達成するためのシステムの最終の次発話候補の内容データとなり得る目的データの残数および／または次発話準備手段による準備処理で得られた次発話候補の内容データの重要度を含むシステム状態を示す情報を記憶するシステム状態記憶手段を備え、
システム発話タイミング検出手段は、
パターン認識処理の結果として得られる尤度に対するユーザ発話権終了判定用閾値を、システム状態記憶手段に記憶されている目的データの残数および／または重要度で定まるシステム発話意欲度を用いて、システム発話意欲度が強いときには当該ユーザ発話権終了判定用閾値が低くなり、システム発話意欲度が弱いときには当該ユーザ発話権終了判定用閾値が高くなるように予め定められた関数により設定する処理を実行する構成としてもよい。 Further, in the case where the content data of the next utterance is selected by the next utterance selection means from the content data of the plurality of next utterance candidates prepared by the above-mentioned next utterance preparation means.
As an index value of system utterance motivation, which indicates the degree of demand for the start of utterance by the system, the remaining number of target data that can be the content data of the final next utterance candidate of the system for achieving the dialogue purpose and / or the next It is equipped with a system state storage means for storing information indicating the system state including the importance of the content data of the next utterance candidate obtained by the preparation process by the utterance preparation means.
The system utterance timing detection means
The system uses the system utterance motivation, which determines the user's utterance right termination determination threshold for the likelihood obtained as a result of the pattern recognition process, by the remaining number and / or importance of the target data stored in the system state storage means. When the motivation to speak is strong, the threshold for determining the end of the utterance right of the user is low, and when the motivation to speak is weak, the threshold for determining the end of the user's utterance is high. It may be configured.

このようにシステム発話タイミング検出手段により、システム状態を示す情報であるシステム発話意欲度を用いてユーザ発話権終了判定用閾値を動的に調整する構成とした場合には、システム発話意欲度が強いときには、ユーザ発話権が終了したという識別結果が出やすくなる設定状態とし、システム発話意欲度が弱いときには、ユーザ発話権が終了したという識別結果が出にくい設定状態とすることが可能となる。 In this way, when the system utterance timing detection means is configured to dynamically adjust the user utterance right termination determination threshold using the system utterance motivation, which is information indicating the system state, the system utterance motivation is strong. Occasionally, it is possible to set a setting state in which an identification result that the user utterance right has ended is likely to be obtained, and a setting state in which an identification result that the user utterance right has ended is difficult to be obtained when the system utterance motivation is weak.

＜音声認識処理の結果が新たに出力されたときに、その音声認識処理の結果を用いて、次発話候補の入替が可能な構成＞ <When the result of voice recognition processing is newly output, the result of the voice recognition processing can be used to replace the next utterance candidate>

さらに、前述した次発話準備手段により準備した複数の次発話候補の内容データの中から、次発話選択手段により次発話の内容データを選択する構成とした場合において、
次発話準備手段は、
音声認識処理手段によるユーザ発話の音声認識処理の結果が新たに出力された場合には、新たに出力された当該音声認識処理の結果を用いて、次発話の候補となる複数の次発話候補の内容データの少なくとも一部を入れ替えるか否かを判定し、入れ替えると判定した場合には、次発話の候補となる別の複数の次発話候補の内容データを取得または生成する準備処理を実行する構成としてもよい。 Further, in the case where the content data of the next utterance is selected by the next utterance selection means from the content data of the plurality of next utterance candidates prepared by the above-mentioned next utterance preparation means.
The means to prepare for the next utterance is
When the result of the voice recognition processing of the user's utterance by the voice recognition processing means is newly output, a plurality of next utterance candidates that are candidates for the next utterance are used by using the newly output result of the voice recognition processing. A configuration that determines whether or not to replace at least a part of the content data, and if it is determined to replace the content data, executes a preparatory process for acquiring or generating the content data of a plurality of other next utterance candidates that are candidates for the next utterance. May be.

このように音声認識処理の結果が新たに出力されたときに、その音声認識処理の結果を用いて、次発話候補の入替が可能な構成とした場合には、進行中のユーザ発話の内容に応じて、既に準備されている複数の次発話候補の内容データの入替を行うことが可能となるので、ユーザ発話の内容に応じた適切な次発話候補の内容データを準備することが可能となる。 When the result of the voice recognition process is newly output in this way, if the result of the voice recognition process is used to replace the next utterance candidate, the content of the user's utterance in progress will be included. Correspondingly, it is possible to replace the content data of a plurality of next utterance candidates that have already been prepared, so that it is possible to prepare appropriate content data of the next utterance candidate according to the content of the user's speech. ..

＜音声認識処理の結果が新たに出力されたときに、この結果に含まれる重要度の高い単語を用いてユーザの関心のある話題を決定し、決定した話題に従って次発話候補の入替を行う構成＞ <When the result of voice recognition processing is newly output, the topic of interest to the user is determined using the highly important words included in this result, and the next utterance candidate is replaced according to the determined topic. ＞

また、上述した音声認識処理の結果が新たに出力されたときに、その音声認識処理の結果を用いて、次発話候補の入替が可能な構成とした場合において、
次発話準備手段は、
新たに出力された音声認識処理の結果を用いて、この結果に含まれる単語のうち予め定められた重要度の高い単語を用いて、ユーザの関心のある話題を決定し、題材データ記憶手段に記憶された題材データまたは外部システムに記憶された題材データの中から、決定した話題に関連付けられて記憶されている題材データを選択し、次発話の候補となる別の複数の次発話候補の内容データを取得または生成する準備処理を実行する構成としてもよい。 In addition, when the result of the above-mentioned voice recognition processing is newly output, the result of the voice recognition processing can be used to replace the next utterance candidate.
The means to prepare for the next utterance is
Using the newly output result of the speech recognition process, the topic of interest of the user is determined by using the predetermined high-importance words among the words included in the result, and the subject data storage means is used. From the stored subject data or the subject data stored in the external system, the subject data stored in association with the determined topic is selected, and the contents of a plurality of other next utterance candidates that are candidates for the next utterance. It may be configured to execute a preparatory process for acquiring or generating data.

このように音声認識処理の結果が新たに出力されたときに、この結果に含まれる重要度の高い単語を用いてユーザの関心のある話題を決定し、決定した話題に従って次発話候補の入替を行う構成とした場合には、進行中のユーザ発話の内容に応じて、既に準備されている複数の次発話候補の内容データの入替を行い、次発話により提示する話題を変更することが可能となる。 When the result of the voice recognition process is newly output in this way, the topic of interest to the user is determined using the highly important words included in this result, and the next utterance candidate is replaced according to the determined topic. In the case of the configuration to perform, it is possible to replace the content data of a plurality of already prepared next utterance candidates according to the content of the user utterance in progress and change the topic presented by the next utterance. Become.

＜システム発話タイミング検出手段により、衝突の発生情報やシステムの交替潜時を用いて、ユーザ発話権終了判定用閾値を調整する構成＞ <A configuration in which the threshold value for determining the end of the user's utterance right is adjusted by using the collision occurrence information and the system's alternate latency by the system utterance timing detecting means>

また、以上に述べた対話システムにおいて、
発話生成手段は、
音声信号取得手段により取得したユーザ発話の音声信号と、再生中のシステム発話の音声信号との衝突の発生を検出し、検出した衝突の発生情報を、ユーザ識別情報と関連付けてユーザ情報記憶手段に記憶させるとともに、ユーザ発話の終了からシステム発話の開始までの交替潜時を計測し、計測した交替潜時を、ユーザ識別情報と関連付けてユーザ情報記憶手段に記憶させる処理も実行する構成とされ、
システム発話タイミング検出手段は、
ユーザ情報記憶手段に記憶されている音声対話相手のユーザとの衝突の発生情報を取得して当該ユーザとの衝突の発生頻度または累積発生回数を算出し、算出した衝突の発生頻度または累積発生回数が上方調整用閾値以上または超過の場合には、ユーザ発話権の維持または終了を識別するパターン認識処理の結果として得られる尤度に対して設定されているユーザ発話権終了判定用閾値を標準値または前回調整値よりも高く設定し、
ユーザ情報記憶手段に記憶されている音声対話相手のユーザについてのユーザ発話の終了からシステム発話の開始までの複数の交替潜時を取得して当該ユーザについての交替潜時の長短の傾向を示す平均値若しくはその他の指標値を算出し、算出した交替潜時の指標値が下方調整用閾値以上または超過の場合には、ユーザ発話権終了判定用閾値を標準値または前回調整値よりも低く設定する処理も実行する構成としてもよい。 In addition, in the dialogue system described above,
The utterance generation means is
The occurrence of a collision between the voice signal of the user utterance acquired by the voice signal acquisition means and the voice signal of the system utterance during playback is detected, and the detected collision occurrence information is associated with the user identification information and used as the user information storage means. In addition to being memorized, the alternate latency from the end of the user utterance to the start of the system utterance is measured, and the measured alternate latency is associated with the user identification information and stored in the user information storage means.
The system utterance timing detection means
Voice dialogue stored in the user information storage means Acquires the occurrence information of the collision with the other user, calculates the occurrence frequency or the cumulative number of collisions with the user, and calculates the occurrence frequency or the cumulative number of collisions with the user. When is equal to or greater than the upward adjustment threshold, the standard value of the user speaking right termination determination threshold set for the likelihood obtained as a result of the pattern recognition process for identifying the maintenance or termination of the user speaking right. Or set it higher than the previous adjustment value,
Voice dialogue stored in the user information storage means Acquires a plurality of alternate latency from the end of the user's utterance to the start of the system utterance for the other user, and shows the tendency of the length of the alternate latency for the user. A value or other index value is calculated, and if the calculated index value during alternation latency is equal to or greater than the downward adjustment threshold value, the user speech right end determination threshold value is set lower than the standard value or the previous adjustment value. It may be configured to execute processing as well.

ここで、「システム発話タイミング検出手段」における「標準値または前回調整値」は、別の情報に基づく別の趣旨での閾値調整が別途に行われている場合には、その別途の閾値調整後の値を指す。 Here, the "standard value or the previous adjustment value" in the "system utterance timing detecting means" is after the separate threshold adjustment when the threshold adjustment for another purpose based on another information is performed separately. Refers to the value of.

また、ここでの「衝突」は、ユーザ発話権が終了したという識別結果が出て、システム発話を開始したところ、実際にはユーザ発話権が維持されていて、両者の発話が重なった場合の衝突である。従って、ユーザ発話権が終了したものの、システム発話の開始が遅れたために、再び、ユーザ発話が開始されてしまい、ほぼ同時に両者の発話が開始されて重なった場合の衝突ではない。 In addition, the "collision" here is when the identification result that the user's utterance right has ended and the system utterance is started, the user's utterance right is actually maintained, and both utterances overlap. It's a collision. Therefore, although the user's utterance right is terminated, the start of the system utterance is delayed, so that the user's utterance is started again, and it is not a collision when both utterances are started and overlapped at almost the same time.

さらに、ここでの「交替潜時」は、ユーザ発話の終了からシステム発話の開始までの間（ま）であり、システムの交替潜時である。従って、「当該ユーザについての交替潜時」とされているが、これは、当該ユーザとの音声対話を行うときのシステムの交替潜時のことであり、システム発話の終了からユーザ発話の開始までの間（ま）のことではない。 Further, the "alternate latency" here is between the end of the user utterance and the start of the system utterance, and is the alternate latency of the system. Therefore, although it is referred to as "alternate latency for the user", this is the alternation latency of the system when performing a voice dialogue with the user, from the end of the system utterance to the start of the user utterance. It's not about the time.

このようにシステム発話タイミング検出手段により、衝突の発生情報やシステムの交替潜時を用いて、ユーザ発話権終了判定用閾値を調整する構成とした場合には、各ユーザについて、衝突の発生が起きる傾向にあるときには、ユーザ発話権が終了したという識別結果が出にくい設定状態とし、システムの交替潜時が長い傾向にあるときには、ユーザ発話権が終了したという識別結果が出やすくなる設定状態とすることが可能となる。このため、ユーザ属性に応じたユーザ発話権終了判定用閾値の調整を実現することができる。 In this way, when the system utterance timing detecting means is configured to adjust the threshold value for determining the end of the user's utterance right by using the collision occurrence information and the system's alternate latency, a collision occurs for each user. When there is a tendency, the setting state is set so that the identification result that the user utterance right is terminated is difficult to be obtained, and when the system shift latency tends to be long, the setting state is set so that the identification result that the user utterance right is terminated is easy to be obtained. It becomes possible. Therefore, it is possible to adjust the threshold value for determining the end of the user's utterance right according to the user attribute.

＜ユーザ発話権終了判定用閾値を下方調整することを決めるための下方調整用閾値を、ユーザの発話速度の関数とする構成＞ <A configuration in which the downward adjustment threshold for determining the downward adjustment of the user's utterance right end determination threshold is a function of the user's utterance speed>

さらに、上述したシステム発話タイミング検出手段により、衝突の発生情報やシステムの交替潜時を用いて、ユーザ発話権終了判定用閾値を調整する構成とした場合において、
発話生成手段は、
音声認識処理手段による音声認識処理の結果として得られたユーザ発話の言語情報を用いて発話速度を算出し、算出した発話速度を、ユーザ識別情報と関連付けてユーザ情報記憶手段に記憶させる処理も実行する構成とされ、
システム発話タイミング検出手段は、
ユーザ情報記憶手段に記憶されている音声対話相手のユーザについてのユーザ発話の終了からシステム発話の開始までの複数の交替潜時を取得して当該ユーザについての交替潜時の長短の傾向を示す平均値若しくはその他の指標値を算出し、算出した交替潜時の指標値が下方調整用閾値以上または超過の場合に、ユーザ発話権終了判定用閾値を標準値または前回調整値よりも低く設定する処理を実行する際に、
ユーザ情報記憶手段に記憶されている音声対話相手の複数の発話速度を取得して当該ユーザの発話速度の傾向を示す平均値若しくはその他の指標値を算出し、下方調整用閾値を、算出した発話速度の指標値を用いて、発話速度の指標値が大きいときには当該下方調整用閾値が小さくなり、発話速度の指標値が小さいときには当該下方調整用閾値が大きくなるように予め定められた関数により設定する処理を実行する構成としてもよい。 Further, in the case where the system utterance timing detection means described above is used to adjust the threshold value for determining the end of the user utterance right by using the collision occurrence information and the system shift latency.
The utterance generation means is
The utterance speed is calculated using the language information of the user's utterance obtained as a result of the voice recognition processing by the voice recognition processing means, and the calculated utterance speed is associated with the user identification information and stored in the user information storage means. Is configured to
The system utterance timing detection means
Voice dialogue stored in the user information storage means Acquires a plurality of alternate latency from the end of the user's utterance to the start of the system utterance for the other user, and shows the tendency of the length of the alternate latency for the user. A process in which a value or other index value is calculated, and when the calculated index value during alternation latency is equal to or greater than the downward adjustment threshold value, the user speech right end determination threshold value is set lower than the standard value or the previous adjustment value. When executing
The utterance for which the downward adjustment threshold is calculated by acquiring a plurality of utterance speeds of the voice dialogue partner stored in the user information storage means, calculating the average value or other index values indicating the tendency of the utterance speed of the user, and calculating the downward adjustment threshold. Using the speed index value, the downward adjustment threshold is set to be small when the utterance speed index value is large, and the downward adjustment threshold is set to be large when the utterance speed index value is small. It may be configured to execute the processing to be performed.

このようにユーザ発話権終了判定用閾値を下方調整することを決めるための下方調整用閾値を、ユーザの発話速度の関数とする構成とした場合には、各ユーザの発話速度の傾向に応じ、下方調整用閾値の設定を変更することが可能となる。このため、ユーザ属性に応じたユーザ発話権終了判定用閾値の下方調整を実現することができる。すなわち、システムの交替潜時が長い傾向にあるときには、ユーザ発話権終了判定用閾値を下方調整することにより、ユーザ発話権が終了したという識別結果が出やすくなる設定状態とし、システムの交替潜時が短くなるようにすることができるが、この際、システムの交替潜時が長い傾向にあるか否かは、ユーザ毎に異なり、各ユーザの発話速度の傾向と関係するので、下方調整用閾値をユーザの発話速度の関数とすることで、ユーザ属性に応じてユーザ発話権終了判定用閾値の下方調整を行うか否かを決めることができる。 When the downward adjustment threshold value for determining the downward adjustment of the user's utterance right end determination threshold value is configured as a function of the user's utterance speed in this way, it corresponds to the tendency of each user's utterance speed. It is possible to change the setting of the downward adjustment threshold value. Therefore, it is possible to realize downward adjustment of the threshold value for determining the end of the user's utterance right according to the user attribute. That is, when the system shift latency tends to be long, the threshold for determining the end of the user utterance right is adjusted downward so that the identification result that the user utterance right has ended can be easily obtained, and the shift latency of the system is set. However, at this time, whether or not the system shift latency tends to be long differs for each user and is related to the tendency of each user's utterance speed. By making the function of the user's utterance speed, it is possible to determine whether or not to downwardly adjust the threshold for determining the end of the user's utterance right according to the user attribute.

＜ユーザ発話の音声信号から抽出した音響特徴量、およびリアルタイムのユーザの発話速度を用いて、ユーザ発話権の維持または終了を識別するパターン認識処理を行う構成＞ <A configuration that performs pattern recognition processing that identifies the maintenance or termination of the user's utterance right using the acoustic features extracted from the utterance signal of the user and the real-time user's utterance speed>

また、以上に述べた対話システムにおいて、
ユーザのリアルタイムの発話速度を含むユーザ状態を示す情報を記憶するユーザ状態記憶手段を備え、
発話生成手段は、
音声認識処理手段による音声認識処理の結果として得られたユーザ発話の言語情報を用いてリアルタイムの発話速度を算出し、算出したリアルタイムの発話速度をユーザ状態記憶手段に記憶させる処理も実行する構成とされ、
システム発話タイミング検出手段は、
音声信号取得手段により取得したユーザ発話の音声信号から音響特徴量を抽出し、抽出した音響特徴量およびユーザ状態記憶手段に記憶されているリアルタイムの発話速度を用いるか、または、これらの音響特徴量およびリアルタイムの発話速度に加え、音声認識処理手段による音声認識処理の結果として得られたユーザ発話の言語情報から抽出した言語特徴量を用いて、音声認識処理手段による音声認識処理の実行タイミングに依拠しない周期で、ユーザ発話権の維持または終了を識別するパターン認識処理を繰り返し実行し、このパターン認識処理の結果を用いて、システム発話の開始タイミングを検出する処理を実行する構成としてもよい。 In addition, in the dialogue system described above,
A user state storage means for storing information indicating the user state including the user's real-time utterance speed is provided.
The utterance generation means is
The real-time utterance speed is calculated using the language information of the user's utterance obtained as a result of the voice recognition processing by the voice recognition processing means, and the calculated real-time utterance speed is stored in the user state storage means. Being done
The system utterance timing detection means
The acoustic feature amount is extracted from the voice signal of the user's speech acquired by the voice signal acquisition means, and the extracted acoustic feature amount and the real-time speech speed stored in the user state storage means are used, or these acoustic feature amounts are used. In addition to the real-time utterance speed, the language features extracted from the linguistic information of the user's utterance obtained as a result of the voice recognition processing by the voice recognition processing means are used, and the execution timing of the voice recognition processing by the voice recognition processing means is relied on. A pattern recognition process for identifying the maintenance or termination of the user's speech right may be repeatedly executed at a cycle that does not occur, and a process for detecting the start timing of system speech may be executed using the result of this pattern recognition process.

ここで、「リアルタイムの発話速度」における「リアルタイムの」という意味は、事後的に計算するのではなく、その場で計算されるという意味であり、「逐次得られる最新の」という意味である。発話速度の計算には、音声認識処理の結果が用いられるが、音声認識処理の結果は、若干の時間遅れで得られるため、厳密に言えば、ここでいう「リアルタイム」には、「略リアルタイム」が含まれる。 Here, the meaning of "real-time" in "real-time utterance speed" means that it is calculated on the spot rather than after the fact, and it means "the latest that can be obtained sequentially". The result of the voice recognition process is used to calculate the utterance speed, but since the result of the voice recognition process is obtained with a slight time delay, strictly speaking, "real time" here means "nearly real time". Is included.

このようにユーザ発話の音声信号から抽出した音響特徴量、およびリアルタイムのユーザの発話速度を用いて、ユーザ発話権の維持または終了を識別するパターン認識処理を行う構成とした場合には、システム発話タイミング検出手段の識別器は、ユーザ発話の各時点における発話速度（蓄積した発話速度から得られるユーザ属性としての発話速度の傾向ではなく、瞬間的な発話速度という意味）を用いた学習を行うことにより構築されるので、その時々のユーザの発話速度を加味した識別結果を得ることが可能となる。このため、ユーザ毎に異なる発話速度の傾向（蓄積した発話速度から得られるユーザ属性）に応じてユーザ発話権終了判定用閾値を調整する必要がなくなる。なお、閾値調整と併用してもよく、その場合には、閾値調整量が少なくなる。 When the pattern recognition process for identifying the maintenance or termination of the user's utterance right is performed using the acoustic feature amount extracted from the utterance signal of the user's utterance and the real-time user's utterance speed, the system utterance is performed. The classifier of the timing detecting means performs learning using the utterance speed at each time point of the user's utterance (meaning the instantaneous utterance speed, not the tendency of the utterance speed as a user attribute obtained from the accumulated utterance speed). Since it is constructed by, it is possible to obtain an identification result in consideration of the utterance speed of the user at that time. Therefore, it is not necessary to adjust the threshold value for determining the end of the user's speaking right according to the tendency of the speaking speed different for each user (user attribute obtained from the accumulated speaking speed). It may be used in combination with the threshold value adjustment, in which case the threshold value adjustment amount is reduced.

＜プログラムの発明＞ <Invention of the program>

本発明のプログラムは、以上に述べた対話システムとして、コンピュータを機能させるためのものである。 The program of the present invention is for operating a computer as the above-described dialogue system.

なお、上記のプログラムまたはその一部は、例えば、光磁気ディスク（ＭＯ）、コンパクトディスク（ＣＤ）、デジタル・バーサタイル・ディスク（ＤＶＤ）、フレキシブルディスク（ＦＤ）、磁気テープ、読出し専用メモリ（ＲＯＭ）、電気的消去および書換可能な読出し専用メモリ（ＥＥＰＲＯＭ）、フラッシュ・メモリ、ランダム・アクセス・メモリ（ＲＡＭ）、ハードディスクドライブ（ＨＤＤ）、ソリッドステートドライブ（ＳＳＤ）、フラッシュディスク等の記録媒体に記録して保存や流通等させることが可能であるとともに、例えば、ローカル・エリア・ネットワーク（ＬＡＮ）、メトロポリタン・エリア・ネットワーク（ＭＡＮ）、ワイド・エリア・ネットワーク（ＷＡＮ）、インターネット、イントラネット、エクストラネット等の有線ネットワーク、あるいは無線通信ネットワーク、さらにはこれらの組合せ等の伝送媒体を用いて伝送することが可能であり、また、搬送波に載せて搬送することも可能である。さらに、上記のプログラムは、他のプログラムの一部分であってもよく、あるいは別個のプログラムと共に記録媒体に記録されていてもよい。 The above program or a part thereof may be, for example, an optical magnetic disk (MO), a compact disk (CD), a digital versatile disk (DVD), a flexible disk (FD), a magnetic tape, or a read-only memory (ROM). , Electrically erased and rewritable read-only memory (EEPROM), flash memory, random access memory (RAM), hard disk drive (HDD), solid state drive (SSD), flash disk, etc. It can be stored and distributed, and for example, local area network (LAN), metropolitan area network (MAN), wide area network (WAN), Internet, intranet, extranet, etc. It is possible to transmit using a transmission medium such as a wired network, a wireless communication network, or a combination thereof, and it is also possible to carry it on a carrier. Further, the above program may be a part of another program, or may be recorded on a recording medium together with a separate program.

以上に述べたように本発明によれば、システム発話タイミング検出手段により、ユーザが自己の発話権を維持しているか、または、譲渡若しくは放棄により終了させたかをパターン認識処理により逐次推定するとともに、次発話準備手段により、システム発話タイミング検出手段によるパターン認識処理とは非同期で、かつ、システム発話タイミング検出手段によりシステム発話の開始タイミングが検出される前に、ユーザ発話に対するシステムの次発話の内容データを準備するので、システムの応答性を向上させることができ、衝突の発生を回避または抑制しつつ、不要に長いシステムの交替潜時の発生を回避または抑制することができるという効果がある。 As described above, according to the present invention, the system utterance timing detecting means sequentially estimates by pattern recognition processing whether the user maintains his / her own utterance right or terminates it by transfer or abandonment. The next utterance preparation means is asynchronous with the pattern recognition process by the system utterance timing detection means, and before the system utterance start timing is detected by the system utterance timing detection means, the content data of the next utterance of the system for the user utterance. Therefore, it is possible to improve the responsiveness of the system, and it is possible to avoid or suppress the occurrence of collisions while avoiding or suppressing the occurrence of unnecessarily long system shift latency.

本発明の一実施形態の対話システムの全体構成図。The whole block diagram of the dialogue system of one Embodiment of this invention. 前記実施形態のシステム発話タイミング検出手段の詳細構成図。The detailed block diagram of the system utterance timing detection means of the said embodiment. 前記実施形態の次発話選択用情報生成手段の詳細構成図。The detailed block diagram of the information generation means for selection of the next utterance of the said embodiment. 前記実施形態の次発話準備手段の詳細構成図。The detailed block diagram of the next utterance preparation means of the said embodiment. 前記実施形態のユーザからシステムへの話者交替時の処理の流れを示すフローチャートの図。The figure of the flowchart which shows the flow of the process at the time of the speaker change from the user to the system of the said embodiment. 前記実施形態のシステム発話、ユーザ発話、各処理の時間的な前後関係を示す説明図。The explanatory view which shows the temporal context of the system utterance, the user utterance, and each process of the said embodiment. 前記実施形態の各処理のタイミングとその結果との関係を示す説明図。The explanatory view which shows the relationship between the timing of each process of the said embodiment and the result. 前記実施形態のシステム発話タイミング検出手段によるシステム発話の開始タイミングの判断処理のロジックを示すブロック図。The block diagram which shows the logic of the determination processing of the start timing of a system utterance by the system utterance timing detection means of the said embodiment. 前記実施形態のシステム発話タイミング検出手段によるユーザ発話権終了判定用閾値のリアルタイム調整（その１）の説明図。The explanatory view of the real-time adjustment (the 1) of the threshold value for determination of the end of a user utterance right by the system utterance timing detection means of the said embodiment. 前記実施形態のシステム発話タイミング検出手段によるユーザ発話権終了判定用閾値のリアルタイム調整（その２）の説明図。It is explanatory drawing of the real-time adjustment (the 2) of the threshold value for the user utterance right end determination by the system utterance timing detection means of the said embodiment. 前記実施形態のシステム発話タイミング検出手段によるユーザ発話権終了判定用閾値の事前調整（その１）の説明図。The explanatory view of the pre-adjustment (the 1) of the threshold value for determination of the end of a user utterance right by the system utterance timing detection means of the said embodiment. 前記実施形態のシステム発話タイミング検出手段によるユーザ発話権終了判定用閾値の事前調整（その２）の説明図。The explanatory view of the pre-adjustment (2) of the threshold value for determination of the end of a user utterance right by the system utterance timing detecting means of the said embodiment. 前記実施形態のシナリオのデータ構成の具体例を示す図。The figure which shows the specific example of the data structure of the scenario of the said embodiment. 前記実施形態のシナリオ再生（システム発話）とユーザの反応（ユーザ発話）との関係を示す説明図。The explanatory view which shows the relationship between the scenario reproduction (system utterance) of the said embodiment, and a user reaction (user utterance). 前記実施形態の次発話候補の準備処理の具体例（１）を示す図。The figure which shows the specific example (1) of the preparation process of the next utterance candidate of the said embodiment. 前記実施形態の次発話候補の準備処理の具体例（２）を示す図。The figure which shows the specific example (2) of the preparation process of the next utterance candidate of the said embodiment. 前記実施形態の次発話候補の準備処理の具体例（３）を示す図。The figure which shows the specific example (3) of the preparation process of the next utterance candidate of the said embodiment.

以下に本発明の一実施形態について図面を参照して説明する。図１には、本実施形態の対話システム１０の全体構成が示されている。図２には、システム発話タイミング検出手段２２の詳細構成が示され、図３には、次発話選択用情報生成手段２３の詳細構成が示され、図４には、次発話準備手段４３の詳細構成が示されている。また、図５には、ユーザからシステムへの話者交替時の処理の流れがフローチャートで示され、図６には、システム発話、ユーザ発話、各処理の時間的な前後関係が示され、図７には、各処理のタイミングとその結果との関係が示され、図８には、システム発話タイミング検出手段２２によるシステム発話の開始タイミングの判断処理のロジックが示されている。さらに、図９〜図１２は、システム発話タイミング検出手段２２によるユーザ発話権終了判定用閾値の調整の説明図である。また、図１３には、シナリオのデータ構成の具体例が示され、図１４には、シナリオ再生（システム発話）とユーザの反応（ユーザ発話）との関係の具体例が示され、図１５〜図１７には、次発話候補の準備処理の具体例が示されている。 An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 shows the overall configuration of the dialogue system 10 of the present embodiment. FIG. 2 shows the detailed configuration of the system utterance timing detecting means 22, FIG. 3 shows the detailed configuration of the next utterance selection information generating means 23, and FIG. 4 shows the details of the next utterance preparing means 43. The configuration is shown. Further, FIG. 5 shows a flow chart of the process flow when the speaker is changed from the user to the system, and FIG. 6 shows the system utterance, the user utterance, and the temporal context of each process. FIG. 7 shows the relationship between the timing of each process and the result, and FIG. 8 shows the logic of the process of determining the start timing of the system utterance by the system utterance timing detecting means 22. Further, FIGS. 9 to 12 are explanatory views for adjusting the threshold value for determining the end of the user's utterance right by the system utterance timing detecting means 22. Further, FIG. 13 shows a specific example of the data structure of the scenario, and FIG. 14 shows a specific example of the relationship between the scenario reproduction (system utterance) and the user's reaction (user utterance). FIG. 17 shows a specific example of the preparation process for the next utterance candidate.

＜対話システム１０の全体構成＞ <Overall configuration of dialogue system 10>

図１において、対話システム１０は、ユーザとの音声対話を行うシステムであり、１台または複数台のコンピュータにより構成され、本実施形態では、一例として、再生装置２０と、対話サーバ４０とをネットワーク１で接続した構成とされている。また、ネットワーク１には、外部システムである題材データ提供システム６０も接続されている。 In FIG. 1, the dialogue system 10 is a system for performing voice dialogue with a user, and is composed of one or a plurality of computers. In the present embodiment, as an example, a playback device 20 and a dialogue server 40 are networked. It is configured to be connected in 1. Further, the subject data providing system 60, which is an external system, is also connected to the network 1.

ここで、ネットワーク１は、主としてインターネットのような外部ネットワークであるが、これとイントラネットやＬＡＮ等の内部ネットワークとの組合せ等でもよく、有線であるか、無線であるか、有線・無線の混在型であるかは問わない。また、ネットワーク１は、例えば、社内、工場内、事業所内、グループ企業内、学内、病院内、マンション内、建物内、公園・遊園地・動物園・博物館・美術館・博覧会場等の施設内、所定の地域内等に限定されたイントラネットやＬＡＮ等の内部ネットワークであってもよい。 Here, the network 1 is mainly an external network such as the Internet, but may be a combination of this with an internal network such as an intranet or a LAN, and may be wired, wireless, or a mixed type of wired and wireless. It doesn't matter if it is. In addition, network 1 is defined in, for example, in-house, in factories, offices, group companies, campuses, hospitals, apartments, buildings, parks, amusement parks, zoos, museums, art galleries, exhibition halls, etc. It may be an internal network such as an intranet or a LAN limited to the area of the above.

再生装置２０は、１台または複数台のコンピュータにより構成され、音声信号取得手段２１と、システム発話タイミング検出手段２２と、次発話選択用情報生成手段２３と、次発話選択手段２４と、発話生成手段２５と、次発話候補記憶手段３０と、システム状態記憶手段３１と、ユーザ状態記憶手段３２とを備えている。この再生装置２０は、例えば、スマートフォン、タブレット、モバイルＰＣ（パーソナル・コンピュータ）等の携帯機器であってもよい。また、汎用機器ではなく、音声対話の専用機器としてもよい。 The playback device 20 is composed of one or a plurality of computers, and includes an audio signal acquisition means 21, a system utterance timing detection means 22, an information generation means for selecting the next utterance 23, a next utterance selection means 24, and an utterance generation. The means 25, the next utterance candidate storage means 30, the system state storage means 31, and the user state storage means 32 are provided. The playback device 20 may be, for example, a portable device such as a smartphone, a tablet, or a mobile PC (personal computer). Further, it may be a dedicated device for voice dialogue instead of a general-purpose device.

このうち、音声信号取得手段２１（但し、マイクロフォンの部分を除く。）、システム発話タイミング検出手段２２（但し、ユーザ発話権終了判定モデル記憶手段２２Ｅ（図２参照）の部分を除く。）、次発話選択用情報生成手段２３（但し、第１、第２の発話意図識別モデル記憶手段２３Ｄ，２３Ｇ（図３参照）の部分を除く。）、次発話選択手段２４、および発話生成手段２５（但し、スピーカやディスプレイの部分を除く。）は、再生装置２０を構成するコンピュータ本体の内部に設けられた中央演算処理装置（ＣＰＵ）、およびこのＣＰＵの動作手順を規定する１つまたは複数のプログラムにより実現される。また、次発話候補記憶手段３０、システム状態記憶手段３１、ユーザ状態記憶手段３２、システム発話タイミング検出手段２２を構成するユーザ発話権終了判定モデル記憶手段２２Ｅ（図２参照）、および次発話選択用情報生成手段２３を構成する第１、第２の発話意図識別モデル記憶手段２３Ｄ，２３Ｇ（図３参照）としては、例えば、ハードディスクドライブ（ＨＤＤ）、ソリッドステートドライブ（ＳＳＤ）等を採用することができる。 Of these, the voice signal acquisition means 21 (however, the microphone part is excluded), the system utterance timing detection means 22 (however, the user utterance right termination determination model storage means 22E (see FIG. 2) is excluded), and the following. The utterance selection information generation means 23 (however, the parts of the first and second utterance intention identification model storage means 23D and 23G (see FIG. 3) are excluded), the next utterance selection means 24, and the utterance generation means 25 (however). (Excluding the speaker and display parts.) Is provided by a central arithmetic processing device (CPU) provided inside the main body of the computer constituting the playback device 20, and one or more programs that define the operating procedure of the CPU. It will be realized. Further, the user utterance right end determination model storage means 22E (see FIG. 2) constituting the next utterance candidate storage means 30, the system state storage means 31, the user state storage means 32, the system utterance timing detection means 22, and the next utterance selection. As the first and second utterance intention identification model storage means 23D and 23G (see FIG. 3) constituting the information generation means 23, for example, a hard disk drive (HDD), a solid state drive (SSD), or the like can be adopted. it can.

対話サーバ４０は、１台または複数台のコンピュータにより構成され、音声認識処理手段４１と、対話状態管理手段４２と、次発話準備手段４３と、対話履歴記憶手段５０と、題材データ記憶手段５１と、ユーザ情報記憶手段５２とを備えている。 The dialogue server 40 is composed of one or a plurality of computers, and includes a voice recognition processing means 41, a dialogue state management means 42, a next utterance preparation means 43, a dialogue history storage means 50, and a subject data storage means 51. The user information storage means 52 is provided.

このうち、音声認識処理手段４１、対話状態管理手段４２、および次発話準備手段４３（但し、先行次発話候補情報記憶手段４３Ｄ（図４参照）の部分は除く。）は、対話サーバ４０を構成するコンピュータ本体の内部に設けられた中央演算処理装置（ＣＰＵ）、およびこのＣＰＵの動作手順を規定する１つまたは複数のプログラムにより実現される。また、対話履歴記憶手段５０、題材データ記憶手段５１、ユーザ情報記憶手段５２、および次発話準備手段４３を構成する先行次発話候補情報記憶手段４３Ｄとしては、例えば、ハードディスクドライブ（ＨＤＤ）、ソリッドステートドライブ（ＳＳＤ）等を採用することができる。なお、先行次発話候補情報記憶手段４３Ｄ（図４参照）は、主メモリ等の揮発性メモリとしてもよい。 Of these, the voice recognition processing means 41, the dialogue state management means 42, and the next utterance preparation means 43 (however, the portion of the preceding utterance candidate information storage means 43D (see FIG. 4) is excluded) constitutes the dialogue server 40. It is realized by a central arithmetic processing device (CPU) provided inside the main body of the computer, and one or a plurality of programs that define the operation procedure of the CPU. Further, examples of the preceding next speech candidate information storage means 43D constituting the dialogue history storage means 50, the subject data storage means 51, the user information storage means 52, and the next speech preparation means 43 include a hard disk drive (HDD) and a solid state. A drive (SSD) or the like can be adopted. The preceding / next utterance candidate information storage means 43D (see FIG. 4) may be a volatile memory such as a main memory.

題材データ提供システム６０は、外部システムであり、１台または複数台のコンピュータにより構成され、対話サーバ４０を構成する題材データ記憶手段５１に相当する外部題材データ記憶手段（不図示）を備えている。 The subject data providing system 60 is an external system, which is composed of one or a plurality of computers and includes an external subject data storage means (not shown) corresponding to the subject data storage means 51 constituting the dialogue server 40. ..

なお、本実施形態では、図１に示すように、対話システム１０は、再生装置２０と、対話サーバ４０とをネットワーク１で接続した構成とされているが、本発明の対話システムは、スタンドアローンのシステムとしてもよい。また、図１に示したネットワーク構成は、一例に過ぎないので、ネットワーク構成とする場合でも、各機能の分散形態として、図１の状態とは異なる様々な形態を採用することができる。 In the present embodiment, as shown in FIG. 1, the dialogue system 10 has a configuration in which the playback device 20 and the dialogue server 40 are connected by a network 1, but the dialogue system of the present invention is stand-alone. It may be a system of. Further, since the network configuration shown in FIG. 1 is only an example, various forms different from the state of FIG. 1 can be adopted as the distributed form of each function even in the case of the network configuration.

例えば、再生装置２０は、音声対話相手であるユーザと音声によるやりとりを行うので、ユーザの近く（音声の届く範囲）に配置する必要があることから、これを本体と端末とに分割して無線または有線で通信を行うようにし、端末をユーザの近くに配置する一方、本体をユーザから比較的離れた位置（音声が届かない位置でもよい）に配置する構成とすることができる。この場合、例えば、端末は、再生装置２０を構成する音声信号取得手段２１またはその一部であるマイクロフォンの部分と、再生装置２０を構成する発話生成手段２５またはその一部であるスピーカの部分（映像や静止画像の再生を伴う場合には、ディスプレイの部分を含む。）とにより構成することができる。そして、例えば、本体を固定設置された機器とし、端末を移動機器とすること等ができるが、本体、端末のいずれについても、固定機器でも移動機器でもよい。また、本体と端末との個数の関係は、１対１でも、１対多でもよい。さらに、例えば、対話種別（ニュース対話、ガイダンス対話、アンケート対話、情報検索対話、操作対話、教育対話等の別）に対応させて異なる本体を設置する場合、新旧異なるタイプの本体を併用する場合、機能の異なる本体を使い分ける場合等には、本体と端末との個数の関係は、多対１、多対多でもよく、この場合には、任意の１つの端末と、複数の本体から目的に応じて選択された１つの本体とが接続されることになる。また、ユーザとの関係では、１つの端末は、同時使用でなければ、複数のユーザが交代して使用することができる。本体は、複数の端末と同時接続可能な構成とすれば、複数のユーザの同時使用に対応可能な構成とすることができるが、複数のユーザの同時使用を許容しない構成としてもよい。 For example, since the playback device 20 exchanges voice with the user who is the voice dialogue partner, it is necessary to arrange the playback device 20 near the user (within the reach of the voice). Therefore, the playback device 20 is divided into a main body and a terminal to be wireless. Alternatively, it is possible to perform wired communication and arrange the terminal near the user while arranging the main body at a position relatively far from the user (may be a position where voice does not reach). In this case, for example, the terminal is a portion of the microphone that is the audio signal acquisition means 21 or a part thereof constituting the reproduction device 20, and a portion of the speaker that is the utterance generation means 25 or a part thereof that constitutes the reproduction device 20. When the reproduction of a moving image or a still image is involved, the display portion is included.). Then, for example, the main body may be a fixedly installed device and the terminal may be a mobile device, but either the main body or the terminal may be a fixed device or a mobile device. Further, the relationship between the number of the main body and the number of terminals may be one-to-one or one-to-many. Furthermore, for example, when different main units are installed according to the dialogue type (news dialogue, guidance dialogue, questionnaire dialogue, information retrieval dialogue, operation dialogue, educational dialogue, etc.), when new and old different types of main units are used together, When using different main units with different functions, the relationship between the number of main units and terminals may be many-to-one or many-to-many. In this case, any one terminal and a plurality of main units can be used according to the purpose. The selected main body will be connected. Further, in relation to users, one terminal can be used by a plurality of users in turn unless they are used at the same time. The main body can be configured to support simultaneous use by a plurality of users if it is configured to be able to connect to a plurality of terminals at the same time, but it may be configured to not allow simultaneous use by a plurality of users.

また、再生装置２０を構成するシステム発話タイミング検出手段２２と、次発話選択用情報生成手段２３と、次発話選択手段２４と、次発話候補記憶手段３０と、システム状態記憶手段３１と、ユーザ状態記憶手段３２とは、それぞれ別々のコンピュータに設けてもよく、適宜組み合わせて同じコンピュータに設けてもよい。 Further, the system utterance timing detecting means 22 constituting the playback device 20, the next utterance selection information generating means 23, the next utterance selection means 24, the next utterance candidate storage means 30, the system state storage means 31, and the user state. The storage means 32 may be provided in separate computers, or may be appropriately combined and provided in the same computer.

さらに、対話サーバ４０も同様であり、対話サーバ４０を構成する各機能の部分は、それぞれ別々のコンピュータに設けてもよく、適宜組み合わせて同じコンピュータに設けてもよい。また、再生装置２０を構成する１つまたは複数の機能の部分と、対話サーバ４０を構成する１つまたは複数の機能の部分とを適宜組み合わせて同じコンピュータに設けてもよい。 Further, the same applies to the dialogue server 40, and the parts of the functions constituting the dialogue server 40 may be provided in separate computers, or may be appropriately combined and provided in the same computer. Further, the part of one or more functions constituting the reproduction device 20 and the part of one or more functions constituting the dialogue server 40 may be appropriately combined and provided in the same computer.

＜再生装置２０／音声信号取得手段２１の構成＞ <Structure of playback device 20 / audio signal acquisition means 21>

音声信号取得手段２１は、ユーザ発話の音声信号を取得するものであり、音（ここでは、音声）をアナログの電気信号に変換する機器であるマイクロフォン、Ａ／Ｄ変換手段、Ａ／Ｄ変換で得られたデジタルの音声信号を保持する音声信号記憶手段、音声信号を各所に送信する送信手段等を含んで構成されている。 The voice signal acquisition means 21 acquires a voice signal spoken by a user, and is a device that converts sound (here, voice) into an analog electric signal by a microphone, an A / D conversion means, and an A / D conversion. It is configured to include a voice signal storage means for holding the obtained digital voice signal, a transmission means for transmitting the voice signal to various places, and the like.

＜再生装置２０／システム発話タイミング検出手段２２の構成＞ <Structure of playback device 20 / system utterance timing detection means 22>

図２において、システム発話タイミング検出手段２２は、音響特徴量抽出手段２２Ａと、言語特徴量抽出手段２２Ｂと、ユーザ発話権終了判定用パターン認識器２２Ｃと、システム発話開始タイミング判断手段２２Ｆと、ユーザ発話権終了判定用閾値調整手段２２Ｇとを含んで構成されている。このうち、音響特徴量抽出手段２２Ａと、言語特徴量抽出手段２２Ｂと、ユーザ発話権終了判定用パターン認識器２２Ｃとについては、例えば、前述した非特許文献１，２に記載された技術を採用することができる。 In FIG. 2, the system utterance timing detection means 22 includes an acoustic feature amount extraction means 22A, a language feature amount extraction means 22B, a user speech right end determination pattern recognizer 22C, a system utterance start timing determination means 22F, and a user. It is configured to include a threshold adjusting means 22G for determining the end of speech right. Of these, the acoustic feature amount extracting means 22A, the language feature amount extracting means 22B, and the user speech right termination determination pattern recognizer 22C employ, for example, the techniques described in Non-Patent Documents 1 and 2 described above. can do.

このシステム発話タイミング検出手段２２による処理は、音声認識処理手段４１による音声認識処理の実行タイミングに依拠しない周期で、すなわち音声認識処理とは非同期で実行される。具体的には、例えば、１０ｍｓ（ミリ秒）〜１００ｍｓという短い周期で実行される。図６の最下部に示した処理の周期Ｑ（時間間隔）である。なお、音声区間（ＩＰＵ）を形成する際のポーズは、通常は１００ｍｓ以上であるから、周期Ｑは、そのＩＰＵ形成用の閾値よりも短いか、同等の周期ということになる。 The processing by the system utterance timing detecting means 22 is executed at a cycle that does not depend on the execution timing of the voice recognition processing by the voice recognition processing means 41, that is, asynchronously with the voice recognition processing. Specifically, for example, it is executed in a short cycle of 10 ms (milliseconds) to 100 ms. It is a process cycle Q (time interval) shown at the bottom of FIG. Since the pause when forming the voice section (IPU) is usually 100 ms or more, the cycle Q is shorter than or equivalent to the threshold value for forming the IPU.

＜再生装置２０／システム発話タイミング検出手段２２／音響特徴量抽出手段２２Ａの構成＞ <Structure of playback device 20 / system utterance timing detecting means 22 / acoustic feature amount extracting means 22A>

音響特徴量抽出手段２２Ａは、音声信号取得手段２１により取得したユーザ発話の音声信号から音響特徴量を抽出する処理を実行するものである。前述した非特許文献１，２に記載された技術を採用する場合には、狭帯域スペクトログラムを符号化、復号化する自己符号化器（オートエンコーダ）をニューラルネットワーク（ＣＮＮ）により構築し、その中間層の出力を音響特徴量（ボトルネック特徴量）とする。具体的には、周波数分析により例えば１０ｍｓ毎に得られる２５６点のパワースペクトルを１０本分並べたものを入力とし、中間層の出力２５６次元を特徴量とする。すなわち、ＣＮＮオートエンコーダの入力は、例えば、フレームサイズ＝８００サンプル（５０ｍｓ）、フレームシフト＝１６０サンプル（１０ｍｓ）で切り出した音声信号から生成したスペクトログラムを時系列に１０本分（図６の下部に示したＲ本分）並べたものとし、そのサイズを１０×２５６次元とする。そして、この入力サイズを２５６次元に圧縮し、音響特徴量とする。 The acoustic feature amount extracting means 22A executes a process of extracting an acoustic feature amount from a user-spoken voice signal acquired by the voice signal acquisition means 21. When the techniques described in Non-Patent Documents 1 and 2 described above are adopted, a self-encoder (autoencoder) that encodes and decodes a narrowband spectrogram is constructed by a neural network (CNN), and an intermediate between them. The output of the layer is defined as the acoustic feature amount (bottleneck feature amount). Specifically, for example, an input is obtained by arranging 10 power spectra of 256 points obtained every 10 ms by frequency analysis, and the output 256 dimensions of the intermediate layer is used as a feature quantity. That is, the input of the CNN autoencoder is, for example, 10 spectrograms generated from audio signals cut out with a frame size = 800 samples (50 ms) and a frame shift = 160 samples (10 ms) in chronological order (at the bottom of FIG. 6). (R pieces shown) shall be arranged side by side, and the size shall be 10 × 256 dimensions. Then, this input size is compressed into 256 dimensions to obtain an acoustic feature amount.

このようにして音響特徴量を抽出する場合、２５６点のパワースペクトルを１０本分並べた入力データを作成する際に、２５６点のパワースペクトルを１本ずつずらしていけば、図６の最下部に示した処理の周期Ｑ（時間間隔）は、周波数分析のフレームシフト＝１０ｍｓと同じになり、２本ずつずらしていけば、２倍の２０ｍｓとなり、同様に１０本ずつずらしていけば、１０倍の１００ｍｓとなる。従って、２５６点のパワースペクトルを用いる際に、ずらす本数を変えることにより、処理の周期Ｑを変更することができる。なお、ずらす本数を多くすることにより、使用する音声信号の区間に重なりがないようにしてもよいが、使用されない音声信号の区間が生じることは避ける必要がある。なお、図７では、周波数分析のタイミングは、各フレームの終点を指し、フレーム毎に図示されているので、時間間隔はフレームシフトを示しているのに対し、システム発話タイミング検出手段２２による処理の周期Ｑは、それよりも広い時間間隔で図示されているので、パワースペクトルを数本ずつ（例えば３本ずつ）ずらして用いていることを示している。また、使用するパワースペクトルの本数は、Ｒ本＝１０本である必要はなく、任意である。さらに、フレームサイズ＝５０ｍｓ、フレームシフト＝１０ｍｓという数値も一例に過ぎず、これらの数値に限定されるものではない。 When extracting the acoustic features in this way, when creating input data in which 10 power spectra of 256 points are arranged, if the power spectra of 256 points are shifted one by one, the bottom of FIG. 6 The processing cycle Q (time interval) shown in (1) is the same as the frame shift of frequency analysis = 10 ms, and if it is shifted by two lines, it doubles to 20 ms. Similarly, if it is shifted by 10 lines, it is 10 ms. It will be doubled to 100 ms. Therefore, when using the power spectrum of 256 points, the processing cycle Q can be changed by changing the number of shifts. By increasing the number of shifts, the sections of the audio signals to be used may not overlap, but it is necessary to avoid the occurrence of sections of the audio signals that are not used. In FIG. 7, the timing of the frequency analysis points to the end point of each frame and is shown for each frame. Therefore, the time interval indicates a frame shift, whereas the processing by the system utterance timing detecting means 22 is performed. Since the period Q is shown at a wider time interval, it indicates that the power spectra are used by shifting the power spectra by several lines (for example, by three lines). Further, the number of power spectra used does not have to be R = 10, and is arbitrary. Further, the numerical values of frame size = 50 ms and frame shift = 10 ms are merely examples, and are not limited to these numerical values.

また、音声信号からの音響特徴量の抽出処理は、上述した非特許文献１，２に記載された技術による抽出処理に限定されるものではなく、ユーザ発話権終了判定用パターン認識器２２Ｃの入力に用いる音響特徴量は、ユーザ発話の音声信号から得られる音響特徴量であれば、いずれの特徴量でもよい。 Further, the extraction process of the acoustic feature amount from the audio signal is not limited to the extraction process by the technique described in Non-Patent Documents 1 and 2 described above, and the input of the user speech right termination determination pattern recognizer 22C. The acoustic feature amount used for is any feature amount as long as it is an acoustic feature amount obtained from the voice signal uttered by the user.

例えば、音響特徴量は、基本周波数（Ｆ０）や、メル周波数ケプストラム係数（ＭＦＣＣ）等でもよい。但し、特徴量を計算すること自体に遅延が生じないことが好ましい。なお、ＭＦＣＣ等の音響特徴量を用いると、処理遅れは無くなるが、韻律情報が失われるという欠点がある。 For example, the acoustic feature amount may be a fundamental frequency (F0), a mel frequency cepstrum coefficient (MFCC), or the like. However, it is preferable that there is no delay in calculating the feature amount. If an acoustic feature such as MFCC is used, the processing delay is eliminated, but there is a drawback that prosodic information is lost.

＜再生装置２０／システム発話タイミング検出手段２２／言語特徴量抽出手段２２Ｂの構成＞ <Structure of playback device 20 / system utterance timing detecting means 22 / language feature amount extracting means 22B>

言語特徴量抽出手段２２Ｂは、音声認識処理手段４１による音声認識処理の結果として得られたユーザ発話の言語情報から言語特徴量を抽出する処理を実行するものである。この言語特徴量抽出手段２２Ｂの設置は省略してもよい。 The language feature amount extracting means 22B executes a process of extracting a language feature amount from the language information of the user's utterance obtained as a result of the voice recognition processing by the voice recognition processing means 41. The installation of the language feature amount extracting means 22B may be omitted.

具体的には、例えば、ＬＳＴＭ言語モデルの中間出力（５１２次元）を言語特徴量とすることができる（非特許文献２参照）。なお、ＬＳＴＭ（Long short-term memory）は、リカレントニューラルネットワーク（ＲＮＮ）の一種である。 Specifically, for example, the intermediate output (512 dimensions) of the LSTM language model can be used as the language feature (see Non-Patent Document 2). RSTM (Long short-term memory) is a kind of recurrent neural network (RNN).

また、音声認識処理手段４１による音声認識処理は、上述した音響特徴量抽出手段２２Ａの処理と非同期で実行されるため、音響特徴量抽出手段２２Ａにより音響特徴量が抽出されたときに、この言語特徴量抽出手段２２Ｂによる言語特徴量の抽出が行われていない場合があるので、その場合には、言語特徴量はゼロベクトルとする。 Further, since the voice recognition process by the voice recognition processing means 41 is executed asynchronously with the processing of the acoustic feature amount extracting means 22A described above, this language is used when the acoustic feature amount is extracted by the acoustic feature amount extracting means 22A. Since the language feature amount may not be extracted by the feature amount extraction means 22B, in that case, the language feature amount is set to a zero vector.

＜再生装置２０／システム発話タイミング検出手段２２／ユーザ発話権終了判定用パターン認識器２２Ｃの構成＞ <Configuration of playback device 20 / system utterance timing detecting means 22 / pattern recognizer 22C for determining user utterance right end>

ユーザ発話権終了判定用パターン認識器２２Ｃは、音響特徴量抽出手段２２Ａにより抽出した音響特徴量を入力とするか（非特許文献１参照）、あるいは、この音響特徴量および言語特徴量抽出手段２２Ｂにより抽出した言語特徴量を入力とし（非特許文献２参照）、ユーザが発話する地位または立場を有していることを示すユーザ発話権の維持または終了（終了には、譲渡、放棄が含まれる。）を識別するパターン認識処理を繰り返し実行するものである。 The pattern recognizer 22C for determining the end of the user's utterance right inputs the acoustic feature amount extracted by the acoustic feature amount extracting means 22A (see Non-Patent Document 1), or the acoustic feature amount and the language feature amount extracting means 22B. Maintaining or terminating the user's utterance right indicating that the user has a position or position to speak by inputting the language features extracted by (see Non-Patent Document 2) (termination includes transfer and abandonment). The pattern recognition process for identifying.) Is repeatedly executed.

このユーザ発話権終了判定用パターン認識器２２Ｃは、識別アルゴリズムによるパターン認識処理を実行するユーザ発話権終了判定用パターン認識処理手段２２Ｄと、このパターン認識処理で用いるモデル（パラメータ）を記憶するユーザ発話権終了判定モデル記憶手段２２Ｅとにより構成されている。 The user utterance right end determination pattern recognizer 22C stores the user utterance right end determination pattern recognition processing means 22D that executes the pattern recognition process by the identification algorithm, and the user utterance that stores the model (parameter) used in the pattern recognition process. It is composed of the right termination determination model storage means 22E.

具体的には、例えば、音響特徴量（２５６次元）および言語特徴量（５１２次元）を入力とし、ユーザ発話権の維持または終了を逐次推定するモデルをニューラルネットワークにより構築し、ユーザ発話権終了判定用パターン認識器２２Ｃとすることができる（非特許文献２参照）。この際、ニューラルネットワークには、時系列情報を考慮するため、ＬＳＴＭ（ＲＮＮの一種）を用いることができる（非特許文献１参照）。 Specifically, for example, by inputting an acoustic feature amount (256 dimensions) and a language feature amount (512 dimensions), a model for sequentially estimating the maintenance or termination of the user speaking right is constructed by a neural network, and the user speaking right termination determination is made. The pattern recognizer 22C can be used (see Non-Patent Document 2). At this time, LSTM (a type of RNN) can be used for the neural network in order to consider the time series information (see Non-Patent Document 1).

このユーザ発話権終了判定用パターン認識器２２Ｃは、ユーザ発話権が終了したことの確からしさを示す尤度を出力するので、その尤度が、予め定められたユーザ発話権終了判定用閾値（但し、この閾値は、ユーザ発話権終了判定用閾値調整手段２２Ｇにより、事前に、またはリアルタイムで動的に調整されることがある。）以上であるか、またはこの閾値を超えているかを判定する閾値処理を行い（図７の最下部を参照）、ユーザ発話権終了判定用閾値以上または超過と判定した場合には、ユーザ発話権が終了したという識別結果を出力し、ユーザ発話権終了判定用閾値未満または以下と判定した場合には、ユーザ発話権が維持されているという識別結果を出力する。 Since the pattern recognizer 22C for determining the end of the user utterance right outputs a likelihood indicating the certainty that the user utterance right has ended, the likelihood is a predetermined threshold value for determining the end of the user utterance right (however, however). , This threshold value may be dynamically adjusted in advance or in real time by the threshold value adjusting means 22G for determining the end of the user's utterance right.) A threshold value for determining whether the threshold value is equal to or higher than or exceeds this threshold value. When processing is performed (see the bottom of FIG. 7) and it is determined that the threshold value for determining the end of the user utterance right is equal to or greater than the threshold value for determining the end of the user utterance right, the identification result indicating that the user utterance right is terminated is output and the threshold value for determining the end of the user utterance right is output. If it is determined to be less than or less than or equal to, the identification result indicating that the user's utterance right is maintained is output.

この際、尤度は、ユーザ発話権が終了したことの確からしさを示す尤度としているので、尤度の値が大きい程（１に近い程）、ユーザ発話権の終了の状態に近く、尤度の値が小さい程（０に近い程）、ユーザ発話権の維持の状態に近い（図７の最下部を参照）。従って、尤度がユーザ発話権終了判定用閾値以上になるか、超えれば、ユーザ発話権が終了したという識別結果が出力されることになるので、ユーザ発話権終了判定用閾値の上方調整というのは、ユーザ発話権が終了したという識別結果が出にくくなる方向への調整であり、下方調整というのは、ユーザ発話権が終了したという識別結果が出やすくなる方向への調整である。本願の請求項は、この場合の記載とされている。 At this time, the likelihood is a likelihood indicating the certainty that the user utterance right has ended. Therefore, the larger the likelihood value (closer to 1), the closer to the end state of the user utterance right, and the likelihood. The smaller the degree value (closer to 0), the closer to the state of maintaining the user's speaking right (see the bottom of FIG. 7). Therefore, if the likelihood exceeds or exceeds the threshold for determining the end of the user's utterance right, the identification result that the user's utterance right has ended is output. Therefore, the threshold for determining the end of the user's utterance right is adjusted upward. Is an adjustment in a direction that makes it difficult to obtain an identification result that the user's utterance right has ended, and a downward adjustment is an adjustment in a direction that makes it easier to obtain an identification result that the user's utterance right has ended. The claims of the present application are described in this case.

一方、尤度は、ユーザ発話権が維持されていることの確からしさを示す尤度としてもよく、この場合には、尤度の値が大きい程（１に近い程）、ユーザ発話権の維持の状態に近く、尤度の値が小さい程（０に近い程）、ユーザ発話権の終了の状態に近い。従って、尤度がユーザ発話権終了判定用閾値以下になるか、未満になれば、ユーザ発話権が終了したという識別結果が出力されることになるので、ユーザ発話権終了判定用閾値の上方調整というのは、ユーザ発話権が終了したという識別結果が出やすくなる方向への調整であり、下方調整というのは、ユーザ発話権が終了したという識別結果が出にくくなる方向への調整である。このため、本願の請求項は、この場合とは逆の記載とされているが（上方、下方が逆の表現となっているが）、両者は等価なことであり、また、１から尤度の値を減じれば、逆の意味の尤度になるので、本願の請求項は、いずれの場合も含むものである。 On the other hand, the likelihood may be a likelihood indicating the certainty that the user's utterance right is maintained. In this case, the larger the likelihood value (closer to 1), the more the user's utterance right is maintained. The closer to the state of, and the smaller the likelihood value (closer to 0), the closer to the end state of the user's utterance right. Therefore, if the likelihood is equal to or less than or less than the threshold for determining the end of the user's utterance right, the identification result indicating that the user's utterance right has been terminated will be output. This is an adjustment in a direction that makes it easier to obtain an identification result that the user's utterance right has ended, and a downward adjustment is an adjustment in a direction that makes it difficult to obtain an identification result that the user's utterance right has ended. For this reason, the claims of the present application are described in the opposite manner to this case (although the expressions above and below are reversed), but both are equivalent and the likelihood is from 1. The claims of the present application include any of these cases, since the likelihood of the opposite meaning is obtained by reducing the value of.

また、閾値処理を行う際には、フィルタをかけた後の尤度（出力される直近の幾つかの尤度を用いて平準化した後の尤度）を用いてもよい。 Further, when performing the threshold processing, the likelihood after filtering (the likelihood after leveling using some of the most recent likelihoods output) may be used.

さらに、ユーザ発話権終了判定用パターン認識器２２Ｃは、ユーザ発話権の維持または終了を識別するパターン認識処理を実行する際に、終了については、質問、応答、相槌、補足要求、反復要求、理解、不理解、無関心、若しくはその他のユーザ発話意図のうちのいずれのユーザ発話意図で終了するのかを識別するパターン認識処理を実行してもよい。この場合、識別器の学習段階で、終了については、質問、応答、相槌等のユーザ発話意図を含むラベル（タグ）を付すことになる。すなわち、「維持」、「質問で終了」、「応答で終了」、「相槌で終了」等のタグ付けを行った学習用データをそれぞれ多数用意して学習を行い、３クラス識別以上のユーザ発話権終了判定モデルを構築する。そして、運用段階では、ユーザ発話権が「維持」されていることの確からしさを示す尤度、「質問で終了」したことの確からしさを示す尤度、「応答で終了」したことの確からしさを示す尤度、「相槌で終了」したことの確からしさを示す尤度等のように、いずれのユーザ発話意図で終了したのかを示す情報が出力される。例えば、質問で終了の尤度＝０．９０、応答で終了の尤度＝０．０４、相槌で終了の尤度＝０．０３等のように出力される。従って、ユーザ発話権終了判定用閾値は、質問、応答、相槌等のユーザ発話意図毎に設定し、ユーザ発話意図毎に閾値処理を行う。但し、ユーザ発話意図毎に設定したユーザ発話権終了判定用閾値が、同じ値となってもよい。このようにした場合、次発話選択手段２４は、ユーザ発話権終了判定用パターン認識器２２Ｃによるユーザ発話意図の識別結果（いずれのユーザ発話意図で終了したのかという情報）を用いて、次発話準備手段４３による準備処理で得られた複数の次発話候補の内容データの中から、発話生成手段２５で用いる次発話の内容データを選択することができる。そして、以上のように質問、応答、相槌等のユーザ発話意図を含むタグ付けをした学習で得られたユーザ発話権終了判定用パターン認識器２２Ｃは、次発話選択用情報生成手段２３のユーザ発話意図の識別器（図３に示す第１、第２の発話意図識別器２３Ｂ，２３Ｅ）と、いずれのユーザ発話意図で終了したのかというタグ付けをしないで単純に維持・終了を識別するための学習で得られた識別器とを、マルチタスクでまとめて一体化させたユーザ発話権終了判定用パターン認識器２２Ｃとは、異なるものである。 Further, when the pattern recognition device 22C for determining the end of the user utterance right executes the pattern recognition process for identifying the maintenance or the end of the user utterance right, the end is a question, a response, an argument, a supplementary request, a repetitive request, and an understanding. , Incomprehension, indifference, or any other user utterance intention, pattern recognition processing may be executed to identify which user utterance intention ends. In this case, at the learning stage of the discriminator, a label (tag) including the user's utterance intention such as a question, a response, and an aizuchi is attached to the end. That is, a large number of learning data tagged with "maintenance", "end with question", "end with response", "end with aizuchi", etc. are prepared for learning, and user utterances of 3 classes or more are identified. Build a right termination judgment model. Then, in the operation stage, the likelihood of indicating the certainty that the user's utterance right is "maintained", the likelihood of indicating the certainty of "ending with a question", and the likelihood of "ending with a response" Information indicating which user's utterance intention was used, such as the likelihood indicating that the user has finished the utterance and the likelihood indicating the certainty that the utterance has ended with the aizuchi, is output. For example, a question has an end likelihood of 0.90, a response has an end likelihood of 0.04, an aizuchi has an end likelihood of 0.03, and so on. Therefore, the threshold value for determining the end of the user utterance right is set for each user utterance intention such as question, answer, and aizuchi, and the threshold processing is performed for each user utterance intention. However, the threshold value for determining the end of the user utterance right set for each user utterance intention may be the same value. In this case, the next utterance selection means 24 prepares for the next utterance by using the identification result of the user utterance intention by the pattern recognizer 22C for determining the end of the user utterance right (information indicating which user utterance intention ended). The content data of the next utterance used in the utterance generation means 25 can be selected from the content data of the plurality of next utterance candidates obtained in the preparatory process by the means 43. Then, the user utterance right termination determination pattern recognizer 22C obtained by the tagging including the user utterance intention such as question, response, and utterance as described above is the user utterance of the next utterance selection information generation means 23. Intention classifiers (first and second utterance intention classifiers 23B and 23E shown in FIG. 3) and simply for identifying maintenance / termination without tagging which user's utterance intention was terminated. This is different from the user utterance right termination determination pattern recognizer 22C, which integrates the discriminator obtained by learning in a multi-tasking manner.

なお、ユーザ発話の音声信号を逐次処理して短い周期で識別を行う技術として、スマートスピーカのウェイクワードのスポッティングが挙げられるが、ユーザ発話権終了判定用パターン認識器２２Ｃは、特定の語のスポッティングではなく、ユーザ発話権の維持または終了を、その発話内容に依らずに検出することを目的とする点で異なる。 As a technique for sequentially processing the voice signal of the user's utterance and identifying it in a short cycle, spotting of the wake word of the smart speaker can be mentioned, and the pattern recognizer 22C for determining the end of the user's utterance right is spotting a specific word. Instead, it differs in that it aims to detect the maintenance or termination of the user's utterance right regardless of the content of the utterance.

＜再生装置２０／システム発話タイミング検出手段２２／システム発話開始タイミング判断手段２２Ｆの構成＞ <Structure of playback device 20 / system utterance timing detecting means 22 / system utterance start timing determining means 22F>

システム発話開始タイミング判断手段２２Ｆは、ユーザ発話権終了判定用パターン認識器２２Ｃによるパターン認識処理の結果（維持または終了の識別結果）を用いるか、またはこのパターン認識処理の結果に加え、システム状態記憶手段３１に記憶されているシステム状態を示す情報（準備完了・準備中の別）や、ユーザ状態記憶手段３２に記憶されているユーザ状態を示す情報（ユーザ発話継続時間）を用いて、システム発話の開始タイミングを検出する処理を実行するものである。 The system speech start timing determination means 22F uses the result of the pattern recognition process (maintenance or termination identification result) by the user speech right end determination pattern recognizer 22C, or in addition to the result of the pattern recognition process, the system state memory. System speech using the information indicating the system status stored in the means 31 (whether ready or in preparation) or the information indicating the user status stored in the user status storage means 32 (user speech duration). It executes the process of detecting the start timing of.

具体的には、図８に示すように、システム発話開始タイミング判断手段２２Ｆは、先ず、ユーザ発話権終了判定用パターン認識器２２Ｃによるパターン認識処理の結果が、ユーザ発話権の維持を示している場合（Ｐ１）と、終了を示している場合（Ｐ２）とに判断分岐する。 Specifically, as shown in FIG. 8, in the system utterance start timing determination means 22F, first, the result of the pattern recognition process by the user utterance right end determination pattern recognizer 22C indicates that the user utterance right is maintained. Judgment branches between the case (P1) and the case (P2) indicating the end.

次に、維持を示している場合（Ｐ１）には、システム発話の開始タイミングではないと判断する（Ｐ７）。 Next, when maintenance is indicated (P1), it is determined that it is not the start timing of the system utterance (P7).

但し、図８中および図２中の二点鎖線で示すように、ユーザ状態記憶手段３２に記憶されているユーザ発話継続時間が、予め定められた長時間判定用閾値以上または超過の場合には、パターン認識処理の結果がユーザ発話権の維持を示していても（Ｐ１）、システム発話の開始タイミングであると判断する処理（Ｐ８）を行ってもよい。 However, as shown by the alternate long and short dash line in FIGS. 8 and 2, when the user utterance duration stored in the user state storage means 32 is equal to or greater than or exceeds the predetermined long-time determination threshold value. Even if the result of the pattern recognition process indicates that the user's utterance right is maintained (P1), the process of determining that it is the start timing of the system utterance (P8) may be performed.

一方、終了を示している場合（Ｐ２）において、システム状態記憶手段３１に記憶されているシステム状態を示す情報が準備完了（ステータス＝「準備完了」）を示している場合（Ｐ３）には、システム発話の開始タイミングであると判断する（Ｐ８）。 On the other hand, in the case of indicating the end (P2), when the information indicating the system state stored in the system state storage means 31 indicates the preparation is completed (status = "ready") (P3). It is determined that it is the start timing of the system utterance (P8).

但し、図８中および図２中の二点鎖線で示すように、ユーザ状態記憶手段３２に記憶されているユーザ発話継続時間が、予め定められた短時間判定用閾値以下または未満の場合には、パターン認識処理の結果がユーザ発話権の終了を示していても（Ｐ２）、システム発話の開始タイミングではないと判断する処理（Ｐ７）を行ってもよい。 However, as shown by the alternate long and short dash line in FIGS. 8 and 2, when the user utterance duration stored in the user state storage means 32 is equal to or less than or less than a predetermined short-time determination threshold value. , Even if the result of the pattern recognition process indicates the end of the user's utterance right (P2), the process of determining that it is not the start timing of the system utterance (P7) may be performed.

また、終了を示している場合（Ｐ２）において、システム状態記憶手段３１に記憶されているシステム状態を示す情報が準備中を示している場合（Ｐ４）には、その準備中を示すステータスに応じ（次発話準備手段４３による準備中の処理内容）に応じ、判断を分岐させる。 Further, in the case of indicating the end (P2), when the information indicating the system state stored in the system state storage means 31 indicates that the system is being prepared (P4), it corresponds to the status indicating that the system is being prepared. The judgment is branched according to (the processing content being prepared by the next utterance preparation means 43).

そして、準備中を示している場合（Ｐ４）において、その準備中の処理内容が、直ぐに完了する処理内容として予め分類されている処理の準備中である場合（Ｐ５）には、準備完了になるまで待ってシステム発話の開始タイミングであると判断するために（但し、結果的に、直ぐに準備完了にならない場合もある。）、その時点では、システム発話の開始タイミングではないと判断する（Ｐ９）。直ぐに完了する処理内容として予め分類されている処理の準備中とは、例えば、ステータス＝「自サーバ検索中」等である。 Then, in the case of indicating that preparation is in progress (P4), if the processing content being prepared is being prepared for processing that is classified in advance as the processing content to be completed immediately (P5), the preparation is completed. In order to judge that it is the start timing of the system utterance (however, as a result, the preparation may not be completed immediately), it is judged that it is not the start timing of the system utterance at that time (P9). .. Preparing for processing that is pre-classified as processing content to be completed immediately means, for example, status = "searching for own server".

一方、準備中を示している場合（Ｐ４）において、その準備中の処理内容が、直ぐに完了しない処理内容として予め分類されている処理の準備中である場合（Ｐ６）には、システム発話の開始タイミングであると判断するとともに、フィラーの挿入タイミングである旨の情報を出力する（Ｐ１０）。直ぐに完了しない処理内容として予め分類されている処理の準備中とは、例えば、ステータス＝「外部システムアクセス中」、「音声合成処理中」等である。フィラーの挿入タイミングである旨の情報には、どのような種別のフィラーを挿入するかの情報を含めてもよく、この場合、準備中のステータスの種別と、フィラーの種別との対応関係を、予め定めておけばよい。例えば、直ぐに完了しない処理内容にも、その程度があるので、かなり長時間の準備を要する場合には、「ちょっと待ってね、今、調べてるから。」、「少々お待ちください、処理中です。」等のフィラーを挿入することができ、そこまで長時間を要しない場合には、「えー。」、「あのね。」等のフィラーを挿入することができる。 On the other hand, in the case of indicating that preparation is in progress (P4), if the processing content being prepared is being prepared for processing that is pre-classified as processing content that is not completed immediately (P6), the system utterance is started. It is determined that it is the timing, and information indicating that it is the filler insertion timing is output (P10). Preparing for processing that is pre-classified as processing content that is not completed immediately is, for example, status = "external system accessing", "speech synthesis processing in progress", or the like. The information indicating that it is the filler insertion timing may include information on what type of filler is to be inserted. In this case, the correspondence between the status type being prepared and the filler type is determined. It may be decided in advance. For example, there are some processing contents that are not completed immediately, so if you need to prepare for a long time, "Wait a minute, I'm investigating now.", "Please wait a moment, processing is in progress. If it does not take a long time, fillers such as "Eh." And "Ane." Can be inserted.

なお、準備中を示すステータスのうち、どのようなステータスが、直ぐに完了する処理内容なのか、直ぐに完了しない処理内容なのかは、システムの構築、運用、管理を行う者が適宜設計すればよく、対話の種別（ニュース対話、アンケート対話、情報検索対話、操作対話、教育対話等の別）に応じて定めてもよい。 It should be noted that, among the statuses indicating preparations, what kind of status is the processing content that is completed immediately or the processing content that is not completed immediately may be appropriately designed by the person who constructs, operates, and manages the system. It may be determined according to the type of dialogue (news dialogue, questionnaire dialogue, information retrieval dialogue, operation dialogue, educational dialogue, etc.).

また、図８において、Ｐ９の下流部分で点線で示されているように、Ｐ９の判断を行って待った結果、システム状態が変化することもあるので、次回以降の判断時の状態に従って、判断分岐が行われることになる。 Further, in FIG. 8, as shown by the dotted line in the downstream part of P9, the system state may change as a result of making the judgment of P9 and waiting. Will be done.

すなわち、待った結果、直ぐに準備処理が完了した場合には、Ｐ２→Ｐ３→Ｐ８という流れとなり、一方、直ぐに完了しない別の準備処理に移行した場合には、Ｐ２→Ｐ４→Ｐ６→Ｐ１０という流れとなる。直ぐに完了しない別の準備処理に移行した場合とは、例えば、ステータス＝「自サーバ検索中」であったが、自サーバ内で目的の情報が得られなかったため、外部システムにアクセスし、ステータス＝「外部システムアクセス中」となった場合等である。 That is, as a result of waiting, if the preparatory process is completed immediately, the flow is P2 → P3 → P8, while if the process shifts to another preparatory process that is not completed immediately, the flow is P2 → P4 → P6 → P10. Become. When moving to another preparatory process that is not completed immediately, for example, the status = "Searching for own server", but since the desired information could not be obtained in the own server, the external system was accessed and the status = For example, when "external system is being accessed".

さらに、図８において、Ｐ１０の判断に基づきフィラーの再生を開始した後、フィラーの再生を行っている間に準備が完了すれば、Ｐ２→Ｐ３→Ｐ８という流れとなり、フィラーの再生を中断するか、またはフィラーの再生終了後に、準備が完了した複数の次発話候補の中からの次発話の選択が行われ、選択された次発話の再生が行われることになる。一方、フィラーの挿入（Ｐ１０）を行っても、未だ準備が続いていた場合には、直ぐに完了しない準備処理が続いていることになるので、Ｐ２→Ｐ４→Ｐ６→Ｐ１０という流れとなり、再び、フィラーの挿入（Ｐ１０）が行われる。なお、フィラーの再生中に、Ｐ１０の判断が再びなされた場合には、再生中のフィラーを優先させて再生を続ける。新たなＰ１０の判断を優先させると、例えば「ちょっと待っ」「ちょっと待っ」「ちょっと待っ」のような繰り返しをする不自然な発話になってしまうからである。 Further, in FIG. 8, after starting the regeneration of the filler based on the judgment of P10, if the preparation is completed during the regeneration of the filler, the flow is P2 → P3 → P8, and the regeneration of the filler is interrupted. , Or, after the completion of the reproduction of the filler, the next utterance is selected from the plurality of ready next utterance candidates, and the selected next utterance is reproduced. On the other hand, even if the filler is inserted (P10), if the preparation is still continued, the preparation process that is not completed immediately continues, so the flow is P2 → P4 → P6 → P10, and again. Insertion of the filler (P10) is performed. If the determination of P10 is made again during the regeneration of the filler, the filler being regenerated is given priority and the regeneration is continued. This is because if the new P10 judgment is prioritized, the utterance will be unnatural, such as "waiting for a moment", "waiting for a moment", and "waiting for a while".

＜再生装置２０／システム発話タイミング検出手段２２／ユーザ発話権終了判定用閾値調整手段２２Ｇの構成＞ <Structure of playback device 20 / system utterance timing detecting means 22 / threshold adjusting means 22G for determining user utterance right end>

ユーザ発話権終了判定用閾値調整手段２２Ｇは、ユーザ発話権終了判定用パターン認識器２２Ｃによる維持・終了の識別処理で用いるユーザ発話権終了判定用閾値の事前調整、ユーザ発話権終了判定用閾値の下方調整を行うことを決めるための下方調整用閾値の事前調整、およびユーザ発話権終了判定用閾値のリアルタイム調整の各処理を実行するものである。 The user utterance right end determination threshold value adjusting means 22G pre-adjusts the user utterance right end determination threshold value used in the maintenance / termination identification process by the user utterance right end determination pattern recognizer 22C, and sets the user utterance right end determination threshold value. Each process of pre-adjusting the downward adjustment threshold value for deciding to perform downward adjustment and real-time adjustment of the user speaking right termination determination threshold value is executed.

ここで、事前調整は、ユーザとの対話（その日またはその時における対話、あるいは、その週、その月、その季節、その年等の所定の区切りの期間における対話）を開始する前に行う調整であり、ユーザ情報記憶手段５２に記憶されているユーザの属性情報（当該ユーザとの対話中における一時的な情報ではなく、当該ユーザとの複数回の対話を通じて得られた蓄積情報）を用いて行われる。一方、リアルタイム調整は、ユーザとの対話の開始後（特に、ユーザ発話の進行中）に行う調整であり、ユーザ状態記憶手段３２に記憶されているユーザ状態を示す情報（対話中における一時的な情報）や、システム状態記憶手段３１に記憶されているシステム状態を示す情報（対話中における一時的な情報）を用いて行われる。 Here, the pre-adjustment is an adjustment to be performed before starting the dialogue with the user (the dialogue in the day or the time, or the dialogue in the predetermined break period such as the week, the month, the season, the year, etc.). , The user's attribute information stored in the user information storage means 52 (not temporary information during the dialogue with the user, but accumulated information obtained through a plurality of dialogues with the user) is used. .. On the other hand, the real-time adjustment is an adjustment performed after the start of the dialogue with the user (particularly, the user's speech is in progress), and the information indicating the user state stored in the user state storage means 32 (temporary during the dialogue). Information) and information indicating the system state (temporary information during the dialogue) stored in the system state storage means 31 are used.

具体的には、図１１に示すように、ユーザ発話権終了判定用閾値調整手段２２Ｇは、対話相手のユーザについてのユーザ識別情報を用いてユーザ情報記憶手段５２に記憶されている当該ユーザの衝突の発生情報（蓄積情報）を取得し、当該ユーザとの衝突の発生頻度または累積発生回数を算出する。この際、衝突の発生頻度は、例えば、１日、１週間、１か月等の所定の長さの期間における衝突の発生回数としてもよく、対話の総数に対する衝突の発生回数としてもよく、ユーザ発話からシステム発話への交替の総数に対する衝突の発生回数としてもよく、発生頻度の単位は、任意である。そして、算出した衝突の発生頻度または累積発生回数が、予め定められた上方調整用閾値以上または超過の場合には、ユーザ発話権終了判定用閾値を標準値または前回調整値よりも高く設定する上方調整を実行する。これにより、図１１に示すように、システム発話の開始タイミングが遅れる方向、すなわち衝突回避方向に調整される。 Specifically, as shown in FIG. 11, the threshold adjusting means 22G for determining the end of the user speaking right uses the user identification information about the user of the conversation partner, and the collision of the user stored in the user information storage means 52. The occurrence information (accumulation information) of the above is acquired, and the occurrence frequency or the cumulative number of collisions with the user is calculated. At this time, the frequency of collisions may be, for example, the number of collisions during a period of a predetermined length such as one day, one week, or one month, or the number of collisions with respect to the total number of dialogues. It may be the number of collisions with respect to the total number of shifts from utterances to system utterances, and the unit of frequency of occurrence is arbitrary. Then, when the calculated collision occurrence frequency or cumulative number of occurrences is equal to or greater than the predetermined upward adjustment threshold value, the user speech right end determination threshold value is set higher than the standard value or the previous adjustment value. Perform adjustments. As a result, as shown in FIG. 11, the start timing of the system utterance is adjusted to be delayed, that is, to avoid the collision.

また、図１２に示すように、ユーザ発話権終了判定用閾値調整手段２２Ｇは、対話相手のユーザについてのユーザ識別情報を用いてユーザ情報記憶手段５２に記憶されている当該ユーザについてのユーザ発話の終了からシステム発話の開始までの複数の交替潜時（システムの交替潜時の蓄積情報）を取得し、当該ユーザを対話相手とするときのシステムの交替潜時の長短の傾向を示す平均値若しくはその他の指標値を算出する。この際、交替潜時の長短の傾向を示す指標値は、複数の交替潜時をまとめた指標値であれば、いずれでもよく、例えば、平均値、中央値、最頻値等とすることができ、中央値や最頻値とする場合は、交替潜時を幾つかに区分していずれかの区分に帰属させ、各区分の代表値の中のいずれかを中央値、最頻値とすること等ができる。そして、算出した交替潜時の指標値が、予め定められた下方調整用閾値以上または超過の場合には、ユーザ発話権終了判定用閾値を標準値または前回調整値よりも低く設定する下方調整を実行する。これにより、図１２に示すように、システム発話の開始タイミングが早まる方向、すなわち交替潜時が短くなる方向に調整される。 Further, as shown in FIG. 12, the threshold value adjusting means 22G for determining the end of the user utterance right uses the user identification information about the user of the conversation partner to store the user utterance of the user in the user information storage means 52. An average value or an average value indicating the long-short tendency of the system's alternate latency when multiple alternate latency (accumulated information of the system's alternate latency) from the end to the start of the system utterance is acquired and the user is the conversation partner. Calculate other index values. At this time, the index value indicating the tendency of the length of the alternating latency may be any index value as long as it is a collection of a plurality of alternating latency, and may be, for example, an average value, a median value, a mode value, or the like. If it can be set to the median or mode, the alternation latency is divided into several categories and assigned to one of the categories, and one of the representative values of each category is set to the median or mode. You can do things. Then, when the calculated index value at the time of alternation latency is equal to or greater than the predetermined downward adjustment threshold value, the downward adjustment that sets the user speech right end determination threshold value lower than the standard value or the previous adjustment value is performed. Run. As a result, as shown in FIG. 12, the start timing of the system utterance is adjusted to be earlier, that is, the alternation latency is adjusted to be shorter.

さらに、図１２に示すように、ユーザ発話権終了判定用閾値調整手段２２Ｇは、対話相手のユーザについてのユーザ識別情報を用いてユーザ情報記憶手段５２に記憶されている当該ユーザの複数の発話速度（蓄積情報）を取得し、当該ユーザの発話速度の傾向を示す平均値若しくはその他の指標値を算出する。なお、発話速度の単位は「モーラ／秒」等である。この際、ユーザの発話速度の傾向を示す指標値は、複数の発話速度をまとめた指標値であれば、いずれでもよく、例えば、平均値、中央値、最頻値等とすることができる。そして、下方調整用閾値を、算出した発話速度の指標値を用いて、発話速度の指標値が大きい（速い）ときには当該下方調整用閾値が小さくなり、発話速度の指標値が小さい（遅い）ときには当該下方調整用閾値が大きくなるように予め定められた関数により設定する。この関数は、上述した前提条件を満たす関数であれば、どのような関数でもよく、図１２の例では、１次関数とされているが、例えば、２次以上の関数でもよく、１段または多段のステップ関数等でもよい。これにより、早口のユーザについては、下方調整用閾値が小さくなり、比較的短い交替潜時でも、ユーザ発話権終了判定用閾値の下方調整を行うことができるようになり（下方調整の条件が成立するようになり）、交替潜時が短くなる方向へのシステム発話の開始タイミングの調整を行うことができるようになる。一方、ゆっくり発話するユーザについては、下方調整用閾値が大きくなり、比較的長い交替潜時でないと、ユーザ発話権終了判定用閾値の下方調整を行うことができないようになり（下方調整の条件が成立しなくなり）、交替潜時が短くなる方向へのシステム発話の開始タイミングの調整を行うことができないようになる。 Further, as shown in FIG. 12, the user utterance right end determination threshold adjusting means 22G uses the user identification information about the user of the conversation partner to store the user's utterance speeds in the user information storage means 52. (Accumulated information) is acquired, and an average value or other index value indicating the tendency of the user's speech speed is calculated. The unit of utterance speed is "mora / second" or the like. At this time, the index value indicating the tendency of the user's utterance speed may be any index value as long as it is a collection of a plurality of utterance speeds, and may be, for example, an average value, a median value, a mode value, or the like. Then, when the index value of the utterance speed is large (fast), the downward adjustment threshold is small, and when the index value of the utterance speed is small (slow), the downward adjustment threshold is used. It is set by a predetermined function so that the downward adjustment threshold value becomes large. This function may be any function as long as it satisfies the above-mentioned preconditions. In the example of FIG. 12, this function is a linear function. It may be a multi-stage step function or the like. As a result, the downward adjustment threshold value becomes smaller for fast-talking users, and it becomes possible to perform downward adjustment of the user speech right end determination threshold value even during a relatively short shift latency (conditions for downward adjustment are satisfied). ), And it becomes possible to adjust the start timing of system utterance in the direction of shortening the alternation latency. On the other hand, for a user who speaks slowly, the threshold for downward adjustment becomes large, and the threshold for determining the end of the user's utterance right cannot be adjusted downward unless the alternation latency is relatively long (the condition for downward adjustment is). (No longer holds), and it becomes impossible to adjust the start timing of system utterance in the direction of shortening the alternation latency.

また、図９中の実線で示すように、ユーザ発話権終了判定用閾値調整手段２２Ｇは、ユーザ状態記憶手段３２に記憶されている対話相手のユーザについてのユーザ発話継続時間（リアルタイム情報）を逐次取得し、取得したユーザ発話継続時間が、予め定められた短時間判定用閾値以下または未満の場合には、ユーザ発話権終了判定用閾値を標準値よりも高く設定し、予め定められた長時間判定用閾値以上または超過の場合には、ユーザ発話権終了判定用閾値を標準値よりも低く設定する処理を逐次実行する。これにより、ユーザ発話の開始直後には、ユーザ発話権が終了したという識別結果が出にくくなり、ユーザ発話の開始時からの経過時間が長くなると、ユーザ発話権が終了したという識別結果が出やすくなる。 Further, as shown by the solid line in FIG. 9, the threshold value adjusting means 22G for determining the end of the user utterance right sequentially sets the user utterance duration (real-time information) of the user of the dialogue partner stored in the user state storage means 32. When the acquired and acquired user utterance duration is less than or less than a predetermined short-time judgment threshold value, the user utterance right end judgment threshold value is set higher than the standard value, and a predetermined long-time judgment value is set. If it is equal to or greater than the determination threshold value, the process of setting the user speech right end determination threshold value lower than the standard value is sequentially executed. As a result, it becomes difficult to obtain an identification result that the user utterance right has ended immediately after the start of the user utterance, and if the elapsed time from the start of the user utterance becomes long, the identification result that the user utterance right has ended is likely to occur. Become.

また、図９中の二点鎖線で示すように、ユーザ発話権終了判定用閾値調整手段２２Ｇは、ユーザ状態記憶手段３２に記憶されている対話相手のユーザについてのユーザ発話継続時間（リアルタイム情報）を逐次取得し、ユーザ発話権終了判定用閾値を、取得したユーザ発話継続時間を用いて、ユーザ発話継続時間が短いときには当該ユーザ発話権終了判定用閾値が高くなり、ユーザ発話継続時間が長いときには当該ユーザ発話権終了判定用閾値が低くなるように予め定められた関数（図９中の実線で示された階段状の関数に限らず、それ以外の様々な関数）により逐次設定してもよい。この関数は、上述した前提条件を満たす関数であれば、どのような関数でもよく、例えば、１次関数でもよく、２次以上の関数でもよく、１段のステップ関数や、図９中の実線で示された２段のステップ関数以外の多段（３段以上）のステップ関数等でもよい。 Further, as shown by the two-point chain line in FIG. 9, the threshold value adjusting means 22G for determining the end of the user utterance right is the user utterance duration (real-time information) of the user of the dialogue partner stored in the user state storage means 32. When the user utterance duration is short, the user utterance right end determination threshold becomes high, and when the user utterance duration is long, the user utterance right end determination threshold is used. It may be sequentially set by a predetermined function (not limited to the stepped function shown by the solid line in FIG. 9 but various other functions) so that the threshold value for determining the end of the user's utterance right is lowered. .. This function may be any function as long as it satisfies the above-mentioned preconditions, for example, a linear function, a second-order or higher-order function, a one-stage step function, or a solid line in FIG. A multi-stage (three-stage or more) step function or the like other than the two-stage step function shown in the above may be used.

さらに、図１０に示すように、ユーザ発話権終了判定用閾値調整手段２２Ｇは、システム状態記憶手段３１に記憶されている対話相手のユーザについての目的データの残数（対話目的を達成するためのシステムの最終の次発話候補の内容データとなり得る題材データである目的データの残数）および／または次発話候補の重要度（次発話準備手段４３による準備処理で得られた複数の次発話候補の内容データの各々に付されている重要度）を取得する処理を逐次実行する。これらの目的データの残数および／または次発話候補の重要度は、システムによる発話開始に対する要求の強さの度合いを示すシステム発話意欲度の指標値である。 Further, as shown in FIG. 10, the user speaking right end determination threshold adjusting means 22G has the remaining number of target data (for achieving the dialogue purpose) of the user of the dialogue partner stored in the system state storage means 31. The remaining number of target data, which is the subject data that can be the content data of the final next utterance candidate of the system) and / or the importance of the next utterance candidate (a plurality of next utterance candidates obtained by the preparation process by the next utterance preparation means 43). The process of acquiring the importance attached to each of the content data) is sequentially executed. The remaining number of these target data and / or the importance of the next utterance candidate is an index value of the system utterance motivation, which indicates the degree of strength of the request for the start of utterance by the system.

ここで、目的データの残数については、例えば、目的データの残数が１であれば、システム発話意欲度が強く、目的データの残数が２以上であれば、システム発話意欲度が弱い設定とすること等ができる。例えば、情報検索対話において、ユーザ発話の進行に伴ってユーザによる条件提示が進み、その条件提示の内容に応じて目的データの残数が１になった時点で、システム発話意欲度を強く設定することができる。具体例を挙げると、飲食店を検索するときに、ユーザが、システムの「食べる場所はどこ？」に対して「東京駅周辺のお店を探したい。」と答え、システムの「何が食べたいの？」に対して「中華料理が食べたい。」と答え、システムの「どんなお店がいいの？」に対して「おいしいと評判のお店がよくて、それと・・・」と答える等の条件提示を積み重ねていった結果、目的データ（情報提供する飲食店のデータ）が１つに絞り込まれる場合があり、この場合、ユーザは、それ以上、条件提示を行う必要はなく（つまり、「それと・・・」以降の条件提示を行う必要はなく）、１つに絞り込まれた目的データ（飲食店のデータ）を早く再生した方がよいという状況になるので、システム発話意欲度が強くなる。また、ユーザが「おいしいと評判のお店がよい。」と言った後に「待って、やっぱり評判はどうでもいいから、安いお店がいいな。」と訂正の発話を行い、それに基づき、再び、目的データ（飲食店のデータ）の残数が２以上になったときには、システム発話意欲度が弱くなる。システム発話意欲度の数値化方法は任意であり、例えば、１〜１０の１０段階（段階数は任意）、０〜１の連続値、０〜１００％の連続値等とすることができる。例えば、目的データの残数＝１の場合には、システム発話意欲度＝１０段階のうちの１０とし、目的データの残数＝２または３の場合には、システム発話意欲度＝１０段階のうちの７とし、目的データの残数＝４以上の場合には、システム発話意欲度＝１０段階のうちの２とすること等ができる。この対応関係は、予め定めておけばよい。なお、上記の例の対応関係では、１０段階のうち使用されないシステム発話意欲度が存在するが、これは、下記の次発話候補の重要度により定まるシステム発話意欲度とのレベル合わせをしているからである。 Here, regarding the remaining number of target data, for example, if the remaining number of target data is 1, the system speaking motivation is strong, and if the remaining number of target data is 2 or more, the system speaking motivation is weak. And so on. For example, in the information retrieval dialogue, the condition presentation by the user progresses as the user utterance progresses, and when the remaining number of target data becomes 1 according to the content of the condition presentation, the system utterance motivation is strongly set. be able to. To give a specific example, when searching for a restaurant, the user answers "Where to eat?" To "Where to eat?" In the system, "I want to find a restaurant near Tokyo Station." Answer "I want to eat Chinese food" to "Do you want to eat?", And answer "What kind of restaurant do you like?" To "What kind of restaurant do you like?" As a result of accumulating the condition presentations such as, the target data (data of restaurants that provide information) may be narrowed down to one, and in this case, the user does not need to present the conditions any more (that is,). , It is not necessary to present the conditions after "and ..."), and the situation is that it is better to quickly reproduce the purpose data (restaurant data) narrowed down to one, so the motivation to speak the system is high. Become stronger. In addition, after the user said, "A restaurant with a good reputation is good.", "Wait, I don't care about the reputation, so a cheap restaurant is good." , When the remaining number of target data (restaurant data) is 2 or more, the motivation to speak the system is weakened. The method for quantifying the motivation to speak the system is arbitrary, and can be, for example, 10 steps from 1 to 10 (the number of steps is arbitrary), a continuous value of 0 to 1, a continuous value of 0 to 100%, and the like. For example, when the remaining number of target data = 1, the system utterance motivation = 10 out of 10 stages, and when the remaining number of target data = 2 or 3, the system utterance motivation = 10 stages. When the remaining number of target data = 4 or more, the system utterance motivation = 2 out of 10 stages can be set. This correspondence may be determined in advance. In the correspondence of the above example, there is a system utterance motivation that is not used in the 10 stages, but this is level-matched with the system utterance motivation determined by the importance of the next utterance candidate below. Because.

また、次発話候補の重要度については、重要度が高ければ、システム発話意欲度が強くなり、重要度が低ければ、システム発話意欲度が弱くなる関係にある。この重要度は、記事データ（ニュースやコラムや歴史等の各種の話題を記載した記事の原文データ）を要約してシナリオデータを生成する際の元の記事データの各構成文の重要度と同じとしてもよいが、本実施形態では、それだけではなく、防災関連情報の緊急性や日常生活への影響の大きさ等を加味した重要度としている。例えば、ニュース対話において、重要度が、「ＸＸＸで大きな地震が発生しましたので、ＹＹＹ沿岸地域の方は、すぐに高台に避難してください。」＝１０、「ＸＸＸ地方に大雨洪水警報が出ました。」＝８、「明日から消費税が１０％となります。」＝６、「早稲田花子選手が女子１００ｍの日本新記録を出しました。」＝４等のように、１〜１０の１０段階の数値で設定されていれば、これらの数値をそのままシステム発話意欲度を示す数値とすること等ができる。また、重要度が８以上は、システム発話意欲度＝３段階のうちの３とし、重要度が７〜５は、システム発話意欲度＝３段階のうちの２とし、重要度が４以下は、システム発話意欲度＝３段階のうちの１とすること等ができる。この対応関係は、予め定めておけばよい。そして、複数の次発話候補の内容データが次発話候補記憶手段３０に記憶され、システム状態記憶手段３１にそれらの複数の次発話候補の内容データの各々についての重要度が記憶されている場合には、複数の重要度の平均値、中央値、最頻値等を代表の重要度としてもよく、最も大きい重要度や最も小さい重要度を代表の重要度としてもよい。 Regarding the importance of the next utterance candidate, the higher the importance, the stronger the system utterance motivation, and the lower the importance, the weaker the system utterance motivation. This importance is the same as the importance of each constituent sentence of the original article data when summarizing the article data (original data of the article describing various topics such as news, columns and history) and generating scenario data. However, in the present embodiment, the importance is set in consideration of not only that, but also the urgency of disaster prevention-related information and the magnitude of the impact on daily life. For example, in the news dialogue, the importance is "A big earthquake occurred in XXX, so if you are in the YYY coastal area, please evacuate to a hill immediately." = 10, "A heavy rain flood warning is issued in the XXX region. "I did." = 8, "The consumption tax will be 10% from tomorrow." = 6, "Hanako Waseda set a new record for women's 100m in Japan." = 4 etc., 1 to 10 If the numerical values are set in 10 steps, these numerical values can be used as they are as numerical values indicating the motivation to speak in the system. In addition, when the importance is 8 or more, the system utterance motivation = 3 out of 3 stages, when the importance is 7 to 5, the system utterance motivation = 2 out of 3 stages, and when the importance is 4 or less, System utterance motivation = 1 of 3 levels can be set. This correspondence may be determined in advance. Then, when the content data of the plurality of next utterance candidates is stored in the next utterance candidate storage means 30, and the importance of each of the content data of the plurality of next utterance candidates is stored in the system state storage means 31. May be the average value, median value, mode value, etc. of a plurality of importance as the representative importance, and the highest importance and the lowest importance may be the representative importance.

なお、目的データの残数や、複数の次発話候補の内容データの各々の重要度をシステム状態記憶手段３１に記憶させるのではなく、次発話準備手段４３によりこれらを換算して求めたシステム発話意欲度を、システム状態記憶手段３１に記憶させてもよい。また、複数の次発話候補の内容データの各々の重要度をシステム状態記憶手段３１に記憶させるのではなく、次発話準備手段４３により求めた代表の重要度を、システム状態記憶手段３１に記憶させてもよい。さらに、目的データの残数により定まるシステム発話意欲度と、次発話候補の重要度により定まるシステム発話意欲度とを、対話の種別に応じて使い分けてもよいが、両者の平均値や加重平均値等を求めて統合して用いてもよい。 It should be noted that the system utterance obtained by converting the remaining number of target data and the importance of each of the content data of the plurality of next utterance candidates by the next utterance preparation means 43 instead of storing them in the system state storage means 31. The degree of motivation may be stored in the system state storage means 31. Further, instead of storing the importance of each of the content data of the plurality of next utterance candidates in the system state storage means 31, the system state storage means 31 stores the importance of the representative obtained by the next utterance preparation means 43. You may. Further, the system utterance motivation determined by the remaining number of target data and the system utterance motivation determined by the importance of the next utterance candidate may be used properly according to the type of dialogue, but the average value or the weighted average value of both may be used. Etc. may be sought and integrated for use.

そして、図１０に示すように、ユーザ発話権終了判定用閾値調整手段２２Ｇは、取得した目的データの残数および／または複数の次発話候補の内容データの各々の重要度からシステム発話意欲度を求め、ユーザ発話権終了判定用閾値を、求めたシステム発話意欲度を用いて、システム発話意欲度が強いときには当該ユーザ発話権終了判定用閾値が低くなり、システム発話意欲度が弱いときには当該ユーザ発話権終了判定用閾値が高くなるように予め定められた関数により設定する処理を逐次実行する。これにより、システム発話意欲度が強いときには、ユーザ発話権が終了したという識別結果が出やすくなり、システム発話意欲度が弱いときには、ユーザ発話権が終了したという識別結果が出にくくなる。 Then, as shown in FIG. 10, the user utterance right end determination threshold adjusting means 22G determines the system utterance motivation degree from the remaining number of acquired target data and / or the importance of each of the content data of the plurality of next utterance candidates. When the system utterance motivation is strong, the user utterance right end judgment threshold is low, and when the system utterance motivation is weak, the user utterance is calculated. The process of setting by a predetermined function so that the right termination determination threshold becomes high is sequentially executed. As a result, when the system utterance motivation is strong, the identification result that the user utterance right is terminated is likely to be obtained, and when the system utterance motivation is weak, the identification result that the user utterance right is terminated is difficult to be obtained.

＜再生装置２０／次発話選択用情報生成手段２３の構成＞ <Structure of playback device 20 / information generation means 23 for selecting next utterance>

図３において、次発話選択用情報生成手段２３は、韻律特徴量抽出手段２３Ａと、第１の発話意図識別器２３Ｂと、第２の発話意図識別器２３Ｅとを含んで構成されている。この次発話選択用情報生成手段２３には、例えば、前述した非特許文献３に記載された技術を採用することができる。 In FIG. 3, the next utterance selection information generating means 23 includes a prosodic feature amount extracting means 23A, a first utterance intention classifier 23B, and a second utterance intention classifier 23E. For the next utterance selection information generation means 23, for example, the technique described in Non-Patent Document 3 described above can be adopted.

韻律特徴量抽出手段２３Ａは、音声信号取得手段２１により取得したユーザ発話の音声信号から韻律特徴量（韻律情報）を抽出する処理を実行するものである。この韻律特徴量抽出手段２３Ａは、システム発話タイミング検出手段２２の音響特徴量抽出手段２２Ａと同様な構成を採用することができる。すなわち、音響特徴量抽出手段２２Ａで抽出された音響特徴量を、韻律特徴量（韻律情報）とすることができる。従って、この韻律特徴量抽出手段２３Ａと、システム発話タイミング検出手段２２の音響特徴量抽出手段２２Ａとは、共通化することができる。従って、例えば、韻律特徴量抽出手段２３Ａで得られる韻律特徴量（韻律情報）は、ＣＮＮオートエンコーダの中間層から取り出した２５６次元のボトルネック特徴量とすることができる。 The prosodic feature amount extracting means 23A executes a process of extracting the prosody feature amount (prosody information) from the voice signal of the user's utterance acquired by the voice signal acquisition means 21. The prosodic feature amount extracting means 23A can adopt the same configuration as the acoustic feature amount extracting means 22A of the system utterance timing detecting means 22. That is, the acoustic feature amount extracted by the acoustic feature amount extracting means 22A can be used as a prosodic feature amount (prosodic information). Therefore, the prosodic feature amount extracting means 23A and the acoustic feature amount extracting means 22A of the system utterance timing detecting means 22 can be shared. Therefore, for example, the prosodic feature amount (prosody information) obtained by the prosodic feature amount extracting means 23A can be a 256-dimensional bottleneck feature amount extracted from the intermediate layer of the CNN autoencoder.

第１の発話意図識別器２３Ｂは、韻律特徴量抽出手段２３Ａで抽出した韻律特徴量（韻律情報）を用いて、質問、応答、相槌、補足要求、反復要求、理解、不理解、無関心、若しくはその他のユーザ発話意図を識別する処理を実行する第１の発話意図識別処理手段２３Ｃと、この識別処理で用いるモデル（パラメータ）を記憶する第１の発話意図識別モデル記憶手段２３Ｄとにより構成されている。この第１の発話意図識別器２３Ｂは、例えば、ＬＳＴＭ（ＲＮＮの一種）により構築することができる。 The first utterance intention classifier 23B uses the prosodic feature amount (prosodic information) extracted by the prosodic feature amount extracting means 23A to ask, answer, reciprocate, supplementary request, repetitive request, understanding, incomprehension, indifference, or It is composed of a first utterance intention identification processing means 23C that executes a process of identifying other user utterance intentions, and a first utterance intention identification model storage means 23D that stores a model (parameter) used in this identification process. There is. The first utterance intention classifier 23B can be constructed by, for example, an LSTM (a type of RNN).

具体的には、第１の発話意図識別器２３Ｂは、例えば、ＣＮＮオートエンコーダの中間層から取り出した２５６次元の韻律特徴量（韻律情報）を逐次入力し、ＬＳＴＭによるパターン認識処理を行って発話意図を識別し、その識別結果を出力する構成とすることができる（非特許文献３参照）。そして、この第１の発話意図識別器２３Ｂから出力された発話意図を、次発話選択手段２４に送ってもよい。 Specifically, the first utterance intention classifier 23B sequentially inputs 256-dimensional prosodic features (prosody information) taken out from the intermediate layer of the CNN autoencoder, performs pattern recognition processing by LSTM, and utters. The intention can be identified and the identification result can be output (see Non-Patent Document 3). Then, the utterance intention output from the first utterance intention classifier 23B may be sent to the next utterance selection means 24.

第２の発話意図識別器２３Ｅは、第１の発話意図識別器２３Ｂで得られた韻律情報（例えば、ＬＳＴＭの隠れ層の値）と、音声認識処理手段４１による音声認識処理の結果として得られたユーザ発話の言語情報と、対話履歴記憶手段５０に記憶されているユーザとシステムとの間の対話履歴情報のうちの直前のシステム発話の言語情報とを用いて、質問、応答、相槌、補足要求、反復要求、理解、不理解、無関心、若しくはその他のユーザ発話意図を識別する処理を実行する第２の発話意図識別処理手段２３Ｆと、この識別処理で用いるモデル（パラメータ）を記憶する第２の発話意図識別モデル記憶手段２３Ｇとにより構成されている。この第２の発話意図識別器２３Ｅは、例えば、ＢＥＲＴにより構築することができる（非特許文献３参照）。ＢＥＲＴは、自然言語処理モデルであり、トランスフォーマのエンコーダ部分をユニットとする双方向トランスフォーマモデルである。この第２の発話意図識別器２３Ｅから出力された発話意図は、次発話選択手段２４に送られる。 The second utterance intention classifier 23E is obtained as a result of the rhyme information (for example, the value of the hidden layer of LSTM) obtained by the first utterance intention classifier 23B and the voice recognition processing by the voice recognition processing means 41. Using the linguistic information of the user utterance and the linguistic information of the immediately preceding system utterance among the dialogue history information between the user and the system stored in the dialogue history storage means 50, a question, a response, an understanding, and a supplement A second utterance intention identification processing means 23F that executes a process of identifying a request, an iterative request, an understanding, an incomprehension, an indifference, or other user utterance intention, and a second that stores a model (parameter) used in this identification process. It is composed of the utterance intention identification model storage means 23G. The second utterance intention classifier 23E can be constructed by, for example, BERT (see Non-Patent Document 3). The BERT is a natural language processing model, and is a bidirectional transformer model having an encoder portion of the transformer as a unit. The utterance intention output from the second utterance intention classifier 23E is sent to the next utterance selection means 24.

また、次発話選択用情報生成手段２３は、ユーザの顔画像やジェスチャー画像（身振り・手振りの画像）を取得し、顔の表情やジェスチャーの内容を解析し、その解析結果（表情の識別結果、身振り・手振りの意図の識別結果）を、次発話選択用情報として次発話選択手段２４に送ってもよい。 Further, the information generation means 23 for selecting the next utterance acquires the user's face image and gesture image (facial expression / gesture image), analyzes the facial expression and the content of the gesture, and analyzes the analysis result (facial expression identification result, The identification result of the intention of the gesture / gesture) may be sent to the next utterance selection means 24 as the information for selecting the next utterance.

＜再生装置２０／次発話選択手段２４の構成＞ <Structure of playback device 20 / next utterance selection means 24>

次発話選択手段２４は、システム発話タイミング検出手段２２によりシステム発話の開始タイミングが検出された後に（システム発話タイミング検出手段２２からシステム発話の開始タイミングであるという判断結果を受け取ったときに）、次発話選択用情報生成手段２３による処理で得られたユーザ発話意図の識別結果と、音声認識処理手段４１による音声認識処理の結果として得られた言語情報（文字列）とを組み合わせて用いて、次発話準備手段４３による準備処理で得られて次発話候補記憶手段３０に記憶されている複数（但し、１つの場合もある。）の次発話候補の内容データの中から、発話生成手段２５で用いる次発話の内容データを選択し、選択した次発話の内容データを、発話生成手段２５に送るとともに、選択した次発話の内容データまたはその識別情報（例えば、シナリオＩＤ、発話節ＩＤ等）を、ネットワーク１を介して対話状態管理手段４２へ送信する処理を実行するものである。 After the system utterance timing detecting means 22 detects the start timing of the system utterance (when the system utterance timing detecting means 22 receives the determination result that it is the start timing of the system utterance), the next utterance selection means 24 next Using a combination of the identification result of the user's utterance intention obtained by the processing by the utterance selection information generation means 23 and the language information (character string) obtained as a result of the voice recognition processing by the voice recognition processing means 41, the following It is used by the utterance generation means 25 from the content data of a plurality of (but one case) next utterance candidates obtained by the preparation process by the utterance preparation means 43 and stored in the next utterance candidate storage means 30. The content data of the next utterance is selected, the content data of the selected next utterance is sent to the utterance generation means 25, and the content data of the selected next utterance or its identification information (for example, scenario ID, utterance clause ID, etc.) is transmitted. The process of transmitting to the interactive state management means 42 via the network 1 is executed.

なお、ユーザ発話意図の識別結果（例えば、質問、相槌等の別）と、音声認識処理の結果として得られた言語情報（文字列）とのうちのいずれか一方だけで、次発話の内容データを選択することができる場合には、これらを組み合わせて用いなくてもよい。また、次発話選択用情報生成手段２３からではなく、システム発話タイミング検出手段２２から、システム発話の開始タイミングであるという判断結果とともにユーザ発話意図の識別結果を受け取った場合（システム発話タイミング検出手段２２において、どのようなユーザ発話意図で終了したのかを識別した場合）には、そのユーザ発話意図の識別結果を用いて、次発話の内容データを選択してもよい。さらに、次発話選択用情報生成手段２３から、ユーザの顔の表情やジェスチャーの内容についての解析結果を受け取った場合には、それらの解析結果を用いて、またはそれらの解析結果と他の情報とを組み合わせて、次発話の内容データを選択してもよい。 It should be noted that only one of the identification result of the user's utterance intention (for example, question, answer, etc.) and the language information (character string) obtained as a result of the voice recognition process is used to obtain the content data of the next utterance. If it is possible to select, it is not necessary to use these in combination. Further, when the identification result of the user utterance intention is received together with the determination result that it is the start timing of the system utterance from the system utterance timing detecting means 22 instead of the information generating means 23 for selecting the next utterance (system utterance timing detecting means 22). In the case where the user's utterance intention is identified), the content data of the next utterance may be selected by using the identification result of the user's utterance intention. Further, when the analysis results of the facial expressions and gesture contents of the user are received from the next utterance selection information generation means 23, those analysis results are used, or those analysis results and other information are used. May be combined to select the content data of the next utterance.

また、次発話選択手段２４は、システム発話タイミング検出手段２２から、システム発話の開始タイミングであるという判断結果とともに、フィラーの挿入タイミングである旨の情報（どのような種別のフィラーを挿入するかの情報を含む）を受け取った場合には、指定された種別のフィラーの内容データ（音声データを含む）を、発話生成手段２５に送る処理を実行する。この際、挿入するフィラーの内容データまたは当該フィラーの種別の識別情報を、ネットワーク１を介して対話状態管理手段４２へ送信する処理を実行してもよく、実行しなくてもよいが、実行した場合でも、対話状態管理手段４２は、フィラーの挿入を、対話履歴上は、システム発話として取り扱うのではなく、システム発話の準備用繋ぎ発話として取り扱う。この点については、後述する対話状態管理手段４２の説明で詳述するので、ここでは詳しい説明を省略する。なお、各種のフィラーの内容データ（音声データを含む）は、フィラーの種別の識別情報と対応付けて再生装置２０に設けられたフィラー記憶手段（不図示）に記憶させておけばよいが、フィラーは常に次発話候補になり得ると考え、次発話候補記憶手段３０に記憶させておいてもよい。後者とする場合、各種のフィラーの内容データ（音声データを含む）を、次発話準備手段４３から毎回ネットワーク１を介して受信する必要はなく、次発話候補記憶手段３０に固定的に準備されているデータとすればよい。なお、このようにフィラーを次発話候補であると考えて次発話候補記憶手段３０に記憶させたとしても、上述したように、対話状態管理手段４２は、フィラーの挿入が行われても、対話履歴上は、それをシステム発話として取り扱わないので、次発話候補記憶手段３０に固定的に記憶しておくフィラーと、それ以外の複数の次発話候補の内容データ（頻繁に更新されるデータ）とは別のものであり、単に同じ次発話候補記憶手段３０に記憶させるに過ぎない。 Further, the next utterance selection means 24 receives information from the system utterance timing detecting means 22 that it is the start timing of the system utterance and the information that it is the filler insertion timing (what kind of filler is inserted). When the information is received), the process of sending the content data (including the voice data) of the filler of the specified type to the utterance generation means 25 is executed. At this time, a process of transmitting the content data of the filler to be inserted or the identification information of the type of the filler to the interactive state management means 42 via the network 1 may or may not be executed, but it is executed. Even in this case, the dialogue state management means 42 treats the insertion of the filler as a connection utterance for preparing the system utterance, not as a system utterance in the dialogue history. Since this point will be described in detail in the description of the dialogue state management means 42 described later, detailed description thereof will be omitted here. The content data (including audio data) of various fillers may be stored in the filler storage means (not shown) provided in the playback device 20 in association with the identification information of the filler type. May always be considered to be a candidate for the next utterance, and may be stored in the next utterance candidate storage means 30. In the latter case, it is not necessary to receive the content data (including voice data) of various fillers from the next utterance preparation means 43 via the network 1 each time, and the content data (including voice data) is fixedly prepared in the next utterance candidate storage means 30. It can be the data that exists. Even if the filler is considered to be the next utterance candidate and stored in the next utterance candidate storage means 30, as described above, the dialogue state management means 42 has a dialogue even if the filler is inserted. Since it is not treated as a system utterance in the history, the filler that is fixedly stored in the next utterance candidate storage means 30 and the content data (frequently updated data) of a plurality of other next utterance candidates. Is different, and is merely stored in the same next utterance candidate storage means 30.

具体的には、次発話選択手段２４は、音声認識処理手段４１による音声認識処理の結果（文字列）を用いて次発話の選択を行う場合には、音声認識処理の結果に含まれる各単語と、次発話候補記憶手段３０に記憶されている複数の次発話候補の内容データの各々に含まれる各単語とを用いて、キーワードマッチングを行い、マッチングした次発話候補の内容データを、次発話の内容データとして選択することができる。また、言語処理と機械学習とを合わせた複雑なマッチングを行ってもよい。さらに、音声認識処理の結果として得られた文字列と、複数の次発話候補の内容データ（文字列）の各々との類似度を、ｄｏｃ２ｖｅｃ等により求め、類似度の高い次発話候補の内容データを、次発話の内容データとして選択してもよい。 Specifically, when the next utterance selection means 24 selects the next utterance using the result (character string) of the voice recognition processing by the voice recognition processing means 41, each word included in the result of the voice recognition processing. And each word included in each of the content data of the plurality of next utterance candidates stored in the next utterance candidate storage means 30, keyword matching is performed, and the content data of the matched next utterance candidate is used as the next utterance candidate. Can be selected as the content data of. In addition, complex matching that combines language processing and machine learning may be performed. Further, the similarity between the character string obtained as a result of the voice recognition process and each of the content data (character strings) of the plurality of next utterance candidates is obtained by doc2vc or the like, and the content data of the next utterance candidate having a high degree of similarity is obtained. May be selected as the content data of the next utterance.

また、電話の自動応答における音声対話等のように、システムがユーザに質問し、ユーザがそれに答えていく場合には、ユーザ発話の内容は、システムから与えられた選択肢等のように限られたものになるので、ユーザ発話の内容は予測することができる。この場合、次発話候補記憶手段３０に記憶されている複数の次発話候補の内容データの各々に、対応するユーザ発話の予測データが付随していれば、その付随しているユーザ発話の予測データのうちのいずれが、音声認識処理の結果として得られた文字列と一致するのかを判断することにより、一致したユーザ発話の予測データに対応する次発話候補の内容データを、次発話の内容データとして選択することができる。 In addition, when the system asks a question to the user and the user answers the question, such as a voice dialogue in an automatic answering of a telephone, the content of the user's utterance is limited to the options given by the system. Since it becomes a thing, the content of the user's utterance can be predicted. In this case, if the corresponding user utterance prediction data is attached to each of the content data of the plurality of next utterance candidates stored in the next utterance candidate storage means 30, the accompanying user utterance prediction data is attached. By determining which of the two matches the character string obtained as a result of the voice recognition processing, the content data of the next utterance candidate corresponding to the predicted data of the matched user utterance is obtained, and the content data of the next utterance is obtained. Can be selected as.

例えば、システム発話Ｓ（Ｎ）が「ＸＸＸ党と、ＹＹＹ党のどちらを支持しますか？」であり、システム発話Ｓ（Ｎ＋１）の複数（２つ）の候補として、「ＸＸＸ党のどの政治家が総理大臣になると思いますか？」という内容データおよびそれに付随する「ＸＸＸ党」というユーザ発話Ｕ（Ｎ）の予測データと、「ＹＹＹ党のどの政治家が党首に相応しいですか？」という内容データおよびそれに付随する「ＹＹＹ党」というユーザ発話Ｕ（Ｎ）の予測データとを、次発話準備手段４３により準備し、次発話候補記憶手段３０に記憶させたとする。このとき、ユーザ発話Ｕ（Ｎ）の音声認識処理の結果が「ＸＸＸ党」であれば、「ＸＸＸ党」というユーザ発話Ｕ（Ｎ）の予測データと一致するので、それに対応する「ＸＸＸ党のどの政治家が総理大臣になると思いますか？」がシステム発話Ｓ（Ｎ＋１）として選択され、発話生成手段２５により再生される。 For example, the system utterance S (N) is "Which party do you support, the XXX party or the YYY party?", And as multiple (two) candidates for the system utterance S (N + 1), "Which politics of the XXX party?" The content data "Do you think the house will be the prime minister?" And the accompanying user-spoken U (N) prediction data "XXX party" and "Which politician of the YYY party is suitable for the party leader?" It is assumed that the content data and the predicted data of the user utterance U (N) called "YYY party" are prepared by the next utterance preparation means 43 and stored in the next utterance candidate storage means 30. At this time, if the result of the voice recognition processing of the user utterance U (N) is "XXX party", it matches the prediction data of the user utterance U (N) called "XXX party". Which politician do you think will be the prime minister? ”Is selected as the system speech S (N + 1) and is reproduced by the speech generation means 25.

さらに、次発話選択手段２４は、ユーザ発話意図の識別結果（質問、相槌等）を用いて次発話の選択を行う場合には、ユーザ発話意図に対応するシステム発話種別が、次発話候補記憶手段３０に記憶されている複数の次発話候補の内容データの各々について定められているので、得られたユーザ発話意図の識別結果に対応するシステム発話種別である次発話候補の内容データを、次発話の内容データとして選択することができる。 Further, when the next utterance selection means 24 selects the next utterance using the identification result (question, utterance, etc.) of the user utterance intention, the system utterance type corresponding to the user utterance intention is the next utterance candidate storage means. Since each of the content data of the plurality of next utterance candidates stored in 30 is defined, the content data of the next utterance candidate, which is the system utterance type corresponding to the obtained identification result of the user utterance intention, is used as the next utterance. Can be selected as the content data of.

例えば、ユーザ発話意図が「相槌」、「理解」等であれば、システム発話種別が主計画である次発話候補の内容データを選択し、ユーザ発話意図が「定義型質問」（用語の意義を問う質問）であれば、システム発話種別が副計画（定義）である次発話候補の内容データを選択し、ユーザ発話意図が「反復要求」、「不理解」であれば、システム発話種別が繰り返し用の主計画である次発話候補の内容データを選択し、ユーザ発話意図が「補足要求」等であれば、補足説明用の副計画（トリビア等）である次発話候補の内容データを選択する等のように、ユーザ発話意図と、システム発話種別との対応関係を予め定めておけばよい。この対応関係は、次発話選択手段２４を構成するプログラム内に記述されていてもよく、再生装置２０に設けられた発話意図・システム発話種別対応関係記憶手段（不図示）に記憶しておいてもよい。従って、シナリオデータの構成要素が、主計画要素であるか、副計画要素であるかも、システム発話種別に該当する。なお、主計画、副計画についての詳細は、図１３および図１４を用いて後述する。 For example, if the user's utterance intention is "speaking", "understanding", etc., the content data of the next utterance candidate whose system utterance type is the main plan is selected, and the user's utterance intention is "definition type question" (meaning of the term). If it is a question), select the content data of the next utterance candidate whose system utterance type is a sub-plan (definition), and if the user utterance intention is "repetition request" or "incomprehension", the system utterance type is repeated. Select the content data of the next utterance candidate, which is the main plan for the utterance, and if the user's utterance intention is "supplementary request", etc., select the content data of the next utterance candidate, which is the sub-plan (trivia, etc.) for supplementary explanation. For example, the correspondence between the user's utterance intention and the system utterance type may be determined in advance. This correspondence may be described in the program constituting the next utterance selection means 24, and may be stored in the utterance intention / system utterance type correspondence storage means (not shown) provided in the playback device 20. May be good. Therefore, whether the component of the scenario data is the main planning element or the sub-planning element also corresponds to the system utterance type. The details of the main plan and the sub plan will be described later with reference to FIGS. 13 and 14.

なお、ユーザ発話意図が「無関心」、「既知」であれば、長短２つ用意された主計画のうちの短い方の主計画である次発話候補の内容データを選択し、伝達情報量を減らすことができる。但し、これらのユーザ発話意図の場合は、次発話準備手段４３の入替準備手段４３Ｃにより、複数の次発話候補の内容データを入れ替える準備処理が進行しているか（ステータス＝準備中）、あるいは、既にその準備が完了し、次発話候補記憶手段３０に、別の話題のシナリオデータ内の先頭の構成要素（主計画要素）が記憶されているか、同じシナリオデータ内の別の構成要素（主計画要素）が記憶されていることもある。その場合は、その主計画要素を選択すればよい。 If the user's utterance intention is "indifferent" or "known", the content data of the next utterance candidate, which is the shorter main plan of the two main plans prepared, is selected to reduce the amount of transmitted information. be able to. However, in the case of these user utterance intentions, is the preparatory process for exchanging the content data of a plurality of next utterance candidates by the replacement preparation means 43C of the next utterance preparation means 43 (status = in preparation), or is already When the preparation is completed, the next utterance candidate storage means 30 stores the first component (main plan element) in the scenario data of another topic, or another component (main plan element) in the same scenario data. ) May be remembered. In that case, the main planning element may be selected.

また、次発話選択手段２４は、ユーザ発話意図の識別結果（質問、相槌等）と、音声認識処理の結果として得られた言語情報（文字列）とを組み合わせて用いて、次のように、次発話を選択することができる。 Further, the next utterance selection means 24 uses the identification result (question, response, etc.) of the user's utterance intention in combination with the linguistic information (character string) obtained as a result of the voice recognition process, as follows. You can select the next utterance.

例えば、システム発話Ｓ（Ｎ）が「早稲田太郎選手が４回転フリップを成功させたよ。」であり、システム発話Ｓ（Ｎ＋１）の複数の候補として、「グランプリシリーズのカナダ大会で跳んだそうだ。」（主計画要素）と、「早稲田太郎選手は、…」という早稲田太郎の人物の説明データ（副計画要素の定義）と、「４回転フリップっていうのは、…」という４回転フリップの技の説明データ（副計画要素の定義）と、繰り返し用の「早稲田太郎選手が４回転フリップを成功させたよ。」（主計画要素）とが、次発話準備手段４３により準備され、次発話候補記憶手段３０に記憶されているとする。このとき、Ｕ（Ｎ）のユーザ発話意図が「相槌」、「理解」であったとすると、「グランプリシリーズのカナダ大会で跳んだそうだ。」（主計画要素）を次発話Ｓ（Ｎ＋１）として選択すればよく、ユーザ発話意図が「反復要求」であったとすると、繰り返し用の「早稲田太郎選手が４回転フリップを成功させたよ。」（主計画要素）を選択すればよい。しかし、Ｕ（Ｎ）のユーザ発話意図が「質問」であったとすると、次発話候補記憶手段３０に記憶されている複数の次発話候補の内容データの中には、定義型質問に対するシステム応答（副計画要素の定義）が２つ用意されているので、早稲田太郎選手について質問しているのか、４回転フリップについて質問しているのかが判明しないと、システム応答を行うことができないが、いずれの質問であるかは、ユーザ発話意図だけでは判断することができない。そこで、音声認識処理の結果として得られた言語情報（文字列）を用いて、どちらの質問であるかを判断し、どちらのシステム応答（副計画要素の定義）を選択するのかを判断する。一方、次発話候補記憶手段３０に記憶されている複数の次発話候補の内容データの中に、定義型質問に対するシステム応答（副計画要素の定義）が１つしかない場合には、音声認識処理の結果を使用せずに（つまり、ユーザ発話意図だけで）、その１つのシステム応答（副計画要素の定義）を選択することができる。 For example, the system utterance S (N) is "Taro Waseda succeeded in a quadruple flip." And as multiple candidates for the system utterance S (N + 1), "It seems that he jumped at the Grand Prix series Canada tournament." (Main planning element), explanation data of Waseda Taro's person "Waseda Taro is ..." (definition of sub-planning element), and "4 rotation flip is ..." Explanatory data (definition of sub-planning element) and "Taro Waseda succeeded in quadruple flip" (main planning element) for repetition are prepared by the next utterance preparation means 43, and the next utterance candidate storage means. It is assumed that it is stored in 30. At this time, if the user's intention of U (N) utterance is "Aizuchi" and "Understanding", "It seems that he jumped at the Grand Prix series Canada tournament." (Main plan element) is selected as the next utterance S (N + 1). If the user's utterance intention is "repetition request", then "Taro Waseda has succeeded in quadruple flip" (main planning element) for repetition may be selected. However, if the user's utterance intention of U (N) is "question", the system response to the defined question is included in the content data of the plurality of next utterance candidates stored in the next utterance candidate storage means 30. Since there are two sub-planning element definitions), it is not possible to make a system response unless it is clear whether you are asking about Taro Waseda or about the quadruple flip. Whether it is a question or not cannot be determined only by the intention of the user to speak. Therefore, using the linguistic information (character string) obtained as a result of the speech recognition process, it is determined which question the question is, and which system response (definition of the sub-planning element) is selected. On the other hand, when there is only one system response (definition of the sub-planning element) to the definition type question in the content data of the plurality of next utterance candidates stored in the next utterance candidate storage means 30, voice recognition processing is performed. One of the system responses (definition of subplanning elements) can be selected without using the result of (that is, only by the user's utterance intention).

また、逆に音声認識処理の結果だけでは、次発話を選択できない場合もある。例えば、「えっ？」というユーザ発話は、驚きなのか、質問なのか、聞き返し（反復要求）なのかは判断できないので、ユーザ発話意図を用いて、次発話を選択することができる。 On the contrary, the next utterance may not be selected only by the result of the voice recognition process. For example, since it is not possible to determine whether the user's utterance "Eh?" Is a surprise, a question, or a reply (repetition request), the next utterance can be selected using the user's utterance intention.

なお、次発話候補の内容データが、次発話候補記憶手段３０に１つも記憶されていない期間があるが、これは、次発話選択手段２４による処理には影響しない。なぜなら、次発話候補の内容データが次発話候補記憶手段３０に１つも記憶されていない期間は、次発話準備手段４３による準備処理が完了していない期間（準備中の期間）であるが、図８のＰ５→Ｐ９の流れの場合（直ぐに準備が完了する場合）には、システム発話の開始タイミングではないと判断されるので、次発話選択手段２４による処理には進まず、一方、図８のＰ６→Ｐ１０の流れの場合（直ぐに準備が完了しない場合）には、次発話選択手段２４による処理に進むものの、フィラーの挿入になるので、次発話候補の内容データは必要ないからである。 There is a period in which no content data of the next utterance candidate is stored in the next utterance candidate storage means 30, but this does not affect the processing by the next utterance selection means 24. This is because the period in which no content data of the next utterance candidate is stored in the next utterance candidate storage means 30 is a period in which the preparation process by the next utterance preparation means 43 is not completed (the period under preparation). In the case of the flow of P5 → P9 of 8 (when the preparation is completed immediately), it is determined that it is not the start timing of the system utterance, so the process by the next utterance selection means 24 does not proceed, while FIG. In the case of the flow of P6 → P10 (when the preparation is not completed immediately), the processing by the next utterance selection means 24 proceeds, but since the filler is inserted, the content data of the next utterance candidate is not required.

＜再生装置２０／発話生成手段２５の構成＞ <Structure of playback device 20 / utterance generation means 25>

発話生成手段２５は、システム発話タイミング検出手段２２によりシステム発話の開始タイミングが検出された後に、次発話選択手段２４で選択された次発話の内容データ（次発話準備手段４３による準備処理で得られた複数の次発話候補の内容データの中から選択された次発話の内容データ）を用いて、システム発話の音声信号の再生を含むシステム発話生成処理を実行するものである。この発話生成手段２５には、スピーカ、ディスプレイも含まれる。 The utterance generation means 25 is obtained by the content data of the next utterance selected by the next utterance selection means 24 (obtained by the preparation process by the next utterance preparation means 43) after the start timing of the system utterance is detected by the system utterance timing detection means 22. The system utterance generation process including the reproduction of the voice signal of the system utterance is executed by using the content data of the next utterance selected from the content data of the plurality of next utterance candidates. The utterance generation means 25 also includes a speaker and a display.

この際、発話生成手段２５は、次発話選択手段２４から受け取った次発話の内容データに音声データ（例えばｗａｖファイル等）が含まれていない場合には、次発話選択手段２４から受け取ったテキストデータから音声データを生成する音声合成処理も実行する。但し、音声合成処理は、システム応答の遅延防止の観点から、次発話準備手段４３で実行するか、または題材データとして予め用意されていることが好ましい。 At this time, if the utterance generation means 25 does not include voice data (for example, a wav file) in the content data of the next utterance received from the next utterance selection means 24, the utterance generation means 25 receives the text data from the next utterance selection means 24. It also executes a voice synthesis process that generates voice data from. However, from the viewpoint of preventing delay in the system response, it is preferable that the speech synthesis process is executed by the next utterance preparation means 43 or is prepared in advance as subject data.

また、発話生成手段２５は、システム発話の音声信号の再生処理を実行するとともに、次発話選択手段２４から受け取った次発話の内容データに映像データや静止画データ、あるいは楽曲データが付随している場合には、ディスプレイでの動画や静止画の再生処理、あるいは音楽の再生処理も実行する。例えば、直前のシステム発話が「早稲田太郎選手が４回転フリップを跳びました。」であり、それに対するユーザの反応が「４回転フリップってどんな技？」という質問だった場合に、４回転フリップの技の説明用の映像を再生し、「早稲田太郎選手ってどんな選手なの？」という質問だった場合に、早稲田太郎選手の顔画像を再生すること等ができる。また、直前のシステム発話が「ＸＸＸホールで第九が演奏されました。」であり、それに対するユーザの反応が「第九ってどんな曲なの？」という質問だった場合に、第九の楽曲データを再生すること等ができる。なお、システム発話中に、システム発話の音声信号の再生と同期または略同期させて、ディスプレイでシステム発話の内容を示すテキスト表示を行ってもよい。 Further, the utterance generation means 25 executes the reproduction processing of the voice signal of the system utterance, and the video data, the still image data, or the music data is attached to the content data of the next utterance received from the next utterance selection means 24. In that case, the reproduction process of the moving image or the still image on the display or the reproduction process of the music is also executed. For example, if the previous system utterance was "Taro Waseda jumped a quadruple flip" and the user's reaction to it was the question "What kind of technique is a quadruple flip?" You can play the video for explaining the technique, and if the question is "What kind of player is Taro Waseda?", You can play the face image of Taro Waseda. Also, if the previous system utterance was "The 9th was played in the XXX Hall" and the user's reaction to it was the question "What kind of song is the 9th?", The 9th song Data can be played back. During the system utterance, the audio signal of the system utterance may be reproduced or substantially synchronized with the display to display the text indicating the content of the system utterance.

さらに、発話生成手段２５は、音声信号取得手段２１により取得したユーザ発話の音声信号と、再生中のシステム発話の音声信号との衝突の発生を検出し、検出した衝突の発生情報を、ネットワーク１を介して対話サーバ４０へ送信し、ユーザ識別情報と関連付けてユーザ情報記憶手段５２に記憶させる処理を実行する。この際、衝突には２種類あり、ここで検出する衝突は、［１］ユーザ発話権が終了したという識別結果が出て、システム発話を開始したところ、実際にはユーザ発話権が維持されていて、両者の発話が重なった場合の衝突である。従って、［２］ユーザ発話権が終了したものの、システム発話の開始が遅れたために、再び、ユーザ発話が開始されてしまい、ほぼ同時に両者の発話が開始されて重なった場合の衝突ではないので、この［２］の場合の衝突を排除する処理を実行する。例えば、衝突を起こしたときの直前の無音区間の長さ（衝突を起こしたユーザ発話の音声区間の開始時点とその直前のユーザ発話の音声区間の終了時点との間の時間間隔）が、予め定めた衝突種別判定用閾値以上または超過の場合に、［２］の場合の衝突であると判断し、排除すること等ができる。また、衝突の発生前後の関連するデータを全て保存しておき、事後的に［２］場合の衝突であるか否かを判断し、［２］の衝突を排除する処理を行ってもよい。関連するデータとは、例えば、衝突の直前のユーザ発話の音声区間の終了時刻、システム発話タイミング検出手段２２によるシステム発話の開始タイミングの検出時刻、発話生成手段２５によるシステム発話の音声信号の再生開始時刻、衝突を起こしたユーザ発話の音声区間の開始時刻、衝突の前後双方のユーザ発話の音声認識処理の結果としての言語情報、衝突を起こしたシステム発話の内容データ（テキストデータ）等である。さらに、これらの関連するデータを用いて学習を行い、［１］と［２］の衝突を識別する識別器を構築し、事後的に、またはリアルタイム若しくは略リアルタイムで、識別器による識別結果に従って、［２］の衝突を排除する処理を行ってもよい。なお、本発明は、予め次発話候補を準備する等、システム応答の遅延防止が図られているので、［２］の場合の衝突は、殆ど発生しないようになっている。 Further, the utterance generation means 25 detects the occurrence of a collision between the voice signal of the user's utterance acquired by the voice signal acquisition means 21 and the voice signal of the system utterance being reproduced, and transmits the detected collision occurrence information to the network 1. It is transmitted to the dialogue server 40 via the above, and a process of associating it with the user identification information and storing it in the user information storage means 52 is executed. At this time, there are two types of collisions, and the collision detected here is [1] When the identification result that the user utterance right has ended and the system utterance is started, the user utterance right is actually maintained. It is a collision when both utterances overlap. Therefore, [2] although the user's utterance right has ended, the start of the system utterance has been delayed, so that the user's utterance is started again, and it is not a collision when both utterances are started and overlap at almost the same time. The process of eliminating the collision in the case of this [2] is executed. For example, the length of the silent section immediately before the collision (the time interval between the start time of the voice section of the user utterance that caused the collision and the end time of the voice section of the user utterance immediately before the collision) is determined in advance. If the collision type determination threshold is equal to or greater than the specified collision type determination threshold, it can be determined that the collision is in the case of [2] and the collision can be eliminated. Further, all the related data before and after the occurrence of the collision may be saved, and after the fact, it may be determined whether or not the collision occurs in the case of [2], and a process of eliminating the collision of [2] may be performed. The related data include, for example, the end time of the voice section of the user utterance immediately before the collision, the detection time of the start timing of the system utterance by the system utterance timing detecting means 22, and the start of reproduction of the voice signal of the system utterance by the utterance generating means 25. The time, the start time of the voice section of the user utterance that caused the collision, the language information as a result of the voice recognition processing of the user utterance both before and after the collision, the content data (text data) of the system utterance that caused the collision, and the like. Furthermore, learning is performed using these related data, and a discriminator for discriminating the collision between [1] and [2] is constructed, and after the fact, or in real time or substantially in real time, according to the discriminating result by the discriminator. The process of eliminating the collision of [2] may be performed. In the present invention, the delay of the system response is prevented by preparing the next utterance candidate in advance, so that the collision in the case of [2] hardly occurs.

また、発話生成手段２５は、音声信号取得手段２１により取得したユーザ発話の音声信号を用いてユーザ発話の終了時刻を検出するとともに、システム発話の音声信号の再生開始時刻を検出することにより、ユーザ発話の終了からシステム発話の開始までの交替潜時を計測し、計測したシステムの交替潜時を、ネットワーク１を介して対話サーバ４０へ送信し、ユーザ識別情報と関連付けてユーザ情報記憶手段５２に記憶させる処理を実行する。 Further, the utterance generation means 25 detects the end time of the user's utterance using the voice signal of the user's utterance acquired by the voice signal acquisition means 21, and also detects the reproduction start time of the voice signal of the system utterance. The alternate latency from the end of the utterance to the start of the system utterance is measured, and the measured alternate latency of the system is transmitted to the dialogue server 40 via the network 1 and associated with the user identification information to the user information storage means 52. Execute the process to memorize.

さらに、発話生成手段２５は、音声信号取得手段２１により取得したユーザ発話の音声信号を用いてユーザ発話の開始時刻を検出し、検出した開始時刻と現在時刻との差分によりシステム発話継続時間を逐次計測し、計測したシステム発話継続時間を、ユーザ状態記憶手段３２に逐次記憶させる処理を実行する。 Further, the utterance generation means 25 detects the start time of the user utterance using the voice signal of the user utterance acquired by the voice signal acquisition means 21, and sequentially determines the system utterance duration by the difference between the detected start time and the current time. A process of measuring and sequentially storing the measured system utterance duration in the user state storage means 32 is executed.

そして、発話生成手段２５は、音声認識処理手段４１による音声認識処理の結果として得られたユーザ発話の言語情報をネットワーク１を介して逐次取得し、取得した言語情報およびその取得時刻（または、言語情報とともに取得した時刻情報若しくは時間情報）を用いて発話速度をリアルタイムで算出し、算出したリアルタイムの発話速度を、ユーザ状態記憶手段３２に逐次記憶させる処理を実行する。また、発話生成手段２５は、対話全体におけるユーザ発話についての発話速度を算出し、算出した発話速度を、ネットワーク１を介して対話サーバ４０へ送信し、ユーザ識別情報と関連付けてユーザ情報記憶手段５２に記憶させる処理を実行する。 Then, the utterance generation means 25 sequentially acquires the language information of the user's utterance obtained as a result of the voice recognition processing by the voice recognition processing means 41 via the network 1, and the acquired language information and the acquisition time (or language) thereof. The utterance speed is calculated in real time using the time information or time information acquired together with the information), and the calculated real-time utterance speed is sequentially stored in the user state storage means 32. Further, the utterance generation means 25 calculates the utterance speed for the user's utterance in the entire dialogue, transmits the calculated utterance speed to the dialogue server 40 via the network 1, and associates it with the user identification information to the user information storage means 52. Executes the process of storing in.

＜再生装置２０／次発話候補記憶手段３０、システム状態記憶手段３１、ユーザ状態記憶手段３２の構成＞ <Structure of playback device 20 / next utterance candidate storage means 30, system state storage means 31, and user state storage means 32>

次発話候補記憶手段３０は、次発話準備手段４３からネットワーク１を介して送信されてきた複数の次発話候補の内容データを、それらのデータの識別情報（例えば、シナリオＩＤ、発話節ＩＤ等）と対応付けて記憶するものである。記憶する次発話候補の内容データには、テキストデータの他に、音声データ（例えばｗａｖファイル等）が含まれ、さらに映像データや静止画データ、あるいは楽曲データが付随している場合もある。なお、再生装置２０にフィラー記憶手段（不図示）を設けない場合には、各種のフィラーの内容データ（音声データを含む）を、既に述べたように固定的に準備されているデータとして、フィラーの種別の識別情報と対応付けて次発話候補記憶手段３０に記憶しておいてもよい。 The next utterance candidate storage means 30 uses the content data of a plurality of next utterance candidates transmitted from the next utterance preparation means 43 via the network 1 as identification information (for example, scenario ID, utterance clause ID, etc.) of the data. It is stored in association with. The content data of the next utterance candidate to be stored includes audio data (for example, a wav file, etc.) in addition to the text data, and may be accompanied by video data, still image data, or music data. When the playback device 20 is not provided with the filler storage means (not shown), the content data (including audio data) of various fillers is used as the data fixedly prepared as described above. It may be stored in the next utterance candidate storage means 30 in association with the identification information of the type of.

また、複数の次発話候補の内容データの各々には、データの属性を示すシステム発話種別（例えば、シナリオデータにおける主計画・副計画の別等）が対応付けられて記憶されている。このシステム発話種別については、次発話選択手段２４の説明で既に詳述しているので、ここでは詳しい説明を省略する。 Further, each of the content data of the plurality of next utterance candidates is stored in association with the system utterance type indicating the data attribute (for example, the distinction between the main plan and the sub plan in the scenario data). Since this system utterance type has already been described in detail in the description of the next utterance selection means 24, detailed description thereof will be omitted here.

システム状態記憶手段３１は、システム状態を示す情報として、次発話準備手段４３による準備処理の状態（準備完了・各種の準備中の別を示すステータス）、目的データの残数（対話目的を達成するためのシステムの最終の次発話候補の内容データとなり得る題材データである目的データの残数）、および次発話候補の重要度（次発話準備手段４３による準備処理で得られた複数の次発話候補の内容データの各々についての重要度）を記憶するものである。このシステム状態は、現在進行しているユーザとの対話中に得られる一時的な情報（逐次更新されるリアルタイム情報）である。このうち、目的データの残数および次発話候補の重要度は、システム発話意欲度の指標値であるが、システム発話意欲度については、システム発話タイミング検出手段２２のユーザ発話権終了判定用閾値調整手段２２Ｇの説明で既に詳述しているので（図１０参照）、ここでは詳しい説明を省略する。準備完了・各種の準備中の別を示すステータス、目的データの残数および次発話候補の重要度のいずれについても、次発話準備手段４３からネットワーク１を介して送信されてきてシステム状態記憶手段３１に記憶されて逐次更新される。 As information indicating the system state, the system state storage means 31 achieves the state of the preparation process by the next utterance preparation means 43 (status indicating the completion of preparation / various preparations) and the remaining number of target data (achieves the purpose of dialogue). The remaining number of target data, which is the subject data that can be the content data of the final next utterance candidate of the system for the next utterance, and the importance of the next utterance candidate (plurality of next utterance candidates obtained by the preparation process by the next utterance preparation means 43). The importance of each of the content data) is memorized. This system state is temporary information (sequentially updated real-time information) obtained during an ongoing dialogue with the user. Of these, the remaining number of target data and the importance of the next utterance candidate are index values of the system utterance motivation, but the system utterance motivation is adjusted by the threshold for determining the end of the user utterance right of the system utterance timing detecting means 22. Since it has already been described in detail in the description of the means 22G (see FIG. 10), detailed description will be omitted here. All of the status indicating the completion of preparation / various preparations, the remaining number of target data, and the importance of the next utterance candidate are transmitted from the next utterance preparation means 43 via the network 1, and the system state storage means 31. It is stored in and updated sequentially.

ユーザ状態記憶手段３２は、ユーザ状態を示す情報として、進行中のユーザ発話についての発話開始からのユーザ発話継続時間、および進行中のユーザ発話についての発話速度を記憶するものである。このユーザ状態は、現在進行しているユーザとの対話中に得られる一時的な情報（逐次更新されるリアルタイム情報）であるため、ユーザ情報記憶手段５２に記憶されているユーザの属性情報（複数回の対話を通じて得られた蓄積情報）とは異なる。本実施形態では、ユーザ発話継続時間および発話速度のいずれについても、発話生成手段２５により計測され、ユーザ状態記憶手段３２に記憶されて逐次更新される。 The user state storage means 32 stores the user utterance duration from the start of utterance of the user utterance in progress and the utterance speed of the user utterance in progress as information indicating the user state. Since this user state is temporary information (real-time information that is sequentially updated) obtained during the ongoing dialogue with the user, the user attribute information (plurality) stored in the user information storage means 52. It is different from the accumulated information obtained through the dialogue. In the present embodiment, both the user utterance duration and the utterance speed are measured by the utterance generation means 25, stored in the user state storage means 32, and sequentially updated.

＜対話サーバ４０／音声認識処理手段４１の構成＞ <Configuration of dialogue server 40 / voice recognition processing means 41>

音声認識処理手段４１は、音声信号取得手段２１により取得したユーザ発話の音声信号をネットワーク１を介して逐次取得し、取得したユーザ発話の音声信号についての音声認識処理を実行し、音声認識処理の結果として得られた言語情報を逐次出力し、出力した言語情報を、対話状態管理手段４２を介して次発話準備手段４３に逐次送るとともに、ネットワーク１を介してシステム発話タイミング検出手段２２、次発話選択用情報生成手段２３、次発話選択手段２４、および発話生成手段２５にも逐次送信する処理を実行するものである。 The voice recognition processing means 41 sequentially acquires the voice signal of the user utterance acquired by the voice signal acquisition means 21 via the network 1, executes the voice recognition process for the acquired voice signal of the user utterance, and performs the voice recognition process. The resulting linguistic information is sequentially output, and the output linguistic information is sequentially sent to the next utterance preparation means 43 via the dialogue state management means 42, and the system utterance timing detecting means 22 and the next utterance via the network 1. The process of sequentially transmitting to the selection information generation means 23, the next utterance selection means 24, and the utterance generation means 25 is executed.

この音声認識処理手段４１による音声認識処理は、システム発話タイミング検出手段２２によるユーザ発話権の維持・終了を識別するパターン認識処理とは非同期で実行される。 The voice recognition process by the voice recognition processing means 41 is executed asynchronously with the pattern recognition process for identifying the maintenance / termination of the user's utterance right by the system utterance timing detecting means 22.

具体的には、図６および図７に示すように、音声認識処理手段４１は、ショートポーズセグメンテーションと呼ばれる機能により、音声信号取得手段２１により取得したユーザ発話の音声信号の中に短時間の無音区間が現れるたびに音声信号を細かく区切り、音声認識対象とする区間を順次確定させていく。これにより、長時間の音声入力を自動的に区切りながら逐次的に音声認識処理を実行することができる。このショートポーズセグメンテーションでの音声認識対象とする音声信号の区間（対応する音声認識処理の時間長は、図６中および図７中の点線で示されている。）は、通常の音声区間検出（ＶＡＤ）で決定される音声区間（対応する音声認識処理の時間長は、図６中および図７中の実線で示されている。）よりも短い。 Specifically, as shown in FIGS. 6 and 7, the voice recognition processing means 41 uses a function called short pause segmentation to silence the user-spoken voice signal acquired by the voice signal acquisition means 21 for a short period of time. Each time a section appears, the voice signal is divided into small pieces, and the section to be voice-recognized is determined in sequence. As a result, the voice recognition process can be sequentially executed while automatically separating the long-time voice input. The section of the voice signal to be voice-recognized in this short pause segmentation (the time length of the corresponding voice recognition process is shown by the dotted line in FIG. 6 and FIG. 7) is the normal voice section detection ( It is shorter than the voice interval determined by VAD) (the time length of the corresponding voice recognition process is shown by the solid line in FIGS. 6 and 7).

このショートポーズセグメンテーションの機能は、対話サーバ４０の音声認識処理手段４１の中に設けてもよく、あるいは、音声認識処理手段４１により、図示されない外部サーバにアクセスしてストリーミング音声認識を行うようにしてもよい。後者の場合は、例えば、グーグル・クラウド・スピーチＡＰＩ（ｈｔｔｐｓ：／／ｃｌｏｕｄ．ｇｏｏｇｌｅ．ｃｏｍ／ｓｐｅｅｃｈ／）のストリーミング音声認識等を用いることができる（非特許文献２参照）。 This short pause segmentation function may be provided in the voice recognition processing means 41 of the dialogue server 40, or the voice recognition processing means 41 accesses an external server (not shown) to perform streaming voice recognition. May be good. In the latter case, for example, streaming speech recognition of Google Cloud Speech API (https://crowd.***.com/speech/) can be used (see Non-Patent Document 2).

＜対話サーバ４０／対話状態管理手段４２の構成＞ <Configuration of dialogue server 40 / dialogue state management means 42>

対話状態管理手段４２は、進行中のユーザとシステムとの間の対話状態を管理する処理を実行するものである。ここで、システム発話とユーザ発話との時間的な前後関係の説明を容易にするため、図６の最上部に示すように、最初のシステム発話をＳ（１）、最初のユーザ発話をＵ（１）とし、以降、Ｓ（２）、Ｕ（２）、Ｓ（３）、Ｕ（３）、…、Ｓ（Ｎ−１）、Ｕ（Ｎ−１）と対話が進み、直前のシステム発話をＳ（Ｎ）、進行中のユーザ発話をＵ（Ｎ）とし、さらにＵ（Ｎ）を音声区間Ｕ（Ｎ，１）、Ｕ（Ｎ，２）、Ｕ（Ｎ，３）、…に分割し、Ｕ（Ｎ，Ｋ）まで進んでいるものとする。そして、未来の新たな音声区間をＵ（Ｎ，Ｋ＋１）とする。なお、実際にはショートポーズセグメンテーションにより処理が進行するので、Ｕ（Ｎ，Ｋ＋１）よりも細かい区間で、新たな出力が得られる。 The dialogue state management means 42 executes a process of managing the dialogue state between the ongoing user and the system. Here, in order to facilitate the explanation of the temporal context between the system utterance and the user utterance, the first system utterance is S (1) and the first user utterance is U (as shown at the top of FIG. 1), and thereafter, dialogue progresses with S (2), U (2), S (3), U (3), ..., S (N-1), U (N-1), and the immediately preceding system utterance occurs. Is S (N), the ongoing user utterance is U (N), and U (N) is further divided into voice sections U (N, 1), U (N, 2), U (N, 3), ... However, it is assumed that the vehicle has advanced to U (N, K). Then, let U (N, K + 1) be the new voice section in the future. Since the processing actually proceeds by short pause segmentation, a new output can be obtained in a section smaller than U (N, K + 1).

具体的には、図４に示すように、対話状態管理手段４２は、次発話選択手段２４からネットワーク１を介して送信されてくる選択結果を受信する処理を実行する。この選択結果は、次発話選択手段２４により選択された次発話Ｓ（Ｎ＋１）の内容データまたはその識別情報（例えば、シナリオＩＤ、発話節ＩＤ等）である。 Specifically, as shown in FIG. 4, the dialogue state management means 42 executes a process of receiving the selection result transmitted from the next utterance selection means 24 via the network 1. This selection result is the content data of the next utterance S (N + 1) selected by the next utterance selection means 24 or its identification information (for example, scenario ID, utterance clause ID, etc.).

そして、対話状態管理手段４２は、次発話選択手段２４からの選択結果の受信により、システム発話の開始タイミングが検出されたこと、すなわちユーザ発話権が終了したことを把握することができるので、その時点まで対話状態管理手段４２のメモリ（主メモリでよい）で保持していた言語情報、すなわちユーザ発話権が終了したユーザ発話Ｕ（Ｎ）の発話区間全体の内容データを、対話履歴記憶手段５０に記憶させる処理を実行する。この時点の前までには、図６の最上部に示すように、対話履歴記憶手段５０には直前のシステム発話Ｓ（Ｎ）までが保存されているので、これにユーザ発話Ｕ（Ｎ）が追加されることになる。 Then, the dialogue state management means 42 can grasp that the start timing of the system utterance is detected by receiving the selection result from the next utterance selection means 24, that is, that the user utterance right has ended. The language information held in the memory of the dialogue state management means 42 (which may be the main memory) until a time point, that is, the content data of the entire utterance section of the user utterance U (N) whose utterance right has expired is stored in the dialogue history storage means 50. Executes the process of storing in. Before this point, as shown at the top of FIG. 6, the dialogue history storage means 50 stores up to the immediately preceding system utterance S (N), so that the user utterance U (N) is stored in this. Will be added.

また、対話状態管理手段４２は、次発話選択手段２４からの選択結果の受信により、システム発話の開始タイミングが検出されたこと、すなわち次発話Ｓ（Ｎ＋１）の再生が開始されることを把握することができるので、選択結果として受信した次発話Ｓ（Ｎ＋１）の内容データを、対話履歴記憶手段５０に記憶させる処理を実行する。これにより、直前のシステム発話Ｓ（Ｎ）まで保存されていた対話履歴記憶手段５０には、ユーザ発話Ｕ（Ｎ）およびシステム発話Ｓ（Ｎ＋１）が追加されることになる。 Further, the dialogue state management means 42 grasps that the start timing of the system utterance is detected by receiving the selection result from the next utterance selection means 24, that is, the reproduction of the next utterance S (N + 1) is started. Therefore, the process of storing the content data of the next utterance S (N + 1) received as the selection result in the dialogue history storage means 50 is executed. As a result, the user utterance U (N) and the system utterance S (N + 1) are added to the dialogue history storage means 50 that has been stored up to the immediately preceding system utterance S (N).

さらに、対話状態管理手段４２は、次発話選択手段２４からの選択結果の受信により、システム発話の開始タイミングが検出されたこと、すなわち次発話Ｓ（Ｎ＋１）の再生が開始されることを把握することができるので、さらに次の次発話候補の準備処理を開始させるための準備開始指示情報を、次発話準備手段４３に送る処理を実行する。これにより、次発話準備手段４３によるシステム発話Ｓ（Ｎ＋２）についての複数の候補の準備処理が開始されることになる。 Further, the dialogue state management means 42 grasps that the start timing of the system utterance is detected by receiving the selection result from the next utterance selection means 24, that is, the reproduction of the next utterance S (N + 1) is started. Therefore, the process of sending the preparation start instruction information for starting the preparation process of the next utterance candidate to the next utterance preparation means 43 is executed. As a result, the preparation process of a plurality of candidates for the system utterance S (N + 2) by the next utterance preparation means 43 is started.

また、対話状態管理手段４２は、ユーザ発話Ｕ（Ｎ）の進行中には、音声認識処理手段４１から逐次出力される音声認識処理の結果としてのユーザ発話の言語情報を逐次受け取り、受け取った言語情報を、次発話準備手段４３に入替要否判断のために逐次送るとともに、対話状態管理手段４２のメモリ（主メモリでよい）に保持する。この際、音声認識処理手段４１から逐次受け取る音声認識処理の結果は、ショートポーズセグメンテーションによる短い区間についての音声認識処理の分であるから、Ｕ（Ｎ．１）、Ｕ（Ｎ，２）、Ｕ（Ｎ，３）、…よりも細かい区間についての音声認識処理の結果である（図６参照）。なお、音声認識処理手段４１から受け取った言語情報を、次発話準備手段４３に逐次送る際には、受け取った言語情報だけを送ってもよく、対話状態管理手段４２のメモリに保持しているその時点までのユーザ発話Ｕ（Ｎ）の全部の言語情報を送ってもよい。 Further, the dialogue state management means 42 sequentially receives and receives the language information of the user utterance as a result of the voice recognition process sequentially output from the voice recognition processing means 41 while the user utterance U (N) is in progress. The information is sequentially sent to the next utterance preparation means 43 for determining the necessity of replacement, and is held in the memory (main memory may be used) of the dialogue state management means 42. At this time, since the result of the voice recognition processing sequentially received from the voice recognition processing means 41 is the part of the voice recognition processing for a short section by the short pause segmentation, U (N.1), U (N, 2), U. This is the result of voice recognition processing for a section smaller than (N, 3), ... (See FIG. 6). When the language information received from the voice recognition processing means 41 is sequentially sent to the next utterance preparation means 43, only the received language information may be sent and is stored in the memory of the dialogue state management means 42. All the language information of the user speech U (N) up to the time point may be sent.

また、対話状態管理手段４２は、上述したように、ユーザ発話Ｕ（Ｎ）の進行中には、音声認識処理手段４１から逐次受け取った言語情報を、次発話準備手段４３に入替要否判断のために逐次送るが、結果的に、それがユーザ発話Ｕ（Ｎ）の発話区間全体における最後の部分（または発話区間全体）であったとしても、次発話準備手段４３に送る。そして、次発話準備手段４３において、ユーザ発話Ｕ（Ｎ）の最後の部分を含めて入替要否判断を行い、システム発話Ｓ（Ｎ＋１）の複数の候補を、新しい複数の候補に入れ替えると判定し、入替の準備を行った場合には、それらの新しい複数の次発話候補（システム発話Ｓ（Ｎ＋１）の複数の候補）の内容データが、ネットワーク１を介して再生装置２０へ送信され、複数の次発話候補として次発話候補記憶手段３０に既に記憶されているシステム発話Ｓ（Ｎ＋１）の複数の候補が、新しい複数の候補に更新される。その後、システム発話タイミング検出手段２２によりシステム発話の開始タイミングが検出されたときには、次発話選択手段２４によりシステム発話Ｓ（Ｎ＋１）の新しい複数の候補のうちの１つが選択され、その選択結果が、ネットワーク１を介して対話状態管理手段４２に送信されてくるので、対話状態管理手段４２のメモリに保持されているユーザ発話Ｕ（Ｎ）の発話区間全体の内容データ、および選択結果として受信したシステム発話Ｓ（Ｎ＋１）の内容データを、対話履歴記憶手段５０に記憶させるとともに、システム発話Ｓ（Ｎ＋２）の複数の候補を準備するための準備開始指示情報を、次発話準備手段４３に送る。 Further, as described above, the dialogue state management means 42 determines whether or not it is necessary to replace the language information sequentially received from the voice recognition processing means 41 with the next utterance preparation means 43 while the user utterance U (N) is in progress. As a result, even if it is the last part (or the entire utterance section) of the entire utterance section of the user utterance U (N), it is sent to the next utterance preparation means 43. Then, in the next utterance preparation means 43, it is determined that replacement is necessary including the last part of the user utterance U (N), and it is determined that the plurality of candidates of the system utterance S (N + 1) are replaced with new plurality of candidates. , When preparations for replacement are made, the content data of the new plurality of new utterance candidates (plurality candidates of system utterance S (N + 1)) is transmitted to the playback device 20 via the network 1, and a plurality of utterance candidates are transmitted. A plurality of candidates for the system utterance S (N + 1) already stored in the next utterance candidate storage means 30 as the next utterance candidate are updated with a plurality of new candidates. After that, when the system utterance timing detection means 22 detects the start timing of the system utterance, the next utterance selection means 24 selects one of a plurality of new candidates for the system utterance S (N + 1), and the selection result is determined. Since it is transmitted to the dialogue state management means 42 via the network 1, the content data of the entire utterance section of the user utterance U (N) held in the memory of the dialogue state management means 42, and the system received as the selection result. The content data of the utterance S (N + 1) is stored in the dialogue history storage means 50, and the preparation start instruction information for preparing a plurality of candidates for the system utterance S (N + 2) is sent to the next utterance preparation means 43.

なお、上記において、仮に、システム発話Ｓ（Ｎ＋１）の複数の候補を、新しい複数の候補に入れ替える準備処理に多少時間がかかった場合でも、システム状態記憶手段３１に記憶されている準備状態を示すステータスが準備中になるので、システム発話の開始タイミングは検出されないことから（図８のＰ９参照）、次発話候補記憶手段３０に記憶されているシステム発話Ｓ（Ｎ＋１）の複数の候補は更新されないまま保たれ（但し、入替の準備を開始した時点でクリアしてもよい。）、新しい複数の候補への入替を待つことになる。一方、準備処理にかなりの時間がかかった場合には、フィラーが挿入されるが（図８のＰ１０参照）、この場合も次発話候補記憶手段３０に記憶されているシステム発話Ｓ（Ｎ＋１）の複数の候補は更新されないまま保たれ（但し、入替の準備を開始した時点でクリアしてもよい。）、新しい複数の候補への入替を待つことになる。この場合、フィラーの挿入情報は、次発話選択手段２４から対話状態管理手段４２へ送信してもよく、送信しなくてもよいが、送信した場合でも、フィラーの挿入情報を受信した対話状態管理手段４２は、フィラーの挿入を、選択されたシステム発話Ｓ（Ｎ＋１）として取り扱うわけではないので、システム発話Ｓ（Ｎ＋２）の準備のための準備開始指示情報を次発話準備手段４３に送る処理は行わない。新しいシステム発話Ｓ（Ｎ＋１）の準備処理に時間がかかっているので、フィラーを挿入したのに、そのフィラーの挿入をもって、さらに次のシステム発話Ｓ（Ｎ＋２）の準備を開始するための処理を行うのは不合理だからである。但し、挿入したフィラーの情報を、システム発話Ｓ（Ｎ＋１）として取り扱うのではなく、システム発話Ｓ（Ｎ＋１）の準備用繋ぎ発話Ｓ（Ｎ＋１：準備）として対話履歴記憶手段５０に記憶させてもよい。 In the above, even if it takes some time for the preparatory process to replace the plurality of candidates of the system utterance S (N + 1) with the new plurality of candidates, the preparatory state stored in the system state storage means 31 is shown. Since the status is being prepared, the start timing of the system utterance is not detected (see P9 in FIG. 8), so that the plurality of candidates of the system utterance S (N + 1) stored in the next utterance candidate storage means 30 are not updated. It will be kept as it is (however, it may be cleared when preparations for replacement are started), and it will wait for replacement with multiple new candidates. On the other hand, if the preparatory process takes a considerable amount of time, a filler is inserted (see P10 in FIG. 8), but in this case as well, the system utterance S (N + 1) stored in the next utterance candidate storage means 30. Multiple candidates are kept unupdated (however, they may be cleared when preparations for replacement are started) and wait for replacement with new multiple candidates. In this case, the filler insertion information may or may not be transmitted from the next utterance selection means 24 to the dialogue state management means 42, but even if it is transmitted, the dialogue state management in which the filler insertion information is received is received. Since the means 42 does not treat the insertion of the filler as the selected system utterance S (N + 1), the process of sending the preparation start instruction information for the preparation of the system utterance S (N + 2) to the next utterance preparation means 43 Not performed. Since it takes time to prepare the new system utterance S (N + 1), even though the filler is inserted, the process for starting the preparation for the next system utterance S (N + 2) is performed by inserting the filler. Is unreasonable. However, the information of the inserted filler may be stored in the dialogue history storage means 50 as the preparation connection utterance S (N + 1: preparation) of the system utterance S (N + 1) instead of being treated as the system utterance S (N + 1). ..

また、対話状態管理手段４２は、対話の開始時には、最初のシステム発話Ｓ（１）についての準備開始指示情報を次発話準備手段４３に送るが、Ｓ（１）については、複数の候補を準備する必要はないので、次発話準備手段４３は、Ｓ（１）の準備開始指示情報を受け取った場合には、１つの次発話（システム発話Ｓ（１））の内容データを、ネットワーク１を介して再生装置２０へ送信し、次発話候補記憶手段３０にＳ（１）の内容データを記憶させればよい。Ｓ（１）の内容データを記憶させた時点で、ユーザは発話していないので、すぐにシステム発話タイミング検出手段２２によりシステム発話の開始タイミングが検出され、次発話選択手段２４によりＳ（１）が選択され、発話生成手段２５により（１）の再生が開始されることになる。また、次発話選択手段２４によりＳ（１）が選択されると、その選択結果がネットワーク１を介して対話状態管理手段４２へ送信されるので、対話状態管理手段４２は、選択結果として受信したシステム発話Ｓ（１）を対話履歴記憶手段５０に記憶させるとともに、システム発話Ｓ（２）の複数の候補の準備のための準備開始指示情報を次発話準備手段４３へ送る。なお、この時点で、ユーザは未だ発話していないので、対話履歴記憶手段５０へのユーザ発話の保存はない。 Further, at the start of the dialogue, the dialogue state management means 42 sends the preparation start instruction information for the first system utterance S (1) to the next utterance preparation means 43, but prepares a plurality of candidates for S (1). Therefore, when the next utterance preparation means 43 receives the preparation start instruction information of S (1), the content data of one next utterance (system utterance S (1)) is transmitted via the network 1. The data may be transmitted to the playback device 20 and the content data of S (1) may be stored in the next utterance candidate storage means 30. Since the user has not spoken at the time when the content data of S (1) is stored, the system utterance timing detection means 22 immediately detects the start timing of the system utterance, and the next utterance selection means 24 S (1) Is selected, and the utterance generation means 25 starts the reproduction of (1). Further, when S (1) is selected by the next utterance selection means 24, the selection result is transmitted to the dialogue state management means 42 via the network 1, so that the dialogue state management means 42 receives the selection result. The system utterance S (1) is stored in the dialogue history storage means 50, and preparation start instruction information for preparing a plurality of candidates of the system utterance S (2) is sent to the next utterance preparation means 43. At this point, since the user has not yet spoken, the user's utterance is not stored in the dialogue history storage means 50.

＜対話サーバ４０／次発話準備手段４３の構成＞ <Structure of dialogue server 40 / next utterance preparation means 43>

次発話準備手段４３は、システム発話タイミング検出手段２２によるパターン認識処理の周期に依拠しないタイミングで、かつ、システム発話タイミング検出手段２２によりシステム発話の開始タイミングが検出される前に、題材データ記憶手段５１に記憶された題材データまたはネットワーク１を介して接続された外部システムである題材データ提供システム６０に記憶された題材データを用いるとともに、ユーザとシステムとの間の対話履歴情報の少なくとも一部および／または音声認識処理手段４１による進行中のユーザ発話についての途中までの音声認識処理の結果を用いて、システムの次発話の内容データ（本実施形態では、複数の次発話候補の内容データ）を取得または生成する準備処理を実行するものである。 The next utterance preparation means 43 is a subject data storage means at a timing that does not depend on the cycle of the pattern recognition process by the system utterance timing detection means 22, and before the system utterance timing detection means 22 detects the start timing of the system utterance. While using the subject data stored in the subject data 51 or the subject data stored in the subject data providing system 60 which is an external system connected via the network 1, at least a part of the dialogue history information between the user and the system and / Or, using the result of the voice recognition processing halfway about the user's speech in progress by the voice recognition processing means 41, the content data of the next speech of the system (in this embodiment, the content data of a plurality of next speech candidates) is obtained. It executes the preparatory process to acquire or generate.

より詳細には、図４に示すように、次発話準備手段４３は、次発話候補初期準備手段４３Ａと、入替要否判断手段４３Ｂと、入替準備手段４３Ｃと、先行次発話候補情報記憶手段４３Ｄとを含んで構成されている。 More specifically, as shown in FIG. 4, the next utterance preparation means 43 includes the next utterance candidate initial preparation means 43A, the replacement necessity determination means 43B, the replacement preparation means 43C, and the preceding next utterance candidate information storage means 43D. It is composed including and.

次発話候補初期準備手段４３Ａは、対話状態管理手段４２からの準備開始指示情報を受け取ったときに、システムの複数の次発話候補の内容データを取得または生成する準備処理を実行するものである。 The next utterance candidate initial preparation means 43A executes a preparatory process for acquiring or generating content data of a plurality of next utterance candidates in the system when receiving preparation start instruction information from the dialogue state management means 42.

入替要否判断手段４３Ｂは、次発話候補初期準備手段４３Ａにより準備した複数の次発話候補の内容データ、または入替要否判断手段４３Ｂ自身により前回準備した複数の次発話候補の内容データを、別の複数の次発話候補の内容データに入れ替えるか否かを判断する処理を実行するものである。 The replacement necessity determination means 43B separates the content data of the plurality of next utterance candidates prepared by the next utterance candidate initial preparation means 43A or the content data of the plurality of next utterance candidates prepared last time by the replacement necessity determination means 43B itself. It executes a process of determining whether or not to replace with the content data of a plurality of next utterance candidates.

入替準備手段４３Ｃは、入替要否判断手段４３Ｂにより入替が必要であると判断した場合に、現在、次発話候補記憶手段３０に記憶されている最新の複数の次発話候補の内容データとは別の複数の次発話候補の内容データを取得または生成する準備処理を実行するものである。 The replacement preparation means 43C is different from the latest content data of the plurality of latest utterance candidates currently stored in the next utterance candidate storage means 30 when the replacement necessity determination means 43B determines that the replacement is necessary. It executes a preparatory process for acquiring or generating content data of a plurality of next utterance candidates.

先行次発話候補情報記憶手段４３Ｄは、次発話候補初期準備手段４３Ａや入替準備手段４３Ｃによる準備処理を行って得られた複数の次発話候補の内容データ、すなわちネットワーク１を介して再生装置２０へ送信し、現在、次発話候補記憶手段３０に記憶されている最新の複数の次発話候補の内容データについての情報（先行情報）を記憶するものである。この先行情報は、入替要否判断手段４３Ｂによる判断処理を行う際に、先行する複数の次発話候補の内容を把握するために利用される。 The preceding utterance candidate information storage means 43D transfers the content data of a plurality of next utterance candidates obtained by performing the preparatory processing by the next utterance candidate initial preparation means 43A and the replacement preparation means 43C, that is, to the playback device 20 via the network 1. Information (preceding information) about the latest content data of a plurality of latest utterance candidates that has been transmitted and is currently stored in the next utterance candidate storage means 30 is stored. This prior information is used to grasp the contents of the plurality of preceding next utterance candidates when performing the determination process by the replacement necessity determination means 43B.

具体的には、次発話候補初期準備手段４３Ａは、システム発話Ｓ（Ｎ＋１）の準備開始指示情報を受け取ったときに、対話履歴記憶手段５０に記憶されているそれまでの対話履歴情報（システム発話Ｓ（Ｎ）までの対話履歴情報）の少なくとも一部、すなわちＳ（１）、Ｕ（１）、Ｓ（２）、Ｕ（２）、…、Ｓ（Ｎ）（図６の最上部を参照）の少なくとも一部を用いて、題材データ記憶手段５１または題材データ提供システム６０に記憶されている題材データやその構成要素の中から、システムの複数の次発話候補Ｓ（Ｎ＋１）の内容データを選択取得する処理を実行する。但し、最初のシステム発話Ｓ（１）の準備開始指示情報を受け取ったときには、選択取得するＳ（１）は、１つだけでよい。また、常に複数の次発話候補を選択取得しなければならないわけではなく、選択取得した次発話候補が、結果的に１つになる場合があってもよい。 Specifically, when the next speech candidate initial preparation means 43A receives the preparation start instruction information of the system speech S (N + 1), the dialogue history information (system speech) up to that point stored in the dialogue history storage means 50 is stored. At least a part of the dialogue history information up to S (N), that is, S (1), U (1), S (2), U (2), ..., S (N) (see top of FIG. 6). ) Is used to obtain the content data of a plurality of next speech candidates S (N + 1) of the system from the subject data and its components stored in the subject data storage means 51 or the subject data providing system 60. Execute the process of selective acquisition. However, when the preparation start instruction information of the first system utterance S (1) is received, only one S (1) may be selectively acquired. Further, it is not always necessary to selectively acquire a plurality of next utterance candidates, and the selected and acquired next utterance candidates may be one as a result.

一方、入替準備手段４３Ｃは、次発話候補記憶手段３０に複数の次発話候補Ｓ（Ｎ＋１）の内容データが既に記憶されている状態において、入替要否判断手段４３Ｂにより入替が必要であると判断した場合に、対話状態管理手段４２から逐次送られてくる音声認識処理手段４１による現在進行中のユーザ発話Ｕ（Ｎ）の音声認識処理の結果である言語情報を用いて、題材データ記憶手段５１または題材データ提供システム６０に記憶されている題材データやその構成要素の中から、システムの別の複数の次発話候補Ｓ（Ｎ＋１）の内容データを選択取得する処理を実行する。但し、常に複数の次発話候補を選択取得しなければならないわけではなく、選択取得した次発話候補が、結果的に１つになる場合があってもよいのは、上述した次発話候補初期準備手段４３Ａの場合と同様である。また、上記の説明では、現在進行中のユーザ発話Ｕ（Ｎ）の音声認識処理の結果を用いるとしているが、準備処理を行う時点では、対話状態管理手段４２から逐次送られてくる音声認識処理の結果がユーザ発話Ｕ（Ｎ）の発話区間全体における最後の部分（または発話区間全体）であるか否かは判らない場合があるので（あるいは、最後の部分であるか否かの区別をする必要はないので）、対話状態管理手段４２の説明で既に詳述したように、対話状態管理手段４２から受け取った音声認識処理の結果が、結果的に、ユーザ発話Ｕ（Ｎ）の発話区間全体における最後の部分（または発話区間全体）であった場合でも、この入替準備手段４３Ｃによる準備処理は実行される。この場合、結果的には、進行中のユーザ発話Ｕ（Ｎ）ではなく、対話履歴情報として対話履歴記憶手段５０に記憶されることになる発話終了後のユーザ発話Ｕ（Ｎ）の情報を用いていることになる。なお、入替準備手段４３Ｃは、ユーザ発話Ｕ（Ｎ）だけではなく、次発話候補初期準備手段４３Ａの場合と同様に、対話履歴情報Ｓ（１）、Ｕ（１）、…、Ｓ（Ｎ）を用いてもよい。 On the other hand, the replacement preparation means 43C determines that the replacement is necessary by the replacement necessity determination means 43B in a state where the content data of the plurality of next utterance candidates S (N + 1) are already stored in the next utterance candidate storage means 30. In this case, the subject data storage means 51 uses the linguistic information that is the result of the voice recognition processing of the user utterance U (N) currently in progress by the voice recognition processing means 41 sequentially sent from the dialogue state management means 42. Alternatively, a process of selecting and acquiring the content data of another plurality of next utterance candidates S (N + 1) of the system is executed from the subject data and its constituent elements stored in the subject data providing system 60. However, it is not always necessary to selectively acquire a plurality of next utterance candidates, and the selected and acquired next utterance candidates may eventually become one, which is the above-mentioned initial preparation for the next utterance candidate. This is the same as in the case of means 43A. Further, in the above description, the result of the voice recognition processing of the user utterance U (N) currently in progress is used, but at the time of performing the preparation processing, the voice recognition processing sequentially sent from the dialogue state management means 42 Since it may not be known whether or not the result of is the last part (or the entire utterance section) of the entire utterance section of the user utterance U (N) (or whether or not it is the last part) is distinguished. As described in detail in the description of the dialogue state management means 42, the result of the voice recognition process received from the dialogue state management means 42 is, as a result, the entire utterance section of the user utterance U (N). Even if it is the last part (or the entire utterance section) in the above, the preparation process by the replacement preparation means 43C is executed. In this case, as a result, the information of the user utterance U (N) after the end of the utterance, which is stored in the dialogue history storage means 50 as the dialogue history information, is used instead of the user utterance U (N) in progress. It will be. The replacement preparation means 43C is not only the user utterance U (N), but also the dialogue history information S (1), U (1), ..., S (N) as in the case of the next utterance candidate initial preparation means 43A. May be used.

また、用意されている題材データには、様々な状態のものがあり、例えば、テキストデータだけの場合、テキストデータおよびそれに対応する音声データがある場合、それらのテキストデータや音声データに、映像データや静止画データ、あるいは楽曲データが付随している場合、付随させる映像データや静止画データ、あるいは楽曲データだけの場合等があり、更にはテキストデータにも様々な語調のものがある。このため、次発話候補初期準備手段４３Ａおよび入替準備手段４３Ｃは、必要な場合には、テキストデータの加工調整（例えば、語尾の調整、一部削除、結合・分割・組替・その他の編集等）、テキストデータから音声データ（例えばｗａｖファイル等）を生成する音声合成、動画や静止画の画質・サイズ調整といった各種の生成処理も行う。但し、システムの応答性を向上させる観点からは、次発話候補の準備処理自体に時間がかかることを避ける必要がある。準備処理は、ユーザ発話中に行い、原則として、ユーザ発話が終了する前に準備が完了していることが好ましいからである。従って、テキストデータの語調等の加工調整、音声合成処理、動画や静止画の画質・サイズ調整等は、予め実行しておき、それらの処理を実行済の題材データを、題材データ記憶手段５１または題材データ提供システム６０に用意しておくことが好ましい。 In addition, there are various types of subject data prepared. For example, in the case of text data only, if there is text data and corresponding audio data, the text data and audio data include video data. , Still image data, or music data is attached, there are cases where the accompanying video data, still image data, or music data is the only data, and there are also text data having various tones. Therefore, the next speech candidate initial preparation means 43A and the replacement preparation means 43C, if necessary, process and adjust the text data (for example, adjust the ending, partially delete, combine / divide / rearrange, edit other, etc.). ), Various generation processes such as voice synthesis to generate voice data (for example, wav file) from text data, and image quality / size adjustment of moving images and still images are also performed. However, from the viewpoint of improving the responsiveness of the system, it is necessary to avoid taking time for the preparation process itself of the next utterance candidate. This is because the preparatory process is performed during the user utterance, and as a general rule, it is preferable that the preparation is completed before the user utterance ends. Therefore, processing adjustments such as the tone of text data, voice synthesis processing, image quality / size adjustments of moving images and still images, etc. are executed in advance, and the subject data that has been subjected to these processes is stored in the subject data storage means 51 or. It is preferable to prepare it in the subject data providing system 60.

そして、次発話候補初期準備手段４３Ａによるシステム発話Ｓ（Ｎ＋１）の複数の候補の選択取得では、通常は、直前のシステム発話Ｓ（Ｎ）の内容が最も重要な選択用判断材料となるが、それよりも前のＳ（１）、Ｕ（１）、…、Ｓ（Ｎ−１）、Ｕ（Ｎ−１）も使用されることがある。例えば、シナリオデータ内における各構成要素（本実施形態では、主計画要素、副計画要素がある。）を予め定めた順序で再生していくときに、ユーザ発話の内容に応じて各構成要素の再生順序を変更する場合がある。この場合、例えば、１回再生した構成要素については、２度目の再生は行わないというルールがあれば、それまでにいずれの構成要素が再生されたのかを把握する必要があるので、それまでの対話履歴情報の全部を使用する必要がある。 Then, in the selection acquisition of a plurality of candidates of the system utterance S (N + 1) by the next utterance candidate initial preparation means 43A, the content of the immediately preceding system utterance S (N) is usually the most important selection judgment material. Earlier S (1), U (1), ..., S (N-1), U (N-1) may also be used. For example, when each component in the scenario data (in this embodiment, there are a main plan element and a sub plan element) is reproduced in a predetermined order, each component is played according to the content of the user's utterance. The playback order may be changed. In this case, for example, if there is a rule that a component that has been played once is not played a second time, it is necessary to know which component has been played by then. All dialogue history information should be used.

また、入替準備手段４３Ｃによるシステム発話Ｓ（Ｎ＋１）の複数の候補の選択取得では、ユーザ発話Ｕ（Ｎ）が、例えば「さっき言っていたＸＸＸ選手について、別の情報が知りたいな。」、「その話は知っているから、さっきのＹＹＹ事件の話を詳しく聞きたいな。」等であれば、選択用判断材料として、そのユーザ発話Ｕ（Ｎ）の情報を使用することは勿論であるが、直前のシステム発話Ｓ（Ｎ）を使用せず、それよりも前の情報Ｓ（１）、Ｕ（１）、…、Ｓ（Ｎ−１）、Ｕ（Ｎ−１）を使用する場合もある。つまり、少し前（例えば数分前等）の対話履歴情報に基づき、ＸＸＸ選手やＹＹＹ事件について、どこまで話していたのかを把握し、それとは別の情報を、題材データ記憶手段５１または題材データ提供システム６０から選択取得する場合等がある。 Further, in the selection acquisition of a plurality of candidates of the system utterance S (N + 1) by the replacement preparation means 43C, the user utterance U (N), for example, "I want to know another information about the XXX player mentioned earlier." If, for example, "I know the story, I would like to hear the story of the YYY case in detail.", It goes without saying that the information of the user utterance U (N) is used as a judgment material for selection. However, when the previous system utterance S (N) is not used and the previous information S (1), U (1), ..., S (N-1), U (N-1) is used. There is also. In other words, based on the dialogue history information a little while ago (for example, a few minutes ago), it is possible to grasp how far the XXX player and the YYY incident have been talked about, and provide other information to the subject data storage means 51 or the subject data. In some cases, it is selectively acquired from the system 60.

また、題材データ記憶手段５１または題材データ提供システム６０に記憶されている題材データには、様々な種類のデータがあり、情報量の多少も異なっている。例えば、題材データがシナリオデータであれば、複数の構成要素により構成され、一方、題材データの中には、シナリオデータ内の１つの構成要素に相当するような比較的短い題材データも存在する。従って、次発話候補初期準備手段４３Ａや入替準備手段４３Ｃにより複数の次発話候補を「選択」することには、複数（多数）のシナリオデータの中から１つのシナリオデータを選択し、かつ、選択した１つのシナリオデータの中から１つ（Ｓ（１）の場合）または複数の構成要素を選択すること、既に選択されている１つのシナリオデータの中から複数の構成要素を選択すること、複数（多数）の比較的短い題材データの中から１つ（Ｓ（１）の場合）または複数の題材データを選択すること等が含まれる。 Further, the subject data stored in the subject data storage means 51 or the subject data providing system 60 includes various types of data, and the amount of information is slightly different. For example, if the subject data is scenario data, it is composed of a plurality of components, while the subject data also includes relatively short subject data that corresponds to one component in the scenario data. Therefore, in order to "select" a plurality of next speech candidates by the next speech candidate initial preparation means 43A or the replacement preparation means 43C, one scenario data is selected from a plurality of (many) scenario data and is selected. Selecting one (in the case of S (1)) or a plurality of components from one scenario data, selecting a plurality of components from one already selected scenario data, and a plurality of components. It includes selecting one (in the case of S (1)) or a plurality of subject data from (many) relatively short subject data.

また、次発話候補初期準備手段４３Ａおよび入替準備手段４３Ｃは、準備処理で得られた複数（結果的に１つの場合もあり、また、Ｓ（１）の場合は１つである。）の次発話候補の内容データまたはそれらに加えてそれらの識別情報（例えば、シナリオＩＤ、発話節ＩＤ等）を、ネットワーク１を介して再生装置２０へ送信し、次発話候補記憶手段３０に記憶させる処理も実行する。２回目以降は、更新処理である。この更新により、次発話候補記憶手段３０に記憶されている次発話候補の内容データの数は、ユーザ発話の進行に伴って、例えば、図７中の中央部に示すように、Ｎ１，Ｎ２．Ｎ３のように変化する。また、入替要否判断手段４３Ｂにより入替が必要であるという判断結果が出た場合に、入替準備手段４３Ｃによる準備処理が開始されるが、この準備期間中は、図７中の中央部に示すように、次発話候補記憶手段３０に記憶されている複数の次発話候補の内容データを削除し、次発話候補の内容データの数をゼロにクリアしてもよく、あるいは、削除せずに維持し、ゼロにクリアしない処理を行ってもよい。 Further, the next utterance candidate initial preparation means 43A and the replacement preparation means 43C are next to the plurality (the result may be one, and in the case of S (1), one) obtained in the preparation process. There is also a process of transmitting the content data of utterance candidates or their identification information (for example, scenario ID, utterance clause ID, etc.) to the playback device 20 via the network 1 and storing them in the next utterance candidate storage means 30. Run. The second and subsequent times are update processing. Due to this update, the number of content data of the next utterance candidate stored in the next utterance candidate storage means 30 increases as the user utterance progresses, for example, as shown in the central portion in FIG. 7, N1, N2. It changes like N3. Further, when the determination result that the replacement is necessary is obtained by the replacement necessity determination means 43B, the preparation process by the replacement preparation means 43C is started. During this preparation period, it is shown in the central part in FIG. As described above, the content data of the plurality of next utterance candidates stored in the next utterance candidate storage means 30 may be deleted, and the number of content data of the next utterance candidate may be cleared to zero, or the content data of the next utterance candidate may be maintained without being deleted. However, processing that does not clear to zero may be performed.

さらに、次発話候補初期準備手段４３Ａおよび入替準備手段４３Ｃは、準備状態を示すステータス、目的データの残数および次発話候補の重要度（準備処理で得られた複数の次発話候補の内容データの各々の重要度）を、ネットワーク１を介して再生装置２０へ送信し、システム情報記憶手段３１に記憶させる処理も実行する。２回目以降は、更新処理である。 Further, the next utterance candidate initial preparation means 43A and the replacement preparation means 43C have a status indicating the preparation status, the remaining number of target data, and the importance of the next utterance candidate (content data of the plurality of next utterance candidates obtained in the preparation process). The process of transmitting each importance) to the reproduction device 20 via the network 1 and storing it in the system information storage means 31 is also executed. The second and subsequent times are update processing.

なお、目的データの残数は、対話目的を達成するためのシステムの最終の次発話候補の内容データとなり得る目的データの残数であるが、次発話候補記憶手段３０に記憶させる次発話候補の内容データの数とは異なる。例えば、情報検索対話で、ユーザが自分の利用する飲食店を探すときには、飲食店のデータが目的データとなる。しかし、条件提示による絞り込みが進んでいない段階では、目的データ（例えば、飲食店のデータ等）は多数存在し、それらの目的データの全部を、次発話候補の内容データとして次発話候補記憶手段３０に記憶させるわけではなく、情報検索対話の初期の段階や途中の段階では、次発話候補記憶手段３０には、「何を食べたいですか？」、「費用はどれぐらいですか？」等が記憶されるだけである。そして、絞り込みが進んだ段階や絞り込みが完了した最終段階で、目的データ（例えば、飲食店のデータ等）は、次発話候補の内容データとして次発話候補記憶手段３０に記憶されることになる。従って、目的データの残数は、潜在的な次発話候補の内容データの数である。 The remaining number of target data is the remaining number of target data that can be the content data of the final next utterance candidate of the system for achieving the dialogue purpose, but the remaining number of the next utterance candidates stored in the next utterance candidate storage means 30. It is different from the number of content data. For example, when a user searches for a restaurant to be used in an information retrieval dialogue, the restaurant data becomes the target data. However, at the stage where the narrowing down by presenting the conditions has not progressed, there are many target data (for example, restaurant data, etc.), and all of the target data are used as the content data of the next utterance candidate, and the next utterance candidate storage means 30. In the initial stage or the middle stage of the information retrieval dialogue, the next utterance candidate storage means 30 includes "what do you want to eat?", "How much does it cost?", Etc. It is only remembered. Then, at the stage where the narrowing down is advanced or the final stage where the narrowing down is completed, the target data (for example, data of a restaurant or the like) is stored in the next utterance candidate storage means 30 as the content data of the next utterance candidate. Therefore, the remaining number of target data is the number of potential content data of the next utterance candidate.

先行次発話候補情報記憶手段４３Ｄには、例えば、次発話候補初期準備手段４３Ａおよび入替準備手段４３Ｃによる準備処理で得られた複数の次発話候補の内容データ（現在、次発話候補記憶手段３０に記憶されている最新の複数の次発話候補の内容データ）、それらの内容データについての各分野（例えば、ＩＴ・科学、テニス、野球等）、分野以外の属性（例えば、男性向け、１０代〜３０代向け等）、それらの内容データに含まれる１つまたは複数の重要度の高い単語等が記憶されている。 In the preceding utterance candidate information storage means 43D, for example, the content data of a plurality of next utterance candidates obtained by the preparatory processing by the next utterance candidate initial preparation means 43A and the replacement preparation means 43C (currently, the next utterance candidate storage means 30). The latest memorized content data of multiple next utterance candidates), each field (for example, IT / science, tennis, baseball, etc.) and attributes other than the field (for example, for men, teens and up) (For people in their thirties, etc.), one or more important words included in the content data are stored.

入替要否判断手段４３Ｂは、対話状態管理手段４２から逐次送られてくる音声認識処理手段４１によるユーザ発話Ｕ（Ｎ）の音声認識処理の結果である言語情報を受け取り、受け取った言語情報と、先行次発話候補情報記憶手段４３Ｄに記憶されている再生装置２０へ送信済の複数の次発話候補の内容データ（現在、次発話候補記憶手段３０に記憶されている最新の複数の次発話候補の内容データ）についての情報（先行情報）とを用いて、次発話の候補となる複数の次発話候補の内容データの少なくとも一部を入れ替えるか否かを逐次判定し、入れ替えると判定した場合には、その結果を入替準備手段４３Ｃに送る処理を実行する。 The replacement necessity determination means 43B receives the language information that is the result of the voice recognition processing of the user utterance U (N) by the voice recognition processing means 41 sequentially sent from the dialogue state management means 42, and receives the received language information and the received language information. Content data of a plurality of next utterance candidates transmitted to the playback device 20 stored in the preceding next utterance candidate information storage means 43D (currently, the latest plurality of next utterance candidates stored in the next utterance candidate storage means 30) Using the information (preceding information) about the content data), it is sequentially determined whether or not at least a part of the content data of the plurality of next utterance candidates that are candidates for the next utterance is to be replaced, and when it is determined to be replaced. , The process of sending the result to the replacement preparation means 43C is executed.

具体的には、入替要否判断手段４３Ｂは、現在までに、図６に示すＵ（Ｎ，Ｋ）までの音声認識処理の結果を用いた入替要否判断処理およびそれに伴う入替準備処理が行われていたとすると、例えば、新たに出力されたＵ（Ｎ，Ｋ＋１）（但し、ショートセグメンテーションであるから、正確には、その一部）の音声認識処理の結果である言語情報の中に、重要度の高い単語が含まれているか否かを判断する。ここで、単語の重要度としては、例えば、ＴＦ（Term Frequency：文書における単語の出現頻度）およびＩＤＦ（Inverse Document Frequency：逆文書頻度）によるＴＦ−ＩＤＦ、Ｏｋａｐｉ−ＢＭ２５等を採用することができ、予め算出して単語重要度記憶手段（不図示）に記憶しておけばよい。 Specifically, the replacement necessity determination means 43B has performed the replacement necessity determination process using the result of the voice recognition process up to U (N, K) shown in FIG. 6 and the accompanying replacement preparation process so far. If so, for example, it is important in the linguistic information that is the result of the voice recognition processing of the newly output U (N, K + 1) (however, since it is a short segmentation, it is a part of it to be exact). Determine if it contains high-frequency words. Here, as the importance of the word, for example, TF-IDF by TF (Term Frequency: frequency of appearance of the word in the document) and IDF (Inverse Document Frequency: frequency of reverse document), Okapi-BM25 and the like can be adopted. , It may be calculated in advance and stored in the word importance storage means (not shown).

そして、例えば、Ｕ（Ｎ，Ｋ＋１）（正確には、その一部）の中に、単語α，β，γが含まれていたとすると、これらの単語α，β，γの全ての重要度が、予め定めた重要度判定用閾値以下または未満であった場合（単語α，β，γがいずれも重要度の高い単語ではなかった場合）には、入替は不要であると判断すること等ができる。一方、単語α，β，γの中に重要度の高い単語が含まれていた場合には、その重要度の高い単語（例えば単語α）が、先行次発話候補情報記憶手段４３Ｄに記憶されている重要度の高い単語の中に含まれているか否かを判断し、含まれていれば、入替は不要であると判断すること等ができる。あるいは、単語α，β，γの中に重要度の高い単語が含まれていた場合には、その重要度の高い単語（例えば単語α）と、先行次発話候補情報記憶手段４３Ｄに記憶されている重要度の高い単語の各々との類似度を、例えばｗｏｒｄ２ｖｅｃやＧｌｏＶｅ等により求め、求めた各類似度のうちのいずれかが類似度判定用閾値以上または超過であった場合（類似する重度語の高い単語があった場合）には、入替は不要であると判断すること等ができる。また、上記のように単語α，β，γの中の重要度の高い単語（例えば単語α）と、先行次発話候補情報記憶手段４３Ｄに記憶されている１つまたは複数の重要度の高い単語とを用いて判断するのではなく、単語α，β，γの中の重要度の高い単語（例えば単語α）と、先行次発話候補情報記憶手段４３Ｄに記憶されている複数の次発話候補の内容データの全体（それらに含まれる全ての単語）とを用いて判断してもよい。 And, for example, if the words α, β, γ are included in U (N, K + 1) (to be exact, a part of them), the importance of all these words α, β, γ is , If it is below or below the predetermined importance judgment threshold (when the words α, β, and γ are not words of high importance), it may be judged that replacement is not necessary. it can. On the other hand, when the words α, β, and γ include words of high importance, the words of high importance (for example, word α) are stored in the preceding utterance candidate information storage means 43D. It is possible to judge whether or not it is included in the words of high importance, and if it is included, it is possible to judge that replacement is unnecessary. Alternatively, when the words α, β, and γ include words of high importance, the words of high importance (for example, word α) and the preceding and next utterance candidate information storage means 43D are stored. The degree of similarity with each of the words of high importance is obtained by, for example, word2vec or GloVe, and when any of the obtained similarities is equal to or greater than the threshold for determining the similarity (similar severe words). If there is a word with a high value), it can be judged that replacement is not necessary. Further, as described above, a high-importance word (for example, word α) among the words α, β, and γ, and one or a plurality of high-importance words stored in the preceding and next utterance candidate information storage means 43D. Rather than making a judgment using, a word of high importance (for example, word α) among the words α, β, and γ, and a plurality of next utterance candidates stored in the preceding next utterance candidate information storage means 43D. Judgment may be made using the entire content data (all words contained therein).

また、上記の例の単語α，β，γの中の重要度の高い単語（例えば単語α）が、いずれの分野に属する単語であるかを判断し、判断した分野が（複数の分野でもよく、その場合には、いずれかの分野が）、先行次発話候補情報記憶手段４３Ｄに記憶されている１つまたは複数の分野（通常は１つの分野であることが多い。）の中に含まれていれば、入替は不要であると判断すること等ができる。なお、各単語（重要度の高い単語）と各分野（例えば、ＩＴ・科学、テニス、ゴルフ、エンタメ、政治経済、国際等）との対応関係は、予め定めて単語帰属分野記憶手段（不図示）に記憶しておけばよく、１つの単語が複数の分野に帰属していてもよい。この対応関係は、例えば、各分野の文書における各単語の出現頻度や、累積出現回数等により定めることができる。 Further, it is determined which field the highly important word (for example, word α) among the words α, β, γ in the above example belongs to, and the determined field (may be a plurality of fields). , In that case, any field) is included in one or more fields (usually often one field) stored in the preceding and next utterance candidate information storage means 43D. If so, it can be determined that replacement is unnecessary. The correspondence between each word (highly important word) and each field (for example, IT / science, tennis, golf, entertainment, political economy, international, etc.) is determined in advance and the word attribution field memory means (not shown). ), And one word may belong to a plurality of fields. This correspondence can be determined, for example, by the frequency of appearance of each word in a document in each field, the cumulative number of occurrences, and the like.

なお、題材データまたはその構成要素には、分野の識別情報が関連付けられている。分野の粒度は、システム設計者が適宜定めればよく、例えば、テニス、ゴルフ、野球等を別々の分野とするか、スポーツで１つの分野にまとめるか、あるいは、政治、経済を別々の分野とするか、１つにまとめるか等は任意である。１つの題材データまたはその構成要素は、複数の分野に帰属していてもよい。また、題材データまたはその構成要素が、女子プロゴルフの話題のみである場合に、例えば、女子プロゴルフ＜ゴルフ＜スポーツのように、包含関係にある分野の識別情報を全て関連付けるようにしてもよい。 The field identification information is associated with the subject data or its components. The particle size of the fields may be determined by the system designer as appropriate. For example, tennis, golf, baseball, etc. may be separated into separate fields, sports may be combined into one field, or politics and economy may be separated into separate fields. It is optional whether to do it or to combine it into one. One subject data or its components may belong to a plurality of fields. Further, when the subject data or its constituent elements are only the topic of women's professional golf, all the identification information of the fields having an inclusion relationship may be associated with each other, for example, women's professional golf <golf <sports. ..

また、音声対話には、各種の目的の対話（例えば、ニュース対話、アンケート対話、ガイダンス対話、情報検索対話、操作対話、教育対話、情報特定対話等）があり、対話の進行も各種のタイプのものがある。対話の進行のタイプとの関係では、次のようになる。 In addition, voice dialogue includes dialogue for various purposes (for example, news dialogue, questionnaire dialogue, guidance dialogue, information retrieval dialogue, operation dialogue, educational dialogue, information specific dialogue, etc.), and the progress of the dialogue is also various types. There is something. In relation to the type of dialogue progression:

次発話候補初期準備手段４３Ａは、シナリオデータ（主計画および副計画を有する複雑な分岐を行うシナリオに限らず、より単純なシナリオも含む。）があり、シナリオとして予め定められた順序に従って対話を進めていく場合には、そのシナリオの順序に従って、複数の次発話候補を選択していく。この場合、入替準備手段４３Ｃにより、予め定められた順序が変更された場合には、その変更を反映させ、例えば、１回再生したシナリオ構成要素については、２度目の再生は行わないというルールがあれば、そのルールに従いつつ、当初の順序をなるべく維持した順序で複数の次発話候補を選択していく。 The next utterance candidate initial preparation means 43A has scenario data (not only a scenario in which a complicated branch having a main plan and a sub-plan is performed, but also a simpler scenario), and the dialogue is performed according to a predetermined order as a scenario. When proceeding, a plurality of next utterance candidates are selected according to the order of the scenarios. In this case, if the replacement preparation means 43C changes the predetermined order, the change is reflected. For example, there is a rule that the scenario component that has been played once is not played a second time. If there is, select multiple next utterance candidates in the order that maintains the original order as much as possible while following the rule.

また、シナリオデータがなく、対話の進行や分岐のパターンが予め定まっているわけではないが、システム発話の内容については、予定外の情報を外部システムから取得しなければならない場合を除き、予め用意されていて、毎回のユーザ発話の内容に従って、その都度、次のシステム発話の内容を定める場合がある。このような場合には、直前のシステム発話Ｓ（Ｎ）で、次のシステム発話Ｓ（Ｎ＋１）の複数の候補が定まることは少ない。なぜなら、Ｓ（Ｎ）でＳ（Ｎ＋１）の候補が定まるということは、結局、広い意味で、または部分的にシナリオが形成されていると考えることができるので、シナリオがない場合に該当しないからである。 In addition, there is no scenario data, and the pattern of dialogue progress and branching is not predetermined, but the content of system utterances is prepared in advance unless unscheduled information must be obtained from an external system. In some cases, the content of the next system utterance may be determined according to the content of each user utterance. In such a case, it is rare that a plurality of candidates for the next system utterance S (N + 1) are determined by the immediately preceding system utterance S (N). This is because the fact that the candidate for S (N + 1) is determined by S (N) does not apply when there is no scenario because it can be considered that the scenario is formed in a broad sense or partially. Is.

例えば、自動車のカーナビ操作のための操作対話において、システム発話Ｓ（Ｎ）＝「住所で目的地を設定しますか？」であった場合、次発話候補初期準備手段４３Ａにより、システム発話Ｓ（Ｎ＋１）＝「最初に都道府県を教えてください。」、「市町村を教えてください。」、「何丁目何番地ですか？」等を準備して次発話候補記憶手段３０に記憶させておく。そして、ユーザ発話Ｕ（Ｎ）が「はい。」であれば、次発話選択手段２４により「最初に都道府県を教えてください。」を選択して発話生成手段２５によりそれを再生し、「はい、東京都です。」であれば、「市町村を教えてください。」を選択して再生し、「はい、東京都新宿区です。」であれば、「何丁目何番地ですか？」を選択して再生する。この際、ユーザ発話Ｕ（Ｎ）が「はい、東京都新宿区です。」の途中の「はい、東京都…」まで進行した段階で、入替準備手段４３Ｃにより、システム発話Ｓ（Ｎ＋１）＝「市町村を教えてください。」、「何丁目何番地ですか？」等への入替が行われる場合（「最初に都道府県を教えてください。」が次発話候補から除かれている場合）もある。このような場合、部分的にシナリオが形成されていると考えることができ、そのシナリオに従って次発話候補初期準備手段４３Ａによる準備処理が行われているが、入替準備手段４３Ｃによる役割も大きい。 For example, in the operation dialogue for car navigation operation of a car, when system utterance S (N) = "Do you want to set the destination by address?", System utterance S ( N + 1) = "Please tell me the prefecture first.", "Please tell me the city, town, and village.", "What is the address?", Etc. are prepared and stored in the next utterance candidate storage means 30. Then, if the user utterance U (N) is "Yes.", The next utterance selection means 24 selects "Please tell me the prefecture first." And the utterance generation means 25 reproduces it, and "Yes." If ", Tokyo.", Select "Please tell me the city, town, and village." To play, and if "Yes, Shinjuku-ku, Tokyo.", Select "What chome and what address?" And play. At this time, when the user utterance U (N) has progressed to "Yes, Tokyo ..." in the middle of "Yes, Shinjuku-ku, Tokyo.", The system utterance S (N + 1) = "" by the replacement preparation means 43C. In some cases, the system may be replaced with "Please tell me the city, town, or village.", "What is the address of what chome?", Etc. ("Please tell me the prefecture first." Is excluded from the candidates for the next utterance). .. In such a case, it can be considered that a scenario is partially formed, and the preparation process is performed by the next utterance candidate initial preparation means 43A according to the scenario, but the replacement preparation means 43C also plays a large role.

一方、シナリオデータがない場合は、次発話候補初期準備手段４３Ａによる準備処理よりも、入替準備手段４３Ｃによる準備処理が中心となる。最初はシナリオがあり、その後、フリートークに近い状態になる場合の後半の処理も同様である。そして、シナリオデータがない場合、次発話候補初期準備手段４３Ａは、対話状態管理手段４２からのＳ（Ｎ＋１）の準備開始指示情報を受け取ったときに、システム発話Ｓ（Ｎ＋１）の複数の候補を定めることができなければ、準備中のステータス（ステータス＝次発話候補検討中）を、ネットワーク１を介して再生装置２０へ送信してシステム状態記憶手段３１に記憶させ、入替準備手段４３Ｃに準備処理を任せることができる。このようにした場合は、入替準備手段４３Ｃによる準備処理は、入替の準備というより初期データの準備となるので、入替準備手段４３Ｃは、入替要否判断手段４３Ｂからの入替が必要であるという判断結果を受け取って準備処理を開始するのではなく、次発話候補初期準備手段４３Ａから転送されてくる準備開始指示情報を受け取って準備処理を開始することになる。この場合、入替要否判断手段４３Ｂによる判断処理は行われないので、入替準備手段４３Ｃは、対話状態管理手段４２から逐次送られてくる音声認識処理の結果を受け取り、入替要否判断手段４３Ｂによる重要度の高い単語の抽出処理に相当する処理を実行するが、この際の重要度判定用閾値は低く設定してもよい。なお、重要度判定用閾値を低く設定しても、重要度の高い単語が抽出されない場合には、進行中のユーザ発話Ｕ（Ｎ）の中に、未だシステム発話Ｓ（Ｎ＋１）の複数の候補の決定をするのに十分な情報（単語）が現れていないことになるので、ユーザ発話Ｕ（Ｎ）の進行を待つことになる。入替準備手段４３Ｃは、以上のような次発話候補初期準備手段４３Ａから転送されてくる準備開始指示情報を受け取った場合の初期データの準備処理を行い、複数の次発話候補の内容データを次発話候補記憶手段３０に記憶させた後には、通常通りの入替の準備処理（重要度判定用閾値も通常の設定とする。）を実行する。 On the other hand, when there is no scenario data, the preparation process by the replacement preparation means 43C is more central than the preparation process by the next utterance candidate initial preparation means 43A. The same applies to the latter half of the process when there is a scenario at first and then the state is close to free talk. Then, when there is no scenario data, the next utterance candidate initial preparation means 43A selects a plurality of candidates for the system utterance S (N + 1) when receiving the preparation start instruction information of S (N + 1) from the dialogue state management means 42. If it cannot be determined, the status under preparation (status = under consideration for next utterance candidate) is transmitted to the playback device 20 via the network 1 and stored in the system state storage means 31, and the replacement preparation means 43C prepares. Can be entrusted to you. In this case, the preparation process by the replacement preparation means 43C is not the preparation for the replacement but the preparation of the initial data. Therefore, the replacement preparation means 43C determines that the replacement is necessary from the replacement necessity determination means 43B. Instead of receiving the result and starting the preparatory process, the preparatory process is started by receiving the preparatory start instruction information transferred from the next utterance candidate initial preparation means 43A. In this case, since the determination process by the replacement necessity determination means 43B is not performed, the replacement preparation means 43C receives the result of the voice recognition process sequentially sent from the dialogue state management means 42, and the replacement necessity determination means 43B is used. A process corresponding to the extraction process of words of high importance is executed, but the threshold value for determining the importance at this time may be set low. If words with high importance are not extracted even if the threshold for determining importance is set low, a plurality of candidates for system utterance S (N + 1) are still included in the ongoing user utterance U (N). Since sufficient information (words) has not appeared to make the decision, the progress of the user utterance U (N) is awaited. The replacement preparation means 43C performs the preparation process of the initial data when receiving the preparation start instruction information transferred from the next utterance candidate initial preparation means 43A as described above, and next utters the content data of the plurality of next utterance candidates. After the data is stored in the candidate storage means 30, a normal replacement preparatory process (the importance determination threshold value is also set as a normal setting) is executed.

また、シナリオデータがない場合は、フリートークの状態に近いと考え、次発話候補初期準備手段４３Ａは、とりあえず、題材データ記憶手段５１や題材データ提供システム６０に記憶されている題材データの中からランダムに選択取得した複数の次発話候補の内容データを、ネットワーク１を介して再生装置２０へ送信して次発話候補記憶手段３０に記憶させてもよい。ランダムな選択取得を行っても、その後、ユーザ発話Ｕ（Ｎ）が進行すると、入替準備手段４３Ｃによる準備処理が行われ、複数の次発話候補の内容データが適切なものに入れ替えられる。仮に、ユーザ発話Ｕ（Ｎ）が進行しても、ランダムに選択取得した複数の次発話候補の内容データがそのまま維持されていたとすると、そのランダムな選択取得が適切であったということになる。なお、ランダムに選択取得する題材データが存在する（選択取得する範囲が定まっている）ということは、完全なフリートークではなく、システム発話の内容は、想定範囲外の情報を外部システムから取得しなければならない場合を除き、予め用意されていることになる。 If there is no scenario data, it is considered that the state is close to the free talk state, and the next utterance candidate initial preparation means 43A is randomly selected from the subject data stored in the subject data storage means 51 and the subject data providing system 60 for the time being. The content data of the plurality of next utterance candidates selected and acquired may be transmitted to the playback device 20 via the network 1 and stored in the next utterance candidate storage means 30. Even if random selection acquisition is performed, when the user utterance U (N) progresses thereafter, the preparation process is performed by the replacement preparation means 43C, and the content data of the plurality of next utterance candidates is replaced with appropriate ones. Even if the user utterance U (N) progresses, if the content data of the plurality of next utterance candidates randomly selected and acquired is maintained as it is, it means that the random selection acquisition is appropriate. In addition, the existence of subject data to be randomly selected and acquired (the range to be selectively acquired is fixed) is not a complete free talk, and the content of the system utterance must acquire information outside the expected range from the external system. Unless otherwise required, it will be prepared in advance.

前述したように、入替要否判断手段４３Ｂは、対話状態管理手段４２から逐次送られてくる新たな音声認識処理の結果を受け取り、この結果に含まれる単語のうち予め定められた重要度の高い単語を抽出する処理を実行するので、入替準備手段４３Ｃは、入替要否判断手段４３Ｂから、抽出された重要度の高い単語を受け取る。そして、入替準備手段４３Ｃは、この重要度の高い単語を用いて、予め定められた各単語と各分野との対応関係（単語帰属分野記憶手段（不図示）に記憶されている情報）から、ユーザの関心のある話題（分野）を決定し、題材データ記憶手段５１または題材データ提供システム６０に記憶されている題材データの中から、決定した話題（分野）に関連付けられて記憶されている題材データを選択し、次発話の候補となる別の複数の次発話候補の内容データを取得または生成する準備処理を実行することができる。 As described above, the replacement necessity determination means 43B receives the result of the new voice recognition process sequentially sent from the dialogue state management means 42, and among the words included in the result, the predetermined importance is high. Since the process of extracting words is executed, the replacement preparation means 43C receives the extracted words of high importance from the replacement necessity determination means 43B. Then, the replacement preparation means 43C uses the words of high importance from the correspondence relationship between each predetermined word and each field (information stored in the word attribution field storage means (not shown)). A topic (field) of interest to the user is determined, and the subject is stored in association with the determined topic (field) from the subject data stored in the subject data storage means 51 or the subject data providing system 60. It is possible to select data and execute a preparatory process for acquiring or generating content data of a plurality of other next speech candidates that are candidates for the next speech.

例えば、システム発話Ｓ（Ｎ）が「早稲田太郎選手が４回転フリップを成功させたよ。」であり、システム発話Ｓ（Ｎ＋１）の複数の候補として、「グランプリシリーズのカナダ大会で跳んだそうだ。」（主計画要素）、「早稲田太郎選手は、…」という早稲田太郎の人物の説明データ（副計画要素）、「４回転フリップっていうのは、…」という４回転フリップの技の説明データ（副計画要素）が、次発話候補記憶手段３０に記憶されているとする。このとき、ユーザ発話Ｕ（Ｎ）が「フィギュアスケートは興味がないので、野球の話が聞きたいんだけど…」、「つまらない、野球の方がおもしろいから…」であった場合には、入替準備手段４３Ｃは、再生中のシナリオデータ（分野＝アイススケート、または、分野＝スポーツ、アイススケート）の中に野球の話は全くないので、シナリオデータ自体を別の分野（この例では、野球の分野）に入れ替え、その入替後のシナリオデータ内の先頭の構成要素を、次発話候補とすることができる。なお、この場合は、Ｓ（Ｎ＋１）の候補ではあるが、シナリオデータの先頭からの再生となるので、Ｓ（１）と同等であるから、次発話候補は１つでよい。また、この場合、「興味がない」、「つまらない」、「もう飽きた」、「くだらない」、「話題を変えてほしい」、「その話はもういい」、「その話はやめて」、「ところで」、「話は変わるけど」、「そういえば」等の話題転換要求を伴っているため、シナリオデータ自体を入れ替える話題転換処理を行っている。また、明確な話題転換要求が無くても、例えば、ユーザ発話Ｕ（Ｎ）が「来週、日米野球があるけど、高校野球の…」のように、再生中のシナリオデータ内にない単語が繰り返される場合にシナリオデータ自体を入れ替える話題転換処理を行ってもよい。 For example, the system utterance S (N) is "Taro Waseda succeeded in a quadruple flip." And as multiple candidates for the system utterance S (N + 1), "It seems that he jumped at the Grand Prix series Canada tournament." (Main planning element), explanation data of Waseda Taro's person "Waseda Taro is ..." (secondary planning element), explanation data of quadruple flip technique "4 rotation flip is ..." (secondary plan element) It is assumed that the planning element) is stored in the next utterance candidate storage means 30. At this time, if the user's speech U (N) is "I'm not interested in figure skating, so I'd like to hear about baseball ..." or "It's boring, baseball is more interesting ...", prepare for replacement. Means 43C does not talk about baseball in the scenario data being played (field = ice skating, or field = sports, ice skating), so the scenario data itself is another field (in this example, the field of baseball). ), And the first component in the scenario data after the replacement can be the next speech candidate. In this case, although it is a candidate for S (N + 1), since it is reproduced from the beginning of the scenario data, it is equivalent to S (1), so only one candidate for the next utterance is sufficient. Also, in this case, "I'm not interested", "Boring", "I'm tired of it", "Silly", "I want you to change the topic", "I'm good at that story", "Stop that story", "By the way , "Although the story changes", "Speaking of which", etc., so the topic conversion process is performed to replace the scenario data itself. Also, even if there is no clear topic change request, for example, the user utterance U (N) has a word that is not in the scenario data being played, such as "Next week, there will be ALL STAR SERIES, but high school baseball ..." A topic change process may be performed in which the scenario data itself is replaced when it is repeated.

一方、上記の例において、ユーザ発話Ｕ（Ｎ）が「早稲田次郎選手の方が好きなんだけど、早稲田次郎選手の成績は…」であった場合には、入替準備手段４３Ｃは、再生中のシナリオデータ内に早稲田次郎選手についての構成要素（主計画要素）も含まれているので、シナリオデータの入替は行わずに、同じシナリオデータ内での再生順序の変更を行う。例えば、「早稲田次郎」によるキーワードマッチングで「早稲田次郎選手は、４回転アクセルに挑戦したけど失敗したんだ。」（主計画要素）を選択するとともに、「早稲田次郎選手は、…」という早稲田次郎の人物の説明データ（副計画要素）、「４回転アクセルっていうのは、…」という４回転アクセルの技の説明データ（副計画要素）を選択し、次発話候補とする。 On the other hand, in the above example, if the user utterance U (N) is "I like Waseda Jiro better, but Waseda Jiro's performance is ...", the replacement preparation means 43C is a scenario during playback. Since the data also includes the components (main planning elements) for Waseda Jiro, the playback order is changed within the same scenario data without replacing the scenario data. For example, in the keyword matching by "Waseda Jiro", "Waseda Jiro tried the quadruple accelerator but failed." (Main plan element) was selected, and "Waseda Jiro is ..." Waseda Jiro. Select the explanation data (sub-planning element) of the person and the explanation data (sub-planning element) of the technique of the quadruple accelerator, "What is a quadruple accelerator?", And use it as a candidate for the next speech.

また、次発話候補初期準備手段４３Ａや入替準備手段４３Ｃによる準備処理において、位置データや時刻データを反映させてもよい。例えば、博物館や遺跡等の案内を行うガイダンス対話では、ユーザの位置データ（例えば、再生装置２０に設置されたＧＰＳ受信機や、再生装置２０が本体と端末とに分割されている場合の端末に設置されたＧＰＳ受信機で得られる位置データ等）を用いて、複数の次発話候補の内容データが定まるようにしてもよい。例えば、博物館のガイダンス対話において、予め登録されて対話サーバ４０のメモリに記憶されている展示物Ｘの位置データと、再生装置２０からネットワーク１を介して送信されてきたユーザの位置データとを用いて、ユーザが展示物Ｘのそばに近づいたことを検出し、さらに時刻データを用いて１２時近くであることを検出した時点で、「そろそろ展示物Ｘが見えてきます。」というシステム発話Ｓ（Ｎ）を行い、次発話候補Ｓ（Ｎ＋１）として「展示物Ｘは、・・・」、「食堂をご案内しましょうか。」等を用意して次発話候補記憶手段３０に記憶させておく。そして、ユーザ発話Ｕ（Ｎ）が「展示物Ｘの説明を聞きたいな。」であった場合には、「展示物Ｘは、・・・」を選択して再生し、「お腹すいた。」であった場合には、「食堂をご案内しましょうか。」を選択して再生する。また、時刻データを用いて、「外が暗くなってきたから、そろそろ帰り支度を始めましょう。」、「閉館時間が迫っているから、素早く回ろうね。」等を準備して次発話候補記憶手段３０に記憶させておくこともできる。 Further, the position data and the time data may be reflected in the preparation process by the next utterance candidate initial preparation means 43A and the replacement preparation means 43C. For example, in a guidance dialogue that guides museums, ruins, etc., the user's position data (for example, a GPS receiver installed in the playback device 20 or a terminal when the playback device 20 is divided into a main body and a terminal) The content data of a plurality of next speech candidates may be determined by using the position data obtained from the installed GPS receiver, etc.). For example, in the guidance dialogue of the museum, the position data of the exhibit X registered in advance and stored in the memory of the dialogue server 40 and the position data of the user transmitted from the playback device 20 via the network 1 are used. Then, when the user detects that he / she is approaching the exhibit X and further detects that it is near 12 o'clock using the time data, the system utterance S saying "It is about time to see the exhibit X." Perform (N), prepare "Exhibit X is ...", "Shall we guide you to the cafeteria?", Etc. as the next speech candidate S (N + 1), and store them in the next speech candidate storage means 30. deep. Then, when the user utterance U (N) is "I want to hear the explanation of the exhibit X", "Exhibit X is ..." is selected and played, and "I am hungry." If it is, select "Would you like to show me around the cafeteria?" And play it. Also, using the time data, prepare the next utterance candidate storage means by preparing "Because it is getting dark outside, let's start preparing for the return", "Because the closing time is approaching, let's turn quickly." It can also be stored in 30.

さらに、次発話候補初期準備手段４３Ａや入替準備手段４３Ｃによる準備処理において、位置データや時刻データ以外の状態データ（変化の速度の大小の相違はあるが、原則として、時々刻々と変化するデータ）、例えば、温度データ、湿度データ、天候データ、高度データ等を用いて次発話候補の内容データを準備してもよい。例えば、「今日は暑いね。」、「今日は蒸すね。」、「今日は天気がいいね。」、「空気が薄くなってきたけど、大丈夫？」等の次発話候補を準備し、次発話候補記憶手段３０に記憶させておくことができる。なお、上記の例の天候データは、選択用判断材料としての天候データ（ユーザが操作する再生装置２０の所在地における晴・雨・曇り等のデータ）であるから、題材データとして用意されている「台風２８号が沖縄地方に接近しています。」、「ＸＸ地方に大雨洪水警報が出ていますので、ＹＹ川の氾濫に注意してください。」等の警報データとは異なる。つまり、例えば、雨という天候データに基づき、「今日は雨だけど、東京ドームは屋根があるから、野球の観戦はできるよ。」等の題材データが選択取得され、次発話候補記憶手段３０に記憶されることになる。 Further, in the preparation process by the next utterance candidate initial preparation means 43A and the replacement preparation means 43C, state data other than position data and time data (data that changes from moment to moment, although there is a difference in the speed of change). For example, the content data of the next utterance candidate may be prepared using temperature data, humidity data, weather data, altitude data, and the like. For example, prepare the next utterance candidates such as "It's hot today.", "It's steaming today.", "The weather is nice today.", "The air is getting thinner, is it okay?" It can be stored in the utterance candidate storage means 30. Since the weather data in the above example is weather data (data such as fine weather, rain, cloudiness, etc. at the location of the playback device 20 operated by the user) as a selection judgment material, it is prepared as subject data. It is different from the warning data such as "Typhoon No. 28 is approaching the Okinawa region." And "Be careful of the flooding of the YY river because a heavy rain flood warning has been issued in the XX region." That is, for example, based on the weather data of rain, subject data such as "It is raining today, but the Tokyo Dome has a roof, so you can watch baseball games." Is selected and acquired, and stored in the next speech candidate storage means 30. Will be done.

また、入替準備手段４３Ｃによる準備処理は、必ずしも複数の次発話候補の全部を入れ替える必要はなく、少なくとも一部の入替が行われればよい。例えば、最初の複数の次発話候補の内容データ（初期データ）または前回の入替後のデータが、次発話候補Ａ，Ｂ，Ｃであったとすると、入替後の次発話候補は、Ｄ，Ｅ，Ｆのように全部が入れ替わっていてもよく、Ａ，Ｂ，Ｄのように一部が入れ替わっていてもよい。また、入替後の次発話候補は、Ａ，Ｂ，Ｃ，Ｄ，Ｅのように候補が追加されて増えた状態となっていてもよく、Ａ，Ｂのように一部削除された状態となっていてもよい。 Further, in the preparation process by the replacement preparation means 43C, it is not always necessary to replace all of the plurality of next utterance candidates, and at least a part of the replacement may be performed. For example, if the content data (initial data) of the first plurality of next utterance candidates or the data after the previous replacement are the next utterance candidates A, B, and C, the next utterance candidates after the replacement are D, E, All of them may be replaced as in F, and some may be replaced as in A, B, and D. Further, the next utterance candidate after the replacement may be in a state in which candidates are added and increased, such as A, B, C, D, and E, and a state in which a part is deleted such as A and B. It may be.

さらに、対話状態管理手段４２は、ユーザ発話Ｕ（Ｎ）の進行中において、各時点において、それまでのＵ（Ｎ）の全部を保持しているので、対話状態管理手段４２から入替要否判断手段４３Ｂに送る入替要否判断に用いるための音声認識処理の結果の長さは、自在に調整することができる。従って、ショートポーズセグメンテーションの単位の最新の音声認識結果だけとしてもよく、最新の音声認識結果を含めたある程度の時間長の音声認識結果としてもよく、対話状態管理手段４２に保持されているＵ（Ｎ）の音声認識結果の全部としてもよい。 Further, since the dialogue state management means 42 holds all of the U (N) up to that point at each time point while the user speech U (N) is in progress, the dialogue state management means 42 determines whether or not replacement is necessary. The length of the result of the voice recognition process for use in determining the necessity of replacement sent to the means 43B can be freely adjusted. Therefore, it may be only the latest voice recognition result of the unit of short pause segmentation, or it may be the voice recognition result of a certain time length including the latest voice recognition result, and the U (U) held in the dialogue state management means 42. It may be the whole voice recognition result of N).

＜対話サーバ４０／対話履歴記憶手段５０の構成＞ <Configuration of dialogue server 40 / dialogue history storage means 50>

対話履歴記憶手段５０は、システムとユーザとの間の対話履歴情報を記憶するものである。具体的には、図６の最上部に示すように、システム発話Ｓ（１）の内容データ（テキストデータ）、ユーザ発話Ｕ（１）の内容データ（テキストデータ）、同様に、Ｓ（２）、Ｕ（２）、Ｓ（３）、Ｕ（３）、…の各内容データ（テキストデータ）を、対話の順番に記憶する。ユーザ発話から始まっていてもよい。なお、進行中のユーザ発話Ｕ（Ｎ）は、本実施形態では、対話状態管理手段４２のメモリ（主メモリでよい）に記憶され、発話の終了後に、発話区間全体が対話履歴記憶手段５０に記憶される。 The dialogue history storage means 50 stores dialogue history information between the system and the user. Specifically, as shown at the top of FIG. 6, the content data (text data) of the system utterance S (1), the content data (text data) of the user utterance U (1), and similarly S (2). , U (2), S (3), U (3), ... Each content data (text data) is stored in the order of dialogue. It may start from the user's utterance. In the present embodiment, the ongoing user utterance U (N) is stored in the memory of the dialogue state management means 42 (may be the main memory), and after the end of the utterance, the entire utterance section is stored in the dialogue history storage means 50. Be remembered.

＜対話サーバ４０／題材データ記憶手段５１の構成＞ <Structure of dialogue server 40 / subject data storage means 51>

題材データ記憶手段５１は、題材データを記憶するものである。題材データは、例えば、シナリオデータ（主計画および副計画を有する複雑な分岐を行うシナリオに限らず、より単純なシナリオも含む。）、シナリオが形成されていない最近のトピックを集めた各種の話題データの集合（但し、話題データの１つ１つが、独立した題材データであり、それぞれ比較的短いデータである。）、辞書データ、事典データ、機器の使用方法や施設等のガイダンス用データ、アンケート調査用データ、機器や装置等の操作補助用データ、教育用データ等である。これらの題材データまたはその構成要素には、分野（例えば、ＩＴ・科学、政治・経済、国際、エンタメ、相撲、ゴルフ等）の識別情報が関連付けられている。なお、分野が定められていない題材データまたはその構成要素が混在していてもよいが、その場合は、キーワードマッチングにより、必要な情報を選択取得する。題材データ提供システム６０も同様である。 The subject data storage means 51 stores subject data. The subject data includes, for example, scenario data (not only scenarios with complex branches having a main plan and a sub-plan, but also simpler scenarios), and various topics that collect recent topics for which no scenarios have been formed. A set of data (however, each topical data is independent subject data and each is relatively short data), dictionary data, encyclopedia data, guidance data on equipment usage and facilities, questionnaires, etc. Survey data, operation assistance data for equipment and devices, educational data, etc. Discriminating information of fields (for example, IT / science, politics / economy, international, entertainment, sumo, golf, etc.) is associated with these subject data or its components. In addition, subject data or its components whose fields are not defined may be mixed, but in that case, necessary information is selected and acquired by keyword matching. The same applies to the subject data providing system 60.

＜対話サーバ４０／ユーザ情報記憶手段５２の構成＞ <Configuration of dialogue server 40 / user information storage means 52>

ユーザ情報記憶手段５２は、ユーザ発話とシステム発話との衝突の発生情報、システムの交替潜時、およびユーザの発話速度を、ユーザ識別情報と関連付けて記憶するものである。このユーザ情報記憶手段５２に記憶される情報は、各ユーザとの複数回の対話を通じて得られたユーザ毎の蓄積情報であるから、ユーザの属性情報である。従って、ユーザとの対話中における一時的な情報ではないので、ユーザ状態記憶手段３２に記憶される情報とは異なる。これらの衝突の発生情報、システムの交替潜時、およびユーザの発話速度は、いずれも発話生成手段２５により得られて記録されたものである。 The user information storage means 52 stores information on the occurrence of a collision between a user's utterance and a system utterance, a system shift latency, and a user's utterance speed in association with user identification information. The information stored in the user information storage means 52 is user attribute information because it is accumulated information for each user obtained through a plurality of dialogues with each user. Therefore, since it is not temporary information during the dialogue with the user, it is different from the information stored in the user state storage means 32. The information on the occurrence of these collisions, the alternate latency of the system, and the utterance speed of the user are all obtained and recorded by the utterance generation means 25.

＜ユーザからシステムへの話者交替時の処理の流れ：図５＞ <Processing flow when changing speakers from the user to the system: Fig. 5>

このような本実施形態においては、以下のようにしてユーザからシステムへの話者交替が行われる。 In such an embodiment, the speaker is changed from the user to the system as follows.

図５において、先ず、対話開始前に、システム発話タイミング検出手段２２のユーザ発話権終了判定用閾値調整手段２２Ｇ（図２参照）により、ユーザ情報記憶手段５２から、対話相手のユーザについてのユーザ識別情報（ユーザＩＤ）を用いて、衝突の発生情報（蓄積情報）、交替潜時（蓄積情報）、および発話速度（蓄積情報）を取得し、衝突の発生情報（蓄積情報）によるユーザ発話権終了判定用閾値の事前調整（図１１参照）、発話速度（蓄積情報）による下方調整用閾値の事前調整およびその下方調整用閾値を用いた交替潜時（蓄積情報）によるユーザ発話権終了判定用閾値の事前調整（図１２参照）を行う（ステップＳ１）。 In FIG. 5, first, before the start of the dialogue, the user information storage means 52 identifies the user of the conversation partner user by the threshold adjustment means 22G (see FIG. 2) for determining the end of the user utterance right of the system utterance timing detecting means 22. The information (user ID) is used to acquire collision occurrence information (accumulation information), alternate latency (accumulation information), and utterance speed (accumulation information), and the user's utterance right ends based on the collision occurrence information (accumulation information). Pre-adjustment of the judgment threshold (see FIG. 11), pre-adjustment of the downward adjustment threshold by the utterance speed (accumulated information), and the threshold for determining the end of the user's speech right by the alternate latency (accumulated information) using the downward adjustment threshold. Pre-adjustment (see FIG. 12) is performed (step S1).

次に、対話開始後においては、音声信号取得手段２１により取得したユーザの音声信号を用いて、システム発話タイミング検出手段２２の音響特徴量抽出手段２２Ａ（図２参照）により、周波数分析等を行って音響特徴量を抽出する（ステップＳ２）。また、必要な場合には、音声認識処理手段４１により得られた音声認識処理の結果である言語情報を用いて、言語特徴量抽出手段２２Ｂ（図２参照）により言語特徴量を抽出する。 Next, after the start of the dialogue, frequency analysis or the like is performed by the acoustic feature amount extracting means 22A (see FIG. 2) of the system utterance timing detecting means 22 using the user's voice signal acquired by the voice signal acquiring means 21. To extract the acoustic features (step S2). If necessary, the language feature amount is extracted by the language feature amount extracting means 22B (see FIG. 2) using the language information obtained as a result of the voice recognition processing by the voice recognition processing means 41.

続いて、システム発話タイミング検出手段２２のユーザ発話権終了判定用閾値調整手段２２Ｇ（図２参照）により、ユーザ状態記憶手段３２からユーザ発話継続時間を取得し、ユーザ発話権終了判定用閾値のリアルタイム調整（図９参照）を行うとともに、システム状態記憶手段３１からシステム発話意欲度の指標値（目的データの残数および／または次発話候補の重要度）を取得し、ユーザ発話権終了判定用閾値のリアルタイム調整（図１０参照）を行う（ステップＳ３）。なお、これらの２種類のリアルタイム調整は、いずれか一方の調整を行ってもよく、双方の調整を行ってもよく、双方の調整を行う場合は、どちらの調整を先に行ってもよい。 Subsequently, the user utterance right end determination threshold adjusting means 22G (see FIG. 2) of the system utterance timing detecting means 22 acquires the user utterance duration from the user state storage means 32, and the user utterance right end determination threshold is real-time. While making adjustments (see FIG. 9), the index value of the system utterance motivation (the remaining number of target data and / or the importance of the next utterance candidate) is acquired from the system state storage means 31, and the threshold for determining the end of the user's utterance right. Real-time adjustment (see FIG. 10) is performed (step S3). In these two types of real-time adjustments, either one may be adjusted, both may be adjusted, and when both are adjusted, either adjustment may be performed first.

それから、システム発話タイミング検出手段２２のユーザ発話権終了判定用パターン認識器２２Ｃ（図２参照）により、ステップＳ２で抽出した音響特徴量、またはこれに加えて言語特徴量を用いて、ユーザ発話権の維持または終了を識別するパターン認識処理を実行し、このパターン認識処理で得られる尤度を用いたユーザ発話権終了判定用閾値による閾値判定を行い、維持または終了の識別結果を出力する（ステップＳ４）。 Then, the user utterance right is used by the pattern recognizer 22C (see FIG. 2) for determining the end of the user utterance right of the system utterance timing detecting means 22, using the acoustic feature amount extracted in step S2 or the language feature amount in addition to this. A pattern recognition process for identifying the maintenance or termination of is executed, a threshold determination is made based on a threshold for determining the end of the user's utterance right using the likelihood obtained by this pattern recognition process, and the identification result of maintenance or termination is output (step). S4).

また、これと並行して、次発話選択用情報生成手段２３により、ステップＳ２で得られた音響特徴量を用いた韻律分析を行い、ユーザ発話意図の識別処理を行う（ステップＳ５）。なお、図５中の２点鎖線で示すように、音声信号取得手段２１により取得したユーザの音声信号を用いて、ステップＳ２とは別途に韻律特徴量を抽出し、その韻律特徴量を用いた韻律分析を行い、ユーザ発話意図の識別処理を行ってもよい。 Further, in parallel with this, the next utterance selection information generation means 23 performs a prosodic analysis using the acoustic features obtained in step S2, and performs a process of identifying the user's utterance intention (step S5). As shown by the alternate long and short dash line in FIG. 5, the prosodic feature amount was extracted separately from step S2 using the user's voice signal acquired by the voice signal acquisition means 21, and the prosodic feature amount was used. Prosody analysis may be performed to identify the user's utterance intention.

そして、前述したステップＳ４の識別結果が維持または終了のいずれであるかを判断し（ステップＳ６）、維持であった場合には、ステップＳ２の処理に戻る。一方、識別結果が終了であった場合には、システム発話タイミング検出手段２２のシステム発話開始タイミング判断手段２２Ｆ（図２参照）により、システム状態記憶手段３１から、準備の状態を示すステータス（準備完了・各種の準備中の別）を取得し、図８に示す流れで、システム発話の開始タイミングであるか否かの判断結果を出力する（図５のステップＳ７）。 Then, it is determined whether the identification result of step S4 described above is maintenance or termination (step S6), and if it is maintenance, the process returns to step S2. On the other hand, when the identification result is completed, the system utterance start timing determination means 22F (see FIG. 2) of the system utterance timing detection means 22 causes the system state storage means 31 to indicate a status indicating the state of preparation (preparation completed). -Obtaining various types of preparations) and outputting the determination result of whether or not it is the start timing of the system utterance in the flow shown in FIG. 8 (step S7 in FIG. 5).

ここで、出力された判断結果が、システム発話の開始タイミングであるか否かに応じ（ステップＳ８）、システム発話の開始タイミングではなかった場合には、ステップＳ２の処理に戻る。一方、システム発話の開始タイミングであった場合（当該タイミングが検出された場合）には、次発話選択手段２４により、前述したステップＳ５で次発話選択用情報生成手段２３により得られたユーザ発話意図の識別結果（質問、相槌等の別）および／または音声認識処理手段４１による音声認識処理の結果である言語情報（文字列）を用いて、次発話候補記憶手段３０に記憶されている複数（但し、１つである場合もある）の次発話候補の内容データの中から、次発話の内容データを選択するとともに、選択した次発話の内容データまたはその識別情報（シナリオＩＤ、発話節ＩＤ等）を、ネットワーク１を介して対話サーバ４０の対話状態管理手段４２へ送信する（ステップＳ９）。 Here, depending on whether or not the output determination result is the start timing of the system utterance (step S8), if it is not the start timing of the system utterance, the process returns to the process of step S2. On the other hand, when it is the start timing of the system utterance (when the timing is detected), the user utterance intention obtained by the next utterance selection means 24 and the next utterance selection information generation means 23 in step S5 described above. (Apartment of questions, responses, etc.) and / or linguistic information (character string) that is the result of voice recognition processing by the voice recognition processing means 41, and a plurality of utterance candidates stored in the next utterance candidate storage means 30 (character strings). However, the content data of the next utterance is selected from the content data of the next utterance candidate (which may be one), and the selected content data of the next utterance or its identification information (scenario ID, utterance section ID, etc.) ) Is transmitted to the dialogue state management means 42 of the dialogue server 40 via the network 1 (step S9).

このステップＳ９の処理を行う際、次発話候補記憶手段３０には、図５の流れとは非同期で行われる次発話準備手段４３の準備処理により、次発話選択手段２４によるステップＳ９の処理に先んじて用意された複数（但し、１つである場合もある）の次発話候補の内容データが既に記憶されている状態である。なお、準備中の場合には、フィラーの挿入が行われる（図８のＰ１０参照）。 When performing the process of step S9, the next utterance candidate storage means 30 is prior to the process of step S9 by the next utterance selection means 24 by the preparatory process of the next utterance preparation means 43 performed asynchronously with the flow of FIG. This is a state in which the content data of a plurality of (however, one may be) next utterance candidates prepared in the above are already stored. If the preparation is in progress, the filler is inserted (see P10 in FIG. 8).

また、次発話選択手段２４による選択結果が、ネットワーク１を介して対話状態管理手段４２へ送信されると、対話履歴記憶手段５０に記憶された対話履歴情報が更新されるとともに、次発話準備手段４３に対してさらにその次の次発話候補の準備開始指示情報が出され、対話履歴記憶手段５０の情報を用いたさらにその次の次発話候補の準備処理が開始されるので、この意味では、図５の流れと、次発話準備手段４３の処理とは一連の流れのように見える。しかし、図４に示すように、次発話準備手段４３は、図５の中心であるシステム発話タイミング検出手段２２の処理とは非同期で行われる音声認識処理手段４１による音声認識処理の結果（言語情報）を用いた準備処理を実行するので、結局、次発話準備手段４３の処理は、図５の一連の処理の流れの中に記載することはできない。 Further, when the selection result by the next utterance selection means 24 is transmitted to the dialogue state management means 42 via the network 1, the dialogue history information stored in the dialogue history storage means 50 is updated and the next utterance preparation means. Further, the preparation start instruction information for the next next utterance candidate is issued to 43, and the preparation process for the next next utterance candidate using the information of the dialogue history storage means 50 is started. Therefore, in this sense, The flow of FIG. 5 and the processing of the next utterance preparation means 43 seem to be a series of flows. However, as shown in FIG. 4, the next utterance preparation means 43 is the result of the voice recognition processing (language information) by the voice recognition processing means 41 performed asynchronously with the processing of the system utterance timing detecting means 22 which is the center of FIG. ) Is executed, so that the process of the next utterance preparation means 43 cannot be described in the flow of the series of processes of FIG.

その後、発話生成手段２５により、次発話選択手段２４により選択された次発話の内容データを用いて、システム発話の音声信号の再生処理が行われる（図５のステップＳ１０）。また、付随する映像データ、静止画データ、楽曲データがあれば、それらの再生処理も行われる。なお、必要な場合には、ここでの音声合成処理も行われるが、次発話候補記憶手段３０に記憶された次発話候補の内容データには、音声合成処理で得られた音声データも含まれていることが好ましい。 After that, the utterance generation means 25 performs a reproduction process of the voice signal of the system utterance using the content data of the next utterance selected by the next utterance selection means 24 (step S10 in FIG. 5). In addition, if there is accompanying video data, still image data, and music data, the reproduction processing of those data is also performed. If necessary, the voice synthesis process is also performed here, but the content data of the next utterance candidate stored in the next utterance candidate storage means 30 includes the voice data obtained by the voice synthesis process. Is preferable.

そして、対話終了であるか否かを判断し（ステップＳ１１）、対話終了でない場合には、ステップＳ２の処理に戻る。 Then, it is determined whether or not the dialogue is completed (step S11), and if it is not the end of the dialogue, the process returns to the process of step S2.

＜シナリオのデータ構成、シナリオの再生、および次発話候補の準備処理の流れ：図１３〜図１７＞ <Flow of scenario data structure, scenario playback, and preparation process for next utterance candidate: FIGS. 13 to 17>

＜シナリオのデータ構成：図１３＞ <Scenario data structure: Fig. 13>

図１３には、主計画および副計画からなるシナリオのデータ構成の具体例が示されている。このようなシナリオデータは、非特許文献４に記載されたシナリオデータと同様のものであり、対話システム１０で利用することができる題材データの一種として題材データ記憶手段５１に記憶されている。 FIG. 13 shows a specific example of the data structure of the scenario consisting of the main plan and the sub plan. Such scenario data is similar to the scenario data described in Non-Patent Document 4, and is stored in the subject data storage means 51 as a kind of subject data that can be used in the dialogue system 10.

より詳細には、このシナリオデータは、ニュースやコラムや歴史等の各種の話題を記載した記事データ（文書データ）から生成したものであり、元の文書データを構成する要素を、元の文書データの内容の要約となる主計画要素と、この主計画要素を補完する副計画要素と、これら以外の要素（省略要素）とに分割し、これらの３種類の要素のうちの主計画要素および副計画要素、並びに、発話計画情報（主計画要素の再生順序および副計画要素への分岐を定めた情報）を含むように構成したものである。なお、元の文書データからのシナリオ生成時に、結果的に省略要素が発生しなくてもよい。つまり、主計画要素を除いた残り全ての要素が、副計画要素に割り当てられてもよい。 More specifically, this scenario data is generated from article data (document data) that describes various topics such as news, columns, and history, and the elements that make up the original document data are the original document data. It is divided into a main plan element that summarizes the contents of, a sub plan element that complements this main plan element, and other elements (abbreviated elements), and the main plan element and sub of these three types of elements. It is configured to include the planning element and the speech planning information (information that defines the reproduction order of the main planning element and the branching to the sub-planning element). As a result, the omitted element does not have to occur when the scenario is generated from the original document data. That is, all the remaining elements except the main planning element may be assigned to the sub-planning element.

主計画要素は、元の文書データを要約し、口語化することにより生成される。文書の要約は、重要文抽出、整列、文圧縮の処理を経て行われる。先ず、重要文抽出で、文書の要点となる情報を文単位で大まかに抽出する。次に、整列を行い、抽出した重要文の提示順序を決定する。そして、文圧縮を行い、文自体を短く縮約する。最後に、口語化処理を行い、書き言葉から会話表現への書き換えを行う。なお、このシナリオ生成時における重要文抽出で考慮される文の重要度は、前述したシステム発話タイミング検出手段２２のユーザ発話権終了判定用閾値調整手段２２Ｇの説明で既に詳述した通り、システム状態記憶手段３１に記憶される次発話候補の内容データの重度度とは異なるものであり、防災関連情報の緊急性や日常生活への影響の大きさ等を加味した重要度ではない。 The main planning element is generated by summarizing and colloquializing the original document data. Document summarization is performed through the processing of important sentence extraction, alignment, and sentence compression. First, in the important sentence extraction, the information that is the main point of the document is roughly extracted for each sentence. Next, alignment is performed and the presentation order of the extracted important sentences is determined. Then, the sentence is compressed and the sentence itself is shortened. Finally, colloquialization is performed to rewrite the written language into a conversational expression. The importance of the sentence considered in the important sentence extraction at the time of generating this scenario is determined by the system state as described in detail in the description of the threshold adjustment means 22G for determining the end of the user utterance right of the system utterance timing detecting means 22 described above. It is different from the severity of the content data of the next utterance candidate stored in the storage means 31, and is not the importance considering the urgency of the disaster prevention-related information and the magnitude of the influence on daily life.

副計画要素は、主計画要素の情報を補うためのシステム発話の計画要素である。この副計画要素には、主計画要素で省かれた内容に基づく補足説明データ、予想される質問に対する回答データが含まれる。ユーザ発話の内容に応じて、副計画要素が再生されることになる。副計画要素についても、文圧縮と口語化の処理を行う。 The sub-planning element is a system-based planning element for supplementing the information of the main planning element. This sub-planning element includes supplementary explanatory data based on what was omitted in the main planning element, and answer data for expected questions. The sub-planning element will be reproduced according to the content of the user's utterance. Sentence compression and colloquialization are also performed for the sub-planning elements.

図１３において、シナリオを構成するデータ（カラム）には、元の記事（文書）についての文書ＩＤ、元の文書を構成する段落についての段落ＩＤ、元の文書を構成する文の重要度（シナリオ生成時に考慮した文の重要度に、防災関連情報の緊急性や日常生活への影響の大きさ等を加味した重要度であり、システム状態記憶手段３１に記憶される対象となる重要度である。）、元の文書を構成する文の内容を伝達したか否かの情報（未伝達・伝達済の別）、元の文書を構成する段落内の文についての文ＩＤ、元の文書を構成する段落内の文を構成する文節についての文節ＩＤ、シナリオの構成要素として選択されたか否かを示す情報（選択文節）、元の文書を構成する文内での文節提示順序、シナリオ再生を行うための発話節ＩＤ、リンクする発話節の合成音声ファイル（ｗａｖファイル等）の名称、口語表現、文節間の間（ま）、元の文節の内容、ユーザの定義型質問に対する応答用の定義の文字情報、リンクする定義の合成音声ファイル（ｗａｖファイル等）の名称、トリビアの文字情報等が含まれる。また、図１３での図示は省略されているが、リンクするトリビアの合成音声ファイル（ｗａｖファイル等）の名称も含まれ、さらに、口語表現は、複数段階の表現（例えば、伝聞口調・断定口調を使い分ける「標準」、伝聞口調だけの「伝聞」、断定口調だけの「断定」、ですます調だけの「敬体」等の口調の別を含む）が用意されている。 In FIG. 13, the data (column) constituting the scenario includes the document ID for the original article (document), the paragraph ID for the paragraph constituting the original document, and the importance of the sentences constituting the original document (scenario). It is the importance of the sentence considered at the time of generation, considering the urgency of the disaster prevention related information and the magnitude of the influence on daily life, and is the importance to be stored in the system state storage means 31. ), Information on whether or not the content of the sentence that composes the original document has been transmitted (whether it has not been transmitted or has been transmitted), the sentence ID of the sentence in the paragraph that constitutes the original document, and the composition of the original document. The phrase ID of the phrase that composes the sentence in the paragraph, the information indicating whether or not it was selected as a component of the scenario (selected phrase), the phrase presentation order in the sentence that composes the original document, and the scenario playback. The utterance clause ID for the purpose, the name of the synthetic voice file (wave file, etc.) of the utterance clause to be linked, the verbal expression, the space between the clauses, the content of the original clause, and the definition for answering the user's defined question. Character information, the name of a synthetic audio file (wave file, etc.) of the definition to be linked, character information of trivia, etc. are included. Further, although the illustration in FIG. 13 is omitted, the name of the linked trivia synthetic audio file (wav file, etc.) is also included, and the colloquial expression is a multi-step expression (for example, hearsay tone / assertive tone). There are "standard" to use properly, "hearsay" only for hearsay tone, "assertion" only for assertive tone, and "respectful body" only for more and more tone).

なお、合成音声ファイルの名称は、上記のように、シナリオを構成するデータであるが、その他に、合成音声ファイル自体（自体とは、ファイルの名称を示す文字情報ではなく、音声データを記録しているファイルという意味）を、予め生成してシナリオデータに含めてもよく、そうすることにより、次発話準備手段４３による準備時や、発話生成手段２５による再生時に音声合成処理を行う必要がなくなるので、システムの応答性を向上させることができる。 The name of the synthetic voice file is the data that constitutes the scenario as described above, but in addition, the synthetic voice file itself (itself is not the character information indicating the name of the file, but the voice data is recorded. It may be generated in advance and included in the scenario data, so that it is not necessary to perform speech synthesis processing at the time of preparation by the next utterance preparation means 43 or at the time of reproduction by the utterance generation means 25. Therefore, the responsiveness of the system can be improved.

また、上記の例では、重要度（システム状態記憶手段３１に記憶される対象となる重要度）は、元の文書を構成する文を単位とする重要度とされているが、元の文書を構成する文の単位ではなく、より細かく発話節毎に設定してもよい。 Further, in the above example, the importance (the importance to be stored in the system state storage means 31) is the importance in units of the sentences constituting the original document, but the original document is used. It may be set more finely for each utterance clause instead of the unit of the constituent sentence.

さらに、元の文書を構成する文の内容を伝達したか否かの情報（未伝達・伝達済の別）は、対話履歴記憶手段５０に記憶されている対話履歴情報（図６の最上部のＳ（１），Ｓ（２），Ｓ（３），…）と同期して更新されるが、この情報も、発話節毎に持たせてもよい。この未伝達・伝達済の別は、対話の進行に伴って逐次更新されるので、題材データ記憶手段５１に記憶されているシナリオデータを直接に書き換えるわけではなく、対話状態管理手段４２のメモリ（主メモリでよいが、不揮発性メモリでもよい。）にコピーされて保持されているシナリオデータを書き換える。題材データ記憶手段５１に記憶されている当該シナリオデータは、同時期に他のユーザとの対話で使用されることもあるからである。主計画および副計画を備えていない他のタイプのシナリオデータの場合も同様である。 Further, the information on whether or not the content of the sentence constituting the original document has been transmitted (whether not transmitted or transmitted) is the dialogue history information stored in the dialogue history storage means 50 (at the top of FIG. 6). It is updated in synchronization with S (1), S (2), S (3), ...), But this information may also be provided for each utterance clause. Since the non-transmitted / transmitted distinction is sequentially updated as the dialogue progresses, the scenario data stored in the subject data storage means 51 is not directly rewritten, but the memory of the dialogue state management means 42 ( The main memory may be used, but the non-volatile memory may be used.) The scenario data copied and held in the memory is rewritten. This is because the scenario data stored in the subject data storage means 51 may be used in a dialogue with another user at the same time. The same is true for other types of scenario data that do not have a primary plan and a sub-plan.

＜シナリオデータを用いた音声対話の進行の概要：図１４＞ <Outline of progress of voice dialogue using scenario data: Fig. 14>

図１４には、図１３のシナリオデータを用いてシステムとユーザとの間で行われる音声対話の進行の概要が示されている。但し、次発話準備手段４３による次発話候補の準備処理等の詳細は、図１５および図１６を用いて後述するので、ここでは表面的に表れる発話だけで対話の進行を説明する。 FIG. 14 shows an outline of the progress of a voice dialogue between the system and the user using the scenario data of FIG. However, since the details of the preparation process of the next utterance candidate by the next utterance preparation means 43 will be described later with reference to FIGS. 15 and 16, the progress of the dialogue will be described here only by the utterances appearing on the surface.

先ず、１番目の主計画要素である「α社が３ＤＳ向けにＳｕｉｃａとかと連携するゲームソフトを開発してるらしいよ」という発話節（文書ＩＤ＝１、段落ＩＤ＝１、文ＩＤ＝１における発話節ＩＤ＝１：合成音声ファイル＝１−１−１−１．ｗａｖ）が再生される。この発話節の途中で（例えば「α社が」の再生直後に）、ユーザから「α社って、どんな会社なの？」という定義型質問（割込み）があった場合には、「α社は」という元の文節に対して予め用意されている定義型質問応答の副計画要素「α社は、・・・」が再生される。また、発話節の途中における別の位置で（例えば「ゲームソフトを」の再生直後に）、「α社って、どんな会社なの？」という割込みがあった場合でも、「α社は」という元の文節に対して予め用意されている定義型質問応答の副計画要素「α社は、・・・」が再生される。ユーザの割込みを受けた後のシステム発話の戻りの再生開始位置は、図１４中の実線で示すように、割込みを受けた位置でもよく、図１４中の点線で示すように、幾つか前の文節からの再開でもよく、発話節の先頭からの再開でもよい。なお、発話節の再生の終了直後に、ユーザから「α社って、どんな会社なの？」という定義型質問（割込み）があった場合も、同様に定義型質問応答の副計画要素「α社は、・・・」が再生されるので、途中であっても、終了直後であっても、同じシステム応答となる。 First, in the utterance clause (document ID = 1, paragraph ID = 1, sentence ID = 1), which is the first main planning element, "It seems that company α is developing game software that works with Suica for 3DS." The utterance clause ID = 1: synthetic voice file = 1-1-1-1.wav) is played. In the middle of this utterance section (for example, immediately after the playback of "Company α"), if the user asks a defined question (interruption) "What kind of company is Company α?", "Company α is." The sub-planning element "α company is ..." of the definition type question answer prepared in advance for the original clause "" is reproduced. Also, even if there is an interrupt "What kind of company is α company?" At another position in the middle of the utterance section (for example, immediately after playing "game software"), the original "α company is". The sub-planning element "α company is ..." of the definition type question answer prepared in advance for the clause of is reproduced. The playback start position of the return of the system utterance after receiving the user's interrupt may be the position where the interrupt was received as shown by the solid line in FIG. 14, and some time before as shown by the dotted line in FIG. It may be restarted from the phrase, or it may be restarted from the beginning of the utterance clause. If the user asks a defined question (interruption) "What kind of company is α company?" Immediately after the end of the playback of the utterance clause, the sub-planning element "α company" of the defined question answering is also performed. Is ... ”is played back, so the same system response is obtained regardless of whether it is in the middle or immediately after the end.

次に、ユーザ発話が「楽しみだね。」であったとすると、２番目の主計画要素である「ＩＣカードから読み取った乗車履歴を基に、」という発話節（文書ＩＤ＝１、段落ＩＤ＝１、文ＩＤ＝２における発話節ＩＤ＝１：合成音声ファイル＝１−１−２−１．ｗａｖ）が再生され、その再生の終了直後に、ユーザから「ＩＣカードって、何？」という定義型質問があった場合には、「ＩＣカードから」という元の文節に対して予め用意されている定義型質問応答の副計画要素「ＩＣカードっていうのは、・・・」が再生される。 Next, if the user's utterance is "I'm looking forward to it.", The second main plan element, "based on the boarding history read from the IC card," is the utterance clause (document ID = 1, paragraph ID =). 1. The utterance clause ID in sentence ID = 2 = 1: synthetic voice file = 1-1-2-1.wave) is played, and immediately after the end of the playback, the user asks "What is an IC card?" When there is a definition type question, the sub-planning element "IC card is ..." prepared in advance for the original phrase "from the IC card" is played. To.

続いて、３番目の主計画要素である「ゲーム内で使えるポイントが手に入るんだって」という発話節（文書ＩＤ＝１、段落ＩＤ＝１、文ＩＤ＝２における発話節ＩＤ＝２：合成音声ファイル＝１−１−２−２．ｗａｖ）が再生され、その再生の終了直後に、ユーザから「へー、そうなんだ。」という反応があった場合には、４番目の主計画要素（図１３での図示は省略）が再生される。 Next, the third main planning element, the utterance clause "I can get points that can be used in the game" (document ID = 1, paragraph ID = 1, sentence ID = 2, utterance clause ID = 2: synthesis If the audio file = 1-1-2-2.wav) is played and the user responds "Huh, that's right" immediately after the end of the playback, the fourth main planning element (figure). (The illustration in 13 is omitted) is reproduced.

さらに、副計画要素の再生中に、ユーザからの割込みがあれば、別の副計画要素が再生されるので、副計画要素の再生は、ユーザの反応次第で階層的になることがある。また、ユーザ発話の内容次第では、副計画要素としてシナリオ内に用意していない計画外の回答を行うこともある。 Further, if there is an interrupt from the user during the reproduction of the sub-planning element, another sub-planning element is reproduced, so that the reproduction of the sub-planning element may be hierarchical depending on the reaction of the user. In addition, depending on the content of the user's utterance, an unplanned answer that is not prepared in the scenario may be given as a sub-planning element.

以上が対話の進行の概要であるが、以上のような対話を実現するために、対話システム１０は、具体的には、例えば、図１５〜図１７に示すような各処理を実行する。但し、図１７は、図１３のシナリオデータではなく、主計画および副計画からなる同型のシナリオデータを用いている。 The above is an outline of the progress of the dialogue. In order to realize the above dialogue, the dialogue system 10 specifically executes each process as shown in FIGS. 15 to 17, for example. However, FIG. 17 uses the same type of scenario data including the main plan and the sub-plan, instead of the scenario data of FIG.

＜次発話候補の準備処理の具体例（１）：図１５＞ <Specific example of preparation process for next utterance candidate (1): Fig. 15>

図１５において、次発話準備手段４３は、対話状態管理手段４２からのシステム発話Ｓ（１）の準備開始指示情報を受け取り、次発話候補（但し、ここでは最初の発話）の準備処理を行う。対話開始時であるから、複数の次発話候補を選択取得するのではなく、シナリオデータ内から１番目の主計画要素を選択取得する。従って、次発話準備手段４３は、図１３のシナリオデータから、Ｓ（１）＝「α社が３ＤＳ向けにＳｕｉｃａとかと連携するゲームソフトを開発してるらしいよ。」を選択取得し、これを次発話候補記憶手段３０に記憶させる。 In FIG. 15, the next utterance preparation means 43 receives the preparation start instruction information of the system utterance S (1) from the dialogue state management means 42, and performs the preparation process for the next utterance candidate (however, here, the first utterance). Since it is the start of the dialogue, the first main planning element is selected and acquired from the scenario data, instead of selectively acquiring a plurality of next utterance candidates. Therefore, the next utterance preparation means 43 selectively acquires S (1) = "It seems that company α is developing game software that cooperates with Suica for 3DS" from the scenario data of FIG. 13, and acquires this. It is stored in the next utterance candidate storage means 30.

続いて、ユーザ発話は未だ無い状態なので、システム発話タイミング検出手段２２により、直ぐにシステム発話の開始タイミングが検出され、次発話選択手段２４により、次発話候補記憶手段３０に記憶されているＳ（１）＝「α社が３ＤＳ向けにＳｕｉｃａとかと連携するゲームソフトを開発してるらしいよ。」が選択されるとともに、その選択結果がネットワーク１を介して対話状態管理手段４２に送信される。対話状態管理手段４２は、受け取ったＳ（１）を対話履歴記憶手段５０に保存するとともに、Ｓ（２）の準備開始指示情報を次発話準備手段４３に送る。 Subsequently, since there is no user utterance yet, the system utterance timing detection means 22 immediately detects the start timing of the system utterance, and the next utterance selection means 24 stores S (1) in the next utterance candidate storage means 30. ) = "It seems that α company is developing a game software for 3DS that cooperates with Suica" is selected, and the selection result is transmitted to the dialogue state management means 42 via the network 1. The dialogue state management means 42 stores the received S (1) in the dialogue history storage means 50, and sends the preparation start instruction information of the S (2) to the next utterance preparation means 43.

それから、発話生成手段２５により、次発話選択手段２４により選択されたＳ（１）＝「α社が３ＤＳ向けにＳｕｉｃａとかと連携するゲームソフトを開発してるらしいよ。」の再生が開始される。また、これと並行して、次発話準備手段４３により、Ｓ（２）の準備処理が進む。ここで準備されるＳ（２）の次発話候補は、２番目の主計画要素である「ＩＣカードから読み取った乗車履歴を基に、」と、現在再生中の１番目の主計画要素に対するユーザの定義型質問に答えるために用意する定義型質問応答の副計画要素である「α社は、…」、「β社は、…」、「β社３ＤＳっていうのは、…」、「交通系ＩＣカードっていうのは、…」、「Ｓｕｉｃａっていうのは、…」、「連携っていうのは、…」、「ゲームソフトっていうのは、…」、「開発っていうのは、…」、「発表っていうのは、…」と、現在再生中の１番目の主計画要素に対するユーザの補足要求に応えるために用意する補足説明用の副計画要素（トリビア）である「Ｓｕｉｃａの名称は…」、「開発は、もともと…」等であり、これらが次発話候補記憶手段３０に記憶される。なお、「β社は、…」という定義型質問応答の副計画要素は、１番目の主計画要素の発話節として選択されなかった元の文節「β社の」について用意された情報であるが、ユーザの連想が及ぶ範囲であるため、ここでは次発話候補としている。 Then, the utterance generation means 25 starts playing S (1) = "It seems that company α is developing game software for 3DS that cooperates with Suica" selected by the next utterance selection means 24. .. In parallel with this, the preparation process of S (2) proceeds by the next utterance preparation means 43. The next utterance candidate of S (2) prepared here is the user for the first main plan element currently being played, saying "based on the boarding history read from the IC card", which is the second main plan element. "Α company is ...", "β company is ...", "β company 3DS is ...", "Traffic "IC cards are ...", "Suica is ...", "Cooperation is ...", "Game software is ...", "Development is ..." , "Announcement is ...", and "Suica name", which is a sub-planning element (trivia) for supplementary explanation prepared to respond to the user's supplementary request for the first main planning element currently being played. "Ha ...", "Development was originally ...", etc., and these are stored in the next utterance candidate storage means 30. The sub-planning element of the definition-type question answering "β company is ..." is the information prepared for the original clause "β company" that was not selected as the utterance clause of the first main planning element. , Since it is within the range of the user's association, it is selected as the next utterance candidate here.

その後、ユーザ発話Ｕ（１）＝「楽しみだね。」であったとすると、この場合は、次発話選択用情報生成手段２３により、ユーザ発話意図として、例えば「理解」等の識別結果が得られるので、次発話選択手段２４により選択される次発話は、Ｓ（２）＝２番目の主計画要素である「ＩＣカードから読み取った乗車履歴を基に、」となり、この選択結果がネットワーク１を介して対話状態管理手段４２に送信される。対話状態管理手段４２は、選択結果としてＳ（２）を受け取ると、自身のメモリに保持しているＵ（１）（音声認識処理手段４１による音声認識処理の結果として受け取り、保持している文字列）と、受け取ったＳ（２）とを対話履歴記憶手段５０に保存するとともに、Ｓ（３）の準備開始指示情報を次発話準備手段４３に送る。 After that, if the user utterance U (1) = "I'm looking forward to it.", In this case, the information generation means 23 for selecting the next utterance can obtain an identification result such as "understanding" as the user utterance intention. Therefore, the next utterance selected by the next utterance selection means 24 is S (2) = "based on the boarding history read from the IC card", which is the second main planning element, and this selection result is the network 1. It is transmitted to the dialogue state management means 42 via. When the dialogue state management means 42 receives S (2) as the selection result, the dialogue state management means 42 receives and holds the character U (1) (voice recognition processing means 41) holding the U (1) in its own memory as a result of the voice recognition processing. The column) and the received S (2) are stored in the dialogue history storage means 50, and the preparation start instruction information of the S (3) is sent to the next speech preparation means 43.

それから、発話生成手段２５により、次発話選択手段２４により選択されたＳ（２）＝２番目の主計画要素である「ＩＣカードから読み取った乗車履歴を基に、」の再生が開始される。また、これと並行して、次発話準備手段４３により、Ｓ（３）の準備処理が進む。ここで準備されるＳ（３）の次発話候補は、３番目の主計画要素である「ゲーム内で使えるポイントが手に入るんだって。」と、現在再生中の２番目の主計画要素に対するユーザの定義型質問に答えるために用意する定義型質問応答の副計画要素である「ＩＣカードっていうのは、…」、「基にっていうのは、…」と、現在再生中の２番目の主計画要素に対するユーザの補足要求に応えるために用意する補足説明用の副計画要素（トリビア）である「ＩＣカードは、国際…」等であり、これらが次発話候補記憶手段３０に記憶される。 Then, the utterance generation means 25 starts the reproduction of S (2) = the second main planning element selected by the next utterance selection means 24, "based on the boarding history read from the IC card". In parallel with this, the preparation process of S (3) proceeds by the next utterance preparation means 43. The next utterance candidate of S (3) prepared here is the third main plan element, "You can get points that can be used in the game." "The IC card is ..." and "The basis is ...", which are the sub-planning elements of the defined question and answer prepared to answer the user's defined question, are the second currently being played. "IC card is international ..." which is a sub-planning element (trivia) for supplementary explanation prepared in order to respond to the user's supplementary request for the main planning element of the above, and these are stored in the next utterance candidate storage means 30. To.

＜次発話候補の準備処理の具体例（２）：図１６＞ <Specific example of preparation process for next utterance candidate (2): Fig. 16>

図１６には、ユーザ発話Ｕ（１）が定義型質問となり、システム発話Ｓ（２）として定義型質問応答の副計画要素が再生される場合の具体例が示されている。システム発話Ｓ（１）の再生、Ｓ（２）の複数の候補の準備までは、前述した図１５の場合と同様である。 FIG. 16 shows a specific example in which the user utterance U (1) becomes a definition type question and the subplanning element of the definition type question response is reproduced as the system utterance S (2). Reproduction of the system utterance S (1) and preparation of a plurality of candidates of S (2) are the same as in the case of FIG. 15 described above.

図１６において、ユーザ発話Ｕ（１）＝「α社って、どんな会社なの？」（割込でもよい）であったとすると、この場合は、次発話選択用情報生成手段２３により、ユーザ発話意図として、例えば「質問」等の識別結果が得られるが、このユーザ発話意図だけでは、いずれの質問なのか判らないので、次発話選択手段２４は、音声認識処理手段４１による音声認識処理の結果（言語情報）を用いたキーワードマッチング等により、いずれの質問なのか判別し、α社についての質問であることを把握する。従って、次発話選択手段２４により選択される次発話は、Ｓ（２）＝定義型質問応答の副計画要素である「α社は、…」となり、この選択結果がネットワーク１を介して対話状態管理手段４２に送信される。対話状態管理手段４２は、選択結果としてＳ（２）＝「α社は、…」を受け取ると、自身のメモリに保持しているＵ（１）＝「α社って、どんな会社なの？」（音声認識処理手段４１による音声認識処理の結果として受け取り、保持している文字列）と、受け取ったＳ（２）＝「α社は、…」とを対話履歴記憶手段５０に保存するとともに、Ｓ（３）の準備開始指示情報を次発話準備手段４３に送る。 In FIG. 16, if the user utterance U (1) = "What kind of company is α company?" (Interruption may be used), in this case, the user utterance intention is performed by the information generation means 23 for selecting the next utterance. As an example, an identification result such as "question" can be obtained, but since it is not possible to know which question the question is from only by the user's utterance intention, the next utterance selection means 24 is the result of the voice recognition processing by the voice recognition processing means 41 ( By keyword matching using (language information), etc., it is determined which question the question is, and it is understood that the question is about company α. Therefore, the next utterance selected by the next utterance selection means 24 is S (2) = "Company α is ..." which is a subplanning element of the definition type question answering, and this selection result is in a dialogue state via the network 1. It is transmitted to the management means 42. When the dialogue state management means 42 receives S (2) = "Company α is ..." as a selection result, U (1) = "What kind of company is company α?" Held in its own memory. (A character string received and held as a result of the voice recognition processing by the voice recognition processing means 41) and the received S (2) = "Company α is ..." are stored in the dialogue history storage means 50, and are also stored. The preparation start instruction information of S (3) is sent to the next speech preparation means 43.

それから、発話生成手段２５により、次発話選択手段２４により選択されたＳ（２）＝定義型質問応答の副計画要素である「α社は、…」の再生が開始される。また、これと並行して、次発話準備手段４３により、Ｓ（３）の準備処理が進む。この準備処理では、次発話準備手段４３は、対話履歴記憶手段５０を参照し、未だ２番目の主計画要素である「ＩＣカードから読み取った乗車履歴を基に、」が再生されていないことを確認することができる。従って、ここで準備されるＳ（３）の次発話候補は、２番目の主計画要素である「ＩＣカードから読み取った乗車履歴を基に、」と、再生を終えている１番目の主計画要素に対するユーザの定義型質問に答えるために用意する定義型質問応答の副計画要素である「α社は、…」、「β社は、…」、「β社３ＤＳっていうのは、…」、「交通系ＩＣカードっていうのは、…」、「Ｓｕｉｃａっていうのは、…」、「連携っていうのは、…」、「ゲームソフトっていうのは、…」、「開発っていうのは、…」、「発表っていうのは、…」と、再生を終えている１番目の主計画要素に対するユーザの補足要求に応えるために用意する補足説明用の副計画要素（トリビア）である「Ｓｕｉｃａの名称は…」、「開発は、もともと…」等であり、これらが次発話候補記憶手段３０に記憶される。従って、結果的に、副計画を再生した場合は、次発話候補を維持することになる。 Then, the utterance generation means 25 starts reproducing "Company α is ...", which is a subplanning element of S (2) = defined question answering selected by the next utterance selection means 24. In parallel with this, the preparation process of S (3) proceeds by the next utterance preparation means 43. In this preparatory process, the next utterance preparation means 43 refers to the dialogue history storage means 50, and indicates that the second main planning element "based on the boarding history read from the IC card" has not yet been reproduced. You can check. Therefore, the next utterance candidate of S (3) prepared here is the second main plan element, "based on the boarding history read from the IC card", and the first main plan that has finished reproduction. "Α company is ...", "β company is ...", "β company 3DS is ...", which are sub-planning elements of the definition type question response prepared to answer the user's defined type question for the element. , "Transportation IC card is ...", "Suica is ...", "Cooperation is ...", "Game software is ...", "Development" Is ... ”,“ Announcement is… ”, which is a sub-planning element (trivia) for supplementary explanation prepared to respond to the user's supplementary request for the first main planning element that has finished playing. "The name of Suica is ...", "Development was originally ...", etc., and these are stored in the next utterance candidate storage means 30. Therefore, as a result, when the sub-plan is reproduced, the next utterance candidate is maintained.

この際、再生を終えている１番目の主計画要素に対するユーザの質問を想定した準備を行うのは、Ｕ（１）＝「α社って、どんな会社なの？」というユーザの定義型質問に対し、システムが、Ｓ（２）＝定義型質問応答の副計画要素である「α社は、…」を再生した後に、さらにユーザが、「Ｓｕｉｃａって、何？」という定義型質問をする場合があるからである。また、上述したように、結果的に次発話候補を維持するだけでもよいが、Ｓ（２）＝定義型質問応答の副計画要素である「α社は、…」を再生すると、その後、ユーザから「α社は、…」の中の用語について、さらに定義型質問が行われる場合があるので、シナリオデータ内に、「α社は、…」という定義型質問応答の副計画要素の中の用語について、更なる定義型質問応答の副計画要素が用意されていれば、それをＳ（３）の次発話候補に含めて準備してもよい。 At this time, it is the user's definition type question that U (1) = "What kind of company is α company?" On the other hand, after the system plays "Company α is ..." which is a subplanning element of S (2) = defined question answering, the user further asks a defined question "What is Suica?" Because there are cases. Further, as described above, as a result, it is sufficient to only maintain the next utterance candidate, but when S (2) = the sub-planning element of the definition type question answering "α company is ..." is reproduced, then the user Since there are cases where more definitive questions are asked about the terms in "Company α is ...", in the scenario data, in the sub-planning element of the definitive question answering "Company α is ..." For terms, if a sub-planning element for further defined question answering is prepared, it may be included in the next utterance candidate of S (3) and prepared.

続いて、ユーザ発話Ｕ（２）＝「なるほど。」であったとすると、この場合は、次発話選択用情報生成手段２３により、ユーザ発話意図として、例えば「理解」、「相槌」等の識別結果が得られるので、次発話選択手段２４により選択される次発話は、Ｓ（３）＝２番目の主計画要素である「ＩＣカードから読み取った乗車履歴を基に、」となり、この選択結果がネットワーク１を介して対話状態管理手段４２に送信される。対話状態管理手段４２は、選択結果としてＳ（３）を受け取ると、自身のメモリに保持しているＵ（２）＝「なるほど。」（音声認識処理手段４１による音声認識処理の結果として受け取り、保持している文字列）と、受け取ったＳ（３）とを対話履歴記憶手段５０に保存するとともに、Ｓ（４）の準備開始指示情報を次発話準備手段４３に送る。 Subsequently, if the user utterance U (2) = "I see.", In this case, the information generation means for selecting the next utterance 23 identifies the user's utterance intention, for example, "understanding" or "reciprocity". Is obtained, the next utterance selected by the next utterance selection means 24 is S (3) = the second main planning element, "based on the boarding history read from the IC card", and this selection result is It is transmitted to the interactive state management means 42 via the network 1. When the dialogue state management means 42 receives S (3) as a selection result, it receives U (2) = "I see." (Received as a result of voice recognition processing by the voice recognition processing means 41) held in its own memory. The held character string) and the received S (3) are stored in the dialogue history storage means 50, and the preparation start instruction information of the S (4) is sent to the next speech preparation means 43.

それから、発話生成手段２５により、次発話選択手段２４により選択されたＳ（３）＝２番目の主計画要素である「ＩＣカードから読み取った乗車履歴を基に、」の再生が開始される。また、これと並行して、次発話準備手段４３により、Ｓ（４）の準備処理が進む。ここで次発話準備手段４３により準備されるＳ（４）の次発話候補は、３番目の主計画要素である「ゲーム内で使えるポイントが手に入るんだって。」と、現在再生中の２番目の主計画要素に対するユーザの定義型質問に答えるために用意する定義型質問応答の副計画要素である「ＩＣカードっていうのは、…」、「基にっていうのは、…」と、現在再生中の２番目の主計画要素に対するユーザの補足要求に応えるために用意する補足説明用の副計画要素（トリビア）である「ＩＣカードは、国際…」等であり、これらが次発話候補記憶手段３０に記憶される。 Then, the utterance generation means 25 starts the reproduction of S (3) = the second main planning element selected by the next utterance selection means 24, "based on the boarding history read from the IC card". In parallel with this, the preparation process of S (4) proceeds by the next utterance preparation means 43. Here, the next utterance candidate of S (4) prepared by the next utterance preparation means 43 is the third main planning element, "You can get points that can be used in the game." "The IC card is ...", "The basis is ...", which are the sub-planning elements of the defined question response prepared to answer the user's defined question for the second main planning element. "IC card is international ..." which is a sub-planning element (trivia) for supplementary explanation prepared to respond to the user's supplementary request for the second main planning element currently being played, and these are candidates for the next utterance. It is stored in the storage means 30.

＜次発話候補の準備処理の具体例（３）：図１７＞ <Specific example of preparation process for next utterance candidate (3): Fig. 17>

図１７には、次発話候補の入替が行われる具体例が示されている。但し、図１７は、図１３のシナリオデータではなく、同型の別のシナリオデータ（不図示）を用いている。 FIG. 17 shows a specific example in which the next utterance candidate is replaced. However, FIG. 17 uses another scenario data (not shown) of the same type instead of the scenario data of FIG.

図１７において、次発話準備手段４３は、対話状態管理手段４２からのシステム発話Ｓ（１）の準備開始指示情報を受け取り、次発話候補（但し、ここでは最初の発話）の準備処理を行う。対話開始時であるから、複数の次発話候補を選択取得するのではなく、シナリオデータ内から１番目の主計画要素を選択取得する。ここでは、次発話準備手段４３は、シナリオデータから、Ｓ（１）＝「早稲田太郎選手が１００ｍ平泳ぎで優勝したんだ。」を選択取得し、これを次発話候補記憶手段３０に記憶させる。 In FIG. 17, the next utterance preparation means 43 receives the preparation start instruction information of the system utterance S (1) from the dialogue state management means 42, and performs the preparation process for the next utterance candidate (however, here, the first utterance). Since it is the start of the dialogue, the first main planning element is selected and acquired from the scenario data, instead of selectively acquiring a plurality of next utterance candidates. Here, the next utterance preparation means 43 selectively acquires S (1) = "Taro Waseda won the 100m breaststroke" from the scenario data, and stores this in the next utterance candidate storage means 30.

続いて、ユーザ発話は未だ無い状態なので、システム発話タイミング検出手段２２により、直ぐにシステム発話の開始タイミングが検出され、次発話選択手段２４により、次発話候補記憶手段３０に記憶されているＳ（１）＝「早稲田太郎選手が１００ｍ平泳ぎで優勝したんだ。」が選択されるとともに、その選択結果がネットワーク１を介して対話状態管理手段４２に送信される。対話状態管理手段４２は、受け取ったＳ（１）を対話履歴記憶手段５０に保存するとともに、Ｓ（２）の準備開始指示情報を次発話準備手段４３に送る。 Subsequently, since there is no user utterance yet, the system utterance timing detection means 22 immediately detects the start timing of the system utterance, and the next utterance selection means 24 stores S (1) in the next utterance candidate storage means 30. ) = "Taro Waseda won the 100m breaststroke." Is selected, and the selection result is transmitted to the dialogue state management means 42 via the network 1. The dialogue state management means 42 stores the received S (1) in the dialogue history storage means 50, and sends the preparation start instruction information of the S (2) to the next utterance preparation means 43.

それから、発話生成手段２５により、次発話選択手段２４により選択されたＳ（１）＝「早稲田太郎選手が１００ｍ平泳ぎで優勝したんだ。」の再生が開始される。また、これと並行して、次発話準備手段４３により、Ｓ（２）の準備処理が進む。ここで準備されるＳ（２）の次発話候補は、２番目の主計画要素である「オーストラリアで開催された国際大会での快挙なんだ。」と、現在再生中の１番目の主計画要素に対するユーザの定義型質問に答えるために用意する定義型質問応答の副計画要素である「早稲田太郎選手は、…」と、現在再生中の１番目の主計画要素に対するユーザの補足要求に応じるために用意する補足説明用の副計画要素（トリビア）である「去年の優勝者は、…」等であり、これらが次発話候補記憶手段３０に記憶される。 Then, the utterance generation means 25 starts the reproduction of S (1) = "Taro Waseda won the 100m breaststroke" selected by the next utterance selection means 24. In parallel with this, the preparation process of S (2) proceeds by the next utterance preparation means 43. The next utterance candidate for S (2) prepared here is the second main plan element, "It's a feat at the international convention held in Australia.", The first main plan element currently being regenerated. To respond to the user's supplementary request for the first main planning element currently being played, "Taro Waseda is ...", which is a sub-planning element of the defined question response prepared to answer the user's defined question and answer. "Last year's winner was ...", which is a sub-planning element (trivia) for supplementary explanation prepared in the above, and these are stored in the next utterance candidate storage means 30.

その後、ユーザ発話Ｕ（１）＝「僕は、平泳ぎよりも、バタフライの方が興味あるんだよ。バタフライの・・・」であったとすると、次発話準備手段４３は、例えば、Ｕ（１）＝「僕は、平泳ぎよりも、バタフライの方が興味あるんだよ。バタフライの」という途中までの情報（音声認識処理手段４１による音声認識処理の結果である文字列）を、対話状態管理手段４２から得ることになる。従って、次発話準備手段４３は、このようなＵ（１）の途中までの音声認識処理の結果に基づき、平泳ぎからバタフライへの話題の転換要求（２回のバタフライという単語の出現、あるいは、「…よりもバタフライの方が興味ある」）を捉え、次発話候補の入替が必要であると判断し、次発話候補の入替のための準備処理を実行する。そして、次発話準備手段４３は、使用中のシナリオデータ内にバタフライのデータ（構成要素であり、主計画要素でも副計画要素でもよい）が存在する場合には、それを選択取得する。また、使用中のシナリオデータ内にバタフライのデータが存在しない場合には、題材データ記憶手段５１に記憶されている別のシナリオデータや、シナリオになっていない別の題材データの中からバタフライのデータを探し、それでも該当データが見つからない場合には、ネットワーク１を介して外部システムである題材データ提供システム６０にアクセスし、該当データを探す。その間は、ステータス＝準備中となる。この際、次発話準備手段４３は、題材データ記憶手段５１および題材データ提供システム６０のいずれを検索する場合でも、分野が関係付けられている題材データについては、先ず分野（例えば、スポーツおよび／または水泳）を用いた絞り込み検索を行うことができ、分野が関係付けられていない題材データについては、バタフライの語によるキーワード検索を行う。 After that, if the user speech U (1) = "I'm more interested in the butterfly stroke than the breaststroke. Butterfly ...", the next speech preparation means 43 is, for example, U (1). ) = "I'm more interested in butterfly stroke than breaststroke. Butterfly" information (character string that is the result of voice recognition processing by voice recognition processing means 41) is managed in an interactive state. It will be obtained from means 42. Therefore, the next utterance preparation means 43 requests a change in the topic from breaststroke to butterfly (the appearance of the word butterfly twice, or "2 times", based on the result of the voice recognition process halfway through U (1). I'm more interested in the butterfly than ... "), determine that it is necessary to replace the next utterance candidate, and execute the preparatory process for replacing the next utterance candidate. Then, when the butterfly data (which is a component and may be a main planning element or a sub-planning element) exists in the scenario data in use, the next utterance preparation means 43 selectively acquires the butterfly data. If the butterfly data does not exist in the scenario data in use, the butterfly data can be selected from other scenario data stored in the subject data storage means 51 or other subject data that is not a scenario. If the relevant data is still not found, the subject data providing system 60, which is an external system, is accessed via the network 1 to search for the relevant data. In the meantime, status = preparing. At this time, regardless of whether the next utterance preparation means 43 searches the subject data storage means 51 or the subject data providing system 60, the subject data to which the field is related is first referred to in the field (for example, sports and / or). It is possible to narrow down the search using (swimming), and for subject data that is not related to the field, perform a keyword search using the butterfly word.

そして、バタフライの結果が見つかった場合には、次発話準備手段４３により準備されるＳ（２）の次発話候補は、Ｓ（２）＝「１００ｍバタフライでは、早稲田次郎選手が６位入賞だったんだ。」、「２００ｍバタフライでは、早稲田三郎選手が残念ながら予選落ちしたんだ。」等となり、これらが次発話候補記憶手段３０に記憶される。一方、見つからなかった場合には、例えば、Ｓ（２）＝「ごめんね。バタフライの結果は知らないんだ。」、「バタフライじゃなくて、背泳ぎの結果なら知ってるよ。」等を準備し、次発話候補記憶手段３０に記憶させる。 Then, when the result of the butterfly is found, the next utterance candidate of S (2) prepared by the next utterance preparation means 43 is S (2) = "In the 100m butterfly, Waseda Jiro won the 6th place. "," Saburo Waseda unfortunately failed in the qualifying in the 200m butterfly. ", Etc., and these are stored in the next utterance candidate storage means 30. On the other hand, if you cannot find it, prepare, for example, S (2) = "I'm sorry. I don't know the result of the butterfly stroke.", "I know the result of the backstroke, not the butterfly stroke." It is stored in the utterance candidate storage means 30.

最終的に、ユーザ発話Ｕ（１）が終了し、Ｕ（１）＝「僕は、平泳ぎよりも、バタフライの方が興味あるんだよ。バタフライの結果を教えてくれないかな。早稲田次郎選手はどうだったの？」であった場合には、システム発話タイミング検出手段２２によりシステム発話の開始タイミングが検出された後、次発話選択手段２４は、次発話候補記憶手段３０に記憶されているＳ（２）の複数の候補の中からの選択処理を行う。例えば、Ｕ（１）に含まれる単語である「早稲田次郎」によるキーワードマッチングを行う。従って、ここで次発話選択手段２４により選択される次発話は、Ｓ（２）＝「１００ｍバタフライでは、早稲田次郎選手が６位入賞だったんだ。」となる。そして、次発話選択手段２４により、その選択結果がネットワーク１を介して対話状態管理手段４２に送信される。対話状態管理手段４２は、選択結果としてＳ（２）を受け取ると、自身のメモリに保持しているＵ（１）＝「僕は、平泳ぎよりも、バタフライの方が興味あるんだよ。バタフライの結果を教えてくれないかな。早稲田次郎選手はどうだったの？」（音声認識処理手段４１による音声認識処理の結果として受け取り、保持している文字列）と、受け取ったＳ（２）とを対話履歴記憶手段５０に保存するとともに、Ｓ（３）の準備開始指示情報を次発話準備手段４３に送る。 Eventually, the user utterance U (1) was finished, and U (1) = "I'm more interested in butterfly stroke than breaststroke. Can you tell me the result of butterfly stroke? Waseda Jiro In the case of "How was it?", The next utterance selection means 24 is stored in the next utterance candidate storage means 30 after the system utterance timing detection means 22 detects the start timing of the system utterance. The selection process is performed from the plurality of candidates of S (2). For example, keyword matching is performed by the word "Waseda Jiro" included in U (1). Therefore, the next utterance selected by the next utterance selection means 24 is S (2) = "In the 100m butterfly, Waseda Jiro won the 6th place." Then, the next utterance selection means 24 transmits the selection result to the dialogue state management means 42 via the network 1. When the dialogue state management means 42 receives S (2) as a selection result, it holds U (1) in its own memory = "I am more interested in butterfly stroke than breaststroke. Butterfly How was Jiro Waseda? ”(Character string received and held as a result of voice recognition processing by voice recognition processing means 41) and S (2) received. Is stored in the dialogue history storage means 50, and the preparation start instruction information of S (3) is sent to the next speech preparation means 43.

＜本実施形態の効果＞ <Effect of this embodiment>

このような本実施形態によれば、次のような効果がある。すなわち、対話システム１０は、システム発話タイミング検出手段２２を備えているので、ユーザが自己の発話権を維持しているか、または、譲渡若しくは放棄により終了させたかをパターン認識処理により逐次推定することができる。また、次発話準備手段４３を備えているので、システム発話タイミング検出手段２２によるパターン認識処理とは非同期で、かつ、システム発話タイミング検出手段２２によりシステム発話の開始タイミングが検出される前に（すなわち、ユーザ発話の進行中に、または、それよりも前の段階であるユーザ発話の開始前に）、ユーザ発話に対するシステムの次発話の内容データを準備することができる。 According to the present embodiment as described above, there are the following effects. That is, since the dialogue system 10 includes the system utterance timing detecting means 22, it is possible to sequentially estimate by pattern recognition processing whether the user maintains his / her own utterance right or terminates it by transfer or abandonment. it can. Further, since the next utterance preparation means 43 is provided, the pattern recognition process by the system utterance timing detecting means 22 is asynchronous, and before the system utterance timing detecting means 22 detects the start timing of the system utterance (that is,). , During the progress of the user utterance, or before the start of the user utterance, which is an earlier stage), the content data of the system's next utterance for the user utterance can be prepared.

このため、対話相手であるユーザが自己の発話権を譲渡若しくは放棄することによりユーザ発話権が終了し、システム発話タイミング検出手段２２により、このユーザ発話権の終了が捉えられ、システム発話の開始タイミングが検出された場合には、その検出直後に、発話生成手段２５により、タイミングよくシステム発話を開始させることができるので、システムの応答性を向上させることができる。 Therefore, the user who is the conversation partner transfers or abandons his / her own utterance right, and the user utterance right is terminated. The system utterance timing detecting means 22 catches the termination of the user utterance right, and the system utterance start timing. When is detected, the utterance generation means 25 can start the system utterance at the right time immediately after the detection, so that the responsiveness of the system can be improved.

また、システム発話タイミング検出手段２２は、音声認識処理手段４１による音声認識処理とは非同期で、ユーザ発話権の維持または終了を識別するパターン認識処理を繰り返し実行する構成とされているので、音声区間検出処理（ＶＡＤ処理）を前提としない処理を実現することができるため、ＶＡＤ処理による遅延なしに早期に、システム発話の開始タイミングを決定することができるとともに、ユーザ発話とシステム発話との衝突も回避または抑制することができる。 Further, since the system utterance timing detection means 22 is configured to repeatedly execute the pattern recognition process for identifying the maintenance or termination of the user's utterance right, which is asynchronous with the voice recognition process by the voice recognition processing means 41, the voice section. Since it is possible to realize processing that does not presuppose detection processing (VAD processing), it is possible to determine the start timing of system utterance at an early stage without delay due to VAD processing, and there is also a collision between user utterance and system utterance. Can be avoided or suppressed.

以上より、対話システム１０では、次発話準備手段４３により、システムが発話すべき内容（本実施形態では、複数の次発話候補の内容）を早期に確定したうえで、システム発話タイミング検出手段２２により、ユーザ発話権の終了が推定され、システム発話の開始タイミングが検出されるのを待って、発話生成手段２５により、システム応答を行うことができる。このため、ユーザ発話の終了後、システム発話の開始までに、長い間（ま）が空くことを避けることができるうえ、両者の発話の衝突の発生も回避または抑制することができる。 From the above, in the dialogue system 10, the content to be spoken by the system (in the present embodiment, the content of a plurality of next utterance candidates) is determined at an early stage by the next utterance preparation means 43, and then the system utterance timing detecting means 22 is used. After the end of the user's utterance right is estimated and the start timing of the system utterance is detected, the utterance generation means 25 can make a system response. Therefore, it is possible to avoid a long period of time between the end of the user's utterance and the start of the system utterance, and it is also possible to avoid or suppress the occurrence of a collision between the two utterances.

また、対話システム１０は、次発話準備手段４３による準備処理で取得または生成した複数の次発話候補の内容データ中から、次発話選択手段２４が、発話生成手段２５で用いる次発話の内容データを選択する構成とされているので、様々な種別の対話に柔軟に対応することができる。すなわち、各種の対話の中には、ユーザ発話の内容が確定する前に、そのユーザ発話に対するシステムの次発話の内容が１つに定まらない種別の対話も多いが、そのような場合でも、システムの応答性の向上を図ることができる。 Further, the dialogue system 10 selects the content data of the next utterance used by the next utterance selection means 25 from the content data of the plurality of next utterance candidates acquired or generated in the preparation process by the next utterance preparation means 43. Since it is configured to be selected, it is possible to flexibly respond to various types of dialogue. That is, among various dialogues, there are many types of dialogues in which the content of the system's next utterance for the user's utterance is not fixed before the content of the user's utterance is determined, but even in such a case, the system It is possible to improve the responsiveness of the system.

そして、次発話選択手段２４は、異なる処理で得られた複数の種類の情報を用いて、次発話の選択処理を行うことができるで、この点でも、様々な種別の対話に柔軟に対応することができる。 Then, the next utterance selection means 24 can perform the next utterance selection process using a plurality of types of information obtained by different processes, and in this respect as well, it flexibly responds to various types of dialogue. be able to.

具体的には、次発話選択手段２４は、音声認識処理手段４１による音声認識処理の結果として得られた言語情報（文字列）と、次発話選択用情報生成手段２３により得られた次発話選択用情報（主としてユーザ発話意図の識別結果であるが、ユーザの顔画像から得られた表情の識別結果や、ユーザのジェスチャー画像から得られた身振り・手振りの意図の識別情報を加えてもよい。）とのうちのいずれか一方の情報を用いて、次発話の選択処理を行うことができ、また、これらの情報を組み合わせて用いて、次発話の選択処理を行うこともできる。さらに、システム発話タイミング検出手段２２で得られたユーザ発話意図の識別結果を用いることができる場合もある。従って、様々な種別の対話において、ユーザ発話の内容（必ずしも言語情報に限らず、ユーザ発話意図等も含めた内容）に応じて、システムの次発話の内容データを選択することができる。 Specifically, the next utterance selection means 24 includes the language information (character string) obtained as a result of the voice recognition processing by the voice recognition processing means 41 and the next utterance selection means obtained by the next utterance selection information generation means 23. Information for use (mainly the identification result of the user's utterance intention, but the identification result of the facial expression obtained from the user's face image and the identification information of the gesture / gesture intention obtained from the user's gesture image may be added. ) And one of the information can be used to select the next utterance, and a combination of these information can be used to select the next utterance. Further, there are cases where the identification result of the user utterance intention obtained by the system utterance timing detecting means 22 can be used. Therefore, in various types of dialogues, it is possible to select the content data of the next utterance of the system according to the content of the user's utterance (the content including not only the language information but also the user's utterance intention).

また、上記において、次発話選択手段２４が、韻律分析で推定したユーザ発話意図だけを用いて、複数の次発話候補の内容データの中から、次発話の内容データを選択することができる場合は、音声認識処理の結果を得る必要はないので、システムの応答性を、より一層向上させることができる。 Further, in the above, when the next utterance selection means 24 can select the content data of the next utterance from the content data of the plurality of next utterance candidates by using only the user utterance intention estimated by the rhyme analysis. Since it is not necessary to obtain the result of the voice recognition process, the responsiveness of the system can be further improved.

また、システム発話タイミング検出手段２２は、システム状態記憶手段３１に記憶されている準備完了・準備中の別を示すステータスを参照する構成とされているので（図８、図２参照）、システム状態を考慮し、より適切なシステム発話の開始タイミングを検出することができる。 Further, since the system utterance timing detecting means 22 is configured to refer to the status stored in the system state storage means 31 indicating whether the system is ready or being prepared (see FIGS. 8 and 2), the system state is displayed. It is possible to detect a more appropriate start timing of system utterance in consideration of.

さらに、システム発話タイミング検出手段２２は、ユーザ状態記憶手段３２に記憶されているユーザ発話継続時間を用いて、ユーザ発話権終了判定用閾値の調整を行うことができるので（図９参照）、ユーザ発話継続時間の長短に応じ、システム発話の開始タイミングを調整することができる。 Further, the system utterance timing detecting means 22 can adjust the threshold value for determining the end of the user utterance right by using the user utterance duration stored in the user state storage means 32 (see FIG. 9). The start timing of system utterance can be adjusted according to the length of the utterance duration.

また、システム発話タイミング検出手段２２は、システム状態記憶手段３１に記憶されているシステム発話意欲度の指標値（目的データの残数、次発話候補の重要度）を用いてユーザ発話権終了判定用閾値を動的に調整することができるので（図１０参照）、システム発話意欲度が強いときには、ユーザ発話権が終了したという識別結果が出やすくなる設定状態とし、システム発話意欲度が弱いときには、ユーザ発話権が終了したという識別結果が出にくい設定状態とすることができる。 Further, the system utterance timing detecting means 22 is for determining the end of the user utterance right by using the index value (the remaining number of target data, the importance of the next utterance candidate) of the system utterance motivation stored in the system state storage means 31. Since the threshold value can be adjusted dynamically (see FIG. 10), when the system utterance motivation is strong, the setting state is set so that the identification result that the user utterance right has ended is easily obtained, and when the system utterance motivation is weak, the setting state is set. It is possible to set a setting state in which it is difficult to obtain an identification result that the user's utterance right has ended.

さらに、システム発話タイミング検出手段２２は、ユーザ情報記憶手段５２に記憶されている衝突の発生情報（蓄積情報）やシステムの交替潜時（蓄積情報）を用いて、ユーザ発話権終了判定用閾値を事前調整することができるので（図１１、図１２参照）、各ユーザについて、衝突の発生が起きる傾向にあるときには、ユーザ発話権が終了したという識別結果が出にくい設定状態とし、システムの交替潜時が長い傾向にあるときには、ユーザ発話権が終了したという識別結果が出やすくなる設定状態とすることができる。このため、ユーザ属性に応じたユーザ発話権終了判定用閾値の調整を実現することができる。 Further, the system utterance timing detecting means 22 uses the collision occurrence information (accumulated information) stored in the user information storage means 52 and the system's alternate latency (accumulated information) to set the user utterance right end determination threshold. Since it can be pre-adjusted (see FIGS. 11 and 12), when a collision tends to occur for each user, the setting state is set so that it is difficult to obtain an identification result that the user's utterance right has ended, and the system is replaced. When the time tends to be long, it is possible to set the setting state so that the identification result that the user's utterance right has ended is likely to be obtained. Therefore, it is possible to adjust the threshold value for determining the end of the user's utterance right according to the user attribute.

この際、システム発話タイミング検出手段２２は、ユーザ情報記憶手段５２に記憶されているユーザの発話速度（蓄積情報）を用いて、ユーザ発話権終了判定用閾値を下方調整することを決めるための下方調整用閾値を、ユーザの発話速度の関数として設定することができるので（図１２参照）、各ユーザの発話速度の傾向に応じ、下方調整用閾値の設定を変更することができる。このため、ユーザ属性に応じたユーザ発話権終了判定用閾値の下方調整を実現することができる。すなわち、システムの交替潜時が長い傾向にあるときには、ユーザ発話権終了判定用閾値を下方調整することにより、ユーザ発話権が終了したという識別結果が出やすくなる設定状態とし、システムの交替潜時が短くなるようにすることができるが、この際、システムの交替潜時が長い傾向にあるか否かは、ユーザ毎に異なり、各ユーザの発話速度の傾向と関係するので、下方調整用閾値をユーザの発話速度の関数とすることで、ユーザ属性に応じてユーザ発話権終了判定用閾値の下方調整を行うか否かを決めることができる。 At this time, the system utterance timing detecting means 22 uses the user's utterance speed (accumulated information) stored in the user information storage means 52 to lower the threshold value for determining the end of the user's utterance right. Since the adjustment threshold value can be set as a function of the user's utterance speed (see FIG. 12), the downward adjustment threshold value setting can be changed according to the tendency of each user's utterance speed. Therefore, it is possible to realize downward adjustment of the threshold value for determining the end of the user's utterance right according to the user attribute. That is, when the system shift latency tends to be long, the threshold for determining the end of the user utterance right is adjusted downward so that the identification result that the user utterance right has ended can be easily obtained, and the shift latency of the system is set. However, at this time, whether or not the system shift latency tends to be long differs for each user and is related to the tendency of each user's utterance speed. By making the function of the user's utterance speed, it is possible to determine whether or not to downwardly adjust the threshold for determining the end of the user's utterance right according to the user attribute.

また、次発話準備手段４３は、入替要否判断手段４３Ｂおよび入替準備手段４３Ｃを備えているので（図４参照）、進行中のユーザ発話の内容を逐次反映させ、既に準備されている複数の次発話候補の内容データの入替を行うことができる。このため、ユーザ発話の内容に応じた適切な次発話候補の内容データを準備することができる。 Further, since the next utterance preparation means 43 includes the replacement necessity determination means 43B and the replacement preparation means 43C (see FIG. 4), the contents of the user's utterance in progress are sequentially reflected, and a plurality of already prepared utterance preparation means 43 are prepared. The content data of the next utterance candidate can be replaced. Therefore, it is possible to prepare appropriate content data of the next utterance candidate according to the content of the user utterance.

例えば、次発話準備手段４３は、逐次得られる音声認識処理の結果に含まれる重要度の高い単語を用いて、ユーザの関心のある話題を決定し、題材データ記憶手段５１または題材データ提供システム６０に記憶された題材データの中から、決定した話題に関連付けられて記憶されている題材データを選択し、次発話の候補となる別の複数の次発話候補の内容データを取得または生成する準備処理を実行することができる。従って、次発話により提示する話題を変更することができる。 For example, the next utterance preparation means 43 determines a topic of interest to the user by using words of high importance included in the results of the speech recognition processing obtained sequentially, and the subject data storage means 51 or the subject data providing system 60. Preparatory processing to select the subject data stored in association with the determined topic from the subject data stored in, and acquire or generate the content data of a plurality of other next utterance candidates that are candidates for the next utterance. Can be executed. Therefore, the topic presented by the next utterance can be changed.

＜変形の形態＞ <Form of deformation>

なお、本発明は前記実施形態に限定されるものではなく、本発明の目的を達成できる範囲内での変形等は本発明に含まれるものである。 The present invention is not limited to the above-described embodiment, and modifications and the like within a range in which the object of the present invention can be achieved are included in the present invention.

例えば、前記実施形態の対話システム１０は、次発話準備手段４３により、複数の次発話候補の内容データを準備して次発話候補記憶手段３０に記憶させ、次発話選択手段２４により、複数の次発話候補の内容データの中から、次発話の内容データを選択する構成とされていたが、本発明の対話システムは、このような構成に限定されるものではなく、次発話準備手段４３により、次発話の内容データを１つだけ準備し、次発話選択手段２４を設置しない構成としてもよい。但し、様々な種別の対話に対応できるようにするという観点で、前記実施形態のように、次発話選択手段２４を設け、次発話準備手段４３により複数の次発話候補の内容データを準備する構成とすることが好ましい。 For example, in the dialogue system 10 of the above embodiment, the content data of a plurality of next utterance candidates are prepared by the next utterance preparation means 43 and stored in the next utterance candidate storage means 30, and the next utterance selection means 24 is used to store the content data of the plurality of next utterance candidates. The configuration was such that the content data of the next utterance was selected from the content data of the utterance candidates, but the dialogue system of the present invention is not limited to such a configuration, and the next utterance preparation means 43 Only one content data of the next utterance may be prepared, and the next utterance selection means 24 may not be installed. However, from the viewpoint of being able to support various types of dialogue, as in the above embodiment, the next utterance selection means 24 is provided, and the content data of a plurality of next utterance candidates is prepared by the next utterance preparation means 43. Is preferable.

また、前記実施形態では、システム発話タイミング検出手段２２は、ユーザ情報記憶手段５２に記憶されているユーザの発話速度（蓄積情報）を用いて、ユーザ発話権終了判定用閾値を下方調整することを決めるための下方調整用閾値を、ユーザの発話速度の関数として設定する構成とれていたが（図１２参照）、蓄積されたユーザの発話速度から得られるユーザ属性（発話速度の傾向）を用いるのではなく、ユーザ状態記憶手段３２に記憶されているユーザのリアルタイムの発話速度（その時々の発話速度）を用いて、ユーザ発話権の維持または終了を識別するパターン認識処理を行う構成としてもよい。 Further, in the above embodiment, the system utterance timing detecting means 22 adjusts the user utterance right end determination threshold downward by using the user's utterance speed (accumulated information) stored in the user information storage means 52. The downward adjustment threshold for determination was set as a function of the user's utterance speed (see FIG. 12), but the user attribute (the tendency of the utterance speed) obtained from the accumulated user's utterance speed is used. Instead, the pattern recognition process for identifying the maintenance or termination of the user's utterance right may be performed using the user's real-time utterance speed (the utterance speed at that time) stored in the user state storage means 32.

このような構成とする場合、ユーザ発話権の維持または終了を識別する識別器を構築する際には、ユーザ発話の音声信号から抽出した音響特徴量と、音声認識処理手段４１による音声認識処理の結果として得られた言語情報から抽出した言語特徴量（但し、省略してもよい）と、ユーザ発話における対応する各時点でのユーザの発話速度とを入力して識別器の学習を行う。そして、運用時には、音響特徴量と、言語特徴量（但し、省略してもよい）と、逐次得られるリアルタイムの発話速度とを、識別器に入力する。これにより、ユーザのリアルタイムの発話速度を加味した識別結果を得ることができる。このため、ユーザ毎に異なる発話速度の傾向（蓄積された発話速度から得られるユーザ属性）に応じてユーザ発話権終了判定用閾値を調整する必要がなくなる。なお、閾値調整と併用してもよく、その場合には、閾値調整量が少なくなる。 In such a configuration, when constructing a discriminator for identifying the maintenance or termination of the user's utterance right, the acoustic feature amount extracted from the voice signal of the user's utterance and the voice recognition processing by the voice recognition processing means 41 are performed. The classifier is learned by inputting the language feature amount extracted from the resulting language information (however, it may be omitted) and the user's speech speed at each corresponding time point in the user's speech. Then, at the time of operation, the acoustic feature amount, the language feature amount (however, it may be omitted), and the real-time utterance speed obtained sequentially are input to the classifier. As a result, it is possible to obtain an identification result in which the user's real-time utterance speed is taken into consideration. Therefore, it is not necessary to adjust the threshold value for determining the end of the user's speaking right according to the tendency of the speaking speed different for each user (user attribute obtained from the accumulated speaking speed). It may be used in combination with the threshold value adjustment, in which case the threshold value adjustment amount is reduced.

以上のように、本発明の対話システムおよびプログラムは、例えば、ニュースやコラムや歴史等の各種の話題を記載した記事データから生成したシナリオデータを用いてユーザに対して記事の内容を伝達するニュース対話システム、ユーザに対して機器の使用方法の説明や施設の案内等を行うガイダンス対話システム、選挙情勢や消費者志向等の各種のユーザの動向調査を行うアンケート対話システム、ユーザが店舗・商品・旅行先・聞きたい曲等の情報検索を行うための情報検索対話システム、ユーザが家電機器や車等の各種の機器や装置等を操作するための操作対話システム、子供や学生や新入社員等であるユーザに対して教育を行うための教育対話システム、システムがユーザ属性等の情報を特定するための情報特定対話システム等に用いるのに適している。 As described above, the dialogue system and program of the present invention, for example, use scenario data generated from article data describing various topics such as news, columns, and history to convey the content of an article to a user. Dialogue system, guidance dialogue system that explains how to use the equipment to users and guides facilities, questionnaire dialogue system that investigates trends of various users such as election situation and consumer orientation, users are stores, products, Information search dialogue system for searching information such as travel destinations and songs you want to listen to, operation dialogue system for users to operate various devices and devices such as home appliances and cars, for children, students, new employees, etc. It is suitable for use in an educational dialogue system for educating a certain user, an information specifying dialogue system for the system to specify information such as user attributes, and the like.

１ネットワーク
１０対話システム
２１音声信号取得手段
２２システム発話タイミング検出手段
２３次発話選択用情報生成手段
２４次発話選択手段
２５発話生成手段
３０次発話候補記憶手段
３１システム状態記憶手段
３２ユーザ状態記憶手段
４１音声認識処理手段
４２対話状態管理手段
４３次発話準備手段
５０対話履歴記憶手段
５１題材データ記憶手段
５２ユーザ情報記憶手段
６０外部システムである題材データ提供システム 1 Network 10 Dialogue system 21 Voice signal acquisition means 22 System utterance timing detection means 23 Information generation means for secondary utterance selection 24 Secondary utterance selection means 25 Speak generation means 30 Next utterance candidate storage means 31 System state storage means 32 User state storage means 41 Voice recognition processing means 42 Dialogue state management means 43 Next utterance preparation means 50 Dialogue history storage means 51 Subject data storage means 52 User information storage means 60 Subject data provision system that is an external system

Claims

ユーザとの音声対話のための処理を実行するコンピュータにより構成された対話システムであって、
ユーザ発話の音声信号を取得する音声信号取得手段と、
この音声信号取得手段により取得したユーザ発話の音声信号についての音声認識処理を実行する音声認識処理手段と、
前記音声信号取得手段により取得したユーザ発話の音声信号から音響特徴量を抽出し、抽出した音響特徴量を用いるか、または、この音響特徴量に加え、前記音声認識処理手段による音声認識処理の結果として得られたユーザ発話の言語情報から抽出した言語特徴量を用いて、前記音声認識処理手段による音声認識処理の実行タイミングに依拠しない周期で、ユーザが発話する地位または立場を有していることを示すユーザ発話権の維持または終了を識別するパターン認識処理を繰り返し実行し、このパターン認識処理の結果を用いて、システム発話の開始タイミングを検出する処理を実行するシステム発話タイミング検出手段と、
このシステム発話タイミング検出手段による前記パターン認識処理の前記周期に依拠しないタイミングで、かつ、このシステム発話タイミング検出手段によりシステム発話の開始タイミングが検出される前に、題材データ記憶手段に記憶された題材データまたはネットワークを介して接続された外部システムに記憶された題材データを用いるとともに、ユーザとシステムとの間の対話履歴情報の少なくとも一部および／または前記音声認識処理手段による進行中のユーザ発話についての途中までの音声認識処理の結果を用いて、システムの次発話の内容データを取得または生成する準備処理を実行する次発話準備手段と、
前記システム発話タイミング検出手段によりシステム発話の開始タイミングが検出された後に、前記次発話準備手段による準備処理で得られた次発話の内容データを用いて、システム発話の音声信号の再生を含むシステム発話生成処理を実行する発話生成手段と
を備えたことを特徴とする対話システム。 A dialogue system consisting of computers that perform processing for voice dialogue with users.
A voice signal acquisition means for acquiring a voice signal spoken by a user,
A voice recognition processing means that executes voice recognition processing for a user-spoken voice signal acquired by this voice signal acquisition means, and a voice recognition processing means.
An acoustic feature amount is extracted from the user-spoken voice signal acquired by the voice signal acquisition means, and the extracted acoustic feature amount is used, or in addition to this acoustic feature amount, the result of voice recognition processing by the voice recognition processing means. The user has a position or position to speak at a cycle that does not depend on the execution timing of the voice recognition process by the voice recognition processing means using the language feature amount extracted from the language information of the user's utterance obtained as. A system utterance timing detecting means that repeatedly executes a pattern recognition process for identifying the maintenance or termination of the user utterance right indicating the above, and executes a process of detecting the start timing of the system utterance using the result of the pattern recognition process.
The material stored in the subject data storage means at a timing that does not depend on the cycle of the pattern recognition process by the system speech timing detecting means and before the start timing of the system speech is detected by the system speech timing detecting means. Using data or subject data stored in an external system connected via a network, at least part of the dialogue history information between the user and the system and / or about ongoing user speech by the voice recognition processing means. The next speech preparation means that executes the preparatory process to acquire or generate the content data of the next speech of the system using the result of the voice recognition process up to the middle of
After the start timing of the system utterance is detected by the system utterance timing detecting means, the system utterance including the reproduction of the voice signal of the system utterance is used by using the content data of the next utterance obtained by the preparatory process by the next utterance preparation means. An interactive system characterized by having an utterance generation means for executing a generation process.

前記次発話準備手段は、
次発話の候補となる複数の次発話候補の内容データを取得または生成する準備処理を実行する構成とされ、
前記システム発話タイミング検出手段によりシステム発話の開始タイミングが検出された後に、前記音声認識処理手段による音声認識処理の結果として得られた言語情報を用いて、前記次発話準備手段による準備処理で得られた複数の次発話候補の内容データの中から、前記発話生成手段で用いる次発話の内容データを選択する処理を実行する次発話選択手段を備えた
ことを特徴とする請求項１に記載の対話システム。 The means for preparing for the next utterance is
It is configured to execute preparatory processing to acquire or generate content data of a plurality of next utterance candidates that are candidates for the next utterance.
After the start timing of the system utterance is detected by the system utterance timing detecting means, the linguistic information obtained as a result of the voice recognition processing by the voice recognition processing means is used to obtain the preparatory processing by the next utterance preparation means. The dialogue according to claim 1, further comprising a next utterance selection means for executing a process of selecting the content data of the next utterance used in the utterance generation means from the content data of a plurality of next utterance candidates. system.

前記次発話準備手段は、
次発話の候補となる複数の次発話候補の内容データを取得または生成する準備処理を実行する構成とされ、
前記音声信号取得手段により取得したユーザ発話の音声信号から得られる韻律情報を用いるか、若しくは、この韻律情報に加えて、前記音声認識処理手段による音声認識処理の結果として得られたユーザ発話の言語情報を用いるか、またはこれらの韻律情報およびユーザ発話の言語情報に加えて、ユーザとシステムとの間の対話履歴情報のうちの直前のシステム発話の言語情報を用いて、質問、応答、相槌、補足要求、反復要求、理解、不理解、無関心、若しくはその他のユーザ発話意図を識別するパターン認識処理を繰り返し実行する次発話選択用情報生成手段と、
前記システム発話タイミング検出手段によりシステム発話の開始タイミングが検出された後に、前記次発話選択用情報生成手段による処理で得られた前記ユーザ発話意図の識別結果を用いて、前記次発話準備手段による準備処理で得られた複数の次発話候補の内容データの中から、前記発話生成手段で用いる次発話の内容データを選択する処理を実行する次発話選択手段と
を備えたことを特徴とする請求項１に記載の対話システム。 The means for preparing for the next utterance is
It is configured to execute preparatory processing to acquire or generate content data of a plurality of next utterance candidates that are candidates for the next utterance.
The user-speaking language obtained as a result of the voice recognition processing by the voice recognition processing means in addition to the rhyme information obtained from the user-spoken voice signal acquired by the voice signal acquisition means is used. Questions, responses, and responses, using information, or using these linguistic information and the linguistic information of the user's utterance, as well as the linguistic information of the immediately preceding system utterance of the dialogue history information between the user and the system. An information generation means for selecting the next utterance that repeatedly executes a pattern recognition process for identifying a supplementary request, a repetitive request, an understanding, an incomprehension, an indifference, or another user's utterance intention.
After the system utterance start timing is detected by the system utterance timing detecting means, the preparation by the next utterance preparing means is used by using the identification result of the user utterance intention obtained by the process by the next utterance selection information generating means. A claim characterized in that it is provided with a next utterance selection means for executing a process of selecting the content data of the next utterance used in the utterance generation means from the content data of a plurality of next utterance candidates obtained in the process. The dialogue system according to 1.

前記次発話準備手段は、
次発話の候補となる複数の次発話候補の内容データを取得または生成する準備処理を実行する構成とされ、
前記音声信号取得手段により取得したユーザ発話の音声信号から得られる韻律情報を用いるか、若しくは、この韻律情報に加えて、前記音声認識処理手段による音声認識処理の結果として得られたユーザ発話の言語情報を用いるか、またはこれらの韻律情報およびユーザ発話の言語情報に加えて、ユーザとシステムとの間の対話履歴情報のうちの直前のシステム発話の言語情報を用いて、質問、応答、相槌、補足要求、反復要求、理解、不理解、無関心、若しくはその他のユーザ発話意図を識別するパターン認識処理を繰り返し実行する次発話選択用情報生成手段と、
前記システム発話タイミング検出手段によりシステム発話の開始タイミングが検出された後に、前記次発話選択用情報生成手段による処理で得られた前記ユーザ発話意図の識別結果と、前記音声認識処理手段による音声認識処理の結果として得られた言語情報とを組み合わせて用いて、前記次発話準備手段による準備処理で得られた複数の次発話候補の内容データの中から、前記発話生成手段で用いる次発話の内容データを選択する処理を実行する次発話選択手段と
を備えたことを特徴とする請求項１に記載の対話システム。 The means for preparing for the next utterance is
It is configured to execute preparatory processing to acquire or generate content data of a plurality of next utterance candidates that are candidates for the next utterance.
The user-speaking language obtained as a result of the voice recognition processing by the voice recognition processing means in addition to the rhyme information obtained from the user-spoken voice signal acquired by the voice signal acquisition means is used. Questions, responses, and responses, using information, or using these linguistic information and the linguistic information of the user's utterance, as well as the linguistic information of the immediately preceding system utterance of the dialogue history information between the user and the system. An information generation means for selecting the next utterance that repeatedly executes a pattern recognition process for identifying a supplementary request, a repetitive request, an understanding, an incomprehension, an indifference, or another user's utterance intention.
After the system utterance start timing is detected by the system utterance timing detecting means, the identification result of the user utterance intention obtained by the process by the next utterance selection information generation means and the voice recognition process by the voice recognition processing means. The content data of the next utterance used in the utterance generation means from among the content data of the plurality of next utterance candidates obtained in the preparatory process by the next utterance preparation means by using the linguistic information obtained as a result of The dialogue system according to claim 1, further comprising a next utterance selection means for executing a process of selecting.

前記次発話準備手段は、
次発話の候補となる複数の次発話候補の内容データを取得または生成する準備処理を実行する構成とされ、
前記システム発話タイミング検出手段は、
前記ユーザ発話権の維持または終了を識別するパターン認識処理を実行する際に、終了については、質問、応答、相槌、補足要求、反復要求、理解、不理解、無関心、若しくはその他のユーザ発話意図のうちのいずれのユーザ発話意図で終了するのかを識別するパターン認識処理を実行する構成とされ、
前記システム発話タイミング検出手段によりシステム発話の開始タイミングが検出された後に、前記システム発話タイミング検出手段による処理で得られたユーザ発話意図の識別結果を用いて、前記次発話準備手段による準備処理で得られた複数の次発話候補の内容データの中から、前記発話生成手段で用いる次発話の内容データを選択する処理を実行する次発話選択手段を備えた
ことを特徴とする請求項１に記載の対話システム。 The means for preparing for the next utterance is
It is configured to execute preparatory processing to acquire or generate content data of a plurality of next utterance candidates that are candidates for the next utterance.
The system utterance timing detecting means
When executing the pattern recognition process for identifying the maintenance or termination of the user's utterance right, the termination is a question, response, reciprocal request, supplementary request, repetitive request, understanding, incomprehension, indifference, or other user utterance intention. It is configured to execute a pattern recognition process that identifies which of the users is intended to end the utterance.
After the system utterance start timing is detected by the system utterance timing detecting means, the identification result of the user utterance intention obtained by the process by the system utterance timing detecting means is used to obtain the preparatory process by the next utterance preparation means. The first aspect of claim 1, wherein the next utterance selection means for executing a process of selecting the content data of the next utterance used in the utterance generation means from the content data of the plurality of next utterance candidates is provided. Dialogue system.

前記次発話準備手段による準備処理の状態を含むシステム状態を示す情報を記憶するシステム状態記憶手段を備え、
前記システム発話タイミング検出手段は、
前記ユーザ発話権の維持または終了を識別するパターン認識処理の結果および前記システム状態記憶手段に記憶されている前記システム状態を示す情報を用いて、システム発話の開始タイミングを検出する処理を実行する際に、
前記パターン認識処理の結果が前記ユーザ発話権の維持を示している場合には、システム発話の開始タイミングではないと判断し、
前記パターン認識処理の結果が前記ユーザ発話権の終了を示し、かつ、前記システム状態を示す情報が準備完了を示している場合には、システム発話の開始タイミングであると判断し、
前記パターン認識処理の結果が前記ユーザ発話権の終了を示し、かつ、前記システム状態を示す情報が準備中を示している場合には、前記次発話準備手段による準備中の処理内容に応じ、直ぐに完了する処理内容として予め分類されている処理の準備中であるときには、準備完了になるまで待ってシステム発話の開始タイミングであると判断し、直ぐに完了しない処理内容として予め分類されている処理の準備中であるときには、システム発話の開始タイミングであると判断するとともに、フィラーの挿入タイミングである旨の情報を出力する処理を実行する構成とされている
ことを特徴とする請求項１〜５のいずれかに記載の対話システム。 A system state storage means for storing information indicating a system state including a state of preparation processing by the next utterance preparation means is provided.
The system utterance timing detecting means
When executing a process of detecting the start timing of system utterance using the result of the pattern recognition process for identifying the maintenance or termination of the user utterance right and the information indicating the system state stored in the system state storage means. To,
When the result of the pattern recognition process indicates that the user's utterance right is maintained, it is determined that it is not the start timing of the system utterance.
When the result of the pattern recognition process indicates the end of the user utterance right and the information indicating the system state indicates the completion of preparation, it is determined that it is the start timing of the system utterance.
When the result of the pattern recognition process indicates the end of the user's utterance right and the information indicating the system state indicates that the system is being prepared, immediately according to the processing content being prepared by the next utterance preparation means. When preparing for a process that is pre-classified as the process content to be completed, wait until the preparation is completed, determine that it is the start timing of the system utterance, and prepare for the process that is pre-classified as the process content that is not completed immediately. Any of claims 1 to 5, wherein when the system is in the middle, it is determined that it is the start timing of the system utterance and the process of outputting the information indicating that it is the insertion timing of the filler is executed. The dialogue system described in Crab.

ユーザ発話継続時間を含むユーザ状態を示す情報を記憶するユーザ状態記憶手段を備え、
前記システム発話タイミング検出手段は、
前記ユーザ発話権の維持または終了を識別するパターン認識処理の結果および前記ユーザ状態記憶手段に記憶されている前記ユーザ状態を示す情報を用いて、システム発話の開始タイミングを検出する処理を実行し、この際の処理として、
（１）前記ユーザ状態記憶手段に記憶されている前記ユーザ発話継続時間が、予め定められた短時間判定用閾値以下または未満の場合には、前記パターン認識処理の結果として得られる尤度に対して設定されているユーザ発話権終了判定用閾値を標準値よりも高く設定し、予め定められた長時間判定用閾値以上または超過の場合には、前記ユーザ発話権終了判定用閾値を標準値よりも低く設定する処理と、
（２）前記ユーザ状態記憶手段に記憶されている前記ユーザ発話継続時間を用いて、前記パターン認識処理の結果として得られる尤度に対するユーザ発話権終了判定用閾値を、前記ユーザ発話継続時間が短いときには当該ユーザ発話権終了判定用閾値が高くなり、前記ユーザ発話継続時間が長いときには当該ユーザ発話権終了判定用閾値が低くなるように予め定められた関数により設定する処理と、
（３）前記ユーザ状態記憶手段に記憶されている前記ユーザ発話継続時間が、予め定められた短時間判定用閾値以下または未満の場合には、前記パターン認識処理の結果が前記ユーザ発話権の終了を示していても、システム発話の開始タイミングではないと判断し、予め定められた長時間判定用閾値以上または超過の場合には、前記パターン認識処理の結果が前記ユーザ発話権の維持を示していても、システム発話の開始タイミングであると判断する処理とのうちのいずれかの処理を実行する構成とされている
ことを特徴とする請求項１〜６のいずれかに記載の対話システム。 A user state storage means for storing information indicating the user state including the user utterance duration is provided.
The system utterance timing detecting means
Using the result of the pattern recognition process for identifying the maintenance or termination of the user utterance right and the information indicating the user state stored in the user state storage means, a process for detecting the start timing of system utterance is executed. As a process at this time,
(1) When the user utterance duration stored in the user state storage means is equal to or less than or less than a predetermined short-time determination threshold value, the likelihood obtained as a result of the pattern recognition process is obtained. The user speaking right end determination threshold value is set higher than the standard value, and if the user speaking right end determination threshold value is equal to or greater than or exceeds the predetermined long-time determination threshold value, the user speaking right end determination threshold value is set higher than the standard value. And the process of setting it low
(2) Using the user utterance duration stored in the user state storage means, the user utterance right end determination threshold value for the likelihood obtained as a result of the pattern recognition process is set to a short user utterance duration. Sometimes, the threshold for determining the end of the user's utterance right becomes high, and when the duration of the user's utterance is long, the threshold for determining the end of the user's utterance right becomes low.
(3) When the user utterance duration stored in the user state storage means is equal to or less than or less than a predetermined short-time determination threshold value, the result of the pattern recognition process is the end of the user utterance right. Even if it is shown, it is judged that it is not the start timing of the system utterance, and if it is equal to or more than or exceeds the predetermined long-time judgment threshold value, the result of the pattern recognition process indicates the maintenance of the user utterance right. However, the dialogue system according to any one of claims 1 to 6, wherein any of the processes for determining that it is the start timing of the system utterance is executed.

システムによる発話開始に対する要求の強さの度合いを示すシステム発話意欲度の指標値として、対話目的を達成するためのシステムの最終の次発話候補の内容データとなり得る目的データの残数および／または前記次発話準備手段による準備処理で得られた次発話候補の内容データの重要度を含むシステム状態を示す情報を記憶するシステム状態記憶手段を備え、
前記システム発話タイミング検出手段は、
前記パターン認識処理の結果として得られる尤度に対するユーザ発話権終了判定用閾値を、前記システム状態記憶手段に記憶されている前記目的データの残数および／または前記重要度で定まる前記システム発話意欲度を用いて、前記システム発話意欲度が強いときには当該ユーザ発話権終了判定用閾値が低くなり、前記システム発話意欲度が弱いときには当該ユーザ発話権終了判定用閾値が高くなるように予め定められた関数により設定する処理を実行する構成とされている
ことを特徴とする請求項２〜５のいずれかに記載の対話システム。 As an index value of system utterance motivation, which indicates the degree of demand for the start of utterance by the system, the remaining number of target data that can be the content data of the final next utterance candidate of the system for achieving the dialogue purpose and / or the above. It is equipped with a system state storage means for storing information indicating the system state including the importance of the content data of the next utterance candidate obtained by the preparation process by the next utterance preparation means.
The system utterance timing detecting means
The system utterance motivation degree determined by the remaining number of the target data stored in the system state storage means and / or the importance of the user's utterance right termination determination threshold value for the likelihood obtained as a result of the pattern recognition process. A predetermined function is used so that when the system utterance motivation is strong, the user's utterance right end determination threshold is low, and when the system utterance motivation is weak, the user's utterance right end determination threshold is high. The dialogue system according to any one of claims 2 to 5, wherein the process is configured to execute the process set by the above.

前記次発話準備手段は、
前記音声認識処理手段によるユーザ発話の音声認識処理の結果が新たに出力された場合には、新たに出力された当該音声認識処理の結果を用いて、次発話の候補となる複数の次発話候補の内容データの少なくとも一部を入れ替えるか否かを判定し、入れ替えると判定した場合には、次発話の候補となる別の複数の次発話候補の内容データを取得または生成する準備処理を実行する構成とされている
ことを特徴とする請求項２〜５のいずれかに記載の対話システム。 The means for preparing for the next utterance is
When the result of the voice recognition processing of the user's utterance by the voice recognition processing means is newly output, a plurality of next utterance candidates which are candidates for the next utterance are used by using the newly output result of the voice recognition processing. It is determined whether or not to replace at least a part of the content data of the above, and if it is determined to be replaced, a preparatory process for acquiring or generating the content data of a plurality of other next utterance candidates that are candidates for the next utterance is executed. The dialogue system according to any one of claims 2 to 5, wherein the dialogue system is configured.

前記次発話準備手段は、
新たに出力された前記音声認識処理の結果を用いて、この結果に含まれる単語のうち予め定められた重要度の高い単語を用いて、ユーザの関心のある話題を決定し、前記題材データ記憶手段に記憶された題材データまたは前記外部システムに記憶された題材データの中から、決定した話題に関連付けられて記憶されている題材データを選択し、次発話の候補となる別の複数の次発話候補の内容データを取得または生成する準備処理を実行する構成とされている
ことを特徴とする請求項９に記載の対話システム。 The means for preparing for the next utterance is
Using the newly output result of the speech recognition process, a topic of interest to the user is determined using a predetermined high-importance word among the words included in the result, and the subject data storage is performed. From the subject data stored in the means or the subject data stored in the external system, the subject data stored in association with the determined topic is selected, and another plurality of next utterances that are candidates for the next utterance are selected. The dialogue system according to claim 9, wherein a preparatory process for acquiring or generating candidate content data is executed.

前記発話生成手段は、
前記音声信号取得手段により取得したユーザ発話の音声信号と、再生中のシステム発話の音声信号との衝突の発生を検出し、検出した衝突の発生情報を、ユーザ識別情報と関連付けてユーザ情報記憶手段に記憶させるとともに、ユーザ発話の終了からシステム発話の開始までの交替潜時を計測し、計測した交替潜時を、ユーザ識別情報と関連付けて前記ユーザ情報記憶手段に記憶させる処理も実行する構成とされ、
前記システム発話タイミング検出手段は、
前記ユーザ情報記憶手段に記憶されている音声対話相手のユーザとの衝突の発生情報を取得して当該ユーザとの衝突の発生頻度または累積発生回数を算出し、算出した衝突の発生頻度または累積発生回数が上方調整用閾値以上または超過の場合には、前記ユーザ発話権の維持または終了を識別するパターン認識処理の結果として得られる尤度に対して設定されているユーザ発話権終了判定用閾値を標準値または前回調整値よりも高く設定し、
前記ユーザ情報記憶手段に記憶されている音声対話相手のユーザについてのユーザ発話の終了からシステム発話の開始までの複数の交替潜時を取得して当該ユーザについての交替潜時の長短の傾向を示す平均値若しくはその他の指標値を算出し、算出した交替潜時の指標値が下方調整用閾値以上または超過の場合には、前記ユーザ発話権終了判定用閾値を標準値または前回調整値よりも低く設定する処理も実行する構成とされている
ことを特徴とする請求項１〜１０のいずれかに記載の対話システム。 The utterance generation means
The occurrence of a collision between the voice signal of the user utterance acquired by the voice signal acquisition means and the voice signal of the system utterance during playback is detected, and the detected collision occurrence information is associated with the user identification information to store the user information. In addition to being stored in the user information, the alternate latency from the end of the user utterance to the start of the system utterance is measured, and the measured alternate latency is associated with the user identification information and stored in the user information storage means. Being done
The system utterance timing detecting means
Voice dialogue stored in the user information storage means Acquires collision occurrence information with the other user, calculates the collision occurrence frequency or cumulative occurrence number with the user, and calculates the collision occurrence frequency or cumulative occurrence. When the number of times is equal to or greater than the upward adjustment threshold, the user speaking right end determination threshold set for the likelihood obtained as a result of the pattern recognition process for identifying the maintenance or termination of the user speaking right is set. Set higher than the standard value or the last adjustment value,
A plurality of alternate latency from the end of the user utterance to the start of the system utterance for the user of the voice dialogue partner stored in the user information storage means is acquired, and the tendency of the length of the alternate latency for the user is shown. If the average value or other index value is calculated and the calculated index value for alternate latency is equal to or greater than the downward adjustment threshold value, the user speech right end determination threshold value is lower than the standard value or the previous adjustment value. The dialogue system according to any one of claims 1 to 10, wherein the processing to be set is also executed.

前記発話生成手段は、
前記音声認識処理手段による音声認識処理の結果として得られたユーザ発話の言語情報を用いて発話速度を算出し、算出した発話速度を、ユーザ識別情報と関連付けて前記ユーザ情報記憶手段に記憶させる処理も実行する構成とされ、
前記システム発話タイミング検出手段は、
前記ユーザ情報記憶手段に記憶されている音声対話相手のユーザについてのユーザ発話の終了からシステム発話の開始までの複数の交替潜時を取得して当該ユーザについての交替潜時の長短の傾向を示す平均値若しくはその他の指標値を算出し、算出した交替潜時の指標値が下方調整用閾値以上または超過の場合に、前記ユーザ発話権終了判定用閾値を標準値または前回調整値よりも低く設定する処理を実行する際に、
前記ユーザ情報記憶手段に記憶されている音声対話相手の複数の発話速度を取得して当該ユーザの発話速度の傾向を示す平均値若しくはその他の指標値を算出し、前記下方調整用閾値を、算出した前記発話速度の指標値を用いて、前記発話速度の指標値が大きいときには当該下方調整用閾値が小さくなり、前記発話速度の指標値が小さいときには当該下方調整用閾値が大きくなるように予め定められた関数により設定する処理を実行する構成とされている
ことを特徴とする請求項１１に記載の対話システム。 The utterance generation means
A process of calculating the utterance speed using the language information of the user's utterance obtained as a result of the voice recognition processing by the voice recognition processing means, and storing the calculated utterance speed in the user information storage means in association with the user identification information. Is also configured to run
The system utterance timing detecting means
A plurality of alternate latency from the end of the user utterance to the start of the system utterance for the user of the voice dialogue partner stored in the user information storage means is acquired, and the tendency of the length of the alternate latency for the user is shown. The average value or other index value is calculated, and when the calculated index value during alternation latency is equal to or greater than the downward adjustment threshold value, the user speech right end determination threshold value is set lower than the standard value or the previous adjustment value. When executing the process to be performed
A plurality of utterance speeds of the voice dialogue partner stored in the user information storage means are acquired, an average value or other index value indicating a tendency of the utterance speed of the user is calculated, and the downward adjustment threshold value is calculated. Using the index value of the utterance speed, the downward adjustment threshold value is set to be small when the utterance speed index value is large, and the downward adjustment threshold value is set to be large when the utterance speed index value is small. The dialogue system according to claim 11, further comprising a configuration in which a process set by a function is executed.

ユーザのリアルタイムの発話速度を含むユーザ状態を示す情報を記憶するユーザ状態記憶手段を備え、
前記発話生成手段は、
前記音声認識処理手段による音声認識処理の結果として得られたユーザ発話の言語情報を用いてリアルタイムの発話速度を算出し、算出したリアルタイムの発話速度を前記ユーザ状態記憶手段に記憶させる処理も実行する構成とされ、
前記システム発話タイミング検出手段は、
前記音声信号取得手段により取得したユーザ発話の音声信号から音響特徴量を抽出し、抽出した音響特徴量および前記ユーザ状態記憶手段に記憶されているリアルタイムの発話速度を用いるか、または、これらの音響特徴量およびリアルタイムの発話速度に加え、前記音声認識処理手段による音声認識処理の結果として得られたユーザ発話の言語情報から抽出した言語特徴量を用いて、前記音声認識処理手段による音声認識処理の実行タイミングに依拠しない周期で、前記ユーザ発話権の維持または終了を識別するパターン認識処理を繰り返し実行し、このパターン認識処理の結果を用いて、システム発話の開始タイミングを検出する処理を実行する構成とされている
ことを特徴とする請求項１〜１２のいずれかに記載の対話システム。 A user state storage means for storing information indicating the user state including the user's real-time utterance speed is provided.
The utterance generation means
A real-time utterance speed is calculated using the language information of the user's utterance obtained as a result of the voice recognition processing by the voice recognition processing means, and a process of storing the calculated real-time utterance speed in the user state storage means is also executed. It is composed and
The system utterance timing detecting means
The acoustic feature amount is extracted from the voice signal of the user's speech acquired by the voice signal acquisition means, and the extracted acoustic feature amount and the real-time speech speed stored in the user state storage means are used, or these sounds are used. In addition to the feature amount and the real-time utterance speed, the linguistic feature amount extracted from the language information of the user's utterance obtained as a result of the voice recognition processing by the voice recognition processing means is used to perform the voice recognition processing by the voice recognition processing means. A configuration in which a pattern recognition process for identifying the maintenance or termination of the user's speech right is repeatedly executed at a cycle that does not depend on the execution timing, and a process for detecting the start timing of system speech is executed using the result of this pattern recognition process. The dialogue system according to any one of claims 1 to 12, wherein the dialogue system is characterized in that.

請求項１〜１３のいずれかに記載の対話システムとして、コンピュータを機能させるためのプログラム。 A program for operating a computer as the dialogue system according to any one of claims 1 to 13.