JP5357321B1

JP5357321B1 - Speech recognition system and method for controlling speech recognition system

Info

Publication number: JP5357321B1
Application number: JP2012271713A
Authority: JP
Inventors: 正樹渋谷; 仁秋田; 岳史小山
Original assignee: Fuji Soft Inc
Current assignee: Fuji Soft Inc
Priority date: 2012-12-12
Filing date: 2012-12-12
Publication date: 2013-12-04
Anticipated expiration: 2032-12-12
Also published as: JP2014115594A

Abstract

【課題】システムの応答中にユーザの言葉が発せられた場合でも、ユーザの音声を正しく認識すること。
【解決手段】音声認識システム１は、所定の訂正指示語を記憶する訂正語辞書２２、２３と、通常の語句を記憶する一般辞書２１と、訂正語辞書と一般辞書のいずれかを使用することで、音声入力部を介して入力されるユーザの音声を認識する音声認識部１４と、音声認識部の音声認識結果を含む応答を音声出力部３８から出力する応答部３１と、を備える。第２の訂正語辞書２３は、応答部が所定の応答を出力中に音声入力部１１から入力されるユーザの音声を認識すべく、所定の応答と所定の訂正指示語とが重なって音声入力部から入力された場合を考慮して作成されている。
【選択図】図１To correctly recognize a user's voice even when a user's words are uttered during a system response.
A speech recognition system uses a correction word dictionary (22, 23) that stores a predetermined correction instruction word, a general dictionary (21) that stores a normal word, and either a correction word dictionary or a general dictionary. The voice recognition unit 14 recognizes the user's voice input via the voice input unit, and the response unit 31 outputs a response including the voice recognition result of the voice recognition unit from the voice output unit 38. In the second correction word dictionary 23, a predetermined response and a predetermined correction instruction word are overlapped with each other in order to recognize a user's voice input from the voice input unit 11 while the response unit outputs a predetermined response. It is created considering the case of input from the department.
[Selection] Figure 1

Description

本発明は、音声認識システムおよび音声認識システムの制御方法に関する。 The present invention relates to a speech recognition system and a control method for the speech recognition system.

ユーザの音声を認識し、その認識結果に応じて応答する対話型の音声認識システムが普及しつつある。そのようなシステムでは、入力されるユーザの音声にシステムからの応答が重畳し、結果的にユーザが発した音声とは異なる音がシステムに入力し、誤認識する可能性がある。 Interactive speech recognition systems that recognize user's voice and respond according to the recognition result are becoming popular. In such a system, a response from the system is superimposed on the input user's voice, and as a result, a sound different from the voice uttered by the user may be input to the system and erroneously recognized.

そこで、ユーザ音声とシステム応答とが重畳した場合であってもユーザ音声を正しく認識する第１の従来技術では、マイクロフォンから音声認識システムに入力した音声信号の中から、ロボットの発話部分の信号をエコーキャンセル処理により取り除き、ユーザが発した音声のみを音声認識システムに供給することで誤認識を防ぐ技術がある（特許文献１）。また、第２の従来技術では、ユーザの音声と所定の雑音とを重ねた音声データに基づいて辞書を作成し、その辞書を用いてユーザの音声を認識することで、誤認識を防いでいる（特許文献２）。 Therefore, in the first conventional technique for correctly recognizing the user voice even when the user voice and the system response are superimposed, the signal of the utterance part of the robot is obtained from the voice signal input from the microphone to the voice recognition system. There is a technique for preventing misrecognition by removing only by echo cancellation processing and supplying only the voice uttered by the user to the voice recognition system (Patent Document 1). In the second prior art, a dictionary is created based on voice data obtained by superimposing a user's voice and predetermined noise, and the user's voice is recognized using the dictionary, thereby preventing erroneous recognition. (Patent Document 2).

なお、第３の従来技術として、ユーザの音声を正しく認識したか否かを判断するために、音声認識結果に基づいてユーザに返答し、その返答の最中にユーザから訂正を求める音声が入力された場合には、初回の認識結果が誤認識と判断し、ユーザの訂正発声の認識結果に基づいて返答内容を変更する技術も知られている（特許文献３）。 As a third prior art, in order to determine whether or not the user's voice has been correctly recognized, a reply is made to the user based on the voice recognition result, and a voice requesting correction is input from the user during the reply. In such a case, a technique is also known in which the first recognition result is determined to be erroneous recognition, and the response content is changed based on the recognition result of the user's correction utterance (Patent Document 3).

特開２００７−１５５９８６号公報JP 2007-155986 A 特開平５−７３０８８号公報JP-A-5-73088 特開２００３−２０８１９６号公報JP 2003-208196 A

しかしながら、第１の従来技術は、エコーキャンセル処理が安定するまでに所定の時間を要するため、短い発話を認識することが難しい。第２の従来技術では、マイクロフォンの周囲の雑音が変化する場合、その雑音に応じた辞書を使用しないと、ユーザの音声を正しく認識することができない。第３の従来技術は、音声認識の誤りについてユーザに訂正の機会を与えることが記載されているだけであり、ユーザとシステムとが同時に発話した場合の音声認識の誤り防止については記載されていない。音声認識システムがユーザの音声を誤って認識したままだと、ユーザの不快感が増加して使い勝手などが低下する。 However, since the first conventional technique requires a predetermined time until the echo cancellation processing is stabilized, it is difficult to recognize a short utterance. In the second prior art, when the noise around the microphone changes, the user's voice cannot be recognized correctly unless a dictionary corresponding to the noise is used. The third prior art only describes providing a user with an opportunity to correct a speech recognition error, and does not describe prevention of speech recognition error when the user and the system speak at the same time. . If the voice recognition system misrecognizes the user's voice, the user's discomfort increases and usability decreases.

本発明は、上記の問題に鑑みてなされたもので、その目的は、ユーザの音声とシステムからの応答とが重なる場合であっても音声認識の誤りを訂正できるようにした音声認識システムおよび音声認識システムの制御方法を提供することにある。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a speech recognition system and a speech that can correct speech recognition errors even when a user's speech and a response from the system overlap. It is to provide a control method of a recognition system.

本発明の一つの観点に係る音声認識システムは、ユーザの音声を認識して応答する音声認識システムであって、ユーザの音声を入力するための音声入力部と、ユーザが音声認識結果を訂正するために使用する可能性のある所定の訂正指示語を記憶する訂正語辞書データベースと、通常の語句を記憶する一般辞書データベースと、訂正語辞書データベースと一般辞書データベースのいずれかを使用することで、音声入力部を介して入力されるユーザの音声を認識する音声認識部と、音声認識部の音声認識結果を含む応答を音声出力部から出力する応答部と、を備える。 A voice recognition system according to one aspect of the present invention is a voice recognition system that recognizes and responds to a user's voice, and a voice input unit for inputting the user's voice, and the user corrects the voice recognition result. By using either a correction word dictionary database that stores a predetermined correction instruction word that may be used in order to use, a general dictionary database that stores normal words and phrases, a correction word dictionary database, or a general dictionary database, A voice recognition unit that recognizes a user's voice input via the voice input unit; and a response unit that outputs a response including a voice recognition result of the voice recognition unit from the voice output unit.

訂正語辞書データベースと一般辞書データベースを切り替えて使用することで、ユーザが発した訂正指示語を正しく認識できる可能性が高まる。 By switching between the correction word dictionary database and the general dictionary database, the possibility that the correction instruction word issued by the user can be correctly recognized increases.

訂正語辞書データベースは、応答部が所定の応答を出力中に音声入力部から入力されるユーザの音声を認識すべく、所定の応答と所定の訂正指示語とが重なって音声入力部から入力された場合を考慮して作成することができる。 The correction word dictionary database is input from the voice input unit by overlapping the predetermined response and the predetermined correction instruction word so as to recognize the user's voice input from the voice input unit while the response unit outputs the predetermined response. It can be created considering the case.

訂正語辞書データベースは、所定の訂正指示語の通常の読みと、通常の読みの変形である変形読みとを記憶することができる。 The correction word dictionary database can store a normal reading of a predetermined correction instruction word and a modified reading that is a modification of the normal reading.

変形読みは、所定の訂正指示語の通常の読みのうち所定箇所の音を他の音に置換することで構成してもよい。所定箇所の音は、通常の読みの先頭から所定範囲の音素または音節であってもよい。 The modified reading may be configured by replacing a sound at a predetermined position in a normal reading of a predetermined correction instruction word with another sound. The sound at the predetermined location may be a phoneme or syllable within a predetermined range from the beginning of normal reading.

音声認識システムのブロック図である。It is a block diagram of a speech recognition system. 一般辞書および訂正語辞書の構成例を示す説明図である。It is explanatory drawing which shows the structural example of a general dictionary and a correction word dictionary. 訂正語辞書の作成方法を示す説明図である。It is explanatory drawing which shows the preparation method of a correction word dictionary. 全体動作を示すフローチャートである。It is a flowchart which shows whole operation | movement. 図４に続く処理を示すフローチャートである。It is a flowchart which shows the process following FIG. ユーザの指示を正しく認識した場合のタイムチャートである。It is a time chart when the user's instruction is correctly recognized. ユーザの指示を誤認識した場合のタイムチャートである。It is a time chart at the time of misrecognizing a user's instruction | indication. ユーザの指示を誤認識した場合の他のタイムチャートである。It is another time chart at the time of misrecognizing a user's instruction | indication. 第２実施例に係る音声認識のタイムチャートである。It is a time chart of voice recognition concerning the 2nd example. 第３実施例に係る音声認識のタイムチャートである。It is a time chart of voice recognition concerning the 3rd example. 第４実施例に係る訂正語辞書の作成方法を示す説明図である。It is explanatory drawing which shows the preparation method of the correction word dictionary which concerns on 4th Example. 第５実施例に係る音声認識のタイムチャートである。It is a time chart of the voice recognition concerning the 5th example.

本実施形態では、以下に詳述する通り、ユーザが音声認識結果を訂正するために使用する可能性のある所定の訂正指示語を記憶する訂正語辞書データベースと、通常の語句を記憶する一般辞書データベースとを切り替えて使用する。これにより、本実施形態では、システムからの応答とユーザの音声とが重なった場合でも、ユーザからの訂正指示を正しく認識することができる可能性が高まり、ユーザの満足感、安心感、使い勝手などが向上する。 In the present embodiment, as will be described in detail below, a correction word dictionary database that stores a predetermined correction instruction word that a user may use to correct a speech recognition result, and a general dictionary that stores normal words and phrases Switch to the database and use it. Thereby, in this embodiment, even when the response from the system and the user's voice overlap, the possibility that the correction instruction from the user can be correctly recognized increases, and the user's satisfaction, security, usability, etc. Will improve.

図１は、本実施例の音声認識システム１の全体構成を示すブロック図である。音声認識システム１は、ユーザと対話しながらユーザの指示を理解し、ユーザの指示した所定の動作を実行する。このような対話型音声認識システムは、例えば、ロボット２、携帯情報端末（携帯電話、スマートフォン、音楽再生装置、デジタルカメラ、パーソナルコンピュータなどを含む）３、乗用車、トラック、建設機械などの各種車両４などに広く適用できる。それら以外の装置、システムなどにも本実施例の音声認識システムを適用できる。本実施例では、ユーザと対話して動作する対話型ロボットを例に挙げて説明する。 FIG. 1 is a block diagram showing the overall configuration of the speech recognition system 1 of the present embodiment. The voice recognition system 1 understands a user's instruction while interacting with the user, and executes a predetermined operation instructed by the user. Such an interactive voice recognition system includes, for example, a robot 2, a portable information terminal (including a mobile phone, a smartphone, a music playback device, a digital camera, a personal computer, etc.) 3, various vehicles 4 such as a passenger car, a truck, and a construction machine. It can be applied widely. The speech recognition system of this embodiment can be applied to other devices and systems. In this embodiment, an interactive robot that interacts with a user and operates will be described as an example.

音声認識システムは、音声認識部と動作制御部とに分けることができる。音声認識部は、音声入力部１１、Ａ／Ｄ（Analog/Digital）変換部１２、特徴抽出部１３、マッチング部１４、音響モデルデータベース１５、文法データベース１６、辞書選択部１７、行動決定部１８、一般辞書データベース２１、第１の訂正語辞書データベース２２、第２の訂正語辞書データベース２３とを含んで構成することができる。 The voice recognition system can be divided into a voice recognition unit and an operation control unit. The voice recognition unit includes a voice input unit 11, an A / D (Analog / Digital) conversion unit 12, a feature extraction unit 13, a matching unit 14, an acoustic model database 15, a grammar database 16, a dictionary selection unit 17, an action determination unit 18, A general dictionary database 21, a first correction word dictionary database 22, and a second correction word dictionary database 23 may be included.

音声入力部１１は、音声を音声認識システム１に入力するための装置である。例えば、マイクロフォンなどが音声入力部１１として使用されてもよい。また、メモリ装置などに記憶された音声データを音声認識システム１に入力できる構成としてもよい。この場合、音声入力部１１は、メモリ装置からデータを受け取るためのインターフェース回路を備えて構成される。 The voice input unit 11 is a device for inputting voice to the voice recognition system 1. For example, a microphone or the like may be used as the voice input unit 11. Further, the voice data stored in a memory device or the like may be input to the voice recognition system 1. In this case, the voice input unit 11 includes an interface circuit for receiving data from the memory device.

Ａ／Ｄ変換部１２は、アナログ信号として入力された音声信号をデジタル信号としての音声データに変換する。特徴抽出部１３は、音声データのうち予め設定される複数の箇所での特徴を抽出する。マッチング部１４は、音声データの特徴と音響モデルデータベース１５と文法データベース１６、辞書選択部１７により選択される辞書データベース２１〜２３のいずれかを用いて、入力された音声データを認識する。 The A / D converter 12 converts an audio signal input as an analog signal into audio data as a digital signal. The feature extraction unit 13 extracts features at a plurality of preset locations in the audio data. The matching unit 14 recognizes the input voice data using any of the features of the voice data, the acoustic model database 15, the grammar database 16, and the dictionary databases 21 to 23 selected by the dictionary selection unit 17.

音響モデルデータベース１５は、テキスト（読み）とテキストを発音したときの波形とを対応づけて記憶したデータベースであり、どのような波形の音がどのような単語として認識されるかを定義している。文法データベース１６は、単語の並べ方（文法）などを記憶したデータベースである。 The acoustic model database 15 is a database that stores text (reading) and a waveform when the text is pronounced in association with each other, and defines what kind of waveform sound is recognized as what word. . The grammar database 16 is a database that stores word arrangement (grammar) and the like.

辞書選択部１７は、一般辞書データベース２１、第１の訂正語辞書データベース２２、第２の訂正語辞書データベース２３を所定のタイミングで選択する。マッチング部１４は、辞書選択部１７で選択された辞書データベースを用いて音声データを認識する。 The dictionary selection unit 17 selects the general dictionary database 21, the first correction word dictionary database 22, and the second correction word dictionary database 23 at a predetermined timing. The matching unit 14 recognizes voice data using the dictionary database selected by the dictionary selection unit 17.

行動決定部１８は、マッチング部１４の音声認識結果に基づいて、音声認識システム１の行動（詳しくは音声認識システム１の行動および／または音声認識システム１が搭載された装置またはシステムの動作）を決定する。 Based on the speech recognition result of the matching unit 14, the behavior determination unit 18 determines the behavior of the speech recognition system 1 (specifically, the behavior of the speech recognition system 1 and / or the operation of the device or system on which the speech recognition system 1 is mounted). decide.

図２を参照する。一般辞書データベース２１は、通常の言葉とその通常の読みとを対応づけて記憶した辞書データベースであり、後述する訂正指示語を含んでいても良いし、含んでいなくても良い。第１の訂正語辞書データベース２２は、「他の訂正語辞書データベース」に該当し、音声認識システム１が発話した直後にユーザから発せられる可能性の高い訂正指示語に関する単語のデータを記録したものであり、訂正指示語とその通常の読みとを対応づけて記憶している。第２の訂正語辞書データベース２３は、音声認識システム１の発話中にユーザが音声を発した場合であっても正しくユーザの発話を認識するためのものであり、訂正指示語とその通常の読みに加えて、通常の読みの所定箇所を他の音に置き換えた変形読みを対応づけて記憶する。なお、上記のように一般辞書データベース２１には訂正指示語のデータを含んでいても良いし、または、一般辞書データベース２１に訂正指示語のデータを含めず、通常の認識時には一般辞書データベース２１と第１の訂正語辞書データベース２２との両方を用いて音声認識するよう構成しても良い。以下、辞書データベースを「辞書」と呼び、また一般辞書データベース２１には訂正指示語のデータが含まれていないものを例に挙げて説明する。 Please refer to FIG. The general dictionary database 21 is a dictionary database that stores a normal word and its normal reading in association with each other, and may or may not include a correction instruction word to be described later. The first correction word dictionary database 22 corresponds to “another correction word dictionary database”, and records data of words related to correction instruction words that are likely to be issued by the user immediately after the speech recognition system 1 speaks. The correction instruction word and its normal reading are stored in association with each other. The second correction word dictionary database 23 is used for correctly recognizing a user's utterance even when the user utters a voice during the utterance of the voice recognition system 1. In addition, a modified reading in which a predetermined portion of normal reading is replaced with another sound is stored in association with each other. As described above, the general dictionary database 21 may include correction instruction word data. Alternatively, the general dictionary database 21 does not include correction instruction word data, and the normal dictionary database 21 and the general dictionary database 21 are not included in normal recognition. You may comprise so that speech recognition may be performed using both the 1st correction word dictionary database 22. FIG. Hereinafter, the dictionary database will be referred to as a “dictionary”, and the general dictionary database 21 will be described as an example in which correction instruction word data is not included.

所定の訂正指示語とは、ユーザが音声認識の結果を取り消すために使用する可能性のある言葉であり、例えば、「違う」、「そうじゃない」、「間違ってる」、「ノー」、「止めなさい」、「停止」、「やり直し」などを挙げることができる。 The predetermined correction instruction word is a word that the user may use to cancel the result of the speech recognition. For example, “No”, “No”, “Wrong”, “No”, “No” Stop, ”“ stop, ”“ redo ”, etc.

本実施例の一般辞書２１は、上述の通り、通常の辞書に登録されている一般の単語の中から所定の訂正指示語を取り除いた単語とその読みを記憶している。これに対し、第１の訂正語辞書２２は、所定の訂正指示語とその通常の読みだけを記憶している。 As described above, the general dictionary 21 of this embodiment stores a word obtained by removing a predetermined correction instruction word from general words registered in a normal dictionary and its reading. On the other hand, the first correction word dictionary 22 stores only a predetermined correction instruction word and its normal reading.

通常の読みには、その訂正指示語の基本的な読みだけでなく、語尾が変化した読みを含めることができる。例えば、訂正指示語「違う」の基本的な通常の読みは「ちがう」であるが、「ちがうよ」、「ちがいます」、「ちがうって」、「ちゃう」などの、通常の読みの語尾が自然に変化したものや、違うを意味する方言等の他の読みを含めてよい。 Normal readings can include not only basic readings of the correction instruction words, but also readings with changed endings. For example, the basic normal reading of the correction instruction word “different” is “different”, but the endings of normal readings such as “different”, “different”, “different”, “chau”, etc. May include other readings such as those that have changed naturally or dialects that mean different.

第２の訂正語辞書２３は、所定の訂正指示語の通常の読みだけでなく、通常の読みの所定箇所を他の音に置換した変形読みを記憶する。図３で後述するように、変形読みは、通常の読みの先頭から所定範囲の音を他の音に置き換えることで得られる。 The second correction word dictionary 23 stores not only a normal reading of a predetermined correction instruction word but also a modified reading obtained by replacing a predetermined portion of the normal reading with another sound. As will be described later with reference to FIG. 3, the modified reading can be obtained by replacing a predetermined range of sounds with other sounds from the beginning of normal reading.

変形読みと通常の読みの語尾変化とは、例えば、音の変化部分が主に単語の先頭であるか末尾であるかの点と、使用頻度の高い読みであるか、それとも不自然な置換であり使用頻度の低い読みであるかの点とで相違する。例えば、訂正指示語「違う」の場合、その通常の読みは「ちがう」であるが、変形読みには「じがう」、「きがう」などの不自然かつ使用頻度の低い読みが含まれるのに対し、語尾変化は「ちがうよ」、「ちがいます」など自然に用いられる単語が含まれる。 Modified reading and normal ending change are, for example, whether the change part of the sound is mainly at the beginning or end of a word and whether it is a frequently used reading or unnatural substitution. There is a difference in whether it is a reading with low usage frequency. For example, when the correction instruction word is “different”, the normal reading is “different”, but the modified reading includes unnatural and infrequently used readings such as “giga” and “kiga”. On the other hand, ending changes include words that are naturally used, such as “different” or “different”.

図１に戻って音声認識システム１の有する動作制御部の構成を説明する。動作制御部は、例えば、システム制御部３１、表示制御部３２、発話制御部３３、機構制御部３４、表示部３５、音声合成部３６、Ｄ／Ａ（Digital/Analog）変換部３７、音声出力部３８、アクチュエータ３９を含んで構成することができる。 Returning to FIG. 1, the configuration of the operation control unit of the voice recognition system 1 will be described. The operation control unit includes, for example, a system control unit 31, a display control unit 32, an utterance control unit 33, a mechanism control unit 34, a display unit 35, a voice synthesis unit 36, a D / A (Digital / Analog) conversion unit 37, and a voice output. The unit 38 and the actuator 39 can be included.

システム制御部３１は、音声認識結果から決定された行動に基づいて、音声認識システム１または音声認識システム１の搭載された装置またはシステムの動作を制御する。システム制御部３１は「応答部」の一例である。システム制御部３１は、例えば、マイクロプロセッサ、メモリ、インターフェースなどを有するコンピュータシステムを使用するコンピュータプログラムとして実現される。以下、音声認識システム１の動作と音声認識システム１の搭載された装置またはシステムの動作を区別せずに、音声認識システム１の動作として説明する。 The system control unit 31 controls the operation of the speech recognition system 1 or a device in which the speech recognition system 1 is installed or the system based on the behavior determined from the speech recognition result. The system control unit 31 is an example of a “response unit”. The system control unit 31 is realized as a computer program using a computer system having a microprocessor, a memory, an interface, and the like, for example. Hereinafter, the operation of the speech recognition system 1 will be described without distinguishing the operation of the speech recognition system 1 from the operation of the apparatus or system on which the speech recognition system 1 is mounted.

表示制御部３２は、表示部３５の動作を制御する。表示制御部３２は、システム制御部３１から指示された表示内容を実現すべく、表示部３５に信号を出力する。表示部３５としては、例えば、液晶ディスプレイ、プラズマディスプレイ、有機ＥＬ（ElectroLuminescence）ディスプレイなどのディスプレイ装置、ＬＥＤ（Light Emitting Diode）ランプなどがある。プリンタ、視覚障害者用のピンディスプレイなどを表示部３５として使用してもよい。 The display control unit 32 controls the operation of the display unit 35. The display control unit 32 outputs a signal to the display unit 35 in order to realize the display content instructed by the system control unit 31. Examples of the display unit 35 include a display device such as a liquid crystal display, a plasma display, and an organic EL (ElectroLuminescence) display, and an LED (Light Emitting Diode) lamp. A printer, a pin display for the visually impaired, or the like may be used as the display unit 35.

発話制御部３３は、音声認識システム１から出力する音声（応答）を制御する。発話制御部３３は、システム制御部３１から指示された応答メッセージをユーザに通知すべく、音声合成部３６に指示を与える。音声合成部３６は、入力される応答メッセージに対応する波形データを組み合わせて音声（応答）を合成する。合成された音声は、Ｄ／Ａ変換部３７によりアナログ信号に変換され、音声出力部３８から出力される。音声出力部３８としては、例えばスピーカのように構成される。 The utterance control unit 33 controls the voice (response) output from the voice recognition system 1. The utterance control unit 33 gives an instruction to the voice synthesis unit 36 in order to notify the user of the response message instructed from the system control unit 31. The voice synthesizer 36 synthesizes voice (response) by combining waveform data corresponding to the input response message. The synthesized voice is converted into an analog signal by the D / A converter 37 and output from the voice output unit 38. The audio output unit 38 is configured as a speaker, for example.

機構制御部３４は、システム制御部３１から指示された動作を実現すべく、アクチュエータ３９に制御信号を出力する。アクチュエータ３９は、音声認識システム１の搭載される装置またはシステムの種類によって異なる。例えば、ロボット２の場合、頭、手足などを動かすための電動モータ、ソレノイド磁石などがアクチュエータ３９となる。携帯情報端末３の場合は、例えば、端末を振動させるための振動発生装置などがアクチュエータ３９となる。車両４の場合、例えば、空調装置、ライト、ラジオ、ナビゲーション装置、エンジンなどがアクチュエータ３９となり得る。 The mechanism control unit 34 outputs a control signal to the actuator 39 in order to realize the operation instructed by the system control unit 31. The actuator 39 differs depending on the device or system type on which the speech recognition system 1 is mounted. For example, in the case of the robot 2, an electric motor, a solenoid magnet, or the like for moving the head, limbs, or the like is the actuator 39. In the case of the portable information terminal 3, for example, a vibration generating device for vibrating the terminal is the actuator 39. In the case of the vehicle 4, for example, an air conditioner, a light, a radio, a navigation device, an engine, or the like can be the actuator 39.

図３を用いて第２の訂正語辞書２３の作成方法の一例を説明する。第２の訂正語辞書２３を作成する方法は、以下の２つの段階に分けることができる。その一つは、認識候補となる単語を記憶する認識候補語辞書を作成する段階（Ｓ１０）である。他の一つは、ユーザの音声と音声出力部３８からの応答とを重ね合わせた音声の認識結果に基づいて所定の読み（変形読み）を抽出する段階（Ｓ２０）である。 An example of a method for creating the second correction word dictionary 23 will be described with reference to FIG. The method of creating the second correction word dictionary 23 can be divided into the following two stages. One of them is a step of creating a recognition candidate word dictionary that stores words that are recognition candidates (S10). The other is a step of extracting a predetermined reading (modified reading) based on the recognition result of the voice obtained by superimposing the user's voice and the response from the voice output unit 38 (S20).

第１の段階（Ｓ１０）について説明する。第１段階では、ユーザが発声する可能性のある訂正指示語の先頭の音を５０音の他の音で置き換えた認識候補語を網羅的に生成して、認識候補語辞書２４に登録する。 The first stage (S10) will be described. In the first stage, recognition candidate words in which the first sound of the correction instruction word that the user may utter is replaced with another sound of 50 sounds are comprehensively generated and registered in the recognition candidate word dictionary 24.

先頭の音を他の音に置き換える理由を説明する。ユーザの発する音声は一般的に先頭の音が小さい傾向を示す。このため、音声認識システム１の応答とユーザ音声とが重なった場合、ユーザ音声の先頭の音が別の音として認識されやすい。そこで、本実施例では、ユーザの発する可能性があると考えられる所定の訂正指示語のうち、その先頭の音を他の音で置き換えることで、認識候補語を生成する。 The reason for replacing the first sound with another sound will be described. The voice uttered by the user generally shows a tendency that the leading sound is small. For this reason, when the response of the voice recognition system 1 and the user voice overlap, the head sound of the user voice is easily recognized as another sound. Therefore, in the present embodiment, a recognition candidate word is generated by replacing the head sound of a predetermined correction instruction word considered to be uttered by the user with another sound.

例えば、訂正指示語「そうじゃない」を例に挙げると、その先頭の音「そ」を他の音に順番に置き換えた、「あうじゃない」、「いうじゃない」、「ううじゃない」、「えうじゃない」、「おうじゃない」、「かうじゃない」、「きうじゃない」、「くうじゃない」・・・等を挙げることができる。 For example, taking the correction instruction word “not so” as an example, the leading sound “so” is replaced with other sounds in order, “not so good”, “not good”, “not good” , “It's not like”, “It's not like”, “It's not like”, “It's not like”, “It's not like”, etc.

例えば、訂正指示語「違う」の場合、その先頭の音「ち」を他の音に置き換えた「あがう」、「いがう」、「うがう」、「えがう」、「おがう」、「かがう」、「きがう」、「くがう」、「けがう」、「こがう」・・・等を挙げることができる。 For example, if the correct instruction word is “different”, the sound “Chi” at the beginning is replaced with another sound, “Agau”, “Igau”, “Ugau”, “Egau”, “ Examples include “goga”, “kaga”, “gaiga”, “kuga”, “kega”, “gagar”, and the like.

このようにして、訂正指示語を認識する可能性のある候補語の読みを自動的にまたは手動で生成して、認識候補語辞書２４に登録する。他の全ての訂正指示語についても同様に、先頭の音を他の音に置き換えた認識候補語の読みを認識候補語辞書２４に登録する。 In this manner, readings of candidate words that may recognize the correction instruction word are automatically or manually generated and registered in the recognition candidate word dictionary 24. Similarly, for all other correction instruction words, the reading of the recognition candidate word in which the leading sound is replaced with another sound is registered in the recognition candidate word dictionary 24.

なお、先頭の一音だけを他の音に置き換えるのではなく、先頭から複数の音を他の音に置き換えることで、認識候補語を生成してもよい。例えば「そうじゃない」の場合、「ああじゃない」、「ああうじゃない」のような候補語を得ることができる。 Note that the recognition candidate word may be generated by replacing a plurality of sounds with other sounds instead of replacing only the first sound with other sounds. For example, in the case of “not so”, candidate words such as “not ah” and “not ah” can be obtained.

また、音の一部（音素）をアルファベット順に他の音素に置換することで、認識候補語を生成する構成でもよい。例えば、「違う（ti ga u）」の場合、先頭の音の子音「t」が認識できない場合や、先頭の音の子音「t」が認識できず、かつ、母音「i」も弱く入力し、例えば、結果的に「hi ga u」と変化して認識される場合や、先頭の音に複数の音が混ざり、「ち」がスペクトル分布の広い音「き」、「け」、「た」、「て」、「と」に変化し、「違う」が例えば「ki ga u」と認識される場合を想定して、他の音素に置換すれば良い。「ti ga u」の場合、例えば「ai ga u」、「bi ga u」、「ci ga u」、「di ga u」のように認識候補語を生成することもできるし、先頭の音に限らず、言葉の途中の音で音ズレが生じることを想定し、他の音素に置換して認識候補語を生成することもできる。 Moreover, the structure which produces | generates a recognition candidate word may be sufficient by replacing a part (phoneme) of a sound with another phoneme in alphabetical order. For example, in the case of “different (ti ga u)”, the consonant “t” of the first sound cannot be recognized, the consonant “t” of the first sound cannot be recognized, and the vowel “i” is also input weakly For example, when it is recognized as “hi ga u” as a result, or when multiple sounds are mixed with the first sound, “Chi” is a sound with a wide spectrum distribution “Ki”, “Ke”, “Ta ”,“ Te ”, and“ to ”, and assuming that“ different ”is recognized as“ ki ga u ”, for example, it may be replaced with another phoneme. In the case of “ti ga u”, recognition candidate words such as “ai ga u”, “bi ga u”, “ci ga u”, “di ga u” can be generated, The recognition candidate word can also be generated by substituting with other phonemes on the assumption that a sound shift occurs in the middle of the word.

さらに、訂正指示語の通常の読みの先頭の一つまたは複数の音と、先頭以外の他の箇所の一つまたは複数の音を、５０音順またはアルファベット順などの所定の順番で、他の音に置き換えて認識候補語を生成する構成でもよい。 In addition, one or more sounds at the beginning of the normal reading of the correction instruction word and one or more sounds at other locations other than the beginning are placed in a predetermined order such as 50-sound order or alphabetical order, A configuration in which recognition candidate words are generated by replacing with sounds may be used.

次に第２の段階（Ｓ２０）について説明する。第２段階では、ユーザの発する訂正指示語の音声データ（訂正指示語を通常の読みで発声した音声のデータ）をメモリ装置に録音する（Ｓ２１）。次に、その録音したユーザ音声のデータを、音声出力部３８から出力される所定の応答に対してタイミングを所定量ずつずらしながら再生して重ね合わせ、合成音を音声入力部１１から音声認識システム１に入力する（Ｓ２２）。 Next, the second stage (S20) will be described. In the second stage, the voice data of the correction instruction word issued by the user (the voice data uttered by the normal reading of the correction instruction word) is recorded in the memory device (S21). Next, the recorded user voice data is reproduced and superimposed while shifting the timing by a predetermined amount with respect to a predetermined response output from the voice output unit 38, and the synthesized sound is transferred from the voice input unit 11 to the voice recognition system. 1 is input (S22).

音声認識システム１のＡ／Ｄ変換部１２は、入力された合成音（ユーザ音声と音声認識システム１からの応答とが所定のタイミングで重なった音声）をデジタル信号に変化する（Ｓ２３）。特徴抽出部１３は、デジタル化された合成音データから所定の特徴を抽出する（Ｓ２４）。 The A / D converter 12 of the speech recognition system 1 changes the input synthesized sound (speech in which the user speech and the response from the speech recognition system 1 overlap at a predetermined timing) into a digital signal (S23). The feature extraction unit 13 extracts a predetermined feature from the digitized synthesized sound data (S24).

マッチング部１４は、抽出された特徴と、音響モデルデータベース１５と、文法データベース１６および認識候補語辞書２４に基づいて、合成音を認識する（Ｓ２５）。そして、合成音の認識結果のうち、元々の訂正指示語に一致する度合い（尤度）が所定値以上の認識結果を、訂正指示語の変形読みとして選択する（Ｓ２６）。最後に、選択した変形読みを第２の訂正語辞書２３に登録する（Ｓ２７）。
すなわち、第１の段階（Ｓ１０）で生成した多数の認識候補語の中から、音声認識システムの発話出力とユーザ音声とが重複した場合に認識される可能性の高いものを判定し、第２の訂正語辞書２３に登録する。 The matching unit 14 recognizes a synthesized sound based on the extracted features, the acoustic model database 15, the grammar database 16, and the recognition candidate word dictionary 24 (S25). Then, a recognition result having a degree (likelihood) matching the original correction instruction word of a predetermined value or more is selected as a modified reading of the correction instruction word (S26). Finally, the selected modified reading is registered in the second correction word dictionary 23 (S27).
In other words, among the many recognition candidate words generated in the first stage (S10), a word that is highly likely to be recognized when the speech output of the speech recognition system and the user speech overlap is determined, and the second Are registered in the correction word dictionary 23.

図４を用いて音声認識システム１の全体動作を説明する。以下、動作の主体をシステム１と略称する。システム１は、一般辞書２１を使用して（Ｓ３０）、音声入力部１１からユーザ音声が入力されるのを待つ（Ｓ３１）。ユーザの音声が入力されると（Ｓ３１：ＹＥＳ）、システム１はそのユーザ音声の認識処理を実行し（Ｓ３２）、予め設定されている所定の応答を音声出力部３８から出力する（Ｓ３３）。ステップＳ３３で出力される応答は「第１応答」の一例であり、ステップＳ３２におけるユーザ音声の認識結果をユーザに知らせるための内容を含む。 The overall operation of the speech recognition system 1 will be described with reference to FIG. Hereinafter, the subject of operation is abbreviated as system 1. The system 1 uses the general dictionary 21 (S30) and waits for a user voice to be input from the voice input unit 11 (S31). When the user's voice is input (S31: YES), the system 1 executes the user voice recognition process (S32), and outputs a predetermined response set in advance from the voice output unit 38 (S33). The response output in step S33 is an example of a “first response”, and includes content for informing the user of the user voice recognition result in step S32.

システム１は第１応答の出力を終了した後（Ｓ３４：ＹＥＳ）、使用する辞書を一般辞書２１から第１の訂正語辞書２２に切り替える（Ｓ３５）。システム１は、第１応答の終了後の所定時間（第１の訂正可能期間）だけ、ユーザからの訂正指示語の入力を待つ（Ｓ３６〜Ｓ３９）。 After completing the output of the first response (S34: YES), the system 1 switches the dictionary to be used from the general dictionary 21 to the first correction word dictionary 22 (S35). The system 1 waits for the input of a correction instruction word from the user for a predetermined time (first correctable period) after the end of the first response (S36 to S39).

即ち、システム１は、第１の訂正語辞書２２に切り替えた後、ユーザ音声が音声入力部１１から入力されたか確認し（Ｓ３６）、ユーザ音声が入力された場合（Ｓ３６：ＹＥＳ）、そのユーザ音声を第１の訂正語辞書２２を用いて認識する（Ｓ３７）。システム１は、訂正指示語のみ登録された第１の訂正語辞書２２を用いて音声を認識するため、訂正指示語を速やかに認識できる。システム１は、ユーザからの訂正指示語の入力を待つ第１の訂正可能期間において、訂正指示語以外の他の単語は認識することができない。 That is, after switching to the first correction word dictionary 22, the system 1 checks whether the user voice is input from the voice input unit 11 (S 36), and if the user voice is input (S 36: YES), the user The speech is recognized using the first correction word dictionary 22 (S37). Since the system 1 recognizes speech using the first correction word dictionary 22 in which only correction instruction words are registered, the correction instruction words can be recognized quickly. The system 1 cannot recognize words other than the correction instruction word in the first correctable period in which the correction instruction word is input from the user.

システム１は、第１の訂正可能期間に入力したユーザ音声の認識結果が訂正指示語であるか判定し（Ｓ３８）、訂正指示語の場合（Ｓ３８：ＹＥＳ）、図５で後述する訂正処理を実行する。ユーザ音声の認識結果が訂正指示語ではない場合（Ｓ３８：ＮＯ）、システム１は所定時間が経過したか判定し（Ｓ３９）、所定時間が経過するまでの間（第１の訂正可能期間）、ステップＳ３５に戻ってユーザからの音声入力を待つ。 The system 1 determines whether the recognition result of the user voice input during the first correctable period is a correction instruction word (S38), and if it is a correction instruction word (S38: YES), the correction processing described later in FIG. 5 is performed. Run. If the recognition result of the user voice is not a correction instruction word (S38: NO), the system 1 determines whether a predetermined time has passed (S39), and until the predetermined time has passed (first correctable period), It returns to step S35 and waits for the voice input from a user.

第１の訂正可能期間の始期は、第１応答の出力終了時（Ｓ３４）である。第１応答の終了時と第１の訂正語辞書２２の使用開始時とは実質的に同時であるため、第１の訂正可能期間の始期を第１の訂正語辞書の使用開始時として定義することもできる。 The start of the first correctable period is when the output of the first response ends (S34). Since the end of the first response and the start of use of the first correction word dictionary 22 are substantially the same, the start of the first correctable period is defined as the start of use of the first correction word dictionary. You can also.

このようにシステム１に入力したユーザの最初の音声の認識結果（Ｓ３２）を第１応答として出力し（Ｓ３３）、ユーザがそれを確認した後、その認識結果が間違っている場合には、システム１に対して直ちに訂正指示語を発声する（Ｓ３６）。この際、システム１はユーザの訂正語の入力を待ち受けて認識し、入力した訂正語に応じて認識結果を取り消すことができる。 As described above, the first speech recognition result (S32) of the user input to the system 1 is output as the first response (S33). After the user confirms the recognition result, if the recognition result is wrong, the system Immediately utters a correction instruction to 1 (S36). At this time, the system 1 can wait for and recognize the input of the user's correction word, and can cancel the recognition result according to the input correction word.

第１の訂正可能期間内にユーザが訂正指示語を発声しなかった場合（Ｓ３９：ＹＥＳ）、システム１は、使用する辞書を第１の訂正語辞書２２から第２の訂正語辞書２３に切り替える（Ｓ４０）。辞書の切替と同時にシステム１は、予め用意されている所定の応答（第２応答）を音声出力部３８から出力する（Ｓ４１）。第２応答は、例えば、システム１の認識結果（Ｓ３２）に基づいて行動を決定する旨の通知（例えば、「指示を了解しました」、「わかりました」など）を含むようにして構成することができる。 When the user does not utter the correction instruction word within the first correctable period (S39: YES), the system 1 switches the dictionary to be used from the first correction word dictionary 22 to the second correction word dictionary 23. (S40). Simultaneously with the switching of the dictionary, the system 1 outputs a predetermined response (second response) prepared in advance from the voice output unit 38 (S41). For example, the second response may be configured to include a notification (for example, “I understand the instruction”, “I understand”) that the action is determined based on the recognition result (S32) of the system 1. it can.

第２応答を音声出力部３８から出力している期間が第２の訂正可能期間である。システム１は、第２応答を出力している間に音声入力部１１からユーザ音声（詳しくはユーザ音声と第２応答の重なった音声）が入力されたか検出する（Ｓ４２〜Ｓ４５）。 The period during which the second response is output from the audio output unit 38 is the second correctable period. The system 1 detects whether a user voice (specifically, a voice in which the user voice and the second response overlap) is input from the voice input unit 11 while outputting the second response (S42 to S45).

即ち、システム１は、第２応答の出力中に、音声入力部１１にシステム１の応答出力以外の音声が入力されたか判定し（Ｓ４２）、音声が入力された場合は第２の訂正語辞書２３を用いてその音声を認識し（Ｓ４３）、認識結果が訂正指示語であるか判定する（Ｓ４４）。訂正指示語である場合（Ｓ４４：ＹＥＳ）、図５で述べる訂正処理を実行する。認識結果が訂正指示語ではなく（Ｓ４４：ＮＯ）、システム１が応答を終了していない場合（Ｓ４５：ＮＯ）、システム１はステップＳ４２に戻る。 That is, the system 1 determines whether a voice other than the response output of the system 1 is input to the voice input unit 11 during the output of the second response (S42), and if the voice is input, the second correction word dictionary. 23 is used to recognize the voice (S43), and it is determined whether the recognition result is a correction instruction word (S44). If it is a correction instruction word (S44: YES), the correction processing described in FIG. 5 is executed. When the recognition result is not the correction instruction word (S44: NO) and the system 1 has not finished the response (S45: NO), the system 1 returns to step S42.

システム１は、第１および第２の訂正可能期間中に訂正処理が行われなかった場合、つまり最初のユーザ音声の認識（Ｓ３２）がユーザにより取り消されなかった場合、その認識結果に応じた動作（行動）を決定し（Ｓ４６）、実行する（Ｓ４７）。 When the correction process is not performed during the first and second correctable periods, that is, when the first user speech recognition (S32) is not canceled by the user, the system 1 operates according to the recognition result. (Action) is determined (S46) and executed (S47).

図５を用いて訂正処理（指示を取り消す処理）を説明する。第１訂正可能期間または第２訂正可能期間のいずれかにおいて、ユーザから訂正指示語が発声されたと認識した場合、システム１は、聞き間違えたことをユーザに視覚的に通知するための聞き間違えマークを表示部４５に表示する（Ｓ５０）。さらに、システム１は、聞き間違えたことをユーザに音声で通知するための聞き間違え確認応答を音声出力部３８から出力する（Ｓ５１）。その後、システム１は、図４のステップＳ３０に戻って、ユーザからの音声による指示を待つ。 The correction process (process for canceling the instruction) will be described with reference to FIG. When it is recognized that the correction instruction word is uttered by the user in either the first correctable period or the second correctable period, the system 1 makes a mistake in the mark for visually notifying the user that the mistake has been made. Is displayed on the display unit 45 (S50). Further, the system 1 outputs a mistaken confirmation confirmation response for notifying the user that a mistake has been made by voice from the voice output unit 38 (S51). Thereafter, the system 1 returns to step S30 in FIG. 4 and waits for a voice instruction from the user.

音声認識の誤りを確認したことを視覚的に通知するための表示は、テキストメッセージの表示または印刷、ＬＥＤランプの点滅、アクチュエータ３９の動作（例えばロボット２の手足を所定のパターンで動かす）のようにして実現できる。 The display for visually notifying that the voice recognition error has been confirmed includes the display or printing of a text message, the blinking of the LED lamp, and the operation of the actuator 39 (for example, moving the limb of the robot 2 in a predetermined pattern). Can be realized.

音声認識の誤りを確認したことを音で知らせるための確認応答は、例えば「聞き間違えたかな」、「ごめんなさい。間違えました」などのように、音声認識の誤りを確認したことのみ示す情報を含んでもよい。または、「聞き間違えました。もう一度言って下さい」などのように、ユーザの再度の指示を促すための情報を含んでもよい。ロボット２が指示待ち状態にあることをＬＥＤランプ等でユーザに知らせる構成でもよい。 The confirmation response to inform you that the speech recognition error has been confirmed is information indicating only that the speech recognition error has been confirmed, such as “I missed it” or “I ’m sorry.” May be included. Alternatively, it may include information for prompting the user again, such as “I made a mistake. Please say again”. The configuration may be such that the user is informed by an LED lamp or the like that the robot 2 is waiting for an instruction.

図６〜図８を用いてシステム１の動作の例を説明する。図６は、システム１がユーザの最初の指示を正しく認識した場合を示す。時刻Ｔ０において、システム１は音声認識可能な状態で待機している。図中、音声認識可能な状態を白い矩形で示し、そのうち音声認識処理中の状態を斜線部で示す。但し、音声認識処理の実行中であることを示す斜線部は、理解のための例示であって、処理のタイミングを厳密に示しているわけではない。 An example of the operation of the system 1 will be described with reference to FIGS. FIG. 6 shows a case where the system 1 correctly recognizes the user's first instruction. At time T0, the system 1 stands by in a state where voice recognition is possible. In the figure, the state where speech recognition is possible is indicated by a white rectangle, and the state during speech recognition processing is indicated by the hatched portion. However, the shaded portion indicating that the voice recognition process is being executed is an example for understanding, and does not strictly indicate the processing timing.

システム１は、図４のステップＳ３０、Ｓ３１で述べたように一般辞書２１を選択して、ユーザからの音声入力を待っている。ユーザは、時刻Ｔ１において、所望の音声ＵＭ１を発する。例えば、ユーザは、ロボット２にクイズの出題を促すべく、「クイズ出してよ」という音声ＵＭ１を発したものとする。ここでロボット２は、クイズの出題、ダンスの披露などの所定の機能を実現できるようになっているものとする。 The system 1 selects the general dictionary 21 as described in steps S30 and S31 of FIG. 4 and waits for a voice input from the user. The user utters a desired voice UM1 at time T1. For example, it is assumed that the user utters a voice UM1 “Please quiz” to prompt the robot 2 to give a quiz question. Here, it is assumed that the robot 2 can realize predetermined functions such as quiz questions and dance performances.

システム１は、図４のステップＳ３２で述べたように、ユーザの指示を伝える音声ＵＭ１を認識すると、時刻Ｔ２において、認識結果を示す第１応答ＳＭ１（例えば「クイズですね」）を出力する（図４のＳ３３）。ユーザは、システム１からの第１応答ＳＭ１を聞いて、自分の指示が正しく認識されたことを確認する。 As described in step S32 in FIG. 4, when the system 1 recognizes the voice UM1 that conveys the user's instruction, the system 1 outputs a first response SM1 (for example, “It is a quiz”) indicating the recognition result at time T2. S33 of FIG. 4). The user listens to the first response SM1 from the system 1 and confirms that his / her instruction is correctly recognized.

システム１は、第１応答ＳＭ１の出力終了時刻Ｔ３において、一般辞書２１から第１訂正辞書２２に切り替える（図４のＳ３５）。時刻Ｔ３から時刻Ｔ５までの間が、システム１の誤認識を訂正するための第１の訂正可能期間となる。 The system 1 switches from the general dictionary 21 to the first correction dictionary 22 at the output end time T3 of the first response SM1 (S35 in FIG. 4). A period from time T3 to time T5 is a first correctable period for correcting erroneous recognition of the system 1.

ユーザの最初の指示ＵＭ１はシステム１により正しく認識されているため、ユーザは、無言のままで待つこともできるし、例えば時刻Ｔ４において何らかの言葉ＵＭ２（例えば「うん」）を発することもできる。 Since the user's first instruction UM1 is correctly recognized by the system 1, the user can wait without saying anything, or can say some word UM2 (eg, “Yes”) at time T4.

システム１は、ユーザの音声ＵＭ２を検出すると、第１の訂正辞書２２を用いて音声認識を試みる。しかし、第１の訂正辞書２２には訂正指示語のみ登録されているため、システム１は、訂正指示語以外の言葉を認識することはできない。従って、システム１は特に何もせずにそのまま待機する。 When the system 1 detects the user's voice UM2, the system 1 tries to recognize the voice using the first correction dictionary 22. However, since only the correction instruction word is registered in the first correction dictionary 22, the system 1 cannot recognize words other than the correction instruction word. Therefore, the system 1 stands by without doing anything.

時刻Ｔ５において第１の訂正可能期間が終了すると同時に、システム１は第２の訂正語辞書２３に切り替える（図４のＳ４０）と共に、所定の了解動作ＲＡ１の少なくとも一部として、第２の応答ＳＭ２を出力する（図４のＳ４１）。 Simultaneously with the end of the first correctable period at time T5, the system 1 switches to the second correction word dictionary 23 (S40 in FIG. 4), and at least a part of the predetermined acknowledgment operation RA1, the second response SM2 Is output (S41 in FIG. 4).

所定の了解動作ＲＡ１とは、ユーザの指示を了解した旨を通知するための動作であり、音声出力に限らず、例えば表示部３５を介した表示出力、アクチュエータ３９の動作などを併用してもよい。なお、図４では、第２の訂正語辞書２３に切り替えた後で、第２応答ＳＭ２を出力するかのように示すが、実際には辞書の切替と第２応答ＳＭ２の出力は同時に実行される。 The predetermined acknowledgment operation RA1 is an operation for notifying that the user's instruction has been accepted, and is not limited to voice output, and for example, display output via the display unit 35, operation of the actuator 39, and the like may be used together. Good. FIG. 4 shows that the second response SM2 is output after switching to the second correction word dictionary 23, but actually the switching of the dictionary and the output of the second response SM2 are executed simultaneously. The

ここで、第２応答が出力される期間である時刻Ｔ５から時刻Ｔ７までの間が、第２の訂正可能期間となる。第２の訂正可能期間において、ユーザは黙って待っていることもできるし、何らかの言葉ＵＭ３（例えば「楽しみだ」）を発することもできる。ユーザの音声ＵＭ３は、システム１の第２応答ＳＭ２と重なって音声入力部１１に入力される。システム１は、ユーザ音声ＵＭ３と第２応答ＳＭ２とが重なった音声を検出すると、第２の訂正語辞書２３を用いて認識を試みる。しかし、上述の通り、第２の訂正語辞書２３は、訂正指示語の通常の読みと所定の変形読みだけを記憶しているため、システム１は、ユーザ音声ＵＭ３を正しく認識することができない。従って、システム１は、特に何もせずにそのまま待機する。 Here, a period from time T5 to time T7, which is a period during which the second response is output, is the second correctable period. In the second correctable period, the user can either wait silently or speak some word UM3 (eg, “I am looking forward”). The user's voice UM3 is input to the voice input unit 11 so as to overlap the second response SM2 of the system 1. When the system 1 detects a voice in which the user voice UM3 and the second response SM2 overlap, the system 1 tries to recognize using the second correction word dictionary 23. However, as described above, since the second correction word dictionary 23 stores only the normal reading of the correction instruction word and the predetermined modified reading, the system 1 cannot correctly recognize the user voice UM3. Therefore, the system 1 stands by without doing anything.

時刻Ｔ７において第２応答の出力が終了すると（図４のステップＳ４５）、システム１は、最初のユーザ音声ＵＭ１の認識結果から決定された所定の動作ＲＡ２（ここではクイズ出題）を開始する。システム１は、所定の動作の一部としての第３応答ＳＭ３（例えば「第１問・・・」）を出力する。 When the output of the second response is completed at time T7 (step S45 in FIG. 4), the system 1 starts a predetermined operation RA2 (here, a quiz question) determined from the recognition result of the first user voice UM1. The system 1 outputs a third response SM3 (for example, “first question...”) As a part of the predetermined operation.

第３応答ＳＭ３を出力している期間（Ｔ７−Ｔ８）、システム１は音声認識処理を停止することができる。第３応答ＳＭ３の出力終了後の時刻Ｔ８において、システム１は一般辞書２１を選択し、ユーザからの音声入力を待つ。第３応答ＳＭ３の出力期間中に、システム１は一般辞書２１を選択して、ユーザの音声を認識できる構成としてもよい。 During the period when the third response SM3 is output (T7-T8), the system 1 can stop the voice recognition process. At time T8 after the end of the output of the third response SM3, the system 1 selects the general dictionary 21 and waits for voice input from the user. During the output period of the third response SM3, the system 1 may select the general dictionary 21 and recognize the user's voice.

図７は、システム１がユーザの最初の指示を誤認識し、ユーザが誤認識に気づいて第１の訂正可能期間（Ｔ３−Ｔ５）に訂正を要求する場合を示す。 FIG. 7 shows a case where the system 1 misrecognizes the user's first instruction, and the user notices the misrecognition and requests correction during the first correctable period (T3-T5).

システム１は時刻Ｔ０において一般辞書２１を選択しており、ユーザからの音声が入力されるのを待っている。時刻Ｔ１において、ユーザから最初の音声ＵＭ１Ａ（例えば「ダンス踊ってよ」）が入力されると（図４のＳ３１）、システム１はその音声を一般辞書２１を用いて認識する（図４のＳ３２）。ここで、システム１はユーザ音声ＵＭ１Ａを誤って認識したとする（例えばクイズ出題を指示されたと認識）。 The system 1 has selected the general dictionary 21 at time T0, and is waiting for voice input from the user. When the user inputs the first voice UM1A (for example, “Dance dance”) at time T1 (S31 in FIG. 4), the system 1 recognizes the voice using the general dictionary 21 (S32 in FIG. 4). ). Here, it is assumed that the system 1 erroneously recognizes the user voice UM1A (for example, recognizes that a quiz question has been instructed).

時刻Ｔ２において、システム１は、音声認識結果をユーザに伝えて、もしも認識結果に誤りがある場合は訂正指示の機会を与えるべく、音声認識結果を含む第１応答ＳＭ１Ａを出力する（図４のＳ３３）。ここでは、システム１は、「クイズですね？」と応答するものとする。第１応答ＳＭ１Ａの出力開始時Ｔ２から出力終了時Ｔ３までの第１応答出力期間（Ｔ２−Ｔ３）では、システム１は音声認識処理を停止する。 At time T2, the system 1 transmits the speech recognition result to the user, and if there is an error in the recognition result, outputs a first response SM1A including the speech recognition result to give an opportunity for a correction instruction (FIG. 4). S33). Here, it is assumed that the system 1 responds “Is it a quiz?”. In the first response output period (T2-T3) from the output start time T2 of the first response SM1A to the output end time T3, the system 1 stops the speech recognition process.

第１応答ＳＭ１Ａの出力終了時Ｔ３に、システム１は、第１の訂正語辞書２２を選択する（図４のＳ３５）。システム１は、第１の訂正可能期間（Ｔ３−Ｔ５）において、音声認識可能な状態になり、ユーザからの音声入力を待つ（図４のＳ３６）。 At the end of output of the first response SM1A T3, the system 1 selects the first correction word dictionary 22 (S35 in FIG. 4). In the first correctable period (T3-T5), the system 1 enters a state where voice recognition is possible, and waits for voice input from the user (S36 in FIG. 4).

ここでは、システム１からの第１応答ＳＭ１Ａを聞いたユーザがシステム１の音声認識の誤りに直ちに気づいて、時刻Ｔ４において訂正指示ＵＭ２Ａを発したとする。 Here, it is assumed that the user who hears the first response SM1A from the system 1 immediately notices the error in the speech recognition of the system 1 and issues the correction instruction UM2A at time T4.

システム１は、ユーザの訂正指示ＵＭ２Ａを第１の訂正辞書２２を用いて認識し、訂正指示が要求されたことを知る（図４のＳ３８でＹＥＳ）。第１訂正可能期間では、システム１からの応答は出力されないため、第１訂正可能期間において音声入力部１１に入力される音のうちユーザ音声が占める比は高い（Ｓ／Ｎ比が大きい）。また、第１の訂正語辞書２２は訂正指示語のみ登録しているため、システム１は第１の訂正語辞書２２を用いて、ユーザの訂正指示ＵＭ２Ａが短い場合であっても、その訂正指示ＵＭ２Ａを速やかに正しく認識することができる。 The system 1 recognizes the user's correction instruction UM2A using the first correction dictionary 22, and knows that the correction instruction is requested (YES in S38 of FIG. 4). Since the response from the system 1 is not output in the first correctable period, the ratio of the user voice occupied by the voice input unit 11 in the first correctable period is high (the S / N ratio is large). Further, since only the correction instruction word is registered in the first correction word dictionary 22, the system 1 uses the first correction word dictionary 22 to correct the correction instruction even when the user's correction instruction UM2A is short. UM2A can be recognized quickly and correctly.

ユーザから訂正指示が発せられたことを知ったシステム１は、時刻Ｔ５において、予め登録されている所定の聞き間違え動作ＲＡ１Ａを開始する。システム１は、聞き間違え動作ＲＡ１Ａとして、例えば表示部３５などに聞き間違えマークを表示したり（図５のＳ５０）、予め登録されている所定の第２応答ＳＭ２Ａを出力する（図５のＳ５１）。第２応答ＳＭ２Ａの出力期間中、システム１は音声認識処理を停止できる。 The system 1 that knows that a correction instruction has been issued by the user starts a predetermined mistaken operation RA1A registered in advance at time T5. The system 1 displays, for example, a misunderstanding mark on the display unit 35 or the like as the mishearing operation RA1A (S50 in FIG. 5) or outputs a predetermined second response SM2A registered in advance (S51 in FIG. 5). . During the output period of the second response SM2A, the system 1 can stop the voice recognition process.

聞き間違え動作ＲＡ１Ａの終了時Ｔ７（第２応答ＳＭ２Ａの出力終了時でもある）に、システム１は一般辞書２１を選択し、ユーザ音声の入力を待つ（図４のＳ３０）。つまり、時刻Ｔ０の段階に戻る。 At the end of the misinterpretation operation RA1A T7 (also at the end of the output of the second response SM2A), the system 1 selects the general dictionary 21 and waits for the input of the user voice (S30 in FIG. 4). That is, the process returns to the time T0 stage.

システム１の聞き間違え動作を確認したユーザは、時刻Ｔ７において、正しい指示ＵＭ３Ａを発することができる。システム１は、そのユーザ音声ＵＭ３Ａを一般辞書２１を用いて音声認識し（図４のＳ３２）、音声認識結果を含む新たな第１応答ＳＭ３Ａを音声出力部３８から出力する。その後、図６で述べたように、システム１はユーザ指示に応じた所定の動作を実行する。 The user who has confirmed the mistaken operation of the system 1 can issue the correct instruction UM3A at time T7. The system 1 recognizes the user voice UM3A using the general dictionary 21 (S32 in FIG. 4), and outputs a new first response SM3A including the voice recognition result from the voice output unit 38. Thereafter, as described with reference to FIG. 6, the system 1 performs a predetermined operation in accordance with the user instruction.

図８は、システム１がユーザの最初の指示を誤認識し、ユーザが誤認識に気づいて第２の訂正可能期間（Ｔ５−Ｔ７）に訂正を要求する場合を示す。 FIG. 8 shows a case where the system 1 misrecognizes the user's first instruction, and the user notices the misrecognition and requests correction during the second correctable period (T5-T7).

時刻Ｔ１において、ユーザから最初の音声ＵＭ１Ｂが入力されると、システム１はその音声を一般辞書２１を用いて認識するが、図７で述べたと同様に誤認識したとする。 When the first voice UM1B is input from the user at time T1, the system 1 recognizes the voice using the general dictionary 21, but it is assumed that the voice is mistakenly recognized as described in FIG.

時刻Ｔ２において、システム１は第１応答ＳＭ１Ｂを出力する。第１応答ＳＭ１Ｂの出力終了時Ｔ３に、システム１は、第１の訂正語辞書２２を選択する。システム１は、第１の訂正可能期間（Ｔ３−Ｔ５）において、音声認識可能な状態になり、ユーザからの音声入力を待つ。 At time T2, the system 1 outputs a first response SM1B. At the end of output of the first response SM1B, the system 1 selects the first correction word dictionary 22. In the first correctable period (T3-T5), the system 1 is in a state where voice recognition is possible and waits for voice input from the user.

ここでは、ユーザはシステム１の音声認識の誤りに気づいたものの、それに対応するための反応が遅れ、第１の訂正可能期間中に訂正を指示できなかったものとする。第１の訂正可能期間の終了時Ｔ５に、システム１は、ユーザ指示（誤認識した指示）に基づいた所定の動作を開始する旨を通知すべく、所定の了解動作ＲＡ１Ｂの少なくとも一部として、第２応答ＳＭ２Ｂを出力する。 Here, it is assumed that the user has noticed an error in the speech recognition of the system 1, but has failed to respond to the response, and has not been able to instruct correction during the first correctable period. At the end of the first correctable period T5, the system 1 notifies at least part of the predetermined acknowledgment operation RA1B to notify that the predetermined operation based on the user instruction (incorrectly recognized instruction) is started. The second response SM2B is output.

第２応答ＳＭ２Ｂの出力期間中（Ｔ５−Ｔ７）、つまり第２の訂正可能期間に、ユーザから訂正を求める音声ＵＭ２Ｂが入力されたとする。音声入力部１１には、第２応答ＳＭ２Ｂとユーザ音声ＵＭ２Ｂとが重なって入力される。 Assume that the user inputs a voice UM2B for correction during the output period of the second response SM2B (T5-T7), that is, in the second correctable period. The second response SM2B and the user voice UM2B are input to the voice input unit 11 in an overlapping manner.

システム１は、第２の訂正可能期間（Ｔ５−Ｔ７）において、第２の訂正語辞書２３を用いた音声認識が可能な状態になっている。上述の通り、第２の訂正語辞書２３には、訂正指示語の通常の読みと所定の変形読みだけが登録されている。所定の変形読みは、図３で説明した通り、ユーザの発した訂正指示語をタイミングをずらしながら第２応答に重ねてシステム１に入力した場合の音声認識結果のうち、所定値以上の尤度を有する読みである。従って、第２の訂正可能期間に第２応答ＳＭ２Ｂとユーザ音声ＵＭ２Ｂが重なってシステム１に入力された場合でも、システム１は、ユーザ音声ＵＭ２Ｂが何を言わんとしているのか正確に判別できる。 The system 1 is in a state where speech recognition using the second correction word dictionary 23 is possible in the second correctable period (T5-T7). As described above, in the second correction word dictionary 23, only normal readings of corrected instruction words and predetermined modified readings are registered. As described with reference to FIG. 3, the predetermined modified reading is the likelihood that the correction instruction word issued by the user is input to the system 1 while being superimposed on the second response while shifting the timing, and the likelihood of a predetermined value or more. Is a reading with Therefore, even when the second response SM2B and the user voice UM2B overlap and are input to the system 1 during the second correctable period, the system 1 can accurately determine what the user voice UM2B is saying.

システム１は、ユーザの訂正指示ＵＭ２Ｂを理解した場合、時刻Ｔ７において、予め登録されている所定の聞き間違え動作ＲＡ２Ｂを開始する。システム１は、聞き間違え動作ＲＡ２Ｂとして、例えば表示部３５などに聞き間違えマークを表示したり、予め登録されている所定の第３応答ＳＭ３Ｂを出力する。 When the system 1 understands the correction instruction UM2B of the user, the system 1 starts a predetermined mistaken operation RA2B registered in advance at time T7. For example, the system 1 displays a mistaken mark on the display unit 35 or outputs a predetermined third response SM3B registered in advance as the mistaken operation RA2B.

その後、システム１は、音声入力を待つアイドリング状態に戻り（Ｔ８）、ユーザからの音声ＵＭ３Ｂが入力されるのを待つ。システム１は、そのユーザ音声ＵＭ３Ｂを一般辞書２１を用いて音声認識する。その後、システム１は、音声認識結果を含む新たな第１応答を出力し、指示された通りの所定の動作を実行する（図示省略）。 Thereafter, the system 1 returns to the idling state where the voice input is waited (T8), and waits for the voice UM3B input from the user. The system 1 recognizes the user voice UM3B using the general dictionary 21. Thereafter, the system 1 outputs a new first response including the voice recognition result, and executes a predetermined operation as instructed (not shown).

このように構成される本実施例によれば、ユーザが音声認識結果を訂正するために使用する可能性のある訂正指示語を記憶する訂正語辞書２２、２３と、通常の語句を記憶する一般辞書２１を切り替えて使用するため、ユーザの訂正指示を正しく認識できる可能性が高まり、誤認識でコマンドが起動した場合でもそれを速やかに取り消すことができ、使い勝手が向上する。 According to the present embodiment configured as described above, the correction word dictionaries 22 and 23 that store correction instruction words that the user may use to correct the speech recognition result, and the general words and phrases that are stored in general. Since the dictionary 21 is used by switching, the possibility that the user's correction instruction can be correctly recognized is increased, and even when a command is activated due to erroneous recognition, it can be quickly canceled and the usability is improved.

本実施例では、ユーザの最初の指示（音声）をシステム１の第１応答として復唱させるため、ユーザはシステム１が正しく認識したか否かを判断できる。そして、本実施例では、ユーザがシステム１の認識を訂正する期間（Ｔ３−Ｔ７）を、ユーザ音声とシステム１の応答とが重ならない第１の訂正可能期間（Ｔ３−Ｔ５）と、ユーザ音声とシステム１の応答が重なる可能性のある第２の訂正可能期間（Ｔ５−Ｔ７）とに分ける。さらに本実施例では、第１の訂正可能期間では、訂正指示語の通常の読みだけを登録した第１の訂正語辞書２２を使用し、第２の訂正可能期間では、訂正指示語の通常の読みおよび所定の変形読みだけを登録した第２の訂正語辞書２３を使用する。従って、本実施例によれば、ユーザ音声だけが入力される場合も、ユーザ音声とシステム１の応答とが重なって入力される場合のいずれの場合も、ユーザの音声による訂正指示を正しく認識することができる。これにより、システム１の信頼性、使い勝手が向上する。 In this embodiment, since the user's first instruction (voice) is repeated as the first response of the system 1, the user can determine whether or not the system 1 has correctly recognized. In this embodiment, the period during which the user corrects the recognition of the system 1 (T3-T7), the first correctable period (T3-T5) in which the user voice and the response of the system 1 do not overlap, and the user voice And the second correctable period (T5-T7) in which the responses of the system 1 may overlap. Further, in this embodiment, the first correction word dictionary 22 in which only normal readings of the correction instruction words are registered is used in the first correctable period, and the normal correction instruction words are used in the second correction possible period. A second correction word dictionary 23 in which only readings and predetermined modified readings are registered is used. Therefore, according to the present embodiment, whether the user voice alone is input or the user voice and the response of the system 1 are overlapped and input, the correction instruction based on the user voice is correctly recognized. be able to. Thereby, the reliability and usability of the system 1 are improved.

本実施例では、ユーザが訂正指示を出す可能性の高い期間（Ｔ３−Ｔ７）において、訂正指示語のみ登録した訂正語辞書２２、２３を使用する。従って、システム１は、比較的小サイズの訂正語辞書２２、２３を用いて、訂正指示語が発せられたかを直ちに判別することができる。 In the present embodiment, the correction word dictionaries 22 and 23 in which only correction instruction words are registered are used during a period (T3-T7) in which the user is likely to issue correction instructions. Therefore, the system 1 can immediately determine whether the correction instruction word has been issued using the correction word dictionaries 22 and 23 having a relatively small size.

本実施例では、ユーザの発した訂正指示語をタイミングをずらしながら第２応答に重ねてシステム１に入力した場合の音声認識結果のうち、所定値以上の尤度を有する読みを所定の変形読みとして第２の訂正語辞書２３に登録する。従って、システム１の応答（第２応答）とユーザの訂正指示の音声とが重なる可能性のある期間に第２の訂正語辞書２３を使用することで、ユーザの音声を正しく認識できる可能性が高まる。 In the present embodiment, among the speech recognition results when the correction instruction word issued by the user is input to the system 1 while being superimposed on the second response while shifting the timing, a reading having a likelihood equal to or greater than a predetermined value is read as a predetermined modified reading. Is registered in the second correction word dictionary 23. Therefore, there is a possibility that the user's voice can be correctly recognized by using the second correction word dictionary 23 in a period in which the response of the system 1 (second response) and the voice of the user's correction instruction may overlap. Rise.

より詳しくは、本実施例では、ユーザ音声の先頭の音を他の音に置き換えた認識候補語を生成し、ユーザの訂正指示語をタイミングをずらしながら第２応答に重ねてシステム１に入力し、認識された候補語のうち所定値以上の尤度を有する候補語の読みを所定の変形読みとして使用する。従って、比較的簡易な構成で第２の訂正語辞書２３を作成することができ、その第２の訂正語辞書２３を使用することで、Ｓ／Ｎ比の小さい状況下でもユーザ音声を正しく認識できる確率を高めることができる。 More specifically, in this embodiment, a recognition candidate word is generated by replacing the first sound of the user voice with another sound, and the correction instruction word of the user is input to the system 1 while being superimposed on the second response while shifting the timing. Of the recognized candidate words, the reading of candidate words having a likelihood equal to or higher than a predetermined value is used as the predetermined modified reading. Therefore, the second correction word dictionary 23 can be created with a relatively simple configuration, and by using the second correction word dictionary 23, the user voice can be correctly recognized even in a situation where the S / N ratio is small. Probability can be increased.

図９を用いて第２実施例を説明する。本実施例を含む以下の各実施例は第１実施例の変形例に該当するため、第１実施例との相違を中心に説明する。本実施例では、第１応答の出力期間中に、第２の一般辞書２５を使用する。 A second embodiment will be described with reference to FIG. Each of the following embodiments including the present embodiment corresponds to a modification of the first embodiment, and therefore, description will be made focusing on differences from the first embodiment. In the present embodiment, the second general dictionary 25 is used during the output period of the first response.

本実施例の音声認識システム１は、一般辞書２１、第１の訂正語辞書２２、第２の訂正語辞書２３に加えて、第２の一般辞書２５を備える。辞書選択部１７は、システム制御部３１からの指示に基づいて、それら辞書２１、２２、２３、２５のうちいずれか一つを選択する。 The speech recognition system 1 according to the present embodiment includes a second general dictionary 25 in addition to the general dictionary 21, the first correction word dictionary 22, and the second correction word dictionary 23. The dictionary selection unit 17 selects any one of the dictionaries 21, 22, 23, and 25 based on an instruction from the system control unit 31.

本実施例では、システム１がユーザの最初の指示を復唱するための第１応答ＳＭ１を出力している間に音声認識処理を実行可能となっている。システム１は、第１応答ＳＭ１を出力している期間に、第２の一般辞書２５を用いて音声認識を行うことができる。 In the present embodiment, the voice recognition process can be executed while the system 1 outputs the first response SM1 for repeating the user's first instruction. The system 1 can perform speech recognition using the second general dictionary 25 during the period when the first response SM1 is being output.

第２の一般辞書２５は、図３で述べた第２の訂正語辞書２３と同様の作成方法に従って作成することができる。即ち、一般の言葉のそれぞれについて、所定箇所（例えば先頭の１音か２音）の音を他の音に置き換えることで、一般用認識候補語辞書を生成する。そして、システム１が出力する可能性のある全ての第１応答と一般の言葉との全ての組合せについて、重ねるタイミングをずらしながらシステム１に入力する。システム１が認識した一般用の候補語のうち、所定値以上の尤度を有する候補語を所定の一般用変形読みとして、第２の一般辞書２５に登録する。第２の一般辞書２５は、訂正指示語を含まない一般の言葉の通常の読みと所定の変形読みとを対応付けて記憶する。 The second general dictionary 25 can be created according to the same creation method as the second correction word dictionary 23 described in FIG. That is, for each general word, a general recognition candidate word dictionary is generated by replacing the sound at a predetermined location (for example, the first or second sound) with another sound. Then, all combinations of all the first responses and general words that can be output by the system 1 are input to the system 1 while shifting the overlapping timing. Of the candidate words for general use recognized by the system 1, candidate words having a likelihood equal to or higher than a predetermined value are registered in the second general dictionary 25 as predetermined general modified readings. The second general dictionary 25 stores a normal reading of a general word that does not include a correction instruction word and a predetermined modified reading in association with each other.

図９のタイムチャートを説明する。時刻Ｔ１で、システム１は、ユーザ音声ＵＭ１Ｃを一般辞書２１（第１の一般辞書）を用いて認識する。時刻Ｔ２で、システム１は、第１応答ＳＭ１を出力することでユーザ音声ＵＭ１Ｃの認識結果を復唱する。第１応答ＳＭ１の出力中に、ユーザが音声ＵＭ２Ｃを発した場合、第１応答ＳＭ１とユーザ音声ＵＭ２Ｃとが重なってシステム１に入力される。システム１は、第２の一般辞書２５を使用してユーザ音声ＵＭ２Ｃを認識する。システム１は、ユーザ音声ＵＭ２Ｃを認識できたことを示すために、所定の受領動作ＲＡ１Ｃを実行することができる。受領動作ＲＡ１Ｃとして、システム１は、例えば、表示部３５にメッセージを表示したり、ＬＥＤランプなどを点滅させたりすることができる。 The time chart of FIG. 9 will be described. At time T1, the system 1 recognizes the user voice UM1C using the general dictionary 21 (first general dictionary). At time T2, the system 1 repeats the recognition result of the user voice UM1C by outputting the first response SM1. If the user utters the voice UM2C during the output of the first response SM1, the first response SM1 and the user voice UM2C are input to the system 1 in an overlapping manner. The system 1 recognizes the user voice UM2C using the second general dictionary 25. The system 1 can perform a predetermined receiving operation RA1C to indicate that the user voice UM2C has been recognized. As the receiving operation RA1C, the system 1 can display a message on the display unit 35 or blink an LED lamp or the like, for example.

その後、第１の訂正可能期間でユーザが訂正指示語以外の音声ＵＭ３Ｃを発しても、その音声は訂正指示ではないため、システム１は特に反応しない。なお、訂正指示語以外の言葉であると認識した場合に、ＬＥＤランプを点滅させる等の動作を行ってもよい。 After that, even if the user utters the voice UM3C other than the correction instruction word in the first correctable period, the voice is not a correction instruction, and the system 1 does not particularly respond. In addition, when it recognizes that it is words other than a correction instruction word, you may perform operation | movement, such as blinking an LED lamp.

第１応答出力後の第２の訂正可能期間では、システム１は、了解動作ＲＡ２Ｃを実行し、第２応答ＳＭ２を出力する。第２の訂正可能期間で訂正指示語以外の音声ＵＭ４Ｃが発せられた場合、そのユーザ音声ＵＭ４Ｃは第２応答ＳＭ２に重なってシステム１に入力される。システム１は、第２の訂正語辞書２３を用いて音声ＵＭ４Ｃの認識を試みるが、訂正指示語ではないため、特に反応しない。なお、上記同様に、システム１は何らかの反応を示しても良い。 In the second correctable period after the first response is output, the system 1 executes the acknowledge operation RA2C and outputs the second response SM2. When the voice UM4C other than the correction instruction word is emitted in the second correctable period, the user voice UM4C is input to the system 1 so as to overlap the second response SM2. The system 1 tries to recognize the voice UM4C using the second correction word dictionary 23, but does not react particularly because it is not a correction instruction word. As described above, the system 1 may show some reaction.

その後、システム１は、最初のユーザ指示ＵＭ１Ｃに従って所定の動作ＲＡ３Ｃを実行し、第３応答ＳＭ３を出力する。第３応答ＳＭ３の出力終了後に、システム１は第１の一般辞書２１に切り替えて、ユーザからの音声入力を待つ。 Thereafter, the system 1 performs a predetermined operation RA3C according to the first user instruction UM1C and outputs a third response SM3. After the output of the third response SM3 is completed, the system 1 switches to the first general dictionary 21 and waits for a voice input from the user.

このように構成される本実施例も第１実施例と同様の作用効果を奏する。さらに本実施例では、ユーザの話す一般の言葉とシステム１の応答とが重なった場合でもユーザ音声を認識できるようにした第２一般辞書２５を用いるため、第１応答の出力期間中に発せられたユーザ音声を高精度に認識できる。 Configuring this embodiment like this also achieves the same operational effects as the first embodiment. Furthermore, in this embodiment, since the second general dictionary 25 is used so that the user's voice can be recognized even when the general word spoken by the user and the response of the system 1 overlap, it is issued during the output period of the first response. User voice can be recognized with high accuracy.

図１０を用いて第３実施例を説明する。本実施例では、ユーザがシステム１の誤認識の訂正を要求しうる期間において、訂正語辞書２２、２３および一般辞書２１の両方を使用する。 A third embodiment will be described with reference to FIG. In the present embodiment, both the correction word dictionaries 22 and 23 and the general dictionary 21 are used in a period in which the user can request correction of misrecognition of the system 1.

図１０のタイムチャートに示すように、システム１は、第１の訂正可能期間（Ｔ３−Ｔ５）では第１の訂正辞書２２と一般辞書２１を使用し、それに続く第２の訂正可能期間（Ｔ５−Ｔ７）では第２の訂正語辞書２３と一般辞書２１を使用する。 As shown in the time chart of FIG. 10, the system 1 uses the first correction dictionary 22 and the general dictionary 21 in the first correctable period (T3-T5), followed by the second correctable period (T5). -T7) uses the second correction word dictionary 23 and the general dictionary 21.

このように構成される本実施例も第１実施例と同様の作用効果を得ることができる。さらに、本実施例では、ユーザが訂正指示語を発する可能性のある期間に訂正語辞書２２、２３と一般辞書２１の両方を使用するため、ユーザが訂正指示語以外の一般の言葉を発した場合でも、その言葉を認識することができる。 This embodiment, which is configured in this way, can also obtain the same effects as the first embodiment. Further, in this embodiment, since both the correction word dictionaries 22 and 23 and the general dictionary 21 are used during a period in which the user may issue a correction instruction word, the user has issued a general word other than the correction instruction word. Even if you can recognize the word.

図１１を用いて第４実施例を説明する。本実施例では、ユーザの発する訂正指示語にシステム１の応答および動作音を重ねた場合のユーザ音声の聞こえ方の変化を考慮して、第２の訂正語辞書２３Ａを作成する。 A fourth embodiment will be described with reference to FIG. In the present embodiment, the second correction word dictionary 23A is created in consideration of changes in the way the user voice is heard when the response and operation sound of the system 1 is superimposed on the correction instruction word issued by the user.

図１１は本実施例による第２の訂正語辞書２３Ａの生成方法を示す説明図である。本実施例では、ユーザの発する訂正指示語とシステム１の応答とをタイミングをずらしながら重ねるだけでなく（Ｓ２１）、音声認識システム１を搭載したシステム（例えばロボット２）の動作音を重ねて音声を認識する。認識結果の候補語のうち所定値以上の尤度を有する候補語の読みを第２の訂正語辞書２３Ａに登録する。音声認識システム１を搭載したシステムの動作音として、例えば、第１応答を出力している期間にロボット２から発せられる可能性の高い音（電動モータの音など）を用いることもできる。動作音は環境音と呼ぶこともできる。 FIG. 11 is an explanatory diagram showing a method of generating the second correction word dictionary 23A according to this embodiment. In this embodiment, not only the correction instruction word issued by the user and the response of the system 1 are overlapped while shifting the timing (S21), but also the operation sound of the system (for example, the robot 2) equipped with the speech recognition system 1 is overlapped to generate the voice Recognize A candidate word having a likelihood equal to or higher than a predetermined value among the candidate words of the recognition result is registered in the second correction word dictionary 23A. As an operation sound of a system equipped with the speech recognition system 1, for example, a sound (such as an electric motor sound) that is likely to be emitted from the robot 2 during the period when the first response is output can be used. The operating sound can also be called environmental sound.

このように構成される本実施例も第１実施例と同様の作用効果を奏する。さらに本実施例では、音声認識システム１の応答だけでなく音声認識システム１を搭載したシステムから発せられる音も考慮して、第２の訂正語辞書２３を作成するため、ユーザの訂正指示をより正確に認識できる。 Configuring this embodiment like this also achieves the same operational effects as the first embodiment. Furthermore, in the present embodiment, the second correction word dictionary 23 is created in consideration of not only the response of the voice recognition system 1 but also the sound emitted from the system on which the voice recognition system 1 is mounted. Can be recognized accurately.

図１２を用いて第５実施例を説明する。本実施例では、第２応答の出力終了後の所定時間だけ、第２の訂正語辞書２３を使用する音声認識処理を可能としている。図１２のタイムチャートに示すように、システム１は、第２応答ＳＭ２を出力した後も、所定期間（Ｔ７−Ｔ８）だけ第２の訂正語辞書２３を使用し続ける。 A fifth embodiment will be described with reference to FIG. In the present embodiment, speech recognition processing using the second correction word dictionary 23 can be performed only for a predetermined time after the end of the output of the second response. As shown in the time chart of FIG. 12, the system 1 continues to use the second correction word dictionary 23 for a predetermined period (T7-T8) even after outputting the second response SM2.

本実施例では、第２の訂正語辞書２３の使用期間が延びるため、その延長された期間にシステム１が音声出力する可能性のある応答ＳＭ３を考慮して、第２の訂正語辞書２３を作成するのが好ましい。図３で説明した方法を、他の応答ＳＭ３まで拡張するだけで、本実施例に適した第２の訂正辞書２３を作成することができる。 In the present embodiment, since the use period of the second correction word dictionary 23 is extended, the second correction word dictionary 23 is set in consideration of the response SM3 that the system 1 may output by voice during the extended period. It is preferable to create. The second correction dictionary 23 suitable for the present embodiment can be created simply by extending the method described with reference to FIG. 3 to another response SM3.

このように構成される本実施例も第１実施例と同様の作用効果を奏する。さらに本実施例では、第２応答の終了後にユーザが訂正指示語を発した場合でも、その訂正指示語を認識することができる。 Configuring this embodiment like this also achieves the same operational effects as the first embodiment. Further, in the present embodiment, even when the user issues a correction instruction word after the end of the second response, the correction instruction word can be recognized.

なお、本発明は、上述した実施の形態に限定されない。当業者であれば、本発明の範囲内で、種々の追加や変更等を行うことができる。 The present invention is not limited to the above-described embodiment. A person skilled in the art can make various additions and changes within the scope of the present invention.

１：音声認識システム、２：ロボット、３：携帯情報端末、４：車両、１１：音声入力部、１４：マッチング部、１７：辞書選択部、１８：行動決定部、２１：一般辞書、２２：第１の訂正語辞書、２３、２３Ａ：第２の訂正語辞書、２４：認識候補語辞書、２５：第２の一般辞書、３１：システム制御部、３５：表示部、３８：音声出力部、３９：アクチュエータ 1: voice recognition system, 2: robot, 3: portable information terminal, 4: vehicle, 11: voice input unit, 14: matching unit, 17: dictionary selection unit, 18: action determination unit, 21: general dictionary, 22: First correction word dictionary, 23, 23A: Second correction word dictionary, 24: Recognition candidate word dictionary, 25: Second general dictionary, 31: System control unit, 35: Display unit, 38: Audio output unit, 39: Actuator

Claims

ユーザの音声を認識して応答する音声認識システムであって、
ユーザの音声を入力するための音声入力部と、
ユーザが音声認識結果を訂正するために使用する可能性のある所定の訂正指示語を記憶する、それぞれ異なる第１の訂正語辞書データベースおよび第２の訂正語辞書データベースと、
通常の語句を記憶する一般辞書データベースと、
前記第１、第２の訂正語辞書データベースと前記一般辞書データベースのいずれかを使用することで、前記音声入力部を介して入力されるユーザの音声を認識する音声認識部と、
前記音声認識部の音声認識結果を含む応答を音声出力部から出力する応答部と、
を備え、
前記応答部から出力される応答には、前記音声認識部により認識された結果をユーザに通知するための第１応答と、該第１応答から所定時間経過後に出力される第２応答とが含まれており、
前記第２の訂正語辞書データベースは、前記応答部が所定の応答を出力中に前記音声入力部から入力されるユーザの音声を認識すべく、ユーザの発した前記所定の訂正指示語をタイミングをずらしながら前記第２応答に重ねて前記音声入力部に入力した場合の前記音声認識部による音声認識結果のうち、尤度が所定値以上の読みを前記所定の訂正指示語についての所定の変形読みとして、通常の読みと共に記憶することで作成されており、
前記第２の訂正語辞書データベースと異なる前記第１の訂正語辞書データベースには、前記所定の訂正指示語の通常の読みだけが記憶されており、
前記第１応答の出力後から前記第２応答が出力されるまでの第１訂正可能期間では、前記音声認識部は前記第１の訂正語辞書データベースを用いて音声を認識し、
前記第２応答が出力される期間である第２訂正可能期間では、前記音声認識部は前記第２の訂正語辞書データベースを用いて音声を認識する、
音声認識システム。 A voice recognition system that recognizes and responds to a user's voice,
A voice input unit for inputting a user's voice;
Different first correction word dictionary databases and second correction word dictionary databases for storing predetermined correction instruction words that the user may use to correct the speech recognition results;
A general dictionary database that stores normal phrases;
A voice recognition unit that recognizes a user's voice input through the voice input unit by using either the first or second correction word dictionary database or the general dictionary database;
A response unit that outputs a response including a voice recognition result of the voice recognition unit from a voice output unit;
Equipped with a,
The response output from the response unit includes a first response for notifying the user of the result recognized by the voice recognition unit and a second response output after a predetermined time has elapsed from the first response. And
In the second correction word dictionary database, in order to recognize the user's voice input from the voice input unit while the response unit outputs a predetermined response, the predetermined correction instruction word issued by the user is timed. Of the speech recognition results by the speech recognition unit when the input to the speech input unit is superimposed on the second response while shifting, a reading with a likelihood greater than or equal to a predetermined value is a predetermined modified reading for the predetermined correction instruction word It is created by memorizing with normal reading,
In the first correction word dictionary database different from the second correction word dictionary database, only normal reading of the predetermined correction instruction word is stored,
In the first correctable period from the output of the first response to the output of the second response, the speech recognition unit recognizes speech using the first correction word dictionary database,
In a second correctable period in which the second response is output, the voice recognition unit recognizes a voice using the second correction word dictionary database.
Speech recognition system.

前記第１訂正可能期間または前記第２訂正可能期間のいずれでもない場合、前記音声認識部は前記一般辞書データベースを用いて音声を認識する、
請求項１に記載の音声認識システム。 If it is neither the first correctable period nor the second correctable period, the speech recognition unit recognizes speech using the general dictionary database.
The speech recognition system according to claim 1 .

ユーザの音声を認識して応答する音声認識システムを制御する方法であって、
前記音声認識システムは、音声出力部から所定の応答を出力中に音声入力部から入力されるユーザの音声を認識すべく、ユーザの発した所定の訂正指示語をタイミングをずらしながら前記所定の応答に重ねて前記音声入力部に入力した場合の音声認識結果のうち、尤度が所定値以上の読みを前記所定の訂正指示語についての所定の変形読みとして、通常の読みと共に記憶することで作成される第２の訂正語辞書データベースと、前記第２の訂正語辞書データベースと異なる第１の訂正語辞書データベースであって、前記所定の訂正指示語の通常の読みだけが記憶される第１の訂正語辞書データベースと、を備えており
前記音声入力部からユーザの音声を入力する音声入力ステップと、
前記第１の訂正語辞書データベースおよび前記第２の訂正語辞書データベースと、通常の語句を記憶する一般辞書データベースとのいずれかを選択する辞書選択ステップと、
選択された辞書データベースを用いることでユーザの音声を認識する音声認識ステップと、
音声認識結果を含む応答を前記音声出力部から出力する応答ステップと、
を実行し、
さらに、前記応答ステップが応答を出力しない間は前記第１の訂正語辞書データベースを使用し、前記応答ステップが応答を出力する間は前記第２の訂正語辞書データベースを使用する、
音声認識システムの制御方法。A method for controlling a voice recognition system that recognizes and responds to a user's voice,
The voice recognition system is configured to recognize the user's voice input from the voice input unit while outputting the predetermined response from the voice output unit, while shifting the timing of the predetermined correction instruction word issued by the user. Created by storing a reading with a likelihood equal to or greater than a predetermined value as a predetermined modified reading for the predetermined correction instruction word together with a normal reading, among the voice recognition results when input to the voice input unit overlaid on And a first correction word dictionary database different from the second correction word dictionary database, in which only normal readings of the predetermined correction instruction words are stored. Correction word dictionary database
A voice input step of inputting a user's voice from the voice input unit,
A dictionary selection step of selecting any one of the first correction word dictionary database and the second correction word dictionary database, and a general dictionary database storing normal words;
A voice recognition step for recognizing a user's voice by using the selected dictionary database;
A response step of outputting a response containing the speech recognition result from the voice output unit,
The execution,
Furthermore, while the response step does not output a response, the first correction word dictionary database is used, and while the response step outputs a response, the second correction word dictionary database is used.
Control method of speech recognition system.

コンピュータを音声認識システムとして機能させるためのコンピュータプログラムであって、
前記コンピュータに、
音声出力部から所定の応答を出力中に音声入力部から入力されるユーザの音声を認識すべく、ユーザの発した所定の訂正指示語をタイミングをずらしながら前記所定の応答に重ねて前記音声入力部に入力した場合の音声認識結果のうち、尤度が所定値以上の読みを前記所定の訂正指示語についての所定の変形読みとして、通常の読みと共に記憶することで作成される第２の訂正語辞書データベースと、前記第２の訂正語辞書データベースと異なる第１の訂正語辞書データベースであって、前記所定の訂正指示語の通常の読みだけが記憶される第１の訂正語辞書データベースと、を実現させると共に、
前記音声入力部からユーザの音声を入力する音声入力ステップと、
前記第１の訂正語辞書データベースと、前記第２の訂正語辞書データベースと、通常の語句を記憶する一般辞書データベースとのいずれかを選択する辞書選択ステップと、
選択された辞書データベースを用いることでユーザの音声を認識する音声認識ステップと、
音声認識結果を含む応答を前記音声出力部から出力する応答ステップと、
を実行し、
さらに、前記応答ステップが応答を出力しない間は前記第１の訂正語辞書データベースを使用し、前記応答ステップが応答を出力する間は前記第２の訂正語辞書データベースを使用することを実現させるためのコンピュータプログラム。A computer program for causing a computer to function as a voice recognition system,
In the computer,
In order to recognize the user's voice input from the voice input unit while outputting the predetermined response from the voice output unit, the voice input is performed by superimposing the predetermined correction instruction word issued by the user on the predetermined response while shifting the timing. A second correction created by storing a reading having a likelihood equal to or greater than a predetermined value, as a predetermined modified reading for the predetermined correction instruction word, together with a normal reading among the speech recognition results input to the unit A first correction word dictionary database that is different from the second correction word dictionary database and stores only normal readings of the predetermined correction instruction words; And realize
A voice input step of inputting a user's voice from the voice input unit,
A dictionary selection step of selecting one of the first correction word dictionary database, the second correction word dictionary database, and a general dictionary database storing normal words;
A voice recognition step for recognizing a user's voice by using the selected dictionary database;
A response step of outputting a response containing the speech recognition result from the voice output unit,
The execution,
Further, in order to realize that the first correction word dictionary database is used while the response step does not output a response, and the second correction word dictionary database is used while the response step outputs a response. Computer program.