JP2008298851A

JP2008298851A - Voice input processing apparatus and voice input processing method

Info

Publication number: JP2008298851A
Application number: JP2007141976A
Authority: JP
Inventors: Osamu Iwata; 收岩田
Original assignee: Denso Ten Ltd
Current assignee: Denso Ten Ltd
Priority date: 2007-05-29
Filing date: 2007-05-29
Publication date: 2008-12-11

Abstract

<P>PROBLEM TO BE SOLVED: To discriminate the speech objects of users from voice data. <P>SOLUTION: A voice input processing unit 2 analyzes the voice data collected by a microphone 31 with an analysis processing part 12a to extract wave-forms corresponding to a single phoneme. When a phoneme collating part 12b collates the waveforms corresponding to the single phoneme, it collates them by using a command phoneme dictionary 14 for storing the phoneme generated in voices when the users input the voices and a conversation phoneme dictionary 15 for storing the phoneme generated in voices when the users have conversation with fellow passengers. A command determination part 12c determines whether the voice data are commands spoken to an onboard device 1 by comparing suitability degrees to the phoneme in the two dictionaries. <P>COPYRIGHT: (C)2009,JPO&INPIT

Description

この発明は、音声認識を用いて入力処理を行なう音声入力処理装置および音声入力処理方法に関し、特に、車載システムにおいてユーザの発話がシステム向けのコマンド入力であるか否かを判別する音声入力処理装置および音声入力処理方法に関する。 The present invention relates to a voice input processing device and a voice input processing method for performing input processing using voice recognition, and in particular, a voice input processing device for determining whether or not a user's utterance is a command input for the system in an in-vehicle system. And a voice input processing method.

近年、利用者の音声を認識する技術の実現に向けて、各種考案がなされている。利用者の音声を認識することができれば、利用者は各種機器の操作を音声によって実行することが可能であり、特に車載装置では運転者による手動操作の運転への影響が懸念されることから音声操作技術の実用化が切望されている。 In recent years, various ideas have been made for realizing a technology for recognizing a user's voice. If the user's voice can be recognized, it is possible for the user to perform various device operations by voice. Especially, in-vehicle devices are concerned about the influence of manual operation by the driver on the driving. The practical application of operation technology is eagerly desired.

従来の音声入力では、音声入力の開始を示す音声入力スイッチを操作した場合に音声データの取得と音声認識を開始していた。しかし、近年、音声入力スイッチを用いることなく、ユーザが発声した内容がコマンドであるか否かの判別自体をシステム側で実行することが求められている。 In the conventional voice input, voice data acquisition and voice recognition are started when a voice input switch indicating the start of voice input is operated. However, in recent years, it has been required that the system itself determine whether or not the content uttered by the user is a command without using a voice input switch.

そこで例えば特許文献１では、発声と発声の無音区間を計測し、その長さに基づいて発声がシステムに対する操作し指令であるか他の人物との会話であるかを判断する技術を開示している。 Therefore, for example, Patent Document 1 discloses a technique for measuring speech and silent sections of speech and determining whether the speech is an operation command to the system or a conversation with another person based on the length. Yes.

特開２００３−３０８０７９号公報JP 2003-308079 A

しかしながら、上述した従来の技術では発声と発声との間に無音時間がないと発話対象を判定することが出来ない。そのため、例えば同乗者同士の会話がシステムに対する音声入力と重なると、音声入力を認識することが出来なくなる。 However, in the conventional technology described above, it is not possible to determine the utterance target unless there is a silent time between utterances. For this reason, for example, if a conversation between passengers overlaps with a voice input to the system, the voice input cannot be recognized.

また、車両には車載用オーディオ装置が装備されていることが一般的であり、このオーディオ出力が音声入力に重畳される状況についても考慮する必要があり、無音区間が存在する確率は低い。 In general, the vehicle is equipped with an in-vehicle audio device, and it is necessary to consider the situation in which this audio output is superimposed on the audio input, and the probability that a silent section exists is low.

すなわち、従来の技術では、ユーザ音声の発話対象がシステムであるか否かを十分に判別することが出来ず、音声入力スイッチを排除することはできないという問題点があった。 That is, the conventional technique has a problem that it cannot sufficiently determine whether or not the user's voice is to be spoken, and the voice input switch cannot be excluded.

そのため、発話対象判別の精度を向上して音声入力スイッチを排除し、音声入力の操作性を向上し、運転操作に集中させることで安全走行に寄与することの出来る音声入力処理装置の実現が重要な課題となっていた。 Therefore, it is important to improve the accuracy of utterance target discrimination, eliminate voice input switches, improve the operability of voice input, and realize a voice input processing device that can contribute to safe driving by concentrating on driving operation It was a difficult task.

本発明は、上述した従来技術における問題点を解消し、課題を解決するためになされたものであり、ユーザの発話対象を高精度に判定可能な音声入力処理装置および音声入力処理方法を提供することを目的とする。 The present invention has been made to solve the above-described problems in the prior art and to solve the problems, and provides a voice input processing device and a voice input processing method capable of determining a user's utterance target with high accuracy. For the purpose.

上述した課題を解決し、目的を達成するため、本発明にかかる音声入力処理装置および音声入力処理方法は、ユーザが音声入力を行なう場合の音声に生ずる音素を記憶する第１音素辞書記憶手段と、音声入力以外で発生しうる音素を記憶する第２音素辞書記憶手段とを設け、入力音声データを分析して得られた入力音素を第１音素辞書および第２音素辞書とそれぞれ照合して入力音声データが前記ユーザによる音声入力であるか否かを判定する。 In order to solve the above-described problems and achieve the object, a speech input processing device and speech input processing method according to the present invention include first phoneme dictionary storage means for storing phonemes generated in speech when a user performs speech input. And second phoneme dictionary storage means for storing phonemes that can be generated other than voice input, and input phonemes obtained by analyzing input voice data are respectively input by collating with the first phoneme dictionary and the second phoneme dictionary. It is determined whether voice data is voice input by the user.

本発明によれば音声入力処理装置および音声入力処理方法は、ユーザが音声入力を行なう場合の音声に生ずる音素を記憶する第１音素辞書記憶手段と、音声入力以外で発生しうる音素を記憶する第２音素辞書記憶手段とを設け、入力音声データを分析して得られた入力音素を第１音素辞書および第２音素辞書とそれぞれ照合して入力音声データが前記ユーザによる音声入力であるか否かを判定するので、ユーザの発話対象を高精度に判定可能な音声入力処理装置および音声入力処理方法を得ることができるという効果を奏する。 According to the present invention, a voice input processing device and a voice input processing method store first phoneme dictionary storage means for storing phonemes generated in voice when a user performs voice input, and phonemes that can be generated other than voice input. Second phoneme dictionary storage means is provided, and the input phoneme obtained by analyzing the input speech data is collated with the first phoneme dictionary and the second phoneme dictionary, respectively, and whether or not the input speech data is speech input by the user. Therefore, there is an effect that it is possible to obtain a voice input processing device and a voice input processing method capable of accurately determining a user's utterance target.

以下に添付図面を参照して、この発明に係る音声入力処理装置および音声入力処理方法の好適な実施の形態を詳細に説明する。 Exemplary embodiments of a speech input processing device and a speech input processing method according to the present invention will be explained below in detail with reference to the accompanying drawings.

図１は、本発明の実施例である車載装置１の概要構成を示す概要構成図である。同図に示したように車載装置１は、その内部に音声入力処理ユニット２、入出力処理部３０、マイク３１、スイッチ３２、タッチパネルディスプレイ３３、スピーカ３４、オーディオユニット４１およびナビゲーションユニット４２を有する。 FIG. 1 is a schematic configuration diagram showing a schematic configuration of an in-vehicle device 1 which is an embodiment of the present invention. As shown in the figure, the in-vehicle device 1 includes a voice input processing unit 2, an input / output processing unit 30, a microphone 31, a switch 32, a touch panel display 33, a speaker 34, an audio unit 41, and a navigation unit 42 therein.

スイッチ３２は、ユーザからの手動操作を受け付ける入力受け付け手段である。また、タッチパネルディスプレイ３３は、表示出力を行なうディスプレイと、ユーザからの手動操作を受け付けるタッチパネルとを一体化した入出力手段である。さらに、スピーカ３４は、ユーザに対して音声出力を行なう出力手段である。 The switch 32 is an input receiving unit that receives a manual operation from the user. The touch panel display 33 is input / output means in which a display that performs display output and a touch panel that receives a manual operation from a user are integrated. Furthermore, the speaker 34 is an output means for outputting sound to the user.

オーディオユニット４１は、ラジオ放送やテレビ放送の受信、ＣＤ，ＤＶＤ，ＨＤなどの記録媒体に格納した音楽データや映像データの再生出力を行なうユニットであり、ナビゲーションユニット４２は自車両の位置情報と地図情報を用いて周辺施設や道路の案内、目的地までの誘導などを行なうユニットである。 The audio unit 41 is a unit that receives radio broadcasts and television broadcasts, and reproduces and outputs music data and video data stored in a recording medium such as a CD, DVD, or HD, and the navigation unit 42 is a vehicle location information and map. This is a unit that uses information to guide nearby facilities and roads, and to reach destinations.

入出力処理部３０は、各種入力手段からの入力に基づいて、オーディオユニット４１およびナビゲーションユニット４２を動作制御するとともにタッチパネルディスプレイ３３からの表示出力制御、スピーカ３４からの表示出力制御を行なう。 The input / output processing unit 30 controls the operation of the audio unit 41 and the navigation unit 42 based on inputs from various input means, and performs display output control from the touch panel display 33 and display output control from the speaker 34.

さらに、車載装置１では音声入力処理ユニット２が、マイク３１によって集音した音声データを用いて音声入力を実現する。具体的には、マイク３１がユーザの音声を集音した場合に、音声入力処理ユニット２内部の音声認識エンジン１０は音声データを最も適合する言葉（テキストデータ）に変換する。 Furthermore, in the in-vehicle device 1, the voice input processing unit 2 implements voice input using voice data collected by the microphone 31. Specifically, when the microphone 31 collects the user's voice, the voice recognition engine 10 inside the voice input processing unit 2 converts the voice data into the most suitable word (text data).

そして、同じく音声入力処理ユニット２内の音声入力部２０は、このテキストデータがユーザから入力されたものとして入出力処理部３０への入力処理を行なう。 Similarly, the voice input unit 20 in the voice input processing unit 2 performs input processing to the input / output processing unit 30 on the assumption that the text data is input from the user.

すなわち、本実施例では、音声入力処理ユニット２が請求項に記載した音声入力処理装置の構成を有し、音声入力処理方法を実行する音声入力処理装置として動作することとなる。 That is, in the present embodiment, the voice input processing unit 2 has the configuration of the voice input processing device described in the claims, and operates as a voice input processing device that executes the voice input processing method.

音声認識エンジン１０は、その内部に雑音処理部１１、音響処理部１２、単語照合部１３、コマンド音素辞書１４、会話音素辞書１５、単語辞書１６を有する。 The speech recognition engine 10 includes a noise processing unit 11, an acoustic processing unit 12, a word matching unit 13, a command phoneme dictionary 14, a conversation phoneme dictionary 15, and a word dictionary 16 therein.

雑音処理部１１は、マイク３１が取得した車室内の音声データに対して雑音を除去する処理を行なって、音響処理部１２に出力する。 The noise processing unit 11 performs a process for removing noise on the voice data in the passenger compartment acquired by the microphone 31 and outputs the processed data to the acoustic processing unit 12.

音響処理部１２は、音素辞書を参照して音声データをテキストデータに変換する処理を行なう。ここで、音響処理部１２が参照する音素辞書を複数設け、どの音素辞書に登録された音素により適合したかに基づいて音声データが音声入力、すなわちユーザが車載装置１に向けて発声したコマンドであるか否かを判定する点に本発明の主たる特徴がある。 The acoustic processing unit 12 performs processing for converting speech data into text data with reference to the phoneme dictionary. Here, a plurality of phoneme dictionaries referred to by the acoustic processing unit 12 are provided, and voice data is input based on which phoneme is registered in which phoneme dictionary, that is, a command uttered by the user toward the in-vehicle device 1 The main feature of the present invention is that it is determined whether or not there is.

本実施例では、音響処理部１２は、記憶手段に記録されたデータであるコマンド音素辞書１４と会話音素辞書１５の二つの音素辞書を参照する。ここで、コマンド音素辞書と会話音素辞書との差について、図２および図３を参照して説明する。なおこれらの辞書を記憶する記憶手段としては、メモリ等の任意の記録手段を用いることができる。 In this embodiment, the acoustic processing unit 12 refers to two phoneme dictionaries, that is, a command phoneme dictionary 14 and a conversation phoneme dictionary 15 which are data recorded in the storage unit. Here, the difference between the command phoneme dictionary and the conversation phoneme dictionary will be described with reference to FIGS. As storage means for storing these dictionaries, any recording means such as a memory can be used.

音素辞書は、音声データの波形の形状と音素を対応付ける辞書であり、各音素、例えば「ａ」、「ｉ」、「ｕ」、「ｅ」、「ｏ」、「ｙ」、「ｗ」、「ｎ」などについて、それぞれ波形データを対応させる。 The phoneme dictionary is a dictionary that associates the waveform shape of speech data with phonemes, and each phoneme, for example, “a”, “i”, “u”, “e”, “o”, “y”, “w”, Waveform data is associated with “n” and the like.

図２に示すように、コマンド音素辞書１４は、ユーザが車載装置１を操作するためにコマンドを発話する際の音素波形で、会話音素辞書１５は、ユーザが同乗者と会話を行う際の音素波形であり、記憶手段に記憶されている。 As shown in FIG. 2, the command phoneme dictionary 14 is a phoneme waveform when a user utters a command to operate the vehicle-mounted device 1, and the conversation phoneme dictionary 15 is a phoneme when the user has a conversation with a passenger. A waveform is stored in the storage means.

これは、ユーザが車載装置１を音声で操作しようとする場合と同乗者と会話する場合とで声の調子が異なることを考慮したものであり、コマンド音素辞書１４と会話音素辞書１５とは、それぞれの音素に対して個別に波形データを有するとなる。 This is because the tone of the voice is different between the case where the user tries to operate the in-vehicle device 1 by voice and the case where the user talks with the passenger, and the command phoneme dictionary 14 and the conversation phoneme dictionary 15 are: Each phoneme has waveform data individually.

例えば、図３にはコマンドとして発話される際の音素「ａ」と会話中で発話される際の音素「ａ」を示しているが、それぞれの波形データは位置と形状に差異がある。 For example, FIG. 3 shows a phoneme “a” when uttered as a command and a phoneme “a” when uttered during a conversation, but each waveform data has a difference in position and shape.

このように、コマンドとして発話する場合の各音素の波形であるコマンド音素辞書１４と、会話として発話する場合の各音素の波形である会話音素辞書１５とを持ち、車内から取得した音声データがコマンド音素辞書１４と会話音素辞書１５の何れにより適合するかを比較することによって、音声データの発話対象が車載装置１であるか否かを判定することが可能となる。 In this way, the command phoneme dictionary 14 that is the waveform of each phoneme when uttering as a command and the conversation phoneme dictionary 15 that is the waveform of each phoneme when speaking as a conversation are provided, and the voice data acquired from the vehicle is a command. By comparing which of the phoneme dictionary 14 and the conversation phoneme dictionary 15 is more suitable, it is possible to determine whether or not the in-vehicle device 1 is the speech data utterance target.

具体的には、音響処理部１２は、その内部に分析処理部１２ａ、音素照合部１２ｂおよびコマンド判定部１２ｃを有する。 Specifically, the acoustic processing unit 12 includes an analysis processing unit 12a, a phoneme matching unit 12b, and a command determination unit 12c.

分析処理部１２ａは、雑音処理部１１による雑音除去処理が施された音声データを分析し、音素に対応する波形データに分解する処理を行なう。音素照合部１２ｂは、分析処理部１２ａによって得られた各入力音素（すなわち単一音素に対応する波形データ）について、それぞれコマンド音素辞書１４および会話音素辞書１５を参照し音素を特定する処理を行なう。 The analysis processing unit 12a analyzes the voice data that has been subjected to the noise removal processing by the noise processing unit 11, and performs processing for decomposing the data into waveform data corresponding to phonemes. The phoneme matching unit 12b performs a process of specifying a phoneme with reference to the command phoneme dictionary 14 and the conversation phoneme dictionary 15 for each input phoneme (that is, waveform data corresponding to a single phoneme) obtained by the analysis processing unit 12a. .

この音素照合部１２ｂによる照合処理によって、波形データは音素の集まりの形に変換される。 By the collation processing by the phoneme collation unit 12b, the waveform data is converted into a phoneme collection.

また、コマンド判定部１２ｃは、波形データ内に現れた音素が、何れの辞書に登録された音素により適合するかに基づいて、音声データの発話対象が車載装置１であるか否か、すなわち音声データがコマンドであるか否かを判定する。 Further, the command determination unit 12c determines whether or not the utterance target of the voice data is the in-vehicle device 1, based on which dictionary the phoneme that appears in the waveform data matches with the phoneme registered in the dictionary. Determine whether the data is a command.

単語照合部１３は、音声データがコマンドであると判定された場合に、その音素の集まりをテキストデータに変換して出力する処理を行なう。このテキストデータへの変換は、単語辞書１６を参照して行なう。単語１６には、テキストデータと音素の集まりとを対応付けて記憶した辞書である。 When it is determined that the voice data is a command, the word collating unit 13 performs a process of converting the phoneme collection into text data and outputting the text data. This conversion into text data is performed with reference to the word dictionary 16. The word 16 is a dictionary that stores text data and a collection of phonemes in association with each other.

音声入力部２０は、音声認識エンジン１０が出力したテキストデータを入出力処理部３０に入力する処理を行なう。また、音声入力部２０は、ユーザが音声入力を行なう場合のインターフェースを提供する処理を行なう。具体的には、音声入力部２０は、その内部にキャラクタ表示部２１および対話処理部２２を有する。 The voice input unit 20 performs processing for inputting the text data output from the voice recognition engine 10 to the input / output processing unit 30. The voice input unit 20 performs processing for providing an interface when the user performs voice input. Specifically, the voice input unit 20 includes a character display unit 21 and a dialogue processing unit 22 therein.

キャラクタ表示部２１は、タッチパネルディスプレイ３３に例えばロボットなどの表示を行ない、ユーザがロボットに対して命令するように音声入力を行なうように促すものである。このように、ロボットに対して命令する口調でユーザが音声を発すると、その音声は同乗者（人）と会話する場合と異なる特徴を帯びる。 The character display unit 21 displays, for example, a robot on the touch panel display 33, and prompts the user to make a voice input so as to instruct the robot. As described above, when the user utters a voice in a tone commanded to the robot, the voice has characteristics different from those in the case of talking with a passenger (person).

すなわち、人ではないものに対して命令する口調で音声入力を行なわせることで、音声入力時の音素波形が、コマンド音素辞書に含まれる音素に近づくように誘発することができる。 That is, by making a voice input in a tone commanded to a person who is not a person, the phoneme waveform at the time of voice input can be induced to approach a phoneme included in the command phoneme dictionary.

したがって、コマンド音素辞書１４を作成する際の音素収録では、ユーザに単純にテキストデータを読み上げさせるのではなく、実環境と同じくキャラクタを表示した機器を準備して収録を行うことが望ましい。 Therefore, in the phoneme recording when the command phoneme dictionary 14 is created, it is desirable to prepare and record a device displaying a character in the same manner as in the actual environment, rather than simply letting the user read out text data.

同様に、対話処理部２２は、ユーザの音声入力に対して音声出力を返す対話処理において、ユーザに対して出力する音声をロボット的な応答（少なくとも人間による会話とは異なる特徴をもつ応答）とすることで、ユーザの発声を機械的な波形（コマンド音素辞書に登録された波形）に近づけるよう誘発し、同乗者との会話で発生する波形との差異を明確化させる。 Similarly, in the dialogue processing in which a voice output is returned in response to the user's voice input, the dialogue processing unit 22 outputs the voice output to the user as a robot response (a response having characteristics different from at least a human conversation). Thus, the user's utterance is induced to approach a mechanical waveform (a waveform registered in the command phoneme dictionary), and the difference from the waveform generated in the conversation with the passenger is clarified.

したがって、この場合、会話音素辞書１５は流暢な発声から生成された音素が登録され、コマンド音素辞書１４にはシステムの要求時における明確な発声から生成された音素が登録されることとなる。 Therefore, in this case, the phoneme generated from the fluent utterance is registered in the conversation phoneme dictionary 15, and the phoneme generated from the clear utterance at the time of the system request is registered in the command phoneme dictionary 14.

つづいて、音素照合に基づく発話対象の判定について具体例を挙げてさらに説明する。ここでは、「野沢温泉」という単語が車内で発声された状況において、この「野沢温泉」という単語がナビゲーションシステムの目的地指定など、車載装置１に対してコマンドとして発声されたのか、「野沢温泉に行きたいと思っているんだけど、いつごろがいい？」などの同乗者との会話の中で出てきた言葉であるのかを判別する場合を例に説明を行なう。 Next, determination of an utterance target based on phoneme matching will be further described with a specific example. Here, in the situation where the word “Nozawa Onsen” was uttered in the car, whether the word “Nozawa Onsen” was uttered as a command to the in-vehicle device 1 such as designation of the destination of the navigation system, “Nozawa Onsen” I want to go to, but when do you want to go? "I will explain how to determine if it is a word that came out in a conversation with a passenger.

図４に示すように、音素照合部１２ｂは、最初の入力音素（単音素に対応する波形データ）に最も適合する音素をコマンド音素辞書１４と会話音素辞書１５を参照して探索する。 As shown in FIG. 4, the phoneme matching unit 12 b searches the command phoneme dictionary 14 and the conversation phoneme dictionary 15 for a phoneme that best matches the first input phoneme (waveform data corresponding to a single phoneme).

同図に示した例では、コマンド音素辞書１４の「ｎ」が最もよく適合し、つぎに会話音素辞書１５の「ｎ」に適合する。この例では、最も適合度の高いコマンド音素辞書１４の「ｎ」を採用し、次の入力音素の照合を行なう。 In the example shown in the figure, “n” in the command phoneme dictionary 14 is best matched, and then “n” in the conversation phoneme dictionary 15 is matched. In this example, “n” of the command phoneme dictionary 14 having the highest fitness is adopted, and the next input phoneme is collated.

２番目の入力音素に対しては、コマンド音素辞書１４と会話音素辞書１５の中から会話音素辞書１５の「ｏ」が採用される。同様にして、３番目の音素に対してはコマンド音素辞書１４の「ｚ」、４番目の音素に対してはコマンド音素辞書１４の「ａ」、５番目の音素に対しては会話音素辞書１５の「ｗ」、６番目の音素に対してはコマンド音素辞書１４の「ａ」が採用される。 For the second input phoneme, “o” of the conversation phoneme dictionary 15 from the command phoneme dictionary 14 and the conversation phoneme dictionary 15 is adopted. Similarly, “z” in the command phoneme dictionary 14 for the third phoneme, “a” in the command phoneme dictionary 14 for the fourth phoneme, and the conversation phoneme dictionary 15 for the fifth phoneme. “W”, the “a” of the command phoneme dictionary 14 is adopted for the sixth phoneme.

このように入力された音声データの全ての音素に対して最も適合する音素を照合し、採用された音素が占める比率を求める。図４の例では、１８音素のうち、コマンド音素辞書１４から採用された音素が１２、会話音素辞書１５から採用された音素が６であり、コマンド音素辞書１４から採用された比率が高いので、入力音声はコマンドであると判定する。 The most suitable phonemes are collated with respect to all phonemes of the input voice data in this way, and the ratio of the adopted phonemes is obtained. In the example of FIG. 4, among the 18 phonemes, 12 phonemes are adopted from the command phoneme dictionary 14, 6 phonemes are adopted from the conversation phoneme dictionary 15, and the ratio adopted from the command phoneme dictionary 14 is high. It is determined that the input voice is a command.

なお、比率が同じである場合や、比率の差が小さい場合には、ユーザに対して聞きなおしたり、事前の発声の対象に基づいて判断すればよい。 In addition, when the ratio is the same or when the difference in the ratio is small, the user may ask the user again or make a determination based on the prior utterance target.

また、図５に示すように、各音素の照合の確からしさによって重み付けして判断することもできる。同図に示した例では、入力音素に対する照合結果についてその確からしさを入力音素に対する辞書内音素の距離として示しており、距離の値が小さいほど辞書内音素に対する一致度が高いことを示している。 Moreover, as shown in FIG. 5, it can also be determined by weighting according to the probability of collation of each phoneme. In the example shown in the figure, the probability of the collation result for the input phoneme is shown as the distance of the phoneme in the dictionary to the input phoneme, and the smaller the distance value is, the higher the matching degree for the phoneme in the dictionary is. .

図５では、１番目の入力音素に対し、コマンド音素辞書１４の「ｎ」の距離の値が１０、会話音素辞書１５の「ｎ」の距離の値が４０であり、二つの辞書内音素に対する距離の差が大きいため、コマンド音素であると判断している。 In FIG. 5, for the first input phoneme, the distance value of “n” in the command phoneme dictionary 14 is 10 and the distance value of “n” in the conversation phoneme dictionary 15 is 40. Since the difference in distance is large, it is determined to be a command phoneme.

つぎに、２番目の入力音素に対し、コマンド音素辞書１４の「ｏ」の距離の値が４０、会話音素辞書１５の「ｏ」の距離の値が４５であり、二つの辞書内音素に対する距離の差が小さいため、コマンド音素であるか会話音素であるかの判断を保留し、次の音素（３番目の音素）の距離で判断することとしている。 Next, for the second input phoneme, the distance value of “o” in the command phoneme dictionary 14 is 40, the distance value of “o” in the conversation phoneme dictionary 15 is 45, and the distance to the phonemes in the two dictionaries. Therefore, the determination of whether the phoneme is a command phoneme or a conversation phoneme is suspended, and the determination is made based on the distance of the next phoneme (third phoneme).

そして、３番目の入力音素に対し、コマンド音素辞書１４の「ｚ」の距離の値が２０、会話音素辞書１５の「ｚ」の距離の値が５５であり、二つの辞書内音素に対する距離の差が大きいため、コマンド音素であると判断している。 Then, for the third input phoneme, the distance value of “z” in the command phoneme dictionary 14 is 20, and the distance value of “z” in the conversation phoneme dictionary 15 is 55. Since the difference is large, it is determined to be a command phoneme.

この例では、判断が保留された２番目の音素について、前後の音素（１番目と３番目の音素）がコマンド音素と判断されたので、２番目の音素についてもコマンド音素であると判断を修正する。 In this example, for the second phoneme for which judgment is suspended, the preceding and following phonemes (first and third phonemes) are judged to be command phonemes, so the judgment is made that the second phoneme is also a command phoneme. To do.

さらに図６に示すように、入力音素に対する辞書内音素の距離の類型距離によって判断することも出来る。同図に示した処理では、まず１番目の入力音素を選択し、コマンド音素か会話音素かの判断を行なう。その結果、２つの辞書内音素に対する距離の差｜ｄ（ｘ）｜が閾値Ｌ１以下であれば、コマンド音素か会話音素かの判断が困難である（ステップＳ１０１，Ｙｅｓ）とする。 Further, as shown in FIG. 6, it can be determined by the type distance of the distance of the phoneme in the dictionary with respect to the input phoneme. In the process shown in the figure, the first input phoneme is first selected to determine whether it is a command phoneme or a conversation phoneme. As a result, if the difference | d (x) | between the two phonemes in the dictionary is equal to or smaller than the threshold L1, it is difficult to determine whether the phoneme is a command phoneme or a conversation phoneme (Yes in step S101).

そして、前の音素ｐ（ｘ−１）が判定待ちでなければ（ステップＳ１０２，Ｎｏ）、音素ｐ（ｘ）を前の音素ｐ（ｘ−１）と同じ分類に判定し、ｘ＝０にリセットする（ステップＳ１０３）。 If the previous phoneme p (x-1) is not waiting for determination (step S102, No), the phoneme p (x) is determined as the same classification as the previous phoneme p (x-1), and x = 0. Reset (step S103).

一方、前の音素ｐ（ｘ−１）が判定待ちであれば（ステップＳ１０２，Ｙｅｓ）、判断困難時の距離の差の累計Ｓｄを算出する。そして、累計距離差Ｓｄの絶対値が閾値Ｌ２よりも大きければ（ステップＳ１０５，Ｙｅｓ）、コマンド音素または会話音素のいずれか近い方に決定する（ステップＳ１０６）。これは、累計距離差Ｓｄが大きくなるというこうとは、同じ分類への音素へのヒット（適合率）が多くなったと考えられるためである。 On the other hand, if the previous phoneme p (x−1) is waiting for determination (step S102, Yes), the cumulative difference Sd of the distance difference when determination is difficult is calculated. If the absolute value of the cumulative distance difference Sd is larger than the threshold value L2 (step S105, Yes), the command phoneme or the conversation phoneme is determined to be closer (step S106). This is because the cumulative distance difference Sd is increased because it is considered that the number of hits (matching rate) for phonemes in the same classification has increased.

一方、累計距離差Ｓｄの絶対値が閾値Ｌ２以下であれば（ステップＳ１０５，Ｎｏ）、次の音素の処理結果まで判断を保留する（ステップＳ１０７）。 On the other hand, if the absolute value of the cumulative distance difference Sd is less than or equal to the threshold L2 (No at Step S105), the determination is suspended until the next phoneme processing result (Step S107).

また、ステップＳ１０１において、２つの辞書内音素に対する距離の差｜ｄ（ｘ）｜が閾値Ｌ１より大きければ（ステップＳ１０１，Ｎｏ）、さらに前の音素ｐ（ｘ−１）が判定町であるか否かを確認し（ステップＳ１０８）、判定待ちの前の音素についても本音素と同じ分類と判定する（ステップＳ１０９）。 In step S101, if the difference in distance | d (x) | between the two dictionary phonemes is larger than the threshold L1 (No in step S101), whether the previous phoneme p (x-1) is the determination town. (Step S108), the phonemes before waiting for determination are also determined to be the same classification as the main phoneme (step S109).

そして、本音素についてはコマンド音素または会話音素のいずれか近い方に決定し（ステップＳ１１０）、ｘ＝０にリセットする。 Then, the phoneme is determined to be either the command phoneme or the conversation phoneme (step S110) and reset to x = 0.

ステップＳ１０３、ステップＳ１０６、ステップＳ１０７、ステップＳ１１０によって、本音素の判定もしくは保留を行なった後、ステップＳ１１１において次の入力音素があるか否かを判定し、次の入力音素がある場合（ステップＳ１１１，Ｙｅｓ）には次の入力音素を判定対象としてステップＳ１０１から処理を繰り返す。 In step S103, step S106, step S107, and step S110, after determining or holding the main phoneme, it is determined in step S111 whether there is a next input phoneme, and if there is a next input phoneme (step S111). , Yes), the process repeats from step S101 with the next input phoneme as the determination target.

そして、全ての入力音素について判定を終了した後（ステップＳ１１１，Ｎｏ）、判定結果の比率、すなわちコマンド音素辞書１４内の音素であるコマンド音素と会話音素辞書１５内の音素である会話音素のいずれに近いと分類された割合が多いかによって入力音声全体が車載装置１に対して発話されたコマンド入力であるのか否かを判定し（ステップＳ１１２）、処理を終了する。なお、比率が同じ、もしくは近い場合には、それぞれの距離に基づいて判定すればよい。 After all the input phonemes have been determined (step S111, No), the ratio of the determination results, that is, either the command phoneme that is the phoneme in the command phoneme dictionary 14 or the conversation phoneme that is the phoneme in the conversation phoneme dictionary 15 Whether the entire input voice is a command input uttered to the vehicle-mounted device 1 is determined depending on whether the ratio classified as close to is large (step S112), and the process is terminated. In addition, what is necessary is just to determine based on each distance when a ratio is the same or near.

上述してきたように、本実施例にかかる音声入力処理ユニット２では、音声データかから単音素に対応する波形を抽出し、音素辞書と比較して音素を特定する際に、記憶手段に記憶されたユーザが音声入力を行なう場合の音声に生ずる音素のデータであるコマンド音素辞書１４とユーザが同乗者と会話する際の音声に生ずる音素のデータである会話音素辞書１５とをそれぞれ参照し、２つの辞書内音素に対する適合度合いを比較することで、音声データが車載装置１に対して発声された音声入力であるのか否かを高精度に判定することが出来る。 As described above, in the speech input processing unit 2 according to the present embodiment, a waveform corresponding to a single phoneme is extracted from speech data, and is stored in a storage unit when a phoneme is specified by comparison with a phoneme dictionary. The command phoneme dictionary 14 which is phoneme data generated when the user performs voice input and the conversation phoneme dictionary 15 which is phoneme data generated when the user talks with the passenger are respectively referred to. It is possible to determine with high accuracy whether or not the voice data is a voice input uttered to the in-vehicle device 1 by comparing the degree of matching with the phonemes in the dictionary.

このように、音声入力処理ユニット２による発話対象の判定は、音声データにおける無音時間を使用しないため、同乗者との会話が音声入力に重なった場合や、オーディオユニットなどから他の音声出力がある場合であっても、車載装置１に対する音声入力を識別することが可能であり、認識精度の向上を実現できる。 As described above, since the speech input processing unit 2 does not use the silent time in the speech data, the speech input processing unit 2 uses other speech output when the conversation with the passenger overlaps the speech input or from the audio unit. Even if it is a case, it is possible to identify the voice input with respect to the vehicle-mounted apparatus 1, and the recognition accuracy can be improved.

そのため、音声入力開始を車載装置１に伝えるためのユーザ操作およびスイッチ自体が不要となり、車内の全音声からコマンド入力を自働的に選別することが出来るので、操作性を向上し、また安全運転に寄与することが可能である。 This eliminates the need for user operation and the switch itself to convey the start of voice input to the in-vehicle device 1, and can automatically select command input from all voices in the vehicle, improving operability and safe driving. It is possible to contribute to

また、同乗者にとっても運転者が車載装置１に対して音声操作を実行中であっても会話を継続することができ、同じく運転者も同乗者が会話中であっても車載装置１の音声操作を実行することが出来る。 In addition, for the passenger, the conversation can be continued even when the driver is performing a voice operation on the in-vehicle device 1, and the voice of the in-vehicle device 1 can be maintained even when the driver is also speaking with the passenger. The operation can be executed.

さらに、表示するキャラクタの制御や対話音声の制御を併用することで、ユーザの音声入力がコマンド音素辞書に適合するように誘発することができ、さらなる認識精度向上を実現することが可能である。 Further, by using the control of the character to be displayed and the control of the dialog voice together, it is possible to induce the user's voice input to be adapted to the command phoneme dictionary, and it is possible to further improve the recognition accuracy.

なお、本実施例に示した構成および動作はあくまで一例であり、本発明を限定するものではない。本発明は、請求項に記載した技術的思想の範囲内において適宜変形して実施することか出来るものである。 Note that the configuration and operation shown in this embodiment are merely examples, and do not limit the present invention. The present invention can be implemented with appropriate modifications within the scope of the technical idea described in the claims.

たとえば、本実施例では、車載装置１に対する音声操作と同乗者との会話とを峻別するため、コマンド音素辞書１４と会話音素辞書１５とを備えた場合を例に説明を行なったが、例えば車内に発生するノイズに対応するノイズ音素辞書や、音楽再生時に発生する音楽音素辞書など、判別すべき対象に応じて任意の音素辞書を作成することができる。 For example, in this embodiment, the case where the command phoneme dictionary 14 and the conversation phoneme dictionary 15 are provided has been described as an example in order to distinguish the voice operation on the in-vehicle device 1 from the conversation with the passenger. An arbitrary phoneme dictionary can be created according to an object to be discriminated, such as a noise phoneme dictionary corresponding to noise generated at the time or a music phoneme dictionary generated during music playback.

さらに、３以上の音素辞書との照合によって音声入力を判別したり、状況に応じて照合する音素辞書を切り替える構成によって本発明を実施することも出来る。 Furthermore, the present invention can be implemented by a configuration in which voice input is determined by collation with three or more phoneme dictionaries, or a phoneme dictionary to be collated is switched according to the situation.

以上のように、本発明にかかる音声入力処理装置および音声入力処理方法は、音声入力技術に有用であり、特に発話対象の自動判別に適している。 As described above, the voice input processing device and the voice input processing method according to the present invention are useful for voice input technology and are particularly suitable for automatic discrimination of an utterance target.

本発明の実施例である車載装置の概要構成を説明する説明図である。It is explanatory drawing explaining the outline | summary structure of the vehicle-mounted apparatus which is an Example of this invention. 図１に示したコマンド音素辞書と会話音素辞書について説明する説明図である。It is explanatory drawing explaining the command phoneme dictionary and conversation phoneme dictionary which were shown in FIG. コマンド入力時と会話時における音素の差について説明する説明図である。It is explanatory drawing explaining the difference of the phoneme at the time of command input and conversation. 音素辞書の照合と発話対象の判定について説明する説明図である。It is explanatory drawing explaining collation of a phoneme dictionary and determination of the speech object. 重み付けを用いた発話対象の判定について説明する説明図である。It is explanatory drawing explaining the determination of the speech object using weighting. 累計距離の算出による発話対象の判定について説明するフローチャートである。It is a flowchart explaining determination of the utterance target by calculation of total distance.

符号の説明Explanation of symbols

１車載装置
２音声入力処理ユニット
１０音声認識エンジン
１１雑音処理部
１２音響処理部
１２ａ分析処理部
１２ｂ音素照合部
１２ｃコマンド判定部
１３単語照合部
１４コマンド音素辞書
１５会話音素辞書
１６単語辞書
２０音声入力部
２１キャラクタ表示部
２２対話処理部
３０入出力処理部
３１マイク
３２スイッチ
３３タッチパネルディスプレイ
３４スピーカ
４１オーディオユニット
４２ナビゲーションユニット DESCRIPTION OF SYMBOLS 1 In-vehicle apparatus 2 Speech input processing unit 10 Speech recognition engine 11 Noise processing part 12 Acoustic processing part 12a Analysis processing part 12b Phoneme collation part 12c Command determination part 13 Word collation part 14 Command phoneme dictionary 15 Conversation phoneme dictionary 16 Word dictionary 20 Voice input Unit 21 Character display unit 22 Dialogue processing unit 30 Input / output processing unit 31 Microphone 32 Switch 33 Touch panel display 34 Speaker 41 Audio unit 42 Navigation unit

Claims

ユーザが音声入力を行なう場合の音声に生ずる音素を記憶する第１音素辞書記憶手段と、
前記音声入力以外で発生しうる音素を記憶する第２音素辞書記憶手段と、
入力音声データを分析して得られた入力音素を記憶手段に記憶された前記第１音素辞書および前記第２音素辞書とそれぞれ照合する照合手段と、
前記照合手段の照合結果に基づいて、前記入力音声データが前記ユーザによる音声入力であるか否かを判定するコマンド判定手段と、
を備えたことを特徴とする音声入力処理装置。 First phoneme dictionary storage means for storing phonemes generated in speech when the user performs speech input;
Second phoneme dictionary storage means for storing phonemes that can be generated other than the voice input;
Collating means for collating input phonemes obtained by analyzing input speech data with the first phoneme dictionary and the second phoneme dictionary stored in the storage means;
Command determination means for determining whether or not the input voice data is a voice input by the user, based on a verification result of the verification means;
A voice input processing device comprising:

車室内から前記入力音声データを常時集音し、前記コマンド判定手段によって前記入力音声データが音声入力であると判定された場合に、前記音素の照合結果から入力された単語を特定し、当該単語を車載装置に対するコマンド入力として入力処理を行なう音声入力手段をさらに備えたことを特徴とする音声入力処理装置。 The input voice data is always collected from the interior of the passenger compartment, and when the command determination means determines that the input voice data is a voice input, the word input from the phoneme matching result is specified, and the word A voice input processing device, further comprising voice input means for performing input processing as a command input to the in-vehicle device.

前記第２音素辞書はユーザが他の人物との会話を行う場合の音声に生ずる音素を含んで構成されることを特徴とする請求項１または２に記載の音声入力処理装置。 The speech input processing device according to claim 1 or 2, wherein the second phoneme dictionary includes phonemes generated in speech when a user has a conversation with another person.

ユーザに対する表示出力および／または音声出力において、前記ユーザが音声入力を実行する場合に前記第１音素辞書に含まれる音素を用いて発声するよう誘発するコマンド音声誘発手段をさらに備えたことを特徴とする請求項１〜３のいずれか一つに記載の音声入力処理装置。 In the display output and / or voice output for the user, it further comprises command voice inducing means for inducing the user to speak using the phonemes included in the first phoneme dictionary when the user performs voice input. The voice input processing device according to any one of claims 1 to 3.

入力音声データを分析して入力音素を得る分析処理工程と、
前記入力音素を、ユーザが音声入力を行なう場合の音声に生ずる音素を記憶する第１音素辞書および前記音声入力以外で発生しうる音素を記憶する第２音素辞書とそれぞれ照合する照合工程と、
前記照合工程の照合結果に基づいて、前記入力音声データが前記ユーザによる音声入力であるか否かを判定するコマンド判定工程と、
を含んだことを特徴とする音声入力処理方法。 An analysis process for obtaining input phonemes by analyzing input speech data;
A collation step of collating the input phonemes with a first phoneme dictionary that stores phonemes generated in speech when a user performs speech input and a second phoneme dictionary that stores phonemes that can be generated outside the speech input;
A command determination step of determining whether or not the input voice data is a voice input by the user, based on a verification result of the verification step;
A voice input processing method comprising: