JP3582069B2

JP3582069B2 - Voice interactive navigation device

Info

Publication number: JP3582069B2
Application number: JP18498694A
Authority: JP
Inventors: 八州男香川; 敬嗣金野
Original assignee: Mazda Motor Corp
Current assignee: Mazda Motor Corp
Priority date: 1994-08-05
Filing date: 1994-08-05
Publication date: 2004-10-27
Anticipated expiration: 2019-10-27
Also published as: JPH0850698A

Description

【０００１】
【産業上の利用分野】
本発明は音声を用いた対話型のナビゲーション装置に関し、特に、音声入力とポインティング装置からの入力とを併用した音声対話型ナビゲーション装置に関する。
【０００２】
【従来の技術】
現在普及しつつあるナビゲーションシステムでは、タッチパネルやリモコンスイッチ等の操作スイッチとメニュー表示を用いて情報検索あるいは機器操作を行なうようになっている。例えば、特開平５−７２９７３号ではタッチパネルを用いている。このため、このような従来のナビゲーションシステムでは、所望の情報を検索するまでの操作が複雑で、運転中の運転者にこのような複雑な操作を強いることは好ましくない。
【０００３】
一方、音声処理技術の進歩と共に、ナビゲーションシステムに音声認識を導入することが提案されている。例えば、特開平５−９９６７９号や特開平３−２１８１７号等。前述のタッチパネルなどは運転者の手動作を必要とするのに対し、音声入力は手動作を不要とし、その点で操作性の向上が期待される。
【０００４】
【発明が解決しようとする課題】
しかしながら、ナビゲーションシステムに適用され得る音声認識技術は孤立単語認識処理技術により行なっているために、例えば、「駐車場！」等と区切って発話する必要があるなど、ユーザの発話は制約されている。また、システムが会話の複雑な構造を理解することは困難であり、そのために、目的の情報を得るためには何段階にも分けて発話する必要があり、煩わしい。さらに、システムの持っている会話認識の手法は必ずしも人間の会話の順序とは異なるために、ユーザはどのような順序で、何を喋ってよいのか分からず、このようなシステムは結局運転者にとって使いにくいものとなっている。
【０００５】
運転者の自然な会話を認識するためには、連続音声認識技術の適用が考えられるが、高度の技術と高性能なハードウエアが必要であり、コストが高く一般用途には現実的でない。
【０００６】
【課題を解決するための手段】
そこで、本発明の目的は、音声入力と操作装置からの入力とを併用し、両入力を互いに補完させ合うことにより、低いコストで高い認識精度を得ることのできる音声対話型ナビゲーション装置を提案する。上記課題を達成するための本発明の構成は、地図を可視出力する表示手段と、少なくとも前記地図を構成する複数のオブジェクト毎の表示図形情報と話題情報とを含む地図情報を登録する地図データベースと、操作者により行なわれた操作を介して前記可視出力された地図上のオブジェクトの選択を受け付ける前記表示手段上に設けられたオブジェクト選択受付手段と、前記選択されたオブジェクト情報に対応する前記話題情報を前記地図データベースから抽出する第１の抽出手段と、前記抽出された話題情報に応じた音声辞書を利用して、前記操作者から入力される音声情報から前記音声辞書に登録されたキーワードを複数抽出する第２の抽出手段と、前記抽出されたキーワードから、前記操作者の意図を表すキーワード列を抽出する第３の抽出手段と、前記地図データベースから前記抽出された話題情報と前記抽出されたキーワード列とに対応する前記地図情報を検索する検索手段と、を備え、前記表示手段は前記検索された地図情報を表示することを特徴とする。
【０００７】
本発明の他の目的は、前記オブジェクト選択受付手段よりオブジェクトの選択を受け付けた後、所定時間の計時を行う計時手段を更に備え、前記第１の抽出手段は、前記計時手段により前記所定時間が計時される以前に入力された音声情報から前記抽出を行うことを特徴とする。
【０００８】
本発明の他の目的は、前記第２の抽出手段は、前記抽出された複数のキーワードを所定の存在確率に応じて並び替えるソート手段と、前記並び替えられた順に、前記各キーワードの品詞に基づく連続性を判定する判定手段とを備え、前記判定手段において連続性を有すると判定されたキーワードを、前記キーワード列として出力することを特徴とする。
【０００９】
【実施例】
以下、本発明の実施例について添付図面を参照しながら詳細に説明する。
図１は実施例の経路誘導装置のシステム構成を示す。このシステムは、ＧＰＳや車速センサからの信号を入力して自車位置を検出すると、その周辺の地図情報を表示し、合わせて音声認識により運転者の意図を判断して、最適な地図情報を表示出力するというものである。このために、図１のシステムは、運転者の音声を入力するためのマイク１と、このマイク１から入力された音声信号を処理して単語を抽出する音声認識処理部３と、地図や推奨経路などを表示する表示装置６と、この表示装置６の表示画面の上に設けられ指などによって目的地を入力するためのタッチパネル７と、大量の地図情報を含む地図データベース５と、ＧＰＳや車速センサからの信号を入力して自車位置を検出すると共に、音声認識処理部３が抽出した単語からキーワードを抽出して、運転者の位置を推論するナビゲーション演算処理部４と、渋滞情報や道路工事情報などの一過性の情報を入力するための交通情報受信装置２と、音声を発して運転者に注意等を喚起する音声出力装置８などを具備する。
【００１０】
地図データベース５には、例えば、交差点や大きな建物、有名な地点に対してノードが設定されており、地図情報としては、ノードの位置、１つのノードにリンクされている他のノードの識別子、ノード間の距離等を含む。図２は、表示装置６上における表示画面の例を示し、この画面に基づく運転者の操作を説明することにより、実施例の原理的な動作を説明する。図２において、１００は自車の現在位置を表し、ナビゲーション演算処理部４がＧＰＳ信号や車速センサ等に基づいて演算したものである。また、１０、１１、１２は自車位置１００の周辺にある３箇所の駐車場を示す。今、運転者は、これら３箇所の駐車場の中で最も安い駐車場を探しているものとする。図３は、図２に示された表示で、運転者（図３において"Ｕ"で示す）と交わす会話である。
【００１１】
先ず、図２のように、３つの駐車場のマーカが表示されている状態で、運転者は駐車場の例えばマーカ１１にタッチしたとする。タッチした位置はタッチパネル装置７を介して処理部４に入力されるであろう。そして運転者はマイク１を介して「ここの値段はいくら？」とシステムに聞く。すると、システムは、駐車場の１１の料金を、例えば「３０分で３００円です」と答える。処理部４は、タッチパネル７を介して入力された運転者のタッチ位置に対応する地図情報が「駐車場」であることは容易に判断することができる。そして、「ここの値段はいくら？」と言う入力会話の中から「値段」というキーワードを抽出する。そして、「駐車場」と言う操作情報と「値段」と言うキーワードとを連結して、処理部４は、運転者が要求しているのはタッチした「駐車場１１の値段である」と推論する。通常の言語認識では、「ここ」という単語を解析し、前後の会話の中から「ここ」は「どこどこの駐車場であろう」と推論する。しかしながらこのような従来の自然言語認識では高度の情報処理を有するためにシステム規模が膨大になる。ところがこの実施例では、タッチ入力された「操作情報」と会話から抽出された「キーワード」とは互いに連関している筈のものであると前提とし「操作情報」と「キーワード」とから演繹される推論を出力するものである。この実施例の操作（マン−マシーン・インタフェース）においては、表示された地図オブジェクトをポインティングするという操作が含まれているために、暗黙のうちにユーザの発話をガイドすることになり、始めてのユーザでも容易に使うことができる。例えば、駐車場アイコンを操作している運転者は、駐車場情報の何らかの属性データを検索しようという意識を暗に持っているため、何を喋ってよいか迷わないからである。換言すれば、処理部４は、このような意識を暗黙の前提とするために、属性データの検索が容易になるのである。
【００１２】
さらに、ポインティング操作と抽出されたキーワードをキーとして認識辞書を動的に切り換えることにより、認識すべき言語の探索範囲を大幅に限定でき、低コストのシステムでも認識率が大幅に向上することができる。
図２，図３の例で、さらに運転者が「一番安い駐車場は？」と発声したとする。すると、処理部４は、「一番」と「安い」と「駐車場」というキーワードを会話から抽出して、運転者が「一番安い駐車場」を探している推論する。もしもマーカ１２の駐車場が一番安いのであれば、マーカ１２がブリンクされる。
【００１３】
さらに、実施例の装置の構成および動作を詳細に説明する。
図４は、音声認識処理部３における処理を示す。マイク１からは連続発話された音声信号が入力される。処理部３は、先ず、この音声信号に対して、音声区間検出の前処理を行ない、次にＬＰＣ−メルケプストラムによる音響特徴分析を行ない、さらに特徴量をベクトル量子化する。抽出されたベクトルは、連続ＤＰマッチングの手法によりテンプレートに登録されているキーワードと照合される。キーワードが抽出されると、タッチパネル操作情報と加味されて意図の理解が行なわれる。この理解は後述するように格構造の分析に基づいてなされる。
【００１４】
図５は、タブレットから入力された操作情報とマイク１から入力された音声情報とに基づいて、運転者の意図が理解され出力されるまでを表すブロック図である。図６は、実施例のナビゲーションシステムの全体制御手順を示すフローチャートである。
図６において、タブレットにおいて運転者が操作するのを待つ。換言すれば、図６の制御手順はタブレット操作があって始めて起動される。タブレット操作があると、ステップＳ４でその操作された位置座標（操作情報）が検出される。ステップＳ６では、地図データベース５から、操作位置に対応するオブジェクトを検索する。
【００１５】
図７は地図データベース５の構造を説明する。地図データベースは、多くのオブジェクトレコードからなる。オブジェクトとは、交差点、駐車場、大きな建物、道路等の地図データである。１つのオブジェクトレコードは、そのオブジェクトに与えられた識別子としての「オブジェクト名」と、そのオブジェクトに与えられた「タッチ領域」（この領域内のアドレスにタッチするとそのオブジェクトが選択されたことを意味する）と、このオブジェクトの表示図形の形状を示す「図形データ」（オブジェクトが駐車場であれば、Ｐの表示図形）と、このオブジェクトに与えられた「話題」（オブジェクトが駐車場であれば、「駐車場」、「駐車料金」、「空き」など）等からなる。これらのデータは、オブジェクト毎に、前もってデータベースに与えられているものである。地図情報には、一時的に外部から交通情報受信機２を介して与えられるもの（以下、「交通情報」と呼ぶ）がある。このような交通情報には、図形として出力すべき情報（例えば、渋滞の程度としての渋滞の長さなど）や、音声として出力すべきもの（駐車場の「料金」や「待ち時間」）がある。この実施例では、これら交通情報には、図形として表示すべき（Ｒとマーク）、音声として出力すべき（Ｓとマーク）の種別のほかに、その交通情報をここのオブジェクトに付属するものとして、受信機から入力する毎に地図データベースに入力してある。
【００１６】
従って、ステップＳ６では、ステップＳ４で入力した座標地と、オブジェクトファイル中の「タッチ領域」とを比較することにより、運転者がどの表示されているオブジェクトにタッチしたかを検出する。ステップＳ８では、検索したオブジェクトに付属している「話題」を抽出する。図２，図３の例では、駐車場のオブジェクトにタッチしたので、そのオブジェクトの話題として、例えば、「駐車場」「料金」「空き」などが抽出されるであろう。図５において、地図データベースから「話題」が抽出されて出力されていることが示されている。
【００１７】
ステップＳ１０ではタイマ（不図示）を起動する。ステップＳ１２では、このタイマがタイムアウト（例えば、５秒）する前に音声入力があったことを確認する。タイマを設けたのは、タブレット操作と音声入力とを関連づけるためである。
音声入力があれば、ステップＳ１４でその音声信号中から「キーワード」を単語毎に抽出する。キーワードの抽出は、テンプレートとの比較によるＤＰマッチングに拠る。図８に、テンプレートの例を示す。テンプレートは、「話題」毎に辞書化され、ベクトル量子化に拠る「コード」と「キーワード」との組み合わせからなる。話題毎に分類したのは、認識対象を限定して探索効率を上げるためである。
【００１８】
ステップＳ１６では、認識されて得たこれらのキーワードを、存在確率の高い順にソートして並べる。図９に、音声信号から抽出されたキーワードが確率順に並べられて示されている。図９の例では、一番高い確率が“一番”に、２番目の確率が“料金”に、３番目の確率が“近い”に、４番目の確率が“最も”に与えられている。尚、これらのキーワードは、音声信号から音声認識処理部３が「そのような単語が発声されたであろう」として認識したものであって、必ずしも発声したものとは限らない。また、存在確率が高い順にならべられたキーワード列は、その順で発声されたとは限らない。
【００１９】
ステップＳ１８では、ステップＳ８で検出された操作情報の「話題」に関連する「意図理解管理テーブル」を検索する。「駐車場」という「話題」に関連する「意図理解管理テーブル」の例を図１０に示す。図１０のテーブル中、Ｓ，Ｃ，Ａ等は「格」を示し、「駐車場」という「話題」に関連する格の例を図１１に示す。
【００２０】
図１１に示した「格−単語テーブル」において、格Ｓは副詞を示し、「一番」や「最も」などがその例である。格Ｃは形容詞を示し、「高い」や「近い」や「安い」などがその例である。格Ａは名詞を示し、「値段」や「距離」などがその例である。格Ｖは動詞を示し、「見せて」や「教えて」などがその例である。
このようにして得た「キーワード列」から、運転者の意図を、「格−単語テーブル」「意図理解管理テーブル」を利用しながら抽出する（ステップＳ２０）。ここで、音声信号から図９のようにえられたキーワード列に基づいて「意図理解管理テーブル」が具体的にどのように利用されるかを説明する。図１０のテーブルの中で、「現在受諾している格構造」と「受諾可能な格」でもって２つの連続するキーワードが形成される。「現在受諾している格構造」が先のキーワードであり、「受諾可能な格」がそれに続く格である。Ｓ列〜Ｖ列中の、×印はそのような組み合わせが存在しない場合を示し、○印はそのような組み合わせが存在する場合を示す。例えば、Ｓ行において、格Ｓに続いて、格Ｓや格Ａや格Ｖが存在することはあり得ないことを示し、格Ｃや格Ｖの存在は許されることを示している。図１１の例では、「一番値段」（ＳＡの組み合わせ）や「一番見せて」（ＳＶの組み合わせ）というキーワード列は存在せず、「一番高い」というキーワード列は存在し得ることを意味する。「出力可」のフラグは、キーワード列が存在しえても、意味出力を行なうには足りないかいなかを示すものである。図１０の例において、ＳＣという列は存在しえても、意味としての出力を行なうことはできないことを示す。また、ＡＶという列（例えば、「値段見せて」）は組み合わせが存在し、「意図」として意味のあるものとして出力可能であるということである。
【００２１】
このようなルールの下で、図９のキーワード列が入力したとする。先ず、最初の最も確率の高い“一番”は格Ｓであるから、図１０のテーブルに従えば、格Ｓに続くことが許される格はＣのみであるので、格Ａである“料金”は刎ねられる。“料金”に続いて確率の高いキーワードは“近い”であり、この格はＣであるので、ＳＣという列は存在を許される。そこで、現在受諾している格構造はＳＣと更新される。このＳＣのキーワード列に続いて受諾可能な格はＶのみであるので、図９の入力列の中で、Ｓ格である“最も”は刎ねられ、次の入力キーワードである“教えて”が検査される。“教えて”の格はＶであり、しかもＳＣＶと言うキーワード列は「出力可」とマークされているので、結局、“一番近い教えて”が意図として出力される。
【００２２】
ステップＳ２２では、ステップＳ２０で出力された「意図」に対応する「出力ルール」を得る。ステップＳ２４では、ステップＳ２０で得られた「意図」に基づいて、同じ「話題」を有するオブジェクトが探索され、探索されたオブジェクトの中から「交通情報」を抽出する。そして、ステップＳ２６では、ステップＳ２２で得た「出力ルール」に基づいてナビゲーション情報を出力する。
【００２３】
以上説明した実施例のナビゲーション装置によれば、
▲１▼：タッチパネルから得られた「操作情報」に従って「話題」（ユーザの関心範囲）を抽出し、この「話題」と、他方、音声入力によって得られたキーワードとを用いて、地図情報を検索するようにしているので、音声認識処理に高度な認識技術を用いなくとも、目的の地図情報を探索することができる。
▲２▼：「話題」（ユーザの関心範囲）に応じたテンプレートファイルから発声されたキーワードを抽出しているので、音声認識処理の効率が上がる。
▲３▼：「話題」に応じた「格−単語テーブル」、「意図理解管理テーブル」を用いているので、意図を推論するときの速度が向上する。
▲４▼：意図理解は、タッチパネル操作があって始めて起動されるようにしているので、無意識のうちの会話によってシステムが誤動作することが防止される。
▲４▼：また、タッチパネル操作から所定時間以内の音声入力のみを受け付けるので、同じく誤動作を防止できる。
【００２４】
本発明はその主旨を逸脱しない範囲で種々変形が可能である。
上記実施例は、主に駐車場を探す場合におけるナビゲーションの例であった。本発明は、例えば、交差点におけるナビゲーションにも適用可能である。
例えば、図１２に示すように、表示装置上に、４方向の交差点が前方にあることを示す表示がなされているとする。運転者が交差点のオブジェクトをタッチすると、システムは、「話題」として「交差点」を認識する。この時点で、「交差点名は？」と音声入力すると、キーワードとして、「交差点」と「名」とが得られ、システムは運転者の意図として、「交差点」という「話題」において、交差点の名前を知りたがっていることを推論する。さらに、「後何メートル？」という入力があれば、この音声信号から「メートル」というキーワードを抽出する。そして、「交差点」という話題と「メートル」というキーワードとから、運転者の意図が、現在地からその交差点までの距離にあると推論する。
【００２５】
図１３の例では、運転者が「品川」というオブジェクトをタッチした場合を示している。この「品川」というオブジェクトには、話題として、「目的地」「経由地」が設定されている。そこで、運転者が「混んでる」と発声すれば、この音声信号の中から、「混ん」をキーワードとして認識する。そして、「目的地」「経由地」という話題と、「混ん」というキーワードとから、システムは品川方向の混雑度を運転者が知りたがっていると推論する。
【００２６】
上記実施例では、操作情報はタッチパネルを介して入力しているが、別途メニュー表示から入力するようにしてもよい。
【００２７】
【発明の効果】
以上説明した本発明の音声対話型ナビゲーション装置は、音声入力と操作装置からの入力とを併用し、両入力を互いに補完させ合うことにより、低いコストで高い認識精度を得ることができる。
【図面の簡単な説明】
【図１】実施例のナビゲーションシステムのブロック図。
【図２】実施例の動作を駐車場を例にして示すときの表示画面の一例を示す図。
【図３】図２の画面に対応した対話の一例を示す図。
【図４】実施例のシステムにおける音声認識所の手順を説明するブロック図。
【図５】実施例における、意図解析の手順を説明する図。
【図６】実施例の制御手順を示すフローチャート。
【図７】実施例に用いられている地図データベースの構造を説明する図。
【図８】実施例に用いられている音声辞書の構成を示す図。
【図９】認識されたキーワード列の例を示す図。
【図１０】意図理解管理テーブルの一例を示す図。
【図１１】格−単語テーブルの一例を示す図。
【図１２】交差点におけるナビゲーションでの表示例を示す図。
【図１３】交差点におけるナビゲーションでの表示例を示す図。[0001]
[Industrial applications]
The present invention relates to an interactive navigation device using voice, and more particularly to a voice interactive navigation device using both voice input and input from a pointing device.
[0002]
[Prior art]
2. Description of the Related Art In a navigation system that is currently spreading, information search or device operation is performed using operation switches such as a touch panel and a remote control switch and menu display. For example, JP-A-5-72973 uses a touch panel. For this reason, in such a conventional navigation system, the operation until searching for desired information is complicated, and it is not preferable to force the driver who is driving to perform such a complicated operation.
[0003]
On the other hand, with the progress of speech processing technology, it has been proposed to introduce speech recognition into a navigation system. For example, JP-A-5-99679 and JP-A-3-21817. While the above-mentioned touch panel and the like require manual operation by the driver, voice input does not require manual operation, and in that respect, improvement in operability is expected.
[0004]
[Problems to be solved by the invention]
However, since the speech recognition technology applicable to the navigation system is performed by the isolated word recognition processing technology, the utterance of the user is restricted, for example, it is necessary to utter separately from “parking lot!”. . Further, it is difficult for the system to understand the complicated structure of the conversation, so that it is necessary to utter the speech in several stages to obtain the desired information, which is cumbersome. Furthermore, since the system of speech recognition is not always the same as the order of human conversation, the user does not know what order and what to talk, and such a system is ultimately necessary for the driver. It is difficult to use.
[0005]
In order to recognize a driver's natural conversation, continuous speech recognition technology can be applied. However, advanced technology and high-performance hardware are required, and the cost is high and it is not practical for general use.
[0006]
[Means for Solving the Problems]
An object of the present invention is used in combination with input from the voice input and the operating device, by both inputs to each other to complement each other, it proposes a voice interactive navigation system capable of obtaining a high recognition accuracy at a low cost I do. The configuration of the present invention for achieving the above object includes a display means for visually outputting a map, a map database for registering map information including at least display graphic information and topic information for each of a plurality of objects constituting the map. An object selection receiving means provided on the display means for receiving a selection of an object on the map that is visually output through an operation performed by an operator; and the topic information corresponding to the selected object information. A plurality of keywords registered in the voice dictionary from voice information input by the operator, using first extracting means for extracting a keyword from the map database and a voice dictionary corresponding to the extracted topic information. A second extracting means for extracting, and a third extracting means for extracting a keyword sequence representing the intention of the operator from the extracted keywords. Extracting means, and searching means for searching the map information corresponding to the extracted topic information and the extracted keyword string from the map database, wherein the display means displays the searched map information. It is characterized by doing.
[0007]
Another object of the present invention is to further include a timing unit that counts a predetermined time after receiving the selection of the object from the object selection receiving unit, wherein the first extraction unit determines the predetermined time by the timing unit. It is characterized in that the extraction is performed from voice information input before the time is measured .
[0008]
Another object of the present invention is that the second extracting means includes a sorting means for rearranging the plurality of extracted keywords in accordance with a predetermined existence probability, and a part of speech of each of the keywords in the rearranged order. Determining means for determining continuity based on the keyword, and outputting the keyword determined to have continuity by the determining means as the keyword sequence .
[0009]
【Example】
Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
FIG. 1 shows a system configuration of the route guidance device of the embodiment. This system detects the position of the vehicle by inputting signals from a GPS or a vehicle speed sensor, displays map information around the vehicle, judges the driver's intention by voice recognition, and determines the optimal map information. It is to display and output. For this purpose, the system of FIG. 1 includes a microphone 1 for inputting a driver's voice, a voice recognition processing unit 3 for processing a voice signal input from the microphone 1 to extract words, a map and a recommendation. A display device 6 for displaying a route or the like; a touch panel 7 provided on a display screen of the display device 6 for inputting a destination with a finger or the like; a map database 5 containing a large amount of map information; A signal from a sensor is input to detect the position of the own vehicle, a keyword is extracted from the words extracted by the voice recognition processing unit 3, and a navigation operation processing unit 4 for inferring the position of the driver is provided. The vehicle includes a traffic information receiving device 2 for inputting temporary information such as construction information, a voice output device 8 that emits voice to call attention to a driver, and the like.
[0010]
In the map database 5, for example, nodes are set for intersections, large buildings, and famous points. The map information includes the position of the node, the identifier of another node linked to one node, and the node. Including the distance between them. FIG. 2 shows an example of a display screen on the display device 6, and the operation of the driver based on this screen will be described to explain the basic operation of the embodiment. In FIG. 2, reference numeral 100 denotes the current position of the own vehicle, which is calculated by the navigation calculation processing unit 4 based on a GPS signal, a vehicle speed sensor, and the like. Reference numerals 10, 11, and 12 denote three parking lots around the own vehicle position 100. Now, it is assumed that the driver is searching for the cheapest parking lot among these three parking lots. FIG. 3 shows a conversation with the driver (indicated by " U " in FIG. 3) on the display shown in FIG.
[0011]
First, it is assumed that the driver touches, for example, the marker 11 in the parking lot while three parking lot markers are displayed as shown in FIG. The touched position will be input to the processing unit 4 via the touch panel device 7. Then, the driver asks the system via the microphone 1 "How much is the price here?" Then, the system replies, for example, "11 for the parking lot is 300 yen for 30 minutes". The processing unit 4 can easily determine that the map information corresponding to the driver's touch position input via the touch panel 7 is “parking lot”. Then, a keyword "price" is extracted from the input conversation "What is the price here?" Then, by linking the operation information “parking lot” and the keyword “price”, the processing unit 4 infers that the driver is requesting “the price of the parking lot 11” touched. I do. In ordinary language recognition, the word "here" is analyzed, and "here" is inferred from the conversation before and after "where and where in the parking lot". However, in such conventional natural language recognition, the system scale is enormous because of having advanced information processing. However, in this embodiment, it is assumed that the touch-input “operation information” and the “keyword” extracted from the conversation should be associated with each other, and are deduced from the “operation information” and the “keyword”. Output inferences. In the operation (man-machine interface) of this embodiment, since the operation of pointing the displayed map object is included, the user's utterance is implicitly guided, and the first user But it can be used easily. For example, the driver operating the parking lot icon implicitly has a consciousness of searching for some attribute data of the parking lot information, and therefore does not hesitate what to talk about. In other words, since the processing unit 4 implicitly assumes such a consciousness, the search of the attribute data is facilitated.
[0012]
Furthermore, by dynamically switching the recognition dictionary using the pointing operation and the extracted keywords as keys, the search range of the language to be recognized can be significantly limited, and the recognition rate can be significantly improved even in a low-cost system. .
In the examples of FIGS. 2 and 3, it is assumed that the driver further utters "What is the cheapest parking lot?" Then, the processing unit 4 extracts the keywords “best”, “cheap”, and “parking lot” from the conversation, and infers that the driver is searching for “the cheapest parking lot”. If the parking lot of the marker 12 is the cheapest, the marker 12 is blinked.
[0013]
Further, the configuration and operation of the device of the embodiment will be described in detail.
FIG. 4 shows processing in the speech recognition processing unit 3. A continuously uttered voice signal is input from the microphone 1. The processing unit 3 first performs preprocessing for voice section detection on this voice signal, then performs acoustic feature analysis using LPC-mel-cepstrum, and further performs vector quantization of the feature amount. The extracted vector is collated with the keyword registered in the template by the continuous DP matching method. When the keyword is extracted, the intention is understood in consideration of the touch panel operation information. This understanding is based on an analysis of the case structure as described below.
[0014]
FIG. 5 is a block diagram showing a process until the driver's intention is understood and output based on the operation information input from the tablet and the voice information input from the microphone 1. FIG. 6 is a flowchart illustrating the overall control procedure of the navigation system according to the embodiment.
In FIG. 6, the operation waits for the driver to operate the tablet. In other words, the control procedure of FIG. 6 is activated only after a tablet operation. If there is a tablet operation, the operated position coordinates (operation information) are detected in step S4. In step S6, an object corresponding to the operation position is searched from the map database 5.
[0015]
FIG. 7 illustrates the structure of the map database 5. A map database consists of many object records. An object is map data of an intersection, a parking lot, a large building, a road, and the like. One object record includes an “object name” as an identifier given to the object and a “touch area” given to the object (meaning that the object is selected by touching an address in this area). ), “Graphic data” indicating the shape of the display graphic of this object (a display graphic of P if the object is a parking lot), and “topic” given to this object (if the object is a parking lot, "Parking lot", "parking fee", "vacancy", etc.). These data are given to the database in advance for each object. Some map information is temporarily given from outside via the traffic information receiver 2 (hereinafter, referred to as “traffic information”). Such traffic information includes information that should be output as a graphic (for example, the length of a traffic jam as a degree of traffic jam) and information that should be output as a voice (“charge” or “waiting time” of a parking lot). . In this embodiment, in addition to the types of traffic information to be displayed as a graphic (R and mark) and to be output as voice (S and mark), the traffic information is attached to the object here. , Are input to the map database each time they are input from the receiver.
[0016]
Accordingly, in step S6, which displayed object the driver has touched is detected by comparing the coordinate location input in step S4 with the "touch area" in the object file. In step S8, "topics" attached to the searched object are extracted. In the examples of FIGS. 2 and 3, since the user touches the object of the parking lot, for example, “parking lot”, “charge”, “vacancy”, etc. will be extracted as the topic of the object. FIG. 5 shows that “topics” are extracted from the map database and output.
[0017]
In step S10, a timer (not shown) is started. In step S12, it is confirmed that there is a voice input before the timer times out (for example, 5 seconds). The timer is provided for associating the tablet operation with the voice input.
If there is a voice input, a "keyword" is extracted for each word from the voice signal in step S14. Keyword extraction is based on DP matching by comparison with a template. FIG. 8 shows an example of the template. The template is dictionaryd for each "topic", and includes a combination of "code" and "keyword" based on vector quantization. The reason for classifying each topic is to increase the search efficiency by limiting the recognition target.
[0018]
In step S16, these recognized keywords are sorted and arranged in descending order of existence probability. FIG. 9 shows keywords extracted from the audio signal arranged in order of probability. In the example of FIG. 9, the highest probability is given to “best”, the second probability is given to “charge”, the third probability is given to “close”, and the fourth probability is given to “most”. . Note that these keywords are recognized by the voice recognition processing unit 3 from the voice signal as “such a word would have been uttered”, and are not necessarily uttered. In addition, the keyword strings arranged in the order of the existence probability are not necessarily uttered in that order.
[0019]
In step S18, the "intention understanding management table" related to the "topic" of the operation information detected in step S8 is searched. FIG. 10 shows an example of the “intention understanding management table” related to the “topic” of “parking lot”. In the table of FIG. 10, S, C, A, and the like indicate “case”, and FIG. 11 shows an example of a case related to “topic” of “parking lot”.
[0020]
In the “case-word table” shown in FIG. 11, the case S indicates an adverb, and examples thereof include “best” and “most”. Case C indicates an adjective, such as "high", "near", or "cheap". Case A indicates a noun, such as "price" or "distance". Case V indicates a verb, such as "show me" or "teach me".
The driver's intention is extracted from the “keyword string” obtained in this manner while using the “case-word table” and the “intention understanding management table” (step S20). Here, how the “intention comprehension management table” is specifically used based on the keyword string obtained from the voice signal as shown in FIG. 9 will be described. In the table of FIG. 10, two consecutive keywords are formed by “currently accepted case structure” and “acceptable case”. The “currently accepted case structure” is the preceding keyword, and the “acceptable case” is the case that follows. In the S column to the V column, an X mark indicates a case where such a combination does not exist, and a 場合 mark indicates a case where such a combination exists. For example, in line S, it is indicated that there is no case S, case A, or case V following the case S, and that cases C and V are allowed. In the example of FIG. 11, there is no keyword string “most expensive” (combination of SAs) or “most show” (combination of SVs), and a keyword string “highest” may exist. means. The “output allowed” flag indicates whether or not it is insufficient to perform meaning output even if a keyword string can be present. In the example of FIG. 10, it is shown that even though the column SC exists, it cannot be output as a meaning. In addition, the column of AV (for example, “show price”) has a combination and can be output as meaningful as “intention”.
[0021]
It is assumed that the keyword sequence of FIG. 9 is input under such a rule. First, since the first “probability” having the highest probability is the case S, according to the table in FIG. 10, only the case C that is allowed to follow the case S is “charge” which is the case A. Is decapitated. The keyword having a high probability following “fee” is “close” and the case is C, so that the column SC is allowed to exist. Thus, the currently accepted case structure is updated to SC. Since V is the only acceptable case following the SC keyword sequence, the S case “Most” in the input sequence of FIG. 9 is cut off, and the next input keyword “Teach me” is entered. Is inspected. Since the case of "Teach" is V, and the keyword string of SCV is marked as "output possible", "Teach closest" is eventually output as intention.
[0022]
In step S22, an "output rule" corresponding to the "intention" output in step S20 is obtained. In step S24, based on the "intention" obtained in step S20, an object having the same "topic" is searched, and "traffic information" is extracted from the searched objects. Then, in step S26, navigation information is output based on the "output rule" obtained in step S22.
[0023]
According to the navigation device of the embodiment described above,
{Circle around (1)} Extracting a “topic” (range of interest of the user) according to the “operation information” obtained from the touch panel, and using this “topic” and a keyword obtained by voice input, map information is extracted. Since the search is performed, the target map information can be searched without using an advanced recognition technology for the voice recognition processing.
{Circle over (2)} Since the uttered keywords are extracted from the template file corresponding to the “topic” (range of interest of the user), the efficiency of the voice recognition processing is increased.
{Circle around (3)} Since the “case-word table” and the “intention understanding management table” corresponding to the “topic” are used, the speed of inferring the intention is improved.
{Circle over (4)} Since the intention understanding is started only after the touch panel operation, the system is prevented from malfunctioning due to unconscious conversation.
{Circle around (4)} Also, since only a voice input within a predetermined time after the touch panel operation is received, malfunction can be similarly prevented.
[0024]
The present invention can be variously modified without departing from the gist thereof.
The above embodiment is an example of navigation mainly in the case of searching for a parking lot. The present invention is also applicable to navigation at an intersection, for example.
For example, as shown in FIG. 12, it is assumed that a display indicating that an intersection in four directions is ahead is displayed on the display device. When the driver touches the intersection object, the system recognizes “intersection” as “topic”. At this point, if the user inputs "intersection name?" By voice, "intersection" and "name" are obtained as keywords, and the system determines the name of the intersection in the "topic" of "intersection" as the driver's intention. Infer that you want to know Furthermore, if there is an input of "How many meters later?", The keyword "Meter" is extracted from this audio signal. Then, from the topic “intersection” and the keyword “meter”, it is inferred that the driver's intention is the distance from the current location to the intersection.
[0025]
The example in FIG. 13 illustrates a case where the driver touches an object “Shinagawa”. In the object “Shinagawa”, “destination” and “transit point” are set as topics. Therefore, if the driver utters “crowded”, “crowded” is recognized as a keyword from the voice signal. Then, from the topics of "destination" and "transit point" and the keyword of "crowd", the system infers that the driver wants to know the congestion degree in the Shinagawa direction.
[0026]
In the above embodiment, the operation information is input via the touch panel, but may be input separately from a menu display.
[0027]
【The invention's effect】
Above-described audio interactive navigation of the present invention apparatus, a combination of the input from the voice input and the operating device, by mutually both inputs were complement each other, it is possible to obtain a high recognition accuracy at low cost.
[Brief description of the drawings]
FIG. 1 is a block diagram of a navigation system according to an embodiment.
FIG. 2 is a diagram showing an example of a display screen when the operation of the embodiment is shown using a parking lot as an example.
FIG. 3 is a view showing an example of a dialog corresponding to the screen of FIG. 2;
FIG. 4 is a block diagram illustrating a procedure of a voice recognition station in the system according to the embodiment.
FIG. 5 is a view for explaining the procedure of intention analysis in the embodiment.
FIG. 6 is a flowchart illustrating a control procedure according to the embodiment.
FIG. 7 is a view for explaining the structure of a map database used in the embodiment.
FIG. 8 is a diagram showing a configuration of a speech dictionary used in the embodiment.
FIG. 9 is a diagram showing an example of a recognized keyword sequence.
FIG. 10 is a diagram showing an example of an intention understanding management table.
FIG. 11 is a diagram showing an example of a case-word table.
FIG. 12 is a diagram showing a display example of navigation at an intersection.
FIG. 13 is a diagram showing a display example of navigation at an intersection.

Claims

地図を可視出力する表示手段と、
少なくとも前記地図を構成する複数のオブジェクト毎の表示図形情報と話題情報とを含む地図情報を登録する地図データベースと、
操作者により行なわれた操作を介して前記可視出力された地図上のオブジェクトの選択を受け付ける前記表示手段上に設けられたオブジェクト選択受付手段と、
前記選択されたオブジェクト情報に対応する前記話題情報を前記地図データベースから抽出する第１の抽出手段と、
前記抽出された話題情報に応じた音声辞書を利用して、前記操作者から入力される音声情報から前記音声辞書に登録されたキーワードを複数抽出する第２の抽出手段と、
前記抽出されたキーワードから、前記操作者の意図を表すキーワード列を抽出する第３の抽出手段と、
前記地図データベースから前記抽出された話題情報と前記抽出されたキーワード列とに対応する前記地図情報を検索する検索手段と、
を備え、前記表示手段は前記検索された地図情報を表示することを特徴とする音声対話型ナビゲーション装置。 Display means for visually outputting the map;
A map database for registering map information including at least display graphic information and topic information for each of a plurality of objects constituting the map;
Object selection receiving means provided on the display means for receiving selection of an object on the map that has been visually output through an operation performed by an operator;
First extraction means for extracting the topic information corresponding to the selected object information from the map database;
A second extraction unit that extracts a plurality of keywords registered in the voice dictionary from voice information input by the operator, using a voice dictionary corresponding to the extracted topic information;
Third extraction means for extracting a keyword sequence representing the intention of the operator from the extracted keywords;
Search means for searching the map information corresponding to the extracted topic information and the extracted keyword string from the map database;
Wherein the display means displays the searched map information .

前記オブジェクト選択受付手段よりオブジェクトの選択を受け付けた後、所定時間の計時を行う計時手段を更に備え、
前記第１の抽出手段は、前記計時手段により前記所定時間が計時される以前に入力された音声情報から前記抽出を行うことを特徴とする請求項１に記載の音声対話型ナビゲーション装置。 After receiving the selection of the object from the object selection receiving means, further comprising a timing means for timing a predetermined time,
2. The voice interactive navigation device according to claim 1, wherein the first extracting unit performs the extraction from voice information input before the predetermined time is measured by the timing unit. 3.

前記第２の抽出手段は、
前記抽出された複数のキーワードを所定の存在確率に応じて並び替えるソート手段と、
前記並び替えられた順に、前記各キーワードの品詞に基づく連続性を判定する判定手段とを備え、
前記判定手段において連続性を有すると判定されたキーワードを、前記キーワード列として出力することを特徴とする請求項１又は２に記載の音声対話型ナビゲーション装置。 The second extracting means includes:
Sorting means for rearranging the plurality of extracted keywords according to a predetermined existence probability;
Determining means for determining continuity based on the part of speech of each of the keywords in the rearranged order,
Said determined to have continuity keywords in the determination unit, the audio interactive navigation system according to claim 1 or 2, characterized in that output as the keyword column.