JP2007041089A

JP2007041089A - Information terminal and speech recognition program

Info

Publication number: JP2007041089A
Application number: JP2005222326A
Authority: JP
Inventors: Tomoya Ozaki; 友哉尾崎; Hideki Nakamura; 秀樹中村
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2005-08-01
Filing date: 2005-08-01
Publication date: 2007-02-15

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech recognition technology with which speech recognition processing with handleability can be executed by solving the problem that it takes time for processing even easy recognition, since communication is utilized in distributing speech recognition (DSR). <P>SOLUTION: The information terminal is provided with a first speech recognition means for performing speech recognition in the terminal and a second speech recognition means by the DSR. The speech recognition is performed by utilizing the first speech recognition means when complexity of input voice is simple, and the second speech recognition means when it is complex. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声認識技術に関するものである。 The present invention relates to speech recognition technology.

携帯電話、カーナビゲーションシステム、家庭内ＡＶ機器等の情報端末の高機能化が進むにつれ、操作の複雑化が進んでいる。複雑な操作を簡単にできるようにするためのユーザインタフェースとして、音声認識を利用したユーザインタフェースが利用されるようになってきている。
通常、音声認識では、処理負荷が重い、あるいは、大規模の音声認識用データベースを必要とするため、端末で認識できる単語数などに制約があった。そのような制約を取り払うための技術として、特許文献１に記載されているような、サーバ側で音声認識処理を行う分散型音声認識（ＤｉｓｔｒｉｂｕｔｅｄＳｐｅｅｃｈＲｅｃｇｎｉｔｉｏｎ：以下ＤＳＲとする）技術がある。 As information terminals such as mobile phones, car navigation systems, and home AV equipment become more sophisticated, operations are becoming more complicated. As a user interface for enabling complicated operations to be easily performed, a user interface using speech recognition has been used.
Normally, speech recognition has a heavy processing load or requires a large-scale speech recognition database, so that there are restrictions on the number of words that can be recognized by a terminal. As a technique for removing such restrictions, there is a distributed speech recognition (hereinafter referred to as DSR) technique that performs voice recognition processing on the server side as described in Patent Document 1.

特開２００５−５５６０６号公報JP-A-2005-55606

上記ＤＳＲでは、情報端末が音声の特徴点抽出処理を実行し、特徴点データを音声認識サーバに送信する。音声認識サーバでは、受信した特徴点データを用いて音声を認識し、結果を情報端末に送信する。
しかしながら、ＤＳＲには、通信できない環境（例えば携帯電話の圏外）では利用できない、通信を用いるため簡単な音声認識であっても時間がかかるといった問題がある。 In the DSR, the information terminal executes a feature point extraction process for speech and transmits feature point data to the speech recognition server. The speech recognition server recognizes speech using the received feature point data and transmits the result to the information terminal.
However, DSR has a problem that it cannot be used in an environment where communication is not possible (for example, outside the mobile phone range), and it takes time even for simple voice recognition because communication is used.

上記課題を解決するために、本発明では、情報端末に、音声を取り込む音声入力手段、音声入力手段により入力された音声から特徴点データを抽出する特徴点抽出手段、入力された音声の複雑さを判定する複雑度判定手段を具備させるようにしている。さらに、情報端末内部で音声認識を行う第１の音声認識手段と、ＤＳＲを用いて音声認識を行う第２の音声認識手段とを具備させるようにしている。そして、複雑度判定手段により、入力が「単純」であると判定した場合には第１の音声認識手段、入力が「複雑」であると判定した場合には第２の音声認識手段を用いて音声認識処理を実行するようにしている。 In order to solve the above-mentioned problems, in the present invention, a voice input means for capturing voice in an information terminal, a feature point extraction means for extracting feature point data from voice inputted by the voice input means, and complexity of the inputted voice Complexity determining means is provided. Furthermore, a first speech recognition unit that performs speech recognition inside the information terminal and a second speech recognition unit that performs speech recognition using DSR are provided. When the complexity determination means determines that the input is “simple”, the first speech recognition means is used. When the input is determined to be “complex”, the second speech recognition means is used. Voice recognition processing is executed.

これにより、簡単な認識処理は情報端末内で実行することにより、レスポンスの向上を図ることができる。また、ＤＳＲを利用することにより、複雑な音声認識処理も実行でき、高度なユーザインタフェースを構築することが可能となる。
さらに、情報端末が通信を行えない状況にあっても、端末の操作などの簡単な音声認識処理を実行することができるようになる。 Thereby, it is possible to improve the response by executing the simple recognition process in the information terminal. Further, by using DSR, complicated voice recognition processing can be executed, and an advanced user interface can be constructed.
Further, even when the information terminal cannot communicate, simple speech recognition processing such as operation of the terminal can be executed.

本発明によれば、使い勝手のよい音声認識処理を実行することができる。 According to the present invention, it is possible to execute a user-friendly speech recognition process.

以下、本発明の実施の形態について図面を用いて説明する。
図１は、本発明の一実施例にかかる情報端末および音声認識サーバのハードウェアの概要を示すブロック図である。なお、情報端末としては、携帯電話、カーナビゲーションシステム、家庭内ＡＶ機器等が想定される。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing an outline of hardware of an information terminal and a voice recognition server according to an embodiment of the present invention. As information terminals, mobile phones, car navigation systems, home AV equipment, and the like are assumed.

図中、１００は情報端末である。１０１はＣＰＵであり、周辺部の制御、データの処理や通信に関わる各種プログラムの実行を行う。１０２は音声入力部であり、例えばマイクである。１０３は入力処理部であり、例えばキーパッド、リモコンなどである。１０４は、記憶部であり、例えば、ＲＡＭ、ＦｌａｓｈＲＯＭなどである。１０５は通信処理部であり、携帯電話の通信機能や、イーサネット（登録商標）、ワイヤレスＬＡＮ等である。なお、情報端末１００は、通信処理部１０５を介して、音声認識サーバ１１０とデータの送受信を行う。１０６は、表示処理部であり、例えば、ＬＣＤ（ＬｉｑｕｉｄＣｒｙｓｔａｌＤｅｖｉｃｅ）ディスプレイである。１０７は音声認識処理部であり、音声認識に関わる音響分析処理や音声認識ＤＢ１０８を用いた認識処理を実行する。ここでは、音声認識処理部１０７をハードウェアのイメージで例示したが、音声認識処理はソフトウェアを用いてＣＰＵ１０１で実行するようにしてもよい。 In the figure, 100 is an information terminal. Reference numeral 101 denotes a CPU which executes various programs related to peripheral control, data processing and communication. An audio input unit 102 is a microphone, for example. Reference numeral 103 denotes an input processing unit such as a keypad or a remote controller. Reference numeral 104 denotes a storage unit, such as a RAM or a flash ROM. Reference numeral 105 denotes a communication processing unit, which is a mobile phone communication function, Ethernet (registered trademark), wireless LAN, or the like. The information terminal 100 transmits / receives data to / from the voice recognition server 110 via the communication processing unit 105. Reference numeral 106 denotes a display processing unit, for example, an LCD (Liquid Crystal Device) display. A voice recognition processing unit 107 executes an acoustic analysis process related to voice recognition and a recognition process using the voice recognition DB 108. Here, the voice recognition processing unit 107 is exemplified by a hardware image, but the voice recognition processing may be executed by the CPU 101 using software.

図中、１１０は音声認識サーバである。１１１はＣＰＵであり、周辺部の制御、データの処理や通信に関わる各種プログラムの実行を行う。１１２は通信処理部であり、イーサネット（登録商標）等である。音声認識サーバ１１０は、通信処理部１１２を介して、情報端末１００とデータの送受信を行う。１１３は記憶部であり、例えば、ＲＡＭ、ＦｌａｓｈＲＯＭなどである。１１４は音声認識処理部であり、音声認識に関わる特徴点抽出処理（音響分析処理）や音声認識ＤＢ１１８を用いた認識処理を実行する。ここでは、音声認識処理部１１４をハードウェアのイメージで例示したが、音声認識処理はソフトウェアを用いてＣＰＵ１１１で実行するようにしてもよい。
図中、１２０は通信回線である。通信回線としては、携帯電話網、公衆電話回線、ＡＤＳＬ回線等がある。 In the figure, reference numeral 110 denotes a voice recognition server. Reference numeral 111 denotes a CPU which executes various programs related to peripheral control, data processing, and communication. Reference numeral 112 denotes a communication processing unit such as Ethernet (registered trademark). The voice recognition server 110 transmits and receives data to and from the information terminal 100 via the communication processing unit 112. Reference numeral 113 denotes a storage unit, such as a RAM or a flash ROM. A voice recognition processing unit 114 performs feature point extraction processing (acoustic analysis processing) related to voice recognition and recognition processing using the voice recognition DB 118. Here, the voice recognition processing unit 114 is exemplified by a hardware image, but the voice recognition processing may be executed by the CPU 111 using software.
In the figure, 120 is a communication line. Examples of the communication line include a mobile phone network, a public telephone line, and an ADSL line.

次に、情報端末１００において、ユーザが入力した音声データを認識する処理（音声処理）について図２を用いて説明する。
図中２００は、情報端末１００における音声処理の概要を示したフローチャートである。 Next, a process (speech process) for recognizing voice data input by the user in the information terminal 100 will be described with reference to FIG.
In the figure, reference numeral 200 is a flowchart showing an outline of voice processing in the information terminal 100.

音声処理２００では、まず、音声入力部１０２を介して、音声データの取り込みを行う（ステップ２０１）。次に、音声認識処理部１０７で、ステップ２０１で取り込んだ音声データの特徴点抽出処理を実行する（ステップ２０２）。次に、ステップ２０１で入力されたデータの複雑度を判定する（ステップ２０３）。複雑度の判定基準としては、例えば、入力された音声データの長さ（時間）、特徴点抽出処理後のデータ量等がある。音声データを基準にする場合、例えば、入力された音声データが３秒未満の場合に「単純」、３秒以上の場合に「複雑」と判定する。また、特徴点抽出処理後のデータ量を基準にする場合、例えば、１Ｋバイト未満の場合に「単純」、１Ｋバイト以上の場合に「複雑」と判定する。 In the audio processing 200, first, audio data is taken in via the audio input unit 102 (step 201). Next, the voice recognition processing unit 107 executes a feature point extraction process of the voice data captured in step 201 (step 202). Next, the complexity of the data input in step 201 is determined (step 203). The complexity criteria include, for example, the length (time) of input voice data, the amount of data after feature point extraction processing, and the like. When audio data is used as a reference, for example, “simple” is determined when input audio data is less than 3 seconds, and “complex” is determined when it is 3 seconds or more. Further, when the data amount after the feature point extraction processing is used as a reference, for example, it is determined as “simple” when it is less than 1 Kbyte and “complex” when it is 1 Kbyte or more.

ステップ２０３で「単純」と判定した場合は、情報端末１００内の音声認識処理部１０７で音声認識を実行する（ステップ２０４）。ステップ２０３で「複雑」と判定した場合は、ＤＳＲを利用した音声認識を実行する（ステップ２０５）。ＤＳＲを利用した音声認識では、ステップ２０２で抽出したデータを、通信処理部１０５を介して音声認識サーバ１１０に送信し、音声認識結果を受信する。 When it is determined as “simple” in step 203, the speech recognition processing unit 107 in the information terminal 100 executes speech recognition (step 204). If it is determined as “complex” in step 203, speech recognition using DSR is executed (step 205). In voice recognition using DSR, the data extracted in step 202 is transmitted to the voice recognition server 110 via the communication processing unit 105, and the voice recognition result is received.

ステップ２０４またはステップ２０５により音声認識結果が得られた場合は、音声認識を結果に基づいて処理を実行する（ステップ２０６）。ここで、音声認識を結果に基づいた処理とは、入力された音声がコマンドの場合はコマンドに応じた処理を実行したり、音声メモの場合はテキスト入力として処理を実行したりすることである。
以上のように、入力されるデータの複雑度に応じて適切な音声認識処理を実行するようにすることにより、単純な入力に対するレスポンスの向上、複雑な音声認識の実行が可能となる。 If a speech recognition result is obtained in step 204 or 205, the speech recognition is performed based on the result (step 206). Here, the process based on the result of the voice recognition is to execute a process corresponding to the command when the input voice is a command, or to execute a process as a text input when the input voice is a voice memo. .
As described above, by performing appropriate speech recognition processing according to the complexity of input data, it is possible to improve response to simple input and to perform complex speech recognition.

次に、情報端末１００における音声処理の別の例を図３を用いて説明する。
図中２１０は、情報端末１００における音声処理の概要を示したフローチャートである。 Next, another example of voice processing in the information terminal 100 will be described with reference to FIG.
In the figure, 210 is a flowchart showing an outline of voice processing in the information terminal 100.

音声処理２１０では、まず、音声入力部１０２を介して、音声データの取り込みを行う（ステップ２１１）。次に、音声認識処理部１０７で、ステップ３０１で取り込んだ音声データの特徴点抽出処理を実行する（ステップ２１２）。次に、情報端末１００内の音声認識処理部１０７で音声認識を実行する（ステップ２１３）。そして、音声認識結果が得られたか否かを判定する（ステップ２１４）。音声認識結果が得られた場合は、音声認識を結果に基づいて処理を実行する（ステップ２１６）。音声認識結果が得られなかった場合は、ＤＳＲを利用した音声認識を実行する（ステップ２１５）。そして、ステップ２１６を実行する。 In the voice processing 210, first, voice data is taken in via the voice input unit 102 (step 211). Next, the voice recognition processing unit 107 executes a feature point extraction process of the voice data captured in step 301 (step 212). Next, voice recognition is executed by the voice recognition processing unit 107 in the information terminal 100 (step 213). Then, it is determined whether or not a voice recognition result has been obtained (step 214). When the voice recognition result is obtained, the voice recognition is processed based on the result (step 216). If no speech recognition result is obtained, speech recognition using DSR is executed (step 215). Then, step 216 is executed.

以上のように、最初に情報端末１００内で音声認識を実行し、端末内で認識を処理できない場合のみＤＳＲによる音声認識処理を実行するようにすることにより、単純な入力に対するレスポンスの向上、複雑な音声認識の実行が可能となる。 As described above, voice recognition is first executed in the information terminal 100, and the voice recognition processing by DSR is executed only when the recognition cannot be processed in the terminal. Voice recognition can be executed.

なお、音声処理２００、２１０のいずれにおいても、情報端末１００内で、単純な入力に対する音声認識処理を実行するため、通信ができないような状態（例えば携帯電話における通信圏外時）であっても、単純な音声入力に対する処理だけは実行できるようになるため、ユーザの利便性が向上する。 In any of the voice processes 200 and 210, the voice recognition process for simple input is executed in the information terminal 100, so that even in a state where communication is not possible (for example, when the mobile phone is out of communication range) Since only processing for simple voice input can be performed, convenience for the user is improved.

本発明の実施例における情報端末の構成を示したブロック図である。It is the block diagram which showed the structure of the information terminal in the Example of this invention. 本発明の実施例における音声処理の概要を示したフォローチャートである。It is the follow chart which showed the outline | summary of the audio | voice process in the Example of this invention. 本発明の別の実施例における音声処理の概要を示したフォローチャートである。It is the follow chart which showed the outline | summary of the audio | voice process in another Example of this invention.

符号の説明Explanation of symbols

１００…情報端末
１１０…音声認識サーバ
１２０…通信回線
100 ... information terminal 110 ... voice recognition server 120 ... communication line

Claims

音声を取り込む音声入力手段と、
取り込んだ音声の複雑度を判定する複雑度判定手段と、
通信を行わずに音声認識を実行する第１の音声認識手段と、
音声認識サーバと通信を行うことにより音声認識を行う第２の音声認識手段と、を具備し、複雑度判定手段が判定する複雑度が一定の基準より低い場合に、第１の音声認識手段を、高い場合に第２の音声認識手段を用いて音声認識処理を実行することを特徴とする情報端末。 Audio input means for capturing audio;
Complexity determination means for determining the complexity of the captured audio;
First speech recognition means for performing speech recognition without performing communication;
Second speech recognition means for performing speech recognition by communicating with the speech recognition server, and when the complexity determined by the complexity determination means is lower than a certain reference, the first speech recognition means An information terminal that performs voice recognition processing using the second voice recognition means when it is high.

前記複雑度の判定に、入力された音声の長さを用いることを特徴とする請求項１記載の情報端末。 The information terminal according to claim 1, wherein an input voice length is used to determine the complexity.

前記複雑度の基準に、入力された音声の特徴点を抽出したデータのサイズを用いることを特徴とする請求項１記載の情報端末。 The information terminal according to claim 1, wherein a size of data obtained by extracting feature points of input speech is used as the complexity criterion.

音声を取り込む音声取り込みステップと
音声取り込みステップで取り込んだ音声の複雑さを判定する複雑度判定ステップと、
複雑度判定ステップで単純と判断した場合に、通信を行わずに音声認識を行う第１の音声認識ステップ、あるいは、複雑度判定ステップで複雑と判断した場合に、音声認識サーバと通信することで音声認識を行う第２の音声認識ステップの、どちらかのステップと、をコンピュータに実行させることを特徴とする音声認識プログラム。 An audio capture step for capturing audio, a complexity determination step for determining the complexity of the audio captured in the audio capture step,
When the complexity determination step determines simple, the first speech recognition step for performing speech recognition without performing communication, or when the complexity determination step determines complexity, by communicating with the speech recognition server A speech recognition program that causes a computer to execute one of the second speech recognition steps for performing speech recognition.

前記複雑度判定ステップにおいて、複雑度の判定に、音声取り込みステップで取り込んだ音声の長さを用いることを特徴とする請求項４記載の音声認識プログラム。 5. The voice recognition program according to claim 4, wherein in the complexity determination step, the length of the voice captured in the voice capture step is used for the complexity determination.

複雑度判定ステップの前に、音声取り込みステップで取り込んだ音声の特徴と抽出する特徴点抽出ステップを有し、複雑度の判定基準に、特徴点抽出ステップで抽出した特徴点のデータ量を用いることを特徴とする請求項４記載の音声認識プログラム。 Before the complexity determination step, there is a feature point extraction step to extract the features of the voice captured in the speech capture step, and the amount of feature point data extracted in the feature point extraction step is used as the complexity criterion The voice recognition program according to claim 4.

音声を取り込む音声取り込みステップと
通信を行わずに音声認識を行う第１の音声認識ステップと、
第１の音声認識ステップで認識結果が得られなかった場合に、音声認識サーバと通信することで音声認識を行う第２の音声認識ステップと、をコンピュータに実行させることを特徴とする音声認識プログラム。
A first voice recognition step for performing voice recognition without performing communication;
A speech recognition program for causing a computer to execute a second speech recognition step for performing speech recognition by communicating with a speech recognition server when a recognition result is not obtained in the first speech recognition step. .