JP2002041083A

JP2002041083A - Remote control system, remote control method and memory medium

Info

Publication number: JP2002041083A
Application number: JP2000219352A
Authority: JP
Inventors: Yasuko Kato; 靖子加藤
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2000-07-19
Filing date: 2000-07-19
Publication date: 2002-02-08

Abstract

PROBLEM TO BE SOLVED: To realize miniaturization of a voice input terminal in a remote control system. SOLUTION: The input voice from an utterance voice input means 200 of the voice input terminal 100 is analyzed by a voice analysis front processor 201, the difference of the amplitude spectrum is calculated by a difference calculation means 202, and the results of the difference calculation are transmitted by an analysis processing interim information data transmitting means 203. The difference data are received by an analysis processing interim information data receiving means 204 of a voice recognition processing main equipment 101, and are restored to the amplitude spectrum by a receiving data restoration means 205. The input surroundings sound from a surroundings sound input means 210 is analyzed by a surroundings sound analysis front processing means 211, the amplitude spectrum is searched. A voice analysis back end processor 206 eliminates the surrounding sound included in the voice based on the amplitude spectrums both of the voice and the surrounding sound, a feature extraction means 207 calculates a feature amount referring the table 212 for the calculating the feature amount, a matching processing means 208 calculates the distance value of each vocabulary referring a standard pattern 213 and a recognition object vocabulary dictionary 214, and a recognition result output means 209 outputs the recognition results.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、遠隔制御システム
および遠隔制御方法に関し、特に、ユーザが発声した発
声音声を認識し、遠隔地にある装置を制御する遠隔制御
システムおよび遠隔制御方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a remote control system and a remote control method, and more particularly, to a remote control system and a remote control method for recognizing an uttered voice uttered by a user and controlling a remote device.

【０００２】[0002]

【従来の技術】従来の音声認識装置の一例が、特開平９
−０６２２８９号公報において開示されている。この従
来の技術の概要について、図６を参照して説明する。図
６に示すように、従来の音声認識装置は、音声入力端末
１と、音声認識処理本体装置２とから構成されている。
音声入力端末１は、音声を入力し、入力した音声の特徴
量を抽出し、その特徴量に対応するデータを音声認識処
理本体装置２に送信するようになされている。音声認識
処理本体装置２は、音声入力端末１から送信されてき
た、入力音声の特徴量に対応するデータを受信し、主と
なる音声認識処理を行うようになされている。2. Description of the Related Art An example of a conventional speech recognition apparatus is disclosed in
No. 062289. An outline of this conventional technique will be described with reference to FIG. As shown in FIG. 6, the conventional speech recognition device includes a speech input terminal 1 and a speech recognition processing main unit 2.
The voice input terminal 1 inputs a voice, extracts a feature amount of the input voice, and transmits data corresponding to the feature amount to the voice recognition processing main unit 2. The voice recognition processing main unit 2 receives the data corresponding to the feature amount of the input voice transmitted from the voice input terminal 1 and performs main voice recognition processing.

【０００３】図６に示したように、音声入力端末１は、
音声を取り込む音声入力部１０と、音声入力部１０によ
り入力された音声を分析処理する音声分析部１１と、認
識可能な単語に対する標準音声特徴データを記憶してい
る標準音声特徴データ記憶部１４と、特定のユーザの音
声の特徴量を標準話者の特徴量に適応する規則を記憶し
ている話者適応変換規則記憶部１５と、話者適応変換規
則記憶部１５に記憶された上記規則に基づいて特徴量の
変換処理を行い、特定ユーザの音声の特徴量を、標準話
者の音声の特徴量に変換する特徴量変換部１２と、特徴
量変換部１２によって変換され、出力された特徴量に対
応する信号（特徴量信号）を、音声認識処理本体装置２
に対して送信する信号送信部１３とから構成されてい
る。[0003] As shown in FIG. 6, a voice input terminal 1 comprises:
A voice input unit 10 for capturing voice, a voice analysis unit 11 for analyzing and processing voice input by the voice input unit 10, a standard voice feature data storage unit 14 for storing standard voice feature data for recognizable words; A speaker adaptive conversion rule storage unit 15 storing a rule for adapting a feature amount of a specific user's voice to a standard speaker's feature amount, and the rule stored in the speaker adaptive conversion rule storage unit 15. A feature amount conversion unit 12 that performs a feature amount conversion process based on the feature amount and converts the feature amount of a specific user's voice into a feature amount of a standard speaker's voice, and a feature that is converted and output by the feature amount conversion unit 12. A signal corresponding to the amount (feature amount signal) is sent to the speech recognition processing main unit 2
And a signal transmission unit 13 that transmits the signal to

【０００４】また、図６に示したように、音声認識処理
本体装置２は、音声入力端末１の信号送信部１３から送
信された特徴量信号を受信する信号受信部１６と、認識
可能な単語に対する標準音声特徴データを記憶する標準
音声特徴データ記憶部１９と、信号受信部１６によって
受信された特徴量信号と、標準音声特徴データ記憶部１
９に記憶されている認識可能な単語に対する標準音声特
徴データとを比較することにより、入力音声から単語を
認識し、検出する単語検出部１７と、単語検出部１７に
よる認識結果を出力する結果出力部１８とから構成され
る。[0006] As shown in FIG. 6, the speech recognition processing main unit 2 includes a signal receiving unit 16 that receives the feature amount signal transmitted from the signal transmitting unit 13 of the speech input terminal 1, and a recognizable word. A standard voice feature data storage unit 19 for storing standard voice feature data for the standard voice feature data signal received by the signal receiving unit 16;
9, a word detector 17 for recognizing and detecting a word from the input speech by comparing the standard speech feature data for the recognizable word and a result output for outputting a recognition result by the word detector 17 And a unit 18.

【０００５】次に、図６に示した音声認識装置の動作に
ついて説明する。ユーザの発声した音声（以下では適宜
発声音声と記載する）が音声入力部１０から入力される
と、音声分析部１１で音声分析され、その音声分析され
た特徴量が特徴量変換部１２に入力される。特徴量変換
部１２は、その送られてきた特徴量を、話者適応変換規
則記憶部１５に格納された変換規則に基づいて標準話者
の特徴量に変換し、その変換後の特徴量を信号送信部１
３から発信する。Next, the operation of the speech recognition apparatus shown in FIG. 6 will be described. When a voice uttered by the user (hereinafter, appropriately referred to as a uttered voice) is input from the voice input unit 10, the voice is analyzed by the voice analysis unit 11, and the feature amount obtained by the voice analysis is input to the feature amount conversion unit 12. Is done. The feature amount conversion unit 12 converts the sent feature amount into a standard speaker feature amount based on the conversion rule stored in the speaker adaptive conversion rule storage unit 15, and converts the converted feature amount. Signal transmission unit 1
Call from 3

【０００６】音声認識処理本体装置２では、信号受信部
１６により、音声入力端末１の信号送信部１３より送信
されてきた特徴量を示すデータが受信され、単語検出部
１７に供給される。標準音声特徴データ記憶部１９に
は、認識可能な単語に対する標準音声特徴データが記憶
されており、単語検出部１７は、信号受信部１６によっ
て受信された特徴量を示すデータ（ユーザが入力した音
声の特徴量を標準話者の特徴量に変換したもの）と、標
準音声特徴データ記憶部１９に記憶されている認識可能
な単語に対する標準音声特徴データとを比較することに
より、入力音声から単語を認識し、検出する。そして、
結果出力部１８は、単語検出部１７による認識結果を出
力する。In the speech recognition processing main unit 2, data indicating the characteristic amount transmitted from the signal transmission unit 13 of the speech input terminal 1 is received by the signal reception unit 16 and supplied to the word detection unit 17. The standard voice feature data storage unit 19 stores standard voice feature data for recognizable words, and the word detection unit 17 outputs data indicating the feature amount received by the signal reception unit 16 (voice input by the user). Is converted into the standard speaker feature amount) and the standard voice feature data for the recognizable word stored in the standard voice feature data storage unit 19, thereby converting the word from the input voice. Recognize and detect. And
The result output unit 18 outputs a recognition result by the word detection unit 17.

【０００７】このようにして、ユーザの年齢や性別、或
いは話し方の個人差等による特徴の違いに関係なく、不
特定話者の音声をきわめて高い認識率で認識し、音声に
よる遠隔制御を可能としている。[0007] In this manner, the voice of an unspecified speaker can be recognized at an extremely high recognition rate regardless of the difference in characteristics due to the age and gender of the user or the individual difference in the way of speaking, and remote control by voice can be performed. I have.

【０００８】また、周囲雑音のある環境下で認識性能を
向上させる方法としては、ユーザが発声した音声の入力
と、周囲雑音などの環境音声の入力の２つの入力を用い
る手法がある。図７は、図６に示した従来の音声認識装
置の音声入力端末１側に、環境音声を入力するための装
置である環境音声入力部２０を設置した例を示してい
る。音声分析部１１は、音声入力部１０より入力された
音声の中に含まれる周囲雑音等の環境音声の一部又は全
部を、環境音声入力部２０より入力された環境音声に基
づいて取り除くことができる。従って、このような構成
の音声入力端末１を用いることにより、周囲雑音のある
環境下で音声入力部１０より入力された音声の認識性能
を向上させることができる。As a method for improving recognition performance in an environment with ambient noise, there is a method using two inputs, that is, input of a voice uttered by a user and input of an environmental voice such as ambient noise. FIG. 7 shows an example in which an environmental voice input unit 20 which is a device for inputting environmental voice is installed on the voice input terminal 1 side of the conventional voice recognition device shown in FIG. The voice analysis unit 11 can remove a part or all of the environmental voice such as ambient noise included in the voice input from the voice input unit 10 based on the environmental voice input from the environmental voice input unit 20. it can. Therefore, by using the voice input terminal 1 having such a configuration, it is possible to improve the recognition performance of voice input from the voice input unit 10 in an environment with ambient noise.

【０００９】[0009]

【発明が解決しようとする課題】しかしながら、周囲雑
音のある環境下で認識性能を向上させる従来の方法は、
ユーザが発声した音声の入力と、周囲雑音などの環境音
声の入力の２つの入力を用いるため、従来の音声認識装
置を利用する場合、次のような課題があった。However, a conventional method for improving recognition performance in an environment with ambient noise is as follows.
Since two inputs, i.e., the input of a voice uttered by the user and the input of environmental sound such as ambient noise, are used, there are the following problems when using a conventional voice recognition device.

【００１０】第１の課題は、周囲雑音除去処理を音声分
析処理の段階で行う必要があるが、従来の技術を用いた
場合、音声入力端末１側で音声を分析した後の特徴量計
算まで行っているので、これらの処理を音声入力端末１
側に搭載する必要がある。このため、音声入力端末１側
の処理をリアルタイムに行うために、高速なＣＰＵ（ｃ
ｅｎｔｒａｌｐｒｏｃｅｓｓｉｎｇｕｎｉｔ）が必
要となってしまうということである。The first problem is that ambient noise removal processing needs to be performed at the stage of voice analysis processing. However, in the case of using the conventional technique, the voice input terminal 1 analyzes the voice and calculates the characteristic amount. Since these processes are performed, these processes are performed by the voice input terminal 1
Must be mounted on the side. Therefore, in order to perform the processing on the voice input terminal 1 side in real time, a high-speed CPU (c
This means that an entire processing unit is required.

【００１１】第２の課題は、音声入力端末１側の処理量
が増えるため、必要となる消費電力も大きくなる。この
ため、音声入力端末１の電源供給部分のハードウェア規
模が大きくなってしまうということである。The second problem is that the amount of processing on the voice input terminal 1 side increases, so that the required power consumption also increases. For this reason, the hardware scale of the power supply part of the voice input terminal 1 is increased.

【００１２】第３の課題は、図７に示したように、図６
に示した従来の音声認識装置の音声入力端末１側に、環
境音声入力用の装置となる環境音声入力部２０を新たに
設置しなければならないため、音声入力端末１側の装置
のハードウェア規模が大きくなり、コストも高くなって
しまうということである。The third problem is that, as shown in FIG.
Has to be newly installed on the voice input terminal 1 side of the conventional voice recognition device shown in (1), the hardware scale of the device on the voice input terminal 1 side must be newly installed. And the cost increases.

【００１３】本発明はこのような状況に鑑みてなされた
ものであり、音声入力端末部分のハードウェアの規模を
小さくするとともに、周囲の雑音がある状況下での音声
認識性能の向上を可能とし、主に音声認識処理を行う装
置が設けられた電子機器の遠隔制御を効率的に行うこと
ができるようにするものである。The present invention has been made in view of such a situation, and has been made to reduce the scale of the hardware of the voice input terminal portion and to improve the voice recognition performance in the presence of ambient noise. Another object of the present invention is to enable efficient remote control of an electronic device mainly provided with a device for performing voice recognition processing.

【００１４】[0014]

【課題を解決するための手段】請求項１に記載の遠隔制
御システムは、第１の装置を用いてユーザが発声した発
声音声による指示を行い、第１の装置から所定の距離だ
け離れた場所にある第２の装置を遠隔制御する遠隔制御
システムであって、第１の装置は、ユーザが発声した発
声音声を入力する発声音声入力手段と、発声音声入力手
段によって入力された発声音声を分析する第１の分析手
段と、第１の分析手段による第１の分析結果に対応する
データを第２の装置に送信する送信手段とを備え、第２
の装置は、第１の装置の送信手段から送信されてきた第
１の分析結果に対応するデータを受信する受信手段と、
第２の装置の周囲の環境音声を入力する環境音声入力手
段と、環境音声入力手段によって入力された環境音声を
分析する第２の分析手段と、受信手段によって受信され
た第１の分析結果に対応するデータと、第２の分析手段
による第２の分析結果とに基づいて、発声音声入力手段
によって入力された発声音声に含まれる環境音声を除去
する環境音声除去手段と、環境音声除去手段によって環
境音声が除去された発声音声の特徴量を抽出する特徴量
抽出手段と、特徴量抽出手段によって抽出された特徴量
に基づいて、発声音声を認識する認識手段と、認識手段
による認識結果を出力する出力手段とを備えることを特
徴とする。また、第１の装置は、第１の分析手段による
第１の分析結果の所定の時間毎の差分を演算する差分演
算手段をさらに備え、送信手段は、差分演算手段によっ
て演算された差分を第１の分析手段による第１の分析結
果に対応するデータとして第２の装置に送信し、第２の
装置の受信手段は、第１の装置の送信手段から送信され
てきた第１の分析結果に対応するデータとしての差分を
受信し、差分に基づいて第１の分析手段による第１の分
析結果を復元する復元手段をさらに備え、環境音声除去
手段は、復元手段によって復元された第１の分析結果
と、第２の分析手段による第２の分析結果とに基づい
て、発声音声入力手段によって入力された発声音声に含
まれる環境音声を除去するようにすることができる。ま
た、第１の装置の第１の分析手段は、発声音声の周波数
分析までの処理を行い、第２の装置の第２の分析手段
は、第１の分析手段による周波数分析後のデータに対し
て、環境音声を除去する処理を行うようにすることがで
きる。また、第２の装置は、音声を出力する音声出力手
段をさらに備え、第２の分析手段に音声出力手段により
出力された音声を供給し、第２の分析手段は、環境音声
入力手段によって入力された環境音声と音声出力手段よ
り入力された音声を分析し、環境音声除去手段は、復元
手段によって復元された第１の分析結果と、第２の分析
手段による第２の分析結果とに基づいて、発声音声入力
手段によって入力された発声音声に含まれる環境音声と
音声とを除去するようにすることができる。請求項５に
記載の遠隔制御方法は、第１の装置を用いてユーザが発
声した発声音声による指示を行い、第１の装置から所定
の距離だけ離れた場所にある第２の装置を遠隔制御する
遠隔制御システムにおける遠隔制御方法であって、第１
の装置は、ユーザが発声した発声音声を入力する発声音
声入力ステップと、発声音声入力ステップにおいて入力
された発声音声を分析する第１の分析ステップと、第１
の分析ステップにおける第１の分析結果に対応するデー
タを第２の装置に送信する送信ステップとを備え、第２
の装置は、第１の装置の送信ステップにおいて送信され
た第１の分析結果に対応するデータを受信する受信ステ
ップと、第２の装置の周囲の環境音声を入力する環境音
声入力ステップと、環境音声入力ステップにおいて入力
された環境音声を分析する第２の分析ステップと、受信
ステップにおいて受信された第１の分析結果に対応する
データと、第２の分析ステップにおける第２の分析結果
とに基づいて、発声音声入力ステップにおいて入力され
た発声音声に含まれる環境音声を除去する環境音声除去
ステップと、環境音声除去ステップにおいて環境音声が
除去された発声音声の特徴量を抽出する特徴量抽出ステ
ップと、特徴量抽出ステップにおいて抽出された特徴量
に基づいて、発声音声を認識する認識ステップと、認識
ステップにおける認識結果を出力する出力ステップとを
備えることを特徴とする。請求項６に記載の記録媒体
は、請求項５に記載の遠隔制御方法を実行可能なプログ
ラムが記録されていることを特徴とする。本発明に係る
遠隔制御システムおよび遠隔制御方法においては、第１
の装置は、ユーザが発声した発声音声を入力し、入力さ
れた発声音声を分析し、分析によって得られた第１の分
析結果に対応するデータを第２の装置に送信する。第２
の装置は、第１の装置において送信された第１の分析結
果に対応するデータを受信し、第２の装置の周囲の環境
音声を入力し、入力された環境音声を分析し、受信され
た第１の分析結果に対応するデータと、分析の結果得ら
れた第２の分析結果とに基づいて、入力された発声音声
に含まれる環境音声を除去し、環境音声が除去された発
声音声の特徴量を抽出し、抽出された特徴量に基づい
て、発声音声を認識し、認識結果を出力する。According to a first aspect of the present invention, there is provided a remote control system in which an instruction is given by a user using an uttered voice using a first device, and the remote control system is located at a predetermined distance from the first device. A remote control system for remotely controlling a second device according to claim 1, wherein the first device analyzes uttered voice input means for inputting uttered voice uttered by the user, and uttered voice input by the uttered voice input means. A first analyzing means for performing the first analysis, and a transmitting means for transmitting data corresponding to the first analysis result by the first analyzing means to the second device.
Receiving means for receiving data corresponding to the first analysis result transmitted from the transmitting means of the first apparatus;
Environmental sound input means for inputting environmental sound around the second device; second analyzing means for analyzing environmental sound input by the environmental sound input means; and first analysis result received by the receiving means. An environmental sound removing unit configured to remove environmental sound included in the uttered voice input by the uttered voice input unit based on the corresponding data and a second analysis result obtained by the second analyzing unit; A feature amount extracting unit that extracts a feature amount of the uttered voice from which the environmental voice has been removed, a recognition unit that recognizes the uttered voice based on the feature amount extracted by the feature amount extracting unit, and a recognition result output by the recognition unit. Output means for performing the operation. Further, the first device further includes difference calculating means for calculating a difference of the first analysis result by the first analyzing means at predetermined time intervals, and the transmitting means calculates the difference calculated by the difference calculating means as the first difference. The data is transmitted to the second device as data corresponding to the first analysis result by the first analysis device, and the receiving device of the second device transmits the data to the first analysis result transmitted from the transmission device of the first device. The image processing apparatus further includes a restoring unit that receives the difference as the corresponding data and restores a first analysis result based on the difference based on the difference, wherein the environmental sound removing unit includes a first analysis unit that is restored by the restoring unit. Based on the result and the second analysis result by the second analysis means, it is possible to remove the environmental sound included in the uttered sound input by the uttered sound input means. The first analyzing means of the first device performs processing up to the frequency analysis of the uttered voice, and the second analyzing means of the second device performs processing on the data after the frequency analysis by the first analyzing means. Thus, a process of removing the environmental sound can be performed. Further, the second device further includes a sound output unit that outputs a sound, supplies the sound output by the sound output unit to the second analysis unit, and the second analysis unit inputs the sound by the environmental sound input unit. The environment sound thus removed and the sound input from the sound output means are analyzed, and the environment sound removal means based on the first analysis result restored by the restoration means and the second analysis result by the second analysis means. Thus, it is possible to remove the environmental sound and the voice included in the voice voice input by the voice voice input means. The remote control method according to claim 5, wherein an instruction is given by an uttered voice uttered by the user using the first device, and the second device that is located at a predetermined distance from the first device is remotely controlled. A remote control method for a remote control system, comprising:
The apparatus comprises: an uttered voice inputting step of inputting an uttered voice uttered by a user; a first analyzing step of analyzing the uttered voice input in the uttered voice inputting step;
Transmitting the data corresponding to the first analysis result in the analyzing step to the second device.
The apparatus includes: a receiving step of receiving data corresponding to the first analysis result transmitted in the transmitting step of the first apparatus; an environmental voice inputting step of inputting an environmental voice around the second apparatus; A second analysis step of analyzing the environmental voice input in the voice input step, data corresponding to the first analysis result received in the receiving step, and a second analysis result in the second analysis step An environmental sound removing step of removing environmental sound included in the uttered voice input in the uttered voice input step; and a feature amount extracting step of extracting a feature amount of the uttered voice from which the environmental sound has been removed in the environmental voice removing step. A recognition step for recognizing the uttered voice based on the feature amount extracted in the feature amount extraction step; And an outputting step of outputting the identification result. A recording medium according to a sixth aspect is characterized by recording a program capable of executing the remote control method according to the fifth aspect. In the remote control system and the remote control method according to the present invention,
The device according to the first aspect inputs an uttered voice uttered by the user, analyzes the input uttered voice, and transmits data corresponding to the first analysis result obtained by the analysis to the second device. Second
The device receives data corresponding to the first analysis result transmitted from the first device, inputs environmental sound around the second device, analyzes the input environmental sound, and receives the received environmental sound. Based on the data corresponding to the first analysis result and the second analysis result obtained as a result of the analysis, the environmental sound included in the input uttered voice is removed, and the uttered voice from which the environmental voice has been removed is removed. The feature amount is extracted, the uttered voice is recognized based on the extracted feature amount, and the recognition result is output.

【００１５】[0015]

【発明の実施の形態】本発明の音声認識装置は、音声に
より、比較的短い距離だけ離れた場所にある装置に対し
て遠隔操作を行うときに、音声を入力し、入力された音
声を認識する。この音声認識装置は、音声を入力するた
めの音声入力端末と、主に音声認識処理を行う音声認識
処理本体装置とからなり、音声入力端末のハードウェア
の規模を小さくすることを可能とするとともに、周囲の
雑音がある状況下での音声認識性能の向上を可能として
いる。BEST MODE FOR CARRYING OUT THE INVENTION A speech recognition apparatus of the present invention inputs a speech and recognizes the inputted speech when remotely controlling a device located at a relatively short distance by speech. I do. This speech recognition device is composed of a speech input terminal for inputting speech and a speech recognition processing main unit that mainly performs speech recognition processing, enabling the hardware scale of the speech input terminal to be reduced. Thus, it is possible to improve the voice recognition performance in the presence of ambient noise.

【００１６】以下、本発明の音声認識装置の一実施の形
態の構成及び動作について説明する。図１は、本発明の
音声認識装置の一実施の形態の構成例を示すブロック図
である。同図に示すように、音声認識装置は、音声入力
端末１００と音声認識処理本体装置１０１から構成され
る。Hereinafter, the configuration and operation of one embodiment of the speech recognition apparatus of the present invention will be described. FIG. 1 is a block diagram illustrating a configuration example of an embodiment of a speech recognition device of the present invention. As shown in the figure, the speech recognition device includes a speech input terminal 100 and a speech recognition processing main unit 101.

【００１７】音声入力端末１００は、話者（ユーザ）に
よって発声された音声を取り込み、取り込んだ音声（入
力音声）に基づいて分析処理中間情報データを生成し、
音声認識処理本体装置１０１に送信する。また、音声認
識処理本体装置１０１は、音声入力端末１００から送信
されてきた分析処理中間情報データを受信し、受信した
分析処理中間情報データから元の音声（入力音声）を復
元したものと、音声認識処理本体装置１０１自身が取り
込んだ周囲雑音等の環境音声をもとに周囲雑音除去処理
を行う。The voice input terminal 100 captures voice uttered by a speaker (user), generates analysis processing intermediate information data based on the captured voice (input voice),
It is transmitted to the voice recognition processing main unit 101. The voice recognition processing main unit 101 receives the analysis processing intermediate information data transmitted from the voice input terminal 100 and restores the original voice (input voice) from the received analysis processing intermediate information data. Ambient noise removal processing is performed based on environmental sounds such as ambient noise captured by the recognition processing main unit 101 itself.

【００１８】次に、入力音声の特徴量を抽出して主とな
る音声認識処理を行い、認識結果を出力する。即ち、認
識対象単語のうち、最も類似している単語を認識結果と
して出力する。Next, the main speech recognition processing is performed by extracting the feature amount of the input speech, and the recognition result is output. That is, among the words to be recognized, the most similar word is output as the recognition result.

【００１９】音声入力端末１００は、発声音声入力部２
００と、音声分析前段処理部２０１と、差分計算部２０
２と、分析処理中間情報データ送信部２０３とにより構
成されている。また、音声認識処理本体装置１０１は、
分析処理中間情報データ受信部２０４と、環境音声入力
部２１０と、受信データ復元部２０５と、音声分析後段
処理部２０６と、特徴量抽出部２０７と、マッチング処
理部２０８と、認識結果出力部２０９と、環境音声分析
前段処理部２１１と、特徴量計算用テーブル２１２と、
標準パターン２１３と、認識対象単語辞書２１４とによ
り構成されている。The voice input terminal 100 includes an utterance voice input unit 2
00, a speech analysis pre-processing unit 201, and a difference calculation unit 20
2 and an analysis processing intermediate information data transmission unit 203. In addition, the voice recognition processing main unit 101
Analysis processing intermediate information data receiving section 204, environmental voice input section 210, received data restoring section 205, voice analysis post-processing section 206, feature quantity extracting section 207, matching processing section 208, recognition result output section 209 Environmental sound analysis pre-processing unit 211, a feature amount calculation table 212,
It comprises a standard pattern 213 and a recognition target word dictionary 214.

【００２０】音声入力端末１００を構成する発声音声入
力部２００は、話者によって発声された音声を取り込
む。音声分析前段処理部２０１は、発声音声入力部２０
０によって入力された音声に対して、音声認識に必要な
分析処理のうち、環境音声が必要となる手前までの処理
を行う。即ち、フレーム化、高域強調処理、窓掛け処
理、フーリエ変換を行い、振幅スペクトルを出力する。The uttered voice input unit 200 constituting the voice input terminal 100 takes in a voice uttered by a speaker. The voice analysis pre-processing unit 201 includes the uttered voice input unit 20.
For the voice input by 0, of the analysis processing required for voice recognition, the processing up to the point where environmental voice is required is performed. That is, framing, high-frequency enhancement processing, windowing processing, and Fourier transform are performed, and an amplitude spectrum is output.

【００２１】差分計算部２０２は、音声分析前段処理部
２０１より出力されたデータ（振幅スペクトルデータ）
の時間毎の差分（振幅スペクトル差分データ）を計算す
る。分析処理中間情報データ送信部２０３は、差分計算
部２０２により計算された差分を分析処理中間情報デー
タとして音声認識処理本体装置１０１に送信する。ま
た、分析処理中間情報データ送信部２０３は、最初のフ
レームについては、振幅スペクトルデータをそのまま音
声認識処理本体装置１０１に送信する。The difference calculation unit 202 outputs data (amplitude spectrum data) output from the speech analysis pre-processing unit 201.
Is calculated for each time (amplitude spectrum difference data). The analysis processing intermediate information data transmission unit 203 transmits the difference calculated by the difference calculation unit 202 to the speech recognition processing main unit 101 as analysis processing intermediate information data. In addition, the analysis processing intermediate information data transmission unit 203 transmits the amplitude spectrum data to the speech recognition processing main unit 101 as it is for the first frame.

【００２２】音声認識処理本体装置１０１を構成する分
析処理中間情報データ受信部２０４は、音声入力端末１
００の分析処理中間情報データ送信部２０３より送信さ
れてきた分析処理中間情報データ（振幅スペクトル差分
データ）を受信（ただし、最初のフレームについては、
振幅スペクトルデータを受信）し、後述する受信データ
復元部２０５に供給する。環境音声入力部２１０は、音
声認識処理本体装置１０１の周囲の雑音（環境音声）を
取り込み、環境音声分析前段処理部２１１に供給する。
環境音声分析前段処理部２１１は、環境音声入力部２１
０より供給された環境音声に対して、音声分析前段処理
部２０１と同様の分析処理を行う。即ち、フレーム化、
高域強調処理、窓掛け処理、フーリエ変換を行い、環境
音声に対応する振幅スペクトルを出力する。The analysis processing intermediate information data receiving unit 204 constituting the voice recognition processing main unit 101 is a voice input terminal 1
00 receives the analysis processing intermediate information data (amplitude spectrum difference data) transmitted from the analysis processing intermediate information data transmission unit 203 (however, for the first frame,
The amplitude spectrum data is received) and supplied to a received data restoring unit 205 described later. The environmental voice input unit 210 captures noise (environmental voice) around the voice recognition processing main unit 101 and supplies it to the environmental voice analysis pre-processing unit 211.
The environmental sound analysis pre-processing unit 211 includes an environmental sound input unit 21.
The analysis processing similar to that performed by the audio analysis pre-processing unit 201 is performed on the environmental audio supplied from 0. That is, framing,
It performs high-frequency emphasis processing, windowing processing, and Fourier transform, and outputs an amplitude spectrum corresponding to environmental sound.

【００２３】受信データ復元部２０５は、分析処理中間
情報データ受信部２０４から供給された分析処理中間情
報データ（振幅スペクトル差分データ）と、最初のフレ
ームの振幅スペクトルデータを元に、音声入力端末１０
０の音声分析前段処理部２０１から出力されたデータ
（振幅スペクトルデータ）を復元する。音声分析後段処
理部２０６は、受信データ復元部２０５によって復元さ
れた入力音声に対応する振幅スペクトルデータと、環境
音声分析前段処理部２１１から供給された環境音声に対
応する振幅スペクトルデータに基づいて、周囲雑音除去
処理を行い、音声区間を検出し、周囲雑音を除去した音
声データを抽出する。ここで、音声区間とは、発声の始
端から終端までを指し、単語単位認識を行う場合は単語
という単位が音声区間となる。The received data restoring unit 205 receives the analysis processing intermediate information data (amplitude spectrum difference data) supplied from the analysis processing intermediate information data receiving unit 204 and the amplitude spectrum data of the first frame, and
The data (amplitude spectrum data) output from the audio analysis pre-processing unit 201 of 0 is restored. The voice analysis post-processing unit 206 is based on the amplitude spectrum data corresponding to the input voice restored by the received data restoring unit 205 and the amplitude spectrum data corresponding to the environmental voice supplied from the environmental voice analysis pre-processing unit 211, Ambient noise removal processing is performed to detect a speech section, and to extract speech data from which the surrounding noise has been removed. Here, the speech section refers to a portion from the beginning to the end of the utterance. In the case of performing word unit recognition, a unit of a word is a speech section.

【００２４】特徴量抽出部２０７は、音声分析後段処理
部２０６により分析され、周囲雑音が除去された音声デ
ータからその特徴量データを、特徴量計算用テーブル２
１２を用いて抽出する。特徴量を計算する際には、各種
関数計算を行う。例えば、コサイン変換（ｃｏｓＸ）な
どである。この関数計算を普通に行うと処理量が多く、
処理時間が長くなるため、予めＸの値毎に計算結果を取
得して、テーブルデータとして保持しておく。そして、
実際の処理時には、Ｘの値に対応する計算結果としての
データをテーブルから引いてくるようにしている。この
ように、各種関数毎に計算結果データをまとめたものを
格納しているテーブルが特徴量計算用テーブル２１２で
ある。The feature extraction unit 207 converts the feature data from the speech data analyzed by the speech analysis post-processing unit 206 and from which ambient noise has been removed, into the feature calculation table 2.
Extract using No. 12. When calculating the feature amount, various function calculations are performed. For example, cosine transform (cosX) is used. Performing this function calculation normally requires a lot of processing,
Since the processing time becomes long, a calculation result is acquired in advance for each value of X and held as table data. And
At the time of actual processing, data as a calculation result corresponding to the value of X is drawn from the table. As described above, the table storing the summary of the calculation result data for each function is the feature amount calculation table 212.

【００２５】マッチング処理部２０８は、特徴量抽出部
２０７において抽出された特徴量データを元に、認識単
位の標準パターン２１３と認識対象単語辞書２１４を参
照して各単語毎の距離値を求める。認識結果出力部２０
９は、マッチング処理部２０８によって求められた距離
値を元に、その距離値が最も小さいものを認識結果とし
て出力する。The matching processing unit 208 obtains a distance value for each word by referring to the standard pattern 213 of the recognition unit and the recognition target word dictionary 214 based on the feature amount data extracted by the feature amount extraction unit 207. Recognition result output unit 20
Numeral 9 outputs, based on the distance value obtained by the matching processing unit 208, the one with the smallest distance value as a recognition result.

【００２６】次に、図２のフローチャートを参照して、
図１に示した音声入力端末１００の動作について説明す
る。まず最初に、音声入力端末１００を用いて音声認識
処理本体装置１０１が設けられた装置を遠隔制御しよう
とするユーザが発声した音声が発声音声入力部２００に
取り込まれる（ＳＴＥＰ１０）。発声音声入力部２００
は、入力された音声をディジタルの音声データに変換
し、この音声データを所定のサンプリング数毎にまとめ
る。そして、音声認識処理を行う単位としてフレーム化
するために、１フレーム分の音声データが揃うまで、音
声の取り込みを続ける。Next, referring to the flowchart of FIG.
The operation of the voice input terminal 100 shown in FIG. 1 will be described. First, a voice uttered by a user who intends to remotely control a device provided with the voice recognition processing main unit 101 using the voice input terminal 100 is taken into the uttered voice input unit 200 (STEP 10). Utterance voice input unit 200
Converts input voice into digital voice data, and compiles the voice data for each predetermined sampling number. Then, in order to form a frame as a unit for performing the voice recognition process, the capturing of the voice is continued until one frame of voice data is prepared.

【００２７】即ち、ＳＴＥＰ２０において、１フレーム
分の音声データの取り込みが終了したか否かが判定さ
れ、１フレーム分の音声データの取り込みが終了してい
ないと判定された場合、ＳＴＥＰ１０、ＳＴＥＰ２０の
処理が繰り返し実行され、１フレーム分の音声データの
取り込みが終了したと判定された場合、ＳＴＥＰ３０に
進む。ここで、１フレームとは、例えば、数十ミリ秒
（ｍｓ）乃至数百ｍｓの時間的に区切られた単位であ
る。音声をサンプリングして処理する場合、サンプリン
グされたデータをいくつかずつのかたまりとして、かた
まり毎に音声分析処理を行うのが通常である。イメージ
としては、各サンプリングデータを + で表すと、[ ]内
が１フレームという単位ということになる。ここで、左
右方向に時間軸があるとする。 [ + + + + + + + + ] [ + + + + + + + + ][ + + + + +
+ + + ] ・・・That is, in STEP 20, it is determined whether or not the capture of the audio data for one frame has been completed. If it is determined that the capture of the audio data for one frame has not been completed, the processing in STEPs 10 and 20 is performed. Are repeatedly executed, and when it is determined that the capture of the audio data for one frame is completed, the process proceeds to STEP30. Here, one frame is, for example, a time-divided unit of several tens of milliseconds (ms) to several hundreds of ms. When audio is sampled and processed, it is usual that the sampled data is divided into several chunks and the audio analysis processing is performed for each chunk. As an image, if each sampling data is represented by +, the unit in [] is a unit of one frame. Here, it is assumed that there is a time axis in the left-right direction. [+ + + + + + + +] [+ + + + + + +] [+ + + + +
+ + +] ・・・

【００２８】１フレーム分だけ揃った音声データは、音
声分析前段処理部２０１において、高域強調処理や窓掛
け処理が施された後、フーリエ変換処理が行われる。こ
れらの処理により、１フレーム単位に、音声データに対
応する振幅スペクトルが計算され、結果として振幅スペ
クトルデータが得られる（ＳＴＥＰ３０）。ここで、高
域強調処理や窓掛け処理、及びフーリエ変換処理は、よ
く知られた技術であるので、ここではその詳細な説明は
省略する。The audio data for one frame is subjected to high-frequency emphasis processing and windowing processing in the audio analysis pre-processing unit 201, and then subjected to Fourier transform processing. By these processes, the amplitude spectrum corresponding to the audio data is calculated for each frame, and as a result, the amplitude spectrum data is obtained (STEP 30). Here, the high-frequency emphasizing process, the windowing process, and the Fourier transform process are well-known technologies, and thus detailed description thereof is omitted here.

【００２９】ＳＴＥＰ３０において、音声分析前段処理
部２０１によって計算された振幅スペクトルを示す振幅
スペクトルデータは、差分計算部２０２に供給される。
差分計算部２０２は、音声分析前段処理部２０１から供
給された所定のフレームの振幅スペクトルデータと、直
前に音声分析前段処理部２０１から供給された、上記フ
レームの１つ前のフレームの振幅スペクトルデータ（前
フレーム振幅スペクトルデータ）の差分を計算し、振幅
スペクトル差分データとして出力する（ＳＴＥＰ４
０）。即ち、差分計算部２０２は、音声分析前段処理部
２０１から供給された所定のフレームの振幅スペクトル
データを前フレーム振幅スペクトルデータとして記憶し
ておく。そして、音声分析前段処理部２０１から供給さ
れた次のフレームの振幅スペクトルデータと、先に記憶
しておいた前フレーム振幅スペクトルデータとの差分を
計算する。そして、上記次のフレームの振幅スペクトル
データを前フレーム振幅スペクトルデータとして記憶す
る。At STEP 30, the amplitude spectrum data indicating the amplitude spectrum calculated by the speech analysis pre-processing unit 201 is supplied to the difference calculation unit 202.
The difference calculation unit 202 includes a predetermined frame amplitude spectrum data supplied from the speech analysis pre-processing unit 201 and an amplitude spectrum data of a frame immediately before the frame supplied from the speech analysis pre-processing unit 201 immediately before. The difference of (previous frame amplitude spectrum data) is calculated and output as amplitude spectrum difference data (STEP 4).
0). That is, the difference calculation unit 202 stores the amplitude spectrum data of a predetermined frame supplied from the speech analysis pre-processing unit 201 as previous frame amplitude spectrum data. Then, a difference between the amplitude spectrum data of the next frame supplied from the audio analysis pre-processing unit 201 and the previously stored amplitude spectrum data of the previous frame is calculated. Then, the amplitude spectrum data of the next frame is stored as the previous frame amplitude spectrum data.

【００３０】次に、差分計算部２０２によって計算され
た振幅スペクトル差分データは、分析処理中間情報デー
タ送信部２０３に供給される。そして、分析処理中間情
報データ送信部２０３は、振幅スペクトル差分データを
分析処理中間情報データとして音声認識処理本体装置１
０１に送信する（ＳＴＥＰ５０）。ただし、各フレーム
の元の振幅スペクトルデータを復元できるように、最初
のフレームについては差分ではなく、振幅スペクトルデ
ータをそのまま音声認識処理本体装置１０１に送信す
る。Next, the amplitude spectrum difference data calculated by the difference calculation section 202 is supplied to the analysis processing intermediate information data transmission section 203. Then, the analysis processing intermediate information data transmission unit 203 uses the amplitude spectrum difference data as the analysis processing intermediate information data, and
01 (STEP 50). However, in order to restore the original amplitude spectrum data of each frame, not the difference but the amplitude spectrum data of the first frame is transmitted to the speech recognition processing main unit 101 as it is.

【００３１】以上の処理を、音声入力処理が終了するま
で繰り返す（ＳＴＥＰ６０）。即ち、ＳＴＥＰ６０にお
いて、音声入力処理が終了したか否かが判定される。そ
の結果、音声入力処理が終了していないと判定された場
合、ＳＴＥＰ１０に戻り、ＳＴＥＰ１０以降の処理が繰
り返し実行される。一方、音声入力処理が終了したと判
定された場合、処理を終了する。The above processing is repeated until the voice input processing is completed (STEP 60). That is, in STEP 60, it is determined whether or not the voice input processing has been completed. As a result, when it is determined that the voice input process is not completed, the process returns to STEP 10, and the processes after STEP 10 are repeatedly executed. On the other hand, when it is determined that the voice input processing has been completed, the processing is completed.

【００３２】次に、図３に示したフローチャートを参照
して、図１に示した音声認識処理本体装置１０１の動作
について説明する。まず最初に、ＳＴＥＰ１００におい
て、分析処理中間情報データ受信部２０４により、分析
処理中間情報データが受信されたか否かが判定される。
分析処理中間情報データが受信されていないと判定され
た場合、ＳＴＥＰ１００の処理が繰り返し実行される。
一方、分析処理中間情報データが受信されたと判定され
た場合、ＳＴＥＰ１１０に進む。ＳＴＥＰ１１０におい
ては、分析処理中間情報データ受信部２０４により、音
声入力端末１００の分析処理中間情報データ送信部２０
３より送信された分析処理中間情報データ（振幅スペク
トル差分データ）が受信され、取り込まれる。ただし、
最初のフレームについては、振幅スペクトルデータがそ
のまま送信されてくるので、その振幅スペクトルデータ
を受信する。Next, the operation of the speech recognition processing main unit 101 shown in FIG. 1 will be described with reference to the flowchart shown in FIG. First, in STEP 100, the analysis processing intermediate information data receiving unit 204 determines whether the analysis processing intermediate information data has been received.
When it is determined that the analysis processing intermediate information data has not been received, the processing of STEP 100 is repeatedly executed.
On the other hand, when it is determined that the analysis processing intermediate information data has been received, the process proceeds to STEP 110. In STEP 110, the analysis processing intermediate information data transmitting unit 20 of the voice input terminal 100 is
The analysis intermediate data (amplitude spectrum difference data) transmitted from 3 is received and captured. However,
For the first frame, since the amplitude spectrum data is transmitted as it is, the amplitude spectrum data is received.

【００３３】次に、受信データ復元部２０５は、最初の
フレームの振幅スペクトルデータと、それ以降のフレー
ムの各振幅スペクトル差分データを用いて入力音声の振
幅スペクトルデータを復元する（ＳＴＥＰ１２０）。ま
た、同時に、音声認識処理本体装置１０１の環境音声入
力部２１０により環境音声が入力され、ディジタルの環
境音声データに変換された後、環境音声分析前段処理部
２１１に供給される。そして、環境音声分析前段処理部
２１１により、図２のＳＴＥＰ３０における処理と同様
の処理が行われ、環境音声入力部２１０より供給された
環境音声データの振幅スペクトルを示す振幅スペクトル
データが求められる。Next, the received data restoring unit 205 restores the amplitude spectrum data of the input voice using the amplitude spectrum data of the first frame and the amplitude spectrum difference data of the subsequent frames (STEP 120). At the same time, the environmental sound is input by the environmental sound input unit 210 of the main unit 101 of the voice recognition processing, converted into digital environmental sound data, and supplied to the environmental sound analysis pre-processing unit 211. Then, the processing similar to the processing in STEP 30 of FIG. 2 is performed by the environmental sound analysis pre-processing unit 211, and amplitude spectrum data indicating the amplitude spectrum of the environmental sound data supplied from the environmental sound input unit 210 is obtained.

【００３４】音声分析後段処理部２０６は、ＳＴＥＰ１
２０において復元された入力音声の振幅スペクトルデー
タと、環境音声分析前段処理部２１１より供給された環
境音声に対応する振幅スペクトルデータの振幅パワーの
状態から音声区間を検出する（ＳＴＥＰ１３０）。ま
た、振幅スペクトルから定常ノイズを除去するスペクト
ルサブトラクションを振幅スペクトルデータに対して施
し、定常ノイズ成分を取り除く。The voice analysis post-processing unit 206 executes STEP 1
A voice section is detected from the state of the amplitude spectrum data of the input voice restored in step 20 and the amplitude power of the amplitude spectrum data corresponding to the environmental voice supplied from the environmental voice analysis pre-processing unit 211 (STEP 130). Further, the spectrum subtraction for removing the stationary noise from the amplitude spectrum is performed on the amplitude spectrum data to remove the stationary noise component.

【００３５】音声区間の検出方法やスペクトルサブトラ
クション処理は、よく知られた技術であるので、その詳
細な説明はここでは省略する。Since the method of detecting a speech section and the spectral subtraction processing are well-known techniques, a detailed description thereof will be omitted here.

【００３６】次に、特徴量抽出部２０７は、ＳＴＥＰ１
３０において検出された音声区間に対して、定常ノイズ
成分を取り除かれた振幅スペクトルデータを用いて、音
声の特徴を示す特徴量データを計算する処理を行う（Ｓ
ＴＥＰ１４０）。音声の特徴を示す特徴量データのパラ
メータとしては、一般的にケプストラムやデルタケプス
トラムなどが使用される。ここで、特徴量計算の際、複
雑な演算処理を行う必要があるが、予め用意した特徴量
計算用テーブル２１２を参照して行うことにより、演算
処理の負荷を軽減するようにしている。Next, the feature quantity extraction unit 207 determines in STEP 1
For the speech section detected in step 30, a process of calculating feature amount data indicating a feature of the speech is performed using the amplitude spectrum data from which the stationary noise component has been removed (S
TEP140). Generally, cepstrum, delta cepstrum, or the like is used as a parameter of feature amount data indicating a feature of a voice. Here, it is necessary to perform complicated arithmetic processing at the time of calculating the feature value. However, by performing the calculation with reference to the feature value calculation table 212 prepared in advance, the load of the arithmetic processing is reduced.

【００３７】次に、マッチング処理部２０８は、ＳＴＥ
Ｐ１４０において抽出された音声の特徴を示す特徴量デ
ータをもとに、認識単位の標準パターン２１３と、予め
登録された認識対象単語辞書２１４とを参照することに
より、入力音声データと各認識対象単語との距離値を求
めて行く（ＳＴＥＰ１５０）。Next, the matching processing unit 208
By referring to the standard pattern 213 of the recognition unit and the pre-registered recognition target word dictionary 214 based on the feature amount data indicating the characteristics of the voice extracted in P140, the input voice data and each recognition target word are recognized. Is calculated (STEP 150).

【００３８】ＳＴＥＰ１６０においては、入力音声全デ
ータに対する処理が終了したか否かが判定され、入力音
声全データに対する処理が終了していないと判定された
場合、ＳＴＥＰ１００に戻り、ＳＴＥＰ１００以降の処
理が繰り返し実行される。一方、入力音声全データに対
する処理が終了したと判定された場合、ＳＴＥＰ１７０
に進み、認識結果を出力し、処理を終了する。In STEP 160, it is determined whether or not the processing for all the input voice data has been completed. If it is determined that the processing for all the input voice data has not been completed, the process returns to STEP 100, and the processing after STEP 100 is repeated. Be executed. On the other hand, if it is determined that the processing for all the input voice data has been completed, STEP 170
To output the recognition result and end the process.

【００３９】即ち、特徴量データと認識単位の標準パタ
ーン２１３から全ての認識単位（例えば半音節）の状態
毎の距離を求めたものを認識単位距離とし（即ち、入力
された音声データの特徴量データと標準パターンの全て
の認識単位（半音節）の特徴量データとを比較すること
により、全ての認識単位（半音節）毎に認識単位距離値
を求め）、認識対象単語として登録されている全ての単
語に対して認識単位距離値を足し合わせたものを各フレ
ームでの単語距離値とする（即ち、認識単位距離値を、
認識対象単語辞書２１４の各単語毎に、各単語に含まれ
る全認識単位の認識単位距離値を累積し、各フレームで
の単語距離値とする）。ここで、半音節とは、音素とは
若干異なり、母音定常部で音節を２分割した単位とな
る。例えば、「ＮＩＨＯＮ（ニホン）」という単語の場
合、［ＮＩ］［ＩＨ］［ＨＯ］［ＯＮ］［Ｎ＞］が各半
音節となる。なお、記号「＞」は、語尾を示している。
次に、入力された全音声区間の音声を対象としてその距
離値を累積し（ＳＴＥＰ１６０）、最も類似していると
考えられる単語（即ち、単語距離値が最も小さい単語）
を認識結果として認識結果出力部２０９から出力する
（ＳＴＥＰ１７０）。That is, the distance for each state of all recognition units (for example, syllables) obtained from the feature amount data and the standard pattern 213 of the recognition unit is defined as the recognition unit distance (that is, the feature amount of the input voice data). By comparing the data with the feature amount data of all recognition units (semisyllables) of the standard pattern, a recognition unit distance value is obtained for each recognition unit (half syllable), and registered as a recognition target word. The sum of the recognition unit distance values for all the words is used as the word distance value in each frame (that is, the recognition unit distance value is
For each word in the recognition target word dictionary 214, the recognition unit distance values of all the recognition units included in each word are accumulated and set as the word distance value in each frame.) Here, a semisyllable is slightly different from a phoneme, and is a unit obtained by dividing a syllable into two in a vowel stationary part. For example, in the case of the word "NIHON (Nihon)", [NI] [IH] [HO] [ON] [N>] is each syllable. The symbol “>” indicates the ending.
Next, the distance values are accumulated for the voices of all the input voice sections (STEP 160), and the words considered to be most similar (that is, the words having the smallest word distance value) are calculated.
Is output from the recognition result output unit 209 as a recognition result (STEP 170).

【００４０】即ち、入力された音声データの特徴量デー
タと標準パターンの全ての認識単位（半音節）の特徴量
データとを比較することにより、全ての認識単位（半音
節）毎に認識単位距離値を求め、次に、認識単位距離値
を、認識対象単語辞書２１４の各単語毎に、各単語に含
まれる全認識単位の認識単位距離値を累積し、各フレー
ムでの単語距離値とする。そして、単語距離値が最も小
さい単語を認識結果とする。That is, by comparing the feature data of the input voice data with the feature data of all the recognition units (semisyllables) of the standard pattern, the recognition unit distance is calculated for every recognition unit (semisyllable). Then, for each word in the recognition target word dictionary 214, the recognition unit distance values of all the recognition units included in each word are accumulated, and are used as the word distance values in each frame. . Then, the word having the smallest word distance value is set as the recognition result.

【００４１】以上説明したように、本実施の形態におい
ては、以下に記載するような効果を奏する。第１の効果
は、音声入力端末１００側では発声音声のみの入力を用
いて分析処理の前半までを行えばよいので、処理性能の
低いＣＰＵの使用が可能であり、コストを削減すること
ができることである。As described above, the present embodiment has the following effects. The first effect is that the voice input terminal 100 only needs to perform the first half of the analysis process using the input of only the uttered voice, so that a CPU with low processing performance can be used, and the cost can be reduced. It is.

【００４２】第２の効果は、特徴量データ計算時にＣＰ
Ｕの負荷を軽減するために、特徴量計算用のテーブルを
用いる場合、この特徴量計算用のテーブルを音声認識処
理本体装置１０１側に設置することができるので、音声
入力端末１００側のメモリ容量が少なくて済み、コスト
を削減することができることである。The second effect is that when calculating the characteristic amount data, the CP
When a table for calculating the feature amount is used in order to reduce the load on U, the table for calculating the feature amount can be installed in the main unit 101 of the speech recognition processing. And the cost can be reduced.

【００４３】第３の効果は、周囲雑音除去処理に必要な
環境音声用の入力装置（環境音声入力部２１０）を、音
声認識処理本体装置１０１側に設置すればよいので、そ
の分だけ音声入力端末１００の寸法を小さくすることが
でき、比較的小さなハードウェアで構成された音声入力
端末１００を用いて、周囲雑音環境下での認識処理性能
を向上させることができることである。The third effect is that an input device (environmental sound input unit 210) for environmental sound necessary for the ambient noise removal processing may be installed in the main unit 101 of the voice recognition processing unit, and accordingly, the input of the corresponding amount of voice is performed. The size of the terminal 100 can be reduced, and the recognition processing performance in an ambient noise environment can be improved using the voice input terminal 100 configured with relatively small hardware.

【００４４】次に、本発明を応用した他の実施の形態に
ついて説明する。本実施の形態の構成は、図１に示した
実施の形態の場合と基本的には同様であるが、本実施の
形態においては、図１の音声入力端末１００から音声認
識処理本体装置１０１へ送信する分析処理中間情報デー
タ、即ち、振幅スペクトルデータに帯域制限処理を施す
ようにする。これにより、音声入力端末１００側から音
声認識処理本体装置１０１側に送信するデータ（分析処
理中間情報データ）のデータ量を削減することができ
る。Next, another embodiment to which the present invention is applied will be described. Although the configuration of the present embodiment is basically the same as that of the embodiment shown in FIG. 1, in the present embodiment, the voice input terminal 100 of FIG. The band limiting process is performed on the analysis process intermediate information data to be transmitted, that is, the amplitude spectrum data. This makes it possible to reduce the amount of data (analysis processing intermediate information data) transmitted from the voice input terminal 100 to the voice recognition processing main unit 101.

【００４５】次に、本発明を応用したさらに他の実施の
形態について説明する。図４は、本発明を応用したさら
に他の実施の形態の音声認識処理本体装置１０１の構成
例を示すブロック図である。図４に示すように、本実施
の形態においては、音声出力部２２０から出力された音
声データが、環境音声分析前段処理部２１１に直接供給
されるようになっている。Next, still another embodiment to which the present invention is applied will be described. FIG. 4 is a block diagram showing a configuration example of a speech recognition processing main unit 101 according to still another embodiment to which the present invention is applied. As shown in FIG. 4, in the present embodiment, the audio data output from the audio output unit 220 is directly supplied to the environmental sound analysis pre-processing unit 211.

【００４６】例えば、音声認識処理本体装置１０１がテ
レビジョン受像機（以下、テレビと記載する）である場
合、音声認識処理本体装置１０１自身が図示せぬスピー
カ等を介して音声を出力する。即ち、音声出力部２２０
に接続された図示せぬスピーカを介して音声が出力され
る。この場合、図４に示すように、音声出力部２２０か
らの出力を環境音声分析前段処理部２１１に直接供給す
ることにより、音声認識処理本体装置１０１において予
め取得できる音声認識処理本体装置１０１自身の出力音
声情報を、周囲雑音除去処理に効率的に反映させること
が可能となる。このように、本実施の形態においては、
周囲雑音除去処理の処理精度を向上させることが可能と
なる。For example, when the voice recognition processing main unit 101 is a television receiver (hereinafter, referred to as a television), the voice recognition processing main unit 101 itself outputs sound via a speaker (not shown) or the like. That is, the audio output unit 220
Sound is output via a speaker (not shown) connected to the speaker. In this case, as shown in FIG. 4, by directly supplying the output from the voice output unit 220 to the environmental voice analysis pre-processing unit 211, the voice recognition processing The output audio information can be efficiently reflected in the ambient noise removal processing. Thus, in the present embodiment,
The processing accuracy of the ambient noise removal processing can be improved.

【００４７】このように、本発明を、テレビとリモート
コントローラ（以下、リモコンと記載する）に応用した
場合、図１の音声認識処理本体装置１０１がテレビに対
応し、音声入力端末１００がリモコンに対応する。そし
て、リモコン側で入力される音声には、環境音声（テレ
ビからの出力音声＋その他の周囲の雑音）と、ユーザが
発声した発声音声とが含まれている。また、テレビ側で
入力される音声には、環境音声（テレビからの出力音声
＋その他の周囲の雑音）が含まれている。As described above, when the present invention is applied to a television and a remote controller (hereinafter, referred to as a remote controller), the voice recognition processing main unit 101 of FIG. 1 corresponds to the television, and the voice input terminal 100 corresponds to the remote controller. Corresponding. The sound input by the remote controller includes environmental sound (output sound from the TV + other ambient noise) and uttered sound uttered by the user. The sound input on the television side includes environmental sound (output sound from the television + other ambient noise).

【００４８】このとき、テレビ自身が出力する出力音声
については、出力前に事前に取得することができるた
め、音声分析後段処理部２０６における周囲雑音除去処
理を高性能かつ効果的に行うことができる。At this time, since the output audio output by the television itself can be obtained in advance before output, the ambient noise removal processing in the audio analysis post-processing unit 206 can be performed efficiently and efficiently. .

【００４９】また、通常、リモコンはテレビから数メー
トル程度離れた場所で使用されるため、リモコン（この
例では、音声入力端末１００に対応する）によって入力
される環境音声（周囲雑音）と、テレビ（この例では、
音声認識処理本体装置１０１に対応する）によって入力
される環境音声とはほぼ同一であるとみなすことができ
る。また、リモコンにより入力される音声は、環境音声
と発声音声であり、テレビにより入力される音声は、環
境音声のみであるとみなすことができるので、リモコン
により入力された音声と、テレビにより入力された音声
の差分を取ることにより、環境音声における非定常的な
ノイズを除去することができる。Also, since the remote control is usually used at a place several meters away from the television, environmental sound (ambient noise) input by the remote control (corresponding to the voice input terminal 100 in this example) and television (In this example,
It can be considered that the environmental sound input by the voice recognition processing main unit 101) is almost the same. Also, the sound input by the remote control is an environmental sound and a utterance sound, and the sound input by the TV can be regarded as only the environmental sound. Therefore, the sound input by the remote control and the sound input by the TV can be considered. By taking the difference between the obtained sounds, non-stationary noise in the environmental sound can be removed.

【００５０】次に、本発明を応用したさらに他の実施の
形態について説明する。図５は、本発明を応用したさら
に他の実施の形態の音声認識処理本体装置１０１の構成
例を示すブロック図である。同図に示すように、本実施
の形態では、図１に示した実施の形態において、音声認
識処理本体装置１０１に話者適応用テーブル２３０を新
たに設けるようにしている。そして、話者適応処理を行
い、各話者に適応するための処理を行うようになってい
る。これにより、各話者の音声を認識する能力をさらに
向上させることができる。Next, still another embodiment to which the present invention is applied will be described. FIG. 5 is a block diagram showing a configuration example of a speech recognition processing main unit 101 according to still another embodiment to which the present invention is applied. As shown in the figure, in the present embodiment, a speaker adaptation table 230 is newly provided in the speech recognition processing main unit 101 in the embodiment shown in FIG. Then, speaker adaptation processing is performed, and processing for adapting to each speaker is performed. Thereby, the ability to recognize the voice of each speaker can be further improved.

【００５１】標準パターン２１３は、予め複数の人間が
発声した発声音声データを元に、その特徴量データを格
納したものであるが、認識率向上のためにある特定の話
者の発声のクセを示すデータを格納したものが話者適応
用テーブル２３０である。具体的には、例えば、標準の
特徴量データと特定話者の特徴量データの差分を示す差
分情報が格納されている。The standard pattern 213 stores feature amount data based on voiced voice data uttered by a plurality of humans in advance. What stores the indicated data is the speaker adaptation table 230. Specifically, for example, difference information indicating the difference between the standard feature data and the feature data of the specific speaker is stored.

【００５２】また、話者適応処理を行うためには、図５
に示すように、音声認識処理本体装置１０１側に話者適
応用テーブル２３０を追加するだけであり、音声入力端
末１００側のハードウェアを拡張する必要はない。特徴
量抽出部２０７は、特徴量計算用テーブル２１２のデー
タ、及び話者適応用テーブル２３０のデータを参照し
て、より高精度に特定の話者の音声データの特徴量の抽
出を行う。従って、音声入力端末１００を図１に示した
実施の形態の場合と同様に小さくすることができるだけ
でなく、各話者の音声を認識する能力を向上させること
ができるという効果を得ることができる。In order to perform speaker adaptation processing, FIG.
As shown in (1), it is only necessary to add the speaker adaptation table 230 to the voice recognition processing main unit 101, and it is not necessary to extend the hardware of the voice input terminal 100. The feature amount extracting unit 207 refers to the data of the feature amount calculation table 212 and the data of the speaker adaptation table 230 to more accurately extract the feature amount of the voice data of a specific speaker. Therefore, it is possible not only to reduce the size of the voice input terminal 100 as in the case of the embodiment shown in FIG. 1, but also to improve the ability to recognize the voice of each speaker. .

【００５３】また、上述したような処理を実行するプロ
グラムは、ＣＤ−ＲＯＭ（ｃｏｍｐａｃｔｄｉｓｃ
ｒｅａｄｏｎｌｙｍｅｍｏｒｙ）、ＤＶＤ（ｄｉｇ
ｉｔａｌｖｅｒｓａｔｉｌｅｄｉｓｃ）、フロッピ
ー（登録商標）ディスク、メモリカード、ＲＯＭ（ｒｅ
ａｄｏｎｌｙｍｅｍｏｒｙ）等の様々な記録媒体に
記録して提供することができる。そして、そのプログラ
ムは、上記音声入力端末及び音声認識処理本体装置に内
蔵されるマイクロコンピュータ等の各部の動作を制御
し、プログラム制御されたそのマイクロコンピュータ等
の各部が上記プログラムにより指令される所定の処理を
実行する。A program for executing the above-described processing is a CD-ROM (compact disc).
read only memory), DVD (dig)
ital versatile disc), floppy (registered trademark) disk, memory card, ROM (re
It can be provided by being recorded on various recording media such as ad only memory. The program controls the operation of each unit such as a microcomputer incorporated in the voice input terminal and the voice recognition processing main unit, and the program-controlled units such as the microcomputer are controlled by a predetermined program instructed by the program. Execute the process.

【００５４】なお、上記実施の形態においては、本発明
をテレビとリモコンに応用する場合等について説明した
が、これに限定されるものではなく、音声により遠隔制
御を行う様々なシステムに本発明を適用することができ
る。In the above embodiment, the case where the present invention is applied to a television and a remote controller has been described. However, the present invention is not limited to this. The present invention is applied to various systems for performing remote control by voice. Can be applied.

【００５５】また、上記各実施の形態においては、差分
計算部を設けて振幅スペクトルの差分を演算し、データ
量を削減するようにしたが、他の方法でデータ量を削減
するようにすることも可能である。In each of the above embodiments, the difference calculator is provided to calculate the difference between the amplitude spectra to reduce the data amount. However, the data amount may be reduced by another method. Is also possible.

【００５６】また、上記実施の形態の構成及び動作は例
であって、本発明の趣旨を逸脱しない範囲で適宜変更す
ることができることは言うまでもない。The configuration and operation of the above embodiment are merely examples, and it goes without saying that the configuration and operation can be appropriately changed without departing from the spirit of the present invention.

【００５７】[0057]

【発明の効果】以上の如く、本発明に係る遠隔制御シス
テムおよび遠隔制御方法によれば、第１の装置は、ユー
ザが発声した発声音声を入力し、入力された発声音声を
分析し、分析によって得られた第１の分析結果に対応す
るデータを第２の装置に送信する。第２の装置は、第１
の装置において送信された第１の分析結果に対応するデ
ータを受信し、第２の装置の周囲の環境音声を入力し、
入力された環境音声を分析し、受信された第１の分析結
果に対応するデータと、分析の結果得られた第２の分析
結果とに基づいて、入力された発声音声に含まれる環境
音声を除去し、環境音声が除去された発声音声の特徴量
を抽出し、抽出された特徴量に基づいて、発声音声を認
識し、認識結果を出力するようにしたので、第１の装置
のハードウェアの規模を小さくすることができる。ま
た、周囲に雑音がある状況下で、音声認識性能を向上さ
せることができ、主に音声認識処理を行う第２の装置の
遠隔制御を効率的に行うことができる。As described above, according to the remote control system and the remote control method according to the present invention, the first apparatus inputs the uttered voice uttered by the user, analyzes the input uttered voice, and analyzes the input uttered voice. The data corresponding to the first analysis result obtained by the above is transmitted to the second device. The second device is the first
Receiving data corresponding to the first analysis result transmitted by the second device, and inputting ambient sound around the second device;
The input environmental sound is analyzed, and based on the received data corresponding to the first analysis result and the second analysis result obtained as a result of the analysis, the environmental sound included in the input utterance sound is determined. The feature amount of the uttered voice from which the environmental sound has been removed is extracted, and the uttered voice is recognized based on the extracted feature amount, and the recognition result is output. Can be reduced in size. In addition, it is possible to improve the voice recognition performance in a situation where there is noise in the surroundings, and it is possible to efficiently perform remote control of the second device that mainly performs voice recognition processing.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の音声認識装置の一実施の形態の構成例
を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration example of an embodiment of a speech recognition device of the present invention.

【図２】図１の音声入力端末の動作を説明するためのフ
ローチャートである。FIG. 2 is a flowchart illustrating an operation of the voice input terminal of FIG. 1;

【図３】図１の音声認識処理本体装置の動作を説明する
ためのフローチャートである。FIG. 3 is a flowchart for explaining the operation of the main unit of the voice recognition processing in FIG. 1;

【図４】本発明の音声認識装置の音声認識処理本体装置
の他の実施の形態の構成例を示すブロック図である。FIG. 4 is a block diagram showing a configuration example of another embodiment of the main unit of the voice recognition processing of the voice recognition device of the present invention.

【図５】本発明の音声認識装置の音声認識処理本体装置
のさらに他の実施の形態の構成例を示すブロック図であ
る。FIG. 5 is a block diagram showing a configuration example of still another embodiment of the main unit of the voice recognition processing of the voice recognition device of the present invention.

【図６】従来の音声認識処理装置の一例の構成を示すブ
ロック図である。FIG. 6 is a block diagram showing a configuration of an example of a conventional speech recognition processing device.

【図７】従来の音声認識処理装置の他の構成例を示すブ
ロック図である。FIG. 7 is a block diagram illustrating another configuration example of a conventional speech recognition processing device.

【符号の説明】[Explanation of symbols]

１，１００音声入力端末２，１０１音声認識処理本体装置１０音声入力部１１音声分析部１２特徴量変換部１３信号送信部１４標準音声特徴データ記憶部１５話者適応変換規則記憶部１６信号受信部１７単語検出部１８結果出力部１９標準音声特徴データ記憶部２０環境音声入力部２００発声音声入力部２０１音声分析前段処理部２０２差分計算部２０３分析処理中間情報データ送信部２０４分析処理中間情報データ受信部２０５受信データ復元部２０６音声分析後段処理部２０７特徴量抽出部２０８マッチング処理部２０９認識結果出力部２１０環境音声入力部２１１環境音声分析前段処理部２１２特徴量計算用テーブル２１３標準パターン２１４認識対象単語辞書２２０音声出力部２３０話者適応用テーブル Reference Signs List 1,100 Voice input terminal 2,101 Voice recognition processing main unit 10 Voice input unit 11 Voice analysis unit 12 Feature amount conversion unit 13 Signal transmission unit 14 Standard voice feature data storage unit 15 Speaker adaptation rule storage unit 16 Signal reception unit 17 Word detection unit 18 Result output unit 19 Standard speech feature data storage unit 20 Environmental speech input unit 200 Speech speech input unit 201 Speech analysis pre-processing unit 202 Difference calculation unit 203 Analysis processing intermediate information data transmission unit 204 Analysis processing intermediate information data reception Unit 205 received data restoring unit 206 speech analysis post-processing unit 207 feature amount extraction unit 208 matching processing unit 209 recognition result output unit 210 environment speech input unit 211 environment speech analysis pre-processing unit 212 feature amount calculation table 213 standard pattern 214 recognition target Word dictionary 220 Voice output unit 230 Speaker Application table

Claims

【特許請求の範囲】[Claims]

【請求項１】第１の装置を用いてユーザが発声した発
声音声による指示を行い、前記第１の装置から所定の距
離だけ離れた場所にある第２の装置を遠隔制御する遠隔
制御システムであって、前記第１の装置は、前記ユーザが発声した前記発声音声を入力する発声音声
入力手段と、前記発声音声入力手段によって入力された前記発声音声
を分析する第１の分析手段と、前記第１の分析手段による第１の分析結果に対応するデ
ータを前記第２の装置に送信する送信手段とを備え、前記第２の装置は、前記第１の装置の前記送信手段から送信されてきた前記
第１の分析結果に対応するデータを受信する受信手段
と、前記第２の装置の周囲の環境音声を入力する環境音声入
力手段と、前記環境音声入力手段によって入力された前記環境音声
を分析する第２の分析手段と、前記受信手段によって受信された前記第１の分析結果に
対応するデータと、前記第２の分析手段による第２の分
析結果とに基づいて、前記発声音声入力手段によって入
力された前記発声音声に含まれる前記環境音声を除去す
る環境音声除去手段と、前記環境音声除去手段によって前記環境音声が除去され
た前記発声音声の特徴量を抽出する特徴量抽出手段と、前記特徴量抽出手段によって抽出された前記特徴量に基
づいて、前記発声音声を認識する認識手段と、前記認識手段による認識結果を出力する出力手段とを備
えることを特徴とする遠隔制御システム。1. A remote control system for instructing a user using an uttered voice using a first device and remotely controlling a second device located a predetermined distance from the first device. Wherein the first device includes: an uttered voice input unit configured to input the uttered voice uttered by the user; a first analysis unit configured to analyze the uttered voice input by the uttered voice input unit; A transmitting unit for transmitting data corresponding to a first analysis result by the first analyzing unit to the second device, wherein the second device is transmitted from the transmitting unit of the first device. Receiving means for receiving data corresponding to the first analysis result, environmental sound input means for inputting environmental sound around the second device, and the environmental sound input by the environmental sound input means. Minute Based on data corresponding to the first analysis result received by the reception unit and a second analysis result by the second analysis unit, An environmental sound removing unit that removes the environmental sound included in the input uttered sound; a feature amount extracting unit that extracts a feature amount of the uttered sound from which the environmental sound has been removed by the environmental sound removing unit; A remote control system comprising: a recognition unit that recognizes the uttered voice based on the feature amount extracted by a feature amount extraction unit; and an output unit that outputs a recognition result by the recognition unit.

【請求項２】前記第１の装置は、前記第１の分析手段
による前記第１の分析結果の所定の時間毎の差分を演算
する差分演算手段をさらに備え、前記送信手段は、前記
差分演算手段によって演算された前記差分を前記第１の
分析手段による前記第１の分析結果に対応するデータと
して前記第２の装置に送信し、前記第２の装置の受信手段は、前記第１の装置の前記送
信手段から送信されてきた前記第１の分析結果に対応す
るデータとしての前記差分を受信し、前記差分に基づい
て前記第１の分析手段による前記第１の分析結果を復元
する復元手段をさらに備え、前記環境音声除去手段は、前記復元手段によって復元さ
れた前記第１の分析結果と、前記第２の分析手段による
前記第２の分析結果とに基づいて、前記発声音声入力手
段によって入力された前記発声音声に含まれる前記環境
音声を除去することを特徴とする請求項１に記載の遠隔
制御システム。2. The apparatus according to claim 1, wherein the first device further comprises a difference calculating means for calculating a difference of the first analysis result by the first analyzing means at predetermined time intervals, and wherein the transmitting means includes the difference calculating means. The difference calculated by the means is transmitted to the second device as data corresponding to the first analysis result by the first analysis means, and the reception means of the second device comprises the first device Restoring means for receiving the difference as data corresponding to the first analysis result transmitted from the transmitting means, and restoring the first analysis result by the first analyzing means based on the difference The environment sound removing means further comprises: the utterance sound input means based on the first analysis result restored by the restoration means and the second analysis result by the second analysis means. Entering The remote control system according to claim 1, characterized in that the removal of the environmental sound included in the utterance that is.

【請求項３】前記第１の装置の前記第１の分析手段
は、前記発声音声の周波数分析までの処理を行い、前記
第２の装置の前記第２の分析手段は、前記第１の分析手
段による前記周波数分析後のデータに対して、環境音声
を除去する処理を行うことを特徴とする請求項１または
２に記載の遠隔制御システム。3. The first analysis means of the first device performs processing up to frequency analysis of the uttered voice, and the second analysis means of the second device performs the first analysis. 3. The remote control system according to claim 1, wherein a process of removing environmental sound is performed on the data after the frequency analysis by the means.

【請求項４】前記第２の装置は、音声を出力する音声
出力手段をさらに備え、前記第２の分析手段に前記音声出力手段により出力され
た前記音声を供給し、前記第２の分析手段は、前記環境
音声入力手段によって入力された前記環境音声と前記音
声出力手段より入力された前記音声を分析し、前記環境音声除去手段は、前記復元手段によって復元さ
れた前記第１の分析結果と、前記第２の分析手段による
前記第２の分析結果とに基づいて、前記発声音声入力手
段によって入力された前記発声音声に含まれる前記環境
音声と前記音声とを除去することを特徴とする請求項
１，２または３に記載の遠隔制御システム。4. The second apparatus further comprises a sound output unit for outputting a sound, the second device supplies the sound output by the sound output unit to the second analysis unit, and the second analysis unit Analyzes the environment sound input by the environment sound input means and the sound input by the sound output means, wherein the environment sound removal means includes the first analysis result restored by the restoration means and And removing the environmental sound and the sound included in the uttered voice input by the uttered voice input unit based on the second analysis result by the second analyzing unit. Item 4. The remote control system according to item 1, 2 or 3.

【請求項５】第１の装置を用いてユーザが発声した発
声音声による指示を行い、前記第１の装置から所定の距
離だけ離れた場所にある第２の装置を遠隔制御する遠隔
制御システムにおける遠隔制御方法であって、前記第１の装置は、前記ユーザが発声した前記発声音声を入力する発声音声
入力ステップと、前記発声音声入力ステップにおいて入力された前記発声
音声を分析する第１の分析ステップと、前記第１の分析ステップにおける第１の分析結果に対応
するデータを前記第２の装置に送信する送信ステップと
を備え、前記第２の装置は、前記第１の装置の前記送信ステップにおいて送信された
前記第１の分析結果に対応するデータを受信する受信ス
テップと、前記第２の装置の周囲の環境音声を入力する環境音声入
力ステップと、前記環境音声入力ステップにおいて入力された前記環境
音声を分析する第２の分析ステップと、前記受信ステップにおいて受信された前記第１の分析結
果に対応するデータと、前記第２の分析ステップにおけ
る第２の分析結果とに基づいて、前記発声音声入力ステ
ップにおいて入力された前記発声音声に含まれる前記環
境音声を除去する環境音声除去ステップと、前記環境音声除去ステップにおいて前記環境音声が除去
された前記発声音声の特徴量を抽出する特徴量抽出ステ
ップと、前記特徴量抽出ステップにおいて抽出された前記特徴量
に基づいて、前記発声音声を認識する認識ステップと、前記認識ステップにおける認識結果を出力する出力ステ
ップとを備えることを特徴とする遠隔制御方法。5. A remote control system for instructing a user using an uttered voice using a first device and remotely controlling a second device located a predetermined distance away from the first device. A remote control method, wherein the first device is configured to input an uttered voice uttered by the user, and a first analysis for analyzing the uttered voice input in the uttered voice input step. And a transmission step of transmitting data corresponding to a first analysis result in the first analysis step to the second device, wherein the second device transmits the data of the first device. A receiving step of receiving data corresponding to the first analysis result transmitted in; and an environmental sound inputting step of inputting environmental sound around the second device; A second analysis step of analyzing the environment voice input in the environment voice input step; data corresponding to the first analysis result received in the reception step; and a second analysis step in the second analysis step. An environmental sound removing step of removing the environmental sound included in the uttered sound input in the uttered sound input step based on the analysis result of the uttered sound input step; and the utterance from which the environmental sound has been removed in the environmental sound removing step. A feature value extraction step of extracting a feature value of a voice; a recognition step of recognizing the uttered voice based on the feature value extracted in the feature value extraction step; and an output step of outputting a recognition result in the recognition step And a remote control method.

【請求項６】請求項５に記載の遠隔制御方法を実行可
能なプログラムが記録されている記録媒体。6. A recording medium on which a program capable of executing the remote control method according to claim 5 is recorded.