JP2012208218A

JP2012208218A - Electronic apparatus

Info

Publication number: JP2012208218A
Application number: JP2011072349A
Authority: JP
Inventors: Noriyuki Daihashi; 紀幸大▲はし▼
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2011-03-29
Filing date: 2011-03-29
Publication date: 2012-10-25

Abstract

PROBLEM TO BE SOLVED: To prevent false recognition of a voice input command to an electronic apparatus capable of performing voice input/output, with a simple configuration.SOLUTION: An electronic apparatus having a voice input/output function carries out both of a voice recognition process on an audio signal outputted from a microphone and a voice recognition process on an audio signal given to a speaker, by time-sharing control by a single voice recognition process unit. Thereby, false recognition of a command is prevented when a voice coincide with a command input voice is accidentally outputted from the speaker.

Description

この発明は、コマンドの音声入力が可能な電子機器に関する。 The present invention relates to an electronic device capable of inputting a command voice.

近年の電子機器のなかには、各種処理の実行を指示するコマンドを音声入力することが可能なものがある。この種の電子機器においては、ユーザの発話音声をマイクロホンにより収音し、当該マイクロホンの出力信号に音声認識処理を施すことで、予め定められたコマンドが入力されたか否かを判定し、コマンドが入力されたと判定された場合にはそのコマンドに応じた処理を実行する、といった制御が為される。なお、音声認識処理とは、マイクロホンにより収音された音声をその発話内容を表す文字列データに変換する処理のことである。 Some recent electronic devices are capable of inputting voice commands for commands to execute various processes. In this type of electronic device, a user's speech is collected by a microphone, and a speech recognition process is performed on an output signal of the microphone to determine whether a predetermined command is input. When it is determined that the input has been made, control is performed such that processing corresponding to the command is executed. Note that the voice recognition process is a process of converting voice picked up by a microphone into character string data representing the utterance content.

ところで、電子機器のなかには、例えばオーディオ機器のように音声出力機能を有するものがある。このように音声出力機能を有する電子機器に対してコマンドの音声入力機能を設ける際には、音声出力機能によって出力した音声が偶然にコマンドの入力音声と一致する場合であっても、ユーザによるコマンド入力音声であると誤認識されることを防止する仕組みを設ける必要がある。このような誤認識を防止するための技術の一例としては特許文献１〜４に開示されたものが挙げられる。 Incidentally, some electronic devices have an audio output function, such as an audio device. Thus, when a command voice input function is provided for an electronic device having a voice output function, even if the voice output by the voice output function coincides with the command input voice by chance, the user command It is necessary to provide a mechanism for preventing erroneous recognition of input speech. Examples of the technique for preventing such erroneous recognition include those disclosed in Patent Documents 1 to 4.

特許文献１および特許文献２に開示された技術では、音声によるコマンド入力を実現するための音声認識処理部とは別個に、スピーカに与えるオーディオ信号に音声認識処理を施す音声認識処理部を設け、前者の音声認識処理部によって何らかのコマンドが検出された場合であっても、同一のコマンドが後者の音声認識処理部によっても検出された場合には当該コマンドの実行を見合わせることで、上記誤認識が防止される。特許文献３には、マイクロホンから出力されたオーディオ信号からスピーカ出力成分を減算した後に音声認識処理を施すことで上記誤認識を回避する技術が開示されている。そして特許文献４には、マイクロホンからの信号出力等を契機としてスピーカ出力音の音量を引き下げることで上記誤認識の発生を回避する技術が開示されている。 In the technologies disclosed in Patent Document 1 and Patent Document 2, a voice recognition processing unit that performs voice recognition processing on an audio signal to be given to a speaker is provided separately from a voice recognition processing unit for realizing voice command input. Even if any command is detected by the former voice recognition processing unit, if the same command is also detected by the latter voice recognition processing unit, the above-mentioned misrecognition can be prevented by suspending the execution of the command. Is prevented. Patent Document 3 discloses a technique for avoiding the erroneous recognition by performing a speech recognition process after subtracting a speaker output component from an audio signal output from a microphone. Patent Document 4 discloses a technique for avoiding the erroneous recognition by lowering the volume of the speaker output sound triggered by the signal output from the microphone.

実開平０７−２３４００号公報Japanese Utility Model Publication No. 07-23400 特開平０９−６９０３８号公報JP 09-69038 A 特開２００３−３４５３８７号公報JP 2003-345387 A 特開２００６−１７１１５２号公報JP 2006-171152 A

しかし、特許文献１または特許文献２に開示された技術では、スピーカ出力とマイク入力の各々に独立に音声認識処理部を設ける必要があり、電子機器の製造コストが高くなるといった問題や、電子機器の小型化に適さない、といった問題がある。特許文献３に開示された技術の場合、スピーカから出力された音声が音響空間を伝播してマイクロホンに至るまでに遅延やゲイン、周波数スペクトル等が変化するため、これら変化を適切に考慮しないとマイクロホンの出力信号からのスピーカ出力成分の減算を正しく行えず、上記誤認識の発生を確実に防止することが困難になる。このため、特許文献３に開示された技術には、家庭用のオーディオ機器のようにスピーカとマイクロホンの相対的な位置関係がその使用のたびに変化し得る機器には適用し難い、といった問題がある。そして、特許文献４に開示された技術には、例えばスピーカから出力される音の音量調整に関するコマンドを音声入力した際に、再生音量が一旦低下した後に所望の音量変化が生じることとなり、ユーザに違和感を与える場合がある、といった問題がある。特に、音量を引き上げるコマンドを音声入力した場合に、上記違和感は顕著となる。 However, in the technique disclosed in Patent Document 1 or Patent Document 2, it is necessary to provide a voice recognition processing unit independently for each of the speaker output and the microphone input, and there is a problem that the manufacturing cost of the electronic device is increased. There is a problem that it is not suitable for downsizing. In the case of the technique disclosed in Patent Document 3, the delay, gain, frequency spectrum, and the like change until the sound output from the speaker propagates through the acoustic space and reaches the microphone. The speaker output component cannot be correctly subtracted from the output signal, and it is difficult to reliably prevent the occurrence of the erroneous recognition. For this reason, the technique disclosed in Patent Document 3 has a problem that it is difficult to apply to a device such as a home audio device in which the relative positional relationship between the speaker and the microphone can be changed every time it is used. is there. In the technique disclosed in Patent Document 4, for example, when a command related to volume adjustment of sound output from a speaker is input by voice, a desired volume change occurs after the playback volume is temporarily reduced. There is a problem that it may give a sense of incongruity. In particular, the above-mentioned uncomfortable feeling becomes remarkable when a command for raising the volume is input by voice.

本発明は上記課題に鑑みてなされたものであり、音声の入出力機能を有する電子機器に対して音声入力されるコマンドの誤認識を、簡単な構成で防止することを可能にする技術を提供することを目的とする。 The present invention has been made in view of the above problems, and provides a technique capable of preventing erroneous recognition of a command inputted to an electronic device having a voice input / output function with a simple configuration. The purpose is to do.

上記課題を解決するために本発明は、（Ａ）マイクロホンにより収音された入力音を表すオーディオ信号のサンプル列を蓄積する第１のバッファと、（Ｂ）スピーカに出力させる出力音を表すオーディオ信号のサンプル列を蓄積する第２のバッファと、（Ｃ）与えられたサンプル列に音声認識処理を施す音声認識処理部と、（Ｄ）前記第１のバッファに蓄積されたサンプル列を最も古いものから順に所定サンプル数ずつ前記音声認識処理部に与え、前記音声認識処理部による音声認識処理の処理結果に基づいて予め定められた１または複数のコマンドの何れかの入力音を表すものであるか否かを判定する第１の判定処理と、前記１または複数のコマンドの何れかの入力音を表すものであると前記第１の判定処理にて判定されたサンプル列の前記第１のバッファへの書き込みタイミングと同じまたは少し早いタイミングで前記第２のバッファに書き込まれたサンプル列に基づいて、当該入力音が前記スピーカから出力されたものであるか否かを判定する第２の判定処理と、前記１または複数のコマンドの何れかを表すものであると前記第１の判定処理にて判定され、かつ前記スピーカから放音されたものではないと前記第２の判定処理にて判定された入力音の表すコマンドを実行するコマンド実行処理と、を実行する制御部と、を有することを特徴とする電子機器を提供する。 In order to solve the above problems, the present invention provides (A) a first buffer for storing a sample sequence of an audio signal representing an input sound collected by a microphone, and (B) an audio representing an output sound to be output to a speaker. A second buffer for accumulating the signal sample sequence; (C) a speech recognition processing unit for performing speech recognition processing on the given sample sequence; and (D) the oldest sample sequence accumulated in the first buffer. A predetermined number of samples are sequentially given to the voice recognition processing unit, and represent an input sound of one or a plurality of commands determined in advance based on the processing result of the voice recognition processing by the voice recognition processing unit. A first determination process for determining whether or not a sample string determined by the first determination process to represent an input sound of any one of the one or more commands. A first determination is made as to whether or not the input sound is output from the speaker based on the sample sequence written in the second buffer at the same timing or slightly earlier than the write timing to the first buffer. 2 and the second determination process when it is determined in the first determination process that it represents one of the one or the plurality of commands and the sound is not emitted from the speaker. And a control unit that executes a command execution process for executing a command represented by the input sound determined in step (b).

ここで、第２の判定処理における判定手法の具体例としては、１または複数のコマンドの何れかの入力音を表すものであると第１の判定処理にて判定されたサンプル列の当該第１のバッファへの書き込みタイミングと同じまたは少し早いタイミングで第２のバッファに書き込まれたサンプル列を所定のサンプル数ずつ音声認識処理部に与え、その音声認識処理結果に基づいて判定する態様や、第２のバッファに書き込まれたサンプル列と１または複数のコマンドの何れかを表すものであると第１の判定処理にて判定されたサンプル列との相関を求め、両者の相関の強さに基づいて判定する態様とが考えられる。 Here, as a specific example of the determination method in the second determination process, the first of the sample strings determined in the first determination process to represent an input sound of any one or a plurality of commands. A sample sequence written to the second buffer at the same timing or a little earlier than the timing of writing to the buffer is given to the speech recognition processing unit by a predetermined number of samples, and a determination is made based on the result of the speech recognition processing, The correlation between the sample sequence written in the second buffer and the sample sequence determined in the first determination processing is obtained as representing one of a plurality of commands, and based on the strength of the correlation between the two. It is conceivable that this is determined in the manner described above.

このような電子機器によれば、第２の判定処理における判定手法として前者の態様を用いる場合であっても、１つの音声認識処理部を用いて第１の判定処理および第２の判定処理が実行されるため、第１の判定処理を実行するための音声認識処理部と第２の判定処理を行うための音声認識処理部とを別個独立に設ける態様に比較して電子機器の製造コストを低く抑えることが可能であり、また、電子機器の小型化にも適するといった利点がある。なお、１つの音声認識処理部を用いて第１の判定処理および第２の判定処理が実行することの具体的な実現方法としては、第１の判定処理により１または複数のコマンドの何れかを表すものであるとの判定結果が得られた後に第２の判定処理を開始する態様や、第１の判定処理の実行過程において、判定対象のサンプル列が前記１または複数のコマンドの何れかの少なくとも１部を表すとの判定結果が得られた時点で前記第１の判定処理の終了を待たずに前記第２の判定処理を開始する態様が考えられる。後者の態様によれば、コマンドの音声入力を行ってから当該コマンドに応じた処理の実行が開始されるまでの遅延を前者の態様に比較して短くすることができる、といった効果が奏される。一方、前者の態様には、第２の判定処理が無駄に実行されることを回避することができるといった利点がある。 According to such an electronic device, even if the former aspect is used as the determination method in the second determination process, the first determination process and the second determination process are performed using one speech recognition processing unit. Therefore, the manufacturing cost of the electronic device can be reduced as compared with an aspect in which a voice recognition processing unit for executing the first determination process and a voice recognition processing unit for performing the second determination process are separately provided. There is an advantage that it can be kept low and is suitable for downsizing of electronic equipment. In addition, as a concrete realization method of performing the first determination process and the second determination process using one voice recognition processing unit, either one or a plurality of commands is determined by the first determination process. In the aspect in which the second determination process is started after the determination result is obtained, or in the execution process of the first determination process, the sample sequence to be determined is one of the one or more commands. A mode is conceivable in which the second determination process is started without waiting for the end of the first determination process when a determination result that represents at least one copy is obtained. According to the latter aspect, there is an effect that it is possible to shorten the delay from the voice input of the command until the execution of the processing corresponding to the command is started compared to the former aspect. . On the other hand, the former aspect has an advantage that the second determination process can be avoided from being performed wastefully.

また、上記課題を解決するために本発明は、（Ａ）マイクロホンにより収音された入力音を表すオーディオ信号のサンプル列を蓄積する第１のバッファと、（Ｂ）複数のスピーカの各々に出力させる出力音を表すオーディオ信号のサンプル列をスピーカ毎に、または前記各オーディオ信号をミキシングして得られるミキシング信号のサンプル列を蓄積する第２のバッファと、（Ｃ）与えられたサンプル列に音声認識処理を施す音声認識処理部と、（Ｄ）前記第１のバッファに蓄積されたサンプル列を最も古いものから順に所定サンプル数ずつ前記音声認識処理部に与え、予め定められた１または複数のコマンドの何れかの入力音を表すものあるか否かを前記音声認識処理部による音声認識処理の処理結果に基づいて判定する第１の判定処理と、前記１または複数のコマンドの何れかの入力音を表すものであると前記第１の判定処理にて判定されたサンプル列の前記第１のバッファへの書き込みタイミングと同じまたは少し早いタイミングで前記第２のバッファに蓄積された前記ミキシング信号のサンプル列または前記スピーカ毎のサンプル列に基づいて、当該入力音が前記複数のスピーカの何れかから放音されたものであるか否かを判定する第２の判定処理と、前記１または複数のコマンドの何れかを表すものであると前記第１の判定処理にて判定され、かつ前記複数のスピーカから放音されたものではないと前記第２の判定処理にて判定された入力音の表すコマンドを実行するコマンド実行処理と、を実行する制御部と、を有することを特徴とする電子機器を提供する。 In order to solve the above problems, the present invention provides (A) a first buffer for storing a sample sequence of an audio signal representing an input sound picked up by a microphone, and (B) an output to each of a plurality of speakers. A second buffer for storing a sample sequence of an audio signal representing an output sound to be generated for each speaker, or a sample sequence of a mixing signal obtained by mixing each audio signal; and (C) a sound in a given sample sequence A speech recognition processing unit that performs recognition processing; and (D) a predetermined number of samples from the oldest sample sequence stored in the first buffer are given to the speech recognition processing unit in order from the oldest one. A first determination process for determining whether or not any input sound of a command is present based on a result of a voice recognition process performed by the voice recognition processing unit; The first or the plurality of commands represent the input sound, and the first timing is the same as or slightly earlier than the timing at which the sample sequence determined in the first determination process is written to the first buffer. Determining whether or not the input sound is emitted from any of the plurality of speakers based on a sample sequence of the mixing signal stored in the second buffer or a sample sequence for each speaker; 2 and the second determination process if it represents one of the one or the plurality of commands, and the second determination means that the sound is not emitted from the plurality of speakers. There is provided an electronic apparatus comprising: a control unit that executes a command execution process that executes a command represented by an input sound determined in the determination process.

このような電子機器によれば、例えば左右各１チャネルのスピーカ出力が可能なステレオオーディオ機器や、５．１チャネルのマルチサラウンドオーディオ機器のように複数のスピーカが接続される電子機器において、音声によるコマンド入力を可能とし、かつ何れかのスピーカから出力された音声が偶然にコマンド入力音声と一致する場合の誤認識を簡単な構成で防止することが可能になる。 According to such an electronic device, for example, in a stereo audio device capable of outputting left and right one-channel speakers and an electronic device to which a plurality of speakers are connected, such as a 5.1-channel multi-surround audio device, it is possible to use sound. It is possible to input a command and prevent erroneous recognition when a voice output from any speaker coincides with a command input voice by chance.

より好ましい態様においては、上記各電子機器は、前記第２のバッファに書き込むサンプル列に、より低いサンプリングレートのサンプル列に変換するサンプリングレート変換部を備え、前記サンプリングレート変換部による処理を経たサンプル列を前記第２のバッファに書き込むことを特徴とする。前記第２のバッファに書き込むサンプル列のサンプリングレートを音声認識に影響を及ぼさない範囲で引き下げるようにすれば、当該第２のバッファのバッファサイズを節約することが可能になる。 In a more preferred aspect, each of the electronic devices includes a sampling rate conversion unit that converts the sample sequence to be written into the second buffer into a sample sequence having a lower sampling rate, and has undergone processing by the sampling rate conversion unit. A column is written to the second buffer. If the sampling rate of the sample sequence written to the second buffer is reduced within a range that does not affect speech recognition, the buffer size of the second buffer can be saved.

本発明の第１実施形態の電子機器１Ａの構成例を示す図である。It is a figure which shows the structural example of 1 A of electronic devices of 1st Embodiment of this invention. 同電子機器１Ａの動作を説明するための図である。It is a figure for demonstrating operation | movement of the electronic device 1A. 本発明の第２実施形態の電子機器の構成例を示す図である。It is a figure which shows the structural example of the electronic device of 2nd Embodiment of this invention.

以下、図面を参照し、この発明の実施形態について説明する。
（Ａ：第１実施形態）
図１は、本発明の第１実施形態の電子機器１Ａの構成例を示す図である。
この電子機器１Ａは、例えば地上デジタル放送等により提供される放送コンテンツを再生するオーディオ機器であり、当該放送コンテンツに含まれるオーディオ信号をスピーカ２に与え、当該オーディオ信号を音として出力させる。図１に示すように電子機器１Ａには、スピーカ２の他にマイクロホン３が接続され、当該電子機器１Ａのユーザは各種処理の実行を指示するコマンドをマイクロホン３を介して音声入力することができる。なお、本実施形態では、スピーカ２およびマイクロホン３の両者を電子機器１Ａの外部装置としたが、これらの両方、あるいは何れか一方を電子機器１Ａに内蔵させても勿論良い。また、再生対象のオーディオ信号の信号源についても電子機器１Ａの外部に設けられている場合に限らず、電子機器１Ａに内蔵されていても良い。そして、スピーカ２やマイクロホン３、再生対象のオーディオ信号の信号源等を外部機器として電子機器１Ａの外部に設ける場合には、それら外部機器と電子機器１Ａとの間の信号授受のためのインタフェースは無線インタフェースであっても良く、また有線インタフェースであっても良い。 Embodiments of the present invention will be described below with reference to the drawings.
(A: 1st Embodiment)
FIG. 1 is a diagram illustrating a configuration example of an electronic apparatus 1A according to the first embodiment of the present invention.
The electronic device 1A is an audio device that reproduces broadcast content provided by, for example, terrestrial digital broadcasting, and provides an audio signal included in the broadcast content to the speaker 2 to output the audio signal as sound. As shown in FIG. 1, a microphone 3 is connected to the electronic device 1 </ b> A in addition to the speaker 2, and a user of the electronic device 1 </ b> A can input a command for instructing execution of various processes via the microphone 3. . In the present embodiment, both the speaker 2 and the microphone 3 are external devices of the electronic device 1A. However, it is a matter of course that both or one of them may be built in the electronic device 1A. Further, the signal source of the audio signal to be reproduced is not limited to being provided outside the electronic apparatus 1A, and may be incorporated in the electronic apparatus 1A. When the speaker 2, the microphone 3, the signal source of the audio signal to be reproduced, etc. are provided outside the electronic device 1A as external devices, the interface for signal exchange between these external devices and the electronic device 1A is as follows. It may be a wireless interface or a wired interface.

図１に示すように、電子機器１Ａは、サンプリングレート変換部１０、再生信号バッファ２０、マイク入力バッファ３０、読出制御部４０、音声認識処理部５０、および制御部６０を含んでいる。再生信号バッファ２０は、スピーカ２に与えるオーディオ信号の波形を表すサンプル列を蓄積するためのものである。この再生信号バッファ２０は、電子機器１Ａの揮発性メモリ（図示略）内に確保されたリングバッファである。再生信号バッファ２０には、サンプリングレート変換部１０によるサンプリングレートの変換を経たサンプル列が書き込まれる。サンプリングレート変換部１０は、スピーカ２に与えるオーディオ信号の波形を表すサンプル列（当該オーディオ信号を例えば４４．１ｋＨｚ〜９６ｋＨｚのサンプリングレートでサンプルすることにより得られるサンプル列）を、後段の音声認識処理部５０による音声認識処理の実行に影響のない範囲で、よりサンプリングレートの低いものに変換して出力する。このようなサンプリングレート変換部１０を設けたのは、再生信号バッファ２０のバッファサイズを節約するためである。したがって、再生信号バッファ２０のバッファサイズを節約する必要がない場合にはサンプリングレート変換部１０を設ける必要はない。なお、上記放送コンテンツに含まれるオーディオ信号がアナログ形式のものである場合には、Ａ／Ｄ変換器等によるＡ／Ｄ変換を施した後にサンプリングレート変換部１０に与えるようにすれば良い。 As shown in FIG. 1, the electronic device 1 </ b> A includes a sampling rate conversion unit 10, a reproduction signal buffer 20, a microphone input buffer 30, a reading control unit 40, a voice recognition processing unit 50, and a control unit 60. The reproduction signal buffer 20 is for accumulating a sample string representing the waveform of the audio signal applied to the speaker 2. The reproduction signal buffer 20 is a ring buffer secured in a volatile memory (not shown) of the electronic device 1A. In the reproduction signal buffer 20, a sample string that has undergone sampling rate conversion by the sampling rate conversion unit 10 is written. The sampling rate conversion unit 10 performs a subsequent voice recognition process on a sample sequence (sample sequence obtained by sampling the audio signal at a sampling rate of, for example, 44.1 kHz to 96 kHz) representing the waveform of the audio signal applied to the speaker 2. The data is converted into one having a lower sampling rate and output within a range that does not affect the execution of the voice recognition processing by the unit 50. The reason for providing such a sampling rate conversion unit 10 is to save the buffer size of the reproduction signal buffer 20. Therefore, when it is not necessary to save the buffer size of the reproduction signal buffer 20, it is not necessary to provide the sampling rate conversion unit 10. When the audio signal included in the broadcast content is in an analog format, it may be provided to the sampling rate conversion unit 10 after A / D conversion by an A / D converter or the like.

マイク入力バッファ３０は、再生信号バッファ２０と同様にリングバッファである。マイク入力バッファ３０には、マイクロホン３から出力されるオーディオ信号（すなわち、マイクロホン３によって収音された音を表すオーディオ信号）の波形を表すサンプル列が蓄積される。本実施形態のマイク入力バッファ３０は、４秒分のサンプル列を格納するバッファサイズを有している。これは、コマンドの発話時間は概ね１〜４秒の範囲に収まっていることが多いからである。読出制御部４０は、制御部６０による制御の下、マイク入力バッファ３０に蓄積されているサンプル列を古いものから順に所定サンプル数のブロックに区切って読み出して音声認識処理部５０に与える第１の処理と、制御部６０から与えられる読出指示に応じて、再生信号バッファ２０に蓄積されているサンプル列のうち当該読出指示にて指示された時間区間のサンプル列を上記所定サンプル数のブロックに区切って読み出し音声認識処理部５０に与える第２の処理とを実行する。 The microphone input buffer 30 is a ring buffer like the reproduction signal buffer 20. The microphone input buffer 30 stores a sample string representing the waveform of an audio signal output from the microphone 3 (that is, an audio signal representing a sound collected by the microphone 3). The microphone input buffer 30 of the present embodiment has a buffer size for storing a sample string for 4 seconds. This is because the command utterance time is often within the range of about 1 to 4 seconds. Under the control of the control unit 60, the read control unit 40 reads the sample sequence stored in the microphone input buffer 30 into blocks having a predetermined number of samples in order from the oldest one, and gives them to the speech recognition processing unit 50. In accordance with the processing and the read instruction given from the control unit 60, the sample string in the time interval designated by the read instruction among the sample strings stored in the reproduction signal buffer 20 is divided into blocks having the predetermined number of samples. The second processing given to the read voice recognition processing unit 50 is executed.

音声認識処理部５０は、例えばＤＳＰ（Digital Signal Processor）である。この音声認識処理部５０は、読出制御部４０から与えられるサンプル列に対して音声認識処理を施し、当該サンプル列の表す音声の発話内容を表す文字列データを生成し制御部６０に出力する。ここで、音声認識処理部５０における音声認識の具体的な手法としては周知のものを適宜用いるようにすれば良い。 The voice recognition processing unit 50 is, for example, a DSP (Digital Signal Processor). The voice recognition processing unit 50 performs voice recognition processing on the sample sequence given from the read control unit 40, generates character string data representing the speech utterance content represented by the sample sequence, and outputs the character string data to the control unit 60. Here, as a specific method of speech recognition in the speech recognition processing unit 50, a known method may be appropriately used.

制御部６０は、ＣＰＵ（Central
Processing Unit）と、ＲＯＭ（Read Only Memory）などの不揮発性メモリと、ＲＡＭ（Random Access Memory）などの揮発性メモリとを含んでいる（何れも、図示略）。上記不揮発性メモリには、音声によるコマンド入力を実現するためのユーザインタフェースプログラムと、制御部６０が実行可能な１または複数のコマンドの各々を表すコマンドデータ（例えば、コマンドを表す文字列データ）が書き込まれたコマンドテーブルとが予め格納されている。制御部６０は、上記ユーザインタフェースプログラムにしたがって上記ＣＰＵを作動させることにより、本実施形態の特徴を顕著に示す処理を実行する。また、上記揮発性メモリは、上記ユーザインタフェースプログラムをＣＰＵに実行させる際のワークエリアとして利用される。 The control unit 60 is a CPU (Central
It includes a processing unit), a nonvolatile memory such as a ROM (Read Only Memory), and a volatile memory such as a RAM (Random Access Memory) (all not shown). In the nonvolatile memory, a user interface program for realizing voice command input and command data representing each of one or more commands that can be executed by the control unit 60 (for example, character string data representing a command) are stored. The written command table is stored in advance. The control unit 60 operates the CPU according to the user interface program, thereby executing processing that significantly shows the features of the present embodiment. The volatile memory is used as a work area when the CPU executes the user interface program.

ユーザインタフェースプログラムにしたがってＣＰＵを作動させることにより制御部６０が実行する処理としては、第１の判定処理、第２の判定処理、およびコマンド実行処理の３つの処理が挙げられる。第１の判定処理とは、マイク入力バッファ３０に蓄積されているサンプル列を上記ブロック単位で古いものから順に音声認識処理部５０に与えるように読出制御部４０を制御し、当該サンプル列の表す入力音がコマンドテーブルに格納されているコマンドデータの表すコマンドの何れかを表すものであるか否かを音声認識処理部５０による音声認識処理の処理結果に基づいて判定する処理である。この第１の判定処理は、マイク入力バッファ３０のバッファサイズに比較して短い時間間隔（例えば、５〜１０ｍｓなど）で周期的に実行される。 The process executed by the control unit 60 by operating the CPU according to the user interface program includes three processes: a first determination process, a second determination process, and a command execution process. In the first determination process, the reading control unit 40 is controlled so that the sample sequence stored in the microphone input buffer 30 is supplied to the speech recognition processing unit 50 in order from the oldest one in block units, and the sample sequence is represented. This is a process for determining whether or not the input sound represents any of the commands represented by the command data stored in the command table based on the processing result of the speech recognition processing by the speech recognition processing unit 50. The first determination process is periodically executed at a short time interval (for example, 5 to 10 ms) compared to the buffer size of the microphone input buffer 30.

第２の判定処理とは、何れかのコマンドを表すものであると第１の判定処理によって判定されたサンプル列のマイク入力バッファ３０への書き込みタイミングと同じタイミング（或いは、スピーカ２から出力された音が音響空間を伝播してマイクロホン３に至るまでの遅延時間の分だけ上記書き込みタイミングよりも早いタイミング）で再生信号バッファ２０に書き込まれたサンプル列をブロック単位で読み出して音声認識処理部５０に与える旨の読出指示を読出制御部４０に与え、上記入力音がスピーカ２から放音されたものであるか否かを判定する処理である。ここで、何れかのコマンドを表すものであると第１の判定処理によって判定されたサンプル列のマイク入力バッファ３０への書き込みタイミングを特定するには、マイク入力バッファ３０に蓄積されているサンプルのうちの最新のものから当該サンプル列の先頭のものまでのサンプル数を時間に換算して求めるようにすれば良い。また、再生信号バッファ２０からのサンプルの読出し位置を特定する際にも上記のようにして求めた時間を再生信号バッファ２０におけるサンプル数に換算して特定するようにすれば良い。そして、コマンド実行処理とは、コマンドテーブルに格納されている何れかのコマンドを表すものであると第１の判定処理にて判定され、かつスピーカ２から放音されたものではないと第２の判定処理にて判定された入力音の表すコマンドをユーザにより実行を指示されたコマンドとして実行する処理である。 The second determination process is the same timing as the writing timing to the microphone input buffer 30 of the sample sequence determined by the first determination process as representing any command (or output from the speaker 2). The sample sequence written in the reproduction signal buffer 20 at a timing earlier than the write timing by the delay time until the sound propagates through the acoustic space and reaches the microphone 3 is read out in units of blocks and is sent to the voice recognition processing unit 50. This is a process of giving a read instruction to give to the read control unit 40 and determining whether or not the input sound is emitted from the speaker 2. Here, in order to specify the writing timing of the sample sequence determined by the first determination processing to represent any command to the microphone input buffer 30, the samples stored in the microphone input buffer 30 are identified. What is necessary is just to obtain | require the number of samples from the newest thing of them to the head of the said sample row | line | column by converting into time. Further, when specifying the reading position of the sample from the reproduction signal buffer 20, the time obtained as described above may be converted into the number of samples in the reproduction signal buffer 20 and specified. The command execution process is determined by the first determination process to represent one of the commands stored in the command table, and is not a sound emitted from the speaker 2 This is a process of executing the command represented by the input sound determined in the determination process as a command instructed to be executed by the user.

本実施形態では、第１の判定処理にて何れかのコマンドを表すものであるとの判定結果が得られたことを契機として第２の判定処理を開始するが、第２の判定処理の開始タイミングはこれに限定されるものではない。例えば、第１の判定処理の実行過程において、コマンドテーブルに格納されたコマンドの何れかと少なくとも一部が一致する（例えば、先頭Ｎ（Ｎは任意の自然数）文字が一致するなど）の音声認識結果が得られた時点で第２の判定処理を開始し、以後、第１の判定処理と第２の判定処理とを時分割制御によって並列に実行する態様も考えられる。このような態様によれば、本実施形態の態様に比較して第２の判定処理がより早く開始され、電子機器１Ａのユーザがコマンド入力のための音声を発してからそのコマンドに応じた処理が開始されるまでの遅延が短くなるといった利点がある。 In the present embodiment, the second determination process is started when the determination result that represents any command is obtained in the first determination process, but the second determination process starts. The timing is not limited to this. For example, in the execution process of the first determination process, a speech recognition result that at least partially matches any of the commands stored in the command table (for example, the first N (N is an arbitrary natural number) character matches). It is also conceivable that the second determination process is started at the time when the first and second determination processes are obtained, and thereafter the first determination process and the second determination process are executed in parallel by time division control. According to such an aspect, the second determination process is started earlier than the aspect of the present embodiment, and the process according to the command after the user of the electronic device 1A utters a voice for command input. There is an advantage that the delay until the start is shortened.

また、第２の判定処理の終了タイミングについても種々の態様が考えられる。例えば、判定対象のサンプル列に音声認識処理を施して得られる文字列の先頭Ｍ（Ｍは自然数）文字が何れのコマンドの先頭Ｍ文字とも一致しないことが判明した時点で、第１の判定処理によりコマンド入力音であると判定された音はスピーカ２から出力されたものではないと判定して当該第２の判定処理を終了する態様が考えられる。また、第１の判定処理による判定対象のサンプル列とは異なる文字列を表すものであることが判明した時点で第２の判定処理を終了するようにしても良い。また、第１の判定処理の終了を待たずに第２の判定処理を開始する態様においては、第１の判定処理にてコマンドを表すものではないとの判定結果が得られた時点で第２の判定処理を終了するようにしても良い。
以上が電子機器１Ａの構成である。 Various modes can be considered for the end timing of the second determination process. For example, when it is determined that the first M (M is a natural number) character of a character string obtained by performing speech recognition processing on the sample sequence to be determined does not match the first M characters of any command, the first determination processing It can be considered that the sound determined as the command input sound is not output from the speaker 2 and the second determination process is terminated. Alternatively, the second determination process may be terminated when it is determined that the character string is different from the sample string to be determined by the first determination process. In the aspect in which the second determination process is started without waiting for the end of the first determination process, the second determination process is performed when a determination result indicating that the first determination process does not represent a command is obtained. This determination process may be terminated.
The above is the configuration of the electronic apparatus 1A.

次いで、図２を参照しつつ電子機器１Ａの動作を説明する。
図２に示すように、本実施形態では、再生信号バッファ２０およびマイク入力バッファ３０へのサンプル列の書き込みが順次行われ、前述した第１の判定処理が所定の時間間隔で実行される結果、マイク入力バッファ３０に対するサンプル列の書き込みを追いかけるように、マイク入力バッファ３０に書き込まれたサンプル列を古いものから順に所定サンプル数のブロックに区切って読み出して音声認識処理部５０に与える処理が読出制御部４０によって実行される。 Next, the operation of the electronic apparatus 1A will be described with reference to FIG.
As shown in FIG. 2, in this embodiment, the sample sequence is sequentially written into the reproduction signal buffer 20 and the microphone input buffer 30, and the first determination process described above is executed at predetermined time intervals. In order to follow the writing of the sample string to the microphone input buffer 30, the process of reading the sample string written in the microphone input buffer 30 into blocks of a predetermined number of samples in order from the oldest and giving it to the speech recognition processing unit 50 is read control. This is executed by the unit 40.

ここで、図２のブロックＭ１およびＭ２に亘って何れかのコマンドの入力音声を表すサンプル列が格納されているとする。上記第１の判定処理を実行中の制御部６０は、ブロックＭ１およびＭ２に対する音声認識処理の処理結果データを音声認識処理部５０から受け取り、これら処理結果データの表す文字列がコマンドテーブルに格納されているコマンドの何れかと一致することを検出すると、当該ブロックＭ１およびＭ２に対応するブロックＳ１およびＳ２を再生信号バッファ２０から読み出す旨の読出指示を読出制御部４０に与え、第２の判定処理を開始する。一方、読出制御部４０は、当該読出指示にしたがって再生信号バッファ２０からブロックＳ１およびＳ２を順次読み出して音声認識処理部５０に与え、音声認識処理部５０はこれらブロックＳ１およびＳ２に対して音声認識処理を施し、その処理結果データを制御部６０に与える。 Here, it is assumed that a sample string representing the input voice of any command is stored across the blocks M1 and M2 in FIG. The control unit 60 that is executing the first determination processing receives the processing result data of the speech recognition processing for the blocks M1 and M2 from the speech recognition processing unit 50, and character strings represented by these processing result data are stored in the command table. When it is detected that it matches any one of the commands, the read control unit 40 is given a read instruction to read the blocks S1 and S2 corresponding to the blocks M1 and M2 from the reproduction signal buffer 20, and the second determination process is performed. Start. On the other hand, the read control unit 40 sequentially reads out the blocks S1 and S2 from the reproduction signal buffer 20 according to the read instruction and gives them to the voice recognition processing unit 50. The voice recognition processing unit 50 performs voice recognition on these blocks S1 and S2. Processing is performed, and processing result data is given to the control unit 60.

第２の判定処理を実行中の制御部６０は、ブロックＳ１およびＳ２に対する音声認識処理の処理結果が文字列を表すものではない場合、または、文字列を表すものの、第１の判定処理にて検出されたコマンドとは異なる文字列を表すものである場合には、当該コマンドを表すものであると第１の判定処理にて判定された入力音はスピーカ２によって出力されたものではないと判定する。この場合、制御部６０は、第１の判定処理にて検出されたコマンドを実行するコマンド実行処理を実行する。 The control unit 60 executing the second determination process performs the first determination process when the processing result of the speech recognition process for the blocks S1 and S2 does not represent a character string or represents a character string. If the detected command represents a character string different from that of the detected command, it is determined that the input sound determined in the first determination process as representing the command is not output by the speaker 2. To do. In this case, the control unit 60 executes a command execution process that executes the command detected in the first determination process.

これに対して、ブロックＳ１およびＳ２に対する音声認識処理の処理結果が、第１の判定処理にて検出されたコマンドと同一の文字列を表すものである場合には、制御部６０は、当該コマンドを表すものであると第１の判定処理にて判定された入力音はスピーカ２によって出力されたものであると判定する。この場合、コマンド実行処理が実行されることはない。したがって、マイクロホン３を介して入力されたコマンド入力音声とスピーカ２によって出力される出力音声とが偶然に一致する場合には、当該コマンド入力音声によって入力されたコマンドは実行されないこととなるが、この場合は、電子機器１Ａのユーザは自らが発したコマンド入力音声とスピーカ２の出力音とが偶然に一致したことを容易に把握することができ、再度、コマンド入力音声を発するなどの対処を行うことができるため、特段の問題は生じない。 On the other hand, when the processing result of the speech recognition processing for the blocks S1 and S2 represents the same character string as the command detected in the first determination processing, the control unit 60 It is determined that the input sound determined in the first determination process as being output from the speaker 2. In this case, the command execution process is not executed. Therefore, when the command input sound input via the microphone 3 and the output sound output by the speaker 2 coincide with each other, the command input by the command input sound is not executed. In this case, the user of the electronic device 1A can easily grasp that the command input voice that he / she uttered and the output sound of the speaker 2 coincided by chance, and take measures such as emitting the command input voice again. Therefore, no particular problem occurs.

以上説明したように、本実施形態においては、１つの音声認識処理部の時分割制御によってマイク入力音に対する音声認識処理とスピーカ出力音に対する音声認識処理とが実行され、スピーカ出力音がコマンド入力音声である誤認されることが防止される。このため、特許文献１および特許文献２に開示された技術のように入力系および出力系の各々に別個独立に音声認識処理部を設ける態様に比較して電子機器の小型化および製造コストの低減を図ることができる。 As described above, in this embodiment, the voice recognition process for the microphone input sound and the voice recognition process for the speaker output sound are executed by the time division control of one voice recognition processing unit, and the speaker output sound is the command input voice. Is prevented from being misidentified. For this reason, compared with the aspect which provided the speech recognition process part separately in each of an input system and an output system like the technique disclosed by patent document 1 and patent document 2, size reduction of an electronic device and reduction of manufacturing cost are carried out. Can be achieved.

（Ｂ：第２実施形態）
図３（Ａ）および図３（Ｂ）は、本発明の第２実施形態の電子機器１Ｂおよび１Ｃの構成例を示す図である。電子機器１Ｂおよび１Ｃは、ステレオオーディオ機器、あるいはマルチチャネルサラウンドオーディオ機器であり、複数のスピーカ２−ｋ（ｋ＝１〜Ｎ）を接続する点が第１実施形態の電子機器１Ａと異なる。 (B: Second embodiment)
FIG. 3A and FIG. 3B are diagrams showing a configuration example of the electronic devices 1B and 1C according to the second embodiment of the present invention. The electronic devices 1B and 1C are stereo audio devices or multi-channel surround audio devices, and are different from the electronic device 1A of the first embodiment in that a plurality of speakers 2-k (k = 1 to N) are connected.

図３（Ａ）に示す電子機器１Ｂにおいては、スピーカ２−ｋ（ｋ＝１〜Ｎ）の各々に与えられるオーディオ信号をミキシングするミキシング部７０を有しており、このミキシング部７０によるミキシング処理により得られたオーディオ信号がサンプリングレート変換部１０に与えられる。つまり、図３（Ａ）に示す電子機器１Ｂにおいては、スピーカ２−ｋ（ｋ＝１〜Ｎ）の各々に与えるオーディオ信号のミキシング信号を対象として上記第２の判定処理が行われる。 The electronic apparatus 1B shown in FIG. 3A has a mixing unit 70 that mixes audio signals given to the speakers 2-k (k = 1 to N), and performs a mixing process by the mixing unit 70. The audio signal obtained by the above is supplied to the sampling rate conversion unit 10. That is, in the electronic apparatus 1B shown in FIG. 3A, the second determination process is performed on the audio signal mixing signal applied to each of the speakers 2-k (k = 1 to N).

一方、電子機器１Ｃは、図３（Ｂ）に示すように、スピーカ２−ｋ（ｋ＝１〜Ｎ）の各々に１つずつ対応するＮ個の再生信号バッファ２０−ｋ（ｋ＝１〜Ｎ）を有している。再生信号バッファ２０−ｋ（ｋ＝１〜Ｎ）の各々には、スピーカ２−ｋ（ｋ＝１〜Ｎ）の各々に与えられるものと同じオーディオ信号のサンプル列がサンプリングレート変換部１０によるサンプリングレート変換を経て書き込まれる。図３（Ｂ）に示す電子機器１Ｃでは、予め定められた１または複数のコマンドの入力音声を表すものであるとの判定結果が第１の判定処理によって得られた場合には、再生信号バッファ２０−ｋ（ｋ＝１〜Ｎ）の各々に格納されているサンプル列に対して順次第２の判定処理が実行される。つまり、図３（Ｂ）に示す電子機器１Ｃにおいては、スピーカ２−ｋ（ｋ＝１〜Ｎ）の各々に与えるオーディオ信号のそれぞれを対象として上記第２の判定処理が行われるのである。 On the other hand, as shown in FIG. 3B, the electronic device 1C includes N reproduction signal buffers 20-k (k = 1 to 1) corresponding to each of the speakers 2-k (k = 1 to N). N). In each of the reproduction signal buffers 20-k (k = 1 to N), the same sample sequence of audio signals as those given to the speakers 2-k (k = 1 to N) is sampled by the sampling rate conversion unit 10. Written through rate conversion. In the electronic apparatus 1C shown in FIG. 3B, when the determination result that represents the input voice of one or more predetermined commands is obtained by the first determination process, the reproduction signal buffer The second determination process is sequentially performed on the sample sequences stored in each of 20-k (k = 1 to N). That is, in the electronic device 1C shown in FIG. 3B, the second determination process is performed for each audio signal applied to each of the speakers 2-k (k = 1 to N).

電子機器１Ｂと電子機器１Ｃとを比較すると、ユーザによるコマンド入力音声の発声から実際に当該コマンドに応じた処理が開始されるまでの遅延は前者のほうが短くなる。しかし、前者の態様（電子機器１Ｂ）では後者の態様（電子機器１Ｃ）に比較して以下のような誤判断が発生する虞がある。第１に、スピーカ２−ｋ（ｋ＝１〜Ｎ）の各々から出力される音声がコマンド入力音声とは異なっているものの、それらを重ね合わせた音声（すなわち、上記ミキシング信号に対応する音声）が偶然にコマンド入力音声と一致するような場合に、ユーザの発したコマンド入力音声がスピーカ２−ｋから出力されたものであると判断されるといった誤判断である。第２に、複数のオーディオ信号をミキシングすることで、特定のチャネルのみに含まれていたコマンドを他のチャネルの妨害によって認識することができなくなるといった誤判断である。この第２の誤判断は、上記特定のチャネルを再生しているスピーカの近くにマイクロホン３がある場合にその発生が懸念される。上記特定のチャネルを再生しているスピーカの近くにマイクロホン３がある場合には、当該スピーカからの再生音が第１の判定処理のみで認識され、第２の判定処理では認識されない虞があるからである。後者の態様（電子機器１Ｃ）では、ユーザによるコマンド入力音声の発声から実際に当該コマンドに応じた処理が開始されるまでの遅延が前者の態様（電子機器１Ｂ）に比べて長くなるものの、上記第１および第２の誤判断が確実に回避される。 Comparing the electronic device 1B and the electronic device 1C, the former has a shorter delay from the utterance of the command input voice by the user until the processing corresponding to the command is actually started. However, in the former mode (electronic device 1B), the following erroneous determination may occur as compared to the latter mode (electronic device 1C). First, although the sound output from each of the speakers 2-k (k = 1 to N) is different from the command input sound, the sound obtained by superimposing them (that is, the sound corresponding to the mixing signal). Is accidentally determined to coincide with the command input voice, it is determined that the command input voice issued by the user is output from the speaker 2-k. Secondly, it is a misjudgment that, by mixing a plurality of audio signals, a command included only in a specific channel cannot be recognized due to interference of other channels. This second misjudgment is a concern when the microphone 3 is near the speaker playing the specific channel. If there is a microphone 3 near the speaker that is reproducing the specific channel, the reproduced sound from the speaker may be recognized only by the first determination process and may not be recognized by the second determination process. It is. In the latter mode (electronic device 1C), although the delay from the utterance of the command input voice by the user until the processing corresponding to the command is actually started is longer than that in the former mode (electronic device 1B), The first and second misjudgments are reliably avoided.

そこで、ユーザによるコマンド入力音声の発声から実際に当該コマンドに応じた処理が開始されるまでの遅延を短くする必要がある場合には前者の態様の電子機器（すなわち、図３（Ａ）に示す電子機器１Ｂ）が好ましく、上記誤判断を確実に回避する必要がある場合には後者の態様の電子機器（図３（Ｂ）に示す電子機器１Ｃ）が好ましいと言える。なお、前者の態様の電子機器であっても上記２種類の誤判断のうち第１の誤判断についてはその発生頻度は低く、仮に発生したとしても、ユーザが再度同一のコマンドを音声入力するようにすれば上記誤判断が再度発生する可能性は低く、当該コマンドに応じた処理が実行される。また第２の誤判断についても各スピーカとマイクロホン３との距離が均等になるようにスピーカおよびマイクロホンの配置位置に注意を払うことでその発生頻度を低くすることができる。 Therefore, when it is necessary to shorten the delay from the utterance of the command input voice by the user until the processing corresponding to the command is actually started, the former electronic device (that is, shown in FIG. 3A). The electronic device 1B) is preferable, and the electronic device of the latter mode (the electronic device 1C shown in FIG. 3B) is preferable when it is necessary to reliably avoid the erroneous determination. Even in the case of the electronic device of the former mode, the frequency of occurrence of the first misjudgment among the two types of misjudgment is low, and even if it occurs, the user may again input the same command by voice. If this is the case, it is unlikely that the misjudgment will occur again, and processing corresponding to the command is executed. Also, the frequency of occurrence of the second misjudgment can be reduced by paying attention to the positions of the speakers and microphones so that the distances between the speakers and the microphones 3 are equal.

（Ｃ：変形）
以上、本発明の第１および第２実施形態について説明したが、これら実施形態を以下のように変形しても勿論良い。
（１）上述した第１および第２実施形態では、マイク入力バッファ３０からサンプル列を読み出す際のブロックサイズを常に一定としたが、ブロックサイズを可変にしても勿論良い。例えば、読み込んだサンプル列に対する音声認識処理の結果、当該サンプル列が音声を表すものではない場合、或いは音声を表すもののコマンドを表すものではない（例えば、音声認識処理の結果得られる文字列データの１文字目がコマンドと一致しない）間は、ブロックサイズを一定とし、１文字目が何れかのコマンドと一致する文字列データが音声認識処理によって得られたことを契機として、後続のサンプル列を読み出す際のブロックサイズを小さくするといった具合である。また、上記第１および第２実施形態では、マイク入力バッファ３０からサンプル列を読み出す際のブロックサイズと再生信号バッファ２０からサンプル列を読み出す際のブロックサイズとを同一としたが、両者のブロックサイズが異なっていても勿論良い。 (C: deformation)
Although the first and second embodiments of the present invention have been described above, it is needless to say that these embodiments may be modified as follows.
(1) In the first and second embodiments described above, the block size when the sample string is read from the microphone input buffer 30 is always constant. For example, as a result of the speech recognition processing on the read sample sequence, the sample sequence does not represent speech, or represents speech but does not represent a command (for example, character string data obtained as a result of speech recognition processing). The first character does not match the command), the block size is constant, and when the character string data whose first character matches any command is obtained by voice recognition processing, the subsequent sample string is For example, the block size when reading is reduced. In the first and second embodiments, the block size when reading the sample string from the microphone input buffer 30 is the same as the block size when reading the sample string from the reproduction signal buffer 20, but both block sizes are the same. Of course, they may be different.

（２）上述した第１および第２実施形態では、上記第１の判定処理および第２の判定処理によってスピーカ２（或いはスピーカ２−ｋ（ｋ＝１〜Ｎ））から出力された音声がコマンド入力音声であると誤認識されることを回避したが、第１の判定処理において１または複数のコマンドの何れかを表す入力音声であると判定された場合には、スピーカ２（あるいはスピーカ２−ｋ（ｋ＝１〜Ｎ））に与えるオーディオ信号のゲイン（すなわち、スピーカ２あるいはスピーカ２−ｋから出力される音声の音量）を引き下げる音量制御をさらに行うようにしても勿論良い。 (2) In the first and second embodiments described above, the voice output from the speaker 2 (or the speaker 2-k (k = 1 to N)) by the first determination process and the second determination process is a command. Although it is avoided that the input voice is erroneously recognized, if it is determined in the first determination process that the input voice represents one or a plurality of commands, the speaker 2 (or the speaker 2- Needless to say, volume control may be further performed to lower the gain of the audio signal applied to k (k = 1 to N) (that is, the volume of the sound output from the speaker 2 or the speaker 2-k).

（３）上述した第１および第２実施形態では、オーディオ機器への本発明の適用例を説明したが、本発明の適用対象はこれに限定されるものではない。オーディオ機器以外の電子機器であっても、音声出力機能と音声によるコマンド入力機能とを備えた電子機器（例えば、家庭用ゲーム機やパーソナルコンピュータ、カーオーディオ装置など）であれば、本発明を適用することによってコマンドの誤認識を回避することが可能になるからである。 (3) In the first and second embodiments described above, application examples of the present invention to audio equipment have been described, but the application target of the present invention is not limited to this. The present invention is applied to an electronic device other than an audio device as long as it is an electronic device having a voice output function and a voice command input function (for example, a home game machine, a personal computer, or a car audio device). This is because erroneous recognition of commands can be avoided.

（４）上述した第１および第２実施形態の第２の判定処理では、何れかのコマンドの入力音を表すものであると第１の判定処理によって判定されたサンプル列のマイク入力バッファ３０への書き込みタイミングと同じタイミング（或いは、当該書き込みタイミングよりも早いタイミング）で再生信号バッファ２０に書き込まれたサンプル列をブロック単位で読み出して音声認識処理部５０に与え、その音声認識処理の処理結果に基づいて、上記入力音がスピーカ２から放音されたものであるか否かを判定した。しかし、何れかのコマンドの入力音を表すものであると第１の判定処理によって判定されたサンプル列と、当該サンプル列のマイク入力バッファ３０への書き込みタイミングと同じタイミング（或いは、当該書き込みタイミングよりも早いタイミング）で再生信号バッファ２０に書き込まれたサンプル列との相関を求め、両者の相関の強さに基づいて上記入力音がスピーカ２から放音されたものであるか否かを判定しても良い。例えば、図２に示す例ではブロックＭ１とブロックＳ１の相関の強さに基づいて当該ブロック１に対応する音がスピーカ２によって放音されたものであるか否かを判定する、といった具合である。要は、何れかのコマンドの入力音を表すものであると第１の判定処理によって判定されたサンプル列のマイク入力バッファ３０への書き込みタイミングと同じタイミング（或いは、当該書き込みタイミングよりも早いタイミング）で再生信号バッファ２０に書き込まれたサンプル列に基づいて当該入力音がスピーカ２から放音されたものであるのか否かを判定する態様であれば良い。 (4) In the second determination process of the first and second embodiments described above, to the microphone input buffer 30 of the sample sequence determined by the first determination process as representing the input sound of any command The sample sequence written in the reproduction signal buffer 20 at the same timing (or earlier than the write timing) is read out in units of blocks and given to the speech recognition processing unit 50, and the processing result of the speech recognition processing is obtained. Based on this, it was determined whether or not the input sound was emitted from the speaker 2. However, the sample sequence determined by the first determination process to represent the input sound of any command and the same timing as the writing timing of the sample sequence to the microphone input buffer 30 (or from the writing timing) The correlation with the sample sequence written in the reproduction signal buffer 20 is obtained at an earlier timing), and it is determined whether or not the input sound is emitted from the speaker 2 based on the strength of the correlation between the two. May be. For example, in the example shown in FIG. 2, it is determined whether or not the sound corresponding to the block 1 is emitted by the speaker 2 based on the strength of the correlation between the block M1 and the block S1. . In short, the same timing as the writing timing to the microphone input buffer 30 of the sample sequence determined by the first determination processing to represent an input sound of any command (or a timing earlier than the writing timing) As long as the input sound is emitted from the speaker 2 on the basis of the sample sequence written in the reproduction signal buffer 20 in the above manner, any method may be used.

（５）上述した第２実施形態では、複数のスピーカの各々に与えるオーディオ信号を全てミキシングして第２の判定処理を行う態様と、各スピーカに与えるオーディオ信号毎に第２の判定処理を行う態様と、を説明した。しかし、Ｎチャネルのオーディオ信号を全てミキシングするのではなく、Ｎ種類のオーディオ信号をＭ（２≦Ｍ＜Ｎ）種類にグループ分けし、グループ毎にオーディオ信号のミキシングを行ってＭ種類のミキシング信号を生成してグループ毎に設けた再生信号バッファに書き込み、それらＭ種類のミキシング信号の各々を第２の判定処理による判定対象としても良い。例えば、５．１チャネル信号に対してフロント左右およびセンタの合計３チャネルのオーディオ信号をミキシングして第１のミキシング信号を生成する一方、サラウンド左右の合計２チャネルのオーディオ信号をミキシングして第２のミキシング信号を生成し、これら第１および第２のミキシング信号の各々を第２の判定処理の判定対象とするのである。このような態様によれば、複数のスピーカの各々に与えるオーディオ信号を全てミキシングして第２の判定処理を行う態様と各スピーカに与えるオーディオ信号毎に第２の判定処理を行う態様の利点と欠点のバランスをとること（すなわち、判定精度をある程度保ちつつ判定遅延の増加を抑えること）が可能になる。 (5) In the second embodiment described above, the second determination process is performed by mixing all the audio signals applied to each of the plurality of speakers, and the second determination process is performed for each audio signal applied to each speaker. The embodiment has been described. However, not all N-channel audio signals are mixed, but N types of audio signals are grouped into M (2 ≦ M <N) types, and the audio signals are mixed for each group to obtain M types of mixing signals. May be generated and written into a reproduction signal buffer provided for each group, and each of the M kinds of mixing signals may be determined by the second determination process. For example, a total of 3 channels of front left and right and center audio signals are mixed with a 5.1 channel signal to generate a first mixing signal, while a total of 2 channels of surround left and right audio signals are mixed to generate a second mixing signal. These mixing signals are generated, and each of the first and second mixing signals is set as a determination target of the second determination process. According to such an aspect, the aspect of performing the second determination process by mixing all the audio signals to be provided to each of the plurality of speakers and the advantage of performing the second determination process for each audio signal to be applied to each speaker, It is possible to balance the defects (that is, to suppress increase in determination delay while maintaining determination accuracy to some extent).

（６）上述した第１および第２実施形態では、本発明の特徴を顕著に示す第１の判定処理、第２の判定処理およびコマンド実行処理をソフトウェアによって実現したが、第１の判定処理を実行する第１の判定手段、第２の判定処理を実行する第２の判定手段、コマンド実行処理を実行するコマンド実行手段の各々を電子回路などのハードウェアによって構成しても良い。また、上述した実施形態では、音声認識処理を実行する音声認識処理部５０を制御部６０とは別個のハードウェアとして実装したが、音声認識処理を制御部６０によるソフトウェア処理によって実現しても勿論良い。また、上記第１の判定処理、第２の判定処理、およびコマンド実行処理（あるいは、さらに音声認識処理）をコンピュータに実行させるプログラムをＣＤ−ＲＯＭ（Compact Disk-Read Only Memory）などのコンピュータ読み取り可能な記録媒体に書き込んで配布しても良く、また、インターネットなどの電気通信回線経由のダウンロードにより配布しても良い。 (6) In the first and second embodiments described above, the first determination process, the second determination process, and the command execution process that clearly show the features of the present invention are implemented by software. Each of the first determination means to execute, the second determination means to execute the second determination processing, and the command execution means to execute the command execution processing may be configured by hardware such as an electronic circuit. In the above-described embodiment, the voice recognition processing unit 50 that executes the voice recognition processing is implemented as hardware separate from the control unit 60. However, the voice recognition processing may be realized by software processing by the control unit 60. good. In addition, a computer-readable program such as a CD-ROM (Compact Disk-Read Only Memory) can be used to cause the computer to execute the first determination process, the second determination process, and the command execution process (or further, the voice recognition process). It may be distributed by writing on a simple recording medium, or may be distributed by downloading via a telecommunication line such as the Internet.

１Ａ、１Ｂ、１Ｃ…電子機器、２…スピーカ、３…マイクロホン、１０…サンプリングレート変換部、２０…再生信号バッファ、３０…マイク入力バッファ、４０…読出制御部、５０…音声認識処理部、６０…制御部。 1A, 1B, 1C ... electronic equipment, 2 ... speaker, 3 ... microphone, 10 ... sampling rate conversion unit, 20 ... reproduction signal buffer, 30 ... microphone input buffer, 40 ... reading control unit, 50 ... voice recognition processing unit, 60 ... control unit.

Claims

（Ａ）マイクロホンにより収音された入力音を表すオーディオ信号のサンプル列を蓄積する第１のバッファと、
（Ｂ）スピーカに出力させる出力音を表すオーディオ信号のサンプル列を蓄積する第２のバッファと、
（Ｃ）与えられたサンプル列に音声認識処理を施す音声認識処理部と、
（Ｄ）前記第１のバッファに蓄積されたサンプル列を最も古いものから順に所定サンプル数ずつ前記音声認識処理部に与え、前記音声認識処理部による音声認識処理の処理結果に基づいて予め定められた１または複数のコマンドの何れかの入力音を表すものであるか否かを判定する第１の判定処理と、
前記１または複数のコマンドの何れかの入力音を表すものであると前記第１の判定処理にて判定されたサンプル列の前記第１のバッファへの書き込みタイミングと同じまたは少し早いタイミングで前記第２のバッファに書き込まれたサンプル列に基づいて、当該入力音が前記スピーカから出力されたものであるか否かを判定する第２の判定処理と、
前記１または複数のコマンドの何れかを表すものであると前記第１の判定処理にて判定され、かつ前記スピーカから放音されたものではないと前記第２の判定処理にて判定された入力音の表すコマンドを実行するコマンド実行処理と、を実行する制御部と、
を有することを特徴とする電子機器。 (A) a first buffer for storing a sample sequence of an audio signal representing an input sound picked up by a microphone;
(B) a second buffer for accumulating a sample sequence of an audio signal representing an output sound to be output from a speaker;
(C) a speech recognition processing unit that performs speech recognition processing on a given sample sequence;
(D) The sample sequence stored in the first buffer is given to the speech recognition processing unit by a predetermined number of samples in order from the oldest one, and is predetermined based on the processing result of the speech recognition processing by the speech recognition processing unit. A first determination process for determining whether or not the input sound of any one or a plurality of commands is represented;
The first or the plurality of commands represent the input sound, and the first timing is the same as or slightly earlier than the timing at which the sample sequence determined in the first determination process is written to the first buffer. A second determination process for determining whether or not the input sound is output from the speaker based on the sample string written in the second buffer;
The input determined in the first determination process as representing one of the one or a plurality of commands and determined in the second determination process as not being emitted from the speaker A command execution process for executing a command represented by a sound;
An electronic device comprising:

（Ａ）マイクロホンにより収音された入力音を表すオーディオ信号のサンプル列を蓄積する第１のバッファと、
（Ｂ）複数のスピーカの各々に出力させる出力音を表すオーディオ信号のサンプル列をスピーカ毎に、または前記各オーディオ信号をミキシングして得られるミキシング信号のサンプル列を蓄積する第２のバッファと、
（Ｃ）与えられたサンプル列に音声認識処理を施す音声認識処理部と、
（Ｄ）前記第１のバッファに蓄積されたサンプル列を最も古いものから順に所定サンプル数ずつ前記音声認識処理部に与え、予め定められた１または複数のコマンドの何れかの入力音を表すものあるか否かを前記音声認識処理部による音声認識処理の処理結果に基づいて判定する第１の判定処理と、
前記１または複数のコマンドの何れかの入力音を表すものであると前記第１の判定処理にて判定されたサンプル列の前記第１のバッファへの書き込みタイミングと同じまたは少し早いタイミングで前記第２のバッファに蓄積された前記ミキシング信号のサンプル列または前記スピーカ毎のサンプル列に基づいて、当該入力音が前記複数のスピーカの何れかから放音されたものであるか否かを判定する第２の判定処理と、
前記１または複数のコマンドの何れかを表すものであると前記第１の判定処理にて判定され、かつ前記複数のスピーカから放音されたものではないと前記第２の判定処理にて判定された入力音の表すコマンドを実行するコマンド実行処理と、を実行する制御部と、
を有することを特徴とする電子機器。 (A) a first buffer for storing a sample sequence of an audio signal representing an input sound picked up by a microphone;
(B) a second buffer for storing a sample sequence of an audio signal representing an output sound to be output to each of a plurality of speakers for each speaker or a sample sequence of a mixing signal obtained by mixing each audio signal;
(C) a speech recognition processing unit that performs speech recognition processing on a given sample sequence;
(D) A sample string stored in the first buffer is given to the speech recognition processing unit by a predetermined number of samples in order from the oldest one and represents an input sound of one or more predetermined commands. First determination processing for determining whether or not there is based on a processing result of the speech recognition processing by the speech recognition processing unit;
The first or the plurality of commands represent the input sound, and the first timing is the same as or slightly earlier than the timing at which the sample sequence determined in the first determination process is written to the first buffer. Determining whether or not the input sound is emitted from any of the plurality of speakers based on a sample sequence of the mixing signal stored in the second buffer or a sample sequence for each speaker; 2 determination processing;
It is determined in the first determination process that it represents one of the one or a plurality of commands, and it is determined in the second determination process that it is not emitted from the plurality of speakers. A command execution process for executing a command represented by the input sound, a control unit for executing,
An electronic device comprising:

前記制御部は、前記第１の判定処理の実行過程において、判定対象のサンプル列が前記１または複数のコマンドの何れかの少なくとも１部を表すとの判定結果が得られた時点で前記第１の判定処理の終了を待たずに前記第２の判定処理を開始することを特徴とする請求項１または２に記載の電子機器。 In the process of executing the first determination process, the control unit obtains a determination result that the determination target sample sequence represents at least one part of the one or the plurality of commands. The electronic device according to claim 1, wherein the second determination process is started without waiting for the end of the determination process.

前記第２のバッファに書き込むサンプル列に、より低いサンプリングレートのサンプル列に変換するサンプリングレート変換部を備え、前記サンプリングレート変換部による処理を経たサンプル列を前記第２のバッファに書き込むことを特徴とする請求項１〜３の何れか１項に記載の電子機器。 A sampling rate conversion unit that converts the sample sequence to be written into the second buffer into a sample sequence with a lower sampling rate is provided, and the sample sequence that has undergone processing by the sampling rate conversion unit is written into the second buffer. The electronic device according to any one of claims 1 to 3.