JP5476760B2

JP5476760B2 - Command recognition device

Info

Publication number: JP5476760B2
Application number: JP2009076789A
Authority: JP
Inventors: 裕司久湊
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2009-03-26
Filing date: 2009-03-26
Publication date: 2014-04-23
Anticipated expiration: 2029-03-26
Also published as: JP2010230852A

Description

本発明は、音声によるコマンドを認識する技術に関する。 The present invention relates to a technique for recognizing a voice command.

音声を用いて電子機器を操作する音声コマンドコントロール、あるいは単にコマンドコントロールと呼ばれる機能がある。例えば音声でカメラのシャッターを切るというような用途に用いられる。具体的には、例えば、カメラに向かって「はい、チーズ」と言うと、その音声に反応してシャッターが切れるものが提案されている（例えば、特許文献１参照）。 There is a function called voice command control for operating an electronic device using voice, or simply command control. For example, it is used for applications such as releasing the shutter of a camera with sound. Specifically, for example, when saying “Yes, cheese” toward the camera, there has been proposed one in which the shutter is released in response to the sound (for example, see Patent Document 1).

特開２０００−５９６６４号公報JP 2000-59664 A

上述したようなコマンドコントロールでは、「はい、チーズ」という音声に対してのみ反応し、その他の音声に対しては反応しないことが望まれる。例えば、複数の人の写真をとる場合に、「はい、写真とるよ」、「はい、集まって」などということがあるが、そのような声に反応してシャッターが切れては困るという問題がある。 In the command control as described above, it is desired to react only to the voice “Yes, cheese” and not to other voices. For example, when taking pictures of multiple people, there are cases such as "Yes, I will take a picture", "Yes, get together", etc. However, there is a problem that it is difficult to release the shutter in response to such a voice is there.

従来のＤＰマッチングを用いた手法では、或る人の「はい、チーズ」という音声を予めテンプレート登録しておく。そしてそのテンプレートと発音した音声とを比較してその類似度がある閾値以上であるかという基準をもとに「はい、チーズ」と言ったかどうかを判定する。閾値を上げれば「はい、チーズ」以外の音声による誤動作は防ぐことができるが、逆に「はい、チーズ」と言ったにもかかわらず動作しない誤棄却が増えてしまい、閾値の設定が重要になる。
しかしながら、登録した発話内容によって閾値は異なり、また周囲の騒音によりその基準も変わるため、様々なユーザの発話に対して期待した動作をさせるようなプリセットの閾値を事前に設定することは困難である。 In the conventional technique using DP matching, a voice of “Yes, cheese” of a certain person is registered in advance as a template. Then, the template is compared with the sound that is pronounced, and it is determined whether or not “yes, cheese” is based on the criterion that the similarity is greater than a certain threshold. Increasing the threshold can prevent malfunctions due to voices other than “Yes, cheese”, but conversely, “Yes, cheese” increases the number of false rejections that do not work, making it important to set the threshold. Become.
However, the threshold value varies depending on the registered utterance content, and the standard also changes depending on the ambient noise. Therefore, it is difficult to set a preset threshold value that makes the expected operation for various user utterances in advance. .

また、従来のＨＭＭ（隠れマルコフモデル）を用いた大語彙音声認識システムを使った方法をこのようなコマンドコントロールに利用する場合には、辞書に大量の単語、形態素が登録されているため、「はい、チーズ」とそれ以外の音声も区別して認識することができる。従って「はい、チーズ」のみに反応して動作するコマンドコントロールは可能である。しかしながら、大語彙音声認識では、高速で動作するＣＰＵあるいはＤＳＰを必要とし、数十ＭＢｙｔｅ以上の単位のメモリが必要になってくるためコンパクトな機器に搭載することは困難である。 When a method using a large vocabulary speech recognition system using a conventional HMM (Hidden Markov Model) is used for such command control, a large number of words and morphemes are registered in the dictionary. Yes, cheese "and other voices can be distinguished and recognized. Therefore, command control that operates only in response to “yes, cheese” is possible. However, large vocabulary speech recognition requires a CPU or DSP that operates at high speed and requires a memory of units of several tens of megabytes or more, so that it is difficult to mount in a compact device.

ここで、従来のＨＭＭを使い数語のみを辞書に登録しておく方法の問題点について図面を参照しつつ説明する。図１４は、従来のコマンドコントロールシステムの一例を示す図である。このシステムには予め音響モデル２２１と辞書（コマンドリスト）２２２が与えられている。なお、辞書はユーザ自身が作成することもできる。図１４に示すコマンドコントロールシステムにおいて、音響モデルとは、各音素の音響的な特徴を表したモデルであるＨＭＭ（隠れマルコフモデル）が用いられる。日本語のモノフォンの音響モデルの場合には、「ａ」，「ｉ」，「ｕ」，「ｅ」，「ｏ」の母音に加え、「ｐ」，「ｔ」，「ｓ」といった子音も加え約４０個程度の音素を用いる。辞書には、ユーザが何と言ったときにどのような動作に対応させるかを記載しておく。例えば、図１５に示すように、１列目に動作、２列目にそれに対応する発音（音素記号列）を記述しておく。 Here, problems of a method of registering only a few words in a dictionary using a conventional HMM will be described with reference to the drawings. FIG. 14 is a diagram illustrating an example of a conventional command control system. In this system, an acoustic model 221 and a dictionary (command list) 222 are provided in advance. The dictionary can be created by the user himself. In the command control system shown in FIG. 14, the acoustic model is an HMM (Hidden Markov Model) that is a model representing the acoustic characteristics of each phoneme. In the case of a Japanese monophone acoustic model, consonants such as “p”, “t”, and “s” are included in addition to vowels of “a”, “i”, “u”, “e”, and “o”. In addition, about 40 phonemes are used. The dictionary describes what kind of actions are taken when the user says. For example, as shown in FIG. 15, the operation is described in the first column, and the corresponding pronunciation (phoneme symbol string) is described in the second column.

認識エンジン部２１１は、辞書、音響モデルから、例えば「はいチーズ」という音声に対応する音の標準的なモデルつまりコマンドの音響モデルＷ１を内部に構築する。同様に「フラッシュ」についても音響モデルＷ２を構築する。また、認識エンジン部２１１は、人が発話した音声を分析して、音響モデルと同じ表現の特徴パラメータの時系列Ｘに変換し、音響モデルＷ１から音声Ｘが出現する確率Ｐ（Ｘ｜Ｗ１）を前向きアルゴリズム等を用いて算出する。同様にＰ（Ｘ｜Ｗ２）も求めることができる。（Ｐ（Ｘ｜Ｗ）はＷという単語を発話しようとしたときにＸという音が発話される確率と言い換えることができる）。 The recognition engine unit 211 constructs, from the dictionary and the acoustic model, a standard model of sound corresponding to the voice “yes cheese”, for example, a command acoustic model W1. Similarly, an acoustic model W2 is constructed for “flash”. In addition, the recognition engine unit 211 analyzes speech uttered by a person, converts the speech into a time series X of feature parameters having the same expression as the acoustic model, and a probability P (X | W1) that the speech X appears from the acoustic model W1. Is calculated using a forward-facing algorithm or the like. Similarly, P (X | W2) can also be obtained. (P (X | W) can be rephrased as the probability of the sound X being uttered when trying to utter the word W).

発話した音声ＸがＷ１であるのか、それともＷ２であるのかを判定するには事後確率Ｐ（Ｗ１｜Ｘ）Ｐ（Ｗ２｜Ｘ）を比較しなければならない。そこで一般的な音声認識システムにおいては、ベイズの定理を用いて、Ｐ（Ｗ｜Ｘ）＝Ｐ（Ｗ）Ｐ（Ｘ｜Ｗ）／Ｐ（Ｘ）としてＰ（Ｘ）はある音声については一定であるのでＰ（Ｗ｜Ｘ）∝Ｐ（Ｗ）Ｐ（Ｘ｜Ｗ）とみなしてＰ（Ｘ｜Ｗ）を比較することでどちらの単語である確率が高いと相対的に判断する。
しかしながら、この方法では、Ｐ（Ｗ｜Ｘ）は絶対的な確率値ではないために、辞書に含まれていない単語であるかどうかは判定することができない。 To determine whether the spoken speech X is W1 or W2, the posterior probabilities P (W1 | X) P (W2 | X) must be compared. Thus, in a general speech recognition system, P (W | X) = P (W) P (X | W) / P (X) where P (X) is constant for a certain speech using Bayes' theorem. Therefore, it is regarded as P (W | X) ∝P (W) P (X | W), and by comparing P (X | W), it is relatively determined that the probability of which word is high.
However, in this method, since P (W | X) is not an absolute probability value, it cannot be determined whether or not the word is not included in the dictionary.

本発明は、音声を用いたコマンドコントロールシステムにおいて、コマンドと関係のない音声による誤動作を軽減することのできる技術を提供することを目的とする。 SUMMARY OF THE INVENTION An object of the present invention is to provide a technique capable of reducing malfunctions caused by voices unrelated to commands in a command control system using voices.

上記課題を解決するために、本発明は、１又は複数の音節に対応する表記で構成されたコマンド音節の音節と音素とを対応付けて記憶する記憶手段と、入力されたコマンド音節に含まれる音節に対応する音素を前記記憶手段から選択して、音素の列で構成されたコマンド音素列を生成するコマンド音素列生成手段と、前記コマンド音素列生成手段によって生成されたコマンド音素列に含まれる複数の母音を異なる母音に置き換えたダミーコマンド音素列を生成するダミーコマンド音素列生成手段と、音声信号が入力される音声信号入力手段と、前記音声信号入力手段に入力された音声信号を解析し、解析結果と前記コマンド音素列との類似度及び該解析結果と前記ダミーコマンド音素列との類似度に応じて、コマンドの認識処理を行うコマンド認識手段とを具備することを特徴とするコマンド認識装置を提供する。 In order to solve the above-mentioned problems, the present invention is included in a storage unit that stores a syllable of a command syllable and a phoneme that are configured by notation corresponding to one or a plurality of syllables, and an input command syllable. A command phoneme string generation unit that selects a phoneme corresponding to a syllable from the storage unit and generates a command phoneme string composed of a phoneme string, and is included in the command phoneme string generated by the command phoneme string generation unit Dummy command phoneme string generating means for generating a dummy command phoneme string in which a plurality of vowels are replaced with different vowels, voice signal input means for inputting a voice signal, and analyzing the voice signal input to the voice signal input means A command for performing a command recognition process according to the similarity between the analysis result and the command phoneme sequence and the similarity between the analysis result and the dummy command phoneme sequence. Providing a command recognition apparatus characterized by comprising a recognition means.

本発明によれば、音声を用いたコマンドコントロールシステムにおいて、コマンドと関係のない音声による誤動作を軽減することができる。 ADVANTAGE OF THE INVENTION According to this invention, in the command control system using an audio | voice, the malfunctioning by the audio | voice unrelated to a command can be reduced.

撮影装置のハードウェア構成の一例を示すブロック図である。It is a block diagram which shows an example of the hardware constitutions of an imaging device. コマンド音素列テーブルの内容の一例を示す図である。It is a figure which shows an example of the content of the command phoneme sequence table. 撮影装置の機能的構成の一例を示すブロック図である。It is a block diagram which shows an example of a functional structure of an imaging device. ダミーコマンドの内容の一例を示す図である。It is a figure which shows an example of the content of a dummy command. コマンドの特徴パラメータと確率Ｐ（Ｘ｜Ｗ）の対応関係の一例を示す図である。It is a figure which shows an example of the correspondence of the feature parameter of a command, and the probability P (X | W). コマンドの特徴パラメータと確率Ｐ（Ｗ｜Ｘ）の対応関係の一例を示す図である。It is a figure which shows an example of the correspondence of the feature parameter of a command, and the probability P (W | X). コマンドの特徴パラメータと確率Ｐ（Ｘ｜Ｗ）の対応関係の一例を示す図である。It is a figure which shows an example of the correspondence of the feature parameter of a command, and the probability P (X | W). コマンド及びダミーコマンドの特徴パラメータと確率Ｐ（Ｗ｜Ｘ）の対応関係の一例を示す図である。It is a figure which shows an example of the correspondence of the characteristic parameter and probability P (W | X) of a command and a dummy command. コマンド及びダミーコマンドの特徴パラメータと確率Ｐ（Ｗ｜Ｘ）の対応関係の一例を示す図である。It is a figure which shows an example of the correspondence of the characteristic parameter and probability P (W | X) of a command and a dummy command. 置換テーブルの内容の一例を示す図である。It is a figure which shows an example of the content of the replacement table. ダミーコマンド音素列の一例を示す図である。It is a figure which shows an example of a dummy command phoneme string. 置換テーブルの内容の一例を示す図である。It is a figure which shows an example of the content of the replacement table. ダミーコマンド音素列の一例を示す図である。It is a figure which shows an example of a dummy command phoneme string. 撮影装置の機能的構成の一例を示す図である。It is a figure which shows an example of a functional structure of an imaging device. 辞書の内容の一例を示す図である。It is a figure which shows an example of the content of a dictionary.

＜Ａ：構成＞
図１は、この発明の一実施形態である撮影装置１のハードウェア構成の一例を示すブロック図である。撮影装置１は、静止画像や動画像を撮影する機能を備えた装置であり、例えばデジタルカメラである。図１において、制御部１１は、ＣＰＵ（Central Processing Unit）やＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）を備え、ＲＯＭ又は記憶部１２に記憶されているコンピュータプログラムを読み出して実行することにより、バスＢＵＳを介して撮影装置１の各部を制御する。記憶部１２は、制御部１１によって実行されるコンピュータプログラムやその実行時に使用されるデータを記憶するための記憶手段であり、例えばハードディスク装置である。表示部１３は、液晶パネル等を備え、制御部１１による制御の下に各種の画像を表示する。操作部１４は、撮影装置１の利用者による操作に応じた信号を制御部１１に出力する。操作部１４は、十字キー（図示略）や、録音を開始・終了させるためのボタン（図示略）、静止画像の撮影及び動画像の撮影を開始・終了させるためのボタン（図示略）等の各種のボタンを備えており、撮影装置１の利用者は、これらのボタンを押下することで、録音や撮影等を行うことができる。なお、静止画の撮影と動画像の撮影との切替は、撮影装置１に設けられた切替スイッチ（図示略）によって切り替えられるようになっている。撮影部１８は、撮影用レンズ等を備え、撮影し、撮影した映像を表す映像データを出力する。なお、本実施形態に係る映像データは、静止画像を表すデータや動画像を表すデータを含む。 <A: Configuration>
FIG. 1 is a block diagram showing an example of a hardware configuration of a photographing apparatus 1 according to an embodiment of the present invention. The photographing device 1 is a device having a function of photographing a still image or a moving image, and is a digital camera, for example. In FIG. 1, the control unit 11 includes a CPU (Central Processing Unit), a ROM (Read Only Memory), and a RAM (Random Access Memory), and reads and executes a computer program stored in the ROM or the storage unit 12. Thus, each unit of the photographing apparatus 1 is controlled via the bus BUS. The storage unit 12 is a storage unit for storing a computer program executed by the control unit 11 and data used at the time of execution, and is, for example, a hard disk device. The display unit 13 includes a liquid crystal panel and the like, and displays various images under the control of the control unit 11. The operation unit 14 outputs a signal corresponding to an operation by the user of the photographing apparatus 1 to the control unit 11. The operation unit 14 includes a cross key (not shown), a button (not shown) for starting / ending recording, a button (not shown) for starting / ending still image shooting and moving image shooting, and the like. Various buttons are provided, and the user of the photographing apparatus 1 can perform recording and photographing by pressing these buttons. Note that switching between still image shooting and moving image shooting can be switched by a change-over switch (not shown) provided in the shooting apparatus 1. The photographing unit 18 includes a photographing lens and the like, photographs and outputs video data representing the photographed video. Note that the video data according to the present embodiment includes data representing still images and data representing moving images.

マイクロホン１５は、収音し、収音した音声を表すアナログ信号を出力する収音手段である。音声処理部１６は、マイクロホン１５が出力するアナログ信号をＡ／Ｄ変換してデジタルデータを生成する。また、音声処理部１６は、制御部１１の制御の下、デジタル形式の音声データをＤ／Ａ変換してアナログ信号を生成し、生成した音声信号をスピーカ１７に出力する。スピーカ１７は、音声処理部１６から供給されるアナログ信号に応じた強度で放音する放音手段である。 The microphone 15 is a sound collection unit that collects sound and outputs an analog signal representing the collected sound. The sound processing unit 16 A / D converts the analog signal output from the microphone 15 to generate digital data. In addition, under the control of the control unit 11, the audio processing unit 16 D / A converts digital audio data to generate an analog signal, and outputs the generated audio signal to the speaker 17. The speaker 17 is a sound emitting unit that emits sound with an intensity corresponding to the analog signal supplied from the sound processing unit 16.

なお、この実施形態では、マイクロホン１５とスピーカ１７とが撮影装置１に含まれている場合について説明するが、音声処理部１６に入力端子及び出力端子を設け、オーディオケーブルを介してその入力端子に外部マイクロホンを接続するとしても良く、同様に、オーディオケーブルを介してその出力端子に外部スピーカを接続するとしても良い。また、この実施形態では、マイクロホン１５から音声処理部１６へ入力される音声信号及び音声処理部１６からスピーカ１７へ出力される音声信号がアナログ音声信号である場合について説明するが、デジタル音声データを入出力するようにしても良い。このような場合には、音声処理部１６にてＡ／Ｄ変換やＤ／Ａ変換を行う必要はない。表示部１３、操作部１４、撮影部１８についても同様であり、撮影装置１に内蔵される形式であってもよく、外付けされる形式であってもよい。 In this embodiment, the case where the microphone 15 and the speaker 17 are included in the photographing apparatus 1 will be described. However, the audio processing unit 16 is provided with an input terminal and an output terminal, and the input terminal is connected to the input terminal via an audio cable. An external microphone may be connected, and similarly, an external speaker may be connected to the output terminal via an audio cable. In this embodiment, the case where the audio signal input from the microphone 15 to the audio processing unit 16 and the audio signal output from the audio processing unit 16 to the speaker 17 are analog audio signals will be described. You may make it input / output. In such a case, the audio processing unit 16 does not need to perform A / D conversion or D / A conversion. The same applies to the display unit 13, the operation unit 14, and the imaging unit 18, and may be a format built in the imaging device 1 or an externally attached format.

記憶部１２は、図示のように、音響モデルデータベースＤＢ１記憶領域１２１と、コマンド音素列テーブルＴＢＬ１記憶領域１２２と、ダミーコマンド音素列テーブルＴＢＬ２記憶領域１２３と、音素辞書記憶領域１２４とを有している。音響モデルデータベースＤＢ１記憶領域１２１には、各音素の音響的な特徴を表したモデルであるＨＭＭ（隠れマルコフモデル）が記憶されている。この実施形態では、日本語のモノフォンの音響モデルとして、「ａ」，「ｉ」，「ｕ」，「ｅ」，「ｏ」の母音に加え、「ｐ」，「ｔ」，「ｓ」といった子音も加え約４０個程度の音素の特徴を表すデータ（以下「音素特徴データ」という）が記憶されている。 As illustrated, the storage unit 12 includes an acoustic model database DB1 storage area 121, a command phoneme string table TBL1 storage area 122, a dummy command phoneme string table TBL2 storage area 123, and a phoneme dictionary storage area 124. Yes. The acoustic model database DB1 storage area 121 stores an HMM (Hidden Markov Model) that is a model representing acoustic characteristics of each phoneme. In this embodiment, as an acoustic model of a Japanese monophone, in addition to vowels “a”, “i”, “u”, “e”, “o”, “p”, “t”, “s”, etc. In addition to consonants, data representing about 40 phoneme features (hereinafter referred to as “phoneme feature data”) is stored.

コマンド音素列テーブルＴＢＬ１記憶領域１２２には、ユーザが何と言ったときにどのような動作に対応させるかを示すデータが記憶される。図２は、コマンド音素列テーブルＴＢＬ１記憶領域１２２の記憶内容の一例を示す図である。図示のように、この記憶領域には、「動作内容」と「音素記号列」との各項目が互いに関連付けて記憶される。「動作内容」の項目には、「シャッターを切る」、「フラッシュをつける」といったような、撮影装置１が行う動作の内容を表すデータが記憶されている。「音素記号列」の項目には、各動作に対応する音声コマンドの音素記号列を表すデータ（以下「コマンド音素列」という）が記憶される。 The command phoneme sequence table TBL1 storage area 122 stores data indicating what kind of operation is to be performed when the user says. FIG. 2 is a diagram showing an example of the contents stored in the command phoneme string table TBL1 storage area 122. As shown in FIG. As shown in the figure, in the storage area, items of “operation content” and “phoneme symbol string” are stored in association with each other. In the item “operation content”, data representing the content of the operation performed by the photographing apparatus 1 such as “shutter release” and “turn on the flash” is stored. In the item “phoneme symbol string”, data representing a phoneme symbol string of a voice command corresponding to each operation (hereinafter referred to as “command phoneme string”) is stored.

ダミーコマンド音素列テーブルＴＢＬ２記憶領域１２３には、撮影装置１の制御部１１が後述するダミーコマンド生成処理を実行することによって生成されるダミーコマンドが記憶される。なお、この記憶領域に記憶されるダミーコマンドについては後述することとし、ここではその詳細な説明を省略する。音素辞書記憶領域１２４には、音節と音素とが対応付けて記憶されている。 In the dummy command phoneme string table TBL2 storage area 123, a dummy command generated by the control unit 11 of the photographing apparatus 1 executing a dummy command generation process described later is stored. The dummy command stored in this storage area will be described later, and a detailed description thereof will be omitted here. In the phoneme dictionary storage area 124, syllables and phonemes are stored in association with each other.

次に、撮影装置１の機能的構成の一例について図面を参照しつつ説明する。図３は、撮影装置１の機能的構成の一例を示す図である。図において、認識エンジン部１１１、コマンド判定部１１２、コマンド実行部１１３及びダミーコマンド生成部１１４は、撮影装置１の制御部１１がＲＯＭ又は記憶部１２に記憶されたコンピュータプログラムを読み出して実行することによって実現される。なお、図中の矢印は、データの流れを概略的に示すものである。 Next, an example of a functional configuration of the photographing apparatus 1 will be described with reference to the drawings. FIG. 3 is a diagram illustrating an example of a functional configuration of the photographing apparatus 1. In the figure, a recognition engine unit 111, a command determination unit 112, a command execution unit 113, and a dummy command generation unit 114 read and execute a computer program stored in the ROM or the storage unit 12 by the control unit 11 of the photographing apparatus 1. It is realized by. In addition, the arrow in a figure shows the flow of data roughly.

認識エンジン部１１１には、マイクロホン１５によって収音された音声を表す音声信号が入力される。認識エンジン部１１１は、入力された音声信号を解析し、解析結果とコマンド音素列テーブルＴＢＬ１記憶領域１２２に記憶されたコマンド音素列との類似度及び解析結果とダミーコマンド音素列テーブルＴＢＬ２記憶領域１２３に記憶されたダミーコマンド音素列との類似度に応じて、コマンドの認識処理を実行する。より具体的には、まず、認識エンジン部１１１は、入力された音声信号から音声の特徴を抽出し、抽出した特徴を表すデータ（以下「音響モデル」という）を生成する。具体的には、例えば、認識エンジン部１１１は、辞書、音響モデルから「はいチーズ」という音声に対応する音の標準的なモデル、すなわちコマンドの音響モデルＷ１を内部に構築する。同様に「フラッシュ」についても音響モデルＷ２を構築する。 The recognition engine unit 111 receives an audio signal representing the sound collected by the microphone 15. The recognition engine unit 111 analyzes the input speech signal, and the similarity between the analysis result and the command phoneme string stored in the command phoneme string table TBL1 storage area 122, and the analysis result and the dummy command phoneme string table TBL2 storage area 123. The command recognition process is executed in accordance with the similarity to the dummy command phoneme string stored in. More specifically, first, the recognition engine unit 111 extracts voice features from the input voice signal, and generates data representing the extracted features (hereinafter referred to as “acoustic model”). Specifically, for example, the recognition engine unit 111 constructs a standard model of sound corresponding to the voice “yes cheese” from a dictionary and an acoustic model, that is, a command acoustic model W1. Similarly, an acoustic model W2 is constructed for “flash”.

また、認識エンジン部１１１は、マイクロホン１５によって収音された音声から音声の特徴を抽出し、抽出した特徴を、音響モデルと同じ表現の特徴パラメータの時系列Ｘに変換する。この実施形態では、音響モデルの特徴パラメータとして、ＭＦＣＣ（Mel-Frequency Cepstrum Coefficient）の２６次元パラメータを用いる。更に、認識エンジン部１１１は、音響モデルＷ１から音声Ｘが出現する確率Ｐ（Ｘ｜Ｗ１）を、前向きアルゴリズム等を用いて算出する。同様に、認識エンジン部１１１は、音響モデルＷ２から音声Ｘが出現する確率Ｐ（Ｘ｜Ｗ２）を、前向きアルゴリズム等を用いて算出する。なお、Ｐ（Ｘ｜Ｗ）は、Ｗという単語を発話しようとしたときにＸという語が発話される確率と言い換えることができる。 The recognition engine unit 111 extracts voice features from the voice collected by the microphone 15, and converts the extracted features into a time series X of feature parameters having the same representation as the acoustic model. In this embodiment, a 26-dimensional parameter of MFCC (Mel-Frequency Cepstrum Coefficient) is used as the characteristic parameter of the acoustic model. Further, the recognition engine unit 111 calculates a probability P (X | W1) that the speech X appears from the acoustic model W1 using a forward algorithm or the like. Similarly, the recognition engine unit 111 calculates a probability P (X | W2) that the speech X appears from the acoustic model W2 using a forward algorithm or the like. Note that P (X | W) can be rephrased as the probability that the word X is spoken when trying to utter the word W.

認識エンジン部１１１は、下記の（１）式にしたがって事後確率を求める。
Ｐ（Ｗ_ｉ｜Ｘ）＝Ｐ（Ｗ_ｉ）Ｐ（Ｘ｜Ｗ_ｉ）／Σ_ｊＰ（Ｘ｜Ｗ_ｊ） …（１） The recognition engine unit 111 obtains the posterior probability according to the following equation (1).
P (W _i | X) = P (W _i ) P (X | W _i ) / Σ _j P (X | W _j ) (1)

コマンド判定部１１２は、認識エンジン部１１１の解析結果に基づいて、コマンドを受理するか棄却するかを判定する。コマンド判定部１１２は、コマンドを受理すると判定した場合には、受理したコマンドをコマンド実行部１１３へ通知する。コマンド実行部１１３はコマンド判定部１１２から通知されるコマンドに従って各種の処理を実行する。 The command determination unit 112 determines whether to accept or reject the command based on the analysis result of the recognition engine unit 111. If the command determination unit 112 determines to accept the command, the command determination unit 112 notifies the command execution unit 113 of the received command. The command execution unit 113 executes various processes according to the command notified from the command determination unit 112.

この実施形態では、コマンド判定部１１２は、認識エンジン部１１１によって算出されたＰ（Ｗ_ｉ｜Ｘ）の最大値Ｍａｘ（Ｐ（Ｗ_ｉ｜Ｘ））が予め定められた閾値より大きく、かつ、最大となるＷ_ｉがダミーコマンドでない場合は、このコマンドを受理する一方、それ以外なら棄却する。 In this embodiment, the command determination unit 112, P calculated by the recognition engine 111 | maximum value Max of _{_{(W i X) (P (}} W i | X)) is greater than a predetermined threshold value, and, If the maximum to become W _i is not a dummy command, while to accept this command, it rejected otherwise.

この方式ではＰ（Ｗ_ｉ｜Ｘ）は０〜１の値をとるので、どのようなコマンドであるかによらず一定の閾値を用いて判定を行うことができる。しかしながら、ここでΣ_ｊＰ（Ｘ｜Ｗ_ｊ）はＸという音声を観測する確率Ｐ（Ｘ）であり、ある言語でありえるすべての単語Ｗｊについて和を求めないと正確な値を得られない。とはいえ、「はえチーズ」と言おうとしたときに「ハイチーズ」と似た発音になるのは考えられるが、「はいちーず」というコマンドとまったくかけ離れた「朝」と言おうとしたときに「はいちーず」に似た発音になることはほとんどゼロに近いはずである。そこで本実施形態では、認識したいコマンドに加えて、そのコマンドの音声に近いダミーコマンドのみを追加しておくことで近似的に精度の高いＰ（Ｘ）を得ようというものである。 In this method, since P (W _i | X) takes a value of 0 to 1, it is possible to make a determination using a certain threshold value regardless of the command. However, here, Σ _j P (X | W _j ) is the probability P (X) of observing the speech X, and an accurate value cannot be obtained unless the sum is obtained for all words Wj that can be in a certain language. Nonetheless, when trying to say “Hay cheese”, it might be pronounced like “high cheese”, but trying to say “morning”, which is quite different from the command “haichizu” Occasionally, the pronunciation is similar to “Haichizu” and should be almost zero. Therefore, in this embodiment, in addition to a command to be recognized, only dummy commands close to the voice of the command are added to obtain P (X) with approximately high accuracy.

ダミーコマンド生成部１１４は、コマンド音素列テーブルに登録されたコマンド音素列と所定の類似度を有するダミーコマンド音素列を、予め定められたアルゴリズムに従って生成する。この実施形態では、ダミーコマンド生成部１１４は、以下の（i）乃至（iv）の基準で、コマンド音素列テーブルＴＢＬ１記憶領域１２２に記憶されたコマンドからダミーコマンド音素列を生成する。
（i）コマンドに含まれる母音のそれぞれをａ，ｉ，ｕ，ｅ，ｏのいずれかに置き換える。
（ii）ただし置き換える母音の個数は２個以上、４個以下とする。
（iii）ただしもとのコマンドとの違いが１音素だけの場合には追加しない。
（iv）子音を取り除いたダミーコマンド音素列を生成する。 The dummy command generation unit 114 generates a dummy command phoneme string having a predetermined similarity with the command phoneme string registered in the command phoneme string table according to a predetermined algorithm. In this embodiment, the dummy command generation unit 114 generates a dummy command phoneme string from the commands stored in the command phoneme string table TBL1 storage area 122 on the basis of the following (i) to (iv).
(I) Replace each vowel included in the command with one of a, i, u, e, and o.
(Ii) However, the number of vowels to be replaced is 2 or more and 4 or less.
(Iii) However, it is not added when the difference from the original command is only one phoneme.
(Iv) Generate a dummy command phoneme string from which consonants are removed.

すなわち、（i）の生成基準に従って、ダミーコマンド生成部１１４は、コマンド音素列に含まれる母音を予め定められた母音に置き換えたものをダミーコマンド音素列として生成する。更に、ダミーコマンド生成部１１４は、（iii）の生成基準に従って、コマンド音素列に含まれる母音を予め定められた母音に置き換えた場合に、そのコマンド音素列との音素の差分が１音素のみである音素列についてはダミーコマンドとして採用しない。 That is, according to the generation criterion (i), the dummy command generation unit 114 generates a dummy command phoneme string by replacing a vowel included in the command phoneme string with a predetermined vowel. Furthermore, when the vowel included in the command phoneme string is replaced with a predetermined vowel in accordance with the generation criterion of (iii), the dummy command generation unit 114 has only one phoneme difference from the command phoneme string. A phoneme string is not adopted as a dummy command.

ここで、ダミーコマンド音素列（以下、単に「ダミーコマンド」という）の具体的な内容の一例について、図４を参照しつつ説明する。図４は、ダミーコマンド生成部１１４が生成するダミーコマンドの一例を示す図である。ダミーコマンド生成部１１４は、「シャッターを切る」というコマンドについては、図４に示すような６個のダミーコマンドを生成する。また、ダミーコマンド生成部１１４は、「フラッシュ」というコマンドに対しては、図４に示すように５個のダミーコマンドを生成する。図４に示すように、「フラッシュ」というコマンドに対しては、「ｆｕｒｕｓｈｕ」というダミーコマンドは生成されない（上述の（iii）の生成基準参照）。 Here, an example of specific contents of a dummy command phoneme string (hereinafter simply referred to as “dummy command”) will be described with reference to FIG. FIG. 4 is a diagram illustrating an example of a dummy command generated by the dummy command generation unit 114. The dummy command generation unit 114 generates six dummy commands as shown in FIG. 4 for the command “shutter release”. Also, the dummy command generation unit 114 generates five dummy commands for the command “flash” as shown in FIG. As shown in FIG. 4, a dummy command “furushu” is not generated for the command “flash” (see the generation criterion (iii) above).

ダミーコマンドの追加により確率Ｐ（Ｘ）の精度が上がり、本来のコマンドに似た単語を発話したときには、ほぼ理想に近い確率Ｐ（Ｘ）が得られる。確率Ｐ（Ｗ｜Ｘ）は０から１に正規化された値となるため、一定の閾値を用いてコマンドか否かを判定することが可能になる。これを模式的に表したのが図５及び図６である。図５は、「はいチーズ」というコマンドＷと「フラッシュ」というコマンドＷの特徴パラメータと確率Ｐ（Ｘ｜Ｗ）の対応関係の一例を示す図である。また、図６は、音声の特徴パラメータと確率Ｐ（Ｗ｜Ｘ）の対応関係の一例を示す図である。なお、図５及び図６では、説明を容易にするため、音声の特徴パラメータを１次元のパラメータとして図示している。 The accuracy of the probability P (X) is increased by adding a dummy command, and when a word similar to the original command is uttered, a probability P (X) that is almost ideal is obtained. Since the probability P (W | X) is a value normalized from 0 to 1, it is possible to determine whether or not it is a command using a certain threshold value. This is schematically shown in FIG. 5 and FIG. FIG. 5 is a diagram illustrating an example of a correspondence relationship between the characteristic parameter of the command W “yes cheese” and the command W “flash” and the probability P (X | W). FIG. 6 is a diagram illustrating an example of a correspondence relationship between a speech feature parameter and a probability P (W | X). In FIG. 5 and FIG. 6, for ease of explanation, the voice feature parameter is illustrated as a one-dimensional parameter.

また、ダミーコマンド生成部１１４によって生成されたダミーコマンドを用いて認識エンジン部１１１が行う確率の算出処理の具体的な一例について、図７乃至図９を参照しつつ説明する。図７は、「はいチーズ」というコマンドと「はいチーズ」のコマンドのダミーコマンドとして生成された「ほおちょーぞ」というダミーコマンドの音声の特徴パラメータと確率Ｐ（Ｘ｜Ｗ）との対応関係の一例を示す図である。 A specific example of the probability calculation process performed by the recognition engine unit 111 using the dummy command generated by the dummy command generation unit 114 will be described with reference to FIGS. FIG. 7 shows the correspondence between the voice characteristic parameter of the dummy command “Hochose” generated as a dummy command of the command “yes cheese” and the command “yes cheese” and the probability P (X | W). It is a figure which shows an example.

図７において、例えば、「はいどーぞ」というような、コマンドにはない中間的な発話をした場合に、Ｐ（Ｘ｜ｈａｉｃｈｉ：ｚｕ）＝０．０３，Ｐ（Ｘ｜ｈｏｏｃｈｏ：ｚｏ）＝０．０２でその他は無視してよい程度に小さい確率だったとする。このときＰ（ｈａｉｃｈｉ：ｚｕ｜Ｘ）＝０．０２／（０．０３＋０．０２）＝０．６で、Ｐ（ｈｏｏｃｈｏ：ｚｏ｜Ｘ）＝０．０２／（０．０３＋０．０２）＝０．４となる。このとき閾値が予め０．８という値が設定されていたとすると、Ｐ（ｈａｉｃｈｉ：ｚｕ｜Ｘ）は０．８以下であるのでコマンドとして受け付けずに棄却するという動作をすることになる（図７参照）。 In FIG. 7, for example, P (X | haichi: zu) = 0.03, P (X | hocho: zo) = 0 when an intermediate utterance that is not included in the command, such as “Haido” .02 and the others are small enough to be ignored. At this time, P (haichi: zu | X) = 0.02 / (0.03 + 0.02) = 0.6 and P (hocho: zo | X) = 0.02 / (0.03 + 0.02) = 0 .4. If a threshold value of 0.8 is set in advance at this time, P (haichi: zu | X) is 0.8 or less, and therefore, an operation of rejecting without accepting as a command is performed (FIG. 7). reference).

本実施形態においては、どのようなダミーコマンドを登録しておくかが重要である。あまりにも本来のコマンドに似たダミーコマンドがあると少し不明瞭な発音をしただけで同程度の確率となるダミーコマンドが増えて、却下される割合が増えてしまう。これはコマンドコントロールを用いるユーザにとっては非常に不便である。逆に、本来のコマンドに似たダミーコマンドがないと、コマンドを発話したつもりでなくても似た言葉に反応して誤動作することになってしまう。図８は、「はいチーズ」のダミーコマンドである「ほおちょーぞ」と「はあちゃーざ」の音声の特徴パラメータと確率Ｐ（Ｗ｜Ｘ）の対応関係を示す図である。また、図９は、ダミーコマンドとして「ほおちょーぞ」と「はあチーズ」とを用いる場合の音声の特徴パラメータと確率Ｐ（Ｗ｜Ｘ）の対応関係の一例を示す図である。図９に示すように、本来のコマンドに似すぎているダミーコマンドを用いると、少し不明瞭な発音をしただけで却下される割合が高くなってしまう。それに対し、本実施形態では、ダミーコマンド生成部１１４が、元のコマンドとの違いが１音素だけの場合にはダミーコマンドとして追加しないことにより、本来のコマンドに似すぎているダミーコマンドが登録されるのを防ぎ、これにより、誤動作を軽減することができる。 In the present embodiment, what kind of dummy command is registered is important. If there is a dummy command that is too similar to the original command, there will be an increase in the number of rejected dummy commands that will have the same probability with a slightly unclear pronunciation. This is very inconvenient for users using command control. On the other hand, if there is no dummy command similar to the original command, even if you do not intend to speak the command, it will malfunction in response to similar words. FIG. 8 is a diagram showing a correspondence relationship between the voice feature parameters and the probabilities P (W | X) of “hochozo” and “haachaza” which are dummy commands of “yes cheese”. FIG. 9 is a diagram illustrating an example of a correspondence relationship between a speech feature parameter and a probability P (W | X) when “hochozo” and “ha cheese” are used as dummy commands. As shown in FIG. 9, if a dummy command that is too similar to the original command is used, the rate of rejection will increase even with a slightly unclear pronunciation. On the other hand, in this embodiment, the dummy command generator 114 registers a dummy command that is too similar to the original command by not adding it as a dummy command when the difference from the original command is only one phoneme. This can reduce malfunctions.

＜Ｂ：動作＞
＜Ｂ−１：コマンド登録動作＞
次に、撮影装置１の動作について説明する。まず、コマンドの登録動作について説明する。まず、ユーザは、操作部１４を操作してコマンドを登録するための操作を行う。より具体的には、ユーザは、操作部１４を操作してコマンドの文字列を表すテキストデータを入力する。制御部１１は、操作部１４からの信号に応じて、入力されたテキストデータ（コマンド音節）を取得する。すなわち、制御部１１には、１又は複数の音節に対応する表記で構成されたコマンド音節が入力される。制御部１１は、入力されたコマンド音節に対応する音素を音素辞書記憶領域１２４に記憶された音素辞書を参照して選択し、音素の列で構成されたコマンド音素列を生成する。制御部１１は、生成したコマンド音素列をコマンド音素列テーブルＴＢＬ１記憶領域１２２に記憶する。 <B: Operation>
<B-1: Command registration operation>
Next, the operation of the photographing apparatus 1 will be described. First, a command registration operation will be described. First, the user operates the operation unit 14 to perform an operation for registering a command. More specifically, the user operates the operation unit 14 and inputs text data representing a command character string. The control unit 11 acquires the input text data (command syllable) in response to a signal from the operation unit 14. That is, a command syllable composed of notation corresponding to one or a plurality of syllables is input to the control unit 11. The control unit 11 selects a phoneme corresponding to the input command syllable with reference to the phoneme dictionary stored in the phoneme dictionary storage area 124, and generates a command phoneme string composed of a phoneme string. The control unit 11 stores the generated command phoneme string in the command phoneme string table TBL1 storage area 122.

コマンドをコマンド音素列テーブルに登録すると、次いで、制御部１１は、入力されたコマンドからダミーコマンドを生成し、生成したダミーコマンドをダミーコマンド音素列テーブルＴＢＬ２記憶領域１２３に記憶する。例えば、「はいチーズ」というコマンドが入力された場合には、図４に示すような６個のダミーコマンドが生成される。 When the command is registered in the command phoneme sequence table, the control unit 11 then generates a dummy command from the input command, and stores the generated dummy command in the dummy command phoneme sequence table TBL2 storage area 123. For example, when a command “yes cheese” is input, six dummy commands as shown in FIG. 4 are generated.

＜Ｂ−２：コマンド認識動作＞
次に、撮影装置１がコマンドを認識する動作について説明する。制御部１１は、音声が入力されるまで待機し、音声が入力されると、入力された音声を解析し、解析結果とコマンド音素列との類似度及び解析結果とダミーコマンド音素列との類似度に応じて、コマンドが入力されたか否かを判定する。コマンドが入力されたと判定された場合には、制御部１１は、入力されたコマンドに対応する処理を実行する。具体的には、例えば、「はいチーズ」という音声コマンドが入力された場合には、制御部１１は、入力された音声コマンドに応じて、静止画像を撮影する処理を実行する。また、例えば、「フラッシュ」という音声コマンドが入力された場合には、制御部１１は、入力された音声コマンドに応じて、フラッシュを点灯（又は消灯）する処理を実行する。 <B-2: Command recognition operation>
Next, an operation in which the photographing apparatus 1 recognizes a command will be described. The control unit 11 waits until a voice is input. When the voice is input, the controller 11 analyzes the input voice, and the similarity between the analysis result and the command phoneme sequence and the similarity between the analysis result and the dummy command phoneme sequence. Depending on the degree, it is determined whether a command has been input. When it is determined that a command has been input, the control unit 11 executes processing corresponding to the input command. Specifically, for example, when a voice command “yes cheese” is input, the control unit 11 executes a process of capturing a still image according to the input voice command. For example, when a voice command “flash” is input, the control unit 11 performs a process of turning on (or turning off) the flash according to the input voice command.

＜Ｃ：実施形態の効果＞
以上説明したように本実施形態によれば、音声を用いたコマンドコントロールシステムにおいて、登録されたコマンドの音素列に類似するダミーコマンドを生成し、生成したダミーコマンドを用いてコマンド認識処理を実行することにより、コマンドと関係のない音声による誤動作を軽減する。このように、認識させたいコマンドの音声に適度に似せた音声をダミーとしてコマンドリストに登録しておくことにより、本来のコマンド以外の音声が入力されたときにコマンドとして認識されないようにし、誤動作を軽減することができる。
また、本実施形態によれば、大語彙の辞書を用いる手法に対して、メモリの消費量を少なくすることができるとともに、ＣＰＵの負荷も軽減することができる。 <C: Effect of the embodiment>
As described above, according to the present embodiment, in the command control system using voice, a dummy command similar to the phoneme string of the registered command is generated, and the command recognition process is executed using the generated dummy command. This reduces the malfunction caused by voice that is not related to the command. In this way, by registering in the command list as a dummy a sound that is reasonably similar to the sound of the command that you want to recognize, so that it will not be recognized as a command when a sound other than the original command is input, causing malfunctions. it can be reduced.
Further, according to the present embodiment, the memory consumption can be reduced and the load on the CPU can be reduced as compared with the method using the large vocabulary dictionary.

また、本実施形態によれば、ダミーコマンドを追加することによりＰ（Ｘ）の精度が上がり、本来のコマンドに似た言葉を発話したときには、ほぼ理想に近いＰ（Ｘ）を得ることができる。Ｐ（Ｗ｜Ｘ）は０から１に正規化された値となるため、一定の閾値を使ってコマンドか否かを判定することが可能になる（図５及び図６参照）。 Further, according to the present embodiment, by adding a dummy command, the accuracy of P (X) is improved, and when a word similar to the original command is spoken, it is possible to obtain P (X) that is almost ideal. . Since P (W | X) is a value normalized from 0 to 1, it is possible to determine whether or not the command is performed using a certain threshold (see FIGS. 5 and 6).

＜Ｄ：変形例＞
以上、本発明の実施形態について説明したが、本発明は上述した実施形態に限定されることなく、他の様々な形態で実施可能である。以下にその一例を示す。なお、以下の各態様を適宜に組み合わせてもよい。
（１）上述の実施形態では、本発明に係るコマンド認識装置をデジタルカメラ等の撮影装置に適用した例について説明したが、本発明に係るコマンド認識装置が適用される装置はデジタルカメラに限らず、例えば、パーソナルコンピュータ、携帯電話端末、コンピュータゲーム機等であってもよく、本発明に係るコマンド認識装置は様々な装置に適用可能である。また、本発明に係るコマンド認識装置が、撮影装置等の他の装置に外部接続される構成であってもよい。この場合は、コマンド認識装置が認識したコマンドを、外部Ｉ／Ｆを介して他の装置に通知するようにすればよい。 <D: Modification>
As mentioned above, although embodiment of this invention was described, this invention is not limited to embodiment mentioned above, It can implement with another various form. An example is shown below. In addition, you may combine each following aspect suitably.
(1) In the above-described embodiment, an example in which the command recognition apparatus according to the present invention is applied to a photographing apparatus such as a digital camera has been described. However, an apparatus to which the command recognition apparatus according to the present invention is applied is not limited to a digital camera. For example, it may be a personal computer, a mobile phone terminal, a computer game machine, or the like, and the command recognition device according to the present invention can be applied to various devices. The command recognition apparatus according to the present invention may be externally connected to another apparatus such as a photographing apparatus. In this case, the command recognized by the command recognition device may be notified to another device via the external I / F.

（２）上述の実施形態では、制御部１１が、上述した（i）乃至（iv）の生成基準に従ってダミーコマンド音素列を生成したが、制御部１１がダミーコマンド音素列を生成する態様はこれに限らず、他の態様であってもよい。例えば、制御部１１が、上述した（i）、（ii）、（iv）の生成基準に従ってダミーコマンド音素列を生成するようにしてもよい。要は、制御部１１が、コマンド音素列と所定の類似度を有するダミーコマンド音素列を、予め定められたアルゴリズムに従って生成するようにすればよい。 (2) In the above-described embodiment, the control unit 11 generates a dummy command phoneme string in accordance with the generation criteria (i) to (iv) described above, but the mode in which the control unit 11 generates a dummy command phoneme string is this. However, the present invention is not limited to this. For example, the control unit 11 may generate a dummy command phoneme string according to the generation criteria (i), (ii), and (iv) described above. In short, the control unit 11 may generate a dummy command phoneme string having a predetermined similarity with the command phoneme string according to a predetermined algorithm.

（３）上述の実施形態では、日本語のコマンドを認識する場合について説明したが、これに限らず、他の言語のコマンドを認識する装置においても本発明は適用可能である。ここで、英語のコマンドを認識する場合の具体的な内容の一例について説明する。英語の場合は、母音数が、長母音や二重母音もあわせると２４個以上もあるため、全てを置き換えた語をダミーコマンドとして登録すると無駄が多くなってしまう。そこで、この態様においては、音的に近い母音だけを用いることとする。ここで、”yes we can”と”say cheese”の場合の例について以下に説明する。”yes we can”の場合、コマンド音素列は「ｊｅｓｗｉ：ｋ｛ｎ」となる。なお、この音素列を表す発音記号はＳＡＭＰＡ（Speech Assessment Methods Phonetic Alphabet）に準拠している。 (3) In the above-described embodiment, the case where a Japanese command is recognized has been described. Here, an example of specific contents when an English command is recognized will be described. In the case of English, the number of vowels is 24 or more when long vowels and double vowels are combined. Therefore, it is wasteful to register words in which all vowels are replaced as dummy commands. Therefore, in this aspect, only vowels that are close in sound are used. Here, an example in the case of “yes we can” and “say cheese” will be described below. In the case of “yes we can”, the command phoneme string is “jewi: k {n”. Note that the phonetic symbols representing the phoneme strings conform to SAMPA (Speech Assessment Methods Phonetic Alphabet).

この例では、音韻的距離を考慮して各母音あたり４〜５母音程度のテーブルを予め用意しておく。この例では、例えば、図１０に示すような置換テーブルを撮影装置１の記憶部１２に予め記憶しておく。制御部１１は、この置換テーブルを参照してコマンド音素列の母音部分を順に書き換え、図１１に示すようなダミーコマンド音素列を生成する。なお、図１１に示す例においては、ダミーコマンド音素列の数が多くなるため組み合わせは考えないものとしている。 In this example, a table of about 4 to 5 vowels is prepared in advance for each vowel considering the phonological distance. In this example, for example, a replacement table as shown in FIG. 10 is stored in advance in the storage unit 12 of the photographing apparatus 1. The control unit 11 refers to this replacement table and rewrites the vowel part of the command phoneme sequence in order to generate a dummy command phoneme sequence as shown in FIG. In the example shown in FIG. 11, the number of dummy command phoneme strings increases, so that combinations are not considered.

次に、“say cheese”のコマンドについて説明する。このコマンドのコマンド音素列は「ｓｅＩｃｈｉ：ｚ」となる。この例でも、上述のコマンドと同様に、音韻的距離を考慮した置換テーブルを予め用意しておく。例えば、図１２に示すような置換テーブルを撮影装置１の記憶部１２に予め記憶しておく。制御部１１は、この置換テーブルを参照してコマンド音素列の母音部分を順に書き換え、図１３に示すようなダミーコマンド音素列を生成する。この場合も、上述した実施形態と同様に、制御部１１は、コマンド音素列と生成したダミーコマンド音素列とを用いてコマンドの認識処理を実行する。 Next, the “say cheese” command will be described. The command phoneme string of this command is “seIchi: z”. In this example as well, a replacement table taking into account the phonological distance is prepared in advance as in the above-described command. For example, a replacement table as shown in FIG. 12 is stored in advance in the storage unit 12 of the photographing apparatus 1. The control unit 11 refers to this replacement table and rewrites the vowel part of the command phoneme sequence in order to generate a dummy command phoneme sequence as shown in FIG. Also in this case, as in the above-described embodiment, the control unit 11 executes a command recognition process using the command phoneme string and the generated dummy command phoneme string.

（４）上述の実施形態では、制御部１１に入力されるコマンド音素列として、コマンドの文字列を表すテキストデータが入力されるようにしたが、入力されるコマンド音素列はテキストデータに限らず、例えば、発音記号を表すデータがコマンド音素列として入力されるようにしてもよい。要は、制御部１１に、１又は複数の音節に対応する表記で構成されたコマンド音節が入力されるものであればよい。 (4) In the above-described embodiment, text data representing a command character string is input as a command phoneme string input to the control unit 11, but the command phoneme string input is not limited to text data. For example, data representing phonetic symbols may be input as a command phoneme string. In short, any command syllable composed of notation corresponding to one or a plurality of syllables may be input to the control unit 11.

（５）上述の実施形態における撮影装置１の制御部１１によって実行されるプログラムは、磁気記録媒体（磁気テープ、磁気ディスクなど）、光記録媒体（光ディスクなど）、光磁気記録媒体、半導体メモリなどのコンピュータが読取可能な記録媒体に記憶した状態で提供し得る。また、インターネットのようなネットワーク経由で撮影装置１にダウンロードさせることも可能である。なお、このような制御を行う制御手段としてはＣＰＵ以外にも種々の装置を適用することができ、例えば、専用のプロセッサなどを用いてもよい。 (5) The program executed by the control unit 11 of the photographing apparatus 1 in the above-described embodiment is a magnetic recording medium (magnetic tape, magnetic disk, etc.), an optical recording medium (optical disk, etc.), a magneto-optical recording medium, a semiconductor memory, etc. It may be provided in a state stored in a computer-readable recording medium. It is also possible to download the image capturing apparatus 1 via a network such as the Internet. Various devices other than the CPU can be applied as the control means for performing such control. For example, a dedicated processor may be used.

１…撮影装置、１１…制御部、１２…記憶部、１３…表示部、１４…操作部、１５…マイクロホン、１６…音声処理部、１７…スピーカ、１８…撮影部、１１１…認識エンジン部、１１２…コマンド判定部、１１３…コマンド実行部、１１４…ダミーコマンド生成部、１２１…音響モデルデータベースＤＢ１記憶領域、１２２…コマンド音素列テーブルＴＢＬ１記憶領域、１２３…ダミーコマンド音素列テーブルＴＢＬ２記憶領域、１２４…音素辞書記憶領域。 DESCRIPTION OF SYMBOLS 1 ... Imaging device, 11 ... Control part, 12 ... Memory | storage part, 13 ... Display part, 14 ... Operation part, 15 ... Microphone, 16 ... Audio | voice processing part, 17 ... Speaker, 18 ... Imaging | photography part, 111 ... Recognition engine part, DESCRIPTION OF SYMBOLS 112 ... Command determination part, 113 ... Command execution part, 114 ... Dummy command production | generation part, 121 ... Acoustic model database DB1 storage area, 122 ... Command phoneme string table TBL1 storage area, 123 ... Dummy command phoneme string table TBL2 storage area, 124 ... Phoneme dictionary storage area.

Claims

１又は複数の音節に対応する表記で構成されたコマンド音節の音節と音素とを対応付けて記憶する記憶手段と、
入力されたコマンド音節に含まれる音節に対応する音素を前記記憶手段から選択して、音素の列で構成されたコマンド音素列を生成するコマンド音素列生成手段と、
前記コマンド音素列生成手段によって生成されたコマンド音素列に含まれる複数の母音を異なる母音に置き換えたダミーコマンド音素列を生成するダミーコマンド音素列生成手段と、
音声信号が入力される音声信号入力手段と、
前記音声信号入力手段に入力された音声信号を解析し、解析結果と前記コマンド音素列との類似度及び該解析結果と前記ダミーコマンド音素列との類似度に応じて、コマンドの認識処理を行うコマンド認識手段と
を具備することを特徴とするコマンド認識装置。 Storage means for storing a syllable and a phoneme of a command syllable configured in a notation corresponding to one or a plurality of syllables;
Command phoneme string generation means for selecting a phoneme corresponding to a syllable included in the input command syllable from the storage means and generating a command phoneme string composed of a phoneme string;
Dummy command phoneme string generation means for generating a dummy command phoneme string by replacing a plurality of vowels included in the command phoneme string generated by the command phoneme string generation means with different vowels ;
An audio signal input means for inputting an audio signal;
The voice signal input to the voice signal input means is analyzed, and a command recognition process is performed according to the similarity between the analysis result and the command phoneme string and the similarity between the analysis result and the dummy command phoneme string. And a command recognition device.