JP2003345384A

JP2003345384A - Method, device, and program for voice recognition

Info

Publication number: JP2003345384A
Application number: JP2002152646A
Authority: JP
Inventors: Hajime Kobayashi; 載小林; Soichi Toyama; 聡一外山
Original assignee: Pioneer Electronic Corp
Current assignee: Pioneer Corp
Priority date: 2002-05-27
Filing date: 2002-05-27
Publication date: 2003-12-03

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice recognition device which reduces the calculation volume of similarities in the case of matching processing to quickly and accurately recognize voice. <P>SOLUTION: A voice recognition device 100 is provided with a key word model database 106 wherein key word HMMs representative of feature quantity patterns of a plurality of key words to be recognized are stored, a similarity calculation part 107 for calculating similarity of feature quantities of respective frames on the basis of the feature quantity of extracted from a voice signal for each frame, key word HMMs, and a specific voice HMM, a unnecessary word similarity setting part 108 for setting unnecessary word similarities on the basis of the similarity calculated on the basis of the specific voice HMM, a matching processing part 109 for performing matching processing on the basis of calculated similarities and set unnecessary word similarities, and a discrimination part 110 for discriminating a key word included in a spoken voice on the basis of the matching processing. <P>COPYRIGHT: (C)2004,JPO

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、ＨＭＭ（Hidden M
arkov Models）法を用いて音声認識を行う技術分野に属
し、より詳細には、発話された音声からキーワードを認
識する技術分野に属する。TECHNICAL FIELD The present invention relates to an HMM (Hidden M).
arkov Models) method, and more specifically, it belongs to the technical field of recognizing keywords from spoken speech.

【０００２】[0002]

【従来の技術】現在、人間が発声した音声を認識する音
声認識装置が開発されており、このような音声認識装置
では、人間が所定の語句の音声を発声すると、その入力
信号から語句の音声を認識するようになっている。2. Description of the Related Art Currently, a voice recognition device for recognizing a voice uttered by a human being has been developed. In such a voice recognizing device, when a human utters a voice of a predetermined phrase, the voice of the phrase is input from the input signal. To recognize.

【０００３】また、このような音声認識装置を、車載さ
れたナビゲーション装置やパーソナルコンピュータなど
各種装置に適用すれば、その装置はキーボードやスイッ
チ選択の手動操作を要することなく、各種の情報を入力
することができるようになる。Further, when such a voice recognition device is applied to various devices such as a vehicle-mounted navigation device and a personal computer, the device inputs various information without requiring a manual operation such as a keyboard or switch selection. Will be able to.

【０００４】したがって、自動車の運転中にナビゲーシ
ョン装置を利用するなどの人間が両手を使用する作業環
境下であっても、操作者は、所望の情報を当該装置に入
力することができるようになっている。Therefore, even in a working environment where a person uses both hands while using a navigation device while driving a car, the operator can input desired information into the device. ing.

【０００５】このような音声認識の代表的なものにＨＭ
Ｍ（隠れマルコフモデル）と呼ばれる確率モデルを利用
して音声認識を行う方法（以下、単に音声認識という）
がある。HM is a typical example of such speech recognition.
A method for performing speech recognition using a probabilistic model called M (Hidden Markov Model) (hereinafter, simply referred to as speech recognition)
There is.

【０００６】この音声認識は、発話音声の特徴量のパタ
ーンを、予め用意されたキーワードとなる認識候補の語
句（以下、認識対象語（キーワード）という）を示す音
声の特徴量のパターンとマッチングさせることにより音
声認識を行うようになっている。In this voice recognition, the pattern of the feature quantity of the uttered voice is matched with the pattern of the feature quantity of the voice showing the words and phrases of the recognition candidates (hereinafter referred to as recognition target words (keywords)) which are the prepared keywords. As a result, voice recognition is performed.

【０００７】具体的には、この音声認識は、予め定めら
れた時間間隔毎に入力された発話音声（入力信号）を分
析して特徴量を抽出し、この入力信号の特徴量に予めデ
ータベースに格納されたＨＭＭによって示される認識対
象語の特徴量のデータとマッチングの割合、すなわち、
入力信号の特徴量が特徴量データであることを示す確率
（以下、類似度という）を算出するとともに、発話音声
の全てにおけるこの類似度を積算し、この積算された類
似度が最も高い認識対象語を認識結果として確定するよ
うになっている。Specifically, in this voice recognition, a speech amount (input signal) input at predetermined time intervals is analyzed to extract a feature amount, and the feature amount of this input signal is stored in a database in advance. Data of the feature amount of the recognition target word indicated by the stored HMM and the matching rate, that is,
The probability that the feature quantity of the input signal is feature quantity data (hereinafter referred to as similarity degree) is calculated, and the similarity degrees of all utterance voices are integrated, and the recognition target having the highest integrated similarity degree is calculated. The word is fixed as a recognition result.

【０００８】この結果、この音声認識は、発話音声であ
る入力信号から所定の語句の音声認識を行うことができ
るようになっている。As a result, in this voice recognition, it is possible to perform voice recognition of a predetermined word / phrase from an input signal which is a spoken voice.

【０００９】なお、ＨＭＭは、遷移する状態の集まりと
して表される統計的信号源モデルであり、予めキーワー
ドなどの認識すべき音声の特徴量を示す。また、このＨ
ＭＭは、予め複数の音声データを採取し、これらの音声
データに基づいて生成されるようになっている。The HMM is a statistical signal source model represented as a group of transition states, and indicates a feature amount of speech such as a keyword to be recognized in advance. Also, this H
The MM is adapted to collect a plurality of voice data in advance and generate based on these voice data.

【００１０】このような音声認識では、発話音声に含ま
れる認識対象語となるキーワード部分を如何に抽出する
かが重要になる。In such speech recognition, it is important how to extract a keyword portion which is a recognition target word included in a speech voice.

【００１１】発話音声には、通常、キーワードの他に、
予め既知の認識する際に不要な語である不要語（認識対
象語の前後に付加される「えー」や「です」等の語）が
含まれることが多く、この場合の発話音声の構成は、原
則的には、不要語と当該不要語に挟まれたキーワードに
よって形成されることとなる。Generally, in addition to keywords, the utterance voice includes
It often contains unnecessary words (words such as "er" and "da" that are added before and after the recognition target word) that are unnecessary words when recognizing in advance. In principle, it is formed by unnecessary words and keywords sandwiched between the unnecessary words.

【００１２】従来、一般的に、音声認識を行う場合、音
声認識の対象となるキーワードを認識することによって
行うワードスポッティングという手法（以下、単にワー
ドスポッティング音声認識という）がよく用いられてい
る。Conventionally, in the case of performing voice recognition, a method called word spotting (hereinafter simply referred to as word spotting voice recognition) which is performed by recognizing a keyword as a target of voice recognition is generally used.

【００１３】このワードスポッティング音声認識は、認
識対象となる発話音声を、キーワードモデルを示すＨＭ
Ｍの他に、不要語のモデル（以下、不要語モデルとい
う）を示すＨＭＭを用意し、最も特徴量の類似度が高い
キーワードモデル、不要語モデルまたはそれらの組み合
わせを認識することによって音声認識を行うようになっ
ている。In this word spotting voice recognition, the utterance voice to be recognized is HM indicating a keyword model.
In addition to M, an HMM indicating an unnecessary word model (hereinafter referred to as an unnecessary word model) is prepared, and speech recognition is performed by recognizing a keyword model or an unnecessary word model having the highest degree of feature similarity. I am supposed to do it.

【００１４】すなわち、このワードスポッティング音声
認識は、積算された類似度に基づいて、最も特徴量の類
似度が高いキーワードモデル、不要語モデルまたはそれ
らの組み合わせを認識し、当該発話音声にキーワードが
含まれている場合には、そのキーワードを認識結果とし
て出力するようになっている。That is, this word spotting speech recognition recognizes a keyword model, an unnecessary word model, or a combination thereof having the highest degree of similarity of feature amounts based on the accumulated degree of similarity, and the utterance speech includes the keyword. If so, the keyword is output as a recognition result.

【００１５】このようにワードスポッティング音声認識
を行う場合に、不要語モデルの構成方法としては、フィ
ラーモデルと呼ばれる確率モデル（以下、単にフィラー
モデルをいう）を利用する方法がある。When performing word spotting speech recognition as described above, as a method of constructing an unnecessary word model, there is a method of using a probabilistic model called a filler model (hereinafter, simply referred to as a filler model).

【００１６】フィラーモデルは、図５に示すように、全
ての音声をモデル化するために接続可能な全ての母音お
よび子音の接続関係をネットワークで表現したモデルで
あり、このフィラーモデルを用いてワードスポッティン
グを実現するためには、キーワードモデルの前後にそれ
ぞれフィラーモデルを接続する必要がある。As shown in FIG. 5, the filler model is a model in which the connection relationships of all vowels and consonants that can be connected to model all voices are expressed by a network. In order to realize spotting, it is necessary to connect filler models before and after the keyword model.

【００１７】すなわち、フィラーモデルでは、認識可能
な全てのパターン、具体的には認識すべき発話音声の特
徴量と各音素毎の特徴量のマッチングを算出することに
よって発話音声の音素の繋がりを算出し、その各々の接
続関係のうち、最適となるパターンで構成される経路か
ら認識すべき不要語を認識するようになっている。In other words, in the filler model, all recognizable patterns, specifically, the matching of the feature amount of the uttered speech to be recognized and the feature amount of each phoneme is calculated to calculate the phoneme connection of the uttered voice. Then, among the respective connection relationships, unnecessary words to be recognized are recognized from the route formed by the optimum pattern.

【００１８】[0018]

【発明が解決しようとする課題】しかしながら、このよ
うな音声認識装置では、不要語を認識するため、発話音
声の特徴量と音素など当該不要語の構成要素となり得る
各特徴量のデータとのマッチング処理を行うので、計算
量が膨大となり、計算処理の負荷がかかるという問題を
有していた。However, in such a voice recognition device, since the unnecessary word is recognized, matching between the characteristic amount of the uttered voice and the data of each characteristic amount such as a phoneme that can be a constituent element of the unnecessary word. Since the processing is performed, the amount of calculation becomes enormous, and there is a problem in that the load of calculation processing is applied.

【００１９】具体的には、マッチング処理とは、発話音
声の特徴量と、不要語の構成要素となり得る特徴量デー
タとの類似する割合を示す類似度（尤度）を計算し、最
も類似度が高くなる特徴量データを有する不要語として
認識すべきものと判断するものであるため、日本語のフ
ィラーモデルの場合、発話音声の特徴量を、あ行、か
行、さ行、た行など、音素単位や音節単位など単語を構
成する単位であるサブワードで表したときの接続パター
ン全てに対する特徴量データとの類似度の計算を行う必
要がある。Specifically, the matching process is performed by calculating the similarity (likelihood) indicating the similarity between the feature amount of the uttered voice and the feature amount data that can be a constituent element of the unnecessary word, and calculating the maximum similarity. Since it is determined that the word should be recognized as an unnecessary word having feature amount data that increases, in the case of the Japanese filler model, the feature amount of the uttered speech is changed to a line, a line, a line, a line, etc. It is necessary to calculate the degree of similarity with the feature amount data for all connection patterns when represented by subwords, which are units that form words such as phoneme units and syllable units.

【００２０】したがって、上述の音声認識装置では、各
特徴量データ毎の類似度の計算により、計算量が膨大と
なるので、計算処理の負荷がかかるという問題を有して
いた。Therefore, the above-mentioned speech recognition apparatus has a problem in that the calculation amount becomes enormous due to the calculation of the degree of similarity for each feature amount data, so that the load of calculation processing is increased.

【００２１】特に、上述の音声認識装置であっては、認
識すべきキーワードの前後に不要語認識用のフィラーモ
デルを接続した言語モデルを想定しているため、キーワ
ードの前後において、発話音声の特徴量と各特徴量デー
タとのマッチング処理を行うことになるので、さらに多
くの計算量が必要となる。In particular, the above speech recognition apparatus assumes a language model in which filler models for recognizing unnecessary words are connected before and after a keyword to be recognized. Since a matching process between the amount and each feature amount data is performed, a larger amount of calculation is required.

【００２２】本発明は、上記の各問題点に鑑みて為され
たもので、その課題は、マッチング処理を行う場合の類
似度の算出量を少なくし、高速にかつ的確に音声認識を
行う音声認識装置を提供することにある。The present invention has been made in view of the above problems, and its object is to reduce the calculation amount of the similarity when performing the matching process, and to perform the voice recognition accurately and at high speed. To provide a recognition device.

【００２３】[0023]

【課題を解決するための手段】上記の課題を解決するた
めに、請求項１に記載の発明は、発話音声に含まれるキ
ーワードを認識する音声認識装置であって、前記発話音
声を分析することによって当該発話音声の音声成分の特
徴量である発話音声特徴量を抽出する抽出手段と、１ま
たは２以上の前記キーワードの音声成分の特徴量を示す
キーワード特徴量データを予め格納しておく格納手段
と、前記発話音声の少なくとも一部の音声区間の抽出さ
れた前記発話音声特徴量と前記格納手段に格納された前
記キーワード特徴量データとに基づいて当該音声区間の
音声が前記キーワードであるキーワード確率を算出する
算出手段と、前記発話音声の少なくとも一部の音声区間
の抽出した前記発話音声特徴量と予め設定された音声成
分の特徴量を示す複数の特定音声特徴量とに基づいて、
当該音声区間の音声が前記キーワードを構成しない不要
語である確率を示す不要語確率を設定する設定手段と、
前記算出したキーワード確率および前記設定した不要語
確率に基づいて前記発話音声に含まれる認識すべき前記
キーワードを決定する決定手段と、を備えた構成を有し
ている。In order to solve the above-mentioned problems, the invention according to claim 1 is a voice recognition device for recognizing a keyword included in an uttered voice, wherein the uttered voice is analyzed. Extraction means for extracting the uttered voice feature amount, which is the feature amount of the voice component of the uttered voice, and storage means for storing in advance keyword feature amount data indicating the feature amount of the voice component of one or more of the keywords. And a keyword probability that the voice in the voice segment is the keyword based on the extracted voice feature amount of at least a part of the voice segment of the voice and the keyword feature amount data stored in the storage unit. Calculating means for calculating the utterance voice feature quantity extracted from at least a part of the voice section of the utterance voice and a preset feature quantity of the voice component. Based in on the specific speech features,
Setting means for setting an unnecessary word probability indicating a probability that the voice of the voice section is an unnecessary word that does not form the keyword;
And a determining unit that determines the keyword to be recognized included in the speech based on the calculated keyword probability and the set unnecessary word probability.

【００２４】この構成により、請求項１に記載の発明で
は、発話音声特徴量が各キーワード特徴量データによっ
て示されるキーワードであることを示すキーワード確率
を算出するとともに、発話音声特徴量と予め設定された
複数の特定音声特徴量とに基づいて、当該音声区間が不
要語である確率を示す不要語確率を設定し、当該算出し
たキーワード確率および設定した不要語確率に基づいて
発話音声に含まれる認識すべきキーワードを決定する。With this configuration, in the invention according to the first aspect, the keyword probability indicating that the uttered voice feature amount is the keyword indicated by each keyword feature amount data is calculated, and the uttered voice feature amount is preset. Based on the plurality of specific voice features, the unnecessary word probability indicating the probability that the relevant speech segment is an unnecessary word is set, and the recognition included in the uttered voice based on the calculated keyword probability and the set unnecessary word probability. Decide which keyword should be used.

【００２５】したがって、例えば、予め設定された複数
の特定音声特徴量として母音などの代表的な不要語を構
成する音声特徴量、または、複数のキーワード特徴量デ
ータの一部の特徴量を用いるなど、通常、不要語確率を
算出する際に必要となる多数の不要語特徴量データを予
め設定しておくことなく、少ないデータで不要語確率を
設定することができるので、不要語確率を算出する際の
処理負担を軽減することができ、容易にかつ高速に発話
音声に含まれるキーワードを認識することができる。Therefore, for example, as a plurality of preset specific voice feature quantities, a voice feature quantity forming a typical unnecessary word such as a vowel or a part of feature quantity of a plurality of keyword feature quantity data is used. , Normally, the unnecessary word probability can be set with a small amount of data without previously setting a large number of unnecessary word feature amount data required for calculating the unnecessary word probability. The processing load at that time can be reduced, and the keywords included in the uttered voice can be easily and quickly recognized.

【００２６】また、請求項２に記載の発明は、請求項１
に記載の音声認識装置において、前記設定手段が、前記
発話音声の少なくとも一部の音声区間の抽出した前記発
話音声特徴量と予め設定された音声成分の特徴量を示す
複数の特定音声特徴量とに基づいて前記発話音声特徴量
が前記各特定音声特徴量である確率を示す特定音声確率
を算出する特定音声確率算出手段と、前記算出した各特
定音声確率に基づいて前記不要語確率を設定する不要語
確率設定手段と、を有する構成をしている。The invention described in claim 2 is the same as claim 1
In the voice recognition device according to the item (1), the setting unit includes a plurality of specific voice feature amounts indicating the feature amount of the voice component extracted and the preset voice component of at least a part of the voice section of the voice. Specific voice probability calculating means for calculating a specific voice probability indicating the probability that the uttered voice feature amount is each specific voice feature amount, and the unnecessary word probability is set based on each calculated specific voice probability. And an unnecessary word probability setting means.

【００２７】この構成により、請求項２に記載の発明で
は、発話音声の少なくとも一部の音声区間の抽出した前
記発話音声特徴量と予め設定された音声成分の特徴量を
示す複数の特定音声特徴量とに基づいて発話音声特徴量
が特定音声確率を算出し、当該算出した各特定音声確率
に基づいて不要語確率を設定する。With this configuration, in the invention according to claim 2, a plurality of specific voice features indicating the feature amount of the voice component extracted and the preset voice component of at least a part of the voice section of the voice component. The utterance voice feature amount calculates a specific voice probability based on the calculated amount, and an unnecessary word probability is set based on each calculated specific voice probability.

【００２８】したがって、例えば、予め設定された複数
の特定音声特徴量として母音などの代表的な不要語を構
成する音声特徴量、または、複数のキーワード特徴量デ
ータの一部の特徴量を用いて特定音声確率を算出し、複
数の特定音声確率の平均を取るなど代表的な音声特徴量
またはキーワード特徴量データを用いて不要語確率を算
出することにより、通常、不要語確率を算出する際に必
要となる多数の不要語特徴量データを予め設定しておく
ことなく、少ないデータで不要語確率を設定することが
できるので、不要語確率を算出する際の処理負担を軽減
することができ、容易にかつ高速に発話音声に含まれる
キーワードを認識することができる。Therefore, for example, as a plurality of preset specific voice feature amounts, a voice feature amount forming a typical unnecessary word such as a vowel or a part of the feature amount of a plurality of keyword feature amount data is used. Usually, when calculating the unnecessary word probability, the unnecessary word probability is calculated by calculating the specific speech probability and averaging a plurality of specific speech probabilities to calculate the unnecessary word probability using typical speech feature amount or keyword feature amount data. Since it is possible to set the unnecessary word probability with a small amount of data without presetting a large number of unnecessary word feature amount data required, it is possible to reduce the processing load when calculating the unnecessary word probability, It is possible to easily and quickly recognize the keywords included in the spoken voice.

【００２９】また、請求項３に記載の発明は、請求項２
に記載の音声認識装置において、前不要語確率設定手段
が、前記特定音声確率算出手段によって算出した各特定
音声確率の平均を当該不要語類似度に設定する構成を有
している。The invention described in claim 3 is the same as that of claim 2.
In the voice recognition device according to the item (4), the preceding unnecessary word probability setting means has a configuration for setting an average of the respective specific speech probabilities calculated by the specific speech probability calculating means to the unnecessary word similarity.

【００３０】この構成により、請求項３に記載の発明で
は、特定音声確率算出手段によって算出した各特定音声
確率の平均を当該不要語類似度に設定する。With this configuration, in the invention according to the third aspect, the average of the respective specific voice probabilities calculated by the specific voice probability calculating means is set as the unnecessary word similarity.

【００３１】したがって、例えば、予め設定された複数
の特定音声特徴量として母音などの代表的な不要語を構
成する音声特徴量、または、複数のキーワード特徴量デ
ータの一部の特徴量を用いて特定音声確率を算出し、複
数の特定音声確率の平均を不要語確率とすることによ
り、通常、不要語確率を算出する際に必要となる多数の
不要語特徴量データを予め設定しておくことなく、少な
いデータで不要語確率を設定することができるので、不
要語確率を算出する際の処理負担を軽減することがで
き、容易にかつ高速に発話音声に含まれるキーワードを
認識することができる。Therefore, for example, by using a voice feature amount that constitutes a typical unnecessary word such as a vowel as a plurality of preset specific voice feature amounts, or a feature amount of a part of a plurality of keyword feature amount data. By calculating specific voice probabilities and using the average of multiple specific voice probabilities as unnecessary word probabilities, it is usually necessary to set in advance a large number of unnecessary word feature amount data required for calculating unnecessary word probabilities. Since the unnecessary word probability can be set with a small amount of data, the processing load when calculating the unnecessary word probability can be reduced, and the keyword included in the uttered voice can be easily and quickly recognized. .

【００３２】また、請求項４に記載の発明は、請求項１
乃至３の何れか一項に記載の音声認識装置において、前
記設定手段が、前記格納手段に格納された前記キーワー
ド特徴量データの少なくとも一部の特徴量を前記特定音
声特徴量として用いる構成を有している。The invention described in claim 4 is the same as claim 1
The voice recognition device according to any one of claims 1 to 3, wherein the setting unit uses a feature amount of at least a part of the keyword feature amount data stored in the storage unit as the specific voice feature amount. is doing.

【００３３】この構成により、請求項４に記載の発明で
は、格納されているキーワード特徴量データの少なくと
も一部の特徴量を特定音声特徴量として用いて不要語確
率を設定する。With this configuration, in the invention according to the fourth aspect, the unnecessary word probability is set by using the feature amount of at least a part of the stored keyword feature amount data as the specific voice feature amount.

【００３４】したがって、通常、不要語確率を算出する
際に必要となる多数の不要語特徴量データを予め設定・
格納しておくことなく、少ないデータで不要語確率を設
定することができるので、不要語確率を算出する際の処
理負担を軽減することができ、容易にかつ高速に発話音
声に含まれるキーワードを認識することができる。Therefore, usually, a large number of unnecessary word feature amount data required for calculating the unnecessary word probability are set in advance.
Since the unnecessary word probability can be set with a small amount of data without storing it, the processing load when calculating the unnecessary word probability can be reduced, and the keywords included in the uttered speech can be easily and quickly determined. Can be recognized.

【００３５】また、請求項５に記載の発明は、請求項１
乃至４の何れか一項に記載の音声認識装置において、前
記抽出手段が、予め設定された単位時間毎に前記発話音
声を分析して前記発話音声特徴量情報を抽出し、前記算
出手段が、前記各単位時間毎に前記キーワード確率を算
出し、前記設定手段が、前記各単位時間毎に前記不要語
確率を設定し、前記決定手段が、前記各単位時間に算出
した各キーワード確率および前記単位時間毎に設定した
不要語確率に基づいて発話音声に含まれる認識すべき前
記キーワードを決定する構成を有している。The invention described in claim 5 is the same as claim 1
In the voice recognition device according to any one of claims 4 to 4, the extraction unit analyzes the uttered voice for each preset unit time to extract the uttered voice feature amount information, and the calculation unit includes: The keyword probability is calculated for each unit time, the setting unit sets the unnecessary word probability for each unit time, and the determination unit calculates each keyword probability and the unit for each unit time. It has a configuration for determining the keyword to be recognized included in the speech voice based on the unnecessary word probability set for each time.

【００３６】この構成により、請求項５に記載の発明で
は、各単位時間毎に算出した各キーワード確率および単
位時間毎に設定した不要語確率に基づいて発話音声に含
まれる認識すべき前記キーワードを決定する。With this configuration, in the invention according to the fifth aspect, the keywords to be recognized included in the uttered voice are identified based on the keyword probability calculated for each unit time and the unnecessary word probability set for each unit time. decide.

【００３７】したがって、発話音声から音素単位、また
は、音声単位などの各言語音毎にキーワード確率および
不要語確率を算出することができ、かつ、例えば、予め
設定された複数の特定音声特徴量として母音などの代表
的な不要語を構成する音声特徴量、または、複数のキー
ワード特徴量データの一部の特徴量を用いるなど、通
常、不要語確率を算出する際に必要となる多数の不要語
特徴量データを予め設定しておくことなく、少ないデー
タで不要語確率を設定することができるので、不要語確
率を算出する際の処理負担を軽減することができ、容易
に、高速にかつ的確に発話音声に含まれるキーワードを
認識することができる。Therefore, the keyword probability and the unnecessary word probability can be calculated for each language sound such as a phoneme unit or a voice unit from the uttered voice, and, for example, as a plurality of preset specific voice feature amounts. A large number of unnecessary words that are usually needed when calculating the probability of unnecessary words, such as the use of voice features that form typical unnecessary words such as vowels, or the use of some features of multiple keyword feature data. Since the unnecessary word probability can be set with a small amount of data without setting the feature amount data in advance, the processing load at the time of calculating the unnecessary word probability can be reduced, and it is easy, fast, and accurate. It is possible to recognize a keyword included in the uttered voice.

【００３８】また、請求項６に記載の発明は、請求項１
乃至５の何れか一項に記載の音声認識装置において、前
記決定手段が、前記算出した前記各キーワード確率と前
記単位時間における不要語確率に基づいて、前記第１格
納手段に格納された各前記キーワード特徴量データによ
って示される前記各キーワードと前記不要語との各組み
合わせである確率を示す組み合わせ確率を算出するとと
もに、当該組み合わせ確率に基づいて前記発話音声に含
まれる認識すべき前記キーワードを決定する構成を有し
ている。The invention described in claim 6 is the same as claim 1.
In the voice recognition device according to any one of claims 1 to 5, the determining unit stores each of the stored in the first storing unit based on the calculated keyword probability and the unnecessary word probability in the unit time. The combination probability indicating the probability of each combination of each keyword and the unnecessary word indicated by the keyword feature amount data is calculated, and the keyword to be recognized included in the uttered voice is determined based on the combination probability. Have a configuration.

【００３９】この構成により、請求項６に記載の発明で
は、算出した各キーワード確率と単位時間における不要
語確率に基づいて、各キーワードと不要語との組み合わ
せ確率を算出し、当該組み合わせ確率に基づいて発話音
声に含まれる認識すべき前記キーワードを決定する。With this configuration, in the invention according to claim 6, the combination probability of each keyword and the unnecessary word is calculated based on the calculated keyword probability and the unnecessary word probability per unit time, and based on the combination probability. And determines the keyword to be recognized included in the uttered voice.

【００４０】したがって、不要語および各キーワードの
各組み合わせを考慮しつつ、発話音声に含まれるキーワ
ードを決定することができるので、容易にかつ的確に発
話音声に含まれるキーワードを認識することができると
ともに、誤認識を防止することができる。Therefore, the keyword included in the uttered voice can be determined in consideration of each combination of the unnecessary word and each keyword, so that the keyword included in the uttered voice can be easily and accurately recognized. It is possible to prevent erroneous recognition.

【００４１】また、請求項７に記載の発明は、発話音声
に含まれるキーワードを認識する音声認識方法であっ
て、前記発話音声を分析することによって当該発話音声
の音声成分の特徴量である発話音声特徴量を抽出する抽
出処理工程と、１または２以上の前記キーワードの音声
成分の特徴量を示すキーワード特徴量データを予め取得
する取得処理工程と、前記発話音声の少なくとも一部の
音声区間の抽出された前記発話音声特徴量と前記格納手
段に格納された前記キーワード特徴量データとに基づい
て当該音声区間の音声が前記キーワードであるキーワー
ド確率を算出する算出処理工程と、前記発話音声の少な
くとも一部の音声区間の抽出した前記発話音声特徴量と
予め設定された音声成分の特徴量を示す複数の特定音声
特徴量とに基づいて、当該音声区間の音声が前記キーワ
ードを構成しない不要語である確率を示す不要語確率を
設定する設定処理工程と、前記算出したキーワード確率
および前記設定した不要語確率に基づいて前記発話音声
に含まれる認識すべき前記キーワードを決定する決定処
理工程と、を含む構成を有している。The invention according to claim 7 is a voice recognition method for recognizing a keyword included in an uttered voice, which is a feature amount of a voice component of the uttered voice by analyzing the uttered voice. An extraction processing step of extracting a voice characteristic amount; an acquisition processing step of obtaining in advance keyword characteristic amount data indicating a characteristic amount of the voice component of one or more of the keywords; and at least a part of the voice section of the uttered voice. A calculation processing step of calculating a keyword probability that the voice of the voice section is the keyword based on the extracted uttered voice feature amount and the keyword feature amount data stored in the storage means, and at least the uttered voice. Based on the extracted speech feature amount of a part of the voice section and a plurality of specific voice feature amounts indicating the feature amount of a preset voice component A setting process step of setting an unnecessary word probability indicating a probability that the voice of the voice section is an unnecessary word that does not form the keyword, and is included in the uttered voice based on the calculated keyword probability and the set unnecessary word probability. And a determination processing step of determining the keyword to be recognized.

【００４２】この構成により、請求項７に記載の発明で
は、発話音声特徴量が各キーワード特徴量データによっ
て示されるキーワードであることを示すキーワード確率
を算出するとともに、発話音声特徴量と予め設定された
複数の特定音声特徴量とに基づいて、当該音声区間が不
要語である確率を示す不要語確率を設定し、当該算出し
たキーワード確率および設定した不要語確率に基づいて
発話音声に含まれる認識すべきキーワードを決定する。With this configuration, in the invention according to claim 7, the keyword probability indicating that the uttered voice feature amount is the keyword indicated by each keyword feature amount data is calculated, and the uttered voice feature amount is preset. Based on the plurality of specific voice feature amounts, the unnecessary word probability indicating the probability that the relevant speech section is an unnecessary word is set, and the recognition included in the uttered voice based on the calculated keyword probability and the set unnecessary word probability. Decide which keyword should be used.

【００４３】したがって、例えば、予め設定された複数
の特定音声特徴量として母音などの代表的な不要語を構
成する音声特徴量、または、複数のキーワード特徴量デ
ータの一部の特徴量を用いるなど、通常、不要語確率を
算出する際に必要となる多数の不要語特徴量データを予
め設定しておくことなく、少ないデータで不要語確率を
設定することができるので、不要語確率を算出する際の
処理負担を軽減することができ、容易にかつ高速に発話
音声に含まれるキーワードを認識することができる。Therefore, for example, as a plurality of preset specific voice feature quantities, a voice feature quantity forming a typical unnecessary word such as a vowel or a part of feature quantity of a plurality of keyword feature quantity data is used. , Normally, the unnecessary word probability can be set with a small amount of data without previously setting a large number of unnecessary word feature amount data required for calculating the unnecessary word probability. The processing load at that time can be reduced, and the keywords included in the uttered voice can be easily and quickly recognized.

【００４４】また、請求項８に記載の発明は、請求項７
に記載の音声認識方法において、前記設定処理工程にお
いては、前記発話音声の少なくとも一部の音声区間の抽
出した前記発話音声特徴量と予め設定された音声成分の
特徴量を示す複数の特定音声特徴量とに基づいて前記発
話音声特徴量が前記各特定音声特徴量である確率を示す
特定音声確率を算出する特定音声確率算出処理工程と、
前記算出した各特定音声確率に基づいて前記不要語確率
を設定する不要語確率設定処理工程と、を含む構成を有
している。The invention described in claim 8 is the same as claim 7
In the voice recognition method described in (1) above, in the setting processing step, a plurality of specific voice features indicating the utterance voice feature amount extracted in at least a part of the voice section of the utterance voice and a feature amount of a preset voice component A specific voice probability calculation processing step of calculating a specific voice probability indicating the probability that the uttered voice feature amount is each of the specific voice feature amounts based on the amount,
And an unnecessary word probability setting process step of setting the unnecessary word probability based on each of the calculated specific speech probabilities.

【００４５】この構成により、請求項８に記載の発明で
は、発話音声の少なくとも一部の音声区間の抽出した前
記発話音声特徴量と予め設定された音声成分の特徴量を
示す複数の特定音声特徴量とに基づいて発話音声特徴量
が特定音声確率を算出し、当該算出した各特定音声確率
に基づいて不要語確率を設定する。With this configuration, in the invention according to claim 8, a plurality of specific voice features indicating the feature amount of the voice component extracted from the voice feature extracted in at least a part of the voice section of the voice component and the preset voice component. The utterance voice feature amount calculates a specific voice probability based on the calculated amount, and an unnecessary word probability is set based on each calculated specific voice probability.

【００４６】したがって、例えば、予め設定された複数
の特定音声特徴量として母音などの代表的な不要語を構
成する音声特徴量、または、複数のキーワード特徴量デ
ータの一部の特徴量を用いて特定音声確率を算出し、複
数の特定音声確率の平均を取るなど代表的な音声特徴量
またはキーワード特徴量データを用いて不要語確率を算
出することにより、通常、不要語確率を算出する際に必
要となる多数の不要語特徴量データを予め設定しておく
ことなく、少ないデータで不要語確率を設定することが
できるので、不要語確率を算出する際の処理負担を軽減
することができ、容易にかつ高速に発話音声に含まれる
キーワードを認識することができる。Therefore, for example, by using a voice feature amount that constitutes a typical unnecessary word such as a vowel as a plurality of preset specific voice feature amounts, or a feature amount of a part of a plurality of keyword feature amount data. Usually, when calculating the unnecessary word probability, the unnecessary word probability is calculated by calculating the specific speech probability and averaging a plurality of specific speech probabilities to calculate the unnecessary word probability using typical speech feature amount or keyword feature amount data. Since it is possible to set the unnecessary word probability with a small amount of data without presetting a large number of unnecessary word feature amount data required, it is possible to reduce the processing load when calculating the unnecessary word probability, It is possible to easily and quickly recognize the keywords included in the spoken voice.

【００４７】また、請求項９に記載の発明は、請求項８
に記載の音声認識方法において、前不要語確率設定処理
工程においては、前記特定音声確率算出処理工程によっ
て算出した各特定音声確率の平均を当該不要語類似度に
設定する構成を有している。The invention described in claim 9 is the same as claim 8
In the speech recognition method described in (1) above, in the preceding unnecessary word probability setting processing step, the average of each specific speech probability calculated in the specific speech probability calculation processing step is set to the unnecessary word similarity.

【００４８】この構成により、請求項９に記載の発明で
は、特定音声確率算出手段によって算出した各特定音声
確率の平均を当該不要語類似度に設定する。With this configuration, in the invention according to the ninth aspect, the average of the respective specific voice probabilities calculated by the specific voice probability calculating means is set to the unnecessary word similarity.

【００４９】したがって、例えば、予め設定された複数
の特定音声特徴量として母音などの代表的な不要語を構
成する音声特徴量、または、複数のキーワード特徴量デ
ータの一部の特徴量を用いて特定音声確率を算出し、複
数の特定音声確率の平均を不要語確率とすることによ
り、通常、不要語確率を算出する際に必要となる多数の
不要語特徴量データを予め設定しておくことなく、少な
いデータで不要語確率を設定することができるので、不
要語確率を算出する際の処理負担を軽減することがで
き、容易にかつ高速に発話音声に含まれるキーワードを
認識することができる。Therefore, for example, by using a voice feature amount forming a typical unnecessary word such as a vowel as a plurality of preset specific voice feature amounts, or a feature amount of a part of a plurality of keyword feature amount data. By calculating specific voice probabilities and using the average of multiple specific voice probabilities as unnecessary word probabilities, it is usually necessary to set in advance a large number of unnecessary word feature amount data required for calculating unnecessary word probabilities. Since the unnecessary word probability can be set with a small amount of data, the processing load when calculating the unnecessary word probability can be reduced, and the keyword included in the uttered voice can be easily and quickly recognized. .

【００５０】また、請求項１０に記載の発明は、請求項
７乃至９の何れか一項に記載の音声認識方法において、
前記設定処理工程においては、前記取得処理工程によっ
て取得された前記キーワード特徴量データの少なくとも
一部の特徴量を前記特定音声特徴量として用いる構成を
有している。The invention described in claim 10 is the speech recognition method according to any one of claims 7 to 9,
In the setting process step, at least a part of the keyword feature amount data obtained in the obtaining process step is used as the specific voice feature amount.

【００５１】この構成により、請求項１０に記載の発明
では、格納されているキーワード特徴量データの少なく
とも一部の特徴量を特定音声特徴量として用いて不要語
確率を設定する。With this configuration, in the invention according to the tenth aspect, the unnecessary word probability is set by using the feature amount of at least a part of the stored keyword feature amount data as the specific voice feature amount.

【００５２】したがって、通常、不要語確率を算出する
際に必要となる多数の不要語特徴量データを予め設定・
格納しておくことなく、少ないデータで不要語確率を設
定することができるので、不要語確率を算出する際の処
理負担を軽減することができ、容易にかつ高速に発話音
声に含まれるキーワードを認識することができる。Therefore, usually, a large number of unnecessary word feature amount data required for calculating the unnecessary word probability are set in advance.
Since the unnecessary word probability can be set with a small amount of data without storing it, the processing load when calculating the unnecessary word probability can be reduced, and the keywords included in the uttered speech can be easily and quickly determined. Can be recognized.

【００５３】また、請求項１１に記載の発明は、請求項
７乃至１０の何れか一項に記載の音声認識方法におい
て、前記抽出処理工程においては、予め設定された単位
時間毎に前記発話音声を分析して前記発話音声特徴量を
抽出し、前記算出処理工程においては、前記各単位時間
毎に前記キーワード確率を算出し、前記設定処理工程に
おいては、前記各単位時間毎に前記不要語確率を設定
し、前記決定処理工程においては、前記各単位時間に算
出した各キーワード確率および前記単位時間毎に設定し
た不要語確率に基づいて発話音声に含まれる認識すべき
前記キーワードを決定する構成を有している。The invention according to claim 11 is the voice recognition method according to any one of claims 7 to 10, wherein in the extraction processing step, the uttered voice is set every preset unit time. Is analyzed to extract the uttered voice feature amount, in the calculation processing step, the keyword probability is calculated for each unit time, and in the setting processing step, the unnecessary word probability is calculated for each unit time. In the determination processing step, the keyword to be recognized included in the uttered voice is determined based on the keyword probability calculated for each unit time and the unnecessary word probability set for each unit time. Have

【００５４】この構成により、請求項１１に記載の発明
では、各単位時間毎に算出した各キーワード確率および
単位時間毎に設定した不要語確率に基づいて発話音声に
含まれる認識すべき前記キーワードを決定する。With this configuration, in the invention according to the eleventh aspect, the keywords to be recognized included in the uttered voice are identified based on the keyword probability calculated for each unit time and the unnecessary word probability set for each unit time. decide.

【００５５】したがって、発話音声から音素単位、また
は、音声単位などの各言語音毎にキーワード確率および
不要語確率を算出することができ、かつ、例えば、予め
設定された複数の特定音声特徴量として母音などの代表
的な不要語を構成する音声特徴量、または、複数のキー
ワード特徴量データの一部の特徴量を用いるなど、通
常、不要語確率を算出する際に必要となる多数の不要語
特徴量データを予め設定しておくことなく、少ないデー
タで不要語確率を設定することができるので、不要語確
率を算出する際の処理負担を軽減することができ、容易
に、高速にかつ的確に発話音声に含まれるキーワードを
認識することができる。Therefore, the keyword probability and the unnecessary word probability can be calculated for each language sound such as a phoneme unit or a voice unit from the uttered voice, and, for example, as a plurality of preset specific voice feature amounts. A large number of unnecessary words that are usually needed when calculating the probability of unnecessary words, such as the use of voice features that form typical unnecessary words such as vowels, or the use of some features of multiple keyword feature data. Since the unnecessary word probability can be set with a small amount of data without setting the feature amount data in advance, the processing load at the time of calculating the unnecessary word probability can be reduced, and it is easy, fast, and accurate. It is possible to recognize a keyword included in the uttered voice.

【００５６】また、請求項１２に記載の発明は、請求項
７乃至１１の何れか一項に記載の音声認識方法におい
て、前記決定処理工程においては、前記算出した前記各
キーワード確率と前記単位時間における不要語確率に基
づいて、前記取得処理工程によって取得された各前記キ
ーワード特徴量データによって示される前記各キーワー
ドと前記不要語との各組み合わせである確率を示す組み
合わせ確率を算出するとともに、当該組み合わせ確率に
基づいて前記発話音声に含まれる認識すべき前記キーワ
ードを決定する構成を有している。The twelfth aspect of the present invention is the voice recognition method according to any one of the seventh to eleventh aspects, wherein in the determination processing step, the calculated keyword probabilities and the unit times are calculated. On the basis of the unnecessary word probability in, while calculating the combination probability indicating the probability of each combination of the keyword and the unnecessary word indicated by the keyword feature amount data acquired by the acquisition processing step, the combination It is configured to determine the keyword to be recognized included in the uttered voice based on the probability.

【００５７】この構成により、請求項１２に記載の発明
では、算出した各キーワード確率と単位時間における不
要語確率に基づいて、各キーワードと不要語との組み合
わせ確率を算出し、当該組み合わせ確率に基づいて発話
音声に含まれる認識すべき前記キーワードを決定する。With this configuration, in the invention according to claim 12, the combination probability of each keyword and the unnecessary word is calculated based on the calculated keyword probability and the unnecessary word probability in the unit time, and based on the combination probability. And determines the keyword to be recognized included in the uttered voice.

【００５８】したがって、不要語および各キーワードの
各組み合わせを考慮しつつ、発話音声に含まれるキーワ
ードを決定することができるので、容易にかつ的確に発
話音声に含まれるキーワードを認識することができると
ともに、誤認識を防止することができる。Therefore, the keyword included in the uttered voice can be determined in consideration of each combination of the unnecessary word and each of the keywords, so that the keyword included in the uttered voice can be easily and accurately recognized. It is possible to prevent erroneous recognition.

【００５９】また、請求項１３に記載の発明は、コンピ
ュータによって、発話音声に含まれるキーワードを認識
する音声認識プログラムであって、前記コンピュータ
を、前記発話音声を分析することによって当該発話音声
の音声成分の特徴量である発話音声特徴量を抽出する抽
出手段、１または２以上の前記キーワードの音声成分の
特徴量を示すキーワード特徴量データを予め取得する取
得手段、前記発話音声の少なくとも一部の音声区間の抽
出された前記発話音声特徴量と前記格納手段に格納され
た前記キーワード特徴量データとに基づいて当該音声区
間の音声が前記キーワードであるキーワード確率を算出
する算出手段、前記発話音声の少なくとも一部の音声区
間の抽出した前記発話音声特徴量と予め設定された音声
成分の特徴量を示す複数の特定音声特徴量とに基づい
て、当該音声区間の音声が前記キーワードを構成しない
不要語である確率を示す不要語確率を設定する設定手
段、前記算出したキーワード確率および前記設定した不
要語確率に基づいて前記発話音声に含まれる認識すべき
前記キーワードを決定する決定手段、として機能させる
構成を有している。The invention according to claim 13 is a voice recognition program for recognizing a keyword included in a uttered voice by a computer, wherein the computer analyzes the uttered voice to output the voice of the uttered voice. Extraction means for extracting the uttered voice feature quantity which is the feature quantity of the component, acquisition means for acquiring in advance keyword feature quantity data indicating the feature quantity of the voice component of one or more of the keywords, and at least a part of the uttered voice. A calculation unit that calculates a keyword probability that the voice of the voice segment is the keyword based on the extracted voice feature amount of the voice segment and the keyword feature amount data stored in the storage unit. Shows the extracted speech feature amount of at least a part of the voice section and the feature amount of a preset voice component Setting means for setting the unnecessary word probability indicating the probability that the speech in the relevant speech segment is an unnecessary word that does not form the keyword, the calculated keyword probability, and the set unnecessary word probability On the basis of the above, it is configured to function as a determining unit that determines the keyword included in the uttered voice to be recognized.

【００６０】この構成により、請求項１３に記載の発明
では、発話音声特徴量が各キーワード特徴量データによ
って示されるキーワードであることを示すキーワード確
率を算出するとともに、発話音声特徴量と予め設定され
た複数の特定音声特徴量とに基づいて、当該音声区間が
不要語である確率を示す不要語確率を設定し、当該算出
したキーワード確率および設定した不要語確率に基づい
て発話音声に含まれる認識すべきキーワードを決定す
る。With this configuration, in the invention described in claim 13, the keyword probability indicating that the uttered voice feature amount is the keyword indicated by each keyword feature amount data is calculated, and the uttered voice feature amount is preset. Based on the plurality of specific voice features, the unnecessary word probability indicating the probability that the relevant speech segment is an unnecessary word is set, and the recognition included in the uttered voice based on the calculated keyword probability and the set unnecessary word probability. Decide which keyword should be used.

【００６１】したがって、例えば、予め設定された複数
の特定音声特徴量として母音などの代表的な不要語を構
成する音声特徴量、または、複数のキーワード特徴量デ
ータの一部の特徴量を用いるなど、通常、不要語確率を
算出する際に必要となる多数の不要語特徴量データを予
め設定しておくことなく、少ないデータで不要語確率を
設定することができるので、不要語確率を算出する際の
処理負担を軽減することができ、容易にかつ高速に発話
音声に含まれるキーワードを認識することができる。Therefore, for example, as a plurality of preset specific voice feature quantities, a voice feature quantity forming a typical unnecessary word such as a vowel or a part of feature quantity of a plurality of keyword feature quantity data is used. , Normally, the unnecessary word probability can be set with a small amount of data without previously setting a large number of unnecessary word feature amount data required for calculating the unnecessary word probability. The processing load at that time can be reduced, and the keywords included in the uttered voice can be easily and quickly recognized.

【００６２】また、請求項１４に記載の発明は、請求項
１３に記載の音声認識プログラムにおいて、前記コンピ
ュータを、前記不要語確率を設定する際に、前記発話音
声の少なくとも一部の音声区間の抽出した前記発話音声
特徴量と予め設定された音声成分の特徴量を示す複数の
特定音声特徴量とに基づいて前記発話音声特徴量が前記
各特定音声特徴量である確率を示す特定音声確率を算出
する特定音声確率算出手段、前記算出した各特定音声確
率に基づいて前記不要語確率を設定する不要語確率設定
手段、として機能させる構成を有している。The invention according to a fourteenth aspect is the speech recognition program according to the thirteenth aspect, wherein when the computer sets the unnecessary word probability, at least a part of the speech section of the uttered speech is A specific voice probability indicating a probability that the uttered voice feature amount is each of the specific voice feature amounts based on the extracted utterance voice feature amount and a plurality of specific voice feature amounts indicating a preset voice component feature amount; It is configured to function as specific voice probability calculating means for calculating, and unnecessary word probability setting means for setting the unnecessary word probability based on each of the calculated specific voice probabilities.

【００６３】この構成により、請求項１４に記載の発明
では、発話音声の少なくとも一部の音声区間の抽出した
前記発話音声特徴量と予め設定された音声成分の特徴量
を示す複数の特定音声特徴量とに基づいて発話音声特徴
量が特定音声確率を算出し、当該算出した各特定音声確
率に基づいて不要語確率を設定する。With this configuration, in the invention according to the fourteenth aspect, a plurality of specific voice features indicating the feature amount of the voice component extracted and the preset voice component of at least a part of the voice section of the voice component are set. The utterance voice feature amount calculates a specific voice probability based on the calculated amount, and an unnecessary word probability is set based on each calculated specific voice probability.

【００６４】したがって、例えば、予め設定された複数
の特定音声特徴量として母音などの代表的な不要語を構
成する音声特徴量、または、複数のキーワード特徴量デ
ータの一部の特徴量を用いて特定音声確率を算出し、複
数の特定音声確率の平均を取るなど代表的な音声特徴量
またはキーワード特徴量データを用いて不要語確率を算
出することにより、通常、不要語確率を算出する際に必
要となる多数の不要語特徴量データを予め設定しておく
ことなく、少ないデータで不要語確率を設定することが
できるので、不要語確率を算出する際の処理負担を軽減
することができ、容易にかつ高速に発話音声に含まれる
キーワードを認識することができる。Therefore, for example, by using a voice feature amount that constitutes a typical unnecessary word such as a vowel as a plurality of preset specific voice feature amounts, or a feature amount of a part of a plurality of keyword feature amount data. Usually, when calculating the unnecessary word probability, the unnecessary word probability is calculated by calculating the specific speech probability and averaging a plurality of specific speech probabilities to calculate the unnecessary word probability using typical speech feature amount or keyword feature amount data. Since it is possible to set the unnecessary word probability with a small amount of data without presetting a large number of unnecessary word feature amount data required, it is possible to reduce the processing load when calculating the unnecessary word probability, It is possible to easily and quickly recognize the keywords included in the spoken voice.

【００６５】また、請求項１５に記載の発明は、請求項
１４に記載の音声認識プログラムにおいて、前記コンピ
ュータを、前記特定音声確率算出手段によって算出した
各特定音声確率の平均を当該不要語類似度に設定する不
要語確率設定手段、として機能させる構成を有してい
る。According to a fifteenth aspect of the present invention, in the voice recognition program according to the fourteenth aspect, an average of the respective specific voice probabilities calculated by the specific voice probability calculating means in the computer is calculated, and the unnecessary word similarity is calculated. It has a structure to function as an unnecessary word probability setting means to be set to.

【００６６】この構成により、請求項１５に記載の発明
では、特定音声確率算出手段によって算出した各特定音
声確率の平均を当該不要語類似度に設定する。With this configuration, in the invention according to the fifteenth aspect, the average of the specific voice probabilities calculated by the specific voice probability calculating means is set to the unnecessary word similarity.

【００６７】したがって、例えば、予め設定された複数
の特定音声特徴量として母音などの代表的な不要語を構
成する音声特徴量、または、複数のキーワード特徴量デ
ータの一部の特徴量を用いて特定音声確率を算出し、複
数の特定音声確率の平均を不要語確率とすることによ
り、通常、不要語確率を算出する際に必要となる多数の
不要語特徴量データを予め設定しておくことなく、少な
いデータで不要語確率を設定することができるので、不
要語確率を算出する際の処理負担を軽減することがで
き、容易にかつ高速に発話音声に含まれるキーワードを
認識することができる。Therefore, for example, as a plurality of preset specific voice feature amounts, a voice feature amount that constitutes a typical unnecessary word such as a vowel or a part of a plurality of keyword feature amount data is used. By calculating specific voice probabilities and using the average of multiple specific voice probabilities as unnecessary word probabilities, it is usually necessary to set in advance a large number of unnecessary word feature amount data required for calculating unnecessary word probabilities. Since the unnecessary word probability can be set with a small amount of data, the processing load when calculating the unnecessary word probability can be reduced, and the keyword included in the uttered voice can be easily and quickly recognized. .

【００６８】また、請求項１６に記載の発明は、請求項
１３乃至１５の何れか一項に記載の音声認識プログラム
において、前記コンピュータを、前記取得した前記キー
ワード特徴量データの少なくとも一部の特徴量を前記特
定音声特徴量として用いて前記不要語確率を設定する設
定手段、として機能させる構成を有している。According to a sixteenth aspect of the present invention, in the voice recognition program according to any one of the thirteenth to fifteenth aspects, the computer is characterized in that at least a part of the acquired keyword feature amount data. It is configured to function as setting means for setting the unnecessary word probability by using the amount as the specific voice feature amount.

【００６９】この構成により、請求項１６に記載の発明
では、格納されているキーワード特徴量データの少なく
とも一部の特徴量を特定音声特徴量として用いて不要語
確率を設定する。With this configuration, in the invention according to claim 16, the unnecessary word probability is set by using the feature amount of at least a part of the stored keyword feature amount data as the specific voice feature amount.

【００７０】したがって、通常、不要語確率を算出する
際に必要となる多数の不要語特徴量データを予め設定・
格納しておくことなく、少ないデータで不要語確率を設
定することができるので、不要語確率を算出する際の処
理負担を軽減することができ、容易にかつ高速に発話音
声に含まれるキーワードを認識することができる。Therefore, usually, a large number of unnecessary word feature amount data necessary for calculating the unnecessary word probability are set in advance.
Since the unnecessary word probability can be set with a small amount of data without storing it, the processing load when calculating the unnecessary word probability can be reduced, and the keywords included in the uttered speech can be easily and quickly determined. Can be recognized.

【００７１】また、請求項１７に記載の発明は、請求項
１３乃至１６の何れか一項に記載の音声認識プログラム
において、前記コンピュータを、予め設定された単位時
間毎に前記発話音声を分析して前記発話音声特徴量を抽
出する抽出手段、前記各単位時間毎に前記キーワード確
率を算出する算出手段、前記各単位時間毎に前記不要語
確率を設定する設定手段、前記各単位時間に算出した各
キーワード確率および前記単位時間毎に設定した不要語
確率に基づいて発話音声に含まれる認識すべき前記キー
ワードを決定する決定手段、として機能させる構成を有
している。The seventeenth aspect of the present invention is the voice recognition program according to any one of the thirteenth to sixteenth aspects, wherein the computer analyzes the uttered voice for each preset unit time. Extracting means for extracting the uttered voice feature amount, calculating means for calculating the keyword probability for each unit time, setting means for setting the unnecessary word probability for each unit time, and calculating for each unit time It is configured to function as a determining unit that determines the keyword to be recognized included in the spoken voice based on each keyword probability and the unnecessary word probability set for each unit time.

【００７２】この構成により、請求項１７に記載の発明
では、各単位時間毎に算出した各キーワード確率および
単位時間毎に設定した不要語確率に基づいて発話音声に
含まれる認識すべき前記キーワードを決定する。With this configuration, in the invention according to claim 17, the keywords to be recognized included in the uttered voice are identified based on the keyword probability calculated for each unit time and the unnecessary word probability set for each unit time. decide.

【００７３】したがって、発話音声から音素単位、また
は、音声単位などの各言語音毎にキーワード確率および
不要語確率を算出することができ、かつ、例えば、予め
設定された複数の特定音声特徴量として母音などの代表
的な不要語を構成する音声特徴量、または、複数のキー
ワード特徴量データの一部の特徴量を用いるなど、通
常、不要語確率を算出する際に必要となる多数の不要語
特徴量データを予め設定しておくことなく、少ないデー
タで不要語確率を設定することができるので、不要語確
率を算出する際の処理負担を軽減することができ、容易
に、高速にかつ的確に発話音声に含まれるキーワードを
認識することができる。Therefore, the keyword probability and the unnecessary word probability can be calculated for each language sound such as a phoneme unit or a voice unit from the uttered voice, and, for example, as a plurality of preset specific voice feature amounts. A large number of unnecessary words that are usually needed when calculating the probability of unnecessary words, such as the use of voice features that form typical unnecessary words such as vowels, or the use of some features of multiple keyword feature data. Since the unnecessary word probability can be set with a small amount of data without setting the feature amount data in advance, the processing load at the time of calculating the unnecessary word probability can be reduced, and it is easy, fast, and accurate. It is possible to recognize a keyword included in the uttered voice.

【００７４】また、請求項１８に記載の発明は、請求項
１３乃至１７の何れか一項に記載の音声認識プログラム
において、前記コンピュータを、前記算出した前記各キ
ーワード確率と前記単位時間における不要語確率に基づ
いて、前記取得された各前記キーワード特徴量データに
よって示される前記各キーワードと前記不要語との各組
み合わせである確率を示す組み合わせ確率を算出すると
ともに、当該組み合わせ確率に基づいて前記発話音声に
含まれる認識すべき前記キーワードを決定する決定手
段、として機能させる構成を有している。The invention according to claim 18 is the speech recognition program according to any one of claims 13 to 17, wherein the computer causes the computer to execute the calculated keyword probabilities and unnecessary words in the unit time. Based on a probability, a combination probability indicating a probability of each combination of each keyword indicated by the acquired each keyword feature amount data and the unnecessary word is calculated, and the uttered voice is calculated based on the combination probability. Is configured to function as a determination unit that determines the keyword to be recognized included in the.

【００７５】この構成により、請求項１８に記載の発明
では、算出した各キーワード確率と単位時間における不
要語確率に基づいて、各キーワードと不要語との組み合
わせ確率を算出し、当該組み合わせ確率に基づいて発話
音声に含まれる認識すべき前記キーワードを決定する。With this configuration, in the invention according to claim 18, the combination probability of each keyword and the unnecessary word is calculated based on the calculated keyword probability and the unnecessary word probability per unit time, and based on the combination probability. And determines the keyword to be recognized included in the uttered voice.

【００７６】したがって、不要語および各キーワードの
各組み合わせを考慮しつつ、発話音声に含まれるキーワ
ードを決定することができるので、容易にかつ的確に発
話音声に含まれるキーワードを認識することができると
ともに、誤認識を防止することができる。Therefore, the keyword included in the uttered voice can be determined in consideration of each combination of the unnecessary word and each keyword, so that the keyword included in the uttered voice can be recognized easily and accurately. It is possible to prevent erroneous recognition.

【００７７】[0077]

【発明の実施の形態】次に、本発明に好適な実施の形態
について、図面に基づいて説明する。BEST MODE FOR CARRYING OUT THE INVENTION Next, preferred embodiments of the present invention will be described with reference to the drawings.

【００７８】なお、以下に説明する実施の形態は、本発
明に係る音声認識装置を適用した場合の実施形態であ
る。The embodiment described below is an embodiment in which the voice recognition device according to the present invention is applied.

【００７９】まず、図１を用いて本実施形態のＨＭＭを
用いた音声言語モデルについて説明する。First, a speech language model using the HMM of this embodiment will be described with reference to FIG.

【００８０】なお、図１は、本実施形態のＨＭＭを用い
た認識ネットワークを示す音声言語モデルを示す図であ
る。FIG. 1 is a diagram showing a spoken language model showing a recognition network using the HMM of this embodiment.

【００８１】本実施形態では、図１に示すようなＨＭＭ
を用いた認識ネットワークを示すモデル（以下、音声言
語モデルという）、すなわち、認識すべきキーワードが
含まれる音声言語モデル１０を想定する。In this embodiment, the HMM as shown in FIG.
It is assumed that the model showing the recognition network using (hereinafter referred to as a spoken language model), that is, the spoken language model 10 including the keyword to be recognized.

【００８２】この音声言語モデル１０は、キーワードモ
デル１１の前後にガーベージモデルと呼ばれる不要語を
構成する単位を示すモデル（以下、不要語構成要素モデ
ルという）１２ａ、１２ｂを接続する構成を有し、発話
音声に含まれるキーワードはキーワードモデル１１に、
不要語は各不要語構成要素モデル１２ａ、１２ｂにマッ
チングさせることによってキーワードと不要語を識別
し、発話音声に含まれるキーワードを認識するようにな
っている。The spoken language model 10 has a structure in which a model (hereinafter referred to as an unnecessary word constituent element model) 12a and 12b which is called a garbage model and which represents a unit forming an unnecessary word is connected before and after the keyword model 11. Keywords included in the uttered voice are displayed in the keyword model 11,
By matching the unnecessary words with the unnecessary word constituent models 12a and 12b, the keywords and the unnecessary words are identified, and the keywords included in the uttered speech are recognized.

【００８３】実際には、このキーワードモデル１１およ
び不要語構成要素モデル１２ａ、１２ｂは、発話音声の
任意の区間毎に遷移する状態の集まりを表し、非定常信
号源を定常信号の連結で表す統計的信号源モデルである
ＨＭＭによって表すようになっている。Actually, the keyword model 11 and the unnecessary word constituent element models 12a and 12b represent a set of states that transit for each arbitrary section of the uttered speech, and a non-stationary signal source is represented by a connection of stationary signals. It is designed to be represented by an HMM which is a dynamic signal source model.

【００８４】このキーワードモデル１１のＨＭＭ（以
下、キーワードＨＭＭという）および不要語構成要素モ
デル１２ａ、１２ｂのＨＭＭ（以下、不要語構成要素Ｈ
ＭＭという）は、ある状態からある状態に状態の遷移の
確率を示す状態遷移確率と状態が遷移するときに観測さ
れるベクトル（フレーム毎の特徴量ベクトル）の確率を
出力する出力確率の２つのパラメータを有し、各キーワ
ードの特徴量パターンおよび不要語構成要素の特徴量パ
ターンを示すようになっている。The HMM of the keyword model 11 (hereinafter referred to as the keyword HMM) and the HMMs of the unnecessary word constituent element models 12a and 12b (hereinafter referred to as the unnecessary word constituent element H).
MM) is a state transition probability that indicates the probability of state transition from a certain state to a certain state, and an output probability that outputs the probability of a vector (feature amount vector for each frame) observed when the state transits. It has a parameter and indicates the feature amount pattern of each keyword and the feature amount pattern of the unnecessary word constituent element.

【００８５】一般的に、発話音声は、同じ単語や音節で
あっても様々な原因によって生じる音響的変動を有する
ため、発話者が異なれば、発話音声を構成する言語音が
大幅に変化するが、同じ言語音は、主に、スペクトル包
絡とその時間的推移によって特徴付けられるようになっ
ており、このような変動の時系列パターンの確率的な性
質を、ＨＭＭによって精密に表現することができるよう
になっている。Generally speaking, a speech sound has acoustic variations caused by various causes even if it is the same word or syllable. Therefore, if the speaker is different, the speech sound constituting the speech sound changes greatly. , The same speech sound is mainly characterized by the spectral envelope and its temporal transition, and the stochastic nature of such a time series pattern of fluctuations can be accurately represented by the HMM. It is like this.

【００８６】したがって、本実施形態では、後述するよ
うに、入力された発話音声の特徴量と、各キーワードＨ
ＭＭおよび不要語構成要素ＨＭＭと、類似度算出および
マッチング処理を行うことによってこの発話音声に含ま
れるキーワードを認識するようになっている。Therefore, in the present embodiment, as will be described later, the feature amount of the input speech voice and each keyword H.
The keyword included in the uttered voice is recognized by performing similarity calculation and matching processing with the MM and the unnecessary word constituent element HMM.

【００８７】なお、本実施形態では、このＨＭＭは、各
キーワードの特徴量パターンおよび不要語構成要素の特
徴量を示す一定時間毎の各周波数毎におけるパワーを示
すスペクトル包絡のデータまたはこのパワースペクトル
の対数をとって逆フーリエ変換行うことによって得られ
たケプストラムのデータを有する確率モデルを示すよう
になっている。In the present embodiment, this HMM is used for the data of the spectrum envelope showing the power at each frequency for every fixed time showing the feature quantity pattern of each keyword and the feature quantity of the unnecessary word constituents, or the power spectrum of this power spectrum. A stochastic model having the data of the cepstrum obtained by performing the inverse Fourier transform by taking the logarithm is shown.

【００８８】また、このＨＭＭは、予め複数の人間が発
する各音素の音声データを取得し、各音素毎に特徴量の
パターンを抽出して各音素毎の特徴量のパターンに基づ
いて各音素の特徴量のパターンデータを学習させること
によって生成されるようになっており、これにより生成
されたＨＭＭが予め各データベースに格納されるように
なっている。Further, this HMM obtains the voice data of each phoneme generated by a plurality of humans in advance, extracts the pattern of the feature quantity for each phoneme, and extracts the phoneme of each phoneme based on the pattern of the feature quantity for each phoneme. It is generated by learning the pattern data of the characteristic amount, and the HMM generated by this is stored in advance in each database.

【００８９】本実施形態では、複数の代表的な不要語構
成要素ＨＭＭを不要語構成要素モデル１２ａ、１２ｂと
し、これらの不要語構成要素モデル１２ａ、１２ｂを用
いることよって、マッチング処理を行うようになってい
る。In this embodiment, a plurality of representative unnecessary word constituent elements HMMs are used as unnecessary word constituent element models 12a and 12b, and matching processing is performed by using these unnecessary word constituent element models 12a and 12b. Has become.

【００９０】例えば、複数の代表的な不要語構成要素Ｈ
ＭＭとしては、「ａ」、「ｉ」、「ｕ」、「ｅ」、
「ｏ」の母音のみのＨＭＭや後述するキーワード構成要
素ＨＭＭを不要語構成要素ＨＭＭとして用い、これらの
不要語構成要素ＨＭＭとマッチング処理を行うようにな
っている。For example, a plurality of typical unnecessary word constituent elements H
The MM includes "a", "i", "u", "e",
An HMM having only vowels of "o" and a keyword constituent element HMM described later are used as unnecessary word constituent elements HMM, and a matching process is performed with these unnecessary word constituent elements HMM.

【００９１】なお、不要語構成要素ＨＭＭおよびそのマ
ッチング処理の詳細については、後述する。The details of the unnecessary word constituent element HMM and its matching processing will be described later.

【００９２】このようなＨＭＭを用いて発話音声などの
音声に含まれるキーワードの音声認識を行う場合には、
当該認識する音声を予め定められた一定時間毎に分割
し、予め格納された各ＨＭＭのデータとのマッチング処
理に基づいて各分割された状態から次の状態に変化する
場合の確率を算出することにより認識すべきキーワード
を確定するようになっている。When performing voice recognition of a keyword included in a voice such as a voice using the HMM,
The recognized voice is divided at predetermined intervals, and the probability of changing from each divided state to the next state is calculated based on the matching processing with the data of each HMM stored in advance. The keyword to be recognized is determined by.

【００９３】具体的には、この特徴量パターンと任意の
状態を示す発話音声の一定時間に区切られた各音声区間
の特徴量と比較することによって、このＨＭＭの特徴量
パターンと各音声区間の特徴量の一致度を示す類似度
（本願発明のキーワード確率および不要語確率に相当）
を算出し、この算出された類似度と、発話音声の各区間
が不要語であると想定した場合の当該各音声区間の音声
特徴量と不要語の特徴量との類似度として予め設定され
た値とに基づいて後述するマッチング処理を行い、あら
ゆるＨＭＭの繋がり、すなわち、キーワードと不要語の
繋がりの確率を示す累積類似度を算出し、最も類似度の
高いＨＭＭの繋がりを発話音声の言語として認識するよ
うになっている。Specifically, by comparing this feature amount pattern with the feature amount of each voice section of the uttered voice showing an arbitrary state divided into a certain time, the feature amount pattern of this HMM and each voice section are compared. Similarity indicating the degree of coincidence of feature quantities (corresponding to keyword probability and unnecessary word probability of the present invention)
Is calculated, and is set in advance as the similarity between the calculated similarity and the voice feature amount of each voice section and the feature amount of the unnecessary word when each section of the speech is assumed to be an unnecessary word. The matching process described later is performed based on the value and the cumulative similarity indicating the probability of connection of all HMMs, that is, the keyword and the unnecessary word is calculated, and the connection of the HMM with the highest similarity is used as the spoken language. I am aware of it.

【００９４】次に、図２を用いて本実施形態の音声認識
装置の構成について説明する。Next, the configuration of the speech recognition apparatus of this embodiment will be described with reference to FIG.

【００９５】図２は、本発明に係るワードスポッティン
グ音声認識装置の一実施形態の構成概要を示すブロック
図である。FIG. 2 is a block diagram showing the outline of the configuration of an embodiment of the word spotting voice recognition device according to the present invention.

【００９６】音声認識装置１００は、図２に示すよう
に、認識すべき発話音声を入力するマイクロホン１０１
と、ローパスフィルター（以下、ＬＰＦ：Low Pass F
ilterという）１０２と、マイクロホン１０１から出力
された音声信号をデジタル信号に変換するアナログ／デ
ジタル変換部（以下、Ａ／Ｄ変換部という）１０３と、
デジタル信号に変換された音声信号から発話音声部分の
音声信号を切り出し、予め設定された時間間隔毎にフレ
ーム分割する入力処理部１０４と、各フレーム毎に音声
信号の特徴量を抽出する音声分析部１０５と、認識すべ
き複数のキーワードの特徴量パターンを示すキーワード
ＨＭＭおよび後述する不要語類似度を算出するための特
定音声のＨＭＭ（以下、特定音声ＨＭＭという）が予め
格納されているＨＭＭモデルデータベース１０６と、抽
出されたフレーム毎の特徴量と格納されているＨＭＭに
基づいてこの各フレームの特徴量の類似度を算出する類
似度算出部１０７と、算出された特定音声ＨＭＭとの類
似度に基づいて当該フレームが不要語に相当する場合の
不要語類似度を設定する不要語類似度設定部１０８と、
算出された各フレーム毎の類似度に基づいて後述するマ
ッチング処理を行うマッチング処理部１０９と、マッチ
ング処理に基づいて発話音声に含まれるキーワードを判
定する判定部１１０と、を備えている。The voice recognition device 100, as shown in FIG. 2, is a microphone 101 for inputting a speech voice to be recognized.
And a low pass filter (hereinafter LPF: Low Pass F
ilter) 102, an analog / digital conversion unit (hereinafter referred to as an A / D conversion unit) 103 that converts the audio signal output from the microphone 101 into a digital signal,
An input processing unit 104 that cuts out a voice signal of a uttered voice portion from a voice signal converted into a digital signal and divides into frames at preset time intervals, and a voice analysis unit that extracts a feature amount of the voice signal for each frame. 105, an HMM model database in which a keyword HMM indicating a feature amount pattern of a plurality of keywords to be recognized and an HMM of a specific voice (hereinafter, referred to as a specific voice HMM) for calculating an unnecessary word similarity described later are stored in advance. 106, a similarity calculation unit 107 that calculates the similarity of the feature amount of each frame based on the extracted feature amount of each frame and the stored HMM, and the calculated similarity of the specific voice HMM. An unnecessary word similarity setting unit 108 for setting an unnecessary word similarity based on the frame based on the unnecessary word;
A matching processing unit 109 that performs a matching process, which will be described later, based on the calculated similarity for each frame, and a determination unit 110 that determines a keyword included in the uttered voice based on the matching process.

【００９７】なお、入力処理部１０４および音声分析部
１０５は、本発明に係る抽出手段を構成し、ＨＭＭモデ
ルデータベース１０６は、本発明に係る格納手段を構成
する。The input processing section 104 and the voice analysis section 105 constitute the extraction means according to the present invention, and the HMM model database 106 constitutes the storage means according to the present invention.

【００９８】また、類似度算出部１０７は、本発明に係
る算出手段、設定手段、特定音声確率算出手段および取
得手段を構成し、不要語類似度設定部１０８は、本発明
に係る設定手段および不要語確率設定手段を構成してい
る。The similarity calculation section 107 constitutes the calculation means, the setting means, the specific voice probability calculation means and the acquisition means according to the present invention, and the unnecessary word similarity setting section 108 has the setting means and the setting means according to the present invention. It constitutes an unnecessary word probability setting means.

【００９９】さらに、マッチング処理部１０９および判
定部１１０は、本発明に係る決定手段を構成する。Further, the matching processing section 109 and the judging section 110 constitute the determining means according to the present invention.

【０１００】マイクロホン１０１には、発話音声が入力
されるようになっており、このマイクロホン１０１は、
入力された発話音声に基づいて音声信号を生成し、ＬＰ
Ｆ１０２に出力されるようになっている。Speech sound is input to the microphone 101, and the microphone 101
Generates a voice signal based on the input utterance voice, and
It is designed to be output to F102.

【０１０１】ＬＰＦ１０２には、マイクロホン１０１に
おいて生成された音声信号が入力されるようになってお
り、このＬＰＦ１０２は、入力された音声信号のうち高
周波数成分を取り除き、当該高周波数成分を取り除いた
音声信号をＡ／Ｄ変換部１０３に出力するようになって
いる。An audio signal generated by the microphone 101 is input to the LPF 102. The LPF 102 removes a high frequency component of the input audio signal and removes the high frequency component. The signal is output to the A / D conversion unit 103.

【０１０２】Ａ／Ｄ変換部１０３には、ＬＰＦ１０２に
おいて高周波数成分が取り除かれた音声信号が入力され
るようになっており、このＡ／Ｄ変換部１０３は、入力
された音声信号をアナログ信号からデジタル信号に変換
し、デジタル信号に変換された音声信号を入力処理部１
０４に出力するようになっている。An audio signal from which the high frequency components have been removed by the LPF 102 is input to the A / D conversion unit 103. This A / D conversion unit 103 converts the input audio signal into an analog signal. Is converted into a digital signal, and the audio signal converted into the digital signal is input to the input processing unit 1.
It is designed to output to 04.

【０１０３】入力処理部１０４には、デジタル信号に変
換された音声信号が入力されるようになっており、この
入力処理部１０４は、入力したデジタル信号の発話音声
部分の音声区間を示す音声信号を切り出すとともに、こ
の切り出された音声区間の音声信号を予め設定された時
間間隔毎のフレームに分割し、音声分析部１０５に出力
するようになっている。A voice signal converted into a digital signal is input to the input processing unit 104. The input processing unit 104 outputs a voice signal indicating the voice section of the uttered voice portion of the input digital signal. The voice signal of the cut voice segment is divided into frames at preset time intervals and is output to the voice analysis unit 105.

【０１０４】なお、例えば、入力処理部１０４は、１フ
レーム、１０ｍｓ〜２０ｍｓ程度の時間間隔毎に分割す
るようになっている。For example, the input processing unit 104 is configured to divide one frame at time intervals of about 10 ms to 20 ms.

【０１０５】音声分析部１０５には、フレーム分割され
た音声信号が入力されるようになっており、この音声分
析部１０５は、入力されたフレーム毎に当該音声信号を
分析するとともに、当該フレーム毎の音声信号の特徴量
を抽出して類似度算出部１０７に出力するようになって
いる。A voice signal divided into frames is input to the voice analysis unit 105. The voice analysis unit 105 analyzes the voice signal for each input frame and also for each frame. The feature amount of the voice signal is extracted and output to the similarity calculation unit 107.

【０１０６】具体的には、音声分析部１０５は、各フレ
ーム毎に、一定時間毎の各周波数毎におけるパワーを示
すスペクトル包絡の情報、または、このパワースペクト
ルの対数をとって逆フーリエ変換を行うことによって得
られるケプストラムの情報を特徴量として抽出し、当該
抽出した特徴量をベクトル化して類似度算出部１０７に
出力ようになっている。Specifically, the voice analysis unit 105 performs, for each frame, the inverse Fourier transform by taking the information of the spectrum envelope showing the power at each frequency for every fixed time or the logarithm of this power spectrum. The information of the cepstrum thus obtained is extracted as a feature amount, and the extracted feature amount is vectorized and output to the similarity calculation unit 107.

【０１０７】ＨＭＭモデルデータベース１０６は、認識
すべきキーワードの特徴量のパターンデータを示すキー
ワードＨＭＭおよび不要語類似度を算出するための特定
音声ＨＭＭのパターンデータが予め格納されている。The HMM model database 106 stores in advance the keyword HMM indicating the pattern data of the feature amount of the keyword to be recognized and the pattern data of the specific voice HMM for calculating the unnecessary word similarity.

【０１０８】この格納されている複数のキーワードＨＭ
Ｍのデータは、認識すべき複数の認識対象語の特徴量の
パターンを示すようになっている。A plurality of stored keywords HM
The data of M indicates a pattern of feature quantities of a plurality of recognition target words to be recognized.

【０１０９】例えば、車載されたナビゲーション装置で
用いる場合には、ＨＭＭモデルデータベース１０６に
は、自動車が向かう目的地名や現在位置名、レストラン
などの施設名といった音声信号の特徴量のパターンを示
すキーワードＨＭＭが格納されるようになっている。For example, when used in a vehicle-mounted navigation device, the HMM model database 106 includes a keyword HMM indicating a pattern of a characteristic amount of a voice signal such as a destination name of a vehicle, a current position name, a facility name such as a restaurant. Is stored.

【０１１０】本実施形態では、各キーワードの特徴量パ
ターンを示すＨＭＭは、上述のように、一定時間毎の各
周波数毎におけるパワーを示すスペクトル包絡のデータ
またはこのパワースペクトルの対数をとって逆フーリエ
変換を行うことによって得られるケプストラムのデータ
を有する確率モデルを示すようになっている。In the present embodiment, the HMM indicating the feature amount pattern of each keyword is, as described above, the data of the spectrum envelope showing the power at each frequency for every fixed time or the inverse Fourier transform of the logarithm of this power spectrum. A probabilistic model having cepstrum data obtained by performing the transformation is shown.

【０１１１】また、通常、キーワードは、「現在地」や
「目的地」のように、複数の音節または音素から構成さ
れるようになっているので、本実施形態では、１つのキ
ーワードＨＭＭは、複数のキーワード構成要素ＨＭＭに
よって構成されており、類似度算出部１０７では、各キ
ーワード構成要素ＨＭＭ毎に１のフレーム毎の特徴量と
の類似度を算出するようになっている。Further, normally, a keyword is composed of a plurality of syllables or phonemes such as "current location" and "destination". Therefore, in this embodiment, one keyword HMM is a plurality of The keyword calculating element 107 is configured to calculate the degree of similarity between the keyword calculating element 107 and the feature amount of each frame for each keyword forming element HMM.

【０１１２】このように、ＨＭＭモデルデータベース１
０６には、認識すべきキーワードの各キーワードＨＭ
Ｍ、すなわち、キーワード構成要素ＨＭＭが格納される
ようになっている。As described above, the HMM model database 1
06 indicates each keyword HM of the keywords to be recognized.
M, that is, the keyword component HMM is stored.

【０１１３】また、ＨＭＭモデルデータベース１０６に
は、予め設定された複数の特定音声特徴量として代表的
な不要語を構成する各母音の音声特徴量データ（以下、
特定音声特徴量データという）のＨＭＭモデル（以下、
特定音声ＨＭＭという）を格納するようになっている。Further, in the HMM model database 106, the voice feature amount data of each vowel which constitutes a typical unnecessary word as a plurality of preset specific voice feature amounts (hereinafter,
HMM model of specific voice feature amount data (hereinafter,
A specific voice HMM) is stored.

【０１１４】例えば、通常、不要語であっても、各音節
には母音が含まれることから、このＨＭＭモデルデータ
ベース１０６には、「ａ」、「ｉ」、「ｕ」、「ｅ」、
「ｏ」の母音の音声信号の特徴量のパターンを示す特定
音声ＨＭＭが格納されるようになっており、類似度算出
部１０７では、これらの特定音声ＨＭＭとマッチング処
理を行うようになっている。For example, since each syllable usually contains a vowel even if it is an unnecessary word, the HMM model database 106 includes “a”, “i”, “u”, “e”,
Specific voice HMMs indicating the pattern of the feature amount of the voice signal of the "o" vowel are stored, and the similarity calculation unit 107 performs matching processing with these specific voice HMMs. .

【０１１５】類似度算出部１０７には、各フレーム毎の
ベクトル特徴量が入力されるようになっており、この類
似度算出部１０７は、入力された各フレーム毎の特徴量
に基づいてＨＭＭモデルデータベース１０６に格納され
ているキーワードＨＭＭモデルおよび特定音声特徴量デ
ータモデルの特徴量（本願の特定音声特徴量に相当）を
比較して各フレームがＨＭＭモデルデータベース１０６
に格納されるキーワードＨＭＭおよび特定音声ＨＭＭを
示す場合などの確率、すなわち、入力された各フレーム
と各ＨＭＭとの類似度を算出し、特定音声ＨＭＭとの算
出された類似度を不要語類似度設定部１０８に、また、
キーワードＨＭＭとの算出された類似度をマッチング処
理部１０９に出力するようになっている。The vector feature amount for each frame is input to the similarity calculation unit 107. The similarity calculation unit 107 uses the HMM model based on the input feature amount for each frame. The keyword HMM model stored in the database 106 and the feature amount of the specific voice feature amount data model (corresponding to the specific voice feature amount of the present application) are compared, and each frame has an HMM model database 106.
Probability in the case of indicating the keyword HMM and the specific voice HMM stored in, that is, the similarity between each input frame and each HMM is calculated, and the calculated similarity with the specific voice HMM is calculated as the unnecessary word similarity. In the setting unit 108,
The calculated similarity with the keyword HMM is output to the matching processing unit 109.

【０１１６】具体的には、類似度算出部１０７は、各フ
レームが各キーワード構成要素ＨＭＭおよび特定音声Ｈ
ＭＭを示す出力確率を算出するとともに、任意のフレー
ムから次のフレームへの状態遷移が各キーワード構成要
素ＨＭＭからキーワード構成要素ＨＭＭまたは特定音声
ＨＭＭ、および、各特定音声ＨＭＭから他の特定音声Ｈ
ＭＭまたはキーワード構成要素ＨＭＭへの状態遷移を示
す各状態遷移確率を算出し、これらの確率を類似度とし
て不要語類似度設定部１０８およびマッチング処理部１
０９に出力するようになっている。Specifically, the similarity calculation unit 107 determines that each frame has each keyword component HMM and specific voice H.
The output probability indicating MM is calculated, and the state transition from any frame to the next frame is performed from each keyword component HMM to the keyword component HMM or the specific voice HMM, and from each specific voice HMM to another specific voice HMM.
Each state transition probability indicating a state transition to the MM or the keyword component HMM is calculated, and these probabilities are used as the degree of similarity for the unnecessary word similarity setting unit 108 and the matching processing unit 1.
It is designed to output to 09.

【０１１７】なお、状態遷移確率には、各キーワード構
成要素ＨＭＭから自己のキーワード構成要素ＨＭＭへ、
または、各特定音声ＨＭＭから自己の特定音声ＨＭＭへ
の状態遷移を示す状態遷移確率が含まれるようになって
いる。The state transition probability is as follows: from each keyword component HMM to its own keyword component HMM,
Alternatively, the state transition probability indicating the state transition from each specific voice HMM to its own specific voice HMM is included.

【０１１８】また、類似度算出部１０７は、各フレーム
毎に算出された各出力確率および各状態遷移確率を各フ
レームの類似度として不要語類似度設定部１０８および
マッチング処理部１０９に出力するようになっている。Further, the similarity calculating section 107 outputs the output probabilities and the state transition probabilities calculated for each frame to the unnecessary word similarity setting section 108 and the matching processing section 109 as the similarity of each frame. It has become.

【０１１９】不要語類似度設定部１０８には、各フレー
ム毎に特定音声ＨＭＭに基づいて算出された各出力確率
および各状態遷移確率が入力されるようになっており、
この不要語類似度設定部１０８は、入力された各出力確
率および各状態遷移確率の平均を算出し、当該平均を不
要語類似度としてマッチング処理部１０９に出力するよ
うになっている。To the unnecessary word similarity setting unit 108, the output probabilities and the state transition probabilities calculated based on the specific speech HMM for each frame are input.
The unnecessary word similarity setting unit 108 is configured to calculate an average of each input output probability and each state transition probability and output the average to the matching processing unit 109 as an unnecessary word similarity.

【０１２０】例えば、特定音声ＨＭＭが「ａ」、
「ｉ」、「ｕ」、「ｅ」、「ｏ」の母音の特徴量のＨＭ
Ｍ場合、不要語類似度設定部１０８は、各フレーム毎
に、各母音のＨＭＭにおける上記各出力確率および各状
態遷移確率を平均し、当該平均した出力確率および状態
遷移確率を当該フレームの不要語類似度としてマッチン
グ処理部１０９に出力するようになっている。For example, the specific voice HMM is "a",
HM of feature values of vowels "i", "u", "e", and "o"
In M, the unnecessary word similarity setting unit 108 averages the output probabilities and the state transition probabilities in the HMM of each vowel for each frame, and the averaged output probabilities and state transition probabilities are used as the unnecessary words of the frame. The similarity is output to the matching processing unit 109.

【０１２１】マッチング処理部１０９には、類似度算出
部１０７および不要語類似度設定部１０８において算出
された各フレーム毎の各出力確率および各遷移出力確率
が入力されるようになっており、マッチング処理部１０
９は、この入力された各出力確率および各遷移出力確率
に基づいて各キーワードモデルと不要語の各組み合わせ
の類似度を示す累積類似度（本願発明の組み合わせ確
率）を算出するマッチング処理を行い、この算出された
累積類似度を判定部１１０に出力するようになってい
る。To the matching processing unit 109, the output probabilities and transition output probabilities for each frame calculated by the similarity calculation unit 107 and the unnecessary word similarity setting unit 108 are input, and the matching processing unit 109 Processing unit 10
9 performs a matching process for calculating a cumulative similarity (combination probability of the present invention) indicating the similarity of each combination of each keyword model and unnecessary word based on each input output probability and each transition output probability, The calculated cumulative similarity is output to the determination unit 110.

【０１２２】具体的には、マッチング処理部１０９は、
不要語類似度設定部１０８から出力された不要語類似度
を当該フレームが不要語構成要素であると想定した場合
の当該フレームの音声成分の特徴量と不要語構成要素の
音声成分の特徴量の特性の類似度を示す不要語類似度と
して用いるようになっており、この不要語類似度と類似
度算出部１０７によって算出された各キーワードの類似
度を各フレーム毎に積算することによってキーワードと
不要語とのあらゆる組み合わせの累積類似度を算出し、
後述するように各キーワード毎に１の累積類似度を算出
するようになっている。Specifically, the matching processing unit 109
Of the feature amount of the voice component of the frame and the feature amount of the voice component of the unnecessary word component when the unnecessary word similarity output from the unnecessary word similarity setting unit 108 is assumed to be the unnecessary word component It is designed to be used as an unnecessary word similarity indicating the similarity of characteristics, and the unnecessary word similarity is calculated by accumulating the unnecessary word similarity and the similarity of each keyword calculated by the similarity calculation unit 107 for each frame. Calculate the cumulative similarity of all combinations with words,
As will be described later, a cumulative similarity of 1 is calculated for each keyword.

【０１２３】なお、このようなマッチング処理部１０９
で行われるマッチング処理の詳細については後述する。Note that such a matching processing unit 109
Details of the matching process performed in step 1 will be described later.

【０１２４】判定部１１０には、マッチング処理部１０
９において算出された各キーワード毎の累積類似度が入
力されるようになっており、入力された累積類似度を各
キーワードのワード長、すなわち、各累積類似度に係る
キーワードの時間的な長さによって正規化し、この各正
規化された類似度の中で最も類似度が高いキーワードを
発話音声に含まれるキーワードと判定してこのキーワー
ドを認識結果として外部に出力するようになっている。The determination unit 110 includes the matching processing unit 10
The cumulative similarity for each keyword calculated in 9 is input, and the input cumulative similarity is the word length of each keyword, that is, the temporal length of the keyword related to each cumulative similarity. The keyword having the highest degree of similarity among the normalized degrees of similarity is determined to be a keyword included in the uttered voice, and the keyword is output to the outside as a recognition result.

【０１２５】このとき、この判定部１１０は、不要語類
似度のみの累積類似度も判定対象に加えてキーワードの
判定を行うようになっており、この不要語類似度のみの
累積類似度が最も入力された累積類似度の中で最も高い
場合には、発話音声にキーワードが含まれていなかった
ものと判定してこの判定結果を外部に出力するようにな
っている。At this time, the determination unit 110 determines the keyword by adding the cumulative similarity of only the unnecessary word similarity to the determination target, and the cumulative similarity of only the unnecessary word similarity is the highest. When the input cumulative similarity is the highest, it is determined that the speech voice does not include the keyword, and the determination result is output to the outside.

【０１２６】次に、本実施形態のマッチング処理部１０
９において行われるマッチング処理について説明する。Next, the matching processing section 10 of this embodiment.
The matching process performed in 9 will be described.

【０１２７】なお、本実施形態のマッチング処理では、
ビタビアルゴリズムを用いるようになっており、このビ
タビアルゴリズムによってキーワードモデルおよび予め
設定された不要語類似度との各組み合わせの累積類似度
を算出するようになっている。In the matching process of this embodiment,
The Viterbi algorithm is used, and the cumulative similarity of each combination of the keyword model and the preset unnecessary word similarity is calculated by this Viterbi algorithm.

【０１２８】このビタビアルゴリズムは、各状態になる
出力確率と各状態から他の状態へ遷移する場合の遷移確
率に基づいて累積類似度を算出するアルゴリズムであ
り、累積類似度を算出した後に当該累積類似度が算出さ
れた組み合わせを出力するようになっている。This Viterbi algorithm is an algorithm for calculating the cumulative similarity based on the output probability of each state and the transition probability when each state transits to another state. After calculating the cumulative similarity, the cumulative similarity is calculated. The combination for which the similarity is calculated is output.

【０１２９】なお、一般的には、各フレームの特徴量に
よって示される状態とＨＭＭによって示される特徴量の
状態のユーグリッド距離を算出し、この累積距離を求め
ることによって累積類似度を算出するようになってい
る。In general, the Euclidean distance between the state indicated by the feature amount of each frame and the state of the feature amount indicated by the HMM is calculated, and the cumulative distance is calculated to calculate the cumulative similarity. It has become.

【０１３０】具体的には、ビタビアルゴリズムは、任意
の状態ｉから次の状態ｊへの遷移を示すパスに沿って計
算される累積的な確率の計算を行うようになっており、
この累積的な確率計算を行うことによって状態の遷移が
可能な各パス、すなわち、ＨＭＭの繋がりおよび組み合
わせを抽出するようになっている。Specifically, the Viterbi algorithm is adapted to calculate a cumulative probability calculated along a path indicating a transition from any state i to the next state j,
By performing this cumulative probability calculation, each path capable of state transition, that is, the connection and combination of HMMs is extracted.

【０１３１】本実施形態では、類似度算出部１０７およ
び不要語類似度算出部１０８おいて算出された各出力確
率および各状態遷移確率に基づいて、入力された発話音
声の最初の分割フレームから最後の分割フレームまで、
順次、各フレームがキーワードモデルである場合の出力
確率および状態遷移確率、並びに、各フレームが不要語
構成要素モデルである場合の出力確率および状態遷移確
率を当てはめ、キーワードモデルおよび不要語構成要素
モデルを任意の組み合わせによる最初の分割フレームか
ら最後の分割フレームまでの累積的な確率を算出するよ
うになっており、各キーワードモデル毎に算出された累
積類似度の最も高い組み合わせを、１つずつ判定部１１
０に出力するようになっている。In the present embodiment, based on the output probabilities and the state transition probabilities calculated by the similarity calculation unit 107 and the unnecessary word similarity calculation unit 108, from the first divided frame to the end of the input uttered speech, Up to the divided frame of
The output probability and state transition probability when each frame is a keyword model, and the output probability and the state transition probability when each frame is an unnecessary word constituent model are sequentially applied to obtain the keyword model and the unnecessary word constituent model. The cumulative probability from the first divided frame to the last divided frame by an arbitrary combination is calculated, and the combination having the highest cumulative similarity calculated for each keyword model is determined one by one. 11
It is designed to output to 0.

【０１３２】例えば、認識すべきキーワードが「現在
地」「目的地」であり、入力された発話音声が「えーっ
と、現在地」の場合、本実施形態のマッチング処理は、
以下の処理を行うようになっている。For example, when the keywords to be recognized are “current location” and “destination” and the input uttered voice is “um, current location”, the matching process of this embodiment is
The following processing is performed.

【０１３３】なお、不要語が「えーっと」であり、予め
不要語類似度が設定されているとともに、ＨＭＭモデル
データベース１０６には、「現在地」「目的地」がそれ
ぞれの音節毎のＨＭＭが格納されているものとする。The unnecessary word is “Eh”, the unnecessary word similarity is set in advance, and the HMM model database 106 stores the HMM for each syllable of “current location” and “destination”. It is assumed that

【０１３４】また、マッチング処理部１０９には、既に
類似度算出部１０７および不要語類似度設定部１０８に
おいて算出された各出力確率および状態遷移確率が入力
されているものとする。Further, it is assumed that the output probabilities and state transition probabilities calculated by the similarity calculation section 107 and the unnecessary word similarity setting section 108 have already been input to the matching processing section 109.

【０１３５】このような場合、本実施形態では、ビタビ
アルゴリズムによって、「現在地」のキーワードと「目
的地」のキーワードそれぞれにおいて、不要語類似度と
全ての組み合わせにおける累積類似度を、不要語類似
度、出力確率および状態遷移確率に基づいて算出するよ
うになっている。In such a case, in the present embodiment, the unnecessary word similarity and the cumulative similarity in all the combinations are calculated as the unnecessary word similarity for each of the "current location" keyword and the "destination" keyword by the Viterbi algorithm. , The output probability and the state transition probability are calculated.

【０１３６】ビタビアルゴリズムは、１のキーワードモ
デル毎、この場合は、「現在地」と「目的地」毎に、発
話音声の最初のフレームから順次各フレーム毎に同時に
全ての組み合わせパターンの累積類似度を算出するよう
になっている。The Viterbi algorithm, for each keyword model, in this case, for each of the "current location" and the "destination", sequentially calculates the cumulative similarity of all combination patterns from the first frame of the uttered speech sequentially in each frame. It is supposed to be calculated.

【０１３７】また、このビタビアルゴリズムは、各キー
ワード毎の各組み合わせの累積類似度を算出する過程に
おいて、組み合わせパターンの累積類似度の低いものは
順次算出途中で、発話音声がこの組み合わせパターンで
はないと判断して累積類似度の計算を中止するようにな
っている。Further, in the Viterbi algorithm, in the process of calculating the cumulative similarity of each combination for each keyword, the combination patterns having a low cumulative similarity are in the process of being sequentially calculated, and the uttered voice is not this combination pattern. Judgment is made and the calculation of the cumulative similarity is stopped.

【０１３８】具体的には、最初の分割フレームには、キ
ーワード「現在地」のキーワード構成要素ＨＭＭである
「げ」のＨＭＭを示す場合の類似度、または、不要語を
示す場合の設定された不要語類似度の何れかが加算され
るようになるが、この場合、累積類似度の高いものが次
の分割フレームの累積類似度を算出するようになってい
る。Specifically, in the first divided frame, the similarity in the case of indicating the HMM of "ge" which is the keyword constituent HMM of the keyword "current position" or the set unnecessaryness indicating the unnecessary word Any of the word similarities is added, but in this case, the one with the higher cumulative similarity calculates the cumulative similarity of the next divided frame.

【０１３９】この場合では、不要語類似度の方が、
「げ」のキーワード構成ＨＭＭの類似度より高くなるの
で、「げ」に対するその後の累積類似度、すなわち、
「げんざいち○○○○」（○印は不要語類似度）の算出
処理を終了させるようになっている。In this case, the unnecessary word similarity is
Since it is higher than the similarity of the keyword composition HMM of “ge”, the subsequent cumulative similarity to “ge”, that is,
The calculation process of “Genzaichi ○○○○” (circle mark is unnecessary word similarity) is ended.

【０１４０】この結果、このようなマッチング処理で
は、「現在地」および「目的地」の各キーワードにおけ
る累積類似度が１つずつ算出されるようになっている。As a result, in such matching processing, the cumulative similarities in the keywords "current position" and "destination" are calculated one by one.

【０１４１】次に、図３を用いて本実施形態のキーワー
ド認識処理について説明する。Next, the keyword recognition processing of this embodiment will be described with reference to FIG.

【０１４２】なお、図３は、本実施形態のキーワード認
識処理の動作を示すフローチャートである。FIG. 3 is a flow chart showing the operation of the keyword recognition processing of this embodiment.

【０１４３】まず、図示しない操作部または制御部によ
ってキーワード認識処理を開始するよう各部に指示が入
力され、発話音声がマイクロホン１０１に入力されると
（ステップＳ１１）、ＬＰＦ１０２およびＡ／Ｄ変換部
１０３を介して入力処理部１０４に入力され、この入力
処理部１０４は、入力された音声信号から発話音声部分
の音声信号を切り出すとともに（ステップＳ１２）、予
め設定された時間間隔毎にフレーム分割を行い、各フレ
ーム毎に先頭のフレームから順次音声信号を音声分析部
１０５に出力する（ステップＳ１３）。First, when an instruction to start the keyword recognition processing is input to each unit by an operation unit or control unit (not shown) and the uttered voice is input to the microphone 101 (step S11), the LPF 102 and the A / D conversion unit 103 are input. Is input to the input processing unit 104 via the input processing unit 104, and the input processing unit 104 cuts out the voice signal of the uttered voice portion from the input voice signal (step S12), and performs frame division at preset time intervals. , For each frame, the audio signal is sequentially output to the audio analysis unit 105 from the first frame (step S13).

【０１４４】次いで、本動作は各フレーム毎に以下の処
理を行う。Next, this operation performs the following processing for each frame.

【０１４５】まず、図示しない制御部によって、音声分
析部１０５に入力されたフレームが最終の分割フレーム
か否かが判断され（ステップＳ１４）、最終の分割のフ
レームと判断されたときは、ステップＳ２０に行き、最
終の分割フレームでないと判断されたときは、以下の動
作を行う。First, the control unit (not shown) determines whether or not the frame input to the voice analysis unit 105 is the final divided frame (step S14). When it is determined that the frame is the final divided frame, step S20. If it is determined that the frame is not the final divided frame, the following operation is performed.

【０１４６】まず、音声分析部１０５は、入力されたフ
レームの音声信号の特徴量を抽出するとともに、抽出し
たこのフレームの特徴量を類似度算出部１０７に出力す
る（ステップＳ１５）。First, the voice analysis unit 105 extracts the feature amount of the voice signal of the input frame, and outputs the extracted feature amount of this frame to the similarity calculation unit 107 (step S15).

【０１４７】具体的には、音声分析部１０５は、各フレ
ームの音声信号に基づいて、一定時間毎の各周波数毎に
おけるパワーを示すスペクトル包絡の情報、または、こ
のパワースペクトルの対数をとって逆フーリエ変換した
ケプストラムの情報を特徴量として抽出するとともに、
当該特徴量をベクトル化して類似度算出部１０７に出力
する。Specifically, the voice analysis unit 105 takes the inverse of the spectrum envelope information indicating the power at each frequency for each fixed time, or the logarithm of this power spectrum, based on the voice signal of each frame. Information of the cepstrum that has been Fourier transformed is extracted as a feature amount, and
The feature amount is vectorized and output to the similarity calculation unit 107.

【０１４８】次いで、類似度算出部１０７は、入力され
たフレームの特徴量とＨＭＭモデルデータベース１０６
に格納されているキーワードＨＭＭモデルの特徴量およ
び特定音声ＨＭＭとを比較するとともに、上述のよう
に、各ＨＭＭモデル毎の当該フレームの出力確率および
状態遷移確率を算出し、特定音声ＨＭＭモデルとの出力
確率および状態遷移確率を不要語類似度設定部１０８
に、および、このキーワードＨＭＭモデルとの出力確率
および状態遷移確率をマッチング処理部１０９に出力す
る（ステップＳ１６）。Next, the similarity calculation unit 107 determines the feature amount of the input frame and the HMM model database 106.
While comparing the feature amount of the keyword HMM model and the specific voice HMM stored in, the output probability and the state transition probability of the frame for each HMM model are calculated as described above, The output probability and the state transition probability are set to the unnecessary word similarity setting unit 108.
And the output probabilities and state transition probabilities with this keyword HMM model are output to the matching processing unit 109 (step S16).

【０１４９】次いで、不要語類似度設定部１０８は、入
力された特定音声ＨＭＭモデルとの出力確率および状態
遷移確率に基づいて不要語類似度を設定する（ステップ
Ｓ１７）。Next, the unnecessary word similarity setting unit 108 sets the unnecessary word similarity based on the output probability and the state transition probability with respect to the input specific voice HMM model (step S17).

【０１５０】例えば、特定音声ＨＭＭが「ａ」、
「ｉ」、「ｕ」、「ｅ」、「ｏ」の母音の特徴量のＨＭ
Ｍ場合は、不要語類似度設定部１０８は、各フレーム毎
に、当該各フレームの特徴量と母音のＨＭＭとに基づい
て算出された各出力確率および各状態遷移確率を平均
し、当該平均した出力確率および状態遷移確率を当該フ
レームの不要語類似度としてマッチング処理部１０９に
出力する。For example, the specific voice HMM is "a",
HM of feature values of vowels "i", "u", "e", and "o"
In the case of M, the unnecessary word similarity setting unit 108 averages, for each frame, the output probabilities and the state transition probabilities calculated based on the feature amount of each frame and the HMM of the vowel and averages the average. The output probability and the state transition probability are output to the matching processing unit 109 as the unnecessary word similarity of the frame.

【０１５１】次いで、マッチング処理部１０９は、類似
度算出部１０７おいて算出された各出力確率および各状
態遷移確率、並びに、不要語類似度設定部１０８におい
て算出された出力確率および状態確率に基づいて、上述
のマッチング処理を行い、各キーワード毎の累積類似度
を算出する（ステップＳ１８）。Next, the matching processing unit 109, based on the output probabilities and state transition probabilities calculated by the similarity calculation unit 107, and the output probabilities and state probabilities calculated by the unnecessary word similarity setting unit 108. Then, the above-described matching process is performed to calculate the cumulative similarity for each keyword (step S18).

【０１５２】具体的には、マッチング処理部１０９は、
前回までの累積類似度に入力された各キーワードＨＭＭ
の類似度および不要語類似度を積算し、各キーワードの
種別毎に最も累積類似度の高い累積類似度のみ算出す
る。Specifically, the matching processing unit 109
Each keyword HMM input to the cumulative similarity up to the last time
And the unnecessary word similarity are integrated, and only the cumulative similarity with the highest cumulative similarity is calculated for each type of keyword.

【０１５３】次いで、図示しない制御部からの指示によ
り次フレームの入力制御を行い（ステップＳ１９）、ス
テップＳ１４に戻る。Then, the input control of the next frame is performed according to an instruction from the control unit (not shown) (step S19), and the process returns to step S14.

【０１５４】一方、図示しない制御部において、最終の
分割のフレームと判断されたときは、算出した各キーワ
ード毎の最も高い累積類似度が判定部１１０に出力され
るとともに、判定部１１０は、各キーワード毎の累積類
似度のワード長に正規化処理を行う（ステップＳ２
０）。On the other hand, when the control unit (not shown) determines that the frame is the final divided frame, the highest cumulative similarity calculated for each keyword is output to the determination unit 110, and the determination unit 110 determines Normalization processing is performed on the word length of the cumulative similarity for each keyword (step S2).
0).

【０１５５】最後に、判定部１１０は、各キーワード毎
の正規化された類似度に基づいて、最も類似度の高い類
似度を有するキーワードを、発話音声に含まれるキーワ
ードであると判断して外部に出力し（ステップＳ２
１）、本動作を終了させる。Finally, the determination unit 110 determines that the keyword having the highest similarity is the keyword included in the uttered voice, based on the normalized similarity for each keyword. Output to (step S2
1), this operation is ended.

【０１５６】以上により本実施形態によれば、フレーム
分割された音声区間毎に、発話音声特徴量とキーワード
特徴量データとの特徴量の類似度を算出するとともに、
複数の母音など特定の特徴量データに基づいて不要語類
似度を設定し、これらの類似度から発話音声に含まれる
認識すべきキーワードを決定することができるので、通
常、不要語確率を算出する際に必要となる多数の不要語
特徴量データを予め設定しておくことなく、少ないデー
タで不要語類似度を設定することができ、不要語類似度
を算出する際の処理負担を軽減することができる。As described above, according to the present embodiment, the degree of similarity between the uttered voice feature quantity and the keyword feature quantity data is calculated for each frame-divided voice section.
Unnecessary word similarity can be set based on specific feature data such as multiple vowels, and the keywords to be recognized included in the spoken voice can be determined from these similarities, so the unnecessary word probability is usually calculated. It is possible to set the unnecessary word similarity with a small amount of data without previously setting a large number of unnecessary word feature amount data that are needed in this case, and to reduce the processing load when calculating the unnecessary word similarity. You can

【０１５７】また、各フレーム毎に不要語類似度および
算出した各類似度を積算して当該不要語類似度および当
該算出した各類似度の各組み合わせにおける累積類似度
を算出し、当該累積類似度に基づいて発話音声に含まれ
る認識すべきキーワードを決定することができるので、
不要語類似度および算出した各類似度の各組み合わせを
考慮しつつ、発話音声に含まれるキーワードを決定する
ことができる。Further, the unnecessary word similarity and the calculated similarities are integrated for each frame to calculate the cumulative similarity in each combination of the unnecessary word similarity and the calculated similarities, and the cumulative similarity is calculated. Since it is possible to determine the keyword to be recognized included in the spoken voice based on
It is possible to determine the keyword included in the uttered voice while considering each combination of the unnecessary word similarity and each calculated similarity.

【０１５８】この結果、容易にかつ高速に発話音声に含
まれるキーワードを的確に認識することができるととも
に、誤認識を防止することができる。As a result, the keyword included in the uttered voice can be accurately recognized easily and at high speed, and erroneous recognition can be prevented.

【０１５９】また、本実施形態において、１の発話音声
において複数のキーワードを認識する場合には、さらに
容易にかつ高速に発話音声に含まれるキーワードを認識
することができるとともに、誤認識を防止することがで
きる。Further, in the present embodiment, when a plurality of keywords are recognized in one utterance voice, the keywords included in the utterance voice can be recognized more easily and at high speed, and erroneous recognition can be prevented. be able to.

【０１６０】例えば、２のキーワードを認識する場合
に、図４に示すようなＨＭＭを用いた認識すべきキーワ
ードが含まれる音声言語モデル２０を想定すると、各認
識すべきキーワードモデルにおけるワード長に基づいて
ワード長の正規化を行うようにすれば、２のキーワード
を同時に認識することができるようになる。For example, when recognizing two keywords, assuming a spoken language model 20 including keywords to be recognized using an HMM as shown in FIG. 4, it is based on the word length in each keyword model to be recognized. If the word length is normalized by using two words, the two keywords can be recognized at the same time.

【０１６１】すなわち、上記マッチング処理部１０９に
おいて、各キーワード毎の累積類似度を算出することに
代えて、ＨＭＭモデルデータベース１０６に格納される
全てのキーワードの組み合わせ毎に累積類似度を算出
し、判定部１１０において、各キーワードのワード長を
加算して正規化処理を行うようにすれば、複数のキーワ
ードを同時に認識することができるとともに、容易にか
つ高速に発話音声に含まれるキーワードを認識すること
ができ、誤認識を防止することができる。That is, instead of calculating the cumulative similarity for each keyword in the matching processing unit 109, the cumulative similarity is calculated for each combination of all the keywords stored in the HMM model database 106, and the determination is made. If the word length of each keyword is added and the normalization process is performed in the unit 110, a plurality of keywords can be recognized at the same time, and the keywords included in the uttered voice can be easily and quickly recognized. It is possible to prevent erroneous recognition.

【０１６２】なお、本実施形態では、特定音声ＨＭＭと
して、「ａ」、「ｉ」、「ｕ」、「ｅ」、「ｏ」の母音
のみのＨＭＭを用いているが、上述したキーワード構成
要素ＨＭＭを特定音声ＨＭＭとして用い、これらの不要
語構成要素ＨＭＭとマッチング処理を行うようにしても
よい。In this embodiment, as the specific voice HMM, the HMMs having only the vowels "a", "i", "u", "e", and "o" are used. The HMM may be used as the specific voice HMM to perform matching processing with these unnecessary word constituent HMMs.

【０１６３】この場合、類似度算出部１０７は、入力さ
れた各フレームと各キーワード構成要素ＨＭＭ毎との出
力確率および状態遷移確率を算出したときに、これらの
各確率の値を不要語類似度設定部１０８に出力し、不要
語類似度設定部１０８は、入力された出力確率および状
態確率が高いものの上位（例えば５つ）の平均を算出す
るとともに、この算出した各平均値である出力確率およ
び状態確率を、不要語類似度としてマッチング処理部１
０９に出力するよう構成する。In this case, when calculating the output probability and the state transition probability of each input frame and each keyword constituent element HMM, the similarity calculating unit 107 calculates the value of each probability as the unnecessary word similarity. The output to the setting unit 108, the unnecessary word similarity setting unit 108 calculates the average of the upper (for example, five) of the input output probabilities and state probabilities that are high, and the output probabilities that are the calculated average values. And the state probability as the unnecessary word similarity, the matching processing unit 1
It is configured to output to 09.

【０１６４】これにより、上述と同様に、不要語類似度
を算出する際に必要となる多数の不要語特徴量データを
予め設定しておくことなく、少ないデータで不要語確率
を設定することができるので、不要語確率を算出する際
の処理負担を軽減することができ、容易にかつ高速に発
話音声に含まれるキーワードを認識することができると
いう効果を有することになる。As a result, similarly to the above, it is possible to set the unnecessary word probability with a small amount of data without previously setting a large number of unnecessary word feature amount data necessary for calculating the unnecessary word similarity. Therefore, the processing load when calculating the unnecessary word probability can be reduced, and the keywords included in the uttered voice can be easily and quickly recognized.

【０１６５】また、本実施形態では、上述の音声認識装
置によってキーワード認識処理を行うようになっている
が、音声認識装置にコンピュータおよび記録媒体を備
え、この記録媒体に上述のキーワード認識処理を行うプ
ログラムを格納し、このコンピュータによってキーワー
ド認識処理プログラムを読み込むことによって上述と同
様のキーワード認識処理を行うようにしてもよい。In this embodiment, the keyword recognition process is performed by the above-described voice recognition device. However, the voice recognition device includes a computer and a recording medium, and this recording medium performs the above-mentioned keyword recognition process. The same keyword recognition processing as described above may be performed by storing the program and reading the keyword recognition processing program by this computer.

【０１６６】また、この場合に、この記録媒体は、ＤＶ
ＤやＣＤなどの記録媒体により構成し、当該音声認識装
置１００には、記録媒体からプログラムを読み出す読出
装置を備えるようにしてもよい。In this case, the recording medium is DV
It may be configured by a recording medium such as a D or a CD, and the voice recognition device 100 may be provided with a reading device that reads out a program from the recording medium.

【０１６７】[0167]

【発明の効果】以上説明したように、本願発明によれ
ば、通常、不要語確率を算出する際に必要となる多数の
不要語特徴量データを予め設定しておくことなく、少な
いデータで不要語確率を設定することができるので、不
要語確率を算出する際の処理負担を軽減することがで
き、容易にかつ高速に発話音声に含まれるキーワードを
認識することができる。As described above, according to the present invention, a large amount of unnecessary word feature amount data, which is usually necessary when calculating the unnecessary word probability, is not required in advance, and is unnecessary with a small amount of data. Since the word probability can be set, the processing load when calculating the unnecessary word probability can be reduced, and the keyword included in the uttered voice can be easily and quickly recognized.

【図面の簡単な説明】[Brief description of drawings]

【図１】ＨＭＭを用いた認識ネットワークを示す音声言
語モデルを示す図である。FIG. 1 is a diagram showing a spoken language model showing a recognition network using an HMM.

【図２】本発明に係るワードスポッティング音声認識装
置の一実施形態の構成概要を示すブロック図である。FIG. 2 is a block diagram showing a schematic configuration of an embodiment of a word spotting voice recognition device according to the present invention.

【図３】ワードスポッティング音声認識装置の一実施形
態におけるキーワード認識処理の動作を示すフローチャ
ートである。FIG. 3 is a flowchart showing an operation of a keyword recognition process in one embodiment of the word spotting voice recognition device.

【図４】２のキーワードを認識する際のＨＭＭを用いた
認識ネットワークを示す音声言語モデルを示す図であ
る。FIG. 4 is a diagram showing a spoken language model showing a recognition network using an HMM when recognizing the second keyword.

【図５】フィラーモデルの音声認識ネットワークを示す
音声言語モデルを示す図。FIG. 5 is a diagram showing a spoken language model showing a filler model speech recognition network.

【符号の説明】[Explanation of symbols]

１００ … 音声認識装置１０１ … マイクロホン１０２ … ＬＰＦ１０３ … Ａ／Ｄ変換部１０４ … 入力処理部（抽出手段）１０５ … 音声分析部（抽出手段）１０６ … ＨＭＭモデルデータベース（格納手段）１０７ … 類似度算出部（算出手段、設定手段、特定
音声確率算出手段、取得手段）１０８ … 不要語類似度設定部（設定手段、不要語類
似度設定手段）１０９ … マッチング処理部（決定手段）１１０ … 判定部（決定手段）100 ... Speech recognition device 101 ... Microphone 102 ... LPF 103 ... A / D conversion unit 104 ... Input processing unit (extraction means) 105 ... Speech analysis unit (extraction means) 106 ... HMM model database (storage means) 107 ... Similarity calculation Part (calculation means, setting means, specific voice probability calculation means, acquisition means) 108 ... Unnecessary word similarity setting section (setting means, unnecessary word similarity setting means) 109 ... Matching processing section (deciding means) 110 ... Judgment section ( (Determination method)

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 15/28 ─────────────────────────────────────────────────── ─── Continuation of front page (51) Int.Cl. ⁷ Identification code FI theme code (reference) G10L 15/28

Claims

【特許請求の範囲】[Claims]

【請求項１】発話音声に含まれるキーワードを認識す
る音声認識装置であって、前記発話音声を分析することによって当該発話音声の音
声成分の特徴量である発話音声特徴量を抽出する抽出手
段と、１または２以上の前記キーワードの音声成分の特徴量を
示すキーワード特徴量データを予め格納しておく格納手
段と、前記発話音声の少なくとも一部の音声区間の抽出された
前記発話音声特徴量と前記格納手段に格納された前記キ
ーワード特徴量データとに基づいて当該音声区間の音声
が前記キーワードであるキーワード確率を算出する算出
手段と、前記発話音声の少なくとも一部の音声区間の抽出した前
記発話音声特徴量と予め設定された音声成分の特徴量を
示す複数の特定音声特徴量とに基づいて、当該音声区間
の音声が前記キーワードを構成しない不要語である確率
を示す不要語確率を設定する設定手段と、前記算出したキーワード確率および前記設定した不要語
確率に基づいて前記発話音声に含まれる認識すべき前記
キーワードを決定する決定手段と、を備えたことを特徴とする音声認識装置。1. A voice recognition device for recognizing a keyword included in a uttered voice, comprising: an extracting unit that extracts a uttered voice feature amount that is a feature amount of a voice component of the uttered voice by analyzing the uttered voice. Storage means for storing in advance keyword feature amount data indicating the feature amount of the voice component of one or more of the keywords, and the uttered voice feature amount extracted from at least a part of the voice section of the uttered voice. Calculating means for calculating a keyword probability that the voice of the voice section is the keyword based on the keyword feature amount data stored in the storage means; and the extracted utterance of at least a part of the voice section of the uttered voice. Based on the voice feature amount and a plurality of specific voice feature amounts indicating the feature amount of the preset voice component, the voice of the voice section is the key word. Setting means for setting an unnecessary word probability indicating the probability of being an unnecessary word that does not constitute, and a determination for determining the keyword to be recognized included in the uttered voice based on the calculated keyword probability and the set unnecessary word probability. A voice recognition device comprising:

【請求項２】請求項１に記載の音声認識装置におい
て、前記設定手段が、前記発話音声の少なくとも一部の音声区間の抽出した前
記発話音声特徴量と予め設定された音声成分の特徴量を
示す複数の特定音声特徴量とに基づいて前記発話音声特
徴量が前記各特定音声特徴量である確率を示す特定音声
確率を算出する特定音声確率算出手段と、前記算出した各特定音声確率に基づいて前記不要語確率
を設定する不要語確率設定手段と、を有することを特徴とする音声認識装置。2. The voice recognition device according to claim 1, wherein the setting means sets the uttered voice feature amount extracted in at least a part of the voice section of the uttered voice and a feature amount of a preset voice component. Specific voice probability calculating means for calculating a specific voice probability indicating the probability that the uttered voice feature amount is each specific voice feature amount based on a plurality of specific voice feature amounts shown, and based on each calculated specific voice probability And an unnecessary word probability setting means for setting the unnecessary word probability.

【請求項３】請求項２に記載の音声認識装置におい
て、前不要語確率設定手段が、前記特定音声確率算出手段によって算出した各特定音声
確率の平均を当該不要語類似度に設定することを特徴と
する音声認識装置。3. The speech recognition apparatus according to claim 2, wherein the preceding unnecessary word probability setting means sets the average of the respective specific speech probabilities calculated by the specific speech probability calculating means to the unnecessary word similarity. Characteristic voice recognition device.

【請求項４】請求項１乃至３の何れか一項に記載の音
声認識装置において、前記設定手段が、前記格納手段に格納された前記キーワ
ード特徴量データの少なくとも一部の特徴量を前記特定
音声特徴量として用いることを特徴とする音声認識装
置。4. The voice recognition device according to claim 1, wherein the setting unit specifies the feature amount of at least a part of the keyword feature amount data stored in the storage unit. A voice recognition device characterized by being used as a voice feature amount.

【請求項５】請求項１乃至４の何れか一項に記載の音
声認識装置において、前記抽出手段が、予め設定された単位時間毎に前記発話
音声を分析して前記発話音声特徴量情報を抽出し、前記算出手段が、前記各単位時間毎に前記キーワード確
率を算出し、前記設定手段が、前記各単位時間毎に前記不要語確率を
設定し、前記決定手段が、前記各単位時間に算出した各キーワー
ド確率および前記単位時間毎に設定した不要語確率に基
づいて発話音声に含まれる認識すべき前記キーワードを
決定することを特徴とする音声認識装置。5. The voice recognition apparatus according to claim 1, wherein the extraction unit analyzes the uttered voice for each preset unit time to obtain the uttered voice feature amount information. Extraction, the calculating means calculates the keyword probability for each unit time, the setting means sets the unnecessary word probability for each unit time, the determining means, for each unit time A voice recognition device, characterized in that the keyword to be recognized included in a spoken voice is determined based on the calculated keyword probability and the unnecessary word probability set for each unit time.

【請求項６】請求項１乃至５の何れか一項に記載の音
声認識装置において、前記決定手段が、前記算出した前記各キーワード確率と
前記単位時間における不要語確率に基づいて、前記第１
格納手段に格納された各前記キーワード特徴量データに
よって示される前記各キーワードと前記不要語との各組
み合わせである確率を示す組み合わせ確率を算出すると
ともに、当該組み合わせ確率に基づいて前記発話音声に
含まれる認識すべき前記キーワードを決定することを特
徴とする音声認識装置。6. The speech recognition apparatus according to claim 1, wherein the determining unit determines the first keyword probability based on the calculated keyword probability and the unnecessary word probability in the unit time.
A combination probability indicating a probability of each combination of the keyword and the unnecessary word indicated by each of the keyword feature amount data stored in the storage means is calculated, and included in the uttered voice based on the combination probability. A voice recognition device characterized by determining the keyword to be recognized.

【請求項７】発話音声に含まれるキーワードを認識す
る音声認識方法であって、前記発話音声を分析することによって当該発話音声の音
声成分の特徴量である発話音声特徴量を抽出する抽出処
理工程と、１または２以上の前記キーワードの音声成分の特徴量を
示すキーワード特徴量データを予め取得する取得処理工
程と、前記発話音声の少なくとも一部の音声区間の抽出された
前記発話音声特徴量と前記格納手段に格納された前記キ
ーワード特徴量データとに基づいて当該音声区間の音声
が前記キーワードであるキーワード確率を算出する算出
処理工程と、前記発話音声の少なくとも一部の音声区間の抽出した前
記発話音声特徴量と予め設定された音声成分の特徴量を
示す複数の特定音声特徴量とに基づいて、当該音声区間
の音声が前記キーワードを構成しない不要語である確率
を示す不要語確率を設定する設定処理工程と、前記算出したキーワード確率および前記設定した不要語
確率に基づいて前記発話音声に含まれる認識すべき前記
キーワードを決定する決定処理工程と、を含むことを特徴とする音声認識方法。7. A voice recognition method for recognizing a keyword included in a uttered voice, comprising: extracting a uttered voice feature amount which is a feature amount of a voice component of the uttered voice by analyzing the uttered voice. An acquisition processing step of previously acquiring keyword feature amount data indicating a feature amount of one or more voice components of the keyword; and the uttered voice feature amount extracted from at least a part of the voice section of the uttered voice. A calculation processing step of calculating a keyword probability that the voice of the voice section is the keyword based on the keyword feature amount data stored in the storage means; and extracting the voice section of at least a part of the uttered voice. Based on the uttered voice feature amount and a plurality of specific voice feature amounts indicating the feature amount of a preset voice component, the voice of the voice section is A setting process step of setting an unnecessary word probability indicating a probability of being an unnecessary word that does not form a word, and determining the keyword to be recognized included in the uttered voice based on the calculated keyword probability and the set unnecessary word probability A speech recognition method, comprising:

【請求項８】請求項７に記載の音声認識方法におい
て、前記設定処理工程においては、前記発話音声の少なくとも一部の音声区間の抽出した前
記発話音声特徴量と予め設定された音声成分の特徴量を
示す複数の特定音声特徴量とに基づいて前記発話音声特
徴量が前記各特定音声特徴量である確率を示す特定音声
確率を算出する特定音声確率算出処理工程と、前記算出した各特定音声確率に基づいて前記不要語確率
を設定する不要語確率設定処理工程と、を含むことを特徴とする音声認識方法。8. The voice recognition method according to claim 7, wherein in the setting processing step, the uttered voice feature amount extracted from at least a part of the voice section of the uttered voice and a feature of a preset voice component are included. A specific voice probability calculation processing step of calculating a specific voice probability indicating the probability that the uttered voice feature amount is each of the specific voice feature amounts based on a plurality of specific voice feature amounts indicating the amount; And an unnecessary word probability setting processing step of setting the unnecessary word probability based on the probability.

【請求項９】請求項８に記載の音声認識方法におい
て、前不要語確率設定処理工程においては、前記特定音声確率算出処理工程によって算出した各特定
音声確率の平均を当該不要語類似度に設定することを特
徴とする音声認識方法。9. The speech recognition method according to claim 8, wherein in the preceding unnecessary word probability setting processing step, an average of the respective specific speech probabilities calculated by the specific speech probability calculation processing step is set as the unnecessary word similarity. A voice recognition method characterized by:

【請求項１０】請求項７乃至９の何れか一項に記載の
音声認識方法において、前記設定処理工程においては、前記取得処理工程によっ
て取得された前記キーワード特徴量データの少なくとも
一部の特徴量を前記特定音声特徴量として用いることを
特徴とする音声認識方法。10. The voice recognition method according to claim 7, wherein in the setting processing step, at least a part of the characteristic amount of the keyword characteristic amount data acquired in the acquisition processing step. Is used as the specific voice feature amount.

【請求項１１】請求項７乃至１０の何れか一項に記載
の音声認識方法において、前記抽出処理工程においては、予め設定された単位時間
毎に前記発話音声を分析して前記発話音声特徴量を抽出
し、前記算出処理工程においては、前記各単位時間毎に前記
キーワード確率を算出し、前記設定処理工程においては、前記各単位時間毎に前記
不要語確率を設定し、前記決定処理工程においては、前記各単位時間に算出し
た各キーワード確率および前記単位時間毎に設定した不
要語確率に基づいて発話音声に含まれる認識すべき前記
キーワードを決定することを特徴とする音声認識方法。11. The speech recognition method according to claim 7, wherein in the extraction processing step, the speech voice is analyzed by analyzing the speech voice for each preset unit time. In the calculation processing step, the keyword probability is calculated for each unit time, in the setting processing step, the unnecessary word probability is set for each unit time, and the determination processing step is performed. Is a method of recognizing a keyword included in uttered speech, which is to be recognized, based on each keyword probability calculated for each unit time and the unnecessary word probability set for each unit time.

【請求項１２】請求項７乃至１１の何れか一項に記載
の音声認識方法において、前記決定処理工程においては、前記算出した前記各キー
ワード確率と前記単位時間における不要語確率に基づい
て、前記取得処理工程によって取得された各前記キーワ
ード特徴量データによって示される前記各キーワードと
前記不要語との各組み合わせである確率を示す組み合わ
せ確率を算出するとともに、当該組み合わせ確率に基づ
いて前記発話音声に含まれる認識すべき前記キーワード
を決定することを特徴とする音声認識方法。12. The voice recognition method according to claim 7, wherein in the determination processing step, based on the calculated keyword probabilities and unnecessary word probabilities in the unit time, While calculating a combination probability indicating a probability of each combination of each keyword and the unnecessary word indicated by each of the keyword feature amount data acquired by the acquisition processing step, included in the uttered voice based on the combination probability. A speech recognition method, characterized in that the keyword to be recognized is determined.

【請求項１３】コンピュータによって、発話音声に含
まれるキーワードを認識する音声認識プログラムであっ
て、前記コンピュータを、前記発話音声を分析することによって当該発話音声の音
声成分の特徴量である発話音声特徴量を抽出する抽出手
段、１または２以上の前記キーワードの音声成分の特徴量を
示すキーワード特徴量データを予め取得する取得手段、前記発話音声の少なくとも一部の音声区間の抽出された
前記発話音声特徴量と前記格納手段に格納された前記キ
ーワード特徴量データとに基づいて当該音声区間の音声
が前記キーワードであるキーワード確率を算出する算出
手段、前記発話音声の少なくとも一部の音声区間の抽出した前
記発話音声特徴量と予め設定された音声成分の特徴量を
示す複数の特定音声特徴量とに基づいて、当該音声区間
の音声が前記キーワードを構成しない不要語である確率
を示す不要語確率を設定する設定手段、前記算出したキーワード確率および前記設定した不要語
確率に基づいて前記発話音声に含まれる認識すべき前記
キーワードを決定する決定手段、として機能させることを特徴とする音声認識プログラ
ム。13. A voice recognition program for recognizing a keyword included in a uttered voice by a computer, the uttered voice feature being a feature amount of a voice component of the uttered voice by analyzing the uttered voice by the computer. Extraction means for extracting the amount, acquisition means for acquiring in advance keyword feature amount data indicating the feature amount of the voice component of one or more of the keywords, and the uttered voice in which at least a part of the uttered voice is extracted Calculating means for calculating a keyword probability that the voice of the voice section is the keyword based on the feature amount and the keyword feature amount data stored in the storage means, and extracting at least a part of the voice section of the uttered voice. Based on the uttered voice feature amount and a plurality of specific voice feature amounts indicating the feature amount of a preset voice component, And a setting means for setting an unnecessary word probability indicating a probability that the voice of the voice section is an unnecessary word that does not form the keyword, and is included in the uttered voice based on the calculated keyword probability and the set unnecessary word probability. A speech recognition program characterized by causing it to function as a determining means for determining the keyword to be recognized.

【請求項１４】請求項１３の何れか一項に記載の音声
認識プログラムにおいて、前記コンピュータを、前記不要語確率を設定する際に、前記発話音声の少なくとも一部の音声区間の抽出した前
記発話音声特徴量と予め設定された音声成分の特徴量を
示す複数の特定音声特徴量とに基づいて前記発話音声特
徴量が前記各特定音声特徴量である確率を示す特定音声
確率を算出する特定音声確率算出手段、前記算出した各特定音声確率に基づいて前記不要語確率
を設定する不要語確率設定手段、として機能させることを特徴とする音声認識プログラ
ム。14. The speech recognition program according to claim 13, wherein when the computer sets the unnecessary word probability, the utterance extracted from at least a part of the speech section of the uttered speech. A specific voice that calculates a specific voice probability indicating a probability that the uttered voice feature amount is each of the specific voice feature amounts, based on the voice feature amount and a plurality of specific voice feature amounts indicating the preset feature amount of the voice component. A speech recognition program, which functions as probability calculation means and unnecessary word probability setting means for setting the unnecessary word probability based on the calculated specific speech probabilities.

【請求項１５】請求項１４に記載の音声認識プログラ
ムにおいて、前記コンピュータを、前記特定音声確率算出手段によって算出した各特定音声
確率の平均を当該不要語類似度に設定する不要語確率設
定手段、として機能させることを特徴とする音声認識プログラ
ム。15. The speech recognition program according to claim 14, wherein the computer sets an unnecessary word probability setting unit that sets an average of the specific speech probabilities calculated by the specific speech probability calculation unit to the unnecessary word similarity. A voice recognition program characterized by functioning as.

【請求項１６】請求項１３乃至１５の何れか一項に記
載の音声認識プログラムにおいて、前記コンピュータを、前記取得した前記キーワード特徴量データの少なくとも
一部の特徴量を前記特定音声特徴量として用いて前記不
要語確率を設定する設定手段、として機能させること特徴とする音声認識プログラム。16. The voice recognition program according to claim 13, wherein the computer uses, as the specific voice feature amount, a feature amount of at least a part of the acquired keyword feature amount data. A voice recognition program, which functions as a setting means for setting the unnecessary word probability.

【請求項１７】請求項１３乃至１６の何れか一項に記
載の音声認識プログラムにおいて、前記コンピュータを、予め設定された単位時間毎に前記発話音声を分析して前
記発話音声特徴量を抽出する抽出手段、前記各単位時間毎に前記キーワード確率を算出する算出
手段、前記各単位時間毎に前記不要語確率を設定する設定手
段、前記各単位時間に算出した各キーワード確率および前記
単位時間毎に設定した不要語確率に基づいて発話音声に
含まれる認識すべき前記キーワードを決定する決定手
段、として機能させることを特徴とする音声認識プログラ
ム。17. The voice recognition program according to claim 13, wherein the computer analyzes the uttered voice for each preset unit time to extract the uttered voice feature amount. Extracting means, calculating means for calculating the keyword probability for each unit time, setting means for setting the unnecessary word probability for each unit time, each keyword probability calculated for each unit time, and each unit time A voice recognition program, which functions as a determining unit that determines the keyword to be recognized, which is included in a speech voice, based on the set unnecessary word probability.

【請求項１８】請求項１３乃至１７の何れか一項に記
載の音声認識プログラムにおいて、前記コンピュータを、前記算出した前記各キーワード確率と前記単位時間にお
ける不要語確率に基づいて、前記取得された各前記キー
ワード特徴量データによって示される前記各キーワード
と前記不要語との各組み合わせである確率を示す組み合
わせ確率を算出するとともに、当該組み合わせ確率に基
づいて前記発話音声に含まれる認識すべき前記キーワー
ドを決定する決定手段、として機能させることを特徴とする音声認識プログラ
ム。18. The speech recognition program according to any one of claims 13 to 17, wherein the computer acquires the acquired keyword probabilities based on the calculated keyword probabilities and unnecessary word probabilities in the unit time. While calculating a combination probability indicating a probability of each combination of each of the keywords and the unnecessary words indicated by each of the keyword feature amount data, the keywords to be recognized included in the uttered voice based on the combination probability are calculated. A speech recognition program characterized by causing it to function as a determining means for determining.