JP2003271185A

JP2003271185A - Device and method for preparing information for voice recognition, device and method for recognizing voice, information preparation program for voice recognition, recording medium recorded with the program, voice recognition program and recording medium recorded with the program

Info

Publication number: JP2003271185A
Application number: JP2002071260A
Authority: JP
Inventors: Yasuhiro Minami; 泰浩南; Mcdermott Eric; マクダーモットエリック; Atsushi Nakamura; 篤中村; Shigeru Katagiri; 滋片桐
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-03-15
Filing date: 2002-03-15
Publication date: 2003-09-25

Abstract

<P>PROBLEM TO BE SOLVED: To realize voice recognition with high accuracy in recognizing an input voice by a hidden Markov model (HMM). <P>SOLUTION: At learning, static characteristic quantity and dynamic characteristic quantity are extracted from voice for learning, the HMM is learned and stored in a storage device, the HMM of the voice for learning and the relation between the static characteristic quantity and the dynamic characteristic quantity are used to prepare a trajectory for the voice for learning, and dispersion from the trajectory is calculated and the calculated result is stored in the storage device. At recognition, the static characteristic quantity and the dynamic characteristic quantity are extracted from the input voice, the stored HMM is used to perform voice recognition of the input voice, a plurality of candidates are obtained, the HMMs of the candidates and the relation between the static characteristic quantity and the dynamic characteristic quantity are used to prepare a trajectory for the candidates, and a score between the trajectory of the candidates and the input voice is recalculated by referring to dispersion to be stored, and thereby the candidates are reevaluated. <P>COPYRIGHT: (C)2003,JPO

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、ヒドンマルコフモ
デルによる音声認識で用いられる音声認識用情報を作成
する音声認識用情報作成装置及びその方法と、その音声
認識用情報作成技術により作成された音声認識用情報を
使って、ヒドンマルコフモデルに従って入力音声を認識
する音声認識装置及びその方法と、その音声認識用情報
作成方法の実現に用いられる音声認識用情報作成プログ
ラム及びそのプログラムを記録した記録媒体と、その音
声認識方法の実現に用いられる音声認識プログラム及び
そのプログラムを記録した記録媒体とに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition information creating apparatus and method for creating speech recognition information used in speech recognition based on the Hidden Markov Model, and a speech created by the speech recognition information creating technique. Speech recognition apparatus and method for recognizing input speech according to Hidden Markov model using recognition information, speech recognition information creating program used for realizing the speech recognition information creating method, and recording medium recording the program And a voice recognition program used for realizing the voice recognition method and a recording medium recording the program.

【０００２】[0002]

【従来の技術】図７を使って、従来の音声認識手法につ
いて説明する。2. Description of the Related Art A conventional voice recognition method will be described with reference to FIG.

【０００３】この図に示すように、従来の音声認識手法
では、入力された音声は、特徴量抽出部１０で、特徴量
（静的特徴量）が計算されるとともに、その動的特徴量
である例えば特徴量の微分係数や２次微分係数が計算さ
れる。As shown in this figure, in the conventional voice recognition method, the feature amount extraction unit 10 calculates the feature amount (static feature amount) of the input voice, and at the same time the dynamic feature amount is used. For example, the differential coefficient or the secondary differential coefficient of the feature amount is calculated.

【０００４】学習時には、例えば音韻で構成される学習
用音声について計算される特徴量及び動的特徴量が音響
モデル学習部１１に送られ、音響モデル学習部１１は、
例えば特徴量の微分係数と２次微分係数とが動的特徴量
である場合で説明するならば、特徴パターンとして、特
徴量の平均値及び分散と、特徴量の微分係数の平均値及
び分散と、特徴量の２次微分係数の平均値及び分散とを
計算して、例えば音韻で構成される学習用音声に対応付
けて定義されるヒドンマルコフ（ＨＭＭ）の構造上に保
存することで、ＨＭＭデータベース１２を構築する。At the time of learning, for example, the feature amount and the dynamic feature amount calculated for the learning voice composed of phonemes are sent to the acoustic model learning unit 11, and the acoustic model learning unit 11
For example, in the case where the differential coefficient and the secondary differential coefficient of the feature amount are dynamic feature amounts, the average value and variance of the feature amount and the average value and variance of the differential coefficient of the feature amount are used as the feature pattern. , The average value and the variance of the second derivative of the feature amount are calculated and stored in the Hidden Markov (HMM) structure defined in association with the learning speech composed of, for example, a phoneme. Build the database 12.

【０００５】そして、認識時には、認識対象となる入力
音声について計算されるこれらの特徴量及び動的特徴量
が音声認識部１３に送られ、音声認識部１３は、辞書１
４に登録されている認識対象データを順番に１つずつ読
み出して、その読み出した認識対象データの持つ音韻に
対応付けられるＨＭＭ（ＨＭＭデータベース１２に格納
されている）を組み合わせることで、その読み出した認
識対象データのＨＭＭを作成して、その作成したＨＭＭ
に対応付けて保存されている平均値及び分散から、入力
音声とその読み出した認識対象データとの間のスコアを
計算して、スコアの高い認識対象データを認識結果とし
て出力する。At the time of recognition, the feature amount and the dynamic feature amount calculated for the input voice to be recognized are sent to the voice recognition unit 13, and the voice recognition unit 13 makes the dictionary 1
The recognition target data registered in No. 4 are sequentially read one by one, and the HMMs (stored in the HMM database 12) associated with the phonemes of the read recognition target data are combined to read the recognition target data. Create an HMM for the recognition target data and create the HMM
A score between the input voice and the read recognition target data is calculated from the average value and the variance stored in association with, and the recognition target data with a high score is output as the recognition result.

【０００６】[0006]

【発明が解決しようとする課題】最初に、トラジェクト
リという概念について説明する。First, the concept of trajectory will be described.

【０００７】トラジェクトリとは、モデル化された音声
の代表的な特徴量の時系列（パターン）であると考え
る。例えば、ＤＰマッチングのような手法であれば、複
数の話者の発声した時系列であり、ＨＭＭの場合は、ビ
タービアルゴリズム（Ｖiterbiアルゴリズム）によって
決まる平均値時系列である。A trajectory is considered to be a time series (pattern) of typical feature quantities of modeled speech. For example, in the case of a method such as DP matching, it is a time series in which a plurality of speakers uttered, and in the case of HMM, it is an average value time series determined by a Viterbi algorithm (Viterbi algorithm).

【０００８】音声認識では、このトラジェクトリを基
に、入力音声のスコア計算（距離計算やスコアの計算）
が実行される。ＨＭＭでは、ビタービアルゴリズムによ
って平均値時系列であるトラジェクトリを効率よく求め
て、それと入力音声との間のスコアを計算している。In speech recognition, score calculation (distance calculation and score calculation) of input speech is performed based on this trajectory.
Is executed. In the HMM, a trajectory which is an average value time series is efficiently obtained by the Viterbi algorithm, and a score between the trajectory is calculated.

【０００９】ＨＭＭでは、このトラジェクトリを求める
際に、特徴量と特徴量の微分係数との間と、特徴量と特
徴量の２次微分係数との間と、特徴量の微分係数と特徴
量の２次微分係数との間を、それぞれ独立と仮定してい
た。In the HMM, when obtaining this trajectory, between the feature quantity and the differential coefficient of the feature quantity, between the feature quantity and the second derivative of the feature quantity, and between the differential coefficient of the feature quantity and the feature quantity. It was assumed that the second derivative was independent of each other.

【００１０】ところが、実際の音声では、静的特徴量と
動的特徴量（例えば、特徴量の微分係数や２次微分係数
など）との間には一定の関係式が成立している。However, in actual voice, a certain relational expression is established between the static feature amount and the dynamic feature amount (for example, the differential coefficient of the feature amount or the second derivative coefficient).

【００１１】しかしながら、従来のＨＭＭでは、この関
係式を使っていない。これがために、従来技術に従って
いると、スコアを計算する際に基準となるＨＭＭの平均
値の時系列（トラジェクトリ）がＨＭＭの状態遷移部分
でなめらかではなくなる。However, the conventional HMM does not use this relational expression. For this reason, according to the conventional technique, the time series (trajectory) of the average value of the HMM, which is the reference when calculating the score, is not smooth at the state transition part of the HMM.

【００１２】これから、従来技術に従っていると、精度
のよい音声認識結果が得られないという問題点があっ
た。また、スコアの計算に必要とされる分散について
も、この平均値を基に計算しているため、この点からし
ても、精度のよい音声認識結果が得られないという問題
点があった。Therefore, according to the prior art, there is a problem that an accurate voice recognition result cannot be obtained. Further, since the variance required for score calculation is also calculated based on this average value, even from this point, there is a problem that an accurate voice recognition result cannot be obtained.

【００１３】従来のＨＭＭで、音声の持つ静的特徴量と
動的特徴量との間に成立する関係を積極的に利用しなか
った大きな原因は、ＨＭＭにおける認識手法であるビタ
ービアルゴリズムにこの関係式を導入することができな
かったからである。In the conventional HMM, the main reason why the relation established between the static feature amount and the dynamic feature amount of the voice is not positively used is the Viterbi algorithm which is a recognition method in the HMM. This is because the relational expression could not be introduced.

【００１４】本発明はかかる事情に鑑みてなされたもの
であって、ヒドンマルコフモデルに従って入力音声を認
識するという構成を採るときにあって、音声の持つ静的
特徴量と動的特徴量との間に成立する関係を用いてトラ
ジェクトリを生成することで、高精度の音声認識を実現
できるようにする新たな技術の提供を目的とする。The present invention has been made in view of such circumstances, and there is a case where an input voice is recognized according to the Hidden Markov model, and the static feature amount and the dynamic feature amount of the voice are It is an object of the present invention to provide a new technique that realizes highly accurate speech recognition by generating a trajectory using a relationship that holds between them.

【００１５】[0015]

【課題を解決するための手段】この目的を達成するため
に、本発明の音声認識用情報作成装置（例えば、本発明
の音声認識装置が学習モードとして動作するときに機能
することで実現される）は、ヒドンマルコフモデルによ
る音声認識で用いられる音声認識用情報を作成するため
に、学習用音声を特徴量分析して静的特徴量及び動的
特徴量を抽出する手段と、静的特徴量及び動的特徴量
からヒドンマルコフモデルを学習して、ＨＭＭ記憶装置
に保存する手段と、学習したヒドンマルコフモデル
と、静的特徴量と動的特徴量との間の関係とを使って、
学習用音声に対してのトラジェクトリを作成する手段
と、作成したトラジェクトリからの学習用音声の分散
を計算して、分散記憶装置に保存する手段とを備えるよ
うに構成する。To achieve this object, a voice recognition information generating apparatus of the present invention (for example, the voice recognition apparatus of the present invention is realized by functioning when operating in a learning mode. ) Is means for extracting a static feature amount and a dynamic feature amount by analyzing the feature amount of the learning voice in order to create the voice recognition information used in the voice recognition by the Hidden Markov model, and the static feature amount. And a means for learning a Hidden Markov model from the dynamic feature and storing it in the HMM storage device, the learned Hidden Markov model, and the relationship between the static feature and the dynamic feature,
It is configured to have means for creating a trajectory for the learning voice, and means for calculating the variance of the learning voice from the created trajectory and storing it in the distributed storage device.

【００１６】ここで、本発明の音声認識用情報作成装置
の備える各処理手段はコンピュータプログラムで実現で
きるものであり、このコンピュータプログラムは、半導
体メモリなどの記録媒体に記録して提供することができ
る。Here, each processing means included in the voice recognition information creating apparatus of the present invention can be realized by a computer program, and this computer program can be provided by being recorded in a recording medium such as a semiconductor memory. .

【００１７】一方、本発明の音声認識装置は、ヒドンマ
ルコフモデルに従って入力音声を認識するために、入
力音声を特徴量分析して静的特徴量及び動的特徴量を抽
出する手段と、本発明の音声認識用情報作成装置によ
り構築されたＨＭＭ記憶装置（学習用音声のヒドンマル
コフモデルを記憶している）を参照することで、入力音
声との比較対象となるヒドンマルコフモデルを取得し
て、入力音声の音声認識を行い複数個の候補を得る手段
と、それらの候補のヒドンマルコフモデルと、静的特
徴量と動的特徴量との間の関係とを使って、それらの候
補に対してのトラジェクトリを作成する手段と、本発
明の音声認識用情報作成装置により構築された分散記憶
装置（学習用音声のトラジェクトリからの学習用音声の
分散を記憶している）を参照することで、それらの候補
のトラジェクトリからの分散を取得して、それらの候補
のトラジェクトリと入力音声との間のスコアを計算する
ことで、それらの候補を再評価する手段とを備えるよう
に構成する。On the other hand, the speech recognition apparatus of the present invention includes means for analyzing the input speech to extract the static characteristic quantity and the dynamic characteristic quantity in order to recognize the input speech according to the Hidden Markov model, and the present invention. By referring to the HMM storage device (which stores the Hidden Markov model of the learning voice) constructed by the information recognition device for speech recognition, the Hidden Markov model to be compared with the input voice is acquired, Using the means for performing speech recognition of the input speech to obtain a plurality of candidates, the Hidden Markov model of those candidates, and the relationship between the static feature and the dynamic feature, And a distributed storage device (which stores the distribution of learning voices from the learning voice trajectory) constructed by the speech recognition information creation device of the present invention. Means for re-evaluating those candidates by obtaining the variance from those candidate trajectories and calculating the score between those candidate trajectories and the input speech. To do.

【００１８】ここで、本発明の音声認識装置の備える各
処理手段はコンピュータプログラムで実現できるもので
あり、このコンピュータプログラムは、半導体メモリな
どの記録媒体に記録して提供することができる。Here, each processing means included in the voice recognition device of the present invention can be realized by a computer program, and this computer program can be provided by being recorded in a recording medium such as a semiconductor memory.

【００１９】このように構成される本発明の音声認識用
情報作成装置では、例えば音韻で構成される学習用音声
を特徴量分析して静的特徴量及び動的特徴量を抽出する
と、その抽出した静的特徴量及び動的特徴量から学習用
音声のヒドンマルコフモデルを学習して、ＨＭＭ記憶装
置に保存する。In the speech recognition information creating apparatus of the present invention thus configured, for example, when the static speech feature and the dynamic feature quantity are extracted by analyzing the feature quantity of the learning voice composed of phonemes, the extraction is performed. The Hidden-Markov model of the learning voice is learned from the static feature amount and the dynamic feature amount that have been learned and stored in the HMM storage device.

【００２０】続いて、学習した学習用音声のヒドンマル
コフモデルと、静的特徴量と動的特徴量との間の関係と
を使い、例えばヒドンマルコフモデルを使った音声認識
で得られるガウス分布時系列を使って、学習用音声に対
してのトラジェクトリを作成して、そのトラジェクトリ
からの学習用音声の分散（静的特徴量及び動的特徴量の
分散）を計算し、分散記憶装置に保存する。Next, using the Hidden-Markov model of the learned learning voice and the relationship between the static feature amount and the dynamic feature amount, for example, when Gaussian distribution is obtained by voice recognition using the Hidden-Markov model. Create a trajectory for learning speech using a sequence, calculate the variance of learning speech from that trajectory (variance of static and dynamic features), and save it in the distributed storage device. .

【００２１】このようにして構築されるＨＭＭ記憶装置
と分散記憶装置とを受けて、本発明の音声認識装置は、
入力音声を特徴量分析して静的特徴量及び動的特徴量を
抽出すると、ＨＭＭ記憶装置に記憶されるヒドンマルコ
フモデルを参照することで、入力音声とのスコア計算の
対象となるヒドンマルコフモデルを取得して、抽出した
静的特徴量及び動的特徴量を使い、それらのヒドンマル
コフモデルと入力音声との間のスコアを計算すること
で、入力音声の音声認識を行い複数個の候補を得る。Upon receiving the HMM storage device and the distributed storage device thus constructed, the speech recognition device of the present invention is
When a static feature amount and a dynamic feature amount are extracted by performing feature amount analysis on the input voice, the Hidden Markov model that is the target of score calculation with the input voice is referred by referring to the Hidden Markov model stored in the HMM storage device. And the extracted static feature and dynamic feature are used to calculate the score between the Hidden-Markov model and the input voice, thereby performing voice recognition of the input voice and selecting a plurality of candidates. obtain.

【００２２】続いて、それらの候補のヒドンマルコフモ
デルと、抽出した静的特徴量と動的特徴量との間の関係
とを使い、例えばヒドンマルコフモデルを使った音声認
識で得られるガウス分布時系列を使って、それらの候補
に対してのトラジェクトリを作成する。Then, using the candidate Hidden-Markov models and the relationship between the extracted static and dynamic features, for example, the Gaussian distribution time obtained by speech recognition using the Hidden-Markov model. Use the sequence to create a trajectory for those candidates.

【００２３】続いて、分散記憶装置に記憶される分散を
参照することで、取得した候補のトラジェクトリからの
分散（静的特徴量及び動的特徴量の分散）を取得して、
その分散と抽出した静的特徴量及び動的特徴量とを使っ
て、それらの候補のトラジェクトリと入力音声との間の
スコアを再計算することで、それらの候補の順位を並べ
替えるなどの再評価を行う。Next, by referring to the variances stored in the distributed storage device, the variances (the variances of the static feature quantity and the dynamic feature quantity) from the obtained candidate trajectories are obtained,
The variance and the extracted static and dynamic features are used to recalculate the scores between the trajectories of these candidates and the input speech, thereby rearranging the order of those candidates. Make an evaluation.

【００２４】このようにして、本発明によれば、音声の
持つ静的特徴量と動的特徴量との間に成立する関係を考
慮してトラジェクトリを作成することで、従来技術で用
いられていた不連続なＨＭＭの平均値の時系列で構成さ
れる不自然なスコア関数が自然なスコア関数に変換され
ることになり、これにより、高精度の音声認識を実現で
きるようになる。As described above, according to the present invention, the trajectory is created in consideration of the relationship established between the static feature amount and the dynamic feature amount of the voice, which is used in the prior art. The unnatural score function configured by the time series of the discontinuous HMM average values is converted into a natural score function, which makes it possible to realize highly accurate speech recognition.

【００２５】本発明で用いている音声の持つ静的特徴量
と動的特徴量との間に成立する関係を考慮して、ヒドン
マルコフモデルからトラジェクトリを作成するという操
作は、いわば、ヒドンマルコフモデルの平均値時系列に
ローパスフィルタ操作を施すことを意味しており、これ
から、本来的には滑らかな動きを示すべきヒドンマルコ
フモデルの平均値時系列について、従来技術に従ってい
ると、あくまで不連続な平均値時系列のものとして取り
扱われるのに対して、本発明によれば、滑らかな動きを
示すものに変換されることになる。The operation of creating a trajectory from a Hidden Markov model in consideration of the relationship established between static and dynamic features of speech used in the present invention is, so to speak, a Hidden Markov model. It means that the low-pass filter operation is applied to the average value time series of, and from this, when the average value time series of the Hidden-Markov model, which should originally show smooth motion, is discontinuous when the conventional technique is followed. According to the present invention, the average value is treated as a time series, while the average value is converted into a series showing smooth motion.

【００２６】[0026]

【発明の実施の形態】以下、実施の形態に従って本発明
を詳細に説明する。BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, the present invention will be described in detail according to embodiments.

【００２７】先ず最初に、図１及び図２を参照して、本
発明の音声認識装置１が実行する処理の概要について説
明する。First, the outline of the processing executed by the speech recognition apparatus 1 of the present invention will be described with reference to FIGS. 1 and 2.

【００２８】ここで、図１は本発明の音声認識装置１が
学習モードで動作するときの機能を示しており、図２は
本発明の音声認識装置１が認識モードで動作するときの
機能を示している。Here, FIG. 1 shows the function when the voice recognition apparatus 1 of the present invention operates in the learning mode, and FIG. 2 shows the function when the voice recognition apparatus 1 of the present invention operates in the recognition mode. Shows.

【００２９】図１及び図２中、１００は特徴量抽出部、
１０１は音響モデル学習部、１０２はＨＭＭデータベー
ス、１０３はトラジェクトリ合成部、１０４は分散計算
部、１０５は分散データベース、１０６は特徴量間関係
式、１０７は音声認識部、１０８は辞書、１０９はトラ
ジェクトリ再合成部、１１０はスコア再計算部である。In FIG. 1 and FIG. 2, 100 is a feature quantity extraction unit,
101 is an acoustic model learning unit, 102 is an HMM database, 103 is a trajectory synthesis unit, 104 is a distributed calculation unit, 105 is a distributed database, 106 is a relational expression between feature quantities, 107 is a speech recognition unit, 108 is a dictionary, and 109 is a trajectory. A recomposition unit 110 is a score recalculation unit.

【００３０】本発明の音声認識装置１は、図１に示す機
能で実現される学習モードで動作する場合、特徴量抽出
部１００で、例えば音韻で構成される学習用音声の特徴
量（静的特徴量）を計算するとともに、その動的特徴量
（以下、説明の便宜上、動的特徴量として、特徴量の微
分係数と特徴量の２次微分係数とを想定する）を計算す
る。When the speech recognition apparatus 1 of the present invention operates in the learning mode realized by the function shown in FIG. 1, the characteristic amount extraction section 100 causes the characteristic amount of the learning speech (static In addition to calculating the feature amount), the dynamic feature amount (hereinafter, for convenience of description, a differential coefficient of the feature amount and a second derivative of the feature amount are assumed as the dynamic feature amount).

【００３１】この静的特徴量及び動的特徴量の計算を受
けて、音響モデル学習部１０１は、特徴パターンとし
て、特徴量の平均値及び分散と、特徴量の微分係数の平
均値及び分散と、特徴量の２次微分係数の平均値及び分
散とを計算して、例えば音韻で構成される学習用音声に
対応付けて定義されるヒドンマルコフ（ＨＭＭ）の構造
上に保存することで、ＨＭＭデータベース１０２を構築
する。In response to the calculation of the static feature amount and the dynamic feature amount, the acoustic model learning unit 101 determines, as the feature pattern, the average value and the variance of the feature amount and the average value and the variance of the differential coefficient of the feature amount. , The average value and the variance of the second derivative of the feature amount are calculated and stored in the Hidden Markov (HMM) structure defined in association with the learning speech composed of, for example, a phoneme. Build the database 102.

【００３２】なお、ここまでの処理については、従来技
術で行われている処理と基本的に変わるところはない。The processing up to this point is basically the same as the processing performed in the prior art.

【００３３】このＨＭＭデータベース１０２の構築後
に、トラジェクトリ合成部１０３は、静的特徴量と動的
特徴量との間に成立する特徴量間関係式１０６の条件の
基に、学習用音声のＨＭＭから、学習用音声に対しての
トラジェクトリを計算する。After the construction of the HMM database 102, the trajectory synthesis unit 103 extracts from the HMM of the learning voice based on the condition of the feature quantity relational expression 106 established between the static feature quantity and the dynamic feature quantity. , Calculate the trajectory for the training voice.

【００３４】このとき計算されるトラジェクトリは、Ｈ
ＭＭの平均値（静的特徴量及び動的特徴量の平均値）の
時系列を基に計算されるが、静的特徴量と動的特徴量と
の間の関係を考慮しているため、ＨＭＭの平均値の時系
列のように、不連続ではなくて滑らかなで自然な時系列
となる。The trajectory calculated at this time is H
It is calculated based on the time series of the average value of MM (the average value of the static feature amount and the dynamic feature amount), but since the relationship between the static feature amount and the dynamic feature amount is considered, Like the time series of HMM average values, the time series is not discontinuous but smooth and natural.

【００３５】認識モードで説明するように、本発明の音
声認識装置１は、このトラジェクトリを基にスコア計算
を行うことにより精度の高い音声認識を実現することに
なるが、このスコア計算を行うためには、トラジェクト
リからの入力特徴量の広がりを表す分散（静的特徴量及
び動的特徴量の分散）をあらかじめ学習用音声を用いて
学習しておく必要がある。As described in the recognition mode, the speech recognition apparatus 1 of the present invention realizes highly accurate speech recognition by performing score calculation based on this trajectory. For this, it is necessary to learn the variance (the variance of the static feature amount and the dynamic feature amount) representing the spread of the input feature amount from the trajectory in advance using the learning voice.

【００３６】そこで、分散計算部１０４は、各学習用音
声に対してトラジェクトリが求められると、これらのト
ラジェクトリとそれに対応付けられる学習用音声とか
ら、この分散をＨＭＭの状態毎に計算して分散データベ
ース１０５に記憶する。Therefore, when the trajectory is obtained for each learning voice, the variance calculator 104 calculates this variance for each state of the HMM from these trajectories and the learning voices associated therewith, and distributes them. It is stored in the database 105.

【００３７】このようにして、本発明の音声認識装置１
は、学習モードで動作する場合には、図３の処理フロー
に示す処理を実行することで、音声認識のために必要と
なるＨＭＭデータベース１０２と分散データベース１０
５とを作成するように処理するのである。In this way, the speech recognition apparatus 1 of the present invention
When operating in the learning mode, by performing the processing shown in the processing flow of FIG. 3, the HMM database 102 and the distributed database 10 required for voice recognition are executed.
5 and 5 are processed.

【００３８】ここで、図３中、２００は学習用音声を格
納する学習用音声ファイル、２０１は特徴量を格納する
特徴量ファイル、２０２はトラジェクトリの合成に用い
るトラジェクトリワークファイルである。In FIG. 3, reference numeral 200 is a learning voice file for storing learning voices, 201 is a feature quantity file for storing feature quantities, and 202 is a trajectory work file used for trajectory synthesis.

【００３９】一方、本発明の音声認識装置１は、図２に
示す機能で実現される認識モードで動作する場合、特徴
量抽出部１００で、認識対象となる入力音声の特徴量
（静的特徴量）を計算するとともに、その動的特徴量を
計算する。On the other hand, when the voice recognition device 1 of the present invention operates in the recognition mode realized by the function shown in FIG. 2, the feature amount extraction unit 100 causes the feature amount of the input voice to be recognized (static feature). (Quantity) and the dynamic features thereof.

【００４０】この静的特徴量及び動的特徴量の計算を受
けて、音声認識部１０７は、辞書１０８に登録されてい
る認識対象データを順番に１つずつ読み出して、その読
み出した認識対象データの持つ音韻に対応付けられるＨ
ＭＭ（ＨＭＭデータベース１０２に格納されている）を
組み合わせることで、その読み出した認識対象データの
ＨＭＭを作成して、その作成したＨＭＭに対応付けて保
存されている平均値及び分散から、入力音声とその読み
出した認識対象データとの間のスコアを計算して、スコ
アの高い上位複数個の候補を選択する。In response to the calculation of the static feature amount and the dynamic feature amount, the voice recognition unit 107 sequentially reads the recognition target data registered in the dictionary 108 one by one, and the read recognition target data. H associated with the phoneme of
By combining MM (stored in the HMM database 102), an HMM of the read recognition target data is created, and the input voice is calculated from the average value and variance stored in association with the created HMM. A score with the read recognition target data is calculated, and a plurality of candidates with high scores are selected.

【００４１】この複数個の候補の選択を受けて、トラジ
ェクトリ再合成部１０９は、静的特徴量と動的特徴量と
の間に成立する特徴量間関係式１０６の条件の基に、各
々の候補のＨＭＭから、各々の候補に対してのトラジェ
クトリを計算する。In response to the selection of the plurality of candidates, the trajectory re-synthesis unit 109 determines each of the candidates based on the condition of the inter-feature quantity relational expression 106 established between the static feature quantity and the dynamic feature quantity. From the candidate HMMs, calculate the trajectory for each candidate.

【００４２】このとき計算されるトラジェクトリは、Ｈ
ＭＭの平均値（静的特徴量及び動的特徴量の平均値）の
時系列を基に計算されるが、静的特徴量と動的特徴量と
の間の関係を考慮しているため、ＨＭＭの平均値の時系
列のように、不連続ではなくて滑らかなで自然な時系列
となる。The trajectory calculated at this time is H
It is calculated based on the time series of the average value of MM (the average value of the static feature amount and the dynamic feature amount), but since the relationship between the static feature amount and the dynamic feature amount is considered, Like the time series of HMM average values, the time series is not discontinuous but smooth and natural.

【００４３】これらのトラジェクトリの計算を受けて、
スコア再計算部１１０は、これらのトラジェクトリと入
力音声との間のスコアを、分散データベース１０５に格
納されるトラジェクトリからの分散を使って計算して、
候補の順位を入れ替えて最終結果とする。After the calculation of these trajectories,
The score recalculator 110 calculates the score between these trajectories and the input speech using the variance from the trajectories stored in the distributed database 105,
Swap the order of candidates to obtain the final result.

【００４４】このようにして、本発明の音声認識装置１
は、認識モードで動作する場合には、図４の処理フロー
に示す処理を実行することで、不連続ではなくて滑らか
で自然な時系列となるトラジェクトリを使って、入力音
声の認識処理を実行するように処理するのである。In this way, the speech recognition apparatus 1 of the present invention
When operating in the recognition mode, by executing the processing shown in the processing flow of FIG. 4, the input speech recognition processing is executed using a trajectory that is not discontinuous but is a smooth and natural time series. It is processed as if to do.

【００４５】次に、本発明の音声認識装置１が実行する
処理の詳細について説明する。Next, details of the processing executed by the speech recognition apparatus 1 of the present invention will be described.

【００４６】先ず最初に、図１で示したトラジェクトリ
合成部１０３の実行するトラジェクトリの生成処理につ
いて詳細に説明する。First, the trajectory generation process executed by the trajectory synthesizer 103 shown in FIG. 1 will be described in detail.

【００４７】入力音声の静的な特徴量時系列、その特徴
量の微分係数時系列、その特徴量の２次微分係数時系列
として、それぞれ、ケプストラムＣ＝｛ｃ₁,ｃ₂,...,ｃ
_T｝、ΔケプストラムΔＣ＝｛Δｃ₁,Δｃ₂,...,Δ
ｃ_T｝、Δ²ケプストラムΔ²Ｃ＝｛Δ²ｃ₁,Δ
²ｃ₂,...,Δ²ｃ_T｝というベクトル時系列が与えられ
るとする。As the static feature time series of the input voice, the derivative time series of the feature quantity, and the second derivative time series of the feature quantity, the cepstrum C = {c ₁ , c ₂ , ... , c
_T }, Δ cepstrum ΔC = {Δc ₁ , Δc ₂ , ..., Δ
c _T }, Δ ² cepstrum Δ ² C = {Δ ² c ₁ , Δ
Suppose a vector time series of ² c ₂ , ..., Δ ² c _T } is given.

【００４８】また、Ｓ＝｛ｓ₁,ｓ₂,...,ｓ_T｝はＨＭＭ
のガウス分布時系列を示し、Ｍ＝｛μ₁,μ₂,...,
μ_T｝、ΔＭ＝｛Δμ₁,Δμ₂,...,Δμ_T｝、Δ²Ｍ＝
｛Δ²μ₁,Δ²μ₂,...,Δ²μ_T｝は、それぞれ、その
ガウス分布時系列でのＨＭＭのケプストラムの平均値の
ベクトル時系列、Δケプストラムの平均値のベクトル時
系列、Δ²ケプストラムの平均値のベクトル時系列を示
すものとする。Further, S = {s ₁ , s ₂ , ..., S _T } is HMM
Shows a Gaussian distribution time series of M = {μ ₁ , μ ₂ , ...,
μ _T }, ΔM = {Δμ ₁ , Δμ ₂ , ..., Δμ _T }, Δ ² M =
{Δ ² μ ₁ , Δ ² μ ₂ , ..., Δ ² μ _T } are the vector time series of the average value of the HMM cepstrum and the vector time of the average value of the Δ cepstrum of the Gaussian distribution time series, respectively. , A vector time series of the average value of the Δ ² cepstrum.

【００４９】また、Σ＝｛Σ₁,Σ₂,...,Σ_T｝、ΔΣ＝
｛ΔΣ₁,ΔΣ₂,...,ΔΣ_T｝、Δ²Σ＝｛Δ²Σ₁,Δ²
Σ₂,...,Δ²Σ_T｝は、それぞれ、ＨＭＭのケプストラ
ムの共分散行列（対角共分散行列を仮定）の時系列、Δ
ケプストラムの共分散行列（対角共分散行列を仮定）の
時系列、Δ²ケプストラムの共分散行列（対角共分散行
列を仮定）の時系列を示すものとする。Further, Σ = {Σ ₁ , Σ ₂ , ..., Σ _T }, ΔΣ =
{ΔΣ ₁ , ΔΣ ₂ , ..., ΔΣ _T }, Δ ² Σ = {Δ ² Σ ₁ , Δ ²
Σ ₂ , ..., Δ ² Σ _T } is the time series of the covariance matrix (assuming a diagonal covariance matrix) of the HMM cepstrum, Δ
A time series of a cepstrum covariance matrix (assuming a diagonal covariance matrix) and a time series of a Δ ² cepstrum covariance matrix (assuming a diagonal covariance matrix) are shown.

【００５０】ところで、静的特徴量であるケプストラム
と、動的特徴量である２つのΔケプストラム、Δ²ケプ
ストラムとの間には、下記の〔数１〕式、〔数２〕式に
示すような拘束条件がある（なお、その他の拘束条件を
使っても同様のことが実現できる）。By the way, between the static feature amount cepstrum and the dynamic feature amount two Δ cepstrum and Δ ² cepstrum, as shown in the following equations (1) and (2), There are various constraints (However, the same can be achieved by using other constraints).

【００５１】[0051]

【数１】 [Equation 1]

【００５２】[0052]

【数２】 [Equation 2]

【００５３】ここで、（２Ｌ＋１）はウィンドウサイ
ズ、ｂ₀,ｂ₁,ｂ₂はウィンドウサイズによって決まる固
定値である。Here, (2L + 1) is a window size, and b ₀ , b ₁ and b ₂ are fixed values determined by the window size.

【００５４】通常のビタービアルゴリズムによる音声認
識では、音声信号に対して、下記の〔数３〕式が最大に
なるように、入力音声に対するＨＭＭのスコアを計算す
る。この最大化によって、ＨＭＭのガウス分布時系列が
求まる。In the voice recognition by the normal Viterbi algorithm, the HMM score for the input voice is calculated so that the following expression (3) is maximized for the voice signal. The Gaussian distribution time series of the HMM is obtained by this maximization.

【００５５】[0055]

【数３】 [Equation 3]

【００５６】ところが、この〔数３〕式により選ばれた
平均値の時系列は、〔数１〕式および〔数２〕式を満た
すようには選ばれていない。このため、ＨＭＭの状態間
での平均値の不連続点が発生するなどといったように、
音声としては不自然な時系列となっていることが多い。However, the time series of the average values selected by the formula [3] is not selected so as to satisfy the formula [1] and the formula [2]. Therefore, the discontinuity of the average value between the states of the HMM occurs, and so on.
The time series is often unnatural for voice.

【００５７】しかるに、従来技術では、この〔数３〕式
に従い、不自然な平均値時系列を基準にして入力音声時
系列のスコアの計算を行なっている。これでは高い精度
の音声認識を実現できない。However, in the prior art, the score of the input voice time series is calculated according to the equation (3) with reference to the unnatural average value time series. This cannot realize highly accurate voice recognition.

【００５８】そこで、本発明では、音声合成で使われて
いる手法［参考文献１〜３］を使って、この平均値の時
系列を変形して、滑らかな特徴量時系列を生成するとい
う構成を採っている。〔参考文献〕 [1]K.Tokuda,T.Kobayashi and S.Imai, "Speech parame
ter generation from HMM using dynamic features・ P
roc.ICASSP,pp.660-663,1995. [2]K.Tokuda,T.Masuko,T.Yamada,T.Kobayashi and S.Im
ai,"An algorithm for speech parameter generation f
rom continuous mixture HMMs with dynamic features
・Proc.Eurospeech,pp.757-760,1995. [3]T.Masuko,K.Tokuda,T.Kobayashi and S.Imai,"Speec
h synthesis from HMMsusing dynamic features・Proc.
ICASSP,pp.389-392,1996. 次に、この音声合成で使われている手法について説明す
る。Therefore, according to the present invention, the method used in speech synthesis [References 1 to 3] is used to modify the time series of the average values to generate a smooth feature quantity time series. Is taking. [References] [1] K. Tokuda, T. Kobayashi and S. Imai, "Speech parame
ter generation from HMM using dynamic features ・ P
roc.ICASSP, pp.660-663,1995. [2] K.Tokuda, T.Masuko, T.Yamada, T.Kobayashi and S.Im
ai, "An algorithm for speech parameter generation f
rom continuous mixture HMMs with dynamic features
・ Proc.Eurospeech, pp.757-760, 1995. [3] T.Masuko, K.Tokuda, T.Kobayashi and S.Imai, "Speec
h synthesis from HMMsusing dynamic features ・ Proc.
ICASSP, pp.389-392, 1996. Next, the method used in this speech synthesis will be explained.

【００５９】いま、ガウス分布時系列が与えられている
と仮定する。この音声合成で使われている手法では、与
えられたガウス分布時系列に対して、〔数２〕式および
〔数３〕式（但し、ＣをＯに置き換える）の条件の下
で、下記の〔数４〕式を最大化するＯ、ΔＯ、Δ²Ｏを
選ぶことによって、特徴量の時系列を生成するようにし
ている。It is now assumed that a Gaussian distribution time series is given. In the method used in this speech synthesis, for a given Gaussian distribution time series, under the conditions of [Equation 2] and [Equation 3] (where C is replaced by O), By selecting O, ΔO, and Δ ² O that maximize the formula [4], a time series of feature quantities is generated.

【００６０】これは、下記の〔数４〕式のΔＯ、Δ²Ｏ
を、〔数２〕式および〔数３〕式を用いてＯだけで表現
し、下記の〔数５〕式のようにすることで実現できる。
以上が、音声合成で使われている手法である。This is ΔO, Δ ² O in the following formula (4)
Can be realized by expressing only by O using the equations [2] and [3], and by the equation [5] below.
The above is the method used in speech synthesis.

【００６１】[0061]

【数４】 [Equation 4]

【００６２】[0062]

【数５】 [Equation 5]

【００６３】このようにして求められるＯ、ΔＯ、Δ²
Ｏの時系列を、ここではトラジェクトリと呼ぶ。このト
ラジェクトリは、任意のガウス分布時系列に対して生成
され、元々のＨＭＭの統計量を保ちながら、音声として
の自然性を持つ特徴量時系列となる。O, ΔO, Δ ² thus obtained
The time series of O is called a trajectory here. This trajectory is generated for an arbitrary Gaussian distribution time series, and becomes a characteristic time series having naturalness as voice while maintaining the original HMM statistics.

【００６４】このトラジェクトリを用いた入力音声に対
するスコアを、下記の〔数６〕式に示すように定義す
る。The score for the input voice using this trajectory is defined as shown in the following [Equation 6].

【００６５】[0065]

【数６】 [Equation 6]

【００６６】ここで、Σ’＝｛Σ’₁,Σ’₂,...,
Σ’_T｝、ΔΣ’＝｛ΔΣ’₁,ΔΣ’₂,...,Δ
Σ’_T｝、Δ²Σ’＝｛Δ²Σ’₁,Δ²Σ’₂,...,Δ²
Σ’_T｝は、ガウス分布時系列Ｓに沿ってのトラジェク
トリからの広がりを表す共分散時系列を示す。[0066] _{Here, Σ '= {Σ' 1} , Σ '2, ...,
Σ ′ _T }, ΔΣ ′ = {ΔΣ ′ ₁ , ΔΣ ′ ₂ , ..., Δ
_{^{Σ 'T}, Δ 2 Σ}} ' = {Δ 2 Σ '1, Δ 2 Σ' 2, ..., Δ 2
Σ ′ _T } denotes a covariance time series that represents the spread from the trajectory along the Gaussian distribution time series S.

【００６７】これまでの議論では、トラジェクトリを生
成するのに、ＨＭＭのガウス分布の時系列が与えられて
いるものとして議論を行った。次に、入力音声が与えら
れたときに、このガウス分布の時系列を求める方法につ
いて述べる。In the above discussion, it was assumed that the time series of the Gaussian distribution of HMM was given to generate the trajectory. Next, a method for obtaining the time series of the Gaussian distribution when the input voice is given will be described.

【００６８】ケプストラムＣを出力するような最適なガ
ウス分布時系列を求めるためには、下記の〔数７〕式で
示すような関数を用いることが必要である。In order to obtain the optimum Gaussian distribution time series that outputs the cepstrum C, it is necessary to use the function shown in the following [Equation 7].

【００６９】[0069]

【数７】 [Equation 7]

【００７０】ここで、probは〔数６〕式に示すスコアで
ある。また、Ｏ（Ｓ）はガウス分布時系列Ｓが与えられ
ているときのＨＭＭから出力されるトラジェクトリであ
る。Here, prob is the score shown in the equation (6). Further, O (S) is a trajectory output from the HMM when the Gaussian distribution time series S is given.

【００７１】しかし、この〔数７〕式の計算を実現する
ためには、可能なガウス分布時系列に対するすべてのＯ
を求めなければならない。また、ビタービアルゴリズム
などの効率的な探索が実現できないので、膨大な計算量
が必要となる。However, in order to realize the calculation of this [Equation 7], all O for all possible Gaussian distribution time series are obtained.
Have to ask. Moreover, since an efficient search such as the Viterbi algorithm cannot be realized, a huge amount of calculation is required.

【００７２】そこで、ここでは、通常の音声認識で用い
られる〔数３〕式のビタービアルゴリズムによって得ら
れるガウス分布時系列を、この最適ガウス分布時系列の
近似として用いることにする。Therefore, here, the Gaussian distribution time series obtained by the Viterbi algorithm of the formula [3] used in ordinary speech recognition is used as an approximation of this optimum Gaussian distribution time series.

【００７３】次に、図１で示した分散計算部１０４の実
行する分散の計算処理について詳細に説明する。Next, the dispersion calculation process executed by the dispersion calculator 104 shown in FIG. 1 will be described in detail.

【００７４】〔数６〕式で示したように、トラジェクト
リ導入に伴って新しい分散の計算が必要になる。ここで
は、分散は１つのガウス分布で、時刻によらず一定であ
るとする。分散を求めるために、以下に示すビタービ学
習法を採用する。As shown in the equation (6), a new variance must be calculated with the introduction of the trajectory. Here, the variance is assumed to be one Gaussian distribution and constant regardless of time. To obtain the variance, the Viterbi learning method shown below is adopted.

【００７５】すなわち、（ａ）ＭＬＥ学習を行い通常のＨＭＭを作成する。（ｂ）各学習データ（学習用音声）に対して、ＨＭＭを
使用して、〔数１〕式のスコアが最大になるガウス分布
時系列をビタービアルゴリズムにより計算する。（ｃ）求められたガウス分布時系列からトラジェクトリ
を求める。（ｄ）ビタービアルゴリズムの結果により、各学習デー
タを各状態ごとにセグメンテーションし、セグメント毎
の小さなデータに分割する。それらのデータを対応する
状態に割り当てる。（ｅ）各状態毎に、その状態に割り当てられたセグメン
トデータを用いて下記の〔数８〕式に従って分散値を推
定する。という手順に従って分散を計算する。That is, (a) MLE learning is performed to create a normal HMM. (B) With respect to each learning data (learning voice), HMM is used to calculate the Gaussian distribution time series having the maximum score of the formula [1] by the Viterbi algorithm. (C) A trajectory is obtained from the obtained Gaussian distribution time series. (D) Based on the result of the Viterbi algorithm, each learning data is segmented for each state and divided into small data for each segment. Assign those data to the corresponding states. (E) For each state, the variance value is estimated according to the following [Equation 8] using the segment data assigned to that state. Calculate the variance according to the procedure.

【００７６】[0076]

【数８】 [Equation 8]

【００７７】ここで、〔数８〕式において、ｎは状態ｓ
に割り当てられたデータの数を示し、ｃ^k _iはｋ番目の
長さのデータのｉ番目のケプストラムを示す。また、ｏ
^k _iはそのケプストラムに対応するトラジェクトリの値
である。In the equation (8), n is the state s
Indicates the number of data items assigned to ^ck , and c ^k _i indicates the i-th cepstrum of the k-th length data. Also, o
^k _i is the value of the trajectory corresponding to the cepstrum.

【００７８】ΔΣ’についても、同様の手順に従って下
記の〔数９〕式に従って計算で求めることができるとと
もに、Δ²Σ’についても、同様の手順に従って下記の
〔数１０〕式に従って計算で求めことができる。ΔΣ ′ can be calculated by the following formula [Equation 9] according to the same procedure, and Δ ² Σ ′ can also be calculated by the following formula [Equation 10] according to the same procedure. be able to.

【００７９】[0079]

【数９】 [Equation 9]

【００８０】[0080]

【数１０】 [Equation 10]

【００８１】次に、本発明の音声認識装置１が認識モー
ド（図２に示す機能で実現されるモード）で実行する音
声認識処理について詳細に説明する。Next, the voice recognition processing executed by the voice recognition apparatus 1 of the present invention in the recognition mode (mode realized by the function shown in FIG. 2) will be described in detail.

【００８２】本発明の音声認識装置１は、認識モードで
は、はじめに、通常のＨＭＭを用いてビタービアルゴリ
ズムによる認識を行い、上位数個の認識候補を出力す
る。In the recognition mode, the speech recognition apparatus 1 of the present invention first performs recognition by the Viterbi algorithm using a normal HMM, and outputs the top several recognition candidates.

【００８３】この複数個の候補に対して、ＨＭＭと、静
的特徴量と動的特徴量との間の関係とを使って、各々の
候補に対してのトラジェクトリを生成する。For the plurality of candidates, the trajectory for each candidate is generated using the HMM and the relationship between the static feature quantity and the dynamic feature quantity.

【００８４】そして、これらの候補に対して、〔数６〕
式による再スコアを行う。ここでは、スコアとして、動
的特徴量のスコアに重みをかけるため、〔数６〕式に代
えて下記の〔数１１〕式を用いる。Then, for these candidates, [Equation 6]
Rescore by formula. Here, since the score of the dynamic feature amount is weighted as the score, the following formula [11] is used instead of the formula [6].

【００８５】[0085]

【数１１】 [Equation 11]

【００８６】ここで、αとβは、それぞれ、Δケプスト
ラムとΔ²ケプストラムに対するスコアの重みを表す。Here, α and β represent the weights of the scores for the Δ cepstrum and the Δ ² cepstrum, respectively.

【００８７】以上に説明した手法を使って認識実験を行
った。この実験では、話者独立、タスク独立の認識を行
った。学習データ（学習用音声）として、音響学会の５
０３音韻バランス文の不特定話者音声データを用いた。
サンプリングレートを１６ｋＨｚとし、フレームシフト
を１０ｍｓとした。このデータを用いて、各状態のガウ
ス分布数が１である環境依存ＨＭＭを学習した。A recognition experiment was conducted using the method described above. In this experiment, speaker independence and task independence were recognized. As learning data (sound for learning), 5 of Acoustical Society
Unspecified speaker voice data of 03 phonological balance sentence was used.
The sampling rate was 16 kHz and the frame shift was 10 ms. Using this data, an environment-dependent HMM in which the number of Gaussian distributions in each state is 1 was learned.

【００８８】評価データ（入力音声）として、学習デー
タと同条件で分析した男女各１０人による１００都市発
声を用いた。αとβとを各々１，２，３，４，５，１０
と変化させて、最も認識率の高いものを本発明の認識結
果とした。従来のＨＭＭを用いたスコアについても同様
にαとβとを変化させて認識率が最大になるようにし
た。As evaluation data (input voice), 100 urban utterances by 10 men and women analyzed under the same conditions as the learning data were used. α and β are 1, 2, 3, 4, 5, 10 respectively
The recognition result of the present invention is the one with the highest recognition rate. Regarding the score using the conventional HMM, α and β are similarly changed so that the recognition rate is maximized.

【００８９】この認識実験で、従来のＨＭＭを用いる
認識を行った場合の認識結果として４.1％の認識率（誤
った認識を行った割合）、本発明による認識を行った
場合の認識結果として３.4％の認識率（誤った認識を行
った割合）、従来のＨＭＭを用いて、αとβとを変化
させて認識を行った場合の認識結果として４.0％の認識
率（誤った認識を行った割合）が得られた。In this recognition experiment, a recognition result of 4.1% as a recognition result when the recognition using the conventional HMM is performed (a rate of erroneous recognition), and a recognition result when the recognition according to the present invention is performed. As a recognition rate of 3.4% (percentage of erroneous recognition), and a recognition rate of 4.0% as a recognition result when recognition is performed by changing α and β using a conventional HMM ( The ratio of false recognition) was obtained.

【００９０】この実験結果から、本発明による認識を行
った場合の認識率（誤った認識を行った割合）が一番小
さくなることで確認できたことで、本発明による音声認
識の有効性を検証できた。From this experimental result, it was confirmed that the recognition rate (percentage of erroneous recognition) in the case of performing recognition according to the present invention was the smallest, and thus the effectiveness of voice recognition according to the present invention was confirmed. I was able to verify.

【００９１】このように、従来技術に従っていると、図
５に示すように、不連続なＨＭＭの平均値の系列を基準
とする不自然なスコア関数を用いて音声認識を行うのに
対して、本発明では、静的特徴量と動的特徴量間との間
の関係を用いてトラジェクトリを生成することにより、
図６のように、トラジェクトリを基準とするより自然な
スコア関数と変換され、この自然なスコア関数を用いて
音声認識を行うことになる。As described above, according to the conventional technique, as shown in FIG. 5, speech recognition is performed using an unnatural score function with a series of discontinuous HMM average values as a reference. In the present invention, by generating the trajectory using the relationship between the static feature amount and the dynamic feature amount,
As shown in FIG. 6, the score is converted into a more natural score function based on the trajectory, and speech recognition is performed using this natural score function.

【００９２】そして、このトラジェクトリからの広がり
である分散を、〔数８〕式〜〔数１１〕式のように計算
することにより、図６で示すような、より広がりの小さ
いスコア関数を実現することができるようになる。Then, the variance, which is the spread from this trajectory, is calculated as in [Equation 8] to [Equation 11] to realize a score function with a smaller spread as shown in FIG. Will be able to.

【００９３】このようにして、本発明によれば、ヒドン
マルコフモデルに従って入力音声を認識するときに、高
い認識を期待できるようになる。As described above, according to the present invention, high recognition can be expected when recognizing an input voice according to the Hidden Markov Model.

【００９４】[0094]

【発明の効果】以上説明したように、本発明によれば、
音声の持つ静的特徴量と動的特徴量との間に成立する関
係を考慮してトラジェクトリを作成することで、従来技
術で用いられていた不連続なＨＭＭの平均値の時系列で
構成される不自然なスコア関数が自然なスコア関数に変
換されることになり、これにより、高精度の音声認識を
実現できるようになる。As described above, according to the present invention,
By creating a trajectory in consideration of the relationship established between the static feature amount and the dynamic feature amount of the voice, the trajectory is configured by the time series of the average values of the discontinuous HMMs used in the related art. The unnatural score function is converted into a natural score function, which makes it possible to realize highly accurate speech recognition.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の一実施形態例である。FIG. 1 is an example of an embodiment of the present invention.

【図２】本発明の一実施形態例である。FIG. 2 is an example of an embodiment of the present invention.

【図３】本発明の実行する処理フローの一実施形態例で
ある。FIG. 3 is an exemplary embodiment of a processing flow executed by the present invention.

【図４】本発明の実行する処理フローの一実施形態例で
ある。FIG. 4 is an exemplary embodiment of a processing flow executed by the present invention.

【図５】従来技術で用いるスコア関数の説明図である。FIG. 5 is an explanatory diagram of a score function used in a conventional technique.

【図６】本発明で用いるスコア関数の説明図である。FIG. 6 is an explanatory diagram of a score function used in the present invention.

【図７】従来技術の説明図である。FIG. 7 is an explanatory diagram of a conventional technique.

【符号の説明】[Explanation of symbols]

１音声認識装置１００特徴量抽出部１０１音響モデル学習部１０２ＨＭＭデータベース１０３トラジェクトリ合成部１０４分散計算部１０５分散データベース１０６特徴量間関係式１０７音声認識部１０８辞書１０９トラジェクトリ再合成部１１０スコア再計算部 1 Speech recognition device 100 Feature Extraction Unit 101 Acoustic model learning unit 102 HMM database 103 trajectory synthesis unit 104 Distributed calculator 105 distributed database 106 Relational expression between features 107 voice recognition unit 108 dictionary 109 trajectory re-synthesis unit 110 Score recalculator

───────────────────────────────────────────────────── フロントページの続き (72)発明者中村篤東京都千代田区大手町二丁目３番１号日本電信電話株式会社内 (72)発明者片桐滋東京都千代田区大手町二丁目３番１号日本電信電話株式会社内Ｆターム(参考） 5D015 FF00 HH23 (54)【発明の名称】音声認識用情報作成装置及びその方法と、音声認識装置及びその方法と、音声認識用情報作成プログラム及びそのプログラムを記録した記録媒体と、音声認識プログラム及びそのプログラムを記録した記録媒体 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Atsushi Nakamura 2-3-1, Otemachi, Chiyoda-ku, Tokyo Inside Telegraph and Telephone Corporation (72) Inventor Shigeru Katagiri 2-3-1, Otemachi, Chiyoda-ku, Tokyo Inside Telegraph and Telephone Corporation F-term (reference) 5D015 FF00 HH23 (54) [Title of Invention] Speech recognition information creating apparatus and method, speech recognition apparatus and method, and speech recognition information creating program A recording medium recording the program and its program, and a voice recognition program and its program. Recording medium recorded

Claims

【特許請求の範囲】[Claims]

【請求項１】ヒドンマルコフモデルによる音声認識で
用いられる音声認識用情報を作成する音声認識用情報作
成装置であって、学習用音声を特徴量分析して静的特徴量及び動的特徴量
を抽出する手段と、上記静的特徴量及び上記動的特徴量からヒドンマルコフ
モデルを学習して、記憶装置に保存する手段と、上記学習したヒドンマルコフモデルと、上記静的特徴量
と上記動的特徴量との間の関係とを使って、学習用音声
に対してのトラジェクトリを作成する手段と、上記作成したトラジェクトリからの学習用音声の分散を
計算して、記憶装置に保存する手段とを備えることを、特徴とする音声認識用情報作成装置。1. A voice recognition information creating apparatus for creating voice recognition information used in voice recognition using a Hidden Markov model, wherein a static voice and a dynamic feature are analyzed by analyzing a feature of learning voice. Means for extracting, means for learning a Hidden Markov model from the static feature quantity and the dynamic feature quantity, and storing in a storage device, the learned Hidden Markov model, the static feature quantity and the dynamic A means for creating a trajectory for the learning voice by using the relationship between the feature amount and a means for calculating the variance of the learning voice from the created trajectory and storing it in the storage device. An information generation device for voice recognition, characterized by being provided.

【請求項２】請求項１記載の音声認識用情報作成装置
において、上記トラジェクトリを作成する手段は、ヒドンマルコフ
モデルを使った音声認識で得られるガウス分布時系列を
使って、学習用音声に対してのトラジェクトリを作成す
ることを、特徴とする音声認識用情報作成装置。2. The speech recognition information creating apparatus according to claim 1, wherein the means for creating the trajectory uses a Gaussian distribution time series obtained by speech recognition using a Hidden Markov model, and An information creation device for voice recognition, which is characterized by creating all trajectories.

【請求項３】ヒドンマルコフモデルによる音声認識で
用いられる音声認識用情報を作成する音声認識用情報作
成方法であって、学習用音声を特徴量分析して静的特徴量及び動的特徴量
を抽出する過程と、上記静的特徴量及び上記動的特徴量からヒドンマルコフ
モデルを学習して、記憶装置に保存する過程と、上記学習したヒドンマルコフモデルと、上記静的特徴量
と上記動的特徴量との間の関係とを使って、学習用音声
に対してのトラジェクトリを作成する過程と、上記作成したトラジェクトリからの学習用音声の分散を
計算して、記憶装置に保存する過程とを備えることを、特徴とする音声認識用情報作成方法。3. A voice recognition information creating method for creating voice recognition information used in voice recognition using a Hidden Markov model, wherein a static voice feature quantity and a dynamic voice feature quantity are analyzed by analyzing a learning voice feature quantity. A process of extracting, a process of learning a Hidden Markov model from the static feature amount and the dynamic feature amount, and saving the model in a storage device, the learned Hidden Markov model, the static feature amount and the dynamic Using the relationship between the feature quantity and the process of creating a trajectory for the training voice, and the process of calculating the variance of the training voice from the created trajectory and storing it in the storage device. A method for creating information for voice recognition, characterized by comprising:

【請求項４】請求項３記載の音声認識用情報作成方法
において、上記トラジェクトリを作成する過程では、ヒドンマルコ
フモデルを使った音声認識で得られるガウス分布時系列
を使って、学習用音声に対してのトラジェクトリを作成
することを、特徴とする音声認識用情報作成方法。4. The method for creating speech recognition information according to claim 3, wherein in the process of creating the trajectory, a Gaussian distribution time series obtained by speech recognition using a Hidden Markov model is used for learning speech. A method for creating information for voice recognition, which is characterized by creating all trajectories.

【請求項５】ヒドンマルコフモデルに従って入力音声
を認識する音声認識装置であって、入力音声を特徴量分析して静的特徴量及び動的特徴量を
抽出する手段と、学習用音声に基づいて作成されたヒドンマルコフモデル
を記憶する記憶装置を参照することで、入力音声との比
較対象となるヒドンマルコフモデルを取得して、入力音
声の音声認識を行い複数個の候補を得る手段と、上記候補のヒドンマルコフモデルと、上記静的特徴量と
上記動的特徴量との間の関係とを使って、上記候補に対
してのトラジェクトリを作成する手段と、学習用音声に基づいて作成されたトラジェクトリからの
分散を記憶する記憶装置を参照することで、上記候補の
トラジェクトリからの分散を取得して、上記候補のトラ
ジェクトリと入力音声との間のスコアを計算すること
で、上記候補を再評価する手段とを備えることを、特徴とする音声認識装置。5. A voice recognition device for recognizing an input voice according to a Hidden Markov model, comprising means for analyzing a feature amount of the input voice to extract a static feature amount and a dynamic feature amount, based on a learning voice. By referring to a storage device that stores the created Hidden Markov model, a Hidden Markov model to be compared with the input voice is acquired, means for performing voice recognition of the input voice to obtain a plurality of candidates, and Hidden Markov model of the candidate and means for creating a trajectory for the candidate by using the relationship between the static feature and the dynamic feature, and based on the training voice The variance from the candidate trajectory is obtained by referring to the storage device that stores the variance from the trajectory, and the score between the candidate trajectory and the input speech is obtained. By calculating, further comprising a means for re-evaluating the candidate, the speech recognition apparatus characterized.

【請求項６】請求項５記載の音声認識装置において、上記トラジェクトリを作成する手段は、ヒドンマルコフ
モデルを使った音声認識で得られるガウス分布時系列を
使って、上記候補に対してのトラジェクトリを作成する
ことを、特徴とする音声認識装置。6. The speech recognition apparatus according to claim 5, wherein the means for creating the trajectory uses the Gaussian distribution time series obtained by speech recognition using a Hidden Markov model to identify the trajectory for the candidate. A voice recognition device characterized by creating.

【請求項７】ヒドンマルコフモデルに従って入力音声
を認識する音声認識方法であって、入力音声を特徴量分析して静的特徴量及び動的特徴量を
抽出する過程と、学習用音声に基づいて作成されたヒドンマルコフモデル
を記憶する記憶装置を参照することで、入力音声との比
較対象となるヒドンマルコフモデルを取得して、入力音
声の音声認識を行い複数個の候補を得る過程と、上記候補のヒドンマルコフモデルと、上記静的特徴量と
上記動的特徴量との間の関係とを使って、上記候補に対
してのトラジェクトリを作成する過程と、学習用音声に基づいて作成されたトラジェクトリからの
分散を記憶する記憶装置を参照することで、上記候補の
トラジェクトリからの分散を取得して、上記候補のトラ
ジェクトリと入力音声との間のスコアを計算すること
で、上記候補を再評価する過程とを備えることを、特徴とする音声認識方法。7. A voice recognition method for recognizing an input voice according to a Hidden Markov model, comprising: a process of analyzing a feature amount of the input voice to extract a static feature amount and a dynamic feature amount; By referring to a storage device that stores the created Hidden Markov model, a Hidden Markov model to be compared with the input voice is obtained, a process of performing voice recognition of the input voice to obtain a plurality of candidates, and Using the candidate Hidden Markov model and the relationship between the static feature and the dynamic feature, the process of creating the trajectory for the candidate and the training speech The variance from the candidate trajectory is obtained by referring to the storage device that stores the variance from the trajectory, and the score between the candidate trajectory and the input speech is obtained. By calculating, further comprising a step of re-evaluating the candidate, the speech recognition method characterized.

【請求項８】請求項７記載の音声認識方法において、上記トラジェクトリを作成する過程では、ヒドンマルコ
フモデルを使った音声認識で得られるガウス分布時系列
を使って、上記候補に対してのトラジェクトリを作成す
ることを、特徴とする音声認識方法。8. The speech recognition method according to claim 7, wherein in the process of creating the trajectory, a trajectory for the candidate is calculated using a Gaussian distribution time series obtained by speech recognition using a Hidden Markov model. A voice recognition method characterized by creating.

【請求項９】請求項３又は４に記載の音声認識用情報
作成方法の実現に用いられる処理をコンピュータに実行
させるための音声認識用情報作成プログラム。9. A voice recognition information creation program for causing a computer to execute the process used to implement the voice recognition information creation method according to claim 3.

【請求項１０】請求項３又は４に記載の音声認識用情
報作成方法の実現に用いられる処理をコンピュータに実
行させるためのプログラムを記録した音声認識用情報作
成プログラムの記録媒体。10. A recording medium for a voice recognition information creation program, which records a program for causing a computer to execute the process used to implement the voice recognition information creation method according to claim 3.

【請求項１１】請求項７又は８に記載の音声認識方法
の実現に用いられる処理をコンピュータに実行させるた
めの音声認識プログラム。11. A voice recognition program for causing a computer to execute the process used to implement the voice recognition method according to claim 7.

【請求項１２】請求項３又は４に記載の音声認識方法
の実現に用いられる処理をコンピュータに実行させるた
めのプログラムを記録した音声認識プログラムの記録媒
体。12. A recording medium for a voice recognition program, which records a program for causing a computer to execute the process used to implement the voice recognition method according to claim 3.