JP2008216488A

JP2008216488A - Voice processor and voice recognition device

Info

Publication number: JP2008216488A
Application number: JP2007051837A
Authority: JP
Inventors: Katsuhiko Shirai; 克彦白井; Akira Kurematsu; 明榑松; Yotaro Kubo; 陽太郎久保
Original assignee: Waseda University
Current assignee: Waseda University
Priority date: 2007-03-01
Filing date: 2007-03-01
Publication date: 2008-09-18

Abstract

<P>PROBLEM TO BE SOLVED: To provide a voice processor and a voice recognition device for improving voice recognition precision than before. <P>SOLUTION: The voice recognition device 1 calculates a zeroth order predictive residual Φ<SB>b</SB>(n) as a momentary phase feature quantity and a logarithmic envelope m<SB>b</SB>(n) as an envelope feature quantity from a voice signal, discriminatively analyzes those momentary phase feature quantity and envelope feature quantity by using subband neural networks 32a to 32n and an integrated neural network 31 which have been learnt, and maps analysis results thereof to another space by a feature converting unit 52 to obtain a voice feature quality. The voice feature quantity which is thus generated is used to improve the voice recognition precision than before. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は音声処理装置及び音声認識装置に関し、例えばユーザ自身の音声で各種操作を行うナビゲーションシステムや電話自動応答システム等の各種システムに用いられる音声認識装置に適用して好適なものである。 The present invention relates to a voice processing device and a voice recognition device, and is suitable for application to a voice recognition device used in various systems such as a navigation system for performing various operations with the user's own voice and an automatic telephone answering system.

近年、ヒューマンインタフェースとしては音声対話が適切なアプリケーションが数多く存在するが、話し言葉音声認識の精度はディクテーション音声認識の精度と比べて大きく劣るため、対話アプリケーションの普及は未だ難しい課題となっている。 In recent years, there are many applications suitable for speech dialogue as human interfaces. However, since the accuracy of spoken speech recognition is greatly inferior to that of dictation speech recognition, the spread of dialogue applications remains a difficult issue.

話し言葉音声認識タスクにおける昨今の大きな進歩としては、日本語の自発音声を大量に集めて多くの研究用情報を付加した話し言葉研究用のデータベース（すなわち「日本語話し言葉コーパス」（ＣＳＪ））が完成したことが挙げられる。これによって、話し言葉音声認識のために充分な量の学習データが入手し得るようになった。しかしながら、現状の音響モデルの枠組みに対し、学習データ量を増やすというアプローチでは、今後画期的な改善が望めない。 The recent major advance in spoken speech recognition tasks has been the completion of a database for spoken language research (ie, the “Japanese Spoken Language Corpus” (CSJ)) that collects a large amount of spontaneous Japanese speech and adds a lot of research information. Can be mentioned. As a result, a sufficient amount of learning data for spoken speech recognition can be obtained. However, the approach of increasing the amount of learning data compared to the current acoustic model framework cannot be expected to make a significant improvement in the future.

従って、異なったアプローチから音響モデルを構築する必要がある。例えばこのような音声認識装置としては、特徴量抽出に長時間の分析窓を利用したマルチバンド音声認識装置が考えられている。 Therefore, it is necessary to construct an acoustic model from different approaches. For example, as such a speech recognition device, a multi-band speech recognition device using a long analysis window for feature amount extraction is considered.

実際上、このようなマルチバンド音声認識装置としては、いわゆるＴＲＡＰｓ（Temporal Patterns）特徴量を用いて音声認識処理が行われるものが知られている（例えば、非特許文献１参照）。ここでＴＲＡＰｓ特徴量は、単一の臨界帯域信号の時間包絡線を５００ｍｓ程度切り出したもので、多層パーセプトロン（ＭＬＰ（Multi-Layered Perceptron））を用いた識別回路と組み合わせることで、スペクトル包絡に依らない音声認識を実現できるようになされている。 In fact, as such a multiband speech recognition apparatus, one that performs speech recognition processing using so-called TRAPs (Temporal Patterns) feature quantities is known (for example, see Non-Patent Document 1). Here, the TRAPs feature amount is obtained by cutting out the time envelope of a single critical band signal for about 500 ms, and depends on the spectrum envelope by combining it with an identification circuit using a multi-layered perceptron (MLP). There has been no realization of voice recognition.

近年、このようなＴＲＡＰｓ特徴量を用いた音声認識装置の中では、ＴＲＡＰｓ特徴量を帯域毎に別々に識別し、その結果をマージすることで音素（言語学的な音声の最小単位）毎のスコアを得て、当該スコアを別の空間に写像したものを特徴量としたＨＭＭ（Hidden Markov Model）を用いて音声認識を行う方法（ＨＡＴｓ（Hidden-Activation-TRAPs）―ｔａｎｄｅｍ）が、最も音声認識精度が高いものとして知られている（例えば、非特許文献２参照）。 In recent years, in a speech recognition apparatus using such TRAPs feature amounts, TRAPs feature amounts are separately identified for each band, and the results are merged to obtain a phoneme (minimum unit of linguistic speech). The most speech-sounding method (HATs (Hidden-Activation-TRAPs) -tandem) using a HMM (Hidden Markov Model) with features obtained by obtaining a score and mapping the score to another space It is known as having high recognition accuracy (see Non-Patent Document 2, for example).

そして、このような音声認識装置は、ＰＬＰ(Perceptual Linear Prediction)ベースのシステムと組み合わせることで、既存の音声認識装置を上回る音声認識精度が得られることが知られている（例えば、非特許文献３参照）。 And it is known that such a speech recognition device can obtain speech recognition accuracy that exceeds that of existing speech recognition devices by combining with a PLP (Perceptual Linear Prediction) based system (for example, Non-Patent Document 3). reference).

一方、瞬時周波数を用いた音声認識装置としては、ＡＩＦ(Average Instantaneous Frequency)があり、雑音重畳環境下での音声認識精度向上に有効であることが報告されている（例えば、非特許文献４参照）。
H. Hermensky, S. Sharma “TRAPs - Classifiers of temporal patterns,” Proc. ICSLP ’98, Sydney, Australia, November 1998 B. Chen et. al. “Learning long term temporal features in LVCSR using neural networks,”Proc. ICSLP ’2004, pp 612-615 B. Chen et. al. “Learning long term temporal features in LVCSR using neural networks,”Proc. ICSLP ’2004, pp 612-615 Y. Wang et. al. “Average Instantaneous Frequency(AIF) and Average Log-Envelopes(ALE) for ASR with the Aurora 2 Database”Proc. of the Eurospeech, 2003 On the other hand, there is AIF (Average Instantaneous Frequency) as a speech recognition device using an instantaneous frequency, and it has been reported that it is effective for improving speech recognition accuracy in a noise superimposed environment (for example, see Non-Patent Document 4). ).
H. Hermensky, S. Sharma “TRAPs-Classifiers of temporal patterns,” Proc. ICSLP '98, Sydney, Australia, November 1998 B. Chen et. Al. “Learning long term temporal features in LVCSR using neural networks,” Proc. ICSLP '2004, pp 612-615 B. Chen et. Al. “Learning long term temporal features in LVCSR using neural networks,” Proc. ICSLP '2004, pp 612-615 Y. Wang et. Al. “Average Instantaneous Frequency (AIF) and Average Log-Envelopes (ALE) for ASR with the Aurora 2 Database” Proc. Of the Eurospeech, 2003

しかしながら、瞬時周波数を用いた音声認識装置では、雑音重畳環境下での音声認識精度向上に有効であることが報告されているものの、雑音のない環境下での精度が劣ることになる。 However, although it has been reported that a speech recognition apparatus using an instantaneous frequency is effective in improving speech recognition accuracy in a noise superimposed environment, the accuracy in an environment without noise is inferior.

また、上述した従来における他の音声認識装置では、未だ自然な対話の音声を認識することが難しく、よりストレスの少ない音声認識装置を実現するにあたり、自然発話の音声認識の精度を向上させることが望まれている。 In addition, in the other conventional speech recognition devices described above, it is still difficult to recognize speech of natural conversation, and in realizing a speech recognition device with less stress, it is possible to improve the accuracy of speech recognition of spontaneous utterances. It is desired.

本発明は以上の点を考慮してなされたもので、従来に比して音声認識精度が一段と向上し得る音声処理装置及び音声認識装置を提供することを目的とする。 The present invention has been made in consideration of the above points, and an object of the present invention is to provide a speech processing apparatus and speech recognition apparatus in which speech recognition accuracy can be further improved as compared with the prior art.

本発明の請求項１記載の音声処理装置は、ＦIＲフィルタ(Finite Impulse Response Filter)により所定数の周波数帯域に音声信号を分割することにより得た各分割帯域信号から、瞬時位相特徴量を抽出する瞬時位相特徴抽出部と、各前記分割帯域信号から包絡線特徴量を抽出する包絡線特徴抽出部と、各前記瞬時位相特徴量及び各前記包絡線特徴量を所定時間ずつ切り出して多層パーセプトロン型ニューラルネットワークによって解析することにより音声特徴量を生成する識別分析部とを備えることを特徴とするものである。 The audio processing apparatus according to claim 1 of the present invention extracts an instantaneous phase feature amount from each divided band signal obtained by dividing an audio signal into a predetermined number of frequency bands by an FIR filter (Finite Impulse Response Filter). An instantaneous phase feature extraction unit; an envelope feature extraction unit that extracts an envelope feature amount from each of the divided band signals; and a multilayer perceptron type neural network that cuts out each of the instantaneous phase feature amount and each of the envelope feature amounts for a predetermined time. And a discriminating / analyzing unit that generates a voice feature amount by analyzing through a network.

本発明の請求項２記載の音声処理装置は、各前記分割帯域信号をヒルベルト変換することにより虚部解析信号を生成するヒルベルト変換フィルタと、各前記分割帯域信号と、該分割帯域信号に対応した前記虚部解析信号とを加算して前記瞬時位相特徴抽出部及び前記包絡線特徴抽出部に供給する加算回路とを備えることを特徴とするものである。 The speech processing apparatus according to claim 2 of the present invention corresponds to a Hilbert transform filter that generates an imaginary part analysis signal by performing a Hilbert transform on each of the divided band signals, the divided band signals, and the divided band signals. And an addition circuit that adds the imaginary part analysis signal and supplies the imaginary part analysis signal to the instantaneous phase feature extraction unit and the envelope feature extraction unit.

本発明の請求項３記載の音声処理装置は、前記識別分析部の前記多層パーセプトロン型ニューラルネットワークは、各前記分割帯域信号毎に独立した音素推定を行う多層パーセプトロン型サブバンドニューラルネットワークと、各前記分割帯域信号毎の前記音素推定の結果を取りまとめて前記音声特徴量を生成する多層パーセプトロン型統合ニューラルネットワークとからなることを特徴とするものである。 The speech processing apparatus according to claim 3 of the present invention is characterized in that the multilayer perceptron type neural network of the discrimination analysis unit includes a multilayer perceptron type subband neural network that performs independent phoneme estimation for each of the divided band signals, It comprises a multi-layer perceptron type integrated neural network that collects the results of phoneme estimation for each divided band signal and generates the speech feature value.

本発明の請求項４記載の音声認識装置は、請求項１〜３のうちいずれかに記載の音声処理装置により得られた前記音声特徴量の結果を用いて音声認識を行うことを特徴とするものである。 According to a fourth aspect of the present invention, there is provided a voice recognition apparatus that performs voice recognition using the result of the voice feature value obtained by the voice processing apparatus according to any one of the first to third aspects. Is.

本発明の請求項１記載の音声処理装置及び請求項４記載の音声認識装置によれば、従来に比して音声認識精度が一段と向上し得る。 According to the speech processing device according to claim 1 and the speech recognition device according to claim 4 of the present invention, the speech recognition accuracy can be further improved as compared with the prior art.

以下図面に基づいて本発明の実施の形態を詳述する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１において、１は全体として本発明による音声認識装置を示し、この音声認識装置１は、マイクロフォン２により音声を集音して得られた音声信号をハードディスク等に録音する録音装置３と、当該音声信号に瞬間位相特徴抽出処理や包絡線特徴抽出処理等の各種処理を施す音声処理部４と、当該音声処理部４において生成した音声特徴量を用いて音声認識を行う音声認識部５とから構成されている。 In FIG. 1, reference numeral 1 denotes a voice recognition apparatus according to the present invention as a whole. The voice recognition apparatus 1 includes a recording apparatus 3 for recording a voice signal obtained by collecting voice by a microphone 2 on a hard disk, and the like. From the speech processing unit 4 that performs various processing such as instantaneous phase feature extraction processing and envelope feature extraction processing on the speech signal, and the speech recognition unit 5 that performs speech recognition using the speech feature amount generated in the speech processing unit 4 It is configured.

録音装置３から供給された音声信号は、フィルタバンク６内に設けられた複数のＦIＲ(Finite Impulse Response Filter)フィルタ（ＢＰＦ（Band Pass Filter））7a〜7nにそれぞれ送出され、例えば臨界周波数の特性をシミュレートした１６０次のＦIＲフィルタ7a〜7nを用いて特定の周波数帯域に分割され得る。 The audio signal supplied from the recording device 3 is sent to a plurality of FIR (Finite Impulse Response Filter) filters (BPF (Band Pass Filter)) 7a to 7n provided in the filter bank 6, for example, characteristics of critical frequency Can be divided into specific frequency bands using 160th order FIR filters 7a to 7n.

なお、ここでは１６０次のＦＩＲフィルタ7a〜7nを用いるようにしたが、本発明はこれに限らず、１６０次よりも低次や高次のＦＩＲフィルタ等の各種ＦＩＲフィルタを用いるようにしても良い。 Note that the 160th order FIR filters 7a to 7n are used here, but the present invention is not limited to this, and various FIR filters such as lower and higher order FIR filters than the 160th order may be used. good.

ここで図２はフィルタバンク６のフィルタ形状の一例を示すもので、フィルタバンク６は、図２に示すように音声信号を複数通りの異なる周波数帯域に分割し得るようになされている。図１に示したように、各ＦIＲフィルタ7a〜7nは、分割した周波数帯域を分割帯域信号として解析部10内のヒルベルト変換フィルタ11a〜11nと加算回路12とにそれぞれ供給する。 2 shows an example of the filter shape of the filter bank 6. The filter bank 6 can divide the audio signal into a plurality of different frequency bands as shown in FIG. As shown in FIG. 1, each of the FIR filters 7a to 7n supplies the divided frequency band as a divided band signal to the Hilbert transform filters 11a to 11n and the adder circuit 12 in the analysis unit 10, respectively.

ここで説明の便宜上、解析部10内の１つのヒルベルト変換フィルタ11bについて着目して以下説明する。図３に示すように、ヒルベルト変換フィルタ11bは、分割帯域信号のうち、対応するＦIＲフィルタ7bから供給された第ｂの分割帯域信号ｘ_ｂ（ｎ）にヒルベルト変換処理を施すことにより、分割帯域信号ｘ_ｂ（ｎ）の周波数成分を９０度ずらした虚部解析信号ｘ_ｂ´（ｎ）を生成し、この虚部解析信号ｘ_ｂ´（ｎ）を加算回路12に供給する。 Here, for convenience of explanation, the following description will be given focusing on one Hilbert transform filter 11b in the analysis unit 10. As shown in FIG. 3, the Hilbert transform filter 11b performs a Hilbert transform process on the _b- th subband signal x _b (n) supplied from the corresponding FIR filter 7b among the subband signals, thereby obtaining a subband. An imaginary part analysis signal x _b ′ (n) obtained by shifting the frequency component of the signal x _b (n) by 90 degrees is generated, and the imaginary part analysis signal x _b ′ (n) is supplied to the adder circuit 12.

加算回路12は、下記の数１示すように、分割帯域信号ｘ_ｂ（ｎ）（すなわち実部解析成分）と虚部解析信号ｘ_ｂ´（ｎ）とを加算し、これにより得られた分割帯域信号ｘ_ｂ ^＋（ｎ）を瞬時位相特徴抽出部15及び包絡線特徴抽出部16に供給する。 The adder circuit 12 adds the divided band signal x _b (n) (that is, the real part analysis component) and the imaginary part analysis signal x _b ′ (n) as shown in the following formula 1, and the division obtained thereby The band signal x _b ⁺ (n) is supplied to the instantaneous phase feature extraction unit 15 and the envelope feature extraction unit 16.

瞬時位相特徴抽出部15は、加算回路12からの分割帯域信号ｘ_ｂ ^＋（ｎ）が偏角演算回路17に供給され、当該偏角演算回路17によって、下記の数２に基づいて信号ｘ_ｂ ^＋（ｎ）を偏角演算処理することにより瞬時位相φ_ｂ（ｎ）を得、これを位相伸長回路18に供給する。 The instantaneous phase feature extraction unit 15 is supplied with the divided band signal x _b ⁺ (n) from the adder circuit 12 to the declination calculation circuit 17, and the declination calculation circuit 17 uses the signal x _b based on the following Equation 2. ^The instantaneous phase φ _b (n) is obtained by performing an argument calculation process on ⁺ (n), and this is supplied to the phase expansion circuit 18.

ここで図４に示すように、偏角演算回路17では、偏角演算処理することによって、位相の絶対量が-π〜+πの範囲に折り畳まれた瞬時位相φ_ｂ（ｎ）を得る。 Here, as shown in FIG. 4, the deflection angle calculation circuit 17 obtains an instantaneous phase φ _b (n) in which the absolute amount of the phase is folded in the range of −π to + π by performing the deflection angle calculation process.

位相伸長回路18は、絶対量が-π〜+πの範囲に折り畳まれている瞬時位相φ_ｂ（ｎ）を元に戻す処理（unwrap処理）を施し、これにより得られた図４に示すような瞬時位相φ_ｂ´（ｎ）（位相伸長回路18の枠内に図示）を線形位相生成回路19と減算回路20とに供給する（図３）。 The phase expansion circuit 18 performs processing (unwrap processing) for returning the instantaneous phase φ _b (n) whose absolute amount is folded in the range of −π to + π, as shown in FIG. The instantaneous phase φ _b ′ (n) (shown in the frame of the phase expansion circuit 18) is supplied to the linear phase generation circuit 19 and the subtraction circuit 20 (FIG. 3).

線形位相生成回路19は、下記の数３に基づいて演算処理を行うことにより図４のような線形位相Ａ_ｂ（ｎ）（線形位相生成回路19の枠内に図示）を得、これを減算回路20に供給する。 The linear phase generation circuit 19 obtains a linear phase A _b (n) (shown in the frame of the linear phase generation circuit 19) as shown in FIG. Supply to circuit 20.

減算回路20は、下記の数４に示すように、瞬時位相φ_ｂ´（ｎ）から線形位相Ａ_ｂ（ｎ）を減算することにより図４に示すような０次予測残差Φ_ｂ（ｎ）（バッファ21aの枠内に図示）を算出し、これをバッファ21aに供給する。 The subtraction circuit 20 subtracts the linear phase A _b (n) from the instantaneous phase φ _b ′ (n), as shown in the following equation 4, to thereby obtain a zeroth-order prediction residual Φ _b (n ) (Shown in the frame of the buffer 21a) is calculated and supplied to the buffer 21a.

一方、これと同時に包絡線特徴抽出部16は、図３に示すように、加算回路12からの分割帯域信号ｘ_ｂ ^＋（ｎ）が絶対値演算回路25に供給され、当該絶対値演算回路25によって分割帯域信号ｘ_ｂ ^＋（ｎ）の絶対値｜ｘ_ｂ ^＋（ｎ）｜を算出し、これを対数演算回路26に供給する。 At the same time, as shown in FIG. 3, the envelope feature extraction unit 16 supplies the divided band signal x _b ⁺ (n) from the addition circuit 12 to the absolute value calculation circuit 25, and the absolute value calculation circuit 25 To calculate the absolute value | x _b ⁺ (n) | of the divided band signal x _b ⁺ (n) and supply it to the logarithmic operation circuit 26.

対数演算回路26は、絶対値｜ｘ_ｂ ^＋（ｎ）｜に対数演算処理を施すことにより、下記の数５に示す対数包絡線ｍ_ｂ（ｎ）を算出し、これをバッファ21bに供給する。 The logarithmic operation circuit 26 performs logarithmic operation processing on the absolute value | x _b ⁺ (n) |, thereby calculating a logarithmic envelope m _b (n) shown in the following equation 5 and supplies it to the buffer 21b. .

このようにして解析部10では、図５（Ａ）に示すように、各周波数帯域毎に瞬時位相特徴量として０次予測残差Φ_ｂ（ｎ）を算出すると共に、図５（Ｂ）に示すように、各周波数大域毎に包絡線特徴量として対数包絡線ｍ_ｂ（ｎ）を算出し、これらを各バッファ21a，21bに供給し得るようになされている。 In this way, the analysis unit 10 calculates the zero-order prediction residual Φ _b (n) as an instantaneous phase feature amount for each frequency band as shown in FIG. As shown, a logarithmic envelope m _b (n) is calculated as an envelope feature quantity for each frequency region, and these can be supplied to the buffers 21a and 21b.

バッファ21aは、各分割帯域信号に対応して算出した瞬時位相特徴量を１フレームとして記憶し、図６に示すように、当該１フレームＦを所定長さに切り出して所定数の切り出し区間Ｃ１〜Ｃ５を得、これら切り出し区間Ｃ１〜Ｃ５を入力ベクトルとして識別分析部27にそれぞれ供給し得るようになされている。 The buffer 21a stores the instantaneous phase feature value calculated corresponding to each divided band signal as one frame, and cuts out the one frame F into a predetermined length as shown in FIG. C5 is obtained, and these cutout sections C1 to C5 can be supplied to the identification analysis unit 27 as input vectors, respectively.

ここで各切り出し区間Ｃ１〜Ｃ５の切り出し長は、所定時間として長時間である約２５０ｍｓ〜１０００ｍｓ間に選定されており、特に５００ｍｓ程度が好ましい。すなわち、切り出し区間Ｃ１〜Ｃ５は、例えば１０ｍｓのような短時間ではなく、瞬時位相特徴量を長時間切り出し得るようになされている。なお、フレームＦのシフト幅は約１０ｍｓに選定されている。 Here, the cut-out length of each cut-out section C1 to C5 is selected between about 250 ms to 1000 ms which is a long time as a predetermined time, and about 500 ms is particularly preferable. In other words, the cutout sections C1 to C5 can cut out the instantaneous phase feature amount for a long time, not a short time such as 10 ms. Note that the shift width of the frame F is selected to be about 10 ms.

因みに、この実施の形態の場合では、１フレームＦを５つの切り出し区間Ｃ１〜Ｃ５に切り出す例を示しているが、この例に限らずフレームＦを種々の個数に切り出すようにしても良い。ここで図７（Ａ）は瞬時位相特徴量を切り出した入力ベクトルであり、例えば第５帯域における瞬時位相特徴量の入力ベクトルの例を示すものである。 Incidentally, in the case of this embodiment, an example in which one frame F is cut into five cut sections C1 to C5 is shown. However, the present invention is not limited to this example, and the frame F may be cut into various numbers. Here, FIG. 7A shows an input vector obtained by cutting out the instantaneous phase feature quantity, and shows an example of the input vector of the instantaneous phase feature quantity in the fifth band, for example.

また、バッファ21bは、各分割帯域信号に対応して算出した包絡線特徴量も１フレームとして記憶し、当該１フレームを上述と同様に所定長さに切り出して所定数の切り出し区間を得、これら切り出し区間を入力ベクトルとして識別分析部27にそれぞれ供給し得るようになされている。ここで図７（Ｂ）は包絡線特徴量を切り出した入力ベクトルであり、同じく第５帯域における包絡線特徴量の入力ベクトルの例を示すものである。 The buffer 21b also stores the envelope feature amount calculated corresponding to each divided band signal as one frame, cuts out the one frame into a predetermined length as described above, and obtains a predetermined number of cut sections. The cut-out section can be supplied to the identification analysis unit 27 as an input vector. Here, FIG. 7B is an input vector obtained by cutting out the envelope feature quantity, and similarly shows an example of the input vector of the envelope feature quantity in the fifth band.

識別分析部27は、図１に示したように、サブニューラルネットワーク群30と、統合ニューラルネットワーク31とから構成され、サブニューラルネットワーク群30及び統合ニューラルネットワーク31により２段階の学習を行うようになされている。 As shown in FIG. 1, the discriminating / analyzing unit 27 is composed of a sub-neural network group 30 and an integrated neural network 31. The sub-neural network group 30 and the integrated neural network 31 perform two-stage learning. ing.

サブニューラルネットワーク群30は、いわゆる多層パーセプトロン（ＭＬＰ）からなるサブバンドニューラルネットワーク32a〜32nがＦIＲフィルタ7a〜7nに対応させて複数設けられている。 The sub-neural network group 30 is provided with a plurality of sub-band neural networks 32a to 32n made of so-called multilayer perceptrons (MLP) corresponding to the FIR filters 7a to 7n.

ここで１段階目のサブバンドニューラルネットワーク32a〜32nは、全て同一構成からなり、入力ベクトルに対する処理内容が同じであることから、以下サブバンドニューラルネットワーク32bに着目して説明する。 Here, the first-stage subband neural networks 32a to 32n all have the same configuration, and the processing contents for the input vectors are the same. Therefore, the subband neural network 32b will be described below.

図８に示すように、この多層パーセプトロン型サブバンドニューラルネットワークとしてのサブバンドニューラルネットワーク32bは、各入力ベクトルを個別に受ける複数の入力ユニット35からなる入力層36と、当該入力ユニット35からの出力ベクトルを受けて所定の処理を行う複数の中間ユニット37からなる隠れ層38と、当該中間ユニット37からの出力ベクトルを受けて所定の処理を行い最終的な出力ベクトルを生成する複数の出力ユニット39からなる出力層40とで構成されている。 As shown in FIG. 8, the subband neural network 32b as the multilayer perceptron type subband neural network includes an input layer 36 including a plurality of input units 35 that individually receive each input vector, and an output from the input unit 35. A hidden layer 38 composed of a plurality of intermediate units 37 that receives a vector and performs a predetermined process, and a plurality of output units 39 that receive the output vector from the intermediate unit 37 and perform a predetermined process to generate a final output vector And an output layer 40 made of

サブバンドニューラルネットワーク32bは、入力層36と隠れ層38と出力層40との間の結合の重み（結合係数）が予め学習によって定義されている。ここで識別分析部27の事前学習について以下、簡単に説明する。 In the subband neural network 32b, weights (coupling coefficients) of coupling among the input layer 36, the hidden layer 38, and the output layer 40 are defined in advance by learning. Here, the prior learning of the identification analysis unit 27 will be briefly described below.

サブバンドニューラルネットワーク32bは、事前学習モード時、学習用瞬時位相特徴量及び学習用包絡線特徴量を所定長さに切り出した切り出し区間からなる学習信号が、データベース42からバッファ21a，21bを介して入力層36に入力され、当該学習信号と一致した教師信号がデータベース42から出力層40に入力される。 In the pre-learning mode, the subband neural network 32b receives a learning signal composed of a segmented section obtained by cutting out the learning instantaneous phase feature and the learning envelope feature into a predetermined length from the database 42 via the buffers 21a and 21b. A teacher signal that is input to the input layer 36 and matches the learning signal is input from the database 42 to the output layer 40.

因みに、学習用瞬時位相特徴量及び学習用包絡線特徴量は、上述した瞬時位相特徴量及び包絡線特徴量と同様に各周波数帯域毎に分割された１フレームからなり、当該１フレームを所定長さに切り出して所定数（５つ）の切り出し区間を得、これら切り出し区間を入力ベクトルとしてサブバンドニューラルネットワーク32bの各入力ユニット35にそれぞれ供給され得る。 Incidentally, the learning instantaneous phase feature amount and the learning envelope feature amount are composed of one frame divided for each frequency band in the same manner as the above-described instantaneous phase feature amount and envelope feature amount, and the one frame has a predetermined length. Then, a predetermined number (five) of cutout sections are obtained by cutting them out, and these cutout sections can be supplied as input vectors to the input units 35 of the subband neural network 32b.

これによりサブバンドニューラルネットワーク32bは、学習用瞬時位相特徴量及び学習用包絡線特徴量に対応する正解を出力層40で得るため、これら学習用瞬時位相特徴量及び学習用包絡線特徴量を与えたときの最終的な出力ベクトルが教師信号と一致するように、各結合係数を変化させる学習処理がなされる。なお、教師信号の正解音素に相当するユニットでは「１．０」となるように設定し、それ以外のユニットでは「０」となるように設定されている。 As a result, the subband neural network 32b provides the learning instantaneous phase feature and the learning envelope feature in order to obtain the correct answer corresponding to the learning instantaneous phase feature and the learning envelope feature in the output layer 40. Then, a learning process for changing each coupling coefficient is performed so that the final output vector coincides with the teacher signal. The unit corresponding to the correct phoneme of the teacher signal is set to be “1.0”, and other units are set to be “0”.

因みに、この実施の形態の場合、学習処理としては、出力層40における最終的な出力ベクトルと教師信号との誤差を求め、出力層40から隠れ層38を介して入力層36へ逆方向に誤差を伝播させ、この誤差に応じた量だけ結合係数を調整し、再度、学習用瞬時位相特徴量及び学習用包絡線特徴量の切り出し区間を入力層36に入力して最終的な出力ベクトルと教師信号の誤差を求めるといういわゆる逆誤差伝播法で行なわれている。 Incidentally, in the case of this embodiment, as the learning process, an error between the final output vector in the output layer 40 and the teacher signal is obtained, and the error in the reverse direction from the output layer 40 to the input layer 36 through the hidden layer 38. Then, the coupling coefficient is adjusted by an amount corresponding to this error, and the cut-out section of the learning instantaneous phase feature amount and the learning envelope feature amount is input to the input layer 36 again, and the final output vector and the teacher are input. This is performed by a so-called reverse error propagation method in which a signal error is obtained.

このようにして各サブバンドニューラルネットワーク32a〜32nでは、ＦIＲフィルタ7a〜7nにより分割した各周波数帯域毎にそれぞれ定義され、各周波数帯域毎に独立して音素推定を行うようになされている。 In this way, each subband neural network 32a to 32n is defined for each frequency band divided by the FIR filters 7a to 7n, and phonemes are estimated independently for each frequency band.

かかる構成に加えて識別分析部27には、これら各サブバンドニューラルネットワーク32a〜32nに統合ニューラルネットワーク31が接続されており、当該統合ニューラルネットワーク31がサブバンドニューラルネットワーク32a〜32n毎に得られた各周波数帯域毎の音素推定結果を取りまとめるようになされている。 In addition to such a configuration, an integrated neural network 31 is connected to each of the subband neural networks 32a to 32n, and the integrated neural network 31 is obtained for each of the subband neural networks 32a to 32n. The phoneme estimation results for each frequency band are compiled.

２段階目の統合ニューラルネットワーク31は、サブバンドニューラルネットワーク32a〜32nと同様に多層パーセプトロン（ＭＬＰ）からなり、入力層45と隠れ層46と出力層47との間の結合の重み（結合係数）が、上述した逆誤差伝播法により予め学習によって定義されている。 The integrated neural network 31 in the second stage is composed of a multilayer perceptron (MLP) similarly to the sub-band neural networks 32a to 32n, and weights (coupling coefficients) of coupling between the input layer 45, the hidden layer 46, and the output layer 47. Is defined by learning in advance by the above-described inverse error propagation method.

実際上、この多層パーセプトロン型統合ニューラルネットワークとしての統合ニューラルネットワーク31は、各サブバンドニューラルネットワーク32a〜32nの隠れ層38からの出力ベクトルを入力ベクトルとして個別に受ける複数の入力ユニット48からなる入力層45と、当該入力ユニット48から出力ベクトルを受けて所定の処理を行う複数の中間ユニット49からなる隠れ層46と、当該中間ユニット46からの出力ベクトルを受けて所定の処理を行い最終的な出力ベクトルを生成する複数の出力ユニット50からなる出力層47とで構成されている。 In practice, the integrated neural network 31 as a multilayer perceptron type integrated neural network has an input layer composed of a plurality of input units 48 that individually receive output vectors from the hidden layers 38 of the subband neural networks 32a to 32n as input vectors. 45, a hidden layer 46 composed of a plurality of intermediate units 49 that receive an output vector from the input unit 48 and perform a predetermined process, and a final process that receives a output vector from the intermediate unit 46 and performs a predetermined process The output layer 47 includes a plurality of output units 50 that generate vectors.

ここで事前学習モード時に統合ニューラルネットワーク31の入力層45へ入力される入力ベクトルは、学習用瞬時位相特徴量及び学習用包絡線特徴量からなる学習信号がサブバンドニューラルネットワーク32a〜32n内の隠れ層38で学習処理された出力ベクトルであり、全周波数帯域に亘って連結したものが利用される。また、教師信号は、サブバンドニューラルネットワーク32a〜32nで使用した同じものが出力層47に入力される。 Here, in the pre-learning mode, the input vector input to the input layer 45 of the integrated neural network 31 is a hidden signal in the subband neural networks 32a to 32n that includes the learning instantaneous phase feature amount and the learning envelope feature amount. The output vector that has been subjected to learning processing in the layer 38 and that is concatenated over the entire frequency band is used. Further, the same teacher signals used in the subband neural networks 32a to 32n are input to the output layer 47.

すなわち、統合ニューラルネットワーク31は、これら学習用瞬時位相特徴量及び学習用包絡線特徴量を学習済みのサブバンドニューラルネットワーク32a〜32nの入力層36に与えたときの出力層47での最終的な出力ベクトルが、教師信号と一致するように各結合係数を変化させる学習処理がなされる。 That is, the integrated neural network 31 finally outputs the learning instantaneous phase feature amount and the learning envelope feature amount to the input layer 36 of the learned subband neural networks 32a to 32n. A learning process for changing each coupling coefficient so that the output vector matches the teacher signal is performed.

ここで図９は切り出し位置と教師信号との関係を示すものである。図９に示すように、教師信号は、例えば「ａ」、「ｒ」、「ａ」及び「ｙ」の各音素が続いた発話内容であるとすると、１フレームＦの中心部分で対応した音素が「ｒ」となり、１フレームＦ部分が「ｒ」の音素を推定するように結合係数を変化させ得る。 FIG. 9 shows the relationship between the cutout position and the teacher signal. As shown in FIG. 9, if the teacher signal is, for example, the utterance content followed by “a”, “r”, “a”, and “y”, the phoneme corresponding to the central portion of one frame F Becomes “r”, and the coupling coefficient can be changed so as to estimate the phoneme of “r” in the F part of one frame.

このようにして識別分析部27では、サブバンドニューラルネットワーク32a〜32n及び統合ニューラルネットワーク31において予め学習処理が実行されることにより、サブバンドニューラルネットワーク32a〜32n及び統合ニューラルネットワーク31の各結合係数がそれぞれ定義され、特徴変換部52においても学習が完了すると、事前学習モードから音声認識モードへと移行する。 In this way, in the identification analysis unit 27, the learning processing is executed in advance in the subband neural networks 32a to 32n and the integrated neural network 31, whereby the coupling coefficients of the subband neural networks 32a to 32n and the integrated neural network 31 are changed. When each is defined and learning is completed in the feature conversion unit 52, the pre-learning mode is switched to the speech recognition mode.

以下、このような学習処理を図１０に示すフローチャートを用いて、簡単にまとめる。音声処理部４は、開始ステップから次のステップＳＰ１に移り、ステップＳＰ１においてサブバンドニューラルネットワーク32a〜32nでの学習を行い始めて次のステップＳＰ２に移る。ステップＳＰ２において音声処理部４は、サブバンドニューラルネットワーク32a〜32nの学習が完了したか否かを判断する。 Hereinafter, such learning processing will be briefly summarized using the flowchart shown in FIG. The speech processing unit 4 proceeds from the start step to the next step SP1, starts learning in the subband neural networks 32a to 32n in step SP1, and proceeds to the next step SP2. In step SP2, the speech processing unit 4 determines whether learning of the subband neural networks 32a to 32n is completed.

ここで否定結果が得られると、このことはサブバンドニューラルネットワーク32a〜32nでの学習が未だ完了していないことを表しており、音声処理部４は再びステップＳＰ１に戻りステップＳＰ２で肯定結果が得られるまで上述した処理を繰り返す。 If a negative result is obtained here, this means that learning in the subband neural networks 32a to 32n has not yet been completed, and the speech processing unit 4 returns to step SP1 again and an affirmative result is obtained in step SP2. The above process is repeated until it is obtained.

これに対してステップＳＰ２において肯定結果が得られると、このことはサブバンドニューラルネットワーク32a〜32nでの学習が完了したことを表しており、音声処理部４は次のステップＳＰ３に移る。 On the other hand, if a positive result is obtained in step SP2, this indicates that learning in the subband neural networks 32a to 32n has been completed, and the speech processing unit 4 proceeds to the next step SP3.

ステップＳＰ３において音声処理部４は、統合ニューラルネットワーク31での学習を行い始めて次のステップＳＰ４に移る。 In step SP3, the speech processing unit 4 starts learning in the integrated neural network 31 and proceeds to the next step SP4.

ステップＳＰ４において音声処理部４は、統合ニューラルネットワーク31での学習が完了したか否かを判断する。ここで否定結果が得られると、このことは統合ニューラルネットワーク31での学習が未だ完了していないことを表しており、音声処理部４は再びステップＳＰ３へ戻りステップＳＰ４で肯定結果が得られるまで上述した処理を繰り返す。 In step SP4, the speech processing unit 4 determines whether learning in the integrated neural network 31 is completed. If a negative result is obtained here, this means that the learning in the integrated neural network 31 has not yet been completed, and the speech processing unit 4 returns to step SP3 again until a positive result is obtained in step SP4. The above process is repeated.

これに対してステップＳＰ４で肯定結果が得られると、このことは統合ニューラルネットワーク31での学習が完了したことを表しており、音声処理部４は次のステップＳＰ５に移る。ステップＳＰ５において音声処理部４は、特徴変換部52の学習を行って次のステップＳＰ６に移り上述した学習処理を終了する。 On the other hand, if a positive result is obtained in step SP4, this indicates that learning in the integrated neural network 31 has been completed, and the speech processing unit 4 proceeds to the next step SP5. In step SP5, the speech processing unit 4 performs learning of the feature conversion unit 52, moves to next step SP6, and ends the learning process described above.

かくして、音声認識処理を実際に行う場合、学習済みのサブバンドニューラルネットワーク32a〜32nには、１フレームＦの瞬時位相特徴量及び包絡線特徴量を切り出した切り出し区間が入力ベクトルとして各入力ユニット35にそれぞれ個別に供給される。 Thus, when the speech recognition processing is actually performed, each input unit 35 has a cut-out section obtained by cutting out the instantaneous phase feature amount and envelope feature amount of one frame F as an input vector in the learned subband neural networks 32a to 32n. Are supplied individually.

これにより、サブバンドニューラルネットワーク32a〜32nは、学習により得られた結合係数に基づいて隠れ層38において各周波数帯域毎に最適な音素推定結果を得、これを学習済みの統合ニューラルネットワーク31に与えるようになされている。 As a result, the subband neural networks 32a to 32n obtain the optimum phoneme estimation result for each frequency band in the hidden layer 38 based on the coupling coefficient obtained by learning, and give this to the learned integrated neural network 31. It is made like that.

統合ニューラルネットワーク31は、各ユニットの結合係数を基に各周波数帯域毎の音素推定結果を取りまとめ、音声特徴量を生成するようになされている。 The integrated neural network 31 collects phoneme estimation results for each frequency band based on the coupling coefficient of each unit, and generates a voice feature amount.

ところで、このように１フレームＦから当該フレームＦに対応する音素を識別分析部27によって推定することが可能になったが、最終的に音声認識装置１の一部として利用するためには、１フレームＦ内の認識のみではなく、複数フレームＦに渡って時系列識別を行う必要がある。 By the way, the phoneme corresponding to the frame F can be estimated from the frame F in this way by the discriminating / analyzing unit 27, but in order to finally use it as a part of the speech recognition apparatus 1, 1 It is necessary to perform time series identification over a plurality of frames F in addition to the recognition in the frame F.

このため本発明の音声処理部４では、時系列識別のために統合ニューラルネットワーク31からの出力スコアを別空間に写像したものを音声特徴量としたＧＭ−ＨＭＭ(Gaussian mixture hidden Markov model)を用いるようになされている。 For this reason, the speech processing unit 4 of the present invention uses a GM-HMM (Gaussian mixture hidden Markov model) in which the output score from the integrated neural network 31 is mapped to another space as a speech feature for time series identification. It is made like that.

ここで特徴変換部52は、統合ニューラルネットワーク31の出力スコアたる音声特徴量を、対数関数を用いて別空間に写像した後、カルーネンレーベ変換（ＫＬＴ）を利用して直交化を行うようになされている。 Here, the feature conversion unit 52 maps the speech feature amount, which is the output score of the integrated neural network 31, to another space using a logarithmic function, and then performs orthogonalization using the Karoonen-Leve transform (KLT). Has been made.

これにより特徴変換部52は、特徴変換前に図１１（Ａ）に示すような波形を有する音声特徴量を、図１１（Ｂ）に示すような波形を有する音声特徴量に変換し、これを音声認識部に供給する。かくして音声認識部においては、特徴変換後の音声特徴量を基に音声認識を行え得るようになされている。 As a result, the feature conversion unit 52 converts the speech feature quantity having a waveform as shown in FIG. 11 (A) into a speech feature quantity having a waveform as shown in FIG. 11 (B) before the feature conversion. Supply to the speech recognition unit. Thus, the speech recognition unit can perform speech recognition based on the speech feature amount after feature conversion.

以上の構成によれば、音声認識装置１では、瞬時位相特徴量たる０次予測残差Φ_ｂ（ｎ）と、包絡線特徴量たる対数包絡線ｍ_ｂ（ｎ）とを音声信号から算出し、これら瞬時位相特徴量及び包絡線特徴量を、学習済みのサブバンドニューラルネットワーク32a〜32n及び統合ニューラルネットワーク31を用いて識別的に分析し、その分析結果を特徴変換部52により別空間に写像することで音声特徴量を得るようにした。このようにして生成した音声特徴量を用いれば、従来に比して音声認識精度が一段と向上し得る。 According to the above configuration, the speech recognition apparatus 1 calculates the zeroth-order prediction residual Φ _b (n) that is the instantaneous phase feature quantity and the logarithmic envelope m _b (n) that is the envelope feature quantity from the speech signal. These instantaneous phase features and envelope feature values are discriminatively analyzed using the learned subband neural networks 32a to 32n and the integrated neural network 31, and the analysis result is mapped to another space by the feature conversion unit 52. By doing so, the voice feature amount was obtained. If the speech feature amount generated in this way is used, the speech recognition accuracy can be further improved as compared with the conventional case.

なお、ここで識別分析部27のサブバンドニューラルネットワーク32a〜32n及び統合ニューラルネットワーク31の入力として瞬時周波数ではなく瞬時位相を用いたのは、結合重みが識別に有利な任意の変換行列を学習可能であることによる。 Note that the use of the instantaneous phase instead of the instantaneous frequency as the input to the subband neural networks 32a to 32n and the integrated neural network 31 of the discrimination analysis unit 27 enables learning of any transformation matrix whose connection weight is advantageous for discrimination. Because it is.

次に、本願発明の提案手法が話し言葉音声認識の精度向上に対して有効であることを確認するため、音素バイグラムを用いた連続音素認識タスクを利用し、複数種類の音声認識装置との間で音素正解精度を比較する検証試験を行った。 Next, in order to confirm that the proposed method of the present invention is effective for improving the accuracy of spoken speech recognition, a continuous phoneme recognition task using phoneme bigrams is used, and a plurality of types of speech recognition devices are used. A verification test was conducted to compare the accuracy of phoneme accuracy.

なお、検証試験において、ＬＴＨＰ（Long-Term Hilbert Pair）とは、上述した０次予測残差Φ_ｂ（ｎ）及び対数包絡線ｍ_ｂ（ｎ）を１０Ｈｚで再サンプリングしたものを示し、ＬＴＨＰ―ｔａｎｄｅｍとは、０次元予測残差Φ_ｂ（ｎ）及び対数包絡線ｍ_ｂ（ｎ）を１０Ｈｚで再サンプリングしたものを識別分析部27で分析したものであって、本願発明の音声処理を示すものである。 In the verification test, LTHP (Long-Term Hilbert Pair) means the resampled 0th-order prediction residual Φ _b (n) and logarithmic envelope m _b (n) at 10 Hz. “tandem” is the result of re-sampling the zero-dimensional prediction residual Φ _b (n) and the logarithmic envelope m _b (n) at 10 Hz by the discriminating / analyzing unit 27, and indicates the speech processing of the present invention. Is.

そして、この検証試験では、ＭＦＣＣ（Mel Frequency Cepstral Coefficients）と、ＭＦＣＣ´と、ＭＦＣＣ´´と、Ｅ´と、Ｅ´´とを合わせた特徴量（３８次元）を用いた音声認識装置（ｂａｓｅｌｉｎ）を使用した。なお、ここで「´」は時間に関する１次微分を示し、「´´」は時間に関する２次微分を示し、「Ｅ」はエネルギーを示すものである。 In this verification test, a speech recognition device (baselin) using a feature amount (38 dimensions) obtained by combining MFCC (Mel Frequency Cepstral Coefficients), MFCC ′, MFCC ″, E ′, and E ″. )It was used. Here, “′” represents a first-order derivative with respect to time, “″” represents a second-order derivative with respect to time, and “E” represents energy.

また、この検証試験では、ＨＡＴｓ(４０次元)を用いた音声認識装置と、ＨＡＴｓ(２６次元)及びＭＦＣＣ(１２次元)を用いた音声認識装置と、ＬＴＨＰ―ｔａｎｄｅｍ(４０次元)を用いた音声認識装置と、ＬＴＨＰ―ｔａｎｄｅｍ（２６次元）及びＭＦＣＣ(１２次元)を用いた音声認識装置とを使用した。 In this verification test, a speech recognition device using HATs (40 dimensions), a speech recognition device using HATs (26 dimensions) and MFCC (12 dimensions), and a speech using LTHP-tandem (40 dimensions). A recognition device and a speech recognition device using LTHP-tandem (26 dimensions) and MFCC (12 dimensions) were used.

ＨＭＭ、ＭＬＰ（多層パーセプトロン）及び音素バイグラムを学習させるために利用したデータは、ＡＴＲ音素バランス文７時間と、ＣＳＪ２時間分との合計９時間とした。 The data used for learning the HMM, MLP (multilayer perceptron) and phoneme bigram was 9 hours in total, 7 hours for ATR phoneme balance sentence and 2 hours for CSJ.

因みに、比較のため、ＨＡＴｓを用いた音声認識装置とＬＴＨＰを用いた音声認識装置で利用するＭＬＰのパラメータ数は合計で約１０万個となるように一致させた。また、テストセットとして、ＣＳＪから発表音声と再朗読音声を各０．５時間分利用した。再朗読音声は発表音声と同一の話者が同一の発話内容を朗読した。この場合、音素正解精度（１００％―置換誤り率―挿入誤り率―削除誤り率)を以下の表１（Ｔａｂｌｅ１）に示す。 For comparison, the number of parameters of the MLP used in the speech recognition device using HATs and the speech recognition device using LTHP was matched to be about 100,000 in total. In addition, as a test set, the presentation speech and re-reading speech from CSJ were used for 0.5 hours each. The re-reading speech was read by the same speaker as the announcement speech. In this case, the accuracy of correct phoneme (100% −replacement error rate−insertion error rate−deletion error rate) is shown in Table 1 below (Table 1).

ｂａｓｅｌｉｎの音素認識精度は、再朗読音声が５５．４％となり、発表音声が４７．７％となった。これに対して本願発明のＬＴＨＰ―ｔａｎｄｅｍを用いた音声認識装置は、シングルストリームではＭＦＣＣと同程度の認識精度であったが、ＭＦＣＣと組み合わることにより発表音声及び再朗読音声ともに音素認識精度が向上することが確認できた。このことから同次元数の比較では、ＭＦＣＣの微分値より本願発明の音声認識装置から導出される音声特徴量を用いたほうが精度向上に寄与できることが分かった。 The baseline phoneme recognition accuracy was 55.4% for re-reading speech and 47.7% for presentation speech. On the other hand, the speech recognition apparatus using LTHP-tandem of the present invention has a recognition accuracy similar to that of MFCC in a single stream, but combined with MFCC has phoneme recognition accuracy for both announced speech and re-reading speech. It was confirmed that it improved. From this, it was found that the comparison of the same number of dimensions can contribute to the accuracy improvement by using the speech feature amount derived from the speech recognition apparatus of the present invention from the differential value of MFCC.

なお、再朗読音声より発声が崩れていると考えられる発表音声においても、本願発明のＬＴＨＰ―ｔａｎｄｅｍとＭＦＣＣのマルチストリームとを用いることで、ｂａｓｅｌｉｎを超える音声認識精度を達成することが可能であった。これは、本願発明の音声認識装置が話し言葉音声に対しても適用可能であることを示している。 It should be noted that speech recognition accuracy exceeding basselin can be achieved by using LTHP-tandem and the MFCC multi-stream of the present invention even in the announcement speech that is considered to be uttered from the re-reading speech. It was. This indicates that the speech recognition apparatus of the present invention is applicable to spoken language speech.

このように、瞬時位相特徴量と包絡線特徴量とをＭＬＰを用いて識別的に分析し、この分析結果を別空間に写像することで音声特徴量を得るようにした場合、ＭＦＣＣとの組み合わせにおいて、音声認識装置の高精度化に効果的な特徴量となることを連続音素認識タスクの音素認識精度を用いて示すことができた。 In this way, when the instantaneous phase feature quantity and the envelope feature quantity are discriminatedly analyzed using the MLP and the analysis result is mapped to another space to obtain the voice feature quantity, the combination with the MFCC Therefore, it can be shown by using the phoneme recognition accuracy of the continuous phoneme recognition task that it is an effective feature amount for improving the accuracy of the speech recognition device.

以上、本発明の実施の形態について説明したが、本発明は、実施の形態に限定されるものではなく、種々の変形実施が可能である。 While the embodiments of the present invention have been described above, the present invention is not limited to the embodiments, and various modifications can be made.

本発明による音声認識装置の回路構成を示すブロック図である。It is a block diagram which shows the circuit structure of the speech recognition apparatus by this invention. フィルタバンクのフィルタ形状を示す概略図である。It is the schematic which shows the filter shape of a filter bank. 解析部の詳細回路構成を示すブロック図である。It is a block diagram which shows the detailed circuit structure of an analysis part. 瞬時位相特徴抽出部における瞬時位相処理の様子を示す概略図である。It is the schematic which shows the mode of the instantaneous phase process in an instantaneous phase feature extraction part. 瞬時位相特徴量と包絡線特徴量との信号波形を示す概略図である。It is the schematic which shows the signal waveform of an instantaneous phase feature-value and an envelope feature-value. 瞬時位相特徴量の１フレームから切り出す切り出し区間の説明に供する概略図である。It is the schematic where it uses for description of the cut-out area cut out from 1 frame of an instantaneous phase feature-value. サブバンドニューラルネットワークの入力層に入力される瞬時位相特徴量及び包絡線特徴量の入力ベクトルを示す概略図である。It is the schematic which shows the input vector of the instantaneous phase feature-value and envelope feature-value input into the input layer of a subband neural network. 識別分析部の詳細回路構成を示すブロック図である。It is a block diagram which shows the detailed circuit structure of an identification analysis part. 切り出し位置と教師信号との関係を示す概略図である。It is the schematic which shows the relationship between a clipping position and a teacher signal. 学習処理手順を示すフローチャートである。It is a flowchart which shows a learning process procedure. 特徴変換部における変換前後の音声特徴量の様子を示すグラフである。It is a graph which shows the mode of the audio | voice feature-value before and behind conversion in the feature conversion part.

符号の説明Explanation of symbols

１音声認識装置
４音声処理部（音声処理装置）
7a〜7n ＦIＲフィルタ
15 瞬時位相特徴抽出部
16 包絡線特徴抽出部
27 識別分析部
32a〜32n サブバンドニューラルネットワーク（多層パーセプトロン型ニューラルネットワーク、多層パーセプトロン型サブバンドニューラルネットワーク）
31 統合ニューラルネットワーク（多層パーセプトロン型ニューラルネットワーク、多層パーセプトロン型統合ニューラルネットワーク） 1 Voice recognition device 4 Voice processing unit (voice processing device)
7a-7n FIR filter
15 Instantaneous phase feature extraction unit
16 Envelope feature extraction unit
27 Discriminant analysis
32a to 32n subband neural network (multilayer perceptron type neural network, multilayer perceptron type subband neural network)
31 Integrated neural network (multilayer perceptron type neural network, multilayer perceptron type neural network)

Claims

ＦIＲフィルタ(Finite Impulse Response Filter)により所定数の周波数帯域に音声信号を分割することにより得た各分割帯域信号から、瞬時位相特徴量を抽出する瞬時位相特徴抽出部と、
各前記分割帯域信号から包絡線特徴量を抽出する包絡線特徴抽出部と、
各前記瞬時位相特徴量及び各前記包絡線特徴量を所定時間ずつ切り出して多層パーセプトロン型ニューラルネットワークによって解析することにより音声特徴量を生成する識別分析部と
を備えることを特徴とする音声処理装置。 An instantaneous phase feature extraction unit that extracts an instantaneous phase feature amount from each divided band signal obtained by dividing an audio signal into a predetermined number of frequency bands by an FIR filter (Finite Impulse Response Filter);
An envelope feature extraction unit that extracts an envelope feature amount from each of the divided band signals;
A speech processing apparatus comprising: a discrimination analysis unit that generates speech feature values by cutting out each of the instantaneous phase feature values and each envelope feature value for a predetermined time and analyzing them by a multilayer perceptron type neural network.

各前記分割帯域信号をヒルベルト変換することにより虚部解析信号を生成するヒルベルト変換フィルタと、
各前記分割帯域信号と、該分割帯域信号に対応した前記虚部解析信号とを加算して前記瞬時位相特徴抽出部及び前記包絡線特徴抽出部に供給する加算回路と
を備えることを特徴とする請求項１記載の音声処理装置。 A Hilbert transform filter that generates an imaginary part analysis signal by performing a Hilbert transform on each of the divided band signals;
An adder circuit that adds each of the divided band signals and the imaginary part analysis signal corresponding to the divided band signals and supplies the signals to the instantaneous phase feature extraction unit and the envelope feature extraction unit. The speech processing apparatus according to claim 1.

前記識別分析部の前記多層パーセプトロン型ニューラルネットワークは、各前記分割帯域信号毎に独立した音素推定を行う多層パーセプトロン型サブバンドニューラルネットワークと、各前記分割帯域信号毎の前記音素推定の結果を取りまとめて前記音声特徴量を生成する多層パーセプトロン型統合ニューラルネットワークとからなる
ことを特徴とする請求項１又は２記載の音声処理装置。 The multi-layer perceptron type neural network of the discriminating / analyzing unit combines a multi-layer perceptron type sub-band neural network that performs independent phoneme estimation for each divided band signal and a result of the phoneme estimation for each divided band signal. The speech processing apparatus according to claim 1 or 2, comprising a multi-layer perceptron type integrated neural network that generates the speech feature amount.

請求項１〜３のうちいずれかに記載の音声処理装置により得られた前記音声特徴量の結果を用いて音声認識を行う
ことを特徴とする音声認識装置。 A speech recognition device that performs speech recognition using a result of the speech feature value obtained by the speech processing device according to claim 1.