JPH0490600A

JPH0490600A - Voice recognition device

Info

Publication number: JPH0490600A
Application number: JP2205249A
Authority: JP
Inventors: Naoto Iwahashi; 直人岩橋
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 1990-08-03
Filing date: 1990-08-03
Publication date: 1992-03-24

Abstract

PURPOSE:To enable accurate voice recognition by obtaining pitch pattern information on a voice according to information on a cepstrum time series and extracting the feature of an input voice signal. CONSTITUTION:The voice recognition device is provided with a cepstrum arithmetic function block 3 which finds a cepstrum for each specific sample of an input voice signal and outputs the information on the obtained cepstrum time series and a signal processing function block 4 which obtains the pitch pattern information on the voice corresponding to the input information according to the back propagation learning rule of, for example, a neural network. Then the pitch pattern on the voice is obtained according to the information from the block 3, the feature of the input voice signal of information on an accent type as the basic type of the voice or information required for maximal point, minimal point, or other voice recognition is extracted, and the information regarding this feature is outputted from an output terminal 5.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、発声された音声の音声認識を行う音〔発明の
概要］本発明は、入力音声信号に対してケプストラム時系列の
情報を求め、これに基づいて音声のピッチパターン情報
を得て入力音声信号の特徴を抽出することにより、入力
音声信号の特徴を容易かつ高精度に求めることができ、
入力音声信号の正確な認識が可能な音声認識装置を提供
するものである。[Detailed Description of the Invention] [Industrial Application Field] The present invention provides a method for obtaining cepstral time-series information from an input speech signal. By obtaining voice pitch pattern information based on this and extracting the characteristics of the input voice signal, the characteristics of the input voice signal can be easily and highly accurately determined.
An object of the present invention is to provide a speech recognition device that can accurately recognize input speech signals.

〔従来の技術〕[Conventional technology]

従来より、実際に発音された音声をコンピュータ、電子
回路等で認識させるいわゆる音声認識においては、音声
波に含まれる情報の中で最も基本的な音声の特徴に関す
る情報が抽出され、この抽出された情報に基づいて該発
音音声の認識が行われている。Traditionally, in so-called speech recognition, which uses computers, electronic circuits, etc. to recognize the actually pronounced speech, information about the most basic speech characteristics is extracted from among the information contained in speech waves. The pronunciation is recognized based on the information.

例えば、上記抽出される音声の特徴に関する情報として
は、音声のピッチパターン（ピッチ周波数の時系列パタ
ーン）から求められる例え＆ｌ大ビ、チパターンの山の
部分であり、上言己ｆｆ１７１−点しま谷の部分である
。当該極小、もシよ、例えＧボ連続発音した文章中の句
の境界を示すこと力（多（、また、これら極大点、極小
点から音声のイントネーションや、アクセント等がわか
るようになる。For example, the information regarding the characteristics of the voice extracted above is the peak part of the pitch pattern (time series pattern of pitch frequency) of the voice, This is the part. The minimum point is the ability to indicate the boundaries of phrases in a sentence, even if G-Bos are pronounced continuously (multiple points).Also, from these maximum and minimum points, you will be able to understand the intonation and accent of the voice.

ところで、上記極大点、極小点の情報を得るのに必要な
上記音声のピンチツマターンを求めるためには、先ず、
入力音声信号のビ・ンチ周波数を精密に求めることが必
要になる。このピ・ノチ周波数を求めるアルゴリズムと
して番よ、幾つ力・提案されており、例えば、いわゆる
自己相関法、ケプストラム）法等がある。By the way, in order to obtain the pinch pattern of the audio necessary to obtain the information on the maximum and minimum points, first,
It is necessary to accurately determine the bin frequency of the input audio signal. A number of algorithms have been proposed for determining this P-Nochi frequency, such as the so-called autocorrelation method and cepstrum method.

すなわち、これらアルゴリズムの何れ力１によつる上記
ピッチパターンから上言巳音声の特徴Ｇこ関する情報が
抽出されるよ痕こなる。In other words, information related to the features of the voice of Kamigon is extracted from the above-mentioned pitch pattern based on the power of these algorithms.

〔発明が解決しようとする課題］しかし、上記音声認識に用いられるピッチ周波数を精密
に求めるのは一般に難しく、完全なピ。[Problems to be Solved by the Invention] However, it is generally difficult to precisely determine the pitch frequency used in the above-mentioned speech recognition, and it is difficult to obtain a perfect pitch frequency.

千周波数を求めるのは容易ではない。It is not easy to find 1,000 frequencies.

また、例えば、正確にピンチ周波数を求めずにピンチパ
ターンの特徴を抽出しても曖昧なものになってしまう。Furthermore, for example, even if the characteristics of the pinch pattern are extracted without accurately determining the pinch frequency, the result will be ambiguous.

すなわち、例えば、上述の極大点。That is, for example, the local maximum points mentioned above.

極小点を求めようとしてもピッチ周波数に誤りがあれば
、正しい極大点、極小点が得られなくなり、結局良好な
音声認識ができないようになってしまつ。Even if you try to find the minimum point, if there is an error in the pitch frequency, you will not be able to obtain the correct maximum or minimum point, and in the end, you will not be able to perform good speech recognition.

そこで、本発明は、上述のような実情に鑑みて提案され
たものであり、正確なピンチ周波数を求めることなく音
声の特徴を容易に抽出することができ、正確な音声認識
が可能な音声認識装置を提供することを目的とするもの
である。Therefore, the present invention has been proposed in view of the above-mentioned actual situation, and provides a speech recognition system that can easily extract speech features without determining an accurate pinch frequency and that can perform accurate speech recognition. The purpose is to provide a device.

〔課題を解決するための手段〕[Means to solve the problem]

本発明の音声認識装置は、上述の目的を達成するために
提案されたものであり、入力音声信号の所定サンプル毎
にケプストラムを求め、得られた複数のケプストラムか
らなるケプストラム時系列の情報を出力するケプストラ
ム演算手段と、入力ケプストラム時系列の情報と対応す
る音声のピッチパターン情報を得る信号処理手段とを有
し、上記ケプストラム演算手段からのケプストラム時系
列の情報に基づいて上記音声のピンチパターン情報を得
て、上記入力音声信号の特徴を抽出するものである。The speech recognition device of the present invention was proposed to achieve the above-mentioned object, and obtains a cepstrum for each predetermined sample of an input speech signal, and outputs cepstrum time series information consisting of a plurality of obtained cepstrums. and a signal processing means for obtaining voice pitch pattern information corresponding to information on the input cepstrum time series, and a signal processing means for obtaining pitch pattern information of the voice corresponding to information on the input cepstrum time series, based on the information on the cepstrum time series from the cepstrum operation means. and extracts the features of the input audio signal.

ここで、上記信号処理手段としては、例えば、入力ケプ
ストラム時系列情報と、音声のピッチパターンに応じた
音声の特徴に関する情報とを対応付ける学習処理が施さ
れたいわゆるニューラルネットワークを挙げることがで
きる。また、このニューラルネットワークでの学習にお
いては、例えば、音声の特徴に関する情報として所定の
アクセント型、前記極大点、極小点或いは他の音声認識
に必要な情報に応した音声のピッチパターン情報に対応
する入力ケプストラム時系列情報に基づいた出力信号と
、教師信号との誤差に応して、この誤差を少なくするよ
うにニューロン間のシナプス結合を変化させるような学
習処理がなされる。Here, as the signal processing means, for example, a so-called neural network that is subjected to a learning process that associates input cepstral time-series information with information regarding the characteristics of the voice according to the pitch pattern of the voice can be mentioned. In addition, in learning with this neural network, for example, information regarding voice characteristics corresponds to voice pitch pattern information corresponding to a predetermined accent type, the maximum point, the minimum point, or other information necessary for voice recognition. Depending on the error between the output signal based on the input cepstral time-series information and the teacher signal, learning processing is performed to change synaptic connections between neurons so as to reduce this error.

〔作用］本発明によれば、ケプストラムに基づくピッチ推定の過
程は簡単であり、また、信号処理手段は予め入力ケプス
トラム時系列情報と音声のピッチパターン情報に応した
音声の特徴とが対応付けられているため、この信号処理
手段では、ケプストラム演算手段からのケプストラム時
系列情報に応じた音声の特徴に関する情報が得られるよ
うになる。[Operation] According to the present invention, the process of pitch estimation based on the cepstrum is simple, and the signal processing means associates the input cepstrum time series information with the speech characteristics corresponding to the pitch pattern information of the speech in advance. Therefore, this signal processing means can obtain information regarding the characteristics of the voice according to the cepstrum time series information from the cepstrum calculation means.

〔実施例〕〔Example〕

以下、本発明を適用した実施例について図面を参照しな
がら説明する。Embodiments to which the present invention is applied will be described below with reference to the drawings.

第１図に本発明実施例の音声認識装置の機能ブロックを
示す。FIG. 1 shows functional blocks of a speech recognition device according to an embodiment of the present invention.

この第１図の機能ブロックに示す音声認識装置は、入力
音声信号の所定サンプル毎にケブストラムを求め、得ら
れた複数のケプストラムからなるケプストラム時系列（
ケプストラムの列）の情報を出力するケプストラム演算
機能ブロック３と、例えば後述するいわゆるニューラル
ネットワーク（神経回路網）によるバンクプロパゲーシ
ョン（逆伝播）学習則に従い、入カケブストラム時系列
情報と対応する音声のピッチパターン情報を得る信号処
理機能ブロック４とを有し、上記ケプストラム演算機能
ブロック３からのケプストラム時系列の情報に基づいて
上記音声のピッチパターン情報を得て、後述の例えば第
７図、第８図に示す音声の基本型としてのアクセント型
の情報、前記極大点、極小点或いは他の音声認識に必要
な情報とうしの上記入力音声信号の特徴を抽出するよう
にしたものである。この入力音声信号の特徴に関する情
報が出力端子５から出力されるようになっている。The speech recognition device shown in the functional blocks of FIG.
A cepstrum calculation function block 3 outputs information on a sequence of cepstrums, and a cepstral calculation function block 3 outputs information on input cepstral time series information and the pitch of the audio corresponding to input cepstral time series information, for example, according to a bank propagation (back propagation) learning rule by a so-called neural network (described later). It has a signal processing function block 4 for obtaining pattern information, and obtains the pitch pattern information of the voice based on the cepstrum time series information from the cepstrum calculation function block 3, and obtains the pitch pattern information of the voice, for example, in FIGS. 7 and 8, which will be described later. The present invention is designed to extract the characteristics of the input audio signal, including information on the accent type as the basic type of speech, the maximum points, minimum points, and other information necessary for speech recognition. Information regarding the characteristics of this input audio signal is output from the output terminal 5.

ここで、上記ケプストラム演算機能ブロック３に供給さ
れる入力音声信号は、入力端子１を介した例えば第２回
に示すような音声波形信号のサンプルデータが、所定サ
ンプル数として例えば５１２サンプル毎にブロック化さ
れた（例えばハミング窓をかけて取り出された）データ
とされている。Here, the input audio signal supplied to the cepstrum arithmetic function block 3 is such that sample data of the audio waveform signal as shown in the second part via the input terminal 1 is divided into blocks every 512 samples as a predetermined number of samples. It is considered to be data that has been converted into data (for example, extracted using a Hamming window).

該ブロック化がバッファ機能ブロック２によりなされて
いる。また、このバッファ機能ブロック２でブロック化
されるデータは、第２図に示すように、上記５１２サン
プルのブロックがｍ（ｍ、、１ｍ２゜１１３、　　＋、
＋　、ｍは１以上の整数で、ｍ＜５１２）サンプル毎に
ずらされて得られる互いに重複したデータを有するブロ
ックデータとなっている。This blocking is performed by a buffer function block 2. Furthermore, as shown in FIG. 2, the data to be divided into blocks by this buffer function block 2 is as follows:
+, m is an integer of 1 or more, and m<512) The block data is obtained by shifting each sample and has mutually overlapping data.

このバッファ機能ブロック２がらの各ブロック毎の５１
２サンプルデータが、上記ケプストラム演算機能ブロッ
ク３に送られる。該ケプストラム演算機能ブロック３で
は、各ブロック毎の５１２サンプルデータのケプストラ
ムからなるケプストラム時系列のデータが求められる。51 for each block of this buffer function block 2
The two sample data are sent to the cepstral calculation function block 3. The cepstrum calculation function block 3 obtains cepstrum time series data consisting of cepstrum of 512 sample data for each block.

上記ケプストラムとは、波形の短時間振幅スペクトル（
パワースペクトル）の対数の逆フーリエ変換として定義
され、スペクトル包絡と微細構造を近イ以的に分離して
抽出できる特徴を有するものである。すなわち、本実施
例においては、音声の特徴に関する情報抽出のために必
要な音声のピッチパターン情報を得るために、該ケプス
トラムを求めるようにしており、このケプストラム法に
よれば、他の音声ピッチパターン算出の方法（前記自己
相関法、５ＴＦＴ法等）よりも、容易にピッチパターン
を求めることができる。The cepstrum above refers to the short-time amplitude spectrum of the waveform (
It is defined as the inverse Fourier transform of the logarithm of the power spectrum, and has the characteristic that the spectral envelope and fine structure can be approximately separated and extracted. That is, in this embodiment, the cepstrum is obtained in order to obtain speech pitch pattern information necessary for extracting information regarding speech characteristics, and according to this cepstrum method, other speech pitch patterns The pitch pattern can be obtained more easily than the calculation methods (the autocorrelation method, 5TFT method, etc.).

ここで、該ケプストラム演算機能ブロック３でのケプス
トラム時系列を求めるための演算は、以下のようになさ
れる。Here, the calculation for obtaining the cepstrum time series in the cepstrum calculation function block 3 is performed as follows.

先ず、ブロック化されたサンプルデータの高速フーリエ
変換（ＦＦＴ）処理を行う。すなわち、Ｘ　（ｋ）　＝
Σｘ（ｎ）ｅｘｐ（−ｊｋｎ２π／Ｎ）　　　　−−−
・Ｃ１＞のようにする。ここで、ｋ＝０．１．２．・・
・、Ｎ−１であり、０≦ｎ≦Ｎ−１である０次に、上述
のようにＦＦＴ処理されて得られたデータから得られる
パワースペクトルの対数をとる。すなわち、パワースペ
クトルの対数は、ＸＬ（ｋ）　＝ｌｏｇ　：　Ｘ　（ｋ）　ｉ　　　　　
　・・・・（２）となる。更にこの（２）式を逆ＦＦＴ
することにより、ケプストラムｘ、（ｎ）が求められる
。このケプストラムによるスペクトル包絡は、例えば第
３図のようになり、このスペクトル包絡（ピンチパター
ン）が音声の特徴を表すようになる。First, fast Fourier transform (FFT) processing is performed on blocked sample data. That is, X (k) =
Σx(n)exp(-jkn2π/N) ---
・Make it like C1>. Here, k=0.1.2.・・・
, N-1, and the logarithm of the power spectrum obtained from the data obtained by FFT processing as described above is taken at the 0th order where 0≦n≦N-1. That is, the logarithm of the power spectrum is: XL(k) = log: X(k) i
...(2). Furthermore, this equation (2) is inverse FFT
By doing so, the cepstrum x,(n) is obtained. The spectral envelope by this cepstrum becomes, for example, as shown in FIG. 3, and this spectral envelope (pinch pattern) comes to represent the characteristics of the voice.

ここで、ケプストラム時系列ｘｃ（ｎ＋、ｎｚ）は、ｘ
ｅ（ｎ＋、ｎｚ）ｘｅｘｐ（ｊｋｎ２π／Ｎ）−−−−＜４）となる。す
なわち、このケプストラム時系列Ｘｃ（ｎｌ、ｎｚ）を
、上記（１）　、　（２）　、　（３）式のように簡略
化して求めると、Ｘ（ｎ＋、ｋ）＝　　Σｘ（ｎ、＋ｎ）ｅｘｐ（−ｊｋ
ｎ２ｘ　ハ）　・・・（５）ＸＬ（ｎ＋、ｋ）＝Ｉｏｇ
　ｉ　Ｘ（ｎ＋、ｎｚ）・・・・（６）となる。上記各式中Ｍはケプストラム分析時のフーリエ
変換（離散的フーリエ変換）が行われるサンプル数を示
す、また、χＣ（ｎｌ、ｎｚ）を、ｎ、＝Ｌ、２Ｌ、３
Ｌ、−−１．Ｎ−Ｌ〜１　（但しＬ＝Ｎ＃でＩは整数）
と、ｎ　ｚ　；０，１＋２．・・・９Ｍ−１のポイント
で求める。この場合、求められるケプストラムの数はＭ
Ｘ？である。Here, the cepstral time series xc(n+, nz) is x
e(n+, nz) xexp(jkn2π/N)---<4). That is, if this cepstral time series Xc (nl, nz) is simplified as shown in equations (1), (2), and (3) above, then (-jk
n2x c) ... (5) XL (n+, k) = Iog
i X (n+, nz) (6). In each of the above formulas, M indicates the number of samples on which Fourier transform (discrete Fourier transform) is performed during cepstrum analysis.
L, --1. N-L~1 (However, L=N# and I is an integer)
and n z ;0,1+2. ...Calculated using 9M-1 points. In this case, the number of cepstrums required is M
X? It is.

ところで、上述のようにしてケプストラムを求める場合
、例えば入力信号の性質或いはケプストラム演算の誤り
等によって、例えば、第４図に示すケプストラムによる
スペクトル包絡のように、ある時間（例えば図中Ｌ＋＋
ｊｚ）における包絡線が、他の時間のスペクトル包絡と
異なる位置に現れることがある。すなわち、第５図に示
すように、各ケプストラム時系列（Ｃ＋、　Ｃｚ、Ｃｘ
：・、・）において、上記入力信号の性質或いはケプス
トラム演算の誤り等によって、ある時間（ｔｌ、ｔｚ）
でのケプストラム時系列では、ケプストラムのピークが
例えば２つ或いはそれ以上存在するようになる場合があ
る。By the way, when obtaining the cepstrum as described above, due to the nature of the input signal or an error in the cepstrum calculation, for example, the spectral envelope by the cepstrum shown in FIG.
The envelope at (jz) may appear at a different position from the spectral envelope at other times. That is, as shown in Fig. 5, each cepstral time series (C+, Cz, Cx
:・,・), due to the nature of the input signal or an error in the cepstral calculation, a certain time (tl, tz)
In a cepstral time series, there may be two or more cepstral peaks, for example.

このような場合、上記ケプストラムによるスペクトル包
絡（音声のピンチパターン）に基づいた音声の特徴抽出
が良好に行えなくなる虞れがあり、特徴抽出が正確ムこ
行えないと、音声認識も不正確となる。なお、第５図の
縦軸は時間軸であり、通常ケフレンシと呼ばれている。In such a case, there is a risk that the feature extraction of the voice based on the spectral envelope (pinch pattern of the voice) using the cepstrum described above may not be performed well, and if the feature extraction cannot be performed accurately, the voice recognition will also be inaccurate. . Note that the vertical axis in FIG. 5 is the time axis, which is usually called quefrency.

本実施例においては、上述のような不正確なケプストラ
ムによるスペクトル包絡（音声のピッチパターン）であ
っても、正確に音声の特徴抽出ができるようにしている
。すなわち、このようなことを行うため、上記ケプスト
ラム演算機能ブロンク３の出力は、信号処理機能ブロッ
ク４に送られるにの信号処理機能ブロック４は、前述し
たように、いわゆるバックプロパゲーション学習則に従
って、予め、音声の特徴（後述のアクセント型の情報或
いは極大点、極小点等の情報）抽出のための音声ピッチ
パターン情報に対応する入力ケプストラム時系列に基づ
いて得られた出カバターン情報と、教師信号としての該
入力ケプストラム時系列情報に対応するパターン情報と
の誤差に応じてニューロン間のシナプス結合を変化させ
るような学習処理がなされたニューラルネットワークで
ある。In this embodiment, even if the spectral envelope (voice pitch pattern) is based on the inaccurate cepstrum as described above, it is possible to accurately extract the voice features. That is, in order to do this, the output of the cepstrum arithmetic function block 3 is sent to the signal processing function block 4. As mentioned above, the signal processing function block 4 follows the so-called backpropagation learning rule. In advance, output pattern information obtained based on an input cepstrum time series corresponding to speech pitch pattern information for extracting speech features (accent type information or information on local maximum points, minimum points, etc. described later) and a teacher signal are prepared in advance. This is a neural network in which a learning process is performed to change the synaptic connections between neurons according to the error between the input cepstral time series information and the pattern information corresponding to the input cepstral time series information.

第６図に上記信号処理機能ブロンク４のニューラルぶッ
トワークの構造を示す。FIG. 6 shows the structure of the neural network of the signal processing function block 4.

すなわち、上記ニューラルネットワークとは、人間の脳
細胞のニューロン間のシナプス結合をモデルとしたもの
であり、このニューラルネットワークの一般的な回路は
、例えば、第６図に示すように、入力層３１と出力層３
３の間に中間層３２ａ、３２ｂを有する多層（例えば３
層）構造となっている。これら各層は、上記ニューロン
に対応するそれぞれ複数個のユニットｕで構成されてい
る。本実施例では、例えば入力層３１のユニット数は、
上記ケプストラム時系列が入力されるためＭＸ１個とな
り、中間層３２ａ、３２ｂは例えば３０ユニツト乃至６
０ユニット程度とされる。上記出力層３３は、ピッチパ
ターン（ケプストラムによるスペクトル包絡のパターン
）に応じた抽出しようとする特徴の数（識別の数）分の
ユニットを持つ。すなわち、後述する第７図或いは第８
図に示す音声の基本型としてのアクセント型等の数のユ
ニットを持つ。また、これらの各層間においては、入力
層→中間層→出力層の方向に結合（シナプス結合）し、
逆に出力層から入力層へ向かう結合のないいわゆるフィ
ードフォワードの♀ントワークとなっている。すなわち
、上記入力層３Ｉの各々のユニットＵが中間層３２ａの
各々のユニットｕと結合（シナプス結合）し、中間層３
２ａの各々のユニットＵが中間層３２ｂの各々ツユニッ
トＵと結合（シナプス結合）し、同様に中間層３２ｂの
各々のユニットＵが出力層３３の各々のユニットＵと結
合（シナプス結合）していて、この逆の結合のないネッ
トワークとなっている。また、１つの層内の各々のユニ
ッ）ｕ間での結合もない、ここで、上記ユニットＵは、
他のユニットＵからの入力を一定の規則で変換し、その
変換結果を出力するようになっていて、これら他のユニ
ッ）Ｕとの結合部はそれぞれ可変の重み（リンク・ウェ
ート）付けを行うようになっている。該重み付けにより
、各ユニットの結合の強さが表される０例えば、上記中
間層３２ｂ内のｉ番目のユニッ）ｕｔ　と上記出力層３
３内の３番目のユニ、ットｕｊとの結合では、可変の重
み付けの係数（結合係数）Ｗ８、が付けられる。この結
合係数Ｗ、ｊを変えるとネットワークの構造が変わる（
シナプス結合が変化する）ようになる。ネットワークの
学習とは、この結合係数Ｗｉｊを変えることであり、こ
の値は正、ゼロ、負の値をとる。ゼロは結合のないこと
を表す、ここで、あるユニットＵが複数のユニットから
入力を受けた場合、該ユニッ）ｕは、その総和を例えば
ｓｉｇ曽ｏｉｄ関数等の所定の関数で変換し、その変換
された値を出力する。すなわち、入力信号の値が入力層
３１の各ユニットｕに人力値としてそれぞれ供給され、
該入力層３１から出力層３３に向かって各ユニットＵの
出力値を順次計算していくことで、上記出力層３３の各
ユニットｕの出力値０．が得られるようになる。That is, the neural network is modeled on synaptic connections between neurons in human brain cells, and the general circuit of this neural network includes, for example, an input layer 31 and an input layer 31, as shown in FIG. Output layer 3
A multilayer structure having an intermediate layer 32a, 32b between 3 layers (e.g. 3
It has a layered structure. Each of these layers is composed of a plurality of units u corresponding to the neurons described above. In this embodiment, for example, the number of units in the input layer 31 is
Since the above cepstrum time series is input, there is one MX, and the middle layers 32a and 32b have, for example, 30 units to 6 units.
It is said to be about 0 units. The output layer 33 has units for the number of features to be extracted (the number of identifications) according to the pitch pattern (pattern of spectral envelope by cepstrum). In other words, as shown in FIG. 7 or 8, which will be described later,
It has a number of units such as accent type as the basic type of voice shown in the figure. In addition, between these layers, there are connections (synaptic connections) in the direction of input layer → middle layer → output layer,
On the contrary, it is a so-called feedforward network with no connection from the output layer to the input layer. That is, each unit U of the input layer 3I is coupled (synaptic connection) with each unit U of the intermediate layer 32a, and the intermediate layer 3
Each unit U of the intermediate layer 32b is connected (synaptic connection) to each unit U of the intermediate layer 32b, and similarly, each unit U of the intermediate layer 32b is connected (synaptic connection) to each unit U of the output layer 33. , it is a network without this reverse connection. In addition, there is no connection between each unit (U) in one layer, where the above unit (U) is
It converts the input from other units U according to certain rules and outputs the conversion results, and the connection parts with these other units U are each given variable weights (link weights). It looks like this. The weighting represents the strength of the connection between each unit. For example, the i-th unit in the intermediate layer 32b) and the output layer 3
In connection with the third unit uj in 3, a variable weighting coefficient (coupling coefficient) W8 is attached. Changing the coupling coefficients W and j changes the structure of the network (
Synaptic connections change). Learning of the network means changing this coupling coefficient Wij, and this value takes a positive, zero, or negative value. Zero represents no connection. Here, when a unit U receives input from multiple units, the unit U converts the sum using a predetermined function such as the sig sooid function, and calculates the Output the converted value. That is, the value of the input signal is supplied to each unit u of the input layer 31 as a human input value,
By sequentially calculating the output value of each unit U from the input layer 31 to the output layer 33, the output value of each unit U of the output layer 33 is 0. will be obtained.

ここで、ハックプロパゲーション学習アルゴリズムにお
いては、入力信号を与えた時の出力層３３の各ユニット
Ｕの実際の出力値Ｏ，と、望ましい出力値ＯＬとしての
教師信号ＯＬとの二乗誤差の総和を、極小化するように
上記結合係数を変える学習処理を行っていくことによっ
て、教師信号ｏ１に最も近い出力値０．が上記出力層３
３から得られるようになる。Here, in the hack propagation learning algorithm, the sum of squared errors between the actual output value O of each unit U of the output layer 33 when the input signal is given and the teacher signal OL as the desired output value OL is calculated. , the output value closest to the teacher signal o1 is 0. is the above output layer 3
You can get it from 3.

すなわち、入力層３１には、入力信号として音声の基本
型パターン情報としての上記アクセント型等に応じたパ
ターン情報に対応する入力ケプストラム時系列情報が供
給され、出力層３３から得られた出カバターン情報と、
教師信号Ｏｔとしての当該入力ケプストラム時系列情報
に対応するパターン情報との誤差を極小化するような学
習処理がなされる。このように学習がなされた信号処理
機能ブロック４に、上記ケプストラム時系列の情報が送
られることで、この信号処理機能ブロック４からは、上
記ケプストラム時系列情報に基づいたピッチパターン情
報に応じた音声の基本型のパターン情報（アクセント型
、極大点、極小点等のパターン情報）が出力層３３から
得られるようになる。That is, the input layer 31 is supplied with input cepstral time-series information corresponding to pattern information according to the accent type, etc., as basic pattern information of speech as an input signal, and output cover pattern information obtained from the output layer 33. and,
Learning processing is performed to minimize the error between the input cepstral time series information as the teacher signal Ot and the corresponding pattern information. By sending the cepstrum time series information to the signal processing function block 4 that has been trained in this way, the signal processing function block 4 can generate audio according to pitch pattern information based on the cepstrum time series information. The basic type pattern information (pattern information such as accent type, maximum point, minimum point, etc.) can be obtained from the output layer 33.

なお、上述した本実施例では、ケプストラム演算機能ブ
ロック３からのケプストラム時系列情報に基づいて直接
アクセント型、極大点、極小点或いは他の音声認識に必
要な情報等の音声の基本型のパターン情報を得るように
なっているが、本発明はこれらに限らず該ケプストラム
時系列情報に応じた音声のピンチパターン情報を出力す
るようにしてもよい、この場合、得られたピッチパター
ン情報を用いて、上記各音声の基本型パターン情報を得
るようにする。In the above-mentioned embodiment, based on the cepstrum time series information from the cepstrum calculation function block 3, pattern information of basic types of speech such as accent type, maximum point, minimum point, or other information necessary for speech recognition is directly calculated. However, the present invention is not limited to these, and may output audio pinch pattern information according to the cepstral time series information. In this case, the obtained pitch pattern information may be used to , to obtain basic pattern information for each of the above voices.

また、上述したニューラルネットワークの基本構造は、
通常複数入力、複数出力の構成をとるため、上記信号処
理機能ブロック４へのケプストラム時系列のデータはパ
ラレルに送られ再びシリア１．＼ｊルに変換されて出力されるようになっている。In addition, the basic structure of the neural network mentioned above is
Since it usually has a configuration with multiple inputs and multiple outputs, the cepstrum time series data to the signal processing function block 4 is sent in parallel and again to the serial 1. It is converted to \j and output.

ここで、上記音声のピッチパターン情報から抽出される
音声の特徴（音声の基本パターン）に関する情報として
、第７図に例えば名詞のアクセントの型の表を示す、す
なわちこの第７図において、名詞は、拍数と、型の種類
とで分類することができ、拍数としては１拍語〜５拍語
・・・に分けることができる。また、型の種類としては
平板式と起伏式に分けられ、更に、この起伏式は尾高型
、中高型１頭高型に分けられる。Here, as information regarding the voice characteristics (basic patterns of voice) extracted from the above-mentioned voice pitch pattern information, FIG. 7 shows, for example, a table of noun accent types. It can be classified according to the number of beats and the type of kata, and the number of beats can be divided into 1-beat words to 5-beat words, etc. In addition, types of molds are divided into flat plate type and undulating type, and this undulating type is further divided into Odaka type, medium-high type, and single-head type.

また、上記音声の基本型のパターンとしては、第８図に
示すようないわゆるモーラ単語によるアクセントの型も
挙げることができる０例えば東京方言の単語アクセント
には、第１モーラから第２モーラにかけて必ず明確なピ
ッチ上昇或いは下降があり、−単語中のピッチの下降は
たかだか１箇所であるという特徴がある。したがって、
ｎモーラの単語にはｎ＋１個のアクセント型が存在する
ことになる。第８図に示す４モーラの名詞の場合、各ア
クセント型は、それぞれピッチの下腎位置に注目して０
型、１型、２型、３型、４型と呼ばれる。０型と４型の
アクセントの違いは、単語単独では現れないが、助詞“
が”等を付加することにより明確になる。これらは基本
周波数パターンの観点から定量的に論することが可能と
なる。Furthermore, as the basic pattern of the above-mentioned sounds, there is also an accent type based on so-called mora words, as shown in Figure 8.For example, in the Tokyo dialect, word accents always start from the first mora to the second mora. There is a clear rise or fall in pitch, and - there is only one drop in pitch within a word. therefore,
A word with n moras will have n+1 accent types. In the case of the 4-mora noun shown in Figure 8, each accent type is 0.
They are called type, type 1, type 2, type 3, and type 4. The difference between type 0 and type 4 accents does not appear in the words alone, but in the particle “
becomes clear by adding ", etc.".These can be discussed quantitatively from the perspective of the fundamental frequency pattern.

更に、第７図の上記平板型はモーラ単語の０型と対応し
、頭高型はモーラ単語の１型と、中高型は２及び３型と
、尾高型は４型と、起伏型はモーラ単語の０型以外の型
に対応するようになる。Furthermore, the above-mentioned flat type in Figure 7 corresponds to type 0 of the mora word, the head height type corresponds to type 1 of the mora word, the medium height type corresponds to types 2 and 3, the Odaka type corresponds to type 4, and the undulating type corresponds to the mora word type 1. It now supports word types other than type 0.

上述したように、本実施例においては、入力音声信号の
ブロック化されたサンプルデータに対してケプストラム
演算機能ブロック３で、例えば自己相関法、５ＩＦＴ法
よりも容易に求められるケプストラム時系列の情報を求
め、これを第７図。As described above, in this embodiment, the cepstral calculation function block 3 calculates cepstral time series information, which is easier to obtain than the autocorrelation method or the 5IFT method, on the block sample data of the input audio signal. Figure 7 shows this.

第８図に示すようなアクセント型で示される音声の特徴
に関するパターン情報を用いて学習されたニューラルネ
ットワークの信号処理機能ブロック４に供給することに
より、正確に入力音声信号のピッチ周波数を求めなくて
も（ケプストラムのスペクトル包絡が不正確なものであ
っても）、ピッチ周波数パターン（ピッチパターンすな
わちスペクトル包絡）の識別を行うことができ、入力音
声信号の特徴を容易かつ高精度に求めることができるよ
うになる。したがって、入力音声信号の正確な認識が可
能となる。By supplying the signal processing function block 4 of the neural network trained using pattern information regarding the voice characteristics indicated by the accent type as shown in FIG. 8, it is possible to accurately determine the pitch frequency of the input voice signal. (even if the cepstral spectral envelope is inaccurate), pitch frequency patterns (pitch patterns or spectral envelopes) can be identified, and the characteristics of the input audio signal can be easily and accurately determined. It becomes like this. Therefore, accurate recognition of the input audio signal is possible.

〔発明の効果〕〔Effect of the invention〕

本発明の音声認識装置においては、入力音声信号に対し
てケプストラム時系列の情報を求め、これに基づいて音
声のピッチパターン情報を得て入力音声信号の特徴を抽
出するようにしたことにより、正確に入力音声信号のピ
ッチ周波数を求めなくても、容易にピンチ周波数パター
ン（ピッチパターン）の識別を行うことができ、入力音
声信号の特徴を容易かつ高精度に求めることができるよ
うになる。したがって、入力音声信号の正確な認識が可
能となる。The speech recognition device of the present invention obtains cepstral time series information for an input speech signal, obtains speech pitch pattern information based on this information, and extracts features of the input speech signal, thereby achieving accuracy. The pinch frequency pattern (pitch pattern) can be easily identified even without determining the pitch frequency of the input audio signal, and the characteristics of the input audio signal can be easily and highly accurately determined. Therefore, accurate recognition of the input audio signal is possible.

【図面の簡単な説明】[Brief explanation of the drawing]

第１図は本発明実施例の音声認識装置の概略構成を示す
機能ブロック図、第２図は入力音声信号波形を示す波形
図、第３図はケプストラムによるスペクトル包絡を示す
図、第４図は入力信号の性質或いは演真の誤りにより異
なる位置に包路線が現れたケプストラムのスペクトル包
絡を示す図、第５図はケプストラム時系列を示す図、第
６図はニューラルネットワークの構成を示す構成図、第
７図は名詞のアクセント型を示す図、第８図は東京方言
における４モ一ラ単語のアクセント型を示す図である。２・・・・・・・・ハンファ機能ブロック３・・・・・
・・・ケプストラム演算機能ブロック４・・・・・・・
・信号処理機能ブロック＃！能アロ＼リワ第１図FIG. 1 is a functional block diagram showing a schematic configuration of a speech recognition device according to an embodiment of the present invention, FIG. 2 is a waveform diagram showing an input speech signal waveform, FIG. 3 is a diagram showing a spectral envelope by cepstrum, and FIG. A diagram showing a spectral envelope of a cepstrum in which the envelope line appears at a different position due to the nature of the input signal or an error in the deduction, FIG. 5 is a diagram showing a cepstrum time series, and FIG. 6 is a configuration diagram showing the configuration of a neural network. FIG. 7 is a diagram showing the accent type of nouns, and FIG. 8 is a diagram showing the accent type of four-molar words in the Tokyo dialect. 2... Hanwha functional block 3...
...Cepstrum calculation function block 4...
・Signal processing function block #! Noh Aro＼Riwa Figure 1

Claims

【特許請求の範囲】入力音声信号の所定サンプル毎にケプストラムを求め、
得られた複数のケプストラムからなるケプストラム時系
列の情報を出力するケプストラム演算手段と、入力ケプストラム時系列の情報と対応する音声のピッチ
パターン情報を得る信号処理手段とを有し、上記ケプストラム演算手段からのケプストラム時系列の
情報に基づいて上記音声のピッチパターン情報を得て、
上記入力音声信号の特徴を抽出することを特徴とする音
声認識装置。[Claims] Obtaining a cepstrum for each predetermined sample of an input audio signal,
cepstrum calculation means for outputting cepstrum time series information consisting of a plurality of obtained cepstrums, and signal processing means for obtaining voice pitch pattern information corresponding to the input cepstrum time series information, and from the cepstrum calculation means Obtain the pitch pattern information of the above speech based on the cepstral time series information of
A speech recognition device characterized by extracting features of the input speech signal.