JP3428309B2

JP3428309B2 - Voice recognition device

Info

Publication number: JP3428309B2
Application number: JP25109696A
Authority: JP
Inventors: 充遠藤
Original assignee: Panasonic Corp; Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Corp; Panasonic Holdings Corp
Priority date: 1996-09-24
Filing date: 1996-09-24
Publication date: 2003-07-22
Anticipated expiration: 2016-09-24
Also published as: JPH1097270A

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、単語を連続して発
声された音声の認識を行なう音声認識装置に関するもの
である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device for recognizing a voice in which words are continuously uttered.

【０００２】[0002]

【従来の技術】近年、音声認識装置の頑健性を向上をさ
せる試みが行なわれてきている。その一つとして、入力
音声の中に登録語以外の語である未知語が含まれている
場合にも、未知語の区間は未知語として認識し、登録語
の区間は登録語としてその発音内容を正しく認識するこ
とで認識率を向上させることを目指したものがある。2. Description of the Related Art In recent years, attempts have been made to improve the robustness of voice recognition devices. As one of them, even if the input speech contains an unknown word other than the registered word, the unknown word section is recognized as the unknown word, and the registered word section is recognized as the registered word. There is one that aims to improve the recognition rate by correctly recognizing.

【０００３】このような未知語を扱った音声認識装置の
例としては、音声タイプライタを用いる方式（”音声タ
イプライタを用いた未知語検出方式の改良検討”日本音
響学会平成４年度秋研究発表会講演論文集、２−Ｑ−２
４（１９９２））とガーベジモデル（ｇａｒｂａｇｅ
ｍｏｄｅｌ）を用いる方式（”連続音声認識における未
知語検出の検討”日本音響学会平成７年度秋研究発表会
講演論文集、１−Ｑ−１７（１９９５））の２種類の方
式が知られており、音声タイプライタを用いる方式を従
来例１、ガーベジモデルを用いる方式を従来例２として
説明する。As an example of a voice recognition device that handles such unknown words, a method using a voice typewriter ("improvement study of unknown word detection method using voice typewriter", Acoustical Society of Japan, Fall 1992 research presentation) Conference Proceedings, 2-Q-2
4 (1992)) and the garbage model (garbage).
model) ("Unknown word detection in continuous speech recognition" Proceedings of Autumn Meeting of the Acoustical Society of Japan, 1995, 1-Q-17 (1995)). A method using a voice typewriter will be described as a conventional example 1 and a method using a garbage model will be described as a conventional example 2.

【０００４】従来例１は、音声タイプライタを用いる方
式である。音声タイプライタは、日本語として考えられ
るすべての発音を認識できるようにサブワードでモデル
化したものである。従来例１において、用いられている
サブワードは音素片である。The conventional example 1 is a system using a voice typewriter. The phonetic typewriter is a subword model that can recognize all pronunciations that can be considered as Japanese. In Conventional Example 1, the subword used is a phoneme piece.

【０００５】図１４は、従来例１の音声認識装置の概略
構成を示したブロック図であり、１は予め作成されたサ
ブワード音響モデルを格納するサブワード音響モデル格
納部、７は音声タイプライタの音響的特徴をサブワード
音響モデルの系列で表現したタイプライタ音響モデルを
作成するタイプライタ音響モデル作成部、４は登録語以
外の単語である未知語の音響的特徴をタイプライタ音響
モデルとタイプライタペナルティ値によって表現した未
知語音響モデルを作成する未知語音響モデル作成部、８
は前記タイプライタペナルティ値を格納するタイプライ
タペナルティ値格納部、５は単語に対応する音響モデル
系列を規定し、出力可能な単語系列を規定する接続規則
を保存する接続規則格納部、６はサブワード音響モデル
と未知語音響モデルを併せた音響モデルを、接続規則に
従って接続し、入力された音声と照合して得られる単語
系列を出力する認識部である。FIG. 14 is a block diagram showing a schematic configuration of a speech recognition apparatus of Conventional Example 1. Reference numeral 1 is a subword acoustic model storage unit for storing a subword acoustic model created in advance. Reference numeral 7 is an audio of a speech typewriter. Typewriter acoustic model creation unit that creates a typewriter acoustic model that expresses the acoustic characteristics as a sequence of subword acoustic models, and 4 indicates the acoustic characteristics of an unknown word that is a word other than a registered word as the typewriter acoustic model and the typewriter penalty value. Unknown word acoustic model creation unit for creating an unknown word acoustic model expressed by
Is a typewriter penalty value storage unit for storing the typewriter penalty value, 5 is a connection rule storage unit that defines an acoustic model sequence corresponding to a word, and saves a connection rule that defines an outputable word sequence, and 6 is a subword A recognition unit that connects an acoustic model including an acoustic model and an acoustic model of an unknown word according to a connection rule, and outputs a word sequence obtained by matching with an input voice.

【０００６】以下に、音声認識装置の動作について簡単
に説明する。認識の動作を行なう前に、まず、タイプラ
イタ音響モデル作成部７は、サブワード音響モデル格納
部１に格納してあるサブワード音響モデルからタイプラ
イタ音響モデルを作成する。次に、未知語音響モデル作
成部４は、タイプライタ音響モデル作成部７で作成され
たタイプライタ音響モデルとタイプライタペナルティ値
格納部８に格納してあるタイプライタペナルティ値とか
ら未知語音響モデルを作成する。認識の動作は、認識部
６が、サブワード音響モデル格納部１に格納してあるサ
ブワード音響モデルと未知語音響モデル作成部４で作成
された未知語音響モデルとを併せた音響モデルを、接続
規則格納部５に格納してある接続規則に従って接続し、
入力された音声と照合して得られる単語系列を出力す
る。The operation of the voice recognition device will be briefly described below. Before performing the recognition operation, first, the typewriter acoustic model creation unit 7 creates a typewriter acoustic model from the subword acoustic model stored in the subword acoustic model storage unit 1. Next, the unknown word acoustic model creation unit 4 uses the typewriter acoustic model created by the typewriter acoustic model creation unit 7 and the typewriter penalty value stored in the typewriter penalty value storage unit 8 to determine the unknown word acoustic model. To create. In the recognition operation, the recognizing unit 6 uses an acoustic model in which the subword acoustic model stored in the subword acoustic model storage unit 1 and the unknown word acoustic model created by the unknown word acoustic model creating unit 4 are combined into a connection rule. Connect according to the connection rules stored in the storage unit 5,
The word sequence obtained by matching with the input voice is output.

【０００７】このタイプライタペナルティ値は、未知語
音響モデルによる照合スコアに求められる条件とタイプ
ライタ音響モデルによる照合スコアとの差を調整するた
めのものであり、種々のタイプライタペナルティの与え
方とその値を設定し、評価実験を繰り返すことで最適値
を求めている。This typewriter penalty value is for adjusting the difference between the condition required for the matching score by the unknown word acoustic model and the matching score by the typewriter acoustic model, and is used for giving various typewriter penalties. The optimum value is obtained by setting that value and repeating evaluation experiments.

【０００８】従来例２は、ガーベジモデルを用いる例で
ある。ガーベジモデルは色々な音声の音響的な特徴をミ
ックスして少ないクラスで表現したモデルである。従来
例２において、用いられているサブワードは音素であ
る。Conventional example 2 is an example using a garbage model. The garbage model is a model in which acoustic characteristics of various voices are mixed and expressed in a small class. In Conventional Example 2, the subword used is a phoneme.

【０００９】図１５は、従来例２の音声認識装置の概略
構成を示したブロック図であり、１は予め作成されたサ
ブワード音響モデルを格納するサブワード音響モデル格
納部、９は色々な音響的特徴を少数のモデルで表現した
ガーベジモデルを格納するガーベジモデル格納部、４は
登録語以外の単語である未知語の音響的特徴をガーベジ
モデルの系列で表現した未知語音響モデルを作成する未
知語音響モデル作成部、５は単語に対応する音響モデル
系列を規定し、出力可能な単語系列を規定する接続規則
を保存する接続規則格納部、６はサブワード音響モデル
と未知語音響モデルを併せた音響モデルを、登録語接続
規則に未知語接続規則を追加した接続規則に従って接続
し、入力された音声と照合して得られる単語系列を出力
する認識部である。FIG. 15 is a block diagram showing a schematic configuration of a speech recognition apparatus of Conventional Example 2. Reference numeral 1 is a subword acoustic model storage unit for storing a subword acoustic model created in advance, and 9 is various acoustic features. A garbage model storage unit that stores a garbage model that represents a small number of models. 4 is an unknown word acoustic model that creates an unknown word acoustic model that represents the acoustic features of an unknown word that is a word other than a registered word by a series of garbage models. A model creation unit 5 defines an acoustic model sequence corresponding to a word, and a connection rule storage unit that saves a connection rule that defines an outputable word sequence, and 6 an acoustic model that combines a subword acoustic model and an unknown word acoustic model. Is a recognition unit for connecting a registered word connection rule according to a connection rule in which an unknown word connection rule is added, and outputting a word sequence obtained by matching with an input voice.

【００１０】以下に、従来例２の音声認識装置の動作に
ついて簡単に説明する。認識の動作を行なう前に、未知
語音響モデル作成部４は、ガーベジモデル格納部９に格
納してあるガーベジモデルから未知語音響モデルを作成
する。認識の動作は、認識部６が、サブワード音響モデ
ル格納部１に格納してあるサブワード音響モデルと未知
語音響モデル作成部４で作成された未知語音響モデルと
を併せた音響モデルを、接続規則格納部５に格納してあ
る接続規則に従って接続し、入力された音声と照合して
得られる単語系列を出力する。The operation of the speech recognition apparatus of the second conventional example will be briefly described below. Before performing the recognition operation, the unknown word acoustic model creation unit 4 creates an unknown word acoustic model from the garbage model stored in the garbage model storage unit 9. In the recognition operation, the recognizing unit 6 uses an acoustic model in which the subword acoustic model stored in the subword acoustic model storage unit 1 and the unknown word acoustic model created by the unknown word acoustic model creating unit 4 are combined into a connection rule. It connects according to the connection rule stored in the storage unit 5, and outputs a word sequence obtained by matching with the input voice.

【００１１】この例では、ガーベジモデルを作成する際
にどのような音素クラスタを用いれば良いのかを調べる
ために、色々なクラスタを設定し、評価実験を繰り返す
ことで最適なクラスタを求めている。In this example, in order to investigate what kind of phoneme cluster should be used when creating a garbage model, various clusters are set and the optimum cluster is obtained by repeating evaluation experiments.

【００１２】[0012]

【発明が解決しようとする課題】音声認識装置において
は、頑健性が要求されており、その一つとして、入力音
声の中に登録語以外の語である未知語が含まれている場
合にも、未知語の区間は未知語として認識し、登録語の
区間は登録語としてその発音内容を正しく認識すること
が求められている。従来法によっても未知語を扱うこと
はできるが、従来例１においてはタイプライタペナルテ
ィ値を設定することが試行錯誤によるため膨大な作業量
が必要であり、開発コストがかかるという課題を有して
いた。The voice recognition device is required to have robustness, and one of the requirements is that even when an input voice contains an unknown word other than a registered word. , It is required that the unknown word section is recognized as an unknown word, and the registered word section is recognized as a registered word and its pronunciation content is correctly recognized. Although it is possible to handle unknown words by the conventional method, the conventional example 1 has a problem that a huge amount of work is required because the typewriter penalty value is set by trial and error and the development cost is high. It was

【００１３】また、従来例２においても、ガーベジモデ
ル作成のためにやはり試行錯誤が必要で膨大な作業量が
必要であり、開発コストがかかるという課題を有してい
た。Further, in the conventional example 2 as well, there is a problem that a trial and error is required for creating the garbage model, a huge amount of work is required, and a development cost is high.

【００１４】本発明は、このようなタイプライタペナル
ティ値の設定やガーベジモデルの作成を必要としない開
発が容易な装置で、未知語に対して頑健な音声認識装置
を実現することを目的とする。It is an object of the present invention to realize a voice recognition device that is robust against unknown words and is a device that does not require the setting of a typewriter penalty value and the creation of a garbage model and is easy to develop. .

【００１５】[0015]

【課題を解決するための手段】この課題を解決するため
に本発明は、予め作成されたサブワード音響モデルを保
存するサブワード音響モデル格納部と、孤立発声された
音節を前記サブワード音響モデルからの語頭用サブワー
ド音響モデルと語尾用サブワード音響モデルとを接続す
ることにより表現した単音節音響モデルを作成する単音
節音響モデル作成部と、登録語以外の単語である未知語
の音響的特徴を前記単音節音響モデルの系列で表現した
未知語音響モデルを作成する未知語音響モデル作成部
と、予め単語に対応する音響モデル系列を規定する第１
の接続規則と出力可能な単語系列と規定する第２の接続
規則とを保存する接続規則格納部と、前記第１の接続規
則に従い、前記サブワード音響モデルから登録語の音響
モデルを算出し、前記未知語音響モデルから未知語の音
響モデルを算出し、前記第２の接続規則に従い、前記登
録語の音響モデル及び前記未知語の音響モデルから出力
可能な複数の単語系列を算出し、入力された音声と前記
複数の単語系列とを照合して複数の照合スコアを算出
し、前記照合スコアが高い前記単語系列を出力する認識
部とで構成したものである。To solve this problem, the present invention provides a subword acoustic model storage unit for storing a subword acoustic model created in advance, and an isolated syllable beginning with the subword acoustic model from the subword acoustic model. Syllable acoustic model creation unit for creating a monosyllabic acoustic model represented by connecting a subword acoustic model for a subword and a subword acoustic model for endings, and the acoustic characteristics of an unknown word other than a registered word to the monosyllable acoustic model. An unknown word acoustic model creation unit that creates an unknown word acoustic model represented by a series of acoustic models, and first defining an acoustic model series corresponding to a word .
A connection rule storage unit that stores a second connection rules governing the connection rules and can output a word sequence of said first connection Tadashi
According to the rules, the sound of the registered word from the subword sound model
A model is calculated, and the sound of an unknown word is calculated from the unknown word acoustic model.
A sound model is calculated, and the registration is performed according to the second connection rule.
Output from acoustic model of recorded words and acoustic model of unknown words
Calculating a plurality of word sequences as possible, the the input speech
Calculate multiple matching scores by matching multiple word series
The recognition unit outputs the word series having a high matching score .

【００１６】これにより、未知語の音響モデルは、自然
にペナルティがかかる構造であるためにペナルティ値の
設定を必要とせず、かつ、登録語の音響モデルと共通の
サブワード音響モデルから構成するのでガーベジモデル
の作成も必要とせず、開発が容易でかつ未知語に対して
頑健な音声認識装置が実現できるものである。As a result, the acoustic model of the unknown word does not need to be set with a penalty value because it is naturally penalized, and is composed of the subword acoustic model common to the acoustic model of the registered word. It is possible to realize a speech recognition device that is easy to develop and robust against unknown words without the need to create a model.

【００１７】[0017]

【発明の実施の形態】本発明の請求項１に記載の発明
は、予め作成されたサブワード音響モデルを保存するサ
ブワード音響モデル格納部と、孤立発声された音節を前
記サブワード音響モデルからの語頭用サブワード音響モ
デルと語尾用サブワード音響モデルとを接続することに
より表現した単音節音響モデルを作成する単音節音響モ
デル作成部と、登録語以外の単語である未知語の音響的
特徴を前記単音節音響モデルの系列で表現した未知語音
響モデルを作成する未知語音響モデル作成部と、予め単
語に対応する音響モデル系列を規定する第１の接続規則
と出力可能な単語系列を規定する第２の接続規則とを保
存する接続規則格納部と、前記第１の接続規則に従い、
前記サブワード音響モデルから登録語の音響モデルを算
出し、前記未知語音響モデルから未知語の音響モデルを
算出し、更に、前記第２の接続規則に従い、前記登録語
の音響モデル及び前記未知語の音響モデルから出力可能
な複数の単語系列を算出し、入力された音声と前記複数
の単語系列とを照合して複数の照合スコアを算出し、前
記照合スコアが高い前記単語系列を出力する認識部とを
有することを特徴とする音声認識装置としたものであ
り、未知語音響モデルを作成する際にサブワード音響モ
デルからの語頭用サブワードモデルと語尾用サブワード
音響モデルを接続して表現した単音節音響モデルの系列
で作成することにより、従来の装置では試行錯誤により
多くの時間と開発コストが必要であったペナルティ値の
設定及びガーベジモデルの作成を必要としないので、開
発コストが削減でき、且つ未知語に対して頑健な音声認
識ができるという作用を有する。BEST MODE FOR CARRYING OUT THE INVENTION The invention according to claim 1 of the present invention comprises a subword acoustic model storage unit for storing a subword acoustic model created in advance, and an isolated syllable for the beginning of the subword acoustic model. A monosyllabic acoustic model creating unit that creates a monosyllabic acoustic model by connecting a subword acoustic model and a subword acoustic model for endings, and the acoustic characteristics of an unknown word that is a word other than a registered word to the monosyllabic acoustics. defining the unknown word acoustic model creating unit that creates the unknown word acoustic models representing a model of the series, the first connection rule <br/> output possible word sequences that define the acoustic model sequence corresponding to the pre-word A connection rule storage unit for storing the second connection rule, and the first connection rule,
Calculate the acoustic model of the registered word from the subword acoustic model
However, the unknown word acoustic model from the unknown word acoustic model
The registered word is calculated according to the second connection rule.
Can be output from the acoustic model of and the acoustic model of the unknown word
Calculating a plurality of word sequences such, the a input speech multiple
Match multiple word series of to calculate multiple matching scores,
A speech recognition device having a recognition unit that outputs the word sequence having a high matching score , and a subword model for the beginning of a subword acoustic model and a word ending when the unknown word acoustic model is created. By creating a series of monosyllabic acoustic models that are represented by connecting subword acoustic models for use, it is possible to set a penalty value and create a garbage model, which required a lot of time and development cost due to trial and error in the conventional device. Since it is not necessary, the development cost can be reduced, and robust speech recognition can be performed on unknown words.

【００１８】請求項２に記載の発明は、予め作成された
サブワード音響モデルを保存するサブワード音響モデル
格納部と、孤立発声された音節を前記サブワード音響モ
デルからの語頭用サブワード音響モデルと語尾用サブワ
ード音響モデルを接続することで表現した単音節音響モ
デルを作成する単音節音響モデル作成部と、前記単音節
音響モデルからＮ個の音響モデルを選択する単音節音響
モデル選択部と、登録語以外の単語である未知語の音響
的特徴を選択された前記単音節音響モデルの系列で表現
した未知語音響モデルを作成する未知語音響モデル作成
部と、予め単語に対応する音響モデル系列を規定する第
１の接続規則と出力可能な単語系列を規定する第２の接
続規則とを保存する接続規則格納部と、前記第１の接続
規則に従い、前記サブワード音響モデルから登録語の音
響モデルを算出し、前記未知語音響モデルから未知語の
音響モデルを算出し、更に、前記第２の接続規則に従
い、前記登録語の音響モデル及び前記未知語の音響モデ
ルから出力可能な複数の単語系列を算出し、入力された
音声と前記複数の単語系列とを照合して複数の照合スコ
アを算出し、前記照合スコアが高い前記単語系列を出力
する認識部とを有することを特徴とする音声認識装置と
したものであり、未知語音響モデルを作成する際にサブ
ワード音響モデルからの語頭用サブワードモデルと語尾
用サブワード音響モデルを接続して選択したＮ個の単音
節音響モデルの系列で作成することにより、従来の装置
では試行錯誤により多くの時間と開発コストが必要であ
ったペナルティ値の設定及びガーベジモデルの作成を必
要としないので、開発コストが削減でき、且つ未知語に
対して頑健な音声認識ができるという作用を有する。According to a second aspect of the present invention, a subword acoustic model storage unit for storing a subword acoustic model created in advance, an isolated vocalized syllable, a subword acoustic model for the beginning of the subword acoustic model and a subword for the ending of the subword acoustic model A monosyllabic acoustic model creation unit that creates a monosyllabic acoustic model expressed by connecting acoustic models, a monosyllabic acoustic model selection unit that selects N acoustic models from the monosyllabic acoustic model, and a non-registered word An unknown word acoustic model creation unit that creates an unknown word acoustic model that represents the acoustic characteristics of an unknown word that is a word by a sequence of the selected monosyllabic acoustic models, and predefines an acoustic model sequence that corresponds to the word in advance .
A connection rule storage unit that stores a second connection <br/> connection rules governing the possible word sequence and output 1 of the connection rule, the first connection
According to the rules, the sound of the registered word from the subword acoustic model
The acoustic model is calculated, and the unknown word acoustic model is calculated from the unknown word acoustic model .
The acoustic model is calculated, and the second connection rule is followed.
The acoustic model of the registered word and the acoustic model of the unknown word.
Of a plurality of word sequences that can be output from the input voice and the plurality of word sequences are collated with the input voice and the plurality of matching scores are calculated.
Is a speech recognition device characterized by having a recognition unit for calculating the a and outputting the word sequence having the high matching score, wherein a word beginning from a subword acoustic model when an unknown word acoustic model is created. A penalty value that requires a lot of time and development cost due to trial and error in the conventional device by creating a series of N monosyllabic acoustic models selected by connecting the subword model for speech and the subword acoustic model for ending. Since there is no need to set the above and to create a garbage model, the development cost can be reduced, and robust speech recognition can be performed for unknown words.

【００１９】以下、本発明の実施の形態について、図１
から図１３を用いて説明する。（実施の形態１）以下、本発明の実施の形態１につい
て、図１から図８を用いて説明する。FIG. 1 shows an embodiment of the present invention.
From now on, it will be described with reference to FIG. (Embodiment 1) Hereinafter, Embodiment 1 of the present invention will be described with reference to FIGS.

【００２０】図１は、本発明の実施の形態１の音声認識
装置のブロック図を示すものであり、１は予め作成され
たサブワード音響モデルを保存するサブワード音響モデ
ル格納部、２は孤立発声された音節を語頭用サブワード
音響モデルと語尾用サブワード音響モデルを接続するこ
とにより表現した単音節音響モデルを作成する単音節音
響モデル作成部、４は登録語以外の単語である未知語の
音響的特徴を単音節音響モデルの系列で表現した未知語
音響モデルを作成する未知語音響モデル作成部、５は単
語に対応する音響モデル系列を規定し、出力可能な単語
系列を規定する接続規則を保存する接続規則格納部、６
はサブワード音響モデルと未知語音響モデルとを併せた
音響モデルを、接続規則に従って接続し、入力された音
声１０と照合して得られる単語系列１１を出力する認識
部である。FIG. 1 is a block diagram of a speech recognition apparatus according to the first embodiment of the present invention, in which 1 is a subword acoustic model storage unit for storing a subword acoustic model created in advance, and 2 is an isolated utterance. A syllable acoustic model creation unit that creates a monosyllabic acoustic model that represents a syllable by connecting a subword acoustic model for the beginning and a subword acoustic model for the end of the word. The unknown word acoustic model creation unit 5 for creating an unknown word acoustic model expressing a single syllable acoustic model as a sequence defines an acoustic model sequence corresponding to a word and stores a connection rule that defines a word sequence that can be output. Connection rule storage, 6
Is a recognition unit that outputs a word sequence 11 obtained by connecting an acoustic model including a subword acoustic model and an unknown word acoustic model according to a connection rule, and collating the acoustic model with the input speech 10.

【００２１】本発明の実施の形態１では、サブワードと
してｃｖ／ｖｃ（子音＋母音／母音＋子音）を用いた場
合を例に説明する。説明を簡単にするため単語と単語の
つなぎ目については、音節間の遷移部分を考慮しないも
のとして説明する。In the first embodiment of the present invention, a case where cv / vc (consonant + vowel / vowel + consonant) is used as a subword will be described as an example. For simplification of description, it is assumed that transitions between syllables are not taken into consideration for words and joints between words.

【００２２】以下に、音声認識装置の動作について図２
から図６を用いて詳細に説明する。サブワード音響モデ
ル格納部１には、予め多くの話者が発声した学習用デー
タから作成したサブワード音響モデルを格納している。
サブワード音響モデルは、サブワードを音響的特徴量
（特徴パラメータの統計量（平均値ベクトル、共分散行
列）の時系列、系列間の遷移確率）で表わしたものであ
る。The operation of the voice recognition device will be described below with reference to FIG.
It will be described in detail with reference to FIG. The subword acoustic model storage unit 1 stores a subword acoustic model created in advance from learning data uttered by many speakers.
The subword acoustic model represents subwords by acoustic feature amounts (time series of feature parameter statistics (mean value vector, covariance matrix), transition probability between sequences).

【００２３】図２に、サブワード音響モデルの作成処理
の概要を示す。ここで、＃は、音声の開始時点および終
了時点を表す仮想的な音素である。図中、音声の音響的
特徴を形で表した多角形は、横方向が時間、縦方向は音
声のパワー情報を参考にして表している。図２の例にお
いて、学習用音声データの発声内容は、「１２、７８」
である。点線で区切られた長方形の区間は音素区間にお
いて前後の音素の影響を受けにくい部分を表し、斜め線
を含む台形の区間は音素間の遷移部分を表している。FIG. 2 shows an outline of the subword acoustic model creation processing. Here, # is a virtual phoneme that represents the start point and end point of the voice. In the figure, a polygon that represents the acoustic characteristics of a voice is shown with reference to time in the horizontal direction and power information of the voice in the vertical direction. In the example of FIG. 2, the utterance content of the learning voice data is “12, 78”.
Is. A rectangular section separated by a dotted line represents a portion of the phoneme section that is not easily affected by the preceding and following phonemes, and a trapezoidal section including a diagonal line represents a transition section between phonemes.

【００２４】特徴パラメータの系列である学習用音声デ
ータ（図２（ａ））は、まずサブワードに切り分ける境
界を定義するためのラベリング（図２（ｂ））を行な
う。ここで、図２（ｃ）に示すように連続する音声部分
（「１２」と「７８」の２つの部分）の最初のサブワー
ドが語頭のサブワードであり、最後のサブワードが語尾
のサブワードであり、残りのサブワードが語中のサブワ
ードである。サブワード音響モデルは、サブワードの種
類毎に、切り分けられたサブワード音声データから、音
響的特徴量を求めることで作成する。サブワード音響モ
デルの作成手順は、既に実用化されている孤立発声単語
音声認識装置の場合と同様であり、確立されている。The learning voice data (FIG. 2A), which is a series of characteristic parameters, is first labeled (FIG. 2B) to define a boundary to be divided into subwords. Here, as shown in FIG. 2 (c), the first subword of the continuous voice part (two parts of "12" and "78") is the subword at the beginning of the word, and the last subword is the subword at the end, The remaining subwords are the subwords in the word. The sub-word acoustic model is created by obtaining the acoustic feature amount from the sub-word audio data that has been divided into sub-word types. The procedure for creating the sub-word acoustic model is the same as that of the isolated voicing word speech recognition device that has already been put into practical use, and has been established.

【００２５】単音節音響モデル作成部２は、サブワード
音響モデル格納部１から読み出したサブワード音響モデ
ルを入力とし、単音節音響モデルを出力する。図３は、
単音節音響モデルを示した図である。単音節音響モデル
は、サブワード音響モデルによって、日本語のすべての
音節を表現したものである。ここで、各音節は、語頭の
サブワード音響モデルと語尾のサブワード音響モデルを
接続したものであり、音節を単独で発音した単音節を表
現している。The monosyllabic acoustic model creation unit 2 receives the subword acoustic model read from the subword acoustic model storage unit 1 as an input, and outputs the monosyllabic acoustic model. Figure 3
It is the figure which showed the monosyllabic acoustic model. The monosyllabic acoustic model is a representation of all Japanese syllables by a subword acoustic model. Here, each syllable is formed by connecting a subword acoustic model at the beginning of a word and a subword acoustic model at the end of the word, and expresses a single syllable in which a syllable is independently pronounced.

【００２６】未知語音響モデル作成部４は、単音節モデ
ル作成部２から出力された単音節音響モデルを入力と
し、未知語音響モデルを出力する。図４に、未知語音響
モデルをネットワークで表した図を示す。未知語音響モ
デルは単音節音響モデルの系列で未知語の音響的な特徴
を表したものである。一般に未知語の発音を予め予測す
ることはできないので、未知語を日本語の全ての音節の
任意の並びとして表現した。図中、φのついた矢印はヌ
ル遷移を表し、時間０で遷移できることを意味してい
る。未知語モデルは、始端（４０１）から各音節に分岐
し、それぞれの単音節音響モデルを経由して、合流し終
端（４０２）にたどり着く。図中最下の遷移（４０３）
は終端から始端へもどるためのヌル遷移であるので、こ
の未知語音響モデルは連続する単音節を表現している。The unknown word acoustic model creating unit 4 receives the monosyllabic acoustic model output from the monosyllabic model creating unit 2 and outputs the unknown word acoustic model. FIG. 4 shows a network representation of the unknown word acoustic model. The unknown word acoustic model is a series of monosyllabic acoustic models and represents the acoustic characteristics of the unknown word. In general, the pronunciation of unknown words cannot be predicted in advance, so the unknown words are expressed as an arbitrary sequence of all Japanese syllables. In the figure, the arrow with φ represents a null transition, which means that the transition can be made at time 0. The unknown word model branches from the beginning (401) into each syllable, reaches the merging end (402) via each monosyllabic acoustic model. Transition at the bottom of the figure (403)
Is a null transition for returning from the end to the start, this unknown-word acoustic model represents a continuous monosyllable.

【００２７】全ての日本語の単語は、音節列として表す
ことができるのでこの未知語音響モデルは任意の単語に
対しても、それなりに大きい照合スコアを与えることが
でき、入力の未知語区間に対しては、登録語に対する照
合スコアよりも大きい値を取ることが期待できる。また
音節間の遷移部分を表現していないことから、入力の登
録語区間に対しては、登録語に対する照合スコアよりも
小さい値を取ることが期待できる。Since all Japanese words can be represented as a syllable string, this unknown word acoustic model can give a reasonably large matching score to any word, and the unknown word interval in the input On the other hand, it can be expected that the value will be larger than the matching score for the registered word. Further, since the transition part between syllables is not expressed, it can be expected that the input registered word section has a value smaller than the matching score for the registered word.

【００２８】接続規則格納部５には、予め作成した接続
規則が格納してある。図５に、接続規則の例を示す。接
続規則は、単語に対応する音響モデル系列の接続規則
（接続規則１）と、出力可能な単語系列の接続規則（接
続規則２）との２つの接続規則からなる。接続規則１
（図５（ａ））に記述されているように、登録語の音響
モデルはサブワード系列として表される。例えば、「観
光」という単語は、｛＃ｋａ、ａＮ、Ｎｋ、ｋｏ、ｏ
ｏ、ｏ＃｝というサブワードの系列で表され、対応する
サブワードの音響モデルを接続することで登録語の音響
モデルが表現される。各サブワード音響モデルは音素お
よび音素間の遷移部分を表すので、登録語の音響モデル
は、単語内のすべての音素および音素間の遷移部分、換
言すればすべての音節および音節間の遷移部分が表現さ
れたものとなる。一方、未知語に対応する音響モデルは
未知語音響モデルである。The connection rule storage unit 5 stores the connection rules created in advance. FIG. 5 shows an example of the connection rule. The connection rule is composed of two connection rules, that is, a connection rule for the acoustic model sequence corresponding to a word (connection rule 1) and a connection rule for an outputtable word sequence (connection rule 2). Connection rule 1
As described in (Fig. 5 (a)), the sound of the registered word
The model is represented as a subword sequence. For example, the word "sightseeing" is {#ka, aN, Nk, ko, o.
It is represented by a series of subwords of o, o #}, and the acoustic model of the registered word is expressed by connecting the acoustic models of the corresponding subwords. Since each subword acoustic model represents phonemes and transitions between phonemes, the acoustic model of a registered word represents all phonemes in a word and transitions between phonemes, in other words, all syllables and transitions between syllables. It has been done. On the other hand, the acoustic model corresponding to the unknown word is the unknown word acoustic model.

【００２９】また、接続規則２（図５（ｂ））に記述さ
れているように、出力可能な単語系列は、｛「観光」
「です」｝、｛「仕事」「です」｝、｛未知語、「で
す」｝などが認められている。接続規則１と接続規則２
とにより、すべての出力可能な単語系列は、音響モデル
の系列で表すことができ、また、音響モデルの系列は、
それに対応する単語系列に変換できるようになる。As described in connection rule 2 (FIG. 5 (b)), the word sequence that can be output is {"Sightseeing".
"Da"}, {"Job""Da"}, {Unknown word, "Da"}, etc. are recognized. Connection rule 1 and connection rule 2
With, all the output word sequences can be represented by the acoustic model sequence, and the acoustic model sequence is
It becomes possible to convert to the corresponding word series.

【００３０】認識部６は、音声１０を入力として、単語
系列１１を出力する。出力の単語系列１１は、サブワー
ド音響モデル格納部１から読み出したサブワード音響モ
デルと未知語音響モデル作成部４で作成された未知語音
響モデルを併せた音響モデルを、接続規則格納部５から
読み出した接続規則に従って接続し、入力音声１０と照
合して得られる結果である。The recognition unit 6 receives the voice 10 as an input and outputs a word sequence 11. As the output word sequence 11, an acoustic model in which the subword acoustic model read from the subword acoustic model storage unit 1 and the unknown word acoustic model created by the unknown word acoustic model creation unit 4 are combined is read from the connection rule storage unit 5. The result is obtained by connecting according to the connection rule and matching with the input voice 10.

【００３１】音声は、各部分の時間長が変動するので、
入力の各フレームが音響モデルのどのフレームに対応す
るかを求める必要がある。そのためには、入力音声のフ
レームと音響モデルのフレームとの間の類似度を定義
し、その類似度の総和が最大となるような対応を求め
る。この対応づけを照合と呼び、類似度の総和の最大値
を入力音声と音響モデルとの照合スコアと呼ぶ。図６
に、照合のイメージを示す。図において、横軸に入力音
声の特徴パラメータ系列を、縦軸に単語系列｛「仕事」
「です」｝に対応する音響モデルを置き、フレームの対
応づけを折れ線で示した。この対応づけはＤＰ法により
求めることができる。同様に、他の単語系列に対しても
照合を行ない、照合スコアに従って順位づけを行ない、
順位の高い単語系列を出力する。未知語モデルのよう
に、ネットワーク表現された部分との照合は、合流点に
おいては最も照合スコアの高い候補を残して照合を進め
ることで実行でき、ＯｎｅＰａｓｓＤＰ法により計
算できる。Since the time length of each part of voice varies,
It is necessary to find which frame of the acoustic model each frame of the input corresponds to. For that purpose, the similarity between the frame of the input speech and the frame of the acoustic model is defined, and the correspondence is calculated so that the total sum of the similarities becomes maximum. This correspondence is called matching, and the maximum value of the sum of the similarities is called the matching score between the input voice and the acoustic model. Figure 6
Shows the image of collation. In the figure, the horizontal axis represents the characteristic parameter series of the input voice, and the vertical axis represents the word series {“work”.
The acoustic model corresponding to “is”} is placed, and the correspondence of the frames is shown by the polygonal line. This correspondence can be obtained by the DP method. Similarly, other word sequences are matched and ranked according to the matching score.
Output a word sequence with a high rank. Matching with a network-represented portion like an unknown word model can be performed by advancing the matching while leaving the candidate with the highest matching score at the confluence point, and can be calculated by the One Pass DP method.

【００３２】未知語を扱う認識装置の動作として、未知
語を含まない入力の場合と未知語を含む入力の場合の二
通りの場合について検証する必要がある。この２つの場
合について、図７、図８を用いて説明する。As the operation of the recognition device that handles unknown words, it is necessary to verify two cases: an input that does not include an unknown word and an input that includes an unknown word. These two cases will be described with reference to FIGS. 7 and 8.

【００３３】図７は、未知語を含まない入力とその結果
の例である。入力音声は「仕事です。」であり、「仕
事」と「です」の２単語からなる。図５に示したよう
に、「仕事」、「です」は登録語であり、サブワード系
列で表されている。また、｛「仕事」、「です」｝や
｛未知語、「です」｝という単語系列は許されている。
１位の結果は、単語系列が｛「仕事」、「です」｝であ
り、照合スコアが０．９であった。２位の結果は、単語
系列が｛未知語（シ・ゴ・ト）、「です」｝であり、照
合スコアが０．８であった。どちらの結果も音節列とし
て見れば｛シ、ゴ、ト、デ、ス｝であるが、サブワード
列として見ると｛シ、ゴ、ト｝に対応する部分が異な
り、１位の結果では｛＃ｓｉ、ｉｇ、ｇｏ、ｏｔ、ｔ
ｏ、ｏ＃｝、２位の結果では｛＃ｓｉ、ｉ＃、＃ｇｏ、
ｏ＃、＃ｔｏ、ｏ＃｝となっているためにスコアが異な
る。FIG. 7 shows an example of an input containing no unknown word and the result thereof. The input voice is "work." And consists of two words, "work" and "is." As shown in FIG. 5, "work" and "da" are registered words and are represented by a subword sequence. Also, the word sequences {"work", "da"} and {unknown word, "da"} are allowed.
In the first place, the word sequence was {“work”, “is”} and the matching score was 0.9. In the second place, the word sequence was {unknown word (shigoto), "da"}, and the matching score was 0.8. When viewed as a syllable string, both results are {Si, Go, To, De, Su}, but when viewed as a subword string, the parts corresponding to {Si, Go, To} are different, and the result of the first place is {# si, ig, go, ot, t
o, o #}, the second place result is {#si, i #, #go,
The scores are different because they are o #, #to, o #}.

【００３４】音響モデル系列の中で、入力との類似度の
低い部分を図中の網掛けで示した。１位の結果では全て
の音節間の遷移部分が表現されているので、特に類似度
の低い部分はないが、２位の結果ではシとゴの間の遷移
部分、ゴとトの間の遷移部分が表現されていないよう
に、周期的に類似度の低い部分が現れている。このこと
が、音声タイプライタにおいて音節間の遷移に応じたペ
ナルティを与えるような働きをする。これによって、登
録語が未知語として認識されることなく正しく認識され
た。In the acoustic model sequence, a portion having a low degree of similarity to the input is shown by hatching in the figure. In the result of 1st place, the transition part between all syllables is expressed, so there is no part with low similarity, but in the result of 2nd place, the transition part between Si and Go, the transition between Go and To As the parts are not represented, parts with low similarity appear periodically. This acts to give a penalty according to the transition between syllables in the voice typewriter. As a result, the registered word was correctly recognized without being recognized as an unknown word.

【００３５】図８は、未知語を含む入力とその認識結果
の例である。入力音声は「商用です。」であり、「商
用」と「です」の２単語からなる。図５に示したように
「です」は登録語でありサブワード系列で表されている
が、「商用」は未知語であるものとする。また｛「仕
事」、「です」｝や｛未知語、「です」｝という単語系
列は許されている。１位の結果は、単語系列が｛未知語
（ショ・ヨ）、「です」｝であり、照合スコアが０．８
であった。FIG. 8 is an example of an input including an unknown word and a recognition result thereof. The input voice is “commercial.” And consists of two words, “commercial” and “is”. As shown in FIG. 5, “is” is a registered word and is represented by a subword series, but “commercial” is an unknown word. Also, word sequences such as {"work", "da"} and {unknown word, "da"} are allowed. In the first place, the word sequence is {unknown word (sho yo), "da"}, and the matching score is 0.8.
Met.

【００３６】２位の結果は、単語系列が｛「仕事」、
「です」｝であり、照合スコアが０．６であった。入力
中の未知語「商用」の発音に対して、登録語のなかで最
も発音が近かった「仕事」に対する音響モデルよりも未
知語モデルの表す単音節系列の中で最も発音が近かった
単音節系列｛ショ、ヨ｝に対する音響モデルの方が音響
的特徴量が似ていたので、未知語区間に対して未知語で
あるという正しい出力の認識が行なえた。The result of the second place is that the word sequence is {"work",
“It is”} and the matching score was 0.6. For the pronunciation of the unknown word "commercial" being input, the monosyllabic one with the closest pronunciation in the monosyllabic sequence represented by the unknown word model than the acoustic model for "work" with the closest pronunciation among the registered words. Since the acoustic features of the acoustic model for the sequence {sho, yo} were similar, it was possible to recognize the correct output as an unknown word for the unknown word section.

【００３７】以上で検証した動作例は、原理的に期待で
きる動作を説明した例であり、実際の動作では、正しく
認識できる場合も正しく認識できない場合もあるので、
その性能は確率的なものである。そこで本発明の実施の
形態１による音声認識装置を用いて認識実験を行なっ
た。The operation example verified above is an example for explaining the operation that can be expected in principle. In actual operation, there are cases where it can be recognized correctly and cases where it cannot be recognized correctly.
Its performance is probabilistic. Therefore, a recognition experiment was conducted using the voice recognition device according to the first embodiment of the present invention.

【００３８】実験において、特徴パラメータは、ＬＰＣ
ケプストラム系列と各音素の標準パタンとの類似度を、
共分散行列を全ての音素カテゴリで共通化したマハラノ
ビス距離として与えた音素類似度ベクトルとした。サブ
ワード音響モデルは、音素類似度ベクトルの平均値ベク
トルの系列を用いた。入力音声と音響モデルのフレーム
間の類似度はベクトルの内積として定義した。評価デー
タは男女各１２名の発声した１６６文である。認識対象
は１６６文を含む４００文で、登録語彙は６６５単語で
ある。未知語への対処法の効果を調べるために、この６
６５単語の中から１０％にあたる６７単語をランダムに
選んで接続規則から削除して認識した、未知語への対処
のないの場合と、上記６７単語を接続規則から削除し、
削除した単語の代わりに未知語が挿入される単語系列を
許すように接続規則を登録して認識した、未知語への対
処のある場合について評価した。In the experiment, the characteristic parameter is LPC.
The similarity between the cepstrum sequence and the standard pattern of each phoneme is
The covariance matrix is the phoneme similarity vector given as the Mahalanobis distance common to all phoneme categories. As the subword acoustic model, a series of average value vectors of phoneme similarity vectors was used. The similarity between the input speech and the frame of the acoustic model is defined as the dot product of the vectors. The evaluation data is 166 sentences uttered by 12 men and women. The recognition target is 400 sentences including 166 sentences, and the registered vocabulary is 665 words. To investigate the effect of coping with unknown words, this 6
67 words corresponding to 10% out of 65 words were randomly selected and deleted from the connection rule, and the case where no unknown word was dealt with, and the above 67 words were deleted from the connection rule,
We evaluated the case in which the unknown word was dealt with, which was recognized by registering the connection rule to allow the word sequence in which the unknown word was inserted instead of the deleted word.

【００３９】実施の形態１における評価結果を（表１）
に示す。表中の数字は単語検出率であり、入力中の未知
語以外の単語について、１位の単語系列に含まれたら検
出、含まれなかったら非検出として検出された割合を計
算した。The evaluation results in the first embodiment are shown in (Table 1).
Shown in. The numbers in the table are word detection rates, and for words other than unknown words in input, the rate of being detected if included in the first-ranked word series and not detected if not included was calculated.

【００４０】[0040]

【表１】 [Table 1]

【００４１】（表１）からわかるように、未知語への対
処をすることにより、未知語を含まない入力に対しては
単語検出率が少し低下するものの、未知語を含んだ入力
に対しては単語検出率が大幅に向上した。総合の単語検
出率は７６．６％から８４．４％まで向上し、本発明の
効果が確認された。As can be seen from (Table 1), by dealing with unknown words, the word detection rate is slightly reduced for inputs that do not include unknown words, but for inputs that include unknown words. Has greatly improved the word detection rate. The overall word detection rate was improved from 76.6% to 84.4%, confirming the effect of the present invention.

【００４２】以上のように本発明の実施の形態１によれ
ば、ペナルティ値の設定を必要とせず、かつ、ガーベジ
モデルの作成も必要としないので、開発コストがかから
ず、未知語に対して頑健な音声認識装置が実現できる。As described above, according to the first embodiment of the present invention, it is not necessary to set the penalty value and the generation of the garbage model is required.
Instead, a voice recognition device that is robust against unknown words can be realized.

【００４３】（実施の形態２）以下、本発明の実施の形
態２について、図９から図１３を用いて説明する。(Second Embodiment) A second embodiment of the present invention will be described below with reference to FIGS. 9 to 13.

【００４４】図９は、本発明の実施の形態２の音声認識
装置のブロック図を示すものであり、１は予め作成され
たサブワード音響モデルを保存するサブワード音響モデ
ル格納部、２は孤立発声された音節を語頭用サブワード
音響モデルと語尾用サブワード音響モデルを接続するこ
とにより表現した単音節音響モデルを作成する単音節音
響モデル作成部、３は単音節音響モデルからＮ個の音響
モデルを選択する単音節音響モデル選択部、登録語以外
の単語である未知語の音響的特徴を選択された単音節音
響モデルの系列で表現した未知語音響モデルを作成する
未知語音響モデル作成部、４は単語に対応する音響モデ
ル系列を規定し、出力可能な単語系列を規定する接続規
則を保存する接続規則格納部、５はサブワード音響モデ
ルと未知語音響モデルとを併せた音響モデルを、接続規
則に従って接続し、入力された音声１０と照合して得ら
れる単語系列１１を出力する認識部である。FIG. 9 is a block diagram of a speech recognition apparatus according to the second embodiment of the present invention, in which 1 is a subword acoustic model storage for storing a subword acoustic model created in advance, and 2 is an isolated utterance. A syllable acoustic model is created by connecting the subword acoustic model for the beginning and the subword acoustic model for the ending of the syllable. The monosyllabic acoustic model creation unit 3 selects N acoustic models from the monosyllabic acoustic model. Monosyllable acoustic model selection unit, unknown word acoustic model creation unit that creates an unknown word acoustic model that represents the acoustic characteristics of unknown words that are words other than registered words by a sequence of selected monosyllabic acoustic models, and 4 is a word The connection rule storage unit 5 that defines the acoustic model sequence corresponding to the above, and stores the connection rule that defines the outputable word sequence, includes a subword acoustic model and an unknown word acoustic model. An acoustic model of the combination of the Le, connected according to the connection rules, a recognition unit which outputs a word sequence 11 obtained against the voice 10 input.

【００４５】本発明の実施の形態２においては、サブワ
ードとしてｃｖ／ｖｃ（子音＋母音／母音＋子音）を用
いた場合を例に説明する。説明を簡単にするため単語と
単語のつなぎ目については、音節間の遷移部分を考慮し
ないものとして説明する。In the second embodiment of the present invention, a case where cv / vc (consonant + vowel / vowel + consonant) is used as a subword will be described as an example. For simplification of description, it is assumed that transitions between syllables are not taken into consideration for words and joints between words.

【００４６】以下に、音声認識装置の動作について図１
０から図１１を用いて説明する。サブワード音響モデル
格納部１、単音節音響モデル作成部２、接続規則格納部
５、認識部６の動作は、本発明の実施の形態１と同じで
あるので、説明を省略する。The operation of the voice recognition device will be described below with reference to FIG.
This will be described with reference to FIGS. The operations of the sub-word acoustic model storage unit 1, the monosyllabic acoustic model creation unit 2, the connection rule storage unit 5, and the recognition unit 6 are the same as those in the first embodiment of the present invention, and therefore their explanations are omitted.

【００４７】単音節モデル選択部３は、単音節音響モデ
ル作成部２で作成された単音節音響モデルの中からＮ個
の単音節音響モデルを選択して出力する。図１０は、単
音節音響モデル選択部３において選択した単音節音響モ
デルの例である。単音節音響モデルから母音および撥音
である６個の単音節母音（撥音を含む）を選択した。The monosyllabic model selecting unit 3 selects and outputs N monosyllabic acoustic models from the monosyllabic acoustic models created by the monosyllabic acoustic model creating unit 2. FIG. 10 is an example of the monosyllabic acoustic model selected by the monosyllabic acoustic model selection unit 3. Six monosyllabic vowels (including vowels), which are vowels and vowels, were selected from the monosyllabic acoustic model.

【００４８】未知語音響モデル作成部４は、単音節モデ
ル選択部３で選択されたＮ個の単音節音響モデルを入力
とし、未知語音響モデルを出力する。図１１に、未知語
音響モデルの例をネットワークで示した図を示す。未知
語音響モデルは選択された単音節音響モデルの系列で未
知語の音響的な特徴を表したものである。一般に未知語
の発音を予め予測することはできないので、未知語を日
本語の全ての単音節母音の任意の並びとして表現した。
図中、φのついた矢印はヌル遷移を表し、時間０で遷移
できることを意味している。未知語モデルは、始端（１
１０）から各単音節母音に分岐し、それぞれの単音節音
響モデルを経由して、合流し終端（１１１）にたどり着
く。図中最下の遷移（１１２）は、終端から始端へもど
るためのヌル遷移であるので、この未知語音響モデルは
連続する単音節母音を表現している。The unknown word acoustic model creating unit 4 receives the N monosyllabic acoustic models selected by the monosyllabic model selecting unit 3 as an input and outputs an unknown word acoustic model. FIG. 11 shows a diagram showing an example of an unknown word acoustic model in a network. The unknown word acoustic model is a sequence of selected monosyllabic acoustic models and represents the acoustic characteristics of the unknown word. In general, the pronunciation of unknown words cannot be predicted in advance, so the unknown words are expressed as an arbitrary sequence of all monosyllabic vowels in Japanese.
In the figure, the arrow with φ represents a null transition, which means that the transition can be made at time 0. The unknown word model has a starting point (1
10) is branched into each monosyllabic vowel and reaches the end (111) of the confluence via each monosyllabic acoustic model. Since the lowest transition (112) in the figure is a null transition for returning from the end to the start, this unknown word acoustic model expresses continuous monosyllabic vowels.

【００４９】全ての日本語の単語は音節列として表すこ
とができるが、各音節には必ず母音が含まれているの
で、この未知語音響モデルは任意の単語に対しても、そ
れなりに大きい照合スコアを与えることができ、入力の
未知語区間に対しては、登録語に対する照合スコアより
も大きい値を取ることが期待できる。また音節間の遷移
部分を表現していないことと子音部を表現していないこ
とから、入力の登録語区間に対しては、登録語に対する
照合スコアよりも小さい値を取ることが期待できる。Although all Japanese words can be represented as a syllable string, since each syllable always contains a vowel, this unknown-word acoustic model has a reasonably large matching even for arbitrary words. A score can be given, and it can be expected that the input unknown word section will take a value larger than the matching score for the registered word. Further, since the transition part between syllables is not expressed and the consonant part is not expressed, it can be expected that the input registered word section has a value smaller than the matching score for the registered word.

【００５０】未知語を扱う認識装置の動作として、未知
語を含まない入力の場合と未知語を含む入力の場合の二
通りの場合について検証する必要がある。この２つの場
合について、図１２、図１３を用いて説明する。As the operation of the recognition device that handles unknown words, it is necessary to verify two cases: an input that does not include an unknown word and an input that includes an unknown word. These two cases will be described with reference to FIGS. 12 and 13.

【００５１】図１２は、未知語を含まない入力音声とそ
の認識結果の例である。入力音声は、「仕事です。」で
あり、「仕事」と「です」の２単語からなる。図５の接
続規則に示したように、「仕事」、「です」は登録語で
ありサブワード系列で表されている。また、｛「仕
事」、「です」｝や｛未知語、「です」｝という単語系
列は許されている。認識結果を図１２に示すが、１位の
結果は単語系列が｛「仕事」、「です」｝であり、照合
スコアが０．９であった。２位の結果は、単語系列が
｛未知語（イ・オ・オ）、「です」｝であり、照合スコ
アが０．７であった。サブワード系列を見ると、入力の
「仕事」対応する部分が異なり、１位の結果では｛＃ｓ
ｉ、ｉｇ、ｇｏ、ｏｔ、ｔｏ、ｏ＃｝、２位の結果では
｛＃ｉ、ｉ＃、＃ｏ、ｏ＃、＃ｏ、ｏ＃｝となっている
ためにスコアが異なる。音響モデルの中で、入力との類
似度の低い部分を図中の網掛けで示した。FIG. 12 shows an example of an input voice containing no unknown word and its recognition result. The input voice is “work.” And consists of two words “work” and “is”. As shown in the connection rule of FIG. 5, “work” and “is” are registered words and are represented by a subword sequence. Also, the word sequences {"work", "da"} and {unknown word, "da"} are allowed. The recognition result is shown in FIG. 12, and in the result of the first place, the word series was {“work”, “is”} and the matching score was 0.9. In the second place, the word sequence was {unknown word (i o o), "da"}, and the matching score was 0.7. Looking at the subword sequence, the part corresponding to the "work" of the input is different, and the result of the first place is {#s
i, ig, go, ot, to, o #} and the second place result are {#i, i #, #o, o #, #o, o #}, so the scores are different. In the acoustic model, the part with low similarity to the input is shown by the hatching in the figure.

【００５２】１位の結果では、全ての子音部分や音節間
の遷移部分が表現されているので、特に類似度の低い部
分はないが、２位の結果では、子音部分と音節間の遷移
部分が表現されていないので、周期的に類似度の低い部
分が現れている。このことが、音声タイプライタにおい
て音節間の遷移に応じたペナルティを与えるような働き
をする。これによって、登録語が未知語として認識され
ることなく正しく認識された。In the result of the first place, all consonant parts and transition parts between syllables are expressed, so there is no part with a low degree of similarity, but in the result of second place, the transition parts between consonant parts and syllables. Is not expressed, so that a portion with low similarity appears periodically. This acts to give a penalty according to the transition between syllables in the voice typewriter. As a result, the registered word was correctly recognized without being recognized as an unknown word.

【００５３】図１３は、未知語を含む入力とその認識結
果の例である。入力音声は、「商用です。」であり、
「商用」と「です」の２単語からなる。図５の接続規則
に示したように「です」は登録語でありサブワード系列
で表されているが、「商用」は未知語であるものとす
る。また｛「仕事」、「です」｝や｛未知語、「で
す」｝という単語系列は許されている。認識結果を図１
３に示すが、１位の結果は単語系列が｛未知語（オ・
オ）、「です」｝であり、照合スコアが０．７であっ
た。２位の結果は、単語系列が｛「仕事」、「です」｝
であり、照合スコアが０．６であった。FIG. 13 shows an example of an input including an unknown word and its recognition result. The input voice is "commercial."
It consists of two words, "commercial" and "is". As shown in the connection rule of FIG. 5, “is” is a registered word and is represented by a subword series, but “commercial” is an unknown word. Also, word sequences such as {"work", "da"} and {unknown word, "da"} are allowed. Figure 1 shows the recognition result
As shown in Fig. 3, the result of 1st place is that the word sequence is {unknown word (
E), "is"}, and the matching score was 0.7. In the second place, the word sequence is {“work”, “is”}
And the matching score was 0.6.

【００５４】入力中の未知語「商用」の発音に対して、
登録語のなかで最も発音が近かった「仕事」に対する音
響モデルよりも未知語モデルの表す単音節母音系列の中
で最も発音が近かった単音節母音系列｛オ、オ｝に対す
る音響モデルの方が音響的特徴量が似ていたので、未知
語区間に対して未知語であるという正しい出力の認識が
行なえた。For the pronunciation of the unknown word "commercial" being input,
The acoustic model for the monosyllabic vowel sequence {o, o}, which has the closest pronunciation among the monosyllabic vowel sequences represented by the unknown word model, is better than the acoustic model for "work" that has the closest pronunciation among the registered words. Since the acoustic features were similar, it was possible to recognize the correct output as an unknown word for the unknown word section.

【００５５】以上で検証した動作例は、原理的に期待で
きる動作を説明した例であり、実際の動作では、正しく
認識できる場合も正しく認識できない場合もあるので、
その性能は確率的なものである。そこで本発明の実施の
形態２による音声認識装置を用いて認識実験を行なっ
た。The operation example verified above is an example for explaining the operation that can be expected in principle. In actual operation, there are cases where it can be recognized correctly and cases where it cannot be recognized correctly.
Its performance is probabilistic. Therefore, a recognition experiment was conducted using the voice recognition device according to the second embodiment of the present invention.

【００５６】実験において、特徴パラメータは、ＬＰＣ
ケプストラム系列と各音素の標準パタンとの類似度を、
共分散行列を全ての音素カテゴリで共通化したマハラノ
ビス距離として与えた音素類似度ベクトルとした。サブ
ワード音響モデルは、音素類似度ベクトルの平均値ベク
トルの系列を用いた。入力と音響モデルのフレーム間の
類似度はベクトルの内積として定義した。評価データは
男女各１２名の発声した１６６文である。認識対象は１
６６文を含む４００文で、登録語彙は６６５単語であ
る。未知語への対処法の効果を調べるために、この６６
５単語の中から１０％にあたる６７単語をランダムに選
んで接続規則から削除して認識した、未知語への対処の
ないの場合と、上記６７単語を接続規則から削除し、削
除した単語の代わりに未知語が挿入される単語系列を許
すように接続規則を登録して認識した、未知語への対処
のある場合について評価した。In the experiment, the characteristic parameter is LPC.
The similarity between the cepstrum sequence and the standard pattern of each phoneme is
The covariance matrix is the phoneme similarity vector given as the Mahalanobis distance common to all phoneme categories. As the subword acoustic model, a series of average value vectors of phoneme similarity vectors was used. The similarity between the input and the frame of the acoustic model is defined as the dot product of the vectors. The evaluation data is 166 sentences uttered by 12 men and women. Recognition target is 1
With 400 sentences including 66 sentences, the registered vocabulary is 665 words. To investigate the effect of coping with unknown words, this 66
67 words corresponding to 10% out of 5 words were randomly selected and deleted from the connection rule and recognized, and when there is no countermeasure for an unknown word, the above 67 words were deleted from the connection rule and replaced with the deleted word. We evaluated a case in which an unknown word was dealt with by registering and recognizing a connection rule so that a word sequence in which an unknown word was inserted was allowed.

【００５７】実施の形態２における結果を（表２）に示
す。表中の数字は単語検出率であり、入力中の未知語以
外の単語について、１位の単語系列に含まれたら検出、
含まれなかったら非検出として検出された割合を計算し
た。The results in the second embodiment are shown in (Table 2). The numbers in the table are word detection rates. Words other than unknown words in the input are detected when they are included in the first word sequence,
If not included, the percentage detected as non-detected was calculated.

【００５８】[0058]

【表２】 [Table 2]

【００５９】（表２）からわかるように、未知語への対
処をすることにより、未知語を含まない入力に対しての
単語検出率の低下がほとんどなく、未知語を含んだ入力
に対しては単語検出率が大幅に向上した。総合の単語検
出率は７６．６％から９３．８％まで向上し、本発明の
効果が確認された。As can be seen from (Table 2), by dealing with unknown words, there is almost no decrease in the word detection rate for inputs that do not include unknown words, and for inputs that include unknown words. Has greatly improved the word detection rate. The overall word detection rate improved from 76.6% to 93.8%, confirming the effect of the present invention.

【００６０】以上のように本発明の実施の形態２によれ
ば、ペナルティ値の設定を必要とせず、かつ、ガーベジ
モデルの作成も必要としないので、開発コストが削減で
き、且つ未知語に対して頑健な音声認識装置が実現でき
る。As described above, according to the second embodiment of the present invention, since it is not necessary to set a penalty value and a garbage model is not required to be created, the development cost can be reduced and unknown words can be used. And a robust voice recognition device can be realized.

【００６１】[0061]

【発明の効果】以上のように本発明によれば、未知語音
響モデルを作成する際にサブワード音響モデルからの語
頭用サブワードモデルと語尾用サブワード音響モデルを
接続して表現した単音節音響モデルの系列で作成するこ
とにより、従来の装置では試行錯誤により多くの時間と
開発コストが必要であったペナルティ値の設定及びガー
ベジモデルの作成を必要としないので、開発コストが削
減でき、且つ未知語に対して頑健な優れた音声認識装置
が実現できるものである。As described above, according to the present invention, when creating an unknown word acoustic model, a monosyllabic acoustic model represented by connecting the subword acoustic model for the beginning and the subword acoustic model for the ending is connected. By creating it in series, it is not necessary to set a penalty value and create a garbage model, which required a lot of time and development cost in the conventional device due to trial and error, so the development cost can be reduced and unknown words can be created. On the other hand, a robust and excellent voice recognition device can be realized.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明の実施の形態１による音声認識装置を示
すブロック図FIG. 1 is a block diagram showing a voice recognition device according to a first embodiment of the present invention.

【図２】（ａ）本発明の実施の形態１による音声認識装
置のサブワード音響モデルの作成処理の学習用音声デー
タを示す図（ｂ）本発明の実施の形態１による音声認識装置のサブ
ワード音響モデルの作成処理のラベリングを示す図（ｃ）本発明の実施の形態１による音声認識装置のサブ
ワード音響モデルの作成処理のサブワード音響モデルを
示す図FIG. 2A is a diagram showing learning voice data in the subword acoustic model creation process of the speech recognition apparatus according to the first embodiment of the present invention; and FIG. 2B is a subword acoustic of the speech recognition apparatus according to the first embodiment of the present invention. The figure which shows the labeling of a model creation process (c) The figure which shows the subword acoustic model of the subword acoustic model creation process of the speech recognition apparatus by Embodiment 1 of this invention.

【図３】本発明の実施の形態１による音声認識装置の単
音節音響モデルの例を示す図FIG. 3 is a diagram showing an example of a monosyllabic acoustic model of the voice recognition device according to the first embodiment of the present invention.

【図４】本発明の実施の形態１による音声認識装置の未
知語音響モデルの例を示す図FIG. 4 is a diagram showing an example of an unknown word acoustic model of the voice recognition device according to the first embodiment of the present invention.

【図５】（ａ）本発明の実施の形態１による音声認識装
置の接続規則１の例を示す図（ｂ）本発明の実施の形態１による音声認識装置の接続
規則２の例を示す図5A is a diagram showing an example of a connection rule 1 of the voice recognition device according to the first embodiment of the present invention. FIG. 5B is a diagram showing an example of a connection rule 2 of the voice recognition device according to the first embodiment of the present invention.

【図６】本発明の実施の形態１による音声認識装置にお
ける入力と音響モデルとの照合処理の概要を示す図FIG. 6 is a diagram showing an outline of matching processing between an input and an acoustic model in the voice recognition device according to the first embodiment of the present invention.

【図７】本発明の実施の形態１による音声認識装置の未
知語を含まない入力とその認識結果の例を示す図FIG. 7 is a diagram showing an example of an input including no unknown word and a recognition result thereof in the speech recognition apparatus according to the first embodiment of the present invention.

【図８】本発明の実施の形態１による音声認識装置の未
知語を含む入力とその認識結果の例を示す図FIG. 8 is a diagram showing an example of an input including an unknown word and a recognition result of the speech recognition apparatus according to the first embodiment of the present invention.

【図９】本発明の実施の形態２による音声認識装置を示
すブロック図FIG. 9 is a block diagram showing a voice recognition device according to a second embodiment of the present invention.

【図１０】本発明の実施の形態２による音声認識装置の
選択された単音節音響モデルの例を示す図FIG. 10 is a diagram showing an example of a selected monosyllabic acoustic model of the speech recognition device according to the second embodiment of the present invention.

【図１１】本発明の実施の形態２による音声認識装置の
未知語音響モデルの例を示す図FIG. 11 is a diagram showing an example of an unknown word acoustic model of the speech recognition device according to the second embodiment of the present invention.

【図１２】本発明の実施の形態２による音声認識装置の
未知語を含まない入力とその認識結果の例を示す図FIG. 12 is a diagram showing an example of an input including no unknown word and a recognition result of the speech recognition apparatus according to the second embodiment of the present invention.

【図１３】本発明の実施の形態２による音声認識装置の
未知語を含む入力とその認識結果の例を示す図FIG. 13 is a diagram showing an example of an input including an unknown word and a recognition result of the speech recognition apparatus according to the second embodiment of the present invention.

【図１４】従来例１による音声認識装置を示すブロック
図FIG. 14 is a block diagram showing a speech recognition apparatus according to Conventional Example 1.

【図１５】従来例２による音声認識装置を示すブロック
図FIG. 15 is a block diagram showing a speech recognition apparatus according to Conventional Example 2.

【符号の説明】[Explanation of symbols]

１サブワード音響モデル格納部２単音節音響モデル作成部３単音節音響モデル選択部４未知語音響モデル作成部５接続規則格納部６認識部７タイプライタ音響モデル作成部８タイプライタペナルティ値格納部９ガーベジモデル格納部１０音声１１単語系列 1 subword acoustic model storage 2 Monosyllable acoustic model creation unit 3 Single syllable acoustic model selection section 4 Unknown word acoustic model creation unit 5 Connection rule storage 6 recognition part 7 Typewriter acoustic model creation unit 8 Typewriter penalty value storage 9 Garbage model storage 10 voice 11 word series

フロントページの続き (56)参考文献高野優，磯健一，渡辺隆夫，半音節単位に基づく単語認識のためのワードスポッティング，日本音響学会平成８年度春季研究発表会講演論文集，日本，1996年３月，３−５−２，ｐ．111−112 遠藤充，伊藤達朗，星見昌克，二矢田勝行，未知語の処理方法の検討，日本音響学会平成８年度秋季研究発表会講演論文集，日本，1996年９月，２−Ｑ− 24，ｐ．177−178 田中信一，正井康之，松浦博，新田恒雄，単語スポッティングに適した語頭・語尾モデルの検討，日本音響学会平成８年度秋季研究発表会講演論文集，日本, 1996年９月，１−３−17，ｐ．33−34 花沢利行，中島邦雄，音声タイプライタを用いた未知語検出方式の改良検討, 日本音響学会平成４年度秋季研究発表会講演論文集，日本，1992年10月，２−Ｑ −24，ｐ．219−220 (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/06 G10L 15/20 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (56) References Yu Takano, Kenichi Iso, Takao Watanabe, word spotting for word recognition based on semisyllabic units, Proceedings of the 1996 ASJ Spring Conference, Japan, 1996 March, 3-5-2, p. 111-112 Mitsuru Endo, Tatsuro Ito, Masakatsu Hoshimi, Katsuyuki Niyata, Study of unknown word processing, Proceedings of the 1996 Autumn Meeting of the Japanese Society of Acoustics, Japan, September 1996, 2-Q -24, p. 177-178 Shin'ichi Tanaka, Yasuyuki Masai, Hiroshi Matsuura, Tsuneo Nitta, Examination of initial and ending models suitable for word spotting, Proceedings of Autumn Meeting of the 1996 ASJ, Japan, September 1996 , 1-3-17, p. 33-34 Toshiyuki Hanazawa, Kunio Nakajima, Improvement study on unknown word detection method using speech typewriter, Proceedings of Autumn Meeting of 1992 Autumn Meeting of the Acoustical Society of Japan, Japan, October 1992, 2-Q-24 , P. 219-220 (58) Fields surveyed (Int.Cl. ⁷ , DB name) G10L 15/06 G10L 15/20 JISST file (JOIS)

Claims

(57)【特許請求の範囲】(57) [Claims]

【請求項１】予め作成されたサブワード音響モデルを
保存するサブワード音響モデル格納部と、孤立発声され
た音節を前記サブワード音響モデルからの語頭用サブワ
ード音響モデルと語尾用サブワード音響モデルとを接続
することにより表現した単音節音響モデルを作成する単
音節音響モデル作成部と、登録語以外の単語である未知
語の音響的特徴を前記単音節音響モデルの系列で表現し
た未知語音響モデルを作成する未知語音響モデル作成部
と、予め単語に対応する音響モデル系列を規定する第１
の接続規則と出力可能な単語系列を規定する第２の接続
規則とを保存する接続規則格納部と、前記第１の接続規
則に従い、前記サブワード音響モデルから登録語の音響
モデルを算出し、前記未知語音響モデルから未知語の音
響モデルを算出し、更に、前記第２の接続規則に従い、
前記登録語の音響モデル及び前記未知語の音響モデルか
ら出力可能な複数の単語系列を算出し、入力された音声
と前記複数の単語系列とを照合して複数の照合スコアを
算出し、前記照合スコアが高い前記単語系列を出力する
認識部とを有することを特徴とする音声認識装置。1. A subword acoustic model storage unit for storing a subword acoustic model created in advance, and an isolated uttered syllable is connected to a subword acoustic model for the beginning and a subword acoustic model for the end of the subword acoustic model. A monosyllabic acoustic model creation unit that creates a monosyllabic acoustic model represented by, and an unknown word acoustic model that represents the acoustic characteristics of an unknown word that is a word other than a registered word by a sequence of the monosyllabic acoustic model. A word-acoustic model creating section and a first part that predefines an acoustic model sequence corresponding to a word .
A connection rule storage unit that stores a second connection rules governing the connection rules and can output a word sequence of said first connection Tadashi
According to the rules, the sound of the registered word from the subword sound model
A model is calculated, and the sound of an unknown word is calculated from the unknown word acoustic model.
A sound model, and further according to the second connection rule,
Acoustic model of the registered word and acoustic model of the unknown word
Calculates a plurality of word sequences that can be output from the input voice and compares the input speech with the plurality of word sequences to obtain a plurality of matching scores.
A speech recognition device , comprising: a recognition unit that calculates and outputs the word sequence having the high matching score .

【請求項２】予め作成されたサブワード音響モデルを
保存するサブワード音響モデル格納部と、孤立発声され
た音節を前記サブワード音響モデルからの語頭用サブワ
ード音響モデルと語尾用サブワード音響モデルを接続す
ることで表現した単音節音響モデルを作成する単音節音
響モデル作成部と、前記単音節音響モデルからＮ個の音
響モデルを選択する単音節音響モデル選択部と、登録語
以外の単語である未知語の音響的特徴を選択された前記
単音節音響モデルの系列で表現した未知語音響モデルを
作成する未知語音響モデル作成部と、予め単語に対応す
る音響モデル系列を規定する第１の接続規則と出力可能
な単語系列を規定する第２の接続規則とを保存する接続
規則格納部と、前記第１の接続規則に従い、前記サブワ
ード音響モデルから登録語の音響モデルを算出し、前記
未知語音響モデルから未知語の音響モデルを算出し、更
に、前記第２の接続規則に従い、前記登録語の音響モデ
ル及び前記未知語の音響モデルから出力可能な複数の単
語系列を算出し、入力された音声と前記複数の単語系列
とを照合して複数の照合スコアを算出し、前記照合スコ
アが高い前記単語系列を出力する認識部とを有すること
を特徴とする音声認識装置。2. A subword acoustic model storage unit that stores a subword acoustic model created in advance, and an isolated vowel syllable is connected to the subword acoustic model for the beginning of a word and the subword acoustic model for a ending of the subword acoustic model. A monosyllabic acoustic model creation unit that creates the expressed monosyllabic acoustic model, a monosyllabic acoustic model selection unit that selects N acoustic models from the monosyllabic acoustic model, and an unknown word sound that is a word other than a registered word. Unknown word acoustic model creation unit for creating an unknown word acoustic model expressing a musical feature in the sequence of the selected monosyllabic acoustic model, and a first connection rule that predefines an acoustic model sequence corresponding to a word and can be output. a connection rule storage unit that stores a second connection rules governing the word sequence, such in accordance with the first connection rule, from the subword acoustic model The acoustic model of the registered word is calculated, and the acoustic model of the unknown word is calculated from the acoustic model of the unknown word.
In accordance with the second connection rule, the acoustic model of the registered word is
And a plurality of units that can be output from the acoustic model of the unknown word.
The word sequence is calculated, and the input voice and the plurality of word sequences are input.
And a plurality of matching scores are calculated, and the matching score is calculated.
A speech recognition device, comprising: a recognition unit that outputs the word sequence having a high frequency.

【請求項３】サブワード音響モデルを作成する際、音
響的特徴量として特徴パラメータの統計量である平均ベ
クトル、共分散行列の時系列あるいは系列間の遷移確率
を用いることを特徴とする請求項１または請求項２記載
の音声認識装置。When creating a wherein subword acoustic models, according to claim, wherein the mean vector is a statistic of the characteristic parameters as the acoustic feature quantity, the use of transition probabilities between the time series or sequence of the covariance matrix The voice recognition device according to claim 1 or 2.

【請求項４】認識部の照合は、入力音声のフレームと
複数の単語系列のフレームとの類似度の総和の最大値を
照合スコアとして求め、照合スコアの高い単語系列を出
力することを特徴とする請求項１または請求項２記載の
音声認識装置。4. The matching of the recognition unit is performed with the frame of the input voice.
The speech recognition apparatus according to claim 1 or 2, wherein the maximum value of the total sum of the degrees of similarity with a frame of a plurality of word sequences is obtained as a matching score, and a word sequence having a high matching score is output.

【請求項５】単音節音響モデル選択部は、単音節音響
モデルから母音および撥音である６個の単音節母音（撥
音を含む）を選択することを特徴とする請求項２記載の
音声認識装置。5. monosyllabic acoustic model selection unit, the speech recognition apparatus according to claim 2, wherein the selecting six monosyllabic vowel from monosyllabic acoustic model is a vowel and syllabic nasal (including syllabic nasal) .

【請求項６】サブワードとして、ｃｖ／ｖｃ（子音＋
母音／母音＋子音）を用いたことを特徴とする請求項１
乃至５のいずれか記載の音声認識装置。6. As a subword, cv / vc (consonant +
A vowel / vowel + consonant) is used.
6. The voice recognition device according to any one of 5 to 5.