JP2015075706A

JP2015075706A - Error correction model learning device and program

Info

Publication number: JP2015075706A
Application number: JP2013213106A
Authority: JP
Inventors: 彰夫小林; Akio Kobayashi
Original assignee: Nippon Hoso Kyokai NHK; Japan Broadcasting Corp
Current assignee: Japan Broadcasting Corp
Priority date: 2013-10-10
Filing date: 2013-10-10
Publication date: 2015-04-20
Anticipated expiration: 2033-10-10
Also published as: JP6222821B2

Abstract

PROBLEM TO BE SOLVED: To differentially and robustly learn a model parameter for an error correction model by using features such as long text or a topic.SOLUTION: Language model learning unit 23 learns a language model for calculating connection probability for a following word by recursive neural network with output of topic feature amount extracted from words in present speech and words in prior speech and a hidden layer as inputs from static text data. Alignment unit 32 aligns a correct word line to voice data and calculates output of the hidden layer of the recursive neural network to respective words in the correct word line. A voice recognition unit 33 recognizes voice of voice data and calculates output of the hidden layer to respective words of voice recognition result. An error correction model learning unit 35 statistically learns an error correction model based on linguistic feature of a word composing an aligned correct word line and linguistic feature of the output of the hidden layer and the word composing the voice recognition result and the output of the hidden layer.

Description

本発明は、誤り修正モデル学習装置、及びプログラムに関する。 The present invention relates to an error correction model learning device and a program.

音声認識の誤り修正については、音声とその書き起こし（正解文）から、言語的な特徴を用いて音声認識の誤り傾向を統計的に学習し、学習の結果得られた統計的な誤り修正モデルを用いて音声認識の性能改善を図る技術がある（例えば、非特許文献１参照）。 For error correction in speech recognition, statistical error correction models obtained as a result of learning by statistically learning the tendency of speech recognition errors using linguistic features from speech and transcriptions (correct sentences) There is a technology for improving the performance of speech recognition by using (see, for example, Non-Patent Document 1).

小林ほか，「単語誤り最小化に基づく識別的スコアリングによるニュース音声認識」，電子情報通信学会論文誌Ｄ，電子情報通信学会，２０１０年，vol.J93-D No.5，ｐ．５９８−６０９Kobayashi et al., “News speech recognition by discriminative scoring based on word error minimization”, IEICE Transactions D, 2010, vol.J93-D No.5, p. 598-609

音声認識で用いられる誤り修正モデルでは、音声認識の誤り傾向を学習するために、比較的短い文脈（２単語あるいは３単語連鎖）を特徴として用いる。また、誤り修正モデルのモデルパラメータの学習には、ニュース原稿やウェブテキストなどの静的なテキストではなく、音声とその書き起こしおよび音声認識結果が用いられる。このため、学習データを大量に収集することが困難であり、長い文脈を利用することは、モデルの統計的頑健性の観点から困難である。 In an error correction model used in speech recognition, a relatively short context (two words or three word chain) is used as a feature in order to learn an error tendency of speech recognition. In addition, for learning the model parameters of the error correction model, not the static text such as the news manuscript or the web text but the voice, its transcription, and the voice recognition result are used. For this reason, it is difficult to collect a large amount of learning data, and it is difficult to use a long context from the viewpoint of the statistical robustness of the model.

しかし、音声言語を含む自然言語では、単語の出現は直前の単語から構成される文脈に依存するだけではなく、より長い文脈や話題などの要因にも影響される。音声認識誤りの訂正能力が高い誤り修正モデルを学習するには、これまでのように単純な文脈を利用するだけではなく、より多数の単語から構成される文脈や、話題などの情報を利用する必要がある。 However, in a natural language including a spoken language, the appearance of a word is not only dependent on the context composed of the immediately preceding word, but also influenced by factors such as a longer context and topic. To learn an error correction model with high speech recognition error correction capability, not only the simple context as before, but also the context composed of a larger number of words and information such as topics are used. There is a need.

本発明は、このような事情を考慮してなされたもので、長い文脈や話題といった特徴を利用して誤り修正モデルのモデルパラメータを識別的かつ頑健に学習する誤り修正モデル学習装置、及びプログラムを提供する。 The present invention has been made in view of such circumstances, and an error correction model learning apparatus and program for learning model parameters of an error correction model in a distinguishing and robust manner using features such as long contexts and topics. provide.

［１］本発明の一態様は、文書のテキストデータを記憶する言語資源記憶部と、再帰的ニューラルネットワークの入力に、前記言語資源記憶部に記憶されている前記テキストデータにおける文章中の単語と、前記テキストデータにおける前記文章よりも前の文章から抽出した話題特徴量と、前記単語の前の単語について算出した前記再帰的ニューラルネットワークの隠れ層の出力とを用いて前記単語に後続する単語の接続確率を算出する言語モデルを学習する言語モデル学習部と、音声データと正解単語列とを対応付けて記憶する音声資源記憶部と、前記音声資源記憶部に記憶される前記音声データに対して前記正解単語列を整列し、整列した前記正解単語列を構成する各単語を、前記言語モデル学習部が学習した前記言語モデルの入力としたときの前記再帰的ニューラルネットワークの隠れ層の出力を算出する整列部と、前記音声資源記憶部に記憶されている前記音声データを音声認識し、音声認識により得られた音声認識結果を構成する各単語を、前記言語モデル学習部が学習した前記言語モデルの入力としたときの前記再帰的ニューラルネットワークの隠れ層の出力を算出する音声認識部と、整列された前記正解単語列に含まれる単語と前記音声認識結果に含まれる単語とから言語的な特徴を抽出する素性定義部と、隠れ層の出力及びモデルパラメータにより重み付けした言語的な特徴を用いて音声認識のスコアを修正するための誤り修正モデルを、整列された前記正解単語列を構成する各単語について算出された前記隠れ層の出力により重み付けした当該単語の前記言語的な特徴と、前記音声認識結果を構成する各単語について算出された前記隠れ層の出力により重み付けした当該単語の前記言語的な特徴とに基づいて学習する誤り修正モデル学習部と、を備えることを特徴とする誤り修正モデル学習装置である。
この発明によれば、誤り修正モデル学習装置は、静的なテキストにおける文章中の単語と、この文章よりも前の文章から抽出した話題特徴量と、前の単語について算出した隠れ層の出力とを入力として後続する単語の接続確率を再帰的ニューラルネットワークにより算出する言語モデルを学習する。誤り修正モデル学習装置は、学習データとして用意された音声データ及び正解単語列を音声資源記憶部から読み出すと、音声データに対して正解単語列を整列し、学習した言語モデルを用いて、正解単語列を構成する各単語を入力としたときの再帰的ニューラルネットワークの隠れ層の出力を算出する。さらに、誤り修正モデル学習装置は、学習データの音声データを音声認識し、学習した言語モデルを用いて、音声認識結果を構成する各単語を入力としたときの再帰的ニューラルネットワークの隠れ層の出力を算出する。誤り修正モデル学習装置は、隠れ層の出力及びモデルパラメータにより重み付けした言語的な特徴を用いて音声認識のスコアを修正するための誤り修正モデルを、整列された正解単語列を構成する各単語について算出された隠れ層の出力により重み付けした当該単語の言語的な特徴と、音声認識結果を構成する各単語について算出された隠れ層の出力により重み付けした当該単語の言語的な特徴とに基づいて学習する。
これにより、誤り修正モデル学習装置は、発話に含まれる単語に加えて、１つ前の単語の隠れ層の出力と、前の発話から得られた話題特徴量を入力に用いる再帰的ニューラルネットワークを学習に利用するため、従来よりも長い文脈および話題を考慮した誤り修正モデルを学習することができる。また、誤り修正モデル学習装置は、大量に入手しやすいテキストデータを誤り修正モデルの学習の一部に用いているため、統計的に頑健な誤り修正モデルを学習することができる。 [1] According to one aspect of the present invention, a language resource storage unit that stores text data of a document, a word in a sentence in the text data stored in the language resource storage unit at an input of a recursive neural network, , The topic feature extracted from the sentence before the sentence in the text data, and the output of the hidden layer of the recursive neural network calculated for the word before the word, A language model learning unit that learns a language model for calculating a connection probability, a speech resource storage unit that stores speech data and a correct word string in association with each other, and the speech data stored in the speech resource storage unit The correct word strings are aligned, and the words constituting the aligned correct word strings are input to the language model learned by the language model learning unit. A speech recognizing result obtained by speech recognition by recognizing the speech data stored in the speech resource storage unit and an alignment unit that calculates the output of the hidden layer of the recursive neural network at the time A speech recognition unit that calculates an output of a hidden layer of the recursive neural network when each word is input to the language model learned by the language model learning unit, and a word included in the aligned correct word string And a feature definition unit that extracts linguistic features from the words included in the speech recognition result, and an error for correcting the speech recognition score using linguistic features weighted by the output of hidden layers and model parameters The linguistic value of the word weighted by the output of the hidden layer calculated for each word constituting the aligned correct word string An error correction model learning unit that learns based on the features and the linguistic features of the words weighted by the output of the hidden layer calculated for each word constituting the speech recognition result. Is an error correction model learning device.
According to this invention, the error correction model learning device includes a word in a sentence in a static text, a topic feature amount extracted from a sentence before the sentence, an output of a hidden layer calculated for the previous word, Is used as an input to learn a language model that calculates the connection probability of the following word using a recursive neural network. When the error correction model learning device reads the speech data and the correct word string prepared as learning data from the speech resource storage unit, the correct word string is aligned with the speech data, and the correct word is used using the learned language model. The output of the hidden layer of the recursive neural network when each word constituting the column is input is calculated. Further, the error correction model learning device recognizes speech data of learning data and outputs a hidden layer of a recursive neural network when each word constituting a speech recognition result is input using a learned language model. Is calculated. The error correction model learning device is configured to generate an error correction model for correcting a speech recognition score using a linguistic feature weighted by an output of a hidden layer and a model parameter for each word constituting an aligned correct word string. Learning based on the linguistic characteristics of the word weighted by the calculated hidden layer output and the linguistic characteristics of the word weighted by the hidden layer output calculated for each word constituting the speech recognition result To do.
Thus, the error correction model learning device uses a recursive neural network that uses the output of the hidden layer of the previous word in addition to the word included in the utterance and the topic feature value obtained from the previous utterance as inputs. Since it is used for learning, it is possible to learn an error correction model considering a longer context and topic than before. In addition, since the error correction model learning device uses a large amount of easily available text data as part of the error correction model learning, it is possible to learn a statistically robust error correction model.

［２］本発明の一態様は、上述する誤り修正モデル学習装置であって、前記誤り修正モデル学習部は、前記音声データが与えられたときの前記正解単語列の事後確率と前記音声認識結果の事後確率との差分により定められる評価関数が最大となるように前記モデルパラメータを統計的に算出する、ことを特徴とする。
この発明によれば、誤り修正モデル学習装置は、音声データが与えられたときの正解単語列の事後確率と音声認識結果の事後確率との差分として定められる評価関数に基づいて、誤り修正モデルのモデルパラメータを統計的に算出する。
これにより、誤り修正モデル学習装置は、正解単語列の事後確率が高くなるように認識誤りの傾向を効率的に学習し、誤り修正モデルを生成することができる。 [2] One aspect of the present invention is the error correction model learning device described above, wherein the error correction model learning unit includes the posterior probability of the correct word string and the speech recognition result when the speech data is given. The model parameter is statistically calculated so that an evaluation function defined by a difference from the posterior probability of the maximum is maximized.
According to the present invention, the error correction model learning device is based on the evaluation function defined as the difference between the posterior probability of the correct word string and the posterior probability of the speech recognition result when speech data is given. Statistically calculate model parameters.
As a result, the error correction model learning device can efficiently learn the tendency of recognition errors so as to increase the posterior probability of the correct word string and generate an error correction model.

［３］本発明の一態様は、上述する誤り修正モデル学習装置であって、前記再帰的ニューラルネットワークに単語とともに入力される前記話題特徴量は、当該単語が含まれる発話または文章よりも前の発話または文章に含まれる各単語の出現頻度から統計的な次元圧縮手法により抽出される、ことを特徴とする。
この発明によれば、誤り修正モデル学習装置は、話題特徴量を、再帰的ニューラルネットワークに入力される単語が含まれる発話または文章よりも前の発話または文章に含まれる各単語の出現頻度から統計的な次元圧縮手法により抽出する。
これにより、誤り修正モデル学習装置は、再帰的ニューラルネットワークに入力される単語が含まれる発話または文章より前の発話または文章から得られる単語行列を低次元の因子に圧縮した表現により話題特徴量を抽出するため、データスパースネスに対して頑健な言語モデルを学習することができる。 [3] One aspect of the present invention is the error correction model learning device described above, wherein the topic feature amount input together with the word to the recursive neural network is before the utterance or sentence including the word. It is characterized in that it is extracted from the appearance frequency of each word contained in an utterance or sentence by a statistical dimension compression method.
According to the present invention, the error correction model learning device calculates the topic feature amount from the appearance frequency of each word included in the utterance or sentence before the utterance or sentence including the word input to the recursive neural network. Extracted by a typical dimensional compression method.
As a result, the error correction model learning device calculates the topic feature amount by the expression obtained by compressing the word matrix obtained from the utterance or sentence before the utterance or sentence including the word input to the recursive neural network into a low-dimensional factor. Because it is extracted, it is possible to learn a language model that is robust against data sparseness.

［４］本発明の一態様は、上述する誤り修正モデル学習装置であって、前記言語資源記憶部が記憶するテキストデータは、ニュース原稿のテキストデータ、または、ウェブ上のテキストデータである、ことを特徴とする。
この発明によれば、誤り修正モデル学習装置は、ニュース原稿のテキストデータ、または、ウェブ上のテキストデータから言語モデルを学習する。
これにより、誤り修正モデル学習装置は、比較的大量に入手可能なテキストを誤り修正モデルの学習の一部に用いるため、誤り修正モデルの頑健性が向上することに加え、データスパースネスの問題も回避することができる。 [4] One aspect of the present invention is the error correction model learning apparatus described above, wherein the text data stored in the language resource storage unit is text data of a news manuscript or text data on the web. It is characterized by.
According to the present invention, the error correction model learning device learns a language model from text data of a news manuscript or text data on the web.
As a result, the error correction model learning device uses a relatively large amount of available text as part of the learning of the error correction model, so that the robustness of the error correction model is improved and the problem of data sparseness also occurs. It can be avoided.

［５］本発明の一態様は、上述する誤り修正モデル学習装置であって、前記言語的な特徴は、単語あるいは単語の品詞であり、前記誤り修正モデルは、前記言語的な特徴に基づく素性関数の値を、前記再帰的ニューラルネットワークの隠れ層の出力及び前記素性関数のモデルパラメータにより重み付けして得られたスコアにより音声認識のスコアを修正する算出式である、ことを特徴とする。
この発明によれば、誤り修正モデル学習装置は、単語あるいは単語の品詞を言語的な特徴として抽出する。誤り修正モデル学習装置は、言語的な特徴に基づく素性関数の値と、再帰的ニューラルネットワークの隠れ層の出力及び素性関数のモデルパラメータとを用いて音声認識のスコアを修正する算出式である誤り修正モデルを、整列された正解単語列の素性関数の値を当該正解単語列について算出された隠れ層の出力により重み付けした値と、音声認識結果の素性関数の値を当該音声認識結果について算出された隠れ層の出力により重み付けした値とに基づいて学習する。
これにより、誤り修正モデル学習装置は、単語あるいは単語の品詞に基づいて認識誤り傾向を効率的に学習し、認識誤りを精度よく修正する誤り修正モデルを生成することができる。 [5] One aspect of the present invention is the error correction model learning device described above, wherein the linguistic feature is a word or a part of speech of the word, and the error correction model is a feature based on the linguistic feature. It is a calculation formula for correcting a speech recognition score by a score obtained by weighting a function value by an output of a hidden layer of the recursive neural network and a model parameter of the feature function.
According to this invention, the error correction model learning device extracts a word or a part of speech of a word as a linguistic feature. The error correction model learning device is an error that is a calculation formula that corrects a speech recognition score using a feature function value based on a linguistic feature, an output of a hidden layer of a recursive neural network, and a model parameter of the feature function. A value obtained by weighting the feature function of the aligned correct word string by the output of the hidden layer calculated for the correct word string and the value of the feature function of the speech recognition result are calculated for the speech recognition result. Learning based on the value weighted by the output of the hidden layer.
Thereby, the error correction model learning device can efficiently learn the recognition error tendency based on the word or the part of speech of the word, and can generate an error correction model that corrects the recognition error with high accuracy.

［６］本発明の一態様は、上述する誤り修正モデル学習装置であって、入力された音声データを、音響モデルと前記言語モデル学習部により学習された前記言語モデルとを用いて音声認識し、前記誤り修正モデル学習部により学習された前記誤り修正モデルにより、入力された前記音声データから得られた音声認識結果の選択における誤りを修正して出力する入力音声認識部をさらに備える、ことを特徴とする。
この発明によれば、誤り修正モデル学習装置は、音声データを音声認識することにより得られた正解候補の中から、誤り修正モデルを用いて音声認識結果を選択する。
これにより、誤り修正モデル学習装置は、入力された音声データに対して、認識率のよい音声認識結果を得ることができる。 [6] One aspect of the present invention is the error correction model learning device described above, which recognizes input speech data using an acoustic model and the language model learned by the language model learning unit. An input speech recognition unit that corrects and outputs an error in selection of a speech recognition result obtained from the input speech data by the error correction model learned by the error correction model learning unit; Features.
According to the present invention, the error correction model learning device selects a speech recognition result using an error correction model from among correct answer candidates obtained by speech recognition of speech data.
Thereby, the error correction model learning device can obtain a speech recognition result with a good recognition rate for the input speech data.

［７］本発明の一態様は、コンピュータを、再帰的ニューラルネットワークの入力に、言語資源記憶手段に記憶されているテキストデータにおける文章中の単語と、前記テキストデータにおける前記文章よりも前の文章から抽出した話題特徴量と、前記単語の前の単語について算出した前記再帰的ニューラルネットワークの隠れ層の出力とを用いて前記単語に後続する単語の接続確率を算出する言語モデルを学習する言語モデル学習手段と、音声資源記憶手段に音声データと対応付けて記憶される正解単語列を、前記音声データに対して整列し、整列した前記正解単語列を構成する各単語を、前記言語モデル学習手段が学習した前記言語モデルの入力としたときの前記再帰的ニューラルネットワークの隠れ層の出力を算出する整列手段と、前記音声資源記憶手段に記憶されている前記音声データを音声認識し、音声認識により得られた音声認識結果を構成する各単語を、前記言語モデル学習手段が学習した前記言語モデルの入力としたときの前記再帰的ニューラルネットワークの隠れ層の出力を算出する音声認識手段と、整列された前記正解単語列に含まれる単語と前記音声認識結果に含まれる単語とから言語的な特徴を抽出する特徴量抽出手段と、隠れ層の出力及びモデルパラメータにより重み付けした言語的な特徴を用いて音声認識のスコアを修正するための誤り修正モデルを、整列された前記正解単語列を構成する各単語について算出された前記隠れ層の出力により重み付けした当該単語の前記言語的な特徴と、前記音声認識結果を構成する各単語について算出された前記隠れ層の出力により重み付けした当該単語の前記言語的な特徴とに基づいて学習する誤り修正モデル学習手段と、を具備する誤り修正モデル学習装置として機能させるためのプログラムである。 [7] In one embodiment of the present invention, a computer uses a word in a sentence in text data stored in a language resource storage unit as input to a recursive neural network, and a sentence before the sentence in the text data. A language model for learning a language model for calculating a connection probability of a word following the word using the topic feature extracted from the word and the output of the hidden layer of the recursive neural network calculated for the word before the word The correct word string stored in association with the voice data in the voice resource storage means and the learning means are aligned with the voice data, and each word constituting the aligned correct word string is replaced with the language model learning means. Alignment means for calculating the output of the hidden layer of the recursive neural network when the input is the language model learned by When the speech data stored in the speech resource storage means is speech-recognized, and each word constituting the speech recognition result obtained by speech recognition is input to the language model learned by the language model learning means A speech recognition means for calculating the output of the hidden layer of the recursive neural network, and a feature amount for extracting linguistic features from the words included in the aligned correct word strings and the words included in the speech recognition results An error correction model for correcting the speech recognition score using the extraction means and the linguistic feature weighted by the output of the hidden layer and the model parameter is calculated for each word constituting the aligned correct word string. The linguistic features of the word weighted by the output of the hidden layer and the hidden feature calculated for each word constituting the speech recognition result. A program for functioning as an error correction model learning device comprising error correction model learning means for learning based on the linguistic features of the word weighted by the output of the layer.

本発明によれば、長い文脈や話題といった特徴を利用して誤り修正モデルのモデルパラメータを識別的かつ頑健に学習することが可能となる。 According to the present invention, it is possible to discriminately and robustly learn model parameters of an error correction model using features such as a long context and a topic.

本発明の一実施形態によるニューラルネットワークを示す図である。1 is a diagram illustrating a neural network according to an embodiment of the present invention. FIG. 同実施形態による誤り修正モデル学習装置の構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the error correction model learning apparatus by the embodiment. 同実施形態による誤り修正モデル学習装置の全体処理フローを示す図である。It is a figure which shows the whole processing flow of the error correction model learning apparatus by the embodiment. 同実施形態による誤り修正モデル学習部が実行する誤り修正モデル学習処理の処理フローを示す。The processing flow of the error correction model learning process which the error correction model learning part by the same embodiment performs is shown. 同実施形態による音声認識における単語仮説の展開を示す図である。It is a figure which shows the expansion | deployment of the word hypothesis in the speech recognition by the embodiment. 同実施形態による拡張されたノードデータのデータ構造体を示す図である。It is a figure which shows the data structure of the extended node data by the embodiment. ニューラルネットワークを示す図である。It is a figure which shows a neural network. 拡張した再帰的ニューラルネットワークを示す図である。It is a figure which shows the extended recursive neural network. 図８に示す拡張した再帰的ニューラルネットワークにおける特徴量の関係を示す図である。It is a figure which shows the relationship of the feature-value in the extended recursive neural network shown in FIG. 図８に示す拡張した再帰的ネットワークのフィードフォワード型ニューラルネットワークへの展開を示す図である。It is a figure which shows expansion | deployment to the feedforward type | mold neural network of the extended recursive network shown in FIG. 従来の音声認識におけるノードデータのデータ構造体を示す図である。It is a figure which shows the data structure of the node data in the conventional speech recognition.

［１．本実施形態の概要］
音声認識の誤り傾向を反映した、いわゆる誤り修正モデルがすでに提案されている。誤り修正モデルのモデルパラメータは、音声認識結果と正解単語列とからなる学習データに基づいて推定される。しかし、実際の音声認識では、学習データと音声認識の対象が、それらの話題性において完全に適合することは多くない。このような話題性が完全に一致しない学習データに基づいて学習した誤り修正モデルを利用して音声認識を行っても、その音声認識の性能は、音声認識の対象となる発話の内容に対して必ずしも最適とはいえない。また、誤り修正モデルでは、通常２〜３単語連鎖程度の比較的短い単語列（文脈）に基づく特徴を利用しているが、音声言語を含む自然言語では、文中の単語の依存関係は、より長い文脈に基づいていると考えられる。 [1. Overview of this embodiment]
A so-called error correction model that reflects the error tendency of speech recognition has already been proposed. The model parameters of the error correction model are estimated based on learning data including a speech recognition result and a correct word string. However, in actual speech recognition, the learning data and speech recognition target do not often match perfectly in their topicality. Even if speech recognition is performed using an error correction model that has been learned based on such learning data whose topicality does not completely match, the performance of the speech recognition is not as good as the content of the speech that is subject to speech recognition. Not necessarily optimal. In addition, the error correction model normally uses features based on a relatively short word string (context) of about 2 to 3 word chain, but in natural languages including spoken languages, the dependency of words in a sentence is more It is thought to be based on a long context.

高い音声認識性能を目指すには、話題に合致し、かつ３単語連鎖以上の長い文脈依存性を考慮した誤り修正モデルを学習する必要がある。しかし、誤り修正モデルの学習では、大量の音声データとその書き起こしである正解単語列が必要となるため、長距離文脈や話題を利用したモデルを頑健に推定することは、データ収集のコストの面から困難であった。 In order to achieve high speech recognition performance, it is necessary to learn an error correction model that matches the topic and takes into account a long context dependency of three word chains or more. However, learning an error correction model requires a large amount of speech data and a correct word string that is a transcript, so robust estimation of a model using a long-distance context or topic is a costly part of data collection. It was difficult from the aspect.

そこで、本実施形態では、話題と長距離文脈の両者を利用した誤り修正モデルを実現する。本実施形態の特徴は、第１に、誤り修正モデルのモデルパラメータの推定の一部を、ニュース原稿やウェブテキストなどの静的なテキストから行う点であり、第２に、話題と長距離文脈の特徴を反映した誤り修正モデルを学習するという点である。比較的大量に入手可能なテキストを、誤り修正モデルのモデルパラメータの推定の一部に用いることにより、モデルの頑健性が向上することに加え、データスパースネスの問題も回避することが期待できる。さらには、本実施形態では、話題と長距離文脈を利用した誤り修正モデルの音声認識への適用手法についても説明する。 Therefore, in this embodiment, an error correction model using both the topic and the long distance context is realized. The feature of this embodiment is that, firstly, a part of estimation of the model parameters of the error correction model is performed from static text such as a news manuscript or web text, and secondly, a topic and a long distance context. It is a point of learning an error correction model reflecting the characteristics of. By using a relatively large amount of available text as part of the estimation of the model parameters of the error correction model, in addition to improving the robustness of the model, it can be expected to avoid the problem of data sparseness. Furthermore, in this embodiment, an application method for speech recognition of an error correction model using a topic and a long distance context will be described.

［２．誤り修正モデルの学習アルゴリズム］
本実施形態の誤り修正モデル学習装置は、音声認識の誤りを修正する統計的な誤り修正モデルのパラメータを、長い文脈や話題といった特徴を利用して識別的かつ頑健に学習し、音声認識に適用する。これにより、誤り修正モデルを発話内容に適合させ、音声認識性能を改善する。以下、本実施形態の誤り修正モデル学習装置に適用する誤り修正モデルの学習アルゴリズムについて説明する。 [2. Error correction model learning algorithm]
The error correction model learning device of the present embodiment learns the parameters of a statistical error correction model that corrects errors in speech recognition by using features such as long contexts and topics, and applies them to speech recognition. To do. As a result, the error correction model is adapted to the utterance content, and the speech recognition performance is improved. The error correction model learning algorithm applied to the error correction model learning device of this embodiment will be described below.

［２．１従来法の誤り修正モデル］
ベイズの定理によれば、音声入力ｘが与えられたとき、この音声入力ｘに対して最も尤もらしい単語列ｗ＾（「＾」は、「ハット」を表す。）は、以下の式（１）により求めることができる。 [2.1 Error correction model of conventional method]
According to Bayes' theorem, when speech input x is given, the most likely word sequence w ^ (“^” represents “hat”) for this speech input x is expressed by the following equation (1). ).

音声入力ｘ及び単語列ｗは、例えば、発話の単位に対応し、Ｐ（ｗ｜ｘ）は、音声入力ｘが与えられたときに単語列（文仮説）ｗが得られる事後確率である。
また、Ｐ（ｘ｜ｗ）は、単語列ｗに対する音響的な尤もらしさを示す尤度であり、対数尤度として定義される音響スコアは隠れマルコフモデル（Hidden Markov Model、ＨＭＭ）及び混合ガウス分布（Gaussian Mixture Model、ＧＭＭ）に代表される統計的音響モデル（以下、「音響モデル」と記載する。）に基づいて計算される。言い換えれば、このスコアは、音響特徴量が与えられたとき、複数の正解候補の単語それぞれに対する尤もらしさを表す。 The voice input x and the word string w correspond to, for example, the unit of speech, and P (w | x) is a posterior probability that a word string (sentence hypothesis) w is obtained when the voice input x is given.
P (x | w) is a likelihood indicating acoustic likelihood for the word string w, and an acoustic score defined as a logarithmic likelihood is a hidden Markov model (HMM) and a mixed Gaussian distribution. It is calculated based on a statistical acoustic model (hereinafter referred to as “acoustic model”) represented by (Gaussian Mixture Model, GMM). In other words, this score represents the likelihood of each of a plurality of correct candidate words when an acoustic feature amount is given.

一方、Ｐ（ｗ）は、単語列ｗに対する言語的な生成確率であり、対数生成確率として定義される言語スコアは、単語ｎ−ｇｒａｍモデル等の統計的言語モデル（以下、「言語モデル」と記載する。）により計算される。言い換えれば、このスコアは、音声認識対象の単語の前または後の単語列、あるいは前後両方の単語列が与えられたとき、複数の正解候補の単語列それぞれに対する言語的な尤もらしさを表す。なお、単語ｎ−ｇｒａｍモデルは、Ｎ単語連鎖（Ｎは、例えば１、２、または３である。）の統計に基づいて、（Ｎ−１）単語の履歴から次の単語の生起確率を与えるモデルである。 On the other hand, P (w) is a linguistic generation probability for the word string w, and a language score defined as a logarithmic generation probability is a statistical language model such as a word n-gram model (hereinafter referred to as “language model”). To be calculated). In other words, this score represents the linguistic likelihood of each of a plurality of correct candidate word strings when a word string before or after a word for speech recognition, or both word strings before and after the word string are given. The word n-gram model gives the occurrence probability of the next word from the history of the word (N-1) based on the statistics of N word chains (N is 1, 2, or 3, for example). It is a model.

式（１）のＰ（ｘ｜ｗ）Ｐ（ｗ）が最大の場合は、その対数も最大である。そこで、音声認識では、上記の式（１）のベイズの定理に基づいて、音声入力ｘが与えられたときの文仮説（正解候補）である単語列ｗの評価関数Ｄ（ｗ｜ｘ）を以下の式（２）のように定める。 When P (x | w) P (w) in Equation (1) is maximum, the logarithm is also maximum. Therefore, in speech recognition, the evaluation function D (w | x) of the word string w, which is a sentence hypothesis (correct answer candidate) when the speech input x is given, is obtained based on the Bayes' theorem of the above equation (1). It is defined as the following formula (2).

式（２）が定められたとき、以下の式（３）に示すように、音声入力ｘに対する正解候補の単語列ｗの集合の中から、式（２）が示す評価関数Ｄ（ｗ｜ｘ）の結果が最大である単語列ｗ＾が、音声入力ｘの音声認識結果として選択される。ここで、κは、音響スコアに対する言語スコアの重みである。 When the formula (2) is determined, as shown in the following formula (3), the evaluation function D (w | x shown by the formula (2) is selected from the set of correct candidate word strings w for the speech input x. ) Is selected as the speech recognition result of speech input x. Here, κ is the weight of the language score with respect to the acoustic score.

従来法における誤り修正モデルでは、式（１）を以下の式（４）のように変更する。 In the error correction model in the conventional method, equation (1) is changed to the following equation (4).

式（４）においてｅｘｐΣ_ｉλ_ｉｇ_ｉ（ｗ）は、単語列ｗの誤り傾向を反映したスコアであり、単語列ｗに対するペナルティもしくは報償として働く。また、ｇ_ｉ（ｗ）（ｉ＝１，...，）はｉ番目の素性関数であり、モデルパラメータΛ＝｛λ_１，...｝の要素λ_ｉは素性関数ｇ_ｉ（ｗ）に対する重み（素性重み）である。素性関数は、与えられた単語列（ここでは、単語列ｗ）で言語的ルールが成立すればその数となり、成立しなければ０となるような関数として定められる。これら素性関数ｇ_ｉの例として、以下などがあげられる。 Equation (4) in _{_{_{expΣ i λ i g i (w}}} ) is a score reflecting the error tendency of the word sequence w, act as a penalty or reward for the word sequence w. _{Further, g i (w) (i} = 1, ...,) is the i-th feature functions, the model parameters Λ = {λ _1, ...} elements lambda _i of feature function _g i (w) Is a weight (feature weight) for. The feature function is defined as a function that becomes the number if a linguistic rule is established in a given word string (here, word string w), and is 0 if not established. Examples of these feature functions g _i, and the like below.

（１）単語列ｗに連続する単語２項組（ｕ，ｖ）が含まれる場合、その数を返す関数
（２）単語列ｗに連続しない単語２項組（ｕ，ｖ）が含まれる場合、その数を返す関数 (1) When the word string w includes a continuous word binary set (u, v), a function that returns the number (2) When the word string w includes a non-continuous word binary set (u, v) , A function that returns the number

音声認識の誤り傾向は、素性関数と素性重みにより言語的な特徴に対するペナルティとして表現され、学習データの単語誤りを最小化する評価関数に基づいて推定される。モデルパラメータΛは、正解単語列および音声認識結果の集合を用いて推定されるが、通常、正解単語列が与えられた学習データを大量に収集することは、特にコストの面から困難である。そのため、従来法の誤り修正モデルでは、連続する単語２項組、３項組などの比較的短い文脈を素性として採用している。これは、より長い単語連鎖を素性としても、学習データのスパースネスが原因で統計的に頑健なモデルを学習できないことによる。 The error tendency of speech recognition is expressed as a penalty for linguistic features by a feature function and feature weight, and is estimated based on an evaluation function that minimizes word errors in the learning data. Although the model parameter Λ is estimated using a set of correct word strings and speech recognition results, it is usually difficult to collect a large amount of learning data to which the correct word strings are given, particularly in terms of cost. For this reason, the error correction model of the conventional method employs a relatively short context such as a continuous word binary set or triplet as a feature. This is because even if a longer word chain is used as a feature, a statistically robust model cannot be learned due to the sparseness of the learning data.

［２．２本実施形態に適用される誤り修正モデルの学習アルゴリズム］
本実施形態では、従来法の課題を解決するために、再帰的ニューラルネットワークに基づいて、文脈および話題に関する特徴量を抽出し、誤り修正モデルのモデルパラメータを推定する。再帰的ニューラルネットワークでは、特徴量の次元圧縮が可能であるものの、ネットワークの層間を結ぶ多数のパラメータの推定には大量の学習データが必要となる。本実施形態では、パラメータの一部をニュース原稿などの静的なテキストから推定することにより、データスパースネスの問題を解決する。 [2.2 Learning algorithm of error correction model applied to this embodiment]
In the present embodiment, in order to solve the problem of the conventional method, based on a recursive neural network, feature quantities related to context and topic are extracted, and model parameters of an error correction model are estimated. In a recursive neural network, dimensional compression of features can be performed, but a large amount of learning data is required to estimate a large number of parameters connecting the layers of the network. In the present embodiment, the data sparseness problem is solved by estimating some of the parameters from static text such as a news manuscript.

図７は、ニューラルネットワークを示す図である。同図では、いわゆるElman型の再帰的ニューラルネットワーク（recurrent neural network）を示している。同図に示すニューラルネットワークは、入力層、隠れ層、及び出力層の３層から構成され、統計的言語モデルのケースでは、入力として単語を与えると、その単語に後続する単語の出現確率（接続確率）が出力される。ニューラルネットワークを用いた言語モデルの場合、語彙サイズをＮとすると、入力層はＮ個の素子からなる層であり、入力される単語は、その単語に該当するインデックスの素子（要素）のみに１が設定され、それ以外のインデックスの素子には０が設定された離散ベクトルとして定められる。隠れ層は、任意の数の素子により構成される層である。また、出力層はＮ個の素子からなる層であり、入力の単語に後続する単語の出現確率となる。ニューラルネットワークの隠れ層は、シグモイド関数により非線形変換されて出力層への入力となり、出力層は、確率の条件を満たすために、各素子の値の総和が１となるようにソフトマックス関数により正規化される。 FIG. 7 is a diagram illustrating a neural network. In the figure, a so-called Elman-type recurrent neural network is shown. The neural network shown in the figure is composed of three layers: an input layer, a hidden layer, and an output layer. In the case of a statistical language model, when a word is given as an input, the probability of occurrence of a word following that word (connection) Probability) is output. In the case of a language model using a neural network, if the vocabulary size is N, the input layer is a layer composed of N elements, and the input word is 1 only in the element (element) of the index corresponding to the word. Is set as a discrete vector in which elements other than the index are set to 0. The hidden layer is a layer composed of an arbitrary number of elements. The output layer is a layer made up of N elements, and represents the appearance probability of a word following the input word. The hidden layer of the neural network is nonlinearly transformed by a sigmoid function to be input to the output layer, and the output layer is normalized by a softmax function so that the sum of the values of each element becomes 1 in order to satisfy the probability condition It becomes.

従来のfeed-forward型のニューラルネットワークとは異なり、再帰的ニューラルネットワークでは、隠れ層の出力が入力層にフィードバックされる。フィードバックにより、再帰的ニューラルネットワークの出力は過去の隠れ層の系列に依存する。言語モデルであれば、出力である単語の確率分布は、過去の入力単語に依存することを意味する。つまり、再帰的に算出される隠れ層を入力に用いることで、長い文脈が考慮された単語の確率分布が出力される。 Unlike a conventional feed-forward neural network, the recursive neural network feeds back the output of the hidden layer to the input layer. With feedback, the output of the recursive neural network depends on the past hidden layer sequence. In the case of a language model, it means that the probability distribution of words as output depends on past input words. In other words, by using a recursively calculated hidden layer as an input, a probability distribution of words in consideration of a long context is output.

文献「T. Mikolov and G. Zweig, Context Dependent Recurrent Neural Network Language Model.Technical Report, NSR-TR-2012-92, Microsoft, 2012.」に記載の再帰的ニューラルネットワークの定義では、時刻ｔにおける入力層ｘ_ｔ、隠れ層ｈ_ｔ、及び出力層ｏ_ｔはそれぞれ、以下の式（５）〜式（７）のようになる。 In the recursive neural network definition described in the document “T. Mikolov and G. Zweig, Context Dependent Recurrent Neural Network Language Model.Technical Report, NSR-TR-2012-92, Microsoft, 2012.”, the input layer at time t x _t , the hidden layer h _t , and the output layer o _t are _expressed by the following equations (5) to (7), respectively.

式（５）において、入力層ｘ_ｔは、時刻ｔの単語ベクトルｗ_ｔおよび１時刻前（ｔ−１）の隠れ層の出力ｈ_ｔ−１からなるベクトルである。単語ベクトルｗ_ｔのＮ個の要素は各単語に対応しており、該当する要素のみに１が設定され、それ以外の要素には０が設定される。単語ベクトルｗ_ｔが表す単語を単語ｗ_ｔとも記載する。式（６）において、Ｍ^ｈは、入力層に対する重み係数行列であり、sigmoid(・)はベクトルの要素に対するシグモイド関数である。なお、隠れ層の要素（素子）数は任意であり、通常、入力層の要素数よりも小さい。式（７）において、Ｍ^ｏは、隠れ層に対する重み係数（パラメータ）行列であり、softmax(・)は、出力層に対するソフトマックス関数である。出力層ｏ_ｔは、単語ベクトルｗ_ｔが表す単語に後続する単語の確率分布を表す。出力層ｏ_ｔのＮ個の要素（素子）は、各単語に対応しており、その要素に対応する単語の事後確率を表す。 In Expression (5), the input layer x _t is a vector composed of the word vector w _{t at} time _t and the output h _t−1 of the hidden layer one time before (t−1). N elements of the word vector w _t correspond to each word, and 1 is set only for the corresponding element, and 0 is set for the other elements. A word represented by the word vector w _t is also referred to as a word w _t . In Equation (6), M ^h is a weight coefficient matrix for the input layer, and sigmoid (·) is a sigmoid function for the vector elements. The number of elements (elements) in the hidden layer is arbitrary and is usually smaller than the number of elements in the input layer. In Equation (7), M ^o is a weight coefficient (parameter) matrix for the hidden layer, and softmax (·) is a softmax function for the output layer. The output layer o _t represents a probability distribution of words following the word represented by the word vector w _t . N elements of the output layer o _t (element) corresponds to each word represents the posterior probability of a word corresponding to that element.

上述の再帰的ニューラルネットワークに基づく統計的言語モデルの学習では、話題に関する特徴を入力するために拡張的な手法が行われている。
図８は、拡張した再帰的ニューラルネットワークを示す図である。通常、統計的言語モデルにおける再帰的ニューラルネットワークでは、単語および１時刻前の隠れ層の出力を入力とする。この入力に、現在着目している発話の直近の発話から得られた話題に関する情報をさらに利用することで、より長い範囲の文脈（話題）をニューラルネットワークに反映できる。つまり、同図に示す拡張した再帰的ニューラルネットワークにおいては、単語ベクトルｗ_ｔおよび１時刻前（ｔ−１）の隠れ層の出力ｈ_ｔ−１に加え、話題に関する特徴量である話題特徴量ベクトルｖをさらに入力としている。 In the learning of a statistical language model based on the recursive neural network described above, an extended method is used to input features relating to topics.
FIG. 8 is a diagram showing an extended recursive neural network. Usually, in a recursive neural network in a statistical language model, a word and an output of a hidden layer one time before are input. By further using information related to the topic obtained from the latest utterance of the utterance currently focused on for this input, a longer range of context (topic) can be reflected in the neural network. That is, in the expanded recursive neural network shown in the figure, in addition to the word vector w _t and the output h _t−1 of the hidden layer one time before (t−1), the topic feature amount vector which is a feature amount related to the topic. v is further input.

図９は、拡張した再帰的ニューラルネットワークにおける特徴量の関係を示す図である。同図においては、現在着目している発話ｓ_ｎにおける単語（単語ｗ_ｔ）に後続する予測単語（出力層ｏ_ｔ）を、話題に依存した特徴（話題特徴量ベクトルｖ）と、発話ｓ_ｎにおける単語ｗ_ｔより前の単語列に基づく特徴（隠れ層の出力ｈ_ｔ−１）とを用いて得る。話題に依存した特徴は、発話ｓ_ｎより前の発話ｓ_ｎ−ｍ，...，ｓ_ｎ−１から抽出される。 FIG. 9 is a diagram showing the relationship of feature amounts in the expanded recursive neural network. In the figure, the predicted word (output layer o _t ) following the word (word w _t ) in the utterance s _n that is currently focused on is characterized by a topic-dependent feature (topic feature vector v) and the utterance s _n. And the feature based on the word string before the word w _t (the output h _{t−1 of the} hidden layer). Features that depend on topic, the speech _{s n} from the previous utterance _{s n-m,} _..., are extracted from the _{s n-1.}

拡張した再帰的ニューラルネットワークの入力層ｘ_ｔは、以下の式（８）となる。ただし、隠れ層ｈ_ｔ、及び出力層ｏ_ｔは、上記の式（６）、式（７）と同様である。 The input layer xt of the expanded recursive neural network is _expressed by the following equation (8). However, the hidden layer h _t and the output layer o _t are the same as the above formulas (6) and (7).

上述した再帰的ニューラルネットワークは、話題に関する特徴量の有無にかかわらず、ｎ−ｇｒａｍ言語モデルの条件付き確率を推定するために用いられることがほとんどであり、誤り修正モデルで利用されることはない。 The recursive neural network described above is mostly used to estimate the conditional probability of the n-gram language model regardless of the presence or absence of feature values related to the topic, and is not used in the error correction model. .

ニューラルネットワークでは、入力層、隠れ層、出力層の各素子間をつなぐ重み係数（結合重み）がパラメータとなるが、一般に素子間の重み係数パラメータは数が多いため、統計的に頑健な学習を行うには大量の学習データを必要とする。しかしながら、誤り修正モデルでは、音声に対する正解単語列を人手により用意しなければならないため、ニューラルネットワークの学習に十分なデータを用意することが困難である。この課題を解決するために、本実施形態の誤り修正モデル学習装置では、図１に示すニューラルネットワークを採用する。 In a neural network, the weighting coefficient (coupling weight) that connects each element in the input layer, hidden layer, and output layer is a parameter. Generally, there are many weighting coefficient parameters between elements, so statistically robust learning is possible. To do it requires a lot of learning data. However, in the error correction model, it is difficult to prepare data sufficient for learning of a neural network because a correct word string for speech must be prepared manually. In order to solve this problem, the neural network shown in FIG. 1 is employed in the error correction model learning device of the present embodiment.

図１は、本実施形態で採用するニューラルネットワークを示す図である。同図に示すニューラルネットワークでは、図８に示す再帰的ニューラルネットワークに、誤り修正モデルのための出力層２が追加されている。なお、同図においては、図８に示す再帰的ニューラルネットワークの出力層を、出力層１としている。出力層２は、誤り修正モデルに用いられる素性関数の値をベクトル表現したものであり、再帰的ニューラルネットワークにより得られる隠れ層ｈ_ｔに、重み係数行列Ｍ^ｏ’を乗算した値ｏ’_ｔが出力される。本実施形態の誤り修正モデル学習装置は、出力層２を構成する素性関数を定義した上で、重み係数行列Ｍ^ｏ’を学習する。 FIG. 1 is a diagram showing a neural network employed in the present embodiment. In the neural network shown in the figure, an output layer 2 for an error correction model is added to the recursive neural network shown in FIG. In the figure, the output layer of the recursive neural network shown in FIG. The output layer 2 is a vector representation of the value of the feature function used in the error correction model. The value o ′ _t obtained by multiplying the hidden layer h _t obtained by the recursive neural network by the weighting coefficient matrix M ^o ′ is Is output. The error correction model learning device of the present embodiment learns the weighting coefficient matrix M ^o ′ after defining the feature function that constitutes the output layer 2.

本実施形態による誤り修正モデル学習装置の特徴は、図１に示すニューラルネットワークのうち、従来型の再帰的ニューラルネットワークの部分を、比較的大量に収集可能な静的テキスト（ニュース原稿やウェブテキストなど）に基づいて学習することである。つまり、本実施形態による誤り修正モデル学習装置は、静的テキストから入力層〜隠れ層の結合重みである重み係数行列Ｍ^ｈを得た上で、誤り修正モデルの学習のための結合重みである重み係数行列Ｍ^ｏ’のみを改めて学習することにある。この手続きにより、本実施形態による誤り修正モデル学習装置は、音声認識結果と正解単語列とからなる学習データから、再帰的ニューラルネットワークの最下層（入力層〜隠れ層）の結合重みを学習することなく、隠れ層〜出力層の間の結合重みのみを学習することにより誤り修正モデルを学習できる。また、本実施形態による誤り修正モデル学習装置は、再帰的ニューラルネットワークを採用することにより、従来法よりも長い文脈や話題といった情報を考慮した誤り修正モデルを得られることも特徴である。つまり、本実施形態による誤り修正モデル学習装置は、１時刻前の隠れ層の出力ｈ_ｔ−１を入力に用いて現在の時刻の隠れ層ｈ_ｔを求めることを繰り返すことで長い文脈を考慮し、現在の発話ｓ_ｎより前の発話群から得られた話題に関する特徴量である話題特徴量ベクトルｖを入力に用いる話題を考慮している。 The error correction model learning apparatus according to this embodiment is characterized in that static text (news manuscript, web text, etc.) that can collect a relatively large amount of the conventional recursive neural network portion of the neural network shown in FIG. ) To learn based on. That is, the error correction model learning device according to the present embodiment is a connection weight for learning an error correction model after obtaining a weight coefficient matrix M ^h that is a connection weight between an input layer and a hidden layer from a static text. The purpose is to learn only the weighting coefficient matrix M ^o ′. By this procedure, the error correction model learning device according to the present embodiment learns the connection weights of the lowest layer (input layer to hidden layer) of the recursive neural network from the learning data composed of the speech recognition result and the correct word string. The error correction model can be learned by learning only the connection weight between the hidden layer and the output layer. In addition, the error correction model learning apparatus according to the present embodiment is characterized in that an error correction model can be obtained in consideration of information such as context and topic that is longer than the conventional method by employing a recursive neural network. That is, the error correction model learning device according to the present embodiment considers a long context by repeatedly obtaining the hidden layer h _t at the current time using the output h _t-1 of the hidden layer one time before as an input. into account the topics using the input topic feature vector v is a feature quantity relating to topics obtained from speech group prior to the current speech s _n.

［３．誤り修正モデル学習装置の構成］
図２は、本発明の一実施形態による誤り修正モデル学習装置１０の構成を示す機能ブロック図であり、本実施形態と関係する機能ブロックのみ抽出して示してある。誤り修正モデル学習装置１０は、コンピュータ装置により実現され、同図に示すように、言語モデル学習処理部２０、誤り修正モデル学習処理部３０、及び音声認識処理部４０を備えて構成される。 [3. Configuration of error correction model learning device]
FIG. 2 is a functional block diagram showing the configuration of the error correction model learning apparatus 10 according to one embodiment of the present invention, and only functional blocks related to the present embodiment are extracted and shown. The error correction model learning device 10 is realized by a computer device, and includes a language model learning processing unit 20, an error correction model learning processing unit 30, and a speech recognition processing unit 40, as shown in FIG.

言語モデル学習処理部２０は、ニュース原稿やウェブ上のテキストデータを学習データとして、図１に示す再帰的ニューラルネットワークに基づく言語モデルを学習する。言語モデルは、ニューラルネットワークにおける重み係数行列Ｍ^ｈ、Ｍ^ｏに相当する。言語モデル学習処理部２０は、言語資源記憶部２１、話題モデル学習部２２、及び言語モデル学習部２３を備えて構成される。 The language model learning processing unit 20 learns a language model based on the recursive neural network shown in FIG. 1 using the news manuscript or text data on the web as learning data. The language model corresponds to the weight coefficient matrices M ^h and M ^o in the neural network. The language model learning processing unit 20 includes a language resource storage unit 21, a topic model learning unit 22, and a language model learning unit 23.

言語資源記憶部２１は、ニュース原稿のテキストデータやウェブから収集したテキストデータなどを学習テキストとして記憶する。話題モデル学習部２２は、統計的な次元圧縮手法により話題に関する特徴量（以下、「話題特徴量」と記載する。）を抽出するために用いる話題モデルを、言語資源記憶部２１に記憶されている学習テキストに基づいて学習する。話題モデル学習部２２は、学習した話題モデルを設定した話題モデルデータＤ１を出力する。言語モデル学習部２３は、言語資源記憶部２１に記憶されている学習テキストが示す文章と、話題モデルデータＤ１に設定されている話題モデルを用いて当該文章よりも前の文章から抽出した話題特徴量とを用いて、ニューラルネットワークにおける重み係数行列Ｍ^ｈ、Ｍ^ｏを言語モデルとして学習する。言語モデル学習部２３は、学習した言語モデルを設定した言語モデルデータＤ２を出力する。 The language resource storage unit 21 stores text data of news manuscripts, text data collected from the web, and the like as learning texts. The topic model learning unit 22 stores, in the language resource storage unit 21, a topic model used for extracting feature values related to the topic (hereinafter referred to as “topic feature amount”) by a statistical dimension compression method. Learn based on the learning text you have. The topic model learning unit 22 outputs topic model data D1 in which the learned topic model is set. The language model learning unit 23 uses topic sentences extracted from sentences before the sentence using the sentences indicated by the learning text stored in the language resource storage unit 21 and the topic model set in the topic model data D1. by using the amount of learning weighting coefficient matrix M ^h in the neural ^network, the M ^o as a language model. The language model learning unit 23 outputs language model data D2 in which the learned language model is set.

誤り修正モデル学習処理部３０は、図１に示す再帰的ニューラルネットワークに基づき、言語モデル学習処理部２０が得た言語モデルを拡張した誤り修正モデルを、音声認識結果を学習データに用いて学習する。誤り修正モデル学習処理部３０は、音声資源記憶部３１、整列部３２、音声認識部３３、素性定義部３４、及び誤り修正モデル学習部３５を備えて構成される。 The error correction model learning processing unit 30 learns an error correction model obtained by extending the language model obtained by the language model learning processing unit 20 based on the recursive neural network shown in FIG. 1 using the speech recognition result as learning data. . The error correction model learning processing unit 30 includes a speech resource storage unit 31, an alignment unit 32, a speech recognition unit 33, a feature definition unit 34, and an error correction model learning unit 35.

音声資源記憶部３１は、発話の音声データと、その音声データの正解単語列とからなる学習データを記憶する。整列部３２は、音声資源記憶部３１に記憶されている音声データに対応する正解単語列の整列を行う。整列部３２は、整列した正解単語列を設定した正解単語列データＤ３を出力する。音声認識部３３は、正解単語列の整列に用いられた音声データを、音響モデル記憶部４１に記憶されている音響モデルと、言語モデルデータＤ２に設定されている言語モデルとを用いて音声認識する。音声認識部３３は、音声認識結果を設定した音声認識結果データＤ４を出力する。素性定義部３４は、正解単語列データＤ３に設定されている正解単語列に含まれる単語と、音声認識結果データＤ４に設定されている音声認識結果に含まれる単語に基づいて素性関数を定義する。誤り修正モデル学習部３５は、正解単語列データＤ３に設定されている正解単語列と、音声認識結果データＤ４に設定されている音声認識結果と、正解単語列及び音声認識結果それぞれをニューラルネットワークに入力したときの隠れ層の出力値とを用いて、素性定義部３４が定義した素性関数を用いた誤り修正モデルのモデルパラメータである重み係数行列Ｍ^ｏ’を学習する。誤り修正モデル学習部３５は、学習したモデルパラメータを用いた誤り修正モデルを誤り修正モデルデータＤ５に設定して音声認識処理部４０へ出力する。 The voice resource storage unit 31 stores learning data including voice data of utterances and correct word strings of the voice data. The sorting unit 32 sorts the correct word strings corresponding to the voice data stored in the voice resource storage unit 31. The alignment unit 32 outputs correct word string data D3 in which the aligned correct word strings are set. The speech recognition unit 33 recognizes speech data used for alignment of correct word strings using an acoustic model stored in the acoustic model storage unit 41 and a language model set in the language model data D2. To do. The voice recognition unit 33 outputs voice recognition result data D4 in which a voice recognition result is set. The feature defining unit 34 defines a feature function based on words included in the correct word string set in the correct word string data D3 and words included in the voice recognition result set in the voice recognition result data D4. . The error correction model learning unit 35 stores the correct word string set in the correct word string data D3, the speech recognition result set in the speech recognition result data D4, and the correct word string and the speech recognition result into a neural network. The weighting coefficient matrix M ^o ′, which is a model parameter of the error correction model using the feature function defined by the feature defining unit 34, is learned using the output value of the hidden layer when inputted. The error correction model learning unit 35 sets an error correction model using the learned model parameters in the error correction model data D5 and outputs the error correction model to the speech recognition processing unit 40.

音声認識処理部４０は、音響モデル、言語モデル学習処理部２０が学習した言語モデル、及び誤り修正モデル学習処理部３０が学習した誤り修正モデルを用いて音声認識を行い、認識結果を出力する。音声認識処理部４０は、音響モデル記憶部４１、言語モデル記憶部４２、誤り修正モデル記憶部４３、及び入力音声認識部４４を備えて構成される。 The speech recognition processing unit 40 performs speech recognition using the acoustic model, the language model learned by the language model learning processing unit 20, and the error correction model learned by the error correction model learning processing unit 30, and outputs a recognition result. The speech recognition processing unit 40 includes an acoustic model storage unit 41, a language model storage unit 42, an error correction model storage unit 43, and an input speech recognition unit 44.

音響モデル記憶部４１は、音響モデルを記憶する。言語モデル記憶部４２は言語モデル学習処理部２０において学習した言語モデルを設定した言語モデルデータＤ２を記憶する。誤り修正モデル記憶部４３は、誤り修正モデル学習処理部３０において学習した誤り修正モデルを設定した誤り修正モデルデータＤ５を記憶する。入力音声認識部４４は、音響モデル記憶部４１に記憶されている音響モデル、言語モデル記憶部４２から読み出した言語モデル、及び誤り修正モデル記憶部４３から読み出した誤り修正モデルを用いて入力音声データＤ６を音声認識し、音声認識結果を設定した入力音声認識結果データＤ７を出力する。 The acoustic model storage unit 41 stores an acoustic model. The language model storage unit 42 stores language model data D2 in which the language model learned in the language model learning processing unit 20 is set. The error correction model storage unit 43 stores error correction model data D5 in which the error correction model learned in the error correction model learning processing unit 30 is set. The input speech recognition unit 44 uses the acoustic model stored in the acoustic model storage unit 41, the language model read from the language model storage unit 42, and the error correction model read from the error correction model storage unit 43 to input speech data D6 is voice-recognized, and input voice recognition result data D7 in which a voice recognition result is set is output.

なお、言語資源記憶部２１に記憶されている音声データ、及び入力音声データＤ６は、発話の音声波形を短時間スペクトル分析して得られた特徴量を示す。 Note that the speech data stored in the language resource storage unit 21 and the input speech data D6 indicate feature amounts obtained by performing short-time spectrum analysis on speech speech waveforms.

［４．誤り修正モデル学習装置の処理手順］
図３は、図２に示す誤り修正モデル学習装置１０の全体処理フローを示す図である。以下、同図に示す各ステップの処理について説明する。 [4. Processing procedure of error correction model learning device]
FIG. 3 is a diagram showing an overall processing flow of the error correction model learning device 10 shown in FIG. Hereinafter, processing of each step shown in FIG.

［４．１言語モデル学習処理部２０の処理手順］
言語モデル学習処理部２０は、図１に示す再帰的ニューラルネットワークに基づく言語モデルを学習する。言語資源記憶部２１には、放送局内のニュース原稿を示すテキストデータや、ウェブ上のテキストデータなどが学習テキストとして集積されている。言語資源記憶部２１は、学習テキストを記事ごとに分類し、各記事を文書データとして記憶する。 [4.1 Processing Procedure of Language Model Learning Processing Unit 20]
The language model learning processing unit 20 learns a language model based on the recursive neural network shown in FIG. In the language resource storage unit 21, text data indicating a news manuscript in the broadcasting station, text data on the web, and the like are accumulated as learning texts. The language resource storage unit 21 classifies the learning text for each article and stores each article as document data.

［４．１．１ステップＳ１：話題特徴量抽出処理］
再帰的ニューラルネットワークの学習では、話題特徴量を入力として必要とする。このため、再帰的ニューラルネットワークに基づく言語モデルを学習する前に、話題モデル学習部２２において、話題特徴量の抽出に用いる話題モデルを学習する。本実施形態では、話題特徴量として非負値行列因子分解（Non-negative Matrix Factorization）による特徴ベクトルを用いる。話題モデル学習部２２は、言語資源記憶部２１に記憶されている学習データから各記事の文書データを読み出し、読み出した文書データに非負値行列因子分解を適用することで話題特徴量を抽出する。非負値行列因子分解については、例えば、文献「D. D. Lee and H. S. Seung, Algorithm for Non-negative Matrix Factorization, In Advances in Neural Information Processing Systems, pp. 556-562, 2001.」に記載されている。 [4.1.1 Step S1: Topic Feature Extraction Process]
In recursive neural network learning, topic feature quantities are required as input. Therefore, before learning a language model based on a recursive neural network, the topic model learning unit 22 learns a topic model used for extracting topic feature values. In the present embodiment, feature vectors based on non-negative matrix factorization are used as topic feature quantities. The topic model learning unit 22 reads the document data of each article from the learning data stored in the language resource storage unit 21, and extracts the topic feature amount by applying non-negative matrix factorization to the read document data. Non-negative matrix factorization is described in, for example, the document “DD Lee and HS Seung, Algorithm for Non-Negative Matrix Factorization, In Advances in Neural Information Processing Systems, pp. 556-562, 2001”.

言語資源記憶部２１に記憶されている学習テキストに対してベクトル空間モデルを使えば、ｍ個の単語を含んだｎ個の記事からなる学習テキストの記事集合Ｄは、ｍ行ｎ列の単語−文書行列として表現できる。単語−文書行列の要素は、例えばその要素の列に対応した記事において、その要素の行に対応した単語が出現する相対頻度に基づいて定めることができる。 If a vector space model is used for the learning text stored in the language resource storage unit 21, an article set D of learning text composed of n articles including m words is an m-by-n word- It can be expressed as a document matrix. The elements of the word-document matrix can be determined based on, for example, the relative frequency that the word corresponding to the row of the element appears in the article corresponding to the element column.

非負値行列因子分解では、学習テキストがｍ×ｎの文書−単語行列として表現されるときに、以下の式（９）を考える。 In non-negative matrix factorization, the following equation (9) is considered when the learning text is expressed as an m × n document-word matrix.

そして、記事集合Ｄをなるべく近似できるような上記の式（９）の行列Ｗ、行列Ｖとして、ｍ×ｒの非負行列Ｗ’、およびｒ×ｎの非負行列Ｖ’を求める。ただし、行列Ｅは誤差からなる行列である。式（９）に示す因子分解では、記事をｒ個の因子（潜在トピック）で表現することに相当する。つまり、非負行列Ｗ’には、記事集合Ｄに頻出する単語の出現パターンがｒ個の列それぞれに潜在トピックを表す基底として現れる。そして、非負行列Ｖ’の各列は、その列に対応する記事に対してｒ個の潜在トピックそれぞれが寄与している度合いを表す。ｍ＞＞ｒであれば、非負行列Ｖ’は、元の記事の単語行列を低次元の因子に圧縮した表現となるため、データスパースネスに対し頑健となる。 Then, an m × r non-negative matrix W ′ and an r × n non-negative matrix V ′ are obtained as the matrix W and the matrix V of the above equation (9) that can approximate the article set D as much as possible. However, the matrix E is a matrix composed of errors. The factorization shown in Expression (9) corresponds to expressing an article with r factors (latent topics). That is, in the non-negative matrix W ′, the appearance pattern of words frequently appearing in the article set D appears as a basis representing a latent topic in each of r columns. Each column of the non-negative matrix V ′ represents the degree to which r potential topics contribute to the article corresponding to the column. If m >> r, the non-negative matrix V ′ is an expression in which the word matrix of the original article is compressed to a low-dimensional factor, and is robust against data sparseness.

未知の文書の単語ベクトルｄに対する話題特徴量ベクトルｖは、式（１０）のように当該文書を非負行列Ｗ’により因子分解して得られる。単語ベクトルｄの各要素は単語に対応しており、文書における各単語の出現頻度が設定される。また、話題特徴量ベクトルｖは、非負行列Ｗ’に表れるｒ個の潜在トピックそれぞれが文書に寄与している度合いを表す。この次元圧縮されたｒ次元の話題特徴量ベクトルｖは、統計的言語モデルの話題特徴量として扱われる。 The topic feature vector v for the word vector d of an unknown document is obtained by factorizing the document with a non-negative matrix W ′ as shown in Equation (10). Each element of the word vector d corresponds to a word, and the appearance frequency of each word in the document is set. The topic feature vector v represents the degree to which each of r potential topics appearing in the non-negative matrix W ′ contributes to the document. This dimension-compressed r-dimensional topic feature vector v is treated as a topic feature of a statistical language model.

話題モデル学習部２２は、言語資源記憶部２１に学習データとして記憶されているｎ個の文書データを読み出し、読み出した文書データが示す記事それぞれについてｍ個の各単語の出現頻度をカウントする。話題モデル学習部２２は、カウントした各記事の単語の出現頻度を要素とする記事集合Ｄを生成する。記事集合Ｄの各要素には、列に対応する記事における、行に対応する単語の出現頻度が設定される。話題モデル学習部２２は、生成した記事集合Ｄに非負値行列因子分解を適用して、非負行列Ｗ’、および非負行列Ｖ’を算出すると、算出した非負行列Ｗ’を話題モデルとして設定した話題モデルデータＤ１を出力する。 The topic model learning unit 22 reads n pieces of document data stored as learning data in the language resource storage unit 21 and counts the appearance frequency of each of m words for each article indicated by the read document data. The topic model learning unit 22 generates an article set D whose elements are the frequency of occurrence of words in each article. In each element of the article set D, the appearance frequency of the word corresponding to the row in the article corresponding to the column is set. When the topic model learning unit 22 applies the non-negative matrix factorization to the generated article set D to calculate the non-negative matrix W ′ and the non-negative matrix V ′, the topic model that sets the calculated non-negative matrix W ′ as the topic model Model data D1 is output.

［４．１．２ステップＳ２：言語モデル学習処理］
言語モデル学習部２３は、ステップＳ１において得られた話題モデルを用いて、言語資源記憶部２１に記憶されている学習テキストから再帰的ニューラルネットワークに基づく言語モデルを学習する。同様のモデルの学習は、例えば、文献「T. Mikolov and G. Zweig, Context Dependent Recurrent Neural Network Language Model, Microsoft Research Technical Report MSR-TR-2012-92, Microsoft, 2012.」に記載されているが、その概略は以下の通りである。 [4.1.2 Step S2: Language Model Learning Process]
The language model learning unit 23 learns a language model based on the recursive neural network from the learning text stored in the language resource storage unit 21, using the topic model obtained in step S1. Learning similar models is described in, for example, the document `` T. Mikolov and G. Zweig, Context Dependent Recurrent Neural Network Language Model, Microsoft Research Technical Report MSR-TR-2012-92, Microsoft, 2012. '' The outline is as follows.

まず、図１に示す再帰的ネットワークに含まれる、図８に示す再帰的ネットワークの部分を図１０に示すように展開し、通常のフィードフォワード型ニューラルネットワークで近似する。
図１０は、図８に示す再帰的ネットワークのフィードフォワード型ニューラルネットワークへの展開を示す図である。展開する深さは任意であるが、本実施形態では、深さを３とした例を示す。展開したフィードフォワード型ニューラルネットワークでは、誤差逆伝播法などのアルゴリズムを用いて、ニューラルネットワークの各層の重み係数行列を学習できる。誤差逆伝播法については、例えば、文献「R. Rojas, Neural Networks - A Systematic Introduction, pp.151-184, Springer-Verlag, 1996.」に記載されている。 First, the portion of the recursive network shown in FIG. 8 included in the recursive network shown in FIG. 1 is expanded as shown in FIG. 10 and approximated by a normal feedforward neural network.
FIG. 10 is a diagram showing the development of the recursive network shown in FIG. 8 into a feedforward neural network. The developing depth is arbitrary, but in the present embodiment, an example in which the depth is 3 is shown. In the developed feedforward neural network, the weighting coefficient matrix of each layer of the neural network can be learned using an algorithm such as an error back propagation method. The back propagation method is described in, for example, the document “R. Rojas, Neural Networks-A Systematic Introduction, pp. 151-184, Springer-Verlag, 1996.”.

言語モデル学習部２３は、言語資源記憶部２１の学習テキストを逐次的に処理することで重み係数行列を学習する。この学習には、例えば、文献「P. J. Werbos, Backpropagation Through Time: What It Does and How to Do It, Proceedings of The IEEE, vol. 78, no. 10, pp.1550-1560, 1990.」に記載のBackpropagation Through Time アルゴリズムを用いることができる。学習の手順を以下に示す。 The language model learning unit 23 learns the weighting coefficient matrix by sequentially processing the learning text in the language resource storage unit 21. This learning is described in, for example, the document `` PJ Werbos, Backpropagation Through Time: What It Does and How to Do It, Proceedings of The IEEE, vol. 78, no. 10, pp. 1550-1560, 1990. '' The Backpropagation Through Time algorithm can be used. The learning procedure is shown below.

（手順１）言語モデル学習部２３は、学習テキスト中の記事における文章ｓ_ｎ｛ｎ＝１，...，N｝の直前のｍ個の文章｛ｓ_ｎ−ｍ，ｓ_{ｎ−ｍ＋１}，...，ｓ_ｎ−１｝を１つの文章とみなして各単語の出現頻度をカウントする。言語モデル学習部２３は、カウントした各単語の出現頻度を表す単語ベクトルｄを、話題モデルデータＤ１に設定されている非負行列Ｗ’を用いて式（１０）により次元圧縮された基底ベクトルの表現に変換し、話題特徴量ベクトルｖ_ｎを算出する。 (Procedure 1) The language model learning unit 23 includes m sentences {s _n−m , s _{n−m + 1} ,... Immediately before the sentence s _n {n = 1,. .., s _n-1 } is regarded as one sentence, and the frequency of occurrence of each word is counted. The language model learning unit 23 expresses a base vector obtained by dimension-compressing the word vector d representing the counted appearance frequency of each word using the non-negative matrix W ′ set in the topic model data D1 according to the equation (10). It converted to, and calculates the topic feature vector v _n.

（手順２）学習テキスト中の文章ｓ_ｎを構成する単語｛ｗ_１，ｗ_２，...，ｗ_ｔ，...，ｗ_ｎＴ｝に対して、展開した再帰的ニューラルネットワークにおけるｔ番目の入出力を（ｗ_ｔ，ｗ_ｔ＋１，ｗ_ｔ＋２，ｖ_ｎ，ｈ_ｔ−１，ｏ_ｔ−１）とする。ここで、ｔ番目の入出力における単語ｗ_ｔ，ｗ_ｔ＋１，ｗ_ｔ＋２は、語彙サイズを次元数とし、該当する単語を示すインデックスの要素のみを１とし、他の要素を０とするベクトルである。言語モデル学習部２３は、誤差逆伝播法に基づき、入出力（ｗ_ｔ，ｗ_ｔ＋１，ｗ_ｔ＋２，ｖ_ｎ，ｈ_ｔ−１，ｏ_ｔ−１）を用いて、重み係数行列Ｍ^ｈ _ｔ、Ｍ^ｈ _ｔ＋１、Ｍ^ｈ _ｔ＋２、Ｍ^ｏを推定する。 (Procedure 2) For the words {w ₁ , w ₂ ,..., W _t ,..., W _nT } constituting the sentence s _n in the learning text, the t th The input / output is assumed to be (w _t , w _{t + 1} , w _{t + 2} , v _n , h _t−1 , o _t−1 ). Here, the words w _t , w _{t + 1} , and w _{t + 2} in the t-th input / output are vectors in which the vocabulary size is the number of dimensions, only the index element indicating the corresponding word is 1, and the other elements are 0. . The language model learning unit 23 uses the input / output (w _t , w _{t + 1} , w _{t + 2} , v _n , h _t−1 , ot ₋₁ ) based on the error back propagation method, and uses the weight coefficient matrix M ^h _t , M ^h _{t + 1} , M ^h _{t + 2} and M ^o are estimated.

（手順３）言語モデル学習部２３は、手順２において推定した重み係数行列Ｍ^ｈ _ｔ、Ｍ^ｈ _ｔ＋１、Ｍ^ｈ _ｔ＋２の平均をとり、再帰的ニューラルネットワークの重み係数行列Ｍ^ｈをこの平均した行列により更新する。さらに、言語モデル学習部２３は、重み係数行列Ｍ^ｈ _ｔ、Ｍ^ｈ _ｔ＋１、Ｍ^ｈ _ｔ＋２それぞれを、平均した行列と差し替える。 (Procedure 3) The language model learning unit 23 takes the average of the weighting coefficient matrices M ^h _t , M ^h _{t + 1} , and M ^h _{t + 2} estimated in step 2, and averages the weighting coefficient matrix M ^h of the recursive neural network. Update with Further, the language model learning unit 23 replaces each of the weighting coefficient matrices M ^h _t , M ^h _{t + 1} , and M ^h _{t + 2} with an averaged matrix.

（手順４）言語モデル学習部２３は、（手順１）〜（手順３）までの処理を、学習テキストすべてについて繰り返す。 (Procedure 4) The language model learning unit 23 repeats the processes from (Procedure 1) to (Procedure 3) for all the learning texts.

（手順５）言語モデル学習部２３は、（手順１）〜（手順４）までの処理を、重み係数行列Ｍ^ｈが収束するまで繰り返す。 (Procedure 5) The language model learning unit 23 repeats the processes from (Procedure 1) to (Procedure 4) until the weight coefficient matrix M ^h converges.

言語モデル学習部２３は、上記の処理によって求めた重み係数行列Ｍ^ｈと重み係数行列Ｍ^ｏを言語モデルとして設定した言語モデルデータＤ２を出力する。 Language model learning unit 23 outputs the language model data D2 obtained by setting the weighting coefficient matrix M ^h and weighting coefficient matrix M ^o obtained by the above process as a language model.

［４．２誤り修正モデル学習処理部の処理］
［４．２．１誤り修正モデルの学習方法］
誤り修正モデル学習処理部３０は、ステップＳ１において求められた話題モデルと、ステップＳ２において求められた再帰的ニューラルネットワークによる言語モデルとを利用して、誤り修正モデルを学習する。 [4.2 Processing of error correction model learning processing unit]
[4.2.1 Learning method of error correction model]
The error correction model learning processing unit 30 learns the error correction model using the topic model obtained in step S1 and the language model based on the recursive neural network obtained in step S2.

本実施形態では、誤り修正モデルの素性関数は、単語ｗの関数として定める。例えば、素性関数を以下とする。 In the present embodiment, the feature function of the error correction model is determined as a function of the word w. For example, the feature function is as follows.

（１）単語ｗが、ｕ∈Ｖに等しいときに１を返す関数
（２）単語ｗの品詞がｃ∈Ｃに等しいときに１を返す関数 (1) A function that returns 1 when the word w is equal to u∈V (2) A function that returns 1 when the part of speech of the word w is equal to c∈C

ここで、Ｖは語彙、ｕはＶに含まれる単語、Ｃは品詞の集合、ｃはＣに含まれる品詞とする。 Here, V is a vocabulary, u is a word included in V, C is a set of parts of speech, and c is a part of speech included in C.

いま、Ｋ個の素性関数をｇ_ｋ（ｋ＝１，...，Ｋ）として、音声入力ｘと話題特徴量ベクトルｖが与えられたときに単語列ｗが得られる事後確率Ｐ（ｗ｜ｘ，ｖ）を以下の式（１１）とする。 Now, assuming that K feature functions are g _k (k = 1,..., K), and a speech input x and a topic feature vector v are given, a posterior probability P (w | Let x, v) be the following equation (11).

ここで、Ｎは、単語列ｗを構成する単語ｗ_ｉの数、単語ｗ^ｉ−１ _０は、単語ｗ_ｉに対する履歴（直近の単語列）、Ｐ（ｗ_ｉ｜ｗ^ｉ−１ _０，ｖ）は、再帰的ニューラルネットワークに基づく言語モデルの出力確率（出力層１からの出力）である。つまり、式（１１）では、単語列ｗを構成する各単語ｗ_ｉのｎグラムの積を算出している。ｅｘｐより後ろは誤り修正モデルの確率の一部となり、再帰的ニューラルネットワークにおける出力層２からの出力の和である。ｈ_ｊ（ｉ）は、単語ｗ_ｉを入力として再帰的ニューラルネットワークを伝播させたときの隠れ層ｈ（ｉ）のｊ番目の素子の値であり、Ｍ_ｊｋ ^ｏ’は、隠れ層と出力層２の間の重み係数行列Ｍ^ｏ’におけるｊ行ｋ列目の要素（ｊｋ成分）である。 Here, N is the number of word _{w i} that make up a word string w, word ^{w _i-1} ₀ is, history for the word _{w i} (the most recent word _{^{string), P (w i | w}} i-1 0, v ) Is the output probability (output from the output layer 1) of the language model based on the recursive neural network. That is, calculated in Equation (11), the product of n-grams of each word w _i constituting the word string w. After exp, it becomes part of the probability of the error correction model and is the sum of the outputs from the output layer 2 in the recursive neural network. h _j (i) is the value of the j-th element of the hidden layer h (i) when the recursive neural network is propagated with the word w _i as input, and M _jk ^o ′ is the hidden layer and the output layer 2 is an element (jk component) in the j-th row and the k-th column in the weighting coefficient matrix M ^o ′ between 2.

誤り修正モデル学習処理部３０は、誤り修正モデル学習処理において、重み係数行列Ｍ^ｏ’を求める。本実施形態では、マージン最大化による誤り修正モデルの学習を行うこととする。
マージン最大化では、音声入力（音声データ）ｘに対して、以下の２つのペアが必要となる。
（１）正解単語列ｗ^ｒ（音声データに対して整列済み）
（２）音声認識結果ｗ^ｄ（音声データに対して整列済み） The error correction model learning processing unit 30 obtains a weighting coefficient matrix M ^o ′ in the error correction model learning process. In the present embodiment, the error correction model is learned by maximizing the margin.
In the margin maximization, the following two pairs are required for the voice input (voice data) x.
(1) Correct word string ^wr (already aligned with speech data)
(2) Speech recognition result w ^d (already aligned with speech data)

式（１１）の対数を取った対数事後確率から、誤り修正モデルによる音声入力ｘに対する単語列ｗのスコアＳ（ｗ｜ｘ）を、以下の式（１２）のように定める。 From the logarithmic posterior probability obtained by taking the logarithm of Equation (11), the score S (w | x) of the word string w with respect to the speech input x by the error correction model is determined as in the following Equation (12).

式（１２）は、本実施形態で用いる誤り修正モデルである。ここで、ｆ_ａｍ（ｗ｜ｘ）は、音響モデル（前述のＨＭＭ−ＧＭＭ）による対数スコア（音響スコア）である。μ_ｌｍは、音響スコアに対する言語スコアの重みである。ｆ_ｌｍ（ｗ_ｉ｜ｗ^ｉ−１ _０，ｖ）は、再帰的ニューラルネットワークに基づく言語モデルによる単語ｗ_ｉの対数スコア（言語スコア）であり、出力層ｏ_ｔに相当する。また、Σ_ｉΣ_ｋｇ_ｋ（ｗ_ｉ）Σ_ｊｈ_ｊ（ｉ）Ｍ_ｊｋ ^ｏ’は、単語列ｗの誤り傾向を反映したスコアである。このように、誤り修正モデルは、隠れ層の出力及びモデルパラメータにより重み付けした言語的な特徴を用いて音声認識のスコアを修正する式である。 Equation (12) is an error correction model used in this embodiment. Here, f _am (w | x) is a logarithmic score (acoustic score) based on the acoustic model (the above-described HMM-GMM). μ _lm is the weight of the language score with respect to the acoustic score. _{_{^{f lm (w i | w i}}} -1 0, v) is the logarithm score of the word _{w i} by the language model based on recursive neural network (language score), equivalent to the output layer _{o t.} Further, Σ _i Σ _k g _k (w _i ) Σ _j h _j (i) M _jk ^o ′ is a score reflecting the error tendency of the word string w. Thus, the error correction model is an expression for correcting the speech recognition score using linguistic features weighted by the output of the hidden layer and the model parameters.

式（１２）を用いた正解単語列ｗ^ｒのスコアをＳ（ｗ^ｒ｜ｘ）、音声認識結果ｗ^ｄのスコアをＳ（ｗ^ｄ｜ｘ）としたときに、その差分(マージン)により定められる評価関数を以下の式（１３）に示すＤｍとし、重み係数行列Ｍ^ｏ’に関する最大化を図る。 Determined when the | ^{(x w} d), by the difference (margin) | Score correct word sequence ^{w r} using equation ^{(12) S (w r x} ), the score of speech recognition result ^{w d} S The evaluation function to be obtained is set to Dm shown in the following formula (13), and the weighting coefficient matrix M ^o ′ is maximized.

差分Ｄｍについての重み係数行列Ｍ^ｏ’のｊｋ成分Ｍ_ｊｋ ^ｏ’に関する勾配を求めると、以下の式（１４）となる。 When the gradient related to the jk component M _jk ^o ′ of the weighting coefficient matrix M ^o ′ for the difference Dm is obtained, the following equation (14) is obtained.

この勾配は、正解単語列ｗ^ｒを構成する各単語ｗ_ｉ ^ｒの素性関数の値ｇ_ｋ（ｗ_ｉ ^ｒ）を当該単語について算出された隠れ層のｊ番目の素子の値ｈ_ｊ ^ｒ（ｉ）により重み付けした値の合計と、音声認識結果ｗ^ｄを構成する各単語ｗ_ｉ’ ^ｄの素性関数の値ｇ_ｋ（ｗ_ｉ’ ^ｄ）を当該単語について算出された隠れ層のｊ番目の素子の値ｈ_ｊ ^ｄ（ｉ’）により重み付けした値の合計との差分に比例する。
確率的勾配降下法にしたがえば、重み係数行列Ｍ^ｏ’の更新式は以下の式（１５）のようになる。 This gradient is obtained by converting the value g _k (w _i ^r ) of the feature function of each word w _i ^r constituting the correct word string w ^r to the value h _j ^r (i of the hidden layer calculated for the word). ) And the value g _k (w _{i ′} ^d ) of the feature function of each word w _{i ′} ^d constituting the speech recognition result w ^d and the j-th element of the hidden layer calculated for the word Is proportional to the difference from the sum of the values weighted by the value h _j ^d (i ′).
According to the stochastic gradient descent method, the update formula of the weighting coefficient matrix M ^o ′ is as shown in the following formula (15).

ここで、ηは定数とする。
誤り修正モデル学習処理部３０は、上記の手法を学習データ全体に対して適用し、誤り修正モデルを学習する。つまり、誤り修正モデル学習処理部３０は、音声データが与えられたときの誤り修正モデルを用いた正解単語列の生成確率（対数事後確率）と音声認識結果の生成確率（対数事後確率）とを用いて定められる評価関数である差分Ｄｍが最大になるように、誤り修正モデルのモデルパラメータであるＭ^ｏ’の各要素を統計的に算出する。
以下、誤り修正モデル学習処理部３０において誤り修正モデルを得るための処理を説明する。 Here, η is a constant.
The error correction model learning processing unit 30 applies the above method to the entire learning data and learns the error correction model. That is, the error correction model learning processing unit 30 calculates the generation probability (logarithmic posterior probability) of the correct word sequence using the error correction model when the speech data is given and the generation probability (logarithmic posterior probability) of the speech recognition result. Each element of M ^o ′, which is a model parameter of the error correction model, is statistically calculated so that the difference Dm, which is an evaluation function that is used, is maximized.
Hereinafter, processing for obtaining an error correction model in the error correction model learning processing unit 30 will be described.

［４．２．２ステップＳ３：整列済み正解単語列取得処理］
整列部３２は、音声資源記憶部３１に学習データとして記憶されている音声データに対して、該当する正解単語の整列を行う。この整列は、音声資源記憶部３１の各学習データに対して順序を考慮して行われる。これは、話題特徴量ベクトルの計算を行うために学習データの順序を保持しておく必要があるためである。また、整列部３２は、各単語に、再帰的ニューラルネットワークに基づく言語モデルにより単語予測を行った際の隠れ層の出力を記録しておく。 [4.2.2 Step S3: Process for Obtaining Arranged Correct Word Sequence]
The sorting unit 32 sorts the corresponding correct words with respect to the voice data stored as learning data in the voice resource storage unit 31. This alignment is performed in consideration of the order of each learning data in the speech resource storage unit 31. This is because it is necessary to maintain the order of learning data in order to calculate the topic feature vector. The alignment unit 32 records, for each word, the output of the hidden layer when the word prediction is performed using a language model based on a recursive neural network.

具体的には、整列部３２は、既存の技術により、正解単語列を構成する各単語に音声データにおける発話開始時刻を対応付け、正解単語列を発話された時刻順に整列する。整列部３２は、整列を行う際、音響モデル記憶部４１に記憶されている音響モデルと、言語モデル学習処理部２０で学習した言語モデル及び話題モデルとを用い、正解単語列を構成する各単語にその音響スコア及び言語スコアを付与する。整列部３２は、式（８）、式（６）、及び式（７）を用いて、正解単語列ｗ^ｒを構成する単語ｗ_ｉの言語スコアｆ_ｌｍ（ｗ_ｉ｜ｗ^ｉ−１ _０，ｖ）を算出するが、重み係数行列Ｍ^ｈ、Ｍ^ｏには言語モデルデータＤ２に設定されている言語モデルを用いる。このとき、式（８）におけるｗ_ｔは、単語ｗ_ｉを表す単語ベクトルであり、ｈ_ｔ−１は、ひとつ前の単語ｗ_ｉ−１について言語スコアを算出した際に式（６）により求めた隠れ層の出力である。また、整列部３２は、話題特徴量ベクトルｖを、話題モデルデータＤ１から取得した話題モデル（非負行列Ｗ’）と、正解単語列ｗ^ｒよりも前の正解単語列群から取得した単語頻度を表す単語ベクトルｄとを用いて、式（１０）により算出する。整列部３２は、各単語の音響スコア、言語スコア、及び、隠れ層の出力を付加した整列済みの正解単語列ｗ^ｒを正解単語列データＤ３に設定し、出力する。単語ｗ_ｉに付加する隠れ層の出力は、言語スコアｆ_ｌｍ（ｗ_ｉ｜ｗ^ｉ−１ _０，ｖ）を算出した際の式（６）の隠れ層ｈ_ｔの値である。 Specifically, the aligning unit 32 associates each word constituting the correct word string with the utterance start time in the speech data, and arranges the correct word string in the order of the utterances by the existing technology. The alignment unit 32 uses the acoustic model stored in the acoustic model storage unit 41 and the language model and topic model learned by the language model learning processing unit 20 to perform each word constituting the correct word string when performing alignment. Is given its acoustic score and language score. Aligning unit 32, Equation (8), equation (6), and using equation (7), the language score of the word _{w i} constituting the correct word sequence ^{_{_{^{w r f lm (w i |}}}} w i-1 0, v) is calculated, and the language model set in the language model data D2 is used for the weighting coefficient matrices M ^h and M ^o . At this time, w _t in equation (8) is a word vector representing the word w _i , and h _t−1 is obtained from equation (6) when the language score is calculated for the previous word w _i−1. The output of the hidden layer. Also, the alignment unit 32, a topic feature vector v, the topic model obtained from the topic model data D1 (non-negative matrix W '), a word frequency than correct word sequence w ^r obtained from the previous correct word sequence group Using the word vector d to represent, it calculates by Formula (10). Aligning unit 32, the acoustic score of each word, language score, and sets the alignment already correct word sequence w ^r added with the output of the hidden layer to the correct word sequence data D3, and outputs. The output of the hidden layer added to the word w _i is the value of the hidden layer h _t in the equation (6) when the language score f _lm (w _i | w ⁱ⁻¹ ₀ , v) is calculated.

［４．２．３ステップＳ４：学習データの音声認識処理］
一方、音声認識部３３は、音声資源記憶部３１に学習データとして記憶されている音声データを、言語モデルデータＤ２が示す言語モデルおよび話題モデルデータＤ１が示す話題モデルを用いて音声認識を実行し、音声認識結果ｗ^ｄを得る。音声認識部３３は、ステップＳ３と同様の処理により、正解単語列データＤ３が示す正解単語列ｗ^ｒと同様に、音声認識結果ｗ^ｄにも、単語列を構成する各単語に音響スコアと言語スコアを付与し、さらに、再帰的ニューラルネットワークで単語予測を行った際の隠れ層の出力を記録する。つまり、音声認識部３３は、音声認識結果ｗ^ｄを構成する単語ｗ_ｉの言語スコアｆ_ｌｍ（ｗ_ｉ｜ｗ^ｉ−１ _０，ｖ）を式（８）、式（６）、及び式（７）により算出する。この際、音声認識部３３は、言語モデルデータＤ２に設定されている言語モデル（重み係数行列Ｍ^ｈ、Ｍ^ｏ）と話題モデルデータＤ１から取得した話題モデル（非負行列Ｗ’）とを用いる。音声認識部３３は、各単語の音響スコア、言語スコア、及び、隠れ層の出力を付加した音声認識結果ｗ^ｄを音声認識結果データＤ４に設定し、出力する。 [4.2.3 Step S4: Speech Recognition Processing of Learning Data]
On the other hand, the speech recognition unit 33 performs speech recognition on speech data stored as learning data in the speech resource storage unit 31 using a language model indicated by the language model data D2 and a topic model indicated by the topic model data D1. , to obtain a voice recognition result ^{w d.} Speech recognition unit 33, similarly to step S3, similarly to the correct word sequence w ^r indicated correct word string data D3, also the speech recognition result w ^d, acoustic score and language each word constituting the word string A score is given, and further, the output of the hidden layer when word prediction is performed by a recursive neural network is recorded. That is, the voice recognition unit 33, a word _{w i} of language score _f lm constituting the speech recognition result ^{_{^{w d (w i | w i}}} -1 0, v) equation (8), equation (6), and ( 7). At this time, the speech recognition unit 33 uses the language model (weight coefficient matrix M ^h , M ^o ) set in the language model data D2 and the topic model (non-negative matrix W ′) acquired from the topic model data D1. Speech recognition unit 33, the acoustic score of each word, language score, and the speech recognition result w ^d obtained by adding the output of the hidden layer is set in the voice recognition data D4, and outputs.

［４．２．４ステップＳ５：素性関数定義処理］
素性定義部３４は、正解単語列データＤ３が示す正解単語列ｗ^ｒに含まれる単語、および、音声認識結果データＤ４が示す音声認識結果ｗ^ｄに含まれる単語から言語的な特徴を抽出し、抽出された言語的な特徴で定義される素性関数を得る。素性関数の定義は、前述の通り以下とする。 [4.2.4 Step S5: Feature Function Definition Process]
Feature specification 34, a word included in the correct word sequence w ^r indicated correct word string data D3, and extracts linguistic features from the word contained in the speech recognition result w ^d shown the speech recognition data D4, A feature function defined by the extracted linguistic features is obtained. The definition of the feature function is as follows as described above.

素性定義部３４は、例えば、正解単語列ｗ^ｒ及び音声認識結果ｗ^ｄから上記のルールに従った素性関数を全て抽出し、抽出した素性関数が出現する頻度をカウントする。素性定義部３４は、は、カウントした出現頻度が予め定めた閾値以上である素性関数を、誤り修正モデルの学習で用いる素性関数ｇ_ｋとして決定する。これにより、素性定義部３４が得た素性関数をＫ個とする。素性定義部３４は、正解単語列データＤ３及び音声認識結果データＤ４と、得られた素性関数ｇ_ｋとを誤り修正モデル学習部３５に出力する。 Feature specification 34, for example, the feature functions in accordance the correct word string w ^r and the speech recognition result w ^d above rule extracts all the extracted feature functions counts the frequency of occurrence. The feature definition unit 34 determines a feature function whose counted appearance frequency is equal to or higher than a predetermined threshold as a feature function g _k used for learning of the error correction model. As a result, the feature functions obtained by the feature defining unit 34 are set to K pieces. The feature definition unit 34 outputs the correct word string data D3 and the speech recognition result data D4 and the obtained feature function g _k to the error correction model learning unit 35.

［４．２．５ステップＳ６：誤り修正モデル学習処理］
図４は、誤り修正モデル学習部３５が実行する誤り修正モデル学習処理の処理フローを示す。
誤り修正モデル学習部３５は、ｎに初期値１を設定し（ステップＳ１１）、ｋに初期値１を設定し（ステップＳ１２）、ｊに初期値１を設定する（ステップＳ１３）。
誤り修正モデル学習部３５は、音声資源記憶部３１に記憶されている学習データの第ｎ番目の音声データｘ_ｎに対応する正解単語列ｗ^ｒ _ｎおよび音声認識結果ｗ^ｄ _ｎを得る。誤り修正モデル学習部３５は、正解単語列ｗ^ｒ _ｎを正解単語列データＤ３から読み出し、音声認識結果ｗ^ｄ _ｎを音声認識結果データＤ４から読み出す。誤り修正モデル学習部３５は、第ｋ番目の素性関数について、式（１４）の右辺である以下の式（１６）を計算する（ステップＳ１４）。 [4.2.5 Step S6: Error Correction Model Learning Process]
FIG. 4 shows a processing flow of the error correction model learning process executed by the error correction model learning unit 35.
The error correction model learning unit 35 sets an initial value 1 to n (step S11), sets an initial value 1 to k (step S12), and sets an initial value 1 to j (step S13).
Error correction model learning unit 35 obtains the correct word sequence w ^r _n and the speech recognition result w ^d _n corresponding to the n-th audio data x _n of the learning data stored in the speech resource storage unit 31. Error correction model learning unit 35 reads the correct word sequence w ^r _n from the correct word string data D3, reads the speech recognition result w ^d _n from the speech recognition result data D4. The error correction model learning unit 35 calculates the following equation (16), which is the right side of the equation (14), for the k-th feature function (step S14).

なお、誤り修正モデル学習部３５は、正解単語列ｗ^ｒ _ｎを構成する各単語ｗ_ｉ ^ｒに付加されている隠れ層の出力からｊ番目の素子の値ｈ_ｊ ^ｒ（ｉ）を取得し、音声認識結果ｗ^ｄ _ｎを構成する各単語ｗ_ｉ ^ｒに付加されている隠れ層の出力からｊ番目の素子の値ｈ_ｊ ^ｄ（ｉ’）を取得する。 The error correction model learning unit 35 acquires the value h _j ^r (i) of the j-th element from the output of the hidden layer added to each word w _i ^r constituting the correct word string w ^r _n , The value h _j ^d (i ′) of the j-th element is acquired from the output of the hidden layer added to each word w _i ^r constituting the speech recognition result w ^d _n .

誤り修正モデル学習部３５は、上述した式（１５）にしたがって、以下の式（１７）により、重み係数行列Ｍ^ｏ’のｊｋ成分Ｍ_ｊｋ ^ｏ’を計算する（ステップＳ１５）。 The error correction model learning unit 35 calculates the jk component M _jk ^o ′ of the weight coefficient matrix M ^o ′ according to the following equation (17) according to the equation (15) described above (step S15).

式（１７）において、（∂Ｄ／∂Ｍ_ｊｋ ^ｏ’）は、式（１６）の算出結果である。本実施形態では、誤り修正モデル学習部３５は、平均化確率的勾配降下法に基づき、以下の式（１８）、式（１９）に示すように重み係数行列Ｍ^ｏ’のｊｋ成分の更新を行う（ステップＳ１６）。式（１８）の左辺のＭ^〜 _ｊｋ ^ｏ’は、更新後の重み係数行列Ｍ^ｏ’のｊｋ成分である。式（１９）は、１〜ｎ回目のループ処理のそれぞれにおいて算出された重み係数行列Ｍ^ｏ’のｊｋ成分の平均であり、式（１８）の右辺のＭ^〜 _ｊｋ ^ｏ’である。 In Expression (17), (∂D / ∂M _jk ^o ') is a calculation result of Expression (16). In the present embodiment, the error correction model learning unit 35 updates the jk component of the weight coefficient matrix M ^o ′ based on the averaged probabilistic gradient descent method as shown in the following equations (18) and (19). This is performed (step S16). M ^to _jk ^o ′ on the left side of Expression (18) are jk components of the updated weight coefficient matrix M ^o ′. Expression (19) is the average of the jk components of the weighting coefficient matrix M ^o ′ calculated in each of the first to n-th loop processes, and is M ^to _jk ^o ′ on the right side of Expression (18).

誤り修正モデル学習部３５は、ｊが重み係数行列Ｍ^ｏ’の行数（隠れ層の素子数）に達するまで、現在のｊの値に１を加算してステップＳ１４からの処理を繰り返す（ステップＳ１７）。
誤り修正モデル学習部３５は、ｋが素性関数の個数Ｋ（重み係数行列Ｍ^ｏ’の列数）に達するまで、現在のｋの値に１を加算してステップＳ１３からの処理を繰り返す（ステップＳ１８）。
誤り修正モデル学習部３５は、学習データの全ての音声データｘ_ｎについて処理を終了するまで、現在のｎの値に１を加算してステップＳ１２からの処理を繰り返す（ステップＳ１９）。 The error correction model learning unit 35 adds 1 to the current value of j and repeats the processing from step S14 until j reaches the number of rows of the weighting coefficient matrix M ^o ′ (number of elements in the hidden layer) (step S14). S17).
The error correction model learning unit 35 adds 1 to the current value of k and repeats the processing from step S13 until k reaches the number K of feature functions (the number of columns of the weight coefficient matrix M ^o ′) (step S13). S18).
The error correction model learning unit 35 adds 1 to the current value of n and repeats the processing from step S12 until the processing is completed for all speech data _xn of the learning data (step S19).

誤り修正モデル学習部３５は、学習データの全ての音声データｘ_ｎについて処理を終了した場合、前回の収束判定時からの重み係数行列Ｍ^ｏ’の変化が、所定の範囲内であるか否かにより収束したか否かを判断する（ステップＳ２０）。誤り修正モデル学習部３５は、収束していないと判断した場合（ステップＳ２０：ＮＯ）、ステップＳ１１からの処理を繰り返し、収束したと判断した場合（ステップＳ２０：ＹＥＳ）、図４の処理を終了する。誤り修正モデル学習部３５は、式（１２）に更新が収束したときの重み係数行列Ｍ^ｏ’を用いた誤り修正モデルを生成し、生成した誤り修正モデルを設定した誤り修正モデルデータＤ５を音声認識処理部４０に出力する。 When the error correction model learning unit 35 finishes the processing for all the speech data _xn of the learning data, whether or not the change of the weighting coefficient matrix M ^o ′ from the previous convergence determination is within a predetermined range. (Step S20). When it is determined that the error correction model learning unit 35 has not converged (step S20: NO), the process from step S11 is repeated, and when it is determined that it has converged (step S20: YES), the process of FIG. To do. The error correction model learning unit 35 generates an error correction model using the weighting coefficient matrix M ^o ′ when the update converges to Expression (12), and the error correction model data D5 in which the generated error correction model is set is voiced. Output to the recognition processing unit 40.

［４．３音声認識処理部の処理］
音声認識アルゴリズムでは、通常Viterbi探索に基づき、音声入力が得られるたびに単語仮説をグラフのノード（頂点）として展開していく。
図５は、音声認識における単語仮説の展開を示す図である。各ノードには、音声入力の始点に向かって逆向きにトレースを行えるよう、音声認識結果を構成する単語仮説の情報を、音声認識スコアとともに保存する。 [4.3 Processing of voice recognition processing unit]
In the speech recognition algorithm, a word hypothesis is developed as a node (vertex) of a graph every time a speech input is obtained based on a normal Viterbi search.
FIG. 5 is a diagram showing development of word hypotheses in speech recognition. Each node stores information on word hypotheses constituting the speech recognition result together with the speech recognition score so that tracing can be performed in the reverse direction toward the start point of speech input.

図１１は、従来の音声認識におけるノードデータのデータ構造体を示す図である。各ノードのノードデータは、同図に示すようなデータをもつ構造体として定義される。つまり、各ノードのデータ構造体は、当該ノードに対応した単語仮説を特定するためのインデックスである「int word」と、この単語仮説の音響モデル・言語モデル・誤り修正モデルによるスコアを設定する「float score」と、当該ノードの１つ前のノードを示す「node* backptr」とを有する。 FIG. 11 is a diagram showing a data structure of node data in conventional speech recognition. The node data of each node is defined as a structure having data as shown in FIG. That is, the data structure of each node sets “int word” which is an index for specifying a word hypothesis corresponding to the node and a score based on the acoustic model / language model / error correction model of this word hypothesis. "float score" and "node * backptr" indicating the node immediately before the node.

しかし、再帰的ニューラルネットワークでは、ニューラルネットワークの隠れ層の出力が入力単語列に依存して変わるため、これを仮説の展開時にノードに記録する必要がある。したがって、本実施形態では、上述のノードの構造体は、図６に示すように拡張される。
図６は、拡張されたノードデータのデータ構造体を示す図である。同図に示すように、図１１に示すデータ構造体に、隠れ層の出力を設定するための「layer hidden_layer」が追加されている。 However, in the recursive neural network, the output of the hidden layer of the neural network changes depending on the input word string, and this must be recorded in the node when the hypothesis is developed. Therefore, in the present embodiment, the above-described node structure is expanded as shown in FIG.
FIG. 6 is a diagram illustrating a data structure of expanded node data. As shown in the figure, “layer hidden_layer” for setting the output of the hidden layer is added to the data structure shown in FIG.

一方、１つのノードに１つの隠れ層の出力をもたせるには、着目しているノードに接続する複数のノードのいずれか１つを選択する必要がある。例えば、図５のノードｎ_４に対しては、ノードｎ_１からノードｎ_３までのノードが接続しており、隠れ層の出力ｈ_ｎ１，ｈ_ｎ２，ｈ_ｎ３のいずれかを入力とした隠れ層の出力をノードｎ_４に記録する必要がある。本実施形態では、ノードｎ_４が参照するノードｎ_１からノードｎ_３のうち、スコアが最大となる経路のノードの隠れ層の出力を、再帰的ニューラルネットワークにおけるノードｎ_４の入力となる隠れ層の出力ｈ_ｔ−１とする。 On the other hand, in order to give one node the output of one hidden layer, it is necessary to select one of a plurality of nodes connected to the focused node. For example, the node n _{4 in} FIG. 5 is connected to nodes from the node n ₁ to the node n ₃ and has one of the hidden layer outputs h _n1 , h _n2 , and h _n3 as an input. it is necessary to record the output of the node n _4. In the present embodiment, among the nodes n ₁ to node n ₄ refers node n _3, the output of the hidden layer nodes of the path score is maximum, the hidden layer as an input node n ₄ in the recursive neural network Output _ht-1 .

入力音声認識部４４は、音響モデル記憶部４１に記憶されている音響モデル、言語モデル記憶部４２に記憶されている言語モデルデータＤ２が示す言語モデル、誤り修正モデル記憶部４３に記憶されている誤り修正モデルデータＤ５が示す誤り修正モデルを用いて入力音声データＤ６を認識する。入力音声認識部４４は、図６に示すデータ構造のノードデータに設定されている誤り修正モデルのスコアに基づいて最もスコアがよい文仮説を選択し、選択した文仮説を音声認識結果として設定した入力音声認識結果データＤ７を出力する。これは、誤り修正モデル学習処理部３０の音声認識部３３の処理と同様であるが、音声認識結果データＤ４には、図６に示すノードデータの隠れ層の出力値を単語ごとに付加して出力する点が異なる。 The input speech recognition unit 44 is stored in the acoustic model stored in the acoustic model storage unit 41, the language model indicated by the language model data D2 stored in the language model storage unit 42, and the error correction model storage unit 43. The input speech data D6 is recognized using the error correction model indicated by the error correction model data D5. The input speech recognition unit 44 selects the sentence hypothesis having the highest score based on the score of the error correction model set in the node data having the data structure shown in FIG. 6, and sets the selected sentence hypothesis as the speech recognition result. Input speech recognition result data D7 is output. This is the same as the processing of the speech recognition unit 33 of the error correction model learning processing unit 30, but the output value of the hidden layer of the node data shown in FIG. 6 is added to the speech recognition result data D4 for each word. The point of output is different.

［４．３．１ステップＳ７：入力音声の音声認識処理］
入力音声認識部４４は、音声認識対象の音声データとして入力音声データＤ６が入力されると、言語モデル記憶部４２に記憶されている言語モデル、及び音響モデル記憶部４１に記憶されている音響モデルとを用いて、入力音声データＤ６の正解候補の単語列を得る。入力音声認識部４４は、音声認識により得られた正解候補の単語列を構成する各単語（単語仮説）に対応したノードについて図６に示すデータ構造のノードデータを生成し、当該ノードの単語インデックスと、前ノード参照用ポインタを設定する。 [4.3.1 Step S7: Speech Recognition Processing of Input Speech]
When the input speech data D6 is input as speech recognition target speech data, the input speech recognition unit 44 is connected to the language model stored in the language model storage unit 42 and the acoustic model stored in the acoustic model storage unit 41. Are used to obtain a word string of correct candidates of the input voice data D6. The input speech recognition unit 44 generates node data having the data structure shown in FIG. 6 for the nodes corresponding to each word (word hypothesis) constituting the word sequence of the correct candidate word obtained by speech recognition, and the word index of the node And the previous node reference pointer is set.

入力音声認識部４４は、正解候補の単語列を構成する各単語について、音響モデルを用いて音響スコアを算出するとともに、言語モデル（重み係数行列Ｍ^ｈ、Ｍ^ｏ）を用いて式（８）、式（６）、及び式（７）により言語スコアを算出する。入力音声認識部４４は、言語スコアを算出する際、式（８）の隠れ層の出力ｈ_ｔ−１に、１つ前のノードのノードデータに設定されている隠れ層の出力を用いるが、１つ前のノードが複数ある場合には、スコアが最もよくなる経路となるノードの隠れ層の出力を用いる。また、入力音声認識部４４は、式（８）の話題特徴量ベクトルｖを、話題モデルデータＤ１から取得した話題モデル（非負行列Ｗ’）と、現在の入力音声データＤ６よりも前の入力音声データの音声認識結果から取得した単語頻度を表す単語ベクトルｄとを用いて、式（１０）により算出する。 The input speech recognition unit 44 calculates an acoustic score using an acoustic model for each word constituting a word string of correct answer candidates, and uses the language model (weight coefficient matrix M ^h , M ^o ) to formula (8). The language score is calculated according to Equation (6) and Equation (7). When calculating the language score, the input speech recognition unit 44 uses the output of the hidden layer set in the node data of the immediately preceding node as the output h _t-1 of the hidden layer of Equation (8). When there are a plurality of previous nodes, the output of the hidden layer of the node that is the path with the best score is used. Further, the input speech recognition unit 44 uses the topic model (non-negative matrix W ′) obtained from the topic model data D1 for the topic feature vector v in the equation (8) and the input speech before the current input speech data D6. Using the word vector d representing the word frequency acquired from the voice recognition result of the data, the calculation is performed according to Expression (10).

入力音声認識部４４は、誤り修正モデル記憶部４３から読み出した誤り修正モデルに従って、正解候補の単語列を構成する各単語について、音響スコア及び言語スコアと、言語スコアの算出において式（６）により算出された隠れ層の出力とを用いて、誤り修正モデルのスコアを算出する。入力音声認識部４４は、ノードデータに音響スコア、言語スコア、及び誤り修正モデルのスコアと、隠れ層の出力とを設定する。入力音声認識部４４は、誤り修正モデルのスコアが最もよくなる経路の正解候補の単語列を正解単語列として選択し、入力音声認識結果データＤ７に設定してリアルタイムに出力する。誤り修正モデルを用いることにより、入力音声認識部４４は、入力音声データＤ６から得られた音声認識結果の選択における誤りを修正する。 In accordance with the error correction model read from the error correction model storage unit 43, the input speech recognition unit 44 calculates the acoustic score, the language score, and the language score for each word constituting the correct candidate word string by using the expression (6). The score of the error correction model is calculated using the calculated output of the hidden layer. The input speech recognition unit 44 sets an acoustic score, a language score, an error correction model score, and a hidden layer output in the node data. The input speech recognition unit 44 selects the correct candidate word sequence of the route with the best error correction model score as the correct word sequence, sets it as the input speech recognition result data D7, and outputs it in real time. By using the error correction model, the input speech recognition unit 44 corrects an error in the selection of the speech recognition result obtained from the input speech data D6.

［５．効果］
以上説明した本実施形態の誤り修正モデル学習装置１０によれば、従来よりも長い文脈および話題を考慮した誤り修正モデルが構成可能となる。入力音声認識部４４は、この誤り修正モデルを用いて音声認識を行うことにより、認識誤りが削減される。また、本実施形態の誤り修正モデル学習装置１０は、大量に入手しやすいテキストデータを誤り修正モデルのモデルパラメータの学習の一部に用いているため、統計的に頑健なモデルとなり、認識誤りが削減される。 [5. effect]
According to the error correction model learning device 10 of the present embodiment described above, an error correction model can be configured in consideration of a longer context and topic than before. The input speech recognition unit 44 performs speech recognition using this error correction model, thereby reducing recognition errors. In addition, since the error correction model learning device 10 of the present embodiment uses text data that is easily available in large quantities as part of the learning of the model parameters of the error correction model, the error correction model learning apparatus 10 becomes a statistically robust model and has a recognition error. Reduced.

［６．その他］
なお、上述の誤り修正モデル学習装置１０は、内部にコンピュータシステムを有している。そして、誤り修正モデル学習装置１０の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 [6. Others]
Note that the error correction model learning device 10 described above has a computer system therein. The operation process of the error correction model learning device 10 is stored in a computer-readable recording medium in the form of a program, and the above processing is performed by the computer system reading and executing this program. The computer system here includes a CPU, various memories, an OS, and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであっても良く、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであっても良い。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

１０誤り修正モデル学習装置
２０言語モデル学習処理部
２１言語資源記憶部
２２話題モデル学習部
２３言語モデル学習部
３０誤り修正モデル学習処理部
３１音声資源記憶部
３２整列部
３３音声認識部
３４素性定義部
３５誤り修正モデル学習部
４０音声認識処理部
４１音響モデル記憶部
４２言語モデル記憶部
４３誤り修正モデル記憶部
４４入力音声認識部 DESCRIPTION OF SYMBOLS 10 Error correction model learning apparatus 20 Language model learning process part 21 Language resource memory | storage part 22 Topic model learning part 23 Language model learning part 30 Error correction model learning process part 31 Speech resource memory | storage part 32 Alignment part 33 Speech recognition part 34 Feature definition part 35 Error correction model learning unit 40 Speech recognition processing unit 41 Acoustic model storage unit 42 Language model storage unit 43 Error correction model storage unit 44 Input speech recognition unit

Claims

文書のテキストデータを記憶する言語資源記憶部と、
再帰的ニューラルネットワークの入力に、前記言語資源記憶部に記憶されている前記テキストデータにおける文章中の単語と、前記テキストデータにおける前記文章よりも前の文章から抽出した話題特徴量と、前記単語の前の単語について算出した前記再帰的ニューラルネットワークの隠れ層の出力とを用いて前記単語に後続する単語の接続確率を算出する言語モデルを学習する言語モデル学習部と、
音声データと正解単語列とを対応付けて記憶する音声資源記憶部と、
前記音声資源記憶部に記憶される前記音声データに対して前記正解単語列を整列し、整列した前記正解単語列を構成する各単語を、前記言語モデル学習部が学習した前記言語モデルの入力としたときの前記再帰的ニューラルネットワークの隠れ層の出力を算出する整列部と、
前記音声資源記憶部に記憶されている前記音声データを音声認識し、音声認識により得られた音声認識結果を構成する各単語を、前記言語モデル学習部が学習した前記言語モデルの入力としたときの前記再帰的ニューラルネットワークの隠れ層の出力を算出する音声認識部と、
整列された前記正解単語列に含まれる単語と前記音声認識結果に含まれる単語とから言語的な特徴を抽出する素性定義部と、
隠れ層の出力及びモデルパラメータにより重み付けした言語的な特徴を用いて音声認識のスコアを修正するための誤り修正モデルを、整列された前記正解単語列を構成する各単語について算出された前記隠れ層の出力により重み付けした当該単語の前記言語的な特徴と、前記音声認識結果を構成する各単語について算出された前記隠れ層の出力により重み付けした当該単語の前記言語的な特徴とに基づいて学習する誤り修正モデル学習部と、
を備えることを特徴とする誤り修正モデル学習装置。 A language resource storage unit for storing text data of the document;
Input to a recursive neural network is a word in a sentence in the text data stored in the language resource storage unit, a topic feature amount extracted from a sentence before the sentence in the text data, and the word A language model learning unit that learns a language model that calculates a connection probability of a word following the word using an output of a hidden layer of the recursive neural network calculated for a previous word;
A voice resource storage unit that stores voice data and correct word strings in association with each other;
The correct word string is aligned with the sound data stored in the sound resource storage unit, and each word constituting the aligned correct word string is input to the language model learned by the language model learning unit. An alignment unit that calculates the output of the hidden layer of the recursive neural network when
When the speech data stored in the speech resource storage unit is speech-recognized, and each word constituting the speech recognition result obtained by speech recognition is used as the input of the language model learned by the language model learning unit A speech recognition unit that calculates an output of a hidden layer of the recursive neural network of
A feature defining unit that extracts linguistic features from the words included in the aligned correct word strings and the words included in the speech recognition result;
An error correction model for correcting a speech recognition score using a linguistic feature weighted by an output of a hidden layer and model parameters, and the hidden layer calculated for each word constituting the aligned correct word string Learning based on the linguistic feature of the word weighted by the output of the word and the linguistic feature of the word weighted by the output of the hidden layer calculated for each word constituting the speech recognition result An error correction model learning unit;
An error correction model learning device comprising:

前記誤り修正モデル学習部は、前記音声データが与えられたときの前記正解単語列の事後確率と前記音声認識結果の事後確率との差分により定められる評価関数が最大となるように前記モデルパラメータを統計的に算出する、
ことを特徴とする請求項１に記載の誤り修正モデル学習装置。 The error correction model learning unit sets the model parameter so that an evaluation function determined by a difference between a posterior probability of the correct word string and a posterior probability of the speech recognition result when the speech data is given is maximized. Statistically calculated,
The error correction model learning device according to claim 1.

前記再帰的ニューラルネットワークに単語とともに入力される前記話題特徴量は、当該単語が含まれる発話または文章よりも前の発話または文章に含まれる各単語の出現頻度から統計的な次元圧縮手法により抽出される、
ことを特徴とする請求項１または請求項２に記載の誤り修正モデル学習装置。 The topic feature amount input together with the word to the recursive neural network is extracted by a statistical dimension compression method from the appearance frequency of each word included in the utterance or sentence before the utterance or sentence including the word. The
The error correction model learning device according to claim 1 or 2, wherein

前記言語資源記憶部が記憶するテキストデータは、ニュース原稿のテキストデータ、または、ウェブ上のテキストデータである、
ことを特徴とする請求項１から請求項３のいずれか１項に記載の誤り修正モデル学習装置。 The text data stored in the language resource storage unit is text data of a news manuscript or text data on the web.
The error correction model learning device according to any one of claims 1 to 3, wherein

前記言語的な特徴は、単語あるいは単語の品詞であり、
前記誤り修正モデルは、前記言語的な特徴に基づく素性関数の値を、前記再帰的ニューラルネットワークの隠れ層の出力及び前記素性関数のモデルパラメータにより重み付けして得られたスコアにより音声認識のスコアを修正する算出式である、
ことを特徴とする請求項１から請求項４のいずれか１項に記載の誤り修正モデル学習装置。 The linguistic feature is a word or a part of speech of the word;
The error correction model has a speech recognition score based on a score obtained by weighting a value of a feature function based on the linguistic feature with an output of a hidden layer of the recursive neural network and a model parameter of the feature function. The calculation formula to be corrected,
The error correction model learning device according to claim 1, wherein the error correction model learning device is one of the following.

入力された音声データを、音響モデルと前記言語モデル学習部により学習された前記言語モデルとを用いて音声認識し、前記誤り修正モデル学習部により学習された前記誤り修正モデルにより、入力された前記音声データから得られた音声認識結果の選択における誤りを修正して出力する入力音声認識部をさらに備える、
ことを特徴とする請求項１から請求項５のいずれか１項に記載の誤り修正モデル学習装置。 The input speech data is speech-recognized using an acoustic model and the language model learned by the language model learning unit, and the error correction model learned by the error correction model learning unit is inputted by the error correction model. An input speech recognition unit for correcting and outputting an error in selection of a speech recognition result obtained from speech data;
The error correction model learning device according to claim 1, wherein

コンピュータを、
再帰的ニューラルネットワークの入力に、言語資源記憶手段に記憶されているテキストデータにおける文章中の単語と、前記テキストデータにおける前記文章よりも前の文章から抽出した話題特徴量と、前記単語の前の単語について算出した前記再帰的ニューラルネットワークの隠れ層の出力とを用いて前記単語に後続する単語の接続確率を算出する言語モデルを学習する言語モデル学習手段と、
音声資源記憶手段に音声データと対応付けて記憶される正解単語列を、前記音声データに対して整列し、整列した前記正解単語列を構成する各単語を、前記言語モデル学習手段が学習した前記言語モデルの入力としたときの前記再帰的ニューラルネットワークの隠れ層の出力を算出する整列手段と、
前記音声資源記憶手段に記憶されている前記音声データを音声認識し、音声認識により得られた音声認識結果を構成する各単語を、前記言語モデル学習手段が学習した前記言語モデルの入力としたときの前記再帰的ニューラルネットワークの隠れ層の出力を算出する音声認識手段と、
整列された前記正解単語列に含まれる単語と前記音声認識結果に含まれる単語とから言語的な特徴を抽出する特徴量抽出手段と、
隠れ層の出力及びモデルパラメータにより重み付けした言語的な特徴を用いて音声認識のスコアを修正するための誤り修正モデルを、整列された前記正解単語列を構成する各単語について算出された前記隠れ層の出力により重み付けした当該単語の前記言語的な特徴と、前記音声認識結果を構成する各単語について算出された前記隠れ層の出力により重み付けした当該単語の前記言語的な特徴とに基づいて学習する誤り修正モデル学習手段と、
を具備する誤り修正モデル学習装置として機能させるためのプログラム。 Computer
The input in the recursive neural network includes a word in the text in the text data stored in the language resource storage means, a topic feature amount extracted from the text before the text in the text data, and a word before the word Language model learning means for learning a language model for calculating a connection probability of a word following the word using an output of a hidden layer of the recursive neural network calculated for the word;
The correct word string stored in association with the voice data in the voice resource storage means is aligned with the voice data, and the language model learning means has learned each word constituting the aligned correct word string. Alignment means for calculating the output of the hidden layer of the recursive neural network as input of the language model;
When the speech data stored in the speech resource storage means is speech-recognized, and each word constituting the speech recognition result obtained by speech recognition is used as the input of the language model learned by the language model learning means Speech recognition means for calculating the output of the hidden layer of the recursive neural network of
Feature quantity extraction means for extracting linguistic features from the words included in the aligned correct word strings and the words included in the speech recognition results;
An error correction model for correcting a speech recognition score using a linguistic feature weighted by an output of a hidden layer and model parameters, and the hidden layer calculated for each word constituting the aligned correct word string Learning based on the linguistic feature of the word weighted by the output of the word and the linguistic feature of the word weighted by the output of the hidden layer calculated for each word constituting the speech recognition result An error correction model learning means;
A program for functioning as an error correction model learning device comprising: