JP2017090660A

JP2017090660A - Acoustic model learning device, voice recognition device, acoustic model learning method, voice recognition method, and program

Info

Publication number: JP2017090660A
Application number: JP2015220304A
Authority: JP
Inventors: 祐太河内; Yuta Kawachi; 浩和政瀧; Hirokazu Masataki
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-11-10
Filing date: 2015-11-10
Publication date: 2017-05-25
Anticipated expiration: 2035-11-10
Also published as: JP6546070B2

Abstract

PROBLEM TO BE SOLVED: To learn an acoustic model capable of accurately recognizing non-native pronunciation.SOLUTION: A learning data storage 14 stores a piece of learning data in which a learning input feature quantity including non-native features representing non-native features of a speaker extracted from a piece of leaning voice data and an acoustic feature quantity extracted from the leaning voice data being coupled with each other, and a piece of transcript data representing a piece of a pronounced content of the learning voice data, are associated with each other. An acoustic model learning section 15 learns an acoustic model using the learning data and stores the same in an acoustic model storage 16.SELECTED DRAWING: Figure 5

Description

この発明は、音声認識技術に関し、特に、非ネイティブ発話の認識に用いる音響モデルを学習する技術に関する。 The present invention relates to a speech recognition technique, and more particularly to a technique for learning an acoustic model used for recognition of a non-native utterance.

非ネイティブ発話に対する音声認識は、ネイティブ発話に対する音声認識と比較して、読み誤りや母音の挿入等、話者の言語経験や母語等に依存した、ネイティブ発話には見られない音響的性質が存在する（例えば、非特許文献１参照）。これら非ネイティブ発話に特有の性質が、入力音声の音素を判別する音響スコア計算を行う判別器（音響モデル）の判別性能に悪影響を与えるため、非ネイティブ発話音声認識はネイティブ発話音声認識と比較して精度を向上することが困難なタスクであった。 Speech recognition for non-native utterances has acoustic properties that are not found in native utterances, depending on the speaker's language experience and mother tongue, such as reading errors and vowel insertion, compared to speech recognition for native utterances. (For example, refer nonpatent literature 1). These characteristics unique to non-native utterances adversely affect the discrimination performance of the discriminator (acoustic model) that calculates the acoustic score that discriminates the phoneme of the input speech, so non-native utterance speech recognition is compared with native utterance speech recognition. It was a difficult task to improve accuracy.

非ネイティブ発話音声認識の認識精度を向上する技術として、非ネイティブ向けＧＭＭ−ＨＭＭ音声認識がある（例えば、非特許文献２参照）。非ネイティブ向けＧＭＭ−ＨＭＭ音声認識では、非ネイティブ音声データセットに対して、ネイティブ教師の人手により発音の正しさを評定したラベルを付加し、この発音評定値に基づいて学習データを分割して、発音レベル別の複数の音響モデルを学習する。これにより、言語経験に由来する発音の違いにそれぞれ特化することが可能となり、音声認識精度を改善している。 As a technique for improving the recognition accuracy of non-native utterance speech recognition, there is a non-native GMM-HMM speech recognition (for example, see Non-Patent Document 2). In the non-native GMM-HMM speech recognition, a label that assesses the correctness of pronunciation by a native teacher is added to a non-native speech data set, and the learning data is divided based on the pronunciation rating value. Learn multiple acoustic models by pronunciation level. This makes it possible to specialize in different pronunciations derived from language experience, improving voice recognition accuracy.

また、音声認識装置の音響モデル全般で高い認識率を実現している多層ニューラルネットワーク音響モデルを用いて非ネイティブ発話を音声認識する非ネイティブ向けＤＮＮ−ＨＭＭ音声認識がある（例えば、非特許文献３参照）。 Further, there is a non-native DNN-HMM speech recognition that recognizes a non-native utterance using a multilayer neural network acoustic model that realizes a high recognition rate in the overall acoustic model of the speech recognition device (for example, Non-Patent Document 3). reference).

河原達也, 峯松信明, “音声情報処理技術を用いた外国語学習支援”, 電子情報通信学会論文誌D, vol. J96-D, no. 7, pp. 1549-1565, 2013年Tatsuya Kawahara, Nobuaki Hamamatsu, “Foreign language learning support using speech information processing technology”, IEICE Transactions D, vol. J96-D, no. 7, pp. 1549-1565, 2013 安斎拓也, 咸聖俊，伊藤彰則, “日本人英語学習者の発音レベルを考慮した音響モデルに関する検討”, 日本音響学会講演論文集, 2011年Takuya Anzai, Shunto Tsuji, Akinori Ito, “Study on Acoustic Model Considering Pronunciation Level of Japanese Learners of English”, Proceedings of the Acoustical Society of Japan, 2011 木菱裕志, 中川聖一, “DNN-HMMによる日本人英語音声の認識”, 日本音響学会講演論文集, 2013年Hiroshi Kibishi, Seiichi Nakagawa, “Recognition of Japanese English speech by DNN-HMM”, Proceedings of the Acoustical Society of Japan, 2013

ネイティブ教師の人手による発音評定値を利用する非ネイティブ向け音声認識では、音響モデルの学習時に用いる音声データに対して人手で発音評定値を設定する必要があった。発音評定値を利用する方法には、発音評定値が主観で決まるため必ずしも信用できず、すべての発話に対し同じ基準で評価がされているとは限らないという問題と、ネイティブ教師の人手を使うことによるコストの問題が存在する。また、ＧＭＭ−ＨＭＭ音声認識と異なり、音響モデルに多層ニューラルネットワークを用いるＤＮＮ−ＨＭＭ音声認識においては、ＭＬＬＲ（Maximum Likelihood Linear Regression）のような有効な適応法がなく、音響モデル学習をやり直す必要がある。このとき、発音評定値に応じて学習データを分割すると、学習データの減少に起因する認識率低下を回避できない。そのため、ＤＮＮ−ＨＭＭ音声認識においては、ＧＭＭ−ＨＭＭ音声認識と同様に発音評定値を利用するアプローチでは認識率を向上することができなった。 In non-native speech recognition that uses native teachers' manual pronunciation ratings, it is necessary to manually set pronunciation ratings for speech data used during acoustic model learning. As for the method of using pronunciation rating values, the pronunciation rating value is determined by the subjectivity, so it is not always reliable, and the problem is that not all utterances are evaluated based on the same standard, and the use of native teachers There is a cost problem. Also, unlike GMM-HMM speech recognition, DNN-HMM speech recognition using a multilayer neural network for the acoustic model does not have an effective adaptation method such as MLLR (Maximum Likelihood Linear Regression), and it is necessary to repeat acoustic model learning. is there. At this time, if the learning data is divided according to the pronunciation rating value, it is impossible to avoid a reduction in the recognition rate due to a decrease in the learning data. Therefore, in the DNN-HMM speech recognition, the recognition rate cannot be improved by the approach using the pronunciation rating value as in the GMM-HMM speech recognition.

この発明の目的は、このような点に鑑みて、ＤＮＮ−ＨＭＭ音声認識であっても適用可能な、非ネイティブ発話を高精度に認識することができる音響モデルを学習する技術を提供することである。 In view of these points, an object of the present invention is to provide a technique for learning an acoustic model that can be applied to even DNN-HMM speech recognition and can recognize non-native utterances with high accuracy. is there.

上記の課題を解決するために、この発明の第一の態様の音響モデル学習装置は、学習用音声データから抽出した話者の非ネイティブ性を表す非ネイティブ特徴量と学習用音声データから抽出した音響特徴量とを結合した学習用入力特徴量と、学習用音声データの発話内容を表す書き起こしデータとが関連付けられた学習データを記憶する学習データ記憶部と、学習データを用いて音響モデルを学習する音響モデル学習部と、を含む。 In order to solve the above-described problem, the acoustic model learning device according to the first aspect of the present invention extracts a non-native feature amount representing non-nativeness of a speaker extracted from learning speech data and learning speech data. A learning data storage unit that stores learning data in which an input feature amount for learning combined with an acoustic feature amount and transcription data representing the utterance content of the learning speech data are associated with each other, and an acoustic model using the learning data An acoustic model learning unit for learning.

この発明の第二の態様の音声認識装置は、音響モデル学習装置により生成した音響モデルを記憶する音響モデル記憶部と、入力音声データから話者の非ネイティブ性を表す非ネイティブ特徴量を抽出する非ネイティブ性抽出部と、入力音声データから音響特徴量を抽出する音響特徴量抽出部と、非ネイティブ特徴量と音響特徴量とを結合した認識用入力特徴量を音響モデルへ入力して入力音声データの音声認識結果を得る音声認識部と、を含む。 A speech recognition device according to a second aspect of the present invention extracts an acoustic model storage unit that stores an acoustic model generated by an acoustic model learning device, and a non-native feature amount representing non-nativeness of a speaker from input speech data. A non-native extraction unit, an acoustic feature extraction unit that extracts an acoustic feature from input speech data, and an input feature that is a combination of the non-native feature and the acoustic feature input to the acoustic model A voice recognition unit for obtaining a voice recognition result of the data.

この発明の音響モデル学習技術は、言語的な専門知識を持ったネイティブ教師の人手を使うことなく、客観性の高い非ネイティブ性を表現する非ネイティブ特徴量を抽出し、それを音響特徴量と結合した学習データから音響モデルを学習する。これにより、従来は発音評定値を利用した音声認識率の向上ができなかったＤＮＮ−ＨＭＭ音声認識においても、非ネイティブ発話を高精度に認識することができる。 The acoustic model learning technology of the present invention extracts a non-native feature amount expressing non-nativeness with high objectivity without using a human teacher of a native teacher who has linguistic expertise, and uses it as an acoustic feature amount. An acoustic model is learned from the combined learning data. As a result, non-native utterances can be recognized with high accuracy even in DNN-HMM speech recognition, which has conventionally failed to improve the speech recognition rate using pronunciation rating values.

図１は、学習データ作成装置の機能構成を例示する図である。FIG. 1 is a diagram illustrating a functional configuration of a learning data creation apparatus. 図２は、学習データ作成方法の処理手続きを例示する図である。FIG. 2 is a diagram illustrating a processing procedure of the learning data creation method. 図３は、学習用発話データの具体例を示す図である。FIG. 3 is a diagram showing a specific example of the learning utterance data. 図４は、学習データの具体例を示す図である。FIG. 4 is a diagram illustrating a specific example of learning data. 図５は、音響モデル学習装置の機能構成を例示する図である。FIG. 5 is a diagram illustrating a functional configuration of the acoustic model learning device. 図６は、音響モデル学習方法の処理手続きを例示する図である。FIG. 6 is a diagram illustrating a processing procedure of the acoustic model learning method. 図７は、音声認識装置の機能構成を例示する図である。FIG. 7 is a diagram illustrating a functional configuration of the speech recognition apparatus. 図８は、音声認識方法の処理手続きを例示する図である。FIG. 8 is a diagram illustrating a processing procedure of the speech recognition method.

以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

この発明の実施形態は以下の３つの装置から構成される音声認識システムである。第一の装置は、学習用音声データから抽出した非ネイティブ特徴量を音響特徴量へ付加して音響モデル学習に用いる学習データを生成する学習データ作成装置である。第二の装置は、その学習データを用いて音響モデルの学習を行う音響モデル学習装置である。第三の装置は、認識対象の入力音声データから抽出した非ネイティブ特徴量を音響特徴量へ付加し、学習済みの音響モデルを用いて音声認識を行う音声認識装置である。 The embodiment of the present invention is a speech recognition system including the following three devices. The first device is a learning data creation device that generates learning data used for acoustic model learning by adding non-native feature amounts extracted from learning speech data to acoustic feature amounts. The second device is an acoustic model learning device that learns an acoustic model using the learning data. The third device is a speech recognition device that adds a non-native feature amount extracted from input speech data to be recognized to an acoustic feature amount and performs speech recognition using a learned acoustic model.

これらの装置は必ずしも３台で構成されるものではなく、各構成部を配置する装置を変更することで任意に装置構成を変更することができる。例えば、学習データ作成装置の各部を音響モデル学習装置が備えるように構成し、学習データの作成から音響モデルの学習まで１台で実行する音響モデル学習装置とすることができる。また、例えば、学習データ作成装置の各部と音響モデル学習装置の各部を音声認識装置が備えるように構成し、学習データの作成から音響認識まで１台で実行する音声認識装置とすることができる。 These apparatuses are not necessarily constituted by three units, and the apparatus configuration can be arbitrarily changed by changing the apparatus in which each component is arranged. For example, the acoustic model learning device may be configured so that each unit of the learning data creation device includes the learning data creation device, and the acoustic model learning device can be configured to execute from learning data creation to acoustic model learning with a single unit. Further, for example, each part of the learning data creation device and each part of the acoustic model learning device can be configured to be included in the speech recognition device, and the speech recognition device can be configured to execute from learning data creation to acoustic recognition by a single unit.

実施形態の学習データ作成装置、音響モデル学習装置、および音声認識装置の各装置は、例えば、中央演算処理装置（CPU: Central Processing Unit）、主記憶装置（RAM: Random Access Memory）などを有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。各装置は、例えば、中央演算処理装置の制御のもとで各処理を実行する。各装置に入力されたデータや各処理で得られたデータは、例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。また、各装置が備える各処理部の少なくとも一部が集積回路等のハードウェアによって構成されていてもよい。各装置が備える各記憶部は、例えば、RAM（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。 Each of the learning data creation device, the acoustic model learning device, and the speech recognition device of the embodiment includes, for example, a central processing unit (CPU), a main storage device (RAM: Random Access Memory), and the like. Alternatively, it is a special device configured by reading a special program into a dedicated computer. Each device executes each process under the control of the central processing unit, for example. Data input to each device and data obtained by each process are stored in, for example, a main storage device, and the data stored in the main storage device is read out as necessary and used for other processing. . In addition, at least a part of each processing unit included in each apparatus may be configured by hardware such as an integrated circuit. Each storage unit included in each device is, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device including a semiconductor memory element such as a hard disk, an optical disk, or a flash memory, or a relational database. And middleware such as key value store.

実施形態の学習データ作成装置は、図１に示すように、学習用音声記憶部１０、非ネイティブ性抽出部１１、音響特徴量抽出部１２、学習データ生成部１３、および学習データ記憶部１４を含む。学習用音声記憶部１０および学習データ記憶部１４は必ずしも学習データ作成装置自身が備える必要はなく、外部の他の装置が備える学習用音声記憶部１０および学習データ記憶部１４をネットワーク等の通信手段を介して読み書き可能なように構成することも可能である。この音響モデル学習装置が図２に示す各ステップの処理を行うことにより実施形態の学習データ作成方法が実現される。 As shown in FIG. 1, the learning data creation apparatus according to the embodiment includes a learning speech storage unit 10, a non-nativeness extraction unit 11, an acoustic feature amount extraction unit 12, a learning data generation unit 13, and a learning data storage unit 14. Including. The learning voice storage unit 10 and the learning data storage unit 14 are not necessarily provided in the learning data creation device itself, and the learning voice storage unit 10 and the learning data storage unit 14 provided in other external devices are provided as communication means such as a network. It is also possible to configure so that it can be read and written via. The acoustic model learning apparatus performs the process of each step shown in FIG. 2 to realize the learning data creation method of the embodiment.

学習用音声記憶部１０には、音響モデルの学習に用いる学習用発話データが記憶されている。学習用発話データは、図３に示すように、各データを一意に特定する「識別番号」と、非ネイティブ話者の発話を録音した音声ファイルへのパスを表す「音声データ」と、音声データの発話内容を書き起こした「書き起こしデータ」とが関連付けて記憶されている。 The learning speech storage unit 10 stores learning utterance data used for learning an acoustic model. As shown in FIG. 3, the learning utterance data includes an “identification number” that uniquely identifies each data, “voice data” that represents the path to the voice file that recorded the utterance of the non-native speaker, and voice data. Is stored in association with the “transcribed data” that transcribes the utterance content of.

ステップＳ１１において、非ネイティブ性抽出部１１は、学習用発話データの音声データから話者の非ネイティブ性を表現する非ネイティブ特徴量を抽出する。抽出された非ネイティブ特徴量は学習用発話データの識別番号と組にして学習データ生成部１３へ入力される。 In step S 11, the non-nativeness extraction unit 11 extracts a non-native feature amount expressing the non-nativeness of the speaker from the speech data of the learning utterance data. The extracted non-native feature amount is input to the learning data generation unit 13 in combination with the identification number of the learning utterance data.

非ネイティブ特徴量は、非ネイティブ話者の言語経験や発音の正しさ、母語種類、出身地方など、非ネイティブ話者に特有の情報を直接または間接的に反映した、連続または離散の、値またはベクトルとして表現される量である。非ネイティブ性抽出部としては、例えばネイティブ話者の発話音声と非ネイティブ話者の発話音声とを区別したり評価したりするように事前に学習された判別器、ニューラルネットワーク、または機械学習装置などを用いることとしてもよい。このとき、判別や回帰、自己符号化を行う多層ニューラルネットワークやＳＶＭ（Support Vector Machine）等の機械学習装置に対して発話を入力した際の中間処理結果や出力を非ネイティブ特徴量とすればよい。中間処理結果としては、例えば多層ニューラルネットワークでは、最終出力層以外の中間層の出力値を用いてもよい。判別器等の学習には、ネイティブ発話や非ネイティブ発話の音声データ、非ネイティブ話者に関する情報、発話の単語、音素等の情報を用いることとしてもよい。学習アルゴリズムは、教師あり学習、教師なし学習のいずれでもよい。 Non-native features are continuous or discrete values or values that directly or indirectly reflect information specific to non-native speakers, such as the non-native speaker's linguistic experience and correct pronunciation, native language type, locality, etc. A quantity expressed as a vector. As the non-native extraction unit, for example, a discriminator, a neural network, or a machine learning device that has been learned in advance so as to distinguish or evaluate the speech of a native speaker and the speech of a non-native speaker It is good also as using. At this time, a non-native feature amount may be used as an intermediate processing result or output when an utterance is input to a multi-layer neural network that performs discrimination, regression, or self-encoding, or a machine learning device such as SVM (Support Vector Machine). . As an intermediate processing result, for example, in a multilayer neural network, output values of intermediate layers other than the final output layer may be used. For learning such as a discriminator, information such as voice data of native utterances or non-native utterances, information about non-native speakers, words of utterances, phonemes, etc. may be used. The learning algorithm may be either supervised learning or unsupervised learning.

非ネイティブ特徴量の具体例としては、学習済の言語判別モデルを非ネイティブ性抽出部として用いて、言語判別結果のスコアを非ネイティブ特徴量として出力することとしてもよい。言語判別結果のスコアとしては、例えば、各言語らしさを示すスコア値である。言語判別結果のスコアの他の例は、第１の言語に近いほど数値が０、第２の言語に近いほど数値が１に近くなるような、０〜１の評価値である。また、非ネイティブ性抽出部がネイティブ向け音響モデル（すなわち、ネイティブ発話を学習した音響モデル）を有し、入力された音声データをこの音響モデルで評価した結果のスコアを非ネイティブ特徴量としてもよい。他の例としては、非ネイティブ性抽出部がネイティブ向け音声認識（すなわち、ネイティブ発話を認識対象とする音声認識）用のモデルを有し、入力された音声データをこのモデルで音声認識したときの認識信頼度を非ネイティブ特徴量としてもよい。 As a specific example of the non-native feature amount, a learned language discrimination model may be used as the non-nativeness extraction unit, and the language discrimination result score may be output as the non-native feature amount. The score of the language discrimination result is, for example, a score value indicating the uniqueness of each language. Another example of the score of the language discrimination result is an evaluation value of 0 to 1 such that the numerical value becomes 0 closer to the first language and the numerical value becomes closer to 1 closer to the second language. In addition, the non-native extraction unit may have a native acoustic model (that is, an acoustic model in which a native utterance is learned), and a score obtained as a result of evaluating the input speech data with the acoustic model may be used as a non-native feature amount. . As another example, when the non-native extraction unit has a model for native speech recognition (that is, speech recognition for recognition of a native utterance), and the input speech data is recognized by this model. The recognition reliability may be a non-native feature amount.

ステップＳ１２において、音響特徴量抽出部１２は、学習用発話データの音声データから音響特徴量を抽出する。音響特徴量としては、例えばメル周波数ケプストラム係数やそれに対して例えば正規化等の変換をしたもの、時間的に前後する複数個の特徴量を結合したもの等、音声認識における音響モデル学習で利用される音響特徴量や、その変換結果であればよい。抽出された音響特徴量は学習用発話データの識別番号と組にして学習データ生成部１３へ入力される。 In step S 12, the acoustic feature amount extraction unit 12 extracts an acoustic feature amount from the speech data of the learning utterance data. The acoustic feature quantity is used in acoustic model learning in speech recognition, such as a mel frequency cepstrum coefficient and a conversion such as normalization, or a combination of multiple feature quantities that change in time. What is necessary is just an acoustic feature amount and its conversion result. The extracted acoustic feature quantity is input to the learning data generation unit 13 in combination with the identification number of the learning utterance data.

ステップＳ１３において、学習データ生成部１３は、非ネイティブ性抽出部１１が出力する非ネイティブ特徴量と音響特徴量抽出部１２が出力する音響特徴量とを、各特徴量と組にした識別番号が一致するように結合し、学習用入力特徴量を生成する。結合とは、一方の特徴量の後に他の特徴量をつなげる処理である。つなげる処理にあたって、２つの特徴量の前後関係は予め定めておくこととする。例えば、音響特徴量“xxx”と非ネイティブ特徴量“yyy”が抽出されたとき、“xxx”と“yyy”とをそのまま順に繋げた“xxxyyy”が学習用入力特徴量となる。その後、学習データ生成部１３は、図４に示すように、各データを一意に特定する「識別番号」と、生成した「学習用入力特徴量」と、学習用発話データの「書き起こしデータ」とを関連付けて学習データを生成する。生成された学習データは学習データ記憶部１４へ記憶される。 In step S 13, the learning data generation unit 13 has an identification number in which the non-native feature amount output from the non-nativeness extraction unit 11 and the acoustic feature amount output from the acoustic feature amount extraction unit 12 are paired with each feature amount. The learning input feature values are generated by combining them so as to match. Combining is a process of connecting one feature quantity to another feature quantity. In the processing to be connected, the context of the two feature amounts is determined in advance. For example, when the acoustic feature quantity “xxx” and the non-native feature quantity “yyy” are extracted, “xxxyyy” in which “xxx” and “yyy” are sequentially connected is the learning input feature quantity. Thereafter, as shown in FIG. 4, the learning data generating unit 13 uniquely identifies each data, the generated “learning input feature value”, and “transcription data” of the learning utterance data. To generate learning data. The generated learning data is stored in the learning data storage unit 14.

上述の実施形態では、２つの特徴量を結合して音響モデル学習を行う学習データとする例を説明したが、音響特徴量が学習データに含まれるという条件さえ守られれば、２つの特徴量から音響モデル学習を行う学習データを求める処理はこれに限定されない。例えば、２つの特徴量を所定の関数に入力して得られる値を音響特徴量の後（あるいは、前）に追加することとしてもよい。所定の関数としては、例えば正規化や、時間的に前後する複数個の特徴量の結合を実施してもよいし、事前に学習された別の機械学習装置に入力し、その中間処理結果や出力を関数の出力として用いてもよい。また、音響特徴量と非ネイティブ特徴量とを結合した後に、正規化や複数フレームの結合等の処理を行ったものを、音響モデル学習を行う学習データとしてもよい。 In the above-described embodiment, an example in which two feature amounts are combined as learning data for performing acoustic model learning has been described. However, as long as the condition that the acoustic feature amount is included in the learning data is satisfied, the two feature amounts can be used. The process for obtaining learning data for performing acoustic model learning is not limited to this. For example, a value obtained by inputting two feature amounts into a predetermined function may be added after (or before) the acoustic feature amount. As the predetermined function, for example, normalization or a combination of a plurality of feature quantities that change in time may be performed, or input to another machine learning device that has been learned in advance, and the intermediate processing result or The output may be used as the function output. Moreover, after combining the acoustic feature quantity and the non-native feature quantity, it is good also as learning data for performing acoustic model learning that has undergone processing such as normalization and combination of a plurality of frames.

上述の実施形態では、各特徴量や音声データ、書き起こしデータを対応付けるために識別番号を付与する例を記載したが、識別番号と各データとを対応づけるのではなく、非ネイティブ性抽出部と音響特徴量抽出部とに同じ音声データを入力し、処理結果の各特徴量に対して、音声データに対応する書き起こしデータを関連付けることにより、識別番号の情報を用いることなく学習データの生成を行うように変形することも可能である。 In the above-described embodiment, an example in which an identification number is assigned to associate each feature amount, audio data, and transcription data has been described. However, instead of associating the identification number with each data, a non-nativeness extraction unit and The same voice data is input to the acoustic feature quantity extraction unit, and the learning data is generated without using the identification number information by associating the transcription data corresponding to the voice data with each feature quantity of the processing result. It can also be modified to do.

上述の実施形態では、書き起こしデータを音響モデル学習時に用いる教師データに相当するものとして直接取得しているが、事前に、“音素に相当する記号”等の異なるシンボル形式に変換を実施してもよい。例えば、ひらがな、カタカナ、音素、モノフォン、トライフォン、クラスタリング済みトライフォンや状態番号等、読みや音を表現する記号や、それらに相当する番号への変換を行ってよい。その際、記号の変換を人間が行ってもよいし、別の音声認識デコーダや音響モデル等を用いて変換してもよい。例えば、ＤＮＮ音声認識分野で従来から用いられている強制アライメント処理を用いても変換してもよい。 In the above-described embodiment, the transcription data is directly acquired as corresponding to the teacher data used at the time of learning the acoustic model, but it is converted in advance to a different symbol format such as “symbol corresponding to phoneme”. Also good. For example, hiragana, katakana, phonemes, monophones, triphones, clustered triphones, state numbers, and the like may be converted into symbols representing readings and sounds, and numbers corresponding thereto. At this time, the conversion of symbols may be performed by a human, or may be performed using another speech recognition decoder, acoustic model, or the like. For example, conversion may be performed using a forced alignment process conventionally used in the DNN speech recognition field.

実施形態の音響モデル学習装置は、図５に示すように、学習データ記憶部１４、音響モデル学習部１５、および音響モデル記憶部１６を含む。学習データ記憶部１４および音響モデル記憶部１６は必ずしも音響モデル学習装置自身が備える必要はなく、他の装置が備える学習データ記憶部１４および音響モデル記憶部１６をネットワーク等の通信手段を介して読み書き可能なように構成することも可能である。この音響モデル学習装置が図６に示す各ステップの処理を行うことにより実施形態の音響モデル学習方法が実現される。 As shown in FIG. 5, the acoustic model learning device according to the embodiment includes a learning data storage unit 14, an acoustic model learning unit 15, and an acoustic model storage unit 16. The learning data storage unit 14 and the acoustic model storage unit 16 are not necessarily provided in the acoustic model learning device itself, and the learning data storage unit 14 and the acoustic model storage unit 16 provided in other devices are read and written via communication means such as a network. It is also possible to configure as possible. The acoustic model learning apparatus performs the process of each step shown in FIG. 6 to realize the acoustic model learning method of the embodiment.

学習データ記憶部１４には、学習データ作成装置により生成された学習データが記憶されている。上述のように、学習データは、各データを一意に特定する識別番号と、学習用発話データの音声データから抽出した非ネイティブ特徴量と音響特徴量とを結合した学習用入力特徴量と、音声データの発話内容を書き起こした書き起こしデータとが関連付けられたものである。 The learning data storage unit 14 stores learning data generated by the learning data creation device. As described above, the learning data includes an identification number that uniquely identifies each data, a learning input feature value obtained by combining the non-native feature value extracted from the sound data of the learning utterance data and the acoustic feature value, and a sound This is associated with the transcription data that transcribes the utterance content of the data.

ステップＳ１５において、音響モデル学習部１５は、学習データ記憶部１４に記憶された学習データから学習用入力特徴量と書き起こしデータとを対応付けて取得し、その学習データを用いて音声認識に用いる音響モデルパラメータを学習する。音響モデルパラメータで表現されるモデルとしては、例えば、波形を音素に相当する記号に変換するモデルがある。“音素に相当する記号”としては、例えば、事前に異なる音響モデルを作成し、その音響モデルを用いたクラスタリング済みのトライフォンや、それを表現する状態番号等を用いることができる。 In step S15, the acoustic model learning unit 15 obtains the learning input feature value and the transcription data in association with each other from the learning data stored in the learning data storage unit 14, and uses the learning data for speech recognition. Learn acoustic model parameters. As a model expressed by acoustic model parameters, for example, there is a model that converts a waveform into a symbol corresponding to a phoneme. As “symbols corresponding to phonemes”, for example, different acoustic models can be created in advance, and clustered triphones using the acoustic models, state numbers representing the same, or the like can be used.

実施形態の音声認識装置は、図７に示すように、音響モデル記憶部１６、言語モデル記憶部２０、非ネイティブ性抽出部１１、音響特徴量抽出部１２、特徴量結合部２１、および音声認識部２２を含む。この音声認識装置が図８に示す各ステップの処理を行うことにより実施形態の音声認識方法が実現される。 As shown in FIG. 7, the speech recognition apparatus according to the embodiment includes an acoustic model storage unit 16, a language model storage unit 20, a non-nativeness extraction unit 11, an acoustic feature amount extraction unit 12, a feature amount combination unit 21, and speech recognition. Part 22 is included. The speech recognition apparatus according to the embodiment is realized by the processing of each step shown in FIG.

音響モデル記憶部１６には、音響モデル学習装置により生成された音響モデルパラメータを備える音響モデルが記憶されている。言語モデル記憶部２０には、音声認識に用いる言語モデルが記憶されている。 The acoustic model storage unit 16 stores an acoustic model including acoustic model parameters generated by the acoustic model learning device. The language model storage unit 20 stores a language model used for speech recognition.

ステップＳ１１において、非ネイティブ性抽出部１１は、入力音声データから話者の非ネイティブ性を表現する非ネイティブ特徴量を抽出する。入力音声データは、ネイティブ話者または非ネイティブ話者による発話を録音した、音声認識対象の音声データである。ここで抽出する非ネイティブ特徴量は、学習データ作成装置が抽出した非ネイティブ特徴量と同じものである。抽出された非ネイティブ特徴量は特徴量結合部２１へ入力される。 In step S 11, the non-nativeness extraction unit 11 extracts a non-native feature amount expressing the non-nativeness of the speaker from the input voice data. The input speech data is speech recognition target speech data in which utterances by native speakers or non-native speakers are recorded. The non-native feature value extracted here is the same as the non-native feature value extracted by the learning data creation device. The extracted non-native feature quantity is input to the feature quantity combining unit 21.

ステップＳ１２において、音響特徴量抽出部１２は、入力音声データから音響特徴量を抽出する。ここで抽出する音響特徴量は、学習データ作成装置が抽出した音響特徴量と同じものである。抽出された音響特徴量は特徴量結合部２１へ入力される。 In step S12, the acoustic feature quantity extraction unit 12 extracts an acoustic feature quantity from the input voice data. The acoustic feature amount extracted here is the same as the acoustic feature amount extracted by the learning data creation device. The extracted acoustic feature value is input to the feature value combining unit 21.

ステップＳ２１において、特徴量結合部２１は、非ネイティブ性抽出部１１が出力する非ネイティブ特徴量と音響特徴量抽出部１２が出力する音響特徴量とを、学習データ作成装置が各特徴量を結合したときと同じ順序で結合し、認識用入力特徴量を生成する。生成された認識用入力特徴量は音声認識部２２へ入力される。 In step S21, the feature amount combining unit 21 combines the non-native feature amount output from the non-nativeness extraction unit 11 and the acoustic feature amount output from the acoustic feature amount extraction unit 12, and the learning data creation device combines the feature amounts. Are combined in the same order as in order to generate an input feature quantity for recognition. The generated recognition input feature value is input to the speech recognition unit 22.

ステップＳ２２において、音声認識部２２は、音響モデル記憶部１６に記憶された音響モデルを用いて、入力された認識用入力特徴量から“音素に相当する記号”の時系列データを出力する。音声認識部が“音素に相当する記号”の時系列データから音声認識結果（例えば、テキスト）を出力する言語モデルを有する場合、音響モデルの出力が言語モデルに入力され、音声認識結果が出力される。 In step S 22, the speech recognition unit 22 uses the acoustic model stored in the acoustic model storage unit 16 to output time-series data of “symbols corresponding to phonemes” from the input recognition input feature value. When the speech recognition unit has a language model that outputs speech recognition results (for example, text) from time-series data of “symbols corresponding to phonemes”, the output of the acoustic model is input to the language model, and the speech recognition results are output The

なお、非ネイティブ特徴量として、非ネイティブ性抽出部の出力の代わりに、非ネイティブ性抽出部の学習に使った正解ラベルを直接使用してもよい。正解ラベルは、例えば、非ネイティブ話者の言語経験や発音の正しさ、母語種類、出身地方など、非ネイティブ話者に関する情報とすればよい。音声認識時には入力音声データから推定した非ネイティブ特徴量を用いることとすればよい。 Note that the correct label used for learning of the non-native extraction unit may be directly used as the non-native feature amount instead of the output of the non-native extraction unit. The correct answer label may be information about a non-native speaker, such as language experience and correctness of pronunciation of a non-native speaker, native language type, home region, and the like. Non-native feature amounts estimated from input speech data may be used during speech recognition.

上述の実施形態では、学習データ作成装置および音声認識装置が非ネイティブ性抽出部を備える例を記載したが、学習データ作成装置および音声認識装置とは異なる外部の装置として非ネイティブ特徴量抽出装置が存在し、非ネイティブ特徴量抽出装置が識別番号と音声データとを学習用発話データから取り出して、識別番号と非ネイティブ特徴量を学習データ作成装置および音声認識装置に提示することとしてもよい。 In the above-described embodiment, an example in which the learning data creation device and the speech recognition device include the non-nativeness extraction unit has been described. However, the non-native feature quantity extraction device is an external device different from the learning data creation device and the speech recognition device. The non-native feature quantity extraction device may extract the identification number and the speech data from the learning speech data, and present the identification number and the non-native feature quantity to the learning data creation device and the speech recognition device.

上述の実施形態では、学習用発話データとして非ネイティブ話者による発話データのみを用いる構成としたが、ネイティブ話者による発話データも学習用発話データに含めて利用するように構成してもよい。具体的には、ネイティブ発話データを非ネイティブ性抽出部に入力し、ネイティブ発話に対する非ネイティブ特徴量を計算し、それを音響特徴量と結合して学習データを生成する。その後、その学習データを用いて音響モデルを学習するように構成すればよい。 In the above-described embodiment, only the utterance data by the non-native speaker is used as the learning utterance data. However, the utterance data by the native speaker may be included in the learning utterance data. Specifically, the native utterance data is input to the non-native extraction unit, the non-native feature amount for the native utterance is calculated, and this is combined with the acoustic feature amount to generate learning data. Then, what is necessary is just to comprise so that an acoustic model may be learned using the learning data.

上述のように、この発明の音響モデル学習技術は、言語的な専門知識を持ったネイティブ教師の人手を使うことなく、客観性の高い非ネイティブ性を表現する非ネイティブ特徴量を抽出し、それを音響特徴量と結合した学習データから音響モデルを学習する。また、この発明の音声認識技術は、認識対象の音声データから非ネイティブ特徴量を音響特徴量と結合して学習済みの音響モデルを用いて音声認識を行う。このように構成することにより、従来は発音評定値を利用した音声認識率の向上ができなかったＤＮＮ−ＨＭＭ音声認識においても、非ネイティブ発話を高精度に認識することが可能となる。 As described above, the acoustic model learning technique of the present invention extracts non-native features that express non-nativeness with high objectivity without using the hands of native teachers with linguistic expertise, An acoustic model is learned from learning data combined with acoustic features. Also, the speech recognition technology of the present invention performs speech recognition using a learned acoustic model by combining a non-native feature amount with an acoustic feature amount from speech data to be recognized. With this configuration, non-native utterances can be recognized with high accuracy even in DNN-HMM speech recognition, which has conventionally been unable to improve the speech recognition rate using pronunciation rating values.

この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施形態において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above embodiment may be executed not only in time series according to the order of description, but also in parallel or individually as required by the processing capability of the apparatus that executes the processes or as necessary.

［プログラム、記録媒体］
上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 [Program, recording medium]
When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD-ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 This program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるASP（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. A configuration in which the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes a processing function only by an execution instruction and result acquisition without transferring a program from the server computer to the computer. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１０学習用音声記憶部
１１非ネイティブ性抽出部
１２音響特徴量抽出部
１３学習データ生成部
１４学習データ記憶部
１５音響モデル学習部
１６音響モデル記憶部
２０言語モデル記憶部
２１特徴量結合部
２２音声認識部 DESCRIPTION OF SYMBOLS 10 Learning speech memory | storage part 11 Non-nativeness extraction part 12 Acoustic feature-value extraction part 13 Learning data generation part 14 Learning data storage part 15 Acoustic model learning part 16 Acoustic model memory | storage part 20 Language model memory | storage part 21 Feature-value coupling | bond part 22 Voice Recognition part

Claims

学習用音声データから抽出した話者の非ネイティブ性を表す非ネイティブ特徴量と上記学習用音声データから抽出した音響特徴量とを結合した学習用入力特徴量と、上記学習用音声データの発話内容を表す書き起こしデータとが関連付けられた学習データを記憶する学習データ記憶部と、
上記学習データを用いて音響モデルを学習する音響モデル学習部と、
を含む音響モデル学習装置。 Learning input feature value combining non-native feature value representing speaker non-nativeness extracted from learning speech data and acoustic feature value extracted from learning speech data, and utterance contents of the learning speech data A learning data storage unit for storing learning data associated with transcription data representing
An acoustic model learning unit that learns an acoustic model using the learning data;
An acoustic model learning device.

請求項１に記載の音響モデル学習装置であって、
上記学習用音声データから上記非ネイティブ特徴量を抽出する非ネイティブ性抽出部と、
上記学習用音声データから上記音響特徴量を抽出する音響特徴量抽出部と、
上記非ネイティブ特徴量と上記音響特徴量とを結合して上記学習用入力特徴量を生成し、その学習用入力特徴量と上記書き起こしデータとを関連付けて上記学習データを生成する学習データ生成部と、
をさらに含む音響モデル学習装置。 The acoustic model learning device according to claim 1,
A non-native extraction unit that extracts the non-native feature amount from the learning speech data;
An acoustic feature quantity extraction unit for extracting the acoustic feature quantity from the learning voice data;
A learning data generation unit that generates the learning input feature value by combining the non-native feature value and the acoustic feature value, and generates the learning data by associating the learning input feature value and the transcription data. When,
An acoustic model learning device further comprising:

請求項１または２に記載の音響モデル学習装置であって、
上記非ネイティブ特徴量は、言語判別モデルによる言語判別結果のスコア、ネイティブ向け音響モデルによる評価結果のスコア、もしくはネイティブ向け音声認識による認識結果の信頼度のいずれかである、
音響モデル学習装置。 The acoustic model learning device according to claim 1 or 2,
The non-native feature amount is one of a language discrimination result score by a language discrimination model, an evaluation result score by a native acoustic model, or a reliability of a recognition result by native speech recognition.
Acoustic model learning device.

請求項１から３のいずれかに記載の音響モデル学習装置により生成した音響モデルを記憶する音響モデル記憶部と、
入力音声データから話者の非ネイティブ性を表す非ネイティブ特徴量を抽出する非ネイティブ性抽出部と、
上記入力音声データから音響特徴量を抽出する音響特徴量抽出部と、
上記非ネイティブ特徴量と上記音響特徴量とを結合した認識用入力特徴量を上記音響モデルへ入力して上記入力音声データの音声認識結果を得る音声認識部と、
を含む音声認識装置。 An acoustic model storage unit that stores an acoustic model generated by the acoustic model learning device according to claim 1;
A non-native feature extraction unit that extracts a non-native feature amount representing non-nativeness of a speaker from input speech data;
An acoustic feature amount extraction unit for extracting an acoustic feature amount from the input voice data;
A speech recognition unit that inputs a recognition input feature amount obtained by combining the non-native feature amount and the acoustic feature amount into the acoustic model and obtains a speech recognition result of the input speech data;
A speech recognition device.

学習データ記憶部に、学習用音声データから抽出した話者の非ネイティブ性を表す非ネイティブ特徴量と上記学習用音声データから抽出した音響特徴量とを結合した学習用入力特徴量と、上記学習用音声データの発話内容を表す書き起こしデータとが関連付けられた学習データが記憶されており、
音響モデル学習部が、上記学習データを用いて音響モデルを学習する音響モデル学習ステップを含む、
音響モデル学習方法。 A learning data storage unit that combines a non-native feature amount representing non-nativeness of a speaker extracted from learning speech data with an acoustic feature amount extracted from the learning speech data, and the learning feature Learning data associated with the transcription data representing the utterance content of the voice data for use is stored,
The acoustic model learning unit includes an acoustic model learning step of learning an acoustic model using the learning data.
Acoustic model learning method.

音響モデル記憶部に、請求項５に記載の音響モデル学習方法により生成した音響モデルが記憶されており、
非ネイティブ性抽出部が、入力音声データから話者の非ネイティブ性を表す非ネイティブ特徴量を抽出する非ネイティブ性抽出ステップと、
音響特徴量抽出部が、上記入力音声データから音響特徴量を抽出する音響特徴量抽出ステップと、
音声認識部が、上記非ネイティブ特徴量と上記音響特徴量とを結合した認識用入力特徴量を上記音響モデルへ入力して上記入力音声データの音声認識結果を得る音声認識ステップと、
を含む音声認識方法。 The acoustic model generated by the acoustic model learning method according to claim 5 is stored in the acoustic model storage unit,
A non-native extraction unit for extracting a non-native feature amount representing non-nativeness of a speaker from input speech data;
An acoustic feature amount extraction unit for extracting an acoustic feature amount from the input voice data;
A voice recognition step in which a voice recognition unit inputs a recognition input feature quantity obtained by combining the non-native feature quantity and the acoustic feature quantity into the acoustic model to obtain a voice recognition result of the input voice data;
A speech recognition method including:

請求項１から３のいずれかに記載の音響モデル学習装置の各部もしくは請求項４に記載の音声認識装置の各部としてコンピュータを機能させるためのプログラム。 The program for functioning a computer as each part of the acoustic model learning apparatus in any one of Claim 1 to 3, or each part of the speech recognition apparatus of Claim 4.