JP2011169960A

JP2011169960A - Apparatus for estimation of speech content, language model forming device, and method and program used therefor

Info

Publication number: JP2011169960A
Application number: JP2010031255A
Authority: JP
Inventors: Hitoshi Yamamoto; 山本　　仁; Takafumi Koshinaka; 孝文越仲
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2010-02-16
Filing date: 2010-02-16
Publication date: 2011-09-01
Anticipated expiration: 2030-02-16
Also published as: JP5585111B2

Abstract

<P>PROBLEM TO BE SOLVED: To accurately recognize a phrase of a specific content included in a speech, even when the content is changed within one speech. <P>SOLUTION: The device includes a first content estimator 101 which estimates probability in which the content of a processing unit obtained by dividing a speech of a processing object to time sections is equal to a first specific content; a second content estimator 102 which estimates probability in which the content of the processing unit obtained by dividing the speech of the processing object to time sections is equal to a second specific content determined as the content of a co-occurring word with a word of the first specific content; and an estimation result output unit 103 which adds the probability for the first specific content in each processing unit estimated by the first content estimator 101 to the probability for the second specific content in each processing unit estimated by the second content estimator 102 and outputs the result as information indicating an estimation resultS of the speech content of the processing object. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、発話内容を推定する発話内容推定装置、言語モデルを作成する言語モデル作成装置、発話内容推定方法、言語モデル作成方法、発話内容推定プログラム、言語モデル作成プログラムに関する。 The present invention relates to an utterance content estimation device that estimates utterance content, a language model creation device that creates a language model, an utterance content estimation method, a language model creation method, an utterance content estimation program, and a language model creation program.

ユーザの発した音声（発話）から、その音声に対応する単語列を認識する音声認識装置が知られている。この種の音声認識装置の一つとして、あらかじめ記憶されている複数の内容別言語モデルに基づいて、音声認識処理を行う音声認識装置が広く知られている。 2. Description of the Related Art A voice recognition device that recognizes a word string corresponding to a voice from a voice (utterance) uttered by a user is known. As one of this type of speech recognition device, a speech recognition device that performs speech recognition processing based on a plurality of content-specific language models stored in advance is widely known.

内容別言語モデルは、特定の内容（話題など）を表す単語列において特定の単語が出現する確率を表すモデルである。例えば、テレビ番組を内容とする単語列においては、番組名や出演者名などの単語が出現する確率が大きくなり、スポーツを内容とする単語列においては、チーム名や運動用具名、選手名などの単語が出現する確率が大きくなる。 The content-specific language model is a model that represents the probability that a specific word appears in a word string that represents specific content (such as a topic). For example, in a word string that contains TV programs, the probability that a word such as a program name or performer name will appear increases, and in a word string that contains sports, a team name, exercise equipment name, player name, etc. The probability that this word will appear increases.

ところで、ユーザの一連の発話において、ひとかたまりの発話内であってもその内容が変化する場合がある。この場合、音声認識装置が１つの内容別言語モデルにのみ基づいて音声認識処理を行った場合、単語列を認識する精度が低下するおそれがある。一般的な音声認識装置は、ユーザの一連の発話を、１つの発話ごとに逐次的に音声認識処理する構成をとる。このことにより、音声認識結果を即時に出力することができる。ここで、１つの発話とは、話者の息継ぎや会話の間などによって時間的に分けられる音声のひとかたまりであり、言語的には文や節に相当することが多い。この際、音声認識する発話ごとに内容別言語モデルを選択することにより、一連の発話における内容の変化に適応することができる。しかし、１つの発話の内部で内容が変化する場合は、このような方法では対処できないおそれがある。 By the way, in a series of user's utterances, the contents may change even within a group of utterances. In this case, when the speech recognition apparatus performs speech recognition processing based only on one content-specific language model, the accuracy of recognizing a word string may be reduced. A general voice recognition apparatus has a configuration in which a series of user's utterances are sequentially subjected to voice recognition processing for each utterance. As a result, the voice recognition result can be output immediately. Here, one utterance is a group of sounds that are divided in time according to the breathing of the speaker, between conversations, and the like, and is often equivalent to sentences and clauses in terms of language. At this time, it is possible to adapt to changes in content in a series of utterances by selecting a content-specific language model for each utterance for speech recognition. However, if the content changes within one utterance, it may not be possible to cope with such a method.

そこで、非特許文献１に記載されている音声認識装置は、音声認識対象の発話内において、複数の内容別言語モデルを発話内の位置によって異なる重みで混合して用いるように構成されている。 Therefore, the speech recognition apparatus described in Non-Patent Document 1 is configured to use a plurality of content-specific language models mixed with different weights depending on positions in the speech within the speech recognition target speech.

山本仁、花沢健、三木清一、「カテゴリ推定に基づく言語モデルの動的制御を用いた音声認識」、日本音響学会２００９年秋期研究発表会講演論文集、ｐ．１８１−１８４Hitoshi Yamamoto, Ken Hanazawa, Kiyoichi Miki, "Speech recognition using dynamic control of language model based on category estimation", Proceedings of the 2009 Acoustical Conference of the Acoustical Society of Japan, p. 181-184

しかし、非特許文献１に記載されている音声認識装置では、音声認識対象の発話内の位置によって異なる重みが適切に与えられない場合には、特定の内容の語句を認識する精度が低下するという問題がある。 However, in the speech recognition apparatus described in Non-Patent Document 1, the accuracy of recognizing words with specific contents decreases when different weights are not appropriately given depending on the position in the speech to be speech recognized. There's a problem.

そこで、本発明は、１つの発話の内部で内容が変化する場合であっても、発話に含まれる特定の内容の語句を高い精度で認識することができる発話内容推定装置、言語モデル作成装置、発話内容推定方法、言語モデル作成方法、発話内容推定プログラム、言語モデル作成プログラムを提供することを目的とする。 Accordingly, the present invention provides an utterance content estimation device, a language model creation device, which can recognize a specific phrase contained in an utterance with high accuracy even when the content changes within one utterance, It is an object to provide an utterance content estimation method, a language model creation method, an utterance content estimation program, and a language model creation program.

本発明による発話内容推定装置は、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容である確率を推定する第１の内容推定手段と、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容の単語と共起する単語の内容として定めた第２の特定の内容である確率を推定する第２の内容推定手段と、第１の内容推定手段によって推定された各処理単位における第１の特定の内容についての確率と、第２の内容推定手段によって推定された各処理単位における第２の特定の内容についての確率とを併せて、処理対象の発話内容の推定結果を示す情報として出力する推定結果出力手段とを備えたことを特徴とする。 An utterance content estimation apparatus according to the present invention includes: a first content estimation unit that estimates a probability that a content of a processing unit obtained by dividing a processing target utterance into time intervals is a first specific content; and a processing target utterance. A second content estimating means for estimating a probability that the content of the processing unit divided into the time intervals is the second specific content determined as the content of the word co-occurring with the first specific content word; The probability for the first specific content in each processing unit estimated by the one content estimation unit and the probability for the second specific content in each processing unit estimated by the second content estimation unit are combined. And an estimation result output means for outputting information indicating the estimation result of the utterance content to be processed.

また、本発明による言語モデル作成装置は、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容である確率を推定する第１の内容推定手段と、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容の単語と共起する単語の内容として定めた第２の特定の内容である確率を推定する第２の内容推定手段と、第１の特定の内容に応じた第１の内容別言語モデルを記憶する第１の内容別言語モデル記憶手段と、第２の特定の内容に応じた第２の内容別言語モデルを記憶する第２の内容別言語モデル記憶手段と、第１の内容推定手段によって推定された各処理単位における第１の特定の内容についての確率である第１のスコアと、第２の内容推定手段によって推定された各処理単位における第２の特定の内容についての確率である第２のスコアと、第１の内容別言語モデル記憶手段に記憶されている第１の内容別言語モデルと、第２の内容別言語モデル記憶手段に記憶されている第２の内容別言語モデルとを用いて、処理対象の発話を時間区間に分割した処理単位ごとに言語モデルを作成する言語モデル作成手段とを備えたことを特徴とする。 In addition, the language model creation device according to the present invention includes a first content estimation unit that estimates a probability that the content of a processing unit obtained by dividing a processing target speech into time intervals is a first specific content, and a processing target Second content estimation means for estimating a probability that the content of a processing unit obtained by dividing an utterance into time intervals is a second specific content determined as a content of a word that co-occurs with a word of the first specific content; The first content-specific language model storage means for storing the first content-specific language model corresponding to the first specific content and the second content-specific language model corresponding to the second specific content are stored. A first score that is a probability of the first specific content in each processing unit estimated by the second content-specific language model storage unit, the first content estimation unit, and an estimation by the second content estimation unit Second specific in each processing unit A second score that is the probability of the content, a first content-specific language model stored in the first content-specific language model storage means, and a second score stored in the second content-specific language model storage means. And a language model creating means for creating a language model for each processing unit obtained by dividing an utterance to be processed into time sections using the content-specific language model.

また、本発明による発話内容推定方法は、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容である確率を示す第１のスコアを算出し、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容の単語と共起する単語の内容として定めた第２の特定の内容である確率を示す第２のスコアを算出し、各処理単位における第１の特定の内容についての第１のスコアと、各処理単位における第２の特定の内容についての第２のスコアとを併せて、発話内容の推定結果を示す情報として出力する。 Further, the speech content estimation method according to the present invention calculates a first score indicating the probability that the content of a processing unit obtained by dividing the speech to be processed into time intervals is the first specific content, and the speech to be processed Calculating the second score indicating the probability that the content of the processing unit divided into time intervals is the second specific content determined as the content of the word co-occurring with the word of the first specific content, The first score for the first specific content in the processing unit and the second score for the second specific content in each processing unit are output together as information indicating the estimation result of the utterance content.

また、本発明による言語モデル作成方法は、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容である確率を示す第１のスコアを算出し、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容の単語と共起する単語の内容として定めた第２の特定の内容である確率を示す第２のスコアを算出し、各処理単位における第１の特定の内容についての第１のスコアと、各処理単位における第２の特定の内容についての第２のスコアと、予め記憶されている第１の特定の内容に応じた第１の内容別言語モデルと、予め記憶されている第２の特定の内容に応じた第２の内容別言語モデルとを用いて、処理対象の発話を時間区間に分割した処理単位ごとに言語モデルを作成することを特徴とする。 The language model creation method according to the present invention calculates a first score indicating the probability that the content of a processing unit obtained by dividing a processing target utterance into time intervals is a first specific content, and the processing target utterance Calculating the second score indicating the probability that the content of the processing unit divided into time intervals is the second specific content determined as the content of the word co-occurring with the word of the first specific content, The first score for the first specific content in the processing unit, the second score for the second specific content in each processing unit, and the first score corresponding to the first specific content stored in advance A language model for each processing unit obtained by dividing an utterance to be processed into time intervals using one content-specific language model and a second content-specific language model stored in advance according to a second specific content It is characterized by creating.

また、本発明による発話内容推定プログラムは、コンピュータに、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容である確率を示す第１のスコアを算出する処理、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容の単語と共起する単語の内容として定めた第２の特定の内容である確率を示す第２のスコアを算出する処理、および各処理単位における第１の特定の内容についての第１のスコアと、各処理単位における第２の特定の内容についての第２のスコアとを併せて、発話内容の推定結果を示す情報として出力する処理を実行させることを特徴とする。 Further, the utterance content estimation program according to the present invention is a computer that calculates a first score indicating a probability that the content of a processing unit obtained by dividing a processing target utterance into time intervals is a first specific content, The second score indicating the probability that the content of the processing unit obtained by dividing the processing target speech into time intervals is the second specific content determined as the content of the word co-occurring with the first specific content word. The estimation result of the utterance content is calculated by combining the first score for the first specific content in each processing unit to be calculated and the second score for the second specific content in each processing unit. It is characterized in that a process of outputting as information to be executed is executed.

また、本発明による言語モデル作成プログラムは、コンピュータに、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容である確率を示す第１のスコアを算出する処理、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容の単語と共起する単語の内容として定めた第２の特定の内容である確率を示す第２のスコアを算出する処理、および各処理単位における第１の特定の内容についての第１のスコアと、各処理単位における第２の特定の内容についての第２のスコアと、予め記憶されている第１の特定の内容に応じた第１の内容別言語モデルと、予め記憶されている第２の特定の内容に応じた第２の内容別言語モデルとを用いて、処理対象の発話を時間区間に分割した処理単位ごとに言語モデルを作成する処理を実行させることを特徴とする。 In addition, the language model creation program according to the present invention is a computer that calculates a first score indicating a probability that the content of a processing unit obtained by dividing an utterance to be processed into time intervals is a first specific content, The second score indicating the probability that the content of the processing unit obtained by dividing the processing target speech into time intervals is the second specific content determined as the content of the word co-occurring with the first specific content word. Processing to be calculated, first score for first specific content in each processing unit, second score for second specific content in each processing unit, and first specification stored in advance The speech to be processed is divided into time intervals using the first content-specific language model corresponding to the content of the content and the second content-specific language model corresponding to the second specific content stored in advance. Language model for each processing unit Characterized in that to execute the process of creating.

本発明によれば、１つの発話の内部で内容が変化する場合であっても、発話に含まれる特定の内容の語句を高い精度で認識することができる。 According to the present invention, even when the content changes within one utterance, it is possible to recognize a phrase having a specific content included in the utterance with high accuracy.

第１の実施形態の音声認識装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition apparatus of 1st Embodiment. 第１の内容モデルに含まれるパラメタ（特徴情報）の例を示す説明図である。It is explanatory drawing which shows the example of the parameter (feature information) contained in a 1st content model. 第１の内容推定部２１が抽出する特徴量の例を示す説明図である。It is explanatory drawing which shows the example of the feature-value which the 1st content estimation part 21 extracts. 第１の内容モデルに含まれるパラメタ（重み情報）の例を示す説明図である。It is explanatory drawing which shows the example of the parameter (weight information) contained in a 1st content model. 第１の内容推定部２１が生成する処理単位列の例を示す説明図である。It is explanatory drawing which shows the example of the process unit sequence which the 1st content estimation part 21 produces | generates. 第１の内容推定部２１が生成するラティスの例を示す説明図である。It is explanatory drawing which shows the example of the lattice which the 1st content estimation part 21 produces | generates. 第１の内容推定部２１が算出した第１のスコアの例を示す説明図である。It is explanatory drawing which shows the example of the 1st score which the 1st content estimation part 21 computed. 第２の内容モデルを示す情報として記憶されるパラメタ（隣接処理単位への付与スコア）の例を示す説明図である。It is explanatory drawing which shows the example of the parameter (Assignment score to an adjacent process unit) memorize | stored as information which shows a 2nd content model. 第２の内容推定部３１が算出した第２のスコアの例を示す説明図である。It is explanatory drawing which shows the example of the 2nd score which the 2nd content estimation part 31 computed. 第２の内容推定部３１が算出した第２のスコアの他の例を示す説明図である。It is explanatory drawing which shows the other example of the 2nd score which the 2nd content estimation part 31 computed. 第２の内容推定部３１が算出した第２のスコアの他の例を示す説明図である。It is explanatory drawing which shows the other example of the 2nd score which the 2nd content estimation part 31 computed. 単語リストの例を示す説明図である。It is explanatory drawing which shows the example of a word list. 第１の実施形態の音声認識装置の動作例を示すフローチャートである。It is a flowchart which shows the operation example of the speech recognition apparatus of 1st Embodiment. 本発明の概要を示すブロック図である。It is a block diagram which shows the outline | summary of this invention. 本発明の概要を示すブロック図である。It is a block diagram which shows the outline | summary of this invention.

以下、本発明を実施するための形態について図面を参照して説明する。図１は、本発明の第１の実施形態の音声認識装置の構成例を示すブロック図である。図１に示す音声認識装置１００は、音声認識部１１と、第１の内容推定部２１と、第１の内容モデル記憶部２２と、第２の内容推定部３１と、第２の内容モデル記憶部３２と、言語モデル作成部４１と、第１の内容別言語モデル記憶部４２と、第２の内容別言語モデル記憶部４３とを備える。 Hereinafter, embodiments for carrying out the present invention will be described with reference to the drawings. FIG. 1 is a block diagram illustrating a configuration example of the speech recognition apparatus according to the first embodiment of this invention. A speech recognition apparatus 100 shown in FIG. 1 includes a speech recognition unit 11, a first content estimation unit 21, a first content model storage unit 22, a second content estimation unit 31, and a second content model storage. Unit 32, language model creation unit 41, first content-specific language model storage unit 42, and second content-specific language model storage unit 43.

音声認識装置１００は、例えば、中央処理装置（ＣＰＵ；ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）、記憶装置（メモリおよびハードディスク駆動装置（ＨＤＤ；ＨａｒｄＤｉｓｋＤｒｉｖｅ））、入力装置および出力装置を備えたコンピュータによって実現される。 The speech recognition apparatus 100 is realized by, for example, a computer including a central processing unit (CPU), a storage device (memory and a hard disk drive (HDD)), an input device, and an output device.

出力装置は、例えば、ディスプレイ装置である。出力装置は、ＣＰＵにより出力された画像情報に基づいて、文字および図形からなる画像を表示させる。また、出力装置は、データ記憶媒体やネットワークとのインタフェース機器であってもよい。そのような場合には、データ記憶媒体やネットワークを介して、音声認識結果の情報を出力する。 The output device is, for example, a display device. The output device displays an image made up of characters and figures based on the image information output by the CPU. The output device may be an interface device with a data storage medium or a network. In such a case, information on the speech recognition result is output via a data storage medium or a network.

入力装置は、例えば、マイクロホンである。入力装置は、例えば、マイクロホンの周囲（すなわち音声認識装置１００の外部）の音声信号を入力する。また、入力装置は、データ記憶媒体やネットワークとのインタフェース機器であってもよい。そのような場合には、データ記憶媒体やネットワークを介して、音声信号を入力する。 The input device is, for example, a microphone. The input device inputs, for example, a voice signal around the microphone (that is, outside the voice recognition device 100). The input device may be an interface device with a data storage medium or a network. In such a case, an audio signal is input via a data storage medium or a network.

なお、本実施形態では、音声認識装置１００は、外部から入力装置を介して音声信号を入力し、その入力された音声信号に対応する音声認識結果を出力するように構成されているものとする。 In this embodiment, the speech recognition apparatus 100 is configured to input a speech signal from the outside via an input device and output a speech recognition result corresponding to the input speech signal. .

本実施形態において、音声認識部１１、第１の内容推定部２１、第２の内容推定部３１、言語モデル作成部４１は、例えば、音声認識装置１００が備えるＣＰＵが、記憶装置に記憶されているプログラムに従い動作することにより実現される。または、論理回路等のハードウェアにより実現されていてもよい。また、第１の内容モデル記憶部２２、第２の内容モデル記憶部３２、第１の内容別言語モデル記憶部４２、第２の内容別言語モデル記憶部４３は、例えば、音声認識装置１００が備える記憶装置によって実現される。 In the present embodiment, the speech recognition unit 11, the first content estimation unit 21, the second content estimation unit 31, and the language model creation unit 41 are stored in a storage device, for example, a CPU included in the speech recognition device 100. It is realized by operating according to the program. Alternatively, it may be realized by hardware such as a logic circuit. The first content model storage unit 22, the second content model storage unit 32, the first content-specific language model storage unit 42, and the second content-specific language model storage unit 43 include, for example, the speech recognition device 100. This is realized by a storage device provided.

音声認識部１１は、入力される音声信号に対して音声認識処理を行い、その音声信号に対応する音声認識仮説を出力する。本実施形態では、音声認識部１１は、第１の内容推定部２１と出力装置（図示せず）とに音声認識仮説を出力する。 The speech recognition unit 11 performs speech recognition processing on the input speech signal and outputs a speech recognition hypothesis corresponding to the speech signal. In the present embodiment, the speech recognition unit 11 outputs a speech recognition hypothesis to the first content estimation unit 21 and an output device (not shown).

音声認識部１１は、音声認識処理において、入力された音声信号に対して、例えば、音声認識を行うためのモデル（例えば、音響モデルや言語モデル、単語辞書等を含む）の与えるスコアに従って、音声信号に適合する単語列を探索するといった一般的な処理を行えばよい。音声認識部１１は、例えば、音響モデルとして隠れマルコフモデルを用い、言語モデルとして単語トライグラムなどを用いるようにしてもよい。なお、音声認識装置１００は、これらのモデルをあらかじめ記憶装置に記憶している。 In the speech recognition processing, the speech recognition unit 11 performs speech on the input speech signal according to a score given by a model for speech recognition (including an acoustic model, a language model, a word dictionary, etc.), for example. A general process such as searching for a word string that matches the signal may be performed. For example, the speech recognition unit 11 may use a hidden Markov model as an acoustic model and a word trigram as a language model. Note that the speech recognition apparatus 100 stores these models in a storage device in advance.

また、音声認識部１１は、音声認識仮説として、例えば、音声信号に対応する語句の候補を１つの単語列として表現した結果情報（音声認識結果情報）を出力してもよい。また、例えば音声認識仮説として、複数の単語列を含む単語グラフやＮベスト単語列の形式で表現した結果情報を出力するようにしてもよい。このとき、音声認識部１１は、音声認識仮説として出力する単語列に含まれる各単語が、入力された音声信号のどの区間に対応するかを表す時間情報を含む結果情報を出力する。 In addition, the speech recognition unit 11 may output, as a speech recognition hypothesis, for example, result information (speech recognition result information) expressing a word / phrase candidate corresponding to the speech signal as one word string. For example, as a speech recognition hypothesis, result information expressed in the form of a word graph including a plurality of word strings or an N best word string may be output. At this time, the speech recognition unit 11 outputs result information including time information indicating which section of the input speech signal each word included in the word string output as the speech recognition hypothesis corresponds to.

第１の内容推定部２１は、音声認識部１１から出力された音声認識仮説を入力し、内容推定処理で用いるための処理単位列を生成し、生成した処理単位のそれぞれに対して、その内容を推定する。より具体的には、各処理単位の内容が特定の内容である確率（尤もらしさを示すスコア）を計算する。この値は、その内容が特定の内容である確率が高くなるほど大きくなる値であればよく、例えば、特定の内容の出現確率であってもよいし、特定の内容に対する尤度や重みと呼ばれる値であってもよい。 The first content estimation unit 21 receives the speech recognition hypothesis output from the speech recognition unit 11, generates a processing unit sequence to be used in the content estimation process, and outputs the content for each of the generated processing units. Is estimated. More specifically, the probability (score indicating likelihood) that the content of each processing unit is specific content is calculated. This value may be a value that increases as the probability that the content is specific content increases. For example, it may be the appearance probability of the specific content, or a value called likelihood or weight for the specific content. It may be.

第１の内容推定部２１は、第１の内容モデル記憶部２２に記憶されている内容モデルに基づいて、処理単位それぞれに対して、特定の内容に対するこのスコアを求める。以下、第１の内容推定部２１が求めるスコアのことを第１のスコアと呼ぶ。第１の内容推定部２１は、求めた第１のスコアを第２の内容推定部３１と言語モデル作成部４１とに出力する。 Based on the content model stored in the first content model storage unit 22, the first content estimation unit 21 obtains this score for specific content for each processing unit. Hereinafter, the score obtained by the first content estimation unit 21 is referred to as a first score. The first content estimation unit 21 outputs the obtained first score to the second content estimation unit 31 and the language model creation unit 41.

第１の内容推定部２１は、内容推定用の処理単位として、音声認識対象の発話の始端から終端までを複数の区間に分割したものを用いる。処理単位である各区間は、発話内におけるその区間の始端時刻と終端時刻によって定めてもよい。 The 1st content estimation part 21 uses what divided | segmented from the start end of the speech of speech recognition object into the several area as a process unit for content estimation. Each section which is a processing unit may be determined by the start time and end time of the section in the utterance.

本実施形態では、第１の内容推定部２１は、内容推定の処理単位として、音声認識仮説の単語列に含まれる各単語に対応する区間を用いる。なお、処理単位として、音声認識仮説に含まれる文字や音素に対応する区間を用いるようにしてもよい。このようにすることで、第１の内容推定部２１は、入力された音声を、もれなく複数の区間に分割することができる。 In this embodiment, the 1st content estimation part 21 uses the area corresponding to each word contained in the word sequence of a speech recognition hypothesis as a processing unit of content estimation. Note that a section corresponding to a character or phoneme included in the speech recognition hypothesis may be used as a processing unit. By doing in this way, the 1st content estimation part 21 can divide the input audio | voice into a several area without exception.

また、本実施形態では、第１の内容推定部２１が推定する特定の内容として、情報検索のための検索条件を用いる場合を例に説明する。情報検索のための発話では、検索条件が異なると、発話中に特定の単語が出現（存在）する確率が異なる。例えば、テレビ番組を検索するための発話の場合の検索条件に用いられる単語には、人名（タレント名やグループ名等を含む）や、番組名、番組ジャンル名（バラエティ、スポーツ等）、放送局名、時間表現（夕方、８時等）等が用いられる。本例の場合には、「人名」、「番組名」、「番組ジャンル名」、「放送局名」、「時間表現」といった検索条件に用いられる単語の括り名称（検索ワードの項目名、種別、属性とも呼ばれる）を内容推定のための特定の内容としてもよい。以下、第１の内容推定部２１が内容推定に用いる特定の内容のことを第１の内容と呼ぶ。 In the present embodiment, a case where a search condition for information search is used as the specific content estimated by the first content estimation unit 21 will be described as an example. In the utterance for information retrieval, the probability that a specific word appears (exists) in the utterance varies depending on the retrieval condition. For example, words used as search conditions in the case of utterances for searching for TV programs include personal names (including talent names and group names), program names, program genre names (variety, sports, etc.), broadcasting stations Names, time expressions (evening, 8 o'clock, etc.) are used. In the case of this example, word wrapping names (search word item name, type used for search conditions such as “person name”, “program name”, “program genre name”, “broadcast station name”, “time expression”) , Also referred to as an attribute) may be the specific content for content estimation. Hereinafter, the specific content used by the first content estimation unit 21 for content estimation is referred to as first content.

このように、第１の内容推定部２１は、発話内容を推定するために、発話中の区間（処理単位）ごとにその区間に対応する発話内容が第１の内容である確率を表す第１のスコアを算出する。従って、発話の途中で内容が変化する場合でも、それぞれの区間ごとに発話内容を推定することができる。 Thus, in order to estimate the utterance content, the first content estimation unit 21 represents the probability that the utterance content corresponding to the section is the first content for each section (processing unit) during the utterance. Calculate the score. Therefore, even when the content changes during the utterance, the utterance content can be estimated for each section.

第１の内容モデル記憶部２２は、内容推定の処理単位と、その処理単位に対応する発話内容が複数の第１の内容のそれぞれである確率との関係を表す内容モデルの情報を記憶する。 The first content model storage unit 22 stores content model information representing a relationship between a content estimation processing unit and a probability that the utterance content corresponding to the processing unit is each of the plurality of first contents.

例えば、内容モデルとして、条件付確率場（ＣＲＦ：ＣｏｎｄｉｔｉｏｎａｌＲａｎｄｏｍＦｉｅｌｄｓ）の理論に基づく確率モデルを用いてもよい。この内容モデルは、以下の式（１）により表される。 For example, a probability model based on the theory of conditional random fields (CRF) may be used as the content model. This content model is represented by the following equation (1).

Ｐ（Ｃ｜Ｗ）＝ｅｘｐ（Λ・Φ（Ｗ，Ｃ））／Ｚ・・・式（１） P (C | W) = exp (Λ · Φ (W, C)) / Z (1)

ここで、Ｗは内容推定の処理単位列である。また、Ｃは各処理単位に対応する内容のラベル列（第１の内容のラベル列）である。すなわち、式（１）の左辺Ｐ（Ｃ｜Ｗ）は、処理単位列Ｗが表す内容のラベル列がＣである確率を表す。また、Φ（Ｗ，Ｃ）は、処理単位列Ｗから抽出する特徴である。また、Λはその特徴のそれぞれの要素に対応する重み係数を表す情報である。また、Ｚは正規化項である。なお、ｅｘｐ（）は、自然対数ｅを底とする数値のべき乗を求める関数を示している。 Here, W is a processing unit sequence for content estimation. C is a label string of contents corresponding to each processing unit (a label string of the first contents). That is, the left side P (C | W) of the equation (1) represents the probability that the label column of the content represented by the processing unit column W is C. Further, Φ (W, C) is a feature extracted from the processing unit sequence W. Λ is information representing a weighting coefficient corresponding to each element of the feature. Z is a normalization term. Note that exp () represents a function for obtaining the power of a numerical value with the natural logarithm e as the base.

第１の内容モデル記憶部２２は、上記内容モデルを示す情報として、例えば、特徴Φの定義（抽出方法）と、それぞれの特徴に対応する重みΛとを記憶してもよい。なお、重みΛは、第１の内容によって異なる値としてもよい。第１の内容ごとに異なる値を与えることで、処理単位におけるそれぞれの内容の出現確率に差をつけることができる。また、本実施形態では、各処理単位について抽出する特徴として、認識特徴と共起特徴とを用いる。 The first content model storage unit 22 may store, for example, the definition (extraction method) of the feature Φ and the weight Λ corresponding to each feature as information indicating the content model. The weight Λ may be a different value depending on the first content. By giving a different value for each first content, it is possible to make a difference in the appearance probability of each content in the processing unit. In this embodiment, a recognition feature and a co-occurrence feature are used as features to be extracted for each processing unit.

認識特徴は、音声認識処理の過程で得られる情報に基づく特徴である。例えば、当該区間の音声認識の信頼度を用いてもよい。音声認識の信頼度は、その音声認識結果の正しさ（正確らしさ）と相関を持つ値である。例えば、音声認識処理後に計算される単語事後確率に基づく信頼度などを用いてもよい。このような特徴を用いることにより、第１の内容推定部２１で、音声認識仮説に含まれる誤認識を検出しやすくなる。特に、情報検索のための検索条件に用いられる単語は誤認識されることが多いため、このような特徴を用いることはより好ましい。 The recognition feature is a feature based on information obtained in the process of speech recognition processing. For example, the reliability of speech recognition in the section may be used. The reliability of speech recognition is a value having a correlation with the correctness (accuracy) of the speech recognition result. For example, reliability based on word posterior probabilities calculated after voice recognition processing may be used. By using such a feature, the first content estimation unit 21 can easily detect misrecognition included in the speech recognition hypothesis. In particular, since a word used as a search condition for information search is often erroneously recognized, it is more preferable to use such a feature.

共起特徴は、発話内で同時に現れる言語情報（例えば、単語や品詞）の組に基づく特徴である。例えば、各処理単位が、ある第１の内容である場合において「先行区間に”単語Ｘ”がある」というように表現される特徴である。 A co-occurrence feature is a feature based on a set of linguistic information (for example, words and parts of speech) that appear simultaneously in an utterance. For example, when each processing unit has a certain first content, it is a feature expressed as “There is“ word X ”in the preceding section”.

図２は、第１の内容モデル記憶部２２が記憶する特徴の情報の一例を示す説明図である。図２に示す例では、特徴の識別子であるＩＤと、その特徴の定義とが対応づけられて保持されている。例えば、図２では、ＩＤ＝１として「当該区間の音声認識の信頼度」という特徴が定義づけられている。なお、この特徴は認識特徴の例である。また、例えば、ＩＤ＝２として「先行する区間に「出」がある」という特徴が定義づけられている。この特徴は共起特徴の例である。この他にもＩＤ＝３として「後続する区間に「出」がある」という特徴や、ＩＤ＝４として「先行する区間に「浅野」がある」という特徴、ＩＤ＝５として「後続する区間に「浅野」がある」という特徴の例が示されている。なお、共起特徴に用いられる単語（本例でいう「出」や「浅野」等）には、単語辞書に含まれるあらゆる単語を用いてもよい。また、これらの単語を選別する必要がある場合には、第１の内容と共起する単語を選別することがより好ましい。 FIG. 2 is an explanatory diagram illustrating an example of feature information stored in the first content model storage unit 22. In the example illustrated in FIG. 2, an ID that is a feature identifier and a definition of the feature are stored in association with each other. For example, in FIG. 2, the characteristic “reliability of speech recognition in the section” is defined as ID = 1. This feature is an example of a recognition feature. Further, for example, a feature that “There is“ out ”in the preceding section” is defined as ID = 2. This feature is an example of a co-occurrence feature. In addition, ID = 3 has a feature that “there is“ out ”in the following section”, ID = 4 has a feature that “the preceding section has“ Asano ”, and ID = 5 has a“ in the following section ” An example of the feature “There is Asano” is shown. It should be noted that any word included in the word dictionary may be used as the word used for the co-occurrence feature (such as “Out” and “Asano” in this example). When these words need to be selected, it is more preferable to select words that co-occur with the first content.

なお、共起特徴は、図２に示したように、ある単語について特徴抽出対象の処理区間に先行するか後続するかによって異なる特徴として扱うようにしてもよいし、前後のいずれかの区間に出現するかというように先行するか後続するかを問わず１つの特徴として扱ってもよい。また、共起特徴を、特徴抽出対象の処理区間の距離により区別して異なる特徴として扱うようにしてもよい。例えば、「隣接する」、「間に１つ挟む」、「間に２つ以上挟む」といった区別により、異なる特徴として扱うようにしてもよい。なお、この距離は、処理区間の数（すなわち、単語の数）であってもよいし、処理区間の発話内における始端時間と終端時間に基づく時間情報の形式（例えば、「３０フレーム離れている」）で表される時間長であってもよい。 As shown in FIG. 2, the co-occurrence feature may be treated as a different feature depending on whether a certain word precedes or follows the processing interval of the feature extraction target. Regardless of whether it appears or not, it may be treated as one feature. In addition, the co-occurrence features may be treated as different features by distinguishing them according to the distance of the processing section to be extracted. For example, different features may be handled by distinction such as “adjacent”, “one sandwiched between”, and “two or more sandwiched between”. This distance may be the number of processing sections (that is, the number of words), or the format of time information based on the start time and end time in the utterance of the processing section (for example, “30 frames apart” )) May be used.

第１の内容推定部２１は、例えば、入力された音声認識仮説において、ある処理単位の先行区間に「出」が出現している場合に、その処理単位の内容がある特定の第１の内容である場合における「先行区間に「出」が出現している」という特徴に対して、共起有りを示す特徴量＝１を与えてもよい。なお、先行区間に「出」が出現しない場合には、これらは共起していないとして、共起無しを示す特徴量＝０を与えればよい。図３は、第１の内容推定部２１が抽出した特徴量の例を示す説明図である。図３では、例えば、第１の内容推定部２１が、ある処理単位について、ＩＤ＝１として定義されている特徴（「当該区間の音声認識の信頼度」）について０．３という値（特徴量）を抽出したことを示している。また、例えば、ＩＤ＝２として定義されている特徴について１という値（特徴量）を抽出したことを示している。これは、その処理単位の先行する区間に「出」という単語が出現したことによる。また、例えば、ＩＤ＝５として定義されている特徴について０という値（特徴量）を抽出したことを示している。これは、その処理単位の後続する区間に「浅野」という単語が出現しないことによる。なお、図３では、後述する図５に示す処理単位列における処理単位３について抽出した特徴量の例を示している。 For example, when “out” appears in a preceding section of a processing unit in the input speech recognition hypothesis, the first content estimation unit 21 has a specific first content with the processing unit content. The feature amount = 1 indicating the presence of co-occurrence may be given to the feature that “out” has appeared in the preceding section. If “out” does not appear in the preceding section, it is determined that these do not co-occur and feature value = 0 indicating no co-occurrence may be given. FIG. 3 is an explanatory diagram illustrating an example of the feature amount extracted by the first content estimation unit 21. In FIG. 3, for example, the first content estimation unit 21 has a value (feature amount) of 0.3 for a feature defined as ID = 1 (“reliability of speech recognition in the section”) for a certain processing unit. ) Is extracted. Further, for example, a value of 1 (feature amount) is extracted for the feature defined as ID = 2. This is because the word “out” appears in the preceding section of the processing unit. Further, for example, a value of 0 (feature amount) is extracted for the feature defined as ID = 5. This is because the word “Asano” does not appear in the subsequent section of the processing unit. Note that FIG. 3 shows an example of the feature amount extracted for the processing unit 3 in the processing unit sequence shown in FIG.

また、図４は、第１の内容モデル記憶部２２が記憶する重みの情報の一例を示す説明図である。図４に示す例では、特徴を識別するＩＤと、その特徴に対する第１の内容別の重みΛとが対応づけられて保持されている。なお、図４は、第１の内容が「人名」「ジャンル名」「その他」の３種類である場合の例である。例えば、図４では、ＩＤ＝１の特徴について、「人名」に対しては重みΛ＝０．４を、「ジャンル名」に対しては重みΛ＝０．３を、「その他」に対しては重みΛ＝−０．２を定めている。 FIG. 4 is an explanatory diagram showing an example of weight information stored in the first content model storage unit 22. In the example shown in FIG. 4, an ID for identifying a feature and a first content-specific weight Λ for the feature are stored in association with each other. FIG. 4 shows an example in which the first contents are three types of “person name”, “genre name”, and “others”. For example, in FIG. 4, for the feature with ID = 1, the weight Λ = 0.4 for “person name”, the weight Λ = 0.3 for “genre name”, and the “other” Defines a weight Λ = −0.2.

第１の内容推定部２１は、各特徴に対して抽出した値（特徴量）と、別途第１の内容に応じて定められている重み係数（Λ）との積に基づいて、第１のスコアを計算する。なお、図４に示すように、第１の内容推定部２１で推定する第１の内容に応じて、各特徴に対する重み係数が異なっているものが含まれていることが好ましい。換言すると、第１の内容別に重み係数が異なるような特徴が多く定義されていることが好ましい。これにより、第１の内容に応じて異なるスコアが得られ、処理単位におけるそれぞれの内容の出現確率に差をつけることができる。 Based on the product of the value (feature amount) extracted for each feature and the weighting factor (Λ) separately determined according to the first content, the first content estimation unit 21 Calculate the score. In addition, as shown in FIG. 4, it is preferable that the thing with which the weighting coefficient with respect to each feature differs according to the 1st content estimated by the 1st content estimation part 21 is contained. In other words, it is preferable that many features having different weighting factors are defined for each first content. Thereby, a different score is obtained according to the first content, and the appearance probability of each content in the processing unit can be differentiated.

ここで、第１の内容推定部２１が発話中の区間ごとに第１のスコアを算出する方法について具体例を用いて説明する。以下では、内容推定の処理単位が音声認識仮説の各単語に対応する区間であり、かつ第１の内容モデルとしてＣＲＦを用いる場合を例示する。図５は、処理対象の発話に対して入力された音声認識仮説から生成される処理単位列の例を示す説明図である。なお、図５（ａ）は、処理対象の発話の例を示す説明図である。また、図５（ｂ）は、入力される音声認識仮説の単語列の例を示す説明図である。また、図５（ｃ）は、生成した処理単位列の例を示す説明図である。図５（ａ）に示すように、例えば、処理対象の発話が「明日のでエグザイルとかが出ている音楽番組」であった場合に、図５（ｂ）で示すような音声認識仮説が得られたとする。なお、図５（ｂ）では、音声認識仮説として示された単語列が、「浅野／出／不在／る／とか／が／出／て／いる／音楽番組」であることが示されている。ここで”／”は単語の区切りを示している。また、各単語の発話内位置が、単語１：「浅野」＝０〜３０、単語２：「出」＝３０〜４０、単語３：「不在」＝４０〜６０、単語４：「る」＝６０〜７５、単語５：「とか」＝７５〜９５、単語６：「が」＝９５〜１０５、単語７：「出」＝１０５〜１１５、単語８：「て」＝１１５〜１２５、単語９：「いる」＝１２５〜１４０、単語１０：「音楽番組」＝１４０〜２００であることが示されている。このような場合に、第１の内容推定部２１は、例えば、入力される音声認識仮説の単語列に含まれる各単語に対応する区間ごとに、処理単位を生成する。すなわち、単語１に対応する区間に対して処理単位１を生成し、単語２に対応する区間に対して処理単位２を生成し、以下繰り返しで、最後に単語１０に対応する区間に対して処理単位１０を生成することによって処理単位１〜１０からなる処理単位列を生成する。 Here, a method of calculating the first score for each section in which the first content estimation unit 21 is speaking will be described using a specific example. In the following, a case where the content estimation processing unit is a section corresponding to each word of the speech recognition hypothesis and CRF is used as the first content model will be exemplified. FIG. 5 is an explanatory diagram illustrating an example of a processing unit sequence generated from a speech recognition hypothesis input for an utterance to be processed. FIG. 5A is an explanatory diagram illustrating an example of a processing target utterance. FIG. 5B is an explanatory diagram showing an example of a word string of the input speech recognition hypothesis. FIG. 5C is an explanatory diagram showing an example of the generated processing unit sequence. As shown in FIG. 5 (a), for example, when the utterance to be processed is “a music program in which tomorrow is exiled”, a speech recognition hypothesis as shown in FIG. 5 (b) is obtained. Suppose. In FIG. 5B, the word string shown as the speech recognition hypothesis is “Asano / Out / Absent / Ru / Toka / Goto / Tei / I / Music program”. . Here, “/” indicates a word break. In addition, the position in the utterance of each word is as follows: word 1: “Asano” = 0-30, word 2: “out” = 30-40, word 3: “absent” = 40-60, word 4: “ru” = 60-75, Word 5: “Toka” = 75-95, Word 6: “Ga” = 95-105, Word 7: “Out” = 105-115, Word 8: “Te” = 115-125, Word 9 : “I” = 125-140, Word 10: “Music program” = 140-200. In such a case, the 1st content estimation part 21 produces | generates a process unit for every area corresponding to each word contained in the word sequence of the input speech recognition hypothesis, for example. That is, the processing unit 1 is generated for the section corresponding to the word 1, the processing unit 2 is generated for the section corresponding to the word 2, and the process is repeated for the section corresponding to the word 10 at the end. By generating the unit 10, the processing unit sequence including the processing units 1 to 10 is generated.

次に、第１の内容推定部２１は、生成した処理単位それぞれに対して、第１の内容モデル記憶部２２に記憶されている内容モデルを参照して、各処理単位の内容が、与えられる第１の内容それぞれである確率を計算する。本例では、第１の内容推定部２１は、処理単位列Ｗ（処理単位の単語）が取りうる全ての内容Ｃの組み合わせを表現するラティス（グラフ構造）において、各ノードの事後確率を計算する。 Next, the first content estimation unit 21 gives the content of each processing unit with reference to the content model stored in the first content model storage unit 22 for each generated processing unit. The probability of each of the first contents is calculated. In this example, the first content estimation unit 21 calculates a posteriori probability of each node in a lattice (graph structure) representing a combination of all the contents C that can be taken by the processing unit sequence W (processing unit word). .

例えば、内容Ｃの候補（すなわち、与えられる第１の内容）が「人名」「ジャンル名」「その他」の３種類である場合、第１の内容推定部２１は、図６に示すようなラティスを生成する。図６は、第１の内容推定部２１が生成するラティスの例を示す説明図である。図６に示す例では、各処理単位の単語が、それぞれ内容ａ：「人名」、内容ｂ：「ジャンル名」、内容ｃ：「その他」であった場合の組み合わせを表現している。 For example, when the content C candidates (that is, the given first content) are three types of “person name”, “genre name”, and “others”, the first content estimation unit 21 performs the lattice as shown in FIG. Is generated. FIG. 6 is an explanatory diagram illustrating an example of a lattice generated by the first content estimation unit 21. In the example illustrated in FIG. 6, the words in each processing unit represent a combination when the content is “a” “person name”, the content b is “genre name”, and the content c is “other”.

図６において、例えば、ノード「１ａ」は、処理単位１に対応する単語１（図５（ｂ）の例では「浅野」）が「人名」である状態を示している。また、例えばノード「１ｂ」は、処理単位１に対応する単語１が「ジャンル名」である状態を示している。また、例えばノード「１ｃ」は、処理単位１に対応する単語１が「その他」である状態を示している。同様に、例えば、ノード「３ａ」は、処理単位３に対応する単語３（図５（ｂ）の例では「不在」）が「人名」である状態を示している。 In FIG. 6, for example, the node “1a” indicates a state in which the word 1 corresponding to the processing unit 1 (“Asano” in the example of FIG. 5B) is “person name”. For example, the node “1b” indicates a state in which the word 1 corresponding to the processing unit 1 is “genre name”. For example, the node “1c” indicates a state in which the word 1 corresponding to the processing unit 1 is “others”. Similarly, for example, the node “3a” indicates a state in which the word 3 corresponding to the processing unit 3 (“absent” in the example of FIG. 5B) is “person name”.

第１の内容推定部２１は、各ノードについて、例えば、図２に例示する特徴の定義に基づき特徴量となる値を抽出し、抽出したこれらの値（図３参照。）と、図４に例示する各特徴に対応する重み係数（ここでは、当該ノードに対応づけらえた内容に与えられた重み係数を用いる）の積に基づいて第１のスコアを計算する。 The first content estimation unit 21 extracts, for each node, for example, values serving as feature amounts based on the feature definition illustrated in FIG. 2, and these extracted values (see FIG. 3) and FIG. A first score is calculated based on the product of weighting factors (here, the weighting factors given to the contents associated with the node) corresponding to each feature exemplified.

以下に、前述の式（１）に基づいて、各ノードにおける事後確率を求める方法の一例を示す。事後的な出現確率（事後確率）ｐ（Ｃｉ＝ｊ｜Ｗ）は、前向きアルゴリズムと後向きアルゴリズムを用いた再帰的な計算により算出する。ここで、Ｃｉ＝ｊは、ｉ番目の処理単位の内容が内容ｊであることを示す。第１の内容推定部２１は、この事後確率ｐを、当該区間における各内容の出現確率として求める。図７は、第１の内容推定部２１が算出した各処理区間の各内容の出現確率（第１のスコア）の例を示す説明図である。図７に示すように、第１の内容推定部２１は、処理単位の区間ごとにそれぞれの内容（第１の内容）の出現確率を第１のスコアとして出力する。図７に示す例では、例えば、処理単位１における「その他」の出現確率と「人名」の出現確率とが０．４〜０．５あたりの値であり、「ジャンル名」の出現確率がゼロに近い値であることが示されている。 Below, an example of the method of calculating | requiring the posterior probability in each node based on above-mentioned Formula (1) is shown. The a posteriori appearance probability (a posteriori probability) p (Ci = j | W) is calculated by a recursive calculation using a forward algorithm and a backward algorithm. Here, Ci = j indicates that the content of the i-th processing unit is content j. The first content estimation unit 21 obtains the posterior probability p as the appearance probability of each content in the section. FIG. 7 is an explanatory diagram illustrating an example of the appearance probability (first score) of each content in each processing section calculated by the first content estimation unit 21. As illustrated in FIG. 7, the first content estimation unit 21 outputs the appearance probability of each content (first content) for each section of the processing unit as a first score. In the example illustrated in FIG. 7, for example, the appearance probability of “others” and the appearance probability of “person name” in the processing unit 1 are values around 0.4 to 0.5, and the appearance probability of “genre name” is zero. It is shown that the value is close to.

なお、第１の内容モデルであるＣＲＦのモデルパラメタは、予め対応付けられたモデルの入力（Ｗ：処理単位列）と、モデルの出力（Ｃ：内容のラベル列）との組み合わせの組を学習データとして、前述の式（１）の対数尤度を最大化する基準に従って反復計算法等により最適化することによって学習されてもよい。ＣＲＦを用いた識別、識別結果の事後確率の計算、モデルパラメタの学習の具体的方法は、例えば、文献「J.Laffery, A.McCallum, F.Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data", Proceedings of 18th International Conference of MachineLearning, 2001, p.282-289 」（非特許文献２）に記載されている方法を用いてもよい。 Note that the model parameter of the CRF that is the first content model is a combination of a model input (W: processing unit sequence) and a model output (C: content label sequence) associated in advance. The data may be learned by optimizing by an iterative calculation method or the like according to the criterion for maximizing the log likelihood of the above-described equation (1). Specific methods of identification using CRF, calculation of posterior probabilities of identification results, and learning of model parameters are described in, for example, the document “J. Laffery, A. McCallum, F. Pereira,“ Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data ", Proceedings of 18th International Conference of Machine Learning, 2001, p.282-289" (Non-Patent Document 2) may be used.

第２の内容推定部３１は、第１の内容推定部２１が出力する各処理単位についての各内容（第１の内容）に対する第１のスコアと、第２の内容モデル記憶部３２に記憶されている第２の内容モデルとに基づいて、各処理単位それぞれに対して、その区間の内容が特定の内容である確率（尤もらしさを示すスコア）を求める。以下、第２の内容推定部３１が求めるスコアを第２のスコアと呼ぶ。第２の内容推定部３１は、求めた第２のスコアを言語モデル作成部４１に出力する。 The second content estimation unit 31 stores the first score for each content (first content) for each processing unit output by the first content estimation unit 21 and the second content model storage unit 32. On the basis of the second content model, a probability (score indicating likelihood) that the content of the section is specific content is obtained for each processing unit. Hereinafter, the score obtained by the second content estimation unit 31 is referred to as a second score. The second content estimation unit 31 outputs the obtained second score to the language model creation unit 41.

第２の内容推定部３１は、処理単位として、第１の内容推定部２１と同様に、音声認識対象の発話を複数の区間に分割したものを用いる。本実施形態では、第１の内容推定部２１が生成した処理単位を用いるものとする。なお、第２の内容推定部３１は、第１の内容推定部２１が生成したものとは異なる時間区間を、スコア算出の処理単位として用いてもよい。 Similar to the first content estimation unit 21, the second content estimation unit 31 uses a speech recognition target speech divided into a plurality of sections as a processing unit. In the present embodiment, it is assumed that the processing unit generated by the first content estimation unit 21 is used. Note that the second content estimation unit 31 may use a time interval different from that generated by the first content estimation unit 21 as a processing unit for score calculation.

本実施形態では、第２の内容推定部３１は、推定する特定の内容として、第１の内容推定部２１で用いた特定の内容（第１の内容）と共起する内容（すなわち、前後表現）を用いる。以下、第２の内容推定部３１が内容推定に用いる特定の内容のことを第２の内容と呼ぶ。情報検索のための発話では、検索条件を表す単語の前後に、検索条件によって異なる特定の表現（言い回し）が頻出する。例えば、テレビ番組を検索するための発話の場合、「人名」の検索条件を表す単語には、「出演している」などの言い回し表現がよく現れる。本例では、第１の内容に対して、対応する言い回し表現を第２の内容として定める。 In the present embodiment, the second content estimation unit 31 is content that co-occurs with the specific content (first content) used in the first content estimation unit 21 as the specific content to be estimated (that is, the front-rear expression). ) Is used. Hereinafter, the specific content used by the second content estimation unit 31 for content estimation is referred to as second content. In an utterance for information retrieval, specific expressions (phrases) that differ depending on the retrieval condition frequently appear before and after the word representing the retrieval condition. For example, in the case of an utterance for searching for a TV program, a phrase expression such as “appearing” often appears in a word indicating a search condition of “person name”. In this example, a corresponding phrase expression is defined as the second content for the first content.

例えば、第１の内容推定部２１で「人名」という表現の括りを第１の内容として用いる場合、第２の内容推定部３１では、「人名の言い回し表現」を第２の内容として用いる。また、例えば、第１の内容推定部２１で、「人名」「ジャンル名」「その他」を第１の内容として用いる場合、第２の内容推定部３１は、「人名の言い回し表現」「ジャンル名の言い回し表現」を第２の内容として用いてもよい。なお、本例では「その他」については特に頻出する言い回し表現はないものとして「その他の言い回し表現」は第２の内容として用いない。 For example, when the first content estimation unit 21 uses the grouping of the expression “person name” as the first content, the second content estimation unit 31 uses “phrase expression of the personal name” as the second content. Further, for example, when the first content estimation unit 21 uses “person name”, “genre name”, and “others” as the first content, the second content estimation unit 31 selects “phrase expression of personal name” “genre name”. May be used as the second content. In this example, “others” is not used as the second content because “others” does not have a particularly frequent expression.

第２の内容モデル記憶部３２は、各処理単位における第２の内容の出現しやすさを表す情報を含む第２の内容モデルの情報を記憶する。図８は、第２の内容モデルを示す情報として記憶されるパラメタの例を示す説明図である。図８に示す例では、第２の内容モデル記憶部３２は、第２の内容ごとに、その出現しやすさに対して与えるスコア（隣接処理単位への付与スコア）を保持する。なお、第１の内容推定で用いる共起特徴として、処理単位と共起する位置関係（前後や距離）によって区別して定義する場合は、その位置関係ごとに異なる値を保持するようにしてもよい。 The second content model storage unit 32 stores information on the second content model including information indicating the ease of appearance of the second content in each processing unit. FIG. 8 is an explanatory diagram illustrating an example of parameters stored as information indicating the second content model. In the example illustrated in FIG. 8, the second content model storage unit 32 holds, for each second content, a score (giving score to an adjacent processing unit) that is given to the likelihood of appearance. When the co-occurrence feature used in the first content estimation is defined by the positional relationship (front and rear or distance) co-occurring with the processing unit, a different value may be held for each positional relationship. .

第２の内容推定部３１は、第２のスコアの算出方法として、各処理単位で最も第１のスコアが大きい第１の内容に基づき、その処理単位に隣接する処理単位における当該第１の内容に対応した第２の内容（言い回し表現）についての第２のスコアが大きくなるように、スコアを与える。例えば、各処理単位で最も第１のスコアが大きい第１の内容に対応する第２の内容に応じて定められているスコア（出現しやうさに対して付与される値）を、その隣接する処理単位に付与する処理を行うことによって、各処理単位について第２の内容である確率を示す第２のスコアを求めてもよい。 As the second score calculation method, the second content estimation unit 31 is based on the first content having the largest first score in each processing unit, and the first content in the processing unit adjacent to the processing unit. A score is given such that the second score for the second content (phrase expression) corresponding to is increased. For example, a score (a value given to the appearance probability) determined according to the second content corresponding to the first content having the largest first score in each processing unit is adjacent to the score. You may obtain | require the 2nd score which shows the probability which is a 2nd content about each process unit by performing the process provided to a process unit.

ここで、図５（ｃ）に示した処理単位３を例に用いて、第２の内容推定部３１での第２のスコアの算出方法について説明する。第２の内容推定部３１は、図７に示す例では、この処理単位３における最も大きい第１のスコアが「人名」のスコア（第１の確率）であるため、図８に示した第２の内容モデルに従い、この処理単位３に隣接する処理単位に０．４というスコアを与える。図９は、第２の内容推定部３１が算出した第２のスコアの例を示す説明図である。図９では、図８に示した第２の内容モデルに従って、隣接する４つの処理単位までスコアを与えている。なお、図９に示す例では、同じく「人名」のスコアが最も高い処理単位を除いてスコアを与える例を示しているが、処理単位ごとに単純に隣接する処理単位にスコアを付与してもよい。そのような場合に、例えば、処理単位ごとの隣接処理単位への付与処理によって重複してスコアが与えられる処理単位については、与えるスコア値を調整した上で付与するようにしてもよい。なお、既に説明したように、単純に２倍されるように付与してもよいし、重複してはスコアを与えないようにしてもよい。 Here, the calculation method of the 2nd score in the 2nd content estimation part 31 is demonstrated using the process unit 3 shown in FIG.5 (c) as an example. In the example shown in FIG. 7, the second content estimation unit 31 has the highest first score in the processing unit 3 as the “person name” score (first probability). According to the content model, a score of 0.4 is given to the processing unit adjacent to the processing unit 3. FIG. 9 is an explanatory diagram illustrating an example of the second score calculated by the second content estimation unit 31. In FIG. 9, according to the second content model shown in FIG. 8, up to four adjacent processing units are given scores. In addition, although the example shown in FIG. 9 shows an example in which a score is given except for the processing unit having the highest score of “person name”, a score may be simply assigned to an adjacent processing unit for each processing unit. Good. In such a case, for example, for a processing unit in which a score is given redundantly by the giving process to the adjacent processing unit for each processing unit, the score value to be given may be given after adjustment. As already described, it may be given so that it is simply doubled, or it may not be given a duplicate score.

また、第２の内容推定部３１は、隣接する処理単位にスコアを与える際に、第１のスコアの大きさによって、第２の内容モデルの値に重みづけてスコアを与えてもよい。例えば、ある処理単位に対して算出された第１のスコアが所定の値よりも小さい場合は、それに基づいて付与するスコアをその分小さくするなどの処理を行うようにしてもよい。また、第１の内容推定で用いる共起特徴として、処理単位と共起する位置関係（前後や距離）により区別された特徴が用いられている場合には、その位置関係ごとに第２の内容のスコアを変えて与えてもよい。例えば、先行する処理単位と後続する処理単位に異なる値を与えたり、１つ隣の処理単位と２つ以上離れた処理単位に異なる値を与えたりするようにしてもよい。 The second content estimation unit 31 may give the score by weighting the value of the second content model according to the magnitude of the first score when giving the score to the adjacent processing unit. For example, when the first score calculated for a certain processing unit is smaller than a predetermined value, processing such as decreasing the score to be given based on the first score may be performed. In addition, when a feature distinguished by the positional relationship (front and back or distance) co-occurring with the processing unit is used as the co-occurrence feature used in the first content estimation, the second content is determined for each positional relationship. You may give different scores. For example, different values may be given to the preceding processing unit and the subsequent processing unit, or different values may be given to processing units that are two or more away from the next processing unit.

また、既に説明したように、第２の内容推定部３１は、第１の内容推定部２１とは異なる時間単位をスコア算出の処理単位としてもよい。そのような場合には、第２の内容推定部３１は、第１の内容推定部２１の各処理単位で最も第１のスコアが大きい第１の内容に基づき、その処理単位（第１の内容推定部２１での処理単位）に隣接する処理単位（ここでは、第２の内容推定部３１での処理単位）に対して、当該第１の内容に対応した第２の内容（言い回し表現）についての第２のスコアが大きくなるように、スコアを与えればよい。例えば、第１の内容推定部２１の処理単位を「単語」単位とし、第２の内容推定部３１の処理単位を「フレーム」単位とする場合には、その単語の前後数フレームに対して、スコアを与えるといった処理を行うことも可能である。 Further, as already described, the second content estimation unit 31 may use a time unit different from that of the first content estimation unit 21 as a processing unit for score calculation. In such a case, the second content estimation unit 31 determines the processing unit (first content) based on the first content having the largest first score in each processing unit of the first content estimation unit 21. With respect to the processing unit adjacent to the processing unit in the estimation unit 21 (here, the processing unit in the second content estimation unit 31), the second content corresponding to the first content (phrase expression) What is necessary is just to give a score so that the 2nd score of this may become large. For example, when the processing unit of the first content estimation unit 21 is a “word” unit and the processing unit of the second content estimation unit 31 is a “frame” unit, for several frames before and after the word, It is also possible to perform processing such as giving a score.

また、第２の内容推定部３１が、第１のスコアと第２のスコアの両方を出力するようにし、その際、第１の内容推定部２１で求めた第１の内容に対する各スコア（第１のスコア）と、第２の内容推定部３１で求めた第２の内容に対する各スコア（第２のスコア）を処理単位ごとに正規化して出力するようにしてもよい。正規化して出力することで、入力された処理対象の発話について、第１の内容である確率と第２の内容である確率とを総合して評価した発話内容推定結果を出力することができる。なお、第１の内容推定部２１と第２の内容推定部３１で異なる処理単位を用いる場合は小さい方の処理単位に合わせればよい。 In addition, the second content estimation unit 31 outputs both the first score and the second score, and at that time, each score (first score) for the first content obtained by the first content estimation unit 21 is output. 1) and each score (second score) for the second content obtained by the second content estimation unit 31 may be normalized and output for each processing unit. By normalizing and outputting, it is possible to output an utterance content estimation result obtained by comprehensively evaluating the probability of the first content and the probability of the second content for the input utterance to be processed. In addition, what is necessary is just to match | combine to the smaller processing unit, when using a different processing unit by the 1st content estimation part 21 and the 2nd content estimation part 31. FIG.

なお、本発明を発話内容推定装置として実現する場合には、第１の内容推定部２１によって算出された第１のスコアと、第２の内容推定部３１によって算出された第２のスコアとを併せて、処理対象の発話内容の推定結果を示す情報として出力してもよい。例えば、推定結果出力部（図示せず。）を設けて、第１のスコアと第２のスコアとを入力し、それらに基づいて処理対象の発話内容の推定結果を示す情報を出力させてもよい。なお、推定結果出力部は、第１のスコアと第２のスコアを含む情報をそのまま処理対象の発話内容の推定結果を示す情報として出力してもよいし、上記で説明したように第１のスコアと第２のスコアを処理単位ごとに正規化したものを処理対象の発話内容の推定結果を示す情報として出力してもよい。なお、各内容推定部がこの推定結果出力部の機能を兼用し、それぞれが求めたスコアを出力してもよいし、第２の内容推定部３１がまとめて出力することで推定結果出力部を兼用してもよい。 When the present invention is implemented as an utterance content estimation device, the first score calculated by the first content estimation unit 21 and the second score calculated by the second content estimation unit 31 are used. In addition, it may be output as information indicating the estimation result of the utterance content to be processed. For example, an estimation result output unit (not shown) may be provided to input a first score and a second score, and output information indicating the estimation result of the utterance content to be processed based on the first score and the second score. Good. The estimation result output unit may output the information including the first score and the second score as information indicating the estimation result of the utterance content to be processed as it is, or the first result as described above. You may output what normalized the score and the 2nd score for every process unit as information which shows the estimation result of the speech content of a process target. In addition, each content estimation part may also use the function of this estimation result output part, and each may output the calculated | required score, or the 2nd content estimation part 31 outputs collectively, and an estimation result output part is output. You may also use it.

図１０および図１１は、第２の内容推定部３１が算出する第２のスコアの他の例を示す説明図である。なお、図１０は、処理単位と共起する位置の関係として距離に応じて共起特徴を分類した場合の第２のスコアの付与例を示す説明図である。また、図１１は、処理単位と共起する位置の前後関係に応じて共起特徴を分類した場合の第２のスコアの付与例を示す説明図である。なお、図１０および図１１とも、「人名の言い回し表現」についての第２のスコアの付与例を示している。 10 and 11 are explanatory diagrams illustrating other examples of the second score calculated by the second content estimation unit 31. FIG. FIG. 10 is an explanatory diagram showing an example of giving a second score when the co-occurrence features are classified according to the distance as the relationship between the positions that co-occur with the processing units. FIG. 11 is an explanatory diagram showing an example of giving a second score when the co-occurrence features are classified according to the front-rear relationship of the position where the process unit co-occurs. Both FIG. 10 and FIG. 11 show an example of giving the second score for the “personal expression”.

言語モデル作成部４１は、算出された第１のスコアと、算出された第２のスコアと、第１の内容別言語モデル記憶部４２に記憶されている第１の内容別言語モデルと、第２の内容別言語モデル記憶部４３に記憶されている第２の内容別言語モデルとに基づいて、音声認識対象の発話のうち、処理単位ごとに特定の単語が出現する確率を表す言語モデルを作成する。また、言語モデル作成部４１は、作成した言語モデルを音声認識部１１に出力する。 The language model creation unit 41 includes a calculated first score, a calculated second score, a first content-specific language model stored in the first content-specific language model storage unit 42, a first Based on the second content-specific language model storage unit 43 and the second content-specific language model 43, a language model representing a probability that a specific word appears for each processing unit among the speech recognition target speeches. create. In addition, the language model creation unit 41 outputs the created language model to the speech recognition unit 11.

第１の内容別言語モデル、第２の内容別言語モデル、言語モデル作成部４１の作成する言語モデルは、例えば、ある単語が出現する確率がその直前のＮ−１個の単語に依存すると定義したＮグラム言語モデルであってもよい。 The language model created by the first content-specific language model, the second content-specific language model, and the language model creation unit 41 is defined, for example, as the probability that a certain word appears depends on the immediately preceding N-1 words. N-gram language model may be used.

Ｎグラム言語モデルにおいて、ｉ番目の単語ｗ_ｉの出現確率はＰ（ｗ_ｉ｜Ｗ_{ｉ−Ｎ＋１} ^ｉ−１）により表される。ここで、条件部のＷ_{ｉ−Ｎ＋１} ^ｉ−１は、（ｉ−Ｎ＋１）〜（ｉ−１）番目の単語列を表す。なお、Ｎ＝２のモデルをバイグラム（ｂｉｇｒａｍ）モデル、Ｎ＝３のモデルをトライグラム（ｔｒｉｇｒａｍ）モデルと呼ぶ。また、直前の単語に影響されないとの仮定に基づいて構築されたモデルをユニグラム（ｕｎｉｇｒａｍ）モデルと呼ぶ。 In the N-gram language model, the appearance probability of the i-th word w _i is represented by P (w _i | W _{i−N + 1} ⁱ⁻¹ ). Here, W _{i−N + 1} ⁱ⁻¹ in the condition part represents the (i−N + 1) to (i−1) th word string. The model with N = 2 is called a bigram model, and the model with N = 3 is called a trigram model. A model constructed based on the assumption that it is not affected by the immediately preceding word is called a unigram model.

Ｎグラム言語モデルによれば、単語列Ｗ_ｉ ^ｎ＝（ｗ_１，ｗ_２，・・・，ｗ_ｎ）が出現する確率Ｐ（Ｗ_ｉ ^ｎ）は以下の式（２）により表される。 According to the N-gram language model, the probability P (W _i ⁿ ) that the word string W _i ⁿ = (w ₁ , w ₂ ,..., W _n ) appears is expressed by the following equation (2).

Ｐ（Ｗ_ｉ ^ｎ）＝Π_ｉＰ（ｗ_ｉ｜Ｗ_{ｉ−Ｎ＋１} ^ｉ−１）・・・式（２） P (W _i ⁿ ) = Π _i P (w _i | W _{i−N + 1} ⁱ⁻¹ ) (2)

また、このようなＮグラム言語モデルに用いられる種々の単語の種々の条件付き確率からなるパラメタは、学習用テキストデータに対する最尤推定等により求められる。 In addition, a parameter composed of various conditional probabilities of various words used in such an N-gram language model is obtained by maximum likelihood estimation or the like for learning text data.

言語モデル作成部４１は、例えば、第１のスコアおよび第２のスコアによって示される、音声認識対象の発話の各区間における各内容（第１の内容および第２の内容）の確率と、第１の内容別言語モデル記憶部４２および第２の内容別言語モデル記憶部４３に記憶されている複数の内容別言語モデルとを用い、次の式（３）に従って、言語モデルを作成してもよい。 The language model creation unit 41, for example, the probability of each content (first content and second content) in each section of the speech of speech recognition target indicated by the first score and the second score, and the first A language model may be created according to the following equation (3) using a plurality of content-specific language models stored in the content-specific language model storage unit 42 and the second content-specific language model storage unit 43: .

Ｐ_ｔ（ｗ_ｉ）＝Σ_ｊα_ｊ（ｔ）Ｐ_ｊ（ｗ_ｉ）・・・式（３） P _t (w _i ) = Σ _j α _j (t) P _j (w _i ) (3)

式（３）において、Ｐ_ｔ（ｗ_ｉ）は単語ｗ_ｉが区間ｔにおいて出現する確率である。また、α_ｊ（ｔ）は、区間ｔにおける内容が内容ｊである確率（スコア）である。またＰ_ｊ（ｗ_ｉ）は内容ｊに対する内容別言語モデルにおける単語ｗ_ｉが出現する確率である。本例では、言語モデル作成部４１は、第１の内容推定部２１および第２の内容推定部３１により計算されたスコア（発話内の各区間における内容の出現確率）を、式（３）のα_ｊ（ｔ）として用いる。ここで、第１の確率と第２の確率とで異なる係数をかけたものを式（３）のα_ｊ（ｔ）として用いるようにしてもよい。 In Expression (3), P _t (w _i ) is a probability that the word w _i appears in the section t. Α _j (t) is a probability (score) that the content in the section t is the content j. P _j (w _i ) is a probability that the word w _i appears in the content-specific language model for the content j. In this example, the language model creation unit 41 uses the score calculated by the first content estimation unit 21 and the second content estimation unit 31 (the appearance probability of the content in each section in the utterance) of Equation (3). Used as α _j (t). Here, a product obtained by multiplying the first probability and the second probability by different coefficients may be used as α _j (t) in Expression (3).

ここで、式（３）のｔは、音声認識処理において用いられる時間フレームに対応する区間であってもよく、発話内の時点を表す時刻等であってもよい。 Here, t in Expression (3) may be a section corresponding to a time frame used in the speech recognition processing, or may be a time representing a time point in the utterance.

第１の内容別言語モデル記憶部４２に記憶されている第１の内容別言語モデルは、第１の内容それぞれを表現する単語を含む単語列を学習データとして用いる。例えば、検索条件を表現する単語リストを用意しておき、それらを含む単語列を学習データとして用いてもよい。 The first content-specific language model stored in the first content-specific language model storage unit 42 uses, as learning data, a word string including words that express the first content. For example, a word list expressing search conditions may be prepared, and a word string including them may be used as learning data.

また、第２の内容別言語モデル記憶部４３に記憶されている第２の内容別言語モデルは、第２の内容それぞれを表現する単語を含む単語列を学習データとして用いる。なお、検索条件ごとに、後述するような特定の表現として選別された言い回し表現のリストを用意しておき、それらを含む単語列を学習データとして用いてもよい。 The second content-specific language model stored in the second content-specific language model storage unit 43 uses, as learning data, a word string including words that express the second content. For each search condition, a list of wording expressions selected as specific expressions described later may be prepared, and a word string including them may be used as learning data.

図１２は、単語リストの例を示す説明図である。なお、図１２（ａ）は、第１の内容の単語リストの一例を示す説明図であって、図１２（ｂ）は、第２の内容の単語リストの一例を示す説明図である。例えば、図１２（ａ）では、第１の内容「人名」については、「エグザイル」などの単語が含まれている。 FIG. 12 is an explanatory diagram illustrating an example of a word list. FIG. 12A is an explanatory diagram illustrating an example of a word list having a first content, and FIG. 12B is an explanatory diagram illustrating an example of a word list having a second content. For example, in FIG. 12A, the first content “person name” includes a word such as “Exile”.

なお、第２の内容の単語リストに関して、本例では、第２の内容に含まれる特定の表現（言い回し表現）を第１の内容モデルに従って定める。ここで、第２の内容に含まれる特定の表現とは、第２の内容である「人名の言い回し表現」という表現の括りに対して「出演している」というような具体的な表現のことである。具体的には、第１の内容モデルとして学習したＣＲＦのパラメタのうち、第１の内容ごとに、その内容の推定に寄与の大きい単語を選別する。例えば、図４に示したパラメタ（重み）の例の場合、第１の内容「人名」に対して重み係数の大きい特徴（ＩＤ＝２）に用いられている単語「出」を特定の表現とし、その他は特定の表現としない、というように選別する。なお、特定の表現は１つに限らず、いくつ定めてもよい。 Regarding the second content word list, in this example, a specific expression (phrase expression) included in the second content is determined according to the first content model. Here, the specific expression included in the second content is a specific expression such as “appearing” in relation to the expression of the expression “personal expression” in the second content. It is. Specifically, among the CRF parameters learned as the first content model, for each first content, a word that greatly contributes to the estimation of the content is selected. For example, in the case of the parameter (weight) example shown in FIG. 4, the word “out” used for the feature (ID = 2) having a large weighting coefficient with respect to the first content “person name” is a specific expression. , And others are not selected as specific expressions. Note that the number of specific expressions is not limited to one, and any number may be determined.

このような場合に、言語モデル作成部４１は、第２の内容の単語リストのうち、第１の内容モデルに従って選別された特定の表現に対してのみα_ｊ（ｔ）に第２のスコアを反映させるようにしてもよい。例えば、図１２（ｂ）に示す例では、第２の内容「人名の言い回し表現」については、「出演する」「出ている」「司会の」といった例が示されているが、特定の表現として「出」という単語のみが選別された場合には「出」が用いられている「出ている」といった単語に対してのみ、α_ｊ（ｔ）に第２のスコアを反映させる。これにより、共起関係の小さい言い回し表現に対して第２のスコアが過剰に反映されることを防ぐことができる。なお、特定の表現か否かの判定は、このように単語単位で行う方法に限らず、例えば単語の一部が一致するものも含むように判定を行うことも可能である。そのような場合には、「出演する」といった単語も選別されるようになる。また、第２の内容についての特定の表現の選別は、第１の内容モデルの学習と同じタイミングで行えばよい。すなわち、本実施形態の一連の処理より前に予め行っておけばよい。なお、何を「第１の内容」および「第２の内容」とするか、すなわち「第１の内容」および「第２の内容」の選定についても、処理対象の発話の種類や内容推定の目的等に応じて予め行われているものとし、本実施形態では一連の処理を開始する際には設定値として与えられているものとする。 In such a case, the language model creation unit 41 assigns the second score to α _j (t) only for a specific expression selected according to the first content model in the word list of the second content. You may make it reflect. For example, in the example shown in FIG. 12B, for the second content “phrase expression of a person name”, examples such as “appear”, “out”, and “moderator” are shown. When only the word “out” is selected, the second score is reflected on α _j (t) only for the word “out” using “out”. Thereby, it is possible to prevent the second score from being excessively reflected in the phrase expression having a small co-occurrence relationship. Note that the determination of whether or not a specific expression is made is not limited to the method performed in units of words as described above, and it is also possible to determine so as to include, for example, a part of words that match. In such a case, the word “appear” will be selected. The selection of the specific expression for the second content may be performed at the same timing as the learning of the first content model. That is, it may be performed in advance before a series of processes of the present embodiment. It should be noted that the selection of “first content” and “second content”, that is, “first content” and “second content”, also determines the type of utterance to be processed and the content estimation. It is assumed that the process is performed in advance according to the purpose, and in the present embodiment, it is given as a set value when starting a series of processes.

このように、内容別言語モデルを学習することにより、特定の内容を表現する単語について、異なる内容別言語モデルでは異なる出現確率が与えられるようになる。 In this way, by learning the content-specific language model, different appearance probabilities are given to words expressing specific content in different content-specific language models.

なお、内容別言語モデル記憶部４２，４３は、特に認識したい単語のリストとして、上述した単語リストを記憶させていてもよい。この場合、言語モデル作成部４１は、発話内の各区間において、例えば、最もスコアが大きい内容に対する単語リストに含まれる単語が出現する確率を所定の値だけ増加させるように構成されていてもよい。 The content-specific language model storage units 42 and 43 may store the above-described word list as a list of words that are particularly desired to be recognized. In this case, the language model creation unit 41 may be configured to increase, for example, the probability that a word included in the word list for the content with the highest score appears in each section in the utterance by a predetermined value. .

なお、言語モデル作成部４１は、作成した言語モデルを出力する際に、言語モデルに含まれる情報をすべて出力してもよいし、外部から指定された情報のみを出力してもよい。 Note that the language model creation unit 41 may output all the information included in the language model when outputting the created language model, or may output only information specified from the outside.

このように、言語モデル作成部４１は、音声認識対象の発話の区間ごとに、その区間における各内容の出現確率の推定結果を重み係数として、各内容別言語モデルの重み付け言語モデルを作成する。 As described above, the language model creation unit 41 creates a weighted language model for each content-specific language model, using the estimation result of the appearance probability of each content in the section for each speech recognition target speech section as a weighting coefficient.

後段で、音声認識部１１が言語モデル作成部４１によって作成された言語モデルを用いることにより、区間ごとに精度のよい言語モデルを用いて音声認識処理を行うことができるため、より正確な音声認識結果を出力することができる。 In the subsequent stage, since the speech recognition unit 11 uses the language model created by the language model creation unit 41, speech recognition processing can be performed using an accurate language model for each section. The result can be output.

次に、図１３に示すフローチャートを参照して、本実施形態の動作について説明する。図１３は、本実施形態の音声認識装置１００の動作の一例を示すフローチャートである。図１３に示すように、音声認識装置１００は、起動されると、記憶デバイス等から必要なデータを読み出し、音声認識部１１、第１の内容推定部２１、第２の内容推定部３１および言語モデル作成部４１から参照できるように、第１の内容モデル記憶部２２、第２の内容モデル記憶部３２、第１の内容別言語モデル記憶部４２および第２の内容別言語モデル記憶部４３にデータを展開する等の初期化処理を行う（ステップＳ１１）。 Next, the operation of this embodiment will be described with reference to the flowchart shown in FIG. FIG. 13 is a flowchart showing an example of the operation of the speech recognition apparatus 100 of the present embodiment. As shown in FIG. 13, when activated, the speech recognition apparatus 100 reads necessary data from a storage device or the like, and recognizes the speech recognition unit 11, the first content estimation unit 21, the second content estimation unit 31, and the language. The first content model storage unit 22, the second content model storage unit 32, the first content-specific language model storage unit 42, and the second content-specific language model storage unit 43 can be referred to from the model creation unit 41. Initialization processing such as data expansion is performed (step S11).

一方、音声認識部１１は、入力装置からの通知に応じて音声信号を受け付け、音声認識処理を行う（ステップＳ１２）。また、音声認識部１１は、音声認識処理によって得た音声認識仮説を第１の内容推定部２１に出力する。 On the other hand, the voice recognition unit 11 receives a voice signal in response to a notification from the input device, and performs voice recognition processing (step S12). Further, the speech recognition unit 11 outputs the speech recognition hypothesis obtained by the speech recognition process to the first content estimation unit 21.

次に、第１の内容推定部２１は、第１の内容モデル記憶部２２に記憶されている第１の内容モデルに基づいて、すでに述べた方法などにより、音声認識仮説から生成した処理単位列に含まれる各処理単位に対応する内容が、特定の内容（第１の内容）である確率である第１のスコアを算出する（ステップＳ１３）。 Next, the first content estimation unit 21 generates a processing unit sequence generated from the speech recognition hypothesis by the method already described based on the first content model stored in the first content model storage unit 22. The first score, which is the probability that the content corresponding to each processing unit included in is a specific content (first content), is calculated (step S13).

次に、第２の内容推定部３１は、第１の内容推定部２１により推定された各処理単位の各内容（第１の内容）の出現確率（第１のスコア）と、第２の内容モデル記憶部３２に記憶されている第２の内容モデルとに基づいて、すでに述べた方法などにより、各処理単位に対応する内容が、特定の内容（第２の内容）である確率である第２のスコアを算出する（ステップＳ１４）。 Next, the second content estimation unit 31 includes the appearance probability (first score) of each content (first content) of each processing unit estimated by the first content estimation unit 21 and the second content. Based on the second content model stored in the model storage unit 32, the content corresponding to each processing unit is the probability that the content corresponding to each processing unit is specific content (second content) by the method described above. A score of 2 is calculated (step S14).

次に、言語モデル作成部４１は、第１の内容推定部２１および第２の内容推定部３１により推定された各処理単位の各内容（第１の内容および第２の内容）の出現確率（第１のスコアと第２のスコア）と、第１の内容別言語モデル４２および第２の内容別言語モデル４３に記憶されている内容別言語モデルとに基づいて、音声認識対象の発話内の位置ごとに、特定の単語が出現する確率を表す言語モデルを作成する（ステップＳ１５）。また、言語モデル作成部４１は、作成した言語モデルを音声認識部１１に出力する。 Next, the language model creation unit 41 generates an appearance probability (first content and second content) of each content (first content and second content) estimated by the first content estimation unit 21 and the second content estimation unit 31. Based on the first score and the second score), and the content-specific language model stored in the first content-specific language model 42 and the second content-specific language model 43. A language model representing the probability that a specific word appears is created for each position (step S15). In addition, the language model creation unit 41 outputs the created language model to the speech recognition unit 11.

次に、音声認識部１１は、言語モデル作成部４１により作成された言語モデルを用いて、音声認識対象の発話を音声認識処理し、音声認識仮説を結果情報として出力装置に出力する（ステップＳ１６）。 Next, the speech recognition unit 11 performs speech recognition processing on the speech to be recognized using the language model created by the language model creation unit 41, and outputs the speech recognition hypothesis as result information to the output device (step S16). ).

このような一連の処理により、より正確な音声認識仮説を得ることができる。なお、音声認識部１１は、この時点の音声認識仮説を再び第１の内容推定部２１に出力してもよい（ステップＳ１３に戻る）。 Through such a series of processes, a more accurate speech recognition hypothesis can be obtained. Note that the speech recognition unit 11 may output the speech recognition hypothesis at this point to the first content estimation unit 21 again (return to step S13).

すなわち、本実施形態によれば、音声認識対象の発話の内容を推定する際に、第１の内容（例えば、どの検索条件であるか）を推定するだけでなく、その推定結果に基づいて、第１の内容の共起する単語の表現である第２の内容（例えば、どの検索条件の言い回し表現であるか）も推定する。これにより、音声認識対象の発話内で変化する内容を誤って推定することを回避することができる。 That is, according to the present embodiment, when estimating the content of the speech of the speech recognition target, not only the first content (for example, which search condition is), but also based on the estimation result, The second content (for example, which search condition is a wording expression) that is an expression of a word that co-occurs with the first content is also estimated. Thereby, it can avoid estimating the content which changes within the speech of speech recognition object erroneously.

これは、第２の内容推定部３１が、第１の内容推定部２１で用いた特定の内容の前後表現を第２の内容として内容推定を行うことによって、発話に対する内容の出現確率が正確になるからである。例えば、第１の内容推定部２１の出力が図７で示した出現確率の場合、処理単位１では、「その他」である確率と「人名」である確率とが同等程度となっている。これは、処理単位２に「出」があることなどの影響によって「人名」である確率が高くなることによる。仮にこの出現確率を用いて、後段で、後述する言語モデル作成部４１で作成した言語モデルを用いて音声認識部１１で音声認識を行うと、この区間では「人名」の内容別言語モデルの重みを大きくして認識するため、この区間の認識結果は「人名」の内容の単語（例えば、「浅野」）となり、認識誤り（検索条件の挿入誤り）が発生する可能性が大きくなる。 This is because the second content estimation unit 31 performs content estimation using the front and rear expressions of the specific content used in the first content estimation unit 21 as the second content, so that the appearance probability of the content with respect to the utterance is accurately determined. Because it becomes. For example, when the output of the first content estimation unit 21 is the appearance probability shown in FIG. 7, in the processing unit 1, the probability of “others” and the probability of “person name” are comparable. This is because the probability of “person name” increases due to the influence of “out” in processing unit 2. If speech recognition is performed by the speech recognition unit 11 using a language model created by the language model creation unit 41, which will be described later, using this appearance probability, the weight of the language model classified by contents of “person name” in this section Therefore, the recognition result in this section is a word having the content of “person name” (for example, “Asano”), and there is a high possibility that a recognition error (search condition insertion error) will occur.

しかし、第２の内容推定部３１が、処理単位１に対して「人名の言い回し表現」の内容の出現確率を多く与えることにより、相対的に「人名」である確率を小さくすることができる。これにより、発話に対する内容の出現確率が正確になる。すると、後段の音声認識部１１で、正確な音声認識仮説（本例では、「明日の」）が得られるようになる。 However, when the second content estimation unit 31 gives the processing unit 1 more appearance probabilities of the content of the “personal expression”, the probability of being “person name” can be relatively reduced. Thereby, the appearance probability of the content with respect to the utterance becomes accurate. Then, an accurate speech recognition hypothesis (in this example, “tomorrow”) can be obtained in the subsequent speech recognition unit 11.

情報検索のための発話では、検索条件の単語では認識誤りが多く、言い回し表現の単語は認識誤りが少ない傾向にある。また、表現のバリエーションについては前者の方が大きい傾向にある。第１の内容推定部２１では、このような発話の傾向を利用して、比較的認識しやすく、バリエーションの少ない言い回し表現を手がかりにして、検索条件の区分を推定している。このため、第１の内容推定部２１で「言い回し表現」の内容も同時に推定するのは比較的難しい。一方で、第１の内容推定部２１で「検索条件」の内容は、比較的推定しやすいため、第２の内容推定部３１では、検索条件の存在を手がかりにして、その前後に存在する言い回し表現の区間を推定することができる。 In utterances for information retrieval, there are many recognition errors in words of search conditions, and words in phrase expressions tend to have few recognition errors. In addition, the former tends to be larger for variations of expression. Using the utterance tendency, the first content estimation unit 21 estimates the category of the search condition by using a phrase expression that is relatively easy to recognize and has few variations. For this reason, it is relatively difficult for the first content estimation unit 21 to simultaneously estimate the content of the “phrase expression”. On the other hand, since the content of the “search condition” is relatively easy to estimate in the first content estimation unit 21, the second content estimation unit 31 uses the search condition as a clue, and the phrases that exist before and after the search condition. The interval of expression can be estimated.

このように、第２の内容推定部３１は、第１の内容（検索条件）だけでなく、第２の内容（言い回し表現）についても発話内の区間ごとに出現確率を推定することにより、音声認識対象の発話に対して適切にその内容を推定することができる。 As described above, the second content estimation unit 31 estimates the appearance probability for each section in the utterance not only for the first content (search condition) but also for the second content (phrase expression). It is possible to appropriately estimate the content of the speech to be recognized.

従って、音声認識装置１００では、発話内の特定の内容をより正確に推定することができる。この結果、音声認識装置１００は、その発話に含まれる特定の内容の語句をより正確に認識することができる。 Therefore, the speech recognition apparatus 100 can estimate the specific contents in the utterance more accurately. As a result, the speech recognition apparatus 100 can more accurately recognize words having specific contents included in the utterance.

次に、本発明の概要について説明する。図１４および図１５は、本発明の概要を示すブロック図である。図１４は、本発明を発話内容推定装置に適用した場合の構成例を示すブロック図である。図１４に示す発話内容推定装置は、第１の内容推定手段１０１と、第２の内容推定手段１０２と、推定結果出力手段１０３とを備える。 Next, the outline of the present invention will be described. 14 and 15 are block diagrams showing an outline of the present invention. FIG. 14 is a block diagram showing a configuration example when the present invention is applied to an utterance content estimation apparatus. The utterance content estimation apparatus shown in FIG. 14 includes first content estimation means 101, second content estimation means 102, and estimation result output means 103.

第１の内容推定手段１０１は、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容である確率を推定する。なお、第１の内容推定手段１０１は、上記実施形態における第１の内容推定部２１によって実現される。 The first content estimation unit 101 estimates the probability that the content of the processing unit obtained by dividing the utterance to be processed into time intervals is the first specific content. The first content estimation unit 101 is realized by the first content estimation unit 21 in the above embodiment.

第２の内容推定手段１０２は、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容の単語と共起する単語の内容として定めた第２の特定の内容である確率を推定する。なお、第２の内容推定手段１０２は、上記実施形態における第２の内容推定部３１によって実現される。 The second content estimation means 102 is the second specific content that is defined as the content of the word that co-occurs with the word of the first specific content, in which the content of the processing unit obtained by dividing the utterance to be processed into time intervals. Estimate a certain probability. The second content estimation unit 102 is realized by the second content estimation unit 31 in the above embodiment.

推定結果出力手段１０３は、第１の内容推定手段１０１によって推定された各処理単位における第１の特定の内容についての確率と、第２の内容推定手段１０２によって推定された各処理単位における第２の特定の内容についての確率とを併せて、発話内容の推定結果を示す情報として出力する。なお、推定結果出力手段１０３は、上記実施形態における第１の内容推定部２１および第２の内容推定部３１、または第１のスコアと第２のスコアとを併せて出力するまたは正規化して出力する第２の内容推定部３１によって実現される。 The estimation result output unit 103 includes a probability of the first specific content in each processing unit estimated by the first content estimation unit 101 and a second value in each processing unit estimated by the second content estimation unit 102. Together with the probability of the specific content of the message, it is output as information indicating the estimation result of the utterance content. The estimation result output means 103 outputs or normalizes and outputs the first content estimation unit 21 and the second content estimation unit 31 or the first score and the second score in the above embodiment. This is realized by the second content estimation unit 31.

なお、このような発話内容推定装置において、第２の内容推定手段は、第２の特定の内容について、共起する第１の特定の内容との位置の前後に応じて、異なる確率を推定してもよい。 In such an utterance content estimation apparatus, the second content estimation means estimates different probabilities for the second specific content depending on the position before and after the first specific content that co-occurs. May be.

また、第２の内容推定手段は、第２の特定の内容について、共起する第１の特定の内容との位置の距離に応じて、異なる確率を推定してもよい。 Further, the second content estimation means may estimate different probabilities for the second specific content according to the distance of the position from the co-occurring first specific content.

また、第１の内容推定手段は、条件付確率場モデルを用いて各処理単位における第１の特定の内容についての確率を示す第１のスコアを算出する際に、各処理単位について抽出する特徴として、音声認識処理の過程で得られる情報に基づく特徴である認識特徴と、発話内で同時に現れる言語情報の組に基づく特徴である共起特徴とを用い、かつ各特徴に対する重み係数の少なくとも１つに、第１の特定の内容によって異なる値をもつ重み係数を用い、第２の内容推定手段は、第１の内容推定手段によって算出された各処理単位における第１の特定の内容についての第１のスコアを利用して、各処理単位における第２の特定の内容についての確率を示す第２のスコアを算出してもよい。 In addition, the first content estimation means extracts the processing unit when calculating the first score indicating the probability of the first specific content in each processing unit using the conditional random field model. Using a recognition feature, which is a feature based on information obtained in the course of speech recognition processing, and a co-occurrence feature, which is a feature based on a set of linguistic information that simultaneously appears in the utterance, and at least one weighting factor for each feature In addition, the second content estimation unit uses a weighting factor having a different value depending on the first specific content, and the second content estimation unit calculates the first specific content in each processing unit calculated by the first content estimation unit. A score of 1 may be used to calculate a second score indicating the probability of the second specific content in each processing unit.

また、発話内容推定装置は、第１の特定の内容に応じた第１の内容別言語モデルを記憶する第１の内容別言語モデル記憶手段と、第２の特定の内容に応じた第２の内容別言語モデルを記憶する第２の内容別言語モデル記憶手段と、処理対象の発話を時間区間に分割した処理単位ごとに言語モデルを作成する言語モデル作成手段とを備え、言語モデル作成手段は、推定結果出力手段から出力される、第１の内容推定手段によって推定された各処理単位における第１の特定の内容についての確率である第１のスコアと、第２の内容推定手段によって推定された各処理単位における第２の特定の内容についての確率である第２のスコアと、第１の内容別言語モデル記憶手段に記憶されている第１の内容別言語モデルと、第２の内容別言語モデル記憶手段に記憶されている第２の内容別言語モデルとを用いて、処理対象の発話を時間区間に分割した処理単位ごとに言語モデルを作成してもよい。 The utterance content estimation device includes a first content-specific language model storage unit that stores a first content-specific language model corresponding to the first specific content, and a second specific content-specific language model storage unit. A second language model storage unit that stores a language model by content, and a language model creation unit that creates a language model for each processing unit obtained by dividing the utterance to be processed into time intervals. The first score, which is the probability of the first specific content in each processing unit estimated by the first content estimation means, output from the estimation result output means, and estimated by the second content estimation means A second score which is the probability of the second specific content in each processing unit, a first content-specific language model stored in the first content-specific language model storage means, and a second content-specific Language model memory By using the second content by language model stored in the stage, may create the language model for each unit of processing an utterance to be processed is divided time intervals.

また、図１５は、本発明を言語モデル作成装置に適用した場合の構成例を示すブロック図である。図１５に示す言語作成モデルは、第１の内容推定手段１０１と、第２の内容推定手段１０２と、第１の内容別言語モデル記憶手段１０４と、第２の内容別言語モデル記憶手段１０５と、言語モデル作成手段１０６とを備える。 FIG. 15 is a block diagram showing a configuration example when the present invention is applied to a language model creation apparatus. The language creation model shown in FIG. 15 includes a first content estimation unit 101, a second content estimation unit 102, a first content-specific language model storage unit 104, and a second content-specific language model storage unit 105. Language model creation means 106.

第１の内容推定手段１０１は、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容である確率を推定する。なお、第１の内容推定手段１０１は、図１４に示した第１の内容推定手段１０１と同様の手段でよい。 The first content estimation unit 101 estimates the probability that the content of the processing unit obtained by dividing the utterance to be processed into time intervals is the first specific content. The first content estimation unit 101 may be the same unit as the first content estimation unit 101 shown in FIG.

第２の内容推定手段１０２は、処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容の単語と共起する単語の内容として定めた第２の特定の内容である確率を推定する。なお、第２の内容推定手段１０２は、図１４に示した第２の内容推定手段１０２と同様の手段でよい。 The second content estimation means 102 is the second specific content that is defined as the content of the word that co-occurs with the word of the first specific content, in which the content of the processing unit obtained by dividing the utterance to be processed into time intervals. Estimate a certain probability. The second content estimation unit 102 may be the same unit as the second content estimation unit 102 shown in FIG.

第１の内容別言語モデル記憶手段１０４は、第１の特定の内容に応じた第１の内容別言語モデルを記憶する。なお、第１の内容別言語モデル記憶手段１０４は、上記実施形態における第１の内容別記憶部４２によって実現される。 The first content-specific language model storage unit 104 stores a first content-specific language model corresponding to the first specific content. The first content-specific language model storage unit 104 is realized by the first content-specific storage unit 42 in the above embodiment.

第２の内容別言語モデル記憶手段１０５は、第２の特定の内容に応じた第２の内容別言語モデルを記憶する。なお、第２の内容別言語モデル記憶手段１０５は、上記実施形態における第２の内容別記憶部４３によって実現される。 The second content-specific language model storage unit 105 stores a second content-specific language model corresponding to the second specific content. The second content-specific language model storage unit 105 is realized by the second content-specific storage unit 43 in the above embodiment.

言語モデル作成手段１０６は、第１の内容推定手段によって推定された各処理単位における第１の特定の内容についての確率である第１のスコアと、第２の内容推定手段によって推定された各処理単位における第２の特定の内容についての確率である第２のスコアと、第１の内容別言語モデル記憶手段に記憶されている第１の内容別言語モデルと、第２の内容別言語モデル記憶手段に記憶されている第２の内容別言語モデルとを用いて、処理対象の発話を時間区間に分割した処理単位ごとに言語モデルを作成する。 The language model creation unit 106 includes a first score that is a probability of the first specific content in each processing unit estimated by the first content estimation unit, and each process estimated by the second content estimation unit. A second score which is a probability of the second specific content in the unit, a first language model stored in the first language model storage means, and a second language model storage A language model is created for each processing unit obtained by dividing the utterance to be processed into time intervals using the second language model by content stored in the means.

（付記１）また、本発明による発話内容推定方法で、第２の特定の内容について、共起する第１の特定の内容との位置の前後に応じて、異なる確率を推定してもよい。 (Supplementary Note 1) Further, in the utterance content estimation method according to the present invention, different probabilities may be estimated for the second specific content depending on the position before and after the co-occurring first specific content.

（付記２）また、本発明による発話内容推定方法で、第２の特定の内容について、共起する第１の特定の内容との位置の距離に応じて、異なる確率を推定してもよい。 (Additional remark 2) Moreover, with the speech content estimation method by this invention, you may estimate a different probability about 2nd specific content according to the distance of the position with the 1st specific content to co-occur.

（付記３）また、本発明による発話内容推定方法で、条件付確率場モデルを用いて各処理単位における第１の特定の内容についての確率を示す第１のスコアを算出する際に、各処理単位について抽出する特徴として、音声認識処理の過程で得られる情報に基づく特徴である認識特徴と、発話内で同時に現れる言語情報の組に基づく特徴である共起特徴とを用い、かつ各特徴に対する重み係数の少なくとも１つに、第１の特定の内容によって異なる値をもつ重み係数を用い、また、第２のスコアを算出する際に、第１の特定の内容によって異なる値をもつ重み係数を用い、第２の内容推定手段は、処理単位として第１の内容推定手段が用いる処理単位を用い、第１の内容推定手段によって算出された各処理単位における第１の特定の内容についての第１のスコアを利用して、各処理単位における第２の特定の内容についての確率を示す第２のスコアを算出してもよい。 (Additional remark 3) Moreover, when calculating the 1st score which shows the probability about the 1st specific content in each processing unit using the conditional random field model by the speech content estimation method by this invention, each process As the features to be extracted for each unit, a recognition feature that is a feature based on information obtained in the process of speech recognition processing and a co-occurrence feature that is a feature based on a set of linguistic information that simultaneously appear in the speech are used, A weighting factor having a different value depending on the first specific content is used as at least one of the weighting factors, and a weighting factor having a different value depending on the first specific content is used when calculating the second score. The second content estimation means uses the processing unit used by the first content estimation means as the processing unit, and the first specific content in each processing unit calculated by the first content estimation means. Using the first score may be calculated second score indicating the probability for the second specific contents of each processing unit.

（付記４）また、本発明による発話内容推定方法は、各処理単位における第１の特定の内容についての確率である第１のスコアと、第２の内容推定手段によって推定された各処理単位における第２の特定の内容についての確率である第２のスコアと、第１の特定の内容に応じた第１の内容別言語モデルと、第２の特定の内容に応じた第２の内容別言語モデルとを用いて、処理対象の発話を時間区間に分割した処理単位ごとに言語モデルを作成するステップを含んでいてもよい。 (Additional remark 4) Moreover, the speech content estimation method by this invention is the 1st score which is the probability about the 1st specific content in each processing unit, and each processing unit estimated by the 2nd content estimation means. A second score that is a probability of the second specific content, a first language model according to the first specific content, and a second content language according to the second specific content A step of creating a language model for each processing unit obtained by dividing the utterance to be processed into time intervals using the model may be included.

（付記５）また、本発明による言語モデル作成方法に、上記付記１〜３に示した事項を適用させてもよい。 (Additional remark 5) Moreover, you may make the matter shown to the said additional remarks 1-3 apply to the language model creation method by this invention.

（付記６）また、本発明による発話内容推定プログラムは、コンピュータに、第２のスコアを算出する処理で、第２の特定の内容について、共起する第１の特定の内容との位置の前後に応じて、異なる確率を推定させてもよい。 (Additional remark 6) Moreover, the speech content estimation program by this invention is a process which calculates a 2nd score to a computer, About the 2nd specific content, before and after the position with the 1st specific content to co-occur Depending on, different probabilities may be estimated.

（付記７）また、本発明による発話内容推定プログラムは、コンピュータに、第２のスコアを算出する処理で、第２の特定の内容について、共起する第１の特定の内容との位置の距離に応じて、異なる確率を推定させてもよい。 (Additional remark 7) Moreover, the utterance content estimation program by this invention is the process of calculating a 2nd score to a computer, About the 2nd specific content, The distance of the position with the 1st specific content to co-occur Depending on, different probabilities may be estimated.

（付記８）また、本発明による発話内容推定プログラムは、コンピュータに、第１のスコアを算出する処理で、条件付確率場モデルを用い、各処理単位について抽出する特徴として、音声認識処理の過程で得られる情報に基づく特徴である認識特徴と、発話内で同時に現れる言語情報の組に基づく特徴である共起特徴とを用い、かつ各特徴に対する重み係数の少なくとも１つに、第１の特定の内容によって異なる値をもつ重み係数を用いさせ、第２のスコアを算出する処理で、処理単位として第１の内容推定手段が用いる処理単位を用い、第１の内容推定手段によって算出された各処理単位における第１の特定の内容についての第１のスコアを利用して、各処理単位における第２の特定の内容についての確率を示す第２のスコアを算出させてもよい。 (Additional remark 8) Moreover, the speech content estimation program by this invention is a process of calculating | requiring a 1st score in a computer, using a conditional random field model, and the process of speech recognition processing as a characteristic extracted about each processing unit. Using a recognition feature, which is a feature based on information obtained in step 1, and a co-occurrence feature, which is a feature based on a set of linguistic information that simultaneously appear in an utterance, and at least one of the weighting factors for each feature In the process of calculating the second score by using a weighting factor having a different value depending on the contents of each, the processing unit used by the first content estimating means is used as the processing unit, and each calculated by the first content estimating means By using the first score for the first specific content in the processing unit, the second score indicating the probability for the second specific content in each processing unit is calculated. Good.

（付記９）また、本発明による発話内容推定プログラムは、コンピュータに、各処理単位における第１の特定の内容についての確率である第１のスコアと、第２の内容推定手段によって推定された各処理単位における第２の特定の内容についての確率である第２のスコアと、第１の特定の内容に応じた第１の内容別言語モデルと、第２の特定の内容に応じた第２の内容別言語モデルとを用いて、処理対象の発話を時間区間に分割した処理単位ごとに言語モデルを作成する処理を実行させてもよい。 (Additional remark 9) Moreover, the utterance content estimation program by this invention makes the computer estimate each 1st score which is the probability about the 1st specific content in each processing unit, and the 2nd content estimation means. A second score that is the probability of the second specific content in the processing unit, a first language model according to the first content according to the first specific content, and a second according to the second specific content A process for creating a language model may be executed for each processing unit obtained by dividing the utterance to be processed into time intervals using the language model for each content.

（付記１０）また、本発明による言語モデル作成プログラムに、上記付記６〜８に示した事項を適用させてもよい。 (Additional remark 10) Moreover, you may make the matter shown to the said additional remarks 6-8 apply to the language model creation program by this invention.

本発明は、直接に発話内容を推定する用途や言語モデルを作成する用途に限らず、発話内容の推定結果や言語モデルを利用するもの（装置、システム、方法、プログラムを問わない。）であれば好適に適用可能である。例えば、発話に含まれる特定の語句を認識する音声認識処理の用途にも好適に適用可能である。 The present invention is not limited to the use of directly estimating the utterance content or the use of creating a language model, but may use the utterance content estimation result or the language model (regardless of device, system, method, or program). Can be suitably applied. For example, the present invention can be suitably applied to the use of speech recognition processing for recognizing a specific phrase included in an utterance.

１００音声認識装置
１０１第１の内容推定手段
１０２第２の内容推定手段
１０３推定結果出力手段
１０４第１の内容別言語モデル記憶手段
１０５第２の内容別言語モデル記憶手段
１０６言語モデル作成手段
１１音声認識部
２１第１の内容推定部
２２第１の内容モデル記憶部
３１第２の内容推定部
３２第２の内容モデル記憶部
４１言語モデル作成部
４２第１の内容別言語モデル記憶部
４３第２の内容別言語モデル記憶部 DESCRIPTION OF SYMBOLS 100 Speech recognition apparatus 101 1st content estimation means 102 2nd content estimation means 103 Estimation result output means 104 1st content-specific language model storage means 105 2nd content-specific language model storage means 106 Language model creation means 11 Speech Recognition unit 21 First content estimation unit 22 First content model storage unit 31 Second content estimation unit 32 Second content model storage unit 41 Language model creation unit 42 First content-specific language model storage unit 43 Second Language model storage unit by content

Claims

処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容である確率を推定する第１の内容推定手段と、
処理対象の前記発話を時間区間に分割した処理単位の内容が、前記第１の特定の内容の単語と共起する単語の内容として定めた第２の特定の内容である確率を推定する第２の内容推定手段と、
前記第１の内容推定手段によって推定された各処理単位における第１の特定の内容についての確率と、前記第２の内容推定手段によって推定された各処理単位における第２の特定の内容についての確率とを併せて、発話内容の推定結果を示す情報として出力する推定結果出力手段とを備えた
ことを特徴とする発話内容推定装置。 First content estimation means for estimating the probability that the content of the processing unit obtained by dividing the processing target speech into time intervals is the first specific content;
A second method for estimating a probability that the content of a processing unit obtained by dividing the utterance to be processed into time intervals is a second specific content determined as a content of a word that co-occurs with a word of the first specific content Content estimation means,
Probability for the first specific content in each processing unit estimated by the first content estimation means and probability for the second specific content in each processing unit estimated by the second content estimation means And an estimation result output means for outputting as an information indicating the estimation result of the utterance content.

第２の内容推定手段は、第２の特定の内容について、共起する第１の特定の内容との位置の前後に応じて、異なる確率を推定する
請求項１に記載の発話内容推定装置。 The utterance content estimation apparatus according to claim 1, wherein the second content estimation means estimates a different probability for the second specific content depending on the position before and after the co-occurring first specific content.

第２の内容推定手段は、第２の特定の内容について、共起する第１の特定の内容との位置の距離に応じて、異なる確率を推定する
請求項１または請求項２に記載の発話内容推定装置。 3. The utterance according to claim 1, wherein the second content estimation unit estimates a different probability for the second specific content in accordance with a position distance from the co-occurring first specific content. Content estimation device.

第１の内容推定手段は、条件付確率場モデルを用いて各処理単位における第１の特定の内容についての確率を示す第１のスコアを算出する際に、各処理単位について抽出する特徴として、音声認識処理の過程で得られる情報に基づく特徴である認識特徴と、発話内で同時に現れる言語情報の組に基づく特徴である共起特徴とを用い、かつ各特徴に対する重み係数の少なくとも１つに、第１の特定の内容によって異なる値をもつ重み係数を用い、
第２の内容推定手段は、前記第１の内容推定手段によって算出された各処理単位における前記第１の特定の内容についての第１のスコアを利用して、各処理単位における第２の特定の内容についての確率を示す第２のスコアを算出する
請求項１から請求項３のうちのいずれか１項に記載の発話内容推定装置。 When the first content estimation means calculates the first score indicating the probability of the first specific content in each processing unit using the conditional random field model, A recognition feature, which is a feature based on information obtained in the process of speech recognition processing, and a co-occurrence feature, which is a feature based on a set of linguistic information that simultaneously appear in an utterance, and at least one weighting factor for each feature , Using weighting factors having different values depending on the first specific content,
The second content estimation means uses the first score for the first specific content in each processing unit calculated by the first content estimation means, and uses the second specific estimation in each processing unit. The utterance content estimation apparatus according to any one of claims 1 to 3, wherein a second score indicating a probability regarding the content is calculated.

前記第１の特定の内容に応じた第１の内容別言語モデルを記憶する第１の内容別言語モデル記憶手段と、
前記第２の特定の内容に応じた第２の内容別言語モデルを記憶する第２の内容別言語モデル記憶手段と、
処理対象の発話を時間区間に分割した処理単位ごとに言語モデルを作成する言語モデル作成手段とを備え、
前記言語モデル作成手段は、推定結果出力手段から出力される、第１の内容推定手段によって推定された各処理単位における第１の特定の内容についての確率である第１のスコアと、前記第２の内容推定手段によって推定された各処理単位における第２の特定の内容についての確率である第２のスコアと、前記第１の内容別言語モデル記憶手段に記憶されている第１の内容別言語モデルと、前記第２の内容別言語モデル記憶手段に記憶されている第２の内容別言語モデルとを用いて、処理対象の前記発話を時間区間に分割した処理単位ごとに言語モデルを作成する
請求項１から請求項４のうちのいずれか１項に記載の発話内容推定装置。 First content-specific language model storage means for storing a first content-specific language model corresponding to the first specific content;
A second content-specific language model storage unit that stores a second content-specific language model according to the second specific content;
A language model creating means for creating a language model for each processing unit obtained by dividing an utterance to be processed into time intervals;
The language model creating means outputs a first score, which is a probability of the first specific content in each processing unit estimated by the first content estimating means, output from the estimation result output means, and the second A second score that is the probability of the second specific content in each processing unit estimated by the content estimation means, and the first content-specific language stored in the first content-specific language model storage means Using the model and the second content-specific language model stored in the second content-specific language model storage means, a language model is created for each processing unit obtained by dividing the utterance to be processed into time intervals. The utterance content estimation apparatus according to any one of claims 1 to 4.

処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容である確率を推定する第１の内容推定手段と、
処理対象の前記発話を時間区間に分割した処理単位の内容が、前記第１の特定の内容の単語と共起する単語の内容として定めた第２の特定の内容である確率を推定する第２の内容推定手段と、
前記第１の特定の内容に応じた第１の内容別言語モデルを記憶する第１の内容別言語モデル記憶手段と、
前記第２の特定の内容に応じた第２の内容別言語モデルを記憶する第２の内容別言語モデル記憶手段と、
前記第１の内容推定手段によって推定された各処理単位における第１の特定の内容についての確率である第１のスコアと、前記第２の内容推定手段によって推定された各処理単位における第２の特定の内容についての確率である第２のスコアと、前記第１の内容別言語モデル記憶手段に記憶されている第１の内容別言語モデルと、前記第２の内容別言語モデル記憶手段に記憶されている第２の内容別言語モデルとを用いて、処理対象の前記発話を時間区間に分割した処理単位ごとに言語モデルを作成する言語モデル作成手段とを備えた
ことを特徴とする言語モデル作成装置。 First content estimation means for estimating the probability that the content of the processing unit obtained by dividing the processing target speech into time intervals is the first specific content;
A second method for estimating a probability that the content of a processing unit obtained by dividing the utterance to be processed into time intervals is a second specific content determined as a content of a word that co-occurs with a word of the first specific content Content estimation means,
First content-specific language model storage means for storing a first content-specific language model corresponding to the first specific content;
A second content-specific language model storage unit that stores a second content-specific language model according to the second specific content;
A first score that is a probability for the first specific content in each processing unit estimated by the first content estimation means, and a second score in each processing unit estimated by the second content estimation means A second score that is a probability of specific content, a first content-specific language model stored in the first content-specific language model storage means, and a second content-specific language model storage means A language model creating means for creating a language model for each processing unit obtained by dividing the utterance to be processed into time intervals using the second content-specific language model. Creation device.

処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容である確率を示す第１のスコアを算出し、
処理対象の前記発話を時間区間に分割した処理単位の内容が、前記第１の特定の内容の単語と共起する単語の内容として定めた第２の特定の内容である確率を示す第２のスコアを算出し、
各処理単位における前記第１の特定の内容についての第１のスコアと、各処理単位における前記第２の特定の内容についての第２のスコアとを併せて、発話内容の推定結果を示す情報として出力する
ことを特徴とする発話内容推定方法。 Calculating a first score indicating the probability that the content of the processing unit obtained by dividing the processing target speech into time intervals is the first specific content;
A second value indicating the probability that the content of the processing unit obtained by dividing the utterance to be processed into time intervals is the second specific content determined as the content of the word that co-occurs with the word of the first specific content. Calculate the score,
As information indicating the estimation result of the utterance content by combining the first score for the first specific content in each processing unit and the second score for the second specific content in each processing unit An utterance content estimation method characterized by output.

処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容である確率を示す第１のスコアを算出し、
処理対象の前記発話を時間区間に分割した処理単位の内容が、前記第１の特定の内容の単語と共起する単語の内容として定めた第２の特定の内容である確率を示す第２のスコアを算出し、
各処理単位における前記第１の特定の内容についての第１のスコアと、各処理単位における前記第２の特定の内容についての第２のスコアと、予め記憶されている前記第１の特定の内容に応じた第１の内容別言語モデルと、予め記憶されている前記第２の特定の内容に応じた第２の内容別言語モデルとを用いて、処理対象の前記発話を時間区間に分割した処理単位ごとに言語モデルを作成する
ことを特徴とする言語モデル作成方法。 Calculating a first score indicating the probability that the content of the processing unit obtained by dividing the processing target speech into time intervals is the first specific content;
A second value indicating the probability that the content of the processing unit obtained by dividing the utterance to be processed into time intervals is the second specific content determined as the content of the word that co-occurs with the word of the first specific content. Calculate the score,
A first score for the first specific content in each processing unit, a second score for the second specific content in each processing unit, and the first specific content stored in advance The speech to be processed is divided into time intervals using a first content-specific language model according to the content and a second content-specific language model stored in advance according to the second specific content. A language model creation method characterized by creating a language model for each processing unit.

コンピュータに、
処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容である確率を示す第１のスコアを算出する処理、
処理対象の前記発話を時間区間に分割した処理単位の内容が、前記第１の特定の内容の単語と共起する単語の内容として定めた第２の特定の内容である確率を示す第２のスコアを算出する処理、および
各処理単位における前記第１の特定の内容についての第１のスコアと、各処理単位における前記第２の特定の内容についての第２のスコアとを併せて、発話内容の推定結果を示す情報として出力する処理
を実行させる発話内容推定プログラム。 On the computer,
A process of calculating a first score indicating a probability that the content of a processing unit obtained by dividing a processing target utterance into time intervals is a first specific content;
A second value indicating the probability that the content of the processing unit obtained by dividing the utterance to be processed into time intervals is the second specific content determined as the content of the word that co-occurs with the word of the first specific content. The content of the utterance by combining the first score for the first specific content in each processing unit and the second score for the second specific content in each processing unit An utterance content estimation program for executing processing to output as information indicating the estimation result of.

コンピュータに、
処理対象の発話を時間区間に分割した処理単位の内容が、第１の特定の内容である確率を示す第１のスコアを算出する処理、
処理対象の前記発話を時間区間に分割した処理単位の内容が、前記第１の特定の内容の単語と共起する単語の内容として定めた第２の特定の内容である確率を示す第２のスコアを算出する処理、および
各処理単位における前記第１の特定の内容についての第１のスコアと、各処理単位における前記第２の特定の内容についての第２のスコアと、予め記憶されている前記第１の特定の内容に応じた第１の内容別言語モデルと、予め記憶されている前記第２の特定の内容に応じた第２の内容別言語モデルとを用いて、処理対象の前記発話を時間区間に分割した処理単位ごとに言語モデルを作成する処理
を実行させる言語モデル作成プログラム。 On the computer,
A process of calculating a first score indicating a probability that the content of a processing unit obtained by dividing a processing target utterance into time intervals is a first specific content;
A second value indicating the probability that the content of the processing unit obtained by dividing the utterance to be processed into time intervals is the second specific content determined as the content of the word that co-occurs with the word of the first specific content. A process for calculating a score, a first score for the first specific content in each processing unit, and a second score for the second specific content in each processing unit are stored in advance. Using the first content-specific language model corresponding to the first specific content and the second content-specific language model corresponding to the second specific content stored in advance, the processing target A language model creation program that executes processing to create a language model for each processing unit that divides speech into time intervals.