JP3794597B2

JP3794597B2 - Topic extraction method and topic extraction program recording medium

Info

Publication number: JP3794597B2
Application number: JP16095497A
Authority: JP
Inventors: 克年大附; 達雄松岡; 昭一松永
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1997-06-18
Filing date: 1997-06-18
Publication date: 2006-07-05
Anticipated expiration: 2017-06-18
Also published as: JPH117447A

Description

【０００１】
【発明の属する技術分野】
この発明は、連続発生された音声の単語認識結果やテキストを形態素解析により分割された単語などの単語列に対し、その内容を表す話題単語を抽出する方法と話題単語抽出プログラムを記録した記録媒体に関する。
【０００２】
【従来の技術】
連続発声された音声からのその内容を表わす話題抽出では５〜１０種類の分野のうちのいずれかの分野に依存度の高いキーワードを予め選択しておき、それらのキーワードを音声区間中から検出（キーワードスポッティング）して、検出されたキーワード集合が最も高い依存度を示す話題を結果として出力する方法が多くとられる。例えば横井、河原、堂下、“キーワードスポッティングに基づくニュース音声の話題同定”、情処研報、SLP6−３、1995．櫻井、有木、“キーワードスポッティングによるニュース音声の索引付けと分類”、信学技法、SP96−66、1996．R.C.Rose，E.L.Chang ，and R.P.Lippmann，“Techniques for Information Retrieval from Voice Messages”，Proc.ICASSP-91，pp.317-320，1991．などに示されている。
【０００３】
また従来の文章（テキスト）から話題を抽出する方法は文中の特定の個所を抽出して行うもので、その処理が複雑であった。
【０００４】
【発明が解決しようとする課題】
従来の連続音声の話題抽出方法では、限られた数のキーワードしか用いることができず、またキーワードの数を増やした場合には、誤って検出されるキーワードが増えてしまう、また話題の分野が少ないため、情報検索や索引付けに使うことができないという問題があった。また従来のテキストよりの話題抽出方法は、特定の個所を探して行うため、その処理が複雑であった。これを連続音声の話題抽出に適用すると、その所定個所についての単語認識を誤ると、話題抽出は誤ってしまう。
【０００５】
この発明の目的は比較的簡単な処理で話題単語を抽出することができる話題単語抽出方法、そのプログラムを記録した記録媒体とを提供することにある。
【０００６】
【課題を解決するための手段】
この発明の話題抽出モデルは、本文とその見出しよりなるテキストを多数用いて、それぞれ形態素解析を用い、本文単語と話題単語（見出し中の）と得、これら本文単語の出現頻度、話題単語の出現頻度、１テキスト中に本文単語と話題単語の組み合せが同時に存在する共起頻度をそれぞれ求め、これら頻度から各話題単語と、本文単語との関連度を求め、これらを話題抽出モデルとして格納しておく。
【０００７】
この発明の話題抽出方法では前記の話題抽出モデルを用い、入力音声の音声認識や入力テキストの形態素解析で、入力単語系列を得、各話題単語と入力単語系列中の各単語との関連度とを話題抽出モデルを参照して求めて話題単語ごとの関連度系列を得、これら関連度系列から各話題単語の入力単語系列との関連度をそれぞれ求め、これら入力単語系列の関連度中の大きいものと対応する話題単語を入力音声又はテキストに対する話題として出力する。
【０００８】
この発明の記録媒体はこの発明の話題抽出方法をコンピュータで実行させるためのプログラムが記録されている。
【０００９】
【発明の実施の形態】
まずこの発明の話題抽出モデルとこの作成方法の実施例を説明する。
話題抽出モデルの学習（作成）はある話題について述べられているテキストとその内容を表わす複数の話題単語との組を大量に用いて行う。一例として新聞記事の本文と見出しを用いて話題抽出モデルを学習（作成）する場合、約５年分の新聞記事よりその見出しと本文とをそれぞれ取出し（Ｓ１）、これらを形態素解析を行い（Ｓ２）、単語（形態素）に分割し、見出しの形態素（話題単語）と、本文の形態素（文中単語）とを得る。
【００１０】
これら話題単語と文中単語について、大量のデータにおける出現頻度と、共起頻度とを用いて、文中単語と話題単語との関連度を求める。しかし、文中単語と話題単語の組み合わせは非常に莫大な数になる。従ってこの実施例では話題単語については、出現回数が２回以上の単語に限り（Ｓ３）、文中単語については出現頻度が上位１５万の単語のみを選出し（Ｓ４）、更に情報検索という観点からより意味的情報を伝達すると考えられる名詞、動詞などの内容語に着目し、ここでは話題単語、文中単語の何れについても名詞、動詞、形容詞、形容動詞、副詞のみを取出す（Ｓ５）。更に話題単語と文中単語との組合せで同一記事に出現するのが１度しかなかったものは除外し、つまり話題単語と文中単語の組み合わせで同一記事に出現することが２回以上のもののみを残した（Ｓ６）。このようにして話題単語の総頻度１２．３×１０⁶が６．３×１０⁶となり総数１３６×１０³が７４×１０³となり、文中単語の総頻度２１８．８×１０⁶が９０．１×１０⁶となり総数６４０×１０³が１４７×１０³となり、２回以上起きた共起の組み合わせは約５８００万種類となった。
【００１１】
この約５８００万種類について、これら単語の出現頻度と共起頻度を用いて文中単語と話題単語との関連度を求める。文中単語ｗ_iと話題単語ｔ_jとの関連度は以下のようにして求める。
相互情報量に基づく関連度
Ｉ（ｗ_i：ｔ_j）＝ log（Ｐ（ｗ_i，ｔ_j）／Ｐ（ｗ_i）Ｐ（ｔ_j)) …(1)
Ｐ（ｗ_i，ｔ_j）：ｗ_iとｔ_jが同時に出現する確率
Ｐ（ｗ_i）：ｗ_iの出現確率、Ｐ（ｔ_j）：ｔ_jの出現確率
χ ² 法に基づく関連度
χ _ij ² ＝（ｆ_ij−Ｆ_ij）² ／Ｆ_ij
【００１２】
【数１】

Ｎ：文中単語の種類数、Ｍ：話題単語の種類数、
ｆ_ij：話題単語ｔ_jに対する文中単語ｗ_iの頻度、
Ｆ_ij：話題単語ｔ_jに対する文中単語ｗ_iの理論（期待）度数
相互情報量の計量において、学習データ中に文中単語ｗ_iと話題単語ｔ_jの共起が観測されない場合、Ｐ（ｗ_i，ｔ_j）＝０となり、関連度の合計を求める際に問題が生じる。そこで、共起が観測されなかった場合には情報が得られなかったものとして、実際には次式のように相互情報量に基づく関連度を計算する。
【００１３】

一方、χ ² 法における理論度数Ｆ_ijとは、全ての話題単語に対して文中単語ｗ_iが等確率で出現した場合の出現頻度である。実際の出現頻度と理論度数とのずれが大きければ、その文中単語はその話題単語に対して偏って出現していることになる。しかし、上述のχ ² 法の式では、実際の出現頻度ｆ_ijが理論度数Ｆ_ijより小さい場合にも、関連度が正の値となってしまうため、実際には次式のようにχ ² 法に基づく関連度を計算する。
【００１４】

従って、ステップＳ６で得られた文中単語ｗ_iと話題単語ｔ_jとの各組み合せについて、その各頻度Ｐ（ｗ_i）：Ｐ（ｔ_j），Ｐ（ｗ_i，ｔ_j）、またはｆ_ij，をそれぞれ演算し（Ｓ７）、頻度テーブル１１に格納する。これを学習データが終るまで行う（Ｓ８）。学習データが終ると、頻度テーブル１１内に演算した頻度を用いて関連度Ｉ（ｗ_i，ｔ_j）又はＦ_ijの計算を行って話題抽出モデルを得る（Ｓ₉ ）。
【００１５】
従って話題抽出モデルは図２Ａに示すように、話題単語の種類ｔ₁，ｔ₂，…ｔ_Mそれぞれについて、これと２回以上共起する文中単語、つまりｔ₁についてはｗ₁₁，ｗ₁₂，ｗ₁₃，…との関連度ｒ₁₁₁，ｒ₁₁₂，ｒ₁₁₃，…が、またｔ₂についてはｗ₂₁，ｗ₂₂，ｗ₂₃，…との関連度ｒ₂₁₁，ｒ₂₁₂，ｒ₂₁₃，…が、以下同様に文中単語との関連度が格納されている。
【００１６】
次にこの話題抽出モデルを用いて連続入力単語列から話題を抽出する方法を図２Ｂを参照して説明する。連続発声される音声を入力とする場合、その入力音声を単語音声認識し（Ｓ１）、認識結果として単語系列ｗ₁，ｗ₂，…ｗ_nを得る（Ｓ２）、これら単語系列ｗ₁，ｗ₂，…ｗ_nの各単語について、話題抽出モデル１１を参照して、その各話題単語ｔ₁，ｔ₂，…ｔ_Mに対する関連度を求める。つまり認識単語ｗ₁に対する話題単語ｔ₁，ｔ₂，…ｔ_Mとの各関連度ｒ₁₁，ｒ₂₁，…，ｒ_M1を求め、単語ｗ₂に対する話題単語ｔ₁，ｔ₂，…ｔ_Mとの各関連度ｒ₁₂，ｒ₂₂，…，ｒ_M2を求め、以下同様に求める。
【００１７】
各話題単語ｔ₁ ，ｔ₂ ，…，ｔ_Mについての各認識単語ｗ₁ ，ｗ₂ ，…，ｗ_nとの関連度の合計、つまり単語系列に対する関連度Ｒ_jを計算する。即ち、話題単語ｔ₁ についてはｒ₁₁，ｒ₁₂，…，ｒ_1nの和Ｒ₁ ＝Σ_k=1 ⁿｒ_1kを求め、ｔ₂ についてはｒ₂₁，ｒ₂₂，…，ｒ_2nの和Ｒ₂ ＝Σ_k=1 ⁿｒ_2kを求め、以下同様にＲ₃ ，…，Ｒ_Mを求める（Ｓ３）。これら単語系列に対する関連度Ｒ₁ ，…，Ｒ_M中で関連度が大きいものから順にＱ個（Ｑは１以上の整数）のものとそれぞれ対応する話題単語ｔ_jの組合せを、その単語系列に対する話題とする（Ｓ４）。Ｑは１でもよいが、通常は複数で例えば５程度である。関連度Ｒ₁ ，…，Ｒ_M中の大きいものから順に対応する話題単語の複数個を候補として出力してもよい。
【００１８】
単語系列から話題の抽出としてはテキストを入力し（Ｓ５）、これを形態素解析し（Ｓ６）、形態素つまり単語列ｗ₁，ｗ₂，…，ｗ_nを得て、これを音声入力の場合と同様に話題抽出モデル１１を用いて処理して、テキストに対する話題を抽出することもできる。
関連度をｗ_iとｔ_jの相互情報量に基づいて求める場合は式（１）、つまり２点間の相互情報量に基づいて決めた。一方、ｎ点間の相互情報量は次式で定義される。
【００１９】
【数２】

Πは、あい異なる添字の全ての組み合せについて計算する。
従ってｘ₁，ｘ₂，…，ｘ_n中１つの話題単語と他のｎ−１個を文中単語との相互情報量をＩ（ｘ₁：ｘ₂：…：ｘ_n）により求めることができる。このように、複数の文中単語と１つの話題単語との関連度を求めておくと、例えば「コンピュータ」と「インターネット」の関連度、また「ネットワーク」と「インターネット」の関連度はそれ程大きくないが、「コンピュータ」と「ネットワーク」が同じ文中にあった場合の「インターネット」への関連度が大きくなるような話題抽出モデルの学習ができる。つまり式（１）の関連度では話題として「インターネット」を抽出できない場合に、式（２）の関連度によると「インターネット」を話題として抽出でき、適切な話題を抽出することができることがある。
【００２０】
話題単語ｔ_kと単語系列ｗ₁，ｗ₂，…，ｗ_nとの関連度Ｒ_kはｔ_kに対する各単語の関連度の和ｒ_k1＋ｒ_k2＋，…，＋ｒ_knで求められるが、その加算の際に各単語に対する重みｓ₁，ｓ₂，…，ｓ_nをそれぞれ付けて、ｒ_k1×ｓ₁＋ｒ_k2×ｓ₂＋，…，＋ｒ_kn×ｓ_nというようにして、より適切な関連度Ｒ_kを得るようにすることもできる。ここで重みｓ₁，ｓ₂，…，ｓ_nとしては、各単語ｗ₁，ｗ₂，…，ｗ_nのその音声認識時の単語の確からしさ（音響的尤度）や言語的尤度、つまりその単語がその前の単語に対し、文法や言語上存在する確からしさ（大語彙音声認識に用いられる言語モデルに示されている）を用いることができ
る。
【００２１】
音声認識結果の単語系列に対して話題抽出を行う際に、認識言語系列候補の第１位だけでなく、上位ｂ位までの候補（ｗ_1-1，ｗ_1-2，…，ｗ_1-n1），（ｗ_2-1，ｗ_2-2，…，ｗ_2-n2）…（ｗ_b-1，ｗ_b-2，…，ｗ_b-nb）を用いて話題抽出を行う。この際、順位の高い程重みが大きくなるようにすることもできる。この場合第１位から第ｂ位までの候補系列は、相互に１単語又は２単語しか違いがない候補系列が多くなる。よってこれら候補系列を、その同一単語を排除して複数単語木構造乃至単語ネットワークあるいは単語ラティスの配列とし、これを用いて第１位〜第ｂ位の候補系列について話題抽出をするようにすれば、その複数の候補系列を少ない容量のメモリに格納して処理することができる。
【００２２】
【発明の効果】
評価は、ニュース音声の書き起こし文および２万語彙の大語彙連続音声認識システムによる音声認識結果に対してこの発明の評価を行った。書き起こし文に対して３人の被験者が人手で付与した話題を評価対象とした。話題抽出モデルが出力した話題単語のうち上位５単語までを出力結果とした場合の適合率（抽出した話題単語のうち、正解の話題単語の割合）は、３人の被験者の付与した話題に対して７０％以上となった。また、単語誤り率２５％の音声認識結果に対する話題抽出結果の適合率も６５％以上となった。各被験者が付与した話題間の重複は約７０％であるので、この話題抽出結果は利用可能な精度であるといえる。関連度の尤度としてχ ² 法を用いた方が相互情報量を用いた場合より良い結果が得られた。
【００２３】
この発明によれば、大量のテキストデータを用いて非常に多くの文中単語および話題単語間の関連度を学習した話題抽出モデルを用いることにより、テキストおよび誤りを含む大語彙連続音声認識結果から詳細な話題抽出を行うことができるという利点がある。
つまり、音声からの話題抽出において、連続音声認識技術を用いることにより、限られた数のキーワードを検出するキーワードスポッティングに基づく方法に比べ、音声中の多くの情報を用いて話題抽出を行うことが可能であり、また、音声の内容を表す単語（話題単語）を複数抽出することにより、音声をいくつかの分野に分類する話題抽出（話題同定・話題認識）に比べ、詳細な話題が抽出できるという利点がある。
【００２４】
特に従来のテキストに対する話題抽出では、特定の関係のものを抽出するため、複雑な処理を必要としたが、この発明では比較的簡単に行うことができる。特に連続音声に対する抽出ではその特定部分に対して認識誤りが生じると、致命的であるが、この発明は文全体の各単語に対して関連性をみるため正しく話題を抽出することができる。
【００２５】
またこのような正しい抽出ができるのは、大量のテストデータを用いて作成した話題単語と各単語との関連度を記憶した話題抽出モデルを用いるからである。
【図面の簡単な説明】
【図１】この発明に用いられるモデル作成方法を示す流れ図。
【図２】Ａはこの発明に用いられる話題抽出モデルの例を示す図、Ｂはこの発明の話題抽出方法を示す図である。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a method of extracting a topic word representing the contents of a word string such as a word recognition result of a continuously generated speech or a word obtained by dividing a text by morphological analysis, and a recording medium on which a topic word extraction program is recorded About.
[0002]
[Prior art]
In topic extraction representing the content from continuously uttered speech, keywords that are highly dependent on one of 5 to 10 types of fields are selected in advance, and those keywords are detected from the speech segment ( There are many methods for outputting a topic in which the detected keyword set shows the highest dependency as a result by keyword spotting. For example, Yokoi, Kawahara, Doshita, “Topic Identification of News Voices Based on Keyword Spotting”, Information Processing Research Reports, SLP6-3, 1995. Sakurai, Ariki, “Indexing and classification of news speech by keyword spotting”, IEICE techniques, SP96-66, 1996. RCRose, ELChang, and RPLippmann, “Techniques for Information Retrieval from Voice Messages”, Proc.ICASSP-91, pp.317-320, 1991. It is shown in.
[0003]
Further, the conventional method of extracting a topic from a sentence (text) is performed by extracting a specific part in the sentence, and the process is complicated.
[0004]
[Problems to be solved by the invention]
In conventional continuous speech topic extraction methods, only a limited number of keywords can be used, and when the number of keywords is increased, the number of keywords that are erroneously detected increases. Because there are few, there was a problem that it could not be used for information retrieval and indexing. In addition, the conventional topic extraction method from text is performed by searching for a specific part, and the processing is complicated. If this is applied to topic extraction of continuous speech, topic extraction will be wrong if the word recognition about the predetermined location is mistaken.
[0005]
An object of the present invention is to provide a topic word extraction method capable of extracting a topic word by a relatively simple process, and a recording medium recording the program.
[0006]
[Means for Solving the Problems]
The topic extraction model of the present invention uses a large number of texts consisting of a body and its headings, and uses morphological analysis, respectively, to obtain body words and topic words (in the headings). The appearance frequency of these body words, the appearance of topic words The frequency of co-occurrence in which a combination of a body word and a topic word exists simultaneously in one text is obtained, the degree of association between each topic word and the body word is obtained from these frequencies, and these are stored as a topic extraction model. deep.
[0007]
Using the previous Symbol topic extraction model of the topic extraction method of the present invention, in the morphological analysis of the speech recognition and input text of the input speech, obtained the input word series, the degree of association between each word in the input word series and each topic word To obtain the relevance series for each topic word by referring to the topic extraction model, and obtain the relevance of each topic word with the input word series from these relevance series, The topic word corresponding to the larger one is output as the topic for the input speech or text.
[0008]
The recording medium of the present invention stores a program for causing a computer to execute the topic extraction method of the present invention.
[0009]
DETAILED DESCRIPTION OF THE INVENTION
First, an embodiment of the topic extraction model and the creation method of the present invention will be described.
The topic extraction model is learned (created) by using a large number of pairs of text describing a topic and a plurality of topic words representing the contents. As an example, when learning (creating) a topic extraction model using the text and headings of newspaper articles, the headlines and text are taken out from newspaper articles for about five years (S1), and these are subjected to morphological analysis (S2). ) And divided into words (morphemes) to obtain headline morphemes (topic words) and body morphemes (words in the sentence).
[0010]
For these topic words and in-sentence words, the degree of association between the in-sentence word and the topic word is obtained using the appearance frequency and co-occurrence frequency in a large amount of data. However, the number of combinations of words in the sentence and topic words is very large. Therefore, in this embodiment, the topic word is limited to the word having the number of appearances of 2 times or more (S3), and the word having the highest appearance frequency is selected for the word in the sentence (S4), and further from the viewpoint of information retrieval. Focusing on content words such as nouns and verbs that are considered to convey more semantic information, only nouns, verbs, adjectives, adjective verbs, and adverbs are extracted for all topic words and words in sentences (S5). Furthermore, the combination of the topic word and the word in the sentence that appears only once in the same article is excluded, that is, only the combination of the topic word and the word in the sentence that appears in the same article more than once. Left (S6). Thus, the total frequency of topic words 12.3 × 10 ⁶ is 6.3 × 10 ^{6 and the} total number 136 × 10 ³ is 74 × 10 ³ , and the total word frequency 218.8 × 10 ⁶ is 90.1. × 10 ⁶ , the total number of 640 × 10 ³ was 147 × 10 ³ , and there were about 58 million kinds of co-occurrence combinations that occurred more than once.
[0011]
About these 58 million types, the relevance degree of the word in a sentence and a topic word is calculated | required using the appearance frequency and co-occurrence frequency of these words. The degree of association between the in-sentence word w _i and the topic word t _j is obtained as follows.
Relevance based on mutual information I (w _i : t _j ) = log (P (w _i , t _j ) / P (w _i ) P (t _j )) (1)
P (w _i , t _j ): Probability that _wi and t _j appear simultaneously P (w _i ): Appearance probability of w _i , P (t _j ): Appearance probability of t _j
Relevance based on χ ² method
χ _ij ² = (f _ij −F _ij ) ² / F _ij
[0012]
[Expression 1]

N: number of types of words in the sentence, M: number of types of topic words,
f _ij : frequency of in-sentence word w _i for topic word t _j ,
F _ij: theory (expected) in a sentence word w _i to the topic word t _j in the metering of the frequency mutual information, if you during the learning data co-occurrence in the sentence word w _i and the topic word t _j is not observed, P (w _i , T _j ) = 0, and a problem arises when calculating the total relevance. Therefore, if no co-occurrence is observed, it is assumed that no information has been obtained, and the relevance based on the mutual information amount is actually calculated as in the following equation.
[0013]

On the other hand, the theoretical frequency F _ij in the χ ² method is the appearance frequency when the word w _{i in the} sentence appears with equal probability for all topic words. If the difference between the actual appearance frequency and the theoretical frequency is large, the word in the sentence appears biased with respect to the topic word. However, in the above equation the chi ² method, if the actual frequency f _ij is smaller than the theoretical frequency F _ij also because the relevance becomes a positive value, in fact as follows chi ² Calculate relevance based on the law.
[0014]

Therefore, for each combination of the word w _i in the sentence obtained in step S6 and the topic word t _j , each frequency P (w _i ): P (t _j ), P (w _i , t _j ), or f _ij , Are respectively calculated (S7) and stored in the frequency table 11. This is performed until the learning data is over (S8). When the learning data ends, the topic extraction model is obtained by calculating the relevance I (w _i , t _j ) or F _ij using the frequency calculated in the frequency table 11 (S ₉ ).
[0015]
Therefore, as shown in FIG. 2A, the topic extraction model is a word in the sentence that co-occurs with the topic word types t ₁ , t ₂ ,... T _M more than once, that is, t ₁ , w ₁₁ , w ₁₂ , w _13, ... and the relevance _{_{_{r 111, r 112, r 113}}} , ... is also w ₂₁ for _{_{_{t 2, w 22, w 23}}} , ... and the relevance _{_{_{r 211, r 212, r 213}}} , ... is Similarly, the degree of association with the word in the sentence is stored.
[0016]
Next, a method for extracting a topic from a continuous input word string using this topic extraction model will be described with reference to FIG. 2B. When continuously uttered speech is used as input, the input speech is recognized as word speech (S1), and word sequences w ₁ , w ₂ ,... W _n are obtained as recognition results (S2), and these word sequences w ₁ , w _2, ... for each word w _n, with reference to the topic extraction model 11, that each topic word t _1, t _2, determine the degree of association for the ... t _M. That recognition word buzz word t _1, t ₂ for w _1, ... t each degree of association with the _{_{_{M r 11, r 21, ...}}} , determine the r _M1, the topic word t _1, t ₂ for the word w _2, ... t _M each relevance r _12, r ₂₂ and, ..., determine the r _M2, determined on.
[0017]
Each topic word t _1, t _2, ..., each recognized word w _1, w ₂ for t _M, ..., the sum of degrees of correlation with w _n, to calculate the relevance R _j for That word sequence. That is, for the topic word t ₁ , the sum R ₁ = Σ _{k = 1} ⁿ r _1k of r ₁₁ , r ₁₂ ,..., R _1n is obtained, and for t ₂ the sum R of r ₂₁ , r ₂₂ _,. ₂ = Σ _{k = 1} ⁿ r _2k is obtained, and R ₃ ,..., R _M are obtained in the same manner (S3). Relevance R ₁ to these word sequence, ..., the combination of topic words t _j respectively as from those relevance in R _M is large Q pieces (Q is an integer of 1 or more) in order corresponding, for the word sequence It is set as a topic (S4). Q may be 1, but is usually a plurality, for example, about 5. A plurality of corresponding topic words may be output as candidates in descending order of relevance R ₁ ,..., R _M.
[0018]
Enter text as the extraction of the topic from the word sequence (S5), which morphological analysis (S6), the morpheme clogging word string w _1, w _2, ..., to obtain a w _n, and if this audio input Similarly, the topic extraction model 11 can be used to extract the topic for the text.
When the degree of association is obtained based on the mutual information amount of w _i and t _j , it is determined based on the equation (1), that is, the mutual information amount between two points. On the other hand, the mutual information amount between n points is defined by the following equation.
[0019]
[Expression 2]

Π is calculated for all combinations of different subscripts.
Therefore, the mutual information amount between one topic word in x ₁ , x ₂ ,..., X _n and the other n−1 words in the sentence can be obtained by I (x ₁ : x ₂ :... X _n ). . Thus, if the degree of association between a plurality of words in a sentence and one topic word is obtained, for example, the degree of association between “computer” and “Internet” and the degree of association between “network” and “Internet” are not so high. However, it is possible to learn a topic extraction model that increases the degree of relevance to the “Internet” when “computer” and “network” are in the same sentence. That is, when “Internet” cannot be extracted as a topic with the relevance level of Expression (1), “Internet” can be extracted as a topic according to the relevance level of Expression (2), and an appropriate topic can be extracted.
[0020]
Topic word t _k and the word sequence _{_{w 1, w 2, ...,}} w n and relevance of R _k is the sum of each word of relevance to the _{_{_{t k r k1 + r k2 +}}} , ..., but are determined by the + r _kn, the weight s _1, s ₂ during addition for each word, ..., put each _{_{_{s n, r k1 × s 1}}} + r k2 × s 2 +, ..., and so on + r _kn × s _n, more appropriate The degree of association R _k can also be obtained. Here the weight s _1, s _2, ..., as is s _n, each word w _1, w _2, ..., likelihood of the word at the time of the speech recognition of w _n (acoustic likelihood) and linguistic likelihood, In other words, the certainty that the word exists in the grammar and language (shown in the language model used for large vocabulary speech recognition) can be used for the previous word.
[0021]
When topic extraction is performed on the word sequence of the speech recognition result, not only the first recognition language sequence candidate but also the top b candidates (w _1-1 , w _1-2 ,..., W _{1- n1} ), ( _w2-1 , _w2-2 ,..., w2 _-n2 )... (wb _-1 , wb _-2 ,..., _wb-nb ) are used for topic extraction. At this time, the higher the rank, the greater the weight. In this case, the candidate series from the 1st place to the b-th place increase in candidate series having only one word or two words different from each other. Therefore, if these candidate sequences are excluded from the same word to form a multiple word tree structure or word network or word lattice arrangement, topic extraction is performed for the first to bth candidate sequences using this. The plurality of candidate sequences can be stored in a small capacity memory for processing.
[0022]
【The invention's effect】
In the evaluation, the present invention was evaluated with respect to the speech recognition results obtained by the transcription of the news speech and the large vocabulary continuous speech recognition system of 20,000 words. A topic given manually by three subjects to the transcript was evaluated. When the top five words out of the topic words output by the topic extraction model are output results, the relevance rate (the ratio of correct topic words among the extracted topic words) is calculated for the topics given by three subjects. Over 70%. In addition, the matching rate of the topic extraction result with respect to the speech recognition result with a word error rate of 25% was 65% or more. Since the overlap between topics given by each subject is about 70%, it can be said that this topic extraction result has usable accuracy. Better results were obtained using the χ ² method as the likelihood of relevance than using the mutual information.
[0023]
According to the present invention, by using a topic extraction model in which a large amount of text data is used to learn the relevance between a very large number of words in a sentence and topic words, the details from large vocabulary continuous speech recognition results including text and errors are detailed. There is an advantage that topic extraction can be performed.
In other words, in topic extraction from speech, topic extraction can be performed using much information in speech compared to a method based on keyword spotting that detects a limited number of keywords by using continuous speech recognition technology. It is possible, and by extracting a plurality of words (topic words) representing the contents of speech, more detailed topics can be extracted than topic extraction (topic identification / topic recognition) that classifies speech into several fields. There is an advantage.
[0024]
In particular, in conventional topic extraction for text, a complicated process is required to extract a specific relationship, but in the present invention, it can be performed relatively easily. In particular, in the extraction of continuous speech, if a recognition error occurs in a specific part, which is fatal, the present invention can correctly extract a topic because the relevance is seen for each word of the whole sentence.
[0025]
The reason why such a correct extraction can be performed is because a topic extraction model that stores the degree of association between a topic word created using a large amount of test data and each word is used.
[Brief description of the drawings]
FIG. 1 is a flowchart showing a model creation method used in the present invention.
FIG. 2A is a diagram showing an example of a topic extraction model used in the present invention , and B is a diagram showing a topic extraction method of the present invention.

Claims

単語系列を入力とし、その単語系列の内容を表す複数の話題単語を話題抽出モデルを用いて抽出する、頻度テーブルと話題抽出メモリを有するコンピュータが行う処理方法であって、
本文とその見出しよりなる多数の記事テキストを学習データとし記事テキストを解析して、記事テキストの見出しから話題単語を、記事テキストの本文から文中単語を得る工程と、
上記各話題単語の出現頻度と、上記各文中単語の出現頻度と、１つの記事テキスト中の上記話題単語と上記文中単語の各組み合せが同時に得られる共起頻度とをそれぞれ計数し、記録手段である頻度テーブルに格納する工程と、
上記頻度テーブルを参照し、話題単語の出現頻度と文中単語の出現頻度と各共起頻度とを用いて各話題単語と各文中単語との関連度を相互情報量に基づいて求める工程と、
話題単語と文中単語の共起が観測されなかった場合には、当該話題単語と当該文中単語との関連度をゼロに設定する工程と、
求められた話題単語と文中単語との関連度、当該関連度の話題単語、当該関連度の文中単語とからなる話題抽出モデルを話題抽出メモリに格納する工程と、
入力された単語系列に対して、上記話題抽出モデル中の各話題単語について、上記話題抽出メモリに格納された話題抽出モデルを参照し、単語系列の各入力単語を上記話題抽出モデルにおける文中単語に対応させて、当該話題単語と各入力単語との関連度を求めて当該話題単語の関連度系列を作り、
各話題単語について、当該話題単語の関連度系列の各関連度の和を求めて、上記入力された単語系列に対する当該話題単語の関連度の合計値とし、
これら話題単語の関連度の合計値の大きいものから順にＱ個（Ｑは１以上の整数）のものとそれぞれ対応する話題単語を出力する
ことを特徴とするコンピュータが行う話題単語抽出方法。A processing method performed by a computer having a frequency table and a topic extraction memory, which takes a word sequence as input and extracts a plurality of topic words representing the contents of the word sequence using a topic extraction model ,
Analyzing the article text using a large number of article texts consisting of a body and its headline as learning data, obtaining a topic word from the headline of the article text, and obtaining a word in the sentence from the body of the article text ;
The frequency of appearance of each of the topic words, the frequency of occurrence of the words in each sentence, and the co-occurrence frequency at which each combination of the topic word and the words in the sentence in one article text is simultaneously obtained are counted by the recording means. Storing in a frequency table ;
A step of referring to the frequency table and determining the degree of association between each topic word and each in-sentence word based on the mutual information amount using the appearance frequency of the topic word, the appearance frequency of the in-sentence word, and each co-occurrence frequency ;
When the co-occurrence of the topic word and the word in the sentence is not observed, the step of setting the degree of association between the topic word and the word in the sentence to zero ;
Storing in the topic extraction memory a topic extraction model composed of the degree of association between the obtained topic word and the word in the sentence, the topic word of the degree of association, and the word in the sentence of the degree of association ;
For each word word in the topic extraction model for the input word sequence, the topic extraction model stored in the topic extraction memory is referred to, and each input word in the word sequence is used as a word in the sentence in the topic extraction model. Correspondingly, the degree of association between the topic word and each input word is calculated, and the degree of association of the topic word is created,
For each topic word, find the sum of the relevance levels of the relevance series of the topic word, and the total value of the relevance of the topic word for the input word series,
A topic word extraction method performed by a computer , characterized in that Q topic words corresponding to Q (Q is an integer of 1 or more) are output in descending order of the total relevance value of these topic words .

単語系列を入力とし、その単語系列の内容を表す複数の話題単語を話題抽出モデルを用いて抽出する、頻度テーブルと話題抽出メモリを有するコンピュータが行う処理方法であって、
本文とその見出しよりなる多数の記事テキストを学習データとし記事テキストを解析して、記事テキストの見出しから話題単語を、記事テキストの本文から文中単語を得る工程と、
上記各話題単語の出現頻度と、上記各文中単語の出現頻度と、１つの記事テキスト中の上記話題単語と上記文中単語の各組み合せが同時に得られる共起頻度とをそれぞれ計数し、記録手段である頻度テーブルに格納する工程と、
上記頻度テーブルを参照し、話題単語の出現頻度と文中単語の出現頻度と各共起頻度とを用いて各話題単語と各文中単語との関連度を、話題単語に対する文中単語の頻度ｆと期待度数Ｆからχ ^２法に基づいて求める工程と、
出現頻度ｆが期待度数Ｆより小さい場合には、当該話題単語と当該文中単語との関連度をゼロに設定する工程と、
求められた話題単語と文中単語との関連度、当該関連度の話題単語、当該関連度の文中単語とからなる話題抽出モデルを話題抽出メモリに格納する工程と、
入力された単語系列に対して、上記話題抽出モデル中の各話題単語について、上記話題抽出メモリに格納された話題抽出モデルを参照し、単語系列の各入力単語を上記話題抽出モデルにおける文中単語に対応させて、当該話題単語と各入力単語との関連度を求めて当該話題単語の関連度系列を作り、
各話題単語について、当該話題単語の関連度系列の各関連度の和を求めて、上記入力された単語系列に対する当該話題単語の関連度の合計値とし、
これら話題単語の関連度の合計値の大きいものから順にＱ個（Ｑは１以上の整数）のものとそれぞれ対応する話題単語を出力する
ことを特徴とするコンピュータが行う話題単語抽出方法。A processing method performed by a computer having a frequency table and a topic extraction memory, which takes a word sequence as input and extracts a plurality of topic words representing the contents of the word sequence using a topic extraction model ,
Analyzing the article text using a large number of article texts consisting of a body and its headline as learning data, obtaining a topic word from the headline of the article text, and obtaining a word in the sentence from the body of the article text ;
The frequency of appearance of each of the topic words, the frequency of occurrence of the words in each sentence, and the co-occurrence frequency at which each combination of the topic word and the words in the sentence in one article text is simultaneously obtained are counted by the recording means. Storing in a frequency table ;
Referring to the frequency table, using the topic word appearance frequency, the in-sentence word appearance frequency, and the co-occurrence frequencies, the degree of association between each topic word and each in-sentence word, the in-sentence word frequency f with respect to the topic word, and the expectation Obtaining from the frequency F based on the χ ² method ;
When the appearance frequency f is smaller than the expected frequency F, the step of setting the degree of association between the topic word and the word in the sentence to zero;
Storing in the topic extraction memory a topic extraction model composed of the degree of association between the obtained topic word and the word in the sentence, the topic word of the degree of association, and the word in the sentence of the degree of association ;
For each word word in the topic extraction model for the input word sequence, the topic extraction model stored in the topic extraction memory is referred to, and each input word in the word sequence is used as a word in the sentence in the topic extraction model. Correspondingly, the degree of association between the topic word and each input word is calculated, and the degree of association of the topic word is created,
For each topic word, find the sum of the relevance levels of the relevance series of the topic word, and the total value of the relevance of the topic word for the input word series,
A topic word extraction method performed by a computer , characterized in that Q topic words corresponding to Q (Q is an integer of 1 or more) are output in descending order of the total relevance value of these topic words .

上記関連度系列の各関連度に対し、これと対応する入力単語の尤度で重み付けて上記各関連度の和を求めることを特徴とする請求項１乃至２の何れかに記載の話題単語抽出方法。 3. The topic word extraction according to claim 1 , wherein each relevance of the relevance series is weighted by a likelihood of an input word corresponding to the relevance to obtain a sum of the relevance. Method.

連続した音声信号を単語音声認識して、上記入力単語系列を得ることを特徴とする請求項１乃至３の何れかに記載の話題単語抽出方法。 4. The topic word extraction method according to claim 1 , wherein the input word series is obtained by recognizing a continuous voice signal as a word voice.

入力テキストを、形態素解析し、その解析結果の形態素を上記入力単語系列とすることを特徴とする請求項１乃至３の何れかに記載の話題単語抽出方法。4. The topic word extraction method according to claim 1 , wherein the input text is subjected to morphological analysis, and the morpheme of the analysis result is used as the input word series.

上記認識結果又は上記解析結果として複数の上位の候補系列を上記入力単語系列とし、順位の高いほど大きい重みで上記関連度の和を求めることを特徴とする請求項４又は５に記載の話題単語抽出方法。The topic word according to claim 4 or 5 , wherein a plurality of higher-order candidate sequences are used as the input word sequence as the recognition result or the analysis result, and the sum of the relevance levels is obtained with a higher weight as the rank is higher. Extraction method.

請求項１乃至６の何れかに記載のコンピュータが行う話題単語抽出方法を、コンピュータが実行するためのコンピュータプログラムで実現し、当該コンピュータプログラムを記録した記録媒体。 The recording medium which implement | achieved the topic word extraction method which the computer in any one of Claim 1 thru | or 6 performs with the computer program for performing a computer, and recorded the said computer program .