JP3794597B2 - Topic extraction method and topic extraction program recording medium - Google Patents

Topic extraction method and topic extraction program recording medium Download PDF

Info

Publication number
JP3794597B2
JP3794597B2 JP16095497A JP16095497A JP3794597B2 JP 3794597 B2 JP3794597 B2 JP 3794597B2 JP 16095497 A JP16095497 A JP 16095497A JP 16095497 A JP16095497 A JP 16095497A JP 3794597 B2 JP3794597 B2 JP 3794597B2
Authority
JP
Japan
Prior art keywords
word
topic
sentence
frequency
relevance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
JP16095497A
Other languages
Japanese (ja)
Other versions
JPH117447A (en
Inventor
克年 大附
達雄 松岡
昭一 松永
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP16095497A priority Critical patent/JP3794597B2/en
Publication of JPH117447A publication Critical patent/JPH117447A/en
Application granted granted Critical
Publication of JP3794597B2 publication Critical patent/JP3794597B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Description

【0001】
【発明の属する技術分野】
この発明は、連続発生された音声の単語認識結果やテキストを形態素解析により分割された単語などの単語列に対し、その内容を表す話題単語を抽出する方法と話題単語抽出プログラムを記録した記録媒体に関する。
【0002】
【従来の技術】
連続発声された音声からのその内容を表わす話題抽出では5〜10種類の分野のうちのいずれかの分野に依存度の高いキーワードを予め選択しておき、それらのキーワードを音声区間中から検出(キーワードスポッティング)して、検出されたキーワード集合が最も高い依存度を示す話題を結果として出力する方法が多くとられる。例えば横井、河原、堂下、“キーワードスポッティングに基づくニュース音声の話題同定”、情処研報、SLP6−3、1995.櫻井、有木、“キーワードスポッティングによるニュース音声の索引付けと分類”、信学技法、SP96−66、1996.R.C.Rose,E.L.Chang ,and R.P.Lippmann,“Techniques for Information Retrieval from Voice Messages”,Proc.ICASSP-91,pp.317-320,1991.などに示されている。
【0003】
また従来の文章(テキスト)から話題を抽出する方法は文中の特定の個所を抽出して行うもので、その処理が複雑であった。
【0004】
【発明が解決しようとする課題】
従来の連続音声の話題抽出方法では、限られた数のキーワードしか用いることができず、またキーワードの数を増やした場合には、誤って検出されるキーワードが増えてしまう、また話題の分野が少ないため、情報検索や索引付けに使うことができないという問題があった。また従来のテキストよりの話題抽出方法は、特定の個所を探して行うため、その処理が複雑であった。これを連続音声の話題抽出に適用すると、その所定個所についての単語認識を誤ると、話題抽出は誤ってしまう。
【0005】
この発明の目的は比較的簡単な処理で話題単語を抽出することができる話題単語抽出方法、そのプログラムを記録した記録媒体とを提供することにある。
【0006】
【課題を解決するための手段】
この発明の話題抽出モデルは、本文とその見出しよりなるテキストを多数用いて、それぞれ形態素解析を用い、本文単語と話題単語(見出し中の)と得、これら本文単語の出現頻度、話題単語の出現頻度、1テキスト中に本文単語と話題単語の組み合せが同時に存在する共起頻度をそれぞれ求め、これら頻度から各話題単語と、本文単語との関連度を求め、これらを話題抽出モデルとして格納しておく。
【0007】
この発明の話題抽出方法では前記の話題抽出モデルを用い、入力音声の音声認識や入力テキストの形態素解析で、入力単語系列を得、各話題単語と入力単語系列中の各単語との関連度とを話題抽出モデルを参照して求めて話題単語ごとの関連度系列を得、これら関連度系列から各話題単語の入力単語系列との関連度をそれぞれ求め、これら入力単語系列の関連度中の大きいものと対応する話題単語を入力音声又はテキストに対する話題として出力する。
【0008】
この発明の記録媒体はこの発明の話題抽出方法をコンピュータで実行させるためのプログラムが記録されている。
【0009】
【発明の実施の形態】
まずこの発明の話題抽出モデルとこの作成方法の実施例を説明する。
話題抽出モデルの学習(作成)はある話題について述べられているテキストとその内容を表わす複数の話題単語との組を大量に用いて行う。一例として新聞記事の本文と見出しを用いて話題抽出モデルを学習(作成)する場合、約5年分の新聞記事よりその見出しと本文とをそれぞれ取出し(S1)、これらを形態素解析を行い(S2)、単語(形態素)に分割し、見出しの形態素(話題単語)と、本文の形態素(文中単語)とを得る。
【0010】
これら話題単語と文中単語について、大量のデータにおける出現頻度と、共起頻度とを用いて、文中単語と話題単語との関連度を求める。しかし、文中単語と話題単語の組み合わせは非常に莫大な数になる。従ってこの実施例では話題単語については、出現回数が2回以上の単語に限り(S3)、文中単語については出現頻度が上位15万の単語のみを選出し(S4)、更に情報検索という観点からより意味的情報を伝達すると考えられる名詞、動詞などの内容語に着目し、ここでは話題単語、文中単語の何れについても名詞、動詞、形容詞、形容動詞、副詞のみを取出す(S5)。更に話題単語と文中単語との組合せで同一記事に出現するのが1度しかなかったものは除外し、つまり話題単語と文中単語の組み合わせで同一記事に出現することが2回以上のもののみを残した(S6)。このようにして話題単語の総頻度12.3×106 が6.3×106 となり総数136×103 が74×103 となり、文中単語の総頻度218.8×106 が90.1×106 となり総数640×103 が147×103 となり、2回以上起きた共起の組み合わせは約5800万種類となった。
【0011】
この約5800万種類について、これら単語の出現頻度と共起頻度を用いて文中単語と話題単語との関連度を求める。文中単語wi と話題単語tj との関連度は以下のようにして求める。
相互情報量に基づく関連度
I(wi :tj )= log(P(wi ,tj )/P(wi )P(tj )) …(1)
P(wi ,tj ):wi とtj が同時に出現する確率
P(wi ):wi の出現確率、P(tj ):tj の出現確率
χ 2 法に基づく関連度
χ ij 2 =(fij−Fij2 /Fij
【0012】
【数1】

Figure 0003794597
N:文中単語の種類数、M:話題単語の種類数、
ij:話題単語tj に対する文中単語wi の頻度、
ij:話題単語tj に対する文中単語wi の理論(期待)度数
相互情報量の計量において、学習データ中に文中単語wi と話題単語tj の共起が観測されない場合、P(wi ,tj )=0となり、関連度の合計を求める際に問題が生じる。そこで、共起が観測されなかった場合には情報が得られなかったものとして、実際には次式のように相互情報量に基づく関連度を計算する。
【0013】
Figure 0003794597
一方、χ 2 法における理論度数Fijとは、全ての話題単語に対して文中単語wi が等確率で出現した場合の出現頻度である。実際の出現頻度と理論度数とのずれが大きければ、その文中単語はその話題単語に対して偏って出現していることになる。しかし、上述のχ 2 法の式では、実際の出現頻度fijが理論度数Fijより小さい場合にも、関連度が正の値となってしまうため、実際には次式のようにχ 2 法に基づく関連度を計算する。
【0014】
Figure 0003794597
従って、ステップS6で得られた文中単語wi と話題単語tj との各組み合せについて、その各頻度P(wi ):P(tj ),P(wi ,tj )、またはfij,をそれぞれ演算し(S7)、頻度テーブル11に格納する。これを学習データが終るまで行う(S8)。学習データが終ると、頻度テーブル11内に演算した頻度を用いて関連度I(wi ,tj )又はFijの計算を行って話題抽出モデルを得る(S9 )。
【0015】
従って話題抽出モデルは図2Aに示すように、話題単語の種類t1 ,t2 ,…tM それぞれについて、これと2回以上共起する文中単語、つまりt1 についてはw11,w12,w13,…との関連度r111 ,r112 ,r113 ,…が、またt2 についてはw21,w22,w23,…との関連度r211 ,r212 ,r213 ,…が、以下同様に文中単語との関連度が格納されている。
【0016】
次にこの話題抽出モデルを用いて連続入力単語列から話題を抽出する方法を図2Bを参照して説明する。連続発声される音声を入力とする場合、その入力音声を単語音声認識し(S1)、認識結果として単語系列w1 ,w2 ,…wn を得る(S2)、これら単語系列w1 ,w2 ,…wn の各単語について、話題抽出モデル11を参照して、その各話題単語t1 ,t2 ,…tM に対する関連度を求める。つまり認識単語w1 に対する話題単語t1 ,t2 ,…tM との各関連度r11,r21,…,rM1を求め、単語w2 に対する話題単語t1 ,t2 ,…tM との各関連度r12,r22,…,rM2を求め、以下同様に求める。
【0017】
各話題単語t1 ,t2 ,…,tM についての各認識単語w1 ,w2 ,…,wn との関連度の合計、つまり単語系列に対する関連度Rj を計算する。即ち、話題単語t1 についてはr11,r12,…,r1nの和R1 =Σk=1 n 1kを求め、t2 についてはr21,r22,…,r2nの和R2 =Σk=1 n 2kを求め、以下同様にR3 ,…,RM を求める(S3)。これら単語系列に対する関連度R1 ,…,RM 中で関連度が大きいものから順にQ個(Qは1以上の整数)のものとそれぞれ対応する話題単語tj 組合せを、その単語系列に対する話題とする(S4)。Qは1でもよいが、通常は複数で例えば5程度である。関連度R1 ,…,RM 中の大きいものから順に対応する話題単語の複数個を候補として出力してもよい。
【0018】
単語系列から話題の抽出としてはテキストを入力し(S5)、これを形態素解析し(S6)、形態素つまり単語列w1 ,w2 ,…,wn を得て、これを音声入力の場合と同様に話題抽出モデル11を用いて処理して、テキストに対する話題を抽出することもできる。
関連度をwi とtj の相互情報量に基づいて求める場合は式(1)、つまり2点間の相互情報量に基づいて決めた。一方、n点間の相互情報量は次式で定義される。
【0019】
【数2】
Figure 0003794597
Πは、あい異なる添字の全ての組み合せについて計算する。
従ってx1 ,x2 ,…,xn 中1つの話題単語と他のn−1個を文中単語との相互情報量をI(x1 :x2 :…:xn )により求めることができる。このように、複数の文中単語と1つの話題単語との関連度を求めておくと、例えば「コンピュータ」と「インターネット」の関連度、また「ネットワーク」と「インターネット」の関連度はそれ程大きくないが、「コンピュータ」と「ネットワーク」が同じ文中にあった場合の「インターネット」への関連度が大きくなるような話題抽出モデルの学習ができる。つまり式(1)の関連度では話題として「インターネット」を抽出できない場合に、式(2)の関連度によると「インターネット」を話題として抽出でき、適切な話題を抽出することができることがある。
【0020】
話題単語tk と単語系列w1 ,w2 ,…,wn との関連度Rk はtk に対する各単語の関連度の和rk1+rk2+,…,+rknで求められるが、その加算の際に各単語に対する重みs1 ,s2 ,…,sn をそれぞれ付けて、rk1×s1 +rk2×s2+,…,+rkn×sn というようにして、より適切な関連度Rk を得るようにすることもできる。ここで重みs1 ,s2 ,…,sn としては、各単語w1 ,w2 ,…,wn のその音声認識時の単語の確からしさ(音響的尤度)や言語的尤度、つまりその単語がその前の単語に対し、文法や言語上存在する確からしさ(大語彙音声認識に用いられる言語モデルに示されている)を用いることができ
る。
【0021】
音声認識結果の単語系列に対して話題抽出を行う際に、認識言語系列候補の第1位だけでなく、上位b位までの候補(w1-1 ,w1-2 ,…,w1-n1),(w2-1 ,w2-2 ,…,w2-n2)…(wb-1 ,wb-2 ,…,wb-nb)を用いて話題抽出を行う。この際、順位の高い程重みが大きくなるようにすることもできる。この場合第1位から第b位までの候補系列は、相互に1単語又は2単語しか違いがない候補系列が多くなる。よってこれら候補系列を、その同一単語を排除して複数単語木構造乃至単語ネットワークあるいは単語ラティスの配列とし、これを用いて第1位〜第b位の候補系列について話題抽出をするようにすれば、その複数の候補系列を少ない容量のメモリに格納して処理することができる。
【0022】
【発明の効果】
評価は、ニュース音声の書き起こし文および2万語彙の大語彙連続音声認識システムによる音声認識結果に対してこの発明の評価を行った。書き起こし文に対して3人の被験者が人手で付与した話題を評価対象とした。話題抽出モデルが出力した話題単語のうち上位5単語までを出力結果とした場合の適合率(抽出した話題単語のうち、正解の話題単語の割合)は、3人の被験者の付与した話題に対して70%以上となった。また、単語誤り率25%の音声認識結果に対する話題抽出結果の適合率も65%以上となった。各被験者が付与した話題間の重複は約70%であるので、この話題抽出結果は利用可能な精度であるといえる。関連度の尤度としてχ 2 法を用いた方が相互情報量を用いた場合より良い結果が得られた。
【0023】
この発明によれば、大量のテキストデータを用いて非常に多くの文中単語および話題単語間の関連度を学習した話題抽出モデルを用いることにより、テキストおよび誤りを含む大語彙連続音声認識結果から詳細な話題抽出を行うことができるという利点がある。
つまり、音声からの話題抽出において、連続音声認識技術を用いることにより、限られた数のキーワードを検出するキーワードスポッティングに基づく方法に比べ、音声中の多くの情報を用いて話題抽出を行うことが可能であり、また、音声の内容を表す単語(話題単語)を複数抽出することにより、音声をいくつかの分野に分類する話題抽出(話題同定・話題認識)に比べ、詳細な話題が抽出できるという利点がある。
【0024】
特に従来のテキストに対する話題抽出では、特定の関係のものを抽出するため、複雑な処理を必要としたが、この発明では比較的簡単に行うことができる。特に連続音声に対する抽出ではその特定部分に対して認識誤りが生じると、致命的であるが、この発明は文全体の各単語に対して関連性をみるため正しく話題を抽出することができる。
【0025】
またこのような正しい抽出ができるのは、大量のテストデータを用いて作成した話題単語と各単語との関連度を記憶した話題抽出モデルを用いるからである。
【図面の簡単な説明】
【図1】この発明に用いられるモデル作成方法を示す流れ図。
【図2】Aはこの発明に用いられる話題抽出モデルの例を示す図、Bはこの発明の話題抽出方法を示す図である。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a method of extracting a topic word representing the contents of a word string such as a word recognition result of a continuously generated speech or a word obtained by dividing a text by morphological analysis, and a recording medium on which a topic word extraction program is recorded About.
[0002]
[Prior art]
In topic extraction representing the content from continuously uttered speech, keywords that are highly dependent on one of 5 to 10 types of fields are selected in advance, and those keywords are detected from the speech segment ( There are many methods for outputting a topic in which the detected keyword set shows the highest dependency as a result by keyword spotting. For example, Yokoi, Kawahara, Doshita, “Topic Identification of News Voices Based on Keyword Spotting”, Information Processing Research Reports, SLP6-3, 1995. Sakurai, Ariki, “Indexing and classification of news speech by keyword spotting”, IEICE techniques, SP96-66, 1996. RCRose, ELChang, and RPLippmann, “Techniques for Information Retrieval from Voice Messages”, Proc.ICASSP-91, pp.317-320, 1991. It is shown in.
[0003]
Further, the conventional method of extracting a topic from a sentence (text) is performed by extracting a specific part in the sentence, and the process is complicated.
[0004]
[Problems to be solved by the invention]
In conventional continuous speech topic extraction methods, only a limited number of keywords can be used, and when the number of keywords is increased, the number of keywords that are erroneously detected increases. Because there are few, there was a problem that it could not be used for information retrieval and indexing. In addition, the conventional topic extraction method from text is performed by searching for a specific part, and the processing is complicated. If this is applied to topic extraction of continuous speech, topic extraction will be wrong if the word recognition about the predetermined location is mistaken.
[0005]
An object of the present invention is to provide a topic word extraction method capable of extracting a topic word by a relatively simple process, and a recording medium recording the program.
[0006]
[Means for Solving the Problems]
The topic extraction model of the present invention uses a large number of texts consisting of a body and its headings, and uses morphological analysis, respectively, to obtain body words and topic words (in the headings). The appearance frequency of these body words, the appearance of topic words The frequency of co-occurrence in which a combination of a body word and a topic word exists simultaneously in one text is obtained, the degree of association between each topic word and the body word is obtained from these frequencies, and these are stored as a topic extraction model. deep.
[0007]
Using the previous Symbol topic extraction model of the topic extraction method of the present invention, in the morphological analysis of the speech recognition and input text of the input speech, obtained the input word series, the degree of association between each word in the input word series and each topic word To obtain the relevance series for each topic word by referring to the topic extraction model, and obtain the relevance of each topic word with the input word series from these relevance series, The topic word corresponding to the larger one is output as the topic for the input speech or text.
[0008]
The recording medium of the present invention stores a program for causing a computer to execute the topic extraction method of the present invention.
[0009]
DETAILED DESCRIPTION OF THE INVENTION
First, an embodiment of the topic extraction model and the creation method of the present invention will be described.
The topic extraction model is learned (created) by using a large number of pairs of text describing a topic and a plurality of topic words representing the contents. As an example, when learning (creating) a topic extraction model using the text and headings of newspaper articles, the headlines and text are taken out from newspaper articles for about five years (S1), and these are subjected to morphological analysis (S2). ) And divided into words (morphemes) to obtain headline morphemes (topic words) and body morphemes (words in the sentence).
[0010]
For these topic words and in-sentence words, the degree of association between the in-sentence word and the topic word is obtained using the appearance frequency and co-occurrence frequency in a large amount of data. However, the number of combinations of words in the sentence and topic words is very large. Therefore, in this embodiment, the topic word is limited to the word having the number of appearances of 2 times or more (S3), and the word having the highest appearance frequency is selected for the word in the sentence (S4), and further from the viewpoint of information retrieval. Focusing on content words such as nouns and verbs that are considered to convey more semantic information, only nouns, verbs, adjectives, adjective verbs, and adverbs are extracted for all topic words and words in sentences (S5). Furthermore, the combination of the topic word and the word in the sentence that appears only once in the same article is excluded, that is, only the combination of the topic word and the word in the sentence that appears in the same article more than once. Left (S6). Thus, the total frequency of topic words 12.3 × 10 6 is 6.3 × 10 6 and the total number 136 × 10 3 is 74 × 10 3 , and the total word frequency 218.8 × 10 6 is 90.1. × 10 6 , the total number of 640 × 10 3 was 147 × 10 3 , and there were about 58 million kinds of co-occurrence combinations that occurred more than once.
[0011]
About these 58 million types, the relevance degree of the word in a sentence and a topic word is calculated | required using the appearance frequency and co-occurrence frequency of these words. The degree of association between the in-sentence word w i and the topic word t j is obtained as follows.
Relevance based on mutual information I (w i : t j ) = log (P (w i , t j ) / P (w i ) P (t j )) (1)
P (w i , t j ): Probability that wi and t j appear simultaneously P (w i ): Appearance probability of w i , P (t j ): Appearance probability of t j
Relevance based on χ 2 method
χ ij 2 = (f ij −F ij ) 2 / F ij
[0012]
[Expression 1]
Figure 0003794597
N: number of types of words in the sentence, M: number of types of topic words,
f ij : frequency of in-sentence word w i for topic word t j ,
F ij: theory (expected) in a sentence word w i to the topic word t j in the metering of the frequency mutual information, if you during the learning data co-occurrence in the sentence word w i and the topic word t j is not observed, P (w i , T j ) = 0, and a problem arises when calculating the total relevance. Therefore, if no co-occurrence is observed, it is assumed that no information has been obtained, and the relevance based on the mutual information amount is actually calculated as in the following equation.
[0013]
Figure 0003794597
On the other hand, the theoretical frequency F ij in the χ 2 method is the appearance frequency when the word w i in the sentence appears with equal probability for all topic words. If the difference between the actual appearance frequency and the theoretical frequency is large, the word in the sentence appears biased with respect to the topic word. However, in the above equation the chi 2 method, if the actual frequency f ij is smaller than the theoretical frequency F ij also because the relevance becomes a positive value, in fact as follows chi 2 Calculate relevance based on the law.
[0014]
Figure 0003794597
Therefore, for each combination of the word w i in the sentence obtained in step S6 and the topic word t j , each frequency P (w i ): P (t j ), P (w i , t j ), or f ij , Are respectively calculated (S7) and stored in the frequency table 11. This is performed until the learning data is over (S8). When the learning data ends, the topic extraction model is obtained by calculating the relevance I (w i , t j ) or F ij using the frequency calculated in the frequency table 11 (S 9 ).
[0015]
Therefore, as shown in FIG. 2A, the topic extraction model is a word in the sentence that co-occurs with the topic word types t 1 , t 2 ,... T M more than once, that is, t 1 , w 11 , w 12 , w 13, ... and the relevance r 111, r 112, r 113 , ... is also w 21 for t 2, w 22, w 23 , ... and the relevance r 211, r 212, r 213 , ... is Similarly, the degree of association with the word in the sentence is stored.
[0016]
Next, a method for extracting a topic from a continuous input word string using this topic extraction model will be described with reference to FIG. 2B. When continuously uttered speech is used as input, the input speech is recognized as word speech (S1), and word sequences w 1 , w 2 ,... W n are obtained as recognition results (S2), and these word sequences w 1 , w 2, ... for each word w n, with reference to the topic extraction model 11, that each topic word t 1, t 2, determine the degree of association for the ... t M. That recognition word buzz word t 1, t 2 for w 1, ... t each degree of association with the M r 11, r 21, ... , determine the r M1, the topic word t 1, t 2 for the word w 2, ... t M each relevance r 12, r 22 and, ..., determine the r M2, determined on.
[0017]
Each topic word t 1, t 2, ..., each recognized word w 1, w 2 for t M, ..., the sum of degrees of correlation with w n, to calculate the relevance R j for That word sequence. That is, for the topic word t 1 , the sum R 1 = Σ k = 1 n r 1k of r 11 , r 12 ,..., R 1n is obtained, and for t 2 the sum R of r 21 , r 22 ,. 2 = Σ k = 1 n r 2k is obtained, and R 3 ,..., R M are obtained in the same manner (S3). Relevance R 1 to these word sequence, ..., the combination of topic words t j respectively as from those relevance in R M is large Q pieces (Q is an integer of 1 or more) in order corresponding, for the word sequence It is set as a topic (S4). Q may be 1, but is usually a plurality, for example, about 5. A plurality of corresponding topic words may be output as candidates in descending order of relevance R 1 ,..., R M.
[0018]
Enter text as the extraction of the topic from the word sequence (S5), which morphological analysis (S6), the morpheme clogging word string w 1, w 2, ..., to obtain a w n, and if this audio input Similarly, the topic extraction model 11 can be used to extract the topic for the text.
When the degree of association is obtained based on the mutual information amount of w i and t j , it is determined based on the equation (1), that is, the mutual information amount between two points. On the other hand, the mutual information amount between n points is defined by the following equation.
[0019]
[Expression 2]
Figure 0003794597
Π is calculated for all combinations of different subscripts.
Therefore, the mutual information amount between one topic word in x 1 , x 2 ,..., X n and the other n−1 words in the sentence can be obtained by I (x 1 : x 2 :... X n ). . Thus, if the degree of association between a plurality of words in a sentence and one topic word is obtained, for example, the degree of association between “computer” and “Internet” and the degree of association between “network” and “Internet” are not so high. However, it is possible to learn a topic extraction model that increases the degree of relevance to the “Internet” when “computer” and “network” are in the same sentence. That is, when “Internet” cannot be extracted as a topic with the relevance level of Expression (1), “Internet” can be extracted as a topic according to the relevance level of Expression (2), and an appropriate topic can be extracted.
[0020]
Topic word t k and the word sequence w 1, w 2, ..., w n and relevance of R k is the sum of each word of relevance to the t k r k1 + r k2 + , ..., but are determined by the + r kn, the weight s 1, s 2 during addition for each word, ..., put each s n, r k1 × s 1 + r k2 × s 2 +, ..., and so on + r kn × s n, more appropriate The degree of association R k can also be obtained. Here the weight s 1, s 2, ..., as is s n, each word w 1, w 2, ..., likelihood of the word at the time of the speech recognition of w n (acoustic likelihood) and linguistic likelihood, In other words, the certainty that the word exists in the grammar and language (shown in the language model used for large vocabulary speech recognition) can be used for the previous word.
[0021]
When topic extraction is performed on the word sequence of the speech recognition result, not only the first recognition language sequence candidate but also the top b candidates (w 1-1 , w 1-2 ,..., W 1- n1 ), ( w2-1 , w2-2 ,..., w2 -n2 )... (wb -1 , wb -2 ,..., wb-nb ) are used for topic extraction. At this time, the higher the rank, the greater the weight. In this case, the candidate series from the 1st place to the b-th place increase in candidate series having only one word or two words different from each other. Therefore, if these candidate sequences are excluded from the same word to form a multiple word tree structure or word network or word lattice arrangement, topic extraction is performed for the first to bth candidate sequences using this. The plurality of candidate sequences can be stored in a small capacity memory for processing.
[0022]
【The invention's effect】
In the evaluation, the present invention was evaluated with respect to the speech recognition results obtained by the transcription of the news speech and the large vocabulary continuous speech recognition system of 20,000 words. A topic given manually by three subjects to the transcript was evaluated. When the top five words out of the topic words output by the topic extraction model are output results, the relevance rate (the ratio of correct topic words among the extracted topic words) is calculated for the topics given by three subjects. Over 70%. In addition, the matching rate of the topic extraction result with respect to the speech recognition result with a word error rate of 25% was 65% or more. Since the overlap between topics given by each subject is about 70%, it can be said that this topic extraction result has usable accuracy. Better results were obtained using the χ 2 method as the likelihood of relevance than using the mutual information.
[0023]
According to the present invention, by using a topic extraction model in which a large amount of text data is used to learn the relevance between a very large number of words in a sentence and topic words, the details from large vocabulary continuous speech recognition results including text and errors are detailed. There is an advantage that topic extraction can be performed.
In other words, in topic extraction from speech, topic extraction can be performed using much information in speech compared to a method based on keyword spotting that detects a limited number of keywords by using continuous speech recognition technology. It is possible, and by extracting a plurality of words (topic words) representing the contents of speech, more detailed topics can be extracted than topic extraction (topic identification / topic recognition) that classifies speech into several fields. There is an advantage.
[0024]
In particular, in conventional topic extraction for text, a complicated process is required to extract a specific relationship, but in the present invention, it can be performed relatively easily. In particular, in the extraction of continuous speech, if a recognition error occurs in a specific part, which is fatal, the present invention can correctly extract a topic because the relevance is seen for each word of the whole sentence.
[0025]
The reason why such a correct extraction can be performed is because a topic extraction model that stores the degree of association between a topic word created using a large amount of test data and each word is used.
[Brief description of the drawings]
FIG. 1 is a flowchart showing a model creation method used in the present invention.
FIG. 2A is a diagram showing an example of a topic extraction model used in the present invention , and B is a diagram showing a topic extraction method of the present invention.

Claims (7)

単語系列を入力とし、その単語系列の内容を表す複数の話題単語を話題抽出モデルを用いて抽出する、頻度テーブルと話題抽出メモリを有するコンピュータが行う処理方法であって、
本文とその見出しよりなる多数の記事テキストを学習データとし記事テキストを解析して、記事テキストの見出しから話題単語を、記事テキストの本文から文中単語を得る工程と
上記各話題単語の出現頻度と、上記各文中単語の出現頻度と、1つの記事テキスト中の上記話題単語と上記文中単語の各組み合せが同時に得られる共起頻度とをそれぞれ計数し、記録手段である頻度テーブルに格納する工程と
上記頻度テーブルを参照し、話題単語の出現頻度と文中単語の出現頻度と各共起頻度とを用いて各話題単語と各文中単語との関連度を相互情報量に基づいて求める工程と
話題単語と文中単語の共起が観測されなかった場合には、当該話題単語と当該文中単語との関連度をゼロに設定する工程と
求められた話題単語と文中単語との関連度、当該関連度の話題単語、当該関連度の文中単語とからなる話題抽出モデルを話題抽出メモリに格納する工程と
入力された単語系列に対して、上記話題抽出モデル中の各話題単語について、上記話題抽出メモリに格納された話題抽出モデルを参照し、単語系列の各入力単語を上記話題抽出モデルにおける文中単語に対応させて、当該話題単語と各入力単語との関連度を求めて当該話題単語の関連度系列を作り、
各話題単語について、当該話題単語の関連度系列の各関連度の和を求めて、上記入力された単語系列に対する当該話題単語の関連度の合計値とし
これら話題単語の関連度の合計値の大きいものから順にQ個(Qは1以上の整数)のものとそれぞれ対応する話題単語を出力する
ことを特徴とするコンピュータが行う話題単語抽出方法。
A processing method performed by a computer having a frequency table and a topic extraction memory, which takes a word sequence as input and extracts a plurality of topic words representing the contents of the word sequence using a topic extraction model ,
Analyzing the article text using a large number of article texts consisting of a body and its headline as learning data, obtaining a topic word from the headline of the article text, and obtaining a word in the sentence from the body of the article text ;
The frequency of appearance of each of the topic words, the frequency of occurrence of the words in each sentence, and the co-occurrence frequency at which each combination of the topic word and the words in the sentence in one article text is simultaneously obtained are counted by the recording means. Storing in a frequency table ;
A step of referring to the frequency table and determining the degree of association between each topic word and each in-sentence word based on the mutual information amount using the appearance frequency of the topic word, the appearance frequency of the in-sentence word, and each co-occurrence frequency ;
When the co-occurrence of the topic word and the word in the sentence is not observed, the step of setting the degree of association between the topic word and the word in the sentence to zero ;
Storing in the topic extraction memory a topic extraction model composed of the degree of association between the obtained topic word and the word in the sentence, the topic word of the degree of association, and the word in the sentence of the degree of association ;
For each word word in the topic extraction model for the input word sequence, the topic extraction model stored in the topic extraction memory is referred to, and each input word in the word sequence is used as a word in the sentence in the topic extraction model. Correspondingly, the degree of association between the topic word and each input word is calculated, and the degree of association of the topic word is created,
For each topic word, find the sum of the relevance levels of the relevance series of the topic word, and the total value of the relevance of the topic word for the input word series,
A topic word extraction method performed by a computer , characterized in that Q topic words corresponding to Q (Q is an integer of 1 or more) are output in descending order of the total relevance value of these topic words .
単語系列を入力とし、その単語系列の内容を表す複数の話題単語を話題抽出モデルを用いて抽出する、頻度テーブルと話題抽出メモリを有するコンピュータが行う処理方法であって、
本文とその見出しよりなる多数の記事テキストを学習データとし記事テキストを解析して、記事テキストの見出しから話題単語を、記事テキストの本文から文中単語を得る工程と
上記各話題単語の出現頻度と、上記各文中単語の出現頻度と、1つの記事テキスト中の上記話題単語と上記文中単語の各組み合せが同時に得られる共起頻度とをそれぞれ計数し、記録手段である頻度テーブルに格納する工程と
上記頻度テーブルを参照し、話題単語の出現頻度と文中単語の出現頻度と各共起頻度とを用いて各話題単語と各文中単語との関連度を、話題単語に対する文中単語の頻度fと期待度数Fからχ 法に基づいて求める工程と
出現頻度fが期待度数Fより小さい場合には、当該話題単語と当該文中単語との関連度をゼロに設定する工程と、
求められた話題単語と文中単語との関連度、当該関連度の話題単語、当該関連度の文中単語とからなる話題抽出モデルを話題抽出メモリに格納する工程と
入力された単語系列に対して、上記話題抽出モデル中の各話題単語について、上記話題抽出メモリに格納された話題抽出モデルを参照し、単語系列の各入力単語を上記話題抽出モデルにおける文中単語に対応させて、当該話題単語と各入力単語との関連度を求めて当該話題単語の関連度系列を作り、
各話題単語について、当該話題単語の関連度系列の各関連度の和を求めて、上記入力された単語系列に対する当該話題単語の関連度の合計値とし
これら話題単語の関連度の合計値の大きいものから順にQ個(Qは1以上の整数)のものとそれぞれ対応する話題単語を出力する
ことを特徴とするコンピュータが行う話題単語抽出方法。
A processing method performed by a computer having a frequency table and a topic extraction memory, which takes a word sequence as input and extracts a plurality of topic words representing the contents of the word sequence using a topic extraction model ,
Analyzing the article text using a large number of article texts consisting of a body and its headline as learning data, obtaining a topic word from the headline of the article text, and obtaining a word in the sentence from the body of the article text ;
The frequency of appearance of each of the topic words, the frequency of occurrence of the words in each sentence, and the co-occurrence frequency at which each combination of the topic word and the words in the sentence in one article text is simultaneously obtained are counted by the recording means. Storing in a frequency table ;
Referring to the frequency table, using the topic word appearance frequency, the in-sentence word appearance frequency, and the co-occurrence frequencies, the degree of association between each topic word and each in-sentence word, the in-sentence word frequency f with respect to the topic word, and the expectation Obtaining from the frequency F based on the χ 2 method ;
When the appearance frequency f is smaller than the expected frequency F, the step of setting the degree of association between the topic word and the word in the sentence to zero;
Storing in the topic extraction memory a topic extraction model composed of the degree of association between the obtained topic word and the word in the sentence, the topic word of the degree of association, and the word in the sentence of the degree of association ;
For each word word in the topic extraction model for the input word sequence, the topic extraction model stored in the topic extraction memory is referred to, and each input word in the word sequence is used as a word in the sentence in the topic extraction model. Correspondingly, the degree of association between the topic word and each input word is calculated, and the degree of association of the topic word is created,
For each topic word, find the sum of the relevance levels of the relevance series of the topic word, and the total value of the relevance of the topic word for the input word series,
A topic word extraction method performed by a computer , characterized in that Q topic words corresponding to Q (Q is an integer of 1 or more) are output in descending order of the total relevance value of these topic words .
上記関連度系列の各関連度に対し、これと対応する入力単語の尤度で重み付けて上記各関連度の和を求めることを特徴とする請求項1乃至2の何れかに記載の話題単語抽出方法。 3. The topic word extraction according to claim 1 , wherein each relevance of the relevance series is weighted by a likelihood of an input word corresponding to the relevance to obtain a sum of the relevance. Method. 連続した音声信号を単語音声認識して、上記入力単語系列を得ることを特徴とする請求項1乃至3の何れかに記載の話題単語抽出方法。 4. The topic word extraction method according to claim 1 , wherein the input word series is obtained by recognizing a continuous voice signal as a word voice. 入力テキストを、形態素解析し、その解析結果の形態素を上記入力単語系列とすることを特徴とする請求項1乃至3の何れかに記載の話題単語抽出方法。4. The topic word extraction method according to claim 1 , wherein the input text is subjected to morphological analysis, and the morpheme of the analysis result is used as the input word series. 上記認識結果又は上記解析結果として複数の上位の候補系列を上記入力単語系列とし、順位の高いほど大きい重みで上記関連度の和を求めることを特徴とする請求項4又は5に記載の話題単語抽出方法。The topic word according to claim 4 or 5 , wherein a plurality of higher-order candidate sequences are used as the input word sequence as the recognition result or the analysis result, and the sum of the relevance levels is obtained with a higher weight as the rank is higher. Extraction method. 請求項1乃至6の何れかに記載のコンピュータが行う話題単語抽出方法を、コンピュータが実行するためのコンピュータプログラムで実現し、当該コンピュータプログラムを記録した記録媒体。 The recording medium which implement | achieved the topic word extraction method which the computer in any one of Claim 1 thru | or 6 performs with the computer program for performing a computer, and recorded the said computer program .
JP16095497A 1997-06-18 1997-06-18 Topic extraction method and topic extraction program recording medium Expired - Fee Related JP3794597B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP16095497A JP3794597B2 (en) 1997-06-18 1997-06-18 Topic extraction method and topic extraction program recording medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP16095497A JP3794597B2 (en) 1997-06-18 1997-06-18 Topic extraction method and topic extraction program recording medium

Publications (2)

Publication Number Publication Date
JPH117447A JPH117447A (en) 1999-01-12
JP3794597B2 true JP3794597B2 (en) 2006-07-05

Family

ID=15725794

Family Applications (1)

Application Number Title Priority Date Filing Date
JP16095497A Expired - Fee Related JP3794597B2 (en) 1997-06-18 1997-06-18 Topic extraction method and topic extraction program recording medium

Country Status (1)

Country Link
JP (1) JP3794597B2 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100327584B1 (en) 1999-07-01 2002-03-14 박종섭 Method of forming high efficiency capacitor in semiconductor device
JP4489994B2 (en) 2001-05-11 2010-06-23 富士通株式会社 Topic extraction apparatus, method, program, and recording medium for recording the program
DE602004025616D1 (en) * 2003-12-26 2010-04-01 Kenwood Corp A facility controller, method and program
JP3923513B2 (en) 2004-06-08 2007-06-06 松下電器産業株式会社 Speech recognition apparatus and speech recognition method
US20060025995A1 (en) * 2004-07-29 2006-02-02 Erhart George W Method and apparatus for natural language call routing using confidence scores
JP4423327B2 (en) 2005-02-08 2010-03-03 日本電信電話株式会社 Information communication terminal, information communication system, information communication method, information communication program, and recording medium recording the same
JP4791984B2 (en) * 2007-02-27 2011-10-12 株式会社東芝 Apparatus, method and program for processing input voice
JP5676552B2 (en) * 2012-12-17 2015-02-25 日本電信電話株式会社 Daily word extraction apparatus, method, and program

Also Published As

Publication number Publication date
JPH117447A (en) 1999-01-12

Similar Documents

Publication Publication Date Title
Halteren et al. Improving accuracy in word class tagging through the combination of machine learning systems
US11182435B2 (en) Model generation device, text search device, model generation method, text search method, data structure, and program
US8543565B2 (en) System and method using a discriminative learning approach for question answering
James The application of classical information retrieval techniques to spoken documents
US20080249764A1 (en) Smart Sentiment Classifier for Product Reviews
Vivaldi et al. Improving term extraction by system combination using boosting
CN108538286A (en) A kind of method and computer of speech recognition
CN111159363A (en) Knowledge base-based question answer determination method and device
Santos et al. Assessing the impact of contextual embeddings for Portuguese named entity recognition
CN109063182B (en) Content recommendation method based on voice search questions and electronic equipment
Ali et al. Genetic approach for Arabic part of speech tagging
CN108345694B (en) Document retrieval method and system based on theme database
Jayasiriwardene et al. Keyword extraction from Tweets using NLP tools for collecting relevant news
Weerasinghe et al. Feature Vector Difference based Authorship Verification for Open-World Settings.
JP3794597B2 (en) Topic extraction method and topic extraction program recording medium
Chifu et al. A system for detecting professional skills from resumes written in natural language
Onyenwe et al. Toward an effective igbo part-of-speech tagger
CN111767733A (en) Document security classification discrimination method based on statistical word segmentation
CN110874408B (en) Model training method, text recognition device and computing equipment
US20110106849A1 (en) New case generation device, new case generation method, and new case generation program
Llitjós et al. Improving pronunciation accuracy of proper names with language origin classes
Rolih Applying coreference resolution for usage in dialog systems
Sonbhadra et al. Email classification via intention-based segmentation
De Kruijf et al. Training a Dutch (+ English) BERT model applicable for the legal domain
Nothman Learning named entity recognition from Wikipedia

Legal Events

Date Code Title Description
A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20060306

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20060407

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20090421

Year of fee payment: 3

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20100421

Year of fee payment: 4

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20100421

Year of fee payment: 4

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20110421

Year of fee payment: 5

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20120421

Year of fee payment: 6

FPAY Renewal fee payment (event date is renewal date of database)

Free format text: PAYMENT UNTIL: 20130421

Year of fee payment: 7

LAPS Cancellation because of no payment of annual fees