TW200424874A

TW200424874A - Automatic thesaurus construction method

Info

Publication number: TW200424874A
Application number: TW092112651A
Authority: TW
Inventors: Yuen-Hsien Tseng
Original assignee: Webgenie Information Ltd; Yuen-Hsien Tseng
Priority date: 2003-05-09
Filing date: 2003-05-09
Publication date: 2004-11-16
Also published as: TWI290684B

Abstract

An automatic thesaurus construction method is provided. The automatic thesaurus construction method first collects keywords in a document set, divides each of the documents in the document set into a plurality of logical segments, and perform related terms analysis based on the logical segments.

Description

200424874 玖、發明說明：【發明所屬之技術領域】本發明是有關於一種索引典之自動建構方法，且特別是有關於一種共現索引典之自動建構方法。【先前技術】長久以來，「字彙不匹配」（vocabulary mismatch)—直是資訊檢索系統使用者檢索失敗的主要原因之一。所謂「字彙不匹配」問題，就是使用者查詢時下達的詞彙與系統用以索引文件的詞彙不相同的情況。例如，不同的文件可能出現「筆記型電腦」、「筆記本電腦」或「筆記本型電腦」等用詞不一致的情況，如果系統直接以原文件的詞彙建立索引（建索引的目的是要加快查詢比對的速度），當使用者下達「筆記本電腦」時，對於包含意義相同但詞彙字串不同的文件，就有可能因爲比對不正確，造成漏檢或失敗的情況。圖書館學的理論中，早已注意到此種現象，並提出像「權威檔」與「索引典」等工具來解決這個問題。「權威檔」（authority file)中記錄了各種同義異形詞，使得索引或檢索時，各種意義相同但形式不同的詞彙，可以對應起來，而被視爲相同的詞彙處理。如上述的「筆記型電腦」與「筆記本電腦」，或「行政院長」與「閣揆」，或「老人癡呆症」與「老人失智症」寺§司囊’都可透過權威檔的運用，在索引與檢索時視爲相同的詞囊。而「索引典」（thesaurus)則進一步紀錄詞彙之間更多 11219twf.doc/006 200424874 的關係，除同義詞外，還有反義詞、廣義詞、狹義詞、相關詞等，用以擴展或縮小檢索詞彙的主題範圍。例如「筆記型電腦」與「掌上型電腦」的槪念很接近，都可視爲「攜帶型電腦」的狹義詞。相對的，「攜帶型電腦」可視爲這兩個詞的廣義詞，透過廣義詞的擴展，運用查詢詞「筆記型電腦」可找出包含「掌上型電腦」的文件。「索引典」列舉詞彙之間的關係，用於查詢詞的互相推薦，以擴大或縮小查詢範圍，或提示相關槪念的不同查詢用語，使檢索從原本的字串比對層次，提升到以語意做比對的層次。爲了建構此種詞彙之間語意上的關係，往往需要人工分析與整理。人工製作索引典的優點是正確性高，缺點則是成本大、建構速度慢、維護不易、以及事先選用的詞彙可能與後續或其他新進的文件無關。過去資訊檢索實驗的硏究指出，一般目的（general-purposed)的索引典運用在特定領域的文件檢索上，會出現無法提升檢索效能的情形。索引典雖然捕捉了詞彙之間的語意落差，索引典涵蓋的詞彙主題，卻可能與文件的主題有所落差’而達不到以索引典提升檢索成效的目的。一個極靖的例子，是將人文科學方面的索引典運用在工程科學文獻的檢索上’其檢索效果當然難以彰顯。然而針對每一種文獻領域製作索引典’卻又耗時費力。因此，根據文獻本身的主題，自動且即時產生索引典 11219twf.doc/006 5 200424874 的方法’是値得探討的主題。自動化的方法，大抵都倚賴相關的詞彙在文件中常常一起出現的線索，來建構索引典。此種方式建構出來的索引典，可稱爲「共同出現索引典」（co-occurrence thesaurus)，或簡稱「共現索引典」。共現索引典裡的詞彙與詞彙之間的關係，不像人工製作的索引典那樣有準確的語意關係，如廣義詞、狹義詞等關係，但卻是有統計上的相關性。此種相關性經常透露出文件中隱含的、不易由人工偵測、發覺或維護的知識。因此，自動建構此種索引典的方法，也可以視爲是文字知識探勘（knowledge discovery from text，or text mining)的一種方法。200424874 (1) Description of the invention: [Technical field to which the invention belongs] The present invention relates to an automatic construction method of an index, and particularly relates to an automatic construction method of a co-occurrence index. [Previous Technology] For a long time, "vocabulary mismatch" (voice-bulk mismatch) has been one of the main reasons for the failure of users of information retrieval systems. The so-called "word mismatch" problem is the case where the vocabulary issued by the user is different from the vocabulary used by the system to index the document. For example, different documents may have inconsistent terms such as "notebook computer", "notebook computer" or "notebook computer". If the system directly indexes the terms of the original document (the purpose of indexing is to speed up the query ratio Speed), when the user issues a "laptop", for documents containing the same meaning but different vocabulary and string, it may be missed or fail due to incorrect comparison. In library theory, this phenomenon has long been noted, and tools such as "authoritative files" and "indexes" have been proposed to solve this problem. A variety of synonymous words are recorded in the "authority file", so that when indexing or searching, various words with the same meaning but different forms can be matched and treated as the same vocabulary. For example, the aforementioned "notebook computers" and "laptops", or "administrators" and "courts", or "senile dementia" and "senile dementia" can be used through authoritative files. , Regarded as the same bag of words in indexing and retrieval. The "Thesaurus" further records more relationships between vocabulary 11219twf.doc / 006 200424874. In addition to synonyms, there are antonyms, broad words, narrow words, related words, etc., which can be used to expand or reduce the search terms. Subject scope. For example, the notions of "notebook computer" and "handheld computer" are very similar, and both can be regarded as narrow terms of "portable computer". In contrast, "portable computer" can be regarded as a generalized word of these two words. Through the expansion of the generalized word, the query word "notebook computer" can be used to find documents containing "handheld computer". "Index Code" lists the relationships between words and is used for mutual recommendation of query words to expand or narrow the scope of the query or to suggest different query terms related to the query, so that the search is upgraded from the original string comparison level to the The level of semantic comparison. In order to construct such a semantic relationship between words, manual analysis and sorting are often required. The advantages of manual indexing are high accuracy, but the disadvantages are high cost, slow construction, difficult maintenance, and pre-selected vocabulary that may be unrelated to subsequent or other new documents. Research on past information retrieval experiments has pointed out that general-purpose indexing can be used to retrieve documents in specific fields, and there is a case where the retrieval performance cannot be improved. Although the index dictionary captures the semantic difference between vocabularies, the topic of the vocabulary covered by the index dictionary may be different from the theme of the document ', and the purpose of improving the retrieval performance by the index dictionary is not achieved. An extremely astounding example is the application of the humanities index to the retrieval of engineering science documents. Of course, its retrieval effect is difficult to show. However, it is time-consuming and labor-intensive to make an index for each document field. Therefore, according to the subject matter of the document itself, the method of automatically and instantaneously generating an index code 11219twf.doc / 006 5 200424874 ’is a subject that can be explored. Automated methods rely on clues that often appear in related documents together to build index lists. The index code constructed in this way can be called "co-occurrence thesaurus", or "co-occurrence index". The relationship between the vocabulary and the vocabulary in the co-occurrence index dictionary is not as accurate as the artificial index dictionary, such as the relationship between broad words and narrow words, but it has statistical correlation. This correlation often reveals knowledge that is implied in the document and cannot be easily detected, detected or maintained by humans. Therefore, the method of automatically constructing such an index can also be regarded as a method of knowledge discovery from text (or text mining).

Gerard Salton 曾在 Automatic Text Processing: TheGerard Salton worked at Automatic Text Processing: The

Transformation，Analysis，and Retrieval of Information by Computer，Addison-Wesley，1989 —文中提出一種建構共現索引典的基本方法，其作法是先算出各個詞彙間的相似度（此相似度乃依詞彙在各文件之間，共同出現的情形而定），然後再依此相似度將詞彙歸類（clustering)。在這個作法裡，假若文件Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley, 1989 — This article proposes a basic method for constructing a co-occurrence index. The method is to first calculate the similarity between vocabularies (this similarity is based on the vocabulary in each file (Depending on the co-occurrence situation), and then clustering the vocabulary based on this similarity. In this approach, if the file

Di以文件向量DeMn，di2，…，dit)表示，其中t表示索引詞個數，而\表示詞彙Tj在文件Di的權重，則詞彙Tj在各個文件的權重也可以組成詞彙向量丁广(七』，d2j，···，dnj)，其中η表示所有文件的個數。那麼詞彙η與Tk的相似度，可定義成： 11219tw£doc/006 6 200424874Di is represented by the file vector DeMn, di2, ..., dit), where t represents the number of index words, and \ represents the weight of the vocabulary Tj in the file Di, and the weight of the vocabulary Tj in each file can also form the vocabulary vector Ding Guang (Seven ", D2j, ..., dnj), where n represents the number of all files. Then the similarity between the words η and Tk can be defined as: 11219tw £ doc / 006 6 200424874

Sim(j]，Tk) = 丁工 dfik 一旦計算出任意兩個詞彙之間的相似度後，就可以運用各種歸類技巧將相似度高的詞彙自動歸成同一類。但此種建構方法，任意兩詞彙之間相似度的計算與歸類的運算量太大，運用於大量文件時，需要耗費龐大的記憶體與長久的計算時Sim (j], Tk) = Ding dfik Once the similarity between any two words is calculated, various classification techniques can be used to automatically classify the high similar words into the same category. However, this construction method requires too much calculation and classification of the similarity between any two words. When it is applied to a large number of files, it requires a large amount of memory and a long calculation time.

Hsinchun Chen 等人也在’’Automatic Thesaurus Generation for an Electronic Community System，’’ Journal of the American Society for Information Science，46(3): 175-193，April 1995 — 文中根據詞彙共同出現在文件中的統計特徵來建構索引典。他們將詞彙Τ』在文件Di中的權重定義爲： xl〇gHsinchun Chen et al. Also in `` Automatic Thesaurus Generation for an Electronic Community System, '' Journal of the American Society for Information Science, 46 (3): 175-193, April 1995 — statistics based on vocabulary co-occurring in documents Features to construct the index. They define the weight of the word T ′ in the file Di as: xl0g

V J 其中n爲文件總數，dfj爲詞彙Τ』出現的篇數，th爲詞彙 Tj在文件Di的詞頻(出現次數），而Wj爲詞彙η的長度，例如：「Artificial Intelligence」包含2個英文字，其長度定義爲2，而「數位音樂」包含四個中文字，其長度定義爲4。另外，詞彙乃及Tk在文件Di中的共同權重定義爲： ί η \ dijk=tfijkxlogVJ where n is the total number of documents, dfj is the number of occurrences of vocabulary T ′, th is the frequency (occurrences) of vocabulary Tj in document Di, and Wj is the length of vocabulary η, for example: "Artificial Intelligence" contains 2 English words , Its length is defined as 2, and "digital music" contains four Chinese characters, and its length is defined as 4. In addition, the common weights of the words Hui Nai and Tk in the file Di are defined as: ί η \ dijk = tfijkxlog

\aJ β J 其中dfjk爲此兩詞彙在所有的文件中共同出現的篇數，tfijk 代表此兩詞彙在文件Di中的出現次數，取出現較少次的爲tfijk 7 11219twf.doc/006 200424874 的値。Chen等人認爲，對稱型（symmetric)的共現（co_occurrence) 詞彙計算方式，比較常找出出現較頻繁的詞彙，而這種詞彙對檢索成效的提升比較沒有幫助，因此他們定義出非對稱型的詞彙歸類方式，亦即，詞彙Τ』對詞彙Tk的歸類權重爲： x weighting _ factor (Tk )\ aJ β J where dfjk is the number of occurrences of these two words in all files, tfijk represents the number of occurrences of these two words in file Di, which takes less frequent occurrences is tfijk 7 11219twf.doc / 006 200424874 value. Chen et al. Believe that the symmetric co_occurrence vocabulary calculation method often finds the vocabulary that appears more frequently, and this vocabulary is not helpful for the improvement of retrieval performance, so they define asymmetric Type of vocabulary classification, that is, the vocabulary T ′ categorizes the weight of the vocabulary Tk as: x weighting _ factor (Tk)

Cluster _Weight(T Jk) ΣϋΛ ΣΛ 而詞彙Tk對詞彙Tj的歸類權重爲： x weighting _ factor{Tj)Cluster _Weight (T Jk) ΣϋΛ ΣΛ and the classification weight of vocabulary Tk to vocabulary Tj is: x weighting _ factor {Tj)

Cluster _Weight(Tk7TJ) Σ,=ι ^ Σ,=ι 心其中權重因子爲 /og⑻ weighting _ factor {Tj) = logCluster _Weight (Tk7TJ) Σ, = ι ^ Σ, = ι heart where the weighting factor is / og⑻ weighting _ factor (Tj) = log

V 利用上述的歸類公式，他們從4,714篇文件中（磁碟空間共佔 8 MB)，產生了 1，708,551 個詞對（co-occurrence pairs)。相當於每個詞約與數千個詞關聯在一起。由於關聯詞對太多，極容易造成使用者查詢瀏覽上的負擔，因此針對每個詞，限制其關聯詞數最多爲100個。如此刪除了 60%的詞對，剩下 709,659個詞對，而這些詞對，是由7829個不同的詞彙所組成。由於上述公式的運算量很大，爲了產生這些詞對，在工作站上要花9.2 CPU小時，且所有詞對所佔用的磁碟空間共 12.3 MB，比原文件集還大Using the above classification formula, V generated 1,708,551 co-occurrence pairs from 4,714 files (8 MB of disk space). Equivalent to each word is associated with about thousands of words. Because there are too many pairs of related words, it is very easy to cause users to search and browse. Therefore, for each word, the maximum number of related words is limited to 100. In this way, 60% of the word pairs are deleted, leaving 709,659 word pairs, which are composed of 7,829 different vocabulary. Due to the large amount of calculation of the above formula, in order to generate these word pairs, it takes 9.2 CPU hours on the workstation, and the disk space occupied by all word pairs is 12.3 MB, which is larger than the original file set

Mark Sanderson 等人在’’Deriving Concept Hierarchies fromMark Sanderson et al. ’’ Deriving Concept Hierarchies from

Text，’’ Proceedings of the 22nd International ACM SIGIR 11219twf.doc/006 8 200424874Text, ’’ Proceedings of the 22nd International ACM SIGIR 11219twf.doc / 006 8 200424874

Conference on Research and Development in Information Retrieval - SIGIR ’99，Aug· 15-19，Berkeley，U.S.A·，1999, ρρ·206_213 —文中，用一種全然不同的方式來建構具備階層架構的共現索引典，他們的目的是希望從檢索回來的文件中，自動產生槪念階層(concept hierarchies)，以方便使用者瞭解檢出文件的大致內容。其作法分兩步驟：第一步先做詞彙的選擇，以決定哪些詞彙要列在槪念階層中，供使用者瀏覽。詞彙的選擇主要是從檢索結果的前幾篇文件中，就比對程度較佳的段落裡，找出常常一起出現的詞彙。另一種選擇’是從每一篇檢出的文件中最相關（即最近似查詢條件）的段落裡’選擇符合下列條件的詞彙：某詞彙出現在檢出文件中的篇數除以該詞彙出現在全部文件中的篇數大於0.1者，即： (df in retrieved set ) / (df in the collection) >= 0.1 運用這兩種選擇，他們從TREC文件集的每個查詢結果的前500篇文件中，平均擷取出2430個詞。有了這些與查詢詞彙相關的重要詞彙，第二步就進行詞彙的關聯分析。他們把任意兩個詞都拿來做包含關係（subsumption relationship)的比較：即如果乃包含(subsumes)Tk，則條件機率P(Tj | Tk) = 1 且P(Tk I η) < 1應當要成立。亦即每當詞彙Tk出現在某篇文件時，詞彙乃也都出現，但當η出現在某篇文件時，則Tk不一定出現，這種情況就說η包含Tk。大體上，廣義詞會包含狹義詞，但沒有廣義、狹義關係的詞彙，有些也會符合這個 11219twf.doc/006 9 200424874 包含條件。這個條件雖然是包含關係在數學上的直接描述，但這個條件太嚴苛，沒有嚴格符合條件的詞彙也有可能具備語意上的包含關係。於是Sanderson將上述條件放寬成··如果 η包含Tk，則條件機率ρ(η丨Tk) >= ο·8且p(Tk | τ」）< 1應當要成立。透過這個包含條件的檢驗’平均從TREC文件集的每個查詢主題中，擷取出200個包含對(subsumption pairs)。從這些包含對中，透過一些階層式選單的視窗工具，即可產生槪念階層。在此槪念階層中，包含者爲父節點’被包含者爲其子節點。在經過測試之後發現，共有67%包含對被判斷爲相關（interesting for further exploring)。在此所謂相關，是指受試者對自動擷取出來的包含對有興趣、有意思，以致有想要進一步探索其何以致之的動機。但經實驗顯示，僅隨意配對就能使相關度達到51%，其原因可能是挑選的詞彙都來自於同一檢索結果，這些詞彙屬於同一主題的可能性很高’ 因此即使將這些詞彙隨意亂配對，也會有某種程度的相關度。整體而言，此方法在查詢時才進行，查詢反應時間會受影響’ 且其提示的詞彙只限於檢索結果的前N篇，因此不是一個全域索引典（global thesaurus)的建構方法。【發明内容】因此，本發明的目的就是在提供一種索引典之自動建構方法，其具有快速的建構速度。本發明的另一目的在提供一種索引典之自動建構方法’ 11219twf.doc/006 10 200424874 其僅需要單篇文件就可以擷取關聯詞。爲了達成上述與其他目的，本發明提供一種索引典之自動建構方法，此自動建構方法首先擷取文件組中之多個關鍵詞，接下來再將文件組中之文件分成多個邏輯段落，並以這些邏輯段落爲基礎來進行關聯詞分析。在本發明的一個較佳實施例中，以邏輯段落爲基礎進行關聯詞分析之步驟，係首先計算兩個關鍵詞之間的關聯權重，並在關聯權重大於等於一門檻値時，累積此關聯權重。最後再將所累積之關聯權重乘上正規化後的反向文件篇數以取得相似度。而在文件組中之文件長度差異超過某一事先設定的預定値時，還可以將相對較長之文件切割成多個文件。而在本發明的另一個較佳實施例中’計算關聯權重之步驟係以下列方程式所得： ^gKTv，Tik) = 2x5(7；·。7；々) xln(1.72 + ^) 其中，Si爲文件i被分割的邏輯段落數’ s(Tij)代表詞彙j 在文件i中出現的邏輯段落數，s(Tij n Tik)爲詞彙j與詞彙 k共同出現的邏輯段落數。在本發明的又一個較佳實施例中’相似度之計算係以下列方程式所得： sin{TpTky log批 X#)， log粉其中，η是文件的總篇數，dfk是詞彙k出現的篇數，wk 11219twf.doc/006 11 200424874 鍵詞k的_，當¥((，[)大於等於一門濫値的時候，令 /(¥([’0、順7；，7；)，否則即令 /⑽/((d)=〇。而在顯不的時候，根據本發明之一較佳實施例，可於顯不關聯詞時一併顯示與此關聯詞相關之詞頻及相似度。也可以在顯不關聯詞時，根據此關聯詞之相似度以決定顯示之位置。此外’還可以根據詞頻、出現時間與相似度來排序關聯詞。本發明因以文章中之邏輯段落，如句子、段落等爲單位來分析關聯詞，因此可以在短時間內僅依靠一份文件就能取得相對應的關聯詞，以及該些關聯詞之間的關聯性。而在新文件加入的時候，只需要就目前所有的關聯詞與新文件做分析，而無須全部文件重作一次關聯詞分析。因此其索引典之建立將可以比之前的技術快上許多。爲讓本發明之上述和其他目的、特徵、和優點能更明顯易懂，下文特舉一較佳實施例，並配合所附圖式，作詳細說明如下：【實施方式】請參照第1圖，其繪示了根據本發明之共現索引典自動建構方法之一較佳實施例的施行步驟流程圖。在此實施例中，首先進行文件的關鍵詞擷取（S100)，之後將每份文件分成多個邏輯段落，並以邏輯段落爲基礎來進行關聯詞的分析（S102)，最後累積所有的關聯詞建成關聯詞庫（S104)，即可構成共現索 11219twf.doc/006 12 200424874 引典。由於關鍵詞是文件內有意義且具代表性的詞彙，且又是呈現文件主題意義的最小單位，因此各種自動化文獻處理都必須先進行關鍵詞自動擷取的步驟。然而由於關鍵詞的認定是主觀的判斷，不利於電腦的自動處理，因此在本方法中係假設：如果文件探討某個主題，那麼應該會提到某些特定的字串好幾次。這種主題詞彙可能重複的特性，便是電腦可以遵循的法則、可用來擷取關鍵詞的依據。依據此一重複性假設，發明人曾發展出一套關鍵詞自動擷取方法，並獲得中華民國的發明專利（中華民國發明專利第153789號）。此方法可擷取新生詞彙、專有名詞、人名、地名、機構名等詞彙；不要求文件的完整性，可適用於有雜訊的環境，如OCR文件、語音辨識等文件；不需要額外資源，如字典、詞庫、文法剖析器、語料庫等需耗費大量人力事先建立或維護的資源；擷取的關鍵詞沒有長度限制；擷取的速度快、消耗的記憶空間少；統計特徵非常低（僅出現兩次）的關鍵詞也可被擷取到；擷取的正確率高，約86%。由於此關鍵詞自動擷取方法只根據一個簡單的假設，因此各種不同領域的文件，如工程、人文、醫學、專利等文獻，以及不同語文，如中文、英文、甚至音樂的旋律字串等都可立即運用這套方法進行關鍵詞、關鍵旋律、或文件關鍵片段 11219twf.doc/006 13 200424874 的自動擷取。至於有些看起來不符合此假設條件的文件，如書目資料、或短文件，由於文字精簡，詞彙不太可能重複，所以可以利用一些技巧讓這些文件也符合重複性的假設。其中一種技巧，就是集合相關的短文件成較長的文件，讓重要詞彙較有機會重複出現，再進行關鍵詞擷取的動作。例如，在處理書目資料時，便可利用這個技巧，將35萬筆標題以五萬筆標題當成一份文件的方式來擷取關鍵詞，然後合倂這些關鍵詞，再比對哪些詞彙來自於哪些標題，即可建成關鍵詞庫，供查詢提示與書目檢索運用。此外，也可以用背景文件的方式，利用檢索功能或其他可靠的方法，找出與想處理的短文件內容相關的文件作爲背景文件，將該短文件與背景文件視爲同一文件，使重要詞彙有機會重複出現，再進行關鍵詞擷取，最後留下僅出現在短文件的詞彙，即爲該文件自動擷取出來的關鍵詞。能夠不依賴詞庫或辭典便能擷取新生詞彙的方法，對於某些文獻類型，或是跨領域的文獻是非常重要的。例如，中文的新聞文件常不斷有新的語彙出現。發明人曾以一部包含12 萬條中文詞的詞庫來比對新聞文件的詞彙，並和發明人自行發展的方法作比較，發現發明人所發展的方法平均每篇新聞文件擷取出33個關鍵詞（出現至少兩次的詞），其中有11個（33%) 11219twf.doc/006 14 200424874 有意義的詞彙是這個詞庫中沒有蒐錄的詞。顯示如果光靠事先建構的詞庫來擷取新聞文件的詞彙，可能會漏掉1/3的重要詞彙。此關鍵詞自動擷取方法也可以配合詞庫斷詞法來提升關鍵詞擷取的正確率。單純的詞庫斷詞法，例如：最長詞優先法，是在依次讀入文件的每一個字時，比對詞庫中以該字爲首的最長詞彙，若詞庫中有這樣的詞彙，就從文件斷出該詞，然後前進到該詞之後的下一個字，重複上述詞庫比對。如果詞庫中沒有以該字爲首的詞彙，就將該字斷爲一字詞（或視爲未知詞之一部分），然後前進到下個字繼續比對。如此重複比對一^直到文件結束爲止。由此斷出來的詞，都是詞庫裡的詞，其正確率（precision)較高，不過文件中可能有詞庫中沒有的詞彙，因而其召回率（recall)較低。由於前述的關鍵詞自動擺取方法可以擺取新生詞彙，因此，可以將此關鍵詞擷取方法運用於詞庫斷詞法之後’讓兩種方法的優點互相彌補，從而得到局正確率、局召回率的結果。根據實驗顯示，先將文件以^ 2 萬詞的詞庫斷詞後，再運用關鍵詞自動擷取方法，所得到的關鍵詞正確率可從原來的86%提升到96%。在文件的關鍵詞依前述方法或任何其他方法擷取出來後，就可以進行關聯分析。過去的分析方法，是依據詞彙是否常常共同出現在同一篇文件中發展而來。但在全文文件的 11219twf.doc/006 15 200424874 環境中，以一整篇文件爲單位來計算是否共同出現，容易失去精準度，這是因爲當兩個詞彙在文件中隔開的距離越大，則其關係密切的可能性將越低。因此共同出現的單位應該縮小到一個邏輯段落（logical paragraph)、或文章段落 (paragraph)、或一個句子（sentence)、或任何分割文件的單位。如此，可以計數單一文件中任意兩個詞彙個別出現與共同出現的單位數，再套用 Dice、Cosine、Jaccard 或 mutual information 等資訊檢索常用的相似度計算公式而求得兩詞彙之間的關聯度。而在計算出每一篇文件重要詞彙之間的關聯度後，累積關聯強度超過某個門檻値的關聯詞，即可完成整個文件資料庫的關聯詞庫。在本發明中，採用下列公式來計算同一篇文中兩個詞彙的關聯權重(association weight) ·· wgt{TipTik) = 2xS(Tur^Tik) xln(1.72 + $) 其中Si爲文件i被分割的單位數，此單位通常爲句子、段落或任一有意義的分割單位，爲方便後續解說，統稱此單位爲一個「句子」。S(Tij)則代表詞囊j在文件i中出現的句子數’ ΓΊ Tik)則爲詞彙j與詞彙k共同出現的句子數。所以上面公式的第一項剛好是Dice係數。但是光以Dice係數計算時，長文件裡詞彙間的關聯權重數値常較短文件的數値低。因此’ 我們以第二項ln(1.72+Si)來補償長文件詞彙的權重，以便讓長 11219twf.doc/006 16 200424874 度不同的文件，其詞彙關聯權重的値有大約相同的範圍。必須注意的是，此處所用之常數1.72並非爲唯一可使用的値。事實上，只要是大於等於自然對數之基數的値均可用於此處。經過上述的計算之後，本發明將關聯權重大於等於某一門檻値(threshold)者拿來累積，並乘上正規化後的反向文件篇數（Inverse Document Frequency，IDF)，而得到最後的相似度。其精確的公式如下： siMTpTk)^ 1〇g(^gX(^— χ 客勉似)）其中η是文件的總篇數，dfk是詞彙k出現的篇數，wk是詞彙k的長度。而當a大於或等於一個門檻値(threshold)的時候令f(a)=a，否則即令f(a)=0。在此公式中，反向文件篇數(IDF) 可以在所有的文件都處理完後，再計算進去，而不會影響單篇文件擷取相關詞的動作。在大部分的情況下，此門檻値只要選1.0即可應用於大部分的文件型態上。例如書目資料庫，其一筆書目記錄可視爲一個邏輯文句，而出現在一筆記錄中的詞彙，只要出現一次，兩詞彙間的關聯權重 (2xl/(l + l)xln(1.72 + l)=ln(2.72)>1.0)就可以通過此門檻値。對於文件長度分配極不平均的文件資料庫而言，最好將超長文件(這裡是指相對於文件資料庫中的其他文件而言）切割成數份較短的文件，再運用上述的公式求其關聯詞與相似度。把共同出現的單位從整個文件改成範圍較小的段落或句 11219twf.doc/006 17 200424874 子後，立即的效應，是單篇文件即可擷取關聯詞，而不用等到處理完全部文件後才看得到關聯詞。這個效應有三個優點：一、漸進式（incremental)擷取關聯詞製作容易，對於每天或定期會新增文件的資料庫，已經處理過的文件不需要再處理，只需擷取新進文件的關聯詞，加到整個關聯詞庫即可。二、哪些關聯詞出現在哪些文件，容易追蹤記錄，因此要將關聯詞按關聯強度排序、或按關聯詞出現的篇數(df)排序，或按文件的日期排序，都變成輕易可行。三、關聯詞擷取的速度快。原先以文件爲共同出現的單位來計算關聯詞，需要就所有文件出現的所有重要詞彙兩兩相比，才知道是否常常共同出現，這需約〇(m2n)的計算量，其中m是所有重要詞彙的個數，n 是所有文件的篇數。通常在一個文件資料庫中，m都是數以萬計，甚至數十萬，而η通常少則數千，多則數十萬，甚至上百萬。改變後的作法，其計算量則約需0(nK2s)，其中Κ是單篇文件中拿來作關聯分析的詞彙數，s則是每篇文件平均的句子個數。一般的情況K與s的範圍都可以設在10到1〇〇之間，與0(m2n)相比，〇(nK2s)小很多，因此運算的時間短少許多。根據實驗顯示，從33萬篇中文新聞文件中（文字量約佔 323MB大小），以桌上型的pentiuln Π電腦運算，大約花費5.5 個小時，產生250萬個關聯詞對(全部相異詞有25萬個），平均每個詞有10個對應的關聯詞。相對的，Chen等人的作法， 11219twf.doc/006 18 200424874 在Sun Sparc工作站上，花費9.2小時，從4,714篇英文(文字量共8MB)中，找出1，708551個關聯詞對，由於關聯詞數太多，從中再移除60%的關聯詞對，作爲最後的結果。chen等人的另外一個實驗則以2GB的英文摘要資料庫爲對象，從270,000 個詞中，產生4,000,000個關聯詞對，共花費超級電腦24 5個 CPU小時。第2圖顯示一篇文件的內容，以及由上述方法自動擷取出來的關鍵詞與關聯詞範例。其中關鍵詞後面的括弧顯示該詞彙出現在文中的次數，而關聯詞則以二維圖示方式顯示詞彙之間彼此的關聯。每一篇文件的關聯詞擷取出來後，即可累積建成一個關聯詞庫，供查詢提示用。使用者可以藉此瞭解文件中大致的內容，並選擇有用的關聯詞彙，探索資料庫中記載的訊息。第3圖顯示根據本發明之另一較佳實施例之施行結果。其中，當使用者輸入DoCoMo後，系統在最左欄提示三個與 DoCoMo字串近似的詞彙，—個是查詢詞彙本身，兩個是被查詢詞彙包含(subsume)的詞（均可視爲狹義詞）。每個詞在其後的圓括號內均註明其出現篇數。若視該詞彙爲一類別詞的話，那麼使用者可以淸楚知道屬於DoCoMo類別的文件有62篇，屬於NTT DoCoMo類別的文件有19篇等等。如此，使用者可點選較小的類別，快速縮小範圍。此外，這些出現篇數也可 11219twf.doc/006 19 200424874 以協助使用者在不必進一步查詢的情況下，就可得知用此詞彙做查詢時會出現的篇數，讓使用者僅作一次查詢，就可得知好幾個查詢結果。其展現出來的效果宛如一個分類目錄，只不過此分類目錄會隨查詢詞彙的不同而變化，而且也會隨文件資料庫的不同而不同，因此稱爲「動態分類目錄」。在第3圖中間標示「Related Terms」的那一欄，也是運用關聯詞庫建構出來的提示詞。對於查詢詞DoCoMo的每一個關聯詞，其後的圓括號顯示兩個數據，第一個是該詞彙的詞頻（df，即出現的篇數），第二個是累積出來的關聯強度。在這個例子中，DoCoMo的關聯詞是按關聯強度由大到小排序而成。從這個關聯提示詞中，大略可知DoCoMo與日本的電信業有關，由於「三G」與「i-Mode」是手機領域滿特別的用語，因此精確的說應該跟電信業的手機領域有關。對於想要探索 DoCoMo的使用者而言，這些提示具有相當程度的摘要作用。使用者如想做進一步的瞭解，可點選適當的詞彙，調出相關的文件，從其記載的描述中瞭解詳情。上面的關聯詞是以一維空間的文字展示，我們也可以將關聯詞應用到二維空間的環境，使其具備另一番使用上的趣味。此外，如果我們能辨別詞彙的性質，在顯示這些詞彙時，還可以根據其性質主動提示出更多相關的訊息。如第4圖所示，使用者在查詢MP3後’發現有很多詞彙跟MP3有關係， 11219twf.doc/006 20 200424874 其中「中環」像似一家公司名稱，使用者點選後，可以找出 MP3與中環相關的文章，瞭解其詳細的關係。此時，系統在廠商資料摩中，也比對到中環這家公司，因此也將該公司的詳細資料顯示出來。如此，從非結構化的文件中，可以獲得事物之間的關聯，而這些關聯猶如記錄了某些知識等待探索，有些知識由文件中的自由文句中透露出來，有些知識則由事先準備好的結構化資料加以補充。在這樣的二維空間中一步步探索時，猶如在地圖上一步步發現所需的資訊或知識。因此，這樣的系統，可簡略稱爲「知識地圖」。如前所述，在累積每篇文件的關聯詞時，我們可以得知詞彙之間的關聯強度（即相似度）、出現的文件與篇數、以及其日期時間等資訊。因此，在顯示某個查詢詞彙的關聯詞時，我們就可依這些訊息來排序。然而不管關聯詞用什麼方式排序，其所對應的關聯詞集（set of related terms)都是一樣的，只不過是這些詞的次序不同而已。但不同次序的呈現方式，會讓使用者對系統提示的關聯詞好壞，有不同的感覺。某個次序可能較另一種次序還好，亦即詞彙相關的比例較高。請參照第5圖，其顯示的是查詢詞「古蹟」的關聯詞，分別用「詞頻」、「時間」、「沒有IDF的相似度」與「相似度」四種方式排序。「詞頻」即該提示詞出現的文件篇數。「時間」指該提示詞與查詢詞一起出現的文件中，以文件日期最近者， 11219twf.doc/006 21 200424874 作爲該提示詞的日期，然後再將這些提示詞依此日期由近而遠排序。這樣做的用意是考慮到某些「關聯詞對」在最近的報導中才出現，他們的強度或詞頻都還未累積到足夠的大’ 而可以被排列在前面。尤其像新聞類的文件，就具備此種特性。「沒有IDF的相似度」是指依下面公式求得的相似度 simNoJDF{TjJk) = 其中，函數f()的定義與前面的定義相同。這個相似度公式，與原先的比較，少了 l〇g(wk X n/dfk)/log(n)這一項，其目的是要比對出少了反向文件頻率（IDF)的效果。最後「相似度」指的是依原先相似度計算公式求得的相似度。圖四的提示詞中，被判定與查詢詞「古蹟」較相關的詞彙，以打勾的方式標示，這個例子看起來，依相似度排序的效果較好。根據實驗結果，若擬定6個查詢詞彙，並就每個查詢詞彙檢視前N(在此N=50)個提示詞，判斷其與查詢詞的相關性’ 則將全部30個查詢詞的結果平均起來的結果將如第6圖所示。當關聯提示詞依「相似度」排序時，被判斷爲相關的比率爲69%，其次是依「沒有IDF的相似度」排序：62%，再其次是依「時間」排序：59%，最差的是依篇數排序，比例爲48% ° 這數據再次印證「反向文件篇數」（IDF)在資訊檢索中常扮演重要的角色，而依時間排序雖然在此效果比較不好’但在強調時間順序的文件資料庫中，可能還是有重要的應用。 11219twf.doc/006 22 200424874 單就相關比例而言，即可知依相似度排序的效果最好。若相關比例一樣時，.計算其精確率與召回率，可以進一步顯示不同方法的差異。例如，當某兩種排序方法的相關比例都是60%時，亦即前50個提示詞中有30個被判定爲相關時，這30個相關提示詞都排列在最前面與都排列最後面（即前20 個都不相關），顯然效果上還是有很大的差異。透過TREC檢索競賽常用的精確率與召回率計算程式trec_eval，可以進一步區別此兩種排序方法的優劣。另外，上述四種排序方式，在前50個提不詞中，呈現的相關提不詞也許不盡相同，透過 trec_eVal的精確率與召回率計算，將四種方法找到的相關提示詞全部集合起來計算召回率，也可以大略得知，那個方法常能找到其他方法不能找到的相關提示詞，或其他方法找到的相關提示詞，常都是某個方法就可以找到的詞彙。第7圖展示查詢詞「古蹟」依相似度排序時的trec_eval輸出。從此表中可以看出，被判定爲相關的詞彙，越是集中在前面，其平均的精確率越高。若找出的相關提示詞一樣多，但沒有特別集中在前面，則其平均的精確率將比較低。最後的平均精確率，則是將全部30個查詢詞的平均精確率再加以總計平均，其中依「相似度」排序依然是效果最好的方式，平均精確率是0.5284，其次是依「沒有IDF的相似度」排序，平均精確率：0.4346，再其次是依「時間」排序：0.4028，最後是依篇 11219twf.doc/006 23 200424874 數排序：0.3020。綜上所述，本發明所提出之索引典之自動建構方法，其建構速度快且成效好。與過去的硏究相較，雖然評估的環境不同，不過就數値上看，彼此的成效幾乎達同一水準。此外，僅需要單篇文件即可擷取關聯詞，而且漸進式擷取關聯詞製作容易，對於每天或定期會新增文件的資料庫，已經處理過的文件不需要再處理，只需擷取新進文件的關聯詞，加到整個關聯詞庫即可。再者，其關聯詞排序容易，要將關聯詞按關聯強度排序、或按關聯詞出現的篇數（df)排序，或按文件的曰期排序，都輕易可行。雖然本發明已以一較佳實施例揭露如上，然其並非用以限定本發明，任何熟習此技藝者，在不脫離本發明之精神和範圍內，當可作些許之更動與潤飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者爲準。【圖式簡單說明】第1圖是根據本發明之一較佳實施例之施行步驟流程圖；第2圖顯示一篇文件的內容，以及由本發明之一較佳實施例自動擷取出來的關鍵詞與關聯詞範例；第3圖顯示根據本發明之另一較佳實施例之一施行結果示意圖；第4圖顯示根據本發明一較佳實施例之另一施行結果示意圖； 11219twf.doc/006 24 200424874 第5圖顯示根據本發明之一較佳實施例查詢「古蹟」的關聯詞，並分別用「詞頻」、「時間」、「沒有IDF的相似度」與「相似度」四種方式排序所得的結果示意圖；第6圖顯示的是根據本發明之一較佳實施例所得結果之示意圖；以及第7圖顯示的是查詢詞「古蹟」依相似度排序時的 trec_eval輸出結果。圖式標記說明= S100〜S104 :根據本發明之一較佳實施例之施行步驟拾、申請專利範圍： 1·一種索引典之自動建構方法，包括·· 擷取一文件組中之複數個關鍵詞；將該文件組中之任一文件分成複數個邏輯段落；以及以該些邏輯段落爲基礎進行關聯詞分析。 2.如申請專利範圍第1項所述之索引典之自動建構方法，其中以該些邏輯段落爲基礎進行關聯詞分析之步驟，包括：計算兩個關鍵詞的一關聯權重；於該關聯權重大於等於一門檻値時，累積該關聯權重；以及將所累積之該關聯權重乘上正規化後的反向文件篇數以取得一相似度。 3·如申請專利範圍第2項所述之索引典之自動建構方法， 11219twf.doc/006 25Conference on Research and Development in Information Retrieval-SIGIR '99, Aug · 15-19, Berkeley, USA ·, 1999, ρ · 206_213 — In this article, a co-occurrence index with a hierarchical structure is constructed in a completely different way. They The purpose is to automatically generate concept hierarchies from the retrieved files, so that users can easily understand the general content of the checked out files. The method consists of two steps: the first step is to select the vocabulary first to determine which vocabulary should be listed in the missed class for users to browse. The selection of vocabulary is mainly to find the words that often appear together from the first few documents of the search results, in the paragraphs with better comparison. Another option is to select from the most relevant (that is, the most similar query conditions) paragraphs in each checked out document to select words that meet the following conditions: the number of words in a checked out document divided by the word output Now the number of articles in all documents is greater than 0.1, that is: (df in retrieved set) / (df in the collection) > = 0.1 Using these two options, they retrieve the first 500 articles from each query result of the TREC file set. In the file, an average of 2430 words were extracted. With these important words related to the query vocabulary, the second step is to perform lexical association analysis. They used any two words as a comparison of the subsumption relationship: if the subsumes Tk, then the conditional probability P (Tj | Tk) = 1 and P (Tk I η) < 1 should To be established. That is, whenever the word Tk appears in a certain file, the word also appears, but when η appears in a certain file, Tk does not necessarily appear. In this case, it is said that η includes Tk. Generally speaking, broad words will include narrow words, but words without broad and narrow relations, and some will also meet this 11219twf.doc / 006 9 200424874 inclusion condition. Although this condition is a direct mathematical description of the containment relationship, this condition is too strict, and words that do not strictly meet the conditions may have a semantic containment relationship. So Sanderson relaxed the above conditions to ... If η contains Tk, then the conditional probability ρ (η 丨 Tk) > = ο · 8 and p (Tk | τ ″) < 1 should be true. Through this test of inclusion conditions, on average, 200 subsumption pairs are extracted from each query subject of the TREC file set. From these include alignments, through some hierarchical menu window tools, you can generate missed levels. In this hierarchy, the inclusive is the parent node and the contained is its child node. After testing, it was found that a total of 67% included interest in further exploring. Relevance here means that the subject is interested and interested in the automatic retrieval, so that he has the motivation to further explore why it is. However, experiments show that the correlation can reach 51% only by random matching. The reason may be that the selected words come from the same search result. It is highly likely that these words belong to the same topic. Therefore, even if these words are randomly matched , There will be some degree of correlation. In general, this method is performed only during the query, and the query response time will be affected ’, and the vocabulary it suggests is limited to the first N articles of the search results, so it is not a global thesaurus construction method. [Summary of the Invention] Therefore, an object of the present invention is to provide an automatic construction method of an index code, which has a fast construction speed. Another object of the present invention is to provide an automatic method for constructing an index code '11219twf.doc / 006 10 200424874, which only needs a single file to retrieve related words. In order to achieve the above and other objectives, the present invention provides an automatic construction method of an index. This automatic construction method first extracts a plurality of keywords in a file group, and then divides the files in the file group into multiple logical paragraphs, and These logical paragraphs are used as a basis for related word analysis. In a preferred embodiment of the present invention, the step of performing a related word analysis based on a logical paragraph is to first calculate a correlation weight between two keywords, and accumulate the correlation weight when the correlation weight is greater than or equal to a threshold 値. Finally, the accumulated correlation weight is multiplied by the number of normalized reverse documents to obtain similarity. When the difference between the file lengths in a file group exceeds a predetermined preset threshold, a relatively long file can be cut into multiple files. In another preferred embodiment of the present invention, the step of calculating the correlation weight is obtained by the following equation: ^ gKTv, Tik) = 2x5 (7; · 7; 々) xln (1.72 + ^) where Si is The number of logical paragraphs segmented in file i's (Tij) represents the number of logical paragraphs in which the vocabulary j appears in file i, and s (Tij n Tik) is the number of logical paragraphs in which vocabulary j and vocabulary k appear together. In another preferred embodiment of the present invention, the calculation of the similarity is obtained by the following equation: sin {TpTky log batch X #), log powder, where η is the total number of documents, and dfk is the number of documents in which the word k appears , Wk 11219twf.doc / 006 11 200424874 _ of the key word k, when ¥ ((, [) is greater than or equal to a source, let / (¥ (['0, Shun 7 ;, 7;), otherwise order / ⑽ / ((d) = 0. And when it is displayed, according to a preferred embodiment of the present invention, the frequency and similarity of the words related to the related word can be displayed together when the related word is displayed. In the case of related words, the display position is determined according to the similarity of the related words. In addition, the related words can also be sorted according to the word frequency, appearance time and similarity. The present invention analyzes the logical paragraphs in the article, such as sentences and paragraphs Related words, so you can get the corresponding related words and the correlation between these related words by relying on only one file in a short time. When a new file is added, you only need to do all the current related words and new files. Analysis without all The document is re-analyzed for related words. Therefore, the establishment of its index can be much faster than the previous technology. In order to make the above and other objects, features, and advantages of the present invention more obvious and understandable, a preferred embodiment is given below. In conjunction with the attached drawings, the detailed description is as follows: [Embodiment] Please refer to FIG. 1, which shows a flowchart of the implementation steps of a preferred embodiment of the automatic construction method of a co-occurrence index according to the present invention. In this embodiment, the keyword extraction of the file is first performed (S100), then each file is divided into multiple logical paragraphs, and the related words are analyzed based on the logical paragraphs (S102). Finally, all related words are accumulated to complete Relevant thesaurus (S104), which can constitute a co-occurrence citation 11219twf.doc / 006 12 200424874. Since keywords are meaningful and representative words in a file, and the smallest unit to present the meaning of the subject of the file, various automations The document processing must first carry out the automatic keyword retrieval step. However, since the identification of keywords is subjective judgment, it is not conducive to the computer's self-determination. In this method, it is assumed that if the document discusses a topic, certain specific strings should be mentioned several times. The characteristic of this topic word that may be repeated is the rules that computers can follow and available The basis for extracting keywords. Based on this repetitive assumption, the inventor has developed a set of automatic keyword extraction methods, and has obtained an invention patent of the Republic of China (ROC Invention Patent No. 153789). This method can be retrieved New vocabulary, proper nouns, personal names, place names, organization names and other words; do not require the integrity of the file, can be used in noisy environments, such as OCR files, speech recognition and other documents; no additional resources such as dictionaries, Thesaurus, grammar parser, corpus and other resources that require a lot of manpower to build or maintain in advance; there is no limit to the length of the keywords retrieved; the retrieval speed is fast and the memory space consumed is small; the statistical characteristics are very low (only appear twice ) Keywords can also be retrieved; the accuracy of retrieval is high, about 86%. Because this keyword automatic retrieval method is based on a simple assumption, all kinds of documents in different fields, such as engineering, humanities, medicine, patents, etc., and different languages, such as Chinese, English, and even melody strings of music, etc. You can immediately use this method for automatic retrieval of keywords, key melodies, or key sections of a document 11219twf.doc / 006 13 200424874. As for some documents that do not seem to meet this assumption, such as bibliographic information, or short documents, because the text is concise and the vocabulary is unlikely to be repeated, some techniques can be used to make these documents also meet the repetitive assumptions. One of the techniques is to collect related short files into longer files, so that important words are more likely to appear repeatedly, and then keywords are retrieved. For example, when processing bibliographic data, you can use this technique to retrieve keywords from 350,000 titles as a document, then combine these keywords, and then compare which words come from Which titles can be built into a keyword library for query tips and bibliographic searches. In addition, you can also use the background file method, using the search function or other reliable methods, to find the file related to the short file content you want to process as the background file, treat the short file and the background file as the same file, so that important words There will be repeated occurrences, and then keyword extraction will be performed, and the words that only appear in short files will be left, that is, keywords that are automatically retrieved from the file. The ability to retrieve new vocabulary without relying on a thesaurus or thesaurus is very important for some types of literature or cross-domain literature. For example, Chinese news documents often have new vocabulary. The inventor used a thesaurus containing 120,000 Chinese words to compare the vocabulary of news documents and compared it with the method developed by the inventor. It was found that the method developed by the inventor extracted an average of 33 news documents per news file. Keywords (words that appear at least twice), 11 of them (33%) 11219twf.doc / 006 14 200424874 Meaningful words are words that are not searched in this thesaurus. It is shown that if the vocabulary of news documents is retrieved by the thesaurus constructed beforehand, one-third of important words may be missed. This keyword automatic retrieval method can also be used with thesaurus to improve the accuracy of keyword retrieval. A simple thesaurus, such as the longest word first method, is to compare each word in the file with the longest vocabulary headed by the word when each word is read in sequence. If there is such a word in the thesaurus, Just break the word from the file, then proceed to the next word after the word, and repeat the thesaurus comparison. If there is no vocabulary beginning with the word in the thesaurus, the word is broken into a word (or regarded as a part of the unknown word), and then the next word is continued for comparison. Repeat the comparison in this way until the end of the file. The words that are judged from this are all words in the thesaurus. The accuracy is higher, but there may be vocabularies that are not in the thesaurus in the file, so the recall is low. Because the aforementioned automatic keyword extraction method can pick up new words, you can apply this keyword extraction method to the thesaurus after the word segmentation method, so that the advantages of the two methods can complement each other, so as to obtain the correct rate, the The result of recall. According to experiments, after the file is segmented with a thesaurus of ^ 20,000 words, and then the automatic keyword extraction method is used, the accuracy of the obtained keywords can be increased from 86% to 96%. After the keywords of the file are extracted according to the aforementioned method or any other method, the correlation analysis can be performed. Past analytical methods have evolved based on whether vocabulary often appears together in the same document. However, in the context of the full-text file of 11219twf.doc / 006 15 200424874, it is easy to lose accuracy when calculating whether they appear together as a unit of an entire file. This is because the greater the distance between two words in the file, The less likely it will be close. Co-occurrence units should therefore be reduced to a logical paragraph, or a paragraph, or a sentence, or any unit that divides a file. In this way, you can count the number of units in which any two words appear and co-occur in a single file, and then apply the similarity calculation formulas commonly used in information retrieval such as Dice, Cosine, Jaccard, or mutual information to find the degree of correlation between the two words. After calculating the degree of association between important words in each document, the accumulated words with a correlation strength exceeding a certain threshold will complete the association thesaurus of the entire document database. In the present invention, the following formula is used to calculate the association weight of two words in the same text. Wgt (TipTik) = 2xS (Tur ^ Tik) xln (1.72 + $) where Si is the file i is divided The number of units. This unit is usually a sentence, paragraph, or any meaningful division unit. For the convenience of subsequent explanation, this unit is collectively referred to as a "sentence". S (Tij) represents the number of sentences in which the word bag j appears in the file i ′ ΓΊ Tik) is the number of sentences in which the vocabulary j and the vocabulary k appear together. So the first term of the above formula happens to be the Dice coefficient. However, when calculated using the Dice coefficient alone, the number of association weights between words in long files is often lower than that in short files. Therefore, we use the second term ln (1.72 + Si) to compensate the weight of the vocabulary of long documents, so that the length of the vocabulary association weights of documents with different lengths of 11219twf.doc / 006 16 200424874 degrees has about the same range. It must be noted that the constant 1.72 used here is not the only plutonium that can be used. In fact, any 値 that is a cardinality greater than or equal to the natural logarithm can be used here. After the above calculations, the present invention accumulates those whose association weight is greater than or equal to a certain threshold, and multiplies them by the normalized Inverse Document Frequency (IDF) to obtain the final similarity. degree. The exact formula is as follows: siMTpTk) ^ 1〇g (^ gX (^ — χ)), where η is the total number of documents, dfk is the number of occurrences of vocabulary k, and wk is the length of vocabulary k. When a is greater than or equal to a threshold 値 (threshold), let f (a) = a, otherwise let f (a) = 0. In this formula, the IDF can be calculated after all the files have been processed, without affecting the action of extracting related words in a single file. In most cases, this threshold can be applied to most file types by selecting 1.0. For example, in a bibliographic database, a bibliographic record can be regarded as a logical sentence, and the vocabulary appearing in a record only needs to appear once, and the correlation weight between the two vocabularies (2xl / (l + l) xln (1.72 + l) = ln (2.72) > 1.0) can pass this threshold. For a document database with extremely uneven file length distribution, it is best to cut very long documents (referring to other documents in the document database) into several shorter documents, and then use the above formula to find Its related words and similarity. After changing the co-occurring unit from the entire file to a smaller paragraph or sentence 11219twf.doc / 006 17 200424874, the immediate effect is that related words can be retrieved in a single file, without waiting for the entire file to be processed. Can see related words. This effect has three advantages: 1. Incremental retrieval of related words is easy to make. For databases that add files daily or periodically, processed documents do not need to be processed anymore. You only need to retrieve the related words of newly entered documents. Just add it to the entire related thesaurus. 2. It is easy to track and record which related words appear in which files. Therefore, it is easy to sort related words according to the strength of the related words, or the number of occurrences of the related words (df), or the date of the file. Third, the retrieval of related words is fast. Originally, documents were used to calculate related words in units of co-occurrence. It is necessary to compare all important words appearing in all documents in pairs to know whether they often appear together. This requires about 0 (m2n) calculation, where m is all important words The number of files, n is the number of all files. Usually in a file database, m is tens of thousands, even hundreds of thousands, and η is usually as few as thousands, as many as hundreds of thousands, or even millions. After the change, the calculation amount is about 0 (nK2s), where K is the number of words used in a single file for correlation analysis, and s is the average number of sentences in each file. In general, the range of K and s can be set between 10 and 100. Compared with 0 (m2n), 0 (nK2s) is much smaller, so the operation time is much shorter. According to experiments, from 330,000 Chinese news documents (about 323 MB of text), it was calculated on a desktop pentiuln Π computer, which took about 5.5 hours to generate 2.5 million related word pairs (all dissimilar words have 25 Ten thousand), and each word has 10 corresponding related words. In contrast, the approach of Chen et al., 11219twf.doc / 006 18 200424874 spent 9.2 hours on the Sun Sparc workstation, and found 1,708551 related word pairs from 4,714 English words (8MB in total), due to the number of related words Too many, remove 60% of the related word pairs from it as the final result. Another experiment by Chen et al. used a 2GB English abstract database as an object. From 270,000 words, 4,000,000 related word pairs were generated, which took a total of 24 5 CPU hours on the supercomputer. Figure 2 shows the contents of a file, as well as examples of keywords and related words automatically extracted by the above method. The parentheses after the keywords show the number of times the word appears in the text, and the related words show the relationship between the vocabularies in two dimensions. After extracting the related words from each file, a related word bank can be accumulated for query prompting. Users can use this to understand the general content of the document and select useful related words to explore the information recorded in the database. FIG. 3 shows the results of implementation of another preferred embodiment of the present invention. Among them, after the user enters DoCoMo, the system prompts three words similar to the DoCoMo string in the leftmost column, one is the query word itself, and two are the words contained in the query word (both can be regarded as narrow words) ). The number of occurrences of each word is indicated in parentheses following it. If the vocabulary is regarded as a category word, the user can know that there are 62 documents belonging to the DoCoMo category, 19 documents belonging to the NTT DoCoMo category, and so on. In this way, users can click on a smaller category to quickly narrow it down. In addition, the number of these articles can also be 11219twf.doc / 006 19 200424874 to help users know the number of articles that will appear when using this vocabulary without further inquiries, allowing users to make only one query , You can see several query results. The effect shown is like a category directory, but this category directory will change with different query terms, and it will also vary with different document databases. Therefore, it is called a “dynamic category directory”. The column labeled "Related Terms" in the middle of Figure 3 is also a reminder word constructed using a related thesaurus. For each related word of the query DoCoMo, the parentheses after it show two data, the first is the word frequency (df, the number of occurrences) of the word, and the second is the cumulative strength of the related words. In this example, the related words of DoCoMo are sorted in descending order of relevance. From this related cue, we can roughly know that DoCoMo is related to the telecommunications industry in Japan. "Three G" and "i-Mode" are special terms in the mobile phone field, so it should be precisely related to the mobile phone field in the telecommunications industry. For users who want to explore DoCoMo, these tips are quite a summary. If users want to know more, they can click on the appropriate words, call up relevant documents, and learn more details from their descriptions. The above related words are displayed in one-dimensional space. We can also apply related words to the environment of two-dimensional space to make it more interesting to use. In addition, if we can discern the nature of vocabulary, when displaying these vocabulary, we can also actively prompt more relevant information based on its nature. As shown in Figure 4, after querying MP3, the user found that there are many words related to MP3. 11219twf.doc / 006 20 200424874 Among them, "Central" looks like a company name. After clicking, the user can find MP3 Articles related to Central to learn more about their relationship. At this time, the system also compared the company in the manufacturer's profile, so the company's detailed information was also displayed. In this way, from unstructured documents, you can obtain the association between things, and these associations are like recording some knowledge waiting to be explored. Some knowledge is revealed in the free text in the document, and some knowledge is prepared in advance. Supplemented with structured information. When you explore step by step in such a two-dimensional space, it is like discovering the required information or knowledge on a map step by step. Therefore, such a system can be referred to simply as a "knowledge map." As mentioned earlier, when accumulating related words for each document, we can know the strength of the association (ie similarity) between the words, the number of documents and articles appearing, and their date and time. Therefore, when displaying related words of a query term, we can sort by these messages. However, no matter how the related words are sorted, the corresponding set of related terms is the same, only the order of these words is different. However, different order presentation methods will make users have different feelings about the related words suggested by the system. One order may be better than another order, that is, the lexical relevance is higher. Please refer to Figure 5, which shows the related words of the query "ancient monuments", which are sorted in four ways: "word frequency", "time", "similarity without IDF" and "similarity". "Word frequency" is the number of documents in which the prompt appears. "Time" refers to the document in which the hint word appeared with the query term, with the latest date of the file, 11219twf.doc / 006 21 200424874 as the date of the hint word, and then sorting these hint words from near to far according to this date . The purpose of this is to take into account that some "relevant word pairs" have only appeared in recent reports, and their intensity or word frequency has not accumulated enough to be ranked first. Documents like news in particular have this characteristic. "Similarity without IDF" refers to the similarity obtained according to the following formula simNoJDF {TjJk) = where the definition of the function f () is the same as the previous definition. This similarity formula is 10 g (wk X n / dfk) / log (n) less than the original comparison. The purpose is to reduce the effect of the reverse file frequency (IDF). The last "similarity" refers to the similarity obtained according to the original similarity calculation formula. In the prompt word in Figure 4, the vocabularies that are judged to be more relevant to the query "ancient monuments" are marked with a tick. In this example, the effect of sorting by similarity is better. According to the experimental results, if 6 query words are prepared, and the first N (here N = 50) prompt words are examined for each query word, and the relevance to the query words is judged, then the results of all 30 query words are averaged. The result will be shown in Figure 6. When the relevance cues are sorted by "similarity", the ratio judged to be related is 69%, followed by "similarity without IDF": 62%, and then by "time": 59%, most The difference is that it is sorted by the number of articles, the proportion is 48%. This data confirms once again that the "inverse document number" (IDF) often plays an important role in information retrieval, and the chronological order is not effective here, but in There may still be important applications in document databases that emphasize chronological order. 11219twf.doc / 006 22 200424874 As far as the correlation ratio is concerned, it can be seen that the ranking by similarity works best. If the relevant ratios are the same, calculating the accuracy rate and recall rate can further show the difference between different methods. For example, when the correlation ratio of some two sorting methods is 60%, that is, when 30 of the first 50 prompts are judged to be relevant, the 30 related prompts are arranged at the front and the rear are arranged at the bottom. (Ie, the top 20 are irrelevant), obviously there are still big differences in effect. The accuracy and recall calculation program trec_eval commonly used in the TREC search competition can further distinguish the advantages and disadvantages of these two sorting methods. In addition, in the above four sorting methods, in the first 50 mentions, the relevant mentions presented may not be the same. Through the calculation of the accuracy and recall rate of trec_eVal, all the relevant prompt words found by the four methods are collected. Calculating the recall rate, you can also roughly know that that method can often find relevant prompt words that cannot be found by other methods, or related prompt words found by other methods, which are often words that can be found by a certain method. Figure 7 shows the trec_eval output when the query term "ancient monument" is sorted by similarity. It can be seen from this table that the more concentrated the vocabulary that is judged to be, the higher the average accuracy. If there are as many related prompt words as found, but they are not specifically focused on, the average accuracy will be lower. The final average accuracy rate is the average accuracy rate of all 30 query words and then the total average. Among them, the ranking based on "similarity" is still the best way. The average accuracy rate is 0.5284, followed by "No IDF" "Similarity", average accuracy rate: 0.4346, followed by "time": 0.4028, and finally by 112112twf.doc / 006 23 200424874 Number: 0.3020. In summary, the automatic construction method of the index dictionary proposed by the present invention has a fast construction speed and good results. Compared with the past research, although the assessment environment is different, the performance of each other is almost the same level. In addition, only a single document can be used to retrieve related words, and the incremental retrieval of related words is easy to make. For a database that adds documents every day or periodically, the processed documents do not need to be processed again, only new documents need to be retrieved. Can be added to the entire related thesaurus. Furthermore, the ranking of related words is easy. It is easy to sort related words according to the strength of the related words, or the number of occurrences (df) of the related words, or the date of the file. Although the present invention has been disclosed as above with a preferred embodiment, it is not intended to limit the present invention. Any person skilled in the art can make some changes and retouch without departing from the spirit and scope of the present invention. The scope of protection of the invention shall be determined by the scope of the attached patent application. [Brief Description of the Drawings] Figure 1 is a flowchart of the implementation steps according to a preferred embodiment of the present invention; Figure 2 shows the content of a file and the key to automatically retrieve it from a preferred embodiment of the present invention Examples of words and related words; Figure 3 shows a schematic diagram of the execution result of one of the preferred embodiments of the present invention; Figure 4 shows a schematic diagram of the execution result of another of the preferred embodiments of the present invention; 11219twf.doc / 006 24 200424874 FIG. 5 shows a query of related words of "monuments" according to a preferred embodiment of the present invention, which are sorted in four ways: "word frequency", "time", "similarity without IDF" and "similarity". A schematic diagram of the results; FIG. 6 shows a schematic diagram of the results obtained according to a preferred embodiment of the present invention; and FIG. 7 shows the trec_eval output result when the query term "ancient monument" is sorted by similarity. Schematic mark description = S100 ~ S104: Implementation steps according to a preferred embodiment of the present invention, patent application scope: 1. An automatic method for constructing an index, including ... Retrieving a plurality of keywords in a file group Divide any file in the file group into a plurality of logical paragraphs; and perform related word analysis based on the logical paragraphs. 2. The method for automatically constructing an index as described in item 1 of the scope of patent application, wherein the step of analyzing related words based on the logical paragraphs includes: calculating an association weight of two keywords; and the relevance weight is greater than or equal to When a threshold is reached, the associated weight is accumulated; and the accumulated associated weight is multiplied by the number of normalized reverse documents to obtain a similarity. 3. The automatic construction method of the index as described in item 2 of the scope of patent application, 11219twf.doc / 006 25

Claims

200424874 第5圖顯示根據本發明之一較佳實施例查詢「古蹟」的關聯詞，並分別用「詞頻」、「時間」、「沒有IDF的相似度」與「相似度」四種方式排序所得的結果示意圖；第6圖顯示的是根據本發明之一較佳實施例所得結果之示意圖；以及第7圖顯示的是查詢詞「古蹟」依相似度排序時的 trec_eval輸出結果。圖式標記說明= S100〜S104 :根據本發明之一較佳實施例之施行步驟拾、申請專利範圍： 1·一種索引典之自動建構方法，包括·· 擷取一文件組中之複數個關鍵詞；將該文件組中之任一文件分成複數個邏輯段落；以及以該些邏輯段落爲基礎進行關聯詞分析。 2.如申請專利範圍第1項所述之索引典之自動建構方法，其中以該些邏輯段落爲基礎進行關聯詞分析之步驟，包括：計算兩個關鍵詞的一關聯權重；於該關聯權重大於等於一門檻値時，累積該關聯權重；以及將所累積之該關聯權重乘上正規化後的反向文件篇數以取得一相似度。 3·如申請專利範圍第2項所述之索引典之自動建構方法， 11219twf.doc/006 25 200424874 其中計算該關聯權重之步驟係以下列方程式所得： wgKTij，Tik) 其中’ Si爲文件1被分割的邏輯段落數，s(Tij)代表詞彙j 在文件1中出現的邏輯段落數，s(Tij n Tik)爲詞彙j與詞彙 k共同出現的邏輯段落數。 4·如申請專利範圍第3項所述之索引典之自動建構方法，其中該相似度之計算係以下列方程式所得：其中，η是文件的總篇數，dfk是詞彙k出現的篇數，Wk 是關鍵詞k的長度，當>^(7；.，4)大於等於一門檻値的時候，令 /〇你，否則即令/〇嗯〇;，4))=〇。 5·如申請專利範圍第1項所述之索引典之自動建構方法，更包括：於顯示關聯詞時，一倂顯示與此關聯詞相關之詞頻及一相似度。 6·如申請專利範圍第1項所述之索引典之自動建構方法，更包括：於顯不關聯詞時，根據此關聯詞之一^相似度以決定顯示之位置。 7·如申請專利範圍第6項所述之索引典之自動建構方法，其中係將該相似度較低之關聯詞顯示於離中心點較遠之位置 11219twf.doc/006 26 200424874 上。 8. 如申請專利範圍第1項所述之索引典之自動建構方法，更包括根據詞頻以排序關聯詞之步驟。 9. 如申請專利範圍第1項所述之索引典之自動建構方法，更包括根據出現時間以排序關聯詞之步驟。 10. 如申請專利範圍第1項所述之索引典之自動建構方法，更包括根據沒有反向文件篇數所得之一相似度來排序關聯詞之步驟。 11. 如申請專利範圍第1項所述之索引典之自動建構方法，更包括：在該文件組中之文件長度差異超過一預定値時，將相對較長之文件切割成多個文件。 11219twf.doc/006 27200424874 FIG. 5 shows a query of related words of "monuments" according to a preferred embodiment of the present invention, which are sorted in four ways: "word frequency", "time", "similarity without IDF" and "similarity". A schematic diagram of the results; FIG. 6 shows a schematic diagram of the results obtained according to a preferred embodiment of the present invention; and FIG. 7 shows the trec_eval output result when the query term "ancient monument" is sorted by similarity. Schematic mark description = S100 ~ S104: Implementation steps according to a preferred embodiment of the present invention, patent application scope: 1. An automatic method for constructing an index, including ... Retrieving a plurality of keywords in a file group Divide any file in the file group into a plurality of logical paragraphs; and perform related word analysis based on the logical paragraphs. 2. The method for automatically constructing an index as described in item 1 of the scope of patent application, wherein the step of analyzing related words based on the logical paragraphs includes: calculating an association weight of two keywords; and the relevance weight is greater than or equal to When a threshold is reached, the associated weight is accumulated; and the accumulated associated weight is multiplied by the number of normalized reverse documents to obtain a similarity. 3. The automatic construction method of the index as described in item 2 of the scope of the patent application, 11219twf.doc / 006 25 200424874 where the step of calculating the correlation weight is obtained by the following equation: wgKTij, Tik) where 'Si is the file 1 is divided S (Tij) represents the number of logical paragraphs in which vocabulary j appears in file 1, and s (Tij n Tik) is the number of logical paragraphs in which vocabulary j and vocabulary k appear together. 4. The automatic construction method of the index as described in item 3 of the scope of patent application, wherein the calculation of the similarity is obtained by the following equation: where η is the total number of documents, dfk is the number of occurrences of the word k, Wk Is the length of the keyword k, and when ^ (7;., 4) is greater than or equal to a threshold 値, let / 〇 you, otherwise let / 〇hm〇 ;, 4)) = 〇. 5. The automatic construction method of the index as described in item 1 of the scope of patent application, further comprising: when displaying the related words, displaying the word frequency and the similarity of the related words at a time. 6. The method for automatically constructing the index as described in item 1 of the scope of patent application, further comprising: when displaying a non-associated word, determining a display position according to the similarity of one of the related words ^. 7. The automatic construction method of the index as described in item 6 of the scope of patent application, wherein the related words with lower similarity are displayed at a position farther from the center point 11219twf.doc / 006 26 200424874. 8. The automatic construction method of the index dictionary as described in item 1 of the scope of patent application, further comprising the step of sorting related words according to the word frequency. 9. The method for automatically constructing the index as described in item 1 of the scope of patent application, further comprising the step of sorting related words according to the occurrence time. 10. The automatic construction method of the index as described in item 1 of the scope of patent application, further includes the step of sorting related words according to a similarity obtained without the number of inverted documents. 11. The automatic construction method of the index as described in item 1 of the scope of patent application, further comprising: when the difference in the length of the files in the file group exceeds a predetermined threshold, cutting a relatively long file into multiple files. 11219twf.doc / 006 27