TW591519B

TW591519B - Automatic ontology building system and method thereof

Info

Publication number: TW591519B
Application number: TW91125217A
Authority: TW
Inventors: Chia-Hsin Liao; Chang-Shing Lee; Yau-Hwang Kuo; I-Heng Meng; Chun-Chieh Kung
Original assignee: Inst Information Industry
Priority date: 2002-10-25
Filing date: 2002-10-25
Publication date: 2004-06-11

Abstract

An automatic ontology building system and method thereof. The system comprises term segmentation processing device, term clustering processing device, data mining module, concept sets creation module and ontology building agent. The term segmentation processing device abstracts feature term pertaining to verb and noun from training document. The term clustering processing device generates multiple clusters in accordance with character similarity contained in feature term. The data mining module generates multiple correlation principles in accordance with the correlation of feature term in document. The concept sets creation module revises the said clusters to generate concept sets in accordance with the correlation defined in correlation principles. The ontology building agent completes construction of ontology in accordance with concept sets and correlation principles.

Description

591519591519

本么月疋有關於一種物件導向本體（ob ject_〇riented ontology)技術，且特別是有關於一種可以針對中文的知識領域，以自動方式來建構屬於它的本體。，本體（ontology)可以做為知識管理的一種核心有=地進行知識管理。因此如何建構某種特定知識領】的本體即成為一項重要的課題，藉此人類〆士 j 4 A Μ Μ 将八頭j以在未來透過此方式讓知識官理更加有效。本體必須是針對某個特領域（domain)所建立，其組成分子分別是物件: 值（value)、關係（relation)和函數（functi〇n)。其中，物件可以代表一概念（concept)，此概念可以是一^ 法、想法或實體。❿這些實體可以給予值，使其具備值（attribute vaiue)的資料型態。物件間則是以以定義’可分為特殊化（specializati〇n)和一般化’、 (general izat ion)的關係。藉此讓物件間的推演明確或是模糊的解釋。在一般本體最常用的組成分=f 代表物件的字彙（vocabulary)和概念、代表關係二=含 (attribute)以及用來表示物件間關係比重的關係權由於目前針對中文知識領域之本體都以人工參盥。本體，但由於採取半自動建構的技術來建構本體，口構文件的關鍵字詞來作分類處理，其依據仍然不足·，另、針對件在分類上也有些模糊問題，例如：某則新聞的談认2文國股市，於是可分類在財金的新聞類別亦可分類:為美聞類別。、约國際新網路技術一日千里，造就知識共享的時代來臨。。千頭Ben Moyue has an object-oriented ontology (ob ject_〇riented ontology) technology, and in particular an area of knowledge that can be used to construct Chinese ontology in an automatic way. Ontology can be used as a core of knowledge management for knowledge management. Therefore, how to construct the ontology of a specific knowledge domain has become an important subject, by which the human warrior j 4 A MM will make the eight heads j in this way to make knowledge management more effective. The ontology must be established for a specific domain, and its constituent molecules are objects: value, relationship, and function. Among them, an object can represent a concept, and this concept can be a method, idea, or entity. ❿These entities can be given a value so that they have a data type of attribute vaiue. The relationship between objects can be divided into specialization, generalizability, and generalization by definition. This allows the deduction between objects to be clear or vague. The most commonly used components in general ontology = f represents the vocabulary and concept of objects, represents relationship two = attribute, and the relationship right used to represent the proportion of the relationship between objects. Participate. Ontology, but because the semi-automatic construction technology is used to construct the ontology, and the keywords of the oral documents are classified for classification, the basis is still insufficient. In addition, the classification of the items is also a bit vague, such as a news talk Recognizing the stock market of the two countries, the news category that can be classified in the financial industry can also be classified as the Meiwen category. With the rapid development of new international network technologies, the era of knowledge sharing has arrived. . Thousand heads

591519 五、發明說明（2) 萬、、者雜乱無章的資訊充斥在我們的周遭，資日人類必須面對的問題。當知識擴展的迷产：炸疋今吸收時，電腦的自動化處理之輔助，在今=:π們所能無論是個人、》司行號、政府機關、甚 2 ::重要。曰行業務當中，皆會遭遇此問題。 ' ’在其有鑑於此，本發明的主要目的就是提供一 =與方法，亦即透過發展資訊與知識管理的軟體：建 =分類網路文件，整合網路知識，體Κ 的水準及生產力。叮口，弓敷體產業因此’本發明提供一種本體自動逮 =件產生一本體，其包括= 迷文件，用以從上述文件内抽離出複 =包含於上述文件内並且屬於既定詞性；詞聚理：斂的包含字相2 m以依據上述特徵詞之間述；i;:;探^:數：：’每一聚類包含部分之上以依據上述文理裝置，用法則，每一關連法則用以定義上產生稷數關連概念組建立模、组，搞接於上述平l=柯之間的關連性； :罙勘模組，用以根據上述關連法則修上= 數概念組，每一概念組包^ l正上述聚類，產生複建構代理器，接於上述概；：：士;f:寺徵詞’·以及本體模組，用以根據上述概念组:j杈、、且和上述資料探勘 ^、、、f上述關連法則，建構上述本 0213-8459TWF(N);STLC-01-B-9127;Franklin. Ί· 第6頁五、發明說明（3) 體。It此，訓綠、 S由概念名稱、述機制的處理，可以讯所構成特定領域本體。异以及各概念間之關【貫施例】之方：”：：二：：之物件導向本體自動建構系統圖所示，本發明之本體自動=以限疋本發明。如第1 性過濾器14、提煉過、、廣j構糸統包含斷詞模組12、詞模組2〇、中文詞聚：f έ 9頻率分析模組1δ、資料探勘立模組26以及本體逮槿二、提煉聚類模組24、概念組建對透過網路本體自動建構系統係針處理，用以產特定領域複數文件1〇進行太驊白奢祕/十^此特疋領域所建構之本體結構3 2。此 ^ ^ 冓系統利用各組元件所執行的處理步驟，_以 =發明自動建構本體的目的，其中各組成元：亦= ，：載，己錄媒體之軟體程式加現。 :各組成元件的操作。另外，在本實施例中之本體自:i 構糸統和方法是採用物件導向系統設計的方式，並且針中文特定領域建構本體，然而此並非用以限定本發明。斷詞處理首先，斷詞模組1 2將做為輸入之中文特定領域複數文件1 0進行斷詞處理，例如將文件中之介系詞以及數字刪除，然後標注斷詞後的詞性，藉此找出文件丨〇中的動詞及名詞等。因此，斷詞模組1 2可以從文件丨〇中產生經斷詞處591519 V. Description of the invention (2) The messy and indiscriminate information is flooding all around us, a problem that human beings must face. When the mystery of knowledge expansion: blasting and absorbing this time, the computer's automated processing aids, nowadays =: π can do it, whether it is an individual, a company, a government agency, or even 2 :: important. This problem will be encountered in all business. In view of this, the main purpose of the present invention is to provide a method and method, that is, to develop software for information and knowledge management: build = classify network documents, integrate network knowledge, and level and productivity. Dingkou, Gongfu Body Industry Therefore, the present invention provides an ontology automatic capture device to generate an ontology, which includes: a fan file for extracting a complex from the above file, which is included in the above file and belongs to a given part of speech; words Convergence: Converged words containing 2 m according to the above-mentioned feature words; i;:; Exploration ^: Number :: 'Each cluster contains part above according to the above-mentioned literary devices, usage rules, each connection The rule is used to define the module and group of the related concept groups that are generated, and connect to the relationship between the above-mentioned l = Ke; The survey module is used to repair = the number concept group according to the above related rule. A concept group includes the above-mentioned clustering to generate a reconstruction agent, which is connected to the above; :: Shi; f: Temple Lexicon ', and an ontology module, which is used according to the above concept group: j fork, and Explore the above-mentioned related rules with the above materials, construct the above-mentioned 0213-8459TWF (N); STLC-01-B-9127; Franklin. Ί · Page 6 V. Description of the invention (3). It is therefore that the training green and S are handled by the concept name and the description mechanism, which can constitute the specific domain ontology. Differences and the relationship between the concepts [Executive examples]: ":: 2 :: Object-oriented ontology automatic construction system shown in the diagram, the ontology of the invention automatically = to limit the invention. As the first filter 14. Refined, and broad-based systems include word segmentation module 12, word module 20, Chinese word aggregation: f 9 9 frequency analysis module 1δ, data exploration and establishment module 26, and ontology capture 2 The clustering module 24. The concept set up deals with the automatic construction system of the network ontology, which is used to produce a plurality of documents in a specific domain 10, and to carry out the Taibaibai extravagance / tenth, the ontology structure constructed in this special domain32. This ^ ^ 冓 system uses the processing steps performed by each group of components, _ == the purpose of the invention to automatically construct the ontology, where each component element: also =,: contains, the software program of the recorded media is cashed out. In addition, in this embodiment, the ontology from: i structure system and method is an object-oriented system design method, and the ontology is constructed in a specific Chinese field, but this is not intended to limit the present invention. Word segmentation processing First, Word segmentation module 1 2 will be lost Plural documents in the Chinese-specific domain 10 are processed for word segmentation, such as deleting the prepositions and numbers in the document, and then marking the part of speech after the word segmentation to find out the verbs and nouns in the document. Therefore, Word segmentation module 1 2 can generate segmented words from file 丨〇

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第7頁 591519 五、發明說明（4) ^錢有詞性屬性的第一詞組50。斷詞模組12可以利用中 $研院所開發之斷詞系統（CKIp)或其他斷詞系統加以實，下來°司性過濾器（S t 〇 p W 〇 r d F i 11 e r ) 1 4則進一步處理第；㈤組5 0 ’從其中選取出具有代表性的第二詞組，例如動j及1詞，藉此當作建構本體的特徵r )。二接著&煉過慮器（R e f i n i n g F i 11 e r) 1 6則根據一中文j庫在&述基本詞性分析後，將文章内的詞做更明確更洋、、、田的過濾、動作，而得到特徵詞組5 4。在本實施例中係 :用斷詞模組12、詞性過濾器14和提煉過濾器16產生後續处理的u司組，然而上述處理方式並非用以限定本發明，對於热習此技藝者而言，亦可以採用其他技術取得後續分析用之詞組’仍可以達到本發明之目的。接下來則疋利用取仔的特徵詞組5 4及其所屬之文件 ίο同時進行詞聚類處理（term clustering)與資料探勘處理（data mining)。其中，詞聚類處理係用來處理詞的相似性，資料探勘處理係用來處理詞的關聯性，藉此建立本體所需要的概念（concept)。詞聚類處理在本實施例中，詞聚類處理係由頻率分析模組（term analyZer)18、中文詞聚類模組22以及提煉聚類模組以執行。頻率刀析模組1 8係用以針對不同類別的文件1 〇所包含的詞，進行詞的頻率分析。因為在這些類別文件裡的詞 0213-8459TWF(N);STLC-01-B-9127;Frank 1in.ptd 第8頁 591519 五、發明說明（5) __ :現= 者占了大多數，所以若能夠將這- 的詞，類別文件中出現頻率較高更高的效率，除此^冰，▲τ的貝料量和訓練時間，達到生。 ” ，可以大量減少聚類後雜訊的？第2圖為本實施例中以新聞文侔盔〃，、上的統計資料表。在第2圖和^为^文件和詞量係來自於中時電子報200 1年：上=，訓練文件文件，其中依據不同新聞類別而進行=年的新聞前所'，不同詞出現的頻率亦有也η，、如第2圖貫例中新聞類別為股市財、、第3圖為對應詞數量的示意圖。如第3圖所示現頻率及其的詞數量相當大，相對地出現頻：出現頻率較低小。因此，本實施例中可以取装相關：f的詞數量則相當前1_個詞作為聚類的輸入，然而或是定本發明。扣師运私準並非用以限以幹ίΪί施例中’頻率分析模組18根據特徵詞組54，可以，出根㈣$篩選#件所選擇的輸入詞组Μ 了 58中词的詞性55和表示詞相似度的詞距⑼。中文；：：艮:it述資謂r㈣，產生經過聚類處理後：水’ #、.且。必須說明的是，本實施例中雖然針對特組54提出一頻率分析模組丨8來篩選出現頻做^二類的輸入，但是並非用以限定本發明，例如直接0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 7 591519 V. Description of the invention (4) ^ Qian has the first phrase 50 with part-of-speech attribute. The word segmentation module 12 can be implemented by using the word segmentation system (CKIp) or other word segmentation systems developed by the Chinese Academy of Research. The following are the filter (S t 〇p W 〇rd F i 11 er) 1 4 Further processing the first; ㈤ group 5 0 'from which a representative second phrase, such as verb j and 1 word, is selected to take the feature r of constructing the ontology). Secondly, & Refining F i 11 er) 1 6 According to a Chinese j library, after analyzing the basic part-of-speech analysis, the words in the article were made clearer, more foreign, and field filtering, action , And get the characteristic phrase 5 4. In this embodiment, the following steps are used to generate a u-group for subsequent processing using the word segmentation module 12, the part-of-speech filter 14, and the refinement filter 16. However, the above processing method is not intended to limit the present invention. For those skilled in the art, You can also use other techniques to obtain the phrase 'for subsequent analysis' and still achieve the purpose of the present invention. The next step is to use the feature word group 5 4 and the file to which it belongs, to perform term clustering and data mining at the same time. Among them, word clustering processing is used to deal with the similarity of words, and data exploration processing is used to deal with the relevance of words, so as to establish the concepts needed by the body. Word cluster processing In this embodiment, the word cluster processing is performed by a frequency analysis module (term analyZer) 18, a Chinese word cluster module 22, and a refined cluster module. Frequency analysis module 18 is used to analyze the frequency of words based on the words contained in different types of documents 10. Because the words 0213-8459TWF (N); STLC-01-B-9127; Frank 1in.ptd in these category files are on page 8 591519 V. Description of the invention (5) __: The present = the majority, so if Able to combine this word with higher frequency and higher efficiency in the category file. In addition to this, the amount of shell material and training time of ▲ τ can be achieved. ", Can reduce the noise after clustering a lot? Figure 2 is the statistical data table in this example with the news text 侔 Helmet 、. In Figure 2 and ^ are ^ files and vocabulary from the middle Times Newsletter 200 1 year: Shang =, training document file, which is performed according to different news categories = years before the news', the frequency of different words also has η, as shown in the example in Figure 2 for the news category: Stock market wealth, Figure 3 is a schematic diagram of the number of corresponding words. As shown in Figure 3, the present frequency and the number of words are quite large, and the frequency of occurrence is relatively low: the frequency of occurrence is relatively small. Therefore, in this embodiment, it can be taken Relevant: The number of words of f is similar to the current 1_words as the input of the cluster, but it is still the present invention. The deduction of the teacher's personal permission is not limited to the example of the frequency analysis module 18 according to the characteristic phrase 54. You can, select the input phrase M selected from the root $ 筛选 ##, which has the part-of-speech 55 of the 58 words and the word pitch 表示, which indicates the similarity of the words. Chinese; :: gen: it After treatment: water '# ,. and. It must be noted that although in this embodiment, Proposed a frequency analysis module Shu ^ 8 to two classes do frequency input filter occurs, but not intended to limit the present invention, for example, group 54 directly Laid

0213-8459TWF(N);STLC-01.B-9127;Franklin.ptd0213-8459TWF (N); STLC-01.B-9127; Franklin.ptd

第9頁Page 9

591519 五、發明說明（6) 二4 =文詞聚類模組22的輸入模組58，亦可以符合本中文詞聚類模組22是根據中文詞聚類的方式，度較高的詞聚類為同-詞組。在本實施例中，中文類板組22主要係利用類神經網路架構的自我組織映射演算法 (self-organizing map algorithm，簡稱 s〇M)模型所 / 模型細部設定則不再詳細說明，以下說明主要對在本貫施例中所需解決的聚類輸入問題。由於在例中是，取詞的詞距56與詞性55作為聚類的輸入特徵，因此在本實施例中，係將輸入詞組58包含的中文字以 Unicode的UTF-16機制轉換。每個字會轉換成一個“位元的碼，其中文字的碼範圍為2E8〇〜9FFF，轉成十進位在^ 9 04〜40 95 9之間。但若以這個範圍的數值來當作s〇m聚類杈型的輸入，會因為輸入值太大而造成聚類的不正確。因此，本實施例中利用映射函數功能（Mapping Functi〇n) 來將文字範圍的數值映射至— 100至1〇〇的範圍，可避免因為輸^值太大而造成聚類的不正確。第4圖表示本實施例中所採用之中文字編碼映射函數的示意圖。針對輸入詞組58和S0M聚類模型輸入單元的對應問題，本實施例中提出以下三種方式解決。第一種方式是預設固定長度的s〇M聚類模型輸入單元，分別對應於輸入詞組58中每個詞的詞性及其所包含的 :文字。例如，輸入單元長度可以設為6，其中第丨個輸入單元對應於每個輸入詞的詞性，第2個至第6個輸入單元則591519 V. Description of the invention (6) Two 4 = input module 58 of the text word clustering module 22, which can also conform to the Chinese word clustering module 22, which is based on the clustering of Chinese words and has a higher degree of word clustering Is the same-phrase. In this embodiment, the Chinese-style board group 22 is mainly a self-organizing map algorithm (som) model / model detail setting using a neural network-like structure, which will not be described in detail below. The description mainly focuses on the clustering input problem that needs to be solved in this embodiment. In the example, the word pitch 56 and the part of speech 55 of the word are taken as the input features of the cluster. Therefore, in this embodiment, the Chinese characters included in the input phrase 58 are converted by the Unicode UTF-16 mechanism. Each word will be converted into a "bit code, where the code's code range is 2E8〇 ~ 9FFF, converted to decimal between ^ 9 04 ~ 40 95 9. But if this range of values is used as s 〇m clustering type input will cause incorrect clustering because the input value is too large. Therefore, in this embodiment, a mapping function function (Mapping Functi) is used to map the value of the text range to -100 to The range of 100 can avoid incorrect clustering because the input value is too large. Figure 4 shows a schematic diagram of the Chinese character encoding mapping function used in this embodiment. For the input phrase 58 and the SOM clustering model The corresponding problem of the input unit is solved in this embodiment by the following three methods. The first method is a preset fixed-length SOM clustering model input unit, which corresponds to the part of speech of each word in the input phrase 58 and its location. Contains: text. For example, the length of the input unit can be set to 6, where the first input unit corresponds to the part of speech of each input word, and the second to sixth input units are

591519591519

五、發明說明（7) 入長度預設為5，所以就會造成π長π這個字無法輸入作判斷。第二、字在輸入詞中的不同位置會造成聚類錯誤，例如”警員”和”員警”兩個詞，其性質相同，依常理判斷會聚類在一起，不過因為兩個輸入詞的字位置不同，會造成無法聚類在一起。 '' 對應於每個輸入詞中所包含的中文字。然而，此種方式在應用有其缺點。第一、由於SOM聚類模型的每個資料輪入層的輸入單位必需固定。所以如果輸入詞的長度比所預定的輸入單位長度大時，就輸入詞中過長的字則無法對應。例如，”民進黨秘書長，，這個詞的長度是6，而上述之詞輸第二種方式則是將每個字當作一個對應輸入單元的輸入，此方式可以解決在第一種方式中輸入詞長度過長以^ 輸入詞中字位置不同造成聚類錯誤的情況。不過，此種方式則是造成輸入層輸入單元的維度過大，同時30Μ聚類模型的訓練時間會非常長。以前述之範例來說，在類別為政焦點的新聞文件中統計’出現的字數量大約在3 〇⑽個2 右，其所組成的特徵詞組則多達1 4 0 0 0。因此，以第二種方式建立SOM聚類模型輸入單元時，其輸入層的維度f輸入單元數量）為3 0 0 0，而供訓練用的資料筆數則多達1 4⑽〇，因此可以預見此種SOM聚類模型的聚類時間會非常長。另外，在第二種方式中，由於每個輸入單元都可以清楚地代表所屬的字，因此並不需要針對中文字進行編碼的輸入，只需要將出現的字設為1，沒出現的字設為〇即可。第三種方式則以多字對應單一輸入單元的方式來處理V. Description of the invention (7) The input length is preset to 5, so the word π length and π cannot be entered for judgment. Second, the different positions of the words in the input word will cause clustering errors. For example, the words "policeman" and "policeman" are of the same nature and will be clustered together according to common sense, but because the two input words The different word positions will not be able to cluster together. '' Corresponds to the Chinese characters contained in each input word. However, this method has its disadvantages in application. First, the input unit of each data rotation layer of the SOM clustering model must be fixed. Therefore, if the length of the input word is larger than the predetermined input unit length, the long word in the input word cannot be matched. For example, "DPP Secretary General, the length of this word is 6, and the second way of inputting the above words is to use each word as a corresponding input unit. This method can be solved in the first way. The length of the input word is too long to cause a clustering error due to the difference in the word position in the input word. However, this method causes the dimension of the input unit of the input layer to be too large, and the training time of the 30M cluster model will be very long. As an example, the number of words that appear in a news document with a category of political focus is about 300 words, and the number of characteristic words is as high as 14000. Therefore, in the second way, When the input unit of the SOM clustering model is established, the number of input dimensions (the number of input units in the dimension f) is 3 0 0, and the number of data pens for training is as high as 14⑽0. The clustering time will be very long. In addition, in the second method, since each input unit can clearly represent the word to which it belongs, there is no need to input the Chinese character encoding, only the occurrence of Is set to 1, the word does not appear to set square. A third embodiment places the single multi-word corresponding to the input processing unit according to

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第11頁 591519 五、發明說明（8) 聚類=輸入，藉此同時解決第一種方式之字位置對應問題以及第二種方式輸入層維度過大的問題。在此方式中，每個輸入單元可以對應於超過一個的不同字，而不同字也可以透，Uni code轉碼來分辨其中不同對應的字。然而，在形成每個輸入單元所對應之字時，必須限制在同一輸入詞中出現的字，不可以同時列為同一輸入單元的對應字則會出現聚類的輸入錯誤。，圖表示本實施例中多字對應單一輸入單元處理輸入早70配置方法的流程圖。其中，輸入資料為頻率分模組18所產生的輸入詞組58或者是斷詞處理後的特徵^ 54，輸出資料則是每個維度（或輸入單元）及其對應之字的集合。如第5圖所示，首先，計算出輸入詞組58或者特徵㈣組54中所有輸入詞所包含的字集合，令該字集合為 {W1 ’’…，Wn}，總字數為η個（步驟S1)。接著，找出所有子W1〜Wn所屬之輸入詞中出現的其他字集合，分別符號SW1，SW2，…，SWn表示（步驟S2)。最後，依序處理字W1〜Wn，隨機配置給輸入單元，但是所配置之輸入單元必須不包含屬於該字所對應字集合sn〜SWn中的元素（步 S3)，亦即每個輸入單元不會同時對應於屬於同一輪入气所包含的字，藉以避免聚類輪入錯誤。 ° 以下以一貫例說明以上流程。考慮一組輸入詞組丨文學，化學，文化，清大，成大，警員，員警，警察局}，其中包含的字集合為{文，學，化，清，大，成，警，員’察，局}。接著找出每個字的字集合sn〜swl〇，亦即0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 11 591519 V. Description of the invention (8) Clustering = input, which simultaneously solves the problem of word position correspondence in the first way and the second This way the problem of too large input layer dimensions. In this way, each input unit can correspond to more than one different word, and different words can also be transparent. Unicode transcoding can distinguish different corresponding words. However, when forming the words corresponding to each input unit, the words appearing in the same input word must be restricted, and the corresponding words that cannot be listed at the same time in the same input unit will cause clustering input errors. The figure shows a flowchart of a method for processing input early 70s with multiple words corresponding to a single input unit in this embodiment. Among them, the input data is the input phrase 58 generated by the frequency sub-module 18 or the feature after word segmentation ^ 54, and the output data is the collection of each dimension (or input unit) and its corresponding word. As shown in FIG. 5, first, the word set included in all input words in the input phrase group 58 or the feature group 54 is calculated, and the word set is {W1 ”..., Wn}, and the total number of words is η ( Step S1). Next, the other word sets appearing in the input words to which all the children W1 to Wn belong are found, and the symbols SW1, SW2, ..., SWn are respectively represented (step S2). Finally, the words W1 ~ Wn are processed in sequence and randomly assigned to the input unit, but the configured input unit must not contain elements belonging to the word set sn ~ SWn corresponding to the word (step S3), that is, each input unit does not It will also correspond to the words included in the same round of Qi, so as to avoid clustering round errors. ° The above process is explained below with a consistent example. Consider a set of input phrases 丨 literature, chemistry, culture, Tsinghua University, Chengdu University, police officer, police officer, police station}, which contains a set of characters {文, 学, 化, 清, 大, 成, 警, policeman 'Check, bureau}. Then find the word set sn ~ swl0 for each word, that is,

五、發明說明（9) 與該字出現於同—輸入詞的字集入 μ 文-^學，化 ν、5 ’其计异結果如下：學〜文’化化〜學，文清〜大大〜清，成成〜大 f 5員，察，局察〜警，局局4警，察第6圖表示本實施例範例中入單元之#咅m 、輪入詞組的字配置給輪入單元），而每個維的包含字其：：歸度（即輪交叉比對，^ ^ α β ^ 、/成必需經過每個字詞的列為在同一個維的包含字，造^ ^ 一個詞裡出現的字，被例中，第1維合對_於丨t生岑類的輸入錯誤。在此範 (學、大、昌曰i 、成、警}，第2維會對應於有3唯_ 丨，以及第3維會對應（化、局}。因為只 J3。維的輸入層，如此可以減少花費在作聚類處理的時 φ 11文闲聚類模組22產生聚類詞組60後，在本實施例 121詞組60會再由提煉聚類模組24進-步處理。 :5 :類模組22是利用類神經網路的s〇M聚類模型將輸过组中相似的詞聚在一起來形成大聚類(clusters)，並V. Description of the invention (9) It appears in the same word as the word—Enter the character set of the input word into μ Wen- ^ Xue, Hua ν, 5 'The difference is as follows: Learn ~ Wen' Hua Hua ~ Xue, Wen Qing ~ Da ~ Qing, Chengcheng ~ big f 5 members, inspector, inspector ~ police, inspector 4 police, inspector 6 (shown in the example of this embodiment, # 入 m of the input unit, the characters of the turn-in phrase are assigned to the turn-in unit) , And the contained words of each dimension are :: Attribution (ie round cross comparison, ^ ^ α β ^, / cheng must pass through each word's column as contained words in the same dimension, making ^ ^ one word The characters appearing in the example are incorrectly entered in the 1st dimension pair _ Yu 丨 tsun Cen class. In this example (learning, big, Chang, i, Cheng, police), the 2nd dimension would correspond to 3 Only _ 丨, and the 3rd dimension will correspond to (chemical, bureau). Because it is only J3. Dimensional input layer, which can reduce the time spent on clustering processing φ 11 leisure clustering module 22 generates clustering phrases 60 After that, in this embodiment 121, the phrase 60 will be further processed by the refined clustering module 24.: 5: The class module 22 uses the OM clustering model of the neural network to transfer similar ones in the group. word Together to form large clusters (clusters), and

591519 五、發明說明（ίο) 構成聚類詞組60 ;提煉聚類模組24則是用才Λ时切出小聚類（Sub dusted。例如，大聚類再細識、警務、參議、警方、議場、學識、巧為上程、常會議、檢警、开丨警、省議會、議會會合=安室、決議案}時，本實施例中提煉聚類模組24曰: 9礅大聚類裡具有相同字的詞再切分出來，忐疋巴同一個類，因此上述大聚類例子可以再細分成3個小 /來小聚類1 ·議程、參議、議場、省議」案、會議、會議室、決議案；義㈢礅會、議小聚類2 :警務、警方、警界、警官、小聚類3 :常識、學識。欢吕、刑警；最後，整個詞聚類處理產生由小聚組62。顯所構成的聚類詞資料探勘處理另一方面’特徵詞組5 4也會同時交由次建立詞的關連法則，找出詞的相關性。由：:20 的聚類都是依據字面上相似性而聚在—起，並:: 關連性，戶斤以透過資料探勘模組20所建立之關連；:上的找出其中關連性較強的詞。每一條關連法則用以兩個詞之間的關連性，以支持度（support)和信賴度出現的比例，信賴度則妒笨u在句中兩個詞同時機率。以詞出現時另一詞也會出現的的性質不同，因此必須針對：新聞類別对母種新聞類別，疋出名詞對於591519 Fifth, the invention description (ίο) constitutes the clustering phrase 60; the refined clustering module 24 is used to cut out the small clusters (Sub dusted. For example, the large clusters are then carefully identified, police, council, police, When the meeting venue, knowledge, clever schedule, regular meeting, prosecution, open police, provincial council, parliamentary meeting = security room, resolution}, the cluster module 24 is extracted in this embodiment: Words with the same word are segmented again, and the same category is used. Therefore, the above large cluster example can be further subdivided into 3 small / come small clusters. 1. Agenda, Senate, Discussion Venue, Proposal, Case, Meeting, Meeting Room , Resolutions; Yihui meeting, small clusters 2: police, police, police, police officers, small clusters 3: common sense, knowledge. Huanlu, criminal police; Finally, the entire word clustering process is generated by small clusters 62. Exploring the processing of the clustered word data formed by the explicit On the other hand, the 'characteristic phrase 5 4' will also be used to establish the correlation rules of the words at the same time to find the relevance of the words. The clustering by :: 20 is based on the literal On similarities and gathered together: :: Relevance The relationship established by the survey module 20: find the words that are more related. Each connection rule uses the relationship between the two words to support and the proportion of trust, Reliability is jealous of the probability that two words are in the sentence at the same time. When a word appears, the other word also appears differently in nature, so it must be targeted: news category vs. parent news category, and noun nouns

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第14頁 591519 五、發明說明（π) 此一新聞類別的支持度和信賴度的門檻值（threshold)。其中一種方式是可以針對不同新聞類別的特性，直接預先設定（支持度，信賴度）的門檻值，例如：政治焦點丨丨 (10 ，50) 國際要聞丨， (10 ，40) 股市財金丨丨 (15 ，40) 兩岸風雲” (15 ，40) 社會地方丨丨 (12 ，40) 生活新知丨， (8， 45) 運動娛樂丨丨 (12 ，50) 另一種方式是根據不同新聞類別的文件數量來決定 (支持度’彳§賴度）的門植值。例如： 1 ·當0<N<Na/2，則 = ，0c = 7〇% ; 2 ·當Na/2 SN<Na，則 0s = 3%，0c = 8O% ; 3 ·當Na$N<3Na/2，則 = ，0c = 7〇% ; 4 ^3Na/2SN<2Na，則 <9s:L5%，= 。其中N表示某新聞類別的文字數量，Nae =文件的平均數量，示（支持度，信斤賴有= 1榼值。依據條件1若某新聞類別文件之數量大於踅勺 :所有新聞類別文件之平均數量的2分之、遠，小 =檻值以及信賴係數大於條件i之門播值4% ^連法則 70/。，則表示這兩詞關聯性強· 。賴係數 λ e ^ ^ ^ ^, 小於所有新聞類別文件之平 θ :句數1的2分之1， <十均數Ϊ，而且關連法則之門檻0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 14 591519 V. Description of the invention (π) Threshold for the support and reliability of this news category. One way is to directly set thresholds (support, trust) for the characteristics of different news categories, such as: political focus 丨 (10,50) international news 丨, (10,40) stock market finance 丨丨 (15,40) Cross-strait situation "(15,40) Social places 丨 (12,40) New knowledge of life 丨, (8, 45) Sports entertainment 丨 (12,50) Another way is based on different news categories The number of files is used to determine the gate planting value of the support degree (彳 ° 度度). For example: 1 · When 0 < N < Na / 2, then =, 0c = 70%; 2 · When Na / 2 SN < Na, Then 0s = 3%, 0c = 8O%; 3 · When Na $ N < 3Na / 2, then =, 0c = 70%; 4 ^ 3Na / 2SN < 2Na, then < 9s: L5%, =. Where N represents the number of texts in a news category, Nae = the average number of documents, showing (support, reliability = 1 value. According to condition 1, if the number of documents in a news category is greater than the spoon: the average of all news category files Two half of the number, far, small = threshold value and trust coefficient is greater than 4% of the door value of condition i ^ Law rule 70 /., Then these two words . Linkable strong Lai coefficient λ e ^ ^ ^ ^, θ is smaller than all level category news documents: sentence number 1 1/2, < ten mean Ϊ, and the threshold rules related

0213.8459TWF(N)；STLC-01-B-9127;Franklin.ptd 591519 五發明說明（12) 值以及信賴係數值大於條件2 80%則表示這兩詞關聯性強，門檻值3%與信賴係數值為 4之條件同樣可找出關聯性^ =類推依據條件3以及條件建立概念組（concept sets) 立一根據上述資訊產聚：處：：歸納出的小聚類都是依據字面而：：到的關連法則可以用來進二步，=料探勘處理所找 ,7 〆補強小聚類内的關連性。例如，假設在關連法則66中有一條為「刑事組長」與「警方」為（18. 2%，90· 7%)，表示兩個詞的關連性很強，所以在概念組建立模組26就會將「刑事組長」放入前述的第2 小聚類中構成對應的概念，亦即：概心1 · 5義程、參義、議場、省議會、議會、議案、會議、會議室、決議案；〃概念2 :警務、警方、警界、警官、檢警、刑警、刑事組長； ^ 概念3 :常識、學識。因此’概念組建立模組2 6是將聚類裡的詞與映射至關連法則的詞共同建立同一個概念。建構本體最後，根據關連法則6 6以及概念組6 4，本體建構代理器30可以建立每個概念之名稱（concept name)、屬性0213.8459TWF (N); STLC-01-B-9127; Franklin.ptd 591519 Five invention descriptions (12) Value and trust coefficient value greater than condition 2 80% indicates that the two words are strongly related, the threshold value of 3% and the trust coefficient A condition with a value of 4 can also find the correlation ^ = analogy, based on condition 3 and concept sets to establish concept sets. The first cluster is based on the above information: Division :: The small clusters summarized are based on literals :: The obtained correlation rule can be used for the next two steps, = to find the prospecting and processing, 7 to strengthen the correlation within the small cluster. For example, suppose that one of the "Relationship Rules 66" and "Police Team Leader" is (18.2%, 90 · 7%), which indicates that the two words are strongly related, so module 26 is established in the concept group. Will put the "criminal team leader" into the aforementioned second sub-cluster to form the corresponding concept, that is: Concentration 1.5 · the process, Senate, the venue, the provincial assembly, the parliament, the bill, the meeting, the conference room, Resolutions; 〃 Concept 2: Police, Police, Police, Police Officer, Prosecutor, Interpol, Criminal Team Leader; ^ Concept 3: Common sense, knowledge. Therefore, the 'concept group creation module 26 is to establish the same concept together with the words in the cluster and the words mapped to the correlation rule. Constructing the ontology Finally, according to the connection rule 6 6 and the concept group 64, the ontology construction agent 30 can establish the concept name and attribute of each concept

591519 五、發明說明（13) (attribute)以及運算（〇perati〇n)，以及概念之間的關係 (relation) °本體建構代理器3〇是由四個部分所組成，包括·概念名稱建構代理器（c〇ncept names construction agent) agent) agent) 為名詞 agent)運"^ 建構代理器（operations construction 屬性建構代理器（attributes construction 關係、建構代理器（relations construction 其中概念名稱之詞性為名詞，概念屬性之詞性亦概念運算之詞性則為動詞，上述是由概念組64中的同一概念中所產生；關係則是根據不同概念之間的關連法則所決定’其詞性則為動詞。第7圖表示本實施例中所建立之概念和概念間關係之一例的示意圖。在第7圖中共例示9個概心刀別以符號9 0表示，每個概念則再細分為概念名/爯92(第一攔）、概念屬性94(第二欄）和概念運算 9 6 (第三欄）。例如這9個概念的概念名稱分別為警方、青少年、火％、火警、歹徒、搶、毒品、色情、專案小組。另外’概念90間的四邊形則表示關係98，係根據關連法則所建立。例如「趕往」為r警方」與「火場」之間的關係，「搶救」和「撲滅」為「火場」與「火警」之間的關係。最後，將所有概念的概念名稱、運算、屬性以及概念間的關係之結果存放產生特定領域的本體32，#完成自動建構本體。、根據以上所述，本發明所揭露之本體自動建構系統和方法可以在操作人員不介入的情況下，自動地根據既定文591519 V. Description of the invention (13) (attribute) and operation (〇perati〇n), and the relationship between the concepts (relation) ° ontology construction agent 30 is composed of four parts, including the concept name construction agent (Concept names construction agent) agent) agent) is the noun agent) operation " ^ construction agent (operations construction attribute construction agent (attributes construction relationship, construction agent) where the concept name part of speech is a noun, The part-of-speech of the concept attribute and the part-of-speech of the concept operation are verbs, which are generated from the same concept in concept group 64; the relationship is determined by the rules of connection between different concepts, and the part-of-speech is a verb. Figure 7 A schematic diagram showing an example of the concepts and the relationships between concepts established in this embodiment. In Figure 7, a total of 9 outline knife types are illustrated by the symbol 90, and each concept is further subdivided into the concept name / 爯 92 (No. One block), concept attributes 94 (second column) and concept operations 9 6 (third column). For example, the concept names of these 9 concepts are police, youth Year, Fire%, Fire Alarm, Gangster, Robbery, Drugs, Pornography, Task Force. In addition, the 'Concept 90's quadrilateral represents relationship 98, which is established in accordance with the law of connection. For example, "rush to" police and fire "Rescue" and "Extinguish" are the relationship between "Fire Field" and "Fire Alarm". Finally, the results of the concept names, operations, attributes, and relationships between concepts of all concepts are stored to generate specific fields. Ontology 32, #Complete the automatic construction of the ontology. According to the above, the system and method for automatically constructing the ontology disclosed in the present invention can be automatically based on the established text without operator intervention.

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd

第17頁 591519 五、發明說明（14) 件，產生屬於特定中文知識領域的本體資料庫，因此在知識管理的應用上極具有產業上的價值。雖然本發明已以較佳實施例揭露如上，然其並非用以限定本發明，任何熟習此技藝者，在不脫離本發明之精神和範圍内，當可作各種之更動與潤飾，因此本發明之保護範圍當視後附之申請專利範圍所界定者為準。Page 17 591519 V. Invention description (14), which generates an ontology database belonging to a specific Chinese knowledge field, so it has great industrial value in the application of knowledge management. Although the present invention has been disclosed as above with preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make various modifications and retouches without departing from the spirit and scope of the present invention. Therefore, the present invention The scope of protection shall be determined by the scope of the attached patent application.

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第18頁 591519 圖式簡單說明為讓本發明之上述和其他目的、特徵、和優點能更明顯易懂，下文特舉出較佳實施例，並配合所附圖式，作詳明第說細下 ο 統系構 ^c 4TTT 體本向導件物之例施實明發本為圖和件文析分例為件文聞新以中例施實明發本為 ο 圖圖2 塊第方之資 J-11 口統的量詞第為圖 3 第出詞的聞新金市股為別類聞新中例 ο 實表圖料圖意示的量數詞應對其及率第頻現函射映碼編字文中之用採所中例施實明發本為圖意 Γ J 示第的數圖 J1 處元單入輸 1 單應對字多中例施實明發本為圖圖程流的法方置配元單入輸中配字的組詞入輸將中例範之例施實明發本示表圖圖意示之元單 ί入第輸給置係關間念概和念概立 rc 所中例施實明發本示表圖 aa •，器器模模組濾濾析勘 ;模過過分探 ^]件詞性煉率料々P文斷詞提頻資 7 &古口 ~ ~ ~ ~ ~ 第例號101214161820 一符之ί 圖意示0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd p. 18 591519 The diagram is a simple illustration to make the above and other objects, features, and advantages of the present invention clearer and easier to understand. The preferred embodiment, in conjunction with the attached drawings, will explain the details below. 统 System structure ^ c 4TTT Example of the body of the guide. The implementation of the Mingfa is shown in the figure and the analysis of the text. For example, Shi Shiming issued this picture. Figure 2 Figure 2 Block Fang's assets J-11 Oral quantifiers Figure 1 Figure 3 The word Wenxin Jinshi shares are examples of different types of new news ο The actual chart diagram The quantifiers should be used in the example of the frequency and frequency of the function code in the coded text. The example is shown in the figure. Γ J shows the figure in the figure. This is a diagram of the flow chart of the French side to place the meta-words in the input of the word group input and input. Interpretation of the concept and concept of the concept rc Shi Shiming issued this chart diagram aa •, the filter module analysis of the filter module; the model is too much exploration ^] the part-of-speech refining material 々 P text segmentation to raise frequency 7 & Gukou ~ ~ ~ ~ ~ No. 101214161820 Illustration

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第19頁 591519 圖式簡單說明 2 2〜中文詞聚類模組； 2 4〜提煉聚類模組； 2 6〜概念組建立模組； 3 0〜本體建構代理器； 32〜本體； 5 0〜第一詞組； 5 2〜第二詞組； 5 4〜特徵詞組； 5 5〜詞性； 5 6〜詞距； 5 8〜輸入詞組； 6 0、6 2〜聚類詞組； 6 4〜概念組； 6 6〜關連法則； 9 0〜概念； 9 2〜概念名稱； 9 4〜概念屬性； 9 6〜概念運算； 9 8〜關係。0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 19 591519 Schematic description 2 2 ~ Chinese word clustering module; 2 4 ~ refined clustering module; 2 6 ~ concept group Building module; 30 ~ ontology construction agent; 32 ~ ontology; 50 ~ first phrase; 5 2 ~ second phrase; 5 4 ~ characteristic phrase; 5 5 ~ part of speech; 5 6 ~ word distance; 5 8 ~ Input phrase; 6 0, 6 2 ~ cluster phrase; 6 4 ~ concept group; 6 6 ~ relation rule; 9 0 ~ concept; 9 2 ~ concept name; 9 4 ~ concept attribute; 9 6 ~ concept operation; 9 8 ~relationship.

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第20頁0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 20

Claims

591519 六、申請專利範圍" " 以根據複數文件產生 1 · 一種本體自動建構系統，用一本體，其包括：斷詞處理裝詈，JL垃& u^ 括離+/、接收上述文件，用以從上述文件内於既定上述特徵詞包含於上述文件内並且屬據聚 =:員處理裝置’耦接於上述斷詞處理裝置，用以依二寺效詞之間的包含字相似性，產生複數聚類，每一類匕含部分之上述特徵詞；、、資料探勘模組，耦接於上述斷詞處理裝置，用以依 j特徵詞在上述文件中之關連性，纟生複數關連法則，母一關連法則用以定義上述特徵詞之間的關連性；概念組建立模組，耦接於上述詞聚類處理裝資料探勘模組，用以根據上述關連法則修正上述聚類，^ 生複數概念組，每一概念組包含部分之上述特徵詞；以 ^ f建構代理^_接於上述概念組建立模組和上述一貝料探勘模組，用以根據上述概念組和上述構上述本體。咬次則，建 2 ·如申請專利範圍第1項所述之本體自動建其中上述既定詞性包括名詞和動詞。 ’Λ為 3 ·如申請專利範圍第丨項所述之本體自動建其中上述斷詞處理裝置包括·· #糸A 断列棋組……放队丄％人丨，’亚且抽離出上中所包含之第一複數詞並且決定上迷第一複數詞件詞性過濾器，用以接收上述第〜複齡％，并 θ性’ 双数d，並且根據上591519 VI. Scope of patent application " " To generate from plural documents1. An automatic ontology construction system using an ontology, which includes: word segmentation processing equipment, JL & u ^ bracketing + /, receiving the above documents Is used to include the above-mentioned characteristic words in the above-mentioned document from the above-mentioned document and to be aggregated =: the member processing device 'is coupled to the above-mentioned word-splitting processing device to use the similarity of the contained words between the two temple effect words , Generating plural clusters, each type containing part of the above-mentioned feature words; data exploration module, coupled to the above-mentioned word-breaking processing device, for generating the plurality of relevance based on the relevance of the j-character word in the above document Rule, the parent-association rule is used to define the correlation between the above-mentioned characteristic words; the concept group building module is coupled to the above-mentioned word clustering processing and data exploration module to modify the above-mentioned clustering according to the above-mentioned connection rule, ^ Generate a plurality of concept groups, each of which includes a part of the above-mentioned characteristic words; construct a proxy with ^ f ^ _ connected to the above-mentioned concept group establishment module and the above-mentioned one material exploration module to be used according to the above The concept group and the above constitute the above ontology. Bite times, build 2 · The ontology is automatically built as described in item 1 of the scope of the patent application, where the above part of speech includes nouns and verbs. 'Λ is 3 · The ontology is automatically built as described in Item 丨 of the scope of the patent application, where the above word segmentation processing device includes ... # 糸 A Broken line chess group ... Putting the team 丄% people 丨,' Ya and pull out The first plural number contained in and determines the first plural plural part-of-speech filter of the fan, to receive the above ~% of the age of reversion, and theta sex 'double d, and according to the above

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 591519 六、申請專利範圍複數詞中屬於上迷既定詞性之述第二複數詞，並且與一詞述既定詞性，決定上述第複數第二詞；以及提煉過濾器，用以接收上庫進行比對，決定上述特徵詞。 1中4上、十如/=專利範圍第3項所述之本體自*^構系統，，、中上述既疋詞性包括名詞和動詞。 5 ·、如申請專利範圍第丨項所述之本體自動八中上述詞聚類處理裝置係包含一類神經網路聚類描刑用以根據上述特徵詞的相似性，產生上述聚類〔、， 6 ·如申請專利範圍第5項所述之本體自動建構系統，八中上述詞聚類處理裝置更包括一頻率分析模組，用以據士述特彳政岡在上述文件中的出現頻率，從上述特徵詞中决疋出現頻率較高之複數輸入詞，做為上述類神經類模型的輸入。 ~ 7 ·如申請專利範圍第5項所述之本體自動建構系統，其中上述類神經網路聚類模型係以上述特徵詞以及其和詞距屬性做為輸入。 8 ·如申請專利範圍第5或^項所述之本體自動建統·’ 1中上述特徵詞或上述輸入詞係為中文詞，並且利用 Umcode之UTF-16轉換機制以及一映射函數將其進行轉換用做為上述類神經網路聚類模型的輸入。、统Λ中如上申述月類專神^範第5或6項所述之本體自動建構系 ;=中述類神經網路聚類輸入層的複數輪 …以分別對應上述特徵詞或土述輸入詞中所有二早第22頁 0213-8459rnVF(N);STLC-01-B-9127;Franklin.ptd 、、申請專利範圍子上述輸入單 10 • 如統 5 其中上以根據輪入網路聚類模 11 • 如統，其中上二特徵詞在 12 • 如統 > 其中上檻值 5 用以上述第一特 13 鲁如統 5 其中上述第二特徵包含上述第特 14 ·如申請、、’，其中本體建則’決定對應上算以及上述概念稱、概念屬性、 15 · 一種本一本體，其包括元之數專利範聚類處特徵詞輸出進專利範連法則文件之專利範連法則上述支和上述專利範念組建關連性徵詞之專利範構代理述概念組之間概念運體自動下列步量低於上述包含字之數量。圍第5或6項所述之本體自動建理裝置更包括〃提煉聚類模組，用之包含字的重複性，將上述類神經一步分割處理產生上述聚類。圍第1項所述之本體自動建構系包含上述特徵詞中第一特徵詞和第支持度參數和信賴度參數。圍第11項所述之本體自動建構系包含一支持度門檻值和一信賴度門持度參數和上述信賴度參數，^定第二特徵詞之關連性。 " 圍第1 2項所述之本體自動建構系立模組係根據上述第一特徵詞和上 ’決定是否將上述第一特徵詞加入上述聚類’構成對應之上述概念 ^第1項所述之本體自動建構系為係根據上述概念組和上述關連法、、且之概念名稱、概念屬性和概念運 2關係資til，並且根據上述概念名算和關係資訊建構上述本體。建構方法’用以根據複數文件產生驟：丁 I 土0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd 591519 VI. Among the plurals in the scope of patent applications, the second plural that belongs to the established part-of-speech of the fan, and describes the part-of-speech with one word, determines the above-mentioned part. Plural second words; and a refinement filter, which is used for receiving the library for comparison and determining the above-mentioned feature words. 1 in 4 and 10, as described above, == the ontology self-structuring system described in item 3 of the patent scope, and the above-mentioned existing parts of speech include nouns and verbs. 5 · The above-mentioned word clustering processing device of the ontology automatic eight as described in the scope of the patent application includes a type of neural network clustering to generate the above-mentioned clusters based on the similarity of the feature words [,, 6 · According to the automatic construction system of ontology as described in item 5 of the scope of patent application, the above-mentioned word clustering processing device in Bazhong further includes a frequency analysis module for the frequency of occurrence of Shigang Zhenggang in the above documents. The plural input words with high occurrence frequency among the above feature words are used as the input of the above-mentioned neural-like model. ~ 7 · The ontology automatic construction system described in item 5 of the scope of the patent application, wherein the neural network clustering model takes the above-mentioned feature words and their pitch distance attributes as input. 8 · The ontology is automatically established as described in item 5 or ^ of the scope of the patent application. · The above-mentioned feature words or the above-mentioned input words are Chinese words, and they are processed by Umcode's UTF-16 conversion mechanism and a mapping function. The transformation is used as an input to the neural network clustering model described above. In the above system, the automatic ontology construction system described in item 5 or 6 of the above-mentioned moon-like gods ^ Fan is described above; = the plural rounds of the input layer of the neural network clustering mentioned above ... to correspond to the above-mentioned feature words or soil input respectively All the two words in the word, page 22, 0213-8459rnVF (N); STLC-01-B-9127; Franklin.ptd, the scope of patent application, the above input list 10 Module 11 • Rutong, where the first two feature words are at 12 • Rutong > where the upper threshold 5 is used for the above first feature 13 Lurutong 5 where the above second feature contains the above mentioned feature 14 Among them, the ontology construction rule 'determines the corresponding calculation and the above-mentioned concept title, concept attributes, 15 · A text-ontology, which includes the feature words of the patent norm clustering at the element number output into the patent norm rule of the document The concept movement between the patent paradigm agent and the concept group forming the related levy of the above-mentioned patent fan concept is automatically reduced by the following steps. the amount. The ontology automatic construction device described in item 5 or 6 further includes a refining clustering module, which uses the repetitiveness of words to segment the above-mentioned nerves in one step to generate the above-mentioned clusters. The ontology automatic construction system described in item 1 includes the first feature word and the first support degree parameter and the trust degree parameter among the above feature words. The automatic ontology construction system described in item 11 includes a support threshold value and a reliability gate support parameter and the above-mentioned reliability parameter, and determines the relevance of the second feature word. " The ontology automatic construction system module described in item 12 is based on the above first feature word and the above 'determining whether to add the first feature word to the cluster' to form the corresponding concept ^ item 1 The automatic construction of the ontology described above is to construct the ontology according to the above-mentioned concept group and the above-mentioned connection method, and the concept name, concept attribute, and concept operation relationship til, and the above-mentioned concept name calculation and relationship information. Construction method ’is used to generate steps based on plural files: 丁 I 土

591519 六、申請專利範圍抽離出上沭令A ^ 含於上述文件内並=之複數特徵詞，其中上述特徵詞包依據上述特彳C既定詞性；類，每-聚類包；間的包含字相似性’產生複數聚 π Μ P、+、分之上述特徵詞；依據上迷特例_ q 一，、政3在上述文件中之關連性，產生複數關彳、 —關連法則用以定義上述特徵詞之間的關連性；組，，上述關連法則，修正上述聚類並且產生複數概念每一概念組包含部分之上述特徵詞；以及根據上述概念組和上述關連法則，建構上述本體。法 16如申请專利範圍第1 5項所述之本體自動建構方其中上述既定詞性包括名詞和動詞。法複 17如申凊專利範圍第1 5項所述之本體自動建構方其中上述抽離上述特徵詞之步驟更包括下列步驟·· 依據了斷詞模組，從上述文件中抽離出其所包含之第數詞並且決定上述第一複數詞之詞性；根據上述既定詞性，從上述第/複數詞中選擇出屬於上述既定詞性之複數第二詞；以及比對上述第二複數詞與一詞庫，決定對應上述文件之上述特徵詞。 1 8 ·如申請專利範圍第丨7項所述之本體自動建構方法，其中上述既定詞性包括名詞和動詞。 1 9 ·如申請專利範圍第1 5項所述之本體自動建構方法，其中產生上述聚類之步驟係利用一類神經網路聚類模591519 VI. The scope of the patent application is extracted from the plural feature words contained in the above order A ^ included in the above document and =, where the feature word package is based on the above-mentioned special part of speech C; class, each-cluster package; inclusive The word similarity 'produces the plural characteristic words of the poly-πMP, +, and points; according to the special case _ q1, the connection of Zheng 3 in the above documents, and the plural relation, the relational rule is used to define the above. The relation between feature words; group, the above-mentioned relational rule, modifying the clustering and generating a plurality of concepts, each concept group contains a part of the above-mentioned feature words; and constructing the ontology according to the above-mentioned concept group and the above-mentioned relational rule. Method 16 The automatic construction method of ontology as described in item 15 of the scope of patent application, wherein the above-mentioned predetermined part of speech includes nouns and verbs. Fafu 17 The automatic ontology construction method described in item 15 of the patent scope of the application. The above steps for extracting the above-mentioned feature words include the following steps. · Based on the word segmentation module, extract its place from the above documents. Contains the first plural word and determines the part-of-speech of the first plural word; according to the predetermined part-of-speech, selecting the second plural word belonging to the above-mentioned predetermined part-of-speech from the above-mentioned / plural word; and comparing the second plural word with a thesaurus , Determine the feature words corresponding to the above documents. 1 8 · The ontology automatic construction method described in item 7 of the scope of patent application, wherein the above-mentioned predetermined part of speech includes nouns and verbs. 19 · The ontology automatic construction method as described in item 15 of the scope of patent application, wherein the step of generating the above-mentioned clustering uses a type of neural network clustering model

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第24頁乃 1519 六、申請專利範圍梨實現。法，20 ·如申請專利範圍第1 9項所述之本體自動建構方其中更包括一步驟：徵詞Ϊ據上述特徵詞在上述文件中的出現頻率，從上述特網路取決定出現頻率較高之複數輸人詞’冑為上述類神經 j硌聚類模型的輸入。 4蜗竹、& 法，21 士如申請專利範圍第2〇項所述之本體自動建構方 ^中上述輸入詞係為中文詞，並且更包含一步驟：述1丨用Urn code之UTF-丨6轉換機制以及_映射函數將上 w1 m ί做為上述類神經網路聚類模型的輸入。 22 .如申知專利範圍第"項所述之去’其中上述特徵詞係為中文詞，並且更包含=方 ^ ^ ^#,j α ^ ^ ^ W ± 2Vf申丁上述類神經網路聚類模型的輸入。 23 .如申“利範圍第19項所述之本體 / ’其中上述類神經網路聚類模型之輪用以分別對應上述特徵詞中所有包含字， π之數量低於上述包含字之數量。予上述輸入早 24 .如申請專利範圍第2〇;所述法’其中上述類神經網路聚類模型，自動建構方元1以分別對應上述輪入詞中所“4的=入單疋之數量低於上述包含字之數量。子上述輸入單、25 ·如申請專利範圍第19項所述之本法，其中更包括一步驟：本體自動建構方0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 24 is 1519 VI. Patent application scope Pear realization. Law, 20 · The automatic construction method of ontology as described in item 19 of the scope of patent application, which further includes a step: Soliciting words: According to the frequency of appearance of the above-mentioned characteristic words in the above documents, determine the frequency of appearance from the above-mentioned Internet. The high complex number input word '胄 is the input of the above neural-like j 硌 clustering model. 4 Snail, & method, 21 The above-mentioned input words in the automatic construction method of the ontology described in item 20 of the scope of patent application ^ are Chinese words, and further include a step: 1 UTF-Urn- The transformation mechanism and mapping function will use w1 m as the input of the above neural network clustering model. 22. As described in item " of the claimed patent, go to 'where the above-mentioned characteristic words are Chinese words, and further include = 方 ^ ^ ^ #, j α ^ ^ ^ W ± 2Vf Cluster model input. 23. The ontology as described in claim 19 of the “Scope of Interest” / wherein the wheel of the above-mentioned neural network clustering model is used to correspond to all the included words in the feature word, respectively, and the number of π is lower than the number of the included words. Give the above input as early as 24. As described in the patent application scope No. 20; the method 'where the above-mentioned neural network clustering model automatically constructs Fang Yuan 1 to correspond to the "4's = entry list" The number is lower than the number of included words mentioned above. The above input form, 25 · The method described in item 19 of the scope of patent application, which further includes a step: the automatic construction method of the ontology

之本體自動建構方來類模型之輸出進-步分割處理產 26 ·如申請專利範圍第20項所述古’其中更包括一步驟：政取根^上述輸入詞之包含字的重複性，將上述類神經網爪類模型之輸出進一步分割處理產生上述聚類。法，2J ·如申請專利範圍第丨5項所述之本體自動建構方 1牲r Γ上述關連法則包含上述特徵詞中第一特徵詞和第、欲词在上述文件之支持度參數和信賴度參數。法，=·如申請專利範圍第27項所述之本體自動建構方植值71ί ϊ ΐ連法則包含一支持度門檻值和一信賴度門上匕較上述支持度參數和上述信賴度參數，決定 ' 寺欲a司和上述第二特徵詞之關連性。法，2中t申請專利範圍第28項所述之本體自動建構方述第二牿上^上述聚類產生上述概念組之步驟，係根據上述第-特it ::和上述第二特徵詞之關連性，決定是否將上法，3』中利範圍第15項所述之本體自動建構方述關連法則冓本體之步驟中，係根據上述概念組和上和概念運算以及上述概今辦概心们生述相丨八么p义伐心、、且之間的關係貝汛，並且根據上心*、概心屬性、概念運算和關係資訊建構上述本 591519 六、申請專利範圍 3 1 · —種儲存媒體，用以儲存可執行於一電腦系統之程式，上述程式包含用以執行如申請專利範圍第1 5項至第 3 0中任一項之本體自動建構方法之程式碼。The ontology automatically constructs the output of the class model. The step-by-step segmentation process produces 26. As described in item 20 of the scope of the patent application, the ancient 'which also includes a step: the political root ^ the repetition of the above-mentioned input words, including The output of the above-mentioned neural network claw model is further segmented to generate the above-mentioned clusters. Method, 2J · The ontology automatic construction method described in item 5 of the scope of the application for patent 1 r r Γ The above-mentioned relational rules include the first feature word, the first feature word, and the desired word in the above documents in the above-mentioned document support parameters and reliability parameter. Method, = · The automatic construction of the ontology as described in item 27 of the scope of the patent application, 71. 法 The Coupling Rule includes a support threshold and a trust threshold. The support parameter and the trust parameter are compared to determine 'The relevance of Siyu a Division and the above-mentioned second characteristic word. The method described in item 2 of the scope of patent application in item 2 of the method described above automatically constructs the second description above. The above step of clustering to generate the above-mentioned concept group is based on the above-mentioned it :: and the second feature word. Relevance determines whether the ontology described in item 15 of the middle scope of 3 ”above is used to automatically construct the dialectic connection rule. The steps of the ontology are based on the above-mentioned concept group and the concept of sum-up and the above-mentioned concepts. We have described the relationship between the Ba Yip and Yi Faxin, and the relationship between them, and constructed the above-mentioned 591519 based on the upper heart *, the generality attribute, the concept operation and the relationship information. A storage medium is used to store a program executable on a computer system, and the program includes code for executing the method for automatically constructing the ontology as described in any of claims 15 to 30 in the scope of patent application.

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第27頁0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 27