TW591519B - Automatic ontology building system and method thereof - Google Patents

Automatic ontology building system and method thereof Download PDF

Info

Publication number
TW591519B
TW591519B TW91125217A TW91125217A TW591519B TW 591519 B TW591519 B TW 591519B TW 91125217 A TW91125217 A TW 91125217A TW 91125217 A TW91125217 A TW 91125217A TW 591519 B TW591519 B TW 591519B
Authority
TW
Taiwan
Prior art keywords
mentioned
ontology
words
word
item
Prior art date
Application number
TW91125217A
Other languages
Chinese (zh)
Inventor
Chia-Hsin Liao
Chang-Shing Lee
Yau-Hwang Kuo
I-Heng Meng
Chun-Chieh Kung
Original Assignee
Inst Information Industry
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inst Information Industry filed Critical Inst Information Industry
Priority to TW91125217A priority Critical patent/TW591519B/en
Application granted granted Critical
Publication of TW591519B publication Critical patent/TW591519B/en

Links

Landscapes

  • Machine Translation (AREA)

Abstract

An automatic ontology building system and method thereof. The system comprises term segmentation processing device, term clustering processing device, data mining module, concept sets creation module and ontology building agent. The term segmentation processing device abstracts feature term pertaining to verb and noun from training document. The term clustering processing device generates multiple clusters in accordance with character similarity contained in feature term. The data mining module generates multiple correlation principles in accordance with the correlation of feature term in document. The concept sets creation module revises the said clusters to generate concept sets in accordance with the correlation defined in correlation principles. The ontology building agent completes construction of ontology in accordance with concept sets and correlation principles.

Description

591519591519

本么月疋有關於一種物件導向本體(ob ject_〇riented ontology)技術,且特別是有關於一種可以針對中文 的知識領域,以自動方式來建構屬於它的本體。, 本體(ontology)可以做為知識管理的一種核心 有=地進行知識管理。因此如何建構某種特定知識領】的 本體即成為一項重要的課題,藉此人類 〆 士 j 4 A Μ Μ 将 八頭j以在未來透過此 方式讓知識官理更加有效。本體必須是針對某個特 領域(domain)所建立,其組成分子分別是物件: 值(value)、關係(relation)和函數(functi〇n)。其中, 物件可以代表一概念(concept),此概念可以是一^ 法、想法或實體。❿這些實體可以給予值,使其具備 值(attribute vaiue)的資料型態。物件間則是以 以定義’可分為特殊化(specializati〇n)和一般化’、 (general izat ion)的關係。藉此讓物件間的推演 明確或是模糊的解釋。在一般本體最常用的組成分=f 代表物件的字彙(vocabulary)和概念、代表關係二=含 (attribute)以及用來表示物件間關係比重的關係權 由於目前針對中文知識領域之本體都以人工參盥。 本體,但由於採取半自動建構的技術來建構本體,口 構 文件的關鍵字詞來作分類處理,其依據仍然不足·,另、針對 件在分類上也有些模糊問題,例如:某則新聞的談认2文 國股市,於是可分類在財金的新聞類別亦可分類:為美 聞類別。 、约國際新 網路技術一日千里,造就知識共享的時代來臨。 。千頭Ben Moyue has an object-oriented ontology (ob ject_〇riented ontology) technology, and in particular an area of knowledge that can be used to construct Chinese ontology in an automatic way. Ontology can be used as a core of knowledge management for knowledge management. Therefore, how to construct the ontology of a specific knowledge domain has become an important subject, by which the human warrior j 4 A MM will make the eight heads j in this way to make knowledge management more effective. The ontology must be established for a specific domain, and its constituent molecules are objects: value, relationship, and function. Among them, an object can represent a concept, and this concept can be a method, idea, or entity. ❿These entities can be given a value so that they have a data type of attribute vaiue. The relationship between objects can be divided into specialization, generalizability, and generalization by definition. This allows the deduction between objects to be clear or vague. The most commonly used components in general ontology = f represents the vocabulary and concept of objects, represents relationship two = attribute, and the relationship right used to represent the proportion of the relationship between objects. Participate. Ontology, but because the semi-automatic construction technology is used to construct the ontology, and the keywords of the oral documents are classified for classification, the basis is still insufficient. In addition, the classification of the items is also a bit vague, such as a news talk Recognizing the stock market of the two countries, the news category that can be classified in the financial industry can also be classified as the Meiwen category. With the rapid development of new international network technologies, the era of knowledge sharing has arrived. . Thousand heads

591519 五、發明說明(2) 萬、、者雜乱無章的資訊充斥在我們的周遭,資 日人類必須面對的問題。當知識擴展的迷产:炸疋今 吸收時,電腦的自動化處理之輔助,在今=:π們所能 無論是個人、》司行號、政府機關、甚 2 ::重要。 曰行業務當中,皆會遭遇此問題。 ' ’在其 有鑑於此,本發明的主要目的就是提供一 =與方法,亦即透過發展資訊與知識管理的軟體:建 =分類網路文件,整合網路知識,體Κ 的水準及生產力。 叮口,弓敷體產業 因此’本發明提供一種本體自動逮 =件產生一本體,其包括= 迷文件,用以從上述文件内抽離出複 =包含於上述文件内並且屬於既定詞性;詞聚理:斂 的包含字相2 m以依據上述特徵詞之間 述;i;:;探^:數::’每一聚類包含部分之上 以依據上述文理裝置,用 法則,每一關連法則用以定義上 產生稷數關連 概念組建立模、组,搞接於上述平l=柯之間的關連性; :罙勘模組,用以根據上述關連法則修上= 數概念組,每一概念組包^ l正上述聚類,產生複 建構代理器,接於上述概;::士;f:寺徵詞’·以及本體 模組,用以根據上述概念组:j杈、、且和上述資料探勘 ^、、、f上述關連法則,建構上述本 0213-8459TWF(N);STLC-01-B-9127;Franklin. Ί· 第6頁 五、發明說明(3) 體。It此,訓綠、 S由概念名稱、述機制的處理,可以 讯所構成特定領域本體。 异以及各概念間之關 【貫施例】 之方:”::二::之物件導向本體自動建構系統 圖所示,本發明之本體自動=以限疋本發明。如第1 性過濾器14、提煉過、、廣j構糸統包含斷詞模組12、詞 模組2〇、中文詞聚:f έ 9頻率分析模組1δ、資料探勘 立模組26以及本體逮槿二、提煉聚類模組24、概念組建 對透過網路本體自動建構系統係針 處理,用以產特定領域複數文件1〇進行 太驊白奢祕/十^此特疋領域所建構之本體結構3 2。此 ^ ^ 冓系統利用各組元件所執行的處理步驟,_以 =發明自動建構本體的目的,其中各組成元:亦= ,:載,己錄媒體之軟體程式加 現。 :各組成元件的操作。另外,在本實施例中之本體自:i 構糸統和方法是採用物件導向系統設計的方式,並且針 中文特定領域建構本體,然而此並非用以限定本發明。 斷詞處理 首先,斷詞模組1 2將做為輸入之中文特定領域複數文 件1 0進行斷詞處理,例如將文件中之介系詞以及數字刪 除,然後標注斷詞後的詞性,藉此找出文件丨〇中的動詞及 名詞等。因此,斷詞模組1 2可以從文件丨〇中產生經斷詞處591519 V. Description of the invention (2) The messy and indiscriminate information is flooding all around us, a problem that human beings must face. When the mystery of knowledge expansion: blasting and absorbing this time, the computer's automated processing aids, nowadays =: π can do it, whether it is an individual, a company, a government agency, or even 2 :: important. This problem will be encountered in all business. In view of this, the main purpose of the present invention is to provide a method and method, that is, to develop software for information and knowledge management: build = classify network documents, integrate network knowledge, and level and productivity. Dingkou, Gongfu Body Industry Therefore, the present invention provides an ontology automatic capture device to generate an ontology, which includes: a fan file for extracting a complex from the above file, which is included in the above file and belongs to a given part of speech; words Convergence: Converged words containing 2 m according to the above-mentioned feature words; i;:; Exploration ^: Number :: 'Each cluster contains part above according to the above-mentioned literary devices, usage rules, each connection The rule is used to define the module and group of the related concept groups that are generated, and connect to the relationship between the above-mentioned l = Ke; The survey module is used to repair = the number concept group according to the above related rule. A concept group includes the above-mentioned clustering to generate a reconstruction agent, which is connected to the above; :: Shi; f: Temple Lexicon ', and an ontology module, which is used according to the above concept group: j fork, and Explore the above-mentioned related rules with the above materials, construct the above-mentioned 0213-8459TWF (N); STLC-01-B-9127; Franklin. Ί · Page 6 V. Description of the invention (3). It is therefore that the training green and S are handled by the concept name and the description mechanism, which can constitute the specific domain ontology. Differences and the relationship between the concepts [Executive examples]: ":: 2 :: Object-oriented ontology automatic construction system shown in the diagram, the ontology of the invention automatically = to limit the invention. As the first filter 14. Refined, and broad-based systems include word segmentation module 12, word module 20, Chinese word aggregation: f 9 9 frequency analysis module 1δ, data exploration and establishment module 26, and ontology capture 2 The clustering module 24. The concept set up deals with the automatic construction system of the network ontology, which is used to produce a plurality of documents in a specific domain 10, and to carry out the Taibaibai extravagance / tenth, the ontology structure constructed in this special domain32. This ^ ^ 冓 system uses the processing steps performed by each group of components, _ == the purpose of the invention to automatically construct the ontology, where each component element: also =,: contains, the software program of the recorded media is cashed out. In addition, in this embodiment, the ontology from: i structure system and method is an object-oriented system design method, and the ontology is constructed in a specific Chinese field, but this is not intended to limit the present invention. Word segmentation processing First, Word segmentation module 1 2 will be lost Plural documents in the Chinese-specific domain 10 are processed for word segmentation, such as deleting the prepositions and numbers in the document, and then marking the part of speech after the word segmentation to find out the verbs and nouns in the document. Therefore, Word segmentation module 1 2 can generate segmented words from file 丨 〇

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第7頁 591519 五、發明說明(4) ^錢有詞性屬性的第一詞組50。斷詞模組12可以利用中 $研院所開發之斷詞系統(CKIp)或其他斷詞系統加以實 ,下來°司性過濾器(S t 〇 p W 〇 r d F i 11 e r ) 1 4則進一步 處理第;㈤組5 0 ’從其中選取出具有代表性的第二詞組, 例如動j及1詞,藉此當作建構本體的特徵r )。 二接著&煉過慮器(R e f i n i n g F i 11 e r) 1 6則根據一中 文j庫在&述基本詞性分析後,將文章内的詞做更明確 更洋、、、田的過濾、動作,而得到特徵詞組5 4。在本實施例中係 :用斷詞模組12、詞性過濾器14和提煉過濾器16產生後續 处理的u司組,然而上述處理方式並非用以限定本發明,對 於热習此技藝者而言,亦可以採用其他技術取得後續分析 用之詞組’仍可以達到本發明之目的。 接下來則疋利用取仔的特徵詞組5 4及其所屬之文件 ίο同時進行詞聚類處理(term clustering)與資料探勘處 理(data mining)。其中,詞聚類處理係用來處理詞的相 似性,資料探勘處理係用來處理詞的關聯性,藉此建立本 體所需要的概念(concept)。 詞聚類處理 在本實施例中,詞聚類處理係由頻率分析模組(term analyZer)18、中文詞聚類模組22以及提煉聚類模組以執 行。 頻率刀析模組1 8係用以針對不同類別的文件1 〇所包含 的詞,進行詞的頻率分析。因為在這些類別文件裡的詞 0213-8459TWF(N);STLC-01-B-9127;Frank 1in.ptd 第8頁 591519 五、發明說明(5) __ :現= 者占了大多數,所以若能夠將這- 的詞,類別文件中出現頻率較高 更高的效率,除此^冰,▲τ的貝料量和訓練時間,達到 生。 ” ,可以大量減少聚類後雜訊的? 第2圖為本實施例中以新聞文侔盔〃,、上 的統計資料表。在第2圖和^为^文件和詞量 係來自於中時電子報200 1年:上=,訓練文件 文件,其中依據不同新聞類別而進行=年的新聞 前所',不同詞出現的頻率亦有也η,、如 第2圖貫例中新聞類別為股市財 、、第3圖為 對應詞數量的示意圖。如第3圖所示現頻率及其 的詞數量相當大,相對地出現頻:出現頻率較低 小。因此,本實施例中可以取装相關:f的詞數量則相當 前1_個詞作為聚類的輸入,然而或是 定本發明。 扣師运私準並非用以限 以幹ίΪί施例中’頻率分析模組18根據特徵詞組54,可 以,出根㈣$篩選#件所選擇的輸入詞组Μ 了 58中词的詞性55和表示詞相似度的詞距⑼。中文;:: 艮:it述資謂r㈣,產生經過聚類處理後: 水’ #、.且。必須說明的是,本實施例中雖然針對特 組54提出一頻率分析模組丨8來篩選出現頻做^二 類的輸入,但是並非用以限定本發明,例如直接0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 7 591519 V. Description of the invention (4) ^ Qian has the first phrase 50 with part-of-speech attribute. The word segmentation module 12 can be implemented by using the word segmentation system (CKIp) or other word segmentation systems developed by the Chinese Academy of Research. The following are the filter (S t 〇p W 〇rd F i 11 er) 1 4 Further processing the first; ㈤ group 5 0 'from which a representative second phrase, such as verb j and 1 word, is selected to take the feature r of constructing the ontology). Secondly, & Refining F i 11 er) 1 6 According to a Chinese j library, after analyzing the basic part-of-speech analysis, the words in the article were made clearer, more foreign, and field filtering, action , And get the characteristic phrase 5 4. In this embodiment, the following steps are used to generate a u-group for subsequent processing using the word segmentation module 12, the part-of-speech filter 14, and the refinement filter 16. However, the above processing method is not intended to limit the present invention. For those skilled in the art, You can also use other techniques to obtain the phrase 'for subsequent analysis' and still achieve the purpose of the present invention. The next step is to use the feature word group 5 4 and the file to which it belongs, to perform term clustering and data mining at the same time. Among them, word clustering processing is used to deal with the similarity of words, and data exploration processing is used to deal with the relevance of words, so as to establish the concepts needed by the body. Word cluster processing In this embodiment, the word cluster processing is performed by a frequency analysis module (term analyZer) 18, a Chinese word cluster module 22, and a refined cluster module. Frequency analysis module 18 is used to analyze the frequency of words based on the words contained in different types of documents 10. Because the words 0213-8459TWF (N); STLC-01-B-9127; Frank 1in.ptd in these category files are on page 8 591519 V. Description of the invention (5) __: The present = the majority, so if Able to combine this word with higher frequency and higher efficiency in the category file. In addition to this, the amount of shell material and training time of ▲ τ can be achieved. ", Can reduce the noise after clustering a lot? Figure 2 is the statistical data table in this example with the news text 侔 Helmet 、. In Figure 2 and ^ are ^ files and vocabulary from the middle Times Newsletter 200 1 year: Shang =, training document file, which is performed according to different news categories = years before the news', the frequency of different words also has η, as shown in the example in Figure 2 for the news category: Stock market wealth, Figure 3 is a schematic diagram of the number of corresponding words. As shown in Figure 3, the present frequency and the number of words are quite large, and the frequency of occurrence is relatively low: the frequency of occurrence is relatively small. Therefore, in this embodiment, it can be taken Relevant: The number of words of f is similar to the current 1_words as the input of the cluster, but it is still the present invention. The deduction of the teacher's personal permission is not limited to the example of the frequency analysis module 18 according to the characteristic phrase 54. You can, select the input phrase M selected from the root $ 筛选 ##, which has the part-of-speech 55 of the 58 words and the word pitch 表示, which indicates the similarity of the words. Chinese; :: gen: it After treatment: water '# ,. and. It must be noted that although in this embodiment, Proposed a frequency analysis module Shu ^ 8 to two classes do frequency input filter occurs, but not intended to limit the present invention, for example, group 54 directly Laid

0213-8459TWF(N);STLC-01.B-9127;Franklin.ptd0213-8459TWF (N); STLC-01.B-9127; Franklin.ptd

第9頁Page 9

591519 五、發明說明(6) 二4 =文詞聚類模組22的輸入模組58,亦可以符合本 中文詞聚類模組22是根據中文詞聚類的方式, 度較高的詞聚類為同-詞組。在本實施例中,中文類 板組22主要係利用類神經網路架構的自我組織映射演算法 (self-organizing map algorithm,簡稱 s〇M)模型所 / 模型細部設定則不再詳細說明,以下說明主要 對在本貫施例中所需解決的聚類輸入問題。由於在 例中是,取詞的詞距56與詞性55作為聚類的輸入特徵,因 此在本實施例中,係將輸入詞組58包含的中文字以 Unicode的UTF-16機制轉換。每個字會轉換成一個“位元 的碼,其中文字的碼範圍為2E8〇〜9FFF,轉成十進位 在^ 9 04〜40 95 9之間。但若以這個範圍的數值來當作s〇m聚 類杈型的輸入,會因為輸入值太大而造成聚類的不正確。 因此,本實施例中利用映射函數功能(Mapping Functi〇n) 來將文字範圍的數值映射至— 100至1〇〇的範圍,可避免因 為輸^值太大而造成聚類的不正確。第4圖表示本實施例 中所採用之中文字編碼映射函數的示意圖。 針對輸入詞組58和S0M聚類模型輸入單元的對應問 題,本實施例中提出以下三種方式解決。 第一種方式是預設固定長度的s〇M聚類模型輸入單 元,分別對應於輸入詞組58中每個詞的詞性及其所包含的 :文字。例如,輸入單元長度可以設為6,其中第丨個輸入 單元對應於每個輸入詞的詞性,第2個至第6個輸入單元則591519 V. Description of the invention (6) Two 4 = input module 58 of the text word clustering module 22, which can also conform to the Chinese word clustering module 22, which is based on the clustering of Chinese words and has a higher degree of word clustering Is the same-phrase. In this embodiment, the Chinese-style board group 22 is mainly a self-organizing map algorithm (som) model / model detail setting using a neural network-like structure, which will not be described in detail below. The description mainly focuses on the clustering input problem that needs to be solved in this embodiment. In the example, the word pitch 56 and the part of speech 55 of the word are taken as the input features of the cluster. Therefore, in this embodiment, the Chinese characters included in the input phrase 58 are converted by the Unicode UTF-16 mechanism. Each word will be converted into a "bit code, where the code's code range is 2E8〇 ~ 9FFF, converted to decimal between ^ 9 04 ~ 40 95 9. But if this range of values is used as s 〇m clustering type input will cause incorrect clustering because the input value is too large. Therefore, in this embodiment, a mapping function function (Mapping Functi) is used to map the value of the text range to -100 to The range of 100 can avoid incorrect clustering because the input value is too large. Figure 4 shows a schematic diagram of the Chinese character encoding mapping function used in this embodiment. For the input phrase 58 and the SOM clustering model The corresponding problem of the input unit is solved in this embodiment by the following three methods. The first method is a preset fixed-length SOM clustering model input unit, which corresponds to the part of speech of each word in the input phrase 58 and its location. Contains: text. For example, the length of the input unit can be set to 6, where the first input unit corresponds to the part of speech of each input word, and the second to sixth input units are

591519591519

五、發明說明(7) 入長度預設為5,所以就會造成π長π這個字無法輸入作判 斷。第二、字在輸入詞中的不同位置會造成聚類錯誤,例 如”警員”和”員警”兩個詞,其性質相同,依常理判斷會聚 類在一起,不過因為兩個輸入詞的字位置不同,會造成無 法聚類在一起。 '' 對應於每個輸入詞中所包含的中文字。然而,此種方式在 應用有其缺點。第一、由於SOM聚類模型的每個資料輪入 層的輸入單位必需固定。所以如果輸入詞的長度比所預定 的輸入單位長度大時,就輸入詞中過長的字則無法對應。 例如,”民進黨秘書長,,這個詞的長度是6,而上述之詞輸 第二種方式則是將每個字當作一個對應輸入單元的輸 入,此方式可以解決在第一種方式中輸入詞長度過長以^ 輸入詞中字位置不同造成聚類錯誤的情況。不過,此種方 式則是造成輸入層輸入單元的維度過大,同時30Μ聚類模 型的訓練時間會非常長。以前述之範例來說,在類別為政 焦點的新聞文件中統計’出現的字數量大約在3 〇⑽個2 右,其所組成的特徵詞組則多達1 4 0 0 0。因此,以第二種 方式建立SOM聚類模型輸入單元時,其輸入層的維度f輸入 單元數量)為3 0 0 0,而供訓練用的資料筆數則多達1 4⑽〇, 因此可以預見此種SOM聚類模型的聚類時間會非常長。另 外,在第二種方式中,由於每個輸入單元都可以清楚地代 表所屬的字,因此並不需要針對中文字進行編碼的輸入, 只需要將出現的字設為1,沒出現的字設為〇即可。 第三種方式則以多字對應單一輸入單元的方式來處理V. Description of the invention (7) The input length is preset to 5, so the word π length and π cannot be entered for judgment. Second, the different positions of the words in the input word will cause clustering errors. For example, the words "policeman" and "policeman" are of the same nature and will be clustered together according to common sense, but because the two input words The different word positions will not be able to cluster together. '' Corresponds to the Chinese characters contained in each input word. However, this method has its disadvantages in application. First, the input unit of each data rotation layer of the SOM clustering model must be fixed. Therefore, if the length of the input word is larger than the predetermined input unit length, the long word in the input word cannot be matched. For example, "DPP Secretary General, the length of this word is 6, and the second way of inputting the above words is to use each word as a corresponding input unit. This method can be solved in the first way. The length of the input word is too long to cause a clustering error due to the difference in the word position in the input word. However, this method causes the dimension of the input unit of the input layer to be too large, and the training time of the 30M cluster model will be very long. As an example, the number of words that appear in a news document with a category of political focus is about 300 words, and the number of characteristic words is as high as 14000. Therefore, in the second way, When the input unit of the SOM clustering model is established, the number of input dimensions (the number of input units in the dimension f) is 3 0 0, and the number of data pens for training is as high as 14⑽0. The clustering time will be very long. In addition, in the second method, since each input unit can clearly represent the word to which it belongs, there is no need to input the Chinese character encoding, only the occurrence of Is set to 1, the word does not appear to set square. A third embodiment places the single multi-word corresponding to the input processing unit according to

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第11頁 591519 五、發明說明(8) 聚類=輸入,藉此同時解決第一種方式之字位置對應問題 以及第二種方式輸入層維度過大的問題。在此方式中,每 個輸入單元可以對應於超過一個的不同字,而不同字也可 以透,Uni code轉碼來分辨其中不同對應的字。然而,在 形成每個輸入單元所對應之字時,必須限制在同一輸入詞 中出現的字,不可以同時列為同一輸入單元的對應字 則會出現聚類的輸入錯誤。 ,圖表示本實施例中多字對應單一輸入單元處理 輸入早70配置方法的流程圖。其中,輸入資料為頻率分 模組18所產生的輸入詞組58或者是斷詞處理後的特徵^ 54,輸出資料則是每個維度(或輸入單元)及其對應之字的 集合。如第5圖所示,首先,計算出輸入詞組58或者特徵 ㈣組54中所有輸入詞所包含的字集合,令該字集合為 {W1 ’’…,Wn},總字數為η個(步驟S1)。接著,找出 所有子W1〜Wn所屬之輸入詞中出現的其他字集合,分別 符號SW1,SW2,…,SWn表示(步驟S2)。最後,依序處理 字W1〜Wn,隨機配置給輸入單元,但是所配置之輸入單元 必須不包含屬於該字所對應字集合sn〜SWn中的元素(步 S3),亦即每個輸入單元不會同時對應於屬於同一輪入气 所包含的字,藉以避免聚類輪入錯誤。 ° 以下以一貫例說明以上流程。考慮一組輸入詞組丨文 學,化學,文化,清大,成大,警員,員警,警察局}, 其中包含的字集合為{文,學,化,清,大,成,警, 員’察,局}。接著找出每個字的字集合sn〜swl〇,亦即0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 11 591519 V. Description of the invention (8) Clustering = input, which simultaneously solves the problem of word position correspondence in the first way and the second This way the problem of too large input layer dimensions. In this way, each input unit can correspond to more than one different word, and different words can also be transparent. Unicode transcoding can distinguish different corresponding words. However, when forming the words corresponding to each input unit, the words appearing in the same input word must be restricted, and the corresponding words that cannot be listed at the same time in the same input unit will cause clustering input errors. The figure shows a flowchart of a method for processing input early 70s with multiple words corresponding to a single input unit in this embodiment. Among them, the input data is the input phrase 58 generated by the frequency sub-module 18 or the feature after word segmentation ^ 54, and the output data is the collection of each dimension (or input unit) and its corresponding word. As shown in FIG. 5, first, the word set included in all input words in the input phrase group 58 or the feature group 54 is calculated, and the word set is {W1 ”..., Wn}, and the total number of words is η ( Step S1). Next, the other word sets appearing in the input words to which all the children W1 to Wn belong are found, and the symbols SW1, SW2, ..., SWn are respectively represented (step S2). Finally, the words W1 ~ Wn are processed in sequence and randomly assigned to the input unit, but the configured input unit must not contain elements belonging to the word set sn ~ SWn corresponding to the word (step S3), that is, each input unit does not It will also correspond to the words included in the same round of Qi, so as to avoid clustering round errors. ° The above process is explained below with a consistent example. Consider a set of input phrases 丨 literature, chemistry, culture, Tsinghua University, Chengdu University, police officer, police officer, police station}, which contains a set of characters {文, 学, 化, 清, 大, 成, 警, policeman 'Check, bureau}. Then find the word set sn ~ swl0 for each word, that is,

五、發明說明(9) 與該字出現於同—輸入詞的字集入 μ 文-^學,化 ν、5 ’其计异結果如下: 學〜文’化 化〜學,文 清〜大 大〜清,成 成〜大 f 5員,察,局 察〜警,局 局4警,察 第6圖表示本實施例範例中 入單元之#咅m 、輪入詞組的字配置給輪 入單元),而每個維的包含字其::歸 度(即輪 交叉比對,^ ^ α β ^ 、/成必需經過每個字詞的 列為在同一個維的包含字,造^ ^ 一個詞裡出現的字,被 例中,第1維合對_於丨t生岑類的輸入錯誤。在此範 (學、大、昌曰i 、成、警},第2維會對應於 有3唯_ 丨,以及第3維會對應(化、局}。因為只 J3。維的輸入層,如此可以減少花費在作聚類處理的時 φ 11文闲聚類模組22產生聚類詞組60後,在本實施例 121詞組60會再由提煉聚類模組24進-步處理。 :5 :類模組22是利用類神經網路的s〇M聚類模型將輸 过组中相似的詞聚在一起來形成大聚類(clusters),並V. Description of the invention (9) It appears in the same word as the word—Enter the character set of the input word into μ Wen- ^ Xue, Hua ν, 5 'The difference is as follows: Learn ~ Wen' Hua Hua ~ Xue, Wen Qing ~ Da ~ Qing, Chengcheng ~ big f 5 members, inspector, inspector ~ police, inspector 4 police, inspector 6 (shown in the example of this embodiment, # 入 m of the input unit, the characters of the turn-in phrase are assigned to the turn-in unit) , And the contained words of each dimension are :: Attribution (ie round cross comparison, ^ ^ α β ^, / cheng must pass through each word's column as contained words in the same dimension, making ^ ^ one word The characters appearing in the example are incorrectly entered in the 1st dimension pair _ Yu 丨 tsun Cen class. In this example (learning, big, Chang, i, Cheng, police), the 2nd dimension would correspond to 3 Only _ 丨, and the 3rd dimension will correspond to (chemical, bureau). Because it is only J3. Dimensional input layer, which can reduce the time spent on clustering processing φ 11 leisure clustering module 22 generates clustering phrases 60 After that, in this embodiment 121, the phrase 60 will be further processed by the refined clustering module 24.: 5: The class module 22 uses the OM clustering model of the neural network to transfer similar ones in the group. word Together to form large clusters (clusters), and

591519 五、發明說明(ίο) 構成聚類詞組60 ;提煉聚類模組24則是用才Λ时 切出小聚類(Sub dusted。例如,大聚類再細 識、警務、參議、警方、議場、學識、巧為上程、常 會議、檢警、开丨警、省議會、議會會合=安 室、決議案}時,本實施例中提煉聚類模組24曰: 9礅 大聚類裡具有相同字的詞再切分出來,忐 疋巴同一個 類,因此上述大聚類例子可以再細分成3個小 /來 小聚類1 ·議程、參議、議場、省議 」 案、會議、會議室、決議案; 義㈢礅會、議 小聚類2 :警務、警方、警界、警官、 小聚類3 :常識、學識。 欢吕、刑警; 最後,整個詞聚類處理產生由小聚 組62。 顯所構成的聚類詞 資料探勘處理 另一方面’特徵詞組5 4也會同時交由次 建立詞的關連法則,找出詞的相關性。由::20 的聚類都是依據字面上相似性而聚在—起,並:: 關連性,戶斤以透過資料探勘模組20所建立之關連;:上的 找出其中關連性較強的詞。每一條關連法則用以 兩個詞之間的關連性,以支持度(support)和信賴度 出現的比例,信賴度則妒笨u在句中兩個詞同時 機率。以詞出現時另一詞也會出現的 的性質不同,因此必須針對 :新聞類別 对母種新聞類別,疋出名詞對於591519 Fifth, the invention description (ίο) constitutes the clustering phrase 60; the refined clustering module 24 is used to cut out the small clusters (Sub dusted. For example, the large clusters are then carefully identified, police, council, police, When the meeting venue, knowledge, clever schedule, regular meeting, prosecution, open police, provincial council, parliamentary meeting = security room, resolution}, the cluster module 24 is extracted in this embodiment: Words with the same word are segmented again, and the same category is used. Therefore, the above large cluster example can be further subdivided into 3 small / come small clusters. 1. Agenda, Senate, Discussion Venue, Proposal, Case, Meeting, Meeting Room , Resolutions; Yihui meeting, small clusters 2: police, police, police, police officers, small clusters 3: common sense, knowledge. Huanlu, criminal police; Finally, the entire word clustering process is generated by small clusters 62. Exploring the processing of the clustered word data formed by the explicit On the other hand, the 'characteristic phrase 5 4' will also be used to establish the correlation rules of the words at the same time to find the relevance of the words. The clustering by :: 20 is based on the literal On similarities and gathered together: :: Relevance The relationship established by the survey module 20: find the words that are more related. Each connection rule uses the relationship between the two words to support and the proportion of trust, Reliability is jealous of the probability that two words are in the sentence at the same time. When a word appears, the other word also appears differently in nature, so it must be targeted: news category vs. parent news category, and noun nouns

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第14頁 591519 五、發明說明(π) 此一新聞類別的支持度和信賴度的門檻值(threshold)。 其中一種方式是可以針對不同新聞類別的特性,直接預先 設定(支持度,信賴度)的門檻值,例如: 政 治 焦 點丨丨 (10 ,50) 國 際 要 聞丨, (10 ,40) 股 市 財 金丨丨 (15 ,40) 兩 岸 風 雲” (15 ,40) 社 會 地 方丨丨 (12 ,40) 生 活 新 知丨, (8, 45) 運 動 娛 樂丨丨 (12 ,50) 另一種方式是根據不同新聞類別的文件數量來決定 (支持度’彳§賴度)的門植值。例如: 1 ·當0<N<Na/2,則 = ,0c = 7〇% ; 2 ·當Na/2 SN<Na,則 0s = 3%,0c = 8O% ; 3 ·當Na$N<3Na/2,則 = ,0c = 7〇% ; 4 ^3Na/2SN<2Na,則 <9s:L5%,= 。 其中N表示某新聞類別的文字數量,Nae =文件的平均數量,示(支持度,信斤賴有= 1榼值。依據條件1若某新聞類別文件之數量大於踅勺 :所有新聞類別文件之平均數量的2分之 、遠,小 =檻值以及信賴係數大於條件i之門播值4% ^連法則 70/。,則表示這兩詞關聯性強· 。賴係數 λ e ^ ^ ^ ^, 小於所有新聞類別文件之平 θ :句數1的2分之1, <十均數Ϊ,而且關連法則之門檻0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 14 591519 V. Description of the invention (π) Threshold for the support and reliability of this news category. One way is to directly set thresholds (support, trust) for the characteristics of different news categories, such as: political focus 丨 (10,50) international news 丨, (10,40) stock market finance 丨丨 (15,40) Cross-strait situation "(15,40) Social places 丨 (12,40) New knowledge of life 丨, (8, 45) Sports entertainment 丨 (12,50) Another way is based on different news categories The number of files is used to determine the gate planting value of the support degree (彳 ° 度 度). For example: 1 · When 0 < N < Na / 2, then =, 0c = 70%; 2 · When Na / 2 SN < Na, Then 0s = 3%, 0c = 8O%; 3 · When Na $ N < 3Na / 2, then =, 0c = 70%; 4 ^ 3Na / 2SN < 2Na, then < 9s: L5%, =. Where N represents the number of texts in a news category, Nae = the average number of documents, showing (support, reliability = 1 value. According to condition 1, if the number of documents in a news category is greater than the spoon: the average of all news category files Two half of the number, far, small = threshold value and trust coefficient is greater than 4% of the door value of condition i ^ Law rule 70 /., Then these two words . Linkable strong Lai coefficient λ e ^ ^ ^ ^, θ is smaller than all level category news documents: sentence number 1 1/2, < ten mean Ϊ, and the threshold rules related

0213.8459TWF(N);STLC-01-B-9127;Franklin.ptd 591519 五 發明說明(12) 值以及信賴係數值大於條件2 80%則表示這兩詞關聯性強,門檻值3%與信賴係數值為 4之條件同樣可找出關聯性^ =類推依據條件3以及條件 建立概念組(concept sets) 立一根據上述資訊產 聚:處::歸納出的小聚類都是依據字面而:: 到的關連法則可以用來進二步,=料探勘處理所找 ,7 〆補強小聚類内的關連性。例 如,假設在關連法則66中有一條為「刑事組長」與「警 方」為(18. 2%,90· 7%),表示兩個詞的關連性很強,所以 在概念組建立模組26就會將「刑事組長」放入前述的第2 小聚類中構成對應的概念,亦即: 概心1 · 5義程、參義、議場、省議會、議會、議案、 會議、會議室、決議案; 〃 概念2 :警務、警方、警界、警官、檢警、刑警、刑 事組長; ^ 概念3 :常識、學識。 因此’概念組建立模組2 6是將聚類裡的詞與映射至關 連法則的詞共同建立同一個概念。 建構本體 最後,根據關連法則6 6以及概念組6 4,本體建構代理 器30可以建立每個概念之名稱(concept name)、屬性0213.8459TWF (N); STLC-01-B-9127; Franklin.ptd 591519 Five invention descriptions (12) Value and trust coefficient value greater than condition 2 80% indicates that the two words are strongly related, the threshold value of 3% and the trust coefficient A condition with a value of 4 can also find the correlation ^ = analogy, based on condition 3 and concept sets to establish concept sets. The first cluster is based on the above information: Division :: The small clusters summarized are based on literals :: The obtained correlation rule can be used for the next two steps, = to find the prospecting and processing, 7 to strengthen the correlation within the small cluster. For example, suppose that one of the "Relationship Rules 66" and "Police Team Leader" is (18.2%, 90 · 7%), which indicates that the two words are strongly related, so module 26 is established in the concept group. Will put the "criminal team leader" into the aforementioned second sub-cluster to form the corresponding concept, that is: Concentration 1.5 · the process, Senate, the venue, the provincial assembly, the parliament, the bill, the meeting, the conference room, Resolutions; 〃 Concept 2: Police, Police, Police, Police Officer, Prosecutor, Interpol, Criminal Team Leader; ^ Concept 3: Common sense, knowledge. Therefore, the 'concept group creation module 26 is to establish the same concept together with the words in the cluster and the words mapped to the correlation rule. Constructing the ontology Finally, according to the connection rule 6 6 and the concept group 64, the ontology construction agent 30 can establish the concept name and attribute of each concept

591519 五、發明說明(13) (attribute)以及運算(〇perati〇n),以及概念之間的關係 (relation) °本體建構代理器3〇是由四個部分所組成,包 括·概念名稱建構代理器(c〇ncept names construction agent) agent) agent) 為名詞 agent)運"^ 建構代理器(operations construction 屬性建構代理器(attributes construction 關係、建構代理器(relations construction 其中概念名稱之詞性為名詞,概念屬性之詞性亦 概念運算之詞性則為動詞,上述是由概念組64中 的同一概念中所產生;關係則是根據不同概念之間的關連 法則所決定’其詞性則為動詞。第7圖表示本實施例中所 建立之概念和概念間關係之一例的示意圖。在第7圖中共 例示9個概心刀別以符號9 0表示,每個概念則再細分為 概念名/爯92(第一攔)、概念屬性94(第二欄)和概念運算 9 6 (第三欄)。例如這9個概念的概念名稱分別為警方、青 少年、火%、火警、歹徒、搶、毒品、色情、專案小組。 另外’概念90間的四邊形則表示關係98,係根據關連法則 所建立。例如「趕往」為r警方」與「火場」之間的關 係,「搶救」和「撲滅」為「火場」與「火警」之間的關 係。 最後,將所有概念的概念名稱、運算、屬性以及概念 間的關係之結果存放產生特定領域的本體32,#完成自動 建構本體。 、根據以上所述,本發明所揭露之本體自動建構系統和 方法可以在操作人員不介入的情況下,自動地根據既定文591519 V. Description of the invention (13) (attribute) and operation (〇perati〇n), and the relationship between the concepts (relation) ° ontology construction agent 30 is composed of four parts, including the concept name construction agent (Concept names construction agent) agent) agent) is the noun agent) operation " ^ construction agent (operations construction attribute construction agent (attributes construction relationship, construction agent) where the concept name part of speech is a noun, The part-of-speech of the concept attribute and the part-of-speech of the concept operation are verbs, which are generated from the same concept in concept group 64; the relationship is determined by the rules of connection between different concepts, and the part-of-speech is a verb. Figure 7 A schematic diagram showing an example of the concepts and the relationships between concepts established in this embodiment. In Figure 7, a total of 9 outline knife types are illustrated by the symbol 90, and each concept is further subdivided into the concept name / 爯 92 (No. One block), concept attributes 94 (second column) and concept operations 9 6 (third column). For example, the concept names of these 9 concepts are police, youth Year, Fire%, Fire Alarm, Gangster, Robbery, Drugs, Pornography, Task Force. In addition, the 'Concept 90's quadrilateral represents relationship 98, which is established in accordance with the law of connection. For example, "rush to" police and fire "Rescue" and "Extinguish" are the relationship between "Fire Field" and "Fire Alarm". Finally, the results of the concept names, operations, attributes, and relationships between concepts of all concepts are stored to generate specific fields. Ontology 32, #Complete the automatic construction of the ontology. According to the above, the system and method for automatically constructing the ontology disclosed in the present invention can be automatically based on the established text without operator intervention.

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd

第17頁 591519 五、發明說明(14) 件,產生屬於特定中文知識領域的本體資料庫,因此在知 識管理的應用上極具有產業上的價值。 雖然本發明已以較佳實施例揭露如上,然其並非用以 限定本發明,任何熟習此技藝者,在不脫離本發明之精神 和範圍内,當可作各種之更動與潤飾,因此本發明之保護 範圍當視後附之申請專利範圍所界定者為準。Page 17 591519 V. Invention description (14), which generates an ontology database belonging to a specific Chinese knowledge field, so it has great industrial value in the application of knowledge management. Although the present invention has been disclosed as above with preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make various modifications and retouches without departing from the spirit and scope of the present invention. Therefore, the present invention The scope of protection shall be determined by the scope of the attached patent application.

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第18頁 591519 圖式簡單說明 為讓本發明之上述和其他目的、特徵、和優點能更明 顯易懂,下文特舉出較佳實施例,並配合所附圖式,作詳 明第 說 細 下 ο 統 系 構 ^c 4TTT 體 本 向 導 件 物 之 例 施 實 明 發 本 為 圖 和 件 文 析 分 例 為 件 文 聞 新 以 中 例 施 實 明 發 本 為 ο 圖 圖2 塊第 方 之 資 J-11 口 統 的 量 詞 第 為 圖 3 第 出 詞 的 聞 新 金 市 股 為 別 類 聞 新 中 例 ο 實 表圖 料 圖 意 示 的 量 數 詞 應 對 其 及 率第 頻 現 函 射 映 碼 編 字 文 中 之 用 採 所 中 例 施 實 明 發 本 為 圖 意 Γ J 示第 的 數 圖 J1 處 元 單 入 輸 1 單 應 對 字 多 中 例 施 實 明 發 本 為 圖 圖 程 流 的 法 方 置 配 元 單 入 輸 中 配 字 的 組 詞 入 輸 將 中 例 範 之 例 施 實 明 發 本 示 表 圖 圖 意 示 之 元 單 ί入 第輸 給 置 係 關 間 念 概 和 念 概 立 rc 所 中 例 施 實 明 發 本 示 表 圖 aa •,器器模模 組濾濾析勘 ;模過過分探 ^]件詞性煉率料 々P文斷詞提頻資 7 &古口 ~ ~ ~ ~ ~ 第例號101214161820 一 符 之ί 圖 意 示0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd p. 18 591519 The diagram is a simple illustration to make the above and other objects, features, and advantages of the present invention clearer and easier to understand. The preferred embodiment, in conjunction with the attached drawings, will explain the details below. 统 System structure ^ c 4TTT Example of the body of the guide. The implementation of the Mingfa is shown in the figure and the analysis of the text. For example, Shi Shiming issued this picture. Figure 2 Figure 2 Block Fang's assets J-11 Oral quantifiers Figure 1 Figure 3 The word Wenxin Jinshi shares are examples of different types of new news ο The actual chart diagram The quantifiers should be used in the example of the frequency and frequency of the function code in the coded text. The example is shown in the figure. Γ J shows the figure in the figure. This is a diagram of the flow chart of the French side to place the meta-words in the input of the word group input and input. Interpretation of the concept and concept of the concept rc Shi Shiming issued this chart diagram aa •, the filter module analysis of the filter module; the model is too much exploration ^] the part-of-speech refining material 々 P text segmentation to raise frequency 7 & Gukou ~ ~ ~ ~ ~ No. 101214161820 Illustration

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第19頁 591519 圖式簡單說明 2 2〜中文詞聚類模組; 2 4〜提煉聚類模組; 2 6〜概念組建立模組; 3 0〜本體建構代理器; 32〜本體; 5 0〜第一詞組; 5 2〜第二詞組; 5 4〜特徵詞組; 5 5〜詞性; 5 6〜詞距; 5 8〜輸入詞組; 6 0、6 2〜聚類詞組; 6 4〜概念組; 6 6〜關連法則; 9 0〜概念; 9 2〜概念名稱; 9 4〜概念屬性; 9 6〜概念運算; 9 8〜關係。0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 19 591519 Schematic description 2 2 ~ Chinese word clustering module; 2 4 ~ refined clustering module; 2 6 ~ concept group Building module; 30 ~ ontology construction agent; 32 ~ ontology; 50 ~ first phrase; 5 2 ~ second phrase; 5 4 ~ characteristic phrase; 5 5 ~ part of speech; 5 6 ~ word distance; 5 8 ~ Input phrase; 6 0, 6 2 ~ cluster phrase; 6 4 ~ concept group; 6 6 ~ relation rule; 9 0 ~ concept; 9 2 ~ concept name; 9 4 ~ concept attribute; 9 6 ~ concept operation; 9 8 ~relationship.

0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第20頁0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 20

Claims (1)

591519 六、申請專利範圍" " 以根據複數文件產生 1 · 一種本體自動建構系統,用 一本體,其包括: 斷詞處理裝詈,JL垃& u^ 括離+/、接收上述文件,用以從上述文件内 於既定上述特徵詞包含於上述文件内並且屬 據 聚 =:員處理裝置’耦接於上述斷詞處理裝置,用以依 二寺效詞之間的包含字相似性,產生複數聚類,每一 類匕含部分之上述特徵詞; 、、資料探勘模組,耦接於上述斷詞處理裝置,用以依 j特徵詞在上述文件中之關連性,纟生複數關連法則, 母一關連法則用以定義上述特徵詞之間的關連性; 概念組建立模組,耦接於上述詞聚類處理裝 資料探勘模組,用以根據上述關連法則修正上述聚類,^ 生複數概念組,每一概念組包含部分之上述特徵詞;以 ^ f建構代理^_接於上述概念組建立模組和上述 一貝料探勘模組,用以根據上述概念組和上述 構上述本體。 咬次則,建 2 ·如申請專利範圍第1項所述之本體自動建 其中上述既定詞性包括名詞和動詞。 ’Λ為 3 ·如申請專利範圍第丨項所述之本體自動建 其中上述斷詞處理裝置包括·· #糸A 断列棋組……放队丄%人丨,’亚且抽離出上 中所包含之第一複數詞並且決定上迷第一複數詞件 詞性過濾器,用以接收上述第〜複齡%,并 θ性’ 双数d,並且根據上591519 VI. Scope of patent application " " To generate from plural documents1. An automatic ontology construction system using an ontology, which includes: word segmentation processing equipment, JL & u ^ bracketing + /, receiving the above documents Is used to include the above-mentioned characteristic words in the above-mentioned document from the above-mentioned document and to be aggregated =: the member processing device 'is coupled to the above-mentioned word-splitting processing device to use the similarity of the contained words between the two temple effect words , Generating plural clusters, each type containing part of the above-mentioned feature words; data exploration module, coupled to the above-mentioned word-breaking processing device, for generating the plurality of relevance based on the relevance of the j-character word in the above document Rule, the parent-association rule is used to define the correlation between the above-mentioned characteristic words; the concept group building module is coupled to the above-mentioned word clustering processing and data exploration module to modify the above-mentioned clustering according to the above-mentioned connection rule, ^ Generate a plurality of concept groups, each of which includes a part of the above-mentioned characteristic words; construct a proxy with ^ f ^ _ connected to the above-mentioned concept group establishment module and the above-mentioned one material exploration module to be used according to the above The concept group and the above constitute the above ontology. Bite times, build 2 · The ontology is automatically built as described in item 1 of the scope of the patent application, where the above part of speech includes nouns and verbs. 'Λ is 3 · The ontology is automatically built as described in Item 丨 of the scope of the patent application, where the above word segmentation processing device includes ... # 糸 A Broken line chess group ... Putting the team 丄% people 丨,' Ya and pull out The first plural number contained in and determines the first plural plural part-of-speech filter of the fan, to receive the above ~% of the age of reversion, and theta sex 'double d, and according to the above 0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 591519 六、申請專利範圍 複數詞中屬於上迷既定詞性之 述第二複數詞,並且與一詞 述既定詞性,決定上述第 複數第二詞;以及 提煉過濾器,用以接收上 庫進行比對,決定上述特徵詞。 1中4上、十如/=專利範圍第3項所述之本體自*^構系統, ,、中上述既疋詞性包括名詞和動詞。 5 ·、如申請專利範圍第丨項所述之本體自動 八中上述詞聚類處理裝置係包含一類神經網路聚類描刑 用以根據上述特徵詞的相似性,產生上述聚類〔、, 6 ·如申請專利範圍第5項所述之本體自動建構系統, 八中上述詞聚類處理裝置更包括一頻率分析模組,用以 據士述特彳政岡在上述文件中的出現頻率,從上述特徵詞中 决疋出現頻率較高之複數輸入詞,做為上述類神經 類模型的輸入。 ~ 7 ·如申請專利範圍第5項所述之本體自動建構系統, 其中上述類神經網路聚類模型係以上述特徵詞以及其 和詞距屬性做為輸入。 8 ·如申請專利範圍第5或^項所述之本體自動建 統·’ 1中上述特徵詞或上述輸入詞係為中文詞,並且利用 Umcode之UTF-16轉換機制以及一映射函數將其進行轉換用 做為上述類神經網路聚類模型的輸入。 、 统Λ中如上申述月類專神^範第5或6項所述之本體自動建構系 ;=中述類神經網路聚類輸入層的複數輪 …以分別對應上述特徵詞或土述輸入詞中所有二早 第22頁 0213-8459rnVF(N);STLC-01-B-9127;Franklin.ptd 、、申請專利範圍 子 上述輸入單 10 • 如 統 5 其 中 上 以 根 據 輪 入 網 路 聚 類 模 11 • 如 統 , 其 中 上 二 特 徵 詞 在 12 • 如 統 > 其 中 上 檻 值 5 用 以 上 述 第 一 特 13 鲁 如 統 5 其 中 上 述 第 二 特 徵 包 含 上 述 第 特 14 ·如申請 、、’,其中本體建 則’決定對應上 算以及上述概念 稱、概念屬性、 15 · 一種本 一本體,其包括 元之數 專利範 聚類處 特徵詞 輸出進 專利範 連法則 文件之 專利範 連法則 上述支 和上述 專利範 念組建 關連性 徵詞之 專利範 構代理 述概念 組之間 概念運 體自動 下列步 量低於上述包含字之數量。 圍第5或6項所述之本體自動建 理裝置更包括〃提煉聚類模組,用 之包含字的重複性,將上述類神經 一步分割處理產生上述聚類。 圍第1項所述之本體自動建構系 包含上述特徵詞中第一特徵詞和第 支持度參數和信賴度參數。 圍第11項所述之本體自動建構系 包含一支持度門檻值和一信賴度門 持度參數和上述信賴度參數,^定 第二特徵詞之關連性。 " 圍第1 2項所述之本體自動建構系 立模組係根據上述第一特徵詞和上 ’決定是否將上述第一特徵詞加入 上述聚類’構成對應之上述概念 ^第1項所述之本體自動建構系 為係根據上述概念組和上述關連法 、、且之概念名稱、概念屬性和概念運 2關係資til,並且根據上述概念名 算和關係資訊建構上述本體。 建構方法’用以根據複數文件產生 驟: 丁 I 土0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd 591519 VI. Among the plurals in the scope of patent applications, the second plural that belongs to the established part-of-speech of the fan, and describes the part-of-speech with one word, determines the above-mentioned part. Plural second words; and a refinement filter, which is used for receiving the library for comparison and determining the above-mentioned feature words. 1 in 4 and 10, as described above, == the ontology self-structuring system described in item 3 of the patent scope, and the above-mentioned existing parts of speech include nouns and verbs. 5 · The above-mentioned word clustering processing device of the ontology automatic eight as described in the scope of the patent application includes a type of neural network clustering to generate the above-mentioned clusters based on the similarity of the feature words [,, 6 · According to the automatic construction system of ontology as described in item 5 of the scope of patent application, the above-mentioned word clustering processing device in Bazhong further includes a frequency analysis module for the frequency of occurrence of Shigang Zhenggang in the above documents. The plural input words with high occurrence frequency among the above feature words are used as the input of the above-mentioned neural-like model. ~ 7 · The ontology automatic construction system described in item 5 of the scope of the patent application, wherein the neural network clustering model takes the above-mentioned feature words and their pitch distance attributes as input. 8 · The ontology is automatically established as described in item 5 or ^ of the scope of the patent application. · The above-mentioned feature words or the above-mentioned input words are Chinese words, and they are processed by Umcode's UTF-16 conversion mechanism and a mapping function. The transformation is used as an input to the neural network clustering model described above. In the above system, the automatic ontology construction system described in item 5 or 6 of the above-mentioned moon-like gods ^ Fan is described above; = the plural rounds of the input layer of the neural network clustering mentioned above ... to correspond to the above-mentioned feature words or soil input respectively All the two words in the word, page 22, 0213-8459rnVF (N); STLC-01-B-9127; Franklin.ptd, the scope of patent application, the above input list 10 Module 11 • Rutong, where the first two feature words are at 12 • Rutong > where the upper threshold 5 is used for the above first feature 13 Lurutong 5 where the above second feature contains the above mentioned feature 14 Among them, the ontology construction rule 'determines the corresponding calculation and the above-mentioned concept title, concept attributes, 15 · A text-ontology, which includes the feature words of the patent norm clustering at the element number output into the patent norm rule of the document The concept movement between the patent paradigm agent and the concept group forming the related levy of the above-mentioned patent fan concept is automatically reduced by the following steps. the amount. The ontology automatic construction device described in item 5 or 6 further includes a refining clustering module, which uses the repetitiveness of words to segment the above-mentioned nerves in one step to generate the above-mentioned clusters. The ontology automatic construction system described in item 1 includes the first feature word and the first support degree parameter and the trust degree parameter among the above feature words. The automatic ontology construction system described in item 11 includes a support threshold value and a reliability gate support parameter and the above-mentioned reliability parameter, and determines the relevance of the second feature word. " The ontology automatic construction system module described in item 12 is based on the above first feature word and the above 'determining whether to add the first feature word to the cluster' to form the corresponding concept ^ item 1 The automatic construction of the ontology described above is to construct the ontology according to the above-mentioned concept group and the above-mentioned connection method, and the concept name, concept attribute, and concept operation relationship til, and the above-mentioned concept name calculation and relationship information. Construction method ’is used to generate steps based on plural files: 丁 I 土 591519 六、申請專利範圍 抽離出上沭令A ^ 含於上述文件内並=之複數特徵詞,其中上述特徵詞包 依據上述特彳C既定詞性; 類,每-聚類包;間的包含字相似性’產生複數聚 π Μ P、+、 分之上述特徵詞; 依據上迷特例_ q 一 , 、政3在上述文件中之關連性,產生複數關 彳、 —關連法則用以定義上述特徵詞之間的關連 性; 組 ,,上述關連法則,修正上述聚類並且產生複數概念 每一概念組包含部分之上述特徵詞;以及 根據上述概念組和上述關連法則,建構上述本體。 法 16如申请專利範圍第1 5項所述之本體自動建構方 其中上述既定詞性包括名詞和動詞。 法 複 17如申凊專利範圍第1 5項所述之本體自動建構方 其中上述抽離上述特徵詞之步驟更包括下列步驟·· 依據了斷詞模組,從上述文件中抽離出其所包含之第 數詞並且決定上述第一複數詞之詞性; 根據上述既定詞性,從上述第/複數詞中選擇出屬於 上述既定詞性之複數第二詞;以及 比對上述第二複數詞與一詞庫,決定對應上述文件之 上述特徵詞。 1 8 ·如申請專利範圍第丨7項所述之本體自動建構方 法,其中上述既定詞性包括名詞和動詞。 1 9 ·如申請專利範圍第1 5項所述之本體自動建構方 法,其中產生上述聚類之步驟係利用一類神經網路聚類模591519 VI. The scope of the patent application is extracted from the plural feature words contained in the above order A ^ included in the above document and =, where the feature word package is based on the above-mentioned special part of speech C; class, each-cluster package; inclusive The word similarity 'produces the plural characteristic words of the poly-πMP, +, and points; according to the special case _ q1, the connection of Zheng 3 in the above documents, and the plural relation, the relational rule is used to define the above. The relation between feature words; group, the above-mentioned relational rule, modifying the clustering and generating a plurality of concepts, each concept group contains a part of the above-mentioned feature words; and constructing the ontology according to the above-mentioned concept group and the above-mentioned relational rule. Method 16 The automatic construction method of ontology as described in item 15 of the scope of patent application, wherein the above-mentioned predetermined part of speech includes nouns and verbs. Fafu 17 The automatic ontology construction method described in item 15 of the patent scope of the application. The above steps for extracting the above-mentioned feature words include the following steps. · Based on the word segmentation module, extract its place from the above documents. Contains the first plural word and determines the part-of-speech of the first plural word; according to the predetermined part-of-speech, selecting the second plural word belonging to the above-mentioned predetermined part-of-speech from the above-mentioned / plural word; and comparing the second plural word with a thesaurus , Determine the feature words corresponding to the above documents. 1 8 · The ontology automatic construction method described in item 7 of the scope of patent application, wherein the above-mentioned predetermined part of speech includes nouns and verbs. 19 · The ontology automatic construction method as described in item 15 of the scope of patent application, wherein the step of generating the above-mentioned clustering uses a type of neural network clustering model 0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第24頁 乃 1519 六 、申請專利範圍 梨實現。 法,20 ·如申請專利範圍第1 9項所述之本體自動建構方 其中更包括一步驟: 徵詞Ϊ據上述特徵詞在上述文件中的出現頻率,從上述特 網路取決定出現頻率較高之複數輸人詞’冑為上述類神經 j硌聚類模型的輸入。 4蜗竹、& 法,21 士如申請專利範圍第2〇項所述之本體自動建構方 ^中上述輸入詞係為中文詞,並且更包含一步驟: 述1丨用Urn code之UTF-丨6轉換機制以及_映射函數將上 w1 m ί做為上述類神經網路聚類模型的輸入。 22 .如申知專利範圍第"項所述之 去’其中上述特徵詞係為中文詞,並且更包含=方 ^ ^ ^#,j α ^ ^ ^ W ± 2Vf申丁上述類神經網路聚類模型的輸入。 23 .如申“利範圍第19項所述之本體 / ’其中上述類神經網路聚類模型之輪 用以分別對應上述特徵詞中所有包含字, π之數量低於上述包含字之數量。 予上述輸入早 24 .如申請專利範圍第2〇;所述 法’其中上述類神經網路聚類模型 ,自動建構方 元1以分別對應上述輪入詞中所“4的=入單 疋之數量低於上述包含字之數量。 子上述輸入單 、25 ·如申請專利範圍第19項所述之本 法,其中更包括一步驟: 本體自動建構方0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 24 is 1519 VI. Patent application scope Pear realization. Law, 20 · The automatic construction method of ontology as described in item 19 of the scope of patent application, which further includes a step: Soliciting words: According to the frequency of appearance of the above-mentioned characteristic words in the above documents, determine the frequency of appearance from the above-mentioned Internet. The high complex number input word '胄 is the input of the above neural-like j 硌 clustering model. 4 Snail, & method, 21 The above-mentioned input words in the automatic construction method of the ontology described in item 20 of the scope of patent application ^ are Chinese words, and further include a step: 1 UTF-Urn- The transformation mechanism and mapping function will use w1 m as the input of the above neural network clustering model. 22. As described in item " of the claimed patent, go to 'where the above-mentioned characteristic words are Chinese words, and further include = 方 ^ ^ ^ #, j α ^ ^ ^ W ± 2Vf Cluster model input. 23. The ontology as described in claim 19 of the “Scope of Interest” / wherein the wheel of the above-mentioned neural network clustering model is used to correspond to all the included words in the feature word, respectively, and the number of π is lower than the number of the included words. Give the above input as early as 24. As described in the patent application scope No. 20; the method 'where the above-mentioned neural network clustering model automatically constructs Fang Yuan 1 to correspond to the "4's = entry list" The number is lower than the number of included words mentioned above. The above input form, 25 · The method described in item 19 of the scope of patent application, which further includes a step: the automatic construction method of the ontology 之本體自動建構方 來類模型之輸出進-步分割處理產 26 ·如申請專利範圍第20項所述 古’其中更包括一步驟: 政取根^上述輸入詞之包含字的重複性,將上述類神經網 爪類模型之輸出進一步分割處理產生上述聚類。 法,2J ·如申請專利範圍第丨5項所述之本體自動建構方 1牲r Γ上述關連法則包含上述特徵詞中第一特徵詞和第 、欲词在上述文件之支持度參數和信賴度參數。 法,=·如申請專利範圍第27項所述之本體自動建構方 植值71ί ϊ ΐ連法則包含一支持度門檻值和一信賴度門 上 匕較上述支持度參數和上述信賴度參數,決定 ' 寺欲a司和上述第二特徵詞之關連性。 法,2中t申請專利範圍第28項所述之本體自動建構方 述第二牿上^上述聚類產生上述概念組之步驟,係根據上 述第-特it ::和上述第二特徵詞之關連性,決定是否將上 法,3』中利範圍第15項所述之本體自動建構方 述關連法則冓本體之步驟中,係根據上述概念組和上 和概念運算以及上述概今辦概心们生 述相丨八么p义伐心、、且之間的關係貝汛,並且根據上 心*、概心屬性、概念運算和關係資訊建構上述本 591519 六、申請專利範圍 3 1 · —種儲存媒體,用以儲存可執行於一電腦系統之 程式,上述程式包含用以執行如申請專利範圍第1 5項至第 3 0中任一項之本體自動建構方法之程式碼。The ontology automatically constructs the output of the class model. The step-by-step segmentation process produces 26. As described in item 20 of the scope of the patent application, the ancient 'which also includes a step: the political root ^ the repetition of the above-mentioned input words, including The output of the above-mentioned neural network claw model is further segmented to generate the above-mentioned clusters. Method, 2J · The ontology automatic construction method described in item 5 of the scope of the application for patent 1 r r Γ The above-mentioned relational rules include the first feature word, the first feature word, and the desired word in the above documents in the above-mentioned document support parameters and reliability parameter. Method, = · The automatic construction of the ontology as described in item 27 of the scope of the patent application, 71. 法 The Coupling Rule includes a support threshold and a trust threshold. The support parameter and the trust parameter are compared to determine 'The relevance of Siyu a Division and the above-mentioned second characteristic word. The method described in item 2 of the scope of patent application in item 2 of the method described above automatically constructs the second description above. The above step of clustering to generate the above-mentioned concept group is based on the above-mentioned it :: and the second feature word. Relevance determines whether the ontology described in item 15 of the middle scope of 3 ”above is used to automatically construct the dialectic connection rule. The steps of the ontology are based on the above-mentioned concept group and the concept of sum-up and the above-mentioned concepts. We have described the relationship between the Ba Yip and Yi Faxin, and the relationship between them, and constructed the above-mentioned 591519 based on the upper heart *, the generality attribute, the concept operation and the relationship information. A storage medium is used to store a program executable on a computer system, and the program includes code for executing the method for automatically constructing the ontology as described in any of claims 15 to 30 in the scope of patent application. 0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第27頁0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 27
TW91125217A 2002-10-25 2002-10-25 Automatic ontology building system and method thereof TW591519B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW91125217A TW591519B (en) 2002-10-25 2002-10-25 Automatic ontology building system and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW91125217A TW591519B (en) 2002-10-25 2002-10-25 Automatic ontology building system and method thereof

Publications (1)

Publication Number Publication Date
TW591519B true TW591519B (en) 2004-06-11

Family

ID=34057953

Family Applications (1)

Application Number Title Priority Date Filing Date
TW91125217A TW591519B (en) 2002-10-25 2002-10-25 Automatic ontology building system and method thereof

Country Status (1)

Country Link
TW (1) TW591519B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8832109B2 (en) 2007-09-03 2014-09-09 British Telecommunications Public Limited Company Distributed system
TWI608367B (en) * 2012-01-11 2017-12-11 國立臺灣師範大學 Text readability measuring system and method thereof

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8832109B2 (en) 2007-09-03 2014-09-09 British Telecommunications Public Limited Company Distributed system
TWI608367B (en) * 2012-01-11 2017-12-11 國立臺灣師範大學 Text readability measuring system and method thereof

Similar Documents

Publication Publication Date Title
CN111079444B (en) Network rumor detection method based on multi-modal relationship
Zhao et al. Cyberbullying detection based on semantic-enhanced marginalized denoising auto-encoder
Tang et al. Deep learning for sentiment analysis: successful approaches and future challenges
Li et al. Analyzing COVID-19 on online social media: Trends, sentiments and emotions
CN111753024B (en) Multi-source heterogeneous data entity alignment method oriented to public safety field
Li et al. Context-aware group captioning via self-attention and contrastive features
Chen et al. Visual and textual sentiment analysis using deep fusion convolutional neural networks
CN114444516B (en) Cantonese rumor detection method based on deep semantic perception map convolutional network
CN112307364B (en) Character representation-oriented news text place extraction method
Liu et al. AMFF: A new attention-based multi-feature fusion method for intention recognition
CN112559747A (en) Event classification processing method and device, electronic equipment and storage medium
Lai et al. Transconv: Relationship embedding in social networks
CN113407697A (en) Chinese medical question classification system for deep encyclopedia learning
CN114818724A (en) Construction method of social media disaster effective information detection model
Chen et al. Identifying Cantonese rumors with discriminative feature integration in online social networks
Liu et al. Category-universal witness discovery with attention mechanism in social network
Zhu et al. Design of knowledge graph retrieval system for legal and regulatory framework of multilevel latent semantic indexing
Ngo et al. Discovering child sexual abuse material creators' behaviors and preferences on the dark web
Zamiralov et al. Detection of housing and utility problems in districts through social media texts
Defersha et al. Deep learning based multilabel hateful speech text comments recognition and classification model for resource scarce ethiopian language: The case of afaan oromo
TW591519B (en) Automatic ontology building system and method thereof
Nguyen et al. Human vs ChatGPT: Effect of Data Annotation in Interpretable Crisis-Related Microblog Classification
Arnfield Enhanced Content-Based Fake News Detection Methods with Context-Labeled News Sources
Krishna et al. A Deep Parallel Hybrid Fusion Model for disaster tweet classification on Twitter data
Singh et al. Neural approaches towards text summarization

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees