TW591519B - Automatic ontology building system and method thereof - Google Patents
Automatic ontology building system and method thereof Download PDFInfo
- Publication number
- TW591519B TW591519B TW91125217A TW91125217A TW591519B TW 591519 B TW591519 B TW 591519B TW 91125217 A TW91125217 A TW 91125217A TW 91125217 A TW91125217 A TW 91125217A TW 591519 B TW591519 B TW 591519B
- Authority
- TW
- Taiwan
- Prior art keywords
- mentioned
- ontology
- words
- word
- item
- Prior art date
Links
Landscapes
- Machine Translation (AREA)
Abstract
Description
591519591519
本么月疋有關於一種物件導向本體(ob ject_〇riented ontology)技術,且特別是有關於一種可以針對中文 的知識領域,以自動方式來建構屬於它的本體。, 本體(ontology)可以做為知識管理的一種核心 有=地進行知識管理。因此如何建構某種特定知識領】的 本體即成為一項重要的課題,藉此人類 〆 士 j 4 A Μ Μ 将 八頭j以在未來透過此 方式讓知識官理更加有效。本體必須是針對某個特 領域(domain)所建立,其組成分子分別是物件: 值(value)、關係(relation)和函數(functi〇n)。其中, 物件可以代表一概念(concept),此概念可以是一^ 法、想法或實體。❿這些實體可以給予值,使其具備 值(attribute vaiue)的資料型態。物件間則是以 以定義’可分為特殊化(specializati〇n)和一般化’、 (general izat ion)的關係。藉此讓物件間的推演 明確或是模糊的解釋。在一般本體最常用的組成分=f 代表物件的字彙(vocabulary)和概念、代表關係二=含 (attribute)以及用來表示物件間關係比重的關係權 由於目前針對中文知識領域之本體都以人工參盥。 本體,但由於採取半自動建構的技術來建構本體,口 構 文件的關鍵字詞來作分類處理,其依據仍然不足·,另、針對 件在分類上也有些模糊問題,例如:某則新聞的談认2文 國股市,於是可分類在財金的新聞類別亦可分類:為美 聞類別。 、约國際新 網路技術一日千里,造就知識共享的時代來臨。 。千頭Ben Moyue has an object-oriented ontology (ob ject_〇riented ontology) technology, and in particular an area of knowledge that can be used to construct Chinese ontology in an automatic way. Ontology can be used as a core of knowledge management for knowledge management. Therefore, how to construct the ontology of a specific knowledge domain has become an important subject, by which the human warrior j 4 A MM will make the eight heads j in this way to make knowledge management more effective. The ontology must be established for a specific domain, and its constituent molecules are objects: value, relationship, and function. Among them, an object can represent a concept, and this concept can be a method, idea, or entity. ❿These entities can be given a value so that they have a data type of attribute vaiue. The relationship between objects can be divided into specialization, generalizability, and generalization by definition. This allows the deduction between objects to be clear or vague. The most commonly used components in general ontology = f represents the vocabulary and concept of objects, represents relationship two = attribute, and the relationship right used to represent the proportion of the relationship between objects. Participate. Ontology, but because the semi-automatic construction technology is used to construct the ontology, and the keywords of the oral documents are classified for classification, the basis is still insufficient. In addition, the classification of the items is also a bit vague, such as a news talk Recognizing the stock market of the two countries, the news category that can be classified in the financial industry can also be classified as the Meiwen category. With the rapid development of new international network technologies, the era of knowledge sharing has arrived. . Thousand heads
591519 五、發明說明(2) 萬、、者雜乱無章的資訊充斥在我們的周遭,資 日人類必須面對的問題。當知識擴展的迷产:炸疋今 吸收時,電腦的自動化處理之輔助,在今=:π們所能 無論是個人、》司行號、政府機關、甚 2 ::重要。 曰行業務當中,皆會遭遇此問題。 ' ’在其 有鑑於此,本發明的主要目的就是提供一 =與方法,亦即透過發展資訊與知識管理的軟體:建 =分類網路文件,整合網路知識,體Κ 的水準及生產力。 叮口,弓敷體產業 因此’本發明提供一種本體自動逮 =件產生一本體,其包括= 迷文件,用以從上述文件内抽離出複 =包含於上述文件内並且屬於既定詞性;詞聚理:斂 的包含字相2 m以依據上述特徵詞之間 述;i;:;探^:數::’每一聚類包含部分之上 以依據上述文理裝置,用 法則,每一關連法則用以定義上 產生稷數關連 概念組建立模、组,搞接於上述平l=柯之間的關連性; :罙勘模組,用以根據上述關連法則修上= 數概念組,每一概念組包^ l正上述聚類,產生複 建構代理器,接於上述概;::士;f:寺徵詞’·以及本體 模組,用以根據上述概念组:j杈、、且和上述資料探勘 ^、、、f上述關連法則,建構上述本 0213-8459TWF(N);STLC-01-B-9127;Franklin. Ί· 第6頁 五、發明說明(3) 體。It此,訓綠、 S由概念名稱、述機制的處理,可以 讯所構成特定領域本體。 异以及各概念間之關 【貫施例】 之方:”::二::之物件導向本體自動建構系統 圖所示,本發明之本體自動=以限疋本發明。如第1 性過濾器14、提煉過、、廣j構糸統包含斷詞模組12、詞 模組2〇、中文詞聚:f έ 9頻率分析模組1δ、資料探勘 立模組26以及本體逮槿二、提煉聚類模組24、概念組建 對透過網路本體自動建構系統係針 處理,用以產特定領域複數文件1〇進行 太驊白奢祕/十^此特疋領域所建構之本體結構3 2。此 ^ ^ 冓系統利用各組元件所執行的處理步驟,_以 =發明自動建構本體的目的,其中各組成元:亦= ,:載,己錄媒體之軟體程式加 現。 :各組成元件的操作。另外,在本實施例中之本體自:i 構糸統和方法是採用物件導向系統設計的方式,並且針 中文特定領域建構本體,然而此並非用以限定本發明。 斷詞處理 首先,斷詞模組1 2將做為輸入之中文特定領域複數文 件1 0進行斷詞處理,例如將文件中之介系詞以及數字刪 除,然後標注斷詞後的詞性,藉此找出文件丨〇中的動詞及 名詞等。因此,斷詞模組1 2可以從文件丨〇中產生經斷詞處591519 V. Description of the invention (2) The messy and indiscriminate information is flooding all around us, a problem that human beings must face. When the mystery of knowledge expansion: blasting and absorbing this time, the computer's automated processing aids, nowadays =: π can do it, whether it is an individual, a company, a government agency, or even 2 :: important. This problem will be encountered in all business. In view of this, the main purpose of the present invention is to provide a method and method, that is, to develop software for information and knowledge management: build = classify network documents, integrate network knowledge, and level and productivity. Dingkou, Gongfu Body Industry Therefore, the present invention provides an ontology automatic capture device to generate an ontology, which includes: a fan file for extracting a complex from the above file, which is included in the above file and belongs to a given part of speech; words Convergence: Converged words containing 2 m according to the above-mentioned feature words; i;:; Exploration ^: Number :: 'Each cluster contains part above according to the above-mentioned literary devices, usage rules, each connection The rule is used to define the module and group of the related concept groups that are generated, and connect to the relationship between the above-mentioned l = Ke; The survey module is used to repair = the number concept group according to the above related rule. A concept group includes the above-mentioned clustering to generate a reconstruction agent, which is connected to the above; :: Shi; f: Temple Lexicon ', and an ontology module, which is used according to the above concept group: j fork, and Explore the above-mentioned related rules with the above materials, construct the above-mentioned 0213-8459TWF (N); STLC-01-B-9127; Franklin. Ί · Page 6 V. Description of the invention (3). It is therefore that the training green and S are handled by the concept name and the description mechanism, which can constitute the specific domain ontology. Differences and the relationship between the concepts [Executive examples]: ":: 2 :: Object-oriented ontology automatic construction system shown in the diagram, the ontology of the invention automatically = to limit the invention. As the first filter 14. Refined, and broad-based systems include word segmentation module 12, word module 20, Chinese word aggregation: f 9 9 frequency analysis module 1δ, data exploration and establishment module 26, and ontology capture 2 The clustering module 24. The concept set up deals with the automatic construction system of the network ontology, which is used to produce a plurality of documents in a specific domain 10, and to carry out the Taibaibai extravagance / tenth, the ontology structure constructed in this special domain32. This ^ ^ 冓 system uses the processing steps performed by each group of components, _ == the purpose of the invention to automatically construct the ontology, where each component element: also =,: contains, the software program of the recorded media is cashed out. In addition, in this embodiment, the ontology from: i structure system and method is an object-oriented system design method, and the ontology is constructed in a specific Chinese field, but this is not intended to limit the present invention. Word segmentation processing First, Word segmentation module 1 2 will be lost Plural documents in the Chinese-specific domain 10 are processed for word segmentation, such as deleting the prepositions and numbers in the document, and then marking the part of speech after the word segmentation to find out the verbs and nouns in the document. Therefore, Word segmentation module 1 2 can generate segmented words from file 丨 〇
0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第7頁 591519 五、發明說明(4) ^錢有詞性屬性的第一詞組50。斷詞模組12可以利用中 $研院所開發之斷詞系統(CKIp)或其他斷詞系統加以實 ,下來°司性過濾器(S t 〇 p W 〇 r d F i 11 e r ) 1 4則進一步 處理第;㈤組5 0 ’從其中選取出具有代表性的第二詞組, 例如動j及1詞,藉此當作建構本體的特徵r )。 二接著&煉過慮器(R e f i n i n g F i 11 e r) 1 6則根據一中 文j庫在&述基本詞性分析後,將文章内的詞做更明確 更洋、、、田的過濾、動作,而得到特徵詞組5 4。在本實施例中係 :用斷詞模組12、詞性過濾器14和提煉過濾器16產生後續 处理的u司組,然而上述處理方式並非用以限定本發明,對 於热習此技藝者而言,亦可以採用其他技術取得後續分析 用之詞組’仍可以達到本發明之目的。 接下來則疋利用取仔的特徵詞組5 4及其所屬之文件 ίο同時進行詞聚類處理(term clustering)與資料探勘處 理(data mining)。其中,詞聚類處理係用來處理詞的相 似性,資料探勘處理係用來處理詞的關聯性,藉此建立本 體所需要的概念(concept)。 詞聚類處理 在本實施例中,詞聚類處理係由頻率分析模組(term analyZer)18、中文詞聚類模組22以及提煉聚類模組以執 行。 頻率刀析模組1 8係用以針對不同類別的文件1 〇所包含 的詞,進行詞的頻率分析。因為在這些類別文件裡的詞 0213-8459TWF(N);STLC-01-B-9127;Frank 1in.ptd 第8頁 591519 五、發明說明(5) __ :現= 者占了大多數,所以若能夠將這- 的詞,類別文件中出現頻率較高 更高的效率,除此^冰,▲τ的貝料量和訓練時間,達到 生。 ” ,可以大量減少聚類後雜訊的? 第2圖為本實施例中以新聞文侔盔〃,、上 的統計資料表。在第2圖和^为^文件和詞量 係來自於中時電子報200 1年:上=,訓練文件 文件,其中依據不同新聞類別而進行=年的新聞 前所',不同詞出現的頻率亦有也η,、如 第2圖貫例中新聞類別為股市財 、、第3圖為 對應詞數量的示意圖。如第3圖所示現頻率及其 的詞數量相當大,相對地出現頻:出現頻率較低 小。因此,本實施例中可以取装相關:f的詞數量則相當 前1_個詞作為聚類的輸入,然而或是 定本發明。 扣師运私準並非用以限 以幹ίΪί施例中’頻率分析模組18根據特徵詞組54,可 以,出根㈣$篩選#件所選擇的輸入詞组Μ 了 58中词的詞性55和表示詞相似度的詞距⑼。中文;:: 艮:it述資謂r㈣,產生經過聚類處理後: 水’ #、.且。必須說明的是,本實施例中雖然針對特 組54提出一頻率分析模組丨8來篩選出現頻做^二 類的輸入,但是並非用以限定本發明,例如直接0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 7 591519 V. Description of the invention (4) ^ Qian has the first phrase 50 with part-of-speech attribute. The word segmentation module 12 can be implemented by using the word segmentation system (CKIp) or other word segmentation systems developed by the Chinese Academy of Research. The following are the filter (S t 〇p W 〇rd F i 11 er) 1 4 Further processing the first; ㈤ group 5 0 'from which a representative second phrase, such as verb j and 1 word, is selected to take the feature r of constructing the ontology). Secondly, & Refining F i 11 er) 1 6 According to a Chinese j library, after analyzing the basic part-of-speech analysis, the words in the article were made clearer, more foreign, and field filtering, action , And get the characteristic phrase 5 4. In this embodiment, the following steps are used to generate a u-group for subsequent processing using the word segmentation module 12, the part-of-speech filter 14, and the refinement filter 16. However, the above processing method is not intended to limit the present invention. For those skilled in the art, You can also use other techniques to obtain the phrase 'for subsequent analysis' and still achieve the purpose of the present invention. The next step is to use the feature word group 5 4 and the file to which it belongs, to perform term clustering and data mining at the same time. Among them, word clustering processing is used to deal with the similarity of words, and data exploration processing is used to deal with the relevance of words, so as to establish the concepts needed by the body. Word cluster processing In this embodiment, the word cluster processing is performed by a frequency analysis module (term analyZer) 18, a Chinese word cluster module 22, and a refined cluster module. Frequency analysis module 18 is used to analyze the frequency of words based on the words contained in different types of documents 10. Because the words 0213-8459TWF (N); STLC-01-B-9127; Frank 1in.ptd in these category files are on page 8 591519 V. Description of the invention (5) __: The present = the majority, so if Able to combine this word with higher frequency and higher efficiency in the category file. In addition to this, the amount of shell material and training time of ▲ τ can be achieved. ", Can reduce the noise after clustering a lot? Figure 2 is the statistical data table in this example with the news text 侔 Helmet 、. In Figure 2 and ^ are ^ files and vocabulary from the middle Times Newsletter 200 1 year: Shang =, training document file, which is performed according to different news categories = years before the news', the frequency of different words also has η, as shown in the example in Figure 2 for the news category: Stock market wealth, Figure 3 is a schematic diagram of the number of corresponding words. As shown in Figure 3, the present frequency and the number of words are quite large, and the frequency of occurrence is relatively low: the frequency of occurrence is relatively small. Therefore, in this embodiment, it can be taken Relevant: The number of words of f is similar to the current 1_words as the input of the cluster, but it is still the present invention. The deduction of the teacher's personal permission is not limited to the example of the frequency analysis module 18 according to the characteristic phrase 54. You can, select the input phrase M selected from the root $ 筛选 ##, which has the part-of-speech 55 of the 58 words and the word pitch 表示, which indicates the similarity of the words. Chinese; :: gen: it After treatment: water '# ,. and. It must be noted that although in this embodiment, Proposed a frequency analysis module Shu ^ 8 to two classes do frequency input filter occurs, but not intended to limit the present invention, for example, group 54 directly Laid
0213-8459TWF(N);STLC-01.B-9127;Franklin.ptd0213-8459TWF (N); STLC-01.B-9127; Franklin.ptd
第9頁Page 9
591519 五、發明說明(6) 二4 =文詞聚類模組22的輸入模組58,亦可以符合本 中文詞聚類模組22是根據中文詞聚類的方式, 度較高的詞聚類為同-詞組。在本實施例中,中文類 板組22主要係利用類神經網路架構的自我組織映射演算法 (self-organizing map algorithm,簡稱 s〇M)模型所 / 模型細部設定則不再詳細說明,以下說明主要 對在本貫施例中所需解決的聚類輸入問題。由於在 例中是,取詞的詞距56與詞性55作為聚類的輸入特徵,因 此在本實施例中,係將輸入詞組58包含的中文字以 Unicode的UTF-16機制轉換。每個字會轉換成一個“位元 的碼,其中文字的碼範圍為2E8〇〜9FFF,轉成十進位 在^ 9 04〜40 95 9之間。但若以這個範圍的數值來當作s〇m聚 類杈型的輸入,會因為輸入值太大而造成聚類的不正確。 因此,本實施例中利用映射函數功能(Mapping Functi〇n) 來將文字範圍的數值映射至— 100至1〇〇的範圍,可避免因 為輸^值太大而造成聚類的不正確。第4圖表示本實施例 中所採用之中文字編碼映射函數的示意圖。 針對輸入詞組58和S0M聚類模型輸入單元的對應問 題,本實施例中提出以下三種方式解決。 第一種方式是預設固定長度的s〇M聚類模型輸入單 元,分別對應於輸入詞組58中每個詞的詞性及其所包含的 :文字。例如,輸入單元長度可以設為6,其中第丨個輸入 單元對應於每個輸入詞的詞性,第2個至第6個輸入單元則591519 V. Description of the invention (6) Two 4 = input module 58 of the text word clustering module 22, which can also conform to the Chinese word clustering module 22, which is based on the clustering of Chinese words and has a higher degree of word clustering Is the same-phrase. In this embodiment, the Chinese-style board group 22 is mainly a self-organizing map algorithm (som) model / model detail setting using a neural network-like structure, which will not be described in detail below. The description mainly focuses on the clustering input problem that needs to be solved in this embodiment. In the example, the word pitch 56 and the part of speech 55 of the word are taken as the input features of the cluster. Therefore, in this embodiment, the Chinese characters included in the input phrase 58 are converted by the Unicode UTF-16 mechanism. Each word will be converted into a "bit code, where the code's code range is 2E8〇 ~ 9FFF, converted to decimal between ^ 9 04 ~ 40 95 9. But if this range of values is used as s 〇m clustering type input will cause incorrect clustering because the input value is too large. Therefore, in this embodiment, a mapping function function (Mapping Functi) is used to map the value of the text range to -100 to The range of 100 can avoid incorrect clustering because the input value is too large. Figure 4 shows a schematic diagram of the Chinese character encoding mapping function used in this embodiment. For the input phrase 58 and the SOM clustering model The corresponding problem of the input unit is solved in this embodiment by the following three methods. The first method is a preset fixed-length SOM clustering model input unit, which corresponds to the part of speech of each word in the input phrase 58 and its location. Contains: text. For example, the length of the input unit can be set to 6, where the first input unit corresponds to the part of speech of each input word, and the second to sixth input units are
591519591519
五、發明說明(7) 入長度預設為5,所以就會造成π長π這個字無法輸入作判 斷。第二、字在輸入詞中的不同位置會造成聚類錯誤,例 如”警員”和”員警”兩個詞,其性質相同,依常理判斷會聚 類在一起,不過因為兩個輸入詞的字位置不同,會造成無 法聚類在一起。 '' 對應於每個輸入詞中所包含的中文字。然而,此種方式在 應用有其缺點。第一、由於SOM聚類模型的每個資料輪入 層的輸入單位必需固定。所以如果輸入詞的長度比所預定 的輸入單位長度大時,就輸入詞中過長的字則無法對應。 例如,”民進黨秘書長,,這個詞的長度是6,而上述之詞輸 第二種方式則是將每個字當作一個對應輸入單元的輸 入,此方式可以解決在第一種方式中輸入詞長度過長以^ 輸入詞中字位置不同造成聚類錯誤的情況。不過,此種方 式則是造成輸入層輸入單元的維度過大,同時30Μ聚類模 型的訓練時間會非常長。以前述之範例來說,在類別為政 焦點的新聞文件中統計’出現的字數量大約在3 〇⑽個2 右,其所組成的特徵詞組則多達1 4 0 0 0。因此,以第二種 方式建立SOM聚類模型輸入單元時,其輸入層的維度f輸入 單元數量)為3 0 0 0,而供訓練用的資料筆數則多達1 4⑽〇, 因此可以預見此種SOM聚類模型的聚類時間會非常長。另 外,在第二種方式中,由於每個輸入單元都可以清楚地代 表所屬的字,因此並不需要針對中文字進行編碼的輸入, 只需要將出現的字設為1,沒出現的字設為〇即可。 第三種方式則以多字對應單一輸入單元的方式來處理V. Description of the invention (7) The input length is preset to 5, so the word π length and π cannot be entered for judgment. Second, the different positions of the words in the input word will cause clustering errors. For example, the words "policeman" and "policeman" are of the same nature and will be clustered together according to common sense, but because the two input words The different word positions will not be able to cluster together. '' Corresponds to the Chinese characters contained in each input word. However, this method has its disadvantages in application. First, the input unit of each data rotation layer of the SOM clustering model must be fixed. Therefore, if the length of the input word is larger than the predetermined input unit length, the long word in the input word cannot be matched. For example, "DPP Secretary General, the length of this word is 6, and the second way of inputting the above words is to use each word as a corresponding input unit. This method can be solved in the first way. The length of the input word is too long to cause a clustering error due to the difference in the word position in the input word. However, this method causes the dimension of the input unit of the input layer to be too large, and the training time of the 30M cluster model will be very long. As an example, the number of words that appear in a news document with a category of political focus is about 300 words, and the number of characteristic words is as high as 14000. Therefore, in the second way, When the input unit of the SOM clustering model is established, the number of input dimensions (the number of input units in the dimension f) is 3 0 0, and the number of data pens for training is as high as 14⑽0. The clustering time will be very long. In addition, in the second method, since each input unit can clearly represent the word to which it belongs, there is no need to input the Chinese character encoding, only the occurrence of Is set to 1, the word does not appear to set square. A third embodiment places the single multi-word corresponding to the input processing unit according to
0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第11頁 591519 五、發明說明(8) 聚類=輸入,藉此同時解決第一種方式之字位置對應問題 以及第二種方式輸入層維度過大的問題。在此方式中,每 個輸入單元可以對應於超過一個的不同字,而不同字也可 以透,Uni code轉碼來分辨其中不同對應的字。然而,在 形成每個輸入單元所對應之字時,必須限制在同一輸入詞 中出現的字,不可以同時列為同一輸入單元的對應字 則會出現聚類的輸入錯誤。 ,圖表示本實施例中多字對應單一輸入單元處理 輸入早70配置方法的流程圖。其中,輸入資料為頻率分 模組18所產生的輸入詞組58或者是斷詞處理後的特徵^ 54,輸出資料則是每個維度(或輸入單元)及其對應之字的 集合。如第5圖所示,首先,計算出輸入詞組58或者特徵 ㈣組54中所有輸入詞所包含的字集合,令該字集合為 {W1 ’’…,Wn},總字數為η個(步驟S1)。接著,找出 所有子W1〜Wn所屬之輸入詞中出現的其他字集合,分別 符號SW1,SW2,…,SWn表示(步驟S2)。最後,依序處理 字W1〜Wn,隨機配置給輸入單元,但是所配置之輸入單元 必須不包含屬於該字所對應字集合sn〜SWn中的元素(步 S3),亦即每個輸入單元不會同時對應於屬於同一輪入气 所包含的字,藉以避免聚類輪入錯誤。 ° 以下以一貫例說明以上流程。考慮一組輸入詞組丨文 學,化學,文化,清大,成大,警員,員警,警察局}, 其中包含的字集合為{文,學,化,清,大,成,警, 員’察,局}。接著找出每個字的字集合sn〜swl〇,亦即0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 11 591519 V. Description of the invention (8) Clustering = input, which simultaneously solves the problem of word position correspondence in the first way and the second This way the problem of too large input layer dimensions. In this way, each input unit can correspond to more than one different word, and different words can also be transparent. Unicode transcoding can distinguish different corresponding words. However, when forming the words corresponding to each input unit, the words appearing in the same input word must be restricted, and the corresponding words that cannot be listed at the same time in the same input unit will cause clustering input errors. The figure shows a flowchart of a method for processing input early 70s with multiple words corresponding to a single input unit in this embodiment. Among them, the input data is the input phrase 58 generated by the frequency sub-module 18 or the feature after word segmentation ^ 54, and the output data is the collection of each dimension (or input unit) and its corresponding word. As shown in FIG. 5, first, the word set included in all input words in the input phrase group 58 or the feature group 54 is calculated, and the word set is {W1 ”..., Wn}, and the total number of words is η ( Step S1). Next, the other word sets appearing in the input words to which all the children W1 to Wn belong are found, and the symbols SW1, SW2, ..., SWn are respectively represented (step S2). Finally, the words W1 ~ Wn are processed in sequence and randomly assigned to the input unit, but the configured input unit must not contain elements belonging to the word set sn ~ SWn corresponding to the word (step S3), that is, each input unit does not It will also correspond to the words included in the same round of Qi, so as to avoid clustering round errors. ° The above process is explained below with a consistent example. Consider a set of input phrases 丨 literature, chemistry, culture, Tsinghua University, Chengdu University, police officer, police officer, police station}, which contains a set of characters {文, 学, 化, 清, 大, 成, 警, policeman 'Check, bureau}. Then find the word set sn ~ swl0 for each word, that is,
五、發明說明(9) 與該字出現於同—輸入詞的字集入 μ 文-^學,化 ν、5 ’其计异結果如下: 學〜文’化 化〜學,文 清〜大 大〜清,成 成〜大 f 5員,察,局 察〜警,局 局4警,察 第6圖表示本實施例範例中 入單元之#咅m 、輪入詞組的字配置給輪 入單元),而每個維的包含字其::歸 度(即輪 交叉比對,^ ^ α β ^ 、/成必需經過每個字詞的 列為在同一個維的包含字,造^ ^ 一個詞裡出現的字,被 例中,第1維合對_於丨t生岑類的輸入錯誤。在此範 (學、大、昌曰i 、成、警},第2維會對應於 有3唯_ 丨,以及第3維會對應(化、局}。因為只 J3。維的輸入層,如此可以減少花費在作聚類處理的時 φ 11文闲聚類模組22產生聚類詞組60後,在本實施例 121詞組60會再由提煉聚類模組24進-步處理。 :5 :類模組22是利用類神經網路的s〇M聚類模型將輸 过组中相似的詞聚在一起來形成大聚類(clusters),並V. Description of the invention (9) It appears in the same word as the word—Enter the character set of the input word into μ Wen- ^ Xue, Hua ν, 5 'The difference is as follows: Learn ~ Wen' Hua Hua ~ Xue, Wen Qing ~ Da ~ Qing, Chengcheng ~ big f 5 members, inspector, inspector ~ police, inspector 4 police, inspector 6 (shown in the example of this embodiment, # 入 m of the input unit, the characters of the turn-in phrase are assigned to the turn-in unit) , And the contained words of each dimension are :: Attribution (ie round cross comparison, ^ ^ α β ^, / cheng must pass through each word's column as contained words in the same dimension, making ^ ^ one word The characters appearing in the example are incorrectly entered in the 1st dimension pair _ Yu 丨 tsun Cen class. In this example (learning, big, Chang, i, Cheng, police), the 2nd dimension would correspond to 3 Only _ 丨, and the 3rd dimension will correspond to (chemical, bureau). Because it is only J3. Dimensional input layer, which can reduce the time spent on clustering processing φ 11 leisure clustering module 22 generates clustering phrases 60 After that, in this embodiment 121, the phrase 60 will be further processed by the refined clustering module 24.: 5: The class module 22 uses the OM clustering model of the neural network to transfer similar ones in the group. word Together to form large clusters (clusters), and
591519 五、發明說明(ίο) 構成聚類詞組60 ;提煉聚類模組24則是用才Λ时 切出小聚類(Sub dusted。例如,大聚類再細 識、警務、參議、警方、議場、學識、巧為上程、常 會議、檢警、开丨警、省議會、議會會合=安 室、決議案}時,本實施例中提煉聚類模組24曰: 9礅 大聚類裡具有相同字的詞再切分出來,忐 疋巴同一個 類,因此上述大聚類例子可以再細分成3個小 /來 小聚類1 ·議程、參議、議場、省議 」 案、會議、會議室、決議案; 義㈢礅會、議 小聚類2 :警務、警方、警界、警官、 小聚類3 :常識、學識。 欢吕、刑警; 最後,整個詞聚類處理產生由小聚 組62。 顯所構成的聚類詞 資料探勘處理 另一方面’特徵詞組5 4也會同時交由次 建立詞的關連法則,找出詞的相關性。由::20 的聚類都是依據字面上相似性而聚在—起,並:: 關連性,戶斤以透過資料探勘模組20所建立之關連;:上的 找出其中關連性較強的詞。每一條關連法則用以 兩個詞之間的關連性,以支持度(support)和信賴度 出現的比例,信賴度則妒笨u在句中兩個詞同時 機率。以詞出現時另一詞也會出現的 的性質不同,因此必須針對 :新聞類別 对母種新聞類別,疋出名詞對於591519 Fifth, the invention description (ίο) constitutes the clustering phrase 60; the refined clustering module 24 is used to cut out the small clusters (Sub dusted. For example, the large clusters are then carefully identified, police, council, police, When the meeting venue, knowledge, clever schedule, regular meeting, prosecution, open police, provincial council, parliamentary meeting = security room, resolution}, the cluster module 24 is extracted in this embodiment: Words with the same word are segmented again, and the same category is used. Therefore, the above large cluster example can be further subdivided into 3 small / come small clusters. 1. Agenda, Senate, Discussion Venue, Proposal, Case, Meeting, Meeting Room , Resolutions; Yihui meeting, small clusters 2: police, police, police, police officers, small clusters 3: common sense, knowledge. Huanlu, criminal police; Finally, the entire word clustering process is generated by small clusters 62. Exploring the processing of the clustered word data formed by the explicit On the other hand, the 'characteristic phrase 5 4' will also be used to establish the correlation rules of the words at the same time to find the relevance of the words. The clustering by :: 20 is based on the literal On similarities and gathered together: :: Relevance The relationship established by the survey module 20: find the words that are more related. Each connection rule uses the relationship between the two words to support and the proportion of trust, Reliability is jealous of the probability that two words are in the sentence at the same time. When a word appears, the other word also appears differently in nature, so it must be targeted: news category vs. parent news category, and noun nouns
0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第14頁 591519 五、發明說明(π) 此一新聞類別的支持度和信賴度的門檻值(threshold)。 其中一種方式是可以針對不同新聞類別的特性,直接預先 設定(支持度,信賴度)的門檻值,例如: 政 治 焦 點丨丨 (10 ,50) 國 際 要 聞丨, (10 ,40) 股 市 財 金丨丨 (15 ,40) 兩 岸 風 雲” (15 ,40) 社 會 地 方丨丨 (12 ,40) 生 活 新 知丨, (8, 45) 運 動 娛 樂丨丨 (12 ,50) 另一種方式是根據不同新聞類別的文件數量來決定 (支持度’彳§賴度)的門植值。例如: 1 ·當0<N<Na/2,則 = ,0c = 7〇% ; 2 ·當Na/2 SN<Na,則 0s = 3%,0c = 8O% ; 3 ·當Na$N<3Na/2,則 = ,0c = 7〇% ; 4 ^3Na/2SN<2Na,則 <9s:L5%,= 。 其中N表示某新聞類別的文字數量,Nae =文件的平均數量,示(支持度,信斤賴有= 1榼值。依據條件1若某新聞類別文件之數量大於踅勺 :所有新聞類別文件之平均數量的2分之 、遠,小 =檻值以及信賴係數大於條件i之門播值4% ^連法則 70/。,則表示這兩詞關聯性強· 。賴係數 λ e ^ ^ ^ ^, 小於所有新聞類別文件之平 θ :句數1的2分之1, <十均數Ϊ,而且關連法則之門檻0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 14 591519 V. Description of the invention (π) Threshold for the support and reliability of this news category. One way is to directly set thresholds (support, trust) for the characteristics of different news categories, such as: political focus 丨 (10,50) international news 丨, (10,40) stock market finance 丨丨 (15,40) Cross-strait situation "(15,40) Social places 丨 (12,40) New knowledge of life 丨, (8, 45) Sports entertainment 丨 (12,50) Another way is based on different news categories The number of files is used to determine the gate planting value of the support degree (彳 ° 度 度). For example: 1 · When 0 < N < Na / 2, then =, 0c = 70%; 2 · When Na / 2 SN < Na, Then 0s = 3%, 0c = 8O%; 3 · When Na $ N < 3Na / 2, then =, 0c = 70%; 4 ^ 3Na / 2SN < 2Na, then < 9s: L5%, =. Where N represents the number of texts in a news category, Nae = the average number of documents, showing (support, reliability = 1 value. According to condition 1, if the number of documents in a news category is greater than the spoon: the average of all news category files Two half of the number, far, small = threshold value and trust coefficient is greater than 4% of the door value of condition i ^ Law rule 70 /., Then these two words . Linkable strong Lai coefficient λ e ^ ^ ^ ^, θ is smaller than all level category news documents: sentence number 1 1/2, < ten mean Ϊ, and the threshold rules related
0213.8459TWF(N);STLC-01-B-9127;Franklin.ptd 591519 五 發明說明(12) 值以及信賴係數值大於條件2 80%則表示這兩詞關聯性強,門檻值3%與信賴係數值為 4之條件同樣可找出關聯性^ =類推依據條件3以及條件 建立概念組(concept sets) 立一根據上述資訊產 聚:處::歸納出的小聚類都是依據字面而:: 到的關連法則可以用來進二步,=料探勘處理所找 ,7 〆補強小聚類内的關連性。例 如,假設在關連法則66中有一條為「刑事組長」與「警 方」為(18. 2%,90· 7%),表示兩個詞的關連性很強,所以 在概念組建立模組26就會將「刑事組長」放入前述的第2 小聚類中構成對應的概念,亦即: 概心1 · 5義程、參義、議場、省議會、議會、議案、 會議、會議室、決議案; 〃 概念2 :警務、警方、警界、警官、檢警、刑警、刑 事組長; ^ 概念3 :常識、學識。 因此’概念組建立模組2 6是將聚類裡的詞與映射至關 連法則的詞共同建立同一個概念。 建構本體 最後,根據關連法則6 6以及概念組6 4,本體建構代理 器30可以建立每個概念之名稱(concept name)、屬性0213.8459TWF (N); STLC-01-B-9127; Franklin.ptd 591519 Five invention descriptions (12) Value and trust coefficient value greater than condition 2 80% indicates that the two words are strongly related, the threshold value of 3% and the trust coefficient A condition with a value of 4 can also find the correlation ^ = analogy, based on condition 3 and concept sets to establish concept sets. The first cluster is based on the above information: Division :: The small clusters summarized are based on literals :: The obtained correlation rule can be used for the next two steps, = to find the prospecting and processing, 7 to strengthen the correlation within the small cluster. For example, suppose that one of the "Relationship Rules 66" and "Police Team Leader" is (18.2%, 90 · 7%), which indicates that the two words are strongly related, so module 26 is established in the concept group. Will put the "criminal team leader" into the aforementioned second sub-cluster to form the corresponding concept, that is: Concentration 1.5 · the process, Senate, the venue, the provincial assembly, the parliament, the bill, the meeting, the conference room, Resolutions; 〃 Concept 2: Police, Police, Police, Police Officer, Prosecutor, Interpol, Criminal Team Leader; ^ Concept 3: Common sense, knowledge. Therefore, the 'concept group creation module 26 is to establish the same concept together with the words in the cluster and the words mapped to the correlation rule. Constructing the ontology Finally, according to the connection rule 6 6 and the concept group 64, the ontology construction agent 30 can establish the concept name and attribute of each concept
591519 五、發明說明(13) (attribute)以及運算(〇perati〇n),以及概念之間的關係 (relation) °本體建構代理器3〇是由四個部分所組成,包 括·概念名稱建構代理器(c〇ncept names construction agent) agent) agent) 為名詞 agent)運"^ 建構代理器(operations construction 屬性建構代理器(attributes construction 關係、建構代理器(relations construction 其中概念名稱之詞性為名詞,概念屬性之詞性亦 概念運算之詞性則為動詞,上述是由概念組64中 的同一概念中所產生;關係則是根據不同概念之間的關連 法則所決定’其詞性則為動詞。第7圖表示本實施例中所 建立之概念和概念間關係之一例的示意圖。在第7圖中共 例示9個概心刀別以符號9 0表示,每個概念則再細分為 概念名/爯92(第一攔)、概念屬性94(第二欄)和概念運算 9 6 (第三欄)。例如這9個概念的概念名稱分別為警方、青 少年、火%、火警、歹徒、搶、毒品、色情、專案小組。 另外’概念90間的四邊形則表示關係98,係根據關連法則 所建立。例如「趕往」為r警方」與「火場」之間的關 係,「搶救」和「撲滅」為「火場」與「火警」之間的關 係。 最後,將所有概念的概念名稱、運算、屬性以及概念 間的關係之結果存放產生特定領域的本體32,#完成自動 建構本體。 、根據以上所述,本發明所揭露之本體自動建構系統和 方法可以在操作人員不介入的情況下,自動地根據既定文591519 V. Description of the invention (13) (attribute) and operation (〇perati〇n), and the relationship between the concepts (relation) ° ontology construction agent 30 is composed of four parts, including the concept name construction agent (Concept names construction agent) agent) agent) is the noun agent) operation " ^ construction agent (operations construction attribute construction agent (attributes construction relationship, construction agent) where the concept name part of speech is a noun, The part-of-speech of the concept attribute and the part-of-speech of the concept operation are verbs, which are generated from the same concept in concept group 64; the relationship is determined by the rules of connection between different concepts, and the part-of-speech is a verb. Figure 7 A schematic diagram showing an example of the concepts and the relationships between concepts established in this embodiment. In Figure 7, a total of 9 outline knife types are illustrated by the symbol 90, and each concept is further subdivided into the concept name / 爯 92 (No. One block), concept attributes 94 (second column) and concept operations 9 6 (third column). For example, the concept names of these 9 concepts are police, youth Year, Fire%, Fire Alarm, Gangster, Robbery, Drugs, Pornography, Task Force. In addition, the 'Concept 90's quadrilateral represents relationship 98, which is established in accordance with the law of connection. For example, "rush to" police and fire "Rescue" and "Extinguish" are the relationship between "Fire Field" and "Fire Alarm". Finally, the results of the concept names, operations, attributes, and relationships between concepts of all concepts are stored to generate specific fields. Ontology 32, #Complete the automatic construction of the ontology. According to the above, the system and method for automatically constructing the ontology disclosed in the present invention can be automatically based on the established text without operator intervention.
0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd
第17頁 591519 五、發明說明(14) 件,產生屬於特定中文知識領域的本體資料庫,因此在知 識管理的應用上極具有產業上的價值。 雖然本發明已以較佳實施例揭露如上,然其並非用以 限定本發明,任何熟習此技藝者,在不脫離本發明之精神 和範圍内,當可作各種之更動與潤飾,因此本發明之保護 範圍當視後附之申請專利範圍所界定者為準。Page 17 591519 V. Invention description (14), which generates an ontology database belonging to a specific Chinese knowledge field, so it has great industrial value in the application of knowledge management. Although the present invention has been disclosed as above with preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make various modifications and retouches without departing from the spirit and scope of the present invention. Therefore, the present invention The scope of protection shall be determined by the scope of the attached patent application.
0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第18頁 591519 圖式簡單說明 為讓本發明之上述和其他目的、特徵、和優點能更明 顯易懂,下文特舉出較佳實施例,並配合所附圖式,作詳 明第 說 細 下 ο 統 系 構 ^c 4TTT 體 本 向 導 件 物 之 例 施 實 明 發 本 為 圖 和 件 文 析 分 例 為 件 文 聞 新 以 中 例 施 實 明 發 本 為 ο 圖 圖2 塊第 方 之 資 J-11 口 統 的 量 詞 第 為 圖 3 第 出 詞 的 聞 新 金 市 股 為 別 類 聞 新 中 例 ο 實 表圖 料 圖 意 示 的 量 數 詞 應 對 其 及 率第 頻 現 函 射 映 碼 編 字 文 中 之 用 採 所 中 例 施 實 明 發 本 為 圖 意 Γ J 示第 的 數 圖 J1 處 元 單 入 輸 1 單 應 對 字 多 中 例 施 實 明 發 本 為 圖 圖 程 流 的 法 方 置 配 元 單 入 輸 中 配 字 的 組 詞 入 輸 將 中 例 範 之 例 施 實 明 發 本 示 表 圖 圖 意 示 之 元 單 ί入 第輸 給 置 係 關 間 念 概 和 念 概 立 rc 所 中 例 施 實 明 發 本 示 表 圖 aa •,器器模模 組濾濾析勘 ;模過過分探 ^]件詞性煉率料 々P文斷詞提頻資 7 &古口 ~ ~ ~ ~ ~ 第例號101214161820 一 符 之ί 圖 意 示0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd p. 18 591519 The diagram is a simple illustration to make the above and other objects, features, and advantages of the present invention clearer and easier to understand. The preferred embodiment, in conjunction with the attached drawings, will explain the details below. 统 System structure ^ c 4TTT Example of the body of the guide. The implementation of the Mingfa is shown in the figure and the analysis of the text. For example, Shi Shiming issued this picture. Figure 2 Figure 2 Block Fang's assets J-11 Oral quantifiers Figure 1 Figure 3 The word Wenxin Jinshi shares are examples of different types of new news ο The actual chart diagram The quantifiers should be used in the example of the frequency and frequency of the function code in the coded text. The example is shown in the figure. Γ J shows the figure in the figure. This is a diagram of the flow chart of the French side to place the meta-words in the input of the word group input and input. Interpretation of the concept and concept of the concept rc Shi Shiming issued this chart diagram aa •, the filter module analysis of the filter module; the model is too much exploration ^] the part-of-speech refining material 々 P text segmentation to raise frequency 7 & Gukou ~ ~ ~ ~ ~ No. 101214161820 Illustration
0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第19頁 591519 圖式簡單說明 2 2〜中文詞聚類模組; 2 4〜提煉聚類模組; 2 6〜概念組建立模組; 3 0〜本體建構代理器; 32〜本體; 5 0〜第一詞組; 5 2〜第二詞組; 5 4〜特徵詞組; 5 5〜詞性; 5 6〜詞距; 5 8〜輸入詞組; 6 0、6 2〜聚類詞組; 6 4〜概念組; 6 6〜關連法則; 9 0〜概念; 9 2〜概念名稱; 9 4〜概念屬性; 9 6〜概念運算; 9 8〜關係。0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 19 591519 Schematic description 2 2 ~ Chinese word clustering module; 2 4 ~ refined clustering module; 2 6 ~ concept group Building module; 30 ~ ontology construction agent; 32 ~ ontology; 50 ~ first phrase; 5 2 ~ second phrase; 5 4 ~ characteristic phrase; 5 5 ~ part of speech; 5 6 ~ word distance; 5 8 ~ Input phrase; 6 0, 6 2 ~ cluster phrase; 6 4 ~ concept group; 6 6 ~ relation rule; 9 0 ~ concept; 9 2 ~ concept name; 9 4 ~ concept attribute; 9 6 ~ concept operation; 9 8 ~relationship.
0213-8459TWF(N);STLC-01-B-9127;Franklin.ptd 第20頁0213-8459TWF (N); STLC-01-B-9127; Franklin.ptd Page 20
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW91125217A TW591519B (en) | 2002-10-25 | 2002-10-25 | Automatic ontology building system and method thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW91125217A TW591519B (en) | 2002-10-25 | 2002-10-25 | Automatic ontology building system and method thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
TW591519B true TW591519B (en) | 2004-06-11 |
Family
ID=34057953
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW91125217A TW591519B (en) | 2002-10-25 | 2002-10-25 | Automatic ontology building system and method thereof |
Country Status (1)
Country | Link |
---|---|
TW (1) | TW591519B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8832109B2 (en) | 2007-09-03 | 2014-09-09 | British Telecommunications Public Limited Company | Distributed system |
TWI608367B (en) * | 2012-01-11 | 2017-12-11 | 國立臺灣師範大學 | Text readability measuring system and method thereof |
-
2002
- 2002-10-25 TW TW91125217A patent/TW591519B/en not_active IP Right Cessation
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8832109B2 (en) | 2007-09-03 | 2014-09-09 | British Telecommunications Public Limited Company | Distributed system |
TWI608367B (en) * | 2012-01-11 | 2017-12-11 | 國立臺灣師範大學 | Text readability measuring system and method thereof |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111079444B (en) | Network rumor detection method based on multi-modal relationship | |
Zhao et al. | Cyberbullying detection based on semantic-enhanced marginalized denoising auto-encoder | |
Tang et al. | Deep learning for sentiment analysis: successful approaches and future challenges | |
Li et al. | Analyzing COVID-19 on online social media: Trends, sentiments and emotions | |
CN111753024B (en) | Multi-source heterogeneous data entity alignment method oriented to public safety field | |
Li et al. | Context-aware group captioning via self-attention and contrastive features | |
Chen et al. | Visual and textual sentiment analysis using deep fusion convolutional neural networks | |
CN114444516B (en) | Cantonese rumor detection method based on deep semantic perception map convolutional network | |
CN112307364B (en) | Character representation-oriented news text place extraction method | |
Liu et al. | AMFF: A new attention-based multi-feature fusion method for intention recognition | |
CN112559747A (en) | Event classification processing method and device, electronic equipment and storage medium | |
Lai et al. | Transconv: Relationship embedding in social networks | |
CN113407697A (en) | Chinese medical question classification system for deep encyclopedia learning | |
CN114818724A (en) | Construction method of social media disaster effective information detection model | |
Chen et al. | Identifying Cantonese rumors with discriminative feature integration in online social networks | |
Liu et al. | Category-universal witness discovery with attention mechanism in social network | |
Zhu et al. | Design of knowledge graph retrieval system for legal and regulatory framework of multilevel latent semantic indexing | |
Ngo et al. | Discovering child sexual abuse material creators' behaviors and preferences on the dark web | |
Zamiralov et al. | Detection of housing and utility problems in districts through social media texts | |
Defersha et al. | Deep learning based multilabel hateful speech text comments recognition and classification model for resource scarce ethiopian language: The case of afaan oromo | |
TW591519B (en) | Automatic ontology building system and method thereof | |
Nguyen et al. | Human vs ChatGPT: Effect of Data Annotation in Interpretable Crisis-Related Microblog Classification | |
Arnfield | Enhanced Content-Based Fake News Detection Methods with Context-Labeled News Sources | |
Krishna et al. | A Deep Parallel Hybrid Fusion Model for disaster tweet classification on Twitter data | |
Singh et al. | Neural approaches towards text summarization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
MM4A | Annulment or lapse of patent due to non-payment of fees |