CN101963989A

CN101963989A - Word elimination process for extracting domain ontology concept

Info

Publication number: CN101963989A
Application number: CN 201010502040
Authority: CN
Inventors: 党延忠; 于娟
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2010-09-30
Filing date: 2010-09-30
Publication date: 2011-02-02

Abstract

The invention belongs to the technical field of artificial intelligence, relating to an extraction method of a domain ontology concept, in particular to a word elimination process for extracting domain ontology concept. The invention has the technical scheme that an elimination process is adopted to automatically extract a domain ontology concept assembly and solves the technical problem that the manual threshold configuration is difficult in extracting the domain concept. When a word assembly appears in a given domain corpus, the method firstly calculates the domain correlation degree of the word and deletes irreverent words of the domain; then the method calculates the domain uniformity of residual words and deletes words which are unevenly distributed in the domain corpus so as to obtain the domain ontology concept assembly. The method can automatically obtain the assembly of an exclusive domain concept according to a text corpora composed of prospect corpus (i.e. domain corpus) and background corpus (i.e. non-domain corpus), thereby reducing argument caused by subjective factors, such as the knowledge structure of domain experts and the like, in the domain concept extracting process.

Description

Extract the word exclusive method of domain body notion

Technical field

The present invention relates to the extracting method of domain body notion, specially refer to the word exclusive method and extract the domain body notion.

Background technology

Domain body notion (be the proprietary notion in field, be called for short field concept) is a blocks of knowledge of describing the common trait of one group of domain object.The field concept extracting method is mainly used in the set of words that support to make up field concept, the word (field term) of auxiliary domain expert's assembling sphere notion and unified notion, that is, and make up field concept the set of term of unique correspondence.Field term is the most appropriate word that can describe the field, is the standardization term of representing field concept.

The extracting method of field concept is the behavior that utilizes the human domain expert of computer simulation, obtains the machine learning method and the technology of the set of words of field concept.Because corpus of text is easy to obtain, the extraction of field concept is generally carried out based on corpus of text.Belong in the electronic document in same field and comprising identical term, therefore, can from the document of field, obtain these terms as field concept.The method of extracting the field concept set from the document of field mainly is divided three classes: 1) based on philological method, 2) based on statistical method, 3) mixed method.

At first obtain template according to the special morphology structure that field concept occurs in real corpus based on philological method, extraction meets the word of these templates as field concept then.Because these templates are relevant with concrete syntax mostly, therefore, these class methods need be implemented different processing at concrete language.

Mainly obtain field concept in real corpus based on statistical method according to the different statistical nature identification that the non-proprietary notion in field concept and field occurs.In the present existing Chinese field concept learning method, be main flow based on the method for adding up.Technics extraction, law-analysing and method for reusing based on the maturation process document that patent 200510011131.0 proposes can extract the technics of ripe technology document.People such as Chen Wenliang adopt the Bootstrapping machine learning techniques to obtain field term automatically from large-scale no part-of-speech tagging language material.The identical people of Zheng family has proposed to take all factors into consideration the weight of position and two factor calculated candidate of word frequency speech, with the Automatic Extraction keyword in conjunction with nonlinear function and " Paired Comparisons ".Cheng Yong has provided a kind of method of learning based on statistics learning areas notion from Hownet in the doctorate paper.People such as He Yan have provided a method of learning the computer body notion based on statistical method from the computer major dictionary.

Mixed method has been used in combination linguistics and statistical method and technology in the hope of obtaining better learning outcome.The method that has adopts the grammer filtrator after statistical treatment, extract through statistical computation significant and with the combination of the word of given morphology template matches; The method that has at first adopts linguistic method to select candidate item, and then with statistical method these candidate item is calculated.People such as Du Bo have proposed the professional domain terminology extraction algorithm of a kind of binding rule and statistics.Zhang Xin has also studied the Ontological concept learning method of a kind of binding rule and statistics.

Existing field concept extracting method judges based on preset threshold whether word is the proprietary notion in field.These methods are at first calculated the statistic of an exclusive degree in reflection field to each word, then by judging whether this numerical value is higher than prior preset threshold and judges whether field concept of this word.Exclusive degree is high more, might become field concept more.Higher threshold value can make the accuracy rate of extracting the result higher, but recall rate is lower; Vice versa.So, accuracy rate and recall rate are the performance index of a pair of mutual contradiction, and higher accuracy rate must cause lower recall rate; And the resulting result of artificial setting threshold can be because domain expert's subjective factors such as the structure of knowledge cause field concept extraction result objective inadequately.

Summary of the invention

The technical problem to be solved in the present invention provides a kind of word exclusive method of extracting the domain body notion, solves the difficulty that needs manual setting threshold in the field concept leaching process.

The present invention adopts the method for the non-proprietary notion in eliminating field to extract the field concept set automatically.During the set of occurring words, this method can automatically be obtained the field concept set according to the text corpus of being made up of prospect language material (being the field language material) and background language material (being non-field language material) in the language material of given field.This method is at first calculated the domain correlation degree of word, the incoherent word in eliminating field based on prospect language material and background language material; Calculate the field uniformity coefficient of residue word then based on the field language material, get rid of word pockety in the language material of field, the word that promptly in the field, is not stably used as yet.So, obtain the field concept set.

Exclusive method of the present invention is deleted the non-proprietary notion in field in two steps, obtains the field concept set.Concrete steps are as follows:

(1) calculates the domain correlation degree in word and field, delete in the set of words and the incoherent word in field.

Domain correlation degree is weighed the degree that word and field be whether relevant and be correlated with.Word t and field D _kThe domain correlation degree computing formula be:

{DR}_{t, k} = \lg (\frac{P (t | {Cf}_{k})}{P (t | {Cb}_{k})}) \times \lg ({TF}_{t, k})

Wherein, P (t|Cf _k), P (t|Cb _k) be respectively t at prospect language material Cf _kWith background language material Cb _kThe middle probability that occurs.When actual computation, it is estimated as respectively:

E (P (t | {Cf}_{k})) = \frac{{TF}_{t, k}}{{mf}_{k}}

E (P (t | {Cb}_{k})) = \frac{\underset{{Cf}_{1} &Element; {Cb}_{k}}{Σ} {TF}_{t, 1}}{{mb}_{k}}

{TF}_{t, i} = \underset{c_{j} &Element; {Cf}_{1}}{Σ} {tf}_{t, j}

Wherein, TF _{T, i}Be that word t is at prospect language material Cf _iThe middle frequency that occurs, mf _iBe Cf _iIn number of documents, mb _kBe background language material Cb _kIn number of documents, tf _{T, j}For t at document c _jThe middle number of times that occurs.

The DR algorithm is made of two parts: 1)

Indication when the probability of occurrence of word in prospect language material (being the field language material) is higher than background language material (being non-field language material), claims this word and this field positive correlation; Otherwise, uncorrelated with the field.Incoherent word is not as field concept.2) lg (TF _{T, k}) make that the word DR value of high word frequency is high, promptly with the degree of correlation height in field.

Therefore, such as general term such as " effect ", " enterprises ", although its frequency of occurrences in the prospect language material is higher and distribution consistency degree is high, because such word also evenly distributes in the background language material, so the DR value in most fields is negative or zero, so DR has indicated itself and field independence.

The result of test of many times shows, with the set of words that comprised in the prospect language material during as input, the DR algorithm can be deleted in the set of words 40% to 50% word automatically.

During the DR value of these all words of algorithm computation, time complexity be O (n '+mf _k+ mb _k), n ' extracts the number of the word that obtains, mf from the language material of field _kAnd mb _kBe respectively the number of documents in prospect language material and the background language material.

(2) the field uniformity coefficient in calculating word and field, deletion do not obtain the stable word that uses as yet in the field.

The degree of uniformity that the word of field uniformity coefficient reflection field positive correlation (DR＞0) distributes in each text of field language material.Word t is at field D _kField uniformity coefficient computing formula be:

{DC}_{t, k} = \underset{c_{j} &Element; {Cf}_{k}}{Σ} (P (t | c_{j}) \times \lg \frac{1}{P (t | c_{j})})

P (t|c _j) be that t is at document c _jThe middle probability that occurs, c _jBe prospect language material Cf _kIn a document.The present invention is when actual computation, with P (t|c _j) be estimated as:

E (P (t | c_{j})) = \frac{{tf}_{t, j}}{{TF}_{t, k}}

Wherein, tf _{T, j}For word t at field prospect language material Cf _kIn j text in the frequency that occurs.

Can see that the definition of DC is similar to information entropy.DC indication word is at Cf _kMiddle distribute whether even.The DC value is high more, and what distribute in the language material of field is even more, also, and at Cf _kIn occur in the more field document, be that the possibility of field concept is bigger.The DC value is that 0 this word of finger only occurred in a field document of prospect language material, excludes the field concept set automatically.

For example, in study during " information management " field concept, owing to include the document of 1 piece of coal enterprise's information management in the prospect language material, and in other field language material of background language material, do not occur, so the DR value of " coal enterprise " is for just; But, therefore, can not misrepresent into field concept set to Knowledge Management Domain because its DC value is 0.

The result of test of many times proves, during as input, the DC algorithm can be deleted the word of 20%-30% in the set of words automatically with the set of words that comprised in the prospect language material.

The time complexity of the DC value of these all words of algorithm computation is O (n " * mf _k), n " is the number that remains word in the set of words after the screening of DR algorithm.

Effect of the present invention and benefit are to have solved practical problems and the difficulty that needs artificial setting threshold to produce in the field concept leaching process.Field concept can be represented the field theme, and its meaning is, can: 1) constitute the basis of domain body; 2) standard field term helps the inner smooth and easy interchange in field to exchange with the scholar is international; 3) auxiliary field document represent, text mining and Knowledge Discovery work such as text cluster and text retrieval.Good domain body notion extracting method can promote the automaticity and the performance of above-mentioned work.The present invention is based on machine learning techniques, design and Implement independently field concept intelligence acquisition methods of a field.

The present invention has reduced the fussy degree of domain body notion extraction work.This invention appliance computer can be deleted the proprietary word in non-field automatically by the auxiliary leaching process of supporting field concept of machine learning techniques, has reduced domain expert's labor workload.

Another benefit of the present invention is: reduce in the field concept leaching process because domain expert's the disagreements that subjective factor caused such as the structure of knowledge.This invention has quantized the degree that word belongs to a certain specific area based on statistical method.The result of quantification can reduce the dispute that subjective factor causes.

Description of drawings

Accompanying drawing is the block scheme of structure of the present invention.

Embodiment

Be described in detail the specific embodiment of the present invention below in conjunction with technical scheme and subordinate list.

Embodiment 1

As shown in drawings, among the figure:

1) corpus.This method uses prospect language material (foreground corpora) and background language material (background corpora) to obtain field concept.The prospect language material is to comprise the field document library of enriching field concept, generally should be made up of some standardized fields text; The background language material is to be used for making the electronic document bank of contrast with the different statistical nature that highlights field concept and show with the prospect language material in field document and non-field document, is made up of several field documents of three above different field.

Corpus C is made of jointly the prospect language material in the individual field of m (m 〉=3).Learning areas D _kField proprietary notion the time, the prospect language material is Cf _k, background language material Cb _kProspect language material Cf by other m-1 field in the corpus ₁(1≤1≤m, 1 ≠ k) constitutes.Requirement prospect language material (being the field language material) Cf _kComprise D fully _kThe proprietary notion of all spectra, and the true behaviour in service of reflection notion word.

2) set of words is the word of the proprietary notion in field in the prospect language material (being the field language material) and the set of other general term.

3) domain body notion extraction module is deleted the proprietary notion in non-field through the calculating of domain correlation degree and field uniformity coefficient, the set of output field concept.

4) set of domain body notion is the set of the proprietary notion in field.

In the process that the structure domain body notion of reality is gathered,, after extraction obtains the field concept set automatically, also can add a step of manually revising in order to improve the accuracy that makes up the result.The artificial correction is expert's manual modification process of the field concept set of extraction automatically.In this process, it is not the proprietary notion in field that the domain expert deletes among the extraction result, adds the field concept that is not included in the language material of field.

Present embodiment adopts exclusive method of the present invention to extract the domain body notion set of Knowledge Management Domain.The summary that the used prospect language material of present embodiment is the project proposal of Knowledge Management Domain has 317 texts, 80,000 Chinese characters.The background language material comprises 75 fields, 37443 of total programmatic recommendation book extracts, about 10,000,000 Chinese characters.

In the present embodiment, comprised 4431 words in the prospect language material, promptly be input in the set of words of field concept set extraction module one and co-exist in 4431 words.Calculate through extracting automatically, 2059 words uncorrelated with Knowledge Management Domain (DR≤0), thereby at first deleted automatically.Have 2597 words or uncorrelated with the field or (DR≤0 or DC=0) only in a field document, occurred, also deleted.Remaining 1834 words are as the domain body notion.

Subordinate list 1 and subordinate list 2 have been showed preceding 10 words and maximum preceding 10 words of DC value that Knowledge Management Domain DR value is maximum respectively.The numerical value inverted order that word in the table has according to gauge outfit in the row of " ↓ " symbol is arranged.

Maximum preceding 10 words of subordinate list 1 Knowledge Management Domain DR letter

Preceding 10 words that subordinate list 2 Knowledge Management Domain DC values are maximum

The explanation of subordinate list 1 and subordinate list 2:

1) under the prerequisite that does not influence result's displaying, short and sweet in order to make data, the data among two tabular DR and the DC are rounded up respectively.

2) degree of correlation height of word in the subordinate list 1 and Knowledge Management Domain, the frequency that occurs in the Knowledge Management Domain language material is high and far above the background language material.But indivedual words, as " Knowledge Conversion ", the uniformity coefficient of appearance is poor.

3) the distribution consistency degree height in the prospect language material of the word in the subordinate list 2, almost each text at the prospect language material all occurs.But low such as the DR value of words such as " enterprise ", " management ", after the overall treatment, these words will be excluded, and only can be left " knowledge ", " information management " and " information management theory " three speech in the subordinate list 2.

Claims

1. a word exclusive method of extracting the domain body notion is characterized in that comprising the steps:

(1) calculates the domain correlation degree in word and field, delete in the set of words and the incoherent word in field;

Word t and field D _kThe domain correlation degree computing formula be:

{DR}_{t, k} = \lg (\frac{P (t | {Cf}_{k})}{P (t | {Cb}_{k})}) \times \lg ({TF}_{t, k})

Wherein, P (t|Cf _k), P (t|Cb _k) be respectively t at prospect language material Cf _kWith background language material Cb _kThe middle probability that occurs; When actual computation, it is estimated as respectively:

E (P (t | {Cf}_{k})) = \frac{{TF}_{t, k}}{{mf}_{k}}

E (P (t | {Cb}_{k})) = \frac{\underset{{Cf}_{1} &Element; {Cb}_{k}}{Σ} {TF}_{t, 1}}{{mb}_{k}}

{TF}_{t, i} = \underset{c_{j} &Element; {Cf}_{i}}{Σ} {tf}_{t, j}

Wherein, TF _{T, i}Be that word t is at prospect language material Cf _iThe middle frequency that occurs, mf _iBe Cf _iIn number of documents, mb _kBe background language material Cb _kIn number of documents, tf _{T, j}For t at document c _jThe middle number of times that occurs;

(2) the field uniformity coefficient in calculating word and field, deletion do not obtain the stable word that uses as yet in the field;

The degree of uniformity that the word of field uniformity coefficient reflection field positive correlation (DR＞0) distributes in each text of field language material; Word t is at field D _kField uniformity coefficient computing formula be:

{DC}_{t, k} = \underset{c_{j} &Element; {Cf}_{k}}{Σ} (P (t | c_{j}) \times \lg \frac{1}{P (t | c_{j})})

P (t|c _j) be that t is at document c _jThe middle probability that occurs, c _jBe prospect language material Cf _kIn a document; The present invention is when actual computation, with P (t|c _j) be estimated as:

E (P (t | c_{j})) = \frac{{tf}_{t, j}}{{TF}_{t, k}}