CN101963989A - Word elimination process for extracting domain ontology concept - Google Patents

Word elimination process for extracting domain ontology concept Download PDF

Info

Publication number
CN101963989A
CN101963989A CN 201010502040 CN201010502040A CN101963989A CN 101963989 A CN101963989 A CN 101963989A CN 201010502040 CN201010502040 CN 201010502040 CN 201010502040 A CN201010502040 A CN 201010502040A CN 101963989 A CN101963989 A CN 101963989A
Authority
CN
China
Prior art keywords
field
domain
word
language material
concept
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 201010502040
Other languages
Chinese (zh)
Inventor
党延忠
于娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN 201010502040 priority Critical patent/CN101963989A/en
Publication of CN101963989A publication Critical patent/CN101963989A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of artificial intelligence, relating to an extraction method of a domain ontology concept, in particular to a word elimination process for extracting domain ontology concept. The invention has the technical scheme that an elimination process is adopted to automatically extract a domain ontology concept assembly and solves the technical problem that the manual threshold configuration is difficult in extracting the domain concept. When a word assembly appears in a given domain corpus, the method firstly calculates the domain correlation degree of the word and deletes irreverent words of the domain; then the method calculates the domain uniformity of residual words and deletes words which are unevenly distributed in the domain corpus so as to obtain the domain ontology concept assembly. The method can automatically obtain the assembly of an exclusive domain concept according to a text corpora composed of prospect corpus (i.e. domain corpus) and background corpus (i.e. non-domain corpus), thereby reducing argument caused by subjective factors, such as the knowledge structure of domain experts and the like, in the domain concept extracting process.

Description

Extract the word exclusive method of domain body notion
Technical field
The present invention relates to the extracting method of domain body notion, specially refer to the word exclusive method and extract the domain body notion.
Background technology
Domain body notion (be the proprietary notion in field, be called for short field concept) is a blocks of knowledge of describing the common trait of one group of domain object.The field concept extracting method is mainly used in the set of words that support to make up field concept, the word (field term) of auxiliary domain expert's assembling sphere notion and unified notion, that is, and make up field concept the set of term of unique correspondence.Field term is the most appropriate word that can describe the field, is the standardization term of representing field concept.
The extracting method of field concept is the behavior that utilizes the human domain expert of computer simulation, obtains the machine learning method and the technology of the set of words of field concept.Because corpus of text is easy to obtain, the extraction of field concept is generally carried out based on corpus of text.Belong in the electronic document in same field and comprising identical term, therefore, can from the document of field, obtain these terms as field concept.The method of extracting the field concept set from the document of field mainly is divided three classes: 1) based on philological method, 2) based on statistical method, 3) mixed method.
At first obtain template according to the special morphology structure that field concept occurs in real corpus based on philological method, extraction meets the word of these templates as field concept then.Because these templates are relevant with concrete syntax mostly, therefore, these class methods need be implemented different processing at concrete language.
Mainly obtain field concept in real corpus based on statistical method according to the different statistical nature identification that the non-proprietary notion in field concept and field occurs.In the present existing Chinese field concept learning method, be main flow based on the method for adding up.Technics extraction, law-analysing and method for reusing based on the maturation process document that patent 200510011131.0 proposes can extract the technics of ripe technology document.People such as Chen Wenliang adopt the Bootstrapping machine learning techniques to obtain field term automatically from large-scale no part-of-speech tagging language material.The identical people of Zheng family has proposed to take all factors into consideration the weight of position and two factor calculated candidate of word frequency speech, with the Automatic Extraction keyword in conjunction with nonlinear function and " Paired Comparisons ".Cheng Yong has provided a kind of method of learning based on statistics learning areas notion from Hownet in the doctorate paper.People such as He Yan have provided a method of learning the computer body notion based on statistical method from the computer major dictionary.
Mixed method has been used in combination linguistics and statistical method and technology in the hope of obtaining better learning outcome.The method that has adopts the grammer filtrator after statistical treatment, extract through statistical computation significant and with the combination of the word of given morphology template matches; The method that has at first adopts linguistic method to select candidate item, and then with statistical method these candidate item is calculated.People such as Du Bo have proposed the professional domain terminology extraction algorithm of a kind of binding rule and statistics.Zhang Xin has also studied the Ontological concept learning method of a kind of binding rule and statistics.
Existing field concept extracting method judges based on preset threshold whether word is the proprietary notion in field.These methods are at first calculated the statistic of an exclusive degree in reflection field to each word, then by judging whether this numerical value is higher than prior preset threshold and judges whether field concept of this word.Exclusive degree is high more, might become field concept more.Higher threshold value can make the accuracy rate of extracting the result higher, but recall rate is lower; Vice versa.So, accuracy rate and recall rate are the performance index of a pair of mutual contradiction, and higher accuracy rate must cause lower recall rate; And the resulting result of artificial setting threshold can be because domain expert's subjective factors such as the structure of knowledge cause field concept extraction result objective inadequately.
Summary of the invention
The technical problem to be solved in the present invention provides a kind of word exclusive method of extracting the domain body notion, solves the difficulty that needs manual setting threshold in the field concept leaching process.
The present invention adopts the method for the non-proprietary notion in eliminating field to extract the field concept set automatically.During the set of occurring words, this method can automatically be obtained the field concept set according to the text corpus of being made up of prospect language material (being the field language material) and background language material (being non-field language material) in the language material of given field.This method is at first calculated the domain correlation degree of word, the incoherent word in eliminating field based on prospect language material and background language material; Calculate the field uniformity coefficient of residue word then based on the field language material, get rid of word pockety in the language material of field, the word that promptly in the field, is not stably used as yet.So, obtain the field concept set.
Exclusive method of the present invention is deleted the non-proprietary notion in field in two steps, obtains the field concept set.Concrete steps are as follows:
(1) calculates the domain correlation degree in word and field, delete in the set of words and the incoherent word in field.
Domain correlation degree is weighed the degree that word and field be whether relevant and be correlated with.Word t and field D kThe domain correlation degree computing formula be:
DR t , k = lg ( P ( t | Cf k ) P ( t | Cb k ) ) × lg ( TF t , k )
Wherein, P (t|Cf k), P (t|Cb k) be respectively t at prospect language material Cf kWith background language material Cb kThe middle probability that occurs.When actual computation, it is estimated as respectively:
E ( P ( t | Cf k ) ) = TF t , k mf k
E ( P ( t | Cb k ) ) = Σ Cf 1 ∈ Cb k TF t , 1 mb k
TF t , i = Σ c j ∈ Cf 1 tf t , j
Wherein, TF T, iBe that word t is at prospect language material Cf iThe middle frequency that occurs, mf iBe Cf iIn number of documents, mb kBe background language material Cb kIn number of documents, tf T, jFor t at document c jThe middle number of times that occurs.
The DR algorithm is made of two parts: 1)
Figure BSA00000296501200035
Indication when the probability of occurrence of word in prospect language material (being the field language material) is higher than background language material (being non-field language material), claims this word and this field positive correlation; Otherwise, uncorrelated with the field.Incoherent word is not as field concept.2) lg (TF T, k) make that the word DR value of high word frequency is high, promptly with the degree of correlation height in field.
Therefore, such as general term such as " effect ", " enterprises ", although its frequency of occurrences in the prospect language material is higher and distribution consistency degree is high, because such word also evenly distributes in the background language material, so the DR value in most fields is negative or zero, so DR has indicated itself and field independence.
The result of test of many times shows, with the set of words that comprised in the prospect language material during as input, the DR algorithm can be deleted in the set of words 40% to 50% word automatically.
During the DR value of these all words of algorithm computation, time complexity be O (n '+mf k+ mb k), n ' extracts the number of the word that obtains, mf from the language material of field kAnd mb kBe respectively the number of documents in prospect language material and the background language material.
(2) the field uniformity coefficient in calculating word and field, deletion do not obtain the stable word that uses as yet in the field.
The degree of uniformity that the word of field uniformity coefficient reflection field positive correlation (DR>0) distributes in each text of field language material.Word t is at field D kField uniformity coefficient computing formula be:
DC t , k = Σ c j ∈ Cf k ( P ( t | c j ) × lg 1 P ( t | c j ) )
P (t|c j) be that t is at document c jThe middle probability that occurs, c jBe prospect language material Cf kIn a document.The present invention is when actual computation, with P (t|c j) be estimated as:
E ( P ( t | c j ) ) = tf t , j TF t , k
Wherein, tf T, jFor word t at field prospect language material Cf kIn j text in the frequency that occurs.
Can see that the definition of DC is similar to information entropy.DC indication word is at Cf kMiddle distribute whether even.The DC value is high more, and what distribute in the language material of field is even more, also, and at Cf kIn occur in the more field document, be that the possibility of field concept is bigger.The DC value is that 0 this word of finger only occurred in a field document of prospect language material, excludes the field concept set automatically.
For example, in study during " information management " field concept, owing to include the document of 1 piece of coal enterprise's information management in the prospect language material, and in other field language material of background language material, do not occur, so the DR value of " coal enterprise " is for just; But, therefore, can not misrepresent into field concept set to Knowledge Management Domain because its DC value is 0.
The result of test of many times proves, during as input, the DC algorithm can be deleted the word of 20%-30% in the set of words automatically with the set of words that comprised in the prospect language material.
The time complexity of the DC value of these all words of algorithm computation is O (n " * mf k), n " is the number that remains word in the set of words after the screening of DR algorithm.
Effect of the present invention and benefit are to have solved practical problems and the difficulty that needs artificial setting threshold to produce in the field concept leaching process.Field concept can be represented the field theme, and its meaning is, can: 1) constitute the basis of domain body; 2) standard field term helps the inner smooth and easy interchange in field to exchange with the scholar is international; 3) auxiliary field document represent, text mining and Knowledge Discovery work such as text cluster and text retrieval.Good domain body notion extracting method can promote the automaticity and the performance of above-mentioned work.The present invention is based on machine learning techniques, design and Implement independently field concept intelligence acquisition methods of a field.
The present invention has reduced the fussy degree of domain body notion extraction work.This invention appliance computer can be deleted the proprietary word in non-field automatically by the auxiliary leaching process of supporting field concept of machine learning techniques, has reduced domain expert's labor workload.
Another benefit of the present invention is: reduce in the field concept leaching process because domain expert's the disagreements that subjective factor caused such as the structure of knowledge.This invention has quantized the degree that word belongs to a certain specific area based on statistical method.The result of quantification can reduce the dispute that subjective factor causes.
Description of drawings
Accompanying drawing is the block scheme of structure of the present invention.
Embodiment
Be described in detail the specific embodiment of the present invention below in conjunction with technical scheme and subordinate list.
Embodiment 1
As shown in drawings, among the figure:
1) corpus.This method uses prospect language material (foreground corpora) and background language material (background corpora) to obtain field concept.The prospect language material is to comprise the field document library of enriching field concept, generally should be made up of some standardized fields text; The background language material is to be used for making the electronic document bank of contrast with the different statistical nature that highlights field concept and show with the prospect language material in field document and non-field document, is made up of several field documents of three above different field.
Corpus C is made of jointly the prospect language material in the individual field of m (m 〉=3).Learning areas D kField proprietary notion the time, the prospect language material is Cf k, background language material Cb kProspect language material Cf by other m-1 field in the corpus 1(1≤1≤m, 1 ≠ k) constitutes.Requirement prospect language material (being the field language material) Cf kComprise D fully kThe proprietary notion of all spectra, and the true behaviour in service of reflection notion word.
2) set of words is the word of the proprietary notion in field in the prospect language material (being the field language material) and the set of other general term.
3) domain body notion extraction module is deleted the proprietary notion in non-field through the calculating of domain correlation degree and field uniformity coefficient, the set of output field concept.
4) set of domain body notion is the set of the proprietary notion in field.
In the process that the structure domain body notion of reality is gathered,, after extraction obtains the field concept set automatically, also can add a step of manually revising in order to improve the accuracy that makes up the result.The artificial correction is expert's manual modification process of the field concept set of extraction automatically.In this process, it is not the proprietary notion in field that the domain expert deletes among the extraction result, adds the field concept that is not included in the language material of field.
Present embodiment adopts exclusive method of the present invention to extract the domain body notion set of Knowledge Management Domain.The summary that the used prospect language material of present embodiment is the project proposal of Knowledge Management Domain has 317 texts, 80,000 Chinese characters.The background language material comprises 75 fields, 37443 of total programmatic recommendation book extracts, about 10,000,000 Chinese characters.
In the present embodiment, comprised 4431 words in the prospect language material, promptly be input in the set of words of field concept set extraction module one and co-exist in 4431 words.Calculate through extracting automatically, 2059 words uncorrelated with Knowledge Management Domain (DR≤0), thereby at first deleted automatically.Have 2597 words or uncorrelated with the field or (DR≤0 or DC=0) only in a field document, occurred, also deleted.Remaining 1834 words are as the domain body notion.
Subordinate list 1 and subordinate list 2 have been showed preceding 10 words and maximum preceding 10 words of DC value that Knowledge Management Domain DR value is maximum respectively.The numerical value inverted order that word in the table has according to gauge outfit in the row of " ↓ " symbol is arranged.
Maximum preceding 10 words of subordinate list 1 Knowledge Management Domain DR letter
Figure BSA00000296501200071
Preceding 10 words that subordinate list 2 Knowledge Management Domain DC values are maximum
Figure BSA00000296501200081
The explanation of subordinate list 1 and subordinate list 2:
1) under the prerequisite that does not influence result's displaying, short and sweet in order to make data, the data among two tabular DR and the DC are rounded up respectively.
2) degree of correlation height of word in the subordinate list 1 and Knowledge Management Domain, the frequency that occurs in the Knowledge Management Domain language material is high and far above the background language material.But indivedual words, as " Knowledge Conversion ", the uniformity coefficient of appearance is poor.
3) the distribution consistency degree height in the prospect language material of the word in the subordinate list 2, almost each text at the prospect language material all occurs.But low such as the DR value of words such as " enterprise ", " management ", after the overall treatment, these words will be excluded, and only can be left " knowledge ", " information management " and " information management theory " three speech in the subordinate list 2.

Claims (1)

1. a word exclusive method of extracting the domain body notion is characterized in that comprising the steps:
(1) calculates the domain correlation degree in word and field, delete in the set of words and the incoherent word in field;
Word t and field D kThe domain correlation degree computing formula be:
DR t , k = lg ( P ( t | Cf k ) P ( t | Cb k ) ) × lg ( TF t , k )
Wherein, P (t|Cf k), P (t|Cb k) be respectively t at prospect language material Cf kWith background language material Cb kThe middle probability that occurs; When actual computation, it is estimated as respectively:
E ( P ( t | Cf k ) ) = TF t , k mf k
E ( P ( t | Cb k ) ) = Σ Cf 1 ∈ Cb k TF t , 1 mb k
TF t , i = Σ c j ∈ Cf i tf t , j
Wherein, TF T, iBe that word t is at prospect language material Cf iThe middle frequency that occurs, mf iBe Cf iIn number of documents, mb kBe background language material Cb kIn number of documents, tf T, jFor t at document c jThe middle number of times that occurs;
(2) the field uniformity coefficient in calculating word and field, deletion do not obtain the stable word that uses as yet in the field;
The degree of uniformity that the word of field uniformity coefficient reflection field positive correlation (DR>0) distributes in each text of field language material; Word t is at field D kField uniformity coefficient computing formula be:
DC t , k = Σ c j ∈ Cf k ( P ( t | c j ) × lg 1 P ( t | c j ) )
P (t|c j) be that t is at document c jThe middle probability that occurs, c jBe prospect language material Cf kIn a document; The present invention is when actual computation, with P (t|c j) be estimated as:
E ( P ( t | c j ) ) = tf t , j TF t , k
Wherein, tf T, jFor word t at field prospect language material Cf kIn j text in the frequency that occurs.
CN 201010502040 2010-09-30 2010-09-30 Word elimination process for extracting domain ontology concept Pending CN101963989A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010502040 CN101963989A (en) 2010-09-30 2010-09-30 Word elimination process for extracting domain ontology concept

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010502040 CN101963989A (en) 2010-09-30 2010-09-30 Word elimination process for extracting domain ontology concept

Publications (1)

Publication Number Publication Date
CN101963989A true CN101963989A (en) 2011-02-02

Family

ID=43516862

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010502040 Pending CN101963989A (en) 2010-09-30 2010-09-30 Word elimination process for extracting domain ontology concept

Country Status (1)

Country Link
CN (1) CN101963989A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169495A (en) * 2011-04-11 2011-08-31 趣拿开曼群岛有限公司 Industry dictionary generating method and device
CN104376024A (en) * 2013-08-16 2015-02-25 交通运输部科学研究院 Document similarity detecting method based on seed words
CN104750777A (en) * 2014-12-31 2015-07-01 东软集团股份有限公司 Text labeling method and system
CN105260375A (en) * 2015-08-05 2016-01-20 北京工业大学 Event ontology learning method
CN105677640A (en) * 2016-01-08 2016-06-15 中国科学院计算技术研究所 Domain concept extraction method for open texts
CN106610941A (en) * 2016-08-11 2017-05-03 四川用联信息技术有限公司 Improved concept semantic similarity calculation method based on information theory
CN111985211A (en) * 2020-09-01 2020-11-24 中国民航科学技术研究院 Ontology concept obtaining method and device in civil aviation safety field and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《情报学报》 20090630 于娟等 领域特征词的提取方法研究 368-373 1 第28卷, 第3期 2 *
《计算机科学》 20081231 于娟等 本体集成研究综述 9-13,18 1 第35卷, 第7期 2 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169495A (en) * 2011-04-11 2011-08-31 趣拿开曼群岛有限公司 Industry dictionary generating method and device
CN102169495B (en) * 2011-04-11 2014-04-02 趣拿开曼群岛有限公司 Industry dictionary generating method and device
CN104376024A (en) * 2013-08-16 2015-02-25 交通运输部科学研究院 Document similarity detecting method based on seed words
CN104376024B (en) * 2013-08-16 2017-12-15 交通运输部科学研究院 A kind of document similarity detection method based on seed words
CN104750777A (en) * 2014-12-31 2015-07-01 东软集团股份有限公司 Text labeling method and system
CN104750777B (en) * 2014-12-31 2018-04-06 东软集团股份有限公司 text marking method and system
CN105260375A (en) * 2015-08-05 2016-01-20 北京工业大学 Event ontology learning method
CN105260375B (en) * 2015-08-05 2019-04-12 北京工业大学 Event ontology learning method
CN105677640A (en) * 2016-01-08 2016-06-15 中国科学院计算技术研究所 Domain concept extraction method for open texts
CN106610941A (en) * 2016-08-11 2017-05-03 四川用联信息技术有限公司 Improved concept semantic similarity calculation method based on information theory
CN111985211A (en) * 2020-09-01 2020-11-24 中国民航科学技术研究院 Ontology concept obtaining method and device in civil aviation safety field and storage medium

Similar Documents

Publication Publication Date Title
CN101963989A (en) Word elimination process for extracting domain ontology concept
CN109933657B (en) Topic mining emotion analysis method based on user feature optimization
CN107329995B (en) A kind of controlled answer generation method of semanteme, apparatus and system
CN109376352B (en) Patent text modeling method based on word2vec and semantic similarity
CN105653590A (en) Name duplication disambiguation method of Chinese literature authors
CN106446148A (en) Cluster-based text duplicate checking method
CN109800310A (en) A kind of electric power O&M text analyzing method based on structuring expression
CN109509557B (en) Chinese electronic medical record information extraction preprocessing method based on big data platform
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN110059923A (en) Matching process, device, equipment and the storage medium of post portrait and biographic information
CN109308317A (en) A kind of hot spot word extracting method of the non-structured text based on cluster
CN105930509A (en) Method and system for automatic extraction and refinement of domain concept based on statistics and template matching
CN108228701A (en) A kind of system for realizing Chinese near-nature forest language inquiry interface
CN103092966A (en) Vocabulary mining method and device
CN109947934A (en) For the data digging method and system of short text
CN110457711A (en) A kind of social media event topic recognition methods based on descriptor
CN108363691A (en) A kind of field term identifying system and method for 95598 work order of electric power
CN110929520A (en) Non-named entity object extraction method and device, electronic equipment and storage medium
CN111090734A (en) Method and system for optimizing machine reading understanding capability based on hierarchical attention mechanism
Chekima et al. An automatic construction of malay stop words based on aggregation method
CN115600602B (en) Method, system and terminal device for extracting key elements of long text
CN106776724A (en) A kind of exercise question sorting technique and system
Stańczak et al. Grammatical Gender's Influence on Distributional Semantics: A Causal Perspective
CN116737887A (en) Intelligent question-answering method based on service item elements
Sun Efficient text feature extraction by integrating the average linkage and K-medoids clustering

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20110202