CN103218362A - Method and system for constructing domain ontology - Google Patents

Method and system for constructing domain ontology Download PDF

Info

Publication number
CN103218362A
CN103218362A CN2012100177727A CN201210017772A CN103218362A CN 103218362 A CN103218362 A CN 103218362A CN 2012100177727 A CN2012100177727 A CN 2012100177727A CN 201210017772 A CN201210017772 A CN 201210017772A CN 103218362 A CN103218362 A CN 103218362A
Authority
CN
China
Prior art keywords
keyword
keywords
node
sequence
ontology
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012100177727A
Other languages
Chinese (zh)
Other versions
CN103218362B (en
Inventor
董振江
吉锋
罗圣美
程龚
瞿裕忠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
ZTE Corp
Original Assignee
Nanjing University
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University, ZTE Corp filed Critical Nanjing University
Priority to CN201210017772.7A priority Critical patent/CN103218362B/en
Publication of CN103218362A publication Critical patent/CN103218362A/en
Application granted granted Critical
Publication of CN103218362B publication Critical patent/CN103218362B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for constructing a domain ontology, which comprises the following steps of setting out names of all terms which need to be described by a target ontology to form a keyword set W0; sorting all keywords in the keyword set W0 to form a keyword sequence S0; creating an ontology set O to be multiplexed, submitting all the keywords in continuous subsequences extracted from the keyword sequence S0 to an ontology retrieval system, and adding an ontology with the highest ranking in a retrieval result into the ontology set O; and carrying out set union operation processing on all the ontologies in the ontology set O to form a novel ontology O. The invention also provides a system for constructing the domain ontology. According to the technical scheme, the constructing method facing keyword inquiry of ontology retrieval is provided and has excellent definition and operability; and higher ontology multiplex ratio can be obtained.

Description

A kind of Methodologies for Building Domain Ontology and system
Technical field
The present invention relates to Information System Modeling and knowledge engineering field, relate in particular to a kind of Methodologies for Building Domain Ontology and system based on ontology reuse.
Background technology
Tom Ge Lubai (Tom Gruber) is defined as body (Ontology) a kind of for sharing the explicit standard of the generalities that build.Generalities refer to be the model that the abstract concept in a field or scope, concrete object, object properties and object relationship are set up, and body is that a generalities explicitly is meaned to become standard, so that a plurality of main body is shared.In body, above-mentioned concept, relation etc. are referred to as term (Term); Body can be regarded the set be comprised of the term description that is called axiom (Axiom) as.Ni Gu draws high Li Nuo (Nicola Guarino) body is divided into to Top-level Ontology, domain body, task ontology, applied ontology.Wherein, Top-level Ontology is described general concept (as space, time), field and task ontology describe respectively general field (as slr camera) and task (as camera is sold), applied ontology is described the concrete scope (as a concrete slr camera is sold website) that concrete application relates to.Wherein, Top-level Ontology is usually more stable, and the shared meaning of applied ontology is less, and therefore, the structure of field and task ontology is the most active, and its construction method is the most important.
The method of existing structure domain body can be divided into two classes: manual construction and semi-automatic structure.Manual construction is with ontology describing catching method (IDEF5, Integrated Definition for Ontology Description Capture Method) be representative, the process that body is built is divided into target and team's foundation, raw data collection, material analysis, body Primary Construction, body is refined and 5 steps such as checking, and each step is all completed by hand by the people.Semi-automatic structure claims again body learning, by computer program, automatically from text, extracts the term that means concept, the relationship of the concepts etc., forms preliminary body, then refines and verify through the people is manual.Yet the preliminary body that computer program builds automatically at present, usually very poor qualitatively, can not effectively reduce artificial dependence, so manual construction is still main stream approach.
When the manual construction domain body, a kind of mode of raising the efficiency is multiplexing existing body, and or existing body near field identical to for new demand transformed, and become a new body, thereby proportion is newly developed cost-saving.Yet, find to be applicable to multiplexing body means very deficient from a large amount of existing bodies.Current a kind of main path is to browse one by one online body library (research project as advanced as U.S. Department of Defense is apart from agent markup language (DAML, Defense Advanced Research Projects Agency Agent Markup Language) body body library), inefficiency.Another kind of emerging approach is to carry out the body retrieval, to body searching system (as the Swoogle search engine) submit Query keyword, obtains and only browses the body that can match searching keyword, thereby raising the efficiency.Yet the method that not yet forms good definition instructs above-mentioned retrieving, the construction method of particularly inquiring about.The another kind of mode of accelerating the manual construction domain body is that multi-person synergy builds, and the difficult point of this mode is that many people build the conflict inspection of result and clear up.
Although domain body is as the model of concept hierarchy, the aspect that has broken away from natural language, but for the people time, still need to adopt the vocabulary in natural language to be named to term, with person who happens to be on hand for an errand's understanding, therefore, term name is also the important component part of domain body.Diversity due to natural language, a term may correspond to the natural language vocabulary (as slr camera and single lens reflex camera) of a plurality of synonyms, therefore, an important step in the domain body structure is as far as possible fully to obtain all synonyms of term name.
Existing synonym acquisition methods is mainly the synonymicon (as WordNet) that utilizes the linguistics expert to build.Although the precision of synonymicon is very high, but coverage rate is limited, and the tractable synonymicon of the computer program that can obtain at present seldom, wherein, Chinese synonymicon still less, therefore, the synonym of the Chinese term name during domain body builds obtains very difficult, usually can only the experience based on structure person (being the domain expert) complete, be difficult to ensure the quality of products, the recall rate of particularly obtaining (spending fully).
Another kind of synonym acquisition methods is the swarm intelligence that utilizes the public, this method has been utilized user's inquiry log of search engine, its basic thought is to think to occur in user's inquiry if two keywords are everlasting, and the user often opens the same web page in the Query Result of their correspondences, and these two keywords are considered to synonym.It is very low that the deficiency that the method exists mainly is to obtain synon precision (being accuracy).Reason is that a webpage may relate to a plurality of different themes, correspond to respectively a plurality of keywords that do not have synonymy, therefore, even the searching keyword of user based on different opened identical webpage, do not show that these keywords certainly exist synonymy yet.
Summary of the invention
In view of this, fundamental purpose of the present invention is to provide a kind of Methodologies for Building Domain Ontology and system, and a kind of construction method of the keyword query towards the retrieval of this example is provided, and has good definition and operability, can obtain higher ontology reuse rate.
For achieving the above object, technical scheme of the present invention is achieved in that
The invention provides a kind of Methodologies for Building Domain Ontology, comprising:
Enumerate and need to, by the title of all terms of target ontology describing, form keyword set W 0;
To keyword set W 0in all keywords sorted, form keyword sequence S 0;
Multiplexing body set O is treated in establishment, will be from keyword sequence S 0all keywords in the continuous subsequence of middle extraction are submitted to the body searching system, add the body that in result for retrieval, rank is the highest to body set O;
All bodies in body set O are carried out to the union of sets operational processes, form new body o.
In said method, the method also comprises: be the term name of describing in new body o, and carry out synonym according to the title of the term of describing in new body o and obtain.
In said method, described enumerating need to, by the title of all terms of target ontology describing, form keyword set W 0for:
For the described target domain of target body, use natural language L sin keyword enumerate and need to, by the title of the described all terms of target body, form a keyword set W 0.
In said method, described to keyword set W 0in all keywords sorted, form keyword sequence S 0for:
Set up tree, in tree, each node has label and marks for treatment;
In decision tree whether the marks for treatment of all nodes be all " processed ", if not, from tree, all marks for treatments are to choose present node, the keyword set W of the label of described present node in the node of " untreated " 0for current set;
Judge in current set and whether only comprise a keyword, when current set-inclusion surpasses a keyword, current set is divided into to two subsets, the most important subset W that two sons are concentrated lleft child node as present node is added in tree, another subset W that two sons are concentrated rright child node as present node is added in tree, changes the marks for treatment of present node into " processed "; Otherwise, change the marks for treatment of present node into " processed ", then continuing in decision tree the whether marks for treatment of all nodes is all " processed ", until when in tree, the marks for treatment of all nodes is all " processed ", according to keyword set W 0in the depth-first traversal order of the corresponding node of all keywords, form keyword sequence S 0.
In said method, describedly current set be divided into to two subsets be:
Using the keyword in current set as the description to a field or scope, using two sons, concentrated keyword is respectively as the description of two sub-fields of difference to this field or scope or subrange.
In said method, described will be from keyword sequence S 0all keywords in the continuous subsequence of middle extraction are submitted to the body searching system, add the body that in result for retrieval, rank is the highest to body set O to be:
Multiplexing body set O is treated in establishment, by keyword sequence S 0be denoted as S, obtain a longest subsequence S in the continuous subsequence of the prefix satisfied condition in S h, by S hclip from the front end of S, obtain the continuous subsequence S of remaining suffix t;
Judgement S hwhether be empty sequence, if S hfor empty sequence, from S tthe top keyword of middle deletion; If S hbe not empty sequence, by result for retrieval HITS (S h) in the highest body of rank add O to;
Judgement S twhether be empty sequence, if S tbe not empty sequence, by S tbe denoted as S, then obtain a longest subsequence S in the continuous subsequence of the prefix satisfied condition of S h, by S hclip from the front end of S, obtain the continuous subsequence S of remaining suffix t; Otherwise, if S tfor empty sequence, flow process finishes.
In said method, described condition is that all keywords in subsequence are combined into a searching keyword group, after described searching keyword group is submitted to the body searching system, and result for retrieval HITS (S h) be not empty.
In said method, described all bodies in body set O are carried out to the union of sets operational processes, form new body o and be:
All bodies in body set O are carried out to the union of sets operational processes, form a new body o; And according to the demand of describing target domain, new body o is carried out to editing and processing;
Described editing and processing at least comprises increases term and axiom, deletion term and axiom, modification term and axiom.
In said method, the described term called after for describing in new body o: to a L for each term who describes in new body o sin Word naming.
In said method, the title of the term of describing in the new body o of described basis is carried out synonym and is retrieved as:
For the title t of each term of describing in new body o, create three keyword set SYN, TRANS, TS;
T is submitted to from L sto another kind of natural language L ttranslation system, by all keywords in translation result add to the set TRANS;
According to each the keyword trans in set TRANS, from L tsynonymicon in all synonyms of the trans that obtains, add all synonyms that get to set TS;
Add all keywords in set TS to set TRANS, and, according to each the keyword trans ' in set TRANS, trans ' is submitted to by L tto L stranslation system, add all keywords in translation result to S set YN;
Delete all synon keywords that are not suitable as t from S set YN, in SYN, remaining all keywords are as the synonym of the t got.
The present invention also provides a kind of domain body constructing system, comprising: enumerate unit, sequencing unit, adding device operational processes unit; Wherein,
Enumerate unit, for enumerating, need to, by the title of all terms of target ontology describing, form keyword set W 0;
Sequencing unit, for to keyword set W 0in all keywords sorted, form keyword sequence S 0;
Adding device, treat multiplexing body set O for establishment, will be from keyword sequence S 0all keywords in the continuous subsequence of middle extraction are submitted to the body searching system, add the body that in result for retrieval, rank is the highest to body set O;
And the operational processes unit, carry out the union of sets operational processes for all bodies to body set O, form new body o.
In said system, this system also comprises:
The name unit, be used to the term name of describing in new body o;
Acquiring unit, the title of the term of describing for the body o according to new is carried out synonym and is obtained.
Methodologies for Building Domain Ontology provided by the invention and system, enumerate and need to, by the title of all terms of target ontology describing, form keyword set W 0; To keyword set W 0in all keywords sorted, form keyword sequence S 0; Multiplexing body set O is treated in establishment, will be from keyword sequence S 0all keywords in the continuous subsequence of middle extraction are submitted to the body searching system, add the body that in result for retrieval, rank is the highest to body set O; All bodies in body set O are carried out to the union of sets operational processes, form new body o, therefore a kind of construction method of the keyword query towards the retrieval of this example is provided, reached and retrieved less body and just can cover the effect of more more important keyword, there is good definition and operability, can obtain higher ontology reuse rate; In addition, based on said method, the present invention can also be the term name of describing in new body o, and carry out synonym according to the title of the term of describing in new body o and obtain, therefore a kind of synonym acquisition methods is provided, by the synonymicon in natural language, reach applied widely, can obtain higher precision and the effect of recall rate.
The accompanying drawing explanation
Fig. 1 is the schematic flow sheet that the present invention realizes Methodologies for Building Domain Ontology;
Fig. 2 is the schematic flow sheet of the concrete grammar of performing step 102 of the present invention;
Fig. 3 is the schematic flow sheet of embodiment mono-of the concrete grammar of performing step 102 of the present invention;
Fig. 4 is the exemplary plot of binary tree data structure in the present invention;
Fig. 5 is the schematic flow sheet of the concrete grammar of performing step 103 of the present invention;
Fig. 6 is the schematic flow sheet of embodiment mono-of the concrete grammar of performing step 103 of the present invention;
Fig. 7 is the schematic flow sheet of the concrete grammar of performing step 106 of the present invention;
Fig. 8 is the structural representation that the present invention realizes the domain body constructing system.
Embodiment
Basic thought of the present invention is: enumerate and need to, by the title of all terms of target ontology describing, form keyword set W 0; To keyword set W 0in all keywords sorted, form keyword sequence S 0; Multiplexing body set O is treated in establishment, will be from keyword sequence S 0all keywords in the continuous subsequence of middle extraction are submitted to the body searching system, add the body that in result for retrieval, rank is the highest to body set O; All bodies in body set O are carried out to the union of sets operational processes, form new body o.
Below by drawings and the specific embodiments, the present invention is described in further detail again.
The invention provides a kind of Methodologies for Building Domain Ontology, Fig. 1 is the schematic flow sheet that the present invention realizes Methodologies for Building Domain Ontology, and as shown in Figure 1, the method comprises the following steps:
Step 101, enumerate and need to, by the title of all terms of target ontology describing, form keyword set W 0;
Concrete, for body to be built (being called the target body) described field (being called target domain), for example the slr camera field, used natural language L sin keyword enumerate and need to, by the title of the described all terms of target body, form a keyword set W 0, L for example s=Chinese, W 0={ " camera lens ", " pixel ", " aperture ", " focal length ", " sensor " }.
Step 102, to keyword set W 0in all keywords sorted, form keyword sequence S 0.
Step 103, create and treat multiplexing body set O, from keyword sequence S 0the continuous subsequence of middle extraction, and all keywords in subsequence are submitted to the body searching system, add the body that in result for retrieval, rank is the highest to body set O.
Step 104, carry out the union of sets operational processes to all bodies in body set O, forms new body o;
Concrete, to all bodies in O, (be o 1and o 2, each body is considered as the set of an axiom) and carry out the union of sets operational processes, form a new body o; And for example, according to the demand of describing target domain (slr camera field) new body o is carried out to editing and processing, editing and processing comprises increases term and axiom, deletion term and axiom, modification term and axiom etc.
Step 105 is the term name of describing in new body o;
Concrete, to a L for each term who describes in new body o sin Word naming, for example in new body o, the name of a term is called " camera lens ".
Step 106, carry out synonym according to the title of the term of describing in new body o and obtain.
Fig. 2 is the schematic flow sheet of the concrete grammar of performing step 102 of the present invention, and as shown in Figure 2, the method comprises the following steps:
Step 201, set up tree, and in tree, each node has label and marks for treatment;
Concrete, set up a binary tree data structure (being called tree), the subsidiary label of each node in tree and a marks for treatment; When initial, only comprise a node in tree, its subsidiary label is W 0, subsidiary marks for treatment is " untreated ".
Step 202, in decision tree whether the marks for treatment of all nodes be all " processed ", in if tree, the marks for treatment of all nodes is all " processed ", performs step 207; Otherwise, perform step 203.
Step 203, from tree, all marks for treatments are to appoint and get a node in the node of " untreated ", this node is present node, the keyword set W of the label of present node 0for current set.
Step 204, judge in current set and whether only comprise a keyword, if current set only comprises a keyword, perform step 206; Otherwise, perform step 205.
Step 205, be divided into two subsets by current set, the most important subset W that two sons are concentrated lleft child node as present node is added in tree, another subset W that two sons are concentrated rright child node as present node is added in tree;
Concrete, current set is divided into to two subsets, the principle of wherein dividing is: using the keyword in current set as the description to a field or scope, using two sons, concentrated keyword is respectively as the description of two sub-fields of difference to this field or scope or subrange;
Estimate above-mentioned two sub-set pairs in the importance of describing target domain, two most important subset W that son is concentrated lleft child node as present node is added in tree, this most important subset W llabel be W l, this most important subset W lmarks for treatment be " untreated "; Two another subset W that son is concentrated rright child node as present node is added in tree, and the label of this subset is W r, subset W rmarks for treatment be " untreated ".
Step 206, change the marks for treatment of present node into " processed ", then performs step 202.
Step 207, for keyword set W 0in each keyword w, can correspond to the node satisfied condition in tree, described condition is: the keyword set as the label of node comprises and only comprises w; Based on W 0in keyword and tree in the corresponding relation of label of node, according to W 0in the depth-first traversal order of the corresponding node of all keywords, form a keyword sequence S 0.
Fig. 3 is the schematic flow sheet of embodiment mono-of the concrete grammar of performing step 102 of the present invention, and as shown in Figure 3, the method comprises the following steps:
Step 301, set up tree, and in tree, each node has label and marks for treatment;
Concrete, set up a binary tree data structure (being called tree), the subsidiary label of each node in tree and a marks for treatment; When initial, only comprise a node in tree, the node A in Fig. 4 for example, subsidiary label is W 0, W 0={ " camera lens ", " pixel ", " aperture ", " focal length ", " sensor " }, subsidiary marks for treatment is " untreated ".
Step 302, from tree, all marks for treatments are to choose present node in the node of " untreated ", the keyword set of the label of described present node is combined into current set;
Concrete, owing in tree, having marks for treatment, it is the node of " untreated ", therefore from tree, all marks for treatments are to choose at random a node in the node of " untreated ", claim that this node is present node, and the keyword set of the label of this present node is collectively referred to as current set; The node A in Fig. 4 for example, current set is { " camera lens ", " pixel ", " aperture ", " focal length ", " sensor " }.
Step 303, if current set-inclusion surpasses a keyword, be divided into current set two subsets; For example subset { " pixel ", " sensor " } and subset { " camera lens ", " aperture ", " focal length " }.
Step 304, the most important subset W that two sons are concentrated lleft child node as present node is added in tree, another subset W that two sons are concentrated rright child node as present node is added in tree;
Concrete, for example estimate above-mentioned two sub-set pairs, in the importance of describing target domain (slr camera field), two most important subset W that son is concentrated l, W for example l={ " pixel ", " sensor " }, add in tree this most important subset W to as the left child node (Node B as shown in Figure 4) of present node (node A as shown in Figure 4) llabel be W l, W l={ " pixel ", " sensor " }, this most important subset W lmarks for treatment be " untreated "; Two another subset W that son is concentrated r, W for example r={ " camera lens ", " aperture ", " focal length " }, add in tree as the right child node (node C as shown in Figure 4) of present node (node A as shown in Figure 4), and the label of this subset is W r, W r={ " camera lens ", " aperture ", " focal length " }, subset W rmarks for treatment be " untreated ".
Step 305, change the marks for treatment of present node (node A as shown in Figure 4) into " processed "; By that analogy, for example, shown in Fig. 4, successively node D, E, F, G, H, I are added in tree, in this adding procedure, the marks for treatment of Node B, C changes " processed " into.
Step 306, still having marks for treatment in if tree is the node of " untreated ", from tree, all marks for treatments are to choose at random a node in the node of " untreated ", claim that this node is present node, and the keyword set of the label of this present node is collectively referred to as current set; The node D in Fig. 4 for example, current set is { " sensor " }.
Step 307, if current set only comprises a keyword, change the marks for treatment of present node (for example node D in Fig. 4) into " processed ", by that analogy, changes the marks for treatment of an E, F, H, I into " processed ".
Step 308, when in tree, the marks for treatment of all nodes is all " processed ", according to keyword set W 0in the depth-first traversal order of the corresponding node of all keywords, form keyword sequence S 0;
Concrete, in if tree, the marks for treatment of all nodes is all " processed ", based on W 0in keyword and tree in the corresponding relation of label of node, the label of the label of the label of the label of the label of " camera lens " corresponding node F, " pixel " corresponding node E, " aperture " corresponding node I, " focal length " corresponding node H, " sensor " corresponding node D for example, according to W 0in the depth-first traversal order (node D, E for example, shown in Fig. 4 in, F, H, I) of the corresponding node of all keywords, to W 0in all keywords sorted, form a keyword sequence S 0, i.e. S 0=<" sensor ", " pixel ", " camera lens ", " focal length ", " aperture ">.
Fig. 5 is the schematic flow sheet of the concrete grammar of performing step 103 of the present invention, and as shown in Figure 5, the method comprises the following steps:
Step 501, create one and treat multiplexing body set O, and when initial, O is empty set.
Step 502, by keyword sequence S 0be denoted as S.
Step 503, obtain a longest subsequence S in the continuous subsequence of the prefix satisfied condition of S h, described condition is after all keywords in this subsequence are combined into a searching keyword group, after this searching keyword group is submitted to the body searching system, result for retrieval is not empty (S hcorresponding result for retrieval is denoted as HITS (S h)); And by S hclip from the front end of S, obtain the continuous subsequence S of remaining suffix t.
Step 504, judgement S hwhether be empty sequence, if S hfor empty sequence (being not have the continuous subsequence of the prefix satisfied condition in step 503 in S), from S tthe top keyword of middle deletion; Otherwise, by HITS (S h) in the highest body of rank add O to.
Step 505, judgement S twhether be empty sequence, if S tfor empty sequence, (be not S in step 503 hthe subsequence of S), by S tbe denoted as S, then perform step 503; Otherwise flow process finishes.
Fig. 6 is the schematic flow sheet of embodiment mono-of the concrete grammar of performing step 103 of the present invention, and as shown in Figure 6, the method comprises the following steps:
Step 601, create one and treat multiplexing body set O, and when initial, O is empty set.
Step 602, by keyword sequence S 0be denoted as S, i.e. S=<" sensor ", " pixel ", " camera lens ") " focal length ", " aperture ">.
Step 603, obtain a longest subsequence S in the continuous subsequence of the prefix satisfied condition in S h, by S hclip from the front end of S, obtain the continuous subsequence S of remaining suffix t;
Concrete, obtain a longest subsequence S in the continuous subsequence of the prefix that meets following condition in S h, wherein, described condition is after all keywords in this subsequence are combined into a searching keyword group, for example, after this searching keyword group is submitted to body searching system (Swoogle), result for retrieval is not empty (S hcorresponding result for retrieval is denoted as HITS (S h));
For example, after " sensor pixel lens focus aperture ", " sensor pixel lens focus ", " sensor pixel camera lens " are submitted to Swoogle respectively, result for retrieval is sky, and, after " sensor pixel " be submitted to Swoogle, result for retrieval is not empty, S h=<" sensor ", " pixel ">; By S hfrom the front end of S, clip, the continuous subsequence of remaining suffix is denoted as S t, S t=<" camera lens ", " focal length ", " aperture ">.
Step 604, due to S hbe not empty sequence, therefore by result for retrieval HITS (S h) the middle the highest body o of rank 1add body set O to.
Step 605, due to S tbe not empty sequence, therefore by the continuous subsequence S of remaining suffix tbe denoted as S.
Step 606, obtain a longest subsequence S in the continuous subsequence of the prefix satisfied condition in S h, by S hclip from the front end of S, obtain the continuous subsequence S of remaining suffix t;
Concrete, obtain a longest subsequence S in the continuous subsequence of the prefix that meets following condition in S h, wherein, described condition is after all keywords in this subsequence are combined into a searching keyword group, for example, after this searching keyword group is submitted to body searching system (Swoogle), result for retrieval is not empty (S hcorresponding result for retrieval is denoted as HITS (S h));
For example, after " lens focus aperture ", " lens focus ", " camera lens " are submitted to Swoogle respectively, result for retrieval is sky, S hfor empty sequence, by S hfrom the front end of S, clip, the continuous subsequence of remaining suffix is denoted as S t, S t=<" camera lens ", " focal length ", " aperture ">.
Step 607, due to S hfor empty sequence, therefore from S tthe top keyword of middle deletion, for example " camera lens ", obtain S t=<" focal length ", " aperture ">.
Step 608, because S tbe not empty sequence, by S tbe denoted as S.
Step 609, obtain a longest subsequence S in the continuous subsequence of the prefix satisfied condition in S h, by S hclip from the front end of S, obtain the continuous subsequence S of remaining suffix t;
Concrete, obtain a longest subsequence S in the continuous subsequence of the prefix that meets following condition in S h, wherein, described condition is after all keywords in this subsequence are combined into a searching keyword group, for example, after this searching keyword group is submitted to body searching system (Swoogle), result for retrieval is not empty (S hcorresponding result for retrieval is denoted as HITS (S h));
For example, after " focal length aperture " is submitted to Swoogle, result for retrieval is not empty, so S h=<" focal length ", " aperture ">, by S hfrom the front end of S, clip, the continuous subsequence of remaining suffix is denoted as S t, S tfor empty sequence.
Step 610, because S hbe not empty sequence, by HITS (S h) the middle the highest body o of rank 2add body set O to.
Step 611, because S tfor empty sequence, final body is in conjunction with O={o 1, o 2.
Fig. 7 is the schematic flow sheet of the concrete grammar of performing step 106 of the present invention, and as shown in Figure 7, the method comprises the following steps:
Step 701, for the title t of each term of describing in new body o, for example t=" camera lens ", create three keyword set, is denoted as respectively SYN, TRANS, TS, and when initial, SYN, TRANS, TS are empty set.
Step 702, for example be submitted to, t (" camera lens ") from L sto another kind of natural language L ttranslation system, L for example t=English, translation system is Google Translate, by all keywords in translation result, for example " shot ", " camera lens ", " camera shot ", add set TRANS, i.e. TRANS={ " shot " to, " camera lens ", " camera shot " }.
Step 703, each the keyword trans according to set in TRANS, trans=" camera lens " for example, from L tsynonymicon (for example WordNet) in all synonyms of the trans (for example " camera lens ") that obtains, for example " optical lens ", add all synonyms that get to set TS, by that analogy, for example the synonym of " shot " comprises " guess ", " snap ", and " camera shot " do not have synonym, gathers TS={ " guess ", " snap ", " optical lens " }.
Step 704, add all keywords in set TS to set TRANS, gathers TRANS={ " shot ", " camera lens ", " camera shot ", " guess ", " snap ", " optical lens " }.
Step 705, according to each the keyword trans ' in set TRANS, for example trans '=" optical lens ", for example be submitted to, trans ' (" optical lens ") by L t(English) is to L sthe translation system (for example Google Translate) of (i.e. Chinese), by all keywords in translation result, for example " optical lens ", add S set YN to, by that analogy, for example the translation result of " shot " comprises " shooting ", " camera lens ", " dosage ", the translation result of " camera lens " comprises " camera lens ", the translation result of " camera shot " comprises " camera lens ", the translation result of " guess " comprises " conjecture ", " supposition ", the translation result of " snap " comprises " unit ", " dropping fire ", S set YN={ " shooting ", " camera lens ", " dosage ", " conjecture ", " supposition ", " unit ", " dropping fire ", " optical lens " }.
Step 706, optionally, obtain the accuracy of result in order to improve synonym, the synon keyword that can also from S set YN, delete all t of being not suitable as (for example " camera lens ") (comprises t self, for example " camera lens "), for example " shooting ", " camera lens ", " dosage ", " conjecture ", " supposition ", " unit ", " dropping fire " all are not suitable as the synon keyword of t, remaining all keywords in SYN, for example " optical lens ", for example, as the synonym of the t got (" camera lens "); Wherein, the synon keyword that is not suitable as t refers to keyword that can not be alternative mutually in current field, and keyword that can be alternative mutually in current field just is suitable as the synon keyword of t.
For realizing said method, the present invention also provides a kind of domain body constructing system, Fig. 8 is the structural representation that the present invention realizes the domain body constructing system, and as shown in Figure 8, this system comprises: enumerate unit 81, sequencing unit 82, adding device 83 operational processes unit 84; Wherein,
Enumerate unit 81, for enumerating, need to, by the title of all terms of target ontology describing, form keyword set W 0;
Sequencing unit 82, for to keyword set W 0in all keywords sorted, form keyword sequence S 0;
Adding device 83, treat multiplexing body set O for establishment, will be from keyword sequence S 0all keywords in the continuous subsequence of middle extraction are submitted to the body searching system, add the body that in result for retrieval, rank is the highest to body set O;
And operational processes unit 84, carry out the union of sets operational processes for all bodies to body set O, form new body o.
This system also comprises:
Name unit 85, be used to the term name of describing in new body o;
Acquiring unit 86, the title of the term of describing for the body o according to new is carried out synonym and is obtained.
The above, be only preferred embodiment of the present invention, is not intended to limit protection scope of the present invention, all any modifications of doing within the spirit and principles in the present invention, is equal to and replaces and improvement etc., within all should being included in protection scope of the present invention.

Claims (12)

1. a Methodologies for Building Domain Ontology, is characterized in that, the method comprises:
Enumerate and need to, by the title of all terms of target ontology describing, form keyword set W 0;
To keyword set W 0in all keywords sorted, form keyword sequence S 0;
Multiplexing body set O is treated in establishment, will be from keyword sequence S 0all keywords in the continuous subsequence of middle extraction are submitted to the body searching system, add the body that in result for retrieval, rank is the highest to body set O;
All bodies in body set O are carried out to the union of sets operational processes, form new body o.
2. method according to claim 1, is characterized in that, the method also comprises: be the term name of describing in new body o, and carry out synonym according to the title of the term of describing in new body o and obtain.
3. method according to claim 1, is characterized in that, described enumerating need to, by the title of all terms of target ontology describing, form keyword set W 0for:
For the described target domain of target body, use natural language L sin keyword enumerate and need to, by the title of the described all terms of target body, form a keyword set W 0.
4. method according to claim 1, is characterized in that, described to keyword set W 0in all keywords sorted, form keyword sequence S 0for:
Set up tree, in tree, each node has label and marks for treatment;
In decision tree whether the marks for treatment of all nodes be all " processed ", if not, from tree, all marks for treatments are to choose present node, the keyword set W of the label of described present node in the node of " untreated " 0for current set;
Judge in current set and whether only comprise a keyword, when current set-inclusion surpasses a keyword, current set is divided into to two subsets, the most important subset W that two sons are concentrated lleft child node as present node is added in tree, another subset W that two sons are concentrated rright child node as present node is added in tree, changes the marks for treatment of present node into " processed "; Otherwise, change the marks for treatment of present node into " processed ", then continuing in decision tree the whether marks for treatment of all nodes is all " processed ", until when in tree, the marks for treatment of all nodes is all " processed ", according to keyword set W 0in the depth-first traversal order of the corresponding node of all keywords, form keyword sequence S 0.
5. method according to claim 4, is characterized in that, describedly current set is divided into to two subsets is:
Using the keyword in current set as the description to a field or scope, using two sons, concentrated keyword is respectively as the description of two sub-fields of difference to this field or scope or subrange.
6. method according to claim 1, is characterized in that, described will be from keyword sequence S 0all keywords in the continuous subsequence of middle extraction are submitted to the body searching system, add the body that in result for retrieval, rank is the highest to body set O to be:
Multiplexing body set O is treated in establishment, by keyword sequence S 0be denoted as S, obtain a longest subsequence S in the continuous subsequence of the prefix satisfied condition in S h, by S hclip from the front end of S, obtain the continuous subsequence S of remaining suffix t;
Judgement S hwhether be empty sequence, if S hfor empty sequence, from S tthe top keyword of middle deletion; If S hbe not empty sequence, by result for retrieval HITS (S h) in the highest body of rank add O to;
Judgement S twhether be empty sequence, if S tbe not empty sequence, by S tbe denoted as S, then obtain a longest subsequence S in the continuous subsequence of the prefix satisfied condition of S h, by S hclip from the front end of S, obtain the continuous subsequence S of remaining suffix t; Otherwise, if S tfor empty sequence, flow process finishes.
7. method according to claim 6, is characterized in that, described condition is that all keywords in subsequence are combined into a searching keyword group, after described searching keyword group is submitted to the body searching system, and result for retrieval HITS (S h) be not empty.
8. method according to claim 1, is characterized in that, described all bodies in body set O carried out to the union of sets operational processes, forms new body o and be:
All bodies in body set O are carried out to the union of sets operational processes, form a new body o; And according to the demand of describing target domain, new body o is carried out to editing and processing;
Described editing and processing at least comprises increases term and axiom, deletion term and axiom, modification term and axiom.
9. method according to claim 2, is characterized in that, the described term called after for describing in new body o: to a L for each term who describes in new body o sin Word naming.
10. method according to claim 2, is characterized in that, the title of the term of describing in the new body o of described basis is carried out synonym and is retrieved as:
For the title t of each term of describing in new body o, create three keyword set SYN, TRANS, TS;
T is submitted to from L sto another kind of natural language L ttranslation system, by all keywords in translation result add to the set TRANS;
According to each the keyword trans in set TRANS, from L tsynonymicon in all synonyms of the trans that obtains, add all synonyms that get to set TS;
Add all keywords in set TS to set TRANS, and, according to each the keyword trans ' in set TRANS, trans ' is submitted to by L tto L stranslation system, add all keywords in translation result to S set YN;
Delete all synon keywords that are not suitable as t from S set YN, in SYN, remaining all keywords are as the synonym of the t got.
11. a domain body constructing system, is characterized in that, this system comprises: enumerate unit, sequencing unit, adding device operational processes unit; Wherein,
Enumerate unit, for enumerating, need to, by the title of all terms of target ontology describing, form keyword set W 0;
Sequencing unit, for to keyword set W 0in all keywords sorted, form keyword sequence S 0;
Adding device, treat multiplexing body set O for establishment, will be from keyword sequence S 0all keywords in the continuous subsequence of middle extraction are submitted to the body searching system, add the body that in result for retrieval, rank is the highest to body set O;
And the operational processes unit, carry out the union of sets operational processes for all bodies to body set O, form new body o.
12. system according to claim 11, is characterized in that, this system also comprises:
The name unit, be used to the term name of describing in new body o;
Acquiring unit, the title of the term of describing for the body o according to new is carried out synonym and is obtained.
CN201210017772.7A 2012-01-19 2012-01-19 A kind of Methodologies for Building Domain Ontology and system Active CN103218362B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210017772.7A CN103218362B (en) 2012-01-19 2012-01-19 A kind of Methodologies for Building Domain Ontology and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210017772.7A CN103218362B (en) 2012-01-19 2012-01-19 A kind of Methodologies for Building Domain Ontology and system

Publications (2)

Publication Number Publication Date
CN103218362A true CN103218362A (en) 2013-07-24
CN103218362B CN103218362B (en) 2016-12-14

Family

ID=48816165

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210017772.7A Active CN103218362B (en) 2012-01-19 2012-01-19 A kind of Methodologies for Building Domain Ontology and system

Country Status (1)

Country Link
CN (1) CN103218362B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593410A (en) * 2013-10-22 2014-02-19 上海交通大学 System for search recommendation by means of replacing conceptual terms
US10095689B2 (en) 2014-12-29 2018-10-09 International Business Machines Corporation Automated ontology building

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398858A (en) * 2008-11-07 2009-04-01 西安交通大学 Web service semantic extracting method based on noumenon learning
US20090204576A1 (en) * 2008-02-08 2009-08-13 Daniel Paul Kolz Constructing a Domain-Specific Ontology by Mining the Web
CN101807181A (en) * 2009-02-17 2010-08-18 日电(中国)有限公司 Method and equipment for restoring inconsistent body
CN101944099A (en) * 2010-06-24 2011-01-12 西北工业大学 Method for automatically classifying text documents by utilizing body
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090204576A1 (en) * 2008-02-08 2009-08-13 Daniel Paul Kolz Constructing a Domain-Specific Ontology by Mining the Web
CN101398858A (en) * 2008-11-07 2009-04-01 西安交通大学 Web service semantic extracting method based on noumenon learning
CN101807181A (en) * 2009-02-17 2010-08-18 日电(中国)有限公司 Method and equipment for restoring inconsistent body
CN101944099A (en) * 2010-06-24 2011-01-12 西北工业大学 Method for automatically classifying text documents by utilizing body
CN102254014A (en) * 2011-07-21 2011-11-23 华中科技大学 Adaptive information extraction method for webpage characteristics

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
DING L等: "Swoogle: A semantic web search and metadata engine", 《PROC. 13TH ACM CONF. ON INFORMATION AND KNOWLEDGE MANAGEMENT. 2004, 304.》 *
刘文娣等: "基于关键词设置的P2P信息检索", 《计算机应用与软件》 *
王效岳等: "本体集成:概念、过程、工具与方法综述", 《图书情报工作》 *
陈刚等: "基于领域知识重用的虚拟领域本体构造", 《软件学报》 *
马雪: "浅谈互联网经济信息的检索技巧", 《当代图书馆》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103593410A (en) * 2013-10-22 2014-02-19 上海交通大学 System for search recommendation by means of replacing conceptual terms
CN103593410B (en) * 2013-10-22 2017-04-12 上海交通大学 System for search recommendation by means of replacing conceptual terms
US10095689B2 (en) 2014-12-29 2018-10-09 International Business Machines Corporation Automated ontology building
US10095690B2 (en) 2014-12-29 2018-10-09 International Business Machines Corporation Automated ontology building
US10360307B2 (en) 2014-12-29 2019-07-23 International Business Machines Corporation Automated ontology building
US10360308B2 (en) 2014-12-29 2019-07-23 International Business Machines Corporation Automated ontology building

Also Published As

Publication number Publication date
CN103218362B (en) 2016-12-14

Similar Documents

Publication Publication Date Title
Jiang et al. FreebaseQA: A new factoid QA data set matching trivia-style question-answer pairs with Freebase
US8463593B2 (en) Natural language hypernym weighting for word sense disambiguation
Gerber et al. Bootstrapping the linked data web
He et al. Unified dialog model pre-training for task-oriented dialog understanding and generation
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
JP2017511922A (en) Method, system, and storage medium for realizing smart question answer
US20150227528A1 (en) Sentiment-based query processing system and method
CN108846138B (en) Question classification model construction method, device and medium fusing answer information
CN107092639A (en) A kind of search engine system
CN111104488B (en) Method, device and storage medium for integrating retrieval and similarity analysis
CN103324700A (en) Noumenon concept attribute learning method based on Web information
Alromima et al. Ontology-based query expansion for Arabic text retrieval
CN105740235B (en) It is a kind of merge Vietnamese grammar property tree of phrases to dependency tree conversion method
Korn et al. Automatically generating interesting facts from wikipedia tables
CN104391969A (en) User query statement syntactic structure determining method and device
CN102799596A (en) Key word filtering method and system based on network application
Gang et al. Chinese intelligent chat robot based on the AIML language
CN103218362A (en) Method and system for constructing domain ontology
He et al. Space-3: Unified dialog model pre-training for task-oriented dialog understanding and generation
CN103984731B (en) Self adaptation topic tracking method and apparatus under microblogging environment
Kordumova et al. Exploring the long tail of social media tags
CN109241438A (en) Across channel focus incident discovery method, apparatus and storage medium based on element
Algosaibi et al. Using the semantics inherent in sitemaps to learn ontologies
CN103177053B (en) Teaching plan editing dynamic resource recommendation method and teaching plan editing system thereof
Manglik et al. Ontology based context synonymy web searching

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant