CN104182420A - Ontology-based Chinese name disambiguation method - Google Patents

Ontology-based Chinese name disambiguation method Download PDF

Info

Publication number
CN104182420A
CN104182420A CN201310202444.9A CN201310202444A CN104182420A CN 104182420 A CN104182420 A CN 104182420A CN 201310202444 A CN201310202444 A CN 201310202444A CN 104182420 A CN104182420 A CN 104182420A
Authority
CN
China
Prior art keywords
personage
attribute
similarity
information
concept
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310202444.9A
Other languages
Chinese (zh)
Inventor
吕钊
罗年洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN201310202444.9A priority Critical patent/CN104182420A/en
Publication of CN104182420A publication Critical patent/CN104182420A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90348Query processing by searching ordered data, e.g. alpha-numerically ordered data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an ontology-based Chinese name disambiguation method. The ontology-based Chinese name disambiguation method comprises the following steps of defining a character attribute and defining concepts, attributes and relations related in a character ontology; constructing the character ontology, and according to character attribute information, defining an underlying and detailed application ontology and defining the character ontology into a quadruple PO={C, P, R, I}, wherein C represents a set of concepts or classes, P represents a set of data attributes or object attributes, R represents a set of relations between the concepts, between the concepts and instances of the concepts and between the concepts and the attributes, I represents an instance set and R represents four classes of core relations, namely a category relation, a part relation, an instance relation and an attribute relation. The ontology-based Chinese name disambiguation method can effectively solve the entity linking problem of Chinese names and better solves the name mismatching problem, and the recognition effect is improved.

Description

A kind of Chinese personal name disambiguation method based on body
Technical field
The present invention relates to natural language processing field, by building personage body, Chinese personal name and its attribute information having are set up to contact specifically, realize linking between name and real entities, to clear up the technology of ambiguity of name keyword.
Background technology
Name disambiguation becomes the focus of searching resource gradually, name ambiguity is excavated, has been brought numerous adverse influences in the application such as responsive personage's information filtering to name inquiry, character relation, in the time retrieving, search engine can return to the webpage that comprises in a large number this name, and these webpages may be described multiple entities, simultaneously name has very high ambiguousness, the of the same name or non-name of many people.Therefore, recent domestic starts progressively to pay close attention to the research of name disambiguation task.Current existing method is mostly to utilize the characteristic information in document to carry out cluster to the document that occurs name, and the document sets that is about to sensing same person is polymerized to independent one by one class.But the ambiguity name central specific people of actual life pointed who occurs in how to confirm document, remains a problem demanding prompt solution.
The present invention is that the body based on the exploitation SUMO of Stanford University (Suggested Upper Merged Ontology) builds " seven footworks ", and personage's various Property Names are (as nationality, occupation etc.), the aspect such as concept and hierarchical structure thereof in personage's body is defined, create the knowledge base of a human body example, encyclopaedia business card half structure (for example: the famous person that Yao Ming is such) mainly for name entry in Baidupedia and this two category information of the non-structure of profile (for example: the ordinary people that Wang Wei is such), work out respectively the architectural feature based on HTML, based on natural language understanding and rule this two classes mode that combines, personage's attribute information is extracted, recycling Jena is to the Information Ontology instantiation of extracting, set up a tree construction, study the similarity between personage's instances of ontology from concept hierarchy and the property value level of personage's body, again in conjunction with the overall measuring similarity of personage's example.
In view of this, inventor provides a kind of Chinese personal name disambiguation method based on body.
Summary of the invention
For defect of the prior art, the invention provides a kind of Chinese personal name disambiguation method based on body, overcome the difficulty of prior art, first build personage's body according to the network information, in the time having people information, extract its information module, create personage's example, and mate with the information in body, the definition of the corresponding entity in name and target entity list links.For example, " Yao Ming's text around, as information relevant with current name such as " " Qianmen affection stall tea " ", " Liu Xiaoqing " ", can determine that it is composer Yao Ming, instead of be locked in basket baller Yao Ming.
According to an aspect of the present invention, provide a kind of Chinese personal name disambiguation method based on body, comprise the following steps:
Definition personage attribute, defines the concept, attribute and the relation that in personage's body, relate to;
Concept and the structure thereof of definition personage body, create this top layer class of entity, then add the abstract and large subclass of material two in its lower floor;
The attribute of definition personage body, attribute comprises two parts: data attribute and object properties;
Extract personage's attribute;
Name instantiation, creates corresponding example by concepts all in personage's body, is mainly the associated attribute of the concept in body is carried out to assignment;
Personage's instances of ontology tree coupling, weighs the overall similarity between personage's example by the similarity of measuring the similarity between personage's example and measure between personage's example on the concept hierarchy of body in the property value level of body;
Sequencing of similarity; And
Link name is to the most similar personage's example.
Preferably, described personage's attribute is the characteristic set that personage has, and comprises person names attribute, personage's base attribute, personage introduction attribute, personage society.
Preferably, this conceptual entity of the definition personage of material type lower floor, representative figure self;
Abstract class lower floor continues to build Attribute class, and in its lower floor, be in intermediate level, continue to add person names, base attribute, introductory information, contact method, value class, this six major concepts class of personal relationship, personage's body is organized into a tree structure with hyponymy.
Preferably, described extraction personage attribute comprises the attribute extraction of half hitch structure text, in all encyclopaedia business cards from webpage, extract personage's essential information, and be converted to a kind of self-defining extend markup language page that has structure, mainly the mode of HTML (Hypertext Markup Language) structure and semi-structured text are extracted to mode combination, collect by the encyclopaedia page corresponding to name, source code is resolved, determine the message block extracting, the feature of analytical information piece and HTML (Hypertext Markup Language) feature tag, conclude the decimation rule of summing up item of information, for the information extraction of the follow-up large batch of encyclopaedia page.
Preferably, described extraction personage attribute comprises the attribute extraction of non-structure text, describes personage's relevant information by the profile of non-structure.
Preferably, define the decimation rule of each attribute from three aspects: the border, left and right of the front and back trigger word of attribute information, the unique characteristics of attribute information and attribute information.
Preferably, on the concept hierarchy of body, measure the computing formula of the similarity between personage's example as follows:
Sim c ( P 1 , P 2 ) = Σ i = 1 m Σ j = 1 n sim ( c 1 i , c 2 j ) min ( m , n ) ,
represent respectively C1, any concept node in C2 set; represent concept node between similarity; Sim c(P 1, P 2) two personage's example P1 of expression and the similarity of P2 on the concept hierarchy of body.
Preferably, in the property value level of body, measure the computing formula of the similarity between personage's example as follows:
sim ( i 1 p , i 2 q ) = max p = 1 . . s ; q = 1 . . t sim ( v 1 p k , v 2 q l )
represent respectively I 1, I 2any property value node in set; respectively certain value in these two any property value nodes; w vfor being given to the weight of certain property value; SV 1p, SV 2qrepresent respectively property value V 1pand V 2qin the set of the word that comprises.
Preferably, weighing overall similarity between personage example comprises and establishes P 1with P 2between one be paired into M=(P 1, P 2), the computing formula of the overall similarity between final two personage's examples is as follows:
Sim p(P 1, P 2)=W c* Sim c(P 1, P 2)+(1-W c) * Sim i(P 1, P 2if) two tree between similarity exceed a default threshold value, judge that these two trees are similar.
Compared with prior art, owing to having used above technology, a kind of Chinese personal name disambiguation method based on body of the present invention, by personage's body is built, in conjunction with top-down matching algorithm, constructs name disambiguating system.By document name attribute information is extracted, build personage's hierarchical tree, successively coupling, calculates similarity, finally determines this name.Experimental result for the name text set in CLP2012 evaluation and test meeting and the Chinese web page data set in network shows, the method based on personage's body can effectively solve the entity link problems of Chinese personal name.Simultaneously statistics according to investigations, the name entry of having included in encyclopaedia has at present reached 470,000 more than, and therefore, for the ambiguity name in current web page, the method is to adapt to and effectively, preferably resolves the problem of name mistake, has improved recognition effect.
Brief description of the drawings
By reading the detailed description of non-limiting example being done with reference to the following drawings, it is more obvious that other features, objects and advantages of the present invention will become:
Fig. 1 illustrates according to the first embodiment of the present invention, the process flow diagram of a kind of Chinese personal name disambiguation method based on body of the present invention;
Fig. 2 illustrates according to the first embodiment of the present invention, the disambiguation process schematic diagram of a kind of Chinese personal name disambiguation method based on body of the present invention; And
Fig. 3 illustrates according to the first embodiment of the present invention, the schematic diagram of the personage's matching stage in a kind of Chinese personal name disambiguation method based on body of the present invention.
Embodiment
It will be appreciated by those skilled in the art that those skilled in the art can realize variation example in conjunction with prior art and above-described embodiment, do not repeat them here.Such variation example does not affect flesh and blood of the present invention, does not repeat them here.
The first embodiment
Fig. 1 illustrates according to the first embodiment of the present invention, the process flow diagram of a kind of Chinese personal name disambiguation method based on body of the present invention.As shown in Figure 1, a kind of Chinese personal name disambiguation method based on body of the present invention, comprises the following steps:
Step S101: definition personage attribute, defines the concept, attribute and the relation that relate in personage's body.
Step S102: concept and the structure thereof of definition personage body, create this top layer class of entity, then add the abstract and large subclass of material two in its lower floor.
Step S103: the attribute of definition personage body, attribute comprises two parts: data attribute and object properties.
Step S104: extract personage's attribute.
Step S105: name instantiation, concepts all in personage's body is created to corresponding example, be mainly that the associated attribute of the concept in body is carried out to assignment.
Step S106: personage's instances of ontology tree coupling, weigh the overall similarity between personage's example by the similarity of measuring the similarity between personage's example and measure between personage's example on the concept hierarchy of body in the property value level of body.
Step S107: sequencing of similarity.And
Step S108: link name is to the most similar personage's example.
In step S101, personage's attribute is the characteristic set that personage has, and comprises person names attribute, personage's base attribute, personage introduction attribute, personage society.
In step S102, this conceptual entity of the definition personage of material type lower floor, representative figure self;
Abstract class lower floor continues to build Attribute class, and in its lower floor, be in intermediate level, continue to add person names, base attribute, introductory information, contact method, value class, this six major concepts class of personal relationship, personage's body is organized into a tree structure with hyponymy.
In step S104, extract the attribute extraction that personage's attribute comprises half hitch structure text, in all encyclopaedia business cards from webpage, extract personage's essential information, and be converted to a kind of self-defining extend markup language page that has structure, mainly the mode of HTML (Hypertext Markup Language) structure and semi-structured text are extracted to mode combination, collect by the encyclopaedia page corresponding to name, source code is resolved, determine the message block extracting, the feature of analytical information piece and HTML (Hypertext Markup Language) feature tag, conclude the decimation rule of summing up item of information, for the information extraction of the follow-up large batch of encyclopaedia page.
Or, extract the attribute extraction that personage's attribute comprises non-structure text, personage's relevant information is described by the profile of non-structure.Define the decimation rule of each attribute from three aspects: the border, left and right of the front and back trigger word of attribute information, the unique characteristics of attribute information and attribute information.
In step S106, the computing formula of measuring the similarity between personage's example on the concept hierarchy of body is as follows:
Sim c ( P 1 , P 2 ) = Σ i = 1 m Σ j = 1 n sim ( c 1 i , c 2 j ) min ( m , n ) ,
represent respectively C1, any concept node in C2 set; represent concept node between similarity; Sim c(P 1, P 2) two personage's example P1 of expression and the similarity of P2 on the concept hierarchy of body.
In step S106, the computing formula of measuring the similarity between personage's example in the property value level of body is as follows:
sim ( i 1 p , i 2 q ) = max p = 1 . . s ; q = 1 . . t sim ( v 1 p k , v 2 q l )
represent respectively I 1, I 2any property value node in set; respectively certain value in these two any property value nodes; w vfor being given to the weight of certain property value; SV 1p, SV 2qrepresent respectively property value V 1pand V 2qin the set of the word that comprises.
In step S106, the overall similarity between measurement personage example comprises establishes P 1with P 2between one be paired into M=(P 1, P 2), the computing formula of the overall similarity between final two personage's examples is as follows:
Sim p(P 1,P 2)=W c*Sim c(P 1,P 2)+(1-W c)*Sim i(P 1,P 2)
If the similarity between two trees exceedes a default threshold value, judge that these two trees are similar.
Below specifically introduce the particular content of a kind of Chinese personal name disambiguation method based on body of the present invention:
Step S101: definition personage attribute
First define the concept, attribute and the relation that in personage's body, relate to.Personage's attribute is the characteristic set that personage has, and conventionally comprises the parts such as person names attribute, personage's base attribute, personage introduction attribute, personage's social property.
Building personage body, according to personage's attribute information, define a bottom, detailed applied ontology, is a four-tuple PO={C by personage's ontology definition, P, R, I}.Wherein, C represents the set of concept or class, the set of P representative data attribute or object properties, and R represents between concept, set of relationship between concept and the example of concept, between concept and attribute, I representative instance (being property value) set.R represents four class Key Relationships: kind-of relation, part-of relation, instance-of relation and attribute-of relation.
Step S102: concept and the structure thereof of definition personage body
Create " entity " this top layer class, then add " abstract " and " material " two large subclasses in its lower floor." material " class lower floor can define " personage " this conceptual entity, representative figure self." abstract " class lower floor continues to build " attribute " class, and in its lower floor, be in intermediate level, continue to add person names, base attribute, introductory information, contact method, value class, this six major concepts class of personal relationship, personage's body is organized into a tree structure with hyponymy.The corresponding OWL language of the concept part representation building is as follows:
<owl:Class rdf:ID=" personage " >
<rdfs:commentrdf:datatype=″http://www.w3.org/2001/XMLSchema#string″>
The corresponding people's of figure kind a complete entity, there is inherent build-in attribute (as sex, nationality etc.), external social property (dynamic attribute, as job information, inhabitation information etc.) and attribute of a relation (as social relationships, family relationship).
Step S103: the attribute of definition personage body
Attribute comprises two parts: data attribute and object properties, data attribute is that the result of concept is numerical expression, as " date " has " value " this attribute.Object properties are relations of two concepts, as just possessed a kind of " being positioned at " relation between concept " birthplace " and two examples (Mao Zedong's birthplace, Changsha) of " place " correspondence.
The corresponding OWL language of the attribute part representation building is as follows:
The unique identification that each concept example has.
Be used for representing which geographic position some organization etc. are specifically positioned at.
Step S104: personage's attribute extraction algorithm
Personage's example is divided into three major types: the famous person who usually occurs in network, and referred ordinary people in network once in a while, also having a class is the personage who not yet includes in encyclopaedia.Due to personage's difference, the text of describing it is also divided three classes: structured text information extraction, semi-structured text information extraction, free style Text Information Extraction.Wherein, free style text is the most unmanageable in three class texts, conventionally need to be in conjunction with the technology of natural language processing aspect.
Combining information extract prior art, and in Baidupedia about famous person and ordinary people's multi-form description, two class personage attribute information extraction algorithms have been proposed: the attribute extraction of half structure text and the attribute extraction of non-structure text herein.
The attribute extraction of (1) half structure text
Its main method is exactly the essential information that extracts personage in all encyclopaedia business cards from webpage, and be converted to a kind of self-defining XML page that has structure, mainly the mode of HTML structure and semi-structured text are extracted to mode combination, collect by the encyclopaedia page corresponding to name, source code is resolved, determine the message block extracting, the feature of analytical information piece and HTML feature tag, conclude the decimation rule of summing up item of information, for the information extraction of the follow-up large batch of encyclopaedia page.For example extract Yao Ming's attribute information:
Wherein " Chinese name: Yao Ming " can be carried out to the normalized form with " attribute-name: property value ".
(2) attribute extraction of non-structure text
If when this name each personage's example pointed is ordinary people, encyclopaedia generally can adopt the profile of non-structure to describe personage's relevant information.Define the decimation rule of each attribute from three aspects: the border, left and right of the front and back trigger word of attribute information, the unique characteristics of attribute information, attribute information.We are taking " high mountain " this word as example, it both can a popular word also can be for name, the result of extraction is as follows:
<?xml?version=″1.0″encoding=″GB2312″?>
-< people information >
-< person names >
-< name >
< Chinese name > high mountain </ Chinese name >
</ name >
The prosperous </ former name > of the high increasing of < former name >
</ person names >
-< base attribute >
</ birthplace, Shenmu County, > northern Shensi, < birthplace >
</ base attribute >
The introductory information > of-<
-< job information >
< occupation > singer </ occupation >
</ job information >
-< people's honor >
Xin Tian You </ representative works > on < representative works > wheelchair
</ people's honor >
The introductory information > of </
-< personal relationship >
-< social relationships >< social networks Zhao > the earth </ social networks >
</ personal relationship >
</ people information >
Step 105: name instantiation
Complete body is made up of the example of concept, attribute and concept, and the instantiation of personage's body is exactly for concepts all in personage's body creates corresponding example, is mainly the associated attribute of the concept in body is carried out to assignment.Adopt the Protege3.3.1 of Stanford University's exploitation as structure and the edit tool of body, for example, added the example of concept " personage ", as Yao Ming, Li Chen, Zhan Tianyou, li po, poplar cross, Liu Dan etc.; The example of occupation has performer, railroad engineer etc.; The example of graduation universities and colleges has Central Drama Institute, Central Cinema Academy, vertical First Normal School of Hunan Province etc.; There are Han nationality, the Hui ethnic group etc. in nationality, utilizes Jena to parse concept in body, attribute, relation etc., calls Jena API and automatically creates corresponding example for each concept.Build personage's example business card:
Chinese name: high mountain occupation: singer
Former name: the prosperous social networks of high increasing: Zhao's the earth
Birthplace: northern Shensi Shenmu County representative works: " Xin Tian You on wheelchair "
Step 106: personage's instances of ontology tree matching algorithm
The proposition of instances of ontology matching problem, has turned to example level by current research center of gravity from the Ontology Matching of pattern level.Personage's example, with the XML document output of tree structure, is mainly weighed the similarity between two personage's example P1 and P2 from following two aspects.
(1) on the concept hierarchy of body, measure the similarity between personage's example
Sim c ( P 1 , P 2 ) = &Sigma; i = 1 m &Sigma; j = 1 n sim ( c 1 i , c 2 j ) min ( m , n ) ,
represent respectively C 1, C 2any concept node in set; represent concept node between similarity; Sim c(P 1, P 2) two personage's example P of expression 1and P 2similarity on the concept hierarchy of body.
(2) in the property value level of body, measure the similarity between personage's example
Simple property value coupling, directly can adopt string matching; Complicated property value, first carries out word segmentation processing to long character string corresponding to property value, and calculates separately the word string length of two property values, then adds up the number that shares word in two property value texts, weighs the similarity of two property values using this as basis; We are referred to as the mode of " similar coupling ".
Concrete calculating formula of similarity is as follows:
sim ( i 1 p , i 2 q ) = max p = 1 . . s ; q = 1 . . t sim ( v 1 p k , v 2 q l )
represent respectively I 1, I 2any property value node in set; respectively certain value in these two any property value nodes; w vfor being given to the weight of certain property value; SV 1p, SV 2qrepresent respectively property value V 1pand V 2qin the set of the word that comprises.
(3) weigh the overall similarity between personage's example
One between P1 and P2 is paired into M=(P 1, P 2), and personage's example matching degree is mainly to be weighed by the matching result between concept and property value level, so coupling is to can being further expressed as: M = ( P 1 , P 2 ) = { ( c 1 i , c 2 j ) | i = 1,2 , . . . , m ; j = 1,2 , . . . , n } &cup; { ( i 1 p , i 2 q ) | p = 1,2 , . . . , s ; q = 1,2 , . . . , t } . Meanwhile, the matching degree of all entity elements, comprises concept, attribute, property value etc., is all that the similarity degree depending between element decides substantially, and therefore, the overall similarity between final two personage's examples can represent with following formula:
Sim p(P 1,P 2)=W c*Sim c(P 1,P 2)+(1-W c)*Sim i(P 1,P 2)
Because the tree structure of personage's instances of ontology represents, if the similarity between two trees exceedes a default threshold value, judge that these two trees are similar.We have proposed a kind of top-down personage's instances of ontology tree matching algorithm of novelty in conjunction with personage's case similarity balancing method.Following table is the effect of indivedual name disambiguation result comparisons:
Table 1: indivedual name disambiguation result deck watch
Step S107: according to the above results, carry out sequencing of similarity.
Step S108: link name is to the highest link of sequencing of similarity.
Fig. 2 illustrates according to the first embodiment of the present invention, the disambiguation process schematic diagram of a kind of Chinese personal name disambiguation method based on body of the present invention.As shown in Figure 2, Shi Jian storehouse, the left side, first extracts and carries out personage's instances of ontology according to defined attribute network data source, builds large-scale personage's case library.The right compatible portion, there is name in a document, extract the attribute information relevant with name and build personage's example to mate with the example in storehouse, and be mainly from conceptual level to attribute layer coupling from top to down, calculate similarity, by the similar value personage's example being mated most that sorts.
Fig. 3 illustrates according to the first embodiment of the present invention, the schematic diagram of the personage's matching stage in a kind of Chinese personal name disambiguation method based on body of the present invention.As shown in Figure 3, in the present embodiment, taking Li Chen as example, the left side is that personage builds hierarchical tree in case library, and the right is the attribute information capturing according to the people information of document appearance.The top-down personage's instances of ontology tree matching algorithm that has proposed a kind of novelty in conjunction with personage's case similarity balancing method can show in two personage's instances of ontology trees that through calculating coupling concept node number is all 10 separately, common concept node number is 9, the similarity on concept hierarchy is 9/min (10,10)=0.9.If the similarity weights W c on concept hierarchy is 0.2, the weight wv of Chinese name attribute is 0.7, and the attribute weight wv such as occupation and representative works are 0.3; According to method for measuring similarity, calculate all example value similarities and be: (0.7*1+0.3*1+0.3*0)/(0.7*1+0.3*1+0.3*1)=0.769; Total similarity of personage's example is 0.9*0.2+0.8*0.769=0.795.Illustrating that these two personage's examples are extremely similar, may be same personage's entity.
In summary, a kind of Chinese personal name disambiguation method based on body of the present invention, by personage's body is built, in conjunction with top-down matching algorithm, constructs name disambiguating system.By document name attribute information is extracted, build personage's hierarchical tree, successively coupling, calculates similarity, finally determines this name.Experimental result for the name text set in CLP2012 evaluation and test meeting and the Chinese web page data set in network shows, the method based on personage's body can effectively solve the entity link problems of Chinese personal name.Simultaneously statistics according to investigations, the name entry of having included in encyclopaedia has at present reached 470,000 more than, and therefore, for the ambiguity name in current web page, the method is to adapt to and effectively, preferably resolves the problem of name mistake, has improved recognition effect.
Above specific embodiments of the invention are described.It will be appreciated that, the present invention is not limited to above-mentioned specific implementations, and those skilled in the art can make various distortion or amendment within the scope of the claims, and this does not affect flesh and blood of the present invention.

Claims (9)

1. the Chinese personal name disambiguation method based on body, is characterized in that, comprises the following steps:
Definition personage attribute, defines the concept, attribute and the relation that in personage's body, relate to;
Concept and the structure thereof of definition personage body, create this top layer class of entity, then add the abstract and large subclass of material two in its lower floor;
The attribute of definition personage body, attribute comprises two parts: data attribute and object properties;
Extract personage's attribute;
Name instantiation, creates corresponding example by concepts all in personage's body, is mainly the associated attribute of the concept in body is carried out to assignment;
Personage's instances of ontology tree coupling, weighs the overall similarity between personage's example by the similarity of measuring the similarity between personage's example and measure between personage's example on the concept hierarchy of body in the property value level of body;
Sequencing of similarity; And
Link name is to the most similar personage's example.
2. a kind of Chinese personal name disambiguation method based on body as claimed in claim 1, is characterized in that: described personage's attribute is the characteristic set that personage has, and comprises person names attribute, personage's base attribute, personage introduction attribute, personage society.
3. a kind of Chinese personal name disambiguation method based on body as claimed in claim 1, is characterized in that: described this conceptual entity of the material type definition personage of lower floor, representative figure self;
Abstract class lower floor continues to build Attribute class, and in its lower floor, be in intermediate level, continue to add person names, base attribute, introductory information, contact method, value class, this six major concepts class of personal relationship, personage's body is organized into a tree structure with hyponymy.
4. a kind of Chinese personal name disambiguation method based on body as claimed in claim 1, it is characterized in that: described extraction personage attribute comprises the attribute extraction of half hitch structure text, in all encyclopaedia business cards from webpage, extract personage's essential information, and be converted to a kind of self-defining extend markup language page that has structure, mainly the mode of HTML (Hypertext Markup Language) structure and semi-structured text are extracted to mode combination, collect by the encyclopaedia page corresponding to name, source code is resolved, determine the message block extracting, the feature of analytical information piece and HTML (Hypertext Markup Language) feature tag, conclude the decimation rule of summing up item of information, for the information extraction of the follow-up large batch of encyclopaedia page.
5. a kind of Chinese personal name disambiguation method based on body as claimed in claim 1, is characterized in that: described extraction personage attribute comprises the attribute extraction of non-structure text, describes personage's relevant information by the profile of non-structure.
6. a kind of Chinese personal name disambiguation method based on body as claimed in claim 5, is characterized in that: the decimation rule that defines each attribute from three aspects: the border, left and right of the front and back trigger word of attribute information, the unique characteristics of attribute information and attribute information.
7. a kind of Chinese personal name disambiguation method based on body as claimed in claim 1, is characterized in that: the computing formula of measuring the similarity between personage's example on the concept hierarchy of described body is as follows:
Sim c ( P 1 , P 2 ) = &Sigma; i = 1 m &Sigma; j = 1 n sim ( c 1 i , c 2 j ) min ( m , n ) ,
represent respectively C1, any concept node in C2 set; represent concept node between similarity; Sim c(P 1, P 2) two personage's example P1 of expression and the similarity of P2 on the concept hierarchy of body.
8. a kind of Chinese personal name disambiguation method based on body as claimed in claim 1, is characterized in that: the computing formula of measuring the similarity between personage's example in the property value level of described body is as follows:
sim ( i 1 p , i 2 q ) = max p = 1 . . s ; q = 1 . . t sim ( v 1 p k , v 2 q l )
represent respectively I 1, I 2any property value node in set; respectively certain value in these two any property value nodes; w vfor being given to the weight of certain property value; SV 1p, SV 2qrepresent respectively property value V 1pand V 2qin the set of the word that comprises.
9. a kind of Chinese personal name disambiguation method based on body as claimed in claim 1, is characterized in that: the overall similarity between described measurement personage example comprises establishes P 1with P 2between one be paired into M=(P 1, P 2), the computing formula of the overall similarity between final two personage's examples is as follows:
Sim p(P 1,P 2)=W c*Sim c(P 1,P 2)+(1-W c)*Sim i(P 1,P 2)
If the similarity between two trees exceedes a default threshold value, judge that these two trees are similar.
CN201310202444.9A 2013-05-27 2013-05-27 Ontology-based Chinese name disambiguation method Pending CN104182420A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310202444.9A CN104182420A (en) 2013-05-27 2013-05-27 Ontology-based Chinese name disambiguation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310202444.9A CN104182420A (en) 2013-05-27 2013-05-27 Ontology-based Chinese name disambiguation method

Publications (1)

Publication Number Publication Date
CN104182420A true CN104182420A (en) 2014-12-03

Family

ID=51963471

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310202444.9A Pending CN104182420A (en) 2013-05-27 2013-05-27 Ontology-based Chinese name disambiguation method

Country Status (1)

Country Link
CN (1) CN104182420A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045826A (en) * 2015-06-29 2015-11-11 华东师范大学 Entity linkage algorithm based on graph model
CN105718433A (en) * 2014-12-05 2016-06-29 富士通株式会社 Table semantic device and method
CN106294677A (en) * 2016-08-04 2017-01-04 浙江大学 A kind of towards the name disambiguation method of China author in english literature
CN106651137A (en) * 2016-11-18 2017-05-10 武汉胜成网络科技有限公司 Project quantity coding and standardization method and system
WO2017168358A1 (en) * 2016-03-30 2017-10-05 International Business Machines Corporation Data processing
CN107341194A (en) * 2017-06-14 2017-11-10 北京金堤科技有限公司 A kind of enterprise's duplication of name people differentiating method and device
CN107402933A (en) * 2016-05-20 2017-11-28 富士通株式会社 Entity polyphone disambiguation method and entity polyphone disambiguation equipment
CN107908749A (en) * 2017-11-17 2018-04-13 哈尔滨工业大学(威海) A kind of personage's searching system and method based on search engine
CN108255846A (en) * 2016-12-29 2018-07-06 北京赛时科技有限公司 A kind of method and apparatus for distinguishing author of the same name
CN109271621A (en) * 2017-07-18 2019-01-25 腾讯科技(北京)有限公司 Semanteme disambiguates processing method, device and its equipment
CN109635297A (en) * 2018-12-11 2019-04-16 湖南星汉数智科技有限公司 A kind of entity disambiguation method, device, computer installation and computer storage medium
CN110457680A (en) * 2019-07-02 2019-11-15 平安科技(深圳)有限公司 Entity disambiguation method, device, computer equipment and storage medium
CN111008285A (en) * 2019-11-29 2020-04-14 中科院计算技术研究所大数据研究院 Author disambiguation method based on thesis key attribute network
CN111428482A (en) * 2020-03-26 2020-07-17 北京明略软件***有限公司 Information identification method and device
CN112825112A (en) * 2019-11-20 2021-05-21 阿里巴巴集团控股有限公司 Data processing method and device and computer terminal
CN112835852A (en) * 2021-04-20 2021-05-25 中译语通科技股份有限公司 Character duplicate name disambiguation method, system and equipment for improving filing-by-filing efficiency

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763395A (en) * 2009-12-31 2010-06-30 浙江大学 Method for automatically generating webpage by adopting artificial intelligence technology

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101763395A (en) * 2009-12-31 2010-06-30 浙江大学 Method for automatically generating webpage by adopting artificial intelligence technology

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周晓等: "基于人物互斥属性的中文人名消歧", 《第六届全国信息检索学术会议》 *

Cited By (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105718433A (en) * 2014-12-05 2016-06-29 富士通株式会社 Table semantic device and method
CN105718433B (en) * 2014-12-05 2019-01-22 富士通株式会社 Table semantization device and method
CN105045826A (en) * 2015-06-29 2015-11-11 华东师范大学 Entity linkage algorithm based on graph model
WO2017168358A1 (en) * 2016-03-30 2017-10-05 International Business Machines Corporation Data processing
US10585893B2 (en) 2016-03-30 2020-03-10 International Business Machines Corporation Data processing
US11188537B2 (en) 2016-03-30 2021-11-30 International Business Machines Corporation Data processing
CN107402933A (en) * 2016-05-20 2017-11-28 富士通株式会社 Entity polyphone disambiguation method and entity polyphone disambiguation equipment
CN106294677B (en) * 2016-08-04 2019-08-16 浙江大学 A kind of name disambiguation method towards author Chinese in english literature
CN106294677A (en) * 2016-08-04 2017-01-04 浙江大学 A kind of towards the name disambiguation method of China author in english literature
CN106651137A (en) * 2016-11-18 2017-05-10 武汉胜成网络科技有限公司 Project quantity coding and standardization method and system
CN108255846A (en) * 2016-12-29 2018-07-06 北京赛时科技有限公司 A kind of method and apparatus for distinguishing author of the same name
CN107341194B (en) * 2017-06-14 2019-04-16 北京金堤科技有限公司 A kind of enterprise's duplication of name people differentiating method and device
CN107341194A (en) * 2017-06-14 2017-11-10 北京金堤科技有限公司 A kind of enterprise's duplication of name people differentiating method and device
CN109271621A (en) * 2017-07-18 2019-01-25 腾讯科技(北京)有限公司 Semanteme disambiguates processing method, device and its equipment
CN109271621B (en) * 2017-07-18 2023-04-18 腾讯科技(北京)有限公司 Semantic disambiguation processing method, device and equipment
CN107908749A (en) * 2017-11-17 2018-04-13 哈尔滨工业大学(威海) A kind of personage's searching system and method based on search engine
CN107908749B (en) * 2017-11-17 2020-04-10 哈尔滨工业大学(威海) Character retrieval system and method based on search engine
CN109635297A (en) * 2018-12-11 2019-04-16 湖南星汉数智科技有限公司 A kind of entity disambiguation method, device, computer installation and computer storage medium
CN109635297B (en) * 2018-12-11 2022-01-04 湖南星汉数智科技有限公司 Entity disambiguation method and device, computer device and computer storage medium
CN110457680A (en) * 2019-07-02 2019-11-15 平安科技(深圳)有限公司 Entity disambiguation method, device, computer equipment and storage medium
CN112825112A (en) * 2019-11-20 2021-05-21 阿里巴巴集团控股有限公司 Data processing method and device and computer terminal
CN112825112B (en) * 2019-11-20 2024-05-31 阿里巴巴集团控股有限公司 Data processing method and device and computer terminal
CN111008285B (en) * 2019-11-29 2021-04-13 中科院计算技术研究所大数据研究院 Author disambiguation method based on thesis key attribute network
CN111008285A (en) * 2019-11-29 2020-04-14 中科院计算技术研究所大数据研究院 Author disambiguation method based on thesis key attribute network
CN111428482A (en) * 2020-03-26 2020-07-17 北京明略软件***有限公司 Information identification method and device
CN111428482B (en) * 2020-03-26 2023-11-24 北京明略软件***有限公司 Information identification method and device
CN112835852A (en) * 2021-04-20 2021-05-25 中译语通科技股份有限公司 Character duplicate name disambiguation method, system and equipment for improving filing-by-filing efficiency
CN112835852B (en) * 2021-04-20 2021-08-17 中译语通科技股份有限公司 Character duplicate name disambiguation method, system and equipment for improving filing-by-filing efficiency

Similar Documents

Publication Publication Date Title
CN104182420A (en) Ontology-based Chinese name disambiguation method
CN106250412B (en) Knowledge mapping construction method based on the fusion of multi-source entity
CN104809176B (en) Tibetan language entity relation extraction method
CN110287334A (en) A kind of school&#39;s domain knowledge map construction method based on Entity recognition and attribute extraction model
CN106776711A (en) A kind of Chinese medical knowledge mapping construction method based on deep learning
CN104699818A (en) Multi-source heterogeneous multi-attribute POI (point of interest) integration method
CN106708966A (en) Similarity calculation-based junk comment detection method
CN103049532A (en) Method for creating knowledge base engine on basis of sudden event emergency management and method for inquiring knowledge base engine
CN108710663A (en) A kind of data matching method and system based on ontology model
CN112559656A (en) Method for constructing affair map based on hydrologic events
CN103324666A (en) Topic tracing method and device based on micro-blog data
CN109558492A (en) A kind of listed company&#39;s knowledge mapping construction method and device suitable for event attribution
CN103559199A (en) Web information extraction method and web information extraction device
CN104199938B (en) Agricultural land method for sending information and system based on RSS
CN109376352A (en) A kind of patent text modeling method based on word2vec and semantic similarity
CN104778283B (en) A kind of user&#39;s occupational classification method and system based on microblogging
CN112417063B (en) Heterogeneous relation network-based compatible function item recommendation method
CN109522416A (en) A kind of construction method of Financial Risk Control knowledge mapping
Yan et al. Development of stock networks using part mutual information and Australian stock market data
CN109949174A (en) A kind of isomery social network user entity anchor chain connects recognition methods
CN112183030A (en) Event extraction method and device based on preset neural network, computer equipment and storage medium
CN116383395A (en) Method for constructing knowledge graph in hydrologic model field
CN107423374A (en) Legal recommendation method and system based on classification labeling
CN110049052A (en) The malice domain name detection method of label and attribute similarity based on dom tree
US20210240701A1 (en) Information processing apparatus, determination method, non-transitory computer readable medium storing program, and information processing system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20141203