CN108520065B - Method, system, equipment and storage medium for constructing named entity recognition corpus - Google Patents

Method, system, equipment and storage medium for constructing named entity recognition corpus Download PDF

Info

Publication number
CN108520065B
CN108520065B CN201810325492.XA CN201810325492A CN108520065B CN 108520065 B CN108520065 B CN 108520065B CN 201810325492 A CN201810325492 A CN 201810325492A CN 108520065 B CN108520065 B CN 108520065B
Authority
CN
China
Prior art keywords
entity
chinese
named
internal
wiki
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810325492.XA
Other languages
Chinese (zh)
Other versions
CN108520065A (en
Inventor
钱龙华
何云琪
李雁群
王红玲
周国栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201810325492.XA priority Critical patent/CN108520065B/en
Publication of CN108520065A publication Critical patent/CN108520065A/en
Application granted granted Critical
Publication of CN108520065B publication Critical patent/CN108520065B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for constructing a Chinese named entity recognition corpus, which is based on a computer, adopts Chinese Wikipedia as a corpus, can classify Chinese Wikipedia items by extracting the characteristics of the Chinese Wikipedia items, determines Chinese Wikipedia items, predicts the types of named entities corresponding to the Chinese Wikipedia items, and finally constructs a Chinese wikipedia entity list containing the named entities based on the types and redirection information, and can form the Chinese named entity recognition corpus by all the named entities in the Chinese wikipedia entity list. Has the advantages of rich content and wide field coverage. Moreover, the construction method can automatically construct the Chinese named entity recognition corpus based on the computer, thereby saving manpower and material resources. In addition, the invention also discloses a system and equipment for constructing the Chinese named entity recognition corpus and a computer readable storage medium, and the effects are as above.

Description

Method, system, equipment and storage medium for constructing named entity recognition corpus
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method, a system, equipment and a storage medium for constructing a named entity recognition corpus.
Background
The purpose of information extraction is to extract entities and their interrelations from unstructured free text and convert them into structured expressions, thereby providing a data base for the construction of knowledge bases.
In the prior art, the Chinese named entity recognition research mainly uses high-quality manually labeled corpora, such as "Renminbi" corpus in 1 month of 1998, Microsoft Asian institute MSRA corpus, CityU corpus of hong Kong City university, and ACE2005 Chinese corpus, etc. The named entity categories used by different corpora, labeling rules and corpus scales are different, and in order to ensure the quality of the corpora, the corpora are labeled by professional personnel, so that the scale and the field of the corpora are limited, and a large amount of manpower and material resources are consumed. For example, the corpus of "daily news" in month 1 of 1998 in the news field is not only old in corpus content but also low in accuracy when it is applied to other fields than the news field.
Therefore, how to automatically construct a Chinese named entity recognition corpus with the advantages of rich content, wide application field and the like is a technical problem that needs to be solved by the technical personnel in the field at present.
Disclosure of Invention
The invention aims to provide a method, a system, equipment and a storage medium for constructing a named entity recognition corpus, which can automatically construct a Chinese named entity recognition corpus with the advantages of rich content, wide application field and the like.
In order to solve the technical problem, the invention provides a method for constructing a Chinese named entity recognition corpus, which is based on a computer and comprises the following steps:
extracting the characteristics of the Chinese wiki encyclopedia entries, and predicting the type of the named entity corresponding to the Chinese wiki entity entries according to the characteristics;
based on the type and the redirection information of the Chinese wiki entity entry, constructing a Chinese wiki entity list containing the named entity to form a corpus;
wherein the Chinese wiki entity entry is a Chinese wiki encyclopedia entry containing the named entity.
Preferably, the feature of extracting the chinese wikipedia entry is specifically:
and extracting the characteristics from the information box, the classification box and the abstract of the Chinese wiki entity entry.
Preferably, after the constructing the chinese wiki entity list containing the named entities, the method further includes:
identifying internal entities in the Chinese wiki entity list and generating nested named entities containing the internal entities;
adding the nested named entities into a Chinese nested entity list, and determining whether each internal entity in the Chinese nested entity list meets a nested relation;
and removing the label of the first internal entity which meets the nesting relation, and deleting the nested named entity containing the second internal entity which does not meet the nesting relation.
Preferably, the identifying the internal entities in the chinese wiki entity list specifically includes:
determining the named entity without type ambiguity in the Chinese wiki entity list;
and identifying internal entities contained in the named entities without type ambiguity by using a longest matching principle by taking the Chinese wiki entity list as a dictionary.
Preferably, the determining whether each internal entity satisfies the nesting relationship specifically includes:
judging whether the Chinese wiki entity entry pointed by the internal entity and the Chinese wiki entity entry pointed by the external entity corresponding to the internal entity have intersection;
if so, determining the internal entity as a first internal entity meeting the nesting relationship;
and if not, determining the internal entity as a second internal entity which does not meet the nesting relationship.
Preferably, after the removing the label of the first internal entity satisfying the nested relationship and deleting the nested named entity containing the second internal entity not satisfying the nested relationship, the method further comprises:
judging whether a first external entity in the Chinese nested entity list is an internal entity of a second external entity;
if so, aggregating the nested structure of the first external entity into the second external entity.
In order to solve the above technical problem, the present invention further provides a system for constructing a chinese named entity recognition corpus, which is based on a computer and comprises:
the prediction module is used for extracting the characteristics of the Chinese wiki encyclopedia entries and predicting the type of the named entity corresponding to the Chinese wiki entity entry according to the characteristics;
the construction module is used for constructing a Chinese wiki entity list containing the named entities to form a corpus based on the type and the redirection information of the Chinese wiki entity entry;
wherein the Chinese wiki entity entry is a Chinese wiki encyclopedia entry containing the named entity.
In order to solve the above technical problem, the present invention further provides a device for constructing a chinese named entity recognition corpus, comprising:
a memory for storing a build program;
a processor for implementing the steps of any one of the above-described methods for constructing a chinese named entity recognition corpus when executing the construction program.
In order to solve the above technical problem, the present invention further provides a computer-readable storage medium, on which a building program is stored, and the building program, when executed by a processor, implements the steps of any one of the methods for building a chinese named entity recognition corpus as described above.
Compared with the prior art, the method for constructing the Chinese named entity recognition corpus provided by the invention is based on a computer, adopts Chinese Wikipedia as a corpus, can classify Chinese Wikipedia items by extracting the characteristics of the Chinese Wikipedia items, determines the Chinese Wikipedia items, predicts the types of named entities corresponding to the Chinese Wikipedia items, and finally constructs the Chinese Wikipedia entity list containing the named entities based on the types and the redirection information, and can form the Chinese named entity recognition corpus by all the named entities in the Chinese Wikipedia entity list. Since Wikipedia is a free-content, open-editing and multilingual network encyclopedia collaborative project, covers a large number of named entities, and has the characteristics of rich content, wide coverage field and the like, the corpus constructed by the construction method of the Chinese named entity recognition corpus also has the advantages of rich content and wide field coverage. Moreover, the construction method is based on a computer, can automatically extract the characteristics of the Chinese wiki encyclopedia entries, automatically predict the named entity types and automatically construct the Chinese wiki entity list to form a corpus through a corresponding computer program, and can save a large amount of manpower and material resources. Therefore, the construction method can automatically construct a Chinese named entity recognition corpus which has the advantages of rich content, wide application field and the like. In addition, the invention also provides a system and equipment for constructing the Chinese named entity recognition corpus and a computer readable storage medium, and the effects are as above.
Drawings
In order to illustrate the embodiments of the present invention more clearly, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained by those skilled in the art without inventive effort.
FIG. 1 is a flowchart of a method for constructing a Chinese named entity recognition corpus according to an embodiment of the present invention;
FIG. 2 is a flowchart of another method for constructing a Chinese named entity recognition corpus according to an embodiment of the present invention;
FIG. 3 is a flowchart of another method for constructing a Chinese named entity recognition corpus according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a system for constructing a Chinese named entity recognition corpus according to an embodiment of the present invention;
fig. 5 is a schematic composition diagram of a device for constructing a chinese named entity recognition corpus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without any inventive step, are within the scope of the present invention.
The invention aims to provide a method, a system, equipment and a storage medium for constructing a named entity recognition corpus, which can automatically construct a Chinese named entity recognition corpus with the advantages of rich content, wide application field and the like.
In order to make the technical solutions of the present invention better understood, the present invention will be further described in detail with reference to the accompanying drawings and specific embodiments.
Fig. 1 is a flowchart of a method for constructing a chinese named entity recognition corpus according to an embodiment of the present invention. The method for constructing a corpus of named entity recognition in a sentence provided by this embodiment is based on a computer, and as shown in fig. 1, the method for constructing the corpus of named entities includes:
s10: extracting the characteristics of the Chinese wiki encyclopedia entries, and predicting the type of the named entity corresponding to the Chinese wiki entity entries according to the characteristics.
Wherein, the Chinese Wikipedia items are items displayed by adopting Chinese in Wikipedia; the Chinese wiki entity entry is a Chinese wiki encyclopedia entry corresponding to the named entity; a named entity refers to an object in the real world, typically consisting of one or more consecutive words in text; the type of named entity can be character (nr), place name (ns), organization (nt), etc., for example, "[ Beijing ] ns" indicates that Beijing is a place name entity.
Wikipedia is a free content, open editing and multilingual network encyclopedia cooperation project, covers a large number of named entities, and has the characteristics of rich content, wide coverage field and the like, the content of the wikipedia is presented in the form of items, each item has a corresponding wikipedia page, article summarization in the wikipedia page contains rich structured, semi-structured and unstructured information, such as templates, information frames, page classification and the like, and the information has great utilization value for related research of natural language processing.
The characteristics of the Chinese wikipedia items comprise effective characteristics mined from wikipedia, extended characteristics and word meaning characteristics added by combining Chinese characteristics and the like, and the purpose is to classify the Chinese wikipedia items according to the characteristics and predict the type of named entities corresponding to the Chinese wikipedia entity items. As a preferred embodiment, the feature of extracting the wikipedia entry in chinese is specifically: features are extracted from the information boxes, classification boxes, and summaries of the Chinese wiki entity entries.
For example, a named entity named "maryun" is included in one chinese wiki entity entry, and summary information about subjects such as "9/10 th in 1964", "living china", "nations of Chinese names and republic of Chinese names", "mother school state university", "professional Alibara group board office chairman" is included in the information boxes in the chinese wiki entity entry, so that the "name", "birth", "living", mother school ", and" occupation "can be extracted as the package-of-words feature; the classification boxes of the Chinese wiki entity entry comprise ' birth in 1964 ', ' characters in the world ', ' hundreds of millions of luxuries in the people's republic of China ', ' Ali baba group ' and the like, and then the core words ' birth ', ' characters ' and ' luxuries ' of each category can be extracted as features; the abstract of the Chinese wiki entity entry is defined as "Marun (English name Jack Ma, 9/10/1964) of the Enterprise of the people's republic of China", and the core word "Enterprise" can be extracted as a feature.
In specific implementation, some Chinese wiki entity entries marked with the named entity types can be selected in advance as training data of the existing classifier, and the classifier is trained to obtain a classification model. For step S10, a plurality of features may be extracted from the information frame, the classification frame, and the abstract of the chinese wiki entity entry, and the feature vectors of the named entity are composed of the plurality of features, and then the feature vectors are predicted by using the existing classifier and the pre-trained classification model to obtain the type of the corresponding named entity.
S11: based on the type and the redirection information of the Chinese wiki entity entry, a Chinese wiki entity list containing the named entities is constructed to form a corpus.
A named entity may have multiple names, including standard names and aliases, so that names that differ from each other may be determined from the redirection information of the Chinese wiki entity entry, but simultaneously represent the same named entity, and the type of the named entity may determine a named entity. Because the named entity can be determined by the standard name and type of the named entity, the Chinese wiki entity list, which is constructed based on the type and the redirection information for the Chinese wiki entity entry, should include at least both the name and the type of the named entity. Moreover, it can be appreciated that in the Chinese wiki entity list, the names and types of the same named entity correspond. And finally, all named entities written in the Chinese wiki entity list form a corpus.
Dividing the named entities according to the fact whether the entity names are ambiguous or not, wherein the named entities are named entities with name ambiguity and named entities without name ambiguity respectively; moreover, for named entities with name ambiguity, the named entities with name ambiguity can be divided according to whether the named entities have type ambiguity or not, and the named entities with name ambiguity and the named entities without type ambiguity are respectively named entities with type ambiguity and named entities without type ambiguity. The name ambiguity refers to that the same named entity points to two or more Chinese wiki entity entries, and the type ambiguity refers to that the same named entity with the name ambiguity has two or more types.
For example, if the name of a named entity is "Wanlong" and the corresponding Chinese wiki entity entry has "22735" and "5044266", the named entity has name ambiguity, and if the type of the named entity in the Chinese wiki entity entry "22735" is "ORG" and the type of the named entity in the Chinese wiki entity entry "5044266" is "PER", the named entity has type ambiguity.
In the case of a Chinese wiki entity list, when it is desired to write both the name and the type of the named entity into the Chinese wiki entity list. Furthermore, named entities can be divided into two categories, one category being an anaglyph named entity, including an anaglyph named entity and a named entity with a name ambiguity but no type ambiguity; the other is a named entity with type ambiguity. In specific implementation, for named entities without name ambiguity and named entities with name ambiguity but without type ambiguity, the name of the named entity and the unique type corresponding to the name can be added into the Chinese wiki entity list together; for named entities with type ambiguity, the name of the named entity can be added to the Chinese wiki entity list along with a plurality of different types corresponding to the name of the named entity. Furthermore, it will be appreciated that a named entity that has a type ambiguity must also have a name ambiguity.
For example, if the name of the named entity is "clisden/nyagte", the corresponding entry of the chinese wiki entity is "125", and the type is "PER", then "clisden/nyagte PER" is directly added to the list of names of the chinese wiki entities. If the name of the named entity is "Wanlong", and the corresponding Chinese wiki entity entry has "22735" and "5044266", wherein the type of the named entity in the Chinese wiki entity entry "22735" is "ORG", and the type of the named entity in the Chinese wiki entity entry "5044266" is "PER", then "Wanlong PER, ORG" needs to be added to the Chinese wiki entity name list.
In summary, the method for constructing a corpus of named entities in chinese provided in the embodiments of the present invention is based on a computer, and uses chinese wiki encyclopedia as a corpus, and can classify entries of chinese wiki encyclopedia by extracting features of entries of chinese wiki encyclopedia, determine entries of chinese wiki entities, and predict types of named entities corresponding to entries of chinese wiki entities, and finally construct a chinese wiki entity list including named entities based on the types and redirection information, and the chinese named entity identification corpus can be constructed from all named entities in the chinese wiki entity list. Since Wikipedia is a free-content, open-editing and multilingual network encyclopedia collaborative project, covers a large number of named entities, and has the characteristics of rich content, wide coverage field and the like, the corpus constructed by the construction method of the Chinese named entity recognition corpus also has the advantages of rich content and wide field coverage. Moreover, the construction method is based on a computer, can automatically extract the characteristics of the Chinese wiki encyclopedia entries, automatically predict the named entity types and automatically construct the Chinese wiki entity list to form a corpus through a corresponding computer program, and can save a large amount of manpower and material resources. Therefore, the construction method can automatically construct a Chinese named entity recognition corpus which has the advantages of rich content, wide application field and the like.
Named entities contained in the Chinese wiki entity list described above include nested named entities and non-nested named entities and are not distinguished, so a corpus of named entities contained in the Chinese wiki entity list described above also does not distinguish between nested named entities and non-nested named entities. In practical application, the nested named entities contain rich entity information and mutual relations among entities, and the structures of the nested named entities are complex and changeable, so that the identification of the nested named entities is also one of the tasks of value research in information extraction.
Fig. 2 is a flowchart of another method for constructing a chinese named entity recognition corpus according to an embodiment of the present invention. As shown in FIG. 2, based on the above embodiment, as a preferred implementation, after constructing the Chinese wiki entity list containing the named entities, the method further includes:
s20: internal entities in the Chinese wiki entity list are identified, and nested named entities containing the internal entities are generated.
S21: and adding the nested named entity into the Chinese nested entity list, and determining whether each internal entity in the Chinese nested entity list meets the nested relation.
S22: and removing the label of the first internal entity which meets the nesting relation, and deleting the nested named entity containing the second internal entity which does not meet the nesting relation.
It should be noted that the nested named entity refers to a named entity in which one or more named entities are nested inside the named entity; the internal named entity refers to a named entity nested inside the nested named entity; the external entity refers to a named entity nested in the outermost layer of the nested named entity; the first internal entity refers to the internal entity which is in the Chinese nested entity list and satisfies the nested relation, and the second internal entity refers to the internal entity which is in the Chinese nested entity list and does not satisfy the nested relation.
In the preferred embodiment, after the internal entities in the chinese wiki entity list are identified, the nested named entities including the internal entities can be generated, the nested named entities are added to the chinese nested entity list, whether each internal entity in the chinese nested entity list satisfies the nested relationship is determined, and thus, which internal entities satisfy the nested relationship and which internal entities do not satisfy the nested relationship in the chinese nested entity list are determined, and finally, the nested named entities satisfying the nested relationship are retained, and the nested named entities not satisfying the nested relationship are deleted. For the first internal entity meeting the nesting relation, directly removing the label of the first internal entity; and for the second internal entity which does not satisfy the nested relation, the nested named entity which currently contains the second internal entity cannot be determined to be the true nested named entity or not, which is probably caused by the missing of the link in the Wikipedia. At this point, the nested named entity that contains the second internal entity needs to be deleted from the Chinese wiki entity list. Finally, a Chinese nested named entity recognition corpus can be constructed from the nested named entities in the Chinese nested entity list.
Based on the above embodiment as a preferred implementation, identifying the internal entity in the chinese wiki entity list specifically includes:
determining a named entity without type ambiguity in a Chinese wiki entity list;
and taking the Chinese wiki entity list as a dictionary, and identifying internal entities contained in the named entities without type ambiguity by using a longest matching principle.
In the preferred embodiment, the longest matching principle is adopted to identify the internal entities contained in the named entities without the type ambiguity, so that the identification accuracy can be improved. In particular, the longest match principle may be used to identify internal entities contained in a typed-ambiguity-free named entity from left to right.
For example, the named entity is "[ Shanghai transportation university Xunju school district ] ns", the dictionary comprises two named entities of "[ Shanghai transportation university ] nt" and "[ Xunju ] ns", and the nested named entity "[ [ Shanghai transportation university ] nt [ Xunju ] ns school district ] ns" can be directly obtained.
Based on the foregoing embodiment, as a preferred implementation, determining whether each internal entity satisfies the nesting relationship specifically includes:
judging whether the Chinese wiki entity item pointed by the internal entity and the Chinese wiki entity pointed by the external entity corresponding to the internal entity have intersection;
if so, determining the internal entity as a first internal entity meeting the nesting relation;
if not, the internal entity is determined to be a second internal entity which does not satisfy the nesting relationship.
The case that the Chinese wiki entity entry pointed by the internal entity intersects with the Chinese wiki entity entry pointed by the external entity corresponding to the internal entity is as follows:
(1) the internal entity has no name ambiguity, and the pointed Chinese wiki entity entry is the same as the pointed Chinese wiki entity entry of the external entity corresponding to the internal entity.
(2) The internal entities have name ambiguity, and a pointed Chinese wiki entity entry is the same as the Chinese wiki entity entry pointed by the external entity corresponding to the internal entity.
The condition that the Chinese wiki entity entry pointed by the internal entity and the Chinese wiki entity entry pointed by the external entity corresponding to the internal entity do not intersect is as follows:
(1) the internal entity is not namelessly ambiguous and the pointed Chinese wiki entity entry does not appear in the linked list within the Chinese wiki page of the external entity corresponding to the internal entity.
(2) The internal entity has name ambiguity, and any pointed Chinese wiki entity entry does not appear in the Chinese wiki in-page linked list of the external entity corresponding to the internal entity.
It should be noted that the linked-chain table in a Chinese wiki page refers to all the connections that appear in one Chinese wiki page that point to other wiki pages, which correspond to wiki encyclopedia entries.
For example, the named entity "[ Tibet autonomous region ] ns" and the named entity "[ Tibet ] ns" point to the same Chinese wiki entity entry, and the latter cannot be an internal entity of the former. In fact, the named entity "[ Tibet autonomous region ] ns" is a whole that cannot be subdivided, so the notation of the named entity "[ Tibet ] ns" is deleted.
"hong kong" in the named entity "[ hong kong district ] ns" is a name ambiguity that points to a certain chinese wiki entity entry that is the same as the external entity pointing to the chinese wiki entity entry, but in fact, "[ hong kong district ] ns" is an integer that cannot be re-segmented, so the label of "[ hong kong ] ns" is removed.
The reference to the named entity "[ orelbillaz ] ns" does not exist in the linked list within the named entity "[ orelbillaz ] ns", so that the nested relationship does not hold, and the "[ orelbillaz ] ns" is removed from the nested named entity list in the text.
The "china" in the named entity "[ chinese city station ] nt" is an ambiguous name, and any wiki page pointed to is not present in the linked list in the wiki page of the external entity, so that "china" is not an internal entity. Remove "[ Chinese City station ] nt" from the Chinese nested named entity list.
Fig. 3 is a flowchart of another method for constructing a chinese named entity recognition corpus according to an embodiment of the present invention. As shown in fig. 3, in order to label the internal entities in the nested named entities as much as possible, so as to further refine the internal structure of the nested named entities, and improve the accuracy of the chinese named entity in identifying the nested named entities in the corpus, as a preferred implementation manner, after step S22, the method further includes:
s30: and judging whether the first external entity in the Chinese nested entity list is the internal entity of the second external entity, if so, entering the step S32, and if not, continuously judging whether the next first external entity in the Chinese nested entity list is the internal entity of the second external entity until the judgment of all the external entities in the Chinese nested entity list is completed.
S31: the nested structure of the first external entity is converged into the second external entity.
Therefore, the first external entity which is the internal entity of the second external entity can be gathered into the second external entity, namely, the first external entity is marked as the internal entity in the nested named entity corresponding to the second external entity, so that the purpose of refining the internal structure of the nested named entity is achieved, and the accuracy of the nested named entity in the Chinese named entity recognition corpus is improved. Moreover, it can be understood that the data in the chinese nested entity list can also be made more concise by aggregating the first external entity, which is an internal entity of the second external entity, into the second external entity. Finally, a Chinese nested named entity recognition corpus can be constructed from the nested named entities in the Chinese nested entity list.
For example, the named entity "[ [ shanghai ] ns transportation university ] nt" appears inside "[ [ shanghai ] ns transportation university ] nt [ xuhui ] ns school zone ] ns", and can be converged into a single nested named entity "[ [ [ shanghai ] ns transportation university ] nt [ xuhui ] ns school zone ] ns".
The embodiment of the method for constructing a corpus of named entities identified in chinese provided by the present invention is described in detail above, and the present invention also provides a construction system corresponding to the construction method.
Fig. 4 is a schematic composition diagram of a system for constructing a chinese named entity recognition corpus according to an embodiment of the present invention. The construction of the chinese named entity recognition corpus provided in this embodiment is based on a computer, as shown in fig. 4, the construction system includes:
and the prediction module 40 is used for extracting the characteristics of the Chinese wiki encyclopedia entries and predicting the type of the named entity corresponding to the Chinese wiki entity entry according to the characteristics.
And a constructing module 41, configured to construct a chinese wiki entity list including the named entities to constitute a corpus based on the type and the redirection information of the chinese wiki entity entry.
Wherein, the Chinese wiki entity entry is a Chinese wiki encyclopedia entry containing a named entity.
Since the construction system of the chinese named entity recognition corpus provided in this embodiment corresponds to the construction method of the chinese named entity recognition corpus described above, the construction system of the chinese named entity recognition corpus provided in this embodiment and the construction method of the chinese named entity recognition corpus described above have the same beneficial effects, and are not described herein again.
The above embodiment of the method for constructing a chinese named entity recognition corpus according to the present invention is described in detail, and the present invention also provides a construction apparatus corresponding to the construction method.
Fig. 5 is a schematic composition diagram of a device for constructing a chinese named entity recognition corpus according to an embodiment of the present invention. As shown in fig. 5, the apparatus for constructing a chinese named entity recognition corpus provided in this embodiment includes:
a memory 50 for storing a build program;
a processor 51 for implementing the steps of any one of the methods for constructing a chinese named entity recognition corpus as described above when executing the construction program.
Since the processor in the apparatus for constructing a chinese named entity recognition corpus provided in this embodiment can implement any of the above steps of the method for constructing a chinese named entity recognition corpus when the processor calls the construction program stored in the memory, the apparatus for constructing a chinese named entity recognition corpus provided in this embodiment has the same beneficial effects as the above method for constructing a chinese named entity recognition corpus, and is not described herein again.
The above embodiment of the method for constructing a corpus of named entities identified in chinese provided by the present invention is described in detail, and the present invention also provides a computer-readable storage medium corresponding to the method for constructing a corpus of named entities identified in chinese, and since the embodiment of the computer-readable storage medium portion corresponds to the embodiment of the method portion, the embodiment of the computer-readable storage medium portion is referred to the description of the embodiment of the method portion, and for the same portions, detailed description is omitted here.
The present embodiment provides a computer-readable storage medium, which stores a construction program, and the construction program, when executed by a processor, implements the steps of any one of the above-mentioned methods for constructing a chinese named entity recognition corpus.
Since the construction program stored in the computer-readable storage medium provided in this embodiment can implement any of the above steps of the method for constructing the chinese named entity recognition corpus when the construction program is called by the processor, the computer-readable storage medium provided in this embodiment has the same beneficial effects as the above method for constructing the chinese named entity recognition corpus, and is not described herein again.
The method, system, device and storage medium for constructing the named entity recognition corpus provided by the invention are described in detail above. The embodiments are described in a progressive mode in the specification, the emphasis of each embodiment is different from that of other embodiments, and the same and similar parts among the embodiments are referred to each other.
It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.
It is further noted that, in the present specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims (7)

1. A method for constructing a Chinese named entity recognition corpus is characterized by comprising the following steps of based on a computer:
extracting the characteristics of the Chinese wiki encyclopedia entries, and predicting the type of the named entity corresponding to the Chinese wiki entity entries according to the characteristics;
based on the type and the redirection information of the Chinese wiki entity entry, constructing a Chinese wiki entity list containing the named entity to form a corpus;
wherein the Chinese wiki entity entry is a Chinese wiki encyclopedia entry containing the named entity;
after the constructing the Chinese wiki entity list containing the named entities, the method further comprises the following steps:
identifying internal entities in the Chinese wiki entity list and generating nested named entities containing the internal entities;
adding the nested named entities into a Chinese nested entity list, and determining whether each internal entity in the Chinese nested entity list meets a nested relation;
removing the label of the first internal entity which meets the nesting relation, and deleting the nested named entity containing the second internal entity which does not meet the nesting relation; the determining whether each internal entity satisfies the nesting relationship specifically includes:
judging whether the Chinese wiki entity entry pointed by the internal entity and the Chinese wiki entity entry pointed by the external entity corresponding to the internal entity have intersection;
if so, determining the internal entity as a first internal entity meeting the nesting relationship;
and if not, determining the internal entity as a second internal entity which does not meet the nesting relationship.
2. The method for constructing a corpus of chinese named entity identities according to claim 1, wherein the extracting chinese wikipedia entries is characterized by:
and extracting the characteristics from the information box, the classification box and the abstract of the Chinese wiki entity entry.
3. The method for constructing a corpus of chinese named entity identities according to claim 2, wherein said identifying internal entities in the chinese wiki entity list specifically comprises:
determining the named entity without type ambiguity in the Chinese wiki entity list;
and identifying internal entities contained in the named entities without type ambiguity by using a longest matching principle by taking the Chinese wiki entity list as a dictionary.
4. The method for constructing a corpus of chinese named entity recognition recited in claim 3, wherein after removing the label of the first internal entity that satisfies the nesting relationship and deleting the nested named entity that includes the second internal entity that does not satisfy the nesting relationship, the method further comprises:
judging whether a first external entity in the Chinese nested entity list is an internal entity of a second external entity;
if so, aggregating the nested structure of the first external entity into the second external entity.
5. A construction system of a Chinese named entity recognition corpus is characterized by comprising the following components based on a computer:
the prediction module is used for extracting the characteristics of the Chinese wiki encyclopedia entries and predicting the type of the named entity corresponding to the Chinese wiki entity entry according to the characteristics;
the construction module is used for constructing a Chinese wiki entity list containing the named entities to form a corpus based on the type and the redirection information of the Chinese wiki entity entry;
wherein the Chinese wiki entity entry is a Chinese wiki encyclopedia entry containing the named entity;
after the constructing the Chinese wiki entity list containing the named entities, the method further comprises the following steps:
identifying internal entities in the Chinese wiki entity list and generating nested named entities containing the internal entities;
adding the nested named entities into a Chinese nested entity list, and determining whether each internal entity in the Chinese nested entity list meets a nested relation;
removing the label of the first internal entity which meets the nesting relation, and deleting the nested named entity containing the second internal entity which does not meet the nesting relation; the determining whether each internal entity satisfies the nesting relationship specifically includes:
judging whether the Chinese wiki entity entry pointed by the internal entity and the Chinese wiki entity entry pointed by the external entity corresponding to the internal entity have intersection;
if so, determining the internal entity as a first internal entity meeting the nesting relationship;
and if not, determining the internal entity as a second internal entity which does not meet the nesting relationship.
6. A construction device of a Chinese named entity recognition corpus is characterized by comprising the following steps:
a memory for storing a build program;
a processor for implementing the steps of the method for constructing a chinese named entity recognition corpus as claimed in any one of claims 1 to 4 when executing the construction program.
7. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a construction program, which when executed by a processor, implements the steps of the method for constructing a chinese named entity recognition corpus as claimed in any one of claims 1 to 4.
CN201810325492.XA 2018-04-12 2018-04-12 Method, system, equipment and storage medium for constructing named entity recognition corpus Active CN108520065B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810325492.XA CN108520065B (en) 2018-04-12 2018-04-12 Method, system, equipment and storage medium for constructing named entity recognition corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810325492.XA CN108520065B (en) 2018-04-12 2018-04-12 Method, system, equipment and storage medium for constructing named entity recognition corpus

Publications (2)

Publication Number Publication Date
CN108520065A CN108520065A (en) 2018-09-11
CN108520065B true CN108520065B (en) 2022-04-12

Family

ID=63432233

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810325492.XA Active CN108520065B (en) 2018-04-12 2018-04-12 Method, system, equipment and storage medium for constructing named entity recognition corpus

Country Status (1)

Country Link
CN (1) CN108520065B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399452A (en) * 2019-07-23 2019-11-01 福建奇点时空数字科技有限公司 A kind of name list of entities generation method of Case-based Reasoning feature modeling
CN112182204A (en) * 2020-08-19 2021-01-05 广东汇银贸易有限公司 Method and device for constructing corpus labeled by Chinese named entities
CN111950288B (en) * 2020-08-25 2024-02-23 海信视像科技股份有限公司 Entity labeling method in named entity recognition and intelligent device
CN113065353B (en) * 2021-03-16 2024-04-02 北京金堤征信服务有限公司 Entity identification method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649272A (en) * 2016-12-23 2017-05-10 东北大学 Named entity recognizing method based on mixed model
CN107239481A (en) * 2017-04-12 2017-10-10 北京大学 A kind of construction of knowledge base method towards multi-source network encyclopaedia

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106649272A (en) * 2016-12-23 2017-05-10 东北大学 Named entity recognizing method based on mixed model
CN107239481A (en) * 2017-04-12 2017-10-10 北京大学 A kind of construction of knowledge base method towards multi-source network encyclopaedia

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Tree Kernel-Based Semantic Relation Extraction Using Unified Dynamic Relation Tree;Longhua Qian 等;《International Conference on Advanced Language Processing and Web Information Technology》;20080725;全文 *

Also Published As

Publication number Publication date
CN108520065A (en) 2018-09-11

Similar Documents

Publication Publication Date Title
JP7302022B2 (en) A text classification method, apparatus, computer readable storage medium and text classification program.
CN108520065B (en) Method, system, equipment and storage medium for constructing named entity recognition corpus
CN107766371B (en) Text information classification method and device
CN111222305B (en) Information structuring method and device
CN111125343B (en) Text analysis method and device suitable for person post matching recommendation system
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN109460551B (en) Signature information extraction method and device
CN111105209B (en) Job resume matching method and device suitable for person post matching recommendation system
CN106919542B (en) Rule matching method and device
US20150033116A1 (en) Systems, Methods, and Media for Generating Structured Documents
CN113177124A (en) Vertical domain knowledge graph construction method and system
CN108549723B (en) Text concept classification method and device and server
CN109740159B (en) Processing method and device for named entity recognition
CN103166981A (en) Wireless webpage transcoding method and device
CN106874397B (en) Automatic semantic annotation method for Internet of things equipment
CN113468887A (en) Student information relation extraction method and system based on boundary and segment classification
CN103886020A (en) Quick search method of real estate information
CN104090869B (en) A kind of method and translation system for translating the network information
CN110427488A (en) The processing method and processing device of document
CN111199151A (en) Data processing method and data processing device
CN113742496A (en) Power knowledge learning system and method based on heterogeneous resource fusion
CN111597302B (en) Text event acquisition method and device, electronic equipment and storage medium
CN109271479A (en) A kind of resume structuring processing method
CN111222000B (en) Image classification method and system based on graph convolution neural network
CN112182204A (en) Method and device for constructing corpus labeled by Chinese named entities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant