CN115935995A

CN115935995A - Knowledge graph generation-oriented non-genetic-fabric-domain entity relationship extraction method

Info

Publication number: CN115935995A
Application number: CN202211601610.8A
Authority: CN
Inventors: 王昊; 赵梓博; 刘懋霖; 赵萌; 王彦莹
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2022-12-13
Filing date: 2022-12-13
Publication date: 2023-04-07

Abstract

The invention discloses a knowledge graph generation-oriented method for extracting entity relation in the field of non-genetic silk weaving, which specifically comprises the following steps: step one, entity identification; step two, extracting the relation; step three, expanding the example; step four, map construction; the invention relates to the technical field of non-legacy digital protection. According to the method for extracting the entity relationship of the non-genetic-fabric field oriented to knowledge-graph generation, entity recognition is carried out by utilizing a mature natural language processing tool and a term dictionary, entity relationship is extracted by an unsupervised machine learning method, a labeling example is amplified by adopting a semi-supervised mode, and finally the field knowledge-graph is generated based on labeled relationship triples, so that the problems of labeling data shortage, difficult text feature selection and poor application depth are effectively solved, and the method is suitable for extracting and applying the entity relationship oriented to large-scale network unstructured texts in the scene of learning linguistic data shortage and labeling data shortage.

Description

Knowledge graph generation-oriented method for extracting entity relation in non-genetic silk weaving field

Technical Field

The invention relates to the technical field of non-genetic digital protection, in particular to a knowledge graph generation-oriented non-genetic silk weaving field entity relation extraction method.

Background

The non-material cultural heritage is non-heritage, is an important component of national culture in China, mainly comprises expressions of folk literature, music, dance, drama, traditional manual skills and the like, the existing non-heritage resource protection faces serious problems of technology transition, practice weakening, missing of inheritors and the like, part of the non-heritage culture is seriously lost and is threatened to lose, and people adopt the domain knowledge map technology to carry out non-heritage digital protection.

The construction of knowledge graph in the prior art relates to key technical paths such as entity identification, relation extraction and the like, the quality of the knowledge graph depends on high-quality knowledge elements, namely entity and entity relation triplets, a large amount of unstructured texts containing non-legacy knowledge are distributed in the Chinese Internet, and the unstructured texts are main sources for acquiring the non-legacy knowledge elements.

In the prior art, non-legacy digital humanization research, entity recognition research, relationship extraction research, text identification research and knowledge map construction research are adopted, wherein the non-legacy digital humanization research mainly goes through two stages: non-heritage digital research and non-heritage knowledge organization research; entity identification refers to identifying entities in text having specific meanings, typically name of person, place, organization, and date, time, number nouns, etc. The existing entity identification method mainly comprises the following steps: a rule template and pattern matching based method, a statistical machine learning based method and a deep neural network learning based method; the relation extraction is a process of judging the semantic relation between two entities in a sentence on the basis of entity identification so as to obtain a relation triple < E1, R and E2, wherein E1 and E2 refer to the entities, and R refers to the relation description and mainly comprises the following steps: rule-based methods, supervised learning methods, semi-supervised learning methods, and unsupervised learning methods; the text representation method is mainly divided into a traditional word representation method, a static word embedding representation method and a pre-training language model representation method; the knowledge graph is symbolic expression of a physical world and is a concept network formed by connecting entities through semantic relations.

For the above-mentioned search, the following problems can be found:

most of knowledge organizations in the non-legacy field adopt a top-down organization mode, and certain field knowledge and expert knowledge are needed, so that more labor and time costs are needed in the knowledge organization process;

with the gradual maturity of deep learning technology, entity identification and relationship extraction research based on deep learning are getting more excellent results, however, in many fields, students still face the problem of data cold start, and how to automatically and easily finish the extraction work of knowledge is always a hot point for the students to think;

the BERT pre-training language model is one of the currently most excellent text representation methods, however, in the task of relationship extraction, most of the existing researches only take sentences of entity pairs as the input of the BERT model, or add grammatical features such as part of speech, syntactic dependency, semantic roles and the like, and the semantic importance of the entity and information of each part of context is rarely considered;

in the angle of knowledge graph construction, the research related to the non-genetic and silk weaving fields is less at present, the non-genetic knowledge graph in the existing research is not sufficient to mine the field knowledge, the positioning of the non-genetic knowledge graph is mostly to provide the display and retrieval service of the knowledge, and the application and service mode needs to be deepened.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a knowledge graph generation-oriented entity relation extraction method in the non-genetic-fabric field, and solves the problems of label data shortage, difficult text feature selection and poor application depth.

(II) technical scheme

In order to achieve the purpose, the invention provides the following technical scheme: the method for extracting the entity relationship of the non-genetic-fiber-woven field generated by the knowledge map specifically comprises the following steps:

step one, entity identification: for field entity recognition, a term dictionary corresponding to the silk weaving field is used as a user-defined dictionary and is input into a word segmentation tool, the word segmentation tool is controlled to segment a field entity in the word segmentation process, and then dictionary matching is carried out, so that the entity recall rate is improved, and for general type entity recognition, a mature natural language processing tool is used for recognition;

step two, relation extraction: dividing a sentence into an entity I, an entity II, a preceding text before the entity I, a postamble text after the entity II and a middle text between the entity I and the entity II, constructing text features by performing BERT embedding representation on different parts of the text, combining and grouping the features, solving an average feature vector of each feature scheme, respectively performing relational clustering through KMeans, density clustering and spectral clustering algorithms, evaluating and analyzing each group of clustering results based on a manually pre-labeled relational tag, and finally determining the optimal text features and clustering algorithms under the optimal relational clustering effect;

step three, example expansion: selecting high-value samples with the aid of optimal text features and a clustering algorithm, transmitting the high-value samples to a classifier for learning, and obtaining a relation classification model;

step four, map construction: and constructing the knowledge graph of the silk weaving field according to the sequence of knowledge processing, knowledge graph construction and graph application mode design, and then performing human narrative on the knowledge graph of the silk weaving field.

The invention is further configured to: the entity identification in the first step specifically comprises the following steps:

the method comprises the following steps of (1) importing a domain dictionary into a natural language processing tool as a self-defined dictionary, and segmenting sentences by using the natural language processing tool to obtain segmentation marks and part-of-speech characteristics of each mark;

inputting the marks obtained by word segmentation and the part-of-speech characteristic information of the marks into a natural language processing tool to perform general type entity identification, wherein the general type entity identification comprises four types of time, name of a person, name of a place and name of a mechanism, and the output of the natural language processing is the names and types of all general type entities in the corpus and the start and stop positions of the entities in a word segmentation list;

and performing entity matching by taking each mark as a unit, wherein the field entities are matched through a field dictionary, the entity types are special, the general type entities are matched through output results of natural language processing, and the matched output results comprise entity names, entity parts of speech, entity types and initial positions of the entities in the original text.

The invention is further configured to: the text features in the second step specifically include: the method comprises the steps of embedding intermediate text to generate intermediate text characteristics, embedding the preceding text and the following text to generate preceding and following text characteristics, and embedding part-of-speech text and entity type text by the entity I and the entity II to generate part-of-speech characteristics and entity type characteristics respectively.

The invention is further configured to: the third step specifically comprises the following steps:

training a classifier based on the corpus with the category labels subjected to the relational clustering experiment;

extracting part of linguistic data from the unlabeled linguistic data in each iteration, performing class prediction on the linguistic data by using a classifier, screening out valuable examples by combining a clustering method, and adding the valuable examples into a training set after manual labeling;

and the training set supplemented with the new data is continuously used for training a new classifier, and when a stopping condition is reached, the iterative model is stopped, and a final classifier is output.

The invention is further configured to: the valuable instance selection strategies include:

representative sampling: the clustering algorithm takes a certain sample as a cluster mark alone, the sample is considered to be special, the embedded space distribution is relatively marginal, certain specificity is achieved, and a classifier is required to strengthen learning;

uncertainty sampling: the clustering algorithm is used for clustering samples of different categories into a cluster, so that the positions of the samples of different categories in the embedding space are approximate and are easy to be confused, and therefore, the samples of the category with relatively small number in the cluster are considered to be uncertain samples, and the classifier is required to strengthen the memory of the samples.

The invention is further configured to: the fourth step specifically comprises the following steps:

processing the triple knowledge from the aspects of triple cleaning, semantic disambiguation and coreference resolution to construct a map data layer;

constructing a visual interface of a knowledge graph based on a Neo4j graph database;

exploring the application mode of the knowledge graph from the aspects of characteristic knowledge retrieval, family knowledge retrieval and knowledge reasoning;

knowledge-graph based human narratives are described from the perspective of the development context, geographic evolution, and commonality of silk skills.

(III) advantageous effects

The invention provides a knowledge graph generation-oriented non-genetic-fabric-field entity relationship extraction method. The method has the following beneficial effects:

(1) The invention utilizes a mature natural language processing tool and a term dictionary to carry out entity recognition, extracts entity relations through an unsupervised machine learning method, adopts a semi-supervised mode to amplify the labeled examples, and finally generates the domain knowledge map based on the labeled relation triples, thereby effectively solving the problems of label data deficiency, difficult text feature selection and poor application depth, and being suitable for the extraction and application of the entity relations facing large-scale network unstructured texts in the scene of lacking of learning linguistic data and label data.

(2) The invention constructs an entity relation extraction method which is slightly dependent on the labeled data and faces to unstructured texts which are easily acquired in a network, which is beneficial to the efficient development of entity relation extraction work of researchers in the fields of digital human and the like, and explores the application mode of the domain knowledge map on the basis of constructing the knowledge map of the non-genetic weaving domain, thereby providing a view angle for novel knowledge service based on the knowledge map and providing important help for the protection and inheritance of knowledge in the genetic weaving domain.

Drawings

FIG. 1 is a schematic diagram of the SREP model structure of the present invention;

FIG. 2 is a flow chart of the present invention for silk entity identification;

FIG. 3 is a flow chart of a relationship extraction experiment of the present invention;

FIG. 4 is an expanded model diagram of an example of an active learning relationship according to the present invention;

FIG. 5 is a schematic diagram of the construction process of knowledge-graph in the field of silk weaving.

Detailed Description

The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

Referring to fig. 1-5, an embodiment of the present invention provides a technical solution: a knowledge graph generation-oriented non-genetic silk weaving field entity relation extraction method provides a research frame SREP model for extracting the relation of the silk weaving field, and divides the main processes of relation extraction and application into four stages of entity identification, relation extraction, example expansion and graph construction, as shown in the attached figure 1, the SREP model integrally adopts a Pipeline structure, and the method specifically comprises the following steps:

step one, entity identification: considering that the recognition difficulty of the silk weaving field entity is different from that of the general entity, regarding the field entity recognition, taking a term dictionary corresponding to the silk weaving field as a user-defined dictionary input word segmentation tool, controlling the word segmentation tool to segment the field entity in the word segmentation process, and then performing dictionary matching, thereby improving the entity recall rate; for the general type entity identification, a mature natural language processing tool is used for identification, as shown in FIG. 2;

specifically, a domain dictionary is used as a self-defined dictionary and is led into a natural language processing tool, the natural language processing tool is used for segmenting words of sentences, and segmentation marks, namely Token, and the part-of-speech characteristics, namely Pos, of each mark are obtained;

inputting the marks obtained by word segmentation and the part-of-speech characteristic information of the marks into a natural language processing tool to perform universal type entity identification, wherein the universal type entity identification comprises four types of time, name of a person, name of a place and name of a mechanism, and the output of the natural language processing is the name and type of all universal type entities in the corpus and the starting and ending positions of the entities in a word segmentation list;

carrying out entity matching by taking each mark as a unit, wherein the field entities are matched through a field dictionary, the entity types are special, the general type entities are matched through an output result of natural language processing, and the matched output result comprises entity names, entity parts of speech, entity types and initial positions of the entities in original texts;

step two, relation extraction: as shown in fig. 3, the sentence is divided into an entity one, an entity two, a preceding text before the entity one, a following text after the entity two, and an intermediate text between the entity one and the entity two, and text features are constructed by performing BERT embedding representation on different parts of the text: the method comprises the steps that Middle text features, namely middles, are generated after Middle texts are embedded, context features, namely Context, are generated after the front texts and the later texts, part-of-speech texts and entity Type texts are embedded by the entity I and the entity II, and then, part-of-speech features, namely Pos, and entity Type features, namely Type are generated respectively, the features are combined and grouped, average feature vectors of feature schemes are obtained, relational clustering is conducted through KMeans, density clustering and spectral clustering algorithms respectively, clustering results of all groups are evaluated and analyzed based on manually pre-labeled relational labels, and finally, the optimal text features and clustering algorithms under the optimal results of the relational clustering effects are determined;

step three, example expansion: based on an active learning method, an optimal text feature and a clustering algorithm are used for assisting in selecting high-value samples and transmitting the high-value samples to a classifier for learning, and a relation classification model is obtained, as shown in the attached figure 4, and the concrete steps are as follows:

training a classifier based on the corpus with the category labels through a relational clustering experiment;

the training set supplemented with new data is continuously used for training a new classifier, when a stopping condition is reached, the iterative model is stopped, and a final classifier is output;

since the optimal feature and clustering algorithm-based relationship extraction has been verified to have a good effect on small-scale data sets, a cluster label of unlabeled samples can be used to assist the classifier in selecting valuable instances, and a valuable instance selection strategy employs representative sampling and uncertainty sampling: the representative sampling means that if a certain sample is singly used as a cluster mark by a clustering algorithm, the sample is considered to be special, and the embedded space distribution is more marginal, has certain particularity and needs a classifier to strengthen learning; the uncertain sampling means that if the clustering algorithm clusters samples of different categories, which are judged by the classifier, into a cluster, the positions of the samples of different categories in the embedding space are relatively similar and are easy to be confused, so that the samples of the category with relatively small number in the cluster are considered to be uncertain samples, the classifier is required to strengthen the memory of the samples, and valuable examples can be continuously selected from the unlabeled samples according to the strategy to help the classifier to learn;

step four, map construction: the method comprises the following steps of constructing a knowledge graph in the silk weaving field according to the sequence of knowledge processing, knowledge graph construction and graph application mode design, and then performing human narrative on the knowledge graph in the silk weaving field, wherein the sequence is specifically shown as the attached drawing 5:

processing the triple knowledge from the angles of triple cleaning, semantic disambiguation and coreference resolution to construct an atlas data layer;

constructing a visual interface of the knowledge graph based on the Neo4j graph database;

exploring the application mode of the knowledge graph in the aspects of characteristic knowledge retrieval, familial knowledge retrieval and knowledge reasoning;

knowledge-map based human narratives are described from the perspective of the context of development, regional evolution, and commonality of silk skills.

As a detailed description, through the construction of an SREP model research framework, the main processes of relation extraction and application are divided into four stages of entity identification, relation extraction, instance expansion and map construction, the model integrally adopts a Pipeline structure, semantic representation of a text is realized based on a BERT model, and the SREP model is practiced by taking an online unstructured text of a Chinese non-material cultural heritage net and a Baidu encyclopedia as a data source, so that researches show that the SREP model can realize entity relation extraction of light labeled data and effectively ends the problem of labeled data lack;

in the entity recognition stage, a word segmentation is carried out by using a natural language processing tool in Haugh, and a domain entity and a general type entity are respectively recognized by combining a term dictionary and a natural language processing tool corresponding to a silk weaving domain consisting of an original dictionary and a document keyword supplementary dictionary;

in a relation extraction stage, the influence of different text characteristics on a relation extraction effect is assumed, relation clustering is carried out by using KMeans, DBSCAN and a spectral clustering algorithm, and ARI, V-Measure and CI indexes are used for evaluating clustering results, experimental results show that the KMeans algorithm is most suitable for silk weaving field relation extraction, the relation extraction effect of combining entity intermediate texts and entity type characteristic combinations is optimal, five types of entity relations including origin relations, time relations, attribute relations, parallel relations and inheritance relations are extracted together by an unsupervised clustering method, and the problem of difficulty in text characteristic selection is solved;

in the example expansion stage, firstly, the robustness of large-scale sample clustering is checked, and the problem that the clustering meets the quality robustness but does not meet the quantity robustness any more as the sample scale is increased is found, so that the problem that the large-scale sample clustering is easy to cause class difficult labeling is solved;

in the knowledge map construction stage, knowledge processing is carried out on a map data layer through triple cleaning, semantic disambiguation and coreference resolution, then a silk-weaving field knowledge map is constructed based on a Neo4j map database, then the application mode of the silk-weaving field knowledge map is elucidated, and the application mode comprises the aspects of characteristic knowledge retrieval, family knowledge retrieval and knowledge reasoning.

Claims

1. The knowledge graph generation-oriented non-genetic-fabric-domain entity relationship extraction method is characterized by comprising the following steps of: the method specifically comprises the following steps:

step one, entity identification: for field entity recognition, a term dictionary corresponding to the silk weaving field is used as a user-defined dictionary and is input into a word segmentation tool, the word segmentation tool is controlled to segment the field entity in the word segmentation process, then dictionary matching is carried out, and for general type entity recognition, a mature natural language processing tool is used for recognition;

2. The method for extracting the entity relationship of the non-genetic-fiber-based domain generated by the knowledge-graph according to claim 1, wherein the method comprises the following steps: the entity identification in the first step specifically comprises the following steps:

and carrying out entity matching by taking each mark as a unit, wherein the field entities are matched through a field dictionary, the entity types are 'special', and the general type entities are matched through an output result of natural language processing.

3. The method for extracting the entity relationship of the non-genetic-textile domain generated by the knowledge-graph according to claim 2, wherein: the output result comprises an entity name, an entity part of speech, an entity type and the initial position of the entity in the original text.

4. The method for extracting the entity relationship of the non-genetic-textile domain generated by the knowledge-graph according to claim 1, wherein: the text features in the second step specifically include: the method comprises the steps of embedding intermediate text to generate intermediate text characteristics, embedding the preceding text and the following text to generate preceding and following text characteristics, and embedding part-of-speech text and entity type text by the entity I and the entity II to generate part-of-speech characteristics and entity type characteristics respectively.

5. The method for extracting the entity relationship of the non-genetic-textile domain generated by the knowledge-graph according to claim 1, wherein: the third step specifically comprises the following steps:

6. The method for extracting the relationship of the entities in the non-genetic-fiber-based domain generated by the knowledge-graph according to claim 5, wherein the method comprises the following steps: the valuable instance selection strategies include:

representative sampling: the clustering algorithm takes a certain sample as an example corresponding to a cluster of marks;

uncertainty sampling: and the clustering algorithm is used for clustering the samples of different classes into a cluster of corresponding examples.

7. The method for extracting the entity relationship of the non-genetic-fiber-based domain generated by the knowledge-graph according to claim 1, wherein the method comprises the following steps: the fourth step specifically comprises the following steps: