CN111444348A

CN111444348A - Method, system and medium for constructing and applying knowledge graph architecture

Info

Publication number: CN111444348A
Application number: CN202010124150.9A
Authority: CN
Inventors: 亓杰星; 李琦; 傅洛伊; 王新兵; 陈贵海
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2020-07-24

Abstract

The invention provides a method, a system and a medium for constructing and applying a knowledge graph architecture, which comprise the following steps: step 1: the knowledge modeling is completed by defining the entity of the academic field and constructing the ontology of the academic knowledge map; step 2: entity alignment is carried out, namely, for each entity in the heterogeneous data source knowledge base, the same entity belonging to the real world is found out; and step 3: enriching the knowledge graph by using a rule-based knowledge graph reasoning method; and 4, step 4: estimating knowledge graph architecture-several most advanced methods of AceKG embedding knowledge; and 5: several of the most advanced methods of assessing knowledge-graph architecture, AceKG network representation learning. The invention not only provides pure academic information, but also provides a large-scale reference data set for researchers, provides a foundation for evaluating knowledge embedding and network representation learning methods, and enriches the provided knowledge map architecture.

Description

Method, system and medium for constructing and applying knowledge graph architecture

Technical Field

The invention relates to the technical field of academic data mining, in particular to a method, a system and a medium for constructing and applying a knowledge graph framework. In particular, it relates to an Acekg, a large-scale knowledge map for academic data mining.

Background

A knowledge graph is a structured semantic knowledge base that describes concepts in the physical world and their interrelationships in symbolic form. The basic composition unit is an entity-relation-entity triple, entities and related attribute value pairs thereof, and the entities are mutually connected through relations to form a network knowledge structure.

In the middle of the 20 th century, plece et al proposed a method of using a citation network to study the context of contemporary scientific development, first proposing the concept of a knowledge-graph. In 1977, the concept of knowledge engineering was proposed at the fifth international human intelligence society, and knowledge base systems represented by expert systems were widely researched and applied, and until the 90 s of the 20 th century, the concept of the knowledge base of organizations was proposed, and since then, research work on knowledge representation and knowledge organization began to be intensively carried out. The organization knowledge base system is widely applied to data integration and external publicity work in various departments and institutions. Google corporation, 11 months 2012, pioneered the concept of Knowledge Graph (KG), which represents the functionality that will add a Knowledge Graph to its search results. The purpose of the method is to improve the capability of a search engine and enhance the search quality and the search experience of a user. According to the statistical data of 1 month in 2015, the KG constructed by Google already has 5 hundred million entities and about 35 hundred million entity relationship information, and has been widely applied to improving the search quality of a search engine.

Although the concept of Knowledge Graph (Knowledge Graph) is newer, it is not a completely new research field, as early as 2006, Berners L ee has proposed the idea of data link (linked data), and has called for popularizing and perfecting related technical standards such as uri (uniform resource identifier), rdf (resource discovery framework), and OW L (Web connectivity guide), which are ready for meeting the arrival of semantic networks.

Knowledge graphs have become an important resource to support many artificial intelligence related applications, such as graph analysis, question and answer systems, web search, and the like. The knowledge graph describes and stores entities in a triple form, is a multi-relationship graph, and is composed of entities serving as nodes and relationships serving as edges of different types. Many companies and research teams now attempt to organize their knowledge in the field into machine-readable knowledge maps. Although these large-scale knowledge maps gather a great deal of factual information about the world, there are still many areas to be studied.

Academic network data mining utilizes information of useful entities such as thesis, scholars, institutions, meeting places, research fields and the like to discover hidden relationships and discover information based on semantics. With structured academic data, multiple academic databases or knowledge maps have been constructed. The public academic knowledge map can provide convincing academic information for the scholars and provide large-scale benchmark data sets for the researchers to conduct data mining projects.

However, existing databases or knowledge maps have some limitations. First, most of the existing efforts provide a homogenous academic map, while the relationships between different types of entities remain lost. Second, some databases focus on only one particular research domain, limiting the projects aimed at discovering cross-domain knowledge. Third, synonyms and ambiguities are also limitations of knowledge mining. Assigning entities with unique ids is a necessary solution, but some databases directly use the name of an entity as their id.

In view of the shortcomings of the prior art, the present invention aims to provide a large-scale knowledge-graph framework for academic data mining, AceKg, for providing pure academic information and a large-scale reference data set for a large number of researchers, for developing challenging data mining projects including link prediction, community detection and student classification.

Patent document CN110347844A (application number: 201910633602.3) discloses a space target knowledge graph construction system, which comprises a text information collection and processing module, a document information collection and processing module, a knowledge graph construction module and a knowledge graph display module; aiming at the problems of high information confidentiality, strong specialization, difficult direct acquisition and the like of the space target, the method of data mining is adopted to research the space target information hidden behind massive information by acquiring related information of the space target such as news, microblogs, academic journals and the like, visually display the attribute information of the space target, analyze the mutual relation between the space targets, and integrate two types of knowledge maps by using software technology to construct a space target knowledge map system.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method, a system and a medium for constructing and applying a knowledge graph framework.

The construction and application method of the knowledge graph framework provided by the invention comprises the following steps:

step 1: the knowledge modeling is completed by defining the entity of the academic field and constructing the ontology of the academic knowledge map;

step 2: entity alignment is carried out, namely, for each entity in the heterogeneous data source knowledge base, the same entity belonging to the real world is found out;

and step 3: enriching the knowledge graph by using a rule-based knowledge graph reasoning method;

and 4, step 4: estimating knowledge graph architecture-several most advanced methods of AceKG embedding knowledge;

and 5: several of the most advanced methods of assessing knowledge-graph architecture, AceKG network representation learning.

Preferably, the entity definition of the academic domain comprises 5 types of academic entities: papers, authors, research areas, places and institutions, describe the common attributes of each entity and the relationships between entities as triplets S (h, r, t);

s represents a triple; h represents a head entity; r represents a relationship between entities; t represents a tail entity;

the ontology of the academic knowledgegraph constructs academic entities comprising 31.3 million triples.

Preferably, the step 2 includes mapping the AceKG paper in the field of computer science to papers stored in the IEEE, ACM and DB L P database, all latest papers in the IEEE, ACM and DB L P database being consistent with AceKG.

Preferably, the step 3 comprises: and making related reasoning through related constraints in the business ontology framework, wherein the related reasoning comprises category reasoning and attribute reasoning.

Preferably, the step 4 includes that given a triplet S (h, R, t) consisting of two entities, h, t ∈ E and a relation R ∈ R, knowledge embedding maps each entity to a k-dimensional vector in an embedding space, and defines a scoring function in a knowledge graph to evaluate the reasonability of the triplet (h, R, t), wherein E represents the set of all entities, and R represents the set of relations between the entities.

Preferably, the step 5 comprises: given a network G ═ V, E ', a, where V denotes the set of vertices, E' denotes the network topology, a denotes the attributes of the nodes to be saved, and the network representation learning has the task of learning a mapping function:

wherein r is_vIs a learning representation of the vertex v, d is v_rDimension of (A), R_dRepresenting a d-dimensional real number space.

Preferably, the step 4 comprises:

step 4.1: extracting a reference data set from AceKG, wherein the reference data set comprises FB15K and WN18, and constructing a new reference data set AK18K from the AceKG for knowledge embedding;

step 4.2: randomly dividing training/valid/testing data sets for the extracted data and storing the training/valid/testing data sets;

step 4.3: code writing is carried out based on OpenKE, and a link prediction result based on knowledge embedding is tested.

Preferably, the step 5 comprises:

step 5.1: based on AceKG, 5 research fields and 5 sub-fields are selected;

step 5.2: respectively extracting all scholars, papers and places in the research field, and constructing 5 heterogeneous cooperative networks;

step 5.3: constructing two academic knowledge graphs;

step 5.4: performing a scholars classification task by adopting logistic regression, and performing microscopic F1 and macroscopic F1 evaluation on a classification result by adopting 5-time cross validation;

step 5.5: and (3) based on the same node representation in the student classification task, utilizing a k-means algorithm to perform a student clustering experiment to evaluate the performance of the model.

The system for constructing and applying the knowledge graph architecture provided by the invention comprises the following components:

module M1: the knowledge modeling is completed by defining the entity of the academic field and constructing the ontology of the academic knowledge map;

module M2: entity alignment is carried out, namely, for each entity in the heterogeneous data source knowledge base, the same entity belonging to the real world is found out;

module M3: enriching the knowledge graph by using a rule-based knowledge graph reasoning method;

module M4: estimating knowledge graph architecture-several most advanced methods of AceKG embedding knowledge;

module M5: several of the most advanced methods of assessing knowledge-graph architecture, AceKG network representation learning.

Compared with the prior art, the invention has the following beneficial effects:

1. the AceKG not only provides pure academic information, but also provides a large-scale reference data set for researchers, and is used for developing challenging data mining projects, including link prediction, community detection and student classification; meanwhile, certain help is provided for academic scientific research personnel to perform collaborative prediction tasks.

2. The reference data set constructed by the AceKG also provides a foundation for evaluating knowledge embedding and network representation learning methods;

3. to enrich the proposed knowledge-graph architecture, the present invention also utilizes existing entity attribute information and rule-based reasoning to perform entity alignment.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a diagram of an AceKG ontology for large-scale knowledge-graph architecture for academic data mining;

FIG. 2 is a schematic diagram of rule-based inference, where dashed arrows are inference predicates;

fig. 3 is a diagram of a specific engineering architecture to which the present invention relates.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

The invention relates to a large-scale knowledge graph framework-AceKg for academic data mining by designing and realizing, relating to the technologies of knowledge extraction (comprising entity extraction, relation extraction, attribute extraction and the like), knowledge fusion (comprising entity disambiguation and the like), knowledge processing (comprising ontology framework, knowledge reasoning and the like), knowledge updating and the like in the knowledge graph field; specifically, the method comprises the following steps:

step S1: knowledge modeling is completed by defining main entities of the academic network field and constructing an ontology of an academic knowledge graph.

Step S2: and performing entity alignment based on the entity/value information in the first step and the schema of the specific field knowledge graph, namely finding out the same entity belonging to the real world for each entity in the heterogeneous data source knowledge base.

Step S3: the knowledge graph is enriched using a rule-based knowledge graph inference method.

Step S4: several most advanced methods for evaluating knowledge embedding using the new large knowledge-graph architecture proposed in this patent, AceKG, include TransE, transah, DistMult, ComplEx, HolE, etc.

And step S5, evaluating and using several most advanced methods of new large knowledge graph architecture-AceKG network representation learning, including Deepwalk, PTE, L INE, meta-ath2vec and other related methods for research and evaluation, and mainly including two tasks of student classification and student clustering.

The step S1 includes:

step S1.1: and carrying out entity definition. All objects (e.g., papers, institutions, authors) are represented as entities in the AceKG. Two entities may form a relationship. The usual attributes of each entity (including numbers, dates, strings, and other text) are also represented. Similar entities are grouped into classes. In general, AceKG defines 5 academic entities, papers, authors, research fields, sites and institutions. The common attributes of each entity and the relationships between the entities are described in the knowledge graph as triples. All data required is sourced from aceap.

Step S1.2: synonyms and ambiguities are handled. To handle synonyms and ambiguities, each entity in the definition attributes is assigned a URI. For example, ace: 7E7A3A69 and ace: 7E0D6766 is two same-named scholars: one of the korean guards is the powerful data mining scientist. Compared to a dataset that represents entities directly using entity names, AceKG may avoid errors caused by synonyms and ambiguities.

Step S1.3: and (5) ontology construction of the knowledge graph. Fig. 1 is a large-scale knowledge-graph architecture AceKG ontology overview for academic data mining, which is related to each other by the relationship between each entity, and the triple statistics between each entity is shown in table 1.

TABLE 1 triple statistics

The step S2 includes:

step S2.1, in order to make AceKG more interconnected and comprehensive, the AceKG papers in most computer science fields are mapped to papers stored in databases of IEEE, ACM and DB L P, all the latest papers in the three databases are consistent with the AceKG, and Table 2 shows some mapping statistical data and knowledge maps regularly update the latest academic information.

Table 2 node mapping statistics

The step S3 includes:

step S3.1: rule-based knowledge graph reasoning is a typical and key approach to enriching knowledge graphs. The selected inference rules we have designed are shown in figure 2. With these inference rules, we can define new relationships on the AceKG, thereby providing a more comprehensive truth.

The step S4 includes:

step S4.1: reference data sets are extracted from AceKG, and mainly comprise FB15K and WN18, and a new reference data set AK18K is constructed from AceKG for knowledge embedding. To extract AK18K from AceKG, we first selected 68 important international sites (meetings and periodicals) and influential papers published on these sites. We then added triplets of authors, fields and institutes.

Table 3 shows the statistics of WN18, FB15K, and AK 18K. AK18K is more sparse than FB15K, but more dense than WN18 (represented by the value of # Trip/# E), and it provides only 7 types of relationships. We will evaluate the scalability of the model on a knowledge graph with simple relationship structure but large number of entities.

TABLE 3 data sets for use in knowledge embedding

Step S4.2: randomly dividing training/valid/testing data sets for the extracted data and storing the training/valid/testing data sets; table 3 shows the amount of training/validation/test data for each data set after random partitioning.

Table 4 link on AK18K predicts task results

Step S4.3: code compiling is carried out based on OpenKE, and a link prediction result based on knowledge embedding is tested; table 4 gives the knowledge-based embedded link prediction results. And optimizing the hyper-parameters obtained by searching the grids in the literature to obtain the optimal hyper-parameter set.

The more advanced models can be divided into two categories:

(i) translation model (TransE, TransH);

(ii) composition model (DistMult, HolE, ComplEx).

TransE outperformed all peers with 89.2% performance on Hits @ 10. Although 94.4% of the relationships in our knowledge graph are many-to-many relationships and are applicable to TransH, TransE shows its advantages in modeling sparse, simple knowledge graphs, and TransH does not achieve better results. The reason may be that the number of relationship types is only 7, which is very small. According to another aspect of the "transparency Graph Embedding by transforming on Hyperplanes" article, where TransH should be three times more numerous than TransE, HolE and CompelEx achieve the most important performance indicators in other metrics, especially in Hits @1 (83.8% and 75.4%) and filtered MRR (0.482 and 0.440), which confirms their advantages in the modeling of anti-symmetric relationships, as all of our relationships are anti-symmetric, such as field _ is _ part _ of and paper _ is _ writer _ by.

Compared with the experimental results of FB15K and WN18 reported in the literature of "Holographic Embeddings of Knowledge Graphs", the AK18K evaluated performance is significantly different. First, the results for AK18K were lower than WN18, but higher than FB 15K. It is caused by a limited number of relationship types and a large number of potential entities per relationship. Some relationships, such as paper _ is _ in _ field, may have thousands of objects per triple, which limits prediction performance. Secondly, as the knowledge graph is complicated, the performance gap between the two types of models becomes more and more obvious, which means that the simple translation model may not be good for modeling the complex graph.

The step S5 includes:

step S5.1: based on AceKG, 5 research Fields (FOS) and 5 main sub-fields each were first selected; the research fields we chose here are life sciences, computer science, economics, physics and pharmacology, respectively.

Step S5.2: all scholars, papers and places in the research fields are respectively extracted, and 5 heterogeneous cooperative networks are constructed.

Step S5.3: two larger academic knowledge graphs are constructed;

(1) integrating the 5 networks into a graph containing all information of 5 research fields;

(2) the *** Scholar5 site class 8 was matched to the AceKG site. Selecting related thesis and scholars to construct a large heterogeneous cooperative network;

in addition, the method for labeling the category of the scholars is as follows:

(a) to label the papers, we directly employed the research domain information and *** scholars categories as labels of the papers in 6 FOS networks and 1 *** scholars network, respectively.

(b) The label of the scholars controls most labels published by the scholars. Some tags are randomly selected when their papers are equal in number.

The statistical index values of the constructed 7 data sets are shown in table 5.

TABLE 5 data sets used in network representation learning

Step S5.4: performing a scholars classification task by adopting logistic regression, and performing microscopic F1 and macroscopic F1 evaluation on a classification result by adopting 5-time cross validation;

it is worth mentioning that logistic regression has many similarities to multiple linear regression, and the model is also wx + b, w and b are required parameters, and the difference is that dependent variables are different, logistic regression corresponds wx + b to a hidden state p through logistic function L, i.e. p is L (wx + b), logistic regression if L is logistic function, multiple linear regression if L is polynomial.

The L logistic regression is actually a classification method for binary problems.

First, a suitable hypothesis function h is found_θ(x) The function is a classification function to be found by people, and is used for predicting the judgment result of input data, the process needs to have certain knowledge and analysis on the data, and the basic characteristics of the prediction function are known, such as whether the function is linear or not.

First, a logic function (sigmod function) is written:

the sigmoid function approximates a unit step function to a certain extent, is monotonous and differentiable, and defines a hypothesis function:

wherein h is_θ(x) The value of the function has a special meaning, which indicates the probability that the result takes 1, and the probabilities for class 1 and class 2 are:

P(y＝1|x；θ)＝h_θ(x)

P(y＝0|x；θ)＝1-h_θ(x)

the above equation can be written using a binomial distribution as:

P(y|x；θ)＝(h_θ(x))^y(1-h_θ(x))^1-y

taking a likelihood function:

taking the logarithm of the obtained product to obtain:

a cost function cost, i.e. a loss function, is constructed. To represent the deviation between the predicted output and the actual class of training data. If all data are considered, the costs can be summed or averaged, as a J (θ) function, representing the deviation between all predictions and the actual class of training data.

And minimizing the cost function to obtain an optimal model parameter solution, namely the minimum value of the J (theta) function. Since the smaller the value of the function, the more accurate the prediction result, the method generally uses a gradient descent method. The cost function and the loss function are respectively:

and F1-score as an evaluation index is defined as follows:

wherein precision represents precision, which refers to the specific gravity of the positive sample in the positive case determined by the classifier. recall denotes recall and refers to the proportion of total positive cases that are predicted to be positive cases.

TABLE 6 student Classification results

Table 6 shows the classification results evaluated for micro F1 and Macro F1 Macro-F1 meta data 2vec has a significantly better ability to learn heterogeneous node embedding than other methods we attribute it to improved heterogeneous sampling and hopping graph algorithms, however deep walk and L INE also achieve similar performance showing their scalability on heterogeneous networks.

In addition, the index values also reflect the interdisciplinary level of these areas. For example, the highest micro f1 indicates that the biological sub-domains are the most independent, while the lowest micro f1 indicates that the sub-domains of CS intersect the most. Finally, the sharp drop from microscopic f1 to macroscopic f1, especially in the economic field, indicates an imbalance in the various sub-fields in a partial FOS.

Step S5.5: and based on the same node representation in the student classification task, further utilizing a k-means algorithm to perform a student clustering experiment to evaluate the performance of the model. The k-means algorithm is one of the most commonly used clustering algorithms. The input to the algorithm is a set of samples (alternatively referred to as a set of points) by which samples can be clustered, with samples having similar characteristics grouped into a class. For each point, the center point of the point closest to all the center points is calculated, and then the point is classified as the cluster represented by the center point. After one iteration is finished, the central point is recalculated for each cluster class, and then the central point closest to the cluster is searched for each point again. And circulating until the cluster class of the two previous and next iterations is not changed. The method comprises the following basic steps:

(1) selecting the number k of categories to be clustered (e.g. k is 3 categories), and selecting k central points;

(2) for each sample point, finding a central point (searching organization) closest to the sample point, wherein the points closest to the same central point are a class, and thus, one-time clustering is completed;

(3) judging whether the sample points before and after clustering are the same in category condition, if so, terminating the algorithm, otherwise, entering (4);

(4) for the sample points in each category, the center points of these sample points are calculated, as new center points for that category, and continue (2).

Based on the same node representation in the student classification task, the performance of the model is evaluated by further utilizing a k-means algorithm to carry out a student clustering experiment. All clustering experiments were performed 10 times and the average performance was reported.

Table 7 gives the clustering results for the Normalized Mutual Information (NMI) evaluation. In general, the performance of the argument ath2vec is superior to that of all other models, which shows that the improved heterogeneous sampling and jump map algorithm can better store the information in the knowledge map.

TABLE 7 results of student clustering

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A construction and application method of a knowledge graph framework is characterized by comprising the following steps:

2. The method for constructing and applying a knowledge-graph framework according to claim 1, wherein the entity definitions of the academic field comprise 5 types of academic entities: papers, authors, research areas, places and institutions, describe the common attributes of each entity and the relationships between entities as triplets S (h, r, t);

3. The method for constructing and applying a knowledge-graph framework as claimed in claim 1, wherein the step 2 comprises mapping AceKG papers in the field of computer science to papers stored in IEEE, ACM and DB L P databases, all latest papers in the IEEE, ACM and DB L P databases being consistent with AceKG.

4. The method for constructing and applying a knowledge-graph framework according to claim 1, wherein the step 3 comprises: and making related reasoning through related constraints in the business ontology framework, wherein the related reasoning comprises category reasoning and attribute reasoning.

5. The method for constructing and applying the knowledge-graph framework according to claim 2, wherein the step 4 comprises the steps of giving a triplet S (h, R, t) consisting of h, t ∈ E and a relation R ∈ R, mapping each entity to a k-dimensional vector in an embedding space, and defining a scoring function in the knowledge-graph to evaluate the reasonability of the triplet (h, R, t), wherein E represents all entity sets, and R represents a relation set between the entities.

6. The method for constructing and applying a knowledge-graph framework according to claim 1, wherein the step 5 comprises: given a network G ═ V, E ', a, where V denotes the set of vertices, E' denotes the network topology, a denotes the attributes of the nodes to be saved, and the network representation learning has the task of learning a mapping function:

7. The method for constructing and applying a knowledge-graph framework according to claim 1, wherein the step 4 comprises:

8. The method for constructing and applying a knowledge-graph framework according to claim 1, wherein the step 5 comprises:

step 5.1: based on AceKG, 5 research fields and 5 sub-fields are selected;

step 5.3: constructing two academic knowledge graphs;

9. A system for constructing and applying knowledge graph architecture is characterized by comprising:

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.