CN111444348A - Method, system and medium for constructing and applying knowledge graph architecture - Google Patents

Method, system and medium for constructing and applying knowledge graph architecture Download PDF

Info

Publication number
CN111444348A
CN111444348A CN202010124150.9A CN202010124150A CN111444348A CN 111444348 A CN111444348 A CN 111444348A CN 202010124150 A CN202010124150 A CN 202010124150A CN 111444348 A CN111444348 A CN 111444348A
Authority
CN
China
Prior art keywords
knowledge
entity
graph
constructing
acekg
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010124150.9A
Other languages
Chinese (zh)
Inventor
亓杰星
李琦
傅洛伊
王新兵
陈贵海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202010124150.9A priority Critical patent/CN111444348A/en
Publication of CN111444348A publication Critical patent/CN111444348A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method, a system and a medium for constructing and applying a knowledge graph architecture, which comprise the following steps: step 1: the knowledge modeling is completed by defining the entity of the academic field and constructing the ontology of the academic knowledge map; step 2: entity alignment is carried out, namely, for each entity in the heterogeneous data source knowledge base, the same entity belonging to the real world is found out; and step 3: enriching the knowledge graph by using a rule-based knowledge graph reasoning method; and 4, step 4: estimating knowledge graph architecture-several most advanced methods of AceKG embedding knowledge; and 5: several of the most advanced methods of assessing knowledge-graph architecture, AceKG network representation learning. The invention not only provides pure academic information, but also provides a large-scale reference data set for researchers, provides a foundation for evaluating knowledge embedding and network representation learning methods, and enriches the provided knowledge map architecture.

Description

Method, system and medium for constructing and applying knowledge graph architecture
Technical Field
The invention relates to the technical field of academic data mining, in particular to a method, a system and a medium for constructing and applying a knowledge graph framework. In particular, it relates to an Acekg, a large-scale knowledge map for academic data mining.
Background
A knowledge graph is a structured semantic knowledge base that describes concepts in the physical world and their interrelationships in symbolic form. The basic composition unit is an entity-relation-entity triple, entities and related attribute value pairs thereof, and the entities are mutually connected through relations to form a network knowledge structure.
In the middle of the 20 th century, plece et al proposed a method of using a citation network to study the context of contemporary scientific development, first proposing the concept of a knowledge-graph. In 1977, the concept of knowledge engineering was proposed at the fifth international human intelligence society, and knowledge base systems represented by expert systems were widely researched and applied, and until the 90 s of the 20 th century, the concept of the knowledge base of organizations was proposed, and since then, research work on knowledge representation and knowledge organization began to be intensively carried out. The organization knowledge base system is widely applied to data integration and external publicity work in various departments and institutions. Google corporation, 11 months 2012, pioneered the concept of Knowledge Graph (KG), which represents the functionality that will add a Knowledge Graph to its search results. The purpose of the method is to improve the capability of a search engine and enhance the search quality and the search experience of a user. According to the statistical data of 1 month in 2015, the KG constructed by Google already has 5 hundred million entities and about 35 hundred million entity relationship information, and has been widely applied to improving the search quality of a search engine.
Although the concept of Knowledge Graph (Knowledge Graph) is newer, it is not a completely new research field, as early as 2006, Berners L ee has proposed the idea of data link (linked data), and has called for popularizing and perfecting related technical standards such as uri (uniform resource identifier), rdf (resource discovery framework), and OW L (Web connectivity guide), which are ready for meeting the arrival of semantic networks.
Knowledge graphs have become an important resource to support many artificial intelligence related applications, such as graph analysis, question and answer systems, web search, and the like. The knowledge graph describes and stores entities in a triple form, is a multi-relationship graph, and is composed of entities serving as nodes and relationships serving as edges of different types. Many companies and research teams now attempt to organize their knowledge in the field into machine-readable knowledge maps. Although these large-scale knowledge maps gather a great deal of factual information about the world, there are still many areas to be studied.
Academic network data mining utilizes information of useful entities such as thesis, scholars, institutions, meeting places, research fields and the like to discover hidden relationships and discover information based on semantics. With structured academic data, multiple academic databases or knowledge maps have been constructed. The public academic knowledge map can provide convincing academic information for the scholars and provide large-scale benchmark data sets for the researchers to conduct data mining projects.
However, existing databases or knowledge maps have some limitations. First, most of the existing efforts provide a homogenous academic map, while the relationships between different types of entities remain lost. Second, some databases focus on only one particular research domain, limiting the projects aimed at discovering cross-domain knowledge. Third, synonyms and ambiguities are also limitations of knowledge mining. Assigning entities with unique ids is a necessary solution, but some databases directly use the name of an entity as their id.
In view of the shortcomings of the prior art, the present invention aims to provide a large-scale knowledge-graph framework for academic data mining, AceKg, for providing pure academic information and a large-scale reference data set for a large number of researchers, for developing challenging data mining projects including link prediction, community detection and student classification.
Patent document CN110347844A (application number: 201910633602.3) discloses a space target knowledge graph construction system, which comprises a text information collection and processing module, a document information collection and processing module, a knowledge graph construction module and a knowledge graph display module; aiming at the problems of high information confidentiality, strong specialization, difficult direct acquisition and the like of the space target, the method of data mining is adopted to research the space target information hidden behind massive information by acquiring related information of the space target such as news, microblogs, academic journals and the like, visually display the attribute information of the space target, analyze the mutual relation between the space targets, and integrate two types of knowledge maps by using software technology to construct a space target knowledge map system.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method, a system and a medium for constructing and applying a knowledge graph framework.
The construction and application method of the knowledge graph framework provided by the invention comprises the following steps:
step 1: the knowledge modeling is completed by defining the entity of the academic field and constructing the ontology of the academic knowledge map;
step 2: entity alignment is carried out, namely, for each entity in the heterogeneous data source knowledge base, the same entity belonging to the real world is found out;
and step 3: enriching the knowledge graph by using a rule-based knowledge graph reasoning method;
and 4, step 4: estimating knowledge graph architecture-several most advanced methods of AceKG embedding knowledge;
and 5: several of the most advanced methods of assessing knowledge-graph architecture, AceKG network representation learning.
Preferably, the entity definition of the academic domain comprises 5 types of academic entities: papers, authors, research areas, places and institutions, describe the common attributes of each entity and the relationships between entities as triplets S (h, r, t);
s represents a triple; h represents a head entity; r represents a relationship between entities; t represents a tail entity;
the ontology of the academic knowledgegraph constructs academic entities comprising 31.3 million triples.
Preferably, the step 2 includes mapping the AceKG paper in the field of computer science to papers stored in the IEEE, ACM and DB L P database, all latest papers in the IEEE, ACM and DB L P database being consistent with AceKG.
Preferably, the step 3 comprises: and making related reasoning through related constraints in the business ontology framework, wherein the related reasoning comprises category reasoning and attribute reasoning.
Preferably, the step 4 includes that given a triplet S (h, R, t) consisting of two entities, h, t ∈ E and a relation R ∈ R, knowledge embedding maps each entity to a k-dimensional vector in an embedding space, and defines a scoring function in a knowledge graph to evaluate the reasonability of the triplet (h, R, t), wherein E represents the set of all entities, and R represents the set of relations between the entities.
Preferably, the step 5 comprises: given a network G ═ V, E ', a, where V denotes the set of vertices, E' denotes the network topology, a denotes the attributes of the nodes to be saved, and the network representation learning has the task of learning a mapping function:
Figure BDA0002393905230000031
wherein r isvIs a learning representation of the vertex v, d is vrDimension of (A), RdRepresenting a d-dimensional real number space.
Preferably, the step 4 comprises:
step 4.1: extracting a reference data set from AceKG, wherein the reference data set comprises FB15K and WN18, and constructing a new reference data set AK18K from the AceKG for knowledge embedding;
step 4.2: randomly dividing training/valid/testing data sets for the extracted data and storing the training/valid/testing data sets;
step 4.3: code writing is carried out based on OpenKE, and a link prediction result based on knowledge embedding is tested.
Preferably, the step 5 comprises:
step 5.1: based on AceKG, 5 research fields and 5 sub-fields are selected;
step 5.2: respectively extracting all scholars, papers and places in the research field, and constructing 5 heterogeneous cooperative networks;
step 5.3: constructing two academic knowledge graphs;
step 5.4: performing a scholars classification task by adopting logistic regression, and performing microscopic F1 and macroscopic F1 evaluation on a classification result by adopting 5-time cross validation;
step 5.5: and (3) based on the same node representation in the student classification task, utilizing a k-means algorithm to perform a student clustering experiment to evaluate the performance of the model.
The system for constructing and applying the knowledge graph architecture provided by the invention comprises the following components:
module M1: the knowledge modeling is completed by defining the entity of the academic field and constructing the ontology of the academic knowledge map;
module M2: entity alignment is carried out, namely, for each entity in the heterogeneous data source knowledge base, the same entity belonging to the real world is found out;
module M3: enriching the knowledge graph by using a rule-based knowledge graph reasoning method;
module M4: estimating knowledge graph architecture-several most advanced methods of AceKG embedding knowledge;
module M5: several of the most advanced methods of assessing knowledge-graph architecture, AceKG network representation learning.
Compared with the prior art, the invention has the following beneficial effects:
1. the AceKG not only provides pure academic information, but also provides a large-scale reference data set for researchers, and is used for developing challenging data mining projects, including link prediction, community detection and student classification; meanwhile, certain help is provided for academic scientific research personnel to perform collaborative prediction tasks.
2. The reference data set constructed by the AceKG also provides a foundation for evaluating knowledge embedding and network representation learning methods;
3. to enrich the proposed knowledge-graph architecture, the present invention also utilizes existing entity attribute information and rule-based reasoning to perform entity alignment.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a diagram of an AceKG ontology for large-scale knowledge-graph architecture for academic data mining;
FIG. 2 is a schematic diagram of rule-based inference, where dashed arrows are inference predicates;
fig. 3 is a diagram of a specific engineering architecture to which the present invention relates.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
The invention relates to a large-scale knowledge graph framework-AceKg for academic data mining by designing and realizing, relating to the technologies of knowledge extraction (comprising entity extraction, relation extraction, attribute extraction and the like), knowledge fusion (comprising entity disambiguation and the like), knowledge processing (comprising ontology framework, knowledge reasoning and the like), knowledge updating and the like in the knowledge graph field; specifically, the method comprises the following steps:
step S1: knowledge modeling is completed by defining main entities of the academic network field and constructing an ontology of an academic knowledge graph.
Step S2: and performing entity alignment based on the entity/value information in the first step and the schema of the specific field knowledge graph, namely finding out the same entity belonging to the real world for each entity in the heterogeneous data source knowledge base.
Step S3: the knowledge graph is enriched using a rule-based knowledge graph inference method.
Step S4: several most advanced methods for evaluating knowledge embedding using the new large knowledge-graph architecture proposed in this patent, AceKG, include TransE, transah, DistMult, ComplEx, HolE, etc.
And step S5, evaluating and using several most advanced methods of new large knowledge graph architecture-AceKG network representation learning, including Deepwalk, PTE, L INE, meta-ath2vec and other related methods for research and evaluation, and mainly including two tasks of student classification and student clustering.
The step S1 includes:
step S1.1: and carrying out entity definition. All objects (e.g., papers, institutions, authors) are represented as entities in the AceKG. Two entities may form a relationship. The usual attributes of each entity (including numbers, dates, strings, and other text) are also represented. Similar entities are grouped into classes. In general, AceKG defines 5 academic entities, papers, authors, research fields, sites and institutions. The common attributes of each entity and the relationships between the entities are described in the knowledge graph as triples. All data required is sourced from aceap.
Step S1.2: synonyms and ambiguities are handled. To handle synonyms and ambiguities, each entity in the definition attributes is assigned a URI. For example, ace: 7E7A3A69 and ace: 7E0D6766 is two same-named scholars: one of the korean guards is the powerful data mining scientist. Compared to a dataset that represents entities directly using entity names, AceKG may avoid errors caused by synonyms and ambiguities.
Step S1.3: and (5) ontology construction of the knowledge graph. Fig. 1 is a large-scale knowledge-graph architecture AceKG ontology overview for academic data mining, which is related to each other by the relationship between each entity, and the triple statistics between each entity is shown in table 1.
TABLE 1 triple statistics
Figure BDA0002393905230000061
The step S2 includes:
step S2.1, in order to make AceKG more interconnected and comprehensive, the AceKG papers in most computer science fields are mapped to papers stored in databases of IEEE, ACM and DB L P, all the latest papers in the three databases are consistent with the AceKG, and Table 2 shows some mapping statistical data and knowledge maps regularly update the latest academic information.
Table 2 node mapping statistics
Figure BDA0002393905230000062
The step S3 includes:
step S3.1: rule-based knowledge graph reasoning is a typical and key approach to enriching knowledge graphs. The selected inference rules we have designed are shown in figure 2. With these inference rules, we can define new relationships on the AceKG, thereby providing a more comprehensive truth.
The step S4 includes:
step S4.1: reference data sets are extracted from AceKG, and mainly comprise FB15K and WN18, and a new reference data set AK18K is constructed from AceKG for knowledge embedding. To extract AK18K from AceKG, we first selected 68 important international sites (meetings and periodicals) and influential papers published on these sites. We then added triplets of authors, fields and institutes.
Table 3 shows the statistics of WN18, FB15K, and AK 18K. AK18K is more sparse than FB15K, but more dense than WN18 (represented by the value of # Trip/# E), and it provides only 7 types of relationships. We will evaluate the scalability of the model on a knowledge graph with simple relationship structure but large number of entities.
TABLE 3 data sets for use in knowledge embedding
Figure BDA0002393905230000063
Step S4.2: randomly dividing training/valid/testing data sets for the extracted data and storing the training/valid/testing data sets; table 3 shows the amount of training/validation/test data for each data set after random partitioning.
Table 4 link on AK18K predicts task results
Figure BDA0002393905230000071
Step S4.3: code compiling is carried out based on OpenKE, and a link prediction result based on knowledge embedding is tested; table 4 gives the knowledge-based embedded link prediction results. And optimizing the hyper-parameters obtained by searching the grids in the literature to obtain the optimal hyper-parameter set.
The more advanced models can be divided into two categories:
(i) translation model (TransE, TransH);
(ii) composition model (DistMult, HolE, ComplEx).
TransE outperformed all peers with 89.2% performance on Hits @ 10. Although 94.4% of the relationships in our knowledge graph are many-to-many relationships and are applicable to TransH, TransE shows its advantages in modeling sparse, simple knowledge graphs, and TransH does not achieve better results. The reason may be that the number of relationship types is only 7, which is very small. According to another aspect of the "transparency Graph Embedding by transforming on Hyperplanes" article, where TransH should be three times more numerous than TransE, HolE and CompelEx achieve the most important performance indicators in other metrics, especially in Hits @1 (83.8% and 75.4%) and filtered MRR (0.482 and 0.440), which confirms their advantages in the modeling of anti-symmetric relationships, as all of our relationships are anti-symmetric, such as field _ is _ part _ of and paper _ is _ writer _ by.
Compared with the experimental results of FB15K and WN18 reported in the literature of "Holographic Embeddings of Knowledge Graphs", the AK18K evaluated performance is significantly different. First, the results for AK18K were lower than WN18, but higher than FB 15K. It is caused by a limited number of relationship types and a large number of potential entities per relationship. Some relationships, such as paper _ is _ in _ field, may have thousands of objects per triple, which limits prediction performance. Secondly, as the knowledge graph is complicated, the performance gap between the two types of models becomes more and more obvious, which means that the simple translation model may not be good for modeling the complex graph.
The step S5 includes:
step S5.1: based on AceKG, 5 research Fields (FOS) and 5 main sub-fields each were first selected; the research fields we chose here are life sciences, computer science, economics, physics and pharmacology, respectively.
Step S5.2: all scholars, papers and places in the research fields are respectively extracted, and 5 heterogeneous cooperative networks are constructed.
Step S5.3: two larger academic knowledge graphs are constructed;
(1) integrating the 5 networks into a graph containing all information of 5 research fields;
(2) the *** Scholar5 site class 8 was matched to the AceKG site. Selecting related thesis and scholars to construct a large heterogeneous cooperative network;
in addition, the method for labeling the category of the scholars is as follows:
(a) to label the papers, we directly employed the research domain information and *** scholars categories as labels of the papers in 6 FOS networks and 1 *** scholars network, respectively.
(b) The label of the scholars controls most labels published by the scholars. Some tags are randomly selected when their papers are equal in number.
The statistical index values of the constructed 7 data sets are shown in table 5.
TABLE 5 data sets used in network representation learning
Figure BDA0002393905230000081
Step S5.4: performing a scholars classification task by adopting logistic regression, and performing microscopic F1 and macroscopic F1 evaluation on a classification result by adopting 5-time cross validation;
it is worth mentioning that logistic regression has many similarities to multiple linear regression, and the model is also wx + b, w and b are required parameters, and the difference is that dependent variables are different, logistic regression corresponds wx + b to a hidden state p through logistic function L, i.e. p is L (wx + b), logistic regression if L is logistic function, multiple linear regression if L is polynomial.
The L logistic regression is actually a classification method for binary problems.
First, a suitable hypothesis function h is foundθ(x) The function is a classification function to be found by people, and is used for predicting the judgment result of input data, the process needs to have certain knowledge and analysis on the data, and the basic characteristics of the prediction function are known, such as whether the function is linear or not.
First, a logic function (sigmod function) is written:
Figure BDA0002393905230000091
the sigmoid function approximates a unit step function to a certain extent, is monotonous and differentiable, and defines a hypothesis function:
Figure BDA0002393905230000092
wherein h isθ(x) The value of the function has a special meaning, which indicates the probability that the result takes 1, and the probabilities for class 1 and class 2 are:
P(y=1|x;θ)=hθ(x)
P(y=0|x;θ)=1-hθ(x)
the above equation can be written using a binomial distribution as:
P(y|x;θ)=(hθ(x))y(1-hθ(x))1-y
taking a likelihood function:
Figure BDA0002393905230000093
taking the logarithm of the obtained product to obtain:
Figure BDA0002393905230000094
a cost function cost, i.e. a loss function, is constructed. To represent the deviation between the predicted output and the actual class of training data. If all data are considered, the costs can be summed or averaged, as a J (θ) function, representing the deviation between all predictions and the actual class of training data.
And minimizing the cost function to obtain an optimal model parameter solution, namely the minimum value of the J (theta) function. Since the smaller the value of the function, the more accurate the prediction result, the method generally uses a gradient descent method. The cost function and the loss function are respectively:
Figure BDA0002393905230000095
Figure BDA0002393905230000101
and F1-score as an evaluation index is defined as follows:
Figure BDA0002393905230000102
wherein precision represents precision, which refers to the specific gravity of the positive sample in the positive case determined by the classifier. recall denotes recall and refers to the proportion of total positive cases that are predicted to be positive cases.
TABLE 6 student Classification results
Figure BDA0002393905230000103
Table 6 shows the classification results evaluated for micro F1 and Macro F1 Macro-F1 meta data 2vec has a significantly better ability to learn heterogeneous node embedding than other methods we attribute it to improved heterogeneous sampling and hopping graph algorithms, however deep walk and L INE also achieve similar performance showing their scalability on heterogeneous networks.
In addition, the index values also reflect the interdisciplinary level of these areas. For example, the highest micro f1 indicates that the biological sub-domains are the most independent, while the lowest micro f1 indicates that the sub-domains of CS intersect the most. Finally, the sharp drop from microscopic f1 to macroscopic f1, especially in the economic field, indicates an imbalance in the various sub-fields in a partial FOS.
Step S5.5: and based on the same node representation in the student classification task, further utilizing a k-means algorithm to perform a student clustering experiment to evaluate the performance of the model. The k-means algorithm is one of the most commonly used clustering algorithms. The input to the algorithm is a set of samples (alternatively referred to as a set of points) by which samples can be clustered, with samples having similar characteristics grouped into a class. For each point, the center point of the point closest to all the center points is calculated, and then the point is classified as the cluster represented by the center point. After one iteration is finished, the central point is recalculated for each cluster class, and then the central point closest to the cluster is searched for each point again. And circulating until the cluster class of the two previous and next iterations is not changed. The method comprises the following basic steps:
(1) selecting the number k of categories to be clustered (e.g. k is 3 categories), and selecting k central points;
(2) for each sample point, finding a central point (searching organization) closest to the sample point, wherein the points closest to the same central point are a class, and thus, one-time clustering is completed;
(3) judging whether the sample points before and after clustering are the same in category condition, if so, terminating the algorithm, otherwise, entering (4);
(4) for the sample points in each category, the center points of these sample points are calculated, as new center points for that category, and continue (2).
Based on the same node representation in the student classification task, the performance of the model is evaluated by further utilizing a k-means algorithm to carry out a student clustering experiment. All clustering experiments were performed 10 times and the average performance was reported.
Table 7 gives the clustering results for the Normalized Mutual Information (NMI) evaluation. In general, the performance of the argument ath2vec is superior to that of all other models, which shows that the improved heterogeneous sampling and jump map algorithm can better store the information in the knowledge map.
TABLE 7 results of student clustering
Figure BDA0002393905230000111
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A construction and application method of a knowledge graph framework is characterized by comprising the following steps:
step 1: the knowledge modeling is completed by defining the entity of the academic field and constructing the ontology of the academic knowledge map;
step 2: entity alignment is carried out, namely, for each entity in the heterogeneous data source knowledge base, the same entity belonging to the real world is found out;
and step 3: enriching the knowledge graph by using a rule-based knowledge graph reasoning method;
and 4, step 4: estimating knowledge graph architecture-several most advanced methods of AceKG embedding knowledge;
and 5: several of the most advanced methods of assessing knowledge-graph architecture, AceKG network representation learning.
2. The method for constructing and applying a knowledge-graph framework according to claim 1, wherein the entity definitions of the academic field comprise 5 types of academic entities: papers, authors, research areas, places and institutions, describe the common attributes of each entity and the relationships between entities as triplets S (h, r, t);
s represents a triple; h represents a head entity; r represents a relationship between entities; t represents a tail entity;
the ontology of the academic knowledgegraph constructs academic entities comprising 31.3 million triples.
3. The method for constructing and applying a knowledge-graph framework as claimed in claim 1, wherein the step 2 comprises mapping AceKG papers in the field of computer science to papers stored in IEEE, ACM and DB L P databases, all latest papers in the IEEE, ACM and DB L P databases being consistent with AceKG.
4. The method for constructing and applying a knowledge-graph framework according to claim 1, wherein the step 3 comprises: and making related reasoning through related constraints in the business ontology framework, wherein the related reasoning comprises category reasoning and attribute reasoning.
5. The method for constructing and applying the knowledge-graph framework according to claim 2, wherein the step 4 comprises the steps of giving a triplet S (h, R, t) consisting of h, t ∈ E and a relation R ∈ R, mapping each entity to a k-dimensional vector in an embedding space, and defining a scoring function in the knowledge-graph to evaluate the reasonability of the triplet (h, R, t), wherein E represents all entity sets, and R represents a relation set between the entities.
6. The method for constructing and applying a knowledge-graph framework according to claim 1, wherein the step 5 comprises: given a network G ═ V, E ', a, where V denotes the set of vertices, E' denotes the network topology, a denotes the attributes of the nodes to be saved, and the network representation learning has the task of learning a mapping function:
Figure FDA0002393905220000011
wherein r isvIs a learning representation of the vertex v, d is vrDimension of (A), RdRepresenting a d-dimensional real number space.
7. The method for constructing and applying a knowledge-graph framework according to claim 1, wherein the step 4 comprises:
step 4.1: extracting a reference data set from AceKG, wherein the reference data set comprises FB15K and WN18, and constructing a new reference data set AK18K from the AceKG for knowledge embedding;
step 4.2: randomly dividing training/valid/testing data sets for the extracted data and storing the training/valid/testing data sets;
step 4.3: code writing is carried out based on OpenKE, and a link prediction result based on knowledge embedding is tested.
8. The method for constructing and applying a knowledge-graph framework according to claim 1, wherein the step 5 comprises:
step 5.1: based on AceKG, 5 research fields and 5 sub-fields are selected;
step 5.2: respectively extracting all scholars, papers and places in the research field, and constructing 5 heterogeneous cooperative networks;
step 5.3: constructing two academic knowledge graphs;
step 5.4: performing a scholars classification task by adopting logistic regression, and performing microscopic F1 and macroscopic F1 evaluation on a classification result by adopting 5-time cross validation;
step 5.5: and (3) based on the same node representation in the student classification task, utilizing a k-means algorithm to perform a student clustering experiment to evaluate the performance of the model.
9. A system for constructing and applying knowledge graph architecture is characterized by comprising:
module M1: the knowledge modeling is completed by defining the entity of the academic field and constructing the ontology of the academic knowledge map;
module M2: entity alignment is carried out, namely, for each entity in the heterogeneous data source knowledge base, the same entity belonging to the real world is found out;
module M3: enriching the knowledge graph by using a rule-based knowledge graph reasoning method;
module M4: estimating knowledge graph architecture-several most advanced methods of AceKG embedding knowledge;
module M5: several of the most advanced methods of assessing knowledge-graph architecture, AceKG network representation learning.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN202010124150.9A 2020-02-27 2020-02-27 Method, system and medium for constructing and applying knowledge graph architecture Pending CN111444348A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010124150.9A CN111444348A (en) 2020-02-27 2020-02-27 Method, system and medium for constructing and applying knowledge graph architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010124150.9A CN111444348A (en) 2020-02-27 2020-02-27 Method, system and medium for constructing and applying knowledge graph architecture

Publications (1)

Publication Number Publication Date
CN111444348A true CN111444348A (en) 2020-07-24

Family

ID=71627078

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010124150.9A Pending CN111444348A (en) 2020-02-27 2020-02-27 Method, system and medium for constructing and applying knowledge graph architecture

Country Status (1)

Country Link
CN (1) CN111444348A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813874A (en) * 2020-09-03 2020-10-23 中国传媒大学 Terahertz knowledge graph construction method and system
CN112017776A (en) * 2020-10-27 2020-12-01 平安科技(深圳)有限公司 Disease prediction method based on dynamic graph and medical knowledge map and related equipment
CN112084389A (en) * 2020-08-17 2020-12-15 上海交通大学 Network crawler-based academic institution geographical position information extraction method
CN112115261A (en) * 2020-08-21 2020-12-22 浙江工商大学 Knowledge graph data expansion method based on symmetry and reciprocal relation statistics
CN112149400A (en) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN112307217A (en) * 2020-09-16 2021-02-02 北京中兵数字科技集团有限公司 Knowledge graph model construction method and device, and storage medium
CN112598428A (en) * 2020-12-25 2021-04-02 北京知因智慧科技有限公司 Transaction data processing method and device, computer equipment and storage medium
CN112800243A (en) * 2021-02-04 2021-05-14 天津德尔塔科技有限公司 Project budget analysis method and system based on knowledge graph
CN112967559A (en) * 2021-03-29 2021-06-15 北京航空航天大学 Assembly skill direct generation method based on virtual assembly environment
CN113191497A (en) * 2021-05-28 2021-07-30 国家电网有限公司 Knowledge graph construction method and system for substation stepping exploration site selection
CN116069895A (en) * 2023-01-31 2023-05-05 首都医科大学附属北京儿童医院 Knowledge base system for medical institutions of pediatric science

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376249A (en) * 2018-09-07 2019-02-22 桂林电子科技大学 A kind of knowledge mapping embedding grammar based on adaptive negative sampling
CN110245238A (en) * 2019-04-18 2019-09-17 上海交通大学 The figure embedding grammar and system of Process Based and syntax schema

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109376249A (en) * 2018-09-07 2019-02-22 桂林电子科技大学 A kind of knowledge mapping embedding grammar based on adaptive negative sampling
CN110245238A (en) * 2019-04-18 2019-09-17 上海交通大学 The figure embedding grammar and system of Process Based and syntax schema

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
张秋颖: "万物互联:学术数据的互联、挖掘与可视化", 《物联网学报》 *
张秋颖: "万物互联:学术数据的互联、挖掘与可视化", 《物联网学报》, 31 December 2018 (2018-12-31), pages 57 - 60 *
张秋颖等: "万物互联:学术数据的互联、挖掘和可视化", 物联网学报, pages 159 - 160 *
王炳顺, 上海交通大学出版社 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112084389A (en) * 2020-08-17 2020-12-15 上海交通大学 Network crawler-based academic institution geographical position information extraction method
CN112115261A (en) * 2020-08-21 2020-12-22 浙江工商大学 Knowledge graph data expansion method based on symmetry and reciprocal relation statistics
CN111813874A (en) * 2020-09-03 2020-10-23 中国传媒大学 Terahertz knowledge graph construction method and system
CN111813874B (en) * 2020-09-03 2023-09-15 中国传媒大学 Terahertz knowledge graph construction method and system
CN112307217A (en) * 2020-09-16 2021-02-02 北京中兵数字科技集团有限公司 Knowledge graph model construction method and device, and storage medium
CN112149400B (en) * 2020-09-23 2021-07-27 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN112149400A (en) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN112017776A (en) * 2020-10-27 2020-12-01 平安科技(深圳)有限公司 Disease prediction method based on dynamic graph and medical knowledge map and related equipment
CN112017776B (en) * 2020-10-27 2021-01-15 平安科技(深圳)有限公司 Disease prediction method based on dynamic graph and medical knowledge map and related equipment
CN112598428A (en) * 2020-12-25 2021-04-02 北京知因智慧科技有限公司 Transaction data processing method and device, computer equipment and storage medium
CN112800243A (en) * 2021-02-04 2021-05-14 天津德尔塔科技有限公司 Project budget analysis method and system based on knowledge graph
CN112967559B (en) * 2021-03-29 2021-12-28 北京航空航天大学 Assembly skill direct generation method based on virtual assembly environment
CN112967559A (en) * 2021-03-29 2021-06-15 北京航空航天大学 Assembly skill direct generation method based on virtual assembly environment
CN113191497A (en) * 2021-05-28 2021-07-30 国家电网有限公司 Knowledge graph construction method and system for substation stepping exploration site selection
CN113191497B (en) * 2021-05-28 2024-04-23 国家电网有限公司 Knowledge graph construction method and system for substation site selection
CN116069895A (en) * 2023-01-31 2023-05-05 首都医科大学附属北京儿童医院 Knowledge base system for medical institutions of pediatric science

Similar Documents

Publication Publication Date Title
CN111444348A (en) Method, system and medium for constructing and applying knowledge graph architecture
Zhang et al. DeepDive: Declarative knowledge base construction
Ristoski et al. Semantic Web in data mining and knowledge discovery: A comprehensive survey
CN112612902A (en) Knowledge graph construction method and device for power grid main device
Han et al. Intelligent query answering by knowledge discovery techniques
Vu et al. Learning semantic models of data sources using probabilistic graphical models
Belhadi et al. Data mining-based approach for ontology matching problem
Tinelli et al. Embedding semantics in human resources management automation via SQL
Ouared et al. Capitalizing the database cost models process through a service‐based pipeline
Kozmina et al. Information requirements for big data projects: A review of state-of-the-art approaches
Jia From data to knowledge: the relationships between vocabularies, linked data and knowledge graphs
Rao et al. A rough–fuzzy approach for retrieval of candidate components for software reuse
Gacitua et al. Using Semantic Web technologies in the development of data warehouses: A systematic mapping
Azzam et al. How AI helps to increase organizations’ capacity to manage complexity–a research perspective and solution approach bridging different disciplines
Basharat et al. Semantically enriched task and workflow automation in crowdsourcing for linked data management
Jiang Research on factor space engineering and application of evidence factor mining in evidence-based reconstruction
Li et al. A scientometric review of hotspots and emerging trends in sustainable business model
Basharat et al. Crowdlink: Crowdsourcing for large-scale linked data management
Radhi Adaptive learning system of ontology using semantic web to mining data from distributed heterogeneous environment
Kozmina et al. Perspectives of information requirements analysis in big data projects
Bounif et al. Schema repository for database schema evolution
Wang et al. AceMap: Knowledge Discovery through Academic Graph
Chulyadyo et al. A framework for offline evaluation of recommender systems based on Probabilistic Relational Models
Signature Signature. ca
Sharma et al. Review Of Data Mining Techniques: An Empirical Study

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination