CN114153983A

CN114153983A - Multi-source construction method of industry knowledge graph

Info

Publication number: CN114153983A
Application number: CN202111353417.2A
Authority: CN
Inventors: 何伟; 李小超; 谢水庚; 冀天宇; 郝志强
Original assignee: Beijing Casicloud Co ltd
Current assignee: Beijing Casicloud Co ltd
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2022-03-08

Abstract

The invention discloses a multi-source construction method of an industry knowledge graph, which comprises the following steps: s1, aiming at four knowledge sources of an open knowledge base, an online encyclopedia, an industry text and industry structure data, industry concepts and entities are extracted; s2 merging synonymous concepts and entities; s3 extracting the upper and lower relation of the concept; s4 extracts non-top and bottom relationships and attribute relationships of concepts and entities. The multi-source construction method can solve the problems that the existing construction method is large in artificial workload, consumes a large amount of computer resources, is excessive in fragmentation information, is incomplete in data, and is difficult to extract and fuse knowledge from different sources in a distinguishing manner, so that the purposes that a target body is constructed, entities and attributes are extracted by adopting a targeted strategy according to different data sources, the characteristics of knowledge from different sources are considered, the knowledge graph is constructed semi-automatically by combining a machine learning method, and the manpower consumed by constructing the large-scale knowledge graph is greatly reduced while the accuracy is ensured are achieved.

Description

Multi-source construction method of industry knowledge graph

Technical Field

The invention relates to the technical field of artificial intelligence text processing, in particular to a multi-source construction method of an industry knowledge graph.

Background

The industry knowledge graph contains massive structural information, and is usually used for analysis application or decision support, so that the requirement on accuracy is high. The construction of the large-scale knowledge graph comprises two modes, namely synchronization with a database and a network encyclopedia respectively. The first method is to use a specific structure for storing the knowledge graph, download a large amount of data, and construct the data in a sub-graph fusion mode after manual integration. This approach is labor intensive, consumes significant computer resources, and does not guarantee data security during the build process. The second method is to adopt a web crawler to perform data acquisition and information extraction on related similar information, which has the problems that a large amount of web page processing causes excessive fragmented information, and most websites have the performance of blocking the crawler, so that the data is incomplete. For the multi-source knowledge graph, knowledge from industry texts, open chain data sets and knowledge bases and encyclopedias has different characteristics, and the existing construction mode is difficult to extract and fuse the knowledge from different sources.

Disclosure of Invention

Aiming at the technical problems in the related art, the invention provides a multi-source construction method of an industry knowledge graph, which can overcome the defects in the prior art.

In order to achieve the technical purpose, the technical scheme of the invention is realized as follows:

a multi-source construction method of an industry knowledge graph comprises the following steps:

s1, aiming at four knowledge sources of an open knowledge base, an online encyclopedia, an industry text and industry structure data, industry concepts and entities are extracted;

s2 merging synonymous concepts and entities;

s3 extracting the upper and lower relation of the concept;

s4 extracts non-top and bottom relationships and attribute relationships of concepts and entities.

Further, the S1 includes the following steps:

s11, collecting the existing open link data set and the business core concepts and entities in an open knowledge base, wherein the open link data set and the open knowledge base comprise DBPedia, YAGO and Zhishi.me;

s12, collecting category labels of classification systems in Wikipedia, encyclopedia and interactive encyclopedia as concepts, titles of encyclopedia articles as entity candidates, and using corresponding brief introduction texts in online encyclopedia as concepts or abstract of entities;

s13, finding out a keyword set for the industry text corpus by adopting word frequency statistics, RAKE, TextRank and TF-IDF methods, and preliminarily screening out an industry core concept from the keyword set by the aid of industry experts;

s14, mapping the related tables and the columns in the tables in the relational database into conceptual entities and attributes of the entities respectively through a D2R Server tool for the industry structure data;

s15 integrates the industry concepts and entities obtained from the four ways in S11-S14.

Further, the S2 includes the following steps:

s21 is clear about the synonymy relationship in the open link data, DBPedia uses "owl: sameAs" "to identify the synonymous entity," "means" "to identify the synonymous entity in YAGO, and" pageRedirects "" to identify the redirection page of the synonymous entity in Zhishi.me;

s22, in the aspect of encyclopedia, merging the learned concepts in the same encyclopedia, traversing the entity pages in the encyclopedia, identifying the page titles with the same redirection label as the same entity, and identifying the values corresponding to the 'alias' and 'Chinese alias' fields in the entity page information as the same entity;

judging whether different online encyclopedia homonymous entities are synonymous or not: for page articles in different online encyclopedias, the articles with the same title and the article content similarity of more than 80% are marked as pages corresponding to the same entity or concept, and the entity or concept corresponding to the article title is marked as synonym;

s23 extracts an industry text synonymy relation: in the field of industry text, first, defining "X and also name Y," "Y and also name Y," "X and also name Y," "X and also name Y," "X and also name Y", "" X and also name Y "s" are "Y" "X is Y" s "are" Y "s", "X" Y "s" are "Y" s "and" s "Y" s "are" Y "s" and "s" Y "s" are "Y" s "Y" s "Y" s "Y" s "and are also called Y" s "Y" s "and are also called Y" s, then, performing word segmentation and part-of-speech tagging on the text through an NLP tool, obtaining training data according to the extracted synonymy relationship, modeling by using a BilSTM-CRF algorithm, and extracting the synonymy relationship;

s24 combines the synonymy relations obtained from the three ways of S21-S23, and if the synonymy relations obtained from different ways have the same concept or entity, then combines the two synonymy relations.

Furthermore, the chapter content similarity in S22 is obtained by an unsupervised learning method, vector representations of all words are obtained by a word2vec algorithm, for any article, tf-idf of each word in the text is used as a weight, word vectors of all words in the article are weighted and averaged to serve as a vector of the article, and cosine similarity between vectors is used as article similarity.

Further, the S3 includes the following steps:

s31, extracting the superior-inferior relation between industry core concepts from the open link data set and the open knowledge base according to the corresponding rules;

s32, directly acquiring the upper and lower relations among the core concepts from the encyclopedia classification system;

s33 extracting the context relation of the industry text: for an industrial text, firstly, defining that X is Y, X is Y, Z and the like, X comprises Y, Z and the like, X has Y, Z and the like, X means Y, Z and the like, X (Y, Z) is a sentence pattern rule for describing the superior-inferior relation, matching is performed in the industrial text according to the patterns, the superior-inferior relation between entities or concepts is extracted, then word segmentation and word property tagging are performed on the text by an NLP tool, training data is obtained according to the extracted superior-inferior relation, a BilSTM-CRF algorithm is used for modeling, and the superior-inferior relation among triples is extracted;

s34 integrates the upper and lower relations obtained from the three ways of S31-S33 to construct a classification tree.

Further, the S4 includes the following steps:

s41, the attribute relation of the concept can be directly extracted from the information module of the open chain data;

s42 compiling an adapter, extracting entity attribute relation of concepts from an information module of the on-line encyclopedic through page analysis, and counting attributes of entities to which the concepts belong, wherein if the number proportion of certain attributes owned by entities corresponding to one concept exceeds 30%, the attributes are considered to be common and become the attributes of the concepts;

s43 extracting non-context relation of industry texts: in the aspect of industry texts, firstly, common sentence pattern rules for describing non-superior-inferior relations are defined under the assistance of industry experts, matching is carried out in industry texts according to the rules, the non-superior-inferior relations between entities or concepts are extracted, then, word segmentation and part-of-speech tagging are carried out on the texts through an NLP tool, training data are obtained according to the extracted non-superior-inferior relations, modeling is carried out through a BiLSTM-CRF algorithm, and the non-superior-inferior relations are extracted;

s44 finally combines the non-superordinate and superordinate relationships obtained by the three routes S41-S43.

The invention has the beneficial effects that: the multisource construction method of the industry knowledge graph can solve the problems that the existing construction method is large in artificial workload, consumes a large amount of computer resources, is excessive in fragmentation information, is incomplete in data and is difficult to extract and fuse knowledge from different sources in a distinguishing mode, so that the purposes that a target body is constructed, entities and attributes are extracted by adopting a targeted strategy according to different data sources, the characteristics of knowledge from different sources are considered, the knowledge graph is constructed semi-automatically by combining a machine learning method, and the manpower consumed by large-scale knowledge graph construction is greatly reduced while the accuracy is ensured are achieved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a flow diagram of a multi-source construction method of an industry knowledge graph according to an embodiment of the invention;

FIG. 2 is a flowchart of the multi-source construction method of an industry knowledge graph for determining whether different online encyclopedia entities are synonymous according to an embodiment of the present invention;

FIG. 3 is a flow chart of extracting synonymous, top-bottom, non-top-bottom relations of industry texts of the multi-source construction method of the industry knowledge graph according to the embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.

As shown in fig. 1 to 3, the multi-source construction method of an industry knowledge graph according to the embodiment of the present invention includes the following steps:

s2 merging synonymous concepts and entities;

s3 extracting the upper and lower relation of the concept;

The above S1 includes the following steps:

The above S2 includes the following steps:

s24 combines the synonyms obtained from the three approaches S21-S23, if there are the same concepts or entities in the synonyms obtained from different approaches, then combine two synonyms, for example, obtain the synonym "computer, electronic computer" in encyclopedia, obtain the synonym "computer, computer" in business text, and the combined synonym is "computer, electronic computer, computer".

The similarity of the chapter contents in the S22 is obtained by an unsupervised learning method, vector representations of all words are obtained by a word2vec algorithm, for any article, tf-idf of each word in the text is used as a weight, word vectors of all words in the article are weighted and averaged to serve as a vector of the article, and then cosine similarity between the vectors is used as the article similarity.

The above S3 includes the following steps:

The above S4 includes the following steps:

In order to facilitate understanding of the above-described technical aspects of the present invention, the above-described technical aspects of the present invention will be described in detail below in terms of specific usage.

When the method is used specifically, industry concepts and entities are extracted aiming at four knowledge sources of an open knowledge base, an online encyclopedia, industry texts and industry structure data, synonymous concepts and entities are combined, the upper and lower level relations of the concepts are extracted, and the non-upper and lower level and attribute relations of the concepts and the entities are extracted.

In conclusion, by means of the technical scheme, the problems that the existing construction method is large in artificial workload, consumes a large amount of computer resources, is excessive in fragmentation information, is incomplete in data, and is difficult to extract and fuse knowledge from different sources are solved, so that the purposes that a target body is constructed, entities and attributes are extracted by adopting a targeted strategy according to different data sources, the characteristics of knowledge from different sources are considered, the knowledge graph is constructed semi-automatically by combining a machine learning method, and the manpower consumed by large-scale knowledge graph construction is greatly reduced while the accuracy is ensured are achieved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A multi-source construction method of an industry knowledge graph is characterized by comprising the following steps:

s2 merging synonymous concepts and entities;

s3 extracting the upper and lower relation of the concept;

2. The multi-source construction method of industry knowledge graph according to claim 1, wherein the S1 comprises the following steps:

3. The multi-source construction method of industry knowledge graph according to claim 1, wherein the S2 comprises the following steps:

4. The multi-source construction method of the industry knowledge graph of claim 3, wherein the chapter content similarity in S22 is obtained through an unsupervised learning method, vector representations of all words are obtained through a word2vec algorithm, for any article, tf-idf of each word in the text is used as weight, word vectors of all words in the article are weighted and averaged to serve as the vector of the article, and cosine similarity among the vectors is used as the article similarity.

5. The multi-source construction method of industry knowledge graph according to claim 1, wherein the S3 comprises the following steps:

6. The multi-source construction method of industry knowledge graph according to claim 1, wherein the S4 comprises the following steps: