CN109145071B - Automatic construction method and system for geophysical field knowledge graph - Google Patents

Automatic construction method and system for geophysical field knowledge graph Download PDF

Info

Publication number
CN109145071B
CN109145071B CN201810883507.4A CN201810883507A CN109145071B CN 109145071 B CN109145071 B CN 109145071B CN 201810883507 A CN201810883507 A CN 201810883507A CN 109145071 B CN109145071 B CN 109145071B
Authority
CN
China
Prior art keywords
relation
entities
knowledge
geophysical
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810883507.4A
Other languages
Chinese (zh)
Other versions
CN109145071A (en
Inventor
董理君
姚宏
赵东阳
康晓军
李新川
郑坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
Original Assignee
China University of Geosciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences filed Critical China University of Geosciences
Priority to CN201810883507.4A priority Critical patent/CN109145071B/en
Publication of CN109145071A publication Critical patent/CN109145071A/en
Application granted granted Critical
Publication of CN109145071B publication Critical patent/CN109145071B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to an automatic construction method for a knowledge graph in the geophysical field, which comprises the following steps of firstly, establishing a concept knowledge base in the geophysical field; secondly, establishing a corresponding relation indication word bank of each relation in the neighborhood of the geophysical field; then acquiring a geophysical field knowledge data set; the text is then NLP processed, and then the text is identified with labeled geophysical domain knowledge concepts for candidate entity pairs based on word distance and entity distance. Generating a candidate relation indication word set containing noise data according to the part of speech label and the position information, and filtering noise by using a relation indication word library; then, after converting the relation indicator corresponding to each relation defined in advance into a vector, carrying out similarity calculation with the vector converted by the candidate relation indicator to find out the relation corresponding to the relation indicator with the highest similarity; and finally, importing the structured data into a graph database Neo4j to build a geophysical domain knowledge graph.

Description

Automatic construction method and system for geophysical field knowledge graph
Technical Field
The invention particularly relates to an automatic construction method and system for a knowledge graph in the geophysical field.
Background
With the continuous deepening and innovation of the theoretical research of the geophysical field and the continuous expansion of the application field, the knowledge data in the discipline are continuously increased, but the discrete distribution form presented by the knowledge data causes the systematic lack of the knowledge data of the geophysical field. In addition, the knowledge storage structure in the form of linear text prevents the rapid circulation of knowledge in the geophysical field between people and the outside, and the demand of people for rapidly acquiring knowledge is not met. Particularly, with the advent of the big data era, the contradiction between the demand of people for quickly acquiring massive knowledge and the difficulty in information acquisition caused by the discrete distribution of knowledge data and the low understanding efficiency caused by the linear structure representation of the knowledge data is increasingly prominent.
In order to solve the above problems, the present patent proposes an automated method for constructing a knowledge graph, so as to establish a knowledge graph in a professional field for the geophysical field. The input is unstructured text in the geophysical domain and the output is structured knowledge data, which is what we often say triplets of knowledge data.
At present, a plurality of methods for automatically constructing a knowledge graph exist, but most of the methods are used for extracting triple data of specified relations, and the method is not suitable for the professional field with more relations and more complex relations. The open triple extraction work is more researched in English, the open triple extraction related research of Chinese is less, and the language phenomena of Chinese and English are greatly different, so that the English method cannot be directly transplanted to Chinese, and the precision is not high.
Disclosure of Invention
The invention aims to solve the technical problem that the existing open type triple automatic extraction technology is insufficient, and provides a method and a system for automatically constructing a knowledge graph in the geophysical field by combining theoretical knowledge structure characteristics in the geophysical field, an established concept knowledge base and a relationship indication word base and a similarity matching algorithm between a generated candidate relationship indication word group and each relationship indication word group.
An automated construction method for a knowledge graph in the geophysical field comprises the following steps:
step 1: establishing a concept knowledge base containing professional vocabularies in the geophysical field;
step 2: establishing a knowledge data set containing unstructured text in the geophysical field;
and step 3: acquiring all relations contained in the knowledge data set and relation indicating words corresponding to the relations according to the knowledge data set established in the step 2, and establishing a relation indicating word library in the geophysical field;
and 4, step 4: performing NLP processing on the knowledge data set according to a concept knowledge base, wherein the NLP processing comprises word segmentation, part of speech tagging and entity identification in the geophysical field;
and 5: identifying whether a relationship exists between any two entities identified in the step 4, and if so, acquiring the relationship between the two entities;
step 6: extracting nouns and verbs distributed between any two entities and behind any two entities as candidate relation indicators, wherein the candidate relation indicators can embody the relation between the two entities acquired in the step 5;
and 7: denoising the candidate relation indicator extracted in the step 6 according to the relation indicator word library established in the step 3 to obtain a high-precision candidate relation indicator;
and 8: converting the relation indication word library and the high-precision candidate relation indication words obtained in the step 7 into vectors, calculating the similarity of the relation indication words, selecting the relation corresponding to the relation indication word with the highest similarity of the high-precision candidate relation indication words as the relation between the two entities, and finally obtaining the structured knowledge data;
and step 9: and (4) importing the structured knowledge data obtained in the step (8) into a graph database for automatically building a geophysical field knowledge graph.
Further, a knowledge data set is established in step 2 by adopting a Scapy crawler framework method.
Further, in step 3, all the relationships included in the knowledge data set and the relationship indicators corresponding to the relationships are obtained by an exhaustion method.
Further, the method for identifying whether a relationship exists between any two entities in step 5 is as follows: when the word distance between two entities does not exceed a preset maximum distance and the number of the entities is less than a preset minimum distance, judging that a relationship exists between the two entities;
further, in step 8, converting the high-precision candidate relation indicator into a vector by using a Bag-of-words method;
further, the structured knowledge data finally obtained in step 8 is triple data.
An automated construction system for a geophysical domain knowledge graph, comprising:
vocabulary collection module: a concept knowledge base used for establishing a professional vocabulary containing the geophysical field;
a text collection module: for building a knowledge data set containing unstructured text of the geophysical field;
a relationship acquisition module: the relation indicating word library is used for acquiring all relations contained in the knowledge data set and relation indicating words corresponding to the relations according to the knowledge data set established in the step 2, and establishing a relation indicating word library in the geophysical field;
an entity identification module: the system is used for performing NLP processing on the knowledge data set according to a concept knowledge base, and comprises word segmentation, part of speech tagging and entity identification in the geophysical field;
a relationship identification module: the method is used for identifying whether a relationship exists between any two entities identified in the step 4, and if the relationship exists, acquiring the relationship between the two entities;
the indicator extraction module: the method is used for extracting nouns or verbs distributed between any two entities and behind any two entities as candidate relation indicators, and the candidate relation indicators can embody the relation between the two entities acquired in the step 5;
the indicator denoising module: the relation indication word library is used for carrying out denoising processing on the candidate relation indication words extracted in the step 6 according to the relation indication word library established in the step 3 to obtain high-precision candidate relation indication words;
a relationship calculation module: the relation indicating word library and the high-precision candidate relation indicating words obtained in the step 7 are converted into vectors, the similarity between the vectors is calculated, the relation corresponding to the relation indicating word with the highest similarity between the high-precision candidate relation indicating words is selected as the relation between the two entities, and finally structured knowledge data are obtained;
automatically building a module: and (4) importing the structured knowledge data obtained in the step (8) into a graph database, and automatically building a geophysical domain knowledge graph.
The established knowledge graph of the professional theory can accelerate the flowing speed of the knowledge data between people and between machines, and the structured geophysical knowledge data lay a foundation for enabling the machines to understand the human knowledge and provide intelligent knowledge services (such as intelligent question answering and intelligent dialogue) through representation learning.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a flowchart of an automated construction method for a geophysical domain knowledge base according to the present invention;
FIG. 2 is a diagram of the effect of the geophysical knowledgebase map of the present invention.
Detailed Description
For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
An automatic construction method for a knowledge graph in the geophysical field comprises the following specific steps:
step 1: establishing a concept knowledge base of the geophysical field, wherein the concept knowledge base comprises professional vocabularies of the geophysical field, and loading the concept knowledge base into a Language Technology Platform (LTP) of Harbin university of industry.
Step 2: the method of the script crawler framework is adopted to establish a knowledge data set of the geophysical field, the knowledge data set comprises unstructured text of the geophysical field, a concept knowledge base established in step 1 is adopted to extract a plurality of entities (concepts such as gravity field, gravity anomaly, kirchhoff interface and the like) from the knowledge data set, wherein each entity (for example, "gravity field of the earth", "geophysical") can be identified through the concept knowledge base established in step 1 as supervision data, however, the relationship (for example, "research branch") between the two entities cannot be realized, the relationship between the entities is contained in the knowledge data set (for example, "gravity field of the earth is one of important branches of geophysical research"), and the automated method of step 3 is left for mining instead of relying on manpower.
And step 3: and (3) according to the knowledge data set established in the step (2), acquiring all relations contained in the knowledge data set and relation indicating words corresponding to the relations by using an exhaustion method, and establishing a relation indicating word library in the geophysical field. For example, the relationship between the entity "geophysical" and the entity "gravitational field of the earth" is "research branch", and the relationship indicator may be "research", "branch". Conversely, in the subsequent step 5, after two entities are identified in the unstructured text "the earth gravity field is one of the important branches of the geophysical research", there are relation knowledge words "research" and "branch", so that the relation between the two entities is finally found to be the "research branch", and finally the triad (the earth physics, the research branch, the earth gravity field) can be obtained. The purpose of establishing the relation indication word library is to provide a basis for reversely deducing the relation from the relation indication words in the unstructured text in the step 8.
And 4, step 4: the method comprises the steps of performing NLP processing on a knowledge data set by adopting a Language Technology Platform (LTP) of Harbin Industrial university loaded with a concept knowledge base, and performing word segmentation, part of speech tagging and entity identification in the geophysical field.
And 5: and (4) judging whether the relation exists between any two entities identified in the step (4), wherein the judging method is that the relation exists between the two entities when the word distance between the two entities does not exceed the preset maxDistance and the number of the entities is less than the preset maxEntityDistance. Because the shorter the word distance between entities, the fewer the entities, the greater the probability that a relationship exists.
Step 6: and extracting nouns and verbs distributed between the entity pairs and behind the entity pairs as candidate relation indicators capable of embodying the relation between the two entities identified in step 5, wherein about 70% of the candidate relation indicators are located between the two entities, 10% -20% of the candidate relation indicators are located behind the two entities, a small part of the candidate relation indicators are left to be located before or not existing in the first entity, and the candidate relation indicators mostly appear in the form of nouns or verbs.
And 7: and (4) denoising the candidate relation indicator extracted in the step (6) according to the relation indicator word library established in the step (3) to obtain a high-precision candidate relation indicator.
And 8: converting the relation indication word library corresponding to each relation and the high-precision candidate relation indicator obtained in the step 7 into vectors by using a Bag-of-words method, calculating the similarity of the vectors, selecting the relation corresponding to the relation indicator with the highest similarity of the high-precision candidate relation indicator as the relation between the two entities, and finally obtaining the structured knowledge data, namely the ternary group data.
And step 9: and (4) importing the triple data obtained in the step (8) into a graph database Neo4j for automatically building a geophysical domain knowledge graph.
Knowledge graph visualization is achieved by obtaining structured triple knowledge data and importing the triple knowledge data into the graph database Neo4j, as shown in fig. 2.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (7)

1. An automatic construction method for a knowledge graph in the geophysical field is characterized by comprising the following steps:
step 1: establishing a concept knowledge base containing professional vocabularies in the geophysical field;
step 2: establishing a knowledge data set containing unstructured text in the geophysical field;
and step 3: acquiring all relations contained in the knowledge data set and relation indicating words corresponding to the relations according to the knowledge data set established in the step 2, and establishing a relation indicating word library in the geophysical field;
and 4, step 4: performing NLP processing on the knowledge data set according to a concept knowledge base, wherein the NLP processing comprises word segmentation, part of speech tagging and entity identification in the geophysical field;
and 5: identifying whether a relationship exists between any two entities identified in the step 4, and if so, acquiring the relationship between the two entities;
step 6: extracting nouns or verbs distributed between any two entities and behind any two entities as candidate relation indicators, wherein the candidate relation indicators can embody the relation between the two entities acquired in the step 5;
and 7: denoising the candidate relation indicator extracted in the step 6 according to the relation indicator word library established in the step 3 to obtain a high-precision candidate relation indicator;
and 8: converting the relation indication word library and the high-precision candidate relation indication words obtained in the step 7 into vectors, calculating the similarity of the relation indication words, selecting the relation corresponding to the relation indication word with the highest similarity of the high-precision candidate relation indication words as the relation between the two entities, and finally obtaining the structured knowledge data;
and step 9: and (4) importing the structured knowledge data obtained in the step (8) into a graph database for automatically building a geophysical field knowledge graph.
2. The automated construction method for the geophysical domain knowledge graph according to claim 1, wherein a method of a script crawler framework is adopted to build the knowledge data set in step 2.
3. The automated construction method for the geophysical field knowledge graph according to claim 1, wherein an exhaustion method is adopted in step 3 to obtain all the relationships contained in the knowledge data set and the relationship indicators corresponding to the relationships.
4. The automated construction method for the geophysical domain knowledge graph according to claim 1, wherein the method for identifying whether a relationship exists between any two entities in the step 5 is as follows: and when the word distance between two entities does not exceed the preset maximum distance and the number of the entities is less than the preset minimum distance, judging that the two entities have the relationship.
5. The automated construction method for the knowledge graph of the geophysical field according to claim 1, wherein the high-precision candidate relational indicator is converted into a vector by a Bag-of-words method in step 8.
6. The automated construction method for the geophysical domain knowledge graph according to claim 1, wherein the structured knowledge data finally obtained in step 8 is triple data.
7. An automated construction system for a geophysical domain knowledge graph, comprising:
vocabulary collection module: a concept knowledge base used for establishing a professional vocabulary containing the geophysical field;
a text collection module: for building a knowledge data set containing unstructured text of the geophysical field;
a relationship acquisition module: the system comprises a text acquisition module, a relation indication word database and a display module, wherein the text acquisition module is used for acquiring a knowledge data set established in the text acquisition module, acquiring all relations contained in the knowledge data set and relation indication words corresponding to the relations, and establishing the relation indication word database in the geophysical field;
an entity identification module: the system is used for performing NLP processing on the knowledge data set according to a concept knowledge base, and comprises word segmentation, part of speech tagging and entity identification in the geophysical field;
a relationship identification module: the entity identification module is used for identifying whether a relationship exists between any two entities identified in the entity identification module, and if the relationship exists, acquiring the relationship between the two entities;
the indicator extraction module: the relation identification module is used for extracting nouns or verbs distributed between any two entities and behind any two entities as candidate relation indicators, and the candidate relation indicators can reflect the relation between the two entities acquired in the relation identification module;
the indicator denoising module: the relation indication word library is established in the relation acquisition module and used for denoising the candidate relation indication words extracted by the indication word extraction module to obtain high-precision candidate relation indication words;
a relationship calculation module: the system comprises a relation instruction word library, an instruction word de-noising module, a relation instruction word selection module, a relation instruction word processing module and a relation instruction word processing module, wherein the relation instruction word library is used for converting high-precision candidate relation instruction words obtained by the relation instruction word library and the instruction word de-noising module into vectors, calculating the similarity of the vectors, selecting a relation corresponding to a relation instruction word with the highest similarity of the high-precision candidate relation instruction words as a relation between two entities, and finally obtaining structured knowledge data;
automatically building a module: the knowledge graph database is used for importing the structured knowledge data obtained by the relation calculation module into a graph database and automatically building a geophysical field knowledge graph.
CN201810883507.4A 2018-08-06 2018-08-06 Automatic construction method and system for geophysical field knowledge graph Active CN109145071B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810883507.4A CN109145071B (en) 2018-08-06 2018-08-06 Automatic construction method and system for geophysical field knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810883507.4A CN109145071B (en) 2018-08-06 2018-08-06 Automatic construction method and system for geophysical field knowledge graph

Publications (2)

Publication Number Publication Date
CN109145071A CN109145071A (en) 2019-01-04
CN109145071B true CN109145071B (en) 2021-08-27

Family

ID=64791709

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810883507.4A Active CN109145071B (en) 2018-08-06 2018-08-06 Automatic construction method and system for geophysical field knowledge graph

Country Status (1)

Country Link
CN (1) CN109145071B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109933789B (en) * 2019-02-27 2021-04-13 中国地质大学(武汉) Neural network-based judicial domain relation extraction method and system
CN110222196A (en) * 2019-06-18 2019-09-10 卓尔智联(武汉)研究院有限公司 Fishery knowledge mapping construction device, method and computer readable storage medium
CN110222198A (en) * 2019-06-18 2019-09-10 卓尔智联(武汉)研究院有限公司 Non-ferrous metal industry knowledge mapping construction method, electronic device and storage medium
CN112559765B (en) * 2020-12-11 2023-06-16 中电科大数据研究院有限公司 Semantic integration method for multi-source heterogeneous database

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105760425A (en) * 2016-01-17 2016-07-13 曲阜师范大学 Ontology data storage method
CN105760495A (en) * 2016-02-17 2016-07-13 扬州大学 Method for carrying out exploratory search for bug problem based on knowledge map
EP3051435A1 (en) * 2013-09-29 2016-08-03 Peking University Founder Group Co., Ltd Method and system for obtaining a knowledge point implicit relationship
CN106844658A (en) * 2017-01-23 2017-06-13 中山大学 A kind of Chinese text knowledge mapping method for auto constructing and system
EP3051434A4 (en) * 2013-09-29 2017-06-14 Peking University Founder Group Co., Ltd Method and system for measurement of knowledge point relationship strength
CN107609152A (en) * 2017-09-22 2018-01-19 百度在线网络技术(北京)有限公司 Method and apparatus for expanding query formula

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10019538B2 (en) * 2015-04-01 2018-07-10 Tata Consultancy Services Limited Knowledge representation on action graph database

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3051435A1 (en) * 2013-09-29 2016-08-03 Peking University Founder Group Co., Ltd Method and system for obtaining a knowledge point implicit relationship
EP3051434A4 (en) * 2013-09-29 2017-06-14 Peking University Founder Group Co., Ltd Method and system for measurement of knowledge point relationship strength
CN105760425A (en) * 2016-01-17 2016-07-13 曲阜师范大学 Ontology data storage method
CN105760495A (en) * 2016-02-17 2016-07-13 扬州大学 Method for carrying out exploratory search for bug problem based on knowledge map
CN106844658A (en) * 2017-01-23 2017-06-13 中山大学 A kind of Chinese text knowledge mapping method for auto constructing and system
CN107609152A (en) * 2017-09-22 2018-01-19 百度在线网络技术(北京)有限公司 Method and apparatus for expanding query formula

Also Published As

Publication number Publication date
CN109145071A (en) 2019-01-04

Similar Documents

Publication Publication Date Title
CN109145071B (en) Automatic construction method and system for geophysical field knowledge graph
CN107679039B (en) Method and device for determining statement intention
CN108629414B (en) Deep hash learning method and device
CN112199938B (en) Science and technology project similarity analysis method, computer equipment and storage medium
CN113553412A (en) Question and answer processing method and device, electronic equipment and storage medium
CN111241209A (en) Method and apparatus for generating information
CN115238029A (en) Construction method and device of power failure knowledge graph
CN103902582B (en) A kind of method and apparatus for reducing data warehouse data redundancy
CN113763937A (en) Method, device and equipment for generating voice processing model and storage medium
CN114120166B (en) Video question-answering method and device, electronic equipment and storage medium
CN110580337A (en) professional entity disambiguation implementation method based on entity similarity calculation
CN112818072A (en) Tourism knowledge map updating method, system, equipment and storage medium
CN112599211A (en) Medical entity relationship extraction method and device
CN110362828B (en) Network information risk identification method and system
CN114239583A (en) Method, device, equipment and medium for training entity chain finger model and entity chain finger
CN111930959A (en) Method and device for generating text by using map knowledge
CN113360712B (en) Video representation generation method and device and electronic equipment
CN117494806B (en) Relation extraction method, system and medium based on knowledge graph and large language model
CN112837148B (en) Risk logic relationship quantitative analysis method integrating domain knowledge
CN112819205B (en) Method, device and system for predicting working hours
CN116227598B (en) Event prediction method, device and medium based on dual-stage attention mechanism
CN112507126B (en) Entity linking device and method based on recurrent neural network
CN112307278B (en) Topic context real-time generation method and system with arbitrary scale
CN113656531A (en) Processing method and device for power grid address structuralization
CN117313696A (en) Method, device and equipment for quality inspection of failure reasons of vehicle insurance task

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant