CN117313850A

CN117313850A - Information extraction and knowledge graph construction system and method

Info

Publication number: CN117313850A
Application number: CN202311316939.4A
Authority: CN
Inventors: 谷钢; 王彦功; 张晓明; 杨玺; 尹京刚; 张悦; 王飞
Original assignee: Inspur Software Technology Co Ltd
Current assignee: Inspur Software Technology Co Ltd
Priority date: 2023-10-12
Filing date: 2023-10-12
Publication date: 2023-12-29

Abstract

The invention discloses an information extraction and knowledge graph construction system and method, belongs to the technical field of data processing, and aims to solve the technical problems of how to improve the accuracy and the integrity of information extraction and how to construct a knowledge graph efficiently and accurately. The system comprises a data preprocessing module, a processing module and a processing module, wherein the data preprocessing module is used for providing word segmentation service, part-of-speech tagging service and syntax analysis service; the entity identification module is used for providing entity feature extraction service and entity tag prediction service; the event extraction module is used for providing event feature extraction service, event template matching service, event classification and extraction service and event relation modeling service; the knowledge graph construction module is used for providing a data model definition service, a data storage service, a data update import service and a visual interaction service; and the knowledge representation and retrieval module is used for providing knowledge representation service, knowledge checking service, similarity calculation service and reasoning expansion service.

Description

Information extraction and knowledge graph construction system and method

Technical Field

The invention relates to the technical field of data processing, in particular to an information extraction and knowledge graph construction system and method.

Background

In the field of natural language processing, there are many basic technologies and algorithms, such as part-of-speech tagging, syntactic analysis, semantic role tagging, entity recognition, relation extraction, event extraction, etc., and knowledge graph construction involves knowledge representation, entity recognition, relation extraction, graph database, etc.

The existing constraint problems and disadvantages are mainly reflected in the limitation of methods and the complexity of construction, and the traditional natural language processing method can be limited by rules and pattern matching, so that the complexity of semantics and contexts is difficult to process, and the extracted information is not accurate and complete enough. The construction of the knowledge graph requires complex tasks such as entity identification, relation extraction, knowledge representation and the like, a great deal of labor and time cost are required, and the accuracy and consistency of the result are sometimes difficult to ensure.

How to improve the accuracy and the integrity of information extraction and how to construct a knowledge graph efficiently and accurately is a technical problem to be solved.

Disclosure of Invention

The technical task of the invention is to provide an information extraction and knowledge graph construction system and method for solving the technical problems of how to improve the accuracy and the integrity of information extraction and how to construct the knowledge graph efficiently and accurately.

In a first aspect, the present invention provides an information extraction and knowledge graph construction system, including:

the data preprocessing module is used for providing word segmentation service, part-of-speech tagging service and syntax analysis service, wherein the word segmentation service is used for segmenting a continuous text sequence into discrete words or marks, the part-of-speech tagging service is used for determining part of speech or word class of each word, and the syntax analysis service is used for determining grammar relations among words in sentences to obtain a syntax structure;

the entity recognition module is used for providing entity feature extraction service and entity tag prediction service, the entity feature extraction service is used for learning context information and semantic features of an entity and learning long-term dependency relationships and local features in a text sequence to obtain features of the entity, and the entity tag prediction service is used for predicting entity tags of each word based on the dependency relationships among tags and the context features of the learning entity;

the system comprises an event extraction module, an event feature extraction module and an event relation modeling service, wherein the event extraction module is used for providing an event feature extraction service, an event template matching service, an event classification and extraction service and an event relation modeling service, the event feature extraction service is used for extracting key features of events, the event template matching service is used for identifying and extracting events of a specific type based on a predefined event template, the event classification and extraction service is used for classifying and extracting the events by learning marked events, and the event relation modeling service is used for constructing relations among the events based on time sequence, logical relations and semantic connections in texts;

The knowledge graph construction module is used for providing a data model definition service, a data storage service, a data update import service and a visual interaction service, wherein the data model definition service is used for defining entities, relations and attributes based on graph structures and organizing relations among the entities, the relations and the attributes, the data storage service is used for taking a graph database as a storage engine of the knowledge graph, the data update import service is used for providing a data update import interface, the data update import interface supports the updating of the knowledge graph in a full or incremental mode, and the visual interaction service is used for browsing and navigating the knowledge graph through a graphical interface or a visual tool based on the data model and the graph database;

the knowledge representation and retrieval module is used for providing a knowledge representation service, a knowledge detection service, a similarity calculation service and an inference expansion service, wherein the knowledge representation service is used for vectorizing entities and relations in a knowledge graph based on a graph representation learning technology, the knowledge retrieval service is used for supporting a user to perform data query and data filtering through a query language or an API interface so as to acquire the entities and the relations, and the similarity calculation service is used for matching new knowledge of potential association similarity based on similarity between the entities or similarity between the relations; the reasoning expansion service is used for analyzing logical relations and semantic connections between the entities and the relations and discovering new entities and relations from the knowledge graph.

Preferably, the data preprocessing module is provided with a word segmentation model constructed based on a hidden Markov model, and the word segmentation model is used for providing word segmentation service;

the data preprocessing module is provided with a part-of-speech tagging model constructed based on a maximum entropy model, and part-of-speech tagging service is provided through the part-of-speech tagging model;

the data preprocessing module is provided with a syntactic analysis model constructed by a statistics-based component syntactic analyzer, syntactic analysis service is provided through the syntactic analysis model, and the working mode of the syntactic analysis model is as follows: dividing the sentence into a plurality of words, and analyzing the word structure and the grammar relationship to obtain a grammar structure.

Preferably, the entity feature extraction service is used for learning the context information and semantic features of the entity through multiple features, wherein the entity features comprise parts of speech, shapes of words, words in a context window and word bag models;

the entity characteristic extraction module is provided with an entity identification model constructed based on a cyclic neural network and a convolutional neural network, and long-term dependency relationship and local characteristics in a text sequence are learned through the entity identification model to obtain characteristics of an entity;

And the entity recognition module is provided with an entity recognition model constructed based on the conditional random field model, and entity tag prediction service is provided through the entity recognition model.

Preferably, the event extraction module is used for extracting key features of the event by multiple features, wherein the key features of the event comprise verbs, noun phrases, time phrases and parts of speech, and key factors in the event are identified by learning the key features of the event, and the key factors comprise actions, participants and time;

the event template is described with relations among all elements in the event;

the event extraction module is provided with an event classification extraction model based on a cyclic neural network, and event classification and extraction services are provided through the event classification extraction model.

Preferably, for a data model of a graph structure, an entity is a node in the graph, a relationship is an edge in the graph, and attributes are attributes of the node and the edge.

Preferably, the knowledge retrieval service supports inquiring according to the attribute of the entity, the type of the relation and the time of the event as conditions, so as to obtain the entity and the relation meeting the conditions;

the reasoning expansion service is used for analyzing the logical relationship and semantic connection between the entities and the relationship through logical reasoning or graph algorithm and discovering new entities and relationships from the knowledge graph.

In a second aspect, the present invention provides an information extraction and knowledge graph construction method, which performs information extraction and constructs a knowledge graph by using the information extraction and knowledge graph construction system according to any one of the first aspects, the method comprising the following steps:

data preprocessing: segmenting a continuous text sequence into discrete words or marks through a word segmentation service, determining the parts of speech or parts of speech of each word through a part of speech tagging service, and determining the grammar relation among the words in a sentence through a grammar analysis service to obtain a grammar structure;

entity identification: extracting context information and semantic features of a service learning entity through entity features, learning long-term dependency relationships and local features in a text sequence, obtaining features of the entity, and predicting entity tags of each word through the context features of the entity tag prediction service learning entity based on the dependency relationships among the tags;

event extraction: extracting key features of the events through an event feature extraction service, identifying and extracting the events of a specific type through an event template matching service based on a predefined event template, classifying and extracting the events through learning marked events and an event classifying and extracting service, and constructing the relationship among the events through an event relationship modeling service based on time sequence, logic relationship and semantic connection in a text;

Knowledge graph construction: defining entities, relationships and attributes and organization relations among the entities, relationships and attributes through a data model based on a graph structure, taking a graph database as a storage engine of a knowledge graph, providing a data update import interface for the graph database, supporting the update of the knowledge graph in a full or incremental mode, and browsing and navigating the knowledge graph through a graphical interface or a visual tool based on the data model and the graph database;

knowledge representation and retrieval: the method comprises the steps of carrying out vectorization representation on entities and relations in a knowledge graph based on a graph representation learning technology, supporting a user to carry out data query and data filtering through a query language or an API interface through a knowledge retrieval service so as to obtain the entities and the relations, and matching new knowledge with potential association similarity based on similarity between the entities or similarity between the relations; and analyzing the logical relationship and semantic connection between the entity and the relationship by the reasoning expansion service, and discovering new entity and relationship from the knowledge graph.

Preferably, for data preprocessing, providing word segmentation service through a word segmentation model constructed based on a hidden Markov model;

providing part-of-speech tagging service through a part-of-speech tagging model constructed based on the maximum entropy model;

A syntactic analysis service is provided through a syntactic analysis model constructed based on a statistical component syntactic analyzer.

Preferably, for entity extraction, learning the contextual information and semantic features of the entity through multiple features, wherein the entity features comprise parts of speech, word shapes, words in a contextual window and word bag models;

the entity recognition model constructed by the cyclic neural network and the convolutional neural network is implemented as follows: learning long-term dependency relationship and local characteristics in a text sequence to obtain characteristics of an entity;

providing entity tag prediction service through an entity identification model constructed based on a conditional random field model;

for event extraction, extracting key features of the event based on multiple features, wherein the key features of the event comprise verbs, noun phrases, time phrases and parts of speech, and key factors in the event are identified by learning the key features of the event, and the key factors comprise actions, participants and time;

the event template is described with relations among all elements in the event;

the event classification and extraction service is provided by an event classification extraction model configured with a recurrent neural network based.

Preferably, for a data model of a graph structure, an entity is taken as a node in the graph, a relation is taken as an edge in the graph, and an attribute is taken as an attribute of the node and the edge;

For knowledge representation and retrieval, inquiring according to the attribute of the entity, the type of the relationship and the time of the event as conditions, and obtaining the entity and the relationship meeting the conditions through a knowledge retrieval service;

and analyzing the logical relationship and semantic connection between the entity and the relationship through logical reasoning or graph algorithm, and finding out new entity and relationship from the knowledge graph.

The information extraction and knowledge graph construction system and method of the invention have the following advantages:

1. automated information extraction: the system can automatically extract structured information from a large amount of texts, and compared with the traditional manual processing method, the system can greatly improve the efficiency and accuracy of information extraction, and can quickly acquire useful knowledge and information from mass data through automatic information extraction;

2. knowledge graph construction and representation: the knowledge graph is constructed by utilizing the extracted information such as the entity, the relation and the event, the knowledge is represented by a graph structure, the knowledge graph can more intuitively display the association relation among the entities, the user can better understand and explore the knowledge, and the scattered information can be integrated into a unified frame by establishing the structured knowledge graph, so that comprehensive and accurate knowledge expression is provided;

3. Knowledge retrieval and reasoning: the user can perform efficient knowledge retrieval and reasoning, the system provides a powerful query function, complex retrieval operation can be performed according to the conditions of entity attributes, relationship types, event time and the like, the user is helped to quickly acquire required knowledge, meanwhile, the system supports reasoning based on a knowledge graph, and new knowledge and association are found through analysis of logical relationship and semantic connection between the entity and the relationship;

4. knowledge application and intelligent services: based on knowledge representation service, knowledge detection service, similarity calculation service and reasoning expansion service, various knowledge applications and intelligent services can be developed, knowledge maps can provide rich knowledge bases for search engines, recommendation systems, intelligent questions and answers and the like, user experience and service quality are improved, and more intelligent and personalized knowledge services can be realized by combining the knowledge maps with other artificial intelligent technologies.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

The invention is further described below with reference to the accompanying drawings.

Fig. 1 is a flow chart of a method for information extraction and knowledge graph construction in embodiment 2.

Detailed Description

The invention will be further described with reference to the accompanying drawings and specific examples, so that those skilled in the art can better understand the invention and implement it, but the examples are not meant to limit the invention, and the technical features of the embodiments of the invention and the examples can be combined with each other without conflict.

The embodiment of the invention provides an information extraction and knowledge graph construction system and method, which are used for solving the technical problems of how to improve the accuracy and the integrity of information extraction and how to construct a knowledge graph efficiently and accurately.

Example 1:

the invention discloses an information extraction and knowledge graph construction system which comprises a data preprocessing module, an entity identification module, an event extraction module, a knowledge graph construction module and a knowledge representation and retrieval module.

The data preprocessing module is used for providing word segmentation service, part-of-speech tagging service and syntax analysis service, the word segmentation service is used for segmenting continuous text sequences into discrete words or marks, the part-of-speech tagging service is used for determining part of speech or word class of each word, and the syntax analysis service is used for determining the syntax relation among the words in sentences to obtain a syntax structure.

In the embodiment, a word segmentation model constructed based on a hidden Markov model is configured in the data preprocessing module, and word segmentation service is provided through the word segmentation model; meanwhile, a part-of-speech tagging model constructed based on the maximum entropy model is configured, and part-of-speech tagging service is provided through the part-of-speech tagging model; meanwhile, a syntactic analysis model constructed by a statistics-based component syntactic analyzer is configured, and syntactic analysis services are provided through the syntactic analysis model, wherein the working mode of the syntactic analysis model is as follows: dividing the sentence into a plurality of words, and analyzing the word structure and the grammar relationship to obtain a grammar structure.

Correspondingly, the data preprocessing module in this embodiment may perform word segmentation, part-of-speech tagging, and syntax analysis operations.

Segmentation is the process of segmenting a continuous sequence of text into discrete words or tokens. In this embodiment, a hidden markov model is used as the word segmentation algorithm model. Hidden markov models (Hidden Markov Model, HMM) are a common statistical model for modeling and analyzing sequence data. The model consists of a Markov chain and an observation sequence. In the hidden Markov model, there are two key components: implicit State (Hidden State): indicating a state that is not directly observed inside the system. Each hidden state has an associated observed value. The implicit state may be discrete, represented as a symbol or tag. Observation sequence (Observation Sequence): representing a sequence of data we can directly observe. There is a certain correlation between the observed sequence and the hidden state. The core assumption of the hidden markov model is the markov property: the probability distribution of the current state depends only on the previous state, independent of the earlier states. This assumption is called a first order markov property. The use of hidden Markov models as the word segmentation algorithm model can strengthen the context dependency and compensate for errant delivery. The hidden markov model considers contextual information that the word segmentation result of the current word depends on the words preceding and following it. The context dependency is helpful to solve the problems of word ambiguity and ambiguity, and improves the word segmentation accuracy. Because the Chinese words have no clear boundary marks, the wrong word segmentation results can affect the subsequent processing tasks. The hidden Markov model considers the global context in the word segmentation process, so that the transmission errors can be reduced, and the quality of the whole word segmentation is improved.

Part of speech tagging is the process of determining each word's part of speech or part of speech. The present embodiment uses a maximum entropy model as a part-of-speech tagging algorithm model, the maximum entropy model (Maximum Entropy Model) being a statistical model for classification and prediction. It is proposed based on the principle of maximum entropy in information theory (Principle of Maximum Entropy). The maximum entropy model is widely applied in the fields of natural language processing, machine learning, statistics and the like. The core idea of the maximum entropy model is to select, given some constraints, the model with the greatest entropy in the probability distribution satisfying these constraints as the optimal model. Entropy represents the uncertainty or degree of confusion of a probability distribution, and the principle of maximum entropy considers that the most uncertain model should be selected without other prior knowledge to maintain the consistency and robustness of the model. The maximum entropy model may select appropriate features according to different languages and application scenarios and allow modeling by combining multiple features. The flexibility enables the maximum entropy model to fully utilize the context information, the vocabulary information and other linguistic features, and improves the accuracy of part-of-speech tagging. In the part-of-speech tagging task, the tag of each word is typically dependent on the tags of other words in its context. The maximum entropy model can solve the word ambiguity and ambiguity problem by considering global context information. It is able to capture transition probabilities between parts of speech, thereby inferring more accurately the labels of each word. The maximum entropy model has better interpretability, and can provide the weight and contribution degree of each feature in the model. This allows the output results of the model to be interpreted and understood, facilitating debugging and improving the model.

Syntactic analysis is the process of determining grammatical relations between individual words in a sentence. The present embodiment uses a statistical-based constituent syntactic analyzer as a syntactic analysis model, which is capable of performing structured analysis on sentences, dividing the sentences into words and phrases, and determining hierarchical relationships and dependencies between them. Such analysis may provide a deep understanding of sentences, revealing the constituent components of the sentences and the grammatical relations between them. The statistical-based constituent syntactic analyzer uses probabilistic models and training data for syntactic parsing, by learning and inference, to determine the most probable syntactic structure. Such analysis may help understand the grammatical rules and semantic meaning of sentences, thereby supporting natural language understanding and generation tasks. The constituent syntactic analyzer may provide more rich context information for improving the performance of the language model. By revealing phrase structure and dependencies in sentences, the syntactic analyzer can provide a more accurate context representation for the generation and prediction of language models.

The entity recognition model is used for providing entity feature extraction service and entity tag prediction service, the entity feature extraction service is used for learning context information and semantic features of an entity and is used for learning long-term dependency relationships and local features in a text sequence to obtain the features of the entity, and the entity tag prediction service is used for predicting entity tags of each word based on the dependency relationships among tags and the context features of the learning entity.

In this embodiment, the entity feature extraction service is configured to learn, through multiple features, context information and semantic features of an entity, where the entity features include parts of speech, shapes of words, words in a context window, and word bag models; meanwhile, an entity recognition model constructed based on a cyclic neural network and a convolutional neural network is configured in the entity feature extraction module, and long-term dependency relationship and local features in a text sequence are learned through the entity recognition model to obtain the features of an entity; meanwhile, an entity recognition model constructed based on the conditional random field model is configured in the entity recognition module, and entity tag prediction service is provided through the entity recognition model.

Correspondingly, the entity feature extraction module can provide feature extraction, entity tag prediction and other operations.

In entity identification, feature extraction is a key step, on one hand, the embodiment uses various features to capture the contextual information and semantic features of the entity, the features comprise parts of speech, word shapes, words in a contextual window, word bag models and the like, and the accuracy and the robustness of entity identification can be improved by comprehensively utilizing the features; on the other hand, in order to further improve the performance of entity recognition, the embodiment introduces a recurrent neural network (Recurrent Neural Networks, RNN) and a convolutional neural network (Convolutional Neural Networks, CNN) in a deep learning model, and long-term dependency and local characteristics in a text sequence can be learned by constructing an entity recognition model through the model, and entity recognition is performed through end-to-end training.

For entity tag prediction, the embodiment adopts a classical conditional random field (Conditional Random Fields, CRF) model as an entity tag prediction model, the CRF model can predict the entity tag of each word by learning context characteristics in consideration of the mutual dependency relationship among tags, and the CRF model is widely applied to entity recognition tasks and has better performance.

The event extraction module is used for providing event feature extraction service, event template matching service, event classification and extraction service and event relation modeling service, the event feature extraction service is used for extracting key features of the events, the event template matching service is used for identifying and extracting events of a specific type based on a predefined event template, the event classification and extraction service is used for classifying and extracting the events by learning marked events, and the event relation modeling service is used for constructing relations among the events based on time sequence, logical relations and semantic connections in the text.

In this embodiment, the event extraction module is configured to extract key features of an event with multiple features, where the key features of the event include verbs, noun phrases, time phrases, and parts of speech, and identify key factors in the event by learning the key features of the event, where the key factors include actions, participants, and time. The event templates have relationships between the elements in the event described therein. Meanwhile, an event classification extraction model based on a cyclic neural network is configured in the event extraction module, and event classification and extraction services are provided through the event classification extraction model.

Correspondingly, the event extraction module can execute the operations of event feature extraction, event template matching, event classification and extraction, event relationship modeling and the like.

For event feature extraction, the present embodiment employs a variety of features to capture key features of an event, including verbs, noun phrases, time phrases, parts of speech, and the like. By extracting the characteristics, key elements such as actions, participants, time and the like in the event can be identified, so that the accurate extraction of the event is realized.

For event target matching, the embodiment adopts an event template matching method to identify and extract a specific type of event. Event templates are predefined patterns or rules describing relationships between individual elements in an event. By matching text to the event templates, specific types of events and their related information can be identified.

For event classification and extraction, the embodiment introduces a cyclic neural network (RNN) model, and builds an event classification extraction model based on the cyclic neural network (RNN) model to further improve the accuracy of event extraction. The automatic classification and extraction of the event are realized by learning a large amount of marked data and based on the event classification extraction model.

In addition to identifying and extracting individual events, the present embodiment is also capable of modeling relationships between events. And (3) constructing a relation network among events by analyzing time sequence, logic relation and semantic connection in the text, and further enriching event information in the knowledge graph.

The knowledge graph construction module is used for providing a data model definition service, a data storage service, a data update import service and a visual interaction service, wherein the data model definition service is used for defining entities, relations and attributes based on graph structures and organizing relations among the entities, the relations and the attributes, the data storage service is used for taking a graph database as a storage engine of the knowledge graph, the data update import service is used for providing a data update import interface, the data update import interface supports updating of the knowledge graph in a full or incremental mode, and the visual interaction service is used for browsing and navigating the knowledge graph through a graphical interface or a visual tool based on the data model and the graph database.

In this embodiment, for the data model of the graph structure, the entity is a node in the graph, the relationship is an edge in the graph, and the attribute is an attribute of the node and the edge.

Correspondingly, the knowledge graph construction module of the embodiment can provide operations such as data model design, data storage, data import and update, visualization, interaction and the like.

And designing a data model of the knowledge graph, wherein the data model comprises definition and organization modes of entities, relations, attributes and the like. The embodiment adopts a graph structure as a data model, wherein an entity is used as a node in the graph, a relationship is used as an edge in the graph, and an attribute is used as an attribute of the node and the edge. By defining the proper entity type, relation type and attribute type, a rich and flexible knowledge graph data model can be established.

And (3) data storage: the embodiment adopts a graph database as a storage engine of the knowledge graph. The graph database is a database system dedicated to storing and querying graph data, which can efficiently store large-scale nodes and edges, and provide flexible querying and navigation functions. The storage model and the indexing mechanism of the graph database can effectively support the query and analysis operation of the knowledge graph.

Data import and update: the knowledge graph needs to be updated and maintained regularly to maintain the accuracy and real-time of the content. The embodiment provides data importing and updating of the library, and can integrate a new data source into the knowledge graph and perform incremental updating and synchronization, so that continuous evolution and updating of the knowledge graph can be ensured.

Visualization and interaction: in order to better show and utilize the content of the knowledge graph, the embodiment provides the visualization and interaction functions based on the knowledge graph and the graph database, and a user can browse and navigate the knowledge graph through a graphical interface or the visualization tool so as to more intuitively understand and explore the association relationship of the knowledge.

The knowledge representation and retrieval module is used for providing a knowledge representation service, a knowledge detection service, a similarity calculation service and an inference expansion service, the knowledge representation and retrieval module is used for vectorizing the entities and relations in the knowledge graph based on a graph representation learning technology, the knowledge retrieval service is used for supporting a user to perform data query and data filtering through a query language or an API interface so as to acquire the entities and the relations, and the similarity calculation service is used for matching new knowledge of potential association similarity based on similarity between the entities or similarity between the relations; the inference extension service is used for analyzing logical relations and semantic connections between entities and relations, and discovering new entities and relations from the knowledge graph.

In this embodiment, the knowledge retrieval service supports querying according to the attribute of the entity, the type of the relationship, and the time of the event as conditions, so as to obtain the entity and the relationship meeting the conditions. The inference expansion service is used for analyzing logical relations and semantic connections between entities and relations through logical inference or graph algorithms and discovering new entities and relations from the knowledge graph.

Correspondingly, the knowledge representation and retrieval module in the embodiment can provide knowledge representation, knowledge retrieval, similarity calculation, reasoning, expansion and other operations.

Knowledge representation: the embodiment adopts graph representation learning technology to vectorize the entity and the relation in the knowledge graph. Graph representation learning is a technique that maps nodes and edges in a graph to a low-dimensional vector space, by learning the representation vectors of the nodes and edges, semantic associations between them can be captured. By mapping entities and relationships to a continuous vector space, more efficient and flexible knowledge reasoning and analysis can be performed.

And (5) knowledge retrieval: in this embodiment, the user may perform complex query and filtering operations using a query language or API interface. The user can query according to the conditions of the attributes of the entities, the types of the relationships, the time of the event and the like, and acquire the entities and the relationships meeting the conditions. Knowledge retrieval can help users to quickly find relevant knowledge, and support the requirements of knowledge reasoning and analysis.

Similarity calculation: knowledge representation and knowledge retrieval of the present embodiment also provides similarity calculations. Potential associations and similar knowledge can be found by calculating the degree of similarity between entities or the degree of relatedness between relationships. The similarity calculation is based on a vector space model and a graph matching algorithm, and helps users find new knowledge and association.

Reasoning and extension: knowledge representation and knowledge retrieval of the present embodiment supports knowledge-graph based reasoning and expansion. By analyzing the logical relationship and semantic connection between the entities and the relationships, inference operations can be performed to discover new entities and relationships. Reasoning can be achieved through methods such as logical reasoning, graph algorithm and the like, and a user is helped to mine more implicit knowledge and association from the knowledge graph.

The system of the embodiment realizes the automatic processing and knowledge representation of a large amount of text data by applying a natural language processing algorithm in the information extraction and knowledge graph construction process. The system can improve the accuracy and efficiency of information extraction, construct a knowledge graph with consistency and expandability, and provide effective support for knowledge representation, retrieval and application. Meanwhile, by considering real-time processing and expandability, the technical scheme can adapt to the increasing data volume and the requirements of real-time application.

Example 2:

according to the information extraction and knowledge graph construction method, information extraction is carried out through the system disclosed in the embodiment 1, and a knowledge graph is constructed. The method comprises the steps of data preprocessing, entity identification, event extraction, knowledge graph construction, knowledge representation, search and the like.

Data preprocessing: the continuous text sequence is segmented into discrete words or marks through the word segmentation service, the part of speech or the part of speech of each word is determined through the part of speech marking service, and the grammar relation among the words in the sentence is determined through the grammar analysis service, so that the grammar structure is obtained.

In the embodiment, for data preprocessing, a word segmentation service is provided through a word segmentation model constructed based on a hidden Markov model; providing part-of-speech tagging service through a part-of-speech tagging model constructed based on the maximum entropy model; a syntactic analysis service is provided through a syntactic analysis model constructed based on a statistical component syntactic analyzer.

The specific implementation of specific data preprocessing comprises the operations of word segmentation, part-of-speech tagging, syntactic analysis and the like.

Segmentation is the process of segmenting a continuous sequence of text into discrete words or tokens. In this embodiment, a hidden markov model is used as the word segmentation algorithm model.

Part of speech tagging is the process of determining each word's part of speech or part of speech. The embodiment uses a maximum entropy model as a part-of-speech tagging algorithm model, and the maximum entropy model can select proper features according to different languages and application scenes and allows multiple features to be combined for modeling. The flexibility enables the maximum entropy model to fully utilize the context information, the vocabulary information and other linguistic features, and improves the accuracy of part-of-speech tagging. In the part-of-speech tagging task, the tag of each word is typically dependent on the tags of other words in its context. The maximum entropy model can solve the word ambiguity and ambiguity problem by considering global context information. It is able to capture transition probabilities between parts of speech, thereby inferring more accurately the labels of each word. The maximum entropy model has better interpretability, and can provide the weight and contribution degree of each feature in the model. This allows the output results of the model to be interpreted and understood, facilitating debugging and improving the model.

Syntactic analysis is the process of determining grammatical relations between individual words in a sentence. The present embodiment uses a statistical-based constituent syntactic analyzer as a syntactic analysis model, which is capable of performing structured analysis on sentences, dividing the sentences into words and phrases, and determining hierarchical relationships and dependencies between them. The constituent syntactic analyzer may provide more rich context information for improving the performance of the language model. By revealing phrase structure and dependencies in sentences, the syntactic analyzer can provide a more accurate context representation for the generation and prediction of language models.

Entity identification: extracting context information and semantic features of a service learning entity through the entity features, learning long-term dependency relationships and local features in a text sequence, obtaining the features of the entity, and predicting the entity tag of each word through the entity tag prediction service learning entity context features based on the dependency relationships among the tags.

In this embodiment, for entity feature extraction, through multi-feature learning of contextual information and semantic features of an entity, the entity features include parts of speech, shapes of words, words in a contextual window, and word bag models; the entity recognition model constructed by the cyclic neural network and the convolutional neural network is implemented as follows: learning long-term dependency relationship and local characteristics in a text sequence to obtain characteristics of an entity; entity tag prediction services are provided through entity recognition models constructed based on conditional random field models.

Specific implementation of entity identification includes feature extraction, entity tag prediction and other operations.

Event extraction: the method comprises the steps of extracting key features of events through an event feature extraction service, identifying and extracting specific types of events through an event template matching service based on a predefined event template, classifying and extracting the events through learning marked events based on an event classifying and extracting service, and constructing relations among the events through an event relation modeling service based on time sequences, logical relations and semantic connections in texts.

In this embodiment, for event extraction, key features of an event are extracted based on multiple features, the key features of the event include verbs, noun phrases, time phrases and parts of speech, and key factors in the event are identified by learning the key features of the event, and the key factors include actions, participants and time. The event template is described with relations among all elements in the event; the present embodiment provides event classification and extraction services by being configured with an event classification extraction model based on a recurrent neural network.

Specific implementation of event extraction includes event feature extraction, event template matching, event classification and extraction, event relationship modeling and other operations.

For the relation between modeling events, a relation network between the events is constructed by analyzing the time sequence, the logic relation and the semantic connection in the text, so that the event information in the knowledge graph is further enriched.

Knowledge graph construction: defining entities, relationships and attributes and organization relations among the entities, relationships and attributes through a data model based on a graph structure, taking a graph database as a storage engine of the knowledge graph, providing a data update import interface for the graph database, supporting the update of the knowledge graph in a full or incremental mode, and browsing and navigating the knowledge graph through a graphical interface or a visual tool based on the data model and the graph database.

In this embodiment, for the data model of the graph structure, the entity is the node in the graph, the relationship is the edge in the graph, and the attribute is the attribute of the node and the edge.

The method is used for constructing the knowledge graph and comprises the operations of data model design, data storage, data import and update, visualization, interaction and the like.

In this embodiment, for knowledge representation and retrieval, query is performed according to the attribute of the entity, the type of the relationship, and the time of the event as conditions, and the entity and the relationship satisfying the conditions are obtained through the knowledge retrieval service; and analyzing the logical relationship and semantic connection between the entity and the relationship through logical reasoning or graph algorithm, and finding out new entity and relationship from the knowledge graph.

As concrete implementation of knowledge representation and retrieval, operations such as knowledge representation, knowledge retrieval, similarity calculation, reasoning and expansion are included.

While the invention has been illustrated and described in detail in the drawings and in the preferred embodiments, the invention is not limited to the disclosed embodiments, but it will be apparent to those skilled in the art that many more embodiments of the invention can be made by combining the means of the various embodiments described above and still fall within the scope of the invention.

Claims

1. An information extraction and knowledge graph construction system, comprising:

2. The information extraction and knowledge graph construction system according to claim 1, wherein the data preprocessing module is configured with a word segmentation model constructed based on a hidden markov model, and the word segmentation model is used for providing word segmentation service;

3. The system for information extraction and knowledge graph construction according to claim 1, wherein the entity feature extraction service is configured to learn, through multiple features, context information and semantic features of an entity, the entity features including parts of speech, shapes of words, words in a context window, and word bag models;

4. The system for information extraction and knowledge graph construction according to claim 1, wherein the event extraction module is configured to extract key features of an event, the key features of the event include verbs, noun phrases, time phrases, parts of speech, and identify key factors in the event by learning the key features of the event, the key factors including actions, participants, and time;

The event template is described with relations among all elements in the event;

5. The information extraction and knowledge graph construction system according to claim 1, wherein for the data model of the graph structure, the entities are nodes in the graph, the relationships are edges in the graph, and the attributes are attributes of the nodes and edges.

6. The information extraction and knowledge graph construction system according to claim 1, wherein the knowledge retrieval service supports query according to the attribute of the entity, the type of the relationship, the time of the event as conditions, and obtains the entity and the relationship satisfying the conditions;

7. An information extraction and knowledge graph construction method, characterized in that the information extraction and knowledge graph construction system according to any one of claims 1-6 is used for information extraction and knowledge graph construction, the method comprises the following steps:

8. The information extraction and knowledge graph construction method according to claim 7, wherein for data preprocessing, a word segmentation service is provided through a word segmentation model constructed based on a hidden markov model;

9. The method for extracting information and constructing a knowledge graph according to claim 7, wherein for entity extraction, the contextual information and semantic features of the entity are learned through multiple features, and the entity features comprise parts of speech, word shape, words in a contextual window, word bag models;

the event template is described with relations among all elements in the event;

10. The method for information extraction and knowledge graph construction according to claim 7, wherein for the data model of the graph structure, the entities are nodes in the graph, the relationships are edges in the graph, and the attributes are attributes of the nodes and the edges;