CN115238029A

CN115238029A - Construction method and device of power failure knowledge graph

Info

Publication number: CN115238029A
Application number: CN202210710873.6A
Authority: CN
Inventors: 丁一; 滕飞; 张磐; 霍现旭; 庞超; 杨挺; 尚学军; 陈沛; 吴磊; 张思涵; 肖文瑞
Original assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd; Electric Power Research Institute of State Grid Tianjin Electric Power Co Ltd
Priority date: 2022-06-22
Filing date: 2022-06-22
Publication date: 2022-10-25

Abstract

The invention relates to a method and a device for constructing a power failure knowledge graph, which comprises the following steps: step 1, acquiring data to be processed: acquiring power failure preprocessing text data; step 2, data preprocessing is carried out; step 3, adopting a BERT-BilSTM-CRF combined model to perform entity extraction on the preprocessed data; step 4, identifying and extracting the relationship between entities by adopting a method based on dependency analysis, and analyzing the dependency relationship between sentence components by identifying and positioning syntactic relationships; step 5, knowledge storage and semantic triple representation; step 6, constructing a power failure knowledge graph; the method and the device can improve the accuracy of Chinese entity identification and relationship extraction.

Description

Construction method and device of power failure knowledge graph

Technical Field

The invention belongs to the technical field of application of artificial intelligence algorithms in power systems, and relates to a method and a device for constructing a knowledge graph, in particular to a method and a device for constructing a power failure knowledge graph.

Background

With the continuous development of power grids, the types and functions of power equipment are more complex than ever before. Therefore, the daily operation of the equipment, such as fault diagnosis, maintenance, etc., is highly dependent on the professional power knowledge of the worker. However, due to the lack of effective knowledge extraction, reasoning and application, the diagnosis of power equipment faults by operation and maintenance personnel mainly depends on own experience. This subjective approach is not only inefficient, but also difficult to ensure accuracy.

In fact, the power industry has been in possession of a large amount of chinese technical literature over decades, which contains rich knowledge of electrical equipment faults. If knowledge can be extracted accurately, presented to the employee in an understandable form or an intelligent QA system, the fault diagnosis is undoubtedly more rapid and accurate. In response to this problem, some academic and applied research attempts to extract, organize and demonstrate the knowledge, and provide better support for intelligent diagnosis of power equipment faults. The knowledge organization method for researching the global fault maintenance text constructs a knowledge model based on various maintenance behaviors, realizes flexible and clear knowledge expression for expressing business logic, is beneficial to improving the automation degree of a power system, and provides global knowledge support for the smart grid. The power text is unstructured data and has the characteristics of dense knowledge and rich knowledge types. Compared with structured data with strict format specifications, the expression mode of the power text is more flexible and is more difficult to read and understand.

Therefore, a fault text natural language processing method and a behavior knowledge organization method suitable for the characteristics of the power fault need to be explored, and how to construct a knowledge graph is the key point of power text natural language processing.

Through searching, no prior art publication which is the same as or similar to the present invention is found.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a method and a device for constructing a power failure knowledge graph, and can improve the accuracy of Chinese entity identification and relation extraction.

The invention solves the practical problem by adopting the following technical scheme:

a construction method of a power failure knowledge graph comprises the following steps:

step 1, acquiring data to be processed: acquiring power failure preprocessing text data;

step 2, performing data preprocessing on the power failure preprocessing training data acquired in the step 1;

step 3, adopting a BERT-BilSTM-CRF combined model to perform entity extraction on the preprocessed data;

step 4, identifying and extracting the relationship between entities by adopting a method based on dependency analysis, and analyzing the dependency relationship between sentence components by identifying and positioning syntactic relationships;

step 5, knowledge storage and semantic triple representation: the knowledge storage specifically comprises the step of storing the entities, the attributes and the relations extracted in the step 4 into a database, and the semantic triple representation specifically comprises the step of representing the extracted knowledge in a triple form;

and 6, constructing a power failure knowledge map, and storing the processed knowledge into a map database to construct the power failure knowledge map.

Further, the specific steps of step 2 include:

(1) The word segmentation process adopts a word segmentation method of HMM-CRF, firstly, segmenting preprocessed data into words, sequencing the words, and constructing a high-frequency dictionary with characteristic word frequency; then, the processed document is divided again by using a CRF-based dividing model, and the divided document is imported into a high-frequency dictionary; and finally obtaining a high-precision segmentation result.

(2) Word vector representation uses a Word2vec model to represent text data, synonyms are calculated by calculating cosine similarity between Word vectors, and the obtained Word vectors in a corpus can also be used as input of a subsequent entity recognition model.

(3) Extracting keywords and constructing a body dictionary, extracting high-frequency keywords according to the frequency weight and the mean value of the average information entropy, and removing irrelevant words through manual screening to construct the body dictionary.

Moreover, the BERT-BilSTM-CRF combination model comprises:

(1) BERT layer: performing feature extraction and training through a multilayer neural network, converting an input text into a word vector, and enabling a BilSTM layer to learn context features; the BERT model converts an input sequence into comprehensive embedding of three characteristics of Tokens, segmentations and Positions, then inputs the comprehensive embedding into the model for extraction, and uses a self-attention mechanism and a full-link layer to model an input text.

(2) BilsTM layer: automatically extracting the characteristics of sentence context, wherein the input of each BilSTM unit is a dynamic word vector sequence; then the BilSTM unit learns how to extract the local features of the sentence; finally, the forward LSTM model outputs a hidden state sequence, and the backward LSTM model splices all the hidden state sequences according to the sentence sequence to obtain a complete hidden state sequence; the correlation data can be obtained by the formula:

i _t ＝δ(W _i *[h _t-1 ，x _t ]+b _t ) (1)

f _t ＝δ(W _f *[h _t-1 ，x _t ]+b _f ) (2)

o _t ＝δ(W _o *[h _t-1 ，x _t ]+b _o ) (3)

C _t ＝f _t *C _t-1 +i _t *tan(W _c *[h _t-1 ，x _t ]+b _c ) (4)

h _t ＝O _t *tanh(C _t ) (5)

in the formula (1-5), i _t 、f _t 、O _t Three gating cells representing each LSTM cell: an input gate, a forgetting gate and an output gate. C _t Represents the output state of the output layer at time t, h _t Representing the output state of the hidden layer at time t. x is the number of _t Representing the input of time t. δ () is an activation function and tanh () is a hyperbolic tangent activation function. W _i 、W _f 、W _o Represents a hidden state vector h _t And an input vector x _t B, and b _i 、b _f 、b _o And b _c Representing an offset vector.

(3) CRF layer: the method is a joint probability distribution graph model without a graph representation, local features are normalized into global features, and the problem of partial labeling deviation is solved by calculating the probability distribution of the whole sequence to obtain a global optimal solution; meanwhile, the CRF model can obtain the hidden constraint rule of the label when training data.

Moreover, the specific method of the step 4 is as follows:

firstly, a subject and a core predicate are extracted through semantic role identification; then, through dependency syntactic analysis, finding an object and a subject related to the meaning of the core predicate; and finally, obtaining a relevant dependency relationship in the power failure text and an entity relationship based on the ontology structure through dependency syntax analysis.

A power failure knowledge graph building apparatus comprising:

the data acquisition module is used for acquiring data to be processed and acquiring a power failure preprocessing text;

the data preprocessing module is used for preprocessing the power failure preprocessing text, segmenting the power failure text, acquiring word vectors, extracting keywords and constructing a body dictionary;

the model training module is used for performing entity extraction and relationship extraction on the to-be-processed power failure text, acquiring word vectors in the preprocessed data, inputting the word vectors into a bidirectional long-time and short-time memory network for entity extraction, and extracting an entity relationship according to dependence analysis;

and the map construction module is configured to generate a knowledge map comprising the entities and the relations between the entities according to the entities and the relations extracted by the model training module.

The invention has the advantages and beneficial effects that:

the invention fully considers the relation between power entities and the relation between long texts, provides a method and a device for creating a power failure knowledge graph based on a novel model of BERT-BilSTM-CRF, and has the innovation point that the accuracy of Chinese entity identification and relation extraction is improved through subtle fusion. First, the language pre-training module BERT (Bi-directional Encoder reproduction from transformations) uses dynamic word vectors in preliminary entity recognition. The method not only reduces the workload of downstream tasks, but also has higher accuracy in Chinese entity recognition. This is because dynamic word vectors are more advantageous than static word vectors in chinese entity recognition. For example, a dynamic word vector may express different semantics in different contexts. In addition, the CRF (Conditional Random Field) module restrains and reverses the labeling sequence, solves part of labeling deviation, calculates the joint probability of the whole labeling sequence, and can fully ensure the accuracy of Chinese entity identification. According to the method, BERT is used for carrying out data preprocessing on the electric power text in the early stage; the processed data is utilized to carry out electric power entity identification in the middle period; and in the later stage, the CRF is utilized to constrain the labeling sequence, the joint probability is calculated, and the extraction accuracy is ensured. Through experimental data analysis, the BERT-BilSTM-CRF trained by the method aiming at the power text has extremely high accuracy, and has an obvious effect on the construction of a power failure knowledge graph.

Drawings

FIG. 1 is a power failure knowledge graph construction flow diagram of the present invention;

FIG. 2 is a schematic diagram of the CBOW model of the present invention;

FIG. 3 is a block diagram of the BERT-BilSTM-CRF model of the present invention;

FIG. 4 is an input representation schematic of the BERT model of the present invention;

fig. 5 is a schematic structural diagram of an embodiment of a power failure knowledge graph constructing apparatus provided in the present invention.

Detailed Description

The invention is described in further detail below with reference to the following drawings:

a method for constructing a power failure knowledge graph is shown in FIG. 1, and comprises the following steps:

in this embodiment, the power failure preprocessing training data includes: overhaul texts, operation manuals and the like.

in this embodiment, the data preprocessing step specifically includes word segmentation, word vector representation, keyword extraction, and ontology dictionary construction.

The specific steps of the step 2 comprise:

(1) The word segmentation process adopts a word segmentation method of HMM-CRF, firstly, segmenting preprocessed data into words, sequencing the words, and constructing a high-frequency dictionary with characteristic word frequency; then, the processed document is divided again by using a CRF-based division model, and the divided document is imported into a high-frequency dictionary; and finally obtaining a high-precision segmentation result.

(2) The Word vector representation uses Word2vec model to represent text data, and the schematic diagram of the CBOW model is shown in fig. 2, which includes: an Input layer, a Hidden layer, and an Output layer. Synonyms are calculated by calculating cosine similarity between word vectors, and the obtained word vectors in the corpus can also be used as input of a subsequent entity recognition model.

(3) And extracting high-frequency keywords according to the frequency weight and the mean value of the average information entropy, and removing irrelevant words through manual screening to construct the body dictionary.

Step 3, adopting a BERT-BilSTM-CRF combined model to perform entity extraction on the preprocessed data, wherein the structure diagram of the BERT-BilSTM-CRF model is shown in figure 3;

the BERT-BilSTM-CRF combined model comprises the following components:

(1) BERT layer: the input representation of the BERT model is shown in fig. 4. Performing feature extraction and training through a multilayer neural network, converting an input text into a word vector, and enabling a BilSTM layer to learn context features; the BERT model converts an input sequence into comprehensive embedding of three characteristics of Tokens, segmentations and Positions, then inputs the comprehensive embedding into the model for extraction, and uses a self-attention mechanism and a full-link layer to model an input text.

In this embodiment, the key part of the BERT model is a deep network based on self-entry mechanism, and the weight coefficient matrix is adjusted mainly by the correlation between words in the same sentence, so as to obtain the expression of the words. Compared with the traditional static word vector training, the dynamic word vector trained by the BERT model contains the meaning of words and the characteristics between the context words, and can capture the implicit characteristics at the sentence level.

(2) BilSTM layer: automatically extracting the characteristics of sentence context, wherein the input of each BilSTM unit is a dynamic word vector sequence; then a BilSTM unit learns how to extract local features of the sentence; and finally, outputting the hidden state sequence by the forward LSTM model, and splicing all the hidden state sequences by the backward LSTM model according to the sentence sequence to obtain a complete hidden state sequence. The correlation data can be obtained by the formula:

i _t ＝δ(W _i *[h _t-1 ，x _t ]+b _t ) (1)

f _t ＝δ(W _f *[h _t-1 ，x _t ]+b _f ) (2)

O _t ＝δ(W _o *[h _t-1 ，x _t ]+b _o ) (3)

C _t ＝f _t *C _t-1 +i _t *tan(W _c *[g _t-1 ，x _t ]+b _c ) (4)

h _t ＝O _t *tanh(C _t ) (5)

in the formula (1-5), i _t 、f _t 、O _t Three gating cells representing each LSTM cell: an input gate, a forgetting gate and an output gate. C _t Representing the output state of the output layer at time t, h _t Representing the output state of the hidden layer at time t. x is a radical of a fluorine atom _t Representing the input of time t. δ () is an activation function and tanh () is a hyperbolic tangent activation function. W is a group of _i 、W _f 、W _o Represents a hidden state vector h _t And an input vector x _t B, and b _i 、b _f 、b _o And b _c Representing an offset vector.

Step 4, identifying and extracting the relationship between entities by adopting a method based on dependency analysis, and analyzing the dependency relationship between sentence components by identifying and positioning syntactic relations;

the specific method of the step 4 comprises the following steps:

firstly, a subject and a core predicate are extracted through semantic role identification; then, through dependency syntax analysis, finding out an object and a subject related to the meaning of the core predicate; and finally, obtaining a relevant dependency relationship in the power failure text and an entity relationship based on the ontology structure through dependency syntax analysis.

In the present embodiment, dependency parsing is mainly used to analyze four relationship structures in a sentence: a main and subordinate relationship (SBV), a moving object relationship (VOB), an attribute relationship (ATT), and an idiom relationship (ADV).

The process of locating and extracting predicates mainly comprises the following parts: and positioning and extracting all entities with SBV structural relation with the core predicates, preferentially positioning, extracting entities with VOB structural relation with the core predicates, and positioning the ATT structural relation with the core predicates.

Step 5, knowledge storage and semantic triple representation: the knowledge storage specifically comprises the step of storing the entities, attributes and relations extracted in the step 4 into a database, and the semantic triple representation specifically comprises the step of representing the extracted knowledge in a triple form;

A schematic diagram of an embodiment of a power failure knowledge map construction apparatus is shown in fig. 5. The device comprises:

the model training module is used for performing entity extraction and relation extraction on the to-be-processed power failure text, acquiring word vectors in the preprocessed data, inputting the word vectors into the bidirectional long-time and short-time memory network for entity extraction, and extracting entity relations according to dependence analysis;

and the map construction module is configured to generate a knowledge map comprising the entities and the relations among the entities according to the entities and the relations extracted by the model training module.

It should be emphasized that the embodiments described herein are illustrative and not restrictive, and thus the present invention includes, but is not limited to, the embodiments described in the detailed description, as well as other embodiments that can be derived by one skilled in the art from the teachings herein.

Claims

1. A construction method of a power failure knowledge graph is characterized by comprising the following steps: the method comprises the following steps:

2. The method for constructing the power failure knowledge graph according to claim 1, wherein the method comprises the following steps: the specific steps of the step 2 comprise:

(2) Word vector representation uses Word2vec model to represent text data, synonyms are calculated by calculating cosine similarity between Word vectors, and the obtained Word vectors in the corpus can also be used as input of a subsequent entity recognition model.

3. The method for constructing the power failure knowledge graph according to claim 1, wherein the method comprises the following steps: the BERT-BilSTM-CRF combined model comprises the following components:

(1) BERT layer: performing feature extraction and training through a multilayer neural network, converting an input text into a word vector, and enabling a BilSTM layer to learn context features; the BERT model converts an input sequence into comprehensive embedding of three characteristics of Tokens, segments and Positions, then inputs the comprehensive embedding into the model for extraction, and uses a self-attention mechanism and a full connection layer to model an input text.

(2) BilsTM layer: automatically extracting the characteristics of sentence context, wherein the input of each BilSTM unit is a dynamic word vector sequence; then a BilSTM unit learns how to extract local features of the sentence; finally, the forward LSTM model outputs a hidden state sequence, and the backward LSTM model splices all the hidden state sequences according to the sentence sequence to obtain a complete hidden state sequence; the correlation data can be obtained by the formula:

i _t ＝δ(W _i *[h _t-1 ，x _t ]+b _t ) (1)

f _t ＝δ(W _f *[h _t-1 ，x _t ]+b _f ) (2)

O _t ＝δ(W _o *[h _t-1 ，x _t ]+b _o ) (3)

C _t ＝f _t *C _t-1 +i _t *tan(W _c *[h _t-1 ，x _t ]+b _c ) (4)

h _t ＝O _t *tanh(C _t ) (5)

in the formula (1-5), i _t 、f _t 、O _t Three gating cells representing each LSTM cell: an input gate, a forgetting gate and an output gate. C _t Representing the output state of the output layer at time t, h _t Representing the output state of the hidden layer at time t; x is a radical of a fluorine atom _t Representing the input of time t. δ () is an activation function and tanh () is a hyperbolic tangent activation function. W _i 、W _f 、W _o Represents a hidden state vector h _t And an input vector x _t B, and b _i 、b _f 、b _o And b _c Representing an offset vector.

(3) CRF layer: the method is a joint probability distribution graph model without a graph representation, local features are normalized into global features, and the problem of partial labeling deviation is solved by calculating the probability distribution of the whole sequence to obtain a global optimal solution; meanwhile, the hidden constraint rule of the label can be obtained by the CRF model when the data is trained.

4. The method for constructing the power failure knowledge-graph according to claim 1, wherein: the specific method of the step 4 comprises the following steps:

firstly, a subject and a core predicate are extracted through semantic role identification; then, through dependency syntax analysis, finding out an object and a subject related to the meaning of the core predicate; and finally, obtaining the related dependency relationship in the power failure text and the entity relationship based on the ontology structure through dependency syntax analysis.

5. An electric power failure knowledge map construction device is characterized in that: the method comprises the following steps: