CN117194682A - Method, device and medium for constructing knowledge graph based on power grid related file - Google Patents
Method, device and medium for constructing knowledge graph based on power grid related file Download PDFInfo
- Publication number
- CN117194682A CN117194682A CN202311466119.3A CN202311466119A CN117194682A CN 117194682 A CN117194682 A CN 117194682A CN 202311466119 A CN202311466119 A CN 202311466119A CN 117194682 A CN117194682 A CN 117194682A
- Authority
- CN
- China
- Prior art keywords
- entity
- constructing
- knowledge graph
- model
- corpus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000000605 extraction Methods 0.000 claims abstract description 19
- 238000007781 pre-processing Methods 0.000 claims abstract description 15
- 239000013598 vector Substances 0.000 claims description 51
- 239000011159 matrix material Substances 0.000 claims description 37
- 230000008569 process Effects 0.000 claims description 25
- 230000006870 function Effects 0.000 claims description 14
- 230000007246 mechanism Effects 0.000 claims description 9
- 230000011218 segmentation Effects 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 5
- 239000003550 marker Substances 0.000 claims description 3
- 230000010076 replication Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 abstract description 7
- 238000012549 training Methods 0.000 description 6
- 230000004913 activation Effects 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 230000010485 coping Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Landscapes
- Machine Translation (AREA)
Abstract
The application discloses a method, a device and a medium for constructing a knowledge graph based on a power grid related file, belongs to the field of language processing, and aims at solving the problem that the current model cannot handle the actual use requirement of the power grid related file, and adopts the following technical scheme: a method of constructing a knowledge graph based on grid related documents, comprising: obtaining a preprocessing corpus, and predefining an entity set and a relation set; based on the preprocessing corpus, multi-entity identification is performed; based on the multi-entity recognition result, extracting the relation based on entity implications; constructing an extraction model, and extracting a semantic topic according to the multi-entity identification result and the relation extraction result; and (5) visualizing. Meanwhile, a device and a medium for constructing the knowledge graph based on the power grid related file are also provided. The application can carry out semantic topic extraction and division based on the entity identification and relation extraction of the document level, improves the information carrying capacity and the information retrieval efficiency of the knowledge graph, and enables the knowledge graph to be suitable for document level file processing.
Description
Technical Field
The application belongs to the technical field of language processing, and particularly relates to a method, a device and a medium for constructing a knowledge graph based on a power grid related file.
Background
Knowledge Graph (KG) is a structured, semantically knowledge representation model for describing entities, concepts, relationships and properties in the real world and representing the associations between them in the form of graphs. Knowledge maps have great potential in many application fields, and provide functions of information organization, information retrieval, knowledge management, sharing and the like in a visual mode. The development time of the power industry is long, the body quantity is large, the power grid is an important basic industry of national economy, the knowledge graph is constructed based on related files of the power grid, related business personnel can be promoted to fully understand and apply related texts, the working efficiency is greatly improved, and the power grid knowledge graph construction method has representative significance for document-level knowledge graph construction. However, because the information of the related files of the power grid is complex and widely related, the existing model cannot be utilized to construct a knowledge graph meeting actual requirements. The existing model has the following problems: (1) entity identification is limited: for entity identification problems, particularly document-level multi-entity problems, it is difficult to identify effectively; (2) relation extraction has difficulty: a large amount of relation extraction work is needed for constructing the knowledge graph, and the relation extraction difficulty of the prior art on the document level is high; (3) knowledge representation has limitations: existing knowledge graphs widely use triples (entities, relationships, attributes) to represent knowledge, but this approach has difficulty in coping with the task of document-level multi-fact and topic classification. Therefore, there is a need to improve the existing knowledge graph model to adapt to the requirements of the grid field.
Disclosure of Invention
Aiming at the problem that the existing model cannot handle the actual use requirement of the related files of the power grid, the application provides a method, a device and a medium for constructing a knowledge graph based on the related files of the power grid, which can effectively identify multiple entities at a document level, extract the relationship among entities crossing sentences and also can clearly determine the subject classification among different entities and relationships.
The application adopts the following technical scheme: the method for constructing the knowledge graph based on the power grid related file comprises the following steps:
step 1, collecting related files of a power grid to obtain an original corpus, preprocessing texts of the original corpus to obtain a preprocessed corpus, and predefining an entity set and a relation set at the same time for collecting names in the corpus; preprocessing a corpus based on a predefined entity set to obtain a corpus with marked entity answers;
step 2, based on a corpus with marked entity answers, performing multi-entity recognition by using a Rigel-Interect model;
step 3, based on the multi-entity recognition result, extracting the relationship through a seq2seq model based on entity implications;
step 4, constructing an extraction model based on LDA-BERE, and extracting a semantic topic according to the multi-entity identification result and the relation extraction result;
and 5, visualizing the result.
Further, the very specific process of step 1 is:
step 1.1, collecting related files about the power industry to form an original corpus;
and 1.2, performing non-text content rejection on the original corpus, performing Chinese word segmentation on texts in the original corpus by using jieba word segmentation, performing part-of-speech tagging on words according to word segmentation results, and removing common Chinese stop words in the texts to obtain a preprocessing corpus. The non-text content includes additional spaces and punctuation marks;
and 1.3, predefining an entity set and a relation set, and marking data of the preprocessed corpus based on the predefined entity set to obtain a corpus with marked entity answers.
Further, the specific process of the step 2 is as follows:
step 2.1, constructing a Rigel-Baseline model to perform single entity identification;
the Rigel Baseline model comprises an encoder and a decoder, wherein the encoder is used for encoding the problems, and the decoder is used for returning probability distribution of the relations on the micro-knowledge graph;
2.2, constructing a Rigel-Interselect model by expanding the Rigel-Baseline model through differential intersection operation, and carrying out multi-entity identification;
step 2.2.1, identifying a shared element between two entities by constructing an intersection of two vectors;
step 2.2.2, using an encoder to encode the question text to form a question insert as the RoBERTa input of the model encoder; for each problem entity, concatenating the problem text with the entity mention or canonical name with separator labels; embedding using the separator marker index as an entity-specific representation of the problem;
step 2.2.3, obtaining a multi-entity identification result in a sentence:
(a) Embedding parallel prediction reasoning into each entity specific problem by using a decoder, and obtaining intermediate answers according to the entities and the relations in the micro knowledge graph;
(b) Weighting the entities in each vector according to the attention score;
(c) Intersecting the two intermediate answers to obtain a final answer as an answer to the question; calculating a loss according to the difference between the final answer and the marked entity answer, so that the encoder and the decoder adjust parameters according to the loss result;
the step provides a document-level entity identification mode, realizes a new intersection operation to explicitly process multi-entity problems, identifies shared elements between two entities, realizes end-to-end question answering, and is used for proving that introducing intersection can improve the performance of network problems and complex network problems.
Further, the specific process of step 2.1 is as follows:
step 2.1.1, defining an entity relation set expression;
step 2.1.2, constructing a set of triples existing in all the preprocessing corpuses according to the entity relation set;
step 2.1.3, constructing a single entity identification model based on matrix operation:
s2.1.3.1, given the entity vector expression:
s2.1.3.2, given the problem embedding and the relationship vector, calculating an entity vector; the question entity and the prediction relation are tracked in the micro knowledge graph to return a prediction answer;
s2.1.3.3, calculating a jump attention score, associating the relation vector with the entity vector, and updating the entity vector;
s2.1.3.4, the single entity recognition model parameters are updated by constructing a loss function.
Further, the specific process of the step 3 is as follows:
step 3.1, linearizing the source text into character strings based on the result of step 2;
step 3.2, adding entity implications: when the entity appears in the input sentence, before the entity and the identification name thereof are added to the source text, the end of the entity prompt is divided by a special mark;
and 3.3, constructing a sequence-to-sequence model based on the character string and the entity hint.
Further, the specific process of step 3.1 is as follows:
step 3.1.1, limiting target vocabulary: limiting the target vocabulary to a set of special tags required to model entities and relationships to prevent the model from generating entity references that do not appear in the source text;
step 3.1.2, replication mechanism: all other tags are copied from the input using a copy mechanism whose principle of operation is to extend the target vocabulary using tags in the source sequence, allow the model to copy these tags into the output sequence, and randomly initialize the embedding of the special tags, learning with other parameters of the model;
step 3.1.3, end decoding: applying a constraint on the decoder by setting a prediction probability of the wrong answer to a minute value; experiments were performed during testing to apply constraints to the decoder to reduce the likelihood of generating a target string (a string that does not follow linearization patterns) that is grammar ineffective;
step 3.1.4, relation ordering: the relationships between the target strings are ordered according to their order of appearance in the source text, providing a consistent decoding order for the model.
Further, the specific process of step 4 is as follows:
step 4.1, feature extraction: extracting the result of the step 3 through an N-gram algorithm to obtain words, and creating sentence embedding based on an SBERT model;
step 4.2, creating a dictionary: the word mapping ID extracted in the step 4.1 is used for generating a document list, and the document list is converted into a word matrix to be used as input of an LDA model;
step 4.3, theme modeling: vectorizing the word matrix, converting the operation of the SBERT model obtained in the step 4.1 on sentences into the operation of the documents, embedding the SBERT of each document into the word matrix row corresponding to the document, and splicing the word matrix rows to create a new document representation.
Further, the specific process of the step 5 is as follows: and (3) storing the result in the step (4) in an excel table, and visualizing by using knowledge graph construction software.
The device for constructing the knowledge graph based on the power grid related file comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the method for constructing the knowledge graph based on the power grid related file when executing the executable codes.
A computer readable storage medium having stored thereon a program which, when executed by a processor, implements the above-described method of constructing a knowledge graph based on grid-related documents.
The application has the beneficial effects that: the application relates to a method, a device and a medium for constructing a knowledge graph based on a power grid related file, which are used for explicitly processing a multi-entity problem by utilizing intersection operation and solving the problem that the entity identification of the existing model is limited; by combining entity implications, the problem of great difficulty in extracting the document-level relationship of the existing model can be solved, and the method is used for extracting and dividing semantic topics by combining an LDA-BERT hybrid model. The application can extract and divide semantic topics based on the document-level entity identification and relation extraction, thereby improving the information bearing capacity and information retrieval efficiency of the knowledge graph, leading the knowledge graph to be suitable for document-level file processing and ensuring the knowledge graph to be suitable for the requirement of power grid file processing.
Drawings
FIG. 1 is a flow chart of a method of constructing a knowledge graph based on grid related documents;
FIG. 2 is a flow chart of step 2;
FIG. 3 is a schematic diagram of the relation extraction of the sequence-to-sequence model;
fig. 4 is a knowledge graph global view based on a power grid file.
Detailed Description
The technical solutions of the embodiments of the present application will be explained and illustrated below with reference to the drawings of the present application, but the following embodiments are only preferred embodiments of the present application, and not all the embodiments. Based on the examples in the implementation manner, other examples obtained by a person skilled in the art without making creative efforts fall within the protection scope of the present application.
Example 1
The embodiment is a method for constructing a knowledge graph based on a power grid related file, a flowchart is shown in fig. 1, and the embodiment adopts Python software to construct a model, and the method comprises the following steps:
step 1, collecting a power grid file to obtain an original corpus, preprocessing the text of the original corpus to obtain a preprocessed corpus, and predefining an entity set and a relation set at the same time for collecting names in the corpus; preprocessing a corpus based on a predefined entity set to obtain a corpus with marked entity answers; the specific process is as follows:
step 1.1, collecting related files related to the power industry, including related files of a main power company, a branch power company, each department of the power industry and the like, to form an original corpus;
and 1.2, performing non-text content rejection on the original corpus, performing Chinese word segmentation on texts in the original corpus by using jieba word segmentation, performing part-of-speech tagging on words according to word segmentation results, and removing common Chinese stop words in the texts to obtain a preprocessing corpus. The non-text content includes additional spaces and punctuation marks;
and 1.3, predefining an entity set and a relation set, and marking the preprocessing corpus based on the predefined entity set to obtain the corpus with marked entity answers.
Step 2, as shown in fig. 2, based on a corpus with marked entity answers, performing multi-entity recognition by using a Rigel-Intersect model;
step 2.1, constructing a Rigel-Baseline model to perform single entity identification;
the Rigel Baseline model comprises an encoder and a decoder, wherein the encoder is used for encoding the problems, and the decoder is used for returning probability distribution of the relations on the micro-knowledge graph; the specific process is as follows:
step 2.1.1, defining an entity relation set expression:
(1);
wherein,is an entity relationship set, E is an entity set, R is a relationship set, (s, p, o) is a triplet, indicating that the relationship p holds between the subject entity s and the object entity o.
Step 2.1.2, constructing a set of triples existing in all the preprocessing corpus according to the entity relation setThe method comprises the steps of carrying out a first treatment on the surface of the And representing the triplet sets in the subject matrix of a given indexM s Relationship matrixM p Object matrixM o In (a) and (b); in the middle ofTAs a set of triples,t k representing the third group in the triplet setkElement(s)>Representing a subject matrix element; />Representing a relationship matrix element; />Representing object momentAn array element;N T is the total number of triples, i.e. +.>The number of triples in (3); the method comprises the following steps:
(2);
wherein,M s is a matrix of subjects, is oneRepresenting the subject entity and entity set in each triplet +.>Relationships between entities in (a);N E is the entity set->The total number of (i) i.e. the number of entities;e j for a collection of entitiesEMiddle (f)jThe number of elements to be added to the composition,jthe value range is 1 toN E ;M s (s k ,j) Represent the firsts k Whether the subject entity of the triple is equal to the entity set +.>The first of (3)jPersonal entity, if so, thenM s (s k ,j) =1, otherwise 0.
(3);
Wherein,M p is a relation matrix, is oneRepresenting relationships and sets of relationships in each tripletRA relationship between relationships in (a);N R is a set of relationshipsRThe total number of (a) i.e. the number of relationships;p l is the first in the relation setlThe number of elements to be added to the composition,lthe value range is 1 toN R ;M p (p k ,l) Represent the firstp k Whether the relation of the triples is equal to the relation set +.>The first of (3)lPersonal relationshipp l The method comprises the steps of carrying out a first treatment on the surface of the If so, thenM p (p k ,l) =1, otherwise 0.
(4);
Wherein,M o is an object matrix, is aRepresenting guest entities and entity sets in each triplet +.>Relationships between entities in (a);M o (o k ,j) Represent the firsto k Guest entity of the triad->Whether or not equal to entity set +.>The first of (3)jPersonal entitye j If so, thenM o (o k ,j) =1, otherwise 0.
Step 2.1.3, constructing a single entity identification model based on matrix operation:
using Roberta as encoder, layered decoderAs a decoder, an attention mechanism, i.e. a unified approach across data sets, is implemented; the idea is as follows: the problem entity and the prediction relation are tracked in the micro knowledge graph, and the prediction is obtainedThe answers, the predicted answers and the marked entity answers are compared, the loss value is used for updating the model, and the specific process is as follows:
s2.1.3.1, given the entity vector expression:
in the encoder, a given one is at the firstt-entity vector of 1 first time stepAnd a relation vector->Obtaining entity vector->The expression:
(5);
wherein,representing an update function;x t representing at a first time steptIs used to determine the vector of the entity,x t-1 representing at a first time stept-1, the entity vector being +.>The vector of the dimensions is used to determine,N E is the number of entities in the set of entities;r t representing at a first time steptIs a +.>The vector of the dimensions is used to determine,N R is the number of relationships in the set of relationships;M s is a body matrix;M p is a relationship matrix;M o is a matrix of objects which are to be moved,M o T representing a transpose of the matrix; ☉ the element multiplication symbol.
S2.1.3.2 given problem embeddingh q And a relation vectorr t Calculating an entity according to equation (5) and equation (6)(Vector)x t :
(6);
In the method, in the process of the application,softmax(-) represents an activation function,W t dec a layered decoder is represented by the representation,h q the problem is indicated to be embedded in,r t-1 ,r t-2 ,……r 1 the history of the relation vector is represented,a transpose of column vectors representing problem embedding and history relation vector stitching;
the question entities and predictive relationships are tracked in a micro-knowledge graph to return predictive answers.
S2.1.3.3, calculating a jump attention score, associating a relationship vector with an entity vector, and updating the entity vector:
(8);
(9);
wherein,c t representing at a first time steptA calculated attention score vector, wherein each element corresponds to an attention score of the question and historical relationship vector;W t att the weight matrix is represented by a matrix of weights,athe attention weight vector is represented as such,representing the attention score vectors calculated for the different first time steps,T h representing the maximum time step of the historical relationship vector.
The attention score vector is calculated according to the attention mechanism of the model and is used for measuring the relevance of the problem embedding and the historical relation vector in the current first time step; calculation of attention mechanism based on problem embedding and first time of inputThe relation vector of the steps is output by the decoder and then usedsoftmaxActivating to obtain new relation vectorr t Based on newr t And entity vector of the current first time stepx t-1 Updating the entity vector is performed by the formula (5).
S2.1.3.4, updating single entity identification model parameters by constructing a loss function:
cross entropy is used as a loss function:
(10);
(11);
wherein,is a loss function for measuring the performance of the model;yrepresenting the answer of the marked entity->Is the final estimate representing the predicted entity answer of the model for a given question and relationship;a t is the first time steptThe attention weight calculated above;x t is at a first time steptThe entity vector calculated above represents entity information in the knowledge graph;y i representing the +.>The true value of the individual element, which is 0 or 1,itake the value of 1 toN E ;/>Is the model predictive firstiElement, representing model pairiAn estimate of the individual entity.
2.2, constructing a Rigel-Interselect model by expanding the Rigel-Baseline model through differential intersection operation, and carrying out multi-entity identification;
step 2.2.1, identifying a shared element between two entities by constructing an intersection of two vectors:
two entities in a sentence are set to be represented by vectors a and b, and the intersection of the vectors a and b is defined as the minimum value min of the elements of a and ba k ,b n ) To associate two entities:
(12);
in the method, in the process of the application,min elem (-) is an intersection expressionkIs vector quantityaIs provided for the length of (a),nis vector quantitybWhen the length of (a)k=nAt the time, min%,/>) Will be ata k <b n Time returna k Or inb n <a k Time returnb n Any element that appears in both vectors returns a non-zero value; when (when)k≠nAt the time, min%a k ,b n ) The element that appears in one vector or the element that does not appear in both vectors returns a 0.
Step 2.2.2, build problem embedding:
using encodersf q Embedding question text q-code forming questionsh q Roberta input as a model encoder; for each question entity, the question text q and the entity mention or canonical name m are marked with a separator [ SEP ]]Separating and connecting; marking index using delimitersisepEmbedding at as entity specific representation of the problem, i.e. entity specific problem embedding:
(13);
wherein: q is the question text and m is the entity mention or canonical name.
Step 2.2.3, obtaining a multi-entity identification result in a sentence:
(a) Embedding parallel predictive reasoning for each entity-specific question using a decoder, obtaining intermediate answers based on entities and relationships in a micro-knowledge-graph,
(b) Weighting the entities in each vector according to the attention score;
(c) Intersecting the two intermediate answers according to formula (12) to obtain a final answer as an answer to the question; and calculating the loss according to the difference between the final answer and the marked entity answer, and training the whole model so that the encoder and the decoder adjust parameters according to the loss result.
The step provides a document-level entity identification mode, realizes a new intersection operation to explicitly process multi-entity problems, identifies shared elements between two entities, realizes end-to-end question answering, and is used for proving that introducing intersection can improve the performance of network problems and complex network problems.
Step 3, based on the multi-entity recognition result, extracting the relationship through a seq2seq model based on entity implications; the specific process is as follows:
step 3.1, linearizing the source text into character strings based on the result of the step 2, wherein the specific process is as follows:
step 3.1.1, limiting target vocabulary: limiting the target vocabulary to a set of special tags required to model entities and relationships to prevent the model from generating entity references that do not appear in the source text;
step 3.1.2, replication mechanism: all other tags are copied from the input using a copy mechanism that works on the principle of using the source sequenceXThe tags in (1) extend the target vocabulary, allowing the model to copy these tags to the output sequenceYRandomly initializing the embedding of the special mark and learning with other parameters of the model;
step 3.1.3, end decoding: applying a constraint on the decoder by setting the prediction probability of the wrong answer to a minute value at each second time step; experiments were performed during testing to apply constraints to the decoder to reduce the likelihood of generating a target string (a string that does not follow linearization patterns) that is grammar ineffective;
step 3.1.4, relation ordering: the relationships between the target strings are ordered according to their order of appearance in the source text, providing a consistent decoding order for the model.
The position of a relation is determined by the first reference and its body, and in the case of an interrelation, the ordering is based on the last-mentioned end entity (multi-relation, and so on). The relationships extracted from a given document are unordered in nature, while the sequence cross entropy penalty is permutation sensitive to the predicted labels. During training, unnecessary decoding sequences may be forced and the model may be prone to overfitting frequent marker combinations in the training set. To alleviate this, the relationships need to be ordered.
The text X is entered and each relationship starts with its constituent entities in the corresponding target string Y. The semicolons are separated by a co-reference mention (;) and the entity terminates with a special tag indicating its type (e.g. @ department). Likewise, a special tag whose relationship indicates its type is terminated (e.g., @ notify @), two or more entities may be included before the special relationship tag to support n-ary extraction. Entities may be ordered if they act as a specific role for the head or tail of a relationship. For each document, a plurality of relationships may be contained in the target string. The entities may be nested in the input text or may be discontinuous.
Step 3.2, adding entity implications: when an entity appears in the input sentence, the end of the entity hint is divided by a special tag @ sep @ before the entity and its identifying name are added to the source text.
Step 3.3, constructing a sequence to sequence model based on the character string and the entity hint:
step 3.3.1, modeling the conditional probability:
(14);
wherein,expressed in a given source sequenceXIn the case of (2) generating a target sequenceYIs a function of the probability of (1),y z is the target sequenceYThe first of (3)zThe elements corresponding to the second time step,Xis of length ofSRepresenting the original text data to be analyzed, i.e. the input text with entity and special marks;Yis of length ofTIs a linearization of the relationship contained in the source; />Is a given source sequenceXAnd the generated partial target sequence +.>In the case of (2) generating the target sequence at the second time step zzIndividual elementsy z Conditional probability of (2); multiplying the conditional probabilities of all second time steps from 1 to the maximum of the second time stepTThe joint probability of the whole sequence is obtained.
Step 3.3.2, optimizing the sequence cross entropy loss and maximizing the log likelihood of training data;
(15);
wherein,representing a loss function, the training objective is to minimize this loss function;θis a parameter of the model, and maximizes the probability of target sequence generation through learning; />Is a log-likelihood term representing the target sequence generated at a given source sequence X +.>Model parametersθIn the case of (2) at a second time stepzGenerating the first in the target sequencezIndividual elementsy z Logarithmic conditional probability of (1) to the second time step maximumTAnd taking the negative logarithm of the logarithm conditional probability of each time step, adding the negative logarithm of the logarithm conditional probability, and calculating a loss function of the model to maximize the log likelihood of the training data.
As shown in fig. 3, each tag in the input sentence is mapped to a context embedding by an encoder; generating an output by the autoregressive decoder, processing the output of the encoder at each second time step one by one until an end-of-sequence mark (@ is generatedend@), or the maximum number flag is completed.
Step 4, constructing an extraction model based on LDA-BERE, and extracting a semantic topic according to the multi-entity identification result and the relation extraction result; the specific process is as follows:
step 4.1, feature extraction: extracting the entity-relation-entity triplet set obtained in the step 3 through an N-gram algorithm, and creating sentence embedding based on an SBERT model, wherein the method specifically comprises the following steps:
extracting the entity-relation-entity triplet set obtained in the step 3 through an N-gram algorithm to obtain words, wherein the words comprise units Unigram, binary Bi-gram and ternary Tri-gram; applying two parallel BERTs with the same network weights to the two sentences through SBERT embedding; embedding the tags for each sentence, compressing the data by averaging pooling, and then generating a similarity score to create a sentence embedding; the similarity score is generated for creating sentence embedding, wherein the sentence can be selected to be directly marked in an original sentence and correspondingly embedded after being converted into a vector, or a database can be created and independently embedded after being converted into the sentence.
Step 4.2, creating a dictionary: the word mapping ID extracted in the step 4.1 is transferred to a dictionary () object to generate a document list, and the document list is transferred to dictionary 2b to generate a word matrix to be used as an input of an LDA model;
mapping each word to a different integer ID by converting the text data to a word list, since step 4.1 extracts the units Unigram, binary Bi-gram and ternary Tri-gram by N-gram algorithm, the text data has been converted to words, where only the de-mapping ID is needed; and then passes it to the complex. Dictionary () object; the document list (i.e., word list of the markup language) is used to generate a BoW corpus; the tagged word list is provided to the dictionary.doc2ow () object, which then creates a word matrix that is used as input to the LDA model; each document corresponds to one of the sentences or text data, which are used to generate the BoW corpus.
Step 4.3, theme modeling: the word matrix is vectorized by using Gensim corporation, and the operation of the SBERT model obtained in step 4.1 on sentences is converted into operation on documents, and the SBERT embedded of each document is spliced with the word matrix row corresponding to the document to create a new document representation.
Vectorizing the word matrix using Gensim corporation and converting the SBERT embedding operation for sentences obtained in step 4.1 into an operation for documents, because the vectorized word matrix is effectively equivalent to a document-word matrix, where each row represents a document and each column represents a word; the method comprises the steps that all documents are included, so that the corresponding similarity scores are spliced according to the document distribution, and the SBERT is embedded and suitable for a bidirectional transducer encoder; the encoder activation function is a ReLU and the optimizer is an Adam optimizer; embedding the SBERT of each document into a word matrix row corresponding to the document and splicing the word matrix rows to create a new document representation; this new representation contains both the word frequency information and the semantic information of the document. Taking this new document representation as input to the LDA model, the LDA model will use this composite representation to perform topic modeling, decomposing the document into topics. The bidirectional transducer encoder activation function is a ReLU and the optimizer is an Adam optimizer, i.e. the network architecture of BERT described in step 4.1.
Step 5, visualizing the result: the specific process is as follows: obtaining a plurality of entity-relation-entity triplet subsets with theme division according to the steps, storing the subsets in an excel table, and visualizing by using knowledge graph construction software SmartKG. As shown in FIG. 4, the knowledge graph includes topic division between texts, one-to-many text and text features, one-to-one text and text feature, relational connection between texts, and relational connection between text features.
Example 2
An apparatus for constructing a knowledge graph based on a grid-related file, comprising a memory and one or more processors, wherein the memory stores executable code, and wherein the one or more processors are configured to implement the method for constructing a knowledge graph based on a grid-related file according to embodiment 1 when executing the executable code.
Example 3
A computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the method of constructing a knowledge-graph based on grid-related files described in embodiment 1.
While the application has been described in terms of specific embodiments, it will be appreciated by those skilled in the art that the application is not limited thereto but includes, but is not limited to, those shown in the drawings and described in the foregoing detailed description. Any modifications which do not depart from the functional and structural principles of the present application are intended to be included within the scope of the appended claims.
Claims (10)
1. The method for constructing the knowledge graph based on the power grid related file is characterized by comprising the following steps of:
step 1, collecting related files of a power grid to obtain an original corpus, preprocessing texts of the original corpus to obtain a preprocessed corpus, and predefining an entity set and a relation set at the same time for collecting names in the corpus; preprocessing a corpus based on a predefined entity set to obtain a corpus with marked entity answers;
step 2, based on a corpus with marked entity answers, performing multi-entity recognition by using a Rigel-Interect model;
step 3, based on the multi-entity recognition result, extracting the relationship through a seq2seq model based on entity implications;
step 4, constructing an extraction model based on LDA-BERE, and extracting a semantic topic according to the multi-entity identification result and the relation extraction result;
and 5, visualizing the result.
2. The method for constructing a knowledge graph based on a power grid related file according to claim 1, wherein the specific process of step 1 is as follows:
step 1.1, collecting related files about the power industry to form an original corpus;
step 1.2, non-text content rejection is carried out on an original corpus, chinese word segmentation is carried out on texts in the original corpus by using jieba word segmentation, word part-of-speech tagging is carried out on words according to word segmentation results, and common Chinese stop words in the texts are removed, so that a preprocessing corpus is obtained;
and 1.3, predefining an entity set and a relation set, and marking data of the preprocessed corpus based on the predefined entity set to obtain a corpus with marked entity answers.
3. The method for constructing a knowledge graph based on a power grid related file according to claim 2, wherein the specific process of step 2 is as follows:
step 2.1, constructing a Rigel-Baseline model to perform single entity identification;
the Rigel Baseline model comprises an encoder and a decoder, wherein the encoder is used for encoding the problems, and the decoder is used for returning probability distribution of the relations on the micro-knowledge graph;
2.2, constructing a Rigel-Interselect model by expanding the Rigel-Baseline model through differential intersection operation, and carrying out multi-entity identification;
step 2.2.1, identifying a shared element between two entities by constructing an intersection of two vectors;
step 2.2.2, using an encoder to encode the question text to form a question insert as the RoBERTa input of the model encoder; for each problem entity, concatenating the problem text with the entity mention or canonical name with separator labels; embedding using the separator marker index as an entity-specific representation of the problem;
step 2.2.3, obtaining a multi-entity identification result in a sentence:
(a) Embedding parallel prediction reasoning into each entity specific problem by using a decoder, and obtaining intermediate answers according to the entities and the relations in the micro knowledge graph;
(b) Weighting the entities according to the attention score;
(c) Intersecting the two intermediate answers to obtain a final answer as an answer to the question; the loss is calculated based on the difference between the final answer and the marked entity answer, such that the encoder and decoder adjust the parameters based on the loss result.
4. The method for constructing a knowledge graph based on a power grid related file according to claim 3, wherein the specific process of step 2.1 is as follows:
step 2.1.1, defining an entity relation set expression;
step 2.1.2, constructing a set of triples existing in all the preprocessing corpuses according to the entity relation set; and representing the triplet set in a subject matrix, a relationship matrix and an object matrix of a given index;
step 2.1.3, constructing a single entity identification model based on matrix operation:
s2.1.3.1, given an entity vector expression;
s2.1.3.2, given the problem embedding and the relationship vector, calculating an entity vector; the question entity and the prediction relation are tracked in the micro knowledge graph to return a prediction answer;
s2.1.3.3, calculating a jump attention score, associating the relation vector with the entity vector, and updating the entity vector;
s2.1.3.4, the single entity recognition model parameters are updated by constructing a loss function.
5. The method for constructing a knowledge graph based on a power grid related file according to claim 3, wherein the specific process of step 3 is as follows:
step 3.1, linearizing the source text into character strings based on the result of step 2;
step 3.2, adding entity implications: when the entity appears in the input sentence, before the entity and the identification name thereof are added to the source text, the end of the entity prompt is divided by a special mark;
and 3.3, constructing a sequence-to-sequence model based on the character string and the entity hint.
6. The method for constructing a knowledge graph based on a power grid related file according to claim 5, wherein the specific process of step 3.1 is as follows:
step 3.1.1, limiting target vocabulary;
step 3.1.2, replication mechanism: all other markers are copied from the input using a copy mechanism;
step 3.1.3, end decoding: the constraint on the decoder is realized by setting the prediction probability of the wrong answer to a minute value;
step 3.1.4, relation ordering: and ordering the relations according to the appearance sequence of the target character strings in the source text, and providing a consistent decoding sequence for the model.
7. The method for constructing a knowledge graph based on a power grid related file according to claim 5, wherein the specific process of step 4 is as follows:
step 4.1, feature extraction: extracting the result of the step 3 through an N-gram algorithm to obtain words, and creating sentence embedding based on an SBERT model;
step 4.2, creating a dictionary: the word mapping ID extracted in the step 4.1 is used for generating a document list, and the document list is converted into a word matrix to be used as input of an LDA model;
step 4.3, theme modeling: vectorizing the word matrix, converting the operation of the SBERT model obtained in the step 4.1 on sentences into the operation of the documents, embedding the SBERT of each document into the word matrix row corresponding to the document, and splicing the word matrix rows to create a new document representation.
8. The method for constructing a knowledge graph based on a power grid related file according to claim 1, wherein the specific process of step 5 is as follows: and (3) storing the result in the step (4) in an excel table, and visualizing by using knowledge graph construction software.
9. An apparatus for constructing a grid-related file-based knowledge graph, comprising a memory and one or more processors, the memory having executable code stored therein, the one or more processors, when executing the executable code, being configured to implement the method for constructing a grid-related file-based knowledge graph of any one of claims 1 to 8.
10. A computer-readable storage medium, on which a program is stored, which program, when being executed by a processor, implements the method of constructing a knowledge-graph based on grid-related files according to any one of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311466119.3A CN117194682B (en) | 2023-11-07 | 2023-11-07 | Method, device and medium for constructing knowledge graph based on power grid related file |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311466119.3A CN117194682B (en) | 2023-11-07 | 2023-11-07 | Method, device and medium for constructing knowledge graph based on power grid related file |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117194682A true CN117194682A (en) | 2023-12-08 |
CN117194682B CN117194682B (en) | 2024-03-01 |
Family
ID=88994676
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311466119.3A Active CN117194682B (en) | 2023-11-07 | 2023-11-07 | Method, device and medium for constructing knowledge graph based on power grid related file |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117194682B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117786092A (en) * | 2024-02-27 | 2024-03-29 | 成都晓多科技有限公司 | Commodity comment key phrase extraction method and system |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110046260A (en) * | 2019-04-16 | 2019-07-23 | 广州大学 | A kind of darknet topic discovery method and system of knowledge based map |
CN111930784A (en) * | 2020-07-23 | 2020-11-13 | 南京南瑞信息通信科技有限公司 | Power grid knowledge graph construction method and system |
CN113779219A (en) * | 2021-09-13 | 2021-12-10 | 内蒙古工业大学 | Question-answering method for embedding multiple knowledge maps by combining hyperbolic segmented knowledge of text |
CN114077673A (en) * | 2021-06-21 | 2022-02-22 | 南京邮电大学 | Knowledge graph construction method based on BTBC model |
US20220164683A1 (en) * | 2020-11-25 | 2022-05-26 | Fmr Llc | Generating a domain-specific knowledge graph from unstructured computer text |
CN115269857A (en) * | 2022-04-28 | 2022-11-01 | 东北林业大学 | Knowledge graph construction method and device based on document relation extraction |
CN116578717A (en) * | 2023-04-28 | 2023-08-11 | 国网安徽省电力有限公司信息通信分公司 | Multi-source heterogeneous knowledge graph construction method for electric power marketing scene |
CN116628219A (en) * | 2023-05-10 | 2023-08-22 | 浙江工业大学 | Question-answering method based on knowledge graph |
CN116739081A (en) * | 2023-05-25 | 2023-09-12 | 东北电力大学 | Knowledge acquisition and representation method oriented to power grid risk field |
-
2023
- 2023-11-07 CN CN202311466119.3A patent/CN117194682B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110046260A (en) * | 2019-04-16 | 2019-07-23 | 广州大学 | A kind of darknet topic discovery method and system of knowledge based map |
CN111930784A (en) * | 2020-07-23 | 2020-11-13 | 南京南瑞信息通信科技有限公司 | Power grid knowledge graph construction method and system |
US20220164683A1 (en) * | 2020-11-25 | 2022-05-26 | Fmr Llc | Generating a domain-specific knowledge graph from unstructured computer text |
CN114077673A (en) * | 2021-06-21 | 2022-02-22 | 南京邮电大学 | Knowledge graph construction method based on BTBC model |
CN113779219A (en) * | 2021-09-13 | 2021-12-10 | 内蒙古工业大学 | Question-answering method for embedding multiple knowledge maps by combining hyperbolic segmented knowledge of text |
CN115269857A (en) * | 2022-04-28 | 2022-11-01 | 东北林业大学 | Knowledge graph construction method and device based on document relation extraction |
CN116578717A (en) * | 2023-04-28 | 2023-08-11 | 国网安徽省电力有限公司信息通信分公司 | Multi-source heterogeneous knowledge graph construction method for electric power marketing scene |
CN116628219A (en) * | 2023-05-10 | 2023-08-22 | 浙江工业大学 | Question-answering method based on knowledge graph |
CN116739081A (en) * | 2023-05-25 | 2023-09-12 | 东北电力大学 | Knowledge acquisition and representation method oriented to power grid risk field |
Non-Patent Citations (4)
Title |
---|
AKASH GHOSH等: "Semantic Clone Detection: Can Source Code Comments Help?", 2018 IEEE SYMPOSIUM ON VISUAL LANGUAGES AND HUMAN-CENTRIC COMPUTING (VL/HCC) * |
宋玮琼等: "基于GCN的配电网知识图谱构建及应用", 电子设计工程, vol. 30, no. 7 * |
谭刚;陈聿;彭云竹;: "融合领域特征知识图谱的电网客服问答***", 计算机工程与应用, vol. 56, no. 03 * |
韦韬;王金华;: "基于非分类关系提取技术的知识图谱构建", 工业技术创新, vol. 07, no. 02 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117786092A (en) * | 2024-02-27 | 2024-03-29 | 成都晓多科技有限公司 | Commodity comment key phrase extraction method and system |
CN117786092B (en) * | 2024-02-27 | 2024-05-14 | 成都晓多科技有限公司 | Commodity comment key phrase extraction method and system |
Also Published As
Publication number | Publication date |
---|---|
CN117194682B (en) | 2024-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110851599B (en) | Automatic scoring method for Chinese composition and teaching assistance system | |
CN109753660B (en) | LSTM-based winning bid web page named entity extraction method | |
CN110851596A (en) | Text classification method and device and computer readable storage medium | |
CN112836046A (en) | Four-risk one-gold-field policy and regulation text entity identification method | |
CN112149421A (en) | Software programming field entity identification method based on BERT embedding | |
CN110569332B (en) | Sentence feature extraction processing method and device | |
CN110990555B (en) | End-to-end retrieval type dialogue method and system and computer equipment | |
CN117194682B (en) | Method, device and medium for constructing knowledge graph based on power grid related file | |
US11170169B2 (en) | System and method for language-independent contextual embedding | |
CN112966117A (en) | Entity linking method | |
CN114153978A (en) | Model training method, information extraction method, device, equipment and storage medium | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN112784602A (en) | News emotion entity extraction method based on remote supervision | |
CN116049422A (en) | Echinococcosis knowledge graph construction method based on combined extraction model and application thereof | |
Lefever et al. | Identifying cognates in English-Dutch and French-Dutch by means of orthographic information and cross-lingual word embeddings | |
CN117291265B (en) | Knowledge graph construction method based on text big data | |
Huang et al. | Disease named entity recognition by machine learning using semantic type of metathesaurus | |
Khadija et al. | Automating information retrieval from faculty guidelines: designing a PDF-driven chatbot powered by OpenAI ChatGPT | |
CN111666375A (en) | Matching method of text similarity, electronic equipment and computer readable medium | |
CN115270713A (en) | Method and system for constructing man-machine collaborative corpus | |
CN115203388A (en) | Machine reading understanding method and device, computer equipment and storage medium | |
CN114637852A (en) | Method, device and equipment for extracting entity relationship of medical text and storage medium | |
CN113901218A (en) | Inspection business basic rule extraction method and device | |
CN113254473A (en) | Method and device for acquiring weather service knowledge | |
CN112015891A (en) | Method and system for classifying messages of network inquiry platform based on deep neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |