CN117194682A - Method, device and medium for constructing knowledge graph based on power grid related file - Google Patents

Method, device and medium for constructing knowledge graph based on power grid related file Download PDF

Info

Publication number
CN117194682A
CN117194682A CN202311466119.3A CN202311466119A CN117194682A CN 117194682 A CN117194682 A CN 117194682A CN 202311466119 A CN202311466119 A CN 202311466119A CN 117194682 A CN117194682 A CN 117194682A
Authority
CN
China
Prior art keywords
entity
constructing
knowledge graph
model
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311466119.3A
Other languages
Chinese (zh)
Other versions
CN117194682B (en
Inventor
王庆娟
胡若云
孙钢
楼斐
丁欣玮
陈千羿
陈志伟
宋宛净
沈艳阳
黎佳慧
蒋贝妮
庄琛
徐世予
汪一帆
王晓宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Zhejiang Electric Power Co Ltd
Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Original Assignee
State Grid Zhejiang Electric Power Co Ltd
Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Zhejiang Electric Power Co Ltd, Marketing Service Center of State Grid Zhejiang Electric Power Co Ltd filed Critical State Grid Zhejiang Electric Power Co Ltd
Priority to CN202311466119.3A priority Critical patent/CN117194682B/en
Publication of CN117194682A publication Critical patent/CN117194682A/en
Application granted granted Critical
Publication of CN117194682B publication Critical patent/CN117194682B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The application discloses a method, a device and a medium for constructing a knowledge graph based on a power grid related file, belongs to the field of language processing, and aims at solving the problem that the current model cannot handle the actual use requirement of the power grid related file, and adopts the following technical scheme: a method of constructing a knowledge graph based on grid related documents, comprising: obtaining a preprocessing corpus, and predefining an entity set and a relation set; based on the preprocessing corpus, multi-entity identification is performed; based on the multi-entity recognition result, extracting the relation based on entity implications; constructing an extraction model, and extracting a semantic topic according to the multi-entity identification result and the relation extraction result; and (5) visualizing. Meanwhile, a device and a medium for constructing the knowledge graph based on the power grid related file are also provided. The application can carry out semantic topic extraction and division based on the entity identification and relation extraction of the document level, improves the information carrying capacity and the information retrieval efficiency of the knowledge graph, and enables the knowledge graph to be suitable for document level file processing.

Description

Method, device and medium for constructing knowledge graph based on power grid related file
Technical Field
The application belongs to the technical field of language processing, and particularly relates to a method, a device and a medium for constructing a knowledge graph based on a power grid related file.
Background
Knowledge Graph (KG) is a structured, semantically knowledge representation model for describing entities, concepts, relationships and properties in the real world and representing the associations between them in the form of graphs. Knowledge maps have great potential in many application fields, and provide functions of information organization, information retrieval, knowledge management, sharing and the like in a visual mode. The development time of the power industry is long, the body quantity is large, the power grid is an important basic industry of national economy, the knowledge graph is constructed based on related files of the power grid, related business personnel can be promoted to fully understand and apply related texts, the working efficiency is greatly improved, and the power grid knowledge graph construction method has representative significance for document-level knowledge graph construction. However, because the information of the related files of the power grid is complex and widely related, the existing model cannot be utilized to construct a knowledge graph meeting actual requirements. The existing model has the following problems: (1) entity identification is limited: for entity identification problems, particularly document-level multi-entity problems, it is difficult to identify effectively; (2) relation extraction has difficulty: a large amount of relation extraction work is needed for constructing the knowledge graph, and the relation extraction difficulty of the prior art on the document level is high; (3) knowledge representation has limitations: existing knowledge graphs widely use triples (entities, relationships, attributes) to represent knowledge, but this approach has difficulty in coping with the task of document-level multi-fact and topic classification. Therefore, there is a need to improve the existing knowledge graph model to adapt to the requirements of the grid field.
Disclosure of Invention
Aiming at the problem that the existing model cannot handle the actual use requirement of the related files of the power grid, the application provides a method, a device and a medium for constructing a knowledge graph based on the related files of the power grid, which can effectively identify multiple entities at a document level, extract the relationship among entities crossing sentences and also can clearly determine the subject classification among different entities and relationships.
The application adopts the following technical scheme: the method for constructing the knowledge graph based on the power grid related file comprises the following steps:
step 1, collecting related files of a power grid to obtain an original corpus, preprocessing texts of the original corpus to obtain a preprocessed corpus, and predefining an entity set and a relation set at the same time for collecting names in the corpus; preprocessing a corpus based on a predefined entity set to obtain a corpus with marked entity answers;
step 2, based on a corpus with marked entity answers, performing multi-entity recognition by using a Rigel-Interect model;
step 3, based on the multi-entity recognition result, extracting the relationship through a seq2seq model based on entity implications;
step 4, constructing an extraction model based on LDA-BERE, and extracting a semantic topic according to the multi-entity identification result and the relation extraction result;
and 5, visualizing the result.
Further, the very specific process of step 1 is:
step 1.1, collecting related files about the power industry to form an original corpus;
and 1.2, performing non-text content rejection on the original corpus, performing Chinese word segmentation on texts in the original corpus by using jieba word segmentation, performing part-of-speech tagging on words according to word segmentation results, and removing common Chinese stop words in the texts to obtain a preprocessing corpus. The non-text content includes additional spaces and punctuation marks;
and 1.3, predefining an entity set and a relation set, and marking data of the preprocessed corpus based on the predefined entity set to obtain a corpus with marked entity answers.
Further, the specific process of the step 2 is as follows:
step 2.1, constructing a Rigel-Baseline model to perform single entity identification;
the Rigel Baseline model comprises an encoder and a decoder, wherein the encoder is used for encoding the problems, and the decoder is used for returning probability distribution of the relations on the micro-knowledge graph;
2.2, constructing a Rigel-Interselect model by expanding the Rigel-Baseline model through differential intersection operation, and carrying out multi-entity identification;
step 2.2.1, identifying a shared element between two entities by constructing an intersection of two vectors;
step 2.2.2, using an encoder to encode the question text to form a question insert as the RoBERTa input of the model encoder; for each problem entity, concatenating the problem text with the entity mention or canonical name with separator labels; embedding using the separator marker index as an entity-specific representation of the problem;
step 2.2.3, obtaining a multi-entity identification result in a sentence:
(a) Embedding parallel prediction reasoning into each entity specific problem by using a decoder, and obtaining intermediate answers according to the entities and the relations in the micro knowledge graph;
(b) Weighting the entities in each vector according to the attention score;
(c) Intersecting the two intermediate answers to obtain a final answer as an answer to the question; calculating a loss according to the difference between the final answer and the marked entity answer, so that the encoder and the decoder adjust parameters according to the loss result;
the step provides a document-level entity identification mode, realizes a new intersection operation to explicitly process multi-entity problems, identifies shared elements between two entities, realizes end-to-end question answering, and is used for proving that introducing intersection can improve the performance of network problems and complex network problems.
Further, the specific process of step 2.1 is as follows:
step 2.1.1, defining an entity relation set expression;
step 2.1.2, constructing a set of triples existing in all the preprocessing corpuses according to the entity relation set;
step 2.1.3, constructing a single entity identification model based on matrix operation:
s2.1.3.1, given the entity vector expression:
s2.1.3.2, given the problem embedding and the relationship vector, calculating an entity vector; the question entity and the prediction relation are tracked in the micro knowledge graph to return a prediction answer;
s2.1.3.3, calculating a jump attention score, associating the relation vector with the entity vector, and updating the entity vector;
s2.1.3.4, the single entity recognition model parameters are updated by constructing a loss function.
Further, the specific process of the step 3 is as follows:
step 3.1, linearizing the source text into character strings based on the result of step 2;
step 3.2, adding entity implications: when the entity appears in the input sentence, before the entity and the identification name thereof are added to the source text, the end of the entity prompt is divided by a special mark;
and 3.3, constructing a sequence-to-sequence model based on the character string and the entity hint.
Further, the specific process of step 3.1 is as follows:
step 3.1.1, limiting target vocabulary: limiting the target vocabulary to a set of special tags required to model entities and relationships to prevent the model from generating entity references that do not appear in the source text;
step 3.1.2, replication mechanism: all other tags are copied from the input using a copy mechanism whose principle of operation is to extend the target vocabulary using tags in the source sequence, allow the model to copy these tags into the output sequence, and randomly initialize the embedding of the special tags, learning with other parameters of the model;
step 3.1.3, end decoding: applying a constraint on the decoder by setting a prediction probability of the wrong answer to a minute value; experiments were performed during testing to apply constraints to the decoder to reduce the likelihood of generating a target string (a string that does not follow linearization patterns) that is grammar ineffective;
step 3.1.4, relation ordering: the relationships between the target strings are ordered according to their order of appearance in the source text, providing a consistent decoding order for the model.
Further, the specific process of step 4 is as follows:
step 4.1, feature extraction: extracting the result of the step 3 through an N-gram algorithm to obtain words, and creating sentence embedding based on an SBERT model;
step 4.2, creating a dictionary: the word mapping ID extracted in the step 4.1 is used for generating a document list, and the document list is converted into a word matrix to be used as input of an LDA model;
step 4.3, theme modeling: vectorizing the word matrix, converting the operation of the SBERT model obtained in the step 4.1 on sentences into the operation of the documents, embedding the SBERT of each document into the word matrix row corresponding to the document, and splicing the word matrix rows to create a new document representation.
Further, the specific process of the step 5 is as follows: and (3) storing the result in the step (4) in an excel table, and visualizing by using knowledge graph construction software.
The device for constructing the knowledge graph based on the power grid related file comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the method for constructing the knowledge graph based on the power grid related file when executing the executable codes.
A computer readable storage medium having stored thereon a program which, when executed by a processor, implements the above-described method of constructing a knowledge graph based on grid-related documents.
The application has the beneficial effects that: the application relates to a method, a device and a medium for constructing a knowledge graph based on a power grid related file, which are used for explicitly processing a multi-entity problem by utilizing intersection operation and solving the problem that the entity identification of the existing model is limited; by combining entity implications, the problem of great difficulty in extracting the document-level relationship of the existing model can be solved, and the method is used for extracting and dividing semantic topics by combining an LDA-BERT hybrid model. The application can extract and divide semantic topics based on the document-level entity identification and relation extraction, thereby improving the information bearing capacity and information retrieval efficiency of the knowledge graph, leading the knowledge graph to be suitable for document-level file processing and ensuring the knowledge graph to be suitable for the requirement of power grid file processing.
Drawings
FIG. 1 is a flow chart of a method of constructing a knowledge graph based on grid related documents;
FIG. 2 is a flow chart of step 2;
FIG. 3 is a schematic diagram of the relation extraction of the sequence-to-sequence model;
fig. 4 is a knowledge graph global view based on a power grid file.
Detailed Description
The technical solutions of the embodiments of the present application will be explained and illustrated below with reference to the drawings of the present application, but the following embodiments are only preferred embodiments of the present application, and not all the embodiments. Based on the examples in the implementation manner, other examples obtained by a person skilled in the art without making creative efforts fall within the protection scope of the present application.
Example 1
The embodiment is a method for constructing a knowledge graph based on a power grid related file, a flowchart is shown in fig. 1, and the embodiment adopts Python software to construct a model, and the method comprises the following steps:
step 1, collecting a power grid file to obtain an original corpus, preprocessing the text of the original corpus to obtain a preprocessed corpus, and predefining an entity set and a relation set at the same time for collecting names in the corpus; preprocessing a corpus based on a predefined entity set to obtain a corpus with marked entity answers; the specific process is as follows:
step 1.1, collecting related files related to the power industry, including related files of a main power company, a branch power company, each department of the power industry and the like, to form an original corpus;
and 1.2, performing non-text content rejection on the original corpus, performing Chinese word segmentation on texts in the original corpus by using jieba word segmentation, performing part-of-speech tagging on words according to word segmentation results, and removing common Chinese stop words in the texts to obtain a preprocessing corpus. The non-text content includes additional spaces and punctuation marks;
and 1.3, predefining an entity set and a relation set, and marking the preprocessing corpus based on the predefined entity set to obtain the corpus with marked entity answers.
Step 2, as shown in fig. 2, based on a corpus with marked entity answers, performing multi-entity recognition by using a Rigel-Intersect model;
step 2.1, constructing a Rigel-Baseline model to perform single entity identification;
the Rigel Baseline model comprises an encoder and a decoder, wherein the encoder is used for encoding the problems, and the decoder is used for returning probability distribution of the relations on the micro-knowledge graph; the specific process is as follows:
step 2.1.1, defining an entity relation set expression:
(1);
wherein,is an entity relationship set, E is an entity set, R is a relationship set, (s, p, o) is a triplet, indicating that the relationship p holds between the subject entity s and the object entity o.
Step 2.1.2, constructing a set of triples existing in all the preprocessing corpus according to the entity relation setThe method comprises the steps of carrying out a first treatment on the surface of the And representing the triplet sets in the subject matrix of a given indexM s Relationship matrixM p Object matrixM o In (a) and (b); in the middle ofTAs a set of triples,t k representing the third group in the triplet setkElement(s)>Representing a subject matrix element; />Representing a relationship matrix element; />Representing object momentAn array element;N T is the total number of triples, i.e. +.>The number of triples in (3); the method comprises the following steps:
(2);
wherein,M s is a matrix of subjects, is oneRepresenting the subject entity and entity set in each triplet +.>Relationships between entities in (a);N E is the entity set->The total number of (i) i.e. the number of entities;e j for a collection of entitiesEMiddle (f)jThe number of elements to be added to the composition,jthe value range is 1 toN EM s (s k ,j) Represent the firsts k Whether the subject entity of the triple is equal to the entity set +.>The first of (3)jPersonal entity, if so, thenM s (s k ,j) =1, otherwise 0.
(3);
Wherein,M p is a relation matrix, is oneRepresenting relationships and sets of relationships in each tripletRA relationship between relationships in (a);N R is a set of relationshipsRThe total number of (a) i.e. the number of relationships;p l is the first in the relation setlThe number of elements to be added to the composition,lthe value range is 1 toN RM p (p k ,l) Represent the firstp k Whether the relation of the triples is equal to the relation set +.>The first of (3)lPersonal relationshipp l The method comprises the steps of carrying out a first treatment on the surface of the If so, thenM p (p k ,l) =1, otherwise 0.
(4);
Wherein,M o is an object matrix, is aRepresenting guest entities and entity sets in each triplet +.>Relationships between entities in (a);M o (o k ,j) Represent the firsto k Guest entity of the triad->Whether or not equal to entity set +.>The first of (3)jPersonal entitye j If so, thenM o (o k ,j) =1, otherwise 0.
Step 2.1.3, constructing a single entity identification model based on matrix operation:
using Roberta as encoder, layered decoderAs a decoder, an attention mechanism, i.e. a unified approach across data sets, is implemented; the idea is as follows: the problem entity and the prediction relation are tracked in the micro knowledge graph, and the prediction is obtainedThe answers, the predicted answers and the marked entity answers are compared, the loss value is used for updating the model, and the specific process is as follows:
s2.1.3.1, given the entity vector expression:
in the encoder, a given one is at the firstt-entity vector of 1 first time stepAnd a relation vector->Obtaining entity vector->The expression:
(5);
wherein,representing an update function;x t representing at a first time steptIs used to determine the vector of the entity,x t-1 representing at a first time stept-1, the entity vector being +.>The vector of the dimensions is used to determine,N E is the number of entities in the set of entities;r t representing at a first time steptIs a +.>The vector of the dimensions is used to determine,N R is the number of relationships in the set of relationships;M s is a body matrix;M p is a relationship matrix;M o is a matrix of objects which are to be moved,M o T representing a transpose of the matrix; ☉ the element multiplication symbol.
S2.1.3.2 given problem embeddingh q And a relation vectorr t Calculating an entity according to equation (5) and equation (6)(Vector)x t
(6);
In the method, in the process of the application,softmax(-) represents an activation function,W t dec a layered decoder is represented by the representation,h q the problem is indicated to be embedded in,r t-1 ,r t-2 ,……r 1 the history of the relation vector is represented,a transpose of column vectors representing problem embedding and history relation vector stitching;
the question entities and predictive relationships are tracked in a micro-knowledge graph to return predictive answers.
S2.1.3.3, calculating a jump attention score, associating a relationship vector with an entity vector, and updating the entity vector:
(8);
(9);
wherein,c t representing at a first time steptA calculated attention score vector, wherein each element corresponds to an attention score of the question and historical relationship vector;W t att the weight matrix is represented by a matrix of weights,athe attention weight vector is represented as such,representing the attention score vectors calculated for the different first time steps,T h representing the maximum time step of the historical relationship vector.
The attention score vector is calculated according to the attention mechanism of the model and is used for measuring the relevance of the problem embedding and the historical relation vector in the current first time step; calculation of attention mechanism based on problem embedding and first time of inputThe relation vector of the steps is output by the decoder and then usedsoftmaxActivating to obtain new relation vectorr t Based on newr t And entity vector of the current first time stepx t-1 Updating the entity vector is performed by the formula (5).
S2.1.3.4, updating single entity identification model parameters by constructing a loss function:
cross entropy is used as a loss function:
(10);
(11);
wherein,is a loss function for measuring the performance of the model;yrepresenting the answer of the marked entity->Is the final estimate representing the predicted entity answer of the model for a given question and relationship;a t is the first time steptThe attention weight calculated above;x t is at a first time steptThe entity vector calculated above represents entity information in the knowledge graph;y i representing the +.>The true value of the individual element, which is 0 or 1,itake the value of 1 toN E ;/>Is the model predictive firstiElement, representing model pairiAn estimate of the individual entity.
2.2, constructing a Rigel-Interselect model by expanding the Rigel-Baseline model through differential intersection operation, and carrying out multi-entity identification;
step 2.2.1, identifying a shared element between two entities by constructing an intersection of two vectors:
two entities in a sentence are set to be represented by vectors a and b, and the intersection of the vectors a and b is defined as the minimum value min of the elements of a and ba kb n ) To associate two entities:
(12);
in the method, in the process of the application,min elem (-) is an intersection expressionkIs vector quantityaIs provided for the length of (a),nis vector quantitybWhen the length of (a)knAt the time, min%,/>) Will be ata kb n Time returna k Or inb na k Time returnb n Any element that appears in both vectors returns a non-zero value; when (when)knAt the time, min%a k ,b n ) The element that appears in one vector or the element that does not appear in both vectors returns a 0.
Step 2.2.2, build problem embedding:
using encodersf q Embedding question text q-code forming questionsh q Roberta input as a model encoder; for each question entity, the question text q and the entity mention or canonical name m are marked with a separator [ SEP ]]Separating and connecting; marking index using delimitersisepEmbedding at as entity specific representation of the problem, i.e. entity specific problem embedding:
(13);
wherein: q is the question text and m is the entity mention or canonical name.
Step 2.2.3, obtaining a multi-entity identification result in a sentence:
(a) Embedding parallel predictive reasoning for each entity-specific question using a decoder, obtaining intermediate answers based on entities and relationships in a micro-knowledge-graph,
(b) Weighting the entities in each vector according to the attention score;
(c) Intersecting the two intermediate answers according to formula (12) to obtain a final answer as an answer to the question; and calculating the loss according to the difference between the final answer and the marked entity answer, and training the whole model so that the encoder and the decoder adjust parameters according to the loss result.
The step provides a document-level entity identification mode, realizes a new intersection operation to explicitly process multi-entity problems, identifies shared elements between two entities, realizes end-to-end question answering, and is used for proving that introducing intersection can improve the performance of network problems and complex network problems.
Step 3, based on the multi-entity recognition result, extracting the relationship through a seq2seq model based on entity implications; the specific process is as follows:
step 3.1, linearizing the source text into character strings based on the result of the step 2, wherein the specific process is as follows:
step 3.1.1, limiting target vocabulary: limiting the target vocabulary to a set of special tags required to model entities and relationships to prevent the model from generating entity references that do not appear in the source text;
step 3.1.2, replication mechanism: all other tags are copied from the input using a copy mechanism that works on the principle of using the source sequenceXThe tags in (1) extend the target vocabulary, allowing the model to copy these tags to the output sequenceYRandomly initializing the embedding of the special mark and learning with other parameters of the model;
step 3.1.3, end decoding: applying a constraint on the decoder by setting the prediction probability of the wrong answer to a minute value at each second time step; experiments were performed during testing to apply constraints to the decoder to reduce the likelihood of generating a target string (a string that does not follow linearization patterns) that is grammar ineffective;
step 3.1.4, relation ordering: the relationships between the target strings are ordered according to their order of appearance in the source text, providing a consistent decoding order for the model.
The position of a relation is determined by the first reference and its body, and in the case of an interrelation, the ordering is based on the last-mentioned end entity (multi-relation, and so on). The relationships extracted from a given document are unordered in nature, while the sequence cross entropy penalty is permutation sensitive to the predicted labels. During training, unnecessary decoding sequences may be forced and the model may be prone to overfitting frequent marker combinations in the training set. To alleviate this, the relationships need to be ordered.
The text X is entered and each relationship starts with its constituent entities in the corresponding target string Y. The semicolons are separated by a co-reference mention (;) and the entity terminates with a special tag indicating its type (e.g. @ department). Likewise, a special tag whose relationship indicates its type is terminated (e.g., @ notify @), two or more entities may be included before the special relationship tag to support n-ary extraction. Entities may be ordered if they act as a specific role for the head or tail of a relationship. For each document, a plurality of relationships may be contained in the target string. The entities may be nested in the input text or may be discontinuous.
Step 3.2, adding entity implications: when an entity appears in the input sentence, the end of the entity hint is divided by a special tag @ sep @ before the entity and its identifying name are added to the source text.
Step 3.3, constructing a sequence to sequence model based on the character string and the entity hint:
step 3.3.1, modeling the conditional probability:
(14);
wherein,expressed in a given source sequenceXIn the case of (2) generating a target sequenceYIs a function of the probability of (1),y z is the target sequenceYThe first of (3)zThe elements corresponding to the second time step,Xis of length ofSRepresenting the original text data to be analyzed, i.e. the input text with entity and special marks;Yis of length ofTIs a linearization of the relationship contained in the source; />Is a given source sequenceXAnd the generated partial target sequence +.>In the case of (2) generating the target sequence at the second time step zzIndividual elementsy z Conditional probability of (2); multiplying the conditional probabilities of all second time steps from 1 to the maximum of the second time stepTThe joint probability of the whole sequence is obtained.
Step 3.3.2, optimizing the sequence cross entropy loss and maximizing the log likelihood of training data;
(15);
wherein,representing a loss function, the training objective is to minimize this loss function;θis a parameter of the model, and maximizes the probability of target sequence generation through learning; />Is a log-likelihood term representing the target sequence generated at a given source sequence X +.>Model parametersθIn the case of (2) at a second time stepzGenerating the first in the target sequencezIndividual elementsy z Logarithmic conditional probability of (1) to the second time step maximumTAnd taking the negative logarithm of the logarithm conditional probability of each time step, adding the negative logarithm of the logarithm conditional probability, and calculating a loss function of the model to maximize the log likelihood of the training data.
As shown in fig. 3, each tag in the input sentence is mapped to a context embedding by an encoder; generating an output by the autoregressive decoder, processing the output of the encoder at each second time step one by one until an end-of-sequence mark (@ is generatedend@), or the maximum number flag is completed.
Step 4, constructing an extraction model based on LDA-BERE, and extracting a semantic topic according to the multi-entity identification result and the relation extraction result; the specific process is as follows:
step 4.1, feature extraction: extracting the entity-relation-entity triplet set obtained in the step 3 through an N-gram algorithm, and creating sentence embedding based on an SBERT model, wherein the method specifically comprises the following steps:
extracting the entity-relation-entity triplet set obtained in the step 3 through an N-gram algorithm to obtain words, wherein the words comprise units Unigram, binary Bi-gram and ternary Tri-gram; applying two parallel BERTs with the same network weights to the two sentences through SBERT embedding; embedding the tags for each sentence, compressing the data by averaging pooling, and then generating a similarity score to create a sentence embedding; the similarity score is generated for creating sentence embedding, wherein the sentence can be selected to be directly marked in an original sentence and correspondingly embedded after being converted into a vector, or a database can be created and independently embedded after being converted into the sentence.
Step 4.2, creating a dictionary: the word mapping ID extracted in the step 4.1 is transferred to a dictionary () object to generate a document list, and the document list is transferred to dictionary 2b to generate a word matrix to be used as an input of an LDA model;
mapping each word to a different integer ID by converting the text data to a word list, since step 4.1 extracts the units Unigram, binary Bi-gram and ternary Tri-gram by N-gram algorithm, the text data has been converted to words, where only the de-mapping ID is needed; and then passes it to the complex. Dictionary () object; the document list (i.e., word list of the markup language) is used to generate a BoW corpus; the tagged word list is provided to the dictionary.doc2ow () object, which then creates a word matrix that is used as input to the LDA model; each document corresponds to one of the sentences or text data, which are used to generate the BoW corpus.
Step 4.3, theme modeling: the word matrix is vectorized by using Gensim corporation, and the operation of the SBERT model obtained in step 4.1 on sentences is converted into operation on documents, and the SBERT embedded of each document is spliced with the word matrix row corresponding to the document to create a new document representation.
Vectorizing the word matrix using Gensim corporation and converting the SBERT embedding operation for sentences obtained in step 4.1 into an operation for documents, because the vectorized word matrix is effectively equivalent to a document-word matrix, where each row represents a document and each column represents a word; the method comprises the steps that all documents are included, so that the corresponding similarity scores are spliced according to the document distribution, and the SBERT is embedded and suitable for a bidirectional transducer encoder; the encoder activation function is a ReLU and the optimizer is an Adam optimizer; embedding the SBERT of each document into a word matrix row corresponding to the document and splicing the word matrix rows to create a new document representation; this new representation contains both the word frequency information and the semantic information of the document. Taking this new document representation as input to the LDA model, the LDA model will use this composite representation to perform topic modeling, decomposing the document into topics. The bidirectional transducer encoder activation function is a ReLU and the optimizer is an Adam optimizer, i.e. the network architecture of BERT described in step 4.1.
Step 5, visualizing the result: the specific process is as follows: obtaining a plurality of entity-relation-entity triplet subsets with theme division according to the steps, storing the subsets in an excel table, and visualizing by using knowledge graph construction software SmartKG. As shown in FIG. 4, the knowledge graph includes topic division between texts, one-to-many text and text features, one-to-one text and text feature, relational connection between texts, and relational connection between text features.
Example 2
An apparatus for constructing a knowledge graph based on a grid-related file, comprising a memory and one or more processors, wherein the memory stores executable code, and wherein the one or more processors are configured to implement the method for constructing a knowledge graph based on a grid-related file according to embodiment 1 when executing the executable code.
Example 3
A computer-readable storage medium having stored thereon a program which, when executed by a processor, implements the method of constructing a knowledge-graph based on grid-related files described in embodiment 1.
While the application has been described in terms of specific embodiments, it will be appreciated by those skilled in the art that the application is not limited thereto but includes, but is not limited to, those shown in the drawings and described in the foregoing detailed description. Any modifications which do not depart from the functional and structural principles of the present application are intended to be included within the scope of the appended claims.

Claims (10)

1. The method for constructing the knowledge graph based on the power grid related file is characterized by comprising the following steps of:
step 1, collecting related files of a power grid to obtain an original corpus, preprocessing texts of the original corpus to obtain a preprocessed corpus, and predefining an entity set and a relation set at the same time for collecting names in the corpus; preprocessing a corpus based on a predefined entity set to obtain a corpus with marked entity answers;
step 2, based on a corpus with marked entity answers, performing multi-entity recognition by using a Rigel-Interect model;
step 3, based on the multi-entity recognition result, extracting the relationship through a seq2seq model based on entity implications;
step 4, constructing an extraction model based on LDA-BERE, and extracting a semantic topic according to the multi-entity identification result and the relation extraction result;
and 5, visualizing the result.
2. The method for constructing a knowledge graph based on a power grid related file according to claim 1, wherein the specific process of step 1 is as follows:
step 1.1, collecting related files about the power industry to form an original corpus;
step 1.2, non-text content rejection is carried out on an original corpus, chinese word segmentation is carried out on texts in the original corpus by using jieba word segmentation, word part-of-speech tagging is carried out on words according to word segmentation results, and common Chinese stop words in the texts are removed, so that a preprocessing corpus is obtained;
and 1.3, predefining an entity set and a relation set, and marking data of the preprocessed corpus based on the predefined entity set to obtain a corpus with marked entity answers.
3. The method for constructing a knowledge graph based on a power grid related file according to claim 2, wherein the specific process of step 2 is as follows:
step 2.1, constructing a Rigel-Baseline model to perform single entity identification;
the Rigel Baseline model comprises an encoder and a decoder, wherein the encoder is used for encoding the problems, and the decoder is used for returning probability distribution of the relations on the micro-knowledge graph;
2.2, constructing a Rigel-Interselect model by expanding the Rigel-Baseline model through differential intersection operation, and carrying out multi-entity identification;
step 2.2.1, identifying a shared element between two entities by constructing an intersection of two vectors;
step 2.2.2, using an encoder to encode the question text to form a question insert as the RoBERTa input of the model encoder; for each problem entity, concatenating the problem text with the entity mention or canonical name with separator labels; embedding using the separator marker index as an entity-specific representation of the problem;
step 2.2.3, obtaining a multi-entity identification result in a sentence:
(a) Embedding parallel prediction reasoning into each entity specific problem by using a decoder, and obtaining intermediate answers according to the entities and the relations in the micro knowledge graph;
(b) Weighting the entities according to the attention score;
(c) Intersecting the two intermediate answers to obtain a final answer as an answer to the question; the loss is calculated based on the difference between the final answer and the marked entity answer, such that the encoder and decoder adjust the parameters based on the loss result.
4. The method for constructing a knowledge graph based on a power grid related file according to claim 3, wherein the specific process of step 2.1 is as follows:
step 2.1.1, defining an entity relation set expression;
step 2.1.2, constructing a set of triples existing in all the preprocessing corpuses according to the entity relation set; and representing the triplet set in a subject matrix, a relationship matrix and an object matrix of a given index;
step 2.1.3, constructing a single entity identification model based on matrix operation:
s2.1.3.1, given an entity vector expression;
s2.1.3.2, given the problem embedding and the relationship vector, calculating an entity vector; the question entity and the prediction relation are tracked in the micro knowledge graph to return a prediction answer;
s2.1.3.3, calculating a jump attention score, associating the relation vector with the entity vector, and updating the entity vector;
s2.1.3.4, the single entity recognition model parameters are updated by constructing a loss function.
5. The method for constructing a knowledge graph based on a power grid related file according to claim 3, wherein the specific process of step 3 is as follows:
step 3.1, linearizing the source text into character strings based on the result of step 2;
step 3.2, adding entity implications: when the entity appears in the input sentence, before the entity and the identification name thereof are added to the source text, the end of the entity prompt is divided by a special mark;
and 3.3, constructing a sequence-to-sequence model based on the character string and the entity hint.
6. The method for constructing a knowledge graph based on a power grid related file according to claim 5, wherein the specific process of step 3.1 is as follows:
step 3.1.1, limiting target vocabulary;
step 3.1.2, replication mechanism: all other markers are copied from the input using a copy mechanism;
step 3.1.3, end decoding: the constraint on the decoder is realized by setting the prediction probability of the wrong answer to a minute value;
step 3.1.4, relation ordering: and ordering the relations according to the appearance sequence of the target character strings in the source text, and providing a consistent decoding sequence for the model.
7. The method for constructing a knowledge graph based on a power grid related file according to claim 5, wherein the specific process of step 4 is as follows:
step 4.1, feature extraction: extracting the result of the step 3 through an N-gram algorithm to obtain words, and creating sentence embedding based on an SBERT model;
step 4.2, creating a dictionary: the word mapping ID extracted in the step 4.1 is used for generating a document list, and the document list is converted into a word matrix to be used as input of an LDA model;
step 4.3, theme modeling: vectorizing the word matrix, converting the operation of the SBERT model obtained in the step 4.1 on sentences into the operation of the documents, embedding the SBERT of each document into the word matrix row corresponding to the document, and splicing the word matrix rows to create a new document representation.
8. The method for constructing a knowledge graph based on a power grid related file according to claim 1, wherein the specific process of step 5 is as follows: and (3) storing the result in the step (4) in an excel table, and visualizing by using knowledge graph construction software.
9. An apparatus for constructing a grid-related file-based knowledge graph, comprising a memory and one or more processors, the memory having executable code stored therein, the one or more processors, when executing the executable code, being configured to implement the method for constructing a grid-related file-based knowledge graph of any one of claims 1 to 8.
10. A computer-readable storage medium, on which a program is stored, which program, when being executed by a processor, implements the method of constructing a knowledge-graph based on grid-related files according to any one of claims 1 to 8.
CN202311466119.3A 2023-11-07 2023-11-07 Method, device and medium for constructing knowledge graph based on power grid related file Active CN117194682B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311466119.3A CN117194682B (en) 2023-11-07 2023-11-07 Method, device and medium for constructing knowledge graph based on power grid related file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311466119.3A CN117194682B (en) 2023-11-07 2023-11-07 Method, device and medium for constructing knowledge graph based on power grid related file

Publications (2)

Publication Number Publication Date
CN117194682A true CN117194682A (en) 2023-12-08
CN117194682B CN117194682B (en) 2024-03-01

Family

ID=88994676

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311466119.3A Active CN117194682B (en) 2023-11-07 2023-11-07 Method, device and medium for constructing knowledge graph based on power grid related file

Country Status (1)

Country Link
CN (1) CN117194682B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117786092A (en) * 2024-02-27 2024-03-29 成都晓多科技有限公司 Commodity comment key phrase extraction method and system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046260A (en) * 2019-04-16 2019-07-23 广州大学 A kind of darknet topic discovery method and system of knowledge based map
CN111930784A (en) * 2020-07-23 2020-11-13 南京南瑞信息通信科技有限公司 Power grid knowledge graph construction method and system
CN113779219A (en) * 2021-09-13 2021-12-10 内蒙古工业大学 Question-answering method for embedding multiple knowledge maps by combining hyperbolic segmented knowledge of text
CN114077673A (en) * 2021-06-21 2022-02-22 南京邮电大学 Knowledge graph construction method based on BTBC model
US20220164683A1 (en) * 2020-11-25 2022-05-26 Fmr Llc Generating a domain-specific knowledge graph from unstructured computer text
CN115269857A (en) * 2022-04-28 2022-11-01 东北林业大学 Knowledge graph construction method and device based on document relation extraction
CN116578717A (en) * 2023-04-28 2023-08-11 国网安徽省电力有限公司信息通信分公司 Multi-source heterogeneous knowledge graph construction method for electric power marketing scene
CN116628219A (en) * 2023-05-10 2023-08-22 浙江工业大学 Question-answering method based on knowledge graph
CN116739081A (en) * 2023-05-25 2023-09-12 东北电力大学 Knowledge acquisition and representation method oriented to power grid risk field

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046260A (en) * 2019-04-16 2019-07-23 广州大学 A kind of darknet topic discovery method and system of knowledge based map
CN111930784A (en) * 2020-07-23 2020-11-13 南京南瑞信息通信科技有限公司 Power grid knowledge graph construction method and system
US20220164683A1 (en) * 2020-11-25 2022-05-26 Fmr Llc Generating a domain-specific knowledge graph from unstructured computer text
CN114077673A (en) * 2021-06-21 2022-02-22 南京邮电大学 Knowledge graph construction method based on BTBC model
CN113779219A (en) * 2021-09-13 2021-12-10 内蒙古工业大学 Question-answering method for embedding multiple knowledge maps by combining hyperbolic segmented knowledge of text
CN115269857A (en) * 2022-04-28 2022-11-01 东北林业大学 Knowledge graph construction method and device based on document relation extraction
CN116578717A (en) * 2023-04-28 2023-08-11 国网安徽省电力有限公司信息通信分公司 Multi-source heterogeneous knowledge graph construction method for electric power marketing scene
CN116628219A (en) * 2023-05-10 2023-08-22 浙江工业大学 Question-answering method based on knowledge graph
CN116739081A (en) * 2023-05-25 2023-09-12 东北电力大学 Knowledge acquisition and representation method oriented to power grid risk field

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
AKASH GHOSH等: "Semantic Clone Detection: Can Source Code Comments Help?", 2018 IEEE SYMPOSIUM ON VISUAL LANGUAGES AND HUMAN-CENTRIC COMPUTING (VL/HCC) *
宋玮琼等: "基于GCN的配电网知识图谱构建及应用", 电子设计工程, vol. 30, no. 7 *
谭刚;陈聿;彭云竹;: "融合领域特征知识图谱的电网客服问答***", 计算机工程与应用, vol. 56, no. 03 *
韦韬;王金华;: "基于非分类关系提取技术的知识图谱构建", 工业技术创新, vol. 07, no. 02 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117786092A (en) * 2024-02-27 2024-03-29 成都晓多科技有限公司 Commodity comment key phrase extraction method and system
CN117786092B (en) * 2024-02-27 2024-05-14 成都晓多科技有限公司 Commodity comment key phrase extraction method and system

Also Published As

Publication number Publication date
CN117194682B (en) 2024-03-01

Similar Documents

Publication Publication Date Title
CN110851599B (en) Automatic scoring method for Chinese composition and teaching assistance system
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN110851596A (en) Text classification method and device and computer readable storage medium
CN112836046A (en) Four-risk one-gold-field policy and regulation text entity identification method
CN112149421A (en) Software programming field entity identification method based on BERT embedding
CN110569332B (en) Sentence feature extraction processing method and device
CN110990555B (en) End-to-end retrieval type dialogue method and system and computer equipment
CN117194682B (en) Method, device and medium for constructing knowledge graph based on power grid related file
US11170169B2 (en) System and method for language-independent contextual embedding
CN112966117A (en) Entity linking method
CN114153978A (en) Model training method, information extraction method, device, equipment and storage medium
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN112784602A (en) News emotion entity extraction method based on remote supervision
CN116049422A (en) Echinococcosis knowledge graph construction method based on combined extraction model and application thereof
Lefever et al. Identifying cognates in English-Dutch and French-Dutch by means of orthographic information and cross-lingual word embeddings
CN117291265B (en) Knowledge graph construction method based on text big data
Huang et al. Disease named entity recognition by machine learning using semantic type of metathesaurus
Khadija et al. Automating information retrieval from faculty guidelines: designing a PDF-driven chatbot powered by OpenAI ChatGPT
CN111666375A (en) Matching method of text similarity, electronic equipment and computer readable medium
CN115270713A (en) Method and system for constructing man-machine collaborative corpus
CN115203388A (en) Machine reading understanding method and device, computer equipment and storage medium
CN114637852A (en) Method, device and equipment for extracting entity relationship of medical text and storage medium
CN113901218A (en) Inspection business basic rule extraction method and device
CN113254473A (en) Method and device for acquiring weather service knowledge
CN112015891A (en) Method and system for classifying messages of network inquiry platform based on deep neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant