CN111783464A - Electric power-oriented domain entity identification method, system and storage medium - Google Patents

Electric power-oriented domain entity identification method, system and storage medium Download PDF

Info

Publication number
CN111783464A
CN111783464A CN202010625052.3A CN202010625052A CN111783464A CN 111783464 A CN111783464 A CN 111783464A CN 202010625052 A CN202010625052 A CN 202010625052A CN 111783464 A CN111783464 A CN 111783464A
Authority
CN
China
Prior art keywords
entity
electric power
field
power
domain
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010625052.3A
Other languages
Chinese (zh)
Inventor
季知祥
施贵荣
蓝海波
蒲天骄
张锐
王晓辉
闵睿
刘鹏
刘剑青
肖凯
蔡常雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
China Electric Power Research Institute Co Ltd CEPRI
State Grid Jibei Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
China Electric Power Research Institute Co Ltd CEPRI
State Grid Jibei Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, China Electric Power Research Institute Co Ltd CEPRI, State Grid Jibei Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Publication of CN111783464A publication Critical patent/CN111783464A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/06Energy or water supply
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Business, Economics & Management (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Water Supply & Treatment (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Strategic Management (AREA)
  • Primary Health Care (AREA)
  • Marketing (AREA)
  • Human Resources & Organizations (AREA)
  • Public Health (AREA)
  • Animal Behavior & Ethology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method and a system for recognizing a field entity facing electric power, which can realize recognition of the electric power field entity in an electric power field text by constructing an electric power field entity recognition algorithm. The method adopts the two-way long-and-short-term memory network and the conditional random field to construct the domain entity recognition model, integrates the sequence modeling capacity of the two-way long-and-short-term memory network, can capture far context information, has the capacity of fitting nonlinearity of a neural network, simultaneously adopts the conditional random field to optimize the whole sequence, effectively solves the problem that the gradient of the traditional circulating neural network disappears or the gradient explodes, and provides an important basis for constructing the knowledge graph in the power field.

Description

Electric power-oriented domain entity identification method, system and storage medium
Technical Field
The invention relates to the technical field of electric power artificial intelligence, in particular to an entity identification method, system and storage medium for the electric power field.
Background
In recent years, with the development of smart power grids, traditional power grid facilities are continuously upgraded and reformed, various information systems are widely applied, and the smart power grids generate and accumulate massive multi-source heterogeneous data. With the construction of the ubiquitous power internet of things, artificial intelligence based on power mass data plays an important role in supporting professional application construction processes such as marketing, operation and inspection, material, scheduling and safety supervision, and the development of artificial intelligence application construction can promote new services and innovation of new application modes to emerge continuously. At present, a complete domain knowledge map is not established in the power industry, and comprehensive intellectualization of power knowledge support is not realized.
The so-called knowledge graph describes concepts, entities and their relations in the objective world in a structured form, and is a structured semantic knowledge base for describing concepts and their interrelations in the physical world in a symbolic form. The basic composition unit is an entity-relation-entity triple, entities and related attribute-value pairs thereof, and the entities are mutually connected through relations to form a network knowledge structure. Therefore, in the field of power dispatching, the knowledge graph can solidify dispatching regulations and dispatcher experience knowledge, and assist in supporting applications such as power grid operation monitoring, exception handling and mode adjustment. In the field of electric power operation and maintenance, the knowledge map can store knowledge of equipment, faults, disposal methods and the like, and can support operation and maintenance of intelligent equipment and the like.
The electric power entity is required to be extracted firstly when the domain knowledge graph is constructed, the entity identification is a basic task in natural language processing, and the application range is very wide. Domain-specific named entity recognition, the goal of which is to identify domain-specific entities and their classes, plays an important role in domain document classification, retrieval and content analysis. The method is the basis of a deep complex information extraction task and is a basic stone of a knowledge calculation process for converting data into machine readable knowledge.
Therefore, it is necessary to develop a domain entity identification method for electric power to construct a knowledge graph of the electric power domain.
Disclosure of Invention
The object of the present invention is to solve at least one of the technical drawbacks mentioned.
To this end, it is an object of the present invention to propose a method for identifying a domain entity oriented to electric power, comprising the following steps,
s1, performing data extraction on the acquired power grid data to form a data set, labeling the linguistic data of the training data S2, dividing the labeled linguistic data into test sets, inputting the test sets into a constructed electric power field entity recognition model for recognition, and performing reverse decoding on recognition results to obtain recognized field entities; s3, calculating a weighted harmonic mean value of the accuracy rate and the recall rate according to the identified domain entities, evaluating the electric power domain entity identification algorithm model by using the weighted harmonic mean value, and outputting the identified domain entities when the obtained evaluation score meets the service requirements; and when the obtained evaluation score does not meet the service requirement, after the electric power field entity recognition model is corrected, repeating the steps S2-S3.
Preferably, in S1, BIE is used to perform corpus tagging on the training data, wherein the letter B is used to represent the first character of the power entity, the letter I is used to represent the internal character of the power entity, the letter E is used to represent the ending character of the power entity, and the letter O is used to represent other characters not belonging to the named entity.
In any of the above embodiments, preferably, the electric power field entity recognition algorithm model is constructed by using a method of combining a bidirectional long-and-short-term memory network and a conditional random field; in S2, the labeled corpus is divided into a test set and a training set, and the constructed domain entity recognition algorithm model is trained using the training set.
Preferably, in any one of the above embodiments, the electric power domain entity recognition algorithm model includes a first layer for mapping each word in a sentence from one-hot vectors to low-dimensional dense word vectors; the second layer, the two-way LSTM layer, is used for extracting the sentence characteristic from the word vector of the first layer automatically; the third layer, the CRF layer, is used for carrying on the sequence marking of the sentence level with the sentence characteristic extracted; and classifying the sentence labels formed by labeling.
Preferably, in any of the above embodiments, in the CRF layer, the sentence labels are classified by using the following formula:
Figure BDA0002565848750000021
where score (x, y) represents the score of sentence x with label y.
In any one of the above embodiments, preferably, when the electric power domain entity recognition algorithm model is trained, solving a log-likelihood value for a training sample by maximizing a log-likelihood function;
log P(yx|x)=score(x,yx)-log(∑y′exp(score(x,y′)));
wherein, score (x, y)x) The label representing sentence x is yxScore of (a); (x, y)x) Are training samples.
In any of the above embodiments, in step S3, the power domain entity identification algorithm model uses the Viterbi algorithm to solve the optimal path in the prediction process when solving the log-likelihood value, and the position corresponding to the identified entity is obtained by the solution.
Preferably, in any one of the above embodiments, in S3, the accuracy is calculated by the following formula:
Figure BDA0002565848750000031
the recall ratio is calculated by the following formula:
Figure BDA0002565848750000032
when the entity recognition algorithm model is evaluated, the weighted harmonic mean value of the calculation accuracy and the recall rate is calculated by the following formula:
Figure BDA0002565848750000033
the invention also provides a field entity identification system for electric power, which comprises
The data acquisition module is used for performing data extraction on the acquired power grid data to form a data set and performing corpus labeling on the training data;
the field entity identification module is used for dividing the marked linguistic data into a test set, inputting the test set into a constructed electric power field entity identification model for identification, and reversely decoding an identification result to obtain an identified field entity;
the evaluation module is used for calculating a weighted harmonic average value of the accuracy rate and the recall rate according to the identified domain entities, evaluating the entity identification algorithm model by using the weighted harmonic average value, and outputting the identified domain entities when the obtained evaluation score meets the service requirements; and when the obtained evaluation score does not meet the service requirement, the entity identification model in the power field is corrected and identified again.
The invention also provides electric power-oriented field entity identification equipment, which comprises a memory, a field entity identification module and a field entity identification module, wherein the memory is used for storing a computer program; and the processor is used for realizing the steps of the electric power field entity identification method when executing the computer program.
The present invention also provides a computer storage medium having a computer program stored thereon, which, when executed by a processor, performs the steps of a method of domain entity identification as described.
Compared with the prior art, the entity identification method for the electric power field at least has the following advantages:
by carrying out corpus identification on the power grid data and then utilizing the entity identification model in the power field to carry out the entity, accurate information collection is facilitated to be realized in huge power dispatching information, a solid foundation is provided for constructing a knowledge graph in the future, the extraction of the entity in the power field is realized, and the construction of the power knowledge graph is supported. The electric power field entity recognition model is evaluated through the accuracy and the recall rate, so that the model is convenient to correct in time, and the problem of overfitting of the electric power field entity recognition model is solved. The method adopts the two-way long-and-short-term memory network and the conditional random field to construct the entity recognition model, integrates the sequence modeling capability of the two-way long-and-short-term memory network, can capture far context information, has the capability of fitting nonlinearity of a neural network, and simultaneously adopts the conditional random field to optimize the whole sequence, thereby effectively solving the problems of gradient disappearance or gradient explosion of the traditional circulating neural network.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings;
FIG. 1 is a flow chart of a method for identifying a domain entity for electric power according to the present invention;
FIG. 2 is a diagram of a unit structure of an LSTM model in the method for recognizing an entity in the power-oriented domain according to the present invention;
FIG. 3 is a block diagram of a BLSTM model in the method for recognizing an entity in the power-oriented domain according to the present invention;
FIG. 4 is a schematic diagram of a bidirectional long-and-short-term memory network and a conditional random field electric power domain entity recognition model structure of the electric power-oriented domain entity recognition method provided by the invention;
Detailed Description
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The following detailed description is exemplary in nature and is intended to provide further details of the invention. Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.
As shown in fig. 1, the present invention provides a method for recognizing an entity in a power-oriented domain, comprising the following steps:
s1, acquiring power grid data, extracting the power grid data to form a data set, and marking the corpus of the training data by BIE;
when the corpus is labeled, a letter B is used for representing a first character of the electric power entity, a letter I is used for representing an internal character of the electric power entity, a letter E is used for representing a final character of the electric power entity, and a letter O is used for representing other characters which do not belong to the named entity. For example, the text data "analyze the influence of the dc reactor on the stability of the power transmission system", where the power entities are "dc reactor", "power transmission system", "dc reactor" is labeled "BIIIE", and "power transmission system" is labeled "BIIE". In the sentence as shown in the following table
TABLE 1 corpus annotation examples
Sentences containing electric power physical nouns Corpus annotation results
Analyzing the influence of a DC reactor on the stability of a power transmission system O O B I II E O B I I E O OOOOO
Capacitor at converter station outlet B I E O OOO B I E
Direct current transmission technology based on voltage source type current converter O O B I IIII E O B I III E
S2, dividing the labeled corpus into test sets, inputting the test sets into the constructed electric power field entity recognition model for recognition, and reversely decoding the recognition results to obtain recognized field entities;
constructing an entity recognition algorithm model in the electric power field by using a bidirectional long-time memory network and a conditional random field;
it should be noted that the Conditional Random Field (CRF) is a discriminant model, which combines the features of the ME model and the HMM model, and has a better result when labeling the natural language processing sequences such as part-of-speech labeling and entity recognition.
The CRF model is a statistical model based on a probability map, can add various characteristics by utilizing context information, realizes global normalization, can obtain a global optimal solution and solves the problem of marking deviation.
Typically, an input sequence is represented by x ═ x1, x 2., xn }, where xi represents the vector of the ith input character and y ═ y1, y 2., yn } represents the possible tag sequence of input x. Y (X) represents a possible tag sequence of X. The sequence CRF probability model defines a series of conditional probabilities P (yz; W, b) that are calculated from equation (1) given the probability of all possible sequences y for x.
Figure BDA0002565848750000061
Wherein
Figure BDA0002565848750000062
Is a potential function, WTy ', yzi and by ', y are the weight vector and bias for the tag pair (y ', y), respectively.
For CRF model training, maximum likelihood estimation is used. For the training set { (zf, Yf)), the likelihood estimation algorithm is shown as equation (equation 2).
L (W, b) ═ Σ log p (y | z; W, b) (equation 2)
The maximum likelihood training process is to select parameters such that L (w, b) is maximized. The decoding process is to find the labeled sequence y with the maximum conditional probability, as shown in formula (3).
y*=argmaxy∈Y(Z)p (y | z; w, b) (equation 3)
For a sequence CIU model, only the interaction between two continuous labels is considered, and a Viterbi algorithm is adopted in the training and decoding process to obtain a better effect.
The LSTM (Long Short-Term Memory-Long Short-Term Memory network) is a special Recurrent Neural Network (RNN), and the problem of RNN gradient disappearance is well solved by introducing a Memory unit and a threshold mechanism. The LSTM unit structure is shown in the figure, x represents the input of the network at different moments, y represents the output of the network, h represents the hidden layer, u represents the weight from the input layer to the hidden layer, w represents the weight from the hidden layer of the previous node to the hidden layer of the current node, and v represents the weight from the hidden layer to the output layer. FIG. 2 is a structural diagram of the LSTM
A bidirectional long and short term memory network (BLSTM); the structure is shown in fig. 3, when processing sequence data, BLSTM adds a backward calculation process compared with general LSTM, and this process can utilize the following information of the sequence, and finally output the calculated values in both forward and backward directions to the output layer, so that all the information of a sequence is obtained through both directions, and it can be applied to multiple types of natural language processing tasks.
The BLSTM has strong sequence modeling capability, can capture far-away context information, and has the capability of fitting nonlinearity of a neural network. CRF calculates a joint probability that optimizes the entire sequence rather than optimally concatenating every time instant, and thus is superior to LSTM. Both models have unique advantages, and the BilSTM-CRF model is adopted in order to integrate the advantages of the two models during electric power entity identification. A CRF linear layer is added behind an output layer of the BLSTM network, so that the problem of gradient disappearance or gradient explosion of the traditional recurrent neural network is effectively solved.
As shown in FIG. 4, the electric power field entity recognition model is constructed by utilizing a bidirectional long-and-short-term memory network and a conditional random field and comprises
The first layer is used for mapping each word in the sentence into a low-dimensional dense word vector from one-hot vectors; (x1, x 2.., xn)
The second layer, the two-way LSTM layer, is used for extracting the sentence characteristic from the word vector of the first layer automatically;
specifically, the word vector is used as an input layer of the bidirectional LSTM, and the bidirectional LSTM automatically extracts sentence features. Hidden state sequence of forward LSTM output by LSTM layer
Figure BDA0002565848750000071
Hidden state sequence with inverted LSTM output
Figure BDA0002565848750000072
Splicing to obtain a complete hidden state sequence htAs the output layer of the LSTM, where m is the number of cells of the hidden layer.
Figure BDA0002565848750000073
Figure BDA0002565848750000074
Figure BDA0002565848750000075
And (3) accessing the hidden state sequence into a CRF linear layer, and mapping the hidden state from m dimension to k dimension (k is the number of labels of the label set), so as to obtain the sentence characteristics which are automatically extracted, and form a characteristic matrix P.
P=(p1,p2,...,pn)∈Rn×k(formula 7)
Each dimension of the feature matrix P may be considered as a word xiIs classified into jiOf a labelHowever, the marked information cannot be used when marking each position, so that a CRF layer is accessed for marking next.
The third layer, the CRF layer, is used for carrying on the sequence marking of the sentence level with the sentence characteristic extracted; and classifying the sentence labels formed by labeling.
A CRF layer is used for sentence-level sequence marking, the parameter of the CRF layer is a matrix A of (k +2) × (k +2), Aij represents the transition score from the ith label to the jth label, and the label marked before can be used for marking a position. For sentence x, the score with the label y ═ y1, y 2.., yn) is calculated as shown in equation 8 below.
Figure BDA0002565848750000081
It can be seen that the score for the entire sequence is equal to the sum of the scores for the positions, and that the score for each position is derived from two parts, one part being the p output by the LSTMiThe other part is determined by the transfer matrix A of the CRF.
Since the feature matrix is k-dimensional (k is the number of labels in the label set), if Softmax is performed on the matrix P, it is equivalent to classify each position of the sentence into k classes.
Thus, the normalized probability is obtained using Softmax: as shown in the following formula (formula 9).
Figure BDA0002565848750000082
And after the Softmax normalization, obtaining the result of classifying each position according to the K labels.
Dividing the labeled corpus into a training set and a test set, and training the constructed entity recognition algorithm model by using the test set; the labeled data is used as the corpus required by the entity recognition supervision training, wherein 80% of the labeled data is used as the training data, and the rest 20% is used as the test data. In order to accelerate the model learning speed, the neural network model adopts an Adam optimizer. The parameters of the model are shown in the table below.
TABLE 3 model training parameter settings
Parameter(s) Parameter value
batch_size 64
epoch 10
hidden_dim 300
optimizer Adam
CRF True
learning rate 0.001
gradient clipping 5.0
dropout keep_prob 0.5
update embedding True
pretrain embedding random
embedding_dim 300
shuffle True
When training the model, by maximizing the log-likelihood function, equation (10) gives the result of the training for a training sample (x, y)x) Log-likelihood of (d):
log P(yx|x)=score(x,yx)-log(∑y′exp (score (x, y'))) (equation 10)
The sorted labels are sorted by adopting the formula. And solving the optimal path by using a Viterbi algorithm in the prediction process, wherein the decoding process is a marking sequence with the maximum solving probability, and the sequence is the position corresponding to the identified entity.
In one embodiment of the invention, two sets of safety robots are installed for a given text data "in hessian and depressed substations". After the device detects that the hollow double circuit line is disconnected, the river east side switch is connected and disconnected, and after the entity identification in the power field, the field entities of 'Hexi transformer substation', 'hollow transformer substation', 'safety automatic device', 'hollow double circuit line' and 'river east side switch' can be obtained;
and S3, calculating a weighted harmonic average value of the accuracy and the recall rate, evaluating an entity recognition algorithm model, and using the electric power entity recognition after the evaluation meets the service requirements. The evaluation indexes adopted by the power entity identification mainly include accuracy and recall rate, and the definitions of the evaluation indexes are respectively shown as a formula (formula 11) and a formula (formula 12).
Figure BDA0002565848750000091
Figure BDA0002565848750000092
The values of the two are between 0 and 1, and the closer the value is to 1, the higher the accuracy or recall rate is. The accuracy and the recall rate sometimes have contradiction, and the weighted harmonic mean value, namely the F value, of the accuracy and the recall rate needs to be considered comprehensively, and is defined as shown in the formula (13).
Figure BDA0002565848750000093
The invention also provides a field entity identification system for electric power, which comprises
The data acquisition module is used for acquiring power grid data, performing data extraction on the power grid data to form a data set, and performing corpus tagging on training data by adopting BIE; with particular reference to the above
The field entity identification module is used for dividing the marked linguistic data into a test set, inputting the test set into a constructed electric power field entity identification model for identification, and reversely decoding an identification result to obtain an identified field entity;
the evaluation module is used for calculating a weighted harmonic average value of the accuracy rate and the recall rate according to the identified domain entities, evaluating the entity identification algorithm model by using the weighted harmonic average value, and outputting the identified domain entities when the obtained evaluation score meets the service requirements;
and when the obtained evaluation score does not meet the service requirement, the entity identification model in the power field is corrected and identified again.
The present invention also provides a computer storage medium having a computer program stored thereon, which, when executed by a processor, performs the steps of a method of domain entity identification as described.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims (10)

1. A field entity identification method facing to electric power is characterized in that: the method comprises the following steps:
s1, performing data extraction on the acquired power grid data to form a data set, and performing corpus labeling on training data;
s2, dividing the labeled corpus into test sets, inputting the test sets into the constructed electric power field entity recognition model for recognition, and reversely decoding the recognition results to obtain recognized field entities;
s3, calculating a weighted harmonic mean value of the accuracy and the recall rate according to the identified domain entities, evaluating the entity identification algorithm model by using the weighted harmonic mean value, and outputting the identified domain entities when the obtained evaluation score meets the service requirements;
and when the obtained evaluation score does not meet the service requirement, after the electric power field entity recognition model is corrected, repeating the steps S2-S3.
2. The electric-power-oriented domain entity recognition method of claim 1, wherein: in S1, BIE is used to perform corpus labeling on the training data, where letter B represents the first character of the power entity, letter I represents the internal character of the power entity, letter E represents the last character of the power entity, and letter O represents other characters not belonging to the named entity.
3. The electric-power-oriented domain entity recognition method of claim 1, wherein: constructing an entity recognition model in the electric power field by using a method of combining a bidirectional long-time memory network and a conditional random field; in S2, the labeled corpus is divided into a test set and a training set, and the constructed domain entity recognition algorithm model is trained using the training set.
4. The power-oriented domain entity recognition method of claim 3, wherein the power domain entity recognition algorithm model comprises:
the first layer is used for mapping each word in the sentence into a low-dimensional dense word vector from one-hot vectors;
the second layer, the two-way LSTM layer, is used for extracting the sentence characteristic from the word vector of the first layer automatically;
the third layer, the CRF layer, is used for carrying on the sequence marking of the sentence level with the sentence characteristic extracted; and classifying the sentence labels formed by labeling.
5. The entity identification method for the electric power field according to claim 4, wherein: in the CRF layer, sentence labels are classified by adopting the following formula:
Figure FDA0002565848740000021
where score (x, y) represents the score of sentence x with label y.
6. The entity identification method for the electric power field according to claim 5, wherein: when the entity recognition algorithm model in the power field is trained, solving a log-likelihood value for a training sample through a maximized log-likelihood function;
log P(yx|x)=score(x,yx)-log(Σy′exp(score(x,y′)));
wherein, score (x, y)x) The label representing sentence x is yxScore of (a); (x, y)x) Are training samples.
7. The entity identification method for the electric power field according to claim 1, wherein: in step S3, the electric power domain entity recognition algorithm model uses the Viterbi algorithm to solve the optimal path in the prediction process when solving the log-likelihood value, and the position corresponding to the recognized entity is obtained by solution.
8. The entity identification method for the electric power field according to claim 1, wherein: in the step of S3, the user is allowed to perform,
the accuracy is calculated by the following formula:
Figure FDA0002565848740000022
the recall ratio is calculated by the following formula:
Figure FDA0002565848740000023
when the domain entity recognition algorithm model is evaluated, the weighted harmonic mean value F value of the calculation accuracy and the recall rate is calculated by the following formula:
Figure FDA0002565848740000024
9. a power-oriented domain entity recognition system is characterized by comprising,
the data acquisition module is used for performing data extraction on the acquired power grid data to form a data set and performing corpus labeling on the training data;
the field entity identification module is used for dividing the marked linguistic data into a test set, inputting the test set into a constructed electric power field entity identification model for identification, and reversely decoding an identification result to obtain an identified field entity;
the evaluation module is used for calculating a weighted harmonic average value of the accuracy rate and the recall rate according to the identified domain entities, evaluating the entity identification algorithm model by using the weighted harmonic average value, and outputting the identified domain entities when the obtained evaluation score meets the service requirements; and when the obtained evaluation score does not meet the service requirement, the entity identification model in the power field is corrected and identified again.
10. A computer storage medium, characterized in that the computer storage medium has stored thereon a computer program which, when being executed by a processor, realizes the steps of a power domain entity identification method according to any one of claims 1 to 8.
CN202010625052.3A 2020-06-29 2020-07-01 Electric power-oriented domain entity identification method, system and storage medium Pending CN111783464A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2020106063404 2020-06-29
CN202010606340 2020-06-29

Publications (1)

Publication Number Publication Date
CN111783464A true CN111783464A (en) 2020-10-16

Family

ID=72757799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010625052.3A Pending CN111783464A (en) 2020-06-29 2020-07-01 Electric power-oriented domain entity identification method, system and storage medium

Country Status (1)

Country Link
CN (1) CN111783464A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749283A (en) * 2020-12-31 2021-05-04 江苏网进科技股份有限公司 Entity relationship joint extraction method for legal field
CN113761891A (en) * 2021-08-31 2021-12-07 国网冀北电力有限公司 Power grid text data entity identification method, system, equipment and medium
CN115396143A (en) * 2022-07-21 2022-11-25 沈阳化工大学 BILSTM-CRF-based industrial intrusion detection method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776866A (en) * 2016-11-29 2017-05-31 首都师范大学 A kind of method that meeting original text on University Websites carries out Knowledge Extraction
CN109002436A (en) * 2018-07-12 2018-12-14 上海金仕达卫宁软件科技有限公司 Medical text terms automatic identifying method and system based on shot and long term memory network
CN110232192A (en) * 2019-06-19 2019-09-13 中国电力科学研究院有限公司 Electric power term names entity recognition method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776866A (en) * 2016-11-29 2017-05-31 首都师范大学 A kind of method that meeting original text on University Websites carries out Knowledge Extraction
CN109002436A (en) * 2018-07-12 2018-12-14 上海金仕达卫宁软件科技有限公司 Medical text terms automatic identifying method and system based on shot and long term memory network
CN110232192A (en) * 2019-06-19 2019-09-13 中国电力科学研究院有限公司 Electric power term names entity recognition method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749283A (en) * 2020-12-31 2021-05-04 江苏网进科技股份有限公司 Entity relationship joint extraction method for legal field
CN113761891A (en) * 2021-08-31 2021-12-07 国网冀北电力有限公司 Power grid text data entity identification method, system, equipment and medium
CN115396143A (en) * 2022-07-21 2022-11-25 沈阳化工大学 BILSTM-CRF-based industrial intrusion detection method

Similar Documents

Publication Publication Date Title
CN110245229B (en) Deep learning theme emotion classification method based on data enhancement
CN107992597B (en) Text structuring method for power grid fault case
CN107798624B (en) Technical label recommendation method in software question-and-answer community
CN110033281B (en) Method and device for converting intelligent customer service into manual customer service
CN113505200B (en) Sentence-level Chinese event detection method combined with document key information
CN111783464A (en) Electric power-oriented domain entity identification method, system and storage medium
CN111506732B (en) Text multi-level label classification method
CN110888980A (en) Implicit discourse relation identification method based on knowledge-enhanced attention neural network
CN115357719A (en) Power audit text classification method and device based on improved BERT model
CN111177402A (en) Evaluation method and device based on word segmentation processing, computer equipment and storage medium
CN112347269A (en) Method for recognizing argument pairs based on BERT and Att-BilSTM
Suyanto Synonyms-based augmentation to improve fake news detection using bidirectional LSTM
CN114492460B (en) Event causal relationship extraction method based on derivative prompt learning
CN112463944A (en) Retrieval type intelligent question-answering method and device based on multi-model fusion
CN113255366A (en) Aspect-level text emotion analysis method based on heterogeneous graph neural network
CN111428502A (en) Named entity labeling method for military corpus
CN112989803B (en) Entity link prediction method based on topic vector learning
Arora et al. Sentimental analysis on imdb movies review using bert
CN117216617A (en) Text classification model training method, device, computer equipment and storage medium
CN115600595A (en) Entity relationship extraction method, system, equipment and readable storage medium
CN115238077A (en) Text analysis method, device and equipment based on artificial intelligence and storage medium
Zhang et al. Hierarchical attention networks for grid text classification
US20230289533A1 (en) Neural Topic Modeling with Continuous Learning
CN113537802A (en) Open source information-based geopolitical risk deduction method
CN115618092A (en) Information recommendation method and information recommendation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination