CN111783464A - Electric power-oriented domain entity identification method, system and storage medium - Google Patents
Electric power-oriented domain entity identification method, system and storage medium Download PDFInfo
- Publication number
- CN111783464A CN111783464A CN202010625052.3A CN202010625052A CN111783464A CN 111783464 A CN111783464 A CN 111783464A CN 202010625052 A CN202010625052 A CN 202010625052A CN 111783464 A CN111783464 A CN 111783464A
- Authority
- CN
- China
- Prior art keywords
- entity
- electric power
- field
- power
- domain
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 25
- 230000015654 memory Effects 0.000 claims abstract description 14
- 238000012549 training Methods 0.000 claims description 28
- 238000012360 testing method Methods 0.000 claims description 17
- 238000011156 evaluation Methods 0.000 claims description 16
- 238000004590 computer program Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 12
- 239000013598 vector Substances 0.000 claims description 12
- 238000002372 labelling Methods 0.000 claims description 10
- 230000002457 bidirectional effect Effects 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 8
- 238000013075 data extraction Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 238000013528 artificial neural network Methods 0.000 abstract description 7
- 238000010586 diagram Methods 0.000 description 12
- 239000011159 matrix material Substances 0.000 description 6
- 230000005540 biological transmission Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 238000010276 construction Methods 0.000 description 4
- 238000013473 artificial intelligence Methods 0.000 description 3
- 230000008034 disappearance Effects 0.000 description 3
- 238000003058 natural language processing Methods 0.000 description 3
- 238000007476 Maximum Likelihood Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 238000004880 explosion Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 206010063385 Intellectualisation Diseases 0.000 description 1
- KUGRPPRAQNPSQD-UHFFFAOYSA-N OOOOO Chemical compound OOOOO KUGRPPRAQNPSQD-UHFFFAOYSA-N 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 239000003990 capacitor Substances 0.000 description 1
- 230000000994 depressogenic effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
- 239000004575 stone Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/06—Energy or water supply
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Business, Economics & Management (AREA)
- Biomedical Technology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Software Systems (AREA)
- Economics (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Water Supply & Treatment (AREA)
- General Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Strategic Management (AREA)
- Primary Health Care (AREA)
- Marketing (AREA)
- Human Resources & Organizations (AREA)
- Public Health (AREA)
- Animal Behavior & Ethology (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a method and a system for recognizing a field entity facing electric power, which can realize recognition of the electric power field entity in an electric power field text by constructing an electric power field entity recognition algorithm. The method adopts the two-way long-and-short-term memory network and the conditional random field to construct the domain entity recognition model, integrates the sequence modeling capacity of the two-way long-and-short-term memory network, can capture far context information, has the capacity of fitting nonlinearity of a neural network, simultaneously adopts the conditional random field to optimize the whole sequence, effectively solves the problem that the gradient of the traditional circulating neural network disappears or the gradient explodes, and provides an important basis for constructing the knowledge graph in the power field.
Description
Technical Field
The invention relates to the technical field of electric power artificial intelligence, in particular to an entity identification method, system and storage medium for the electric power field.
Background
In recent years, with the development of smart power grids, traditional power grid facilities are continuously upgraded and reformed, various information systems are widely applied, and the smart power grids generate and accumulate massive multi-source heterogeneous data. With the construction of the ubiquitous power internet of things, artificial intelligence based on power mass data plays an important role in supporting professional application construction processes such as marketing, operation and inspection, material, scheduling and safety supervision, and the development of artificial intelligence application construction can promote new services and innovation of new application modes to emerge continuously. At present, a complete domain knowledge map is not established in the power industry, and comprehensive intellectualization of power knowledge support is not realized.
The so-called knowledge graph describes concepts, entities and their relations in the objective world in a structured form, and is a structured semantic knowledge base for describing concepts and their interrelations in the physical world in a symbolic form. The basic composition unit is an entity-relation-entity triple, entities and related attribute-value pairs thereof, and the entities are mutually connected through relations to form a network knowledge structure. Therefore, in the field of power dispatching, the knowledge graph can solidify dispatching regulations and dispatcher experience knowledge, and assist in supporting applications such as power grid operation monitoring, exception handling and mode adjustment. In the field of electric power operation and maintenance, the knowledge map can store knowledge of equipment, faults, disposal methods and the like, and can support operation and maintenance of intelligent equipment and the like.
The electric power entity is required to be extracted firstly when the domain knowledge graph is constructed, the entity identification is a basic task in natural language processing, and the application range is very wide. Domain-specific named entity recognition, the goal of which is to identify domain-specific entities and their classes, plays an important role in domain document classification, retrieval and content analysis. The method is the basis of a deep complex information extraction task and is a basic stone of a knowledge calculation process for converting data into machine readable knowledge.
Therefore, it is necessary to develop a domain entity identification method for electric power to construct a knowledge graph of the electric power domain.
Disclosure of Invention
The object of the present invention is to solve at least one of the technical drawbacks mentioned.
To this end, it is an object of the present invention to propose a method for identifying a domain entity oriented to electric power, comprising the following steps,
s1, performing data extraction on the acquired power grid data to form a data set, labeling the linguistic data of the training data S2, dividing the labeled linguistic data into test sets, inputting the test sets into a constructed electric power field entity recognition model for recognition, and performing reverse decoding on recognition results to obtain recognized field entities; s3, calculating a weighted harmonic mean value of the accuracy rate and the recall rate according to the identified domain entities, evaluating the electric power domain entity identification algorithm model by using the weighted harmonic mean value, and outputting the identified domain entities when the obtained evaluation score meets the service requirements; and when the obtained evaluation score does not meet the service requirement, after the electric power field entity recognition model is corrected, repeating the steps S2-S3.
Preferably, in S1, BIE is used to perform corpus tagging on the training data, wherein the letter B is used to represent the first character of the power entity, the letter I is used to represent the internal character of the power entity, the letter E is used to represent the ending character of the power entity, and the letter O is used to represent other characters not belonging to the named entity.
In any of the above embodiments, preferably, the electric power field entity recognition algorithm model is constructed by using a method of combining a bidirectional long-and-short-term memory network and a conditional random field; in S2, the labeled corpus is divided into a test set and a training set, and the constructed domain entity recognition algorithm model is trained using the training set.
Preferably, in any one of the above embodiments, the electric power domain entity recognition algorithm model includes a first layer for mapping each word in a sentence from one-hot vectors to low-dimensional dense word vectors; the second layer, the two-way LSTM layer, is used for extracting the sentence characteristic from the word vector of the first layer automatically; the third layer, the CRF layer, is used for carrying on the sequence marking of the sentence level with the sentence characteristic extracted; and classifying the sentence labels formed by labeling.
Preferably, in any of the above embodiments, in the CRF layer, the sentence labels are classified by using the following formula:
where score (x, y) represents the score of sentence x with label y.
In any one of the above embodiments, preferably, when the electric power domain entity recognition algorithm model is trained, solving a log-likelihood value for a training sample by maximizing a log-likelihood function;
log P(yx|x)=score(x,yx)-log(∑y′exp(score(x,y′)));
wherein, score (x, y)x) The label representing sentence x is yxScore of (a); (x, y)x) Are training samples.
In any of the above embodiments, in step S3, the power domain entity identification algorithm model uses the Viterbi algorithm to solve the optimal path in the prediction process when solving the log-likelihood value, and the position corresponding to the identified entity is obtained by the solution.
Preferably, in any one of the above embodiments, in S3, the accuracy is calculated by the following formula:
the recall ratio is calculated by the following formula:
when the entity recognition algorithm model is evaluated, the weighted harmonic mean value of the calculation accuracy and the recall rate is calculated by the following formula:
the invention also provides a field entity identification system for electric power, which comprises
The data acquisition module is used for performing data extraction on the acquired power grid data to form a data set and performing corpus labeling on the training data;
the field entity identification module is used for dividing the marked linguistic data into a test set, inputting the test set into a constructed electric power field entity identification model for identification, and reversely decoding an identification result to obtain an identified field entity;
the evaluation module is used for calculating a weighted harmonic average value of the accuracy rate and the recall rate according to the identified domain entities, evaluating the entity identification algorithm model by using the weighted harmonic average value, and outputting the identified domain entities when the obtained evaluation score meets the service requirements; and when the obtained evaluation score does not meet the service requirement, the entity identification model in the power field is corrected and identified again.
The invention also provides electric power-oriented field entity identification equipment, which comprises a memory, a field entity identification module and a field entity identification module, wherein the memory is used for storing a computer program; and the processor is used for realizing the steps of the electric power field entity identification method when executing the computer program.
The present invention also provides a computer storage medium having a computer program stored thereon, which, when executed by a processor, performs the steps of a method of domain entity identification as described.
Compared with the prior art, the entity identification method for the electric power field at least has the following advantages:
by carrying out corpus identification on the power grid data and then utilizing the entity identification model in the power field to carry out the entity, accurate information collection is facilitated to be realized in huge power dispatching information, a solid foundation is provided for constructing a knowledge graph in the future, the extraction of the entity in the power field is realized, and the construction of the power knowledge graph is supported. The electric power field entity recognition model is evaluated through the accuracy and the recall rate, so that the model is convenient to correct in time, and the problem of overfitting of the electric power field entity recognition model is solved. The method adopts the two-way long-and-short-term memory network and the conditional random field to construct the entity recognition model, integrates the sequence modeling capability of the two-way long-and-short-term memory network, can capture far context information, has the capability of fitting nonlinearity of a neural network, and simultaneously adopts the conditional random field to optimize the whole sequence, thereby effectively solving the problems of gradient disappearance or gradient explosion of the traditional circulating neural network.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings;
FIG. 1 is a flow chart of a method for identifying a domain entity for electric power according to the present invention;
FIG. 2 is a diagram of a unit structure of an LSTM model in the method for recognizing an entity in the power-oriented domain according to the present invention;
FIG. 3 is a block diagram of a BLSTM model in the method for recognizing an entity in the power-oriented domain according to the present invention;
FIG. 4 is a schematic diagram of a bidirectional long-and-short-term memory network and a conditional random field electric power domain entity recognition model structure of the electric power-oriented domain entity recognition method provided by the invention;
Detailed Description
The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings. It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict.
The following detailed description is exemplary in nature and is intended to provide further details of the invention. Unless otherwise defined, all technical terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention.
As shown in fig. 1, the present invention provides a method for recognizing an entity in a power-oriented domain, comprising the following steps:
s1, acquiring power grid data, extracting the power grid data to form a data set, and marking the corpus of the training data by BIE;
when the corpus is labeled, a letter B is used for representing a first character of the electric power entity, a letter I is used for representing an internal character of the electric power entity, a letter E is used for representing a final character of the electric power entity, and a letter O is used for representing other characters which do not belong to the named entity. For example, the text data "analyze the influence of the dc reactor on the stability of the power transmission system", where the power entities are "dc reactor", "power transmission system", "dc reactor" is labeled "BIIIE", and "power transmission system" is labeled "BIIE". In the sentence as shown in the following table
TABLE 1 corpus annotation examples
Sentences containing electric power physical nouns | Corpus annotation results |
Analyzing the influence of a DC reactor on the stability of a power transmission system | O O B I II E O B I I E O OOOOO |
Capacitor at converter station outlet | B I E O OOO B I E |
Direct current transmission technology based on voltage source type current converter | O O B I IIII E O B I III E |
S2, dividing the labeled corpus into test sets, inputting the test sets into the constructed electric power field entity recognition model for recognition, and reversely decoding the recognition results to obtain recognized field entities;
constructing an entity recognition algorithm model in the electric power field by using a bidirectional long-time memory network and a conditional random field;
it should be noted that the Conditional Random Field (CRF) is a discriminant model, which combines the features of the ME model and the HMM model, and has a better result when labeling the natural language processing sequences such as part-of-speech labeling and entity recognition.
The CRF model is a statistical model based on a probability map, can add various characteristics by utilizing context information, realizes global normalization, can obtain a global optimal solution and solves the problem of marking deviation.
Typically, an input sequence is represented by x ═ x1, x 2., xn }, where xi represents the vector of the ith input character and y ═ y1, y 2., yn } represents the possible tag sequence of input x. Y (X) represents a possible tag sequence of X. The sequence CRF probability model defines a series of conditional probabilities P (yz; W, b) that are calculated from equation (1) given the probability of all possible sequences y for x.
WhereinIs a potential function, WTy ', yzi and by ', y are the weight vector and bias for the tag pair (y ', y), respectively.
For CRF model training, maximum likelihood estimation is used. For the training set { (zf, Yf)), the likelihood estimation algorithm is shown as equation (equation 2).
L (W, b) ═ Σ log p (y | z; W, b) (equation 2)
The maximum likelihood training process is to select parameters such that L (w, b) is maximized. The decoding process is to find the labeled sequence y with the maximum conditional probability, as shown in formula (3).
y*=argmaxy∈Y(Z)p (y | z; w, b) (equation 3)
For a sequence CIU model, only the interaction between two continuous labels is considered, and a Viterbi algorithm is adopted in the training and decoding process to obtain a better effect.
The LSTM (Long Short-Term Memory-Long Short-Term Memory network) is a special Recurrent Neural Network (RNN), and the problem of RNN gradient disappearance is well solved by introducing a Memory unit and a threshold mechanism. The LSTM unit structure is shown in the figure, x represents the input of the network at different moments, y represents the output of the network, h represents the hidden layer, u represents the weight from the input layer to the hidden layer, w represents the weight from the hidden layer of the previous node to the hidden layer of the current node, and v represents the weight from the hidden layer to the output layer. FIG. 2 is a structural diagram of the LSTM
A bidirectional long and short term memory network (BLSTM); the structure is shown in fig. 3, when processing sequence data, BLSTM adds a backward calculation process compared with general LSTM, and this process can utilize the following information of the sequence, and finally output the calculated values in both forward and backward directions to the output layer, so that all the information of a sequence is obtained through both directions, and it can be applied to multiple types of natural language processing tasks.
The BLSTM has strong sequence modeling capability, can capture far-away context information, and has the capability of fitting nonlinearity of a neural network. CRF calculates a joint probability that optimizes the entire sequence rather than optimally concatenating every time instant, and thus is superior to LSTM. Both models have unique advantages, and the BilSTM-CRF model is adopted in order to integrate the advantages of the two models during electric power entity identification. A CRF linear layer is added behind an output layer of the BLSTM network, so that the problem of gradient disappearance or gradient explosion of the traditional recurrent neural network is effectively solved.
As shown in FIG. 4, the electric power field entity recognition model is constructed by utilizing a bidirectional long-and-short-term memory network and a conditional random field and comprises
The first layer is used for mapping each word in the sentence into a low-dimensional dense word vector from one-hot vectors; (x1, x 2.., xn)
The second layer, the two-way LSTM layer, is used for extracting the sentence characteristic from the word vector of the first layer automatically;
specifically, the word vector is used as an input layer of the bidirectional LSTM, and the bidirectional LSTM automatically extracts sentence features. Hidden state sequence of forward LSTM output by LSTM layerHidden state sequence with inverted LSTM outputSplicing to obtain a complete hidden state sequence htAs the output layer of the LSTM, where m is the number of cells of the hidden layer.
And (3) accessing the hidden state sequence into a CRF linear layer, and mapping the hidden state from m dimension to k dimension (k is the number of labels of the label set), so as to obtain the sentence characteristics which are automatically extracted, and form a characteristic matrix P.
P=(p1,p2,...,pn)∈Rn×k(formula 7)
Each dimension of the feature matrix P may be considered as a word xiIs classified into jiOf a labelHowever, the marked information cannot be used when marking each position, so that a CRF layer is accessed for marking next.
The third layer, the CRF layer, is used for carrying on the sequence marking of the sentence level with the sentence characteristic extracted; and classifying the sentence labels formed by labeling.
A CRF layer is used for sentence-level sequence marking, the parameter of the CRF layer is a matrix A of (k +2) × (k +2), Aij represents the transition score from the ith label to the jth label, and the label marked before can be used for marking a position. For sentence x, the score with the label y ═ y1, y 2.., yn) is calculated as shown in equation 8 below.
It can be seen that the score for the entire sequence is equal to the sum of the scores for the positions, and that the score for each position is derived from two parts, one part being the p output by the LSTMiThe other part is determined by the transfer matrix A of the CRF.
Since the feature matrix is k-dimensional (k is the number of labels in the label set), if Softmax is performed on the matrix P, it is equivalent to classify each position of the sentence into k classes.
Thus, the normalized probability is obtained using Softmax: as shown in the following formula (formula 9).
And after the Softmax normalization, obtaining the result of classifying each position according to the K labels.
Dividing the labeled corpus into a training set and a test set, and training the constructed entity recognition algorithm model by using the test set; the labeled data is used as the corpus required by the entity recognition supervision training, wherein 80% of the labeled data is used as the training data, and the rest 20% is used as the test data. In order to accelerate the model learning speed, the neural network model adopts an Adam optimizer. The parameters of the model are shown in the table below.
TABLE 3 model training parameter settings
Parameter(s) | Parameter value |
batch_size | 64 |
epoch | 10 |
hidden_dim | 300 |
optimizer | Adam |
CRF | True |
learning rate | 0.001 |
gradient clipping | 5.0 |
dropout keep_prob | 0.5 |
update embedding | True |
pretrain embedding | random |
embedding_dim | 300 |
shuffle | True |
When training the model, by maximizing the log-likelihood function, equation (10) gives the result of the training for a training sample (x, y)x) Log-likelihood of (d):
log P(yx|x)=score(x,yx)-log(∑y′exp (score (x, y'))) (equation 10)
The sorted labels are sorted by adopting the formula. And solving the optimal path by using a Viterbi algorithm in the prediction process, wherein the decoding process is a marking sequence with the maximum solving probability, and the sequence is the position corresponding to the identified entity.
In one embodiment of the invention, two sets of safety robots are installed for a given text data "in hessian and depressed substations". After the device detects that the hollow double circuit line is disconnected, the river east side switch is connected and disconnected, and after the entity identification in the power field, the field entities of 'Hexi transformer substation', 'hollow transformer substation', 'safety automatic device', 'hollow double circuit line' and 'river east side switch' can be obtained;
and S3, calculating a weighted harmonic average value of the accuracy and the recall rate, evaluating an entity recognition algorithm model, and using the electric power entity recognition after the evaluation meets the service requirements. The evaluation indexes adopted by the power entity identification mainly include accuracy and recall rate, and the definitions of the evaluation indexes are respectively shown as a formula (formula 11) and a formula (formula 12).
The values of the two are between 0 and 1, and the closer the value is to 1, the higher the accuracy or recall rate is. The accuracy and the recall rate sometimes have contradiction, and the weighted harmonic mean value, namely the F value, of the accuracy and the recall rate needs to be considered comprehensively, and is defined as shown in the formula (13).
The invention also provides a field entity identification system for electric power, which comprises
The data acquisition module is used for acquiring power grid data, performing data extraction on the power grid data to form a data set, and performing corpus tagging on training data by adopting BIE; with particular reference to the above
The field entity identification module is used for dividing the marked linguistic data into a test set, inputting the test set into a constructed electric power field entity identification model for identification, and reversely decoding an identification result to obtain an identified field entity;
the evaluation module is used for calculating a weighted harmonic average value of the accuracy rate and the recall rate according to the identified domain entities, evaluating the entity identification algorithm model by using the weighted harmonic average value, and outputting the identified domain entities when the obtained evaluation score meets the service requirements;
and when the obtained evaluation score does not meet the service requirement, the entity identification model in the power field is corrected and identified again.
The present invention also provides a computer storage medium having a computer program stored thereon, which, when executed by a processor, performs the steps of a method of domain entity identification as described.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.
Claims (10)
1. A field entity identification method facing to electric power is characterized in that: the method comprises the following steps:
s1, performing data extraction on the acquired power grid data to form a data set, and performing corpus labeling on training data;
s2, dividing the labeled corpus into test sets, inputting the test sets into the constructed electric power field entity recognition model for recognition, and reversely decoding the recognition results to obtain recognized field entities;
s3, calculating a weighted harmonic mean value of the accuracy and the recall rate according to the identified domain entities, evaluating the entity identification algorithm model by using the weighted harmonic mean value, and outputting the identified domain entities when the obtained evaluation score meets the service requirements;
and when the obtained evaluation score does not meet the service requirement, after the electric power field entity recognition model is corrected, repeating the steps S2-S3.
2. The electric-power-oriented domain entity recognition method of claim 1, wherein: in S1, BIE is used to perform corpus labeling on the training data, where letter B represents the first character of the power entity, letter I represents the internal character of the power entity, letter E represents the last character of the power entity, and letter O represents other characters not belonging to the named entity.
3. The electric-power-oriented domain entity recognition method of claim 1, wherein: constructing an entity recognition model in the electric power field by using a method of combining a bidirectional long-time memory network and a conditional random field; in S2, the labeled corpus is divided into a test set and a training set, and the constructed domain entity recognition algorithm model is trained using the training set.
4. The power-oriented domain entity recognition method of claim 3, wherein the power domain entity recognition algorithm model comprises:
the first layer is used for mapping each word in the sentence into a low-dimensional dense word vector from one-hot vectors;
the second layer, the two-way LSTM layer, is used for extracting the sentence characteristic from the word vector of the first layer automatically;
the third layer, the CRF layer, is used for carrying on the sequence marking of the sentence level with the sentence characteristic extracted; and classifying the sentence labels formed by labeling.
6. The entity identification method for the electric power field according to claim 5, wherein: when the entity recognition algorithm model in the power field is trained, solving a log-likelihood value for a training sample through a maximized log-likelihood function;
log P(yx|x)=score(x,yx)-log(Σy′exp(score(x,y′)));
wherein, score (x, y)x) The label representing sentence x is yxScore of (a); (x, y)x) Are training samples.
7. The entity identification method for the electric power field according to claim 1, wherein: in step S3, the electric power domain entity recognition algorithm model uses the Viterbi algorithm to solve the optimal path in the prediction process when solving the log-likelihood value, and the position corresponding to the recognized entity is obtained by solution.
8. The entity identification method for the electric power field according to claim 1, wherein: in the step of S3, the user is allowed to perform,
the accuracy is calculated by the following formula:
the recall ratio is calculated by the following formula:
when the domain entity recognition algorithm model is evaluated, the weighted harmonic mean value F value of the calculation accuracy and the recall rate is calculated by the following formula:
9. a power-oriented domain entity recognition system is characterized by comprising,
the data acquisition module is used for performing data extraction on the acquired power grid data to form a data set and performing corpus labeling on the training data;
the field entity identification module is used for dividing the marked linguistic data into a test set, inputting the test set into a constructed electric power field entity identification model for identification, and reversely decoding an identification result to obtain an identified field entity;
the evaluation module is used for calculating a weighted harmonic average value of the accuracy rate and the recall rate according to the identified domain entities, evaluating the entity identification algorithm model by using the weighted harmonic average value, and outputting the identified domain entities when the obtained evaluation score meets the service requirements; and when the obtained evaluation score does not meet the service requirement, the entity identification model in the power field is corrected and identified again.
10. A computer storage medium, characterized in that the computer storage medium has stored thereon a computer program which, when being executed by a processor, realizes the steps of a power domain entity identification method according to any one of claims 1 to 8.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2020106063404 | 2020-06-29 | ||
CN202010606340 | 2020-06-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111783464A true CN111783464A (en) | 2020-10-16 |
Family
ID=72757799
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010625052.3A Pending CN111783464A (en) | 2020-06-29 | 2020-07-01 | Electric power-oriented domain entity identification method, system and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111783464A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112749283A (en) * | 2020-12-31 | 2021-05-04 | 江苏网进科技股份有限公司 | Entity relationship joint extraction method for legal field |
CN113761891A (en) * | 2021-08-31 | 2021-12-07 | 国网冀北电力有限公司 | Power grid text data entity identification method, system, equipment and medium |
CN115396143A (en) * | 2022-07-21 | 2022-11-25 | 沈阳化工大学 | BILSTM-CRF-based industrial intrusion detection method |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776866A (en) * | 2016-11-29 | 2017-05-31 | 首都师范大学 | A kind of method that meeting original text on University Websites carries out Knowledge Extraction |
CN109002436A (en) * | 2018-07-12 | 2018-12-14 | 上海金仕达卫宁软件科技有限公司 | Medical text terms automatic identifying method and system based on shot and long term memory network |
CN110232192A (en) * | 2019-06-19 | 2019-09-13 | 中国电力科学研究院有限公司 | Electric power term names entity recognition method and device |
-
2020
- 2020-07-01 CN CN202010625052.3A patent/CN111783464A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106776866A (en) * | 2016-11-29 | 2017-05-31 | 首都师范大学 | A kind of method that meeting original text on University Websites carries out Knowledge Extraction |
CN109002436A (en) * | 2018-07-12 | 2018-12-14 | 上海金仕达卫宁软件科技有限公司 | Medical text terms automatic identifying method and system based on shot and long term memory network |
CN110232192A (en) * | 2019-06-19 | 2019-09-13 | 中国电力科学研究院有限公司 | Electric power term names entity recognition method and device |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112749283A (en) * | 2020-12-31 | 2021-05-04 | 江苏网进科技股份有限公司 | Entity relationship joint extraction method for legal field |
CN113761891A (en) * | 2021-08-31 | 2021-12-07 | 国网冀北电力有限公司 | Power grid text data entity identification method, system, equipment and medium |
CN115396143A (en) * | 2022-07-21 | 2022-11-25 | 沈阳化工大学 | BILSTM-CRF-based industrial intrusion detection method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110245229B (en) | Deep learning theme emotion classification method based on data enhancement | |
CN107992597B (en) | Text structuring method for power grid fault case | |
CN107798624B (en) | Technical label recommendation method in software question-and-answer community | |
CN110033281B (en) | Method and device for converting intelligent customer service into manual customer service | |
CN113505200B (en) | Sentence-level Chinese event detection method combined with document key information | |
CN111783464A (en) | Electric power-oriented domain entity identification method, system and storage medium | |
CN111506732B (en) | Text multi-level label classification method | |
CN110888980A (en) | Implicit discourse relation identification method based on knowledge-enhanced attention neural network | |
CN115357719A (en) | Power audit text classification method and device based on improved BERT model | |
CN111177402A (en) | Evaluation method and device based on word segmentation processing, computer equipment and storage medium | |
CN112347269A (en) | Method for recognizing argument pairs based on BERT and Att-BilSTM | |
Suyanto | Synonyms-based augmentation to improve fake news detection using bidirectional LSTM | |
CN114492460B (en) | Event causal relationship extraction method based on derivative prompt learning | |
CN112463944A (en) | Retrieval type intelligent question-answering method and device based on multi-model fusion | |
CN113255366A (en) | Aspect-level text emotion analysis method based on heterogeneous graph neural network | |
CN111428502A (en) | Named entity labeling method for military corpus | |
CN112989803B (en) | Entity link prediction method based on topic vector learning | |
Arora et al. | Sentimental analysis on imdb movies review using bert | |
CN117216617A (en) | Text classification model training method, device, computer equipment and storage medium | |
CN115600595A (en) | Entity relationship extraction method, system, equipment and readable storage medium | |
CN115238077A (en) | Text analysis method, device and equipment based on artificial intelligence and storage medium | |
Zhang et al. | Hierarchical attention networks for grid text classification | |
US20230289533A1 (en) | Neural Topic Modeling with Continuous Learning | |
CN113537802A (en) | Open source information-based geopolitical risk deduction method | |
CN115618092A (en) | Information recommendation method and information recommendation system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |