CN112560486A

CN112560486A - Power entity identification method based on multilayer neural network, storage medium and equipment

Info

Publication number: CN112560486A
Application number: CN202011337566.5A
Authority: CN
Inventors: 刘子全; 李睿凡; 王泽元; 胡成博; 熊永平; 朱雪琼
Original assignee: State Grid Corp of China SGCC; Beijing University of Posts and Telecommunications; State Grid Jiangsu Electric Power Co Ltd; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; Beijing University of Posts and Telecommunications; State Grid Jiangsu Electric Power Co Ltd; Electric Power Research Institute of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2021-03-26

Abstract

The invention discloses a multilayer neural network-based electric power entity identification method, a storage medium and equipment. Pre-training the BERT language model through a language model training corpus; labeling a power entity label on the power corpus data, and constructing a power entity identification corpus; constructing a Huffman code of the electric power entity label according to the quantity of the electric power entity label in the electric power entity identification corpus; and adding a classification layer after the pre-trained BERT language model to form a BERT power entity recognition model, and retraining the BERT power entity recognition model again through the power entity recognition corpus to obtain the trained BERT power entity recognition model. The accuracy of the Chinese named entity recognition in the power field is improved.

Description

Power entity identification method based on multilayer neural network, storage medium and equipment

Technical Field

The invention relates to the technical field of electric power entity identification, in particular to an electric power entity identification method, a storage medium and equipment based on a multilayer neural network.

Background

Named Entity Recognition (NER), also referred to as entity recognition, entity segmentation and entity extraction, is a sub-task of information extraction for identifying person names, place names, organizational names or named entities that are divided according to specific needs in the input text, aiming at locating and classifying named entities in the input text into predefined categories. Traditional named entity recognition involves recognition tasks that include 3 major classes (entity, time, and number) and 7 minor classes (person name, place name, time, value, currency, and percentage). Conventional named entity recognition methods may be classified into dictionary-based named entity recognition methods, rule-based named entity recognition methods, and conventional machine learning-based named entity recognition methods.

Early studies were based on rule methods, and the labor costs for rule making and maintenance were high. The method is based on a mechanistic method, wherein a conditional random field model (CRF) performs feature learning by establishing a log-likelihood model, but the training cost is high and the training speed is slow. Features can be automatically learned based on a deep learning model, long-short term memory network models (LSTM) can learn long-distance features through a gate control unit, and Attention mechanism models (Attention) can focus on information more critical to NER tasks in a plurality of input information.

Most existing NER methods are implemented based on data-driven, i.e. the larger the amount of data, the better the learning effect of the model. However, in some specific fields, it is difficult to build enough markup corpora, and the effect of the model is greatly reduced. The problem that labeled corpora are insufficient when a named entity recognition tool in the power field is constructed at present, in addition, the problem that labels are unbalanced frequently occurs in a named entity recognition task, namely frequency difference of different entities is large, a model trained based on the data can be prone to predicting labels with multiple frequencies, and unbalanced problem processing difficulty is increased due to insufficient corpora. And the manual labeling needs professional knowledge in the power field, and ordinary people are difficult to directly and accurately identify entities in the power field, so that the problems of high cost and slow labeling are caused.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a power entity identification method, a storage medium and equipment based on a multilayer neural network, and solves the problems of unbalanced power entity identification labels, inaccurate identification and slow manual labeling.

In order to achieve the above purpose, the invention adopts the following technical scheme: a power entity identification method based on a multilayer neural network comprises the following steps: inputting the electric power corpus to be identified into a pre-constructed BERT electric power entity identification model to obtain a Huffman code of an electric power entity label, and mapping the Huffman code to obtain an entity label so as to obtain an identified entity.

Further, the construction step of the BERT power entity recognition model comprises the following steps:

extracting a mass text corpus, and performing data preprocessing on the mass text corpus to obtain a language model training corpus;

pre-training the BERT language model through a language model training corpus;

labeling a power entity label on the power corpus data, and constructing a power entity identification corpus;

constructing a Huffman code of the electric power entity label according to the quantity of the electric power entity label in the electric power entity identification corpus;

and adding a classification layer after the pre-trained BERT language model to form a BERT power entity recognition model, and retraining the BERT power entity recognition model again through the power entity recognition corpus to obtain the trained BERT power entity recognition model.

Further, the process of preprocessing the data of the massive corpus includes:

dividing sentences of the text and constructing sentence pairs, wherein the sentence pairs are connected by using set connecting labels, the head of each sentence is provided with a set head label, and the tail of each sentence is provided with a set tail label; wherein, the sentence pair formed by the sentences connected with the original text is a positive sample, and the unconnected sentences are negative samples; constructing a corpus of a relation prediction task of upper and lower sentences;

in each sentence, randomly masking out a portion of the words for prediction; for the masked characters, a part of the masked characters are replaced by set character string labels, a part of the masked characters are replaced by random characters, and the rest of the masked characters are kept unchanged to form a corpus used for a character prediction task;

generating word labels according to the real words covering the positions, and generating upper and lower sentence relation labels according to the relation of sentence pairs, thereby obtaining the language model training corpus.

Further, the pre-training of the BERT language model by the language model training corpus includes the steps of:

the input of the BERT language model is a preprocessed text, and the output is a word label and a relation label of upper and lower sentences;

calculating the output of the BERT language model and the loss value of the real label, adding the loss value of the word label and the loss value of the upper sentence relation and the lower sentence relation to obtain a final loss value, training the BERT language model by adopting an AdamW optimizer according to the final loss value, stopping training when the loss value of the model on the verification set does not decrease any more, and storing model parameters to obtain the BERT language model.

Further, the labeling of the electric power entity tag to the electric power corpus data and the construction of the electric power entity identification corpus include: manually marking partial electric power corpus data to obtain a knowledge base of an electric power entity; and performing non-artificial power entity label marking on the rest power linguistic data by using the knowledge base to obtain power entity identification linguistic data.

Further, the label labeling of the non-artificial power entity includes:

constructing and obtaining an electric power entity identification corpus in a BMEO marking form, and marking a character unit as a B-entity type if the character unit is the beginning of an entity word; if a character unit is the end of an entity word, marking the character unit as an E-entity type; if one character unit is a non-starting non-ending character of an entity word, marking the character unit as an M-entity type; if a character does not belong to a physical word, it is labeled O.

Further, the classification layer comprises a full connection layer and a Sigmoid activation function which are connected in series, the input of the classification layer is the output of the BERT language model, and the output of the classification layer is the Huffman coding of the predicted electric power entity label.

Further, training the BERT power entity recognition model comprises:

inputting the electric power entity identification corpus into a BERT electric power entity identification model, outputting a Huffman code of a predicted electric power entity label, and mapping the Huffman code to obtain a corresponding electric power entity label;

calculating the difference between a real label on the electric power entity recognition corpus and an output label of the BERT electric power entity recognition model by adopting cross entropy loss, training the BERT electric power entity recognition model by an AdamW optimizer, stopping training when the loss of the model on the electric power entity recognition corpus verification set does not decrease any more, and storing model parameters to obtain the trained BERT electric power entity recognition model.

A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform a method of multilayer neural network-based power entity identification in accordance with any of the preceding.

A computing device, comprising, in combination,

one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing a method for multi-layer neural network-based power entity identification according to any of the preceding.

The invention achieves the following beneficial effects:

1. the Huffman coding of the entity label can effectively relieve the problem of unbalance of the entity label in the electric power field through a Huffman tree structure, and improve the identification precision of the Chinese named entity in the electric power field;

2. the data marking method of the pseudo marking can effectively reduce the labor cost of entity recognition text marking;

3. according to the invention, the semantic representation of the characters is enhanced through the BERT pre-training model, the training parameters are reduced through a fine-tuning mode, the training time is saved, and the model performance is good under the condition of small data volume.

Drawings

FIG. 1 is a flow chart of data annotation in an embodiment of the present invention;

FIG. 2 is a schematic diagram of tag distribution of power entities in accordance with an embodiment of the present invention;

FIG. 3 is a flowchart of the BERT power entity recognition model training process in accordance with an embodiment of the present invention;

FIG. 4 is a schematic diagram of a BERT power entity recognition model in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of Huffman coding according to an embodiment of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

Example 1:

a power entity identification method based on a multilayer neural network comprises the following steps:

the following steps 1-5 are a BERT electric power entity recognition model training process, and a flow chart is shown in FIG. 3;

step 1: crawling a mass of texts from Wikipedia and Baidu encyclopedia to form a mass text corpus;

step 2: carrying out data preprocessing on a mass text corpus to form a language model training corpus;

the process of data preprocessing of a massive text corpus mainly comprises the following three parts:

(1) carrying out character level segmentation on the massive text corpora;

(2) segmenting the text with the characters segmented and constructing sentence pairs, segmenting the sentences exceeding the preset length max-num-tokens, connecting the sentence pairs by using a set connecting tag [ SEP ], adding a set head tag [ CLS ] to the head of the sentence pairs, and adding a set tail tag [ SEP ] to the tail of the sentence pairs; wherein, the sentence pair formed by the sentences connected with the original text is a positive sample, and the unconnected sentences are used as negative samples; constructing a corpus of a relation prediction task of upper and lower sentences;

(3) in each sentence, randomly masking 15% of the words for prediction; for the masked words, 80% of the masked words are replaced by a set character string [ MASK ], 10% of the masked words are replaced by a random word, and 10% of the masked words are kept unchanged; forming a corpus for a word prediction task;

(4) generating word labels according to the real words with the positions covered in the step (2), and generating upper and lower sentence relation labels according to the sentence pair relation in the step (3);

(5) for the language model training corpus constructed in the above way, the proportion is 9: the approach of 1 partitions the training set and the validation set.

And step 3: pre-training the BERT language model by using a language model training corpus;

the BERT (bidirectional transformer coding) language model is an existing model in which a multi-layer neural network exists; the method comprises the steps of firstly forming a vector sequence by embedding layers in a text sequence, and then coding a context by a layer transformer (transformer) coder, so that the conversion from the text sequence to a label sequence or from the text sequence to a single label can be realized.

The task of language model pre-training comprises two parts: a word prediction task and a context relation prediction task;

in the word prediction task, inputting a preprocessed text and outputting a predicted word passing through a [ MASK ] label position of a BERT language model; in the task of predicting the relation between the upper sentence and the lower sentence, the text which is also preprocessed is input, and the relation between the upper sentence and the lower sentence which is in the [ CLS ] position of the BERT language model is output. Because the input of the two tasks is the same, two kinds of prediction labels can be obtained for the same input data, including word labels and upper and lower sentence relation labels.

Optimizing the model by calculating the output of the corresponding position of the BERT language model and the loss of the real label, wherein the loss comprises the following steps: the loss of the word label and the loss of the relation between the upper sentence and the lower sentence are calculated by adopting a cross entropy formula. The BERT language model is trained based on a training set using an optimizer of momentum and weight decay (AdamW) optimizer based on the final loss value. And when the loss value of the model on the verification set does not decrease any more, stopping the training of the model and storing the model parameters.

And 4, step 4: labeling a power entity label (namely performing BMEO labeling) on the power corpus data, constructing a power entity identification corpus, and preprocessing the power entity identification corpus;

the electric power corpus to be marked adopts a text of a regulation and a regulation of power transformation maintenance of a national power grid company, and the text comprises: the general management regulations and anti-accident measures (hereinafter referred to as 'five-way measures') for transformer acceptance, operation and maintenance, detection, evaluation and overhaul of the national grid company are adopted. The text comprises a plurality of electric naming entity nouns, and plain text data is obtained through format conversion, washing and deduplication operations.

1/4 data (about 9800 pieces) with longer length are randomly extracted from the corpus data of the national grid 'Wutong Yishi' to form a labeling pool. The entity knowledge base statistical information shown in the table 1 is obtained by setting a set of marking standards of the electric power named entity. The statistical distribution of the power-naming-entity tags is shown in FIG. 2, and it can be seen that the power-naming-entity tags have imbalance. Note that the number of entity labels "O" (meaning the character does not belong to an entity word) is 322,331, much higher than other label classes.

Table 1: statistical information of power entity library

In the step, the BERT model can be finely adjusted after the electric power corpus is required to be preprocessed. As shown in fig. 1, the labeling process of the electric corpus includes:

(1) labeling part of the electric power corpus data (such as extracting 1/4 electric power corpus data) to obtain a knowledge base of an electric power entity, wherein binary groups (entity types and entity names) are stored in the knowledge base;

(2) using the knowledge base to perform non-artificial pseudo labeling on other electric power corpora, wherein the non-artificial pseudo labeling means that a machine performs electric power entity label labeling on unlabeled electric power corpus data according to the electric power entity knowledge base to obtain electric power entity identification corpora, namely pseudo labeling corpora;

the method comprises the following steps of preprocessing the electric power entity identification corpus, including:

and performing character-level segmentation on the electric power entity recognition corpus, adding a set head tag [ CLS ] at the head of the sentence, and adding a set tail tag [ SEP ] at the tail of the sentence. (the power entity recognition corpus includes several separate sentences).

And (3) identifying corpora of the preprocessed electric power entity according to the proportion of 8: 2, dividing a training set and a verification set;

and during labeling, constructing to obtain a pseudo labeling corpus in a BMEO labeling form. Specifically, if a character unit is the beginning of an entity word, it is labeled as a B-entity type; if a character unit is the end of an entity word, marking the character unit as an E-entity type; if one character unit is a non-starting non-ending character of an entity word, marking the character unit as an M-entity type; if a character does not belong to a physical word, it is labeled O. For example, the sentence "examine reactor of series insulator" is labeled "O, run B-1, run M-1, run E-1, run B-1, run M-1, examine O", and examine O ". By adopting the mode, the labor cost can be greatly reduced.

And 4, step 4: constructing a Huffman code of the electric power entity label according to the quantity of the electric power entity label in the electric power entity identification corpus;

in the above-mentioned electric power naming entity tag, a Huffman tree code is adopted to represent all entity tag types in the entity tag, the entity tag types include 28 entity tags such as 'B-0' to 'B-8', 'M-0' to 'M-8', 'E-0' to 'E-8' and 'O', 0-8 are respectively 9 numbers in Table 1. And constructing a Huffman tree according to the quantity of the 28 entity labels in the power entity identification corpus. These 28 entity labels are stored in the leaf nodes of the huffman tree. The Huffman code of each label is a path from a root node to the leaf node, and the walking direction in the path is defined to be 0 in the left subtree and 1 in the right subtree. For example, in fig. 5, the huffman code of the leaf node X is [1,1,1], but since different leaf nodes have different lengths, for a leaf node whose path length is smaller than the longest path, its huffman code is filled with 0 to the longest path, for example, the longest path is 5, and the huffman code of the leaf node to which the label X belongs is [1,1,1,0,0 ].

By performing Huffman coding on the entity tags, the problem caused by unbalanced categories can be alleviated. This is because when the model is used to predict the label, the output is its huffman coding path, and the label mapping is performed according to the huffman coding path using the forward maximum matching method. Due to the characteristic of Huffman coding construction, two entity labels with the least quantity are selected to form a sub-tree each time, 0 and 1 of a path on a root node of the sub-tree point to the two selected entities respectively, and therefore 0 and 1 distribution in the labels is generally balanced. By predicting the path of the Huffman coding, the problem of imbalance of entity class label prediction can be effectively solved.

And 5: after a BERT language model is obtained through pre-training, a classification layer is added to form a BERT power entity recognition model, and the BERT power entity recognition model is trained again through a power entity recognition corpus to obtain the BERT power entity recognition model;

and after a BERT language model is obtained through pre-training, a classification layer is added to form a BERT power entity recognition model. As shown in FIG. 4, when an input text passes through a BERT power entity recognition model, the input text is first changed into text vector sequences E [1] to E [12] through an embedding layer, then the text vector sequences are passed through a multi-layer encoder to obtain outputs T [1] to T [12] of the BERT language model, and finally entity labels are obtained through classification. The classification layer comprises a full connection layer and a Sigmoid activation function which are sequentially connected, the input of the classification layer is the output of the BERT language model, the output of the classification layer is the Huffman coding of the predicted electric power entity label, and the corresponding electric power entity label is obtained through Huffman coding mapping. And calculating the difference between a real label on the electric power entity recognition corpus and an output label of the BERT electric power entity recognition model by adopting cross entropy loss, and training the BERT electric power entity recognition model by an AdamW optimizer. And when the loss of the model on the electric power entity recognition corpus verification set is not reduced any more, stopping the training of the model and storing the model parameters.

Step 6: and inputting the linguistic data to be recognized into a BERT power entity recognition model to obtain a Huffman code of a power entity label, and mapping the Huffman code to obtain entity labels so as to obtain entities with 9 categories defined in advance in the linguistic data.

Inputting the linguistic data to be recognized into a BERT electric power entity recognition model sentence by sentence to obtain a Huffman code of each word entity label in the sentence; and then, mapping the Huffman code of each word entity label in the corpus sentence to be recognized into the original entity label through the mapping relation between the Huffman code and the entity label. Judging whether a sequence formed by entity labels meets a template of a set entity type such as 'B-X', 'B-X, E-X', 'B-X, M-X, E-X' and 'B-X, M-X, M-X, E-X', wherein X represents an entity type (0-9), and 'M-X' can be repeated for multiple times to meet the entity requirements of different lengths; and finally, combining the words corresponding to the subsequences meeting the template into an entity as a final prediction entity.

Example 2:

a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the aforementioned multi-layer neural network-based power entity identification methods.

A computing device, comprising, in combination,

one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the aforementioned multi-layer neural network-based power entity identification methods.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A power entity identification method based on a multilayer neural network is characterized in that:

inputting the electric power corpus to be identified into a pre-constructed BERT electric power entity identification model to obtain a Huffman code of an electric power entity label, and mapping the Huffman code to obtain an entity label so as to obtain an identified entity.

2. The method for recognizing the electric power entity based on the multilayer neural network as claimed in claim 1, wherein: the construction step of the BERT electric power entity recognition model comprises the following steps:

pre-training the BERT language model through a language model training corpus;

3. The method for recognizing the electric power entity based on the multilayer neural network as claimed in claim 2, wherein: the data preprocessing process for the massive text corpus comprises the following steps:

4. The method for recognizing the electric power entity based on the multilayer neural network as claimed in claim 3, wherein: the pre-training of the BERT language model through the language model training corpus comprises the following steps:

5. The method for recognizing the electric power entity based on the multilayer neural network as claimed in claim 2, wherein: the marking of the electric power entity label to the electric power corpus data and the construction of the electric power entity identification corpus comprise: manually marking partial electric power corpus data to obtain a knowledge base of an electric power entity; and performing non-artificial power entity label marking on the rest power linguistic data by using the knowledge base to obtain power entity identification linguistic data.

6. The method for recognizing the electric power entity based on the multilayer neural network as claimed in claim 5, wherein: the non-artificial power entity label marking comprises the following steps:

7. The method for recognizing the electric power entity based on the multilayer neural network as claimed in claim 2, wherein: the classification layer comprises a full connection layer and a Sigmoid activation function which are connected in series, the input of the classification layer is the output of the BERT language model, and the output of the classification layer is the Huffman coding of the predicted electric power entity label.

8. The method for recognizing the electric power entity based on the multilayer neural network as claimed in claim 7, wherein: training a BERT power entity recognition model, comprising:

9. A computer readable storage medium storing one or more programs, characterized in that: the one or more programs include instructions that, when executed by a computing device, cause the computing device to perform the multi-layer neural network-based power entity identification method of any one of claims 1-8.

10. A computing device, characterized by: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing the method for multi-tier neural network-based power entity identification of any one of claims 1-8.