CN113822599A

CN113822599A - Power industry policy management method based on classification tree fusion technology

Info

Publication number: CN113822599A
Application number: CN202111256627.XA
Authority: CN
Inventors: 朱峰; 左强; 邹云峰; 祝宇楠; 范环宇; 蔡明明; 寇文心
Original assignee: State Grid Jiangsu Electric Power Co ltd Marketing Service Center; State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Jiangsu Electric Power Co ltd Marketing Service Center; State Grid Jiangsu Electric Power Co Ltd
Priority date: 2021-10-27
Filing date: 2021-10-27
Publication date: 2021-12-21

Abstract

The application discloses a power industry policy management method based on classification tree fusion technology, which comprises the following steps: acquiring a policy text of the power industry and preprocessing data; encoding the power industry policy text information after data preprocessing; setting information attention weights of different sentences in a policy text of the power industry; classifying the electric power industry policy texts based on the codes and attention weights of the electric power industry policy text information; extracting information of different types of power industry policy texts; and the extracted different types of information are fused and assembled to realize the policy management of the power industry. The method is used for managing the policies of the power industry based on the classification tree fusion technology, can meet the requirements of digitalization, intelligent transformation and efficient and unified management of the policies of the power industry, realizes classification management of the policies of the power industry, improves the management efficiency of the policies of the power industry, and supports quality improvement and efficiency improvement of related services of the power industry.

Description

Power industry policy management method based on classification tree fusion technology

Technical Field

The invention belongs to the technical field of information perception and identification of the power industry, and relates to a power industry policy management method based on a classification tree fusion technology.

Background

The power industry is the basic industry, the pillar industry and the strategic industry of national economy, and the development of industries such as power informatization, smart grid and power internet of things is an important means for realizing energy production, consumption, technology and system revolution in China.

The difference between the policy of the power industry and the general policy is that the function of the policy of the power industry is more complex, the policy is an important component of the national economic system, and the policy is an important means for national economic adjustment. The power industry policy is of various types and complex, for example, the electricity price policy is an important means for national economic regulation, and the electricity prices of different types are often adjusted according to the change of economic policies at different periods. Therefore, the policy execution of the power industry is in place and accurate, the loss of the legal benefits of the enterprise is avoided, the power utilization accuracy of the power utilization customers is ensured, and the policy execution method is a key work of the power enterprise.

At present, natural language processing technology is gradually mature, but even under the background of vigorously advocating intellectualization and digital transformation in the power industry, the application of natural language processing technology in the power industry, especially in the field of policy management of the power industry, is still lacking.

Therefore, considering the digital and intelligent transformation requirements of the power industry and the requirement of unified management of the policy of the power industry comprehensively, a technical method for efficiently managing the policy of the power industry based on natural language processing is urgently needed, so as to support the management and implementation of the policy of the power industry.

Disclosure of Invention

In order to overcome the defects in the prior art, the power industry policy management method based on the classification tree fusion technology is provided.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a power industry policy management method based on classification tree fusion technology comprises the following steps:

step 1: acquiring a policy text of the power industry and preprocessing data;

step 2: encoding the power industry policy text information after data preprocessing;

and step 3: setting information attention weights of different sentences in a policy text of the power industry;

and 4, step 4: based on the coding and attention weight of the electric power industry policy text information, classifying the electric power industry policy text by adopting a classification tree fusion technology;

and 5: extracting the triple information of the policy texts of different types of the power industry;

step 6: and based on an entity alignment algorithm, the extracted different types of information are fused and assembled, so that the policy management of the power industry is realized.

The invention further comprises the following preferred embodiments:

preferably, step 1 specifically comprises:

step 1.1: the method comprises the steps of obtaining a power industry policy text, using a jieba word segmentation tool to segment the power industry policy text, and deleting stop words in the power industry policy text through a stop word vocabulary;

step 1.2: after the preprocessing of the step 1.1, each word in the sentence obtains the position of the word in the word list through the word list and maps each word into a word vector in the word embedding matrix through the word embedding matrix;

step 1.3: based on the word vectors, the convolutional neural network extracts information representations of statements in the power industry policy text.

Preferably, step 1.3 specifically comprises:

step 1.3.1: performing matrix splicing combination on word vectors of each word in the electric power industry policy text sentences to construct a sentence vector matrix of the electric power industry policy text sentences;

step 1.3.2: aiming at the sentence vector matrix, a plurality of convolution kernels with different sizes are arranged in the convolution layer to extract the common information representation among different words;

step 1.3.3: and extracting fixed-length information representation from statements with different lengths by using a K-Max pooling and Padding method.

Preferably, step 2 is specifically:

and sequentially inputting the information representation of each statement into a BilSTM network or a GRU network according to the sequence of the statements in the electric power industry policy text, and coding the electric power industry policy text information.

Preferably, step 3 is specifically:

and setting information Attention weights of different sentences in the electric power industry policy text by using an Attention mechanism, and outputting an electric power industry policy text vector code added with the Attention weight information.

Preferably, step 4 is specifically:

and coding the electric power industry policy text vector added with the attention weight information into a Softmax classifier to obtain a one-hot vector representation of the category to which the electric power industry policy text belongs, and finally realizing electric power industry policy text classification.

Preferably, step 5 specifically includes:

step 5.1: based on an Open domain three-tuple extraction tool Open-IE, extracting triple information of a policy text in the power industry: firstly extracting all possible subjects and predicates of different types of power industry policies, then judging the association between the subjects and predicates, and finally extracting the subjects corresponding to the subjects and predicates;

step 5.2: and (3) extracting the triple information of the policy text of the power industry based on a closed domain triple extraction tool Close-IE: extract the Object and the Object first, and then classify the relationship between the Object and the Object.

Preferably, step 5.1 specifically comprises:

step 5.1.1: an encoding Layer Encoder-Layer acquires context information of a statement;

step 5.1.2: the entity extraction Layer EntityRelation-Layer extracts all possible objects and predicates;

step 5.1.3: finding all possible related subjects and predicates by the Multihead-Layer;

step 5.1.4: extracting the corresponding Object by the Object-Layer according to the specified Object and Predicate;

step 5.1.5: Triple-Result extracts the final (Subject, predict, Object) set in the statement according to steps 5.1.1-5.1.4.

Preferably, in step 5.1.2, the start position and the end position of Subject and Predicate are extracted respectively in Span mode, and the formula is as follows:

P_i ^start_s＝sigmoid(W_starth_i+b_start)

P_i ^end_s＝sigmoid(W_endh_i+b_end)

P_i ^start_p＝sigmoid(W_starth_i+b_start)

P_i ^end_p＝sigmoid(W_endh_i+b_end)

wherein P is_i ^start_sRepresenting the probability that the ith token is the start position of Subject in the sentence, P_i ^end_sRepresenting the probability that the ith token is the end position of Subject in the sentence, P_i ^start_pRepresenting the probability that the ith token in the statement is the beginning of the Predicate, P_i ^end_pIndicates the probability that the ith token in the statement is the ending position of Predicate, h_iRepresenting the coding after the ith token in the sentence by Bert, W_(·)Representing the weight of the model to be trained, b_(·)Is a partial execution;

step 5.1.3 the formula used is as follows:

P_i,j＝sigmoid(h_i,h_j)

wherein h is_iRepresents the coding of the ith feature in the sentence, the feature represented as Subject, h_jThe encoding of the jth feature in the statement, representing the feature of Predicate, P_i,jIs represented by (h)_i,h_j) Probabilities that relationships can be constructed;

step 5.1.4 the formula used is as follows:

P_i ^start_o＝sigmoid(W_{start_o}(h_i,V^s,V^p)+b_{start_o})

P_i ^end_o＝sigmoid(W_{end_o}(h_i,V^s,V^p)+b_{end_o})

wherein P is_i ^start_oRepresenting the probability that the ith token is the start position of the Object in the sentence, P_i ^end_oDenotes the probability that the ith token is the end position of the Object in the sentence, V^sDenotes the sum of the head and tail features, V, representing Subject^pRepresents the sum of head and tail characteristics of Predicate.

Preferably, step 5.2 specifically comprises:

step 5.2.1: a BERT coding Layer BERT-Layer acquires context information of a statement;

step 5.2.2: the Entity extraction Layer Entity-Layer extracts all possible Subjects and Obubjects;

step 5.2.3: the Multihead-Layer finds out the possible relation among all different tokens in the statement;

step 5.2.4: Triple-Result extracts the final (Subject, predictor, Object) set in the statement according to steps 5.2.1-5.2.3.

The beneficial effect that this application reached:

the invention is based on a classification tree fusion technology and provides a novel open domain three-tuple information extraction mode, realizes the policy management of the power industry, can meet the requirements of digitalization, intelligent transformation and efficient and unified management of the policy of the power industry, realizes the classification management of the policy of the power industry, improves the management efficiency of the policy of the power industry, and supports the quality improvement and the efficiency improvement of related services of the power industry.

Drawings

FIG. 1 is a flowchart of a power industry policy management method based on classification tree fusion technology according to the present invention;

FIG. 2 is a diagram of a power industry policy text classifier according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating extraction of Open-IE Open domain information in an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating extraction of Close-IE open domain information in the embodiment of the present invention.

Detailed Description

The present application is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present application is not limited thereby.

As shown in fig. 1, the method for managing policies in the power industry based on the classification tree fusion technology of the present invention includes the following steps:

step 1: the method includes the steps of obtaining a policy text of the power industry and conducting data preprocessing, and specifically includes the following steps:

step 1.1: acquiring a policy text of the power industry and preprocessing data:

the method comprises the steps of obtaining a power industry policy text, using a jieba word segmentation tool to segment the power industry policy text, and deleting stop words in the power industry policy text through a stop word vocabulary;

such as: for the phrase "the fire power and electricity price of Jiangsu province of this year is predicted to be adjusted", the phrases "this year", "Jiangsu province", "fire", "electricity price", "prediction", "will", "adjusted" will be obtained by word segmentation and removal of stop words.

Step 1.2: word embedding:

as shown in a word embedding module in fig. 2, a plurality of sentences in a power industry policy document are sequentially input into a word embedding layer, each word in the sentence obtains the position of the word in the word list through the word list, and each word is mapped into a word vector in a word embedding matrix through the word embedding matrix;

the vocabulary is a set of all possible words, and the location of a word in the vocabulary can be found, for example, the location of the word in the vocabulary of "this year" is 156.

The word embedding matrix is a two-dimensional matrix with the dimensionality [ the size of a word list and the length of a word vector ], and the word vector corresponding to the input word can be obtained through the word embedding matrix at the position of the word list.

The word vector is a vector with a fixed size, and words in each word list correspond to different word vectors.

Step 1.3: extracting statement characterization information by the convolutional neural network:

as shown in the CNN module in fig. 2, extracting information representations of statements in a power industry policy text by using a convolutional neural network specifically includes:

the sentence "the price of electricity and heat in Jiangsu province of this year is expected to be adjusted" is also used as an example.

Through step 1.2, word vectors of 7 words contained in the sentence can be obtained, and then the 7 word vectors are spliced and combined to obtain a sentence vector matrix of the sentence.

The sentence vector matrix is a two-dimensional matrix with the dimension [ sentence length, word vector dimension ].

Step 1.3.2: setting a plurality of convolution kernels with different sizes in the convolution layer to extract the common information representation among different words;

for the sentence vector matrix, 5 convolution kernels with the dimensions of [1, word vector dimension ], [2, word vector dimension ], [3, word vector dimension ], [4, word vector dimension ], [5, word vector dimension ] are arranged, and the number of each convolution kernel is 5 respectively. Taking a convolution kernel with a dimension of [3, word vector dimension ] as an example, the convolution kernel can extract information representation among 3 words, so that more information among the words can be mined by setting a plurality of convolution kernels with different sizes.

Step 1.3.3: and extracting fixed-length information representations from statements with different lengths in the pooling layer through a K-Max pooling layer and Padding.

For example, for two sentences of which the sentence vector matrix is [7, word vector dimension ] (7 is the length of the sentence after segmentation) and [18,768] (18 is the length of the sentence after segmentation);

after K-Max pooling and Padding, the two sentence vector matrixes can be compressed into 2 sentence vectors with the same dimension, such as sentence vectors with the dimension of [1,200 ].

Step 2: encoding the electric power industry policy text information after data preprocessing, specifically comprising the following steps:

the BilSTM network or the GRU network encodes the text information of the policy of the power industry:

the BilSTM network or GRU network belongs to sequence modeling, where each sequence unit of the BilSTM network outputs its hidden state h.

Assuming that a text describing the thermal power generation policy is encoded with information, a maximum of 50 sentences in the document are assumed, and each sentence obtains a sentence vector (with the length of 200) of the sentence through a convolution network.

These 50 sentence vectors of length 200 are then sequence modeled by BiLSTM, outputting a vector representation of all sequence units.

The vector of the output has a specific dimension size of [50,200 ]. (wherein 50 represents the number of sequence units of the BilSTM network, and 200 represents the length of the output vector of each sequence unit of the BilSTM network)

As shown in the BiLSTM network in fig. 2, the BiLSTM network can effectively capture the information dependency relationship of a longer distance, and therefore, after the information representation of the fixed length of the statement is extracted from each statement in the electric power industry policy text, the information representation of each statement is sequentially input into the BiLSTM network according to the sequence of the statements in the electric power industry policy text, and the electric power industry policy text information is encoded.

The BilSTM network or the GRU network can output codes without pre-training and inputting the representation.

And step 3: setting information attention weights of different sentences in a policy text of the power industry, specifically comprising the following steps:

as shown in the Attention mechanism in fig. 2, the Attention mechanism is used to set information Attention weights of different sentences in the power industry policy text, and output a power industry policy text vector code added with the Attention weight information.

The input of the Attention is the output of the BilsTM network, and in the Attention mechanism, the output h of the BilsTM network is firstly output_i,tInputting the full connection layer to obtain an implicit layer representation u of the attention layer_i,t。

Then the attention weight alpha of the corresponding information in the document is calculated through softmax_i,t。

Finally, the weighted sum is carried out on the weight and the output of the BilSTM network to obtain the vector representation s after statement weighting_i。

u_i,t＝tanh(W_wh_i,t+b_w)

And 4, step 4: based on the coding and attention weight of the text information of the electric power industry policy, the classification of the text of the electric power industry policy is carried out by adopting a classification tree fusion technology, which specifically comprises the following steps:

classifying the electric power industry policy texts by a Softmax classifier:

as shown in a Softmax classifier module in fig. 2, encoding and inputting the power industry policy text vector added with the attention weight information into the Softmax classifier to obtain a one-hot vector representation of a category to which the power industry policy text belongs, and finally realizing the text classification of the power industry policy.

And 5: extracting the triple information of the policy texts of different types of the power industry, specifically comprising the following steps:

step 5.1: extracting triple information of a policy text of the power industry based on an Open domain triple extraction tool Open-IE, wherein the specific structure is shown in FIG. 3;

due to the current lack of a correlation method for open domain triplet information extraction.

Therefore, a new method for extracting the open domain triplet information is proposed.

The method comprises the steps of firstly extracting all possible subjects and predicates of different types of power industry policies, then judging the association between the subjects and the predicates, and finally extracting the subjects corresponding to the subjects and the predicates.

The step 5.1 specifically comprises the following steps:

step 5.1.1: Encode-Layer:

the text representation capability of the BilSTM network on the triple extraction task is weak, and the overall effect is poor. Therefore, different from the step 2 of selecting the BilSTM network as the coding layer, the triple extraction task selects BERT as the coding layer, so that the context information of the statement can be better acquired.

In the Encoder-Layer of fig. 3, in order to further improve the model performance, BERT is used as a feature extraction Layer, so as to better acquire context information of a statement.

Step 5.1.2EntityRelation-Layer:

in the entity extraction Layer subpar-Layer of fig. 3, the start position and the end position of the Subject and the Predicate are respectively extracted in a Span manner. The calculation formula is as follows:

P_i ^start_s＝sigmoid(W_starth_i+b_start)

P_i ^end_s＝sigmoid(W_endh_i+b_end)

P_i ^start_p＝sigmoid(W_starth_i+b_start)

P_i ^end_p＝sigmoid(W_endh_i+b_end)

wherein P is_i ^start_sRepresenting the probability that the ith token is the start position of Subject in the sentence, P_i ^end_sRepresenting the probability that the ith token is the end position of Subject in the sentence, P_i ^start_pRepresenting the probability that the ith token in the statement is the beginning of the Predicate, P_i ^end_pIndicates the probability that the ith token in the statement is the ending position of Predicate, h_iRepresenting the coding after the ith token in the sentence by Bert, W_(·)Representing the weight of the model to be trained, b_(·)It is a bias.

Step 5.1.3: Multihead-Layer:

in the MultiHead-Layer of fig. 3, each token in the statement may possibly form a relationship with other tokens, and the Layer will find out the Subject and the prefix of all possible relationships, and the calculation formula is as follows:

P_i,j＝sigmoid(h_i,h_j)

wherein h is_iRepresents the coding of the ith feature in the sentence, the feature represented as Subject, h_jThe encoding of the jth feature in the statement, representing the feature of Predicate, P_i,jIs represented by (h)_i,h_j) The probabilities of the relationships may be constructed.

Step 5.1.4: Object-Layer:

in the Object-Layer in fig. 3, the Layer is used to extract the specified Object, and extract the corresponding Object according to the specified Object and Predicate, and the calculation formula is as follows:

P_i ^start_o＝sigmoid(W_{start_o}(h_i,V^s,V^p)+b_{start_o})

P_i ^end_o＝sigmoid(W_{end_o}(h_i,V^s,V^p)+b_{end_o})

wherein, P_i ^start_oRepresenting the probability that the ith token is the start position of the Object in the sentence, P_i ^end_oDenotes the probability that the ith token is the end position of the Object in the sentence, V^sDenotes the sum of the head and tail features, V, representing Subject^pRepresents the sum of head and tail characteristics of Predicate.

Step 5.1.5: Triple-Result:

in the Triple-Result layer of FIG. 3, the Triple-Result layer extracts the final (Subject, predictor, Object) set of the statement according to the first several steps.

Step 5.2: and extracting the triple information of the policy text of the power industry based on a closed domain triple extraction tool Close-IE, wherein the specific structure is shown in FIG. 4.

Step 5.2, extract Object and Object first, and then classify the relationship between Object and Object, which specifically includes:

step 5.2.1: BERT-Layer:

and (3) extracting the three-tuple information in the closed domain also because the text representation capability of the BilSTM network on the task of extracting the three-tuple is weak, and BERT is selected as an encoding layer.

In the coding Layer BERT Layer of fig. 4, BERT is used as a feature extraction Layer to obtain context information of a statement.

Step 5.2.2: Entity-Layer:

in the Entity extraction Layer Entity Layer of fig. 4, the start position and the end position of the Subject and the object are extracted respectively in a Span manner. The calculation formula is as follows:

P_i ^start_s＝sigmoid(W_starth_i+b_start)

P_i ^end_s＝sigmoid(W_endh_i+b_end)

P_i ^start_o＝sigmoid(W_starth_i+b_start)

P_i ^end_o＝sigmoid(W_endh_i+b_end)

wherein P is_i ^start_sRepresenting the probability that the ith token is the start position of Subject in the sentence, P_i ^end_sRepresenting the probability that the ith token is the end position of Subject in the sentence, P_i ^start_oRepresenting the probability that the ith token is the start position of the object in the sentence, P_i ^end_oIndicates the probability that the ith token is the end position of the object in the sentence, h_iRepresenting the coding after the ith token in the sentence by Bert, W_(·)Representing the weight of the model to be trained, b_(·)It is a bias.

Step 5.2.3: Multihead-Layer:

in the MultiHead-Layer of fig. 4, each token in the statement may have a relationship with other tokens, and the Layer finds the possible relationship between all different tokens, and the calculation formula is as follows:

P_i,j＝sigmoid(h_i,h_j)

wherein h is_iRepresents the coding of the ith feature in the sentence, the feature represented as Subject, h_jCoding of jth feature in a sentence, representing the feature of Object, P_i,jIs represented by (h)_i,h_j) The probabilities of the relationships may be constructed.

Step 5.2.4: Triple-Result:

in the Triple-Result of FIG. 4, the final (Subject, predictor, Object) set in the statement is extracted according to the first few steps.

Step 6: and based on an entity alignment algorithm, the extracted different types of information are fused and assembled to realize the policy management of the power industry, so that power industry policy trees of different classifications are formed.

The present applicant has described and illustrated embodiments of the present invention in detail with reference to the accompanying drawings, but it should be understood by those skilled in the art that the above embodiments are merely preferred embodiments of the present invention, and the detailed description is only for the purpose of helping the reader to better understand the spirit of the present invention, and not for limiting the scope of the present invention, and on the contrary, any improvement or modification made based on the spirit of the present invention should fall within the scope of the present invention.

Claims

1. A power industry policy management method based on classification tree fusion technology is characterized by comprising the following steps:

the method comprises the following steps:

step 1: acquiring a policy text of the power industry and preprocessing data;

2. The power industry policy management method based on classification tree fusion technology as claimed in claim 1, wherein:

the step 1 specifically comprises the following steps:

3. The power industry policy management method based on classification tree fusion technology as claimed in claim 2, wherein:

the step 1.3 specifically comprises:

4. The power industry policy management method based on classification tree fusion technology as claimed in claim 2, wherein:

the step 2 specifically comprises the following steps:

5. The electric power industry policy management method based on classification tree fusion technology as claimed in claim 4, wherein:

the step 3 specifically comprises the following steps:

6. The electric power industry policy management method based on classification tree fusion technology as claimed in claim 5, wherein:

the step 4 specifically comprises the following steps:

7. The power industry policy management method based on classification tree fusion technology as claimed in claim 1, wherein:

the step 5 specifically comprises the following steps:

8. The power industry policy management method based on classification tree fusion technology as claimed in claim 7, wherein:

the step 5.1 specifically comprises the following steps:

9. The power industry policy management method based on classification tree fusion technology as claimed in claim 8, wherein:

in step 5.1.2, the start position and the end position of the Subject and the Predicate are extracted respectively in a Span mode, and the formula is as follows:

P_i ^start_s＝sigmoid(W_starth_i+b_start)

P_i ^end_s＝sigmoid(W_endh_i+b_end)

P_i ^start_p＝sigmoid(W_starth_i+b_start)

P_i ^end_p＝sigmoid(W_endh_i+b_end)

step 5.1.3 the formula used is as follows:

P_i,j＝sigmoid(h_i,h_j)

step 5.1.4 the formula used is as follows:

P_i ^start_o＝sigmoid(W_{start_o}(h_i,V^s,V^p)+b_{start_o})

P_i ^end_o＝sigmoid(W_{end_o}(h_i,V^s,V^p)+b_{end_o})

10. The power industry policy management method based on classification tree fusion technology as claimed in claim 7, wherein:

the step 5.2 specifically comprises the following steps: