CN115455155B

CN115455155B - Method for extracting subject information of government affair text and storage medium

Info

Publication number: CN115455155B
Application number: CN202211402800.7A
Authority: CN
Inventors: 赵习枝; 仇阿根; 张福浩; 罗宁; 朱鹏; 陶坤旺; 方美丽; 陈才; 郑佳荣; 陈颂; 刘尚钦
Original assignee: Chinese Academy of Surveying and Mapping
Current assignee: Chinese Academy of Surveying and Mapping
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2023-03-03
Anticipated expiration: 2042-11-10
Also published as: CN115455155A

Abstract

A theme information extraction method and storage medium of government affair text, wherein the said method, carry on the preconditioning to the text data of unstructured government affair at first, to the text data after preconditioning, adopt MacBERT model to carry on the vector extraction; capturing semantic information in the sentence through a BiGRU model to obtain a high-level feature vector of the keyword; and finally, calculating the importance of the keywords, arranging the importance of the keywords in a descending order, and selecting the keywords with higher importance as the keywords of the subject information to realize the extraction of the subject information of the government affair text. The invention combines the MacBERT model and the BiGRU model to extract the subject information of unstructured government affair text data, thereby not only reducing the overfitting risk of the models, but also being capable of well extracting advanced features of keywords, obtaining more accurate keywords of the subject information and helping government departments to quickly mine and analyze unstructured texts.

Description

Method for extracting subject information of government affair text and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method for extracting subject information of government affair texts and a storage medium.

Background

The government affair big data refers to data owned and managed by the government, has wide sources and various forms, and specifically comprises (but is not limited to) natural information, district construction, district health management statistical monitoring and service and civil consumption data. At present, the quantity of unstructured government affair data is increasing, the data structure of the unstructured government affair data is irregular or incomplete, a predefined data model is not available, the data structure is difficult to represent by a database two-dimensional logic table, and how to extract the subject information of the government affair data quickly and efficiently becomes a technical problem which needs to be solved urgently.

By utilizing the natural language processing technology in the technical field of artificial intelligence, the theme information in the government affair data is extracted, and the mining analysis of the unstructured text can be realized. For example, for the office hall of the people's government in Shanghai city, which is about printing the work scheme of ' private safety treatment of self-built house in Shanghai city ', the file is analyzed by adopting a subject information extraction model, and the general characteristics of the subject expression in the text are analyzed, so that the subject information keywords of ' self-built house ', ' private ', ' investigation ', ' treatment ', ' elimination ', ' potential safety hazard ', ' reinforced guarantee ', ' supervision and guidance ' are finally obtained. Subject information extraction of government affairs text can achieve fast text understanding.

Disclosure of Invention

Aiming at the problem of irregular data structure in the non-structured text data of the government affairs, the invention provides a method for extracting the theme information of the government affair text, which can effectively extract the theme information of the government affair text and realize quick text understanding.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for extracting subject information of government affair texts comprises the following steps:

data preprocessing step S110:

preprocessing unstructured government affair text data, wherein the preprocessing comprises filtering out irrelevant information and performing word segmentation processing on the text data;

text feature vector extraction and processing step S120:

extracting word vectors of the preprocessed government affair text information data by adopting a MacBERT model to obtain key word feature vectors, then taking the key word feature vectors as input, capturing semantic information in sentences through a BiGRU model, and optimizing the feature vectors to obtain advanced feature vectors of keywords;

subject information obtaining step S130: receiving the high-level feature vectors of the keywords extracted in step S120, calculating the importance of the keywords, sorting the importance of the keywords in a descending order, and selecting the keywords with high importance as the topic information keywords, thereby realizing the topic information extraction of the government affair text.

Optionally, the pretreatment specifically includes: and deleting punctuation marks and spaces, introducing a field dictionary into the government affair text data, performing word segmentation on the data, filtering stop words by using a general stop word bank, and removing corresponding stop words in the divided government affair text data.

Optionally, the government affair text information data includes unstructured government affair text data, which specifically includes: and the natural text language is used for describing information such as construction of the district, statistical monitoring conditions of health management of the district and the like.

Optionally, in step S120, the BiGRU model is a bidirectional improved recurrent neural network.

Optionally, the BiGRU model includes a forward GRU model

And reverse GRU model

Among them forward GRU model

In which the feature vector of the keyword is input in the forward direction

Reverse GRU model

By using backward inputs on the feature vectors of the keywords

，

Each GRU model

By renewing the door

And a reset gate

The information propagation process inside the GRU model is as follows:

wherein, the first and the second end of the pipe are connected with each other,

in order to input the vector, the input vector is input,

to reset the door

The weight matrix of (a) is determined,

for updating the door

The weight matrix of (a) is determined,

for the present information

The weight matrix of (a) is determined,

in order to multiply the elements one by one,

in order to be a sigmoid function,

as hyperbolic tangent function, now information

From past information

And the current input

The decision is made in a joint manner,

is composed of

Outputting time information including past information

And present information

Updating door

Used for controlling how much history information is forgotten and how much new information is accepted when the current state is in a reset state

Used for controlling how much information in the candidate state is obtained from the historical information;

finally, the output of the BiGRU model

Defined by the following equation:

wherein the content of the first and second substances,

is the output of the forward GRU model,

for the output of the reverse GRU model,

represent

Time of day

The weight of the corresponding one of the first and second weights,

to represent

The weight of the corresponding one of the first and second weights,

to represent

Time of day

The corresponding bias term.

Optionally, in step S120, a MacBERT model is used to extract a word vector, the extracted word vector is used to extract context features through a bidirectional GRU model, and high-level feature vectors of the keywords are generated by concatenation.

Optionally, subject information keyword importancePObtained by sigmoid function, where 0<P<1：

Wherein the content of the first and second substances,

is that

The weight matrix of (a) is determined,

is that

The bias term of (c).

Optionally, importance of each topic information keywordPAnd selecting the first eight keywords as the topic information keywords according to the sequence from big to small.

The invention further discloses a storage medium for storing computer executable instructions, which is characterized in that:

the computer executable instructions, when executed by a processor, perform the above-described government affairs text subject information extraction method.

Compared with the prior art, the method for extracting the theme information of the government affair text has the following advantages that:

1) The invention adopts MacBERT model, can obtain the keyword feature vector, and solves the problem of insufficient local feature extraction capability.

2) Because the invention adopts the BiGRU model, the semantic information in the sentence can be captured, the high-grade characteristic vector of the key word is obtained, the text information is effectively utilized, and the parallel computation is adopted, thereby greatly improving the extraction efficiency of the subject information.

3) According to the invention, the MacBERT model and the BiGRU model are fused, so that the extraction effect of a single model on the subject information is improved, the extraction accuracy of the subject information is further improved, and the overfitting risk of the model is reduced.

Drawings

Fig. 1 is a basic flowchart of a method for extracting subject information of a government affair text and a storage medium according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

The invention is characterized in that a MacBERT model (Masked Language modeling as Correction Bidirectional Encoder reproduction from transformations) and a BiGRU model (Bi-Gate Recurrent Unit) are combined to extract the subject information of unstructured government affair text data. Firstly, extracting a word vector by adopting a MacBERT layer to obtain a keyword feature vector; then capturing semantic information in sentences through a BiGRU layer, and extracting high-level feature vectors of keywords, so that the features are more judgment; and finally, calculating the importance of the keywords, arranging the importance of the keywords in a descending order, and selecting the keywords with higher importance as the keywords of the subject information to realize the extraction of the subject information of the government affair text.

Referring to fig. 1, a basic flowchart of a method for extracting subject information of a government affairs text and a storage medium according to an embodiment of the present invention is shown.

Data preprocessing step S110:

specifically, the pretreatment specifically comprises: deleting punctuation marks, blank spaces and the like, introducing a field dictionary into the government affair text data, performing word segmentation processing on the data, filtering stop words by using a general stop word library, and removing corresponding stop words in the government affair text data after word segmentation.

Specifically, in step S110, the unstructured government affair text data includes a natural text language describing information such as construction of the jurisdiction and statistical supervision of health management of the jurisdiction.

Of course, the invention is not limited thereto, and the processing method of the invention can be applied to other government affair text information.

Text feature vector extraction and processing step S120:

and performing word vector extraction on the preprocessed government affair text information data, such as unstructured government affair text data, by using a MacBERT model to obtain a keyword feature vector, capturing semantic information in a sentence by using a BiGRU model by taking the keyword feature vector as input, and optimizing the feature vector to obtain a high-level feature vector of the keyword.

Specifically, in step S120, the MacBERT model may obtain a keyword feature vector

And the problem of insufficient local feature extraction capability is solved.

Specifically, in step S120, the BiGRU model is a bidirectional improved recurrent neural network, including a forward GRU model

And reverse GRU model

Among them forward GRU model

Using forward input of feature vectors of middle pairs of keywords

Reverse GRU model

Using inverse inputs for the feature vectors of the keywords

，

Each GRU model

By renewing the door

And a reset gate

The information propagation process inside the GRU model is as follows:

wherein the content of the first and second substances,

in order to be the vector input, the vector is input,

to reset the door

The weight matrix of (a) is determined,

to renew the door

The weight matrix of (a) is determined,

for the present information

The weight matrix of (a) is determined,

in order to multiply the elements one by one,

in order to be a sigmoid function,

as hyperbolic tangent function, now information

From past information

And the current input

In a joint decision, it is decided that,

is composed of

Time information output including past information

And present information

. Updating door

The method is used for controlling how much historical information needs to be forgotten and how much new information needs to be accepted in the current state, and is helpful for capturing long-term dependence in the sequence. Reset door

For controlling how many of the candidate states areLittle information is obtained from historical information, which helps to capture short-term dependencies in the sequence.

Finally, the output of the BiGRU model

Defined by the following equation:

wherein the content of the first and second substances,

for the output of the forward GRU model,

for the output of the reverse GRU model,

represent

Time of day

The weight corresponding to the weight of the corresponding weight,

to represent

The weight of the corresponding one of the first and second weights,

to represent

Time of day

The corresponding bias term.

Specifically, in step S120, a MacBERT model is used to extract a word vector, the extracted word vector is used to extract context features through a bidirectional GRU model, and high-level feature vectors of keywords are generated by concatenation, so as to improve the extraction accuracy of the topic information.

Subject information obtaining step S130: receiving the high-level feature vectors of the keywords extracted in step S120, calculating the importance of the keywords, sorting the importance of the keywords in a descending order, and selecting the keywords with higher importance as the topic information keywords, thereby realizing the topic information extraction of the government affair text.

Specifically, in step S130, the importance of the topic information keywordPObtained by sigmoid function, where 0<P<1：

Wherein the content of the first and second substances,

is that

The weight matrix of (a) is determined,

is that

The bias term of (1). Training data through the proposed model to obtain the optimal parameters of the model.

In particular, importance to each topic information keywordPThe first eight may be selected as the topic information keywords in descending order.

A storage medium for storing computer-executable instructions, characterized in that:

1) The invention adopts the MacBERT model, so that the keyword feature vector can be obtained, and the problem of insufficient local feature extraction capability is solved.

3) According to the invention, the MacBERT model and the BiGRU model are fused, so that the extraction effect of a single model on the theme information is improved, the extraction accuracy of the theme information is improved, and the overfitting risk of the model is reduced.

It will be apparent to those skilled in the art that the various elements or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device, or alternatively, they may be implemented using program code that is executable by a computing device, such that they may be stored in a memory device and executed by a computing device, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above is a further detailed description of the invention with reference to specific preferred embodiments, which should not be considered as limiting the invention to the specific embodiments described herein, but rather as a matter of simple deductions or substitutions by a person skilled in the art without departing from the inventive concept, it should be considered that the invention lies within the scope of protection defined by the claims as filed.

Claims

1. A method for extracting subject information of a government affair text is characterized by comprising the following steps:

data preprocessing step S110:

text feature vector extraction and processing step S120:

extracting word vectors of the preprocessed government affair text data by adopting a MacBERT model to obtain key word feature vectors, then taking the key word feature vectors as input, capturing semantic information in sentences through a BiGRU model, and optimizing the feature vectors to obtain advanced feature vectors of keywords;

subject information obtaining step S130: receiving the high-level feature vectors of the keywords extracted in the step S120, calculating the importance of the keywords, sorting the importance of the keywords in a descending order, and selecting the keywords with high importance as the topic information keywords to realize the topic information extraction of the government affair text;

in step S120, the BiGRU model is a bidirectional modified recurrent neural network;

the BiGRU model comprises a forward GRU model

And reverse GRU model

Among them forward GRU model

In which the feature vector of the keyword is input in the forward direction

Reverse GRU model

Using inverse inputs for the feature vectors of the keywords

，

Each GRU model

By renewing the door

And a reset gate

The information propagation process inside the GRU model is as follows:

in order to input the vector, the input vector is input,

to reset the door

The weight matrix of (a) is determined,

for updating the door

The weight matrix of (a) is determined,

for the present information

The weight matrix of (a) is determined,

to be made intoThe elements are multiplied by each other, and the multiplication,

is a function of the sigmoid and is,

is a hyperbolic tangent function, now information

From past information

And the current input

The decision is made in a joint manner,

is composed of

Outputting time information including past information

And present information

Updating door

Reset gate for controlling how much history information is forgotten and how much new information is accepted in current state

Used for controlling how much information in the candidate state is obtained from the history information;

finally, the output of the BiGRU model

Defined by the following equation:

for the output of the forward GRU model,

for the output of the inverse GRU model,

to represent

Time of day

The weight of the corresponding one of the first and second weights,

to represent

The weight of the corresponding one of the first and second weights,

to represent

Time of day

The corresponding bias term;

in step S120, a MacBERT model extracts a word vector, the extracted word vector extracts context features through a bidirectional GRU model, and high-level feature vectors of the keywords are generated by concatenation;

in the step S130, in the step S,

topic information keyword importancePObtained by sigmoid function, where 0<P<1：

is that

The weight matrix of (a) is determined,

is that

The bias term of (c).

2. The subject information extraction method according to claim 1, characterized in that:

the pretreatment specifically comprises: and deleting punctuation marks and spaces, introducing a field dictionary into the government affair text data, performing word segmentation on the data, filtering stop words by using a general stop word bank, and removing corresponding stop words in the divided government affair text data.

3. The subject information extraction method according to claim 2, characterized in that:

the government affair text information data comprise unstructured government affair text data, and specifically comprise: and describing natural text language of the statistical monitoring condition of the construction and health management of the district.

4. The subject information extraction method according to claim 1, characterized in that:

importance to each topic information keywordPAnd selecting the first eight keywords as the topic information keywords according to the sequence from big to small.

5. A storage medium for storing computer-executable instructions, characterized in that:

the computer-executable instructions, when executed by a processor, perform the method of extracting subject information of government affairs texts according to any one of claims 1 to 4.