CN115455155B - Method for extracting subject information of government affair text and storage medium - Google Patents
Method for extracting subject information of government affair text and storage medium Download PDFInfo
- Publication number
- CN115455155B CN115455155B CN202211402800.7A CN202211402800A CN115455155B CN 115455155 B CN115455155 B CN 115455155B CN 202211402800 A CN202211402800 A CN 202211402800A CN 115455155 B CN115455155 B CN 115455155B
- Authority
- CN
- China
- Prior art keywords
- information
- keywords
- model
- government affair
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A theme information extraction method and storage medium of government affair text, wherein the said method, carry on the preconditioning to the text data of unstructured government affair at first, to the text data after preconditioning, adopt MacBERT model to carry on the vector extraction; capturing semantic information in the sentence through a BiGRU model to obtain a high-level feature vector of the keyword; and finally, calculating the importance of the keywords, arranging the importance of the keywords in a descending order, and selecting the keywords with higher importance as the keywords of the subject information to realize the extraction of the subject information of the government affair text. The invention combines the MacBERT model and the BiGRU model to extract the subject information of unstructured government affair text data, thereby not only reducing the overfitting risk of the models, but also being capable of well extracting advanced features of keywords, obtaining more accurate keywords of the subject information and helping government departments to quickly mine and analyze unstructured texts.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method for extracting subject information of government affair texts and a storage medium.
Background
The government affair big data refers to data owned and managed by the government, has wide sources and various forms, and specifically comprises (but is not limited to) natural information, district construction, district health management statistical monitoring and service and civil consumption data. At present, the quantity of unstructured government affair data is increasing, the data structure of the unstructured government affair data is irregular or incomplete, a predefined data model is not available, the data structure is difficult to represent by a database two-dimensional logic table, and how to extract the subject information of the government affair data quickly and efficiently becomes a technical problem which needs to be solved urgently.
By utilizing the natural language processing technology in the technical field of artificial intelligence, the theme information in the government affair data is extracted, and the mining analysis of the unstructured text can be realized. For example, for the office hall of the people's government in Shanghai city, which is about printing the work scheme of ' private safety treatment of self-built house in Shanghai city ', the file is analyzed by adopting a subject information extraction model, and the general characteristics of the subject expression in the text are analyzed, so that the subject information keywords of ' self-built house ', ' private ', ' investigation ', ' treatment ', ' elimination ', ' potential safety hazard ', ' reinforced guarantee ', ' supervision and guidance ' are finally obtained. Subject information extraction of government affairs text can achieve fast text understanding.
Disclosure of Invention
Aiming at the problem of irregular data structure in the non-structured text data of the government affairs, the invention provides a method for extracting the theme information of the government affair text, which can effectively extract the theme information of the government affair text and realize quick text understanding.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for extracting subject information of government affair texts comprises the following steps:
data preprocessing step S110:
preprocessing unstructured government affair text data, wherein the preprocessing comprises filtering out irrelevant information and performing word segmentation processing on the text data;
text feature vector extraction and processing step S120:
extracting word vectors of the preprocessed government affair text information data by adopting a MacBERT model to obtain key word feature vectors, then taking the key word feature vectors as input, capturing semantic information in sentences through a BiGRU model, and optimizing the feature vectors to obtain advanced feature vectors of keywords;
subject information obtaining step S130: receiving the high-level feature vectors of the keywords extracted in step S120, calculating the importance of the keywords, sorting the importance of the keywords in a descending order, and selecting the keywords with high importance as the topic information keywords, thereby realizing the topic information extraction of the government affair text.
Optionally, the pretreatment specifically includes: and deleting punctuation marks and spaces, introducing a field dictionary into the government affair text data, performing word segmentation on the data, filtering stop words by using a general stop word bank, and removing corresponding stop words in the divided government affair text data.
Optionally, the government affair text information data includes unstructured government affair text data, which specifically includes: and the natural text language is used for describing information such as construction of the district, statistical monitoring conditions of health management of the district and the like.
Optionally, in step S120, the BiGRU model is a bidirectional improved recurrent neural network.
Optionally, the BiGRU model includes a forward GRU modelAnd reverse GRU modelAmong them forward GRU modelIn which the feature vector of the keyword is input in the forward directionReverse GRU modelBy using backward inputs on the feature vectors of the keywords,
Each GRU modelBy renewing the doorAnd a reset gateThe information propagation process inside the GRU model is as follows:
wherein, the first and the second end of the pipe are connected with each other,in order to input the vector, the input vector is input,to reset the doorThe weight matrix of (a) is determined,for updating the doorThe weight matrix of (a) is determined,for the present informationThe weight matrix of (a) is determined,in order to multiply the elements one by one,in order to be a sigmoid function,as hyperbolic tangent function, now informationFrom past informationAnd the current inputThe decision is made in a joint manner,is composed ofOutputting time information including past informationAnd present informationUpdating doorUsed for controlling how much history information is forgotten and how much new information is accepted when the current state is in a reset stateUsed for controlling how much information in the candidate state is obtained from the historical information;
wherein the content of the first and second substances,is the output of the forward GRU model,for the output of the reverse GRU model,representTime of dayThe weight of the corresponding one of the first and second weights,to representThe weight of the corresponding one of the first and second weights,to representTime of dayThe corresponding bias term.
Optionally, in step S120, a MacBERT model is used to extract a word vector, the extracted word vector is used to extract context features through a bidirectional GRU model, and high-level feature vectors of the keywords are generated by concatenation.
Optionally, subject information keyword importancePObtained by sigmoid function, where 0<P<1:
Wherein the content of the first and second substances,is thatThe weight matrix of (a) is determined,is thatThe bias term of (c).
Optionally, importance of each topic information keywordPAnd selecting the first eight keywords as the topic information keywords according to the sequence from big to small.
The invention further discloses a storage medium for storing computer executable instructions, which is characterized in that:
the computer executable instructions, when executed by a processor, perform the above-described government affairs text subject information extraction method.
Compared with the prior art, the method for extracting the theme information of the government affair text has the following advantages that:
1) The invention adopts MacBERT model, can obtain the keyword feature vector, and solves the problem of insufficient local feature extraction capability.
2) Because the invention adopts the BiGRU model, the semantic information in the sentence can be captured, the high-grade characteristic vector of the key word is obtained, the text information is effectively utilized, and the parallel computation is adopted, thereby greatly improving the extraction efficiency of the subject information.
3) According to the invention, the MacBERT model and the BiGRU model are fused, so that the extraction effect of a single model on the subject information is improved, the extraction accuracy of the subject information is further improved, and the overfitting risk of the model is reduced.
Drawings
Fig. 1 is a basic flowchart of a method for extracting subject information of a government affair text and a storage medium according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
The invention is characterized in that a MacBERT model (Masked Language modeling as Correction Bidirectional Encoder reproduction from transformations) and a BiGRU model (Bi-Gate Recurrent Unit) are combined to extract the subject information of unstructured government affair text data. Firstly, extracting a word vector by adopting a MacBERT layer to obtain a keyword feature vector; then capturing semantic information in sentences through a BiGRU layer, and extracting high-level feature vectors of keywords, so that the features are more judgment; and finally, calculating the importance of the keywords, arranging the importance of the keywords in a descending order, and selecting the keywords with higher importance as the keywords of the subject information to realize the extraction of the subject information of the government affair text.
Referring to fig. 1, a basic flowchart of a method for extracting subject information of a government affairs text and a storage medium according to an embodiment of the present invention is shown.
Data preprocessing step S110:
preprocessing unstructured government affair text data, wherein the preprocessing comprises filtering out irrelevant information and performing word segmentation processing on the text data;
specifically, the pretreatment specifically comprises: deleting punctuation marks, blank spaces and the like, introducing a field dictionary into the government affair text data, performing word segmentation processing on the data, filtering stop words by using a general stop word library, and removing corresponding stop words in the government affair text data after word segmentation.
Specifically, in step S110, the unstructured government affair text data includes a natural text language describing information such as construction of the jurisdiction and statistical supervision of health management of the jurisdiction.
Of course, the invention is not limited thereto, and the processing method of the invention can be applied to other government affair text information.
Text feature vector extraction and processing step S120:
and performing word vector extraction on the preprocessed government affair text information data, such as unstructured government affair text data, by using a MacBERT model to obtain a keyword feature vector, capturing semantic information in a sentence by using a BiGRU model by taking the keyword feature vector as input, and optimizing the feature vector to obtain a high-level feature vector of the keyword.
Specifically, in step S120, the MacBERT model may obtain a keyword feature vectorAnd the problem of insufficient local feature extraction capability is solved.
Specifically, in step S120, the BiGRU model is a bidirectional improved recurrent neural network, including a forward GRU modelAnd reverse GRU modelAmong them forward GRU modelUsing forward input of feature vectors of middle pairs of keywordsReverse GRU modelUsing inverse inputs for the feature vectors of the keywords,
Each GRU modelBy renewing the doorAnd a reset gateThe information propagation process inside the GRU model is as follows:
wherein the content of the first and second substances,in order to be the vector input, the vector is input,to reset the doorThe weight matrix of (a) is determined,to renew the doorThe weight matrix of (a) is determined,for the present informationThe weight matrix of (a) is determined,in order to multiply the elements one by one,in order to be a sigmoid function,as hyperbolic tangent function, now informationFrom past informationAnd the current inputIn a joint decision, it is decided that,is composed ofTime information output including past informationAnd present information. Updating doorThe method is used for controlling how much historical information needs to be forgotten and how much new information needs to be accepted in the current state, and is helpful for capturing long-term dependence in the sequence. Reset doorFor controlling how many of the candidate states areLittle information is obtained from historical information, which helps to capture short-term dependencies in the sequence.
wherein the content of the first and second substances,for the output of the forward GRU model,for the output of the reverse GRU model,representTime of dayThe weight corresponding to the weight of the corresponding weight,to representThe weight of the corresponding one of the first and second weights,to representTime of dayThe corresponding bias term.
Specifically, in step S120, a MacBERT model is used to extract a word vector, the extracted word vector is used to extract context features through a bidirectional GRU model, and high-level feature vectors of keywords are generated by concatenation, so as to improve the extraction accuracy of the topic information.
Subject information obtaining step S130: receiving the high-level feature vectors of the keywords extracted in step S120, calculating the importance of the keywords, sorting the importance of the keywords in a descending order, and selecting the keywords with higher importance as the topic information keywords, thereby realizing the topic information extraction of the government affair text.
Specifically, in step S130, the importance of the topic information keywordPObtained by sigmoid function, where 0<P<1:
Wherein the content of the first and second substances,is thatThe weight matrix of (a) is determined,is thatThe bias term of (1). Training data through the proposed model to obtain the optimal parameters of the model.
In particular, importance to each topic information keywordPThe first eight may be selected as the topic information keywords in descending order.
A storage medium for storing computer-executable instructions, characterized in that:
the computer executable instructions, when executed by a processor, perform the above-described government affairs text subject information extraction method.
Compared with the prior art, the method for extracting the theme information of the government affair text has the following advantages that:
1) The invention adopts the MacBERT model, so that the keyword feature vector can be obtained, and the problem of insufficient local feature extraction capability is solved.
2) Because the invention adopts the BiGRU model, the semantic information in the sentence can be captured, the high-grade characteristic vector of the key word is obtained, the text information is effectively utilized, and the parallel computation is adopted, thereby greatly improving the extraction efficiency of the subject information.
3) According to the invention, the MacBERT model and the BiGRU model are fused, so that the extraction effect of a single model on the theme information is improved, the extraction accuracy of the theme information is improved, and the overfitting risk of the model is reduced.
It will be apparent to those skilled in the art that the various elements or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device, or alternatively, they may be implemented using program code that is executable by a computing device, such that they may be stored in a memory device and executed by a computing device, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above is a further detailed description of the invention with reference to specific preferred embodiments, which should not be considered as limiting the invention to the specific embodiments described herein, but rather as a matter of simple deductions or substitutions by a person skilled in the art without departing from the inventive concept, it should be considered that the invention lies within the scope of protection defined by the claims as filed.
Claims (5)
1. A method for extracting subject information of a government affair text is characterized by comprising the following steps:
data preprocessing step S110:
preprocessing unstructured government affair text data, wherein the preprocessing comprises filtering out irrelevant information and performing word segmentation processing on the text data;
text feature vector extraction and processing step S120:
extracting word vectors of the preprocessed government affair text data by adopting a MacBERT model to obtain key word feature vectors, then taking the key word feature vectors as input, capturing semantic information in sentences through a BiGRU model, and optimizing the feature vectors to obtain advanced feature vectors of keywords;
subject information obtaining step S130: receiving the high-level feature vectors of the keywords extracted in the step S120, calculating the importance of the keywords, sorting the importance of the keywords in a descending order, and selecting the keywords with high importance as the topic information keywords to realize the topic information extraction of the government affair text;
in step S120, the BiGRU model is a bidirectional modified recurrent neural network;
the BiGRU model comprises a forward GRU modelAnd reverse GRU modelAmong them forward GRU modelIn which the feature vector of the keyword is input in the forward directionReverse GRU modelUsing inverse inputs for the feature vectors of the keywords,
Each GRU modelBy renewing the doorAnd a reset gateThe information propagation process inside the GRU model is as follows:
wherein, the first and the second end of the pipe are connected with each other,in order to input the vector, the input vector is input,to reset the doorThe weight matrix of (a) is determined,for updating the doorThe weight matrix of (a) is determined,for the present informationThe weight matrix of (a) is determined,to be made intoThe elements are multiplied by each other, and the multiplication,is a function of the sigmoid and is,is a hyperbolic tangent function, now informationFrom past informationAnd the current inputThe decision is made in a joint manner,is composed ofOutputting time information including past informationAnd present informationUpdating doorReset gate for controlling how much history information is forgotten and how much new information is accepted in current stateUsed for controlling how much information in the candidate state is obtained from the history information;
wherein, the first and the second end of the pipe are connected with each other,for the output of the forward GRU model,for the output of the inverse GRU model,to representTime of dayThe weight of the corresponding one of the first and second weights,to representThe weight of the corresponding one of the first and second weights,to representTime of dayThe corresponding bias term;
in step S120, a MacBERT model extracts a word vector, the extracted word vector extracts context features through a bidirectional GRU model, and high-level feature vectors of the keywords are generated by concatenation;
in the step S130, in the step S,
topic information keyword importancePObtained by sigmoid function, where 0<P<1:
2. The subject information extraction method according to claim 1, characterized in that:
the pretreatment specifically comprises: and deleting punctuation marks and spaces, introducing a field dictionary into the government affair text data, performing word segmentation on the data, filtering stop words by using a general stop word bank, and removing corresponding stop words in the divided government affair text data.
3. The subject information extraction method according to claim 2, characterized in that:
the government affair text information data comprise unstructured government affair text data, and specifically comprise: and describing natural text language of the statistical monitoring condition of the construction and health management of the district.
4. The subject information extraction method according to claim 1, characterized in that:
importance to each topic information keywordPAnd selecting the first eight keywords as the topic information keywords according to the sequence from big to small.
5. A storage medium for storing computer-executable instructions, characterized in that:
the computer-executable instructions, when executed by a processor, perform the method of extracting subject information of government affairs texts according to any one of claims 1 to 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211402800.7A CN115455155B (en) | 2022-11-10 | 2022-11-10 | Method for extracting subject information of government affair text and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211402800.7A CN115455155B (en) | 2022-11-10 | 2022-11-10 | Method for extracting subject information of government affair text and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115455155A CN115455155A (en) | 2022-12-09 |
CN115455155B true CN115455155B (en) | 2023-03-03 |
Family
ID=84295516
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211402800.7A Active CN115455155B (en) | 2022-11-10 | 2022-11-10 | Method for extracting subject information of government affair text and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115455155B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113535886A (en) * | 2020-04-15 | 2021-10-22 | 北大方正信息产业集团有限公司 | Information processing method, device and equipment |
CN114398877A (en) * | 2022-01-12 | 2022-04-26 | 平安普惠企业管理有限公司 | Theme extraction method and device based on artificial intelligence, electronic equipment and medium |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11755840B2 (en) * | 2021-02-09 | 2023-09-12 | Tata Consultancy Services Limited | Extracting mentions of complex relation types from documents by using joint first and second RNN layers to determine sentence spans which correspond to relation mentions |
CN114153802A (en) * | 2021-12-03 | 2022-03-08 | 西安交通大学 | Government affair file theme classification method based on Bert and residual self-attention mechanism |
CN114357172A (en) * | 2022-01-07 | 2022-04-15 | 北京邮电大学 | Rumor detection method based on ERNIE-BiGRU-Attention |
CN115310448A (en) * | 2022-08-10 | 2022-11-08 | 南京邮电大学 | Chinese named entity recognition method based on combining bert and word vector |
-
2022
- 2022-11-10 CN CN202211402800.7A patent/CN115455155B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113535886A (en) * | 2020-04-15 | 2021-10-22 | 北大方正信息产业集团有限公司 | Information processing method, device and equipment |
CN114398877A (en) * | 2022-01-12 | 2022-04-26 | 平安普惠企业管理有限公司 | Theme extraction method and device based on artificial intelligence, electronic equipment and medium |
Non-Patent Citations (1)
Title |
---|
面向机器阅读理解的候选句抽取算法;郭鑫 等;《计算机科学》;20200531;第47卷(第5期);198-203 * |
Also Published As
Publication number | Publication date |
---|---|
CN115455155A (en) | 2022-12-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111144131B (en) | Network rumor detection method based on pre-training language model | |
CN109800310B (en) | Electric power operation and maintenance text analysis method based on structured expression | |
CN111767725B (en) | Data processing method and device based on emotion polarity analysis model | |
CN111368086A (en) | CNN-BilSTM + attribute model-based sentiment classification method for case-involved news viewpoint sentences | |
CN111858932A (en) | Multiple-feature Chinese and English emotion classification method and system based on Transformer | |
CN111222305A (en) | Information structuring method and device | |
CN111325029A (en) | Text similarity calculation method based on deep learning integration model | |
Sartakhti et al. | Persian language model based on BiLSTM model on COVID-19 corpus | |
CN113742733A (en) | Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device | |
CN114925157A (en) | Nuclear power station maintenance experience text matching method based on pre-training model | |
CN107526721A (en) | A kind of disambiguation method and device to electric business product review vocabulary | |
CN114462379A (en) | Improved script learning method and device based on event evolution diagram | |
CN113886601A (en) | Electronic text event extraction method, device, equipment and storage medium | |
Fu et al. | Improving distributed word representation and topic model by word-topic mixture model | |
CN111831783A (en) | Chapter-level relation extraction method | |
Wang et al. | Weighted graph convolution over dependency trees for nontaxonomic relation extraction on public opinion information | |
CN113761192A (en) | Text processing method, text processing device and text processing equipment | |
Behere et al. | Text summarization and classification of conversation data between service chatbot and customer | |
Sairam et al. | Image Captioning using CNN and LSTM | |
Nair et al. | Fake news detection model for regional language | |
CN115455155B (en) | Method for extracting subject information of government affair text and storage medium | |
US20240086643A1 (en) | Visual Dialogue Method and System | |
CN111382333A (en) | Case element extraction method in news text sentence based on case correlation joint learning and graph convolution | |
Bhargava et al. | Deep paraphrase detection in indian languages | |
CN116186241A (en) | Event element extraction method and device based on semantic analysis and prompt learning, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |