CN114021574A - Intelligent analysis and structuring method and system for policy file - Google Patents
Intelligent analysis and structuring method and system for policy file Download PDFInfo
- Publication number
- CN114021574A CN114021574A CN202210003661.4A CN202210003661A CN114021574A CN 114021574 A CN114021574 A CN 114021574A CN 202210003661 A CN202210003661 A CN 202210003661A CN 114021574 A CN114021574 A CN 114021574A
- Authority
- CN
- China
- Prior art keywords
- policy
- reward
- model
- training
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000004458 analytical method Methods 0.000 title claims abstract description 31
- 238000012549 training Methods 0.000 claims abstract description 63
- 238000007670 refining Methods 0.000 claims abstract description 20
- 239000013598 vector Substances 0.000 claims description 26
- 238000000605 extraction Methods 0.000 claims description 10
- 230000002457 bidirectional effect Effects 0.000 claims description 9
- 230000014509 gene expression Effects 0.000 claims description 9
- 239000000284 extract Substances 0.000 claims description 7
- 238000004140 cleaning Methods 0.000 claims description 3
- 230000001427 coherent effect Effects 0.000 claims description 3
- 210000000056 organ Anatomy 0.000 claims description 3
- 238000013473 artificial intelligence Methods 0.000 abstract description 3
- 238000005516 engineering process Methods 0.000 description 9
- 238000012797 qualification Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 239000003814 drug Substances 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 229940079593 drug Drugs 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/185—Hierarchical storage management [HSM] systems, e.g. file migration or policies thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Tourism & Hospitality (AREA)
- Human Resources & Organizations (AREA)
- Strategic Management (AREA)
- Marketing (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Entrepreneurship & Innovation (AREA)
- General Business, Economics & Management (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Economics (AREA)
- Databases & Information Systems (AREA)
- Quality & Reliability (AREA)
- Operations Research (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- Primary Health Care (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of artificial intelligence, and particularly relates to a method and a system for intelligently analyzing and structuring policy files. The method comprises S1, disassembling the policy file according to the item hierarchy, obtaining the data of different hierarchy levels and storing; s2, identifying data in different areas of the data structure tree to obtain required policy file information data; s3, identifying declaration condition information containing reward measures and corresponding reward measures by using a graph convolution network and a trained policy condition reward identification model; s4, pre-training the model according to the policy to train a policy label refining model in a form of adding downstream tasks, refining the declaration condition information into a label, and summarizing each rewarding measure to the corresponding industry and industry field. The method has the characteristics of saving labor cost, realizing deep analysis of the complex policy text and automatically extracting the rewarding measures and declaration conditions of the policy text.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a method and a system for intelligently analyzing and structuring policy files.
Background
Policies are the struggling goals set by government agencies in order to orchestrate the healthy development of society, as well as the steps and measures that need to be taken to achieve that goal. In particular, in order to promote economic progress and optimization of industrial structure, authorities frequently issue guiding policy texts, which often include specific reward measures and corresponding conditions, and objects are enterprises and individuals in general. Policy awards mean the rights that a business can enjoy, and policy terms are the conditions that need to be met to enjoy the rights.
In the face of huge policy texts, individuals or enterprises often have difficulty in declaring corresponding rewards according to conditions owned by the individuals or the enterprises. The existing policy software and website only simply classify the policy text and do not analyze the reward measures and declaration conditions of the policy text in depth.
The existing policy analysis technology is very simple, the policy is usually disassembled manually aiming at a large number of policies, and the knowledge is summarized into a database; or using regular expressions to disassemble some fixed expressions in the policy; in addition, or a natural language processing technology is adopted to carry out semantic analysis on the policy text.
Therefore, the existing policy resolution technology has the following disadvantages:
1. manual analysis is time-consuming and labor-consuming, certain expert knowledge is required, and labor cost is too high;
2. the automatic analysis method based on the regular expression seriously depends on the policy text contacted by the writer of the regular expression, the fitting performance of the unseen policy description is extremely poor, and the analysis mode based on the regular expression is easy to have regular conflict, so that the analysis failure is caused.
3. The method based on semantic analysis is superior to the automatic analysis method based on regular expression, but the current policy analysis technology is only simple analysis of the policy, for example, the accurate identification of the policy conditions and the reward text cannot be achieved, the support for the identification of the policy conditions and the reward relationship is also lacked, the generalization performance is not strong, and the accuracy is not high.
The policy has the characteristics of high complexity and ultra-long text, the traditional semantic analysis technology cannot directly establish an end-to-end model for the ultra-long text, and the loss of characteristics is caused, under the condition, the correspondence between the global condition and the reward relation cannot be realized, and only partial limiting conditions appearing around the reward text can be identified; in addition, the conventional semantic analysis technology can only realize simple text recognition with conditional rewards, conditions with similar meanings cannot be summarized, and under the condition of extremely large number of policies, the data size becomes very large, so that the construction and subsequent use of a database are inconvenient.
Therefore, it is very important to design a method and a system for intelligently analyzing and structuring a policy document, which can save labor cost, realize deep analysis of a complex policy document, and automatically extract reward measures and declaration conditions of the policy document.
For example, a policy research interpretation method, a system, a storage medium, and a server described in chinese patent application No. CN201910542701.0 are disclosed, wherein a policy source file is entered, analyzed and interpreted, a knowledge base for enterprise declaration guidance is made for the policy source file, and a user can quickly know whether the user has a declaration qualification through guidance of the knowledge base, and if the user has the declaration qualification, a declaration request can be made to the system, and the system automatically declares a project for the user. Although the policies are converted into various indexes which are convenient for enterprises to understand after various government support policies are researched, and the indexes are recorded and stored in the knowledge base, so that the enterprises can conveniently and quickly learn and understand the various policies, a large amount of policy interpretation time is saved for the enterprises, the declaration efficiency and the project passing rate of the enterprises are improved, and the use requirements are met, the method has the defects that reward measures of policy texts cannot be extracted, further the reward measures for analyzing the policies and corresponding declaration condition information cannot be realized to quickly interpret, and the scheme is limited in use.
Disclosure of Invention
The invention provides an intelligent analysis and structuring method and system for a policy file, which can save labor cost, realize deep analysis of a complex policy text and automatically extract reward measures and declaration conditions of the policy text, and aims to solve the problems that the conventional policy analysis technology needs to be combed by manpower, wastes time and labor and has overhigh labor cost in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
the intelligent analysis and structuring method for the policy file comprises the following steps;
s1, the policy file is disassembled according to the item hierarchy, and data of different hierarchy levels are obtained and stored in a data structure tree form;
s2, identifying data in different areas of the data structure tree to obtain required policy file information data;
s3, using a graph convolution network and a policy condition reward identification model trained based on the established policy pre-training model to identify declaration condition information containing reward measures and reward measures;
s4, pre-training a model according to the established policy, training a policy label refining model in a form of adding downstream tasks, refining declaration condition information into a label, and summarizing each reward measure to the corresponding industry and industry field.
Preferably, step S2 includes the steps of:
step S21, identifying the policy name in the head and tail areas of the policy in the data structure tree in a mode of combining the regular expression and named entity identification;
step S22, identifying the issuing department in the head and tail regions of the policy in the data structure tree in a named entity identification mode;
step S23, identifying the issuing region in the head and tail regions of the policy in the data structure tree in a keyword searching mode;
step S24, identifying the issue time and the deadline time in the head and tail regions of the policy in the data structure tree, and describing the time of different styles into a uniform format text.
Preferably, the policy pre-training model is constructed by the following method:
carrying out data cleaning on the acquired real policy text, the acquired authority document and the acquired Wikipedia text, and removing a non-natural language part, wherein the natural language part comprises a picture and a link;
separating the text by periods, limiting the maximum length to 512, and truncating the part exceeding the maximum length;
in the process of converting the text into data required by training, each character is kept unchanged at a probability of 90%, and the current character is replaced by a character [ MASK ] at a probability of 10%;
splicing two sentences together as input, and sending the input into a bidirectional encoder representation model based on a converter in a [ sentence 1, sentence 2] mode, wherein the output of the bidirectional encoder representation model based on the converter in training is as follows: predicting whether the sentences 1 and 2 are coherent or not, and predicting characters originally corresponding to the characters [ MASK ];
and updating parameters of the policy pre-training model according to a back propagation principle to finish the training purpose.
Preferably, step S3 includes the steps of:
s31, regarding the data structure tree as a directed acyclic graph containing a plurality of nodes, wherein each node corresponds to a section of text, and each section of text is calculated as a low-dimensional vector by using a policy pre-training model;
s32, combining the low-dimensional vector of each node with the low-dimensional vectors of the surrounding nodes by using a graph convolution network, recalculating a new vector, and replacing the original low-dimensional vector of the node with the new vector;
s33, identifying the information including the reward measures and the declaration condition corresponding to the reward measures according to the new vector;
s34, on the basis of the policy pre-training model, adding a downstream task module for information recognition and training the downstream task module as a policy condition reward recognition model, and recognizing specific condition information in the declaration condition information and specific reward information in reward measures.
Preferably, step S4 includes the steps of:
s41, adding downstream label refining tasks based on the policy pre-training model according to the labeled conditional text data containing the labels to train the policy label refining model;
s42, using policy label extraction model to identify condition value and extract declaration condition information into label;
and S43, summarizing the various reward measures to the corresponding industries and industry fields.
Preferably, step S43 further includes the steps of:
and when the declared condition information corresponding to the reward measures does not have obvious industries and industry fields, automatically incorporating the reward measures into the industries and industry fields applicable to the policy document.
The invention also provides an intelligent analysis and structuring system of the policy document, comprising;
the text layering module is used for disassembling the policy file according to the item layering to obtain data of different layering levels and storing the data in a data structure tree form;
the basic information analysis module is used for carrying out data identification on different areas of the data structure tree to obtain required policy file information data;
the condition reward identification module is used for identifying declaration condition information comprising reward measures and reward measures corresponding to the reward measures according to the graph convolution network and the policy condition reward identification model;
and the label analysis module is used for pre-training a model according to the established policy to train a policy label refining model in a form of adding downstream tasks, refining declaration condition information into labels, and summarizing each reward measure to the corresponding industry and industry field.
Preferably, the policy document intelligent parsing and structuring system further comprises:
and the policy pre-training model building module is used for pre-training the bidirectional encoder representation model based on the converter by using a plurality of real policy texts, the power organ documents and the Wikipedia texts to build a policy pre-training model.
Preferably, the conditional reward identification module further comprises:
and the policy condition reward identification model building module is used for adding a downstream task module for information identification on the basis of pre-training the policy model and training the downstream task module as the policy condition reward identification model so as to identify specific condition information in the declaration condition information and specific reward information in reward measures.
Preferably, the policy document intelligent parsing and structuring system further comprises:
and the policy label extraction model building module is used for training the policy pre-training model according to the labeled conditional text data containing the labels in a form of adding downstream tasks to obtain the policy label extraction model.
Compared with the prior art, the invention has the beneficial effects that: (1) according to the method, a set of rule engine for structuring the policy text is created, the policy text can be hierarchically disassembled according to items for various complex policy texts, and meanwhile, different parts of the text can be applied to different downstream identification tasks, so that the identification accuracy of the downstream tasks is improved; (2) the method has the advantages that the graph convolution network is innovatively merged, the text of each item is taken as a node, the node relation fitting capacity of the graph convolution network is utilized, the structured data disassembled by a rule engine are matched, the modeling of the ultra-long policy text can be realized, and the relation between conditions and rewards in all levels can be identified; (3) the invention organically combines an artificial intelligence method and an expert guidance module, extracts a plurality of universal labels, extracts valuable characteristic values in a lengthy condition text by introducing a named entity recognition model, and provides concise and reliable data for subsequent use.
Drawings
FIG. 1 is a flow chart of a method for intelligent parsing and structuring of policy documents in accordance with the present invention;
FIG. 2 is a diagram illustrating a policy document for a certain region according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of the present invention for hierarchically parsing the text in FIG. 2;
FIG. 4 is a schematic diagram of the text in FIG. 2 being recognized and parsed by the present invention;
fig. 5 is a schematic diagram of the text in fig. 2 for identifying specific condition information and reward information and labeling the condition information according to the present invention.
Detailed Description
In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.
Example 1:
the intelligent parsing and structuring method for policy document shown in fig. 1 includes the following steps;
s1, the policy file is disassembled according to the item hierarchy, and data of different hierarchy levels are obtained and stored in a data structure tree form;
s2, identifying data in different areas of the data structure tree to obtain required policy file information data;
s3, using a graph convolution network and a policy condition reward identification model trained based on the established policy pre-training model to identify declaration condition information containing reward measures and reward measures;
s4, pre-training a model according to the established policy, training a policy label refining model in a form of adding downstream tasks, refining declaration condition information into a label, and summarizing each reward measure to the corresponding industry and industry field.
Further, step S2 includes the following steps:
step S21, identifying the policy name in the head and tail areas of the policy in the data structure tree in a mode of combining the regular expression and named entity identification;
step S22, identifying the issuing department in the head and tail regions of the policy in the data structure tree in a named entity identification mode;
step S23, identifying the issuing region in the head and tail regions of the policy in the data structure tree in a keyword searching mode;
the method comprises the following steps of issuing areas and dividing the areas according to administrative regions to form province (autonomous region, direct prefecture city) -city-county region structures;
step S24, identifying the release time and the deadline time in the head and tail regions of the policy in the data structure tree, and describing the time of different styles into a uniform format text; if three points are zero five points in the afternoon of 7 months and eight days in 2020, No. 8 in 7 months and No. 15 in 2020: 05, all will be unified into 2010-07-0815: 05.
Further, the policy pre-training model is constructed by the following method:
carrying out data cleaning on the acquired real policy text, the acquired authority document and the acquired Wikipedia text, and removing a non-natural language part, wherein the natural language part comprises a picture and a link;
separating the text by periods, limiting the maximum length to 512, and truncating the part exceeding the maximum length;
in the process of converting the text into data required by training, each character is kept unchanged at a probability of 90%, and the current character is replaced by a character [ MASK ] at a probability of 10%;
splicing two sentences together as input, and sending the input into a bidirectional encoder representation model based on a converter in a [ sentence 1, sentence 2] mode, wherein the output of the bidirectional encoder representation model based on the converter in training is as follows: predicting whether the sentences 1 and 2 are coherent or not, and predicting characters originally corresponding to the characters [ MASK ];
and updating parameters of the policy pre-training model according to a back propagation principle to finish the training purpose.
Further, step S3 includes the following steps:
s31, regarding the data structure tree as a directed acyclic graph containing a plurality of nodes, wherein each node corresponds to a section of text, and each section of text is calculated as a low-dimensional vector by using a policy pre-training model;
s32, combining the low-dimensional vector of each node with the low-dimensional vectors of the surrounding nodes by using a graph convolution network, recalculating a new vector, and replacing the original low-dimensional vector of the node with the new vector;
s33, identifying the information including the reward measures and the declaration condition corresponding to the reward measures according to the new vector;
s34, on the basis of the policy pre-training model, adding a downstream task module for information recognition and training the downstream task module as a policy condition reward recognition model, and recognizing specific condition information in the declaration condition information and specific reward information in reward measures.
The method of the downstream task module is as follows: inputting a text, giving a vector of each character in the text by the policy pre-training model, calculating the probability that the character belongs to the condition information or the reward information by using Softmax, and finally obtaining the specific condition text or the specific reward text of the text.
Further, step S4 includes the following steps:
s41, adding downstream label refining tasks based on the policy pre-training model according to the labeled conditional text data containing the labels to train the policy label refining model;
the method for adding the downstream task module comprises the following steps: inputting a text and labels, giving a vector of each word in the text by a policy pre-training model, and then synthesizing the vectors and the sequence of the vectors by using a conditional random field algorithm to obtain the probability that characters in the text belong to certain labels.
S42, using policy label extraction model to identify condition value and extract declaration condition information into label;
and S43, summarizing the various reward measures to the corresponding industries and industry fields.
Further, step S43 includes the following steps:
and when the declared condition information corresponding to the reward measures does not have obvious industries and industry fields, automatically incorporating the reward measures into the industries and industry fields applicable to the policy document.
As shown in FIG. 1, the present invention also provides an intelligent parsing and structuring system for policy documents, comprising;
the text layering module is used for disassembling the policy file according to the item layering to obtain data of different layering levels and storing the data in a data structure tree form;
the basic information analysis module is used for carrying out data identification on different areas of the data structure tree to obtain required policy file information data;
the condition reward identification module is used for identifying declaration condition information comprising reward measures and reward measures corresponding to the reward measures according to the graph convolution network and the policy condition reward identification model;
and the label analysis module is used for pre-training a model according to the established policy to train a policy label refining model in a form of adding downstream tasks, refining declaration condition information into labels, and summarizing each reward measure to the corresponding industry and industry field.
Further, the intelligent policy file parsing and structuring system further comprises:
and the policy pre-training model building module is used for pre-training the bidirectional encoder representation model based on the converter by using a plurality of real policy texts, the power organ documents and the Wikipedia texts to build a policy pre-training model.
Further, the conditional reward identification module further comprises:
and the policy condition reward identification model building module is used for adding a downstream task module for information identification on the basis of pre-training the policy model and training the downstream task module as the policy condition reward identification model so as to identify specific condition information in the declaration condition information and specific reward information in reward measures.
Further, the intelligent policy file parsing and structuring system further comprises:
and the policy label extraction model building module is used for training the policy pre-training model according to the labeled conditional text data containing the labels in a form of adding downstream tasks to obtain the policy label extraction model.
Based on the technical solution of the present invention, a policy file parsing process in the specific implementation and operation process is shown in fig. 2 to 5:
firstly, the format of the data is judged, and characters in the policy file are extracted by using different tools. If the picture is the picture, the text is extracted by using ocr capability, and the direct conversion and extraction are carried out by adopting PDF and DOC analysis toolkits.
Taking a policy of 'a certain area in 2018 supporting an industry development policy' as an example, the input text is shown in fig. 2, and the text part is omitted.
The text layer is modularized by using the rule engine provided by the invention, as shown in FIG. 3. The process of step S1 is realized by using the labels in the text, by a rules engine, to hierarchy the text and organize the different labels and the content below them.
Then, in the basic information analyzing module, the head and tail data of the hierarchical content in the upper text hierarchical module are sent to the basic information module, and the basic information shown in fig. 4 is analyzed.
And then, identifying the conditional reward relationship among all paragraphs of the text by using the graph convolution network model, wherein the specific implementation is that the text of each node is vectorized through a BERT model, then the vector is input into the graph convolution network model in a hierarchical structure, the graph convolution network is enabled to fit the relationship among the nodes, and then the relationship among the nodes is identified. For example, some of the conditional restrictions in the tenth, attached rule are for all sub-policies in the policy, that is, for a certain sub-policy to enjoy the policy, in addition to satisfying the conditions in the sub-policy, the conditional restrictions in the tenth, attached rule need to be satisfied. The conditional reward relationship is identified by the graph convolution network model, shown as module a in fig. 5. And further identifying and obtaining the condition information and the reward information by a policy condition reward identification model, and obtaining the condition reward relationship of the condition reward information and the reward information.
Finally, in the condition parsed in the previous step, it needs to be tagged, as shown in the module B shown in fig. 5, tag information in the condition statement identified in the module a is identified. Namely, the conditions are refined again and are included in a certain label, so that the subsequent retrieval and use are facilitated. For example for the following sub-policies:
the primary reward of 20 ten thousand yuan is given to the medical institution of a certain area for obtaining the qualification (I, II and III) of the national medicine (apparatus) clinical test institution (GCP) for the first time. "C (B)
With the following conditions:
[ medical institutions in certain areas obtain the clinical test institution (GCP) qualification (I, II, III) of national drugs (instruments) for the first time ]
The refined label is as follows:
the location of the enterprise: a region;
the type of the enterprise: a medical facility;
enterprise qualification: the national drug (appliance) clinical trial institute (GCP) qualification (stage I, stage II, stage III).
The method is integrated with various technical means, the policy analysis technology based on the rule engine and the semantic analysis technology can automatically analyze complex and overlong policy documents, and the condition range is reduced by using the universal characteristics extracted by experts, so that the labor input required by policy combing can be greatly reduced, and the analyzed policy information can be simply, quickly and accurately retrieved by a user; the method utilizes the rule engine and combines the feature of the aggregate text of the graph convolution network to refine the condition information and reduce the search space of the reward, and can greatly enhance the policy resolution capability.
The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.
Claims (10)
1. The intelligent analysis and structuring method for the policy document is characterized by comprising the following steps;
s1, the policy file is disassembled according to the item hierarchy, and data of different hierarchy levels are obtained and stored in a data structure tree form;
s2, identifying data in different areas of the data structure tree to obtain required policy file information data;
s3, using a graph convolution network and a policy condition reward identification model trained based on the established policy pre-training model to identify declaration condition information containing reward measures and reward measures;
s4, pre-training a model according to the established policy, training a policy label refining model in a form of adding downstream tasks, refining declaration condition information into a label, and summarizing each reward measure to the corresponding industry and industry field.
2. The intelligent parsing and structuring method for policy documents according to claim 1, wherein step S2 comprises the following steps:
step S21, identifying the policy name in the head and tail areas of the policy in the data structure tree in a mode of combining the regular expression and named entity identification;
step S22, identifying the issuing department in the head and tail regions of the policy in the data structure tree in a named entity identification mode;
step S23, identifying the issuing region in the head and tail regions of the policy in the data structure tree in a keyword searching mode;
step S24, identifying the issue time and the deadline time in the head and tail regions of the policy in the data structure tree, and describing the time of different styles into a uniform format text.
3. The intelligent parsing and structuring method for policy documents according to claim 1, wherein the policy pre-training model is constructed by the following steps:
carrying out data cleaning on the acquired real policy text, the acquired authority document and the acquired Wikipedia text, and removing a non-natural language part, wherein the natural language part comprises a picture and a link;
separating the text by periods, limiting the maximum length to 512, and truncating the part exceeding the maximum length;
in the process of converting the text into data required by training, each character is kept unchanged at a probability of 90%, and the current character is replaced by a character [ MASK ] at a probability of 10%;
splicing two sentences together as input, and sending the input into a bidirectional encoder representation model based on a converter in a [ sentence 1, sentence 2] mode, wherein the output of the bidirectional encoder representation model based on the converter in training is as follows: predicting whether the sentences 1 and 2 are coherent or not, and predicting characters originally corresponding to the characters [ MASK ];
and updating parameters of the policy pre-training model according to a back propagation principle to finish the training purpose.
4. The intelligent parsing and structuring method for policy documents according to claim 1, wherein step S3 comprises the following steps:
s31, regarding the data structure tree as a directed acyclic graph containing a plurality of nodes, wherein each node corresponds to a section of text, and each section of text is calculated as a low-dimensional vector by using a policy pre-training model;
s32, combining the low-dimensional vector of each node with the low-dimensional vectors of the surrounding nodes by using a graph convolution network, recalculating a new vector, and replacing the original low-dimensional vector of the node with the new vector;
s33, identifying the information including the reward measures and the declaration condition corresponding to the reward measures according to the new vector;
s34, on the basis of the policy pre-training model, adding a downstream task module for information recognition and training the downstream task module as a policy condition reward recognition model, and recognizing specific condition information in the declaration condition information and specific reward information in reward measures.
5. The intelligent parsing and structuring method for policy documents according to claim 1, wherein step S4 comprises the following steps:
s41, adding downstream label refining tasks based on the policy pre-training model according to the labeled conditional text data containing the labels to train the policy label refining model;
s42, using policy label extraction model to identify condition value and extract declaration condition information into label;
and S43, summarizing the various reward measures to the corresponding industries and industry fields.
6. The intelligent parsing and structuring method for policy documents according to claim 5, wherein step S43 further comprises the steps of:
and when the declared condition information corresponding to the reward measures does not have obvious industries and industry fields, automatically incorporating the reward measures into the industries and industry fields applicable to the policy document.
7. The intelligent policy file analyzing and structuring system is characterized by comprising;
the text layering module is used for disassembling the policy file according to the item layering to obtain data of different layering levels and storing the data in a data structure tree form;
the basic information analysis module is used for carrying out data identification on different areas of the data structure tree to obtain required policy file information data;
the condition reward identification module is used for identifying declaration condition information comprising reward measures and reward measures corresponding to the reward measures according to the graph convolution network and the policy condition reward identification model;
and the label analysis module is used for pre-training a model according to the established policy to train a policy label refining model in a form of adding downstream tasks, refining declaration condition information into labels, and summarizing each reward measure to the corresponding industry and industry field.
8. The intelligent policy document parsing and structuring system according to claim 7 further comprising:
and the policy pre-training model building module is used for pre-training the bidirectional encoder representation model based on the converter by using a plurality of real policy texts, the power organ documents and the Wikipedia texts to build a policy pre-training model.
9. The system of claim 7, wherein the conditional reward identification module further comprises:
and the policy condition reward identification model building module is used for adding a downstream task module for information identification on the basis of pre-training the policy model and training the downstream task module as the policy condition reward identification model so as to identify specific condition information in the declaration condition information and specific reward information in reward measures.
10. The intelligent policy document parsing and structuring system according to claim 8 further comprising:
and the policy label extraction model building module is used for training the policy pre-training model according to the labeled conditional text data containing the labels in a form of adding downstream tasks to obtain the policy label extraction model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210003661.4A CN114021574B (en) | 2022-01-05 | 2022-01-05 | Intelligent analysis and structuring method and system for policy file |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210003661.4A CN114021574B (en) | 2022-01-05 | 2022-01-05 | Intelligent analysis and structuring method and system for policy file |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114021574A true CN114021574A (en) | 2022-02-08 |
CN114021574B CN114021574B (en) | 2022-05-17 |
Family
ID=80069722
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210003661.4A Active CN114021574B (en) | 2022-01-05 | 2022-01-05 | Intelligent analysis and structuring method and system for policy file |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114021574B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115525620A (en) * | 2022-10-09 | 2022-12-27 | 金恒智控管理咨询集团股份有限公司 | Method for generating internal control flow based on policy file |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190080225A1 (en) * | 2017-09-11 | 2019-03-14 | Tata Consultancy Services Limited | Bilstm-siamese network based classifier for identifying target class of queries and providing responses thereof |
CN110245225A (en) * | 2019-06-21 | 2019-09-17 | 广东政沣云计算有限公司 | A kind of research in policy deciphering method, system, storage medium and server |
CN110609983A (en) * | 2019-08-19 | 2019-12-24 | 广州利科科技有限公司 | Structured decomposition method for policy file |
CN110968776A (en) * | 2018-09-30 | 2020-04-07 | 北京国双科技有限公司 | Policy knowledge recommendation method, device storage medium and processor |
CN112035653A (en) * | 2020-11-05 | 2020-12-04 | 北京智源人工智能研究院 | Policy key information extraction method and device, storage medium and electronic equipment |
US20210073330A1 (en) * | 2019-09-11 | 2021-03-11 | International Business Machines Corporation | Creating an executable process from a text description written in a natural language |
CN112529071A (en) * | 2020-12-08 | 2021-03-19 | 广州大学华软软件学院 | Text classification method, system, computer equipment and storage medium |
-
2022
- 2022-01-05 CN CN202210003661.4A patent/CN114021574B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190080225A1 (en) * | 2017-09-11 | 2019-03-14 | Tata Consultancy Services Limited | Bilstm-siamese network based classifier for identifying target class of queries and providing responses thereof |
CN110968776A (en) * | 2018-09-30 | 2020-04-07 | 北京国双科技有限公司 | Policy knowledge recommendation method, device storage medium and processor |
CN110245225A (en) * | 2019-06-21 | 2019-09-17 | 广东政沣云计算有限公司 | A kind of research in policy deciphering method, system, storage medium and server |
CN110609983A (en) * | 2019-08-19 | 2019-12-24 | 广州利科科技有限公司 | Structured decomposition method for policy file |
US20210073330A1 (en) * | 2019-09-11 | 2021-03-11 | International Business Machines Corporation | Creating an executable process from a text description written in a natural language |
CN112035653A (en) * | 2020-11-05 | 2020-12-04 | 北京智源人工智能研究院 | Policy key information extraction method and device, storage medium and electronic equipment |
CN112529071A (en) * | 2020-12-08 | 2021-03-19 | 广州大学华软软件学院 | Text classification method, system, computer equipment and storage medium |
Non-Patent Citations (3)
Title |
---|
ZHEN YU; JIALONG HE; QINQUAN GAO; JIAJIA LIU: "System Design of OE Mail File Parsing Based on DBX File Analysis", 《IEEE》 * |
卢章平等: "国家和地方科技成果转化政策对比分析", 《图书情报工作》 * |
王超群等: "基于政策工具与政策目标双重视角的我国网络视听产业政策分析", 《科技广场》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115525620A (en) * | 2022-10-09 | 2022-12-27 | 金恒智控管理咨询集团股份有限公司 | Method for generating internal control flow based on policy file |
Also Published As
Publication number | Publication date |
---|---|
CN114021574B (en) | 2022-05-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110489395B (en) | Method for automatically acquiring knowledge of multi-source heterogeneous data | |
US11321364B2 (en) | System and method for analysis and determination of relationships from a variety of data sources | |
CN110990590A (en) | Dynamic financial knowledge map construction method based on reinforcement learning and transfer learning | |
CN109145260B (en) | Automatic text information extraction method | |
CN111708773A (en) | Multi-source scientific and creative resource data fusion method | |
CN113806563B (en) | Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material | |
CN113191148B (en) | Rail transit entity identification method based on semi-supervised learning and clustering | |
CN111708899B (en) | Engineering information intelligent searching method based on natural language and knowledge graph | |
CN111061882A (en) | Knowledge graph construction method | |
WO2021138163A1 (en) | System and method for analysis and determination of relationships from a variety of data sources | |
CN110175334A (en) | Text knowledge's extraction system and method based on customized knowledge slot structure | |
CN111651569B (en) | Knowledge base question-answering method and system in electric power field | |
CN116244344A (en) | Retrieval method and device based on user requirements and electronic equipment | |
CN115470871A (en) | Policy matching method and system based on named entity recognition and relation extraction model | |
CN113312922A (en) | Improved chapter-level triple information extraction method | |
CN114186533A (en) | Model training method and device, knowledge extraction method and device, equipment and medium | |
CN115344666A (en) | Policy matching method, device, equipment and computer readable storage medium | |
CN114021574B (en) | Intelligent analysis and structuring method and system for policy file | |
Kushmerick | Finite-state approaches to web information extraction | |
CN115983571A (en) | Construction project auditing method and system based on artificial intelligence for construction industry | |
CN114911893A (en) | Method and system for automatically constructing knowledge base based on knowledge graph | |
Ezeani et al. | Towards an Extensible Framework for Understanding Spatial Narratives | |
CN116523041A (en) | Knowledge graph construction method, retrieval method and system for equipment field and electronic equipment | |
CN115759037A (en) | Intelligent auditing frame and auditing method for building construction scheme | |
Yu et al. | English translation model based on intelligent recognition and deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |