CN114021574A

CN114021574A - Intelligent analysis and structuring method and system for policy file

Info

Publication number: CN114021574A
Application number: CN202210003661.4A
Authority: CN
Inventors: 赵康康; 夏聪
Original assignee: Hangzhou Real Intelligence Technology Co ltd
Current assignee: Hangzhou Real Intelligence Technology Co ltd
Priority date: 2022-01-05
Filing date: 2022-01-05
Publication date: 2022-02-08
Anticipated expiration: 2042-01-05
Also published as: CN114021574B

Abstract

The invention belongs to the technical field of artificial intelligence, and particularly relates to a method and a system for intelligently analyzing and structuring policy files. The method comprises S1, disassembling the policy file according to the item hierarchy, obtaining the data of different hierarchy levels and storing; s2, identifying data in different areas of the data structure tree to obtain required policy file information data; s3, identifying declaration condition information containing reward measures and corresponding reward measures by using a graph convolution network and a trained policy condition reward identification model; s4, pre-training the model according to the policy to train a policy label refining model in a form of adding downstream tasks, refining the declaration condition information into a label, and summarizing each rewarding measure to the corresponding industry and industry field. The method has the characteristics of saving labor cost, realizing deep analysis of the complex policy text and automatically extracting the rewarding measures and declaration conditions of the policy text.

Description

Intelligent analysis and structuring method and system for policy file

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a method and a system for intelligently analyzing and structuring policy files.

Background

Policies are the struggling goals set by government agencies in order to orchestrate the healthy development of society, as well as the steps and measures that need to be taken to achieve that goal. In particular, in order to promote economic progress and optimization of industrial structure, authorities frequently issue guiding policy texts, which often include specific reward measures and corresponding conditions, and objects are enterprises and individuals in general. Policy awards mean the rights that a business can enjoy, and policy terms are the conditions that need to be met to enjoy the rights.

In the face of huge policy texts, individuals or enterprises often have difficulty in declaring corresponding rewards according to conditions owned by the individuals or the enterprises. The existing policy software and website only simply classify the policy text and do not analyze the reward measures and declaration conditions of the policy text in depth.

The existing policy analysis technology is very simple, the policy is usually disassembled manually aiming at a large number of policies, and the knowledge is summarized into a database; or using regular expressions to disassemble some fixed expressions in the policy; in addition, or a natural language processing technology is adopted to carry out semantic analysis on the policy text.

Therefore, the existing policy resolution technology has the following disadvantages:

1. manual analysis is time-consuming and labor-consuming, certain expert knowledge is required, and labor cost is too high;

2. the automatic analysis method based on the regular expression seriously depends on the policy text contacted by the writer of the regular expression, the fitting performance of the unseen policy description is extremely poor, and the analysis mode based on the regular expression is easy to have regular conflict, so that the analysis failure is caused.

3. The method based on semantic analysis is superior to the automatic analysis method based on regular expression, but the current policy analysis technology is only simple analysis of the policy, for example, the accurate identification of the policy conditions and the reward text cannot be achieved, the support for the identification of the policy conditions and the reward relationship is also lacked, the generalization performance is not strong, and the accuracy is not high.

The policy has the characteristics of high complexity and ultra-long text, the traditional semantic analysis technology cannot directly establish an end-to-end model for the ultra-long text, and the loss of characteristics is caused, under the condition, the correspondence between the global condition and the reward relation cannot be realized, and only partial limiting conditions appearing around the reward text can be identified; in addition, the conventional semantic analysis technology can only realize simple text recognition with conditional rewards, conditions with similar meanings cannot be summarized, and under the condition of extremely large number of policies, the data size becomes very large, so that the construction and subsequent use of a database are inconvenient.

Therefore, it is very important to design a method and a system for intelligently analyzing and structuring a policy document, which can save labor cost, realize deep analysis of a complex policy document, and automatically extract reward measures and declaration conditions of the policy document.

For example, a policy research interpretation method, a system, a storage medium, and a server described in chinese patent application No. CN201910542701.0 are disclosed, wherein a policy source file is entered, analyzed and interpreted, a knowledge base for enterprise declaration guidance is made for the policy source file, and a user can quickly know whether the user has a declaration qualification through guidance of the knowledge base, and if the user has the declaration qualification, a declaration request can be made to the system, and the system automatically declares a project for the user. Although the policies are converted into various indexes which are convenient for enterprises to understand after various government support policies are researched, and the indexes are recorded and stored in the knowledge base, so that the enterprises can conveniently and quickly learn and understand the various policies, a large amount of policy interpretation time is saved for the enterprises, the declaration efficiency and the project passing rate of the enterprises are improved, and the use requirements are met, the method has the defects that reward measures of policy texts cannot be extracted, further the reward measures for analyzing the policies and corresponding declaration condition information cannot be realized to quickly interpret, and the scheme is limited in use.

Disclosure of Invention

The invention provides an intelligent analysis and structuring method and system for a policy file, which can save labor cost, realize deep analysis of a complex policy text and automatically extract reward measures and declaration conditions of the policy text, and aims to solve the problems that the conventional policy analysis technology needs to be combed by manpower, wastes time and labor and has overhigh labor cost in the prior art.

In order to achieve the purpose, the invention adopts the following technical scheme:

the intelligent analysis and structuring method for the policy file comprises the following steps;

s1, the policy file is disassembled according to the item hierarchy, and data of different hierarchy levels are obtained and stored in a data structure tree form;

s2, identifying data in different areas of the data structure tree to obtain required policy file information data;

s3, using a graph convolution network and a policy condition reward identification model trained based on the established policy pre-training model to identify declaration condition information containing reward measures and reward measures;

s4, pre-training a model according to the established policy, training a policy label refining model in a form of adding downstream tasks, refining declaration condition information into a label, and summarizing each reward measure to the corresponding industry and industry field.

Preferably, step S2 includes the steps of:

step S21, identifying the policy name in the head and tail areas of the policy in the data structure tree in a mode of combining the regular expression and named entity identification;

step S22, identifying the issuing department in the head and tail regions of the policy in the data structure tree in a named entity identification mode;

step S23, identifying the issuing region in the head and tail regions of the policy in the data structure tree in a keyword searching mode;

step S24, identifying the issue time and the deadline time in the head and tail regions of the policy in the data structure tree, and describing the time of different styles into a uniform format text.

Preferably, the policy pre-training model is constructed by the following method:

carrying out data cleaning on the acquired real policy text, the acquired authority document and the acquired Wikipedia text, and removing a non-natural language part, wherein the natural language part comprises a picture and a link;

separating the text by periods, limiting the maximum length to 512, and truncating the part exceeding the maximum length;

in the process of converting the text into data required by training, each character is kept unchanged at a probability of 90%, and the current character is replaced by a character [ MASK ] at a probability of 10%;

splicing two sentences together as input, and sending the input into a bidirectional encoder representation model based on a converter in a [ sentence 1, sentence 2] mode, wherein the output of the bidirectional encoder representation model based on the converter in training is as follows: predicting whether the sentences 1 and 2 are coherent or not, and predicting characters originally corresponding to the characters [ MASK ];

and updating parameters of the policy pre-training model according to a back propagation principle to finish the training purpose.

Preferably, step S3 includes the steps of:

s31, regarding the data structure tree as a directed acyclic graph containing a plurality of nodes, wherein each node corresponds to a section of text, and each section of text is calculated as a low-dimensional vector by using a policy pre-training model;

s32, combining the low-dimensional vector of each node with the low-dimensional vectors of the surrounding nodes by using a graph convolution network, recalculating a new vector, and replacing the original low-dimensional vector of the node with the new vector;

s33, identifying the information including the reward measures and the declaration condition corresponding to the reward measures according to the new vector;

s34, on the basis of the policy pre-training model, adding a downstream task module for information recognition and training the downstream task module as a policy condition reward recognition model, and recognizing specific condition information in the declaration condition information and specific reward information in reward measures.

Preferably, step S4 includes the steps of:

s41, adding downstream label refining tasks based on the policy pre-training model according to the labeled conditional text data containing the labels to train the policy label refining model;

s42, using policy label extraction model to identify condition value and extract declaration condition information into label;

and S43, summarizing the various reward measures to the corresponding industries and industry fields.

Preferably, step S43 further includes the steps of:

and when the declared condition information corresponding to the reward measures does not have obvious industries and industry fields, automatically incorporating the reward measures into the industries and industry fields applicable to the policy document.

The invention also provides an intelligent analysis and structuring system of the policy document, comprising;

the text layering module is used for disassembling the policy file according to the item layering to obtain data of different layering levels and storing the data in a data structure tree form;

the basic information analysis module is used for carrying out data identification on different areas of the data structure tree to obtain required policy file information data;

the condition reward identification module is used for identifying declaration condition information comprising reward measures and reward measures corresponding to the reward measures according to the graph convolution network and the policy condition reward identification model;

and the label analysis module is used for pre-training a model according to the established policy to train a policy label refining model in a form of adding downstream tasks, refining declaration condition information into labels, and summarizing each reward measure to the corresponding industry and industry field.

Preferably, the policy document intelligent parsing and structuring system further comprises:

and the policy pre-training model building module is used for pre-training the bidirectional encoder representation model based on the converter by using a plurality of real policy texts, the power organ documents and the Wikipedia texts to build a policy pre-training model.

Preferably, the conditional reward identification module further comprises:

and the policy condition reward identification model building module is used for adding a downstream task module for information identification on the basis of pre-training the policy model and training the downstream task module as the policy condition reward identification model so as to identify specific condition information in the declaration condition information and specific reward information in reward measures.

and the policy label extraction model building module is used for training the policy pre-training model according to the labeled conditional text data containing the labels in a form of adding downstream tasks to obtain the policy label extraction model.

Compared with the prior art, the invention has the beneficial effects that: (1) according to the method, a set of rule engine for structuring the policy text is created, the policy text can be hierarchically disassembled according to items for various complex policy texts, and meanwhile, different parts of the text can be applied to different downstream identification tasks, so that the identification accuracy of the downstream tasks is improved; (2) the method has the advantages that the graph convolution network is innovatively merged, the text of each item is taken as a node, the node relation fitting capacity of the graph convolution network is utilized, the structured data disassembled by a rule engine are matched, the modeling of the ultra-long policy text can be realized, and the relation between conditions and rewards in all levels can be identified; (3) the invention organically combines an artificial intelligence method and an expert guidance module, extracts a plurality of universal labels, extracts valuable characteristic values in a lengthy condition text by introducing a named entity recognition model, and provides concise and reliable data for subsequent use.

Drawings

FIG. 1 is a flow chart of a method for intelligent parsing and structuring of policy documents in accordance with the present invention;

FIG. 2 is a diagram illustrating a policy document for a certain region according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of the present invention for hierarchically parsing the text in FIG. 2;

FIG. 4 is a schematic diagram of the text in FIG. 2 being recognized and parsed by the present invention;

fig. 5 is a schematic diagram of the text in fig. 2 for identifying specific condition information and reward information and labeling the condition information according to the present invention.

Detailed Description

In order to more clearly illustrate the embodiments of the present invention, the following description will explain the embodiments of the present invention with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, other drawings and embodiments can be derived from them without inventive effort.

Example 1:

the intelligent parsing and structuring method for policy document shown in fig. 1 includes the following steps;

Further, step S2 includes the following steps:

the method comprises the following steps of issuing areas and dividing the areas according to administrative regions to form province (autonomous region, direct prefecture city) -city-county region structures;

step S24, identifying the release time and the deadline time in the head and tail regions of the policy in the data structure tree, and describing the time of different styles into a uniform format text; if three points are zero five points in the afternoon of 7 months and eight days in 2020, No. 8 in 7 months and No. 15 in 2020: 05, all will be unified into 2010-07-0815: 05.

Further, the policy pre-training model is constructed by the following method:

Further, step S3 includes the following steps:

The method of the downstream task module is as follows: inputting a text, giving a vector of each character in the text by the policy pre-training model, calculating the probability that the character belongs to the condition information or the reward information by using Softmax, and finally obtaining the specific condition text or the specific reward text of the text.

Further, step S4 includes the following steps:

the method for adding the downstream task module comprises the following steps: inputting a text and labels, giving a vector of each word in the text by a policy pre-training model, and then synthesizing the vectors and the sequence of the vectors by using a conditional random field algorithm to obtain the probability that characters in the text belong to certain labels.

Further, step S43 includes the following steps:

As shown in FIG. 1, the present invention also provides an intelligent parsing and structuring system for policy documents, comprising;

Further, the intelligent policy file parsing and structuring system further comprises:

Further, the conditional reward identification module further comprises:

Based on the technical solution of the present invention, a policy file parsing process in the specific implementation and operation process is shown in fig. 2 to 5:

firstly, the format of the data is judged, and characters in the policy file are extracted by using different tools. If the picture is the picture, the text is extracted by using ocr capability, and the direct conversion and extraction are carried out by adopting PDF and DOC analysis toolkits.

Taking a policy of 'a certain area in 2018 supporting an industry development policy' as an example, the input text is shown in fig. 2, and the text part is omitted.

The text layer is modularized by using the rule engine provided by the invention, as shown in FIG. 3. The process of step S1 is realized by using the labels in the text, by a rules engine, to hierarchy the text and organize the different labels and the content below them.

Then, in the basic information analyzing module, the head and tail data of the hierarchical content in the upper text hierarchical module are sent to the basic information module, and the basic information shown in fig. 4 is analyzed.

And then, identifying the conditional reward relationship among all paragraphs of the text by using the graph convolution network model, wherein the specific implementation is that the text of each node is vectorized through a BERT model, then the vector is input into the graph convolution network model in a hierarchical structure, the graph convolution network is enabled to fit the relationship among the nodes, and then the relationship among the nodes is identified. For example, some of the conditional restrictions in the tenth, attached rule are for all sub-policies in the policy, that is, for a certain sub-policy to enjoy the policy, in addition to satisfying the conditions in the sub-policy, the conditional restrictions in the tenth, attached rule need to be satisfied. The conditional reward relationship is identified by the graph convolution network model, shown as module a in fig. 5. And further identifying and obtaining the condition information and the reward information by a policy condition reward identification model, and obtaining the condition reward relationship of the condition reward information and the reward information.

Finally, in the condition parsed in the previous step, it needs to be tagged, as shown in the module B shown in fig. 5, tag information in the condition statement identified in the module a is identified. Namely, the conditions are refined again and are included in a certain label, so that the subsequent retrieval and use are facilitated. For example for the following sub-policies:

the primary reward of 20 ten thousand yuan is given to the medical institution of a certain area for obtaining the qualification (I, II and III) of the national medicine (apparatus) clinical test institution (GCP) for the first time. "C (B)

With the following conditions:

[ medical institutions in certain areas obtain the clinical test institution (GCP) qualification (I, II, III) of national drugs (instruments) for the first time ]

The refined label is as follows:

the location of the enterprise: a region;

the type of the enterprise: a medical facility;

enterprise qualification: the national drug (appliance) clinical trial institute (GCP) qualification (stage I, stage II, stage III).

The method is integrated with various technical means, the policy analysis technology based on the rule engine and the semantic analysis technology can automatically analyze complex and overlong policy documents, and the condition range is reduced by using the universal characteristics extracted by experts, so that the labor input required by policy combing can be greatly reduced, and the analyzed policy information can be simply, quickly and accurately retrieved by a user; the method utilizes the rule engine and combines the feature of the aggregate text of the graph convolution network to refine the condition information and reduce the search space of the reward, and can greatly enhance the policy resolution capability.

The foregoing has outlined rather broadly the preferred embodiments and principles of the present invention and it will be appreciated that those skilled in the art may devise variations of the present invention that are within the spirit and scope of the appended claims.

Claims

1. The intelligent analysis and structuring method for the policy document is characterized by comprising the following steps;

2. The intelligent parsing and structuring method for policy documents according to claim 1, wherein step S2 comprises the following steps:

3. The intelligent parsing and structuring method for policy documents according to claim 1, wherein the policy pre-training model is constructed by the following steps:

4. The intelligent parsing and structuring method for policy documents according to claim 1, wherein step S3 comprises the following steps:

5. The intelligent parsing and structuring method for policy documents according to claim 1, wherein step S4 comprises the following steps:

6. The intelligent parsing and structuring method for policy documents according to claim 5, wherein step S43 further comprises the steps of:

7. The intelligent policy file analyzing and structuring system is characterized by comprising;

8. The intelligent policy document parsing and structuring system according to claim 7 further comprising:

9. The system of claim 7, wherein the conditional reward identification module further comprises:

10. The intelligent policy document parsing and structuring system according to claim 8 further comprising: