CN116562265A

CN116562265A - Information intelligent analysis method, system and storage medium

Info

Publication number: CN116562265A
Application number: CN202310811685.7A
Authority: CN
Inventors: 王铁鑫; 张超; 苏圣阳; 孙进宇; 刘彬
Original assignee: Nanjing Dnet System Technology Co ltd; Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing Dnet System Technology Co ltd; Nanjing University of Aeronautics and Astronautics
Priority date: 2023-07-04
Filing date: 2023-07-04
Publication date: 2023-08-08
Anticipated expiration: 2043-07-04
Also published as: CN116562265B

Abstract

The invention discloses an information intelligent analysis method, an information intelligent analysis system and a storage medium, and relates to the field of artificial intelligence, wherein the method comprises the following steps: preprocessing a policy file to obtain policy key content; the automatic extraction model of the policy index is trained by using a natural language processing technology, wherein the natural language processing technology mainly comprises the following steps: identifying named entities and extracting relations; automatically extracting a model according to the policy indexes, and automatically analyzing the policy texts into index triples; constructing a policy index knowledge graph, and storing index triplet information by using a graph database; policy knowledge query, for enterprises, provides services for policy index knowledge query. The invention effectively solves the problem of difficult interpretation of the policy text, uses the policy index triples to represent the policy file, constructs the knowledge graph to store the policy information, can automatically extract and store the key information of the policy text and provides services such as policy knowledge inquiry and the like.

Description

Information intelligent analysis method, system and storage medium

Technical Field

The invention discloses an information intelligent analysis method, an information intelligent analysis system and a storage medium, and relates to the field of artificial intelligence.

Background

Along with development of informatization technology, informatization platforms for arranging information files by users have become a common convenient channel, but the information files are difficult to accurately recommend by the platforms. It is difficult for the user to find an information file meeting his own needs. The reasons are mainly as follows: the information file is not known, is not understood, and cannot be used; the number of information files is huge, and screening the information files consumes a great deal of time and requires a certain expertise.

Disclosure of Invention

Aiming at the technical problems, the application aims to provide an information intelligent analysis method, an information intelligent analysis system and a storage medium, which effectively solve the problem of difficult interpretation of information files, use index triples of the information files to represent policy files, construct a knowledge graph to store policy information, automatically extract and store key information of the information files and provide services such as knowledge inquiry, knowledge reasoning and the like of the information files.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

an intelligent information analysis method, which comprises the following steps:

s1, extracting key content from a set original file by using a character recognition method, acquiring information to be processed and storing the information;

s2, training a BERT-BiLSTM-CRF index automatic extraction model by using a natural language processing method, wherein the natural language processing method mainly comprises the following steps of: a named entity identification process and a relationship extraction process;

s3, automatically analyzing the information to be processed into index triplet information through a trained BERT-BiLSTM-CRF index automatic extraction model;

s4, constructing a set index knowledge graph, and storing the index triplet information by using a graph database;

s5, inquiring the set index to obtain index triplet sequence information, and feeding back to the user.

Further, the step S1 specifically includes the following:

and acquiring a set original file by using a crawler technology, extracting key contents from the set original file by using a character recognition method OCR, acquiring information to be processed, and storing the information to be processed in an excel file.

Further, the step S2 specifically includes the following:

dividing the information to be processed into a training set and a testing set according to a set proportion, marking the training set of the information to be processed by using a mode of entity-relation joint extraction, and training a BERT-BiLSTM-CRF index automatic extraction model;

the automatic BERT-BiLSTM-CRF index extraction model comprises a BERT module, a BiLSTM module and a CRF module, wherein the BERT module converts input information to be processed into word vectors by constructing two unsupervised training tasks, the BiLSTM module takes the output word vectors of the BERT module as input, performs coding calculation and then outputs the result to the CRF module, and performs final decoding calculation in the CRF module to obtain a predicted sequence.

Further, the entity-relationship joint extraction method comprises the following steps:

labeling the training set of information to be processed, wherein the labeling label format comprises three parts, the first part is labeling of the position information of an entity in a word, the labeling rule of the part refers to BIOES labeling specification, and the label and representative information are { B: entity start, I: inside the entity, E: entity end, S: single entity }; the second part marks the relationship information, performs simplified coding according to the formulated relationship type, and marks the relationship type information; the third part is the main and customer information of the entity, namely the direction of the relation, and the labeling rule is {1: entity 1,2: entity 2} or {3: entity }.

Further, the BERT module comprises two unsupervised training tasks, namely sentence occlusion training MLM and sentence relation prediction NSP; in the NSP task, judging whether the two sentences are in an upper-lower sentence relationship or not according to the concatenation of the two input sentences; the MLM cuts the sentence by taking the characters as a unit, then randomly selects part of the characters in the training sample, wipes the part of the characters from the original sentence, and predicts the wiped characters by using the rest characters.

Further, the BiLSTM module and the CRF module together form a BiLSTM-CRF module, and the BiLSTM-CRF module comprises the following contents:

the word vector obtained by the BERT module is input into the BiLSTM module for encoding, the BiLSTM module consists of a forward LSTM layer and a backward LSTM layer, the output is the combination of the two LSTM outputs, and the expression of the LSTM calculation is as follows:

；

in the above formula:is an input door, ">For the output door->Is a forgetful door, is a->Is memory cell-> and />To activate the function +.>Is a weight matrix of gates, ">Is the bias vector of the gate, +.>For the input information of the current cell, < > for>In the state of the last hidden layer +.> and />For the last sequence and the current cell state, +.>Is a temporary cell state; the current unit accepts or rejects the information transmitted by the previous unit, and the retention degree of the current input and the output of the next unit are all according to +.>，/> and />Is determined by the calculation result of (a);

the output result expression of the BiLSTM module is:

；

the CRF module creates a label transfer matrix according to the relation of adjacent labels, generates label sequences with different probabilities, and sets the sequence with the highest calculated score as a final predicted sequence; for any one sequenceThe score calculation formula in the CRF module is:

；

wherein ,Yis a sequenceXIs used for the prediction of the sequence of (c),Pis the scoring matrix output by the BiLSTM module, i.e，/>Represent the firstiWord number ofjThe score of the individual tag(s),Arepresenting a transfer score matrix, ">Representation tagiTransfer to labeljIs a fraction of (2);Arepresenting a transition score matrix, prediction sequenceYThe probability formula generated is:

；

taking the logarithm of both sides of the equation to obtain the likelihood function of the predicted sequence:

；

wherein ,representing the actual labeling sequence,/->Representing all possible labeling sequences, and finally obtaining the output sequence with the maximum score after decoding:

。

further, the step S3 is:

and extracting a triplet form of the < head entity, the relation and the tail entity > from the information to be processed by using the trained BERT-BiLSTM-CRF index automatic extraction model.

The application also provides an information intelligent analysis system, which comprises:

the preprocessing module is used for extracting key contents from the set original file by using a character recognition method, acquiring information to be processed and storing the information;

the model training module trains and sets the index automatic extraction model by using a natural language processing method, and the natural language processing method mainly comprises the following steps: a named entity identification process and a relationship extraction process;

the index extraction module automatically analyzes the information to be processed into index triplet information according to a set index automatic extraction model;

the storage module is used for constructing a set index knowledge graph and storing the index triplet information by using a graph database;

and the query module queries the set index to obtain index triplet sequence information and feeds the index triplet sequence information back to the user.

The application also provides a computer readable storage medium, wherein the storage medium stores a program, and the program realizes the intelligent information analysis method when being executed by a processor.

The beneficial effects are that:

according to the information intelligent analysis method, system and storage medium, the problem that the information file is difficult to read is effectively solved, the policy file is represented by the index triplets of the information file, the knowledge graph is constructed to store the policy information, key information of the information file can be automatically extracted and stored, and services such as knowledge query, knowledge reasoning and the like of the information file can be provided.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an intelligent information analysis method provided in an embodiment of the invention;

FIG. 2 is a schematic diagram of a entity-relationship joint extraction model in an information intelligent analysis method scheme according to an embodiment of the present invention;

FIG. 3 is a flowchart for constructing a policy knowledge graph according to an embodiment of the present invention;

FIG. 4 is a diagram of an exemplary storage of policy documents in a diagram database provided in an embodiment of the present invention.

Description of the embodiments

The present invention will be described in further detail below with reference to the drawings and detailed description, wherein the described embodiments are provided as examples of the invention, and other embodiments, which are obtained by persons skilled in the art without making any inventive work, are within the scope of the invention.

Example 1

Fig. 1 is a schematic diagram of an intelligent analysis method for information, which is a policy calculator aiming at a platform for providing policy information for enterprises, and has the main functional characteristics: the policies of the country level and the province and city are summarized and a classified query function is provided. The partial policy calculator provides a self-test function to determine whether a policy can be declared or not through data filled by the enterprise. However, this technique has the following disadvantages: the repeated filling of the data is serious, and the process of the primary filling of the data is redundant; fuzzy screening is carried out through the filled data, so that data information is not mined, and accurate matching is difficult; the policy calculator is mainly used for reporting assistance for enterprises and cannot be used in government audit.

The embodiment is an intelligent analysis for enterprise supporting policy, and the method comprises the following steps:

s1, extracting key content from a set original file by using a character recognition method, acquiring information to be processed and storing the information; the scheme provided by the embodiment is that a crawler technology is used for acquiring a policy original file from a policy information issuing website, key contents in a policy file PDF are acquired through a optical character recognition OCR text recognition technology, and main information of a policy text is obtained and stored in an excel file.

S2, training a BERT-BiLSTM-CRF index automatic extraction model by using a natural language processing method, wherein in the embodiment, the automatic extraction model of the policy index is trained, and the natural language processing method mainly comprises the following steps: a named entity identification process and a relationship extraction process;

fig. 2 is a schematic diagram of a entity-relationship joint extraction model of an intelligent information analysis method according to an embodiment of the present invention, where the entity-relationship joint extraction model includes the following contents:

and the information to be processed in the excel file is processed according to 7:3, dividing the ratio into a training set and a testing set, and marking the policy text training set by using a entity-relation joint extraction mode. The form and format of the labeling label can be defined by oneself, and the sample and the corresponding characteristics can be embodied, in this embodiment, the labeling label format used in the process of data labeling mainly comprises three parts: the first part is the labeling of the position information of the entity in the word, the labeling rule of the part refers to the BIOES labeling specification, and the label and the representative information are { B (entity start), I (entity interior), E (entity end), S (single entity) }; and the second part is to label the relation information, and to carry out simplified coding according to the formulated relation type, and to label the type information of the relation. The third part is the main and guest information of the entity, namely the direction of the relation, and the labeling rule is {1 (entity 1), 2 (entity 2) } or {3 (entity) }; in the third section, because of the specificity of the policy text, part-type policy index relationships omit the subject of the policy index, and such relationships require the replenishment of their subject parts, such entities are labeled separately. The remaining characters that are not within the entity relationship triplet are all labeled "O".

And a knowledge extraction model based on BERT-BiLSTM-CRF is adopted, namely, the automatic extraction model of BERT-BiLSTM-CRF indexes is adopted to realize the joint extraction task of entities and relations. The BERT-BiLSTM-CRF index automatic extraction model firstly inputs the marked sequence into the BERT layer to obtain contextualized word vectors; and then inputting the word vector into a BiLSTM layer for coding, wherein the BiLSTM module takes the output word vector of the BERT module as input, outputs the word vector to a CRF module after coding calculation, and finally decodes the word vector in the CRF module to obtain a prediction sequence.

The BERT model builds two unsupervised training tasks, namely sentence occlusion training MLM (Mask Language Model) and sentence relationship prediction NSP (Next Sentence Prediction), when the language model is pre-trained. And the NSP task inputs the concatenation of two sentences, and the model judges whether the two sentences are in a top-bottom sentence relationship or not. The MLM will segment the sentence in character units, then randomly select 15% of the segmented characters in the training samples, wipe them out of the original sentence, and predict the wiped out characters using the other remaining characters.

The contextualized word vectors are obtained through the BERT layer, the word vectors are input into the BiLSTM layer for encoding, the BiLSTM layer consists of a forward LSTM layer and a backward LSTM layer, and the output is the combination of the two LSTM outputs. Gating concepts are the core of LSTM model operation, and gates in the LSTM model include forget gatesTransport and deliveryEntrance (I)>Output door->And memory cell->. The forgetting gate and the input gate are used for transmitting useful information and filtering useless information in the calculation process, and the output of the memory cell is multiplied by the output of the output gate to be used as the output of the whole structure. The formula for the LSTM calculation is shown below:

；

the output result expression of the BiLSTM module is:

；

。

s3, automatically extracting a model by using the trained policy indexes, and extracting a triplet form of the head entity, the relation and the tail entity from the administrative policy text. The relationship types in the index triples extracted in the policy text can be classified into 14 kinds of: year, place, academy, title, business or institution, type of business, industry, honor or title, type of economy, money, number of people, age, time, place; because of the specificity of the policy text, part of the type of policy index relationships omit the main body of the policy index, and such relationships are automatically supplemented with main body parts, such as: < enterprise qualification, qualification type, high and new technology enterprise >, etc. Table 1 is a class and an example of policy index triples provided in an embodiment of the present invention;

TABLE 1

。

The development client interface visually displays the policy index extraction function and results. The client is developed based on a Vue framework, and provides functions of policy text content input, data transmission through an extraction button and index triplet table rendering. After manually inputting or pasting the policy text content, clicking an extraction button, and transmitting the input policy text content to a server by the client; the server interface is developed based on a flash framework, after the interface receives the policy text content transmitted by the client, the policy text content is input into a trained automatic policy index extraction model, the recognized policy index triplet sequence is output through the processing of the model, and the server interface acquires the index triplet sequence and transmits the index triplet sequence to the client; after receiving the policy index triplet sequence, the client sequentially renders each triplet information into the table according to the forms of 'head entity', 'relation', 'tail entity'.

S4, constructing a policy index knowledge graph, and describing entities and concepts in the policy index and the relation between the entities and concepts.

Fig. 3 is a flowchart of the construction of a policy knowledge graph according to an embodiment of the present invention, where the flowchart includes the following specific details:

specifically, firstly, combing the existing semantic structures in structured and semi-structured data including databases, tables and the like, and combining the experience of experts in the field of policy declaration to construct a mode layer of a policy knowledge graph from top to bottom; and then, storing the index triples extracted in the step S3 by using a graph database, thereby constructing a data layer of the knowledge graph.

The knowledge graph data layer construction method comprises the following specific contents: firstly, sequentially reading information of each policy file in an excel form obtained by preprocessing in the step S1 based on a python language, wherein one policy file information comprises a name, a grade, a category and text content; the grades include district grade, city grade, provincial grade, country; categories include science and technology, letter, talents, etc. Secondly, inputting text content in each policy file into a model, and returning a policy index triplet sequence after model processing; further, the name, level, category, and index triplet sequence of the policy file is stored in a json file until all policy files in the excel table are all parsed and stored in the json file. The json file serves as an intermediate form of the graph database storing the required data.

Based on the Vue framework, developing a client, and acquiring the json file by the client, and sequentially extracting information of each policy: the name, class, level and index triplet sequence is transmitted to the server; the server is developed based on a Springboot framework, is used for receiving policy information transmitted by the client, is connected with a neo4j database, and stores the policy information into the neo4j database. In the neo4j database, for each policy, a root node is first created, and the attributes of the node are the name, class and level of the policy; secondly, respectively creating nodes with the attribute of the node being the name of the entity by a head entity and a tail entity in the index triplet; then, establishing an edge for the policy node and the head entity node, wherein the content of the edge is an index; then, creating edges for the head entity and the tail entity in the triples, wherein the content of the edges is the relation content in the corresponding index triples: such as: the "index includes", "has", etc. FIG. 4 is a diagram illustrating an exemplary storage of a policy file in a graph database provided in an embodiment of the present invention.

S5, inquiring the set index to obtain index triplet sequence information, feeding back the index triplet sequence information to the user, and providing policy index inquiring service for enterprises.

After all the policy index data are stored in the graph database in S4, the enterprise may select a corresponding query condition, such as the name, level, class, or specific index type in table 1 of the policy, and then obtain the required policy index information. For the policy name query condition, the enterprise can acquire the policy content corresponding to the policy name; for policy level query conditions, the enterprise may obtain all policy content for the level; for policy class query conditions, the enterprise may obtain all policy content for that class; for a particular index type, the enterprise may obtain all policy content with that index type. Providing such policy query services to enterprises can effectively alleviate the burden of enterprises to read a large number of policy PDF files.

The embodiment provides an information intelligent analysis method, which can effectively solve the problem of difficult interpretation of a policy text, uses a policy index triplet to represent a policy file, constructs a knowledge graph to store policy information, can automatically extract and store key information of the policy text, and provides services such as policy knowledge inquiry, policy knowledge reasoning and the like.

Example 2

The embodiment of the invention provides an information intelligent analysis system, which comprises a preprocessing module, a processing module and a processing module, wherein the preprocessing module uses a character recognition method to extract key contents from a set original file, acquire information to be processed and store the information;

Based on the application scenario of the embodiment, the system is an enterprise support policy intelligent analysis system based on knowledge characterization, and the system comprises the following contents:

the preprocessing module is used for preprocessing the policy file, extracting key content from the policy file by using a character recognition technology, acquiring a policy text and storing the policy text; the model training module trains an efficient policy index extraction model based on a natural language processing method of named entity recognition and relation extraction; the index extraction module is used for automatically analyzing the input policy text into index triples by utilizing the final model in the training module; the storage module is used for constructing an index knowledge graph and storing policy index information by using a graph database; and the query module is used for providing policy index query service for enterprises.

Example 3

The embodiment of the invention provides a computer readable storage medium, wherein a program is stored in the storage medium, and the program realizes the intelligent information analysis method when being executed by a processor.

The invention effectively solves the problem of difficult interpretation of the policy text, uses the policy index triples to represent the policy file, constructs the knowledge graph to store the policy information, can automatically extract and store the key information of the policy text and provides services such as policy knowledge inquiry and the like.

Claims

1. An intelligent information analysis method is characterized by comprising the following steps:

s2, training a BERT-BiLSTM-CRF index automatic extraction model by using a natural language processing method, wherein the natural language processing method comprises the following steps: a named entity identification process and a relationship extraction process;

2. The intelligent information analysis method according to claim 1, wherein the step S1 specifically includes the following steps:

3. The intelligent information analysis method according to claim 2, wherein the step S2 specifically includes the following steps:

4. The method for intelligent resolution of information according to claim 2, wherein the entity-relationship joint extraction method comprises the following steps:

5. An information intelligent parsing method according to claim 3, wherein the BERT module comprises two unsupervised training tasks, namely sentence occlusion training MLM and sentence relation prediction NSP; in the NSP task, judging whether the two sentences are in an upper-lower sentence relationship or not according to the concatenation of the two input sentences; the MLM cuts the sentence by taking the characters as a unit, then randomly selects part of the characters in the training sample, wipes the part of the characters from the original sentence, and predicts the wiped characters by using the rest characters.

6. The intelligent information analysis method according to claim 3, wherein the BiLSTM module and the CRF module together form a BiLSTM-CRF module, the BiLSTM-CRF module comprising:

；

in the above formula:is an input door, ">For the output door->Is a forgetful door, is a->Is memory cell-> and />In order to activate the function,is a weight matrix of gates, ">Is the bias vector of the gate, +.>For the input information of the current cell, < > for>In the state of the last hidden layer +.> and />For the last sequence and the current cell state, +.>Is a temporary cell state; the current unit accepts or rejects the information transmitted by the previous unit, and the retention degree of the current input and the output of the next unit are all according to +.>，/> and />Is determined by the calculation result of (a);

the output result expression of the BiLSTM module is:

；

。

7. the intelligent information analysis method according to claim 1, wherein S3 is:

8. An intelligent information analysis system, characterized in that the analysis system comprises:

9. A computer-readable storage medium, wherein a program is stored in the storage medium, which when executed by a processor implements an intelligent information parsing method according to any one of claims 1 to 7.