CN116562265A - Information intelligent analysis method, system and storage medium - Google Patents

Information intelligent analysis method, system and storage medium Download PDF

Info

Publication number
CN116562265A
CN116562265A CN202310811685.7A CN202310811685A CN116562265A CN 116562265 A CN116562265 A CN 116562265A CN 202310811685 A CN202310811685 A CN 202310811685A CN 116562265 A CN116562265 A CN 116562265A
Authority
CN
China
Prior art keywords
information
index
module
policy
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310811685.7A
Other languages
Chinese (zh)
Other versions
CN116562265B (en
Inventor
王铁鑫
张超
苏圣阳
孙进宇
刘彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Dnet System Technology Co ltd
Nanjing University of Aeronautics and Astronautics
Original Assignee
Nanjing Dnet System Technology Co ltd
Nanjing University of Aeronautics and Astronautics
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Dnet System Technology Co ltd, Nanjing University of Aeronautics and Astronautics filed Critical Nanjing Dnet System Technology Co ltd
Priority to CN202310811685.7A priority Critical patent/CN116562265B/en
Publication of CN116562265A publication Critical patent/CN116562265A/en
Application granted granted Critical
Publication of CN116562265B publication Critical patent/CN116562265B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/027Frames
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses an information intelligent analysis method, an information intelligent analysis system and a storage medium, and relates to the field of artificial intelligence, wherein the method comprises the following steps: preprocessing a policy file to obtain policy key content; the automatic extraction model of the policy index is trained by using a natural language processing technology, wherein the natural language processing technology mainly comprises the following steps: identifying named entities and extracting relations; automatically extracting a model according to the policy indexes, and automatically analyzing the policy texts into index triples; constructing a policy index knowledge graph, and storing index triplet information by using a graph database; policy knowledge query, for enterprises, provides services for policy index knowledge query. The invention effectively solves the problem of difficult interpretation of the policy text, uses the policy index triples to represent the policy file, constructs the knowledge graph to store the policy information, can automatically extract and store the key information of the policy text and provides services such as policy knowledge inquiry and the like.

Description

Information intelligent analysis method, system and storage medium
Technical Field
The invention discloses an information intelligent analysis method, an information intelligent analysis system and a storage medium, and relates to the field of artificial intelligence.
Background
Along with development of informatization technology, informatization platforms for arranging information files by users have become a common convenient channel, but the information files are difficult to accurately recommend by the platforms. It is difficult for the user to find an information file meeting his own needs. The reasons are mainly as follows: the information file is not known, is not understood, and cannot be used; the number of information files is huge, and screening the information files consumes a great deal of time and requires a certain expertise.
Disclosure of Invention
Aiming at the technical problems, the application aims to provide an information intelligent analysis method, an information intelligent analysis system and a storage medium, which effectively solve the problem of difficult interpretation of information files, use index triples of the information files to represent policy files, construct a knowledge graph to store policy information, automatically extract and store key information of the information files and provide services such as knowledge inquiry, knowledge reasoning and the like of the information files.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:
an intelligent information analysis method, which comprises the following steps:
s1, extracting key content from a set original file by using a character recognition method, acquiring information to be processed and storing the information;
s2, training a BERT-BiLSTM-CRF index automatic extraction model by using a natural language processing method, wherein the natural language processing method mainly comprises the following steps of: a named entity identification process and a relationship extraction process;
s3, automatically analyzing the information to be processed into index triplet information through a trained BERT-BiLSTM-CRF index automatic extraction model;
s4, constructing a set index knowledge graph, and storing the index triplet information by using a graph database;
s5, inquiring the set index to obtain index triplet sequence information, and feeding back to the user.
Further, the step S1 specifically includes the following:
and acquiring a set original file by using a crawler technology, extracting key contents from the set original file by using a character recognition method OCR, acquiring information to be processed, and storing the information to be processed in an excel file.
Further, the step S2 specifically includes the following:
dividing the information to be processed into a training set and a testing set according to a set proportion, marking the training set of the information to be processed by using a mode of entity-relation joint extraction, and training a BERT-BiLSTM-CRF index automatic extraction model;
the automatic BERT-BiLSTM-CRF index extraction model comprises a BERT module, a BiLSTM module and a CRF module, wherein the BERT module converts input information to be processed into word vectors by constructing two unsupervised training tasks, the BiLSTM module takes the output word vectors of the BERT module as input, performs coding calculation and then outputs the result to the CRF module, and performs final decoding calculation in the CRF module to obtain a predicted sequence.
Further, the entity-relationship joint extraction method comprises the following steps:
labeling the training set of information to be processed, wherein the labeling label format comprises three parts, the first part is labeling of the position information of an entity in a word, the labeling rule of the part refers to BIOES labeling specification, and the label and representative information are { B: entity start, I: inside the entity, E: entity end, S: single entity }; the second part marks the relationship information, performs simplified coding according to the formulated relationship type, and marks the relationship type information; the third part is the main and customer information of the entity, namely the direction of the relation, and the labeling rule is {1: entity 1,2: entity 2} or {3: entity }.
Further, the BERT module comprises two unsupervised training tasks, namely sentence occlusion training MLM and sentence relation prediction NSP; in the NSP task, judging whether the two sentences are in an upper-lower sentence relationship or not according to the concatenation of the two input sentences; the MLM cuts the sentence by taking the characters as a unit, then randomly selects part of the characters in the training sample, wipes the part of the characters from the original sentence, and predicts the wiped characters by using the rest characters.
Further, the BiLSTM module and the CRF module together form a BiLSTM-CRF module, and the BiLSTM-CRF module comprises the following contents:
the word vector obtained by the BERT module is input into the BiLSTM module for encoding, the BiLSTM module consists of a forward LSTM layer and a backward LSTM layer, the output is the combination of the two LSTM outputs, and the expression of the LSTM calculation is as follows:
in the above formula:is an input door, ">For the output door->Is a forgetful door, is a->Is memory cell-> and />To activate the function +.>Is a weight matrix of gates, ">Is the bias vector of the gate, +.>For the input information of the current cell, < > for>In the state of the last hidden layer +.> and />For the last sequence and the current cell state, +.>Is a temporary cell state; the current unit accepts or rejects the information transmitted by the previous unit, and the retention degree of the current input and the output of the next unit are all according to +.>,/> and />Is determined by the calculation result of (a);
the output result expression of the BiLSTM module is:
the CRF module creates a label transfer matrix according to the relation of adjacent labels, generates label sequences with different probabilities, and sets the sequence with the highest calculated score as a final predicted sequence; for any one sequenceThe score calculation formula in the CRF module is:
wherein ,Yis a sequenceXIs used for the prediction of the sequence of (c),Pis the scoring matrix output by the BiLSTM module, i.e,/>Represent the firstiWord number ofjThe score of the individual tag(s),Arepresenting a transfer score matrix, ">Representation tagiTransfer to labeljIs a fraction of (2);Arepresenting a transition score matrix, prediction sequenceYThe probability formula generated is:
taking the logarithm of both sides of the equation to obtain the likelihood function of the predicted sequence:
wherein ,representing the actual labeling sequence,/->Representing all possible labeling sequences, and finally obtaining the output sequence with the maximum score after decoding:
further, the step S3 is:
and extracting a triplet form of the < head entity, the relation and the tail entity > from the information to be processed by using the trained BERT-BiLSTM-CRF index automatic extraction model.
The application also provides an information intelligent analysis system, which comprises:
the preprocessing module is used for extracting key contents from the set original file by using a character recognition method, acquiring information to be processed and storing the information;
the model training module trains and sets the index automatic extraction model by using a natural language processing method, and the natural language processing method mainly comprises the following steps: a named entity identification process and a relationship extraction process;
the index extraction module automatically analyzes the information to be processed into index triplet information according to a set index automatic extraction model;
the storage module is used for constructing a set index knowledge graph and storing the index triplet information by using a graph database;
and the query module queries the set index to obtain index triplet sequence information and feeds the index triplet sequence information back to the user.
The application also provides a computer readable storage medium, wherein the storage medium stores a program, and the program realizes the intelligent information analysis method when being executed by a processor.
The beneficial effects are that:
according to the information intelligent analysis method, system and storage medium, the problem that the information file is difficult to read is effectively solved, the policy file is represented by the index triplets of the information file, the knowledge graph is constructed to store the policy information, key information of the information file can be automatically extracted and stored, and services such as knowledge query, knowledge reasoning and the like of the information file can be provided.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of an intelligent information analysis method provided in an embodiment of the invention;
FIG. 2 is a schematic diagram of a entity-relationship joint extraction model in an information intelligent analysis method scheme according to an embodiment of the present invention;
FIG. 3 is a flowchart for constructing a policy knowledge graph according to an embodiment of the present invention;
FIG. 4 is a diagram of an exemplary storage of policy documents in a diagram database provided in an embodiment of the present invention.
Description of the embodiments
The present invention will be described in further detail below with reference to the drawings and detailed description, wherein the described embodiments are provided as examples of the invention, and other embodiments, which are obtained by persons skilled in the art without making any inventive work, are within the scope of the invention.
Example 1
Fig. 1 is a schematic diagram of an intelligent analysis method for information, which is a policy calculator aiming at a platform for providing policy information for enterprises, and has the main functional characteristics: the policies of the country level and the province and city are summarized and a classified query function is provided. The partial policy calculator provides a self-test function to determine whether a policy can be declared or not through data filled by the enterprise. However, this technique has the following disadvantages: the repeated filling of the data is serious, and the process of the primary filling of the data is redundant; fuzzy screening is carried out through the filled data, so that data information is not mined, and accurate matching is difficult; the policy calculator is mainly used for reporting assistance for enterprises and cannot be used in government audit.
The embodiment is an intelligent analysis for enterprise supporting policy, and the method comprises the following steps:
s1, extracting key content from a set original file by using a character recognition method, acquiring information to be processed and storing the information; the scheme provided by the embodiment is that a crawler technology is used for acquiring a policy original file from a policy information issuing website, key contents in a policy file PDF are acquired through a optical character recognition OCR text recognition technology, and main information of a policy text is obtained and stored in an excel file.
S2, training a BERT-BiLSTM-CRF index automatic extraction model by using a natural language processing method, wherein in the embodiment, the automatic extraction model of the policy index is trained, and the natural language processing method mainly comprises the following steps: a named entity identification process and a relationship extraction process;
fig. 2 is a schematic diagram of a entity-relationship joint extraction model of an intelligent information analysis method according to an embodiment of the present invention, where the entity-relationship joint extraction model includes the following contents:
and the information to be processed in the excel file is processed according to 7:3, dividing the ratio into a training set and a testing set, and marking the policy text training set by using a entity-relation joint extraction mode. The form and format of the labeling label can be defined by oneself, and the sample and the corresponding characteristics can be embodied, in this embodiment, the labeling label format used in the process of data labeling mainly comprises three parts: the first part is the labeling of the position information of the entity in the word, the labeling rule of the part refers to the BIOES labeling specification, and the label and the representative information are { B (entity start), I (entity interior), E (entity end), S (single entity) }; and the second part is to label the relation information, and to carry out simplified coding according to the formulated relation type, and to label the type information of the relation. The third part is the main and guest information of the entity, namely the direction of the relation, and the labeling rule is {1 (entity 1), 2 (entity 2) } or {3 (entity) }; in the third section, because of the specificity of the policy text, part-type policy index relationships omit the subject of the policy index, and such relationships require the replenishment of their subject parts, such entities are labeled separately. The remaining characters that are not within the entity relationship triplet are all labeled "O".
And a knowledge extraction model based on BERT-BiLSTM-CRF is adopted, namely, the automatic extraction model of BERT-BiLSTM-CRF indexes is adopted to realize the joint extraction task of entities and relations. The BERT-BiLSTM-CRF index automatic extraction model firstly inputs the marked sequence into the BERT layer to obtain contextualized word vectors; and then inputting the word vector into a BiLSTM layer for coding, wherein the BiLSTM module takes the output word vector of the BERT module as input, outputs the word vector to a CRF module after coding calculation, and finally decodes the word vector in the CRF module to obtain a prediction sequence.
The BERT model builds two unsupervised training tasks, namely sentence occlusion training MLM (Mask Language Model) and sentence relationship prediction NSP (Next Sentence Prediction), when the language model is pre-trained. And the NSP task inputs the concatenation of two sentences, and the model judges whether the two sentences are in a top-bottom sentence relationship or not. The MLM will segment the sentence in character units, then randomly select 15% of the segmented characters in the training samples, wipe them out of the original sentence, and predict the wiped out characters using the other remaining characters.
The contextualized word vectors are obtained through the BERT layer, the word vectors are input into the BiLSTM layer for encoding, the BiLSTM layer consists of a forward LSTM layer and a backward LSTM layer, and the output is the combination of the two LSTM outputs. Gating concepts are the core of LSTM model operation, and gates in the LSTM model include forget gatesTransport and deliveryEntrance (I)>Output door->And memory cell->. The forgetting gate and the input gate are used for transmitting useful information and filtering useless information in the calculation process, and the output of the memory cell is multiplied by the output of the output gate to be used as the output of the whole structure. The formula for the LSTM calculation is shown below:
in the above formula:is an input door, ">For the output door->Is a forgetful door, is a->Is memory cell-> and />To activate the function +.>Is a weight matrix of gates, ">Is the bias vector of the gate, +.>For the input information of the current cell, < > for>In the state of the last hidden layer +.> and />For the last sequence and the current cell state, +.>Is a temporary cell state; the current unit accepts or rejects the information transmitted by the previous unit, and the retention degree of the current input and the output of the next unit are all according to +.>,/> and />Is determined by the calculation result of (a);
the output result expression of the BiLSTM module is:
the CRF module creates a label transfer matrix according to the relation of adjacent labels, generates label sequences with different probabilities, and sets the sequence with the highest calculated score as a final predicted sequence; for any one sequenceThe score calculation formula in the CRF module is:
wherein ,Yis a sequenceXIs used for the prediction of the sequence of (c),Pis the scoring matrix output by the BiLSTM module, i.e,/>Represent the firstiWord number ofjThe score of the individual tag(s),Arepresenting a transfer score matrix, ">Representation tagiTransfer to labeljIs a fraction of (2);Arepresenting a transition score matrix, prediction sequenceYThe probability formula generated is:
taking the logarithm of both sides of the equation to obtain the likelihood function of the predicted sequence:
wherein ,representing the actual labeling sequence,/->Representing all possible labeling sequences, and finally obtaining the output sequence with the maximum score after decoding:
s3, automatically extracting a model by using the trained policy indexes, and extracting a triplet form of the head entity, the relation and the tail entity from the administrative policy text. The relationship types in the index triples extracted in the policy text can be classified into 14 kinds of: year, place, academy, title, business or institution, type of business, industry, honor or title, type of economy, money, number of people, age, time, place; because of the specificity of the policy text, part of the type of policy index relationships omit the main body of the policy index, and such relationships are automatically supplemented with main body parts, such as: < enterprise qualification, qualification type, high and new technology enterprise >, etc. Table 1 is a class and an example of policy index triples provided in an embodiment of the present invention;
TABLE 1
The development client interface visually displays the policy index extraction function and results. The client is developed based on a Vue framework, and provides functions of policy text content input, data transmission through an extraction button and index triplet table rendering. After manually inputting or pasting the policy text content, clicking an extraction button, and transmitting the input policy text content to a server by the client; the server interface is developed based on a flash framework, after the interface receives the policy text content transmitted by the client, the policy text content is input into a trained automatic policy index extraction model, the recognized policy index triplet sequence is output through the processing of the model, and the server interface acquires the index triplet sequence and transmits the index triplet sequence to the client; after receiving the policy index triplet sequence, the client sequentially renders each triplet information into the table according to the forms of 'head entity', 'relation', 'tail entity'.
S4, constructing a policy index knowledge graph, and describing entities and concepts in the policy index and the relation between the entities and concepts.
Fig. 3 is a flowchart of the construction of a policy knowledge graph according to an embodiment of the present invention, where the flowchart includes the following specific details:
specifically, firstly, combing the existing semantic structures in structured and semi-structured data including databases, tables and the like, and combining the experience of experts in the field of policy declaration to construct a mode layer of a policy knowledge graph from top to bottom; and then, storing the index triples extracted in the step S3 by using a graph database, thereby constructing a data layer of the knowledge graph.
The knowledge graph data layer construction method comprises the following specific contents: firstly, sequentially reading information of each policy file in an excel form obtained by preprocessing in the step S1 based on a python language, wherein one policy file information comprises a name, a grade, a category and text content; the grades include district grade, city grade, provincial grade, country; categories include science and technology, letter, talents, etc. Secondly, inputting text content in each policy file into a model, and returning a policy index triplet sequence after model processing; further, the name, level, category, and index triplet sequence of the policy file is stored in a json file until all policy files in the excel table are all parsed and stored in the json file. The json file serves as an intermediate form of the graph database storing the required data.
Based on the Vue framework, developing a client, and acquiring the json file by the client, and sequentially extracting information of each policy: the name, class, level and index triplet sequence is transmitted to the server; the server is developed based on a Springboot framework, is used for receiving policy information transmitted by the client, is connected with a neo4j database, and stores the policy information into the neo4j database. In the neo4j database, for each policy, a root node is first created, and the attributes of the node are the name, class and level of the policy; secondly, respectively creating nodes with the attribute of the node being the name of the entity by a head entity and a tail entity in the index triplet; then, establishing an edge for the policy node and the head entity node, wherein the content of the edge is an index; then, creating edges for the head entity and the tail entity in the triples, wherein the content of the edges is the relation content in the corresponding index triples: such as: the "index includes", "has", etc. FIG. 4 is a diagram illustrating an exemplary storage of a policy file in a graph database provided in an embodiment of the present invention.
S5, inquiring the set index to obtain index triplet sequence information, feeding back the index triplet sequence information to the user, and providing policy index inquiring service for enterprises.
After all the policy index data are stored in the graph database in S4, the enterprise may select a corresponding query condition, such as the name, level, class, or specific index type in table 1 of the policy, and then obtain the required policy index information. For the policy name query condition, the enterprise can acquire the policy content corresponding to the policy name; for policy level query conditions, the enterprise may obtain all policy content for the level; for policy class query conditions, the enterprise may obtain all policy content for that class; for a particular index type, the enterprise may obtain all policy content with that index type. Providing such policy query services to enterprises can effectively alleviate the burden of enterprises to read a large number of policy PDF files.
The embodiment provides an information intelligent analysis method, which can effectively solve the problem of difficult interpretation of a policy text, uses a policy index triplet to represent a policy file, constructs a knowledge graph to store policy information, can automatically extract and store key information of the policy text, and provides services such as policy knowledge inquiry, policy knowledge reasoning and the like.
Example 2
The embodiment of the invention provides an information intelligent analysis system, which comprises a preprocessing module, a processing module and a processing module, wherein the preprocessing module uses a character recognition method to extract key contents from a set original file, acquire information to be processed and store the information;
the model training module trains and sets the index automatic extraction model by using a natural language processing method, and the natural language processing method mainly comprises the following steps: a named entity identification process and a relationship extraction process;
the index extraction module automatically analyzes the information to be processed into index triplet information according to a set index automatic extraction model;
the storage module is used for constructing a set index knowledge graph and storing the index triplet information by using a graph database;
and the query module queries the set index to obtain index triplet sequence information and feeds the index triplet sequence information back to the user.
Based on the application scenario of the embodiment, the system is an enterprise support policy intelligent analysis system based on knowledge characterization, and the system comprises the following contents:
the preprocessing module is used for preprocessing the policy file, extracting key content from the policy file by using a character recognition technology, acquiring a policy text and storing the policy text; the model training module trains an efficient policy index extraction model based on a natural language processing method of named entity recognition and relation extraction; the index extraction module is used for automatically analyzing the input policy text into index triples by utilizing the final model in the training module; the storage module is used for constructing an index knowledge graph and storing policy index information by using a graph database; and the query module is used for providing policy index query service for enterprises.
Example 3
The embodiment of the invention provides a computer readable storage medium, wherein a program is stored in the storage medium, and the program realizes the intelligent information analysis method when being executed by a processor.
The invention effectively solves the problem of difficult interpretation of the policy text, uses the policy index triples to represent the policy file, constructs the knowledge graph to store the policy information, can automatically extract and store the key information of the policy text and provides services such as policy knowledge inquiry and the like.

Claims (9)

1. An intelligent information analysis method is characterized by comprising the following steps:
s1, extracting key content from a set original file by using a character recognition method, acquiring information to be processed and storing the information;
s2, training a BERT-BiLSTM-CRF index automatic extraction model by using a natural language processing method, wherein the natural language processing method comprises the following steps: a named entity identification process and a relationship extraction process;
s3, automatically analyzing the information to be processed into index triplet information through a trained BERT-BiLSTM-CRF index automatic extraction model;
s4, constructing a set index knowledge graph, and storing the index triplet information by using a graph database;
s5, inquiring the set index to obtain index triplet sequence information, and feeding back to the user.
2. The intelligent information analysis method according to claim 1, wherein the step S1 specifically includes the following steps:
and acquiring a set original file by using a crawler technology, extracting key contents from the set original file by using a character recognition method OCR, acquiring information to be processed, and storing the information to be processed in an excel file.
3. The intelligent information analysis method according to claim 2, wherein the step S2 specifically includes the following steps:
dividing the information to be processed into a training set and a testing set according to a set proportion, marking the training set of the information to be processed by using a mode of entity-relation joint extraction, and training a BERT-BiLSTM-CRF index automatic extraction model;
the automatic BERT-BiLSTM-CRF index extraction model comprises a BERT module, a BiLSTM module and a CRF module, wherein the BERT module converts input information to be processed into word vectors by constructing two unsupervised training tasks, the BiLSTM module takes the output word vectors of the BERT module as input, performs coding calculation and then outputs the result to the CRF module, and performs final decoding calculation in the CRF module to obtain a predicted sequence.
4. The method for intelligent resolution of information according to claim 2, wherein the entity-relationship joint extraction method comprises the following steps:
labeling the training set of information to be processed, wherein the labeling label format comprises three parts, the first part is labeling of the position information of an entity in a word, the labeling rule of the part refers to BIOES labeling specification, and the label and representative information are { B: entity start, I: inside the entity, E: entity end, S: single entity }; the second part marks the relationship information, performs simplified coding according to the formulated relationship type, and marks the relationship type information; the third part is the main and customer information of the entity, namely the direction of the relation, and the labeling rule is {1: entity 1,2: entity 2} or {3: entity }.
5. An information intelligent parsing method according to claim 3, wherein the BERT module comprises two unsupervised training tasks, namely sentence occlusion training MLM and sentence relation prediction NSP; in the NSP task, judging whether the two sentences are in an upper-lower sentence relationship or not according to the concatenation of the two input sentences; the MLM cuts the sentence by taking the characters as a unit, then randomly selects part of the characters in the training sample, wipes the part of the characters from the original sentence, and predicts the wiped characters by using the rest characters.
6. The intelligent information analysis method according to claim 3, wherein the BiLSTM module and the CRF module together form a BiLSTM-CRF module, the BiLSTM-CRF module comprising:
the word vector obtained by the BERT module is input into the BiLSTM module for encoding, the BiLSTM module consists of a forward LSTM layer and a backward LSTM layer, the output is the combination of the two LSTM outputs, and the expression of the LSTM calculation is as follows:
in the above formula:is an input door, ">For the output door->Is a forgetful door, is a->Is memory cell-> and />In order to activate the function,is a weight matrix of gates, ">Is the bias vector of the gate, +.>For the input information of the current cell, < > for>In the state of the last hidden layer +.> and />For the last sequence and the current cell state, +.>Is a temporary cell state; the current unit accepts or rejects the information transmitted by the previous unit, and the retention degree of the current input and the output of the next unit are all according to +.>,/> and />Is determined by the calculation result of (a);
the output result expression of the BiLSTM module is:
the CRF module creates a label transfer matrix according to the relation of adjacent labels, generates label sequences with different probabilities, and sets the sequence with the highest calculated score as a final predicted sequence; for any one sequenceThe score calculation formula in the CRF module is:
wherein ,Yis a sequenceXIs used for the prediction of the sequence of (c),Pis the scoring matrix output by the BiLSTM module, i.e,/>Represent the firstiWord number ofjThe score of the individual tag(s),Arepresenting a transfer score matrix, ">Representation tagiTransfer to labeljIs a fraction of (2);Arepresenting a transition score matrix, prediction sequenceYThe probability formula generated is:
taking the logarithm of both sides of the equation to obtain the likelihood function of the predicted sequence:
wherein ,representing the actual labeling sequence,/->Representing all possible labeling sequences, and finally obtaining the output sequence with the maximum score after decoding:
7. the intelligent information analysis method according to claim 1, wherein S3 is:
and extracting a triplet form of the < head entity, the relation and the tail entity > from the information to be processed by using the trained BERT-BiLSTM-CRF index automatic extraction model.
8. An intelligent information analysis system, characterized in that the analysis system comprises:
the preprocessing module is used for extracting key contents from the set original file by using a character recognition method, acquiring information to be processed and storing the information;
the model training module trains and sets the index automatic extraction model by using a natural language processing method, and the natural language processing method mainly comprises the following steps: a named entity identification process and a relationship extraction process;
the index extraction module automatically analyzes the information to be processed into index triplet information according to a set index automatic extraction model;
the storage module is used for constructing a set index knowledge graph and storing the index triplet information by using a graph database;
and the query module queries the set index to obtain index triplet sequence information and feeds the index triplet sequence information back to the user.
9. A computer-readable storage medium, wherein a program is stored in the storage medium, which when executed by a processor implements an intelligent information parsing method according to any one of claims 1 to 7.
CN202310811685.7A 2023-07-04 2023-07-04 Information intelligent analysis method, system and storage medium Active CN116562265B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310811685.7A CN116562265B (en) 2023-07-04 2023-07-04 Information intelligent analysis method, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310811685.7A CN116562265B (en) 2023-07-04 2023-07-04 Information intelligent analysis method, system and storage medium

Publications (2)

Publication Number Publication Date
CN116562265A true CN116562265A (en) 2023-08-08
CN116562265B CN116562265B (en) 2023-12-01

Family

ID=87502139

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310811685.7A Active CN116562265B (en) 2023-07-04 2023-07-04 Information intelligent analysis method, system and storage medium

Country Status (1)

Country Link
CN (1) CN116562265B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117520552A (en) * 2024-01-08 2024-02-06 北京中科江南信息技术股份有限公司 Policy text processing method, device, equipment and storage medium
CN117609432A (en) * 2023-12-21 2024-02-27 中国疾病预防控制中心慢性非传染性疾病预防控制中心 Method for realizing intelligent policy retrieval through label extraction strategy

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140075004A1 (en) * 2012-08-29 2014-03-13 Dennis A. Van Dusen System And Method For Fuzzy Concept Mapping, Voting Ontology Crowd Sourcing, And Technology Prediction
CN105573984A (en) * 2015-12-18 2016-05-11 小米科技有限责任公司 Socio-economic indicator identification method and device
US20190236130A1 (en) * 2018-01-31 2019-08-01 Apple Inc. Knowledge-based framework for improving natural language understanding
CN111428053A (en) * 2020-03-30 2020-07-17 西安交通大学 Tax field knowledge graph construction method
CN112241438A (en) * 2020-10-09 2021-01-19 浙江水木海角科技服务有限公司 Policy service information data processing and query method and system
CN113312501A (en) * 2021-06-29 2021-08-27 中新国际联合研究院 Construction method and device of safety knowledge self-service query system based on knowledge graph
CN113360671A (en) * 2021-06-16 2021-09-07 浙江工业大学 Medical insurance medical document auditing method and system based on knowledge graph
CN113535917A (en) * 2021-06-30 2021-10-22 山东师范大学 Intelligent question-answering method and system based on travel knowledge map
US20220092096A1 (en) * 2020-09-23 2022-03-24 International Business Machines Corporation Automatic generation of short names for a named entity
CN114461781A (en) * 2021-12-30 2022-05-10 阿里云计算有限公司 Data storage method, data query method, server and storage medium
CN114580639A (en) * 2022-02-23 2022-06-03 中南民族大学 Knowledge graph construction method based on automatic extraction and alignment of government affair triples
CN115292490A (en) * 2022-08-02 2022-11-04 福建省科立方科技有限公司 Analysis algorithm for policy interpretation semantics
CN115310425A (en) * 2022-10-08 2022-11-08 浙江浙里信征信有限公司 Policy text analysis method based on policy text classification and key information identification
CN115344666A (en) * 2022-05-30 2022-11-15 招商银行股份有限公司 Policy matching method, device, equipment and computer readable storage medium
CN115470871A (en) * 2022-11-02 2022-12-13 江苏鸿程大数据技术与应用研究院有限公司 Policy matching method and system based on named entity recognition and relation extraction model
CN115906842A (en) * 2022-10-08 2023-04-04 浙江浙里信征信有限公司 Policy information identification method
CN115953041A (en) * 2022-12-30 2023-04-11 广东数源智汇科技有限公司 Construction scheme and system of operator policy system

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140075004A1 (en) * 2012-08-29 2014-03-13 Dennis A. Van Dusen System And Method For Fuzzy Concept Mapping, Voting Ontology Crowd Sourcing, And Technology Prediction
CN105573984A (en) * 2015-12-18 2016-05-11 小米科技有限责任公司 Socio-economic indicator identification method and device
US20190236130A1 (en) * 2018-01-31 2019-08-01 Apple Inc. Knowledge-based framework for improving natural language understanding
CN111428053A (en) * 2020-03-30 2020-07-17 西安交通大学 Tax field knowledge graph construction method
US20220092096A1 (en) * 2020-09-23 2022-03-24 International Business Machines Corporation Automatic generation of short names for a named entity
CN112241438A (en) * 2020-10-09 2021-01-19 浙江水木海角科技服务有限公司 Policy service information data processing and query method and system
CN113360671A (en) * 2021-06-16 2021-09-07 浙江工业大学 Medical insurance medical document auditing method and system based on knowledge graph
CN113312501A (en) * 2021-06-29 2021-08-27 中新国际联合研究院 Construction method and device of safety knowledge self-service query system based on knowledge graph
CN113535917A (en) * 2021-06-30 2021-10-22 山东师范大学 Intelligent question-answering method and system based on travel knowledge map
CN114461781A (en) * 2021-12-30 2022-05-10 阿里云计算有限公司 Data storage method, data query method, server and storage medium
CN114580639A (en) * 2022-02-23 2022-06-03 中南民族大学 Knowledge graph construction method based on automatic extraction and alignment of government affair triples
CN115344666A (en) * 2022-05-30 2022-11-15 招商银行股份有限公司 Policy matching method, device, equipment and computer readable storage medium
CN115292490A (en) * 2022-08-02 2022-11-04 福建省科立方科技有限公司 Analysis algorithm for policy interpretation semantics
CN115310425A (en) * 2022-10-08 2022-11-08 浙江浙里信征信有限公司 Policy text analysis method based on policy text classification and key information identification
CN115906842A (en) * 2022-10-08 2023-04-04 浙江浙里信征信有限公司 Policy information identification method
CN115470871A (en) * 2022-11-02 2022-12-13 江苏鸿程大数据技术与应用研究院有限公司 Policy matching method and system based on named entity recognition and relation extraction model
CN115953041A (en) * 2022-12-30 2023-04-11 广东数源智汇科技有限公司 Construction scheme and system of operator policy system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HERMAN YULIANSYAH: "Taxonomy of Link Prediction for Social Network Analysis: A Review", IEEE, vol. 8, pages 183470, XP011816280, DOI: 10.1109/ACCESS.2020.3029122 *
揣子昂等: "产业政策知识图谱的自动化构建", 情报工程, vol. 8, no. 3, pages 28 *
翟岩慧等: "融合决策蕴涵的知识图谱推理方法", 计算机科学与探索, pages 1 *
黄茜茜等: "基于司法判决书的知识图谱构建与知识服务应用分析", 情报科学, vol. 40, no. 2, pages 133 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117609432A (en) * 2023-12-21 2024-02-27 中国疾病预防控制中心慢性非传染性疾病预防控制中心 Method for realizing intelligent policy retrieval through label extraction strategy
CN117520552A (en) * 2024-01-08 2024-02-06 北京中科江南信息技术股份有限公司 Policy text processing method, device, equipment and storage medium
CN117520552B (en) * 2024-01-08 2024-04-16 北京中科江南信息技术股份有限公司 Policy text processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN116562265B (en) 2023-12-01

Similar Documents

Publication Publication Date Title
CN111428053B (en) Construction method of tax field-oriented knowledge graph
CN110825882B (en) Knowledge graph-based information system management method
CN110427623B (en) Semi-structured document knowledge extraction method and device, electronic equipment and storage medium
CN107748757B (en) Question-answering method based on knowledge graph
Navigli et al. Learning domain ontologies from document warehouses and dedicated web sites
CN116562265B (en) Information intelligent analysis method, system and storage medium
CN111767368B (en) Question-answer knowledge graph construction method based on entity link and storage medium
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
Huang et al. Expert as a service: Software expert recommendation via knowledge domain embeddings in stack overflow
CN109447266A (en) A kind of agricultural science and technology service intelligent sorting method based on big data
CN115470871B (en) Policy matching method and system based on named entity recognition and relation extraction model
CN111914556A (en) Emotion guiding method and system based on emotion semantic transfer map
CN114911945A (en) Knowledge graph-based multi-value chain data management auxiliary decision model construction method
CN113822026A (en) Multi-label entity labeling method
CN116719913A (en) Medical question-answering system based on improved named entity recognition and construction method thereof
CN113779264A (en) Trade recommendation method based on patent supply and demand knowledge graph
CN116383399A (en) Event public opinion risk prediction method and system
CN112883175A (en) Meteorological service interaction method and system combining pre-training model and template generation
CN112749283A (en) Entity relationship joint extraction method for legal field
CN116186237A (en) Entity relationship joint extraction method based on event cause and effect inference
CN117149974A (en) Knowledge graph question-answering method for sub-graph retrieval optimization
CN117034135A (en) API recommendation method based on prompt learning and double information source fusion
Barale et al. Automated refugee case analysis: An nlp pipeline for supporting legal practitioners
CN116258204A (en) Industrial safety production violation punishment management method and system based on knowledge graph
Palshikar et al. RINX: A system for information and knowledge extraction from resumes

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant