CN116522233A

CN116522233A - Method and system for extracting and classifying key point review content of research document

Info

Publication number: CN116522233A
Application number: CN202310803849.1A
Authority: CN
Inventors: 李炳辉; 李晖; 许禹诺; 刘会龙; 代志强; 王登政; 刘兆燕; ***; 吕阳; 闫晓栋; 费长顺
Original assignee: Beijing Sgitg Accenture Information Technology Co ltd; State Grid Beijing Electric Power Co Ltd
Current assignee: Beijing Sgitg Accenture Information Technology Co ltd; State Grid Beijing Electric Power Co Ltd
Priority date: 2023-07-03
Filing date: 2023-07-03
Publication date: 2023-08-01

Abstract

The invention discloses a method and a system for extracting and classifying key point review contents of a research document, which belong to the technical field of document content classification, wherein a key point review content extraction classification model of the research document is established and trained according to a data set of the research document, and a predefined label type is preset in the key point review content extraction classification model of the research document; the method and the device realize quick identification of the review content of the collectable point by relying on an open source technology of a preset predefined label type, replace a mode of manually browsing and extracting the point content, solve the problems of point information loss, repeated word strings generation, logic confusion and the like of similar business similar technologies, and can quickly acquire the point information of a collectable document through logic.

Description

Method and system for extracting and classifying key point review content of research document

Technical Field

The invention belongs to the technical field of document content classification, and particularly relates to a method and a system for extracting and classifying key point review contents of a researched document.

Background

The comment content of the national network research document is mainly presented in the form of phrases, and the content is different in length, various in characters and randomly appeared. The traditional national network research document point review content is based on manual review and extraction, and has low efficiency and long time consumption. At present, the content of the scrutiny of the national network scrutiny document can be extracted by using a regular or artificial intelligent model, the regular filtering logic is complicated, the later maintenance operability is poor, the artificial intelligent model can adopt methods of intelligent extraction of scrutiny point information based on TF-IDF and Word2Vec, intelligent extraction of scrutiny point information based on TextRank, PGN (Pointer-Generator Networks), seq2seq, fusion of machine learning and deep learning algorithms and the like, and the intelligent extraction of the scrutiny point information is realized. In the intelligent extraction process of TF-IDF and textRank key points, calculating word frequency, assigning key point information weight, and sequencing weight values to extract key point contents, wherein the extraction result ignores word sequence, has unknown logic and loses a large amount of key point contents; the text can not be directly input in the intelligent extraction process of the key points of Word2Vec, the text is required to be vectorized to extract the Word vectors of the key point information to be identified, and because Word2Vec is issued too early, some industry terms and rare corpus are not fully covered, but the key point information industry terms of the national net-researchers file are most, the accuracy of the key point information extraction result is low, and the information loss of the professional terms is serious; excessive repeated substrings can be generated in the intelligent extraction results of the PGN and the seq2seq model key points, the readability is poor, and key point information errors are caused by generating new key point contents from the original text. The extraction result of the comment content of the national network-capable-of-research document is not only readable content which is through logically, but also the classification and structural display of the key content are needed, the accuracy of intelligent extraction of the key content is low by utilizing the traditional text abstract technology, the readability of the generated repeated substrings is poor, the key classification cannot be directly realized, the structural display of the key content can be realized by utilizing the entity identification technology to extract the key content, but the result is unordered fields, and only partial key information cannot be completely logically seen.

Disclosure of Invention

The invention aims to provide a method and a system for extracting and classifying key point review contents of a document to be researched, which are used for solving the problems that the key point manual extraction speed of the document to be researched is low, the text abstract modeling result is unstructured and poor in readability, key point information is seriously lost, and the modeling result of the entity extraction technology is classified and not logical in the prior art.

A method for extracting and classifying the key point review content of a research document comprises the following steps:

s1, collecting and establishing a research document data set according to research project setting, establishing and training a research document point review content extraction classification model according to the research document data set, and presetting a predefined label type in the research document point review content extraction classification model;

s2, transmitting the pre-review document content into a review document point review content extraction classification model for prediction and identification in a form of readable parameters of the review document point review content extraction classification model for the review document point review, and classifying and identifying and outputting the point phrases by the review document point review content extraction classification model according to the predefined label types.

Preferably, step S1 is performed during the process of creating the data set of the documents to be ground: the document format of the item under investigation is converted into a text document.

Preferably, the specific process of establishing the data set of the documents to be ground according to the collection of the settings of the items to be ground comprises: the converted text document data is subjected to rule filtration by utilizing a scientific calculation library, the converted text document data is divided into periods, and a trainable document data set with a period as a model training sample is constructed.

Preferably, step S1 establishes and trains a classification model for extracting key point review contents of the documents to be ground according to the data set of the documents to be ground, and sets text document comparison converted from the document contents by analyzing the data set of the documents to be ground and setting the documents to be ground, and selects preset words with the front frequency according to the occurrence frequency of the words in the documents to be ground to form a hotword list.

Preferably, step S1 builds and trains the point review content of the documents to be ground according to the document dataset to be ground, extracts a classification model, classifies and marks the point content in the document dataset to be ground according to the point review requirement and the hot word list, converts the document dataset to be ground marked after classification into a training set text format according to a model training format, and correspondingly sets a label format.

Preferably, in the process of establishing and training the key point review content extraction classification model of the documents to be ground according to the documents to be ground, the model training data set is established, which specifically comprises: carrying out self-defined classification on the key point content in the data set of the documents to be researched according to the key point review requirement, and then using a labeling tool to label each classified text document in the data set of the documents to be researched by adopting the pre-defined label with the self-defined classified category as the pre-defined label; and then converting the self-defined classified data set of the documents to a sample in a training set text format according to the json format file, setting a corresponding predefined label for each sample, and then taking the sample as a model training data set of the document point review content extraction classification model.

Preferably, the training of the key point review content extraction classification model of the documents comprises the following steps: and constructing a research document point review content extraction classification model by using a two-way long-short-term memory network and a conditional random field, inputting the acquired model training data set into the research document point review content extraction classification model for iterative training, performing tuning parameters in the training process, and taking the F1-score as a final model evaluation index until the F1-score tends to stably stop model training, thereby obtaining the trained research document point review content extraction classification model.

Preferably, the specific calculation process of the final model evaluation index comprises the following steps: calculation and acquisition according to confusion matrix，The calculation formula is as follows: />

In the above-mentioned method, the step of,: a reconciliation average of precision and recall; TP/(tp+fp): accuracy rate; TP/(tp+fn): recall rate; TP: true examples; FN: false counterexamples; FP: false counterexamples; TN: true and negative examples.

The system comprises a predefining module and an extraction classification module;

the predefining module is used for establishing and training a research document point review content extraction classification model according to the research document data set, and predefining a label type in the research document point review content extraction classification model;

the extraction classification module is used for receiving the pre-review document content, transmitting the pre-review document content into the review document point extraction classification model for prediction and identification in a form of readable parameters of the review document point extraction classification model, and classifying and identifying the review document point extraction classification model according to the predefined label type and outputting the point phrase.

Preferably, a two-way long-short-term memory network and a conditional random field are utilized to construct a research document point review content extraction classification model, an obtained model training data set is input into the research document point review content extraction classification model for iterative training, and tuning parameters are carried out in the training process so as toAs final model evaluation index up to +.>Model training tends to be stopped steadily, and a classification model is extracted from the key point review content of the trained trainable document.

Compared with the prior art, the invention has the following beneficial technical effects:

the invention relates to a method for extracting and classifying key point review contents of a research document, which comprises the steps of establishing and training a key point review content extraction classification model of the research document according to a research document data set, and presetting a predefined label type in the key point review content extraction classification model of the research document; the method and the device realize quick identification of the review content of the collectable point by relying on an open source technology of a preset predefined label type, replace a mode of manually browsing and extracting the point content, solve the problems of point information loss, repeated word strings generation, logic confusion and the like of similar business similar technologies, and can quickly acquire the point information of a collectable document through logic.

According to the system for extracting and classifying the key point review contents of the documents to be ground, the pre-review document contents are transmitted into the key point review content extraction classification model of the documents to be ground in the form of readable parameters of the key point review content extraction classification model of the documents to be ground for prediction and identification, and the key point review information to be ground can be successfully classified on line according to the predefined labels, so that the database is stored in a structured mode, the accuracy is high, and the identification speed is high.

Drawings

FIG. 1 is a schematic diagram of a classification model for extracting review content of a point of interest of a research document in an embodiment of the invention.

FIG. 2 is a schematic diagram of model evaluation index according to an embodiment of the present invention.

FIG. 3 is a schematic diagram of a relationship between points according to an embodiment of the present invention.

FIG. 4 is a flowchart of a method for extracting and classifying key point review contents of a research document according to an embodiment of the invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As shown in FIG. 1, the invention provides a method for extracting and classifying key point review contents of a research document, which solves the problems of low manual extraction speed of key points of the research document, unstructured text abstract modeling results, poor readability, serious key point information loss and classification and no logic of entity extraction technology modeling results; the invention can rapidly and comprehensively extract key point content of the research document with consistent logic and high readability, and can perform classification processing; the method specifically comprises the following steps:

the data set of the research project comprises two format documents, namely pdf and word, and the document format of the research project is converted into a text document. The application sets document content for the national network research project, and comprises the following steps: reporting, accounting, wholesale, review, drawings, multi-rule unification, contract, settlement, resolution, completion acceptance, stand approval, and transfer documents; the reporting, overview, wholesale, review, drawing, multi-rule unification, contract, settlement, resolution, completion acceptance, stand approval, and transfer documents are converted to text documents.

The converted text document data is subjected to rule filtration by utilizing a scientific calculation library, the converted text document data is divided into periods, and a trainable document data set with a period as a model training sample is constructed.

According to the data set of the documents to be ground, a hot word list is constructed, which is realized by the following steps: setting the text document comparison converted by the document content through the analysis of the data set of the research project; some words appear randomly in the item to be researched, and a preset word with a front frequency is selected to form a hot word list according to the frequency of the words appearing randomly. The words in the hot word list are combined on the basis of the classification and identification results, so that the original text logic can be reproduced more clearly, and the accurate information of the key point content can be expressed clearly. For example, for the high frequency words appearing in the national net research file: new creation, old utilization, expansion, original, existing and total.

S2, the pre-review document content is transmitted into a review document point review content extraction classification model for prediction identification in a form of readable parameters of the review document point review content extraction classification model, the review document point review content extraction classification model is used for classification identification according to a predefined label type and outputting point phrases, and point contents of different categories can be directly read under different scenes.

And classifying and marking the key point contents in the trainable document data set according to the key point review requirement and the hot word list, converting the classified and marked trainable document data set into a training set text format according to a model training format, and correspondingly setting a label format. The method specifically comprises the following steps:

carrying out self-defined classification on the key point content in the data set of the documents to be researched according to the key point review requirement, and then using a labeling tool to label each classified text document in the data set of the documents to be researched by adopting the pre-defined label with the self-defined classified category as the pre-defined label; and then converting the self-defined classified data set of the documents to a sample in a training set text format according to the json format file, setting a corresponding predefined label for each sample, and then taking the sample as a model training data set of the document point review content extraction classification model.

Specifically, a model training data set is formed by labeling the predefined labels corresponding to the self-defined classified trainable document data set by using a label-studio.

Constructing a research document key point review content extraction classification model by using a two-way long-short-term memory network (BiLSTM) and a Conditional Random Field (CRF), wherein a network architecture diagram is shown in figure 1, and W is input; B-SUB, I-SUB, O, B-LIN, I-LIN are the outputs, respectively.

And inputting the obtained model training data set into the key point review content extraction classification model of the abradable document for iterative training, performing tuning parameters in the training process, and taking the F1-score as a final model evaluation index until the F1-score tends to stably stop model training, thereby obtaining the key point review content extraction classification model of the abradable document after training.

Stopping model training and judging:

firstly, obtaining a training evaluation index graph of a primary model according to a set training round parameter value (the set epoch=20 in the application); training an evaluation index curve index F1-score judgment model iteration process according to the initial model, and training the index in the evaluation index curve according to the initial modelAfter the extremum is reached, the iteration curve will periodically fluctuate between 0 to 1, namely, the iteration curve is considered as ++>And (3) tending to be stable, selecting a training model with highest robustness in the first five training results tending to be stable as a final extraction classification model for the key point review content of the researched document.

Setting index marks in initial model training evaluation index curve graphAnd when the extreme points appear in the epoch=5, three parameters of the epoch=4, the epoch=5 and the epoch=6 are selected for training to respectively obtain three models, the three models are respectively tested by using a test set, the robustness of the three models is compared, and the model with the highest final robustness is determined to be the final extraction classification model of the key point review content of the researched document.

The final model evaluation index F1-score is shown in FIG. 2, wherein the dashed line represents the validation set evaluation index graph and the solid line represents the training set evaluation index graph.

Specifically, according to the confusion matrix calculation and acquisition，/>The calculation formula is as follows:

in the above-mentioned method, the step of,: a reconciliation average of precision and recall; TP/(tp+fp): accuracy rate; TP/(tp+fn): recall rate; TP: true examples (true results are positive examples, predicted results are positive examples); FN: false counter-examples (true result is positive example, predicted result is false example); FP: false counter-examples (false examples for true results, positive examples for predicted results); TN: true counterexamples (true results are false examples, predicted results are false examples).

The confusion matrix is shown in table 1 below:

TABLE 1 confusion matrix

And transmitting the pre-review document content into the review document point extraction classification model in the form of readable parameters of the review document point extraction classification model, classifying and identifying the review document point extraction classification model according to the self-defined classified review document and the corresponding self-defined label to output point phrases, and directly reading the point content of different categories under different scenes.

The classification model for extracting the review content of the key points of the research documents establishes a triple relation structure for the output classification result, the label and the text name and stores the triple relation structure, and the stored classification result relation is shown in figure 3.

The method comprises the steps that a research document point review content extraction classification model integrates different label contents which are output, a local point document is constructed by means of logic duplication, the local point document is called by means of word segmentation technology, the pre-review document content is segmented reversely, non-point contents existing in the pre-review document content are all deleted by means of word segmentation technology, at the moment, the non-logic and classification point contents which are identified by the research document point review content extraction classification model are restored to original text logic, and finally the research review point content is formed.

The invention realizes the intelligent identification of the review content of the points to be researched by means of the open source technology, replaces manual review and extraction of the points, solves the problems of point information loss, repeated word strings generation, logic confusion and the like of the similar technologies of the similar services, and can rapidly acquire the point information of logical through of a file to be researched. The review information of the research points can be successfully classified on line according to the predefined labels, and the database is stored in a structured manner.

Claims

1. The method for extracting and classifying the key point review content of the research document is characterized by comprising the following steps of:

2. The method for extracting and classifying key point review contents of a documents under study according to claim 1, wherein in the process of creating the document under study data set in step S1: the document format of the item under investigation is converted into a text document.

3. The method for extracting and classifying key point review contents of a measurable document according to claim 2, wherein the specific process of collecting and establishing a measurable document data set according to the measurable item settings comprises: the converted text document data is subjected to rule filtration by utilizing a scientific calculation library, the converted text document data is divided into periods, and a trainable document data set with a period as a model training sample is constructed.

4. The method for extracting and classifying key point review contents of a measurable document according to claim 1, wherein step S1 establishes and trains a key point review content extraction classification model of the measurable document according to a data set of the measurable document, sets text document comparison converted from document contents by analyzing the data set of the measurable document and by a measurable item, and selects preset words with a front frequency according to the occurrence frequency of randomly occurring words in the measurable item to form a hotword list.

5. The method for extracting and classifying key point review contents of a measurable document according to claim 4, wherein step S1 establishes and trains a key point review content extraction classification model of the measurable document according to a measurable document dataset, classifies and marks key point contents in the measurable document dataset according to a key point review requirement and a hot word list, converts the classified and marked measurable document dataset into a training set text format according to a model training format, and correspondingly sets a label format.

6. The method for extracting and classifying key point review contents of a documents under study according to claim 5, wherein the step of constructing a model training data set during the process of constructing and training the key point review contents of the documents under study according to the document under study data set comprises the steps of: carrying out self-defined classification on the key point content in the data set of the documents to be researched according to the key point review requirement, and then using a labeling tool to label each classified text document in the data set of the documents to be researched by adopting the pre-defined label with the self-defined classified category as the pre-defined label; and then converting the self-defined classified data set of the documents to a sample in a training set text format according to the json format file, setting a corresponding predefined label for each sample, and then taking the sample as a model training data set of the document point review content extraction classification model.

7. The method for extracting and classifying key points and review contents of a documents to be studied according to claim 1, wherein the training of the extracting and classifying model of the key points and review contents of the documents to be studied specifically comprises the following steps: and constructing a research document point review content extraction classification model by using a two-way long-short-term memory network and a conditional random field, inputting the acquired model training data set into the research document point review content extraction classification model for iterative training, performing tuning parameters in the training process, and taking the F1-score as a final model evaluation index until the F1-score tends to stably stop model training, thereby obtaining the trained research document point review content extraction classification model.

8. The method for extracting and classifying key point review contents of a research document according to claim 7, wherein the specific calculation process of the final model evaluation index comprises: calculation and acquisition according to confusion matrix，/>The calculation formula is as follows: />

9. The system for extracting and classifying the key point review content of the research document is characterized by comprising a predefined module and an extracting and classifying module;

10. The system for extracting and classifying key points of documents to be ground as set forth in claim 9, wherein said system is characterized in that said system uses two-way long-short-term memory network and conditional random field to construct said classification model for extracting and classifying key points of documents to be ground, said obtained training data set is inputted into said classification model for extracting and classifying key points of documents to be ground for iterative training, and said training process is performed with tuning parameters toAs final model evaluation index up to +.>Model training tends to be stopped steadily, and a classification model is extracted from the key point review content of the trained trainable document.