CN112541339A

CN112541339A - Knowledge extraction method based on random forest and sequence labeling model

Info

Publication number: CN112541339A
Application number: CN202011364225.7A
Authority: CN
Inventors: 柳先辉; 周珮; 陈宇飞; 赵卫东
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2020-08-20
Filing date: 2020-11-29
Publication date: 2021-03-23

Abstract

The invention discloses a knowledge extraction method based on a random forest and a sequence labeling model, and particularly relates to an entity relationship joint extraction method based on the random forest and a BILSTM _ CRF. Firstly, acquiring an unstructured text, preprocessing the text and expressing the text in a sentence vectorization mode, then inputting a sentence sequence into a sentence selector to screen out high-quality sentences, inputting the selected sentences into a BILSTM _ CRF sequence labeling model to perform labeling training, and finally performing sentence-level sequence labeling on the input sentences by using the trained model. The invention is based on random forest, effectively extracts knowledge in the unstructured text and forms structured information through the BILSTM _ CRF sequence labeling model, and by adopting the extraction method, the extraction efficiency of the unstructured information is greatly improved, the existing knowledge map resources are enriched, and further, the invention can better serve various intelligent applications.

Description

Knowledge extraction method based on random forest and sequence labeling model

Technical Field

The invention belongs to the technical field of knowledge extraction, and particularly relates to a knowledge extraction method based on a random forest and a BILSTM _ CRF sequence labeling model.

Background

With the development of networks and computers, the information resources are updated quickly and in huge quantities, which contains abundant available knowledge and high research value. On the premise of such big data and low density of information resources, knowledge extraction has great research significance. Most of the networked and digitized information resources exist in a free, semi-structured or unstructured form, the information quantity is complicated and the information is updated in real time, and the knowledge extraction can extract the knowledge required by the user from the information by using related technologies and methods, so that the effective utilization of the information resources is realized.

The extraction of the entity and the relation is an important link in the construction process of the knowledge graph, and can lay a good foundation for the establishment of the knowledge graph. The traditional knowledge extraction method extracts entities firstly and then identifies the relationships, and the method can cause the result of entity identification to seriously influence the effect of relationship classification and cause error transmission.

Disclosure of Invention

Different from the traditional method, the knowledge extraction based on the random forest and the sequence labeling model combines the entity identification and the relationship extraction together, establishes a label containing relationship information, trains an entity relationship combined extraction model by screening out high-quality sentences, and directly extracts entities and relationships thereof by using the sequence labeling model, thereby effectively integrating the entity information and the relationship information and effectively ensuring the accuracy and the high efficiency of knowledge acquisition.

The invention aims to overcome the defects of the traditional knowledge extraction in the aspect of non-structural texts, improve the accuracy and the high efficiency of knowledge acquisition, disclose a random forest-based knowledge extraction method, and combine a sequence labeling model BILSTM _ CRF to obtain a better extraction result, contribute to enriching the existing knowledge map resources, and further provide better services for various intelligent applications.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a knowledge extraction method based on random forests and sequence labeling models is characterized by comprising the following specific steps:

step 1, inputting a text to be analyzed: the target of knowledge extraction is unstructured text;

step 2, preprocessing the text to be analyzed: because the acquired information resources are huge, various error information is easy to appear, and the errors can seriously influence the acquisition of the result sometimes, the preprocessing such as denoising and the like is generally carried out on the text to be analyzed;

step 3, vectorization of words: receiving a preprocessed text to be analyzed, vectorizing and mapping words of a window, and mapping each word of an input window into a distributed vector x by using a trained word vector matrix_i∈R^dD is the dimension of the vector;

and step 4, matrixing and expressing sentences: generating a matrixing expression of the sentence by using the trained word vector layer, namely obtaining a word vector sequence (x)₁,x₂,…,x_n)；

Step 5, sentence selector: the sentences are divided into different sets according to different entities, the purpose of the sentence selector is to select high-quality sentences without label noise in the entity sets, and the random forest model is used for deciding which operation is to be executed in each state:

selecting or not selecting a current sentence as a training sentence of the sequence marking model (1 represents selection, and 0 represents non-selection), namely classifying the sentences by using a random forest; wherein, the random forest model needs to use the manually marked sentences (manually marked sentences of 1 or 0) as training sentences;

step 6, sequence labeling model: taking the sentences classified as 1 by the sentence selector module in the step 5 as the input of the sequence labeling model, namely selecting high-quality sentences without label noise, inputting the selected sentences into the sequence labeling model BILSTM _ CRF, and training the sequence labeling model;

and 7, after the training stage of the model is completed in the steps 5 and 6, inputting a sentence sequence to the trained sequence labeling model to obtain a sentence labeling result.

The invention has the beneficial effects that:

the invention relates to a knowledge extraction technology based on random forests and a BILSTM _ CRF model, which is suitable for the field of scientific and technological resource service platforms. A knowledge extraction scheme based on a random forest and a BILSTM _ CRF model is provided by combining scientific and technological resource classification and resource characteristics in a scientific and technological service platform environment. The scheme is composed of a sentence selector and a sequence labeling model, sentence-level sequence labels are predicted, entity relation joint knowledge extraction in scientific and technological service field resources is achieved through input and preprocessing of unstructured texts, word vectorization, sentence matrixing, sentence selection and sentence-level sequence labeling, efficient organization and management of the scientific and technological service resources are effectively achieved, and support is provided for scientific and technological resource query, management, selection, aggregation and the like.

Drawings

FIG. 1 is a diagram of the BILSTM _ CRF model architecture.

FIG. 2 is a flow chart of knowledge extraction based on random forests and BILSTM _ CRF.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

As shown in fig. 2, the knowledge extraction method based on reinforcement learning specifically includes the following steps:

1. inputting a text to be analyzed: the knowledge extraction is mainly performed on unstructured texts.

2. Preprocessing a text to be analyzed: because the obtained information resources are huge, various error information is easy to appear, and the errors can seriously influence the result obtaining, the preprocessing such as denoising and coding is usually carried out on the text to be analyzed.

3. Vectorizing words: receiving a preprocessed text to be analyzed, vectorizing and mapping words of a window, and mapping each word of an input window into a distributed vector x by using a trained word vector matrix_i∈R^dAnd d is the dimension of the vector.

4. Matrixing expression of sentences: generating a matrixing expression of the sentence by using the trained word vector layer, namely obtaining a word vector sequence (x)₁,x₂,…,x_n)。

5. Sentence selection: firstly, sentences are divided into different sets according to different entities, each set corresponds to a different entity, the sentence selector aims to select high-quality sentences without label noise in the entity sets, and which operation is to be executed in each state is determined through a random forest model: the current sentence is selected or not selected as a training sentence of the sequence labeling model (1 represents selection, and 0 represents non-selection), that is, the sentence is classified by using a random forest. The random forest model needs to use a manually labeled sentence (a manually labeled sentence with 1 or 0) as a training sentence.

6. Sequence labeling model: and (3) taking the sentence classified into 1 by the sentence selector module in the last step as the input of the sequence labeling model, namely selecting a high-quality sentence without label noise, inputting the selected sentence into the sequence labeling model BILSTM _ CRF, and training the model.

7. And 5, after the training stage of the model is completed, inputting a sentence sequence into the trained sequence labeling model to obtain a labeling result of the sentence.

Claims

1. A knowledge extraction method based on random forests and sequence labeling models is characterized by comprising the following specific steps: