CN112541339A - Knowledge extraction method based on random forest and sequence labeling model - Google Patents
Knowledge extraction method based on random forest and sequence labeling model Download PDFInfo
- Publication number
- CN112541339A CN112541339A CN202011364225.7A CN202011364225A CN112541339A CN 112541339 A CN112541339 A CN 112541339A CN 202011364225 A CN202011364225 A CN 202011364225A CN 112541339 A CN112541339 A CN 112541339A
- Authority
- CN
- China
- Prior art keywords
- sentence
- sentences
- model
- sequence
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Animal Behavior & Ethology (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a knowledge extraction method based on a random forest and a sequence labeling model, and particularly relates to an entity relationship joint extraction method based on the random forest and a BILSTM _ CRF. Firstly, acquiring an unstructured text, preprocessing the text and expressing the text in a sentence vectorization mode, then inputting a sentence sequence into a sentence selector to screen out high-quality sentences, inputting the selected sentences into a BILSTM _ CRF sequence labeling model to perform labeling training, and finally performing sentence-level sequence labeling on the input sentences by using the trained model. The invention is based on random forest, effectively extracts knowledge in the unstructured text and forms structured information through the BILSTM _ CRF sequence labeling model, and by adopting the extraction method, the extraction efficiency of the unstructured information is greatly improved, the existing knowledge map resources are enriched, and further, the invention can better serve various intelligent applications.
Description
Technical Field
The invention belongs to the technical field of knowledge extraction, and particularly relates to a knowledge extraction method based on a random forest and a BILSTM _ CRF sequence labeling model.
Background
With the development of networks and computers, the information resources are updated quickly and in huge quantities, which contains abundant available knowledge and high research value. On the premise of such big data and low density of information resources, knowledge extraction has great research significance. Most of the networked and digitized information resources exist in a free, semi-structured or unstructured form, the information quantity is complicated and the information is updated in real time, and the knowledge extraction can extract the knowledge required by the user from the information by using related technologies and methods, so that the effective utilization of the information resources is realized.
The extraction of the entity and the relation is an important link in the construction process of the knowledge graph, and can lay a good foundation for the establishment of the knowledge graph. The traditional knowledge extraction method extracts entities firstly and then identifies the relationships, and the method can cause the result of entity identification to seriously influence the effect of relationship classification and cause error transmission.
Disclosure of Invention
Different from the traditional method, the knowledge extraction based on the random forest and the sequence labeling model combines the entity identification and the relationship extraction together, establishes a label containing relationship information, trains an entity relationship combined extraction model by screening out high-quality sentences, and directly extracts entities and relationships thereof by using the sequence labeling model, thereby effectively integrating the entity information and the relationship information and effectively ensuring the accuracy and the high efficiency of knowledge acquisition.
The invention aims to overcome the defects of the traditional knowledge extraction in the aspect of non-structural texts, improve the accuracy and the high efficiency of knowledge acquisition, disclose a random forest-based knowledge extraction method, and combine a sequence labeling model BILSTM _ CRF to obtain a better extraction result, contribute to enriching the existing knowledge map resources, and further provide better services for various intelligent applications.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a knowledge extraction method based on random forests and sequence labeling models is characterized by comprising the following specific steps:
step 1, inputting a text to be analyzed: the target of knowledge extraction is unstructured text;
step 2, preprocessing the text to be analyzed: because the acquired information resources are huge, various error information is easy to appear, and the errors can seriously influence the acquisition of the result sometimes, the preprocessing such as denoising and the like is generally carried out on the text to be analyzed;
step 3, vectorization of words: receiving a preprocessed text to be analyzed, vectorizing and mapping words of a window, and mapping each word of an input window into a distributed vector x by using a trained word vector matrixi∈RdD is the dimension of the vector;
and step 4, matrixing and expressing sentences: generating a matrixing expression of the sentence by using the trained word vector layer, namely obtaining a word vector sequence (x)1,x2,…,xn);
Step 5, sentence selector: the sentences are divided into different sets according to different entities, the purpose of the sentence selector is to select high-quality sentences without label noise in the entity sets, and the random forest model is used for deciding which operation is to be executed in each state:
selecting or not selecting a current sentence as a training sentence of the sequence marking model (1 represents selection, and 0 represents non-selection), namely classifying the sentences by using a random forest; wherein, the random forest model needs to use the manually marked sentences (manually marked sentences of 1 or 0) as training sentences;
step 6, sequence labeling model: taking the sentences classified as 1 by the sentence selector module in the step 5 as the input of the sequence labeling model, namely selecting high-quality sentences without label noise, inputting the selected sentences into the sequence labeling model BILSTM _ CRF, and training the sequence labeling model;
and 7, after the training stage of the model is completed in the steps 5 and 6, inputting a sentence sequence to the trained sequence labeling model to obtain a sentence labeling result.
The invention has the beneficial effects that:
the invention relates to a knowledge extraction technology based on random forests and a BILSTM _ CRF model, which is suitable for the field of scientific and technological resource service platforms. A knowledge extraction scheme based on a random forest and a BILSTM _ CRF model is provided by combining scientific and technological resource classification and resource characteristics in a scientific and technological service platform environment. The scheme is composed of a sentence selector and a sequence labeling model, sentence-level sequence labels are predicted, entity relation joint knowledge extraction in scientific and technological service field resources is achieved through input and preprocessing of unstructured texts, word vectorization, sentence matrixing, sentence selection and sentence-level sequence labeling, efficient organization and management of the scientific and technological service resources are effectively achieved, and support is provided for scientific and technological resource query, management, selection, aggregation and the like.
Drawings
FIG. 1 is a diagram of the BILSTM _ CRF model architecture.
FIG. 2 is a flow chart of knowledge extraction based on random forests and BILSTM _ CRF.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
As shown in fig. 2, the knowledge extraction method based on reinforcement learning specifically includes the following steps:
1. inputting a text to be analyzed: the knowledge extraction is mainly performed on unstructured texts.
2. Preprocessing a text to be analyzed: because the obtained information resources are huge, various error information is easy to appear, and the errors can seriously influence the result obtaining, the preprocessing such as denoising and coding is usually carried out on the text to be analyzed.
3. Vectorizing words: receiving a preprocessed text to be analyzed, vectorizing and mapping words of a window, and mapping each word of an input window into a distributed vector x by using a trained word vector matrixi∈RdAnd d is the dimension of the vector.
4. Matrixing expression of sentences: generating a matrixing expression of the sentence by using the trained word vector layer, namely obtaining a word vector sequence (x)1,x2,…,xn)。
5. Sentence selection: firstly, sentences are divided into different sets according to different entities, each set corresponds to a different entity, the sentence selector aims to select high-quality sentences without label noise in the entity sets, and which operation is to be executed in each state is determined through a random forest model: the current sentence is selected or not selected as a training sentence of the sequence labeling model (1 represents selection, and 0 represents non-selection), that is, the sentence is classified by using a random forest. The random forest model needs to use a manually labeled sentence (a manually labeled sentence with 1 or 0) as a training sentence.
6. Sequence labeling model: and (3) taking the sentence classified into 1 by the sentence selector module in the last step as the input of the sequence labeling model, namely selecting a high-quality sentence without label noise, inputting the selected sentence into the sequence labeling model BILSTM _ CRF, and training the model.
7. And 5, after the training stage of the model is completed, inputting a sentence sequence into the trained sequence labeling model to obtain a labeling result of the sentence.
Claims (1)
1. A knowledge extraction method based on random forests and sequence labeling models is characterized by comprising the following specific steps:
step 1, inputting a text to be analyzed: the target of knowledge extraction is unstructured text;
step 2, preprocessing the text to be analyzed: because the acquired information resources are huge, various error information is easy to appear, and the errors can seriously influence the acquisition of the result sometimes, the preprocessing such as denoising and the like is generally carried out on the text to be analyzed;
step 3, vectorization of words: receiving a preprocessed text to be analyzed, vectorizing and mapping words of a window, and mapping each word of an input window into a distributed vector x by using a trained word vector matrixi∈RdD is the dimension of the vector;
and step 4, matrixing and expressing sentences: generating a matrixing expression of the sentence by using the trained word vector layer, namely obtaining a word vector sequence (x)1,x2,…,xn);
Step 5, sentence selector: the sentences are divided into different sets according to different entities, the purpose of the sentence selector is to select high-quality sentences without label noise in the entity sets, and the random forest model is used for deciding which operation is to be executed in each state:
selecting or not selecting a current sentence as a training sentence of the sequence marking model (1 represents selection, and 0 represents non-selection), namely classifying the sentences by using a random forest; wherein, the random forest model needs to use the manually marked sentences (manually marked sentences of 1 or 0) as training sentences;
step 6, sequence labeling model: taking the sentences classified as 1 by the sentence selector module in the step 5 as the input of the sequence labeling model, namely selecting high-quality sentences without label noise, inputting the selected sentences into the sequence labeling model BILSTM _ CRF, and training the sequence labeling model;
and 7, after the training stage of the model is completed in the steps 5 and 6, inputting a sentence sequence to the trained sequence labeling model to obtain a sentence labeling result.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010842127 | 2020-08-20 | ||
CN2020108421273 | 2020-08-20 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112541339A true CN112541339A (en) | 2021-03-23 |
Family
ID=75015532
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011364225.7A Pending CN112541339A (en) | 2020-08-20 | 2020-11-29 | Knowledge extraction method based on random forest and sequence labeling model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112541339A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113377884A (en) * | 2021-07-08 | 2021-09-10 | 中央财经大学 | Event corpus purification method based on multi-agent reinforcement learning |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108875051A (en) * | 2018-06-28 | 2018-11-23 | 中译语通科技股份有限公司 | Knowledge mapping method for auto constructing and system towards magnanimity non-structured text |
-
2020
- 2020-11-29 CN CN202011364225.7A patent/CN112541339A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108875051A (en) * | 2018-06-28 | 2018-11-23 | 中译语通科技股份有限公司 | Knowledge mapping method for auto constructing and system towards magnanimity non-structured text |
Non-Patent Citations (2)
Title |
---|
方依: ""面向新闻的发生地抽取研究"", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 * |
高扬: "《智能摘要与深度学习》", 30 April 2019, 北京理工大学出版社 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113377884A (en) * | 2021-07-08 | 2021-09-10 | 中央财经大学 | Event corpus purification method based on multi-agent reinforcement learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111104498B (en) | Semantic understanding method in task type dialogue system | |
WO2018000272A1 (en) | Corpus generation device and method | |
CN111125365B (en) | Address data labeling method and device, electronic equipment and storage medium | |
CN110019839A (en) | Medical knowledge map construction method and system based on neural network and remote supervisory | |
CN110826303A (en) | Joint information extraction method based on weak supervised learning | |
CN103678285A (en) | Machine translation method and machine translation system | |
CN108829823A (en) | A kind of file classification method | |
CN107273295A (en) | A kind of software problem reporting sorting technique based on text randomness | |
CN112364125B (en) | Text information extraction system and method combining reading course learning mechanism | |
CN112541339A (en) | Knowledge extraction method based on random forest and sequence labeling model | |
CN115878818B (en) | Geographic knowledge graph construction method, device, terminal and storage medium | |
CN110362691B (en) | Syntax tree bank construction system | |
CN117473054A (en) | Knowledge graph-based general intelligent question-answering method and device | |
CN116304064A (en) | Text classification method based on extraction | |
CN114118068B (en) | Method and device for amplifying training text data and electronic equipment | |
CN112052652B (en) | Automatic generation method and device for electronic courseware script | |
CN114925206A (en) | Artificial intelligence body, voice information recognition method, storage medium and program product | |
CN112069777B (en) | Two-stage data-to-text generation method based on skeleton | |
CN110727695B (en) | Natural language query analysis method for novel power supply urban rail train data operation and maintenance | |
KR101207375B1 (en) | System and method for managing mathematical contents | |
CN111209726A (en) | Intelligent report generation system | |
CN111369005A (en) | Crowdsourcing marking system | |
CN1570921A (en) | Spoken language analyzing method based on statistic model | |
CN109947953B (en) | Construction method, system and equipment of knowledge ontology in English field | |
CN112181389B (en) | Method, system and computer equipment for generating API (application program interface) marks of course fragments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20210323 |