CN112541339A - Knowledge extraction method based on random forest and sequence labeling model - Google Patents

Knowledge extraction method based on random forest and sequence labeling model Download PDF

Info

Publication number
CN112541339A
CN112541339A CN202011364225.7A CN202011364225A CN112541339A CN 112541339 A CN112541339 A CN 112541339A CN 202011364225 A CN202011364225 A CN 202011364225A CN 112541339 A CN112541339 A CN 112541339A
Authority
CN
China
Prior art keywords
sentence
sentences
model
sequence
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011364225.7A
Other languages
Chinese (zh)
Inventor
柳先辉
周珮
陈宇飞
赵卫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Publication of CN112541339A publication Critical patent/CN112541339A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Animal Behavior & Ethology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a knowledge extraction method based on a random forest and a sequence labeling model, and particularly relates to an entity relationship joint extraction method based on the random forest and a BILSTM _ CRF. Firstly, acquiring an unstructured text, preprocessing the text and expressing the text in a sentence vectorization mode, then inputting a sentence sequence into a sentence selector to screen out high-quality sentences, inputting the selected sentences into a BILSTM _ CRF sequence labeling model to perform labeling training, and finally performing sentence-level sequence labeling on the input sentences by using the trained model. The invention is based on random forest, effectively extracts knowledge in the unstructured text and forms structured information through the BILSTM _ CRF sequence labeling model, and by adopting the extraction method, the extraction efficiency of the unstructured information is greatly improved, the existing knowledge map resources are enriched, and further, the invention can better serve various intelligent applications.

Description

Knowledge extraction method based on random forest and sequence labeling model
Technical Field
The invention belongs to the technical field of knowledge extraction, and particularly relates to a knowledge extraction method based on a random forest and a BILSTM _ CRF sequence labeling model.
Background
With the development of networks and computers, the information resources are updated quickly and in huge quantities, which contains abundant available knowledge and high research value. On the premise of such big data and low density of information resources, knowledge extraction has great research significance. Most of the networked and digitized information resources exist in a free, semi-structured or unstructured form, the information quantity is complicated and the information is updated in real time, and the knowledge extraction can extract the knowledge required by the user from the information by using related technologies and methods, so that the effective utilization of the information resources is realized.
The extraction of the entity and the relation is an important link in the construction process of the knowledge graph, and can lay a good foundation for the establishment of the knowledge graph. The traditional knowledge extraction method extracts entities firstly and then identifies the relationships, and the method can cause the result of entity identification to seriously influence the effect of relationship classification and cause error transmission.
Disclosure of Invention
Different from the traditional method, the knowledge extraction based on the random forest and the sequence labeling model combines the entity identification and the relationship extraction together, establishes a label containing relationship information, trains an entity relationship combined extraction model by screening out high-quality sentences, and directly extracts entities and relationships thereof by using the sequence labeling model, thereby effectively integrating the entity information and the relationship information and effectively ensuring the accuracy and the high efficiency of knowledge acquisition.
The invention aims to overcome the defects of the traditional knowledge extraction in the aspect of non-structural texts, improve the accuracy and the high efficiency of knowledge acquisition, disclose a random forest-based knowledge extraction method, and combine a sequence labeling model BILSTM _ CRF to obtain a better extraction result, contribute to enriching the existing knowledge map resources, and further provide better services for various intelligent applications.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a knowledge extraction method based on random forests and sequence labeling models is characterized by comprising the following specific steps:
step 1, inputting a text to be analyzed: the target of knowledge extraction is unstructured text;
step 2, preprocessing the text to be analyzed: because the acquired information resources are huge, various error information is easy to appear, and the errors can seriously influence the acquisition of the result sometimes, the preprocessing such as denoising and the like is generally carried out on the text to be analyzed;
step 3, vectorization of words: receiving a preprocessed text to be analyzed, vectorizing and mapping words of a window, and mapping each word of an input window into a distributed vector x by using a trained word vector matrixi∈RdD is the dimension of the vector;
and step 4, matrixing and expressing sentences: generating a matrixing expression of the sentence by using the trained word vector layer, namely obtaining a word vector sequence (x)1,x2,…,xn);
Step 5, sentence selector: the sentences are divided into different sets according to different entities, the purpose of the sentence selector is to select high-quality sentences without label noise in the entity sets, and the random forest model is used for deciding which operation is to be executed in each state:
selecting or not selecting a current sentence as a training sentence of the sequence marking model (1 represents selection, and 0 represents non-selection), namely classifying the sentences by using a random forest; wherein, the random forest model needs to use the manually marked sentences (manually marked sentences of 1 or 0) as training sentences;
step 6, sequence labeling model: taking the sentences classified as 1 by the sentence selector module in the step 5 as the input of the sequence labeling model, namely selecting high-quality sentences without label noise, inputting the selected sentences into the sequence labeling model BILSTM _ CRF, and training the sequence labeling model;
and 7, after the training stage of the model is completed in the steps 5 and 6, inputting a sentence sequence to the trained sequence labeling model to obtain a sentence labeling result.
The invention has the beneficial effects that:
the invention relates to a knowledge extraction technology based on random forests and a BILSTM _ CRF model, which is suitable for the field of scientific and technological resource service platforms. A knowledge extraction scheme based on a random forest and a BILSTM _ CRF model is provided by combining scientific and technological resource classification and resource characteristics in a scientific and technological service platform environment. The scheme is composed of a sentence selector and a sequence labeling model, sentence-level sequence labels are predicted, entity relation joint knowledge extraction in scientific and technological service field resources is achieved through input and preprocessing of unstructured texts, word vectorization, sentence matrixing, sentence selection and sentence-level sequence labeling, efficient organization and management of the scientific and technological service resources are effectively achieved, and support is provided for scientific and technological resource query, management, selection, aggregation and the like.
Drawings
FIG. 1 is a diagram of the BILSTM _ CRF model architecture.
FIG. 2 is a flow chart of knowledge extraction based on random forests and BILSTM _ CRF.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
As shown in fig. 2, the knowledge extraction method based on reinforcement learning specifically includes the following steps:
1. inputting a text to be analyzed: the knowledge extraction is mainly performed on unstructured texts.
2. Preprocessing a text to be analyzed: because the obtained information resources are huge, various error information is easy to appear, and the errors can seriously influence the result obtaining, the preprocessing such as denoising and coding is usually carried out on the text to be analyzed.
3. Vectorizing words: receiving a preprocessed text to be analyzed, vectorizing and mapping words of a window, and mapping each word of an input window into a distributed vector x by using a trained word vector matrixi∈RdAnd d is the dimension of the vector.
4. Matrixing expression of sentences: generating a matrixing expression of the sentence by using the trained word vector layer, namely obtaining a word vector sequence (x)1,x2,…,xn)。
5. Sentence selection: firstly, sentences are divided into different sets according to different entities, each set corresponds to a different entity, the sentence selector aims to select high-quality sentences without label noise in the entity sets, and which operation is to be executed in each state is determined through a random forest model: the current sentence is selected or not selected as a training sentence of the sequence labeling model (1 represents selection, and 0 represents non-selection), that is, the sentence is classified by using a random forest. The random forest model needs to use a manually labeled sentence (a manually labeled sentence with 1 or 0) as a training sentence.
6. Sequence labeling model: and (3) taking the sentence classified into 1 by the sentence selector module in the last step as the input of the sequence labeling model, namely selecting a high-quality sentence without label noise, inputting the selected sentence into the sequence labeling model BILSTM _ CRF, and training the model.
7. And 5, after the training stage of the model is completed, inputting a sentence sequence into the trained sequence labeling model to obtain a labeling result of the sentence.

Claims (1)

1. A knowledge extraction method based on random forests and sequence labeling models is characterized by comprising the following specific steps:
step 1, inputting a text to be analyzed: the target of knowledge extraction is unstructured text;
step 2, preprocessing the text to be analyzed: because the acquired information resources are huge, various error information is easy to appear, and the errors can seriously influence the acquisition of the result sometimes, the preprocessing such as denoising and the like is generally carried out on the text to be analyzed;
step 3, vectorization of words: receiving a preprocessed text to be analyzed, vectorizing and mapping words of a window, and mapping each word of an input window into a distributed vector x by using a trained word vector matrixi∈RdD is the dimension of the vector;
and step 4, matrixing and expressing sentences: generating a matrixing expression of the sentence by using the trained word vector layer, namely obtaining a word vector sequence (x)1,x2,…,xn);
Step 5, sentence selector: the sentences are divided into different sets according to different entities, the purpose of the sentence selector is to select high-quality sentences without label noise in the entity sets, and the random forest model is used for deciding which operation is to be executed in each state:
selecting or not selecting a current sentence as a training sentence of the sequence marking model (1 represents selection, and 0 represents non-selection), namely classifying the sentences by using a random forest; wherein, the random forest model needs to use the manually marked sentences (manually marked sentences of 1 or 0) as training sentences;
step 6, sequence labeling model: taking the sentences classified as 1 by the sentence selector module in the step 5 as the input of the sequence labeling model, namely selecting high-quality sentences without label noise, inputting the selected sentences into the sequence labeling model BILSTM _ CRF, and training the sequence labeling model;
and 7, after the training stage of the model is completed in the steps 5 and 6, inputting a sentence sequence to the trained sequence labeling model to obtain a sentence labeling result.
CN202011364225.7A 2020-08-20 2020-11-29 Knowledge extraction method based on random forest and sequence labeling model Pending CN112541339A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010842127 2020-08-20
CN2020108421273 2020-08-20

Publications (1)

Publication Number Publication Date
CN112541339A true CN112541339A (en) 2021-03-23

Family

ID=75015532

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011364225.7A Pending CN112541339A (en) 2020-08-20 2020-11-29 Knowledge extraction method based on random forest and sequence labeling model

Country Status (1)

Country Link
CN (1) CN112541339A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377884A (en) * 2021-07-08 2021-09-10 中央财经大学 Event corpus purification method based on multi-agent reinforcement learning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
方依: ""面向新闻的发生地抽取研究"", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *
高扬: "《智能摘要与深度学习》", 30 April 2019, 北京理工大学出版社 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377884A (en) * 2021-07-08 2021-09-10 中央财经大学 Event corpus purification method based on multi-agent reinforcement learning

Similar Documents

Publication Publication Date Title
CN111104498B (en) Semantic understanding method in task type dialogue system
WO2018000272A1 (en) Corpus generation device and method
CN111125365B (en) Address data labeling method and device, electronic equipment and storage medium
CN110019839A (en) Medical knowledge map construction method and system based on neural network and remote supervisory
CN110826303A (en) Joint information extraction method based on weak supervised learning
CN103678285A (en) Machine translation method and machine translation system
CN108829823A (en) A kind of file classification method
CN107273295A (en) A kind of software problem reporting sorting technique based on text randomness
CN112364125B (en) Text information extraction system and method combining reading course learning mechanism
CN112541339A (en) Knowledge extraction method based on random forest and sequence labeling model
CN115878818B (en) Geographic knowledge graph construction method, device, terminal and storage medium
CN110362691B (en) Syntax tree bank construction system
CN117473054A (en) Knowledge graph-based general intelligent question-answering method and device
CN116304064A (en) Text classification method based on extraction
CN114118068B (en) Method and device for amplifying training text data and electronic equipment
CN112052652B (en) Automatic generation method and device for electronic courseware script
CN114925206A (en) Artificial intelligence body, voice information recognition method, storage medium and program product
CN112069777B (en) Two-stage data-to-text generation method based on skeleton
CN110727695B (en) Natural language query analysis method for novel power supply urban rail train data operation and maintenance
KR101207375B1 (en) System and method for managing mathematical contents
CN111209726A (en) Intelligent report generation system
CN111369005A (en) Crowdsourcing marking system
CN1570921A (en) Spoken language analyzing method based on statistic model
CN109947953B (en) Construction method, system and equipment of knowledge ontology in English field
CN112181389B (en) Method, system and computer equipment for generating API (application program interface) marks of course fragments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210323