CN108804408A - Information extraction system based on domain-specialist knowledge system and information extraction method - Google Patents
Information extraction system based on domain-specialist knowledge system and information extraction method Download PDFInfo
- Publication number
- CN108804408A CN108804408A CN201710289555.6A CN201710289555A CN108804408A CN 108804408 A CN108804408 A CN 108804408A CN 201710289555 A CN201710289555 A CN 201710289555A CN 108804408 A CN108804408 A CN 108804408A
- Authority
- CN
- China
- Prior art keywords
- information extraction
- domain
- knowledge
- rule
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
- G06F18/24155—Bayesian classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
- G06N5/025—Extracting rules from data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of information extraction system and its information extraction method based on domain-specialist knowledge system, including:Resource management module, preprocessing module, core processing module, output module information extraction method are:The expert of judicial domain combs out the knowledge of judicial domain to build domain knowledge base by expertise library unit, and the expert of judicial domain also combs out knowledge point by resource management module and defined to it;Maintenance personnel needs to write decimation rule to form information extraction rules library by regular library unit according to information extraction;Regular and segmentation is carried out to judgement document content using preprocessing module;Using core processing module according to the information extraction rules library of domain knowledge base and manual compiling, using information extraction algorithm Extracting Information point, exported by output module by result is extracted.The present invention has the advantages that universality is high, maintenance cost is low etc..
Description
Technical field
The present invention relates to information extraction field, a kind of specifically information extraction system based on domain-specialist knowledge system
System and its information extraction method.
Background technology
Court verdict, law term refer to the document that law court is write as according to judgement.It is a kind of law circle's commonly Applied Composition
Style, including paper of civil judgment, criminal judgment, administrative judgment book and incidental civil court verdict.
The new rule of Supreme People's Court's publication:Law court's binding judgment book is comprehensively public in internet from 1 day January in 2014
Cloth, except be related to state secret, individual privacy, teenage crime and unsuitable " solarization " 4 class court verdicts in addition to, the public can look at any time
It reads.
Currently, existing document extraction technique is mainly based on rules technology, there are the information point of extraction it is scattered and
Immethodical defect, therefore the changeable demand of extraction task is cannot be satisfied, in addition, existing text extraction technique maintenance cost
It is high, it is not suitable for and is widely used to promote.
Invention content
The technical problem to be solved by the present invention is in order to overcome the prior art not have universality and safeguard threshold height
Defect, and provide a kind of information extraction system and its information extraction method based on domain-specialist knowledge system.
The present invention solves the technical solution that above-mentioned technical problem provides:The invention discloses one kind to be known based on domain expert
The information extraction system of knowledge system, including:Resource management module, preprocessing module, core processing module, output module, it is described
Resource management module be used for management domain knowledge base and information extraction rules library, the preprocessing module be used for judgement text
Book content carries out regular and segmentation, and the core processing module is used to be provided according to the rule of domain knowledge base and manual compiling
Source, using information extraction algorithm Extracting Information point, the output module is exported for that will extract result.
Preferably, the resource management module includes expertise library unit and regular library unit, the expert
Repository unit is used for the knowledge of expert's combing judicial domain to build domain knowledge base, and is combed out by the expert of judicial domain
Knowledge point simultaneously defines it, and the regular library unit is used to write decimation rule by maintenance personnel according to the needs of information extraction
Form information extraction rules library.
Preferably, the invention also discloses a kind of letters of the above-mentioned information extraction system based on domain-specialist knowledge system
Abstracting method is ceased, is as follows:
(1), the expert of judicial domain is combed out the knowledge of judicial domain by expertise library unit and is known with building field
Know library, the expert of judicial domain also combs out knowledge point by resource management module and defined to it;
(2), maintenance personnel needs to write decimation rule to form information pumping by regular library unit according to information extraction
Take rule base;
(3), regular and segmentation is carried out to judgement document content using preprocessing module;
(4), information is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module
Extraction algorithm Extracting Information point;
(5), it is exported by output module by result is extracted.
Preferably, in the step (3), the specific method is as follows:It determines the content that each paragraph states clearly, then uses
Hackberry Bayes Method or rule classification method are classified, then are ranked up, that is, realize automatic paragraphing, last output category
As a result.
Preferably, the rule classification method is classified according to the rule that maintenance personnel writes.
Preferably, the sort algorithm is fscore=w1*fBayesian+w2*fRule
Wherein fscoreFor the total score that the paragraph is label A, fBayesianIt is obtained for the Bayes's classification that the paragraph is label A
Point, fRuleFor the rule match score that the paragraph is label A, w1With w2For weight coefficient, obtained by training.
Preferably, the step (4) is according to the automatic paragraphing in step (3) as a result, being extracted in each paragraph
Different information points, since information point quantity is more in judgement document, the more features of type are needed for different types using different
Method go to identify.
Compared with prior art, the present invention has following beneficial advantage:
The emphasis of the present invention is based on the domain business knowledge system of combing, by using preprocessing module and core
The architecture design of processing module first uses preprocessing module to carry out regular and segmentation to judgement document content, although judgement document
There is the specification write, but it should include which information and rough piecemeal, therefore each judge that judgement document is only illustrated in specification
When writing, there are certain degree of freedom, the purpose of segmentation is to determine the content that each paragraph states clearly, and beats each paragraph
Label is the premise of follow-up Extracting Information point, then uses core processing module according to domain knowledge base and manual compiling again
Using information extraction algorithm Extracting Information point, therefore the universality and dimension of extraction system greatly improved in information extraction rules library
Threshold is protected, to cope with changeable information extraction demand.
Description of the drawings
Fig. 1 is a kind of system block diagram of the information extraction system based on domain-specialist knowledge system of the present invention;
Fig. 2 is the schematic diagram of the embodiment of the present invention 1;
The structural representation of the step of Fig. 3 is a kind of information extraction system based on domain-specialist knowledge system of the present invention (3)
Figure.
Specific implementation mode
Referring to Fig.1 shown in -3, the invention discloses a kind of information extraction system based on domain-specialist knowledge system, packets
It includes:Resource management module 1, preprocessing module 2, core processing module 3, output module 4, the resource management module 1 are used for
Management domain knowledge base and information extraction rules library, the preprocessing module 2 be used for judgement document content carry out it is regular and
Segmentation, the core processing module 3 are used for the regular resource according to domain knowledge base and manual compiling, are calculated using information extraction
Method Extracting Information point, the output module 4 are exported for that will extract result.
Preferably, the resource management module 1 includes expertise library unit 11 and regular library unit 12, it is described
Expertise library unit 11 is used for the knowledge of expert's combing judicial domain to build domain knowledge base, and by the expert of judicial domain
It combs out knowledge point and it is defined, the regular library unit 12 according to the needs of information extraction by maintenance personnel for being write
Decimation rule forms information extraction rules library.
Preferably, the invention also discloses a kind of letters of the above-mentioned information extraction system based on domain-specialist knowledge system
Abstracting method is ceased, is as follows:
(1), the expert of judicial domain combs out the knowledge of judicial domain to build field by expertise library unit 11
The expert of knowledge base, judicial domain also combs out knowledge point by resource management module and is defined to it;
(2), maintenance personnel needs to write decimation rule to form information by regular library unit 12 according to information extraction
Decimation rule library;
(3), regular and segmentation is carried out to judgement document content using preprocessing module 2;
(4), letter is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module 3
Cease extraction algorithm Extracting Information point;
(5), it is exported by output module 4 by result is extracted.
Preferably, in the step (3), the specific method is as follows:It determines the content that each paragraph states clearly, then uses
Hackberry Bayes Method or rule classification method are classified, then are ranked up, that is, realize automatic paragraphing, last output category
As a result.
Preferably, the rule classification method is classified according to the rule that maintenance personnel writes.
Preferably, the sort algorithm is fscore=w1*fBayesian+w2*fRule
Wherein fscoreFor the total score that the paragraph is label A, fBayesianIt is obtained for the Bayes's classification that the paragraph is label A
Point, fRuleFor the rule match score that the paragraph is label A, w1With w2For weight coefficient, obtained by training.
Preferably, the step (4) is according to the automatic paragraphing in step (3) as a result, being extracted in each paragraph
Different information points, since information point quantity is more in judgement document, the more features of type are needed for different types using different
Method go to identify.
Embodiment 1
A kind of information extraction method of the above-mentioned information extraction system based on domain-specialist knowledge system, specific steps are such as
Under:
(1), the expert of judicial domain combs out the knowledge of judicial domain to build field by expertise library unit 11
The expert of knowledge base, judicial domain also combs out knowledge point by resource management module and is defined to it;
(2), maintenance personnel needs to write decimation rule to form information by regular library unit 12 according to information extraction
Decimation rule library;
(3), regular and segmentation is carried out to judgement document content using preprocessing module 2, the specific steps are each section of determinations
The content stated clearly is fallen, is then classified using hackberry Bayes Method, then be ranked up, that is, realizes automatic paragraphing, most
Output category result afterwards, the sort algorithm are fscore=w1*fBayesian+w2*fRule
Wherein fscoreFor the total score that the paragraph is label A, fBayesianIt is obtained for the Bayes's classification that the paragraph is label A
Point, fRuleFor the rule match score that the paragraph is label A, w1With w2For weight coefficient, obtained by training;
(4), letter is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module 3
Cease extraction algorithm Extracting Information point, the process according to according to the automatic paragraphing in step (3) as a result, being extracted not in each paragraph
Same information point, since information point quantity is more in judgement document, the more features of type are needed for different types using different
Method goes to identify, by taking name, place name, time, institutional framework name as an example, this kind of type identification is claimed in natural language understanding field
The side being combined using statistics and rule for name Entity recognition (Named EntitiesRecognition, NER), this system
Method, and it is aided with part of speech comprehensive descision, there are the relationship descriptions of various complexity, this system mainly to use the side of rule in judgement document
Formula, defines the extraction template of a variety of relationships, then is aided with simple reasoning and judging;
(5), it is exported by output module 4 by result is extracted.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe
The personage for knowing this technology can all carry out modifications and changes to above-described embodiment without violating the spirit and scope of the present invention.Cause
This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as
At all equivalent modifications or change, should by the present invention claim be covered.
Claims (7)
1. a kind of information extraction system based on domain-specialist knowledge system, including:Resource management module, preprocessing module, core
Heart processing module, output module, it is characterised in that:The resource management module is used for management domain knowledge base and information extraction
Rule base, the preprocessing module are used to carry out regular and segmentation to judgement document content, and the core processing module is used
In the regular resource according to domain knowledge base and manual compiling, information extraction algorithm Extracting Information point, the output mould are utilized
Block is exported for that will extract result.
2. a kind of information extraction system based on domain-specialist knowledge system according to claim 1, it is characterised in that:Institute
The resource management module stated includes expertise library unit and regular library unit, and the expertise library unit is combed for expert
The knowledge of judicial domain is managed to build domain knowledge base, and is combed out knowledge point by the expert of judicial domain and it is defined, institute
The regular library unit stated forms information extraction rules library for writing decimation rule by maintenance personnel according to the needs of information extraction.
3. a kind of information extraction of information extraction system based on domain-specialist knowledge system according to claim 1 or 2
Method, it is characterised in that:It is as follows:
(1), the expert of judicial domain combs out the knowledge of judicial domain to build domain knowledge base by expertise library unit,
The expert of judicial domain also combs out knowledge point by resource management module and is defined to it;
(2), maintenance personnel needs to write decimation rule to form information extraction rule by regular library unit according to information extraction
Then library;
(3), regular and segmentation is carried out to judgement document content using preprocessing module;
(4), information extraction is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module
Algorithm Extracting Information point;
(5), it is exported by output module by result is extracted.
4. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 3
Method, it is characterised in that:In the step (3), the specific method is as follows:It determines the content that each paragraph states clearly, then uses Piao
Tree Bayes Method or rule classification method are classified, then are ranked up, that is, realize automatic paragraphing, last output category knot
Fruit.
5. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 4
Method, it is characterised in that:The rule classification method is classified according to the rule that maintenance personnel writes.
6. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 4
Method, it is characterised in that:The sort algorithm is fscore=w1*fBayesian+w2*fRule
Wherein fscoreFor the total score that the paragraph is label A, fBayesianFor the Bayes's classification score that the paragraph is label A,
fRuleFor the rule match score that the paragraph is label A, w1With w2For weight coefficient, obtained by training.
7. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 3
Method, it is characterised in that:The step (4) is according to the automatic paragraphing in step (3) as a result, being extracted in each paragraph different
Information point, due to judgement document in information point quantity it is more, the more features of type, for different types need use different sides
Method goes to identify.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710289555.6A CN108804408A (en) | 2017-04-27 | 2017-04-27 | Information extraction system based on domain-specialist knowledge system and information extraction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710289555.6A CN108804408A (en) | 2017-04-27 | 2017-04-27 | Information extraction system based on domain-specialist knowledge system and information extraction method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108804408A true CN108804408A (en) | 2018-11-13 |
Family
ID=64069303
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710289555.6A Pending CN108804408A (en) | 2017-04-27 | 2017-04-27 | Information extraction system based on domain-specialist knowledge system and information extraction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108804408A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753020A (en) * | 2019-03-28 | 2020-10-09 | 阿里巴巴集团控股有限公司 | Method and device for establishing relational network model |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8024329B1 (en) * | 2006-06-01 | 2011-09-20 | Monster Worldwide, Inc. | Using inverted indexes for contextual personalized information retrieval |
CN102194013A (en) * | 2011-06-23 | 2011-09-21 | 上海毕佳数据有限公司 | Domain-knowledge-based short text classification method and text classification system |
CN103049490A (en) * | 2012-12-05 | 2013-04-17 | 北京海量融通软件技术有限公司 | Attribute generation system and generation method among knowledge network nodes |
CN103123653A (en) * | 2013-03-15 | 2013-05-29 | 山东浪潮齐鲁软件产业股份有限公司 | Search engine retrieving ordering method based on Bayesian classification learning |
CN103618652A (en) * | 2013-12-17 | 2014-03-05 | 沈阳觉醒软件有限公司 | Audit and depth analysis system and audit and depth analysis method of business data |
CN103927302A (en) * | 2013-01-10 | 2014-07-16 | 阿里巴巴集团控股有限公司 | Text classification method and system |
CN105069560A (en) * | 2015-07-30 | 2015-11-18 | 中国科学院软件研究所 | Resume information extraction and characteristic identification analysis system and method based on knowledge base and rule base |
CN105574084A (en) * | 2015-12-10 | 2016-05-11 | 天津海量信息技术有限公司 | Extraction method of case information in webpage |
-
2017
- 2017-04-27 CN CN201710289555.6A patent/CN108804408A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8024329B1 (en) * | 2006-06-01 | 2011-09-20 | Monster Worldwide, Inc. | Using inverted indexes for contextual personalized information retrieval |
CN102194013A (en) * | 2011-06-23 | 2011-09-21 | 上海毕佳数据有限公司 | Domain-knowledge-based short text classification method and text classification system |
CN103049490A (en) * | 2012-12-05 | 2013-04-17 | 北京海量融通软件技术有限公司 | Attribute generation system and generation method among knowledge network nodes |
CN103927302A (en) * | 2013-01-10 | 2014-07-16 | 阿里巴巴集团控股有限公司 | Text classification method and system |
CN103123653A (en) * | 2013-03-15 | 2013-05-29 | 山东浪潮齐鲁软件产业股份有限公司 | Search engine retrieving ordering method based on Bayesian classification learning |
CN103618652A (en) * | 2013-12-17 | 2014-03-05 | 沈阳觉醒软件有限公司 | Audit and depth analysis system and audit and depth analysis method of business data |
CN105069560A (en) * | 2015-07-30 | 2015-11-18 | 中国科学院软件研究所 | Resume information extraction and characteristic identification analysis system and method based on knowledge base and rule base |
CN105574084A (en) * | 2015-12-10 | 2016-05-11 | 天津海量信息技术有限公司 | Extraction method of case information in webpage |
Non-Patent Citations (1)
Title |
---|
邱莉榕 等编著: "《算法设计与优化》", 31 December 2016, 北京:中央民族大学出版社 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111753020A (en) * | 2019-03-28 | 2020-10-09 | 阿里巴巴集团控股有限公司 | Method and device for establishing relational network model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103631859B (en) | Intelligent review expert recommending method for science and technology projects | |
CN105740228B (en) | A kind of internet public feelings analysis method and system | |
CN110619568A (en) | Risk assessment report generation method, device, equipment and storage medium | |
CN110472017A (en) | A kind of analysis of words art and topic point identify matched method and system | |
CN105095190B (en) | A kind of sentiment analysis method combined based on Chinese semantic structure and subdivision dictionary | |
CN107403375A (en) | A kind of listed company's bulletin classification and abstraction generating method based on deep learning | |
CN108763483A (en) | A kind of Text Information Extraction method towards judgement document | |
CN106649223A (en) | Financial report automatic generation method based on natural language processing | |
CN107766371A (en) | A kind of text message sorting technique and its device | |
CN104573094B (en) | Network account identifies matching process | |
CN107273295B (en) | Software problem report classification method based on text chaos | |
CN106447490A (en) | Credit investigation application method based on user figures | |
CN107301170A (en) | The method and apparatus of cutting sentence based on artificial intelligence | |
CN111597331B (en) | Referee document classification method based on Bayesian network | |
CN106022708A (en) | Method for predicting employee resignation | |
CN111259160B (en) | Knowledge graph construction method, device, equipment and storage medium | |
CN109460551A (en) | Signing messages extracting method and device | |
CN106709804A (en) | Interactive wealth planning consulting robot system | |
CN105488098B (en) | A kind of new words extraction method based on field otherness | |
CN109165337A (en) | A kind of method and system of knowledge based map construction bidding field association analysis | |
CN108241867A (en) | A kind of sorting technique and device | |
CN109241297A (en) | A kind of classifying content polymerization, electronic equipment, storage medium and engine | |
CN110034966A (en) | A kind of method for classifying data stream and system based on machine learning | |
CN115423639A (en) | Social network-oriented secure community discovery method | |
CN112580332A (en) | Enterprise portrait method based on label layering and deepening modeling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20181113 |