CN108804408A - Information extraction system based on domain-specialist knowledge system and information extraction method - Google Patents

Information extraction system based on domain-specialist knowledge system and information extraction method Download PDF

Info

Publication number
CN108804408A
CN108804408A CN201710289555.6A CN201710289555A CN108804408A CN 108804408 A CN108804408 A CN 108804408A CN 201710289555 A CN201710289555 A CN 201710289555A CN 108804408 A CN108804408 A CN 108804408A
Authority
CN
China
Prior art keywords
information extraction
domain
knowledge
rule
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710289555.6A
Other languages
Chinese (zh)
Inventor
司华建
贾真
耿伟
金重九
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Fu Chi Information Technology Co Ltd
Original Assignee
Anhui Fu Chi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Fu Chi Information Technology Co Ltd filed Critical Anhui Fu Chi Information Technology Co Ltd
Priority to CN201710289555.6A priority Critical patent/CN108804408A/en
Publication of CN108804408A publication Critical patent/CN108804408A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of information extraction system and its information extraction method based on domain-specialist knowledge system, including:Resource management module, preprocessing module, core processing module, output module information extraction method are:The expert of judicial domain combs out the knowledge of judicial domain to build domain knowledge base by expertise library unit, and the expert of judicial domain also combs out knowledge point by resource management module and defined to it;Maintenance personnel needs to write decimation rule to form information extraction rules library by regular library unit according to information extraction;Regular and segmentation is carried out to judgement document content using preprocessing module;Using core processing module according to the information extraction rules library of domain knowledge base and manual compiling, using information extraction algorithm Extracting Information point, exported by output module by result is extracted.The present invention has the advantages that universality is high, maintenance cost is low etc..

Description

Information extraction system based on domain-specialist knowledge system and information extraction method
Technical field
The present invention relates to information extraction field, a kind of specifically information extraction system based on domain-specialist knowledge system System and its information extraction method.
Background technology
Court verdict, law term refer to the document that law court is write as according to judgement.It is a kind of law circle's commonly Applied Composition Style, including paper of civil judgment, criminal judgment, administrative judgment book and incidental civil court verdict.
The new rule of Supreme People's Court's publication:Law court's binding judgment book is comprehensively public in internet from 1 day January in 2014 Cloth, except be related to state secret, individual privacy, teenage crime and unsuitable " solarization " 4 class court verdicts in addition to, the public can look at any time It reads.
Currently, existing document extraction technique is mainly based on rules technology, there are the information point of extraction it is scattered and Immethodical defect, therefore the changeable demand of extraction task is cannot be satisfied, in addition, existing text extraction technique maintenance cost It is high, it is not suitable for and is widely used to promote.
Invention content
The technical problem to be solved by the present invention is in order to overcome the prior art not have universality and safeguard threshold height Defect, and provide a kind of information extraction system and its information extraction method based on domain-specialist knowledge system.
The present invention solves the technical solution that above-mentioned technical problem provides:The invention discloses one kind to be known based on domain expert The information extraction system of knowledge system, including:Resource management module, preprocessing module, core processing module, output module, it is described Resource management module be used for management domain knowledge base and information extraction rules library, the preprocessing module be used for judgement text Book content carries out regular and segmentation, and the core processing module is used to be provided according to the rule of domain knowledge base and manual compiling Source, using information extraction algorithm Extracting Information point, the output module is exported for that will extract result.
Preferably, the resource management module includes expertise library unit and regular library unit, the expert Repository unit is used for the knowledge of expert's combing judicial domain to build domain knowledge base, and is combed out by the expert of judicial domain Knowledge point simultaneously defines it, and the regular library unit is used to write decimation rule by maintenance personnel according to the needs of information extraction Form information extraction rules library.
Preferably, the invention also discloses a kind of letters of the above-mentioned information extraction system based on domain-specialist knowledge system Abstracting method is ceased, is as follows:
(1), the expert of judicial domain is combed out the knowledge of judicial domain by expertise library unit and is known with building field Know library, the expert of judicial domain also combs out knowledge point by resource management module and defined to it;
(2), maintenance personnel needs to write decimation rule to form information pumping by regular library unit according to information extraction Take rule base;
(3), regular and segmentation is carried out to judgement document content using preprocessing module;
(4), information is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module Extraction algorithm Extracting Information point;
(5), it is exported by output module by result is extracted.
Preferably, in the step (3), the specific method is as follows:It determines the content that each paragraph states clearly, then uses Hackberry Bayes Method or rule classification method are classified, then are ranked up, that is, realize automatic paragraphing, last output category As a result.
Preferably, the rule classification method is classified according to the rule that maintenance personnel writes.
Preferably, the sort algorithm is fscore=w1*fBayesian+w2*fRule
Wherein fscoreFor the total score that the paragraph is label A, fBayesianIt is obtained for the Bayes's classification that the paragraph is label A Point, fRuleFor the rule match score that the paragraph is label A, w1With w2For weight coefficient, obtained by training.
Preferably, the step (4) is according to the automatic paragraphing in step (3) as a result, being extracted in each paragraph Different information points, since information point quantity is more in judgement document, the more features of type are needed for different types using different Method go to identify.
Compared with prior art, the present invention has following beneficial advantage:
The emphasis of the present invention is based on the domain business knowledge system of combing, by using preprocessing module and core The architecture design of processing module first uses preprocessing module to carry out regular and segmentation to judgement document content, although judgement document There is the specification write, but it should include which information and rough piecemeal, therefore each judge that judgement document is only illustrated in specification When writing, there are certain degree of freedom, the purpose of segmentation is to determine the content that each paragraph states clearly, and beats each paragraph Label is the premise of follow-up Extracting Information point, then uses core processing module according to domain knowledge base and manual compiling again Using information extraction algorithm Extracting Information point, therefore the universality and dimension of extraction system greatly improved in information extraction rules library Threshold is protected, to cope with changeable information extraction demand.
Description of the drawings
Fig. 1 is a kind of system block diagram of the information extraction system based on domain-specialist knowledge system of the present invention;
Fig. 2 is the schematic diagram of the embodiment of the present invention 1;
The structural representation of the step of Fig. 3 is a kind of information extraction system based on domain-specialist knowledge system of the present invention (3) Figure.
Specific implementation mode
Referring to Fig.1 shown in -3, the invention discloses a kind of information extraction system based on domain-specialist knowledge system, packets It includes:Resource management module 1, preprocessing module 2, core processing module 3, output module 4, the resource management module 1 are used for Management domain knowledge base and information extraction rules library, the preprocessing module 2 be used for judgement document content carry out it is regular and Segmentation, the core processing module 3 are used for the regular resource according to domain knowledge base and manual compiling, are calculated using information extraction Method Extracting Information point, the output module 4 are exported for that will extract result.
Preferably, the resource management module 1 includes expertise library unit 11 and regular library unit 12, it is described Expertise library unit 11 is used for the knowledge of expert's combing judicial domain to build domain knowledge base, and by the expert of judicial domain It combs out knowledge point and it is defined, the regular library unit 12 according to the needs of information extraction by maintenance personnel for being write Decimation rule forms information extraction rules library.
Preferably, the invention also discloses a kind of letters of the above-mentioned information extraction system based on domain-specialist knowledge system Abstracting method is ceased, is as follows:
(1), the expert of judicial domain combs out the knowledge of judicial domain to build field by expertise library unit 11 The expert of knowledge base, judicial domain also combs out knowledge point by resource management module and is defined to it;
(2), maintenance personnel needs to write decimation rule to form information by regular library unit 12 according to information extraction Decimation rule library;
(3), regular and segmentation is carried out to judgement document content using preprocessing module 2;
(4), letter is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module 3 Cease extraction algorithm Extracting Information point;
(5), it is exported by output module 4 by result is extracted.
Preferably, in the step (3), the specific method is as follows:It determines the content that each paragraph states clearly, then uses Hackberry Bayes Method or rule classification method are classified, then are ranked up, that is, realize automatic paragraphing, last output category As a result.
Preferably, the rule classification method is classified according to the rule that maintenance personnel writes.
Preferably, the sort algorithm is fscore=w1*fBayesian+w2*fRule
Wherein fscoreFor the total score that the paragraph is label A, fBayesianIt is obtained for the Bayes's classification that the paragraph is label A Point, fRuleFor the rule match score that the paragraph is label A, w1With w2For weight coefficient, obtained by training.
Preferably, the step (4) is according to the automatic paragraphing in step (3) as a result, being extracted in each paragraph Different information points, since information point quantity is more in judgement document, the more features of type are needed for different types using different Method go to identify.
Embodiment 1
A kind of information extraction method of the above-mentioned information extraction system based on domain-specialist knowledge system, specific steps are such as Under:
(1), the expert of judicial domain combs out the knowledge of judicial domain to build field by expertise library unit 11 The expert of knowledge base, judicial domain also combs out knowledge point by resource management module and is defined to it;
(2), maintenance personnel needs to write decimation rule to form information by regular library unit 12 according to information extraction Decimation rule library;
(3), regular and segmentation is carried out to judgement document content using preprocessing module 2, the specific steps are each section of determinations The content stated clearly is fallen, is then classified using hackberry Bayes Method, then be ranked up, that is, realizes automatic paragraphing, most Output category result afterwards, the sort algorithm are fscore=w1*fBayesian+w2*fRule
Wherein fscoreFor the total score that the paragraph is label A, fBayesianIt is obtained for the Bayes's classification that the paragraph is label A Point, fRuleFor the rule match score that the paragraph is label A, w1With w2For weight coefficient, obtained by training;
(4), letter is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module 3 Cease extraction algorithm Extracting Information point, the process according to according to the automatic paragraphing in step (3) as a result, being extracted not in each paragraph Same information point, since information point quantity is more in judgement document, the more features of type are needed for different types using different Method goes to identify, by taking name, place name, time, institutional framework name as an example, this kind of type identification is claimed in natural language understanding field The side being combined using statistics and rule for name Entity recognition (Named EntitiesRecognition, NER), this system Method, and it is aided with part of speech comprehensive descision, there are the relationship descriptions of various complexity, this system mainly to use the side of rule in judgement document Formula, defines the extraction template of a variety of relationships, then is aided with simple reasoning and judging;
(5), it is exported by output module 4 by result is extracted.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe The personage for knowing this technology can all carry out modifications and changes to above-described embodiment without violating the spirit and scope of the present invention.Cause This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as At all equivalent modifications or change, should by the present invention claim be covered.

Claims (7)

1. a kind of information extraction system based on domain-specialist knowledge system, including:Resource management module, preprocessing module, core Heart processing module, output module, it is characterised in that:The resource management module is used for management domain knowledge base and information extraction Rule base, the preprocessing module are used to carry out regular and segmentation to judgement document content, and the core processing module is used In the regular resource according to domain knowledge base and manual compiling, information extraction algorithm Extracting Information point, the output mould are utilized Block is exported for that will extract result.
2. a kind of information extraction system based on domain-specialist knowledge system according to claim 1, it is characterised in that:Institute The resource management module stated includes expertise library unit and regular library unit, and the expertise library unit is combed for expert The knowledge of judicial domain is managed to build domain knowledge base, and is combed out knowledge point by the expert of judicial domain and it is defined, institute The regular library unit stated forms information extraction rules library for writing decimation rule by maintenance personnel according to the needs of information extraction.
3. a kind of information extraction of information extraction system based on domain-specialist knowledge system according to claim 1 or 2 Method, it is characterised in that:It is as follows:
(1), the expert of judicial domain combs out the knowledge of judicial domain to build domain knowledge base by expertise library unit, The expert of judicial domain also combs out knowledge point by resource management module and is defined to it;
(2), maintenance personnel needs to write decimation rule to form information extraction rule by regular library unit according to information extraction Then library;
(3), regular and segmentation is carried out to judgement document content using preprocessing module;
(4), information extraction is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module Algorithm Extracting Information point;
(5), it is exported by output module by result is extracted.
4. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 3 Method, it is characterised in that:In the step (3), the specific method is as follows:It determines the content that each paragraph states clearly, then uses Piao Tree Bayes Method or rule classification method are classified, then are ranked up, that is, realize automatic paragraphing, last output category knot Fruit.
5. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 4 Method, it is characterised in that:The rule classification method is classified according to the rule that maintenance personnel writes.
6. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 4 Method, it is characterised in that:The sort algorithm is fscore=w1*fBayesian+w2*fRule
Wherein fscoreFor the total score that the paragraph is label A, fBayesianFor the Bayes's classification score that the paragraph is label A, fRuleFor the rule match score that the paragraph is label A, w1With w2For weight coefficient, obtained by training.
7. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 3 Method, it is characterised in that:The step (4) is according to the automatic paragraphing in step (3) as a result, being extracted in each paragraph different Information point, due to judgement document in information point quantity it is more, the more features of type, for different types need use different sides Method goes to identify.
CN201710289555.6A 2017-04-27 2017-04-27 Information extraction system based on domain-specialist knowledge system and information extraction method Pending CN108804408A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710289555.6A CN108804408A (en) 2017-04-27 2017-04-27 Information extraction system based on domain-specialist knowledge system and information extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710289555.6A CN108804408A (en) 2017-04-27 2017-04-27 Information extraction system based on domain-specialist knowledge system and information extraction method

Publications (1)

Publication Number Publication Date
CN108804408A true CN108804408A (en) 2018-11-13

Family

ID=64069303

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710289555.6A Pending CN108804408A (en) 2017-04-27 2017-04-27 Information extraction system based on domain-specialist knowledge system and information extraction method

Country Status (1)

Country Link
CN (1) CN108804408A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753020A (en) * 2019-03-28 2020-10-09 阿里巴巴集团控股有限公司 Method and device for establishing relational network model

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8024329B1 (en) * 2006-06-01 2011-09-20 Monster Worldwide, Inc. Using inverted indexes for contextual personalized information retrieval
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system
CN103049490A (en) * 2012-12-05 2013-04-17 北京海量融通软件技术有限公司 Attribute generation system and generation method among knowledge network nodes
CN103123653A (en) * 2013-03-15 2013-05-29 山东浪潮齐鲁软件产业股份有限公司 Search engine retrieving ordering method based on Bayesian classification learning
CN103618652A (en) * 2013-12-17 2014-03-05 沈阳觉醒软件有限公司 Audit and depth analysis system and audit and depth analysis method of business data
CN103927302A (en) * 2013-01-10 2014-07-16 阿里巴巴集团控股有限公司 Text classification method and system
CN105069560A (en) * 2015-07-30 2015-11-18 中国科学院软件研究所 Resume information extraction and characteristic identification analysis system and method based on knowledge base and rule base
CN105574084A (en) * 2015-12-10 2016-05-11 天津海量信息技术有限公司 Extraction method of case information in webpage

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8024329B1 (en) * 2006-06-01 2011-09-20 Monster Worldwide, Inc. Using inverted indexes for contextual personalized information retrieval
CN102194013A (en) * 2011-06-23 2011-09-21 上海毕佳数据有限公司 Domain-knowledge-based short text classification method and text classification system
CN103049490A (en) * 2012-12-05 2013-04-17 北京海量融通软件技术有限公司 Attribute generation system and generation method among knowledge network nodes
CN103927302A (en) * 2013-01-10 2014-07-16 阿里巴巴集团控股有限公司 Text classification method and system
CN103123653A (en) * 2013-03-15 2013-05-29 山东浪潮齐鲁软件产业股份有限公司 Search engine retrieving ordering method based on Bayesian classification learning
CN103618652A (en) * 2013-12-17 2014-03-05 沈阳觉醒软件有限公司 Audit and depth analysis system and audit and depth analysis method of business data
CN105069560A (en) * 2015-07-30 2015-11-18 中国科学院软件研究所 Resume information extraction and characteristic identification analysis system and method based on knowledge base and rule base
CN105574084A (en) * 2015-12-10 2016-05-11 天津海量信息技术有限公司 Extraction method of case information in webpage

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邱莉榕 等编著: "《算法设计与优化》", 31 December 2016, 北京:中央民族大学出版社 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753020A (en) * 2019-03-28 2020-10-09 阿里巴巴集团控股有限公司 Method and device for establishing relational network model

Similar Documents

Publication Publication Date Title
CN103631859B (en) Intelligent review expert recommending method for science and technology projects
CN105740228B (en) A kind of internet public feelings analysis method and system
CN110619568A (en) Risk assessment report generation method, device, equipment and storage medium
CN110472017A (en) A kind of analysis of words art and topic point identify matched method and system
CN105095190B (en) A kind of sentiment analysis method combined based on Chinese semantic structure and subdivision dictionary
CN107403375A (en) A kind of listed company's bulletin classification and abstraction generating method based on deep learning
CN108763483A (en) A kind of Text Information Extraction method towards judgement document
CN106649223A (en) Financial report automatic generation method based on natural language processing
CN107766371A (en) A kind of text message sorting technique and its device
CN104573094B (en) Network account identifies matching process
CN107273295B (en) Software problem report classification method based on text chaos
CN106447490A (en) Credit investigation application method based on user figures
CN107301170A (en) The method and apparatus of cutting sentence based on artificial intelligence
CN111597331B (en) Referee document classification method based on Bayesian network
CN106022708A (en) Method for predicting employee resignation
CN111259160B (en) Knowledge graph construction method, device, equipment and storage medium
CN109460551A (en) Signing messages extracting method and device
CN106709804A (en) Interactive wealth planning consulting robot system
CN105488098B (en) A kind of new words extraction method based on field otherness
CN109165337A (en) A kind of method and system of knowledge based map construction bidding field association analysis
CN108241867A (en) A kind of sorting technique and device
CN109241297A (en) A kind of classifying content polymerization, electronic equipment, storage medium and engine
CN110034966A (en) A kind of method for classifying data stream and system based on machine learning
CN115423639A (en) Social network-oriented secure community discovery method
CN112580332A (en) Enterprise portrait method based on label layering and deepening modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181113