CN108804408A

CN108804408A - Information extraction system based on domain-specialist knowledge system and information extraction method

Info

Publication number: CN108804408A
Application number: CN201710289555.6A
Authority: CN
Inventors: 司华建; 贾真; 耿伟; 金重九
Original assignee: Anhui Fu Chi Information Technology Co Ltd
Current assignee: Anhui Fu Chi Information Technology Co Ltd
Priority date: 2017-04-27
Filing date: 2017-04-27
Publication date: 2018-11-13

Abstract

The invention discloses a kind of information extraction system and its information extraction method based on domain-specialist knowledge system, including：Resource management module, preprocessing module, core processing module, output module information extraction method are：The expert of judicial domain combs out the knowledge of judicial domain to build domain knowledge base by expertise library unit, and the expert of judicial domain also combs out knowledge point by resource management module and defined to it；Maintenance personnel needs to write decimation rule to form information extraction rules library by regular library unit according to information extraction；Regular and segmentation is carried out to judgement document content using preprocessing module；Using core processing module according to the information extraction rules library of domain knowledge base and manual compiling, using information extraction algorithm Extracting Information point, exported by output module by result is extracted.The present invention has the advantages that universality is high, maintenance cost is low etc..

Description

Information extraction system based on domain-specialist knowledge system and information extraction method

Technical field

The present invention relates to information extraction field, a kind of specifically information extraction system based on domain-specialist knowledge system System and its information extraction method.

Background technology

Court verdict, law term refer to the document that law court is write as according to judgement.It is a kind of law circle's commonly Applied Composition Style, including paper of civil judgment, criminal judgment, administrative judgment book and incidental civil court verdict.

The new rule of Supreme People's Court's publication:Law court's binding judgment book is comprehensively public in internet from 1 day January in 2014 Cloth, except be related to state secret, individual privacy, teenage crime and unsuitable " solarization " 4 class court verdicts in addition to, the public can look at any time It reads.

Currently, existing document extraction technique is mainly based on rules technology, there are the information point of extraction it is scattered and Immethodical defect, therefore the changeable demand of extraction task is cannot be satisfied, in addition, existing text extraction technique maintenance cost It is high, it is not suitable for and is widely used to promote.

Invention content

The technical problem to be solved by the present invention is in order to overcome the prior art not have universality and safeguard threshold height Defect, and provide a kind of information extraction system and its information extraction method based on domain-specialist knowledge system.

The present invention solves the technical solution that above-mentioned technical problem provides：The invention discloses one kind to be known based on domain expert The information extraction system of knowledge system, including：Resource management module, preprocessing module, core processing module, output module, it is described Resource management module be used for management domain knowledge base and information extraction rules library, the preprocessing module be used for judgement text Book content carries out regular and segmentation, and the core processing module is used to be provided according to the rule of domain knowledge base and manual compiling Source, using information extraction algorithm Extracting Information point, the output module is exported for that will extract result.

Preferably, the resource management module includes expertise library unit and regular library unit, the expert Repository unit is used for the knowledge of expert's combing judicial domain to build domain knowledge base, and is combed out by the expert of judicial domain Knowledge point simultaneously defines it, and the regular library unit is used to write decimation rule by maintenance personnel according to the needs of information extraction Form information extraction rules library.

Preferably, the invention also discloses a kind of letters of the above-mentioned information extraction system based on domain-specialist knowledge system Abstracting method is ceased, is as follows：

(1), the expert of judicial domain is combed out the knowledge of judicial domain by expertise library unit and is known with building field Know library, the expert of judicial domain also combs out knowledge point by resource management module and defined to it；

(2), maintenance personnel needs to write decimation rule to form information pumping by regular library unit according to information extraction Take rule base；

(3), regular and segmentation is carried out to judgement document content using preprocessing module；

(4), information is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module Extraction algorithm Extracting Information point；

(5), it is exported by output module by result is extracted.

Preferably, in the step (3), the specific method is as follows：It determines the content that each paragraph states clearly, then uses Hackberry Bayes Method or rule classification method are classified, then are ranked up, that is, realize automatic paragraphing, last output category As a result.

Preferably, the rule classification method is classified according to the rule that maintenance personnel writes.

Preferably, the sort algorithm is f_score=w₁*f_Bayesian+w₂*f_Rule

Wherein f_scoreFor the total score that the paragraph is label A, f_BayesianIt is obtained for the Bayes's classification that the paragraph is label A Point, f_RuleFor the rule match score that the paragraph is label A, w₁With w₂For weight coefficient, obtained by training.

Preferably, the step (4) is according to the automatic paragraphing in step (3) as a result, being extracted in each paragraph Different information points, since information point quantity is more in judgement document, the more features of type are needed for different types using different Method go to identify.

Compared with prior art, the present invention has following beneficial advantage：

The emphasis of the present invention is based on the domain business knowledge system of combing, by using preprocessing module and core The architecture design of processing module first uses preprocessing module to carry out regular and segmentation to judgement document content, although judgement document There is the specification write, but it should include which information and rough piecemeal, therefore each judge that judgement document is only illustrated in specification When writing, there are certain degree of freedom, the purpose of segmentation is to determine the content that each paragraph states clearly, and beats each paragraph Label is the premise of follow-up Extracting Information point, then uses core processing module according to domain knowledge base and manual compiling again Using information extraction algorithm Extracting Information point, therefore the universality and dimension of extraction system greatly improved in information extraction rules library Threshold is protected, to cope with changeable information extraction demand.

Description of the drawings

Fig. 1 is a kind of system block diagram of the information extraction system based on domain-specialist knowledge system of the present invention；

Fig. 2 is the schematic diagram of the embodiment of the present invention 1；

The structural representation of the step of Fig. 3 is a kind of information extraction system based on domain-specialist knowledge system of the present invention (3) Figure.

Specific implementation mode

Referring to Fig.1 shown in -3, the invention discloses a kind of information extraction system based on domain-specialist knowledge system, packets It includes：Resource management module 1, preprocessing module 2, core processing module 3, output module 4, the resource management module 1 are used for Management domain knowledge base and information extraction rules library, the preprocessing module 2 be used for judgement document content carry out it is regular and Segmentation, the core processing module 3 are used for the regular resource according to domain knowledge base and manual compiling, are calculated using information extraction Method Extracting Information point, the output module 4 are exported for that will extract result.

Preferably, the resource management module 1 includes expertise library unit 11 and regular library unit 12, it is described Expertise library unit 11 is used for the knowledge of expert's combing judicial domain to build domain knowledge base, and by the expert of judicial domain It combs out knowledge point and it is defined, the regular library unit 12 according to the needs of information extraction by maintenance personnel for being write Decimation rule forms information extraction rules library.

(1), the expert of judicial domain combs out the knowledge of judicial domain to build field by expertise library unit 11 The expert of knowledge base, judicial domain also combs out knowledge point by resource management module and is defined to it；

(2), maintenance personnel needs to write decimation rule to form information by regular library unit 12 according to information extraction Decimation rule library；

(3), regular and segmentation is carried out to judgement document content using preprocessing module 2；

(4), letter is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module 3 Cease extraction algorithm Extracting Information point；

(5), it is exported by output module 4 by result is extracted.

Preferably, the sort algorithm is f_score=w₁*f_Bayesian+w₂*f_Rule

Embodiment 1

A kind of information extraction method of the above-mentioned information extraction system based on domain-specialist knowledge system, specific steps are such as Under：

(3), regular and segmentation is carried out to judgement document content using preprocessing module 2, the specific steps are each section of determinations The content stated clearly is fallen, is then classified using hackberry Bayes Method, then be ranked up, that is, realizes automatic paragraphing, most Output category result afterwards, the sort algorithm are f_score=w₁*f_Bayesian+w₂*f_Rule

Wherein f_scoreFor the total score that the paragraph is label A, f_BayesianIt is obtained for the Bayes's classification that the paragraph is label A Point, f_RuleFor the rule match score that the paragraph is label A, w₁With w₂For weight coefficient, obtained by training；

(4), letter is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module 3 Cease extraction algorithm Extracting Information point, the process according to according to the automatic paragraphing in step (3) as a result, being extracted not in each paragraph Same information point, since information point quantity is more in judgement document, the more features of type are needed for different types using different Method goes to identify, by taking name, place name, time, institutional framework name as an example, this kind of type identification is claimed in natural language understanding field The side being combined using statistics and rule for name Entity recognition (Named EntitiesRecognition, NER), this system Method, and it is aided with part of speech comprehensive descision, there are the relationship descriptions of various complexity, this system mainly to use the side of rule in judgement document Formula, defines the extraction template of a variety of relationships, then is aided with simple reasoning and judging；

(5), it is exported by output module 4 by result is extracted.

The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe The personage for knowing this technology can all carry out modifications and changes to above-described embodiment without violating the spirit and scope of the present invention.Cause This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as At all equivalent modifications or change, should by the present invention claim be covered.

Claims

1. a kind of information extraction system based on domain-specialist knowledge system, including：Resource management module, preprocessing module, core Heart processing module, output module, it is characterised in that：The resource management module is used for management domain knowledge base and information extraction Rule base, the preprocessing module are used to carry out regular and segmentation to judgement document content, and the core processing module is used In the regular resource according to domain knowledge base and manual compiling, information extraction algorithm Extracting Information point, the output mould are utilized Block is exported for that will extract result.

2. a kind of information extraction system based on domain-specialist knowledge system according to claim 1, it is characterised in that：Institute The resource management module stated includes expertise library unit and regular library unit, and the expertise library unit is combed for expert The knowledge of judicial domain is managed to build domain knowledge base, and is combed out knowledge point by the expert of judicial domain and it is defined, institute The regular library unit stated forms information extraction rules library for writing decimation rule by maintenance personnel according to the needs of information extraction.

3. a kind of information extraction of information extraction system based on domain-specialist knowledge system according to claim 1 or 2 Method, it is characterised in that：It is as follows：

(1), the expert of judicial domain combs out the knowledge of judicial domain to build domain knowledge base by expertise library unit, The expert of judicial domain also combs out knowledge point by resource management module and is defined to it；

(2), maintenance personnel needs to write decimation rule to form information extraction rule by regular library unit according to information extraction Then library；

(4), information extraction is utilized according to the information extraction rules library of domain knowledge base and manual compiling using core processing module Algorithm Extracting Information point；

(5), it is exported by output module by result is extracted.

4. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 3 Method, it is characterised in that：In the step (3), the specific method is as follows：It determines the content that each paragraph states clearly, then uses Piao Tree Bayes Method or rule classification method are classified, then are ranked up, that is, realize automatic paragraphing, last output category knot Fruit.

5. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 4 Method, it is characterised in that：The rule classification method is classified according to the rule that maintenance personnel writes.

6. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 4 Method, it is characterised in that：The sort algorithm is f_score=w₁*f_Bayesian+w₂*f_Rule

Wherein f_scoreFor the total score that the paragraph is label A, f_BayesianFor the Bayes's classification score that the paragraph is label A, f_RuleFor the rule match score that the paragraph is label A, w₁With w₂For weight coefficient, obtained by training.

7. a kind of information extraction side of information extraction system based on domain-specialist knowledge system according to claim 3 Method, it is characterised in that：The step (4) is according to the automatic paragraphing in step (3) as a result, being extracted in each paragraph different Information point, due to judgement document in information point quantity it is more, the more features of type, for different types need use different sides Method goes to identify.