CN113609838B

CN113609838B - Document information extraction and mapping method and system

Info

Publication number: CN113609838B
Application number: CN202110795366.2A
Authority: CN
Inventors: 牛硕硕; 王金华; 王盼盼; 李德启; 黄哲
Original assignee: CETC 32 Research Institute
Current assignee: CETC 32 Research Institute
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2024-05-24
Anticipated expiration: 2041-07-14
Also published as: CN113609838A

Abstract

The invention provides a method and a system for extracting and mapping document information, comprising the following steps: step 1: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules and extracting entities by using a mode matching method; step 2: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules, and extracting relations and corresponding entity attributes by using a mode matching method; step 3: and mapping the extracted entity, relation and attribute triples to generate a document map. The invention can extract the relation and the attribute of the document based on the syntactic semantic rule, does not need to adopt a machine learning method to label and train the data, improves the extraction efficiency and reduces the consumption of computer resources during extraction.

Description

Document information extraction and mapping method and system

Technical Field

The invention relates to the technical field of natural language understanding and processing, in particular to a method and a system for extracting and mapping document information. And more particularly, to a method for extracting and mapping management document information based on syntactic semantic rules.

Background

With the advent of the information and internet age, information resource construction became the core content of information construction of the current army, military equipment is updated and upgraded rapidly, military organization and personnel are redeployed and planned, military tactics are promoted and new, the task of army project construction and demand is increased, and the degree of military information automation is required to be further improved.

The accurate analysis of data has more and more prominent effect in modern military information research work, and the existence of a large amount of information in the form of electronic documents also provides basic conditions for information extraction, data analysis and knowledge graph construction. The military information automation construction work needs to extract the most effective information in the text from the military electronic data in real time, and excavate valuable military information from the massive information by applying the data mining and natural language processing technology, reasonably allocate battlefield information resources in the whole battle range, provide comprehensive data evaluation and reliable analysis results for the decision maker of the army, and assist the decision maker to make decisions quickly.

Military demand documents, which are important documents for military technical research and project management implementation, bear the bridge role from demand concept landing to demand implementation. In the face of massive required documents, decision-making staff are urgent to need some automation tools, and a proper extraction method is applied to rapidly extract entities, relations and attributes from texts to obtain the overall requirements of the documents.

The existing information extraction technology mostly depends on a deep learning method, and the method generally needs to consume a great deal of manpower and material resources to preprocess and mark data and consume huge computing resources to train a model. In addition, the existing extraction objects are often concrete entities, and more entities required to be extracted in the requirement management document in the military field are virtual concepts such as functions, concepts, system descriptions, roles and the like, and the relationships required to be extracted such as composition, inclusion, input and output and the like are relatively abstract relationship concepts. Therefore, some methods combining natural language processing and lexical syntax semantic features are needed to formulate rules for extracting military requirement management documents, and entity and relation attributes are extracted from the perspective of language composition, so that the manpower and material resource consumption caused by the data labeling process can be reduced to a certain extent, text analysis can be performed from the language composition, and the interpretation is strong.

Patent document CN106874378a (application number: CN 201710006826.2) discloses a method for constructing a knowledge graph based on entity extraction and relationship mining of a rule model. However, the patent adopts extraction of the semi-structured data of encyclopedia class, and the extraction is relatively weak in dependence on natural language processing technologies such as lexical syntax semantics and the like.

Patent document CN108319586a (application number: CN 201810097357.4) discloses a method and apparatus for generating information extraction rules and semantic analysis. However, the patent cannot perform pruning processing on the wrongly recognized entity words and acquire the classification of the entity words, so that the purpose of extracting the entity words of the military requirement documents is achieved.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method and a system for extracting and mapping document information.

The document information extraction and mapping method provided by the invention comprises the following steps:

Step 1: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules and extracting entities by using a mode matching method;

Step 2: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules, and extracting relations and corresponding entity attributes by using a mode matching method;

step 3: and mapping the extracted entity, relation and attribute triples to generate a document map.

Preferably, the step 1 includes:

step 1.1: calling part-of-speech tagging service of a natural language processing platform to obtain lexical feature information comprising word segmentation, part-of-speech tagging, word length, word offset and word position;

Step 1.2: calling a dependency syntax analysis service of a natural language processing platform, analyzing lexical feature information to obtain dependency syntax tree information, and analyzing linguistic Chinese word forming to obtain a compound noun entity;

Step 1.3: according to the characteristics of the entity words of the document, pruning is carried out on the incorrectly identified entity words from the dependency syntax tree in the form of stop words and trigger words, the classification of the entity words is obtained, and entity extraction is carried out by utilizing the formulated rules and the added general words and trigger words, so that an entity extraction result is obtained.

Preferably, the step2 includes:

Step 2.1: calling a dependency syntax analysis and semantic role marking service of the natural language processing platform to analyze dependency syntax and semantic roles of the requirement items, and obtaining a result of the dependency syntax analysis and the semantic role marking;

Step 2.2: scanning the item to obtain a relation word, mapping the relation word to a core word of dependency syntactic analysis, and mapping the relation word to a predicate of semantic role annotation;

Step 2.3: extracting an entity conforming to a logic expression formed by dependency syntactic analysis and a relation entity conforming to semantic role labels A0 and A1 and predicates to be used as a relation extraction triplet;

Step 2.4: calling part-of-speech tagging service of a natural language processing platform to acquire lexical feature information;

Step 2.5: extracting the number words and the graduated words in the word segmentation result, matching the number words and the graduated words with the modifier words of the attribute values, and splicing the number words and the modifier words to form attribute values as triggering conditions for attribute extraction;

step 2.6: and scanning and recording the triggered attribute value and corresponding attribute information, and taking the entity nearest before the attribute scanning or the entity nearest before the modifier from the attribute as an attribute entity object to link the entity attribute to form a final attribute triplet.

Preferably, the step3 includes:

step 3.1: defining a relationship label and an entity label of the triplet;

step 3.2: defining entity attributes of the triples as node attributes in the atlas, and storing the relation words as attributes in relation edges in the atlas;

step 3.3: each entity is used as a sub-object in the instantiation object, and the triples are stored in the neo4j graph database in an object mode.

Preferably, the information is extracted and visualized in a hierarchical tight coupling mode, word forming features of a Chinese required document are analyzed from morphology, syntax and semanteme by combining an open-source natural language processing platform, corresponding information extraction rules are formulated, rule maintenance is carried out by using a Drools engine, entity and relationship attributes in the document are extracted, and mapping is carried out to form a knowledge graph.

The document information extraction and mapping system provided by the invention comprises:

Module M1: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules and extracting entities by using a mode matching method;

Module M2: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules, and extracting relations and corresponding entity attributes by using a mode matching method;

module M3: and mapping the extracted entity, relation and attribute triples to generate a document map.

Preferably, the module M1 comprises:

module M1.1: calling part-of-speech tagging service of a natural language processing platform to obtain lexical feature information comprising word segmentation, part-of-speech tagging, word length, word offset and word position;

Module M1.2: calling a dependency syntax analysis service of a natural language processing platform, analyzing lexical feature information to obtain dependency syntax tree information, and analyzing linguistic Chinese word forming to obtain a compound noun entity;

Module M1.3: according to the characteristics of the entity words of the document, pruning is carried out on the incorrectly identified entity words from the dependency syntax tree in the form of stop words and trigger words, the classification of the entity words is obtained, and entity extraction is carried out by utilizing the formulated rules and the added general words and trigger words, so that an entity extraction result is obtained.

Preferably, the module M2 comprises:

Module M2.1: calling a dependency syntax analysis and semantic role marking service of the natural language processing platform to analyze dependency syntax and semantic roles of the requirement items, and obtaining a result of the dependency syntax analysis and the semantic role marking;

Module M2.2: scanning the item to obtain a relation word, mapping the relation word to a core word of dependency syntactic analysis, and mapping the relation word to a predicate of semantic role annotation;

module M2.3: extracting an entity conforming to a logic expression formed by dependency syntactic analysis and a relation entity conforming to semantic role labels A0 and A1 and predicates to be used as a relation extraction triplet;

Module M2.4: calling part-of-speech tagging service of a natural language processing platform to acquire lexical feature information;

Module M2.5: extracting the number words and the graduated words in the word segmentation result, matching the number words and the graduated words with the modifier words of the attribute values, and splicing the number words and the modifier words to form attribute values as triggering conditions for attribute extraction;

Module M2.6: and scanning and recording the triggered attribute value and corresponding attribute information, and taking the entity nearest before the attribute scanning or the entity nearest before the modifier from the attribute as an attribute entity object to link the entity attribute to form a final attribute triplet.

Preferably, the module M3 includes:

module M3.1: defining a relationship label and an entity label of the triplet;

Module M3.2: defining entity attributes of the triples as node attributes in the atlas, and storing the relation words as attributes in relation edges in the atlas;

Module M3.3: each entity is used as a sub-object in the instantiation object, and the triples are stored in the neo4j graph database in an object mode.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention can extract the relation and the attribute of the document based on the syntactic semantic rule, does not need to adopt a machine learning method to label and train the data, improves the extraction efficiency and reduces the consumption of computer resources during extraction;

(2) The invention supports flexible configuration of the document extraction rules, and entity extraction is carried out on the Chinese word, so that the interpretation is strong;

(3) According to the invention, the extracted entities, relations and attributes form the triples, and the neo4j is utilized for mapping, so that the relations among elements such as document hierarchical structures, item structures, functions, data, roles and the like in the required document can be clearly displayed, and tasks such as similarity calculation, sub-graph matching, item clustering, item tracking and the like of the items are performed on the basis, so that the process of converting manual reading extraction into automatic computer extraction is greatly improved, and the efficiency of excavating and analyzing the requirements of staff is greatly improved.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is a system block diagram;

FIG. 2 is a diagram of a demand knowledge graph metadata model;

FIG. 3 is a diagram of an example of a demand knowledge graph metadata definition;

FIG. 4 is a sentence structure diagram of a dependency syntax tree;

FIG. 5 is a logical rule expression formed by dependency syntax analysis;

FIG. 6 is a triplet mapping flow chart.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

Example 1:

according to the invention, entity and relation attribute extraction is carried out on the item data of the demand document in combination with the demand, the item data is imported into a graph database for storage and visualization, the underlying natural language understanding and natural language processing technology is researched, word forming characteristics of the Chinese demand document are analyzed from the aspects of lexical, syntactic and semantic by combining with an open source natural language processing platform LTP, corresponding information extraction rules are formulated, rule maintenance is carried out by utilizing a Drools engine, entity and relation attribute in the demand document are extracted, and a demand knowledge graph is formed through mapping.

The invention provides a demand management document information extraction and mapping method based on syntactic semantic rules, which comprises the following steps:

step 1: demand management document entity extraction based on syntactic semantic rules

The extraction of the entity of the demand management document based on the syntactic semantic rule is based on natural language understanding and natural language processing technology, word features of the demand document are obtained from word features and dependency syntactic trees, rules are formulated, and the entity extraction is performed by a pattern matching method.

The method comprises the following steps:

Step 1.1: and calling a part-of-speech tagging service of a natural language processing platform (LTP) to acquire lexical feature information such as word segmentation, part-of-speech tagging, word length, word offset, word position and the like. The part-of-speech tagging of LTP adopts a national 863 tagging system, and totally comprises 28 Chinese parts-of-speech.

Step 1.2: and calling a dependency syntax analysis service of the LTP, obtaining dependency syntax tree information through dependency syntax analysis, and obtaining a compound noun entity in NP and VP forms, namely Chinese baseNP through analysis of linguistic Chinese word formation. The combination of NP and VP is the combination of words obtained by the word segmentation of LTP, and the rules are placed in 5 drl files according to the difference of word combination lengths. Thus, based on linguistic analysis, a total of 158 rules for NP and VP word-forming structures were obtained. On the basis of rule matching, an entity dictionary of the required document is added at the same time, so that the accuracy and recall rate of entity extraction are optimized and improved manually.

Step 1.3: aiming at the characteristics of the entity words of the requirement document, pruning is carried out on the incorrectly identified entity words from the dependency syntax tree in the form of stop words and trigger words, and the classification of the entity words is obtained. And finally, extracting the entity by using the established rule, the added general word and the trigger word, and obtaining the entity extraction result of the requirement item.

Step 2: demand management document relation and attribute extraction based on syntactic semantic rules

The method for extracting the relation and the attribute of the demand management document based on the syntactic semantic rule is also a method for extracting the relation and the attribute of the corresponding entity by using a pattern matching method by acquiring word characteristics of the demand document from word characteristics and dependency syntactic trees based on natural language understanding and natural language processing technologies.

The method comprises the following steps:

Step 2.1: the dependency syntax analysis and semantic role labeling service of a natural language processing platform (LTP) is called to analyze the dependency syntax and semantic roles of the requirement items.

Step 2.2: after the results of dependency syntactic analysis and semantic role labeling are obtained, the items are scanned first to find the relation in the sentence. Wherein, based on linguistic analysis, the relation extraction yields a total of 266 relation words. These relational words are then mapped onto HED core words of the dependency syntax analysis, and also onto predicates of semantic role labels.

Step 2.3: and extracting an entity conforming to a logical expression formed by dependency syntactic analysis and a relation entity conforming to semantic role labels A0 and A1 and predicates to be used as a relation extraction triplet.

Step 2.4: the method for extracting the attribute of the demand management document based on the syntactic semantic rule also calls a part-of-speech tagging service of a natural language processing platform (LTP) to acquire lexical feature information such as word segmentation, part-of-speech tagging, word length, word offset, word position and the like.

Step 2.5: finding out the number words and the graduated words in the word segmentation result, matching the number words and the graduated words with the modifier words with the attribute values such as not less than, equal to, not less than, and the like before the number words and the graduated words, splicing the number words and the modifier words to form the attribute values, and taking the attribute values as triggering conditions of attribute extraction. Thereafter, the position of the attribute value is recorded, the sentence is scanned forward, and if the attribute words defined by the requirement document (56 attribute words are generated in total according to the definition of the meta model) are scanned, it is recorded as attributes.

Step 2.6: recording the attribute value and the corresponding attribute information, continuing to forward scan the sentence, scanning the entity nearest before the attribute or finding the entity nearest before the modifier as the attribute entity object, and carrying out entity attribute linking to form the final attribute triplet.

Step 3: atlas generation of demand documents

The atlas generating method of the demand document is a process of atlas-forming the generated entity, relation and attribute triples. The generated map can visually display the relation among elements such as document hierarchical structure, item structure, function, data, roles and the like in the requirement document. Meanwhile, the knowledge base formed by the atlas can be used for carrying out tasks such as similarity calculation, sub-graph matching, item clustering, item tracking and the like on the basis.

The map is stored and visualized using the mainstream neo4j graph database, neo4j is a high-performance NOSQL graph database, neo4j can also be regarded as a high-performance graph engine with all the features of the mature database. Nodes and edges of the graph structure may be manipulated in an object-oriented, flexible network structure.

The method comprises the following steps:

step 3.1: according to the meta-model extracted from the requirement document, 7 relation types are defined, and the 7 relation types are corresponding 7 triples. In the process of importing triples, each triplet is regarded as an instantiated relation object, one relation is used as a relation label, and one entity category is used as an entity label.

Step 3.2: entity attributes defined in the model are stored as attributes of nodes in the map, and the relation words are stored in the map as attributes in relation edges.

Step 3.3: each entity is considered a child of this instantiation object. The triples are stored in the neo4j graph database in the form of objects.

The invention adopts a hierarchical tight coupling mode to extract and visualize information, and a data resource layer, a service layer and an application layer are respectively arranged from bottom to top. The architecture diagram is shown in figure 1.

Fig. 2 is a schematic diagram of a demand knowledge graph metadata model. In the figure, the relationships are mainly 7, namely an assignment/allocation relationship, a type relationship, a composition relationship, a collection relationship, a flow relationship, an input relationship and an output relationship. The entity category is 6, namely functions, systems/software, information/data, running systems, roles and organizations.

FIG. 3 is a definition example of a demand knowledge graph metamodel.

FIG. 4 is a sentence structure diagram of a dependency syntax tree of one item instance obtained by calling the dependency syntax analysis service. As can be clearly seen from the right half of the figure, an item sentence is traversed, and NP and VP structures in the sentence are the actual entity compound word results which need to be extracted in the required document in the bottom-up-root process.

As shown in FIG. 5, the relation extraction is realized by using a logical rule expression formed by dependency syntax analysis, which is a rule expression of triplet extraction obtained by using a dependency syntax analysis method.

FIG. 6 is a specific flowchart of triad mapping.

Example 2:

Example 2 is a preferred example of example 1.

The method for extracting the demand management document entity based on the syntactic semantic rule comprises the following steps:

baseNP: simple non-nested noun phrase, church, was first proposed in english in 1988. Chinese non-nested noun phrases are different from English, and Chinese baseNP (basic entity noun) formal descriptions are divided into 4 classes:

1.baseNP→baseNP+baseNP

Basenp→basenp+ noun/noun

And acquiring word features of the requirement document from the word features and the dependency syntax tree, formulating rules and extracting the entity by using a pattern matching method. This process is essentially a process of traversing all NP and VP-type phrases from the dependency syntax tree to construct a Chinese baseNP.

The basic template is utilized to analyze the rule of the inputted item text to obtain a candidate set of the base NP, and the rule matching process is as follows (namely, the formalization process utilizes the rule elements of the rule extraction entity):

1. Each word in the text of the input entry is denoted w _i, the part of speech of which is denoted t _i after being tagged with the part of speech of LTP, so that the input entry can be represented as the following symbol string:

w₁|t₁,w₂|t₂,w₃|t₃,…,w_i|t_i,…,w_j|t_j,…,w_N|t_N

2. if, among all 158 rules, there is one such rule:

w_i|t_i,…,w_j|t_j—>baseNP

In the rule, i is more than or equal to 0 and less than or equal to j, which means that one fragment in the process 1 accords with the composition rule of NP or VP in the baseNP.

3. And outputting the scanned character strings conforming to the context-free rule in the entry as a result of entity extraction.

In Chinese, there are special verbs, such as formal verbs, auxiliary verbs, systematic verbs, and so on. Typically, these verbs cannot be used as words in the base noun phrases, and if these words are found to be used in the candidate base noun phrases, we do not add them to the candidate base noun phrases.

Common formal verbs are: administration, progress, presence, possibility;

Common auxiliary verbs are: when, the, get, dare, meeting, possibility, can, ken, willing, able, capable, let, allowed, hope, want, should, wish, willing, allowed, voluntary;

common tie verbs are: the terms "a," an, "" the, "" acting, "" when, "" as, "" the, "the" are to be construed as meaning that they are not intended to be limiting.

Dependency syntax analysis: dependency grammar is used to reveal the syntactic structure in natural sentences, which is obtained by analyzing the dependencies between components within a language unit. In popular terms, dependency syntactic analysis is to find out the expression modes of the grammar of 'main-predicate bindingqi' in a sentence by analyzing the relation among all structural components in the sentence.

Semantic role labeling: semantic components (semantic roles) of predicates in sentences, such as time, place, agent persons, subjects, reasons, results and the like, are marked from natural sentences, and are a shallow semantic analysis technology. The semantic roles of the core include six types of A0, A1, A2, A3, A4 and A5, AO represents agent sides of actions in the semantic roles, and A1 represents the influence caused by predicate actions. A2-A5 are flexible, and can be endowed with different semantic meanings according to different predicates. In relation extraction, the relation between A0, A1 and predicates is considered with emphasis, so that a relation extraction triplet is formed, and besides core semantic roles, 15 semantic roles are additional semantic roles, such as LOC and TMP, and the positions and the times are respectively represented.

5 Conditions of the dependency tree:

1. simple node condition: only the terminal nodes exist in the sentence, and no non-terminal nodes exist;

2. single parent node condition: except the root node, other nodes only have one father point;

3. Root node independent condition: only one root node exists in one dependency tree;

4. non-crossing condition: the situation that the branches of the dependency tree are not intersected;

5. mutually exclusive conditions: the preceding relationship from left to right and the dominant relationship from top to bottom are mutually exclusive, and only one of the preceding relationship or the dominant relationship exists between the nodes.

The method comprises the steps of processing an item by utilizing dependency syntactic analysis and semantic role marking, and then analyzing a rule base, wherein the rule base of the dependency syntactic analysis is shown in fig. 5, and the rule of the semantic role marking is traversing agent a role A0 and a role A1 by taking predicates as cores, so that a relation triplet is formed.

Firstly, part-of-speech tagging service of a natural language processing platform (LTP) is called, and lexical feature information such as word segmentation, part-of-speech tagging, word length, word offset, word position and the like is obtained. Then, the number words and the graduated words in the item word segmentation result are found and used as triggering conditions for attribute extraction, then the item word segmentation result is traversed forward, the item word is matched with an attribute word stock defined by a meta model, the attribute words are found, and therefore the attributes in the attribute triples are obtained, and finally the attributes and the attribute value chains form the attribute triples of the required items on the corresponding entities.

Those skilled in the art will appreciate that the systems, apparatus, and their respective modules provided herein may be implemented entirely by logic programming of method steps such that the systems, apparatus, and their respective modules are implemented as logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the systems, apparatus, and their respective modules being implemented as pure computer readable program code. Therefore, the system, the apparatus, and the respective modules thereof provided by the present invention may be regarded as one hardware component, and the modules included therein for implementing various programs may also be regarded as structures within the hardware component; modules for implementing various functions may also be regarded as being either software programs for implementing the methods or structures within hardware components.

The foregoing describes specific embodiments of the present application. It is to be understood that the application is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the application. The embodiments of the application and the features of the embodiments may be combined with each other arbitrarily without conflict.

Claims

1. The document information extraction and mapping method is characterized by comprising the following steps:

Step 3: mapping the extracted entity, relation and attribute triples to generate a document map;

The step 1 comprises the following steps:

Step 1.3: pruning the wrongly recognized entity words from the dependency syntax tree in the form of stop words and trigger words according to the characteristics of the entity words of the document, acquiring the classification of the entity words, and extracting the entity by using the formulated rules and the added general words and trigger words to acquire an entity extraction result;

The step 2 comprises the following steps:

step 2.3: extracting an entity conforming to a logic expression formed by dependency syntactic analysis and a relation entity conforming to semantic role labels A0 and A1 and predicates to be used as a relation extraction triplet; wherein A0 is agent roles, A1 is a crashed role;

step 2.6: scanning and recording the triggered attribute value and corresponding attribute information, and taking the entity nearest before the attribute scanning or the entity nearest before the modifier from the attribute as an attribute entity object to link the entity attribute to form a final attribute triplet;

The step3 comprises the following steps:

step 3.1: defining a relationship label and an entity label of the triplet;

2. The document information extraction and mapping method according to claim 1, wherein the information extraction and visualization are performed in a hierarchical tight coupling mode, word formation characteristics of a Chinese required document are analyzed from morphology, syntax and semanteme by combining an open source natural language processing platform, corresponding information extraction rules are formulated, rule maintenance is performed by using a Drools engine, entity and relationship attributes in the document are extracted, and mapping is performed to form a knowledge graph.

3. A document information extraction and mapping system, comprising:

module M3: mapping the extracted entity, relation and attribute triples to generate a document map;

The module M1 includes:

Module M1.3: pruning the wrongly recognized entity words from the dependency syntax tree in the form of stop words and trigger words according to the characteristics of the entity words of the document, acquiring the classification of the entity words, and extracting the entity by using the formulated rules and the added general words and trigger words to acquire an entity extraction result;

the module M2 includes:

module M2.3: extracting an entity conforming to a logic expression formed by dependency syntactic analysis and a relation entity conforming to semantic role labels A0 and A1 and predicates to be used as a relation extraction triplet; wherein A0 is agent roles, A1 is a crashed role;

module M2.6: scanning and recording the triggered attribute value and corresponding attribute information, and taking the entity nearest before the attribute scanning or the entity nearest before the modifier from the attribute as an attribute entity object to link the entity attribute to form a final attribute triplet;

The module M3 includes:

module M3.1: defining a relationship label and an entity label of the triplet;

4. The document information extraction and mapping system according to claim 3, wherein the information extraction and visualization are performed in a hierarchical tight coupling mode, word formation characteristics of a Chinese required document are analyzed from morphology, syntax and semanteme by combining an open source natural language processing platform, corresponding information extraction rules are formulated, rule maintenance is performed by using a Drools engine, entity and relationship attributes in the document are extracted, and mapping is performed to form a knowledge map.