CN113609838B - Document information extraction and mapping method and system - Google Patents

Document information extraction and mapping method and system Download PDF

Info

Publication number
CN113609838B
CN113609838B CN202110795366.2A CN202110795366A CN113609838B CN 113609838 B CN113609838 B CN 113609838B CN 202110795366 A CN202110795366 A CN 202110795366A CN 113609838 B CN113609838 B CN 113609838B
Authority
CN
China
Prior art keywords
entity
words
word
attribute
natural language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110795366.2A
Other languages
Chinese (zh)
Other versions
CN113609838A (en
Inventor
牛硕硕
王金华
王盼盼
李德启
黄哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 32 Research Institute
Original Assignee
CETC 32 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 32 Research Institute filed Critical CETC 32 Research Institute
Priority to CN202110795366.2A priority Critical patent/CN113609838B/en
Publication of CN113609838A publication Critical patent/CN113609838A/en
Application granted granted Critical
Publication of CN113609838B publication Critical patent/CN113609838B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method and a system for extracting and mapping document information, comprising the following steps: step 1: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules and extracting entities by using a mode matching method; step 2: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules, and extracting relations and corresponding entity attributes by using a mode matching method; step 3: and mapping the extracted entity, relation and attribute triples to generate a document map. The invention can extract the relation and the attribute of the document based on the syntactic semantic rule, does not need to adopt a machine learning method to label and train the data, improves the extraction efficiency and reduces the consumption of computer resources during extraction.

Description

Document information extraction and mapping method and system
Technical Field
The invention relates to the technical field of natural language understanding and processing, in particular to a method and a system for extracting and mapping document information. And more particularly, to a method for extracting and mapping management document information based on syntactic semantic rules.
Background
With the advent of the information and internet age, information resource construction became the core content of information construction of the current army, military equipment is updated and upgraded rapidly, military organization and personnel are redeployed and planned, military tactics are promoted and new, the task of army project construction and demand is increased, and the degree of military information automation is required to be further improved.
The accurate analysis of data has more and more prominent effect in modern military information research work, and the existence of a large amount of information in the form of electronic documents also provides basic conditions for information extraction, data analysis and knowledge graph construction. The military information automation construction work needs to extract the most effective information in the text from the military electronic data in real time, and excavate valuable military information from the massive information by applying the data mining and natural language processing technology, reasonably allocate battlefield information resources in the whole battle range, provide comprehensive data evaluation and reliable analysis results for the decision maker of the army, and assist the decision maker to make decisions quickly.
Military demand documents, which are important documents for military technical research and project management implementation, bear the bridge role from demand concept landing to demand implementation. In the face of massive required documents, decision-making staff are urgent to need some automation tools, and a proper extraction method is applied to rapidly extract entities, relations and attributes from texts to obtain the overall requirements of the documents.
The existing information extraction technology mostly depends on a deep learning method, and the method generally needs to consume a great deal of manpower and material resources to preprocess and mark data and consume huge computing resources to train a model. In addition, the existing extraction objects are often concrete entities, and more entities required to be extracted in the requirement management document in the military field are virtual concepts such as functions, concepts, system descriptions, roles and the like, and the relationships required to be extracted such as composition, inclusion, input and output and the like are relatively abstract relationship concepts. Therefore, some methods combining natural language processing and lexical syntax semantic features are needed to formulate rules for extracting military requirement management documents, and entity and relation attributes are extracted from the perspective of language composition, so that the manpower and material resource consumption caused by the data labeling process can be reduced to a certain extent, text analysis can be performed from the language composition, and the interpretation is strong.
Patent document CN106874378a (application number: CN 201710006826.2) discloses a method for constructing a knowledge graph based on entity extraction and relationship mining of a rule model. However, the patent adopts extraction of the semi-structured data of encyclopedia class, and the extraction is relatively weak in dependence on natural language processing technologies such as lexical syntax semantics and the like.
Patent document CN108319586a (application number: CN 201810097357.4) discloses a method and apparatus for generating information extraction rules and semantic analysis. However, the patent cannot perform pruning processing on the wrongly recognized entity words and acquire the classification of the entity words, so that the purpose of extracting the entity words of the military requirement documents is achieved.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method and a system for extracting and mapping document information.
The document information extraction and mapping method provided by the invention comprises the following steps:
Step 1: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules and extracting entities by using a mode matching method;
Step 2: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules, and extracting relations and corresponding entity attributes by using a mode matching method;
step 3: and mapping the extracted entity, relation and attribute triples to generate a document map.
Preferably, the step 1 includes:
step 1.1: calling part-of-speech tagging service of a natural language processing platform to obtain lexical feature information comprising word segmentation, part-of-speech tagging, word length, word offset and word position;
Step 1.2: calling a dependency syntax analysis service of a natural language processing platform, analyzing lexical feature information to obtain dependency syntax tree information, and analyzing linguistic Chinese word forming to obtain a compound noun entity;
Step 1.3: according to the characteristics of the entity words of the document, pruning is carried out on the incorrectly identified entity words from the dependency syntax tree in the form of stop words and trigger words, the classification of the entity words is obtained, and entity extraction is carried out by utilizing the formulated rules and the added general words and trigger words, so that an entity extraction result is obtained.
Preferably, the step2 includes:
Step 2.1: calling a dependency syntax analysis and semantic role marking service of the natural language processing platform to analyze dependency syntax and semantic roles of the requirement items, and obtaining a result of the dependency syntax analysis and the semantic role marking;
Step 2.2: scanning the item to obtain a relation word, mapping the relation word to a core word of dependency syntactic analysis, and mapping the relation word to a predicate of semantic role annotation;
Step 2.3: extracting an entity conforming to a logic expression formed by dependency syntactic analysis and a relation entity conforming to semantic role labels A0 and A1 and predicates to be used as a relation extraction triplet;
Step 2.4: calling part-of-speech tagging service of a natural language processing platform to acquire lexical feature information;
Step 2.5: extracting the number words and the graduated words in the word segmentation result, matching the number words and the graduated words with the modifier words of the attribute values, and splicing the number words and the modifier words to form attribute values as triggering conditions for attribute extraction;
step 2.6: and scanning and recording the triggered attribute value and corresponding attribute information, and taking the entity nearest before the attribute scanning or the entity nearest before the modifier from the attribute as an attribute entity object to link the entity attribute to form a final attribute triplet.
Preferably, the step3 includes:
step 3.1: defining a relationship label and an entity label of the triplet;
step 3.2: defining entity attributes of the triples as node attributes in the atlas, and storing the relation words as attributes in relation edges in the atlas;
step 3.3: each entity is used as a sub-object in the instantiation object, and the triples are stored in the neo4j graph database in an object mode.
Preferably, the information is extracted and visualized in a hierarchical tight coupling mode, word forming features of a Chinese required document are analyzed from morphology, syntax and semanteme by combining an open-source natural language processing platform, corresponding information extraction rules are formulated, rule maintenance is carried out by using a Drools engine, entity and relationship attributes in the document are extracted, and mapping is carried out to form a knowledge graph.
The document information extraction and mapping system provided by the invention comprises:
Module M1: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules and extracting entities by using a mode matching method;
Module M2: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules, and extracting relations and corresponding entity attributes by using a mode matching method;
module M3: and mapping the extracted entity, relation and attribute triples to generate a document map.
Preferably, the module M1 comprises:
module M1.1: calling part-of-speech tagging service of a natural language processing platform to obtain lexical feature information comprising word segmentation, part-of-speech tagging, word length, word offset and word position;
Module M1.2: calling a dependency syntax analysis service of a natural language processing platform, analyzing lexical feature information to obtain dependency syntax tree information, and analyzing linguistic Chinese word forming to obtain a compound noun entity;
Module M1.3: according to the characteristics of the entity words of the document, pruning is carried out on the incorrectly identified entity words from the dependency syntax tree in the form of stop words and trigger words, the classification of the entity words is obtained, and entity extraction is carried out by utilizing the formulated rules and the added general words and trigger words, so that an entity extraction result is obtained.
Preferably, the module M2 comprises:
Module M2.1: calling a dependency syntax analysis and semantic role marking service of the natural language processing platform to analyze dependency syntax and semantic roles of the requirement items, and obtaining a result of the dependency syntax analysis and the semantic role marking;
Module M2.2: scanning the item to obtain a relation word, mapping the relation word to a core word of dependency syntactic analysis, and mapping the relation word to a predicate of semantic role annotation;
module M2.3: extracting an entity conforming to a logic expression formed by dependency syntactic analysis and a relation entity conforming to semantic role labels A0 and A1 and predicates to be used as a relation extraction triplet;
Module M2.4: calling part-of-speech tagging service of a natural language processing platform to acquire lexical feature information;
Module M2.5: extracting the number words and the graduated words in the word segmentation result, matching the number words and the graduated words with the modifier words of the attribute values, and splicing the number words and the modifier words to form attribute values as triggering conditions for attribute extraction;
Module M2.6: and scanning and recording the triggered attribute value and corresponding attribute information, and taking the entity nearest before the attribute scanning or the entity nearest before the modifier from the attribute as an attribute entity object to link the entity attribute to form a final attribute triplet.
Preferably, the module M3 includes:
module M3.1: defining a relationship label and an entity label of the triplet;
Module M3.2: defining entity attributes of the triples as node attributes in the atlas, and storing the relation words as attributes in relation edges in the atlas;
Module M3.3: each entity is used as a sub-object in the instantiation object, and the triples are stored in the neo4j graph database in an object mode.
Preferably, the information is extracted and visualized in a hierarchical tight coupling mode, word forming features of a Chinese required document are analyzed from morphology, syntax and semanteme by combining an open-source natural language processing platform, corresponding information extraction rules are formulated, rule maintenance is carried out by using a Drools engine, entity and relationship attributes in the document are extracted, and mapping is carried out to form a knowledge graph.
Compared with the prior art, the invention has the following beneficial effects:
(1) The invention can extract the relation and the attribute of the document based on the syntactic semantic rule, does not need to adopt a machine learning method to label and train the data, improves the extraction efficiency and reduces the consumption of computer resources during extraction;
(2) The invention supports flexible configuration of the document extraction rules, and entity extraction is carried out on the Chinese word, so that the interpretation is strong;
(3) According to the invention, the extracted entities, relations and attributes form the triples, and the neo4j is utilized for mapping, so that the relations among elements such as document hierarchical structures, item structures, functions, data, roles and the like in the required document can be clearly displayed, and tasks such as similarity calculation, sub-graph matching, item clustering, item tracking and the like of the items are performed on the basis, so that the process of converting manual reading extraction into automatic computer extraction is greatly improved, and the efficiency of excavating and analyzing the requirements of staff is greatly improved.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:
FIG. 1 is a system block diagram;
FIG. 2 is a diagram of a demand knowledge graph metadata model;
FIG. 3 is a diagram of an example of a demand knowledge graph metadata definition;
FIG. 4 is a sentence structure diagram of a dependency syntax tree;
FIG. 5 is a logical rule expression formed by dependency syntax analysis;
FIG. 6 is a triplet mapping flow chart.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.
Example 1:
according to the invention, entity and relation attribute extraction is carried out on the item data of the demand document in combination with the demand, the item data is imported into a graph database for storage and visualization, the underlying natural language understanding and natural language processing technology is researched, word forming characteristics of the Chinese demand document are analyzed from the aspects of lexical, syntactic and semantic by combining with an open source natural language processing platform LTP, corresponding information extraction rules are formulated, rule maintenance is carried out by utilizing a Drools engine, entity and relation attribute in the demand document are extracted, and a demand knowledge graph is formed through mapping.
The invention provides a demand management document information extraction and mapping method based on syntactic semantic rules, which comprises the following steps:
step 1: demand management document entity extraction based on syntactic semantic rules
The extraction of the entity of the demand management document based on the syntactic semantic rule is based on natural language understanding and natural language processing technology, word features of the demand document are obtained from word features and dependency syntactic trees, rules are formulated, and the entity extraction is performed by a pattern matching method.
The method comprises the following steps:
Step 1.1: and calling a part-of-speech tagging service of a natural language processing platform (LTP) to acquire lexical feature information such as word segmentation, part-of-speech tagging, word length, word offset, word position and the like. The part-of-speech tagging of LTP adopts a national 863 tagging system, and totally comprises 28 Chinese parts-of-speech.
Step 1.2: and calling a dependency syntax analysis service of the LTP, obtaining dependency syntax tree information through dependency syntax analysis, and obtaining a compound noun entity in NP and VP forms, namely Chinese baseNP through analysis of linguistic Chinese word formation. The combination of NP and VP is the combination of words obtained by the word segmentation of LTP, and the rules are placed in 5 drl files according to the difference of word combination lengths. Thus, based on linguistic analysis, a total of 158 rules for NP and VP word-forming structures were obtained. On the basis of rule matching, an entity dictionary of the required document is added at the same time, so that the accuracy and recall rate of entity extraction are optimized and improved manually.
Step 1.3: aiming at the characteristics of the entity words of the requirement document, pruning is carried out on the incorrectly identified entity words from the dependency syntax tree in the form of stop words and trigger words, and the classification of the entity words is obtained. And finally, extracting the entity by using the established rule, the added general word and the trigger word, and obtaining the entity extraction result of the requirement item.
Step 2: demand management document relation and attribute extraction based on syntactic semantic rules
The method for extracting the relation and the attribute of the demand management document based on the syntactic semantic rule is also a method for extracting the relation and the attribute of the corresponding entity by using a pattern matching method by acquiring word characteristics of the demand document from word characteristics and dependency syntactic trees based on natural language understanding and natural language processing technologies.
The method comprises the following steps:
Step 2.1: the dependency syntax analysis and semantic role labeling service of a natural language processing platform (LTP) is called to analyze the dependency syntax and semantic roles of the requirement items.
Step 2.2: after the results of dependency syntactic analysis and semantic role labeling are obtained, the items are scanned first to find the relation in the sentence. Wherein, based on linguistic analysis, the relation extraction yields a total of 266 relation words. These relational words are then mapped onto HED core words of the dependency syntax analysis, and also onto predicates of semantic role labels.
Step 2.3: and extracting an entity conforming to a logical expression formed by dependency syntactic analysis and a relation entity conforming to semantic role labels A0 and A1 and predicates to be used as a relation extraction triplet.
Step 2.4: the method for extracting the attribute of the demand management document based on the syntactic semantic rule also calls a part-of-speech tagging service of a natural language processing platform (LTP) to acquire lexical feature information such as word segmentation, part-of-speech tagging, word length, word offset, word position and the like.
Step 2.5: finding out the number words and the graduated words in the word segmentation result, matching the number words and the graduated words with the modifier words with the attribute values such as not less than, equal to, not less than, and the like before the number words and the graduated words, splicing the number words and the modifier words to form the attribute values, and taking the attribute values as triggering conditions of attribute extraction. Thereafter, the position of the attribute value is recorded, the sentence is scanned forward, and if the attribute words defined by the requirement document (56 attribute words are generated in total according to the definition of the meta model) are scanned, it is recorded as attributes.
Step 2.6: recording the attribute value and the corresponding attribute information, continuing to forward scan the sentence, scanning the entity nearest before the attribute or finding the entity nearest before the modifier as the attribute entity object, and carrying out entity attribute linking to form the final attribute triplet.
Step 3: atlas generation of demand documents
The atlas generating method of the demand document is a process of atlas-forming the generated entity, relation and attribute triples. The generated map can visually display the relation among elements such as document hierarchical structure, item structure, function, data, roles and the like in the requirement document. Meanwhile, the knowledge base formed by the atlas can be used for carrying out tasks such as similarity calculation, sub-graph matching, item clustering, item tracking and the like on the basis.
The map is stored and visualized using the mainstream neo4j graph database, neo4j is a high-performance NOSQL graph database, neo4j can also be regarded as a high-performance graph engine with all the features of the mature database. Nodes and edges of the graph structure may be manipulated in an object-oriented, flexible network structure.
The method comprises the following steps:
step 3.1: according to the meta-model extracted from the requirement document, 7 relation types are defined, and the 7 relation types are corresponding 7 triples. In the process of importing triples, each triplet is regarded as an instantiated relation object, one relation is used as a relation label, and one entity category is used as an entity label.
Step 3.2: entity attributes defined in the model are stored as attributes of nodes in the map, and the relation words are stored in the map as attributes in relation edges.
Step 3.3: each entity is considered a child of this instantiation object. The triples are stored in the neo4j graph database in the form of objects.
The invention adopts a hierarchical tight coupling mode to extract and visualize information, and a data resource layer, a service layer and an application layer are respectively arranged from bottom to top. The architecture diagram is shown in figure 1.
Fig. 2 is a schematic diagram of a demand knowledge graph metadata model. In the figure, the relationships are mainly 7, namely an assignment/allocation relationship, a type relationship, a composition relationship, a collection relationship, a flow relationship, an input relationship and an output relationship. The entity category is 6, namely functions, systems/software, information/data, running systems, roles and organizations.
FIG. 3 is a definition example of a demand knowledge graph metamodel.
FIG. 4 is a sentence structure diagram of a dependency syntax tree of one item instance obtained by calling the dependency syntax analysis service. As can be clearly seen from the right half of the figure, an item sentence is traversed, and NP and VP structures in the sentence are the actual entity compound word results which need to be extracted in the required document in the bottom-up-root process.
As shown in FIG. 5, the relation extraction is realized by using a logical rule expression formed by dependency syntax analysis, which is a rule expression of triplet extraction obtained by using a dependency syntax analysis method.
FIG. 6 is a specific flowchart of triad mapping.
Example 2:
Example 2 is a preferred example of example 1.
The method for extracting the demand management document entity based on the syntactic semantic rule comprises the following steps:
baseNP: simple non-nested noun phrase, church, was first proposed in english in 1988. Chinese non-nested noun phrases are different from English, and Chinese baseNP (basic entity noun) formal descriptions are divided into 4 classes:
1.baseNP→baseNP+baseNP
Basenp→basenp+ noun/noun
Basenp→basenp+ noun/noun
Basenp→basenp+ noun/noun
Wherein the definitive notations include: adjective |distinguishment |adverb|verb|noun| the word |english word|number|adverb|.
And acquiring word features of the requirement document from the word features and the dependency syntax tree, formulating rules and extracting the entity by using a pattern matching method. This process is essentially a process of traversing all NP and VP-type phrases from the dependency syntax tree to construct a Chinese baseNP.
The basic template is utilized to analyze the rule of the inputted item text to obtain a candidate set of the base NP, and the rule matching process is as follows (namely, the formalization process utilizes the rule elements of the rule extraction entity):
1. Each word in the text of the input entry is denoted w i, the part of speech of which is denoted t i after being tagged with the part of speech of LTP, so that the input entry can be represented as the following symbol string:
w1|t1,w2|t2,w3|t3,…,wi|ti,…,wj|tj,…,wN|tN
2. if, among all 158 rules, there is one such rule:
wi|ti,…,wj|tj—>baseNP
In the rule, i is more than or equal to 0 and less than or equal to j, which means that one fragment in the process 1 accords with the composition rule of NP or VP in the baseNP.
3. And outputting the scanned character strings conforming to the context-free rule in the entry as a result of entity extraction.
In Chinese, there are special verbs, such as formal verbs, auxiliary verbs, systematic verbs, and so on. Typically, these verbs cannot be used as words in the base noun phrases, and if these words are found to be used in the candidate base noun phrases, we do not add them to the candidate base noun phrases.
Common formal verbs are: administration, progress, presence, possibility;
Common auxiliary verbs are: when, the, get, dare, meeting, possibility, can, ken, willing, able, capable, let, allowed, hope, want, should, wish, willing, allowed, voluntary;
common tie verbs are: the terms "a," an, "" the, "" acting, "" when, "" as, "" the, "the" are to be construed as meaning that they are not intended to be limiting.
Dependency syntax analysis: dependency grammar is used to reveal the syntactic structure in natural sentences, which is obtained by analyzing the dependencies between components within a language unit. In popular terms, dependency syntactic analysis is to find out the expression modes of the grammar of 'main-predicate bindingqi' in a sentence by analyzing the relation among all structural components in the sentence.
Semantic role labeling: semantic components (semantic roles) of predicates in sentences, such as time, place, agent persons, subjects, reasons, results and the like, are marked from natural sentences, and are a shallow semantic analysis technology. The semantic roles of the core include six types of A0, A1, A2, A3, A4 and A5, AO represents agent sides of actions in the semantic roles, and A1 represents the influence caused by predicate actions. A2-A5 are flexible, and can be endowed with different semantic meanings according to different predicates. In relation extraction, the relation between A0, A1 and predicates is considered with emphasis, so that a relation extraction triplet is formed, and besides core semantic roles, 15 semantic roles are additional semantic roles, such as LOC and TMP, and the positions and the times are respectively represented.
5 Conditions of the dependency tree:
1. simple node condition: only the terminal nodes exist in the sentence, and no non-terminal nodes exist;
2. single parent node condition: except the root node, other nodes only have one father point;
3. Root node independent condition: only one root node exists in one dependency tree;
4. non-crossing condition: the situation that the branches of the dependency tree are not intersected;
5. mutually exclusive conditions: the preceding relationship from left to right and the dominant relationship from top to bottom are mutually exclusive, and only one of the preceding relationship or the dominant relationship exists between the nodes.
The method comprises the steps of processing an item by utilizing dependency syntactic analysis and semantic role marking, and then analyzing a rule base, wherein the rule base of the dependency syntactic analysis is shown in fig. 5, and the rule of the semantic role marking is traversing agent a role A0 and a role A1 by taking predicates as cores, so that a relation triplet is formed.
Firstly, part-of-speech tagging service of a natural language processing platform (LTP) is called, and lexical feature information such as word segmentation, part-of-speech tagging, word length, word offset, word position and the like is obtained. Then, the number words and the graduated words in the item word segmentation result are found and used as triggering conditions for attribute extraction, then the item word segmentation result is traversed forward, the item word is matched with an attribute word stock defined by a meta model, the attribute words are found, and therefore the attributes in the attribute triples are obtained, and finally the attributes and the attribute value chains form the attribute triples of the required items on the corresponding entities.
Those skilled in the art will appreciate that the systems, apparatus, and their respective modules provided herein may be implemented entirely by logic programming of method steps such that the systems, apparatus, and their respective modules are implemented as logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the systems, apparatus, and their respective modules being implemented as pure computer readable program code. Therefore, the system, the apparatus, and the respective modules thereof provided by the present invention may be regarded as one hardware component, and the modules included therein for implementing various programs may also be regarded as structures within the hardware component; modules for implementing various functions may also be regarded as being either software programs for implementing the methods or structures within hardware components.
The foregoing describes specific embodiments of the present application. It is to be understood that the application is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the application. The embodiments of the application and the features of the embodiments may be combined with each other arbitrarily without conflict.

Claims (4)

1. The document information extraction and mapping method is characterized by comprising the following steps:
Step 1: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules and extracting entities by using a mode matching method;
Step 2: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules, and extracting relations and corresponding entity attributes by using a mode matching method;
Step 3: mapping the extracted entity, relation and attribute triples to generate a document map;
The step 1 comprises the following steps:
step 1.1: calling part-of-speech tagging service of a natural language processing platform to obtain lexical feature information comprising word segmentation, part-of-speech tagging, word length, word offset and word position;
Step 1.2: calling a dependency syntax analysis service of a natural language processing platform, analyzing lexical feature information to obtain dependency syntax tree information, and analyzing linguistic Chinese word forming to obtain a compound noun entity;
Step 1.3: pruning the wrongly recognized entity words from the dependency syntax tree in the form of stop words and trigger words according to the characteristics of the entity words of the document, acquiring the classification of the entity words, and extracting the entity by using the formulated rules and the added general words and trigger words to acquire an entity extraction result;
The step 2 comprises the following steps:
Step 2.1: calling a dependency syntax analysis and semantic role marking service of the natural language processing platform to analyze dependency syntax and semantic roles of the requirement items, and obtaining a result of the dependency syntax analysis and the semantic role marking;
Step 2.2: scanning the item to obtain a relation word, mapping the relation word to a core word of dependency syntactic analysis, and mapping the relation word to a predicate of semantic role annotation;
step 2.3: extracting an entity conforming to a logic expression formed by dependency syntactic analysis and a relation entity conforming to semantic role labels A0 and A1 and predicates to be used as a relation extraction triplet; wherein A0 is agent roles, A1 is a crashed role;
Step 2.4: calling part-of-speech tagging service of a natural language processing platform to acquire lexical feature information;
Step 2.5: extracting the number words and the graduated words in the word segmentation result, matching the number words and the graduated words with the modifier words of the attribute values, and splicing the number words and the modifier words to form attribute values as triggering conditions for attribute extraction;
step 2.6: scanning and recording the triggered attribute value and corresponding attribute information, and taking the entity nearest before the attribute scanning or the entity nearest before the modifier from the attribute as an attribute entity object to link the entity attribute to form a final attribute triplet;
The step3 comprises the following steps:
step 3.1: defining a relationship label and an entity label of the triplet;
step 3.2: defining entity attributes of the triples as node attributes in the atlas, and storing the relation words as attributes in relation edges in the atlas;
step 3.3: each entity is used as a sub-object in the instantiation object, and the triples are stored in the neo4j graph database in an object mode.
2. The document information extraction and mapping method according to claim 1, wherein the information extraction and visualization are performed in a hierarchical tight coupling mode, word formation characteristics of a Chinese required document are analyzed from morphology, syntax and semanteme by combining an open source natural language processing platform, corresponding information extraction rules are formulated, rule maintenance is performed by using a Drools engine, entity and relationship attributes in the document are extracted, and mapping is performed to form a knowledge graph.
3. A document information extraction and mapping system, comprising:
Module M1: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules and extracting entities by using a mode matching method;
Module M2: acquiring word features of the document from the word features and the dependency syntax tree through a natural language understanding technology and a natural language processing technology, formulating rules, and extracting relations and corresponding entity attributes by using a mode matching method;
module M3: mapping the extracted entity, relation and attribute triples to generate a document map;
The module M1 includes:
module M1.1: calling part-of-speech tagging service of a natural language processing platform to obtain lexical feature information comprising word segmentation, part-of-speech tagging, word length, word offset and word position;
Module M1.2: calling a dependency syntax analysis service of a natural language processing platform, analyzing lexical feature information to obtain dependency syntax tree information, and analyzing linguistic Chinese word forming to obtain a compound noun entity;
Module M1.3: pruning the wrongly recognized entity words from the dependency syntax tree in the form of stop words and trigger words according to the characteristics of the entity words of the document, acquiring the classification of the entity words, and extracting the entity by using the formulated rules and the added general words and trigger words to acquire an entity extraction result;
the module M2 includes:
Module M2.1: calling a dependency syntax analysis and semantic role marking service of the natural language processing platform to analyze dependency syntax and semantic roles of the requirement items, and obtaining a result of the dependency syntax analysis and the semantic role marking;
Module M2.2: scanning the item to obtain a relation word, mapping the relation word to a core word of dependency syntactic analysis, and mapping the relation word to a predicate of semantic role annotation;
module M2.3: extracting an entity conforming to a logic expression formed by dependency syntactic analysis and a relation entity conforming to semantic role labels A0 and A1 and predicates to be used as a relation extraction triplet; wherein A0 is agent roles, A1 is a crashed role;
Module M2.4: calling part-of-speech tagging service of a natural language processing platform to acquire lexical feature information;
Module M2.5: extracting the number words and the graduated words in the word segmentation result, matching the number words and the graduated words with the modifier words of the attribute values, and splicing the number words and the modifier words to form attribute values as triggering conditions for attribute extraction;
module M2.6: scanning and recording the triggered attribute value and corresponding attribute information, and taking the entity nearest before the attribute scanning or the entity nearest before the modifier from the attribute as an attribute entity object to link the entity attribute to form a final attribute triplet;
The module M3 includes:
module M3.1: defining a relationship label and an entity label of the triplet;
Module M3.2: defining entity attributes of the triples as node attributes in the atlas, and storing the relation words as attributes in relation edges in the atlas;
Module M3.3: each entity is used as a sub-object in the instantiation object, and the triples are stored in the neo4j graph database in an object mode.
4. The document information extraction and mapping system according to claim 3, wherein the information extraction and visualization are performed in a hierarchical tight coupling mode, word formation characteristics of a Chinese required document are analyzed from morphology, syntax and semanteme by combining an open source natural language processing platform, corresponding information extraction rules are formulated, rule maintenance is performed by using a Drools engine, entity and relationship attributes in the document are extracted, and mapping is performed to form a knowledge map.
CN202110795366.2A 2021-07-14 2021-07-14 Document information extraction and mapping method and system Active CN113609838B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110795366.2A CN113609838B (en) 2021-07-14 2021-07-14 Document information extraction and mapping method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110795366.2A CN113609838B (en) 2021-07-14 2021-07-14 Document information extraction and mapping method and system

Publications (2)

Publication Number Publication Date
CN113609838A CN113609838A (en) 2021-11-05
CN113609838B true CN113609838B (en) 2024-05-24

Family

ID=78337552

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110795366.2A Active CN113609838B (en) 2021-07-14 2021-07-14 Document information extraction and mapping method and system

Country Status (1)

Country Link
CN (1) CN113609838B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115017913B (en) * 2022-04-21 2023-01-31 广州世纪华轲科技有限公司 Semantic component analysis method based on master-slave framework mode
CN115098617A (en) * 2022-06-10 2022-09-23 杭州未名信科科技有限公司 Method, device and equipment for labeling triple relation extraction task and storage medium
CN115238217B (en) * 2022-09-23 2022-12-20 山东省齐鲁大数据研究院 Method for extracting numerical information from bulletin text and terminal

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050302A (en) * 2014-07-10 2014-09-17 华东师范大学 Topic detecting system based on atlas model
CN104636466A (en) * 2015-02-11 2015-05-20 中国科学院计算技术研究所 Entity attribute extraction method and system oriented to open web page
CN106777275A (en) * 2016-12-29 2017-05-31 北京理工大学 Entity attribute and property value extracting method based on many granularity semantic chunks
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis
CN107291687A (en) * 2017-04-27 2017-10-24 同济大学 It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method
CN109241538A (en) * 2018-09-26 2019-01-18 上海德拓信息技术股份有限公司 Based on the interdependent Chinese entity relation extraction method of keyword and verb
CN110222200A (en) * 2019-06-20 2019-09-10 京东方科技集团股份有限公司 Method and apparatus for entity fusion
CN110597999A (en) * 2019-08-01 2019-12-20 湖北工业大学 Judicial case knowledge graph construction method of dependency syntactic analysis relation extraction model
CN111027309A (en) * 2019-12-05 2020-04-17 电子科技大学广东电子信息工程研究院 Method for extracting entity attribute value based on bidirectional long-short term memory network
CN111309925A (en) * 2020-02-10 2020-06-19 同方知网(北京)技术有限公司 Knowledge graph construction method of military equipment
CN111353030A (en) * 2020-02-26 2020-06-30 陕西师范大学 Knowledge question and answer retrieval method and device based on travel field knowledge graph
CN111597351A (en) * 2020-05-14 2020-08-28 上海德拓信息技术股份有限公司 Visual document map construction method
CN111708874A (en) * 2020-08-24 2020-09-25 湖南大学 Man-machine interaction question-answering method and system based on intelligent complex intention recognition
CN111897914A (en) * 2020-07-20 2020-11-06 杭州叙简科技股份有限公司 Entity information extraction and knowledge graph construction method for field of comprehensive pipe gallery
CN111897908A (en) * 2020-05-12 2020-11-06 中国科学院计算技术研究所 Event extraction method and system fusing dependency information and pre-training language model

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050302A (en) * 2014-07-10 2014-09-17 华东师范大学 Topic detecting system based on atlas model
CN104636466A (en) * 2015-02-11 2015-05-20 中国科学院计算技术研究所 Entity attribute extraction method and system oriented to open web page
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis
CN106777275A (en) * 2016-12-29 2017-05-31 北京理工大学 Entity attribute and property value extracting method based on many granularity semantic chunks
CN107291687A (en) * 2017-04-27 2017-10-24 同济大学 It is a kind of based on interdependent semantic Chinese unsupervised open entity relation extraction method
CN109241538A (en) * 2018-09-26 2019-01-18 上海德拓信息技术股份有限公司 Based on the interdependent Chinese entity relation extraction method of keyword and verb
CN110222200A (en) * 2019-06-20 2019-09-10 京东方科技集团股份有限公司 Method and apparatus for entity fusion
CN110597999A (en) * 2019-08-01 2019-12-20 湖北工业大学 Judicial case knowledge graph construction method of dependency syntactic analysis relation extraction model
CN111027309A (en) * 2019-12-05 2020-04-17 电子科技大学广东电子信息工程研究院 Method for extracting entity attribute value based on bidirectional long-short term memory network
CN111309925A (en) * 2020-02-10 2020-06-19 同方知网(北京)技术有限公司 Knowledge graph construction method of military equipment
CN111353030A (en) * 2020-02-26 2020-06-30 陕西师范大学 Knowledge question and answer retrieval method and device based on travel field knowledge graph
CN111897908A (en) * 2020-05-12 2020-11-06 中国科学院计算技术研究所 Event extraction method and system fusing dependency information and pre-training language model
CN111597351A (en) * 2020-05-14 2020-08-28 上海德拓信息技术股份有限公司 Visual document map construction method
CN111897914A (en) * 2020-07-20 2020-11-06 杭州叙简科技股份有限公司 Entity information extraction and knowledge graph construction method for field of comprehensive pipe gallery
CN111708874A (en) * 2020-08-24 2020-09-25 湖南大学 Man-machine interaction question-answering method and system based on intelligent complex intention recognition

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
《Dependency Parsing-based Entity Relation Extraction over Chinese Complex Text》;Qi S 等;《Transactions on Asian and Low-Resource Language Information Processing 》;20210609;第20卷(第4期);1-34 *
《Internet-based Knowledge Graph Construction Technology in Air Defense Field》;Yu L 等;《2021 International Conference on Computer Technology and Media Convergence Design (CTMCD)》;20210628;25-29 *
《基于深度学习和语法规约的需求文档命名实体识别》;许梦笛 等;《计算机与现代化》;20210131(第1期);105-110 *
《基于远程监督的军事实体关系抽取应用研究》;苟继承;《中国优秀硕士学位论文全文数据库社会科学Ⅰ辑》;20200715(第7期);G112-15 *
《基于非分类关系提取技术的知识图谱构建》;韦韬 等;《工业技术创新》;20200430;第7卷(第2期);23-28 *
《面向军事领域的知识图谱构建与应用研究》;薛坤;《中国优秀硕士学位论文全文数据库社会科学Ⅰ辑》;20210215(第2期);G112-12 *

Also Published As

Publication number Publication date
CN113609838A (en) 2021-11-05

Similar Documents

Publication Publication Date Title
CN110968700B (en) Method and device for constructing domain event map integrating multiple types of affairs and entity knowledge
CN108121829B (en) Software defect-oriented domain knowledge graph automatic construction method
CN113609838B (en) Document information extraction and mapping method and system
Meziane et al. Generating natural language specifications from UML class diagrams
Al-Hroob et al. The use of artificial neural networks for extracting actions and actors from requirements document
CN111382571B (en) Information extraction method, system, server and storage medium
CN108665141B (en) Method for automatically extracting emergency response process model from emergency plan
CN115576984A (en) Method for generating SQL (structured query language) statement and cross-database query by Chinese natural language
US11170169B2 (en) System and method for language-independent contextual embedding
Bartolini et al. Automatic classification and analysis of provisions in italian legal texts: a case study
Soria et al. Automatic extraction of semantics in law documents
CN112733547A (en) Chinese question semantic understanding method by utilizing semantic dependency analysis
CN113934909A (en) Financial event extraction method based on pre-training language and deep learning model
CN113779062A (en) SQL statement generation method and device, storage medium and electronic equipment
CN111898024A (en) Intelligent question and answer method and device, readable storage medium and computing equipment
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
Sun A natural language interface for querying graph databases
CN113792542A (en) Intention understanding method fusing syntactic analysis and semantic role pruning
Florian et al. Factorizing complex models: A case study in mention detection
Azzopardi et al. Integrating natural language and formal analysis for legal documents
Kiyavitskaya et al. Semi-Automatic Semantic Annotations for Web Documents.
CN117473054A (en) Knowledge graph-based general intelligent question-answering method and device
CN116483314A (en) Automatic intelligent activity diagram generation method
US20230111052A1 (en) Self-learning annotations to generate rules to be utilized by rule-based system
CN114528459A (en) Semantic-based webpage information extraction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant