CN113312922B - Improved chapter-level triple information extraction method - Google Patents

Improved chapter-level triple information extraction method Download PDF

Info

Publication number
CN113312922B
CN113312922B CN202110399643.8A CN202110399643A CN113312922B CN 113312922 B CN113312922 B CN 113312922B CN 202110399643 A CN202110399643 A CN 202110399643A CN 113312922 B CN113312922 B CN 113312922B
Authority
CN
China
Prior art keywords
entity
node
semantic
verb
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110399643.8A
Other languages
Chinese (zh)
Other versions
CN113312922A (en
Inventor
李少锋
王妍妍
王玉坤
高菁
陈文颖
张春晖
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 28 Research Institute
Original Assignee
CETC 28 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 28 Research Institute filed Critical CETC 28 Research Institute
Priority to CN202110399643.8A priority Critical patent/CN113312922B/en
Publication of CN113312922A publication Critical patent/CN113312922A/en
Application granted granted Critical
Publication of CN113312922B publication Critical patent/CN113312922B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an improved chapter-level triple information extraction method, which comprises the following steps: firstly, preprocessing text data; secondly, performing chapter-level semantic analysis on the text data, including hierarchical semantic analysis, entity alignment and dependency verb extraction; thirdly, heuristic learning is carried out in a multi-round iteration mode, and an event semantic model is built; fourthly, extracting triples based on the end-to-end samples, and extracting triples based on chapter understanding; and fifthly, applying the triad knowledge extracted in the third step and the fourth step to some applications, such as intelligent retrieval, intelligent question-answering, knowledge mining, decision support and the like. The method realizes the establishment of the triple information extraction model based on the small sample, has chapter-level triple extraction capability, is easy to popularize, has expansibility, and is an important basic link for large-scale text information data extraction, knowledge system establishment and vertical domain knowledge graph establishment.

Description

Improved chapter-level triple information extraction method
Technical Field
The invention relates to an improved chapter-level triple information extraction method.
Background
The research of natural language processing starts from vocabulary and dictionary performance, sentences are always used as the most core research objects in recent years, and multi-language scholars of semantic analysis of chapters search theoretically; the chapter level lacks formal marks, so that the language calculation of the chapter level has not been particularly significantly advanced. However, many semantic problems must be solved at the chapter level to be fundamentally, such as coreference resolution, chapter structure and semantic relationship recognition, event fusion and relationship recognition, etc.; at the same time, the resolution of these chapter-level semantic problems is also of instructive significance for both vocabulary-level and sentence-level analysis. On the other hand. In recent years, the development of Chinese vocabulary and sentence-level natural language processing technology, in particular to the staged results obtained by research works such as word sense disambiguation, syntactic analysis, semantic role labeling and the like, creates technical conditions for research of chapter semantic analysis.
In general, chinese sentences are generally longer, and a sentence often comprises a plurality of entity information, so that the number of entity pairs formed by the Chinese sentences is also larger, and the number distribution of entity types is uneven. Compared with the relation exploration and relation extraction of simple sentences, the sentence pattern of long sentences is complex, so that the tasks of relation detection and relation extraction are more difficult; multiple entity information is often included in long sentences, and multiple verbs are often present in sentences in which pairs of entities span long distances. Therefore, how to select verbs that can effectively characterize whether there is a semantic relationship between entity pairs and specific relationship types becomes the key for relationship exploration and relationship extraction; the biggest challenge of current extraction is that training data is inadequate, and the distribution of relationship instances across categories is highly unbalanced. At present, the means for extracting entity relations mainly comprise means based on templates, dependency syntax analysis, deep learning and the like. However, the main problem with template-based entity relationship extraction is that the accuracy and recall are low. Entity relationship extraction based on dependency syntax faces the problem of semantic loss. The entity relation extraction based on deep learning obtains better experimental results in some fields, and has no obvious performance difference, but the cost is that a large number of training and testing samples are required to be marked for the predefined relation category, the samples are relatively simple short sentences, and the sample distribution of each relation is relatively uniform. However, manually labeling sentence-level data accurately is very costly and requires a significant amount of time and effort. In a practical scenario, it is almost impossible to do with manual annotation of training data, oriented to thousands of relationships, tens of millions of entity pairs, and hundreds of millions of sentences. Meanwhile, in practical cases, the relationship among entities and the occurrence frequency of entity pairs often follow long tail distribution, and a large number of relationships or entity pairs with fewer samples exist. The effect of the neural network model needs to be ensured by relying on large-scale labeling data, and the problem of 'lifting ten against one' exists. How to improve the learning ability of the depth model and realize 'one to three', is a problem to be solved by relation extraction. In addition, existing models extract relationships between entities mainly from a single sentence, requiring that the sentence must contain two entities at the same time. In practice, a large number of relationships between entities often appear in multiple sentences of a document, even in multiple documents. How to perform relationship extraction in more complex contexts is also a problem facing relationship extraction. The existing task setting generally assumes a predefined set of closed relationships to convert tasks into relationship classification problems. In this way, new relationships between entities contained in the text cannot be obtained efficiently. The above means achieves a certain effect on a test set which is relatively simple in phrases and relatively uniform in sample distribution of each relation, but in practical application, particularly in the extraction of triplet information for chapter-level texts, various problems exist, such as problems of data scale, learning ability, complex contexts, open relations and the like. If a theory and method system with theoretical depth and realistic chapter semantic analysis can be established, the method has important significance for development of natural language processing academy and application.
In the information age, how to mine and establish a comprehensive and accurate knowledge system from massive text data and related reports, construct a knowledge graph in the vertical field, and subsequent applications such as intelligent inquiry and answer knowledge mining decision support in subsequent intelligent search become technical problems, and the chapter-level triple information extraction method is an effective means, so that in order to enable knowledge information extracted from chapters to be applied in the industry in a large scale, a set of method capable of accurately extracting high-quality entity association relations based on a small number of labeling samples is needed.
Disclosure of Invention
The invention aims to: a method for extracting chapter-level triple information is provided for mining and establishing a comprehensive and accurate knowledge system and knowledge atlas from massive text data and related report annual reports, and the method utilizes a natural language processing technology and a machine learning algorithm to realize the extraction of high-quality entity association relation based on limited samples, construct a vertical domain knowledge atlas, powerfully support the establishment of the knowledge system in the domain and assist in realizing the mining and research of information relation.
In order to solve the technical problems, the invention provides an improved chapter-level triplet information extraction method, which comprises the following steps:
step 1, preprocessing text data;
step 2, performing chapter-level semantic analysis on the text data;
step 3, heuristic learning is carried out in a multi-round iteration mode, and an event semantic model is built;
and 4, performing triplet extraction based on the end-to-end sample.
Step 1 comprises the following steps:
step 1-1, converting a text data format, namely converting the acquired text data format into a text data format which can be directly subjected to natural language processing by adopting the existing natural language processing technology, such as extracting text from pdf and doc;
step 1-2, preprocessing and cleaning the text data after format conversion by using a natural language processing technology;
step 1-3, text data chapter structure processing: splitting a long document into text blocks according to paragraphs and periods;
and 1-4, splitting text data sentence blocks, and further splitting the text blocks into physical sentence blocks with punctuation marks at intervals.
The step 1-2 comprises the following steps: the following processes are sequentially performed on the text data after format conversion: full and half angle conversion, uppercase character conversion to lowercase number, uppercase letter conversion to lowercase letter, removal of expression, removal of all characters in text and only preservation of Chinese, chinese text segmentation, traditional simplified Chinese conversion and Chinese text stop word filtering.
The steps 1-4 comprise:
step 1-4-1, for brackets in a text block, if the content in the brackets is in close semantic relation with the adjacent component on the left side (the semantic component relation of the same semantic segment is close, and the semantic component relation of different semantic segments is not close;
step 1-4-2, for quotation marks in sentence blocks, if quotation mark bodies belong to a part of a named entity (the named entity refers to an entity with specific meaning in text and mainly comprises a name, a place name, a mechanism name, a proper noun and the like;
step 1-4-3, for other symbols in the sentence block, if the symbols are part of the named entity, combining other symbols in the sentence block (such as interval number in foreign names, some books written by books and the like will add the title number) and related contexts into a semantic entity, otherwise, using other symbols in the sentence block as marks for dividing physical sentence blocks.
Step 2 comprises the following steps:
step 2-1, carrying out semantic analysis on continuous texts in the chapters by using known grammar syntax knowledge of linguistics, and respectively generating a list formed by an analysis tree for each continuous text block;
step 2-2, decomposing complex semantics into a hierarchical semantic structure by combining an information structure of the text data, a category of a term which plays a specific role and a category of the text data;
step 2-3, entity alignment is carried out;
and 2-4, extracting the nearest syntactic dependency verb by the entity.
In step 2-2, each level in the hierarchical semantic structure includes N semantic blocks related to facts or concepts, where N is a natural number; according to the sequence of the subsequent traversal, query operation is preferentially executed on the semantic blocks of the nesting layer (the nesting layer is the semantic block with the nesting semantics in the semantic blocks, the complex semantics are decomposed into a hierarchical semantic structure through step 2-2, a plurality of semantics can be nested), the nesting layer epitaxy is determined, after the nesting layer processing is finished, query operation is executed on the semantic blocks of the rest facts or concepts, and the epitaxy of each semantic block is determined.
The step 2-3 comprises the following steps:
judging whether the entity library established in advance has the same name entity according to the entity name, if not, generating new entity pairs, adding the new entity pairs into the entity library, otherwise, acquiring all the entity pairs with the same name, calculating the similarity between the target entity pairs and each acquired entity pair, comprehensively scoring candidate sequences of the calculated results according to the similarity among category labels, attribute labels and unstructured text keywords, if the score is smaller than a threshold (the threshold cannot be quantified and needs to be timely adjusted according to specific conditions), adding the target entity into the entity library, otherwise, selecting the result with the highest score as the alignment result of the target entity. Entity alignment is the determination of whether two or more entities of different sources of information are pointing to the same object in the real world. If multiple entities characterize the same object, an alignment relationship is constructed between the entities, and information contained by the entities is fused and aggregated. The target entity is an entity extracted from the text, and the purpose of the target entity is to determine whether the entity in the text has a co-pointing relationship with the entity in the entity library.
The steps 2-4 comprise:
step 2-4-1, two different entities are set as e respectively i And e j Respectively extracting and e by the following method i And e j Dependency relating node e 'having relationship of parallel structure or centering structure' i And e' j : setting the current node as a father node of e, if the dependency relationship of the father node is a parallel structure or a centering structure relationship, continuing traversing all the nodes, if the condition that the dependency relationship of the father node is a parallel structure or a centering structure relationship is satisfied, continuing traversing, otherwise, returning to the father node;
step 2-4-2, extracting the 2 nd entity e by adopting the following method j Dependency relating node e' j Verb V with nearest dependency relationship j : initializing a return value as null value, and setting the current node as a father node of e; and when the current node is not the root node, judging: if the current node is the verb node, the verb node is the verb closest to the dependency relationship of the entity e, circulation is ended, the verb node is returned to be the verb closest to the dependency relationship to be searched, otherwise, the father node of the current node is set as the current node, and the judgment is continued;
step 2-4-3, obtaining the 1 st entity e by adopting the following method i Dependency relating node e' i Verb V closest to the occurrence of a master or pre-object relationship i : initializing a return value as null value, and setting the current node as a father node of e; and when the current node is not the root node, judging: if the verb node is the verb node, and the verb node and the entity have a main-predicate relationship or a prepositive object relationship, the verb node is the verb with the nearest dependency relationship with the entity e, the circulation is ended, the verb node is returned to be the verb with the nearest dependency relationship to be searched, otherwise, the father node of the current node is set as the current node, and the judgment is continued;
step 2-4-4, judging verb V i And V is equal to j Whether the entity pair is the same verb or is in parallel structural relation<e i e j >Is determined from the most recent dependent verb DV.
Step 3 comprises the following steps:
step 3-1, performing hierarchical semantic analysis on the text data, and generating mapping knowledge, identification knowledge and associated knowledge according to a hierarchical semantic structure;
step 3-2, generating extraction knowledge according to an analysis tree and parameter mapping generated by training corpus, which specifically comprises the following steps: step 3-2-1, independently constructing a mapping rule for each semantic hierarchy with parameter mapping; mapping rules refer to rules from a particular semantic hierarchy to a target structure fragment;
step 3-2-2, if parameter mapping in different levels exists in the same parse tree, constructing an identification rule containing the levels according to nested points (preferentially constructing the identification rule by using a target structure, and when the target structure cannot be utilized, utilizing a semantic structure instead); nested points refer to a sentence of a text containing a plurality of semantic terms; the recognition rule refers to that for the same target structure existing in different analysis trees, the analysis tree with default components and component references can be used for completing components by comparing with the complete analysis tree;
step 3-2-3, if parameter mapping related to the same target structure exists in different parse trees, constructing a recognition rule of the cross-sentence block according to the association points; the association points are connection points formed by default and reference relations among different sentence blocks, namely the prior language and the corresponding language in the reference, and the prior language and the default language in the default;
step 3-2-4, if more than two sentence blocks appear in the end sample, wherein the sentence blocks all contain parameter mapping, and the end sample does not provide associated labeling information about the sentence blocks, the user is actively prompted to supplement corresponding associated labeling;
step 3-2-5, if the center component of one hierarchy that is modified and limited by the modifier is extracted, and the other components in the hierarchy are not extracted, the hierarchy is not processed.
Step 4 comprises the following steps:
step 4-1, obtaining a primary first-order logical expression according to the hierarchical semantic structure of the input text;
step 4-2, performing association reasoning by using a first-order logical formula (the first-order logical formula is obtained by text semantic analysis and can be a rule or a fact), and realizing variable unification of the first-order logical formula by using default, reference and unification relations among contexts to obtain a unification first-order logical formula after default recovery, reference digestion and entity unification;
step 4-3, performing mapping reasoning by utilizing unified first-order logic formulas, wherein each independent first-order logic formula can generate a primary target structure fragment;
step 4-4, utilizing the unification first-order logic formula or the original target structure segment to conduct recognition reasoning so as to obtain a coupling target structure segment;
step 4-5, if predicates of two coupling target structure fragments adjacent or overlapped in position are the same, but subject objects corresponding to the predicates in the text phrase are completely different, or values of the same parameters are the same, directly combining the two coupling target structure fragments adjacent or overlapped in position into a larger target structure to be used as final output; otherwise, executing the step 4-6;
step 4-6, regarding two coupling target structure fragments adjacent or overlapped in position as different target structure examples of the same predicate, and taking the two coupling target structure fragments as final output;
step 4-7, repeating step 4-5, step 4-6 until no new, larger coupling target segments are generated, and obtaining all target structure examples, namely the final output.
The invention also comprises a step 5, and some applications of the triplet knowledge extracted by the step 3 and the step 4, such as intelligent retrieval, intelligent question-answering, knowledge mining, decision support and the like.
Compared with the prior art, the invention has the remarkable advantages that:
(1) The method adopts a semantic model-based hierarchical semantic analysis technology, and utilizes the hierarchical semantic analysis technology to realize heuristic learning aiming at an end-to-end sample, thereby achieving the effect of learning in the opposite three ways, realizing triple information extraction on the basis of chapter-level understanding, and ensuring complete and usable triple information extraction results;
(2) Training of small samples is achieved through heuristic learning. Because the knowledge used in the event semantic model is based on semantic patterns, which are highly multiplexed in the natural language expression, one end sample can contribute to highly multiplexed extraction knowledge, so that training can be completed without a huge amount of samples, and the problem of lack of effective samples is effectively solved.
(3) The method is based on chapter-level semantic analysis, has expansibility, and can be used for extracting binary relations (triad) and polynary relations;
(4) The method has higher accuracy and recall rate, and is an effective means for forming a high-quality knowledge graph in the vertical field and realizing intelligent analysis of the field knowledge.
Drawings
The foregoing and/or other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings and detailed description.
Fig. 1 is a flow chart of the present invention.
Fig. 2 is a text data preprocessing flow chart of the present invention.
Fig. 3 is an entity alignment flow chart of the present invention.
FIG. 4 is an exemplary diagram of a hierarchical semantic structure of the present invention.
Detailed Description
Aiming at the common problems of incomplete extraction information, large training sample scale, high cost and the like in the existing triplet information extraction field, the method adopts a semantic model-based hierarchical semantic analysis technology to establish an event semantic model, forcefully captures entity relations and information structures contained in texts, adopts heuristic learning to reduce the number of required samples, realizes chapter-level triplet information extraction, can effectively solve or improve the problems of data scale, learning capacity, complex context, open relations and the like, and can form a high-quality vertical field knowledge graph. The invention provides an improved chapter-level triplet information extraction method, which is shown in fig. 1 and comprises the following steps:
step 1, preprocessing text data;
step 1-1, converting text data formats, and extracting effective text contents from documents in pdf, docx and other formats;
and step 1-2, preprocessing and cleaning the text data after format conversion by using a natural language processing technology. The converted text data may contain useless information such as advertisements, special characters without practical significance, and the like, and the text data preprocessing is performed by adopting a natural language processing technology, and the preprocessing comprises the following steps: full-angle and half-angle conversion, uppercase character conversion into lowercase number, uppercase letter conversion into lowercase letter, expression symbol removal, text removal of all characters (only Chinese is reserved), chinese text word segmentation, traditional simplified Chinese conversion, chinese text stop word filtration and the like, and a preprocessing flow chart is shown in figure 2;
step 1-3, text data chapter structure processing, namely splitting a longer document into a plurality of text blocks (knowledge points);
step 1-4, splitting text data sentence blocks, and further splitting the text blocks into physical sentence blocks with punctuation mark intervals, wherein the method specifically comprises the following steps:
step 1-4-1, for brackets in sentence blocks, if the content in a bracket body is in close coupling relation with the adjacent component on the left side of the bracket body, combining the bracket body and the adjacent component into a semantic component, otherwise, processing the bracket body;
step 1-4-2, for quotation marks in sentence blocks, merging quotation marks with a named entity if the quotation marks belong to a part of the named entity, otherwise, not processing;
step 1-4-3, for other symbols in the sentence block, if the symbol is a part of a named entity, merging the punctuation symbol and the related context into a semantic entity, otherwise, using the punctuation symbol as a mark for dividing the physical sentence block;
step 2, performing chapter-level semantic analysis on the text data;
step 2-1, carrying out semantic analysis on continuous texts in chapters by using known language knowledge, and respectively generating a list formed by an analysis tree for each continuous text block;
step 2-2, in combination with the information structure of text data, the category of terms taking a specific role, the category of text data, the complex semantics are decomposed into hierarchical semantic structures, such as shown in fig. 4, as steps 2-3, step 2-4, with the addition that the text "edison in the figure invented an incandescent lamp that lights up at night" actually is nested by two basic expressions "edison invented an incandescent lamp" and "incandescent lamp lights up at night", (specifically, basic expression 1 "fact" edison, invented an incandescent lamp that lights up at night "constitutes the first layer of semantics, in which" edison "is the incident," an incandescent lamp "is the incident, and" an incandescent lamp that lights up at night "constitutes the nested sublayer with respect to" an incandescent lamp ", also can be said that" an incandescent lamp that lights up at night "is a phrase that lights up at night," the incandescent lamp "is the center word, so" an incandescent lamp "couples two layers as the nested point");
step 2-3, obtaining a hierarchical semantic structure as described above, wherein each hierarchy comprises a plurality of semantic blocks related to facts or concepts;
step 2-4, according to the sequence of the subsequent traversal, preferentially performing operations such as query and the like on the semantic blocks of the nested layer, determining the extension of the semantic blocks, and the like;
step 2-5, as shown in fig. 3, performing entity alignment, firstly judging whether entity with the same name exists in an entity library according to the entity name, if not, generating new entity pairs, adding the new entity pairs into the entity library, otherwise, acquiring all entity pairs with the same name, calculating the similarity between a target entity pair and each acquired entity pair, comprehensively scoring candidate sorting for the calculated results according to the similarity of category labels, attribute labels and unstructured text keywords, if the score is smaller than a threshold value, adding the target entity into the entity library, otherwise, selecting the aligned result with the highest scoring result as the target entity;
step 2-6, extracting the nearest syntax dependency verb by the entity, wherein the specific steps are step 2-7, step 2-8, step 2-9 and step 2-10;
step 2-7, respectively extracting and obtaining entity e i And e j Dependency relating node e 'having relationship of parallel structure or centering structure' i And e' j Such as algorithm 2-1;
step 2-8, extraction and 2 nd entity e j Dependency relating node e' j Verb V with nearest dependency relationship j Such as algorithm 2-2;
step 2-9, acquiring and 1 st entity e i Dependency relating node e' i Verb V closest to the occurrence of a master or pre-object relationship i Such as algorithm 2-3;
step 2-10, judging verb V i And V is equal to j Whether the entity pair is the same verb or is in parallel structural relation<eie j >Is determined from the most recent dependent verb DV;
algorithm 2-1, extracting dependency associated node of entity
Algorithm 2-2, extracting verb with closest dependency relationship with 2 nd entity
Algorithm 2-3, extracting verbs with the closest main-predicate relation or preposed object relation to the 1 st entity
Step 3, heuristic learning is carried out in a multi-round iteration mode, and an event semantic model is built;
step 3-1, performing hierarchical semantic analysis on the text data, and generating mapping knowledge, identification knowledge and associated knowledge according to a hierarchical semantic structure;
step 3-2, generating extraction knowledge according to an analysis tree and parameter mapping generated by the terminal sample, wherein the extraction knowledge specifically comprises the following steps:
step 3-2-1, independently constructing a mapping rule for each semantic hierarchy with parameter mapping;
step 3-2-2, if parameter mapping in different levels exists in the same parse tree, constructing an identification rule containing the levels according to the nested points;
step 3-2-3, if parameter mapping related to the same target structure exists in different parse trees, trying to construct a recognition rule of a cross-sentence block according to the association points;
step 3-2-4, if a plurality of sentence blocks (i.e. a plurality of parsing trees correspondingly exist) appear in the end sample, and the sentence blocks all contain parameter mapping, and the end sample does not provide associated labeling information about the sentence blocks, the user should be actively prompted to supplement corresponding associated labels;
step 3-2-5, if the center word of a certain level is extracted, but other components in the level are not extracted, the level can be ignored;
step 4, extracting triples based on end samples;
step 4-1, obtaining a primary first-order logical expression according to the hierarchical semantic structure of the input text;
and 4-2, performing association reasoning by using the first-order logic formula, and realizing variable unification of the first-order logic formula by using default, reference and unification relations among contexts. Obtaining a unified first-order logic formula after default recovery, reference digestion and entity unification;
step 4-3, performing mapping reasoning by utilizing unified first-order logic formulas, wherein each independent first-order logic formula can generate a plurality of original target structure fragments;
step 4-4, utilizing the unified first-order logic formula or the target structure fragments to conduct recognition reasoning to obtain a plurality of coupling target structure fragments;
step 4-5, if predicates of two coupling target structure fragments adjacent or overlapped in position are identical, but parameters are completely different, or values of the same parameters are also identical, the two coupling target structure fragments can be directly combined into a larger target structure to be used as a final output. Otherwise, executing the step 4-6;
step 4-6, regarding both as different target structure examples of the same predicate as final output;
step 4-7, repeating step 4-5, step 4-6 until no new, larger coupling target segments are generated, and obtaining all target structure examples, which are final outputs.
Step 5, some applications to the triplet knowledge extracted by the third and fourth steps: if intelligent searching is performed, hundred-degree searching is performed on incumbent united states president, the displayed result is mainly a president A, and the description and retrieval technology is further improved about a president B; the intelligent question and answer can be regarded as the extension of semantic search, and can be applied to chat robots, not only provide scene dialogue, but also provide knowledge of various industries, the knowledge graph relied on by the intelligent question and answer is the knowledge graph of the open field, the provided knowledge is very wide, daily knowledge can be provided for users, and chat dialogue can be performed; the personalized recommendation system analyzes social relations between users and association relations between the users and the products by collecting interest preferences and attributes of the users, classification, attributes, contents and the like of the products, and deduces the favorites and demands of the users by utilizing a personalized algorithm so as to recommend the interested products or contents for the users; the auxiliary decision is to analyze and process the knowledge by utilizing the knowledge of the knowledge graph, and to obtain a certain conclusion by logic reasoning of a certain rule, thereby providing support for the decision of the user.
The present invention provides an improved chapter level triplet information extraction method, and the method and the way of implementing the technical scheme are numerous, the above is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several improvements and modifications can be made without departing from the principles of the present invention, and these improvements and modifications should also be regarded as the protection scope of the present invention. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims (2)

1. An improved chapter-level triplet information extraction method, comprising the steps of:
step 1, preprocessing text data;
step 2, performing chapter-level semantic analysis on the text data;
step 3, heuristic learning is carried out in a multi-round iteration mode, and an event semantic model is built;
step 4, extracting triples based on end-to-end samples;
step 1 comprises the following steps:
step 1-1, converting a text data format;
step 1-2, preprocessing and cleaning the text data after format conversion by using a natural language processing technology;
step 1-3, text data chapter structure processing: splitting a long file into text blocks;
step 1-4, dividing text data sentence blocks, and further dividing the text blocks into physical sentence blocks with punctuation mark intervals;
the steps 1-4 comprise:
step 1-4-1, for brackets in a text block, if the content in the brackets and the adjacent components on the left side are in a tight semantic relation, merging the content in the brackets and the text components adjacent to the left brackets into a semantic component, otherwise, not processing the brackets;
step 1-4-2, for quotation marks in sentence blocks, merging quotation mark bodies with a named entity if the quotation mark bodies belong to a part of the named entity, otherwise, disregarding;
step 1-4-3, for other symbols in the sentence block, if the symbols are part of a named entity, combining the other symbols in the sentence block with related contexts to form a semantic entity, otherwise, using the other symbols in the sentence block as marks for dividing the physical sentence block;
step 2 comprises the following steps:
step 2-1, carrying out semantic analysis on continuous texts in the chapters by using known grammar syntax knowledge of linguistics, and respectively generating a list formed by an analysis tree for each continuous text block;
step 2-2, decomposing complex semantics into a hierarchical semantic structure by combining an information structure of the text data, a category of a term which plays a specific role and a category of the text data;
step 2-3, entity alignment is carried out;
step 2-4, extracting the nearest syntax dependency verb by the entity;
in step 2-2, each level in the hierarchical semantic structure includes N semantic blocks related to facts or concepts, where N is a natural number; according to the sequence of the subsequent traversal, preferentially executing query operation on semantic blocks of the nesting layer, determining the extension of the nesting layer, and after the processing of the nesting layer is finished, executing query operation on semantic blocks of other facts or concepts, and determining the extension of each semantic block;
the step 2-3 comprises the following steps:
judging whether the entity library established in advance has the same name entity according to the entity name, if not, generating a new entity pair, adding the new entity pair into the entity library, otherwise, acquiring all the entity pairs with the same name, calculating the similarity between the target entity pair and each acquired entity pair, comprehensively scoring candidate sequences for the calculated results according to the similarity of category labels, attribute labels and unstructured text keywords, and if the score is smaller than a threshold value, adding the target entity into the entity library, otherwise, selecting the result with the highest score as the alignment result of the target entity;
the steps 2-4 comprise:
step 2-4-1, two different entities are set as e respectively i And e j Respectively extracting and e by the following method i And e j Dependency associated node e having relationship of parallel structure or centering structure i ' and e j ': setting the current node as a father node of e, if the dependency relationship of the father node is a parallel structure or a centering structure relationship, continuing traversing all the nodes, if the condition that the dependency relationship of the father node is a parallel structure or a centering structure relationship is satisfied, continuing traversing, otherwise, returning to the father node;
step 2-4-2, extracting the 2 nd entity e by adopting the following method j Is dependent on associated node e of (1) j Verb V with' least-recently occurring dependency relationship j : initializing a return value as null value, and setting the current node as a father node of e; and when the current node is not the root node, judging: if the current node is the verb node, the verb node is the verb closest to the dependency relationship of the entity e, circulation is ended, the verb node is returned to be the verb closest to the dependency relationship to be searched, otherwise, the father node of the current node is set as the current node, and the judgment is continued;
step 2-4-3, obtaining the 1 st entity e by adopting the following method i Is dependent on associated node e of (1) i ' verb V nearest to the occurrence of a subject relationship or a pre-object relationship i : initializing a return value as null value, and setting the current node as a father node of e; and when the current node is not the root node, judging: if the verb node is the verb node, and the verb node and the entity have a main-predicate relationship or a prepositive object relationship, the verb node is the verb with the nearest dependency relationship with the entity e, the circulation is ended, the verb node is returned to be the verb with the nearest dependency relationship to be searched, otherwise, the father node of the current node is set as the current node, and the judgment is continued;
step 2-4-4, judging verb V i And V is equal to j Whether the entity pair is the same verb or is in parallel structural relation<e i e j >To thereby determine a triplet;
step 3 comprises the following steps:
step 3-1, performing hierarchical semantic analysis on the text data, and generating mapping knowledge, identification knowledge and associated knowledge according to a hierarchical semantic structure;
step 3-2, generating extraction knowledge according to an analysis tree and parameter mapping generated by training corpus, which specifically comprises the following steps:
step 3-2-1, independently constructing a mapping rule for each semantic hierarchy with parameter mapping;
step 3-2-2, if parameter mapping in different levels exists in the same parse tree, constructing an identification rule containing the levels according to the nested points;
step 3-2-3, if parameter mapping related to the same target structure exists in different parse trees, constructing a recognition rule of the cross-sentence block according to the association points;
step 3-2-4, if more than two sentence blocks appear in the end sample, wherein the sentence blocks all contain parameter mapping, and the end sample does not provide associated labeling information about the sentence blocks, the user is actively prompted to supplement corresponding associated labeling;
step 3-2-5, if the center component of one hierarchy, which is modified and limited by the modifier, is extracted, and the other components in the hierarchy are not extracted, the hierarchy is not processed;
step 4 comprises the following steps:
step 4-1, obtaining a primary first-order logical expression according to the hierarchical semantic structure of the input text;
step 4-2, performing association reasoning by using the first-order logic formula, and realizing variable unification of the first-order logic formula by using default, reference and unification relations among contexts to obtain an unification first-order logic formula after default recovery, reference digestion and entity unification;
step 4-3, performing mapping reasoning by utilizing unified first-order logic formulas, wherein each independent first-order logic formula can generate a primary target structure fragment;
step 4-4, utilizing the unification first-order logic formula or the original target structure segment to conduct recognition reasoning so as to obtain a coupling target structure segment;
step 4-5, if predicates of two coupling target structure fragments adjacent or overlapped in position are the same, but subject objects corresponding to the predicates in the text phrase are completely different, or values of the same parameters are the same, directly combining the two coupling target structure fragments adjacent or overlapped in position into a larger target structure to be used as final output; otherwise, executing the step 4-6;
step 4-6, regarding two coupling target structure fragments adjacent or overlapped in position as different target structure examples of the same predicate, and taking the two coupling target structure fragments as final output;
step 4-7, repeating step 4-5, step 4-6 until no new, larger coupling target segments are generated, and obtaining all target structure examples, namely the final output.
2. The method of claim 1, wherein step 1-2 comprises: the following processes are sequentially performed on the text data after format conversion: full and half angle conversion, uppercase character conversion to lowercase number, uppercase letter conversion to lowercase letter, removal of expression, removal of all characters in text and only preservation of Chinese, chinese text segmentation, traditional simplified Chinese conversion and Chinese text stop word filtering.
CN202110399643.8A 2021-04-14 2021-04-14 Improved chapter-level triple information extraction method Active CN113312922B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110399643.8A CN113312922B (en) 2021-04-14 2021-04-14 Improved chapter-level triple information extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110399643.8A CN113312922B (en) 2021-04-14 2021-04-14 Improved chapter-level triple information extraction method

Publications (2)

Publication Number Publication Date
CN113312922A CN113312922A (en) 2021-08-27
CN113312922B true CN113312922B (en) 2023-10-24

Family

ID=77372136

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110399643.8A Active CN113312922B (en) 2021-04-14 2021-04-14 Improved chapter-level triple information extraction method

Country Status (1)

Country Link
CN (1) CN113312922B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114707520B (en) * 2022-06-06 2022-09-13 天津大学 Session-oriented semantic dependency analysis method and device
CN115081437B (en) * 2022-07-20 2022-12-09 中国电子科技集团公司第三十研究所 Machine-generated text detection method and system based on linguistic feature contrast learning
CN117094396B (en) * 2023-10-19 2024-01-23 北京英视睿达科技股份有限公司 Knowledge extraction method, knowledge extraction device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015080561A1 (en) * 2013-11-27 2015-06-04 Mimos Berhad A method and system for automated relation discovery from texts
CN109446338A (en) * 2018-09-20 2019-03-08 大连交通大学 Drug disease relationship classification method neural network based
CA3060811A1 (en) * 2018-10-31 2020-04-30 Royal Bank Of Canada System and method for cross-domain transferable neural coherence model
CN111274790A (en) * 2020-02-13 2020-06-12 东南大学 Chapter-level event embedding method and device based on syntactic dependency graph
CN111597351A (en) * 2020-05-14 2020-08-28 上海德拓信息技术股份有限公司 Visual document map construction method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015080561A1 (en) * 2013-11-27 2015-06-04 Mimos Berhad A method and system for automated relation discovery from texts
CN109446338A (en) * 2018-09-20 2019-03-08 大连交通大学 Drug disease relationship classification method neural network based
CA3060811A1 (en) * 2018-10-31 2020-04-30 Royal Bank Of Canada System and method for cross-domain transferable neural coherence model
CN111274790A (en) * 2020-02-13 2020-06-12 东南大学 Chapter-level event embedding method and device based on syntactic dependency graph
CN111597351A (en) * 2020-05-14 2020-08-28 上海德拓信息技术股份有限公司 Visual document map construction method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
篇章级事件表示及相关性计算;刘一仝;《中国优秀硕士学位论文全文数据库 信息科技辑》(第02期);I138-2371 *
融合对抗训练的端到端知识三元组联合抽取;黄培馨等;《计算机研究与发展》;第56卷(第12期);第2536-2548页 *

Also Published As

Publication number Publication date
CN113312922A (en) 2021-08-27

Similar Documents

Publication Publication Date Title
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN109684448B (en) Intelligent question and answer method
CN111723215B (en) Device and method for establishing biotechnological information knowledge graph based on text mining
CN108573411B (en) Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments
CN110110054B (en) Method for acquiring question-answer pairs from unstructured text based on deep learning
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN108121829B (en) Software defect-oriented domain knowledge graph automatic construction method
CN113312922B (en) Improved chapter-level triple information extraction method
CN105095204B (en) The acquisition methods and device of synonym
CN110502642B (en) Entity relation extraction method based on dependency syntactic analysis and rules
CN112100322B (en) API element comparison result automatic generation method based on knowledge graph
CN110609983B (en) Structured decomposition method for policy file
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
CN111061882A (en) Knowledge graph construction method
CN112926345A (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN112733547A (en) Chinese question semantic understanding method by utilizing semantic dependency analysis
Bounhas et al. A hybrid possibilistic approach for Arabic full morphological disambiguation
JPH0816620A (en) Data sorting device/method, data sorting tree generation device/method, derivative extraction device/method, thesaurus construction device/method, and data processing system
CN114997288A (en) Design resource association method
CN115017335A (en) Knowledge graph construction method and system
CN111178080A (en) Named entity identification method and system based on structured information
Wang A cross-domain natural language interface to databases using adversarial text method
CN110750632B (en) Improved Chinese ALICE intelligent question-answering method and system
CN114997398B (en) Knowledge base fusion method based on relation extraction
CN115759037A (en) Intelligent auditing frame and auditing method for building construction scheme

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant