CN111651562A

CN111651562A - Scientific and technological literature content deep revealing method based on content map

Info

Publication number: CN111651562A
Application number: CN202010504233.0A
Authority: CN
Inventors: 王敬东; 宋建磊; 孟凡奇; 李佳
Original assignee: Northeast Dianli University
Current assignee: Northeast Electric Power University
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2020-09-11
Anticipated expiration: 2040-06-05
Also published as: CN111651562B

Abstract

The invention discloses a scientific and technical literature content deep revealing method based on a content map, which comprises the following steps: extracting knowledge objects and semantic relations of the knowledge objects from the text data; constructing a content map of a plurality of scientific and technical documents; deep aggregation of knowledge of scientific and technical literature contents. The method can deepen the organization mode of the content knowledge of the scientific and technological literature from the external features of the literature to the internal features, can discover implicit knowledge among the literatures and reason among the knowledge, can generate knowledge clusters and knowledge chains across the literature in a planar space and a three-dimensional space, and organizes the knowledge of the 'free state' among the literatures in series, thereby realizing effective cooperation and deep aggregation among the knowledge and being beneficial to solving the problems of 'information maze' when the scientific and technological literature is consulted and 'knowledge dissociation' existing among the scientific and technological literature.

Description

Scientific and technological literature content deep revealing method based on content map

Technical Field

The invention relates to semantic analysis, in particular to a scientific and technical literature content deep revealing method based on a content map.

Background

At present, with the explosive growth of the number of scientific and technological documents, in order to disclose the knowledge contained therein, a large number of researches are carried out by related scholars from different angles according to different theories, and a scientific and technological text knowledge meta model, a scientific paper content structure model, a scientific paper content ontology model and the like are provided, and although all the researches have achieved certain results, the researches are mainly carried out from external features such as discourse names, themes, authors, keywords, reference documents and the like, or the text disclosure is carried out on knowledge objects and semantic relations in a single document, and the deep mining and organization of the content among documents with the same theme are lacked, so that most of the knowledge of the content of the scientific and technological documents still exist in a "free state", the cooperation among knowledge is lacked, the knowledge clusters and knowledge chains across the documents are difficult to generate, and the research refers to the phenomenon as.

The scientific and technical literature content knowledge organization modes based on knowledge elements or ontologies have the problems that the description granularity of the literature knowledge is not fine enough, semantic association is lacked among the knowledge, the semantic association degree is not enough, and the like. Moreover, the two methods only disclose the contents of a single scientific and technical literature, cut off the connection with other literatures under the same subject, the knowledge between the literatures still exists in a 'free state' form, the cooperation between the knowledge is lacked, the knowledge clusters and knowledge chains across the literatures cannot be effectively generated, and the deep aggregation of the knowledge of the literatures is not facilitated. The scientific and technical literature content knowledge organization mode based on the graph theory is mainly used for identifying text topics, and the technical literature content knowledge is less in disclosure and the problem of knowledge dissociation cannot be solved.

Disclosure of Invention

The invention mainly aims to provide a scientific and technological literature content deep revealing method based on a content map, which utilizes the scientific and technological literature content map to organize the knowledge of 'free state' among the literatures in series, realizes effective cooperation and deep aggregation among the knowledge, solves the problem of 'knowledge free' so as to effectively meet the complex information requirement of a user and improve the service capability of the knowledge.

The technical scheme adopted by the invention is as follows: a scientific and technical literature content deep revealing method based on a content map comprises the following steps:

extracting knowledge objects and semantic relations of the knowledge objects from the text data;

constructing a content map of a plurality of scientific and technical documents;

deep aggregation of knowledge of scientific and technical literature contents.

Further, the extracting knowledge objects and semantic relations thereof from the text data includes:

step 1: inputting a prepared text data set;

step 2: manually labeling unstructured experimental data in the text data, and converting the unstructured experimental data into structured data; marking text labels, titles and abstracts on each text, identifying the position and number of each sentence, and refining the content of the abstracts into a purpose, a method, a result and a conclusion;

and step 3: preprocessing the text data, and deleting useless knowledge objects by using a stop word list;

and 4, step 4: extracting knowledge object syntax triple from the processed text data by using a tool ClausiE, and storing the extracted knowledge object syntax triple;

and 5: processing the extracted syntax triple;

step 6: matching the incomplete syntactic triples processed in the step 5 with a domain semantic dictionary by means of the domain semantic dictionary, searching the semantic relationship between the head entity and the tail entity to complement the incomplete syntactic triples completely, taking the complemented complete triples as semantic triples, and storing the semantic triples;

and 7: storing each processed semantic triple with position information and syntax triple together to form a required data set, namely a semantic set SS, which is used as data for constructing a content map of a plurality of scientific and technical documents; and after the scientific and technical literature content knowledge objects and the relations thereof are extracted, the semantic set is constructed.

Further, the construction of the content map of the plurality of scientific and technical documents comprises:

the method comprises a multi-scientific literature content map construction process and a multi-scientific literature content map construction algorithm.

Further, the content mapping process of the scientific documents comprises the following steps:

step 1: collecting several scientific and technological documents with the same subject, establishing document sets, and sequentially using the document sets as original documents

Data;

step 2: extracting core terms in a document set and semantic relations and semantic elements among the core terms by means of a domain dictionary;

and step 3: the extracted original semantic elements are disassembled in the semantic structure to obtain basic words

Defining elements and stirring the basic semantic elements;

and 4, step 4: the scattered basic semantic elements are subjected to semantic logical relationship according to the internal existence thereof

Carrying out recombination to form a semantic set;

and 5: constructing new semantic structure and semantic features to form content map and main parts of scientific and technological literature

And finishing the construction of the content map.

Further, the multiple scientific and technical literature content mapping algorithm comprises:

step 1: inputting a semantic set SS which is well extracted and processed;

step 2: respectively counting the number of semantic triples, the number of semantic relations, the occurrence frequency of each knowledge object and the total occurrence frequency of all knowledge objects in a semantic set SS, then calculating the frequency of each knowledge object by using a formula (1),

（1）

in the formula:

representing knowledge objectstThe number of occurrences in the Semantic Set (SS),

presentation languageDefining the total number of times of occurrence of all knowledge objects in the set SS;

and step 3: arranging the knowledge object A with the highest frequency in the semantic set SS into a scientific and technological literature content map as a first batch of points generated by the scientific and technological literature content map;

and 4, step 4: extracting all semantic triples containing a knowledge object A in a semantic set SS, putting the semantic triples into a null semantic set, and establishing a semantic set SSA;

and 5: calculating the importance of the knowledge objects in the semantic set SSA, sequencing the knowledge objects according to the importance of the knowledge objects, and calculating the importance of the knowledge objects;

step 6: arranging the knowledge objects with the sequence n into a scientific and technical literature content map according to the importance of the knowledge objects in the semantic set SSA;

and 7: judging whether the semantic subset SSA is empty, if the semantic subset SSA is empty, executing the step 8, and if the semantic subset SSA is not empty, turning to the step 6;

and 8: extracting a semantic set SSB, a semantic set SSC and a semantic set SSD … from a semantic set SS in sequence according to the importance of the knowledge object in the semantic set SSA, and executing the step 5 to the step 7 on each semantic subset;

and step 9: judging whether the semantic set SS is empty, if the semantic set SS is empty, executing the step 10, and if the semantic set SS is not empty, turning to the step 8;

step 10: and outputting a plurality of scientific and technical literature content maps.

Further, the deep aggregation of knowledge of the scientific literature content comprises:

performing implicit knowledge discovery and inference among documents by using the constructed content map, and generating a cross-document knowledge cluster from the dimensions of knowledge objects, semantic relations and quantity statistics in a plane space; in a three-dimensional space, a knowledge object is taken as a target, a semantic relation is taken as a clue to perform deep exploration and generate a cross-literature knowledge chain;

through semantic analysis on the constructed content maps of the plurality of scientific and technical documents, knowledge can be associated in two aspects of breadth and depth according to the difference of path lengths. In the three-dimensional space, a certain knowledge object is selected as a starting point, the path length is selected, and the semantic relation is taken as a clue to carry out deep association, so that a plurality of important facts of direct or indirect association between different knowledge objects taking the knowledge object as the starting point can be obtained, namely generation of a cross-literature knowledge chain.

The invention has the advantages that:

the method can deepen the organization mode of the content knowledge of the scientific and technological literature from the external features of the literature to the internal features, can discover implicit knowledge among the literatures and reason among the knowledge, can generate knowledge clusters and knowledge chains across the literature in a planar space and a three-dimensional space, and organizes the knowledge of the 'free state' among the literatures in series, thereby realizing effective cooperation and deep aggregation among the knowledge and being beneficial to solving the problems of 'information maze' when the scientific and technological literature is consulted and 'knowledge dissociation' existing among the scientific and technological literature.

In addition to the objects, features and advantages described above, other objects, features and advantages of the present invention are also provided. The present invention will be described in further detail below with reference to the drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention.

FIG. 1 is a flow chart of a method for deep disclosure of technical literature content based on a content map according to an embodiment of the present invention;

FIG. 2 is a flow chart of knowledge object and semantic relationship extraction for text data according to an embodiment of the present invention;

FIG. 3 is a flow chart of the content mapping of the scientific and technical documents according to the embodiment of the present invention;

FIG. 4 is a diagram of exemplary steps for content mapping according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Referring to fig. 1, as shown in fig. 1, a method for deep disclosure of technical literature content based on a content map includes:

deep aggregation of knowledge of scientific and technical literature contents.

Referring to fig. 2, as shown in fig. 2, the extracting knowledge objects and their semantic relationships from text data includes:

step 1: inputting a prepared text data set;

step 2: manually labeling unstructured experimental data in the text data, and converting the unstructured experimental data into structured data; marking text labels, titles and abstracts on each text, identifying the position and the number of each sentence, simultaneously, in order to position the knowledge object and the position of the relation between the knowledge objects in the text, being convenient for backtracking analysis of the extracted result, and refining the abstract content into the purpose, method, result and conclusion;

and step 3: preprocessing the text data, and mainly deleting useless knowledge objects such as pronouns, articles, punctuations and the like by utilizing a stop word list;

and 5: processing the extracted syntactic triplet into a format (head entity, ___, tail entity), that is, only retaining entities (i.e., knowledge objects) in the syntactic triplet, such as (osimertinib, ___, non-small-cellung cancer);

The construction of the content map of the plurality of scientific and technical documents comprises the following steps:

Referring to fig. 3, as shown in fig. 3, the process of mapping the contents of the plurality of scientific and technical documents includes:

Data;

step 2: core terms in a document collection and their semantic relationships (master) with the help of a domain dictionary

A predicate-object, a triple) and other semantic elements;

and step 3: carrying out disassembly operation on the proposed original semantic elements (semantic triples) in a semantic structure

Obtaining basic semantic elements such as subjects, objects, predicates or knowledge objects, semantic relations and the like, and stirring the basic semantic elements;

Carrying out recombination to form a semantic set;

And finishing the construction of the content map.

The content mapping algorithm of the scientific and technical documents comprises the following steps:

step 1: inputting a semantic set SS which is well extracted and processed;

（1）

in the formula:

representing the total number of occurrences of all knowledge objects in a Semantic Set (SS);

and 5: calculating the importance of the knowledge objects in the semantic set SSA, and sequencing according to the importance; order n (n =1,2,3 …);

in order to determine the sequence of the knowledge objects in the semantic subsets to be arranged into a scientific and technical literature content map and the sequence of the knowledge objects in the semantic subsets to be returned into an original Semantic Set (SS) to extract the semantic subsets, a knowledge object importance formula is provided. The formula refers to a TF-IDF method, semantic relation influence is added, factors such as frequency, semantic relation weight and inverse word frequency of knowledge objects in a semantic subset are fully considered to comprehensively calculate the importance of the knowledge objects, and the method comprises the following steps:

(2)

in the formula: r (t) represents the semantic weight of knowledge object t in the semantic subset,

represents the number of occurrences of the semantic relationship between knowledge object t and knowledge object i in the semantic subset (SSi, i = a, B, C …),

represents the total number of occurrences of all semantic relationships in the semantic set (SSi, i = A, B, C …),

representing the number of semantic relationship types between knowledge object t and knowledge object i.

(3)

In the formula: ISF (t) represents the inverse word frequency of the knowledge object t, M represents the total number of knowledge units (semantic triples) in the Semantic Set (SS),

the number of knowledge units (semantic triples) containing a knowledge object t in a Semantic Set (SS) is shown.

Then applying the formula (1) to the corresponding semantic subsets, calculating frequency TF (t) of the knowledge object t in the semantic subsets, and combining the formula (1), the formula (2) and the formula (3) to obtain a knowledge object importance calculation formula (4);

(4)

step 6: according to the importance of the knowledge objects in the semantic set SSA, the knowledge objects with the sequence n (n starts from 1) are arranged into a scientific and technical literature content map;

A plurality of scientific and technical literature content mapping examples:

based on the content map construction method of multiple scientific and technical documents proposed by the research, the construction process is graphically described, and the construction process is shown in detail, an example of which is shown in fig. 4.

Process 1: carrying out quantity statistics on the extracted Semantic Set (SS), wherein the quantity statistics comprises the total number of semantic triples, the total number of semantic relations, the total frequency of knowledge objects and the frequency of each knowledge object;

and (2) a process: arranging a knowledge object A with the highest frequency in a Semantic Set (SS) into a content map, wherein the knowledge object A is used as a first batch of points generated in the content map of scientific and technical literature and is also a first point;

and 3, process: extracting all semantic triples containing a knowledge object A in a Semantic Set (SS) and putting the semantic triples into a new Semantic Set (SSA);

and 4, process: calculating the importance of the knowledge objects except the knowledge object A in the Semantic Set (SSA), and sequencing the importance of the knowledge objects in the Semantic Set (SSA) according to the calculation result;

and (5) a process: sequentially arranging the knowledge objects into a scientific and technical literature content map according to the importance of the knowledge objects in a Semantic Set (SSA), wherein the knowledge objects are second points generated by the scientific and technical literature content map;

and 6, a process: extracting a Semantic Set (SSB), a Semantic Set (SSC) and a Semantic Set (SSD) … in sequence from the Semantic Set (SS) according to the importance of knowledge objects in the Semantic Set (SSA), and taking the extraction of the Semantic Set (SSB) as an example from here to here;

and (7) a process: repeating the process 4, namely calculating the importance of the knowledge objects except the knowledge object B in the Semantic Set (SSB), and sequencing the importance of the knowledge objects in the Semantic Set (SSB) according to the calculation result;

and (8) a process: repeating the process 5, namely sequentially arranging the knowledge objects into a scientific and technical literature content map according to the importance of the knowledge objects in the Semantic Set (SSB), wherein the knowledge objects are the third batch of points generated by the scientific and technical literature content map;

like processing the Semantic Set (SSB), until all knowledge objects in the Semantic Set (SSC), the Semantic Set (SSD), and the Semantic Set (SSE) … extracted according to the importance of the second set of points generated by the content map are placed in the content map, at which point the technical literature content map will generate all the third set of points. And then extracting semantic subsets from the original Semantic Set (SS) according to the importance of the third batch of points until the Semantic Set (SS) is empty, and thus completing the construction of the content map of the plurality of scientific and technical documents.

The deep aggregation of the knowledge of the scientific literature content comprises the following steps:

through a plurality of constructed scientific and technical literature content maps, the expression mode of knowledge is deepened from the external characteristics of the original scientific and technical literature to the internal characteristics of the scientific and technical literature, and the downward disclosure part provided by the invention is completed.

Performing implicit knowledge discovery and inference among documents by using the constructed content map, and generating a cross-document knowledge cluster from dimensions such as knowledge objects, semantic relations, quantity statistics and the like in a plane space; in the three-dimensional space, a knowledge object is taken as a target, a semantic relation is taken as a clue to perform deep exploration, and a knowledge chain crossing documents is generated.

Therefore, knowledge of 'free state' among the documents is associated and organized, effective cooperation and deep aggregation among the knowledge are realized, and the 'upward aggregation' part provided by the invention is completed.

Specifically, through semantic analysis of the constructed content maps of the plurality of scientific and technical documents, knowledge can be related in both the breadth and the depth according to the difference of path lengths. In a three-dimensional space, a certain knowledge object is selected as a starting point, the path length is selected, and the semantic relation is taken as a clue to carry out deep association, so that a plurality of important facts which are directly or indirectly associated among different knowledge objects taking the knowledge object as the starting point can be obtained, namely the generation of a cross-literature knowledge chain;

in planar space, multiple important facts associated by the same knowledge object, i.e., the generation of cross-document knowledge clusters, are also available. By means of the method, the knowledge clusters and knowledge chains across the documents are generated by means of the content map, so that not only can the deep aggregation of knowledge be realized, but also the implicit knowledge among the scientific and technical documents can be revealed, and the value of the scientific and technical documents is improved.

On the basis of associating and summarizing the knowledge of the scientific and technical literature content, the method refers to a correlation method in knowledge inference facing to a knowledge map, and carries out simple manual reasoning on the fine-grained knowledge in the constructed map of the scientific and technical literature content.

More than 95% of the constructed scientific and technical literature content map has a connection relation between semantic triples, that is, a tail entity of one semantic triplet is likely to be a head entity of another semantic triplet, such two semantic triples can form a quintuple path, and at this time, some representable semantic relation may be hidden between the head and tail entities of the quintuple.

For example, if there is a triplet (a, father, B) and (B, father, C), it is known that the triplet (a, grandfather, C) should exist. The invention adopts the thought of combining single step reasoning and multi-step reasoning, and carries out semantic relation reasoning on knowledge objects which are not directly connected in different semantic triples on the basis of knowledge induction statistics, thereby discovering new knowledge.

Extraction of knowledge objects and semantic relations thereof:

one of the key points of the deep knowledge disclosure of the scientific and technological literature content is the accuracy of the extraction of the knowledge object and the semantic relation thereof, and the method provided by the invention is applicable to the deep knowledge disclosure of the scientific and technological literature content in all fields on the premise that a knowledge base (a field dictionary) in the field is required.

The invention also considers that the field has SemRep as an extraction tool by selecting scientific and technological literature in the medical field in an experiment.

Richness of textual relationships:

in scientific and technical literature, multiple relations exist among knowledge objects, including co-occurrence, syntax, semantics and the like, and although the semantic relations contain most abundant information, the co-occurrence and syntax relations also contain certain information, so that only considering the semantics, one relation is bound to reveal text content insufficiently, for example, the term "statistical non-uniform-small-cellulose-document" and the term "syntax" have semantic "tress (refer)" therebetween, and the term "syntax" also has semantic relation "hadnot received" simultaneously, that is, the two terms also have co-occurrence relation therebetween.

Authenticity of implicit knowledge:

through deep disclosure of contents of multiple scientific and technical documents, knowledge discovered through inference and hidden in one or more documents needs experts to identify the authenticity of the knowledge, and it cannot be guaranteed that each piece of knowledge obtained through inference is real and reliable or is acknowledged to exist, and the implicit knowledge obtained through inference can provide new ideas and new attempts for relevant researchers.

In order to realize fine-grained description of knowledge of scientific and technological literature content, deep aggregation of literature content is carried out from internal features of scientific and technological literature, effective cooperation of knowledge among literatures is enhanced, the problem of knowledge dissociation is solved, the service capability of the scientific and technological literature is improved, and the requirement of a user for precision is met, the invention provides a deep disclosure method of the scientific and technological literature content, which is based on a content map and is used for downward disclosure and upward aggregation. The invention selects the medical literature with perfect domain dictionary for experiment, verifies the feasibility and effectiveness of the method through backtracking analysis and comparative analysis of the original text content, randomly selects core terms on the content map of the scientific and technical literature to generate knowledge clusters and knowledge chains across the literature in a plane space and a three-dimensional space, and carries out knowledge reasoning in the subgraph of the content map, thereby realizing the knowledge deep aggregation in the real sense.

The invention provides a method for deeply disclosing scientific and technical literature contents based on a content map, aiming at the problem that the conventional abstract automatic generation technology cannot deeply and semantically disclose the scientific and technical literature contents. The method comprises the steps of extracting, disassembling, scattering, recombining and the like knowledge units in multiple scientific and technical documents to construct a map of the contents of the multiple scientific and technical documents, revealing the internal features and fine-grained knowledge of the scientific and technical documents downwards on the basis, and performing association aggregation on the fine-grained knowledge of the multiple scientific and technical documents upwards, so that deep semantic revealing of the contents of the scientific and technical documents is realized through the 'downward revealing and upward aggregating' method.

The scientific and technological literature content deep revealing method based on the 'downward revealing and upward gathering' of the content map can effectively achieve fine-grained revealing of the scientific and technological literature content knowledge and multi-dimensional gathering of the scientific and technological literature content knowledge.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A scientific and technical literature content deep revealing method based on a content map is characterized in that a package

Comprises the following steps:

deep aggregation of knowledge of scientific and technical literature contents.

2. The method of claim 1,

the method is characterized in that the extraction of the knowledge object and the semantic relation thereof from the text data comprises the following steps:

step 1: inputting a prepared text data set;

and 5: processing the extracted syntax triple;

3. The method of claim 1,

the method is characterized in that the construction of the content map of the scientific and technical literature comprises the following steps:

4. The method of claim 3,

the method is characterized in that the construction process of the content map of the scientific and technological literature comprises the following steps:

Data;

Defining elements and stirring the basic semantic elements;

Carrying out recombination to form a semantic set;

And finishing the construction of the content map.

5. The method of claim 3,

the method is characterized in that the content map construction algorithm of the scientific and technical literature comprises the following steps:

step 1: inputting a semantic set SS which is well extracted and processed;

（1）

in the formula:

representing the total number of times of occurrence of all knowledge objects in the semantic set SS;

6. The method of claim 1,

the method is characterized in that the deep aggregation of the content knowledge of the scientific and technical literature comprises the following steps:

semantic analysis is carried out on the constructed content maps of the plurality of scientific and technical documents, and knowledge can be associated in two aspects of breadth and depth according to different path lengths; in the three-dimensional space, a certain knowledge object is selected as a starting point, the path length is selected, and the semantic relation is taken as a clue to carry out deep association, so that a plurality of important facts of direct or indirect association between different knowledge objects taking the knowledge object as the starting point can be obtained, namely generation of a cross-literature knowledge chain.