CN111651562A - Scientific and technological literature content deep revealing method based on content map - Google Patents

Scientific and technological literature content deep revealing method based on content map Download PDF

Info

Publication number
CN111651562A
CN111651562A CN202010504233.0A CN202010504233A CN111651562A CN 111651562 A CN111651562 A CN 111651562A CN 202010504233 A CN202010504233 A CN 202010504233A CN 111651562 A CN111651562 A CN 111651562A
Authority
CN
China
Prior art keywords
semantic
knowledge
scientific
literature
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010504233.0A
Other languages
Chinese (zh)
Other versions
CN111651562B (en
Inventor
王敬东
宋建磊
孟凡奇
李佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeast Electric Power University
Original Assignee
Northeast Dianli University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeast Dianli University filed Critical Northeast Dianli University
Priority to CN202010504233.0A priority Critical patent/CN111651562B/en
Publication of CN111651562A publication Critical patent/CN111651562A/en
Application granted granted Critical
Publication of CN111651562B publication Critical patent/CN111651562B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a scientific and technical literature content deep revealing method based on a content map, which comprises the following steps: extracting knowledge objects and semantic relations of the knowledge objects from the text data; constructing a content map of a plurality of scientific and technical documents; deep aggregation of knowledge of scientific and technical literature contents. The method can deepen the organization mode of the content knowledge of the scientific and technological literature from the external features of the literature to the internal features, can discover implicit knowledge among the literatures and reason among the knowledge, can generate knowledge clusters and knowledge chains across the literature in a planar space and a three-dimensional space, and organizes the knowledge of the 'free state' among the literatures in series, thereby realizing effective cooperation and deep aggregation among the knowledge and being beneficial to solving the problems of 'information maze' when the scientific and technological literature is consulted and 'knowledge dissociation' existing among the scientific and technological literature.

Description

Scientific and technological literature content deep revealing method based on content map
Technical Field
The invention relates to semantic analysis, in particular to a scientific and technical literature content deep revealing method based on a content map.
Background
At present, with the explosive growth of the number of scientific and technological documents, in order to disclose the knowledge contained therein, a large number of researches are carried out by related scholars from different angles according to different theories, and a scientific and technological text knowledge meta model, a scientific paper content structure model, a scientific paper content ontology model and the like are provided, and although all the researches have achieved certain results, the researches are mainly carried out from external features such as discourse names, themes, authors, keywords, reference documents and the like, or the text disclosure is carried out on knowledge objects and semantic relations in a single document, and the deep mining and organization of the content among documents with the same theme are lacked, so that most of the knowledge of the content of the scientific and technological documents still exist in a "free state", the cooperation among knowledge is lacked, the knowledge clusters and knowledge chains across the documents are difficult to generate, and the research refers to the phenomenon as.
The scientific and technical literature content knowledge organization modes based on knowledge elements or ontologies have the problems that the description granularity of the literature knowledge is not fine enough, semantic association is lacked among the knowledge, the semantic association degree is not enough, and the like. Moreover, the two methods only disclose the contents of a single scientific and technical literature, cut off the connection with other literatures under the same subject, the knowledge between the literatures still exists in a 'free state' form, the cooperation between the knowledge is lacked, the knowledge clusters and knowledge chains across the literatures cannot be effectively generated, and the deep aggregation of the knowledge of the literatures is not facilitated. The scientific and technical literature content knowledge organization mode based on the graph theory is mainly used for identifying text topics, and the technical literature content knowledge is less in disclosure and the problem of knowledge dissociation cannot be solved.
Disclosure of Invention
The invention mainly aims to provide a scientific and technological literature content deep revealing method based on a content map, which utilizes the scientific and technological literature content map to organize the knowledge of 'free state' among the literatures in series, realizes effective cooperation and deep aggregation among the knowledge, solves the problem of 'knowledge free' so as to effectively meet the complex information requirement of a user and improve the service capability of the knowledge.
The technical scheme adopted by the invention is as follows: a scientific and technical literature content deep revealing method based on a content map comprises the following steps:
extracting knowledge objects and semantic relations of the knowledge objects from the text data;
constructing a content map of a plurality of scientific and technical documents;
deep aggregation of knowledge of scientific and technical literature contents.
Further, the extracting knowledge objects and semantic relations thereof from the text data includes:
step 1: inputting a prepared text data set;
step 2: manually labeling unstructured experimental data in the text data, and converting the unstructured experimental data into structured data; marking text labels, titles and abstracts on each text, identifying the position and number of each sentence, and refining the content of the abstracts into a purpose, a method, a result and a conclusion;
and step 3: preprocessing the text data, and deleting useless knowledge objects by using a stop word list;
and 4, step 4: extracting knowledge object syntax triple from the processed text data by using a tool ClausiE, and storing the extracted knowledge object syntax triple;
and 5: processing the extracted syntax triple;
step 6: matching the incomplete syntactic triples processed in the step 5 with a domain semantic dictionary by means of the domain semantic dictionary, searching the semantic relationship between the head entity and the tail entity to complement the incomplete syntactic triples completely, taking the complemented complete triples as semantic triples, and storing the semantic triples;
and 7: storing each processed semantic triple with position information and syntax triple together to form a required data set, namely a semantic set SS, which is used as data for constructing a content map of a plurality of scientific and technical documents; and after the scientific and technical literature content knowledge objects and the relations thereof are extracted, the semantic set is constructed.
Further, the construction of the content map of the plurality of scientific and technical documents comprises:
the method comprises a multi-scientific literature content map construction process and a multi-scientific literature content map construction algorithm.
Further, the content mapping process of the scientific documents comprises the following steps:
step 1: collecting several scientific and technological documents with the same subject, establishing document sets, and sequentially using the document sets as original documents
Data;
step 2: extracting core terms in a document set and semantic relations and semantic elements among the core terms by means of a domain dictionary;
and step 3: the extracted original semantic elements are disassembled in the semantic structure to obtain basic words
Defining elements and stirring the basic semantic elements;
and 4, step 4: the scattered basic semantic elements are subjected to semantic logical relationship according to the internal existence thereof
Carrying out recombination to form a semantic set;
and 5: constructing new semantic structure and semantic features to form content map and main parts of scientific and technological literature
And finishing the construction of the content map.
Further, the multiple scientific and technical literature content mapping algorithm comprises:
step 1: inputting a semantic set SS which is well extracted and processed;
step 2: respectively counting the number of semantic triples, the number of semantic relations, the occurrence frequency of each knowledge object and the total occurrence frequency of all knowledge objects in a semantic set SS, then calculating the frequency of each knowledge object by using a formula (1),
Figure DEST_PATH_IMAGE001
(1)
in the formula:
Figure 233879DEST_PATH_IMAGE002
representing knowledge objectstThe number of occurrences in the Semantic Set (SS),
Figure DEST_PATH_IMAGE003
presentation languageDefining the total number of times of occurrence of all knowledge objects in the set SS;
and step 3: arranging the knowledge object A with the highest frequency in the semantic set SS into a scientific and technological literature content map as a first batch of points generated by the scientific and technological literature content map;
and 4, step 4: extracting all semantic triples containing a knowledge object A in a semantic set SS, putting the semantic triples into a null semantic set, and establishing a semantic set SSA;
and 5: calculating the importance of the knowledge objects in the semantic set SSA, sequencing the knowledge objects according to the importance of the knowledge objects, and calculating the importance of the knowledge objects;
step 6: arranging the knowledge objects with the sequence n into a scientific and technical literature content map according to the importance of the knowledge objects in the semantic set SSA;
and 7: judging whether the semantic subset SSA is empty, if the semantic subset SSA is empty, executing the step 8, and if the semantic subset SSA is not empty, turning to the step 6;
and 8: extracting a semantic set SSB, a semantic set SSC and a semantic set SSD … from a semantic set SS in sequence according to the importance of the knowledge object in the semantic set SSA, and executing the step 5 to the step 7 on each semantic subset;
and step 9: judging whether the semantic set SS is empty, if the semantic set SS is empty, executing the step 10, and if the semantic set SS is not empty, turning to the step 8;
step 10: and outputting a plurality of scientific and technical literature content maps.
Further, the deep aggregation of knowledge of the scientific literature content comprises:
performing implicit knowledge discovery and inference among documents by using the constructed content map, and generating a cross-document knowledge cluster from the dimensions of knowledge objects, semantic relations and quantity statistics in a plane space; in a three-dimensional space, a knowledge object is taken as a target, a semantic relation is taken as a clue to perform deep exploration and generate a cross-literature knowledge chain;
through semantic analysis on the constructed content maps of the plurality of scientific and technical documents, knowledge can be associated in two aspects of breadth and depth according to the difference of path lengths. In the three-dimensional space, a certain knowledge object is selected as a starting point, the path length is selected, and the semantic relation is taken as a clue to carry out deep association, so that a plurality of important facts of direct or indirect association between different knowledge objects taking the knowledge object as the starting point can be obtained, namely generation of a cross-literature knowledge chain.
The invention has the advantages that:
the method can deepen the organization mode of the content knowledge of the scientific and technological literature from the external features of the literature to the internal features, can discover implicit knowledge among the literatures and reason among the knowledge, can generate knowledge clusters and knowledge chains across the literature in a planar space and a three-dimensional space, and organizes the knowledge of the 'free state' among the literatures in series, thereby realizing effective cooperation and deep aggregation among the knowledge and being beneficial to solving the problems of 'information maze' when the scientific and technological literature is consulted and 'knowledge dissociation' existing among the scientific and technological literature.
In addition to the objects, features and advantages described above, other objects, features and advantages of the present invention are also provided. The present invention will be described in further detail below with reference to the drawings.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the invention and, together with the description, serve to explain the invention and not to limit the invention.
FIG. 1 is a flow chart of a method for deep disclosure of technical literature content based on a content map according to an embodiment of the present invention;
FIG. 2 is a flow chart of knowledge object and semantic relationship extraction for text data according to an embodiment of the present invention;
FIG. 3 is a flow chart of the content mapping of the scientific and technical documents according to the embodiment of the present invention;
FIG. 4 is a diagram of exemplary steps for content mapping according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, as shown in fig. 1, a method for deep disclosure of technical literature content based on a content map includes:
extracting knowledge objects and semantic relations of the knowledge objects from the text data;
constructing a content map of a plurality of scientific and technical documents;
deep aggregation of knowledge of scientific and technical literature contents.
The method can deepen the organization mode of the content knowledge of the scientific and technological literature from the external features of the literature to the internal features, can discover implicit knowledge among the literatures and reason among the knowledge, can generate knowledge clusters and knowledge chains across the literature in a planar space and a three-dimensional space, and organizes the knowledge of the 'free state' among the literatures in series, thereby realizing effective cooperation and deep aggregation among the knowledge and being beneficial to solving the problems of 'information maze' when the scientific and technological literature is consulted and 'knowledge dissociation' existing among the scientific and technological literature.
Referring to fig. 2, as shown in fig. 2, the extracting knowledge objects and their semantic relationships from text data includes:
step 1: inputting a prepared text data set;
step 2: manually labeling unstructured experimental data in the text data, and converting the unstructured experimental data into structured data; marking text labels, titles and abstracts on each text, identifying the position and the number of each sentence, simultaneously, in order to position the knowledge object and the position of the relation between the knowledge objects in the text, being convenient for backtracking analysis of the extracted result, and refining the abstract content into the purpose, method, result and conclusion;
and step 3: preprocessing the text data, and mainly deleting useless knowledge objects such as pronouns, articles, punctuations and the like by utilizing a stop word list;
and 4, step 4: extracting knowledge object syntax triple from the processed text data by using a tool ClausiE, and storing the extracted knowledge object syntax triple;
and 5: processing the extracted syntactic triplet into a format (head entity, ___, tail entity), that is, only retaining entities (i.e., knowledge objects) in the syntactic triplet, such as (osimertinib, ___, non-small-cellung cancer);
step 6: matching the incomplete syntactic triples processed in the step 5 with a domain semantic dictionary by means of the domain semantic dictionary, searching the semantic relationship between the head entity and the tail entity to complement the incomplete syntactic triples completely, taking the complemented complete triples as semantic triples, and storing the semantic triples;
and 7: storing each processed semantic triple with position information and syntax triple together to form a required data set, namely a semantic set SS, which is used as data for constructing a content map of a plurality of scientific and technical documents; and after the scientific and technical literature content knowledge objects and the relations thereof are extracted, the semantic set is constructed.
The construction of the content map of the plurality of scientific and technical documents comprises the following steps:
the method comprises a multi-scientific literature content map construction process and a multi-scientific literature content map construction algorithm.
Referring to fig. 3, as shown in fig. 3, the process of mapping the contents of the plurality of scientific and technical documents includes:
step 1: collecting several scientific and technological documents with the same subject, establishing document sets, and sequentially using the document sets as original documents
Data;
step 2: core terms in a document collection and their semantic relationships (master) with the help of a domain dictionary
A predicate-object, a triple) and other semantic elements;
and step 3: carrying out disassembly operation on the proposed original semantic elements (semantic triples) in a semantic structure
Obtaining basic semantic elements such as subjects, objects, predicates or knowledge objects, semantic relations and the like, and stirring the basic semantic elements;
and 4, step 4: the scattered basic semantic elements are subjected to semantic logical relationship according to the internal existence thereof
Carrying out recombination to form a semantic set;
and 5: constructing new semantic structure and semantic features to form content map and main parts of scientific and technological literature
And finishing the construction of the content map.
The content mapping algorithm of the scientific and technical documents comprises the following steps:
step 1: inputting a semantic set SS which is well extracted and processed;
step 2: respectively counting the number of semantic triples, the number of semantic relations, the occurrence frequency of each knowledge object and the total occurrence frequency of all knowledge objects in a semantic set SS, then calculating the frequency of each knowledge object by using a formula (1),
Figure 847263DEST_PATH_IMAGE001
(1)
in the formula:
Figure 240067DEST_PATH_IMAGE002
representing knowledge objectstThe number of occurrences in the Semantic Set (SS),
Figure 966715DEST_PATH_IMAGE003
representing the total number of occurrences of all knowledge objects in a Semantic Set (SS);
and step 3: arranging the knowledge object A with the highest frequency in the semantic set SS into a scientific and technological literature content map as a first batch of points generated by the scientific and technological literature content map;
and 4, step 4: extracting all semantic triples containing a knowledge object A in a semantic set SS, putting the semantic triples into a null semantic set, and establishing a semantic set SSA;
and 5: calculating the importance of the knowledge objects in the semantic set SSA, and sequencing according to the importance; order n (n =1,2,3 …);
in order to determine the sequence of the knowledge objects in the semantic subsets to be arranged into a scientific and technical literature content map and the sequence of the knowledge objects in the semantic subsets to be returned into an original Semantic Set (SS) to extract the semantic subsets, a knowledge object importance formula is provided. The formula refers to a TF-IDF method, semantic relation influence is added, factors such as frequency, semantic relation weight and inverse word frequency of knowledge objects in a semantic subset are fully considered to comprehensively calculate the importance of the knowledge objects, and the method comprises the following steps:
Figure 599821DEST_PATH_IMAGE004
(2)
in the formula: r (t) represents the semantic weight of knowledge object t in the semantic subset,
Figure DEST_PATH_IMAGE005
represents the number of occurrences of the semantic relationship between knowledge object t and knowledge object i in the semantic subset (SSi, i = a, B, C …),
Figure DEST_PATH_IMAGE007
represents the total number of occurrences of all semantic relationships in the semantic set (SSi, i = A, B, C …),
Figure 231660DEST_PATH_IMAGE008
representing the number of semantic relationship types between knowledge object t and knowledge object i.
Figure DEST_PATH_IMAGE009
(3)
In the formula: ISF (t) represents the inverse word frequency of the knowledge object t, M represents the total number of knowledge units (semantic triples) in the Semantic Set (SS),
Figure 428155DEST_PATH_IMAGE010
the number of knowledge units (semantic triples) containing a knowledge object t in a Semantic Set (SS) is shown.
Then applying the formula (1) to the corresponding semantic subsets, calculating frequency TF (t) of the knowledge object t in the semantic subsets, and combining the formula (1), the formula (2) and the formula (3) to obtain a knowledge object importance calculation formula (4);
Figure 540467DEST_PATH_IMAGE012
(4)
step 6: according to the importance of the knowledge objects in the semantic set SSA, the knowledge objects with the sequence n (n starts from 1) are arranged into a scientific and technical literature content map;
and 7: judging whether the semantic subset SSA is empty, if the semantic subset SSA is empty, executing the step 8, and if the semantic subset SSA is not empty, turning to the step 6;
and 8: extracting a semantic set SSB, a semantic set SSC and a semantic set SSD … from a semantic set SS in sequence according to the importance of the knowledge object in the semantic set SSA, and executing the step 5 to the step 7 on each semantic subset;
and step 9: judging whether the semantic set SS is empty, if the semantic set SS is empty, executing the step 10, and if the semantic set SS is not empty, turning to the step 8;
step 10: and outputting a plurality of scientific and technical literature content maps.
A plurality of scientific and technical literature content mapping examples:
based on the content map construction method of multiple scientific and technical documents proposed by the research, the construction process is graphically described, and the construction process is shown in detail, an example of which is shown in fig. 4.
Process 1: carrying out quantity statistics on the extracted Semantic Set (SS), wherein the quantity statistics comprises the total number of semantic triples, the total number of semantic relations, the total frequency of knowledge objects and the frequency of each knowledge object;
and (2) a process: arranging a knowledge object A with the highest frequency in a Semantic Set (SS) into a content map, wherein the knowledge object A is used as a first batch of points generated in the content map of scientific and technical literature and is also a first point;
and 3, process: extracting all semantic triples containing a knowledge object A in a Semantic Set (SS) and putting the semantic triples into a new Semantic Set (SSA);
and 4, process: calculating the importance of the knowledge objects except the knowledge object A in the Semantic Set (SSA), and sequencing the importance of the knowledge objects in the Semantic Set (SSA) according to the calculation result;
and (5) a process: sequentially arranging the knowledge objects into a scientific and technical literature content map according to the importance of the knowledge objects in a Semantic Set (SSA), wherein the knowledge objects are second points generated by the scientific and technical literature content map;
and 6, a process: extracting a Semantic Set (SSB), a Semantic Set (SSC) and a Semantic Set (SSD) … in sequence from the Semantic Set (SS) according to the importance of knowledge objects in the Semantic Set (SSA), and taking the extraction of the Semantic Set (SSB) as an example from here to here;
and (7) a process: repeating the process 4, namely calculating the importance of the knowledge objects except the knowledge object B in the Semantic Set (SSB), and sequencing the importance of the knowledge objects in the Semantic Set (SSB) according to the calculation result;
and (8) a process: repeating the process 5, namely sequentially arranging the knowledge objects into a scientific and technical literature content map according to the importance of the knowledge objects in the Semantic Set (SSB), wherein the knowledge objects are the third batch of points generated by the scientific and technical literature content map;
like processing the Semantic Set (SSB), until all knowledge objects in the Semantic Set (SSC), the Semantic Set (SSD), and the Semantic Set (SSE) … extracted according to the importance of the second set of points generated by the content map are placed in the content map, at which point the technical literature content map will generate all the third set of points. And then extracting semantic subsets from the original Semantic Set (SS) according to the importance of the third batch of points until the Semantic Set (SS) is empty, and thus completing the construction of the content map of the plurality of scientific and technical documents.
The deep aggregation of the knowledge of the scientific literature content comprises the following steps:
through a plurality of constructed scientific and technical literature content maps, the expression mode of knowledge is deepened from the external characteristics of the original scientific and technical literature to the internal characteristics of the scientific and technical literature, and the downward disclosure part provided by the invention is completed.
Performing implicit knowledge discovery and inference among documents by using the constructed content map, and generating a cross-document knowledge cluster from dimensions such as knowledge objects, semantic relations, quantity statistics and the like in a plane space; in the three-dimensional space, a knowledge object is taken as a target, a semantic relation is taken as a clue to perform deep exploration, and a knowledge chain crossing documents is generated.
Therefore, knowledge of 'free state' among the documents is associated and organized, effective cooperation and deep aggregation among the knowledge are realized, and the 'upward aggregation' part provided by the invention is completed.
Specifically, through semantic analysis of the constructed content maps of the plurality of scientific and technical documents, knowledge can be related in both the breadth and the depth according to the difference of path lengths. In a three-dimensional space, a certain knowledge object is selected as a starting point, the path length is selected, and the semantic relation is taken as a clue to carry out deep association, so that a plurality of important facts which are directly or indirectly associated among different knowledge objects taking the knowledge object as the starting point can be obtained, namely the generation of a cross-literature knowledge chain;
in planar space, multiple important facts associated by the same knowledge object, i.e., the generation of cross-document knowledge clusters, are also available. By means of the method, the knowledge clusters and knowledge chains across the documents are generated by means of the content map, so that not only can the deep aggregation of knowledge be realized, but also the implicit knowledge among the scientific and technical documents can be revealed, and the value of the scientific and technical documents is improved.
On the basis of associating and summarizing the knowledge of the scientific and technical literature content, the method refers to a correlation method in knowledge inference facing to a knowledge map, and carries out simple manual reasoning on the fine-grained knowledge in the constructed map of the scientific and technical literature content.
More than 95% of the constructed scientific and technical literature content map has a connection relation between semantic triples, that is, a tail entity of one semantic triplet is likely to be a head entity of another semantic triplet, such two semantic triples can form a quintuple path, and at this time, some representable semantic relation may be hidden between the head and tail entities of the quintuple.
For example, if there is a triplet (a, father, B) and (B, father, C), it is known that the triplet (a, grandfather, C) should exist. The invention adopts the thought of combining single step reasoning and multi-step reasoning, and carries out semantic relation reasoning on knowledge objects which are not directly connected in different semantic triples on the basis of knowledge induction statistics, thereby discovering new knowledge.
Extraction of knowledge objects and semantic relations thereof:
one of the key points of the deep knowledge disclosure of the scientific and technological literature content is the accuracy of the extraction of the knowledge object and the semantic relation thereof, and the method provided by the invention is applicable to the deep knowledge disclosure of the scientific and technological literature content in all fields on the premise that a knowledge base (a field dictionary) in the field is required.
The invention also considers that the field has SemRep as an extraction tool by selecting scientific and technological literature in the medical field in an experiment.
Richness of textual relationships:
in scientific and technical literature, multiple relations exist among knowledge objects, including co-occurrence, syntax, semantics and the like, and although the semantic relations contain most abundant information, the co-occurrence and syntax relations also contain certain information, so that only considering the semantics, one relation is bound to reveal text content insufficiently, for example, the term "statistical non-uniform-small-cellulose-document" and the term "syntax" have semantic "tress (refer)" therebetween, and the term "syntax" also has semantic relation "hadnot received" simultaneously, that is, the two terms also have co-occurrence relation therebetween.
Authenticity of implicit knowledge:
through deep disclosure of contents of multiple scientific and technical documents, knowledge discovered through inference and hidden in one or more documents needs experts to identify the authenticity of the knowledge, and it cannot be guaranteed that each piece of knowledge obtained through inference is real and reliable or is acknowledged to exist, and the implicit knowledge obtained through inference can provide new ideas and new attempts for relevant researchers.
In order to realize fine-grained description of knowledge of scientific and technological literature content, deep aggregation of literature content is carried out from internal features of scientific and technological literature, effective cooperation of knowledge among literatures is enhanced, the problem of knowledge dissociation is solved, the service capability of the scientific and technological literature is improved, and the requirement of a user for precision is met, the invention provides a deep disclosure method of the scientific and technological literature content, which is based on a content map and is used for downward disclosure and upward aggregation. The invention selects the medical literature with perfect domain dictionary for experiment, verifies the feasibility and effectiveness of the method through backtracking analysis and comparative analysis of the original text content, randomly selects core terms on the content map of the scientific and technical literature to generate knowledge clusters and knowledge chains across the literature in a plane space and a three-dimensional space, and carries out knowledge reasoning in the subgraph of the content map, thereby realizing the knowledge deep aggregation in the real sense.
The invention provides a method for deeply disclosing scientific and technical literature contents based on a content map, aiming at the problem that the conventional abstract automatic generation technology cannot deeply and semantically disclose the scientific and technical literature contents. The method comprises the steps of extracting, disassembling, scattering, recombining and the like knowledge units in multiple scientific and technical documents to construct a map of the contents of the multiple scientific and technical documents, revealing the internal features and fine-grained knowledge of the scientific and technical documents downwards on the basis, and performing association aggregation on the fine-grained knowledge of the multiple scientific and technical documents upwards, so that deep semantic revealing of the contents of the scientific and technical documents is realized through the 'downward revealing and upward aggregating' method.
The scientific and technological literature content deep revealing method based on the 'downward revealing and upward gathering' of the content map can effectively achieve fine-grained revealing of the scientific and technological literature content knowledge and multi-dimensional gathering of the scientific and technological literature content knowledge.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (6)

1. A scientific and technical literature content deep revealing method based on a content map is characterized in that a package
Comprises the following steps:
extracting knowledge objects and semantic relations of the knowledge objects from the text data;
constructing a content map of a plurality of scientific and technical documents;
deep aggregation of knowledge of scientific and technical literature contents.
2. The method of claim 1,
the method is characterized in that the extraction of the knowledge object and the semantic relation thereof from the text data comprises the following steps:
step 1: inputting a prepared text data set;
step 2: manually labeling unstructured experimental data in the text data, and converting the unstructured experimental data into structured data; marking text labels, titles and abstracts on each text, identifying the position and number of each sentence, and refining the content of the abstracts into a purpose, a method, a result and a conclusion;
and step 3: preprocessing the text data, and deleting useless knowledge objects by using a stop word list;
and 4, step 4: extracting knowledge object syntax triple from the processed text data by using a tool ClausiE, and storing the extracted knowledge object syntax triple;
and 5: processing the extracted syntax triple;
step 6: matching the incomplete syntactic triples processed in the step 5 with a domain semantic dictionary by means of the domain semantic dictionary, searching the semantic relationship between the head entity and the tail entity to complement the incomplete syntactic triples completely, taking the complemented complete triples as semantic triples, and storing the semantic triples;
and 7: storing each processed semantic triple with position information and syntax triple together to form a required data set, namely a semantic set SS, which is used as data for constructing a content map of a plurality of scientific and technical documents; and after the scientific and technical literature content knowledge objects and the relations thereof are extracted, the semantic set is constructed.
3. The method of claim 1,
the method is characterized in that the construction of the content map of the scientific and technical literature comprises the following steps:
the method comprises a multi-scientific literature content map construction process and a multi-scientific literature content map construction algorithm.
4. The method of claim 3,
the method is characterized in that the construction process of the content map of the scientific and technological literature comprises the following steps:
step 1: collecting several scientific and technological documents with the same subject, establishing document sets, and sequentially using the document sets as original documents
Data;
step 2: extracting core terms in a document set and semantic relations and semantic elements among the core terms by means of a domain dictionary;
and step 3: the extracted original semantic elements are disassembled in the semantic structure to obtain basic words
Defining elements and stirring the basic semantic elements;
and 4, step 4: the scattered basic semantic elements are subjected to semantic logical relationship according to the internal existence thereof
Carrying out recombination to form a semantic set;
and 5: constructing new semantic structure and semantic features to form content map and main parts of scientific and technological literature
And finishing the construction of the content map.
5. The method of claim 3,
the method is characterized in that the content map construction algorithm of the scientific and technical literature comprises the following steps:
step 1: inputting a semantic set SS which is well extracted and processed;
step 2: respectively counting the number of semantic triples, the number of semantic relations, the occurrence frequency of each knowledge object and the total occurrence frequency of all knowledge objects in a semantic set SS, then calculating the frequency of each knowledge object by using a formula (1),
Figure 216427DEST_PATH_IMAGE002
(1)
in the formula:
Figure 664726DEST_PATH_IMAGE004
representing knowledge objectstThe number of occurrences in the Semantic Set (SS),
Figure 585409DEST_PATH_IMAGE006
representing the total number of times of occurrence of all knowledge objects in the semantic set SS;
and step 3: arranging the knowledge object A with the highest frequency in the semantic set SS into a scientific and technological literature content map as a first batch of points generated by the scientific and technological literature content map;
and 4, step 4: extracting all semantic triples containing a knowledge object A in a semantic set SS, putting the semantic triples into a null semantic set, and establishing a semantic set SSA;
and 5: calculating the importance of the knowledge objects in the semantic set SSA, sequencing the knowledge objects according to the importance of the knowledge objects, and calculating the importance of the knowledge objects;
step 6: arranging the knowledge objects with the sequence n into a scientific and technical literature content map according to the importance of the knowledge objects in the semantic set SSA;
and 7: judging whether the semantic subset SSA is empty, if the semantic subset SSA is empty, executing the step 8, and if the semantic subset SSA is not empty, turning to the step 6;
and 8: extracting a semantic set SSB, a semantic set SSC and a semantic set SSD … from a semantic set SS in sequence according to the importance of the knowledge object in the semantic set SSA, and executing the step 5 to the step 7 on each semantic subset;
and step 9: judging whether the semantic set SS is empty, if the semantic set SS is empty, executing the step 10, and if the semantic set SS is not empty, turning to the step 8;
step 10: and outputting a plurality of scientific and technical literature content maps.
6. The method of claim 1,
the method is characterized in that the deep aggregation of the content knowledge of the scientific and technical literature comprises the following steps:
performing implicit knowledge discovery and inference among documents by using the constructed content map, and generating a cross-document knowledge cluster from the dimensions of knowledge objects, semantic relations and quantity statistics in a plane space; in a three-dimensional space, a knowledge object is taken as a target, a semantic relation is taken as a clue to perform deep exploration and generate a cross-literature knowledge chain;
semantic analysis is carried out on the constructed content maps of the plurality of scientific and technical documents, and knowledge can be associated in two aspects of breadth and depth according to different path lengths; in the three-dimensional space, a certain knowledge object is selected as a starting point, the path length is selected, and the semantic relation is taken as a clue to carry out deep association, so that a plurality of important facts of direct or indirect association between different knowledge objects taking the knowledge object as the starting point can be obtained, namely generation of a cross-literature knowledge chain.
CN202010504233.0A 2020-06-05 2020-06-05 Scientific and technological literature content deep revealing method based on content map Active CN111651562B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010504233.0A CN111651562B (en) 2020-06-05 2020-06-05 Scientific and technological literature content deep revealing method based on content map

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010504233.0A CN111651562B (en) 2020-06-05 2020-06-05 Scientific and technological literature content deep revealing method based on content map

Publications (2)

Publication Number Publication Date
CN111651562A true CN111651562A (en) 2020-09-11
CN111651562B CN111651562B (en) 2023-03-21

Family

ID=72351231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010504233.0A Active CN111651562B (en) 2020-06-05 2020-06-05 Scientific and technological literature content deep revealing method based on content map

Country Status (1)

Country Link
CN (1) CN111651562B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2226233A1 (en) * 1997-01-21 1998-07-21 At&T Corp. Systems and methods for determinizing and minimizing a finite state transducer for speech recognition
US20070118394A1 (en) * 2005-11-12 2007-05-24 Cahoon Kyle A Value synthesis infrastructure and ontological analysis system
CN102508874A (en) * 2011-10-15 2012-06-20 西安交通大学 Method of generating navigation learning path on knowledge map
CN103034657A (en) * 2011-09-29 2013-04-10 日立(中国)研究开发有限公司 Document abstract generating method and device
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors
CN105321177A (en) * 2015-10-09 2016-02-10 浙江工业大学 Automatic hierarchical atlas collaging method based on image importance
CN106408255A (en) * 2016-09-06 2017-02-15 李刚 Building information model (BIM) multidimensional coding and analysis method and system
CN109726298A (en) * 2019-01-08 2019-05-07 上海市研发公共服务平台管理中心 Knowledge mapping construction method, system, terminal and medium suitable for scientific and technical literature
CN109885694A (en) * 2019-01-17 2019-06-14 南京邮电大学 A kind of selection of document and its study precedence determine method
CN110188147A (en) * 2019-05-22 2019-08-30 厦门无常师教育科技有限公司 The document entity relationship of knowledge based map finds method and system

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2226233A1 (en) * 1997-01-21 1998-07-21 At&T Corp. Systems and methods for determinizing and minimizing a finite state transducer for speech recognition
US20070118394A1 (en) * 2005-11-12 2007-05-24 Cahoon Kyle A Value synthesis infrastructure and ontological analysis system
CN103034657A (en) * 2011-09-29 2013-04-10 日立(中国)研究开发有限公司 Document abstract generating method and device
CN102508874A (en) * 2011-10-15 2012-06-20 西安交通大学 Method of generating navigation learning path on knowledge map
CN104834735A (en) * 2015-05-18 2015-08-12 大连理工大学 Automatic document summarization extraction method based on term vectors
CN105321177A (en) * 2015-10-09 2016-02-10 浙江工业大学 Automatic hierarchical atlas collaging method based on image importance
CN106408255A (en) * 2016-09-06 2017-02-15 李刚 Building information model (BIM) multidimensional coding and analysis method and system
CN109726298A (en) * 2019-01-08 2019-05-07 上海市研发公共服务平台管理中心 Knowledge mapping construction method, system, terminal and medium suitable for scientific and technical literature
CN109885694A (en) * 2019-01-17 2019-06-14 南京邮电大学 A kind of selection of document and its study precedence determine method
CN110188147A (en) * 2019-05-22 2019-08-30 厦门无常师教育科技有限公司 The document entity relationship of knowledge based map finds method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WANG.等: "Research on Deep Aggregation Method of Scientific Literature Content Based on Semantic Analysis", 《BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY》 *
李树青等: "基于关键词链接网络分析方法的医学文献推荐服务研究", 《情报学报》 *
杨光等: "企业产品研发过程中不同阶段知识产权检索工作的侧重点分析", 《江苏科技信息》 *

Also Published As

Publication number Publication date
CN111651562B (en) 2023-03-21

Similar Documents

Publication Publication Date Title
Sebastiani Classification of text, automatic
Al-diabat Arabic text categorization using classification rule mining
US20160203130A1 (en) Method and system for identifying and evaluating semantic patterns in written language
WO2011042889A1 (en) A method, computer product program and system for analysis of data
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
Tseng et al. Patent surrogate extraction and evaluation in the context of patent mapping
Tagarelli et al. Toward semantic XML clustering
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
Tandon et al. Information extraction from web-scale n-gram data
Yousef et al. TextNetTopics: text classification based word grouping as topics and topics’ scoring
AlMahmoud et al. A modified bond energy algorithm with fuzzy merging and its application to Arabic text document clustering
CN111753067A (en) Innovative assessment method, device and equipment for technical background text
Cheng et al. MISDA: web services discovery approach based on mining interface semantics
Lee et al. Annotating multiple types of biomedical entities: a single word classification approach
CN111651562B (en) Scientific and technological literature content deep revealing method based on content map
Shah et al. Deep learning based automatic hindi text summarization
Luong et al. Ontology learning using word net lexical expansion and text mining
Asa et al. A comprehensive survey on extractive text summarization techniques
Li et al. Semantic augmented topic model over short text
Kamali et al. Improving mathematics retrieval
Bora et al. Heuristic frequent term-based clustering of news headlines
Cummins The evolution and analysis of term-weighting schemes in information retrieval
CN107220354A (en) A kind of big data search method
Quemy European court of human right open data project
Abulaish et al. A web content mining approach for tag cloud generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant