CN115757827A - Knowledge graph creating method and device for patent text, storage medium and equipment - Google Patents

Knowledge graph creating method and device for patent text, storage medium and equipment Download PDF

Info

Publication number
CN115757827A
CN115757827A CN202211452940.5A CN202211452940A CN115757827A CN 115757827 A CN115757827 A CN 115757827A CN 202211452940 A CN202211452940 A CN 202211452940A CN 115757827 A CN115757827 A CN 115757827A
Authority
CN
China
Prior art keywords
concept
entities
entity
knowledge
fields
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211452940.5A
Other languages
Chinese (zh)
Inventor
严妍
汪敏
杨春宇
况海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kaipuyun Information Technology Co ltd
Cape Cloud Information Technology Co ltd
Original Assignee
Beijing Kaipuyun Information Technology Co ltd
Cape Cloud Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kaipuyun Information Technology Co ltd, Cape Cloud Information Technology Co ltd filed Critical Beijing Kaipuyun Information Technology Co ltd
Priority to CN202211452940.5A priority Critical patent/CN115757827A/en
Publication of CN115757827A publication Critical patent/CN115757827A/en
Priority to PCT/CN2023/106264 priority patent/WO2024109097A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a method, a device, a storage medium and equipment for creating a knowledge graph of a patent text, and belongs to the technical field of machine learning. The method comprises the following steps: acquiring fields in a plurality of patent texts in the field of traditional Chinese medicine, wherein the fields comprise invention names, abstracts, claims and specifications; extracting entities from the fields by using a trained entity extraction model, wherein the entity extraction model is created and trained on the basis of the sequence-BERT-BiGRU-CRF; extracting entities and relations from the fields based on the linguistic rules of the overall concept-component concept and the object concept-effect concept; and creating a knowledge graph based on the entities extracted by the entity extraction model and the entities and the relations extracted by the language rules. According to the method, the accuracy of entity identification is improved through the entity identification model; and the entity and the relation are extracted through the language rule, so that the coverage of knowledge acquisition is improved, and the scale of the knowledge map is expanded.

Description

Knowledge graph creating method and device for patent text, storage medium and equipment
Technical Field
The present application relates to the field of machine learning technologies, and in particular, to a method, an apparatus, a storage medium, and a device for creating a knowledge graph of a patent document.
Background
The knowledge-graph represents entities in the objective world and the relationship between the entities in the objective world in the form of a graph. The existing traditional Chinese medicine knowledge graph is constructed according to the characteristics of a traditional Chinese medicine field model, and comprises traditional Chinese medicine diseases, formulas, traditional Chinese medicines, traditional Chinese medicine chemical components, pharmacological actions, traditional Chinese medicine experiments and chemical experiment methods.
When creating a knowledge graph for a patent text, an extraction model is generally trained, entities and relationships are extracted from the patent text by using the extraction model, and then the knowledge graph is created based on the entities and the relationships. Wherein the entity includes patent application number, patent title, inventor, invention drug, herb, disease, dosage, pharmacological action, and the relationship includes patent title, patent inventor, patent invention drug, drug treatment disease, drug efficacy action, drug component, drug preparation step, herb dosage, herb property, taste and channel tropism, etc.
However, the extraction model extracts both entities and relationships between the entities, which makes the training of the extraction model more difficult and the accuracy of the model is not high.
Disclosure of Invention
The application provides a knowledge graph creating method, a knowledge graph creating device, a storage medium and equipment for patent texts, which are used for solving the problems of high training difficulty and low accuracy when a model is extracted to extract an entity and a relation. The technical scheme is as follows:
in one aspect, a method for creating a knowledge graph of patent text is provided, the method comprising:
acquiring fields in a plurality of patent texts in the field of traditional Chinese medicine, wherein the fields comprise the invention name, the abstract, the claims and the specification;
extracting entities from the fields by using a trained entity extraction model, wherein the entity extraction model is created and trained based on the sequence-BERT-BiGRU-CRF;
extracting entities and relations from the fields based on the language rules of the overall concept-component concept and the object concept-effect concept;
and creating a knowledge graph based on the entities extracted by the entity extraction model and the entities and the relations extracted by the language rules.
In one possible implementation, the creating a knowledge graph based on the entities extracted by the entity extraction model and the entities and relationships extracted by the linguistic rules includes:
storing the entities extracted by the entity extraction model and the entities and the relations extracted by the language rules in a triple form, wherein the triple comprises a head entity, a head entity label, a relation label, a tail entity and a tail entity label;
creating the knowledge-graph based on the triples.
In one possible implementation, the method further includes:
storing the knowledge graph by using a Gephi graph database;
and displaying the knowledge graph in the Gephi graph database.
In one possible implementation, the method further includes:
acquiring a training sample, performing word segmentation and part-of-speech tagging on text contents in fields of the training sample, and tagging entity labels to the word segmentation based on a regular expression and the part-of-speech tagging;
creating a model based on the sequence-BERT-BiGRU-CRF;
and training the model by using the training sample to obtain the entity extraction model.
In one possible implementation, the extracting entities from the fields by using the trained entity extraction model includes:
extracting entities from the fields of unlabeled entity labels using the entity extraction model.
In one possible implementation, the extracting entities and relationships from the fields based on the linguistic rules of the whole concept-component concept and the object concept-effect concept includes:
acquiring a predefined constant term, wherein the constant term is a word or a symbol extracted based on a fixed sentence pattern;
creating a first language rule for the overall concept, the component concept and the constant item according to the fixed sentence pattern, and extracting entities and relations from the fields based on the first language rule;
and creating a second language rule for the object concept, the effect concept and the constant item according to the fixed sentence pattern, and extracting the entity and the relation from the field based on the second language rule.
In one possible implementation, the first language rule includes: the overall concept-constant term-component concept, the component concept-constant term-overall concept;
the second language rule includes: object concept-constant term-effect concept.
In one aspect, an apparatus for creating a knowledge graph of patent text is provided, the apparatus comprising:
the acquisition module is used for acquiring fields in a plurality of patent texts in the field of traditional Chinese medicine, wherein the fields comprise invention names, abstracts, claims and specifications;
an extraction module for extracting entities from the fields by using a trained entity extraction model, wherein the entity extraction model is created and trained based on the sequence-BERT-BiGRU-CRF;
the extraction module is further used for extracting entities and relations from the fields based on the language rules of the overall concept-component concept and the object concept-effect concept;
and the creating module is used for creating a knowledge graph based on the entities extracted by the entity extraction model and the entities and the relations extracted by the language rules.
In one aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, the at least one instruction being loaded and executed by a processor to implement the method for creating a knowledge-graph of patent text as described above.
In one aspect, a computer device is provided and includes a processor and a memory, where at least one instruction is stored in the memory, and the instruction is loaded and executed by the processor to implement the method for creating a knowledge-graph of patent text as described above.
The technical scheme provided by the application has the beneficial effects that:
the entity extraction model is created and trained on the basis of the Sennce-BERT-BiGRU-CRF, the Sennce-BERT can learn expected semantic features on the premise of removing word number limitation, and time complexity is increased linearly; the BiGRU can learn a longer context relationship between words and accelerate an NER (Named Entity Recognition) model to carry out reasoning; the CRF can correct sequence errors of BiGRU prediction, so that the accuracy of entity identification by an entity extraction model is improved.
The fields in the patent text usually contain a plurality of fixed sentence patterns, the linguistic rules based on the whole concept-component concept and the object concept-effect concept can be extracted according to the fields, and then the entities and the relations are extracted from the fields based on the linguistic rules, so that the coverage of knowledge acquisition can be effectively improved, and the scale of the knowledge graph in the field of traditional Chinese medicine is expanded.
By displaying the knowledge graph in the Gephi graph database, the accuracy of patent examination and analysis can be promoted through the visualized knowledge graph, and an optimal way for professional and non-professional persons to learn knowledge in the field of traditional Chinese medicine can be provided.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flow chart of a method of knowledge-graph creation of patent text provided by an embodiment of the present application;
FIG. 2 is a diagram of a visualization of a knowledge-graph provided by an embodiment of the present application;
FIG. 3 is a block diagram of a knowledge-graph creating apparatus for patent documents according to still another embodiment of the present application;
fig. 4 is a block diagram of a knowledge-graph creating apparatus for patent documents according to still another embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application will be further described in detail with reference to the accompanying drawings.
The purpose of the application is to create a knowledge graph of patent texts in the field of traditional Chinese medicine. The knowledge map includes entities such as patent application number, name of the invention, inventor, inventive drug, herb, disease, dosage, pharmacological action, etc., and relationships such as patent title, patent inventor, patent inventive drug, drug treatment disease, drug efficacy action, drug ingredient, drug preparation step, herb dosage, herb property, taste and channel tropism. In the application, the core knowledge of the patent text in the Chinese medicine field is structurally and formally expressed, the extracted knowledge information is connected together to form the knowledge graph, and the knowledge graph contains rich semantic relations, so that the functions of knowledge reasoning, auxiliary analysis, decision support and the like are realized, enterprises or users can quickly and conveniently inquire the related knowledge and the mutual relation of the patents in the Chinese medicine field, and the subsequent work is further carried out by utilizing the data. The key point of constructing the knowledge graph in the field of traditional Chinese medicine lies in knowledge extraction, the patent text of the traditional Chinese medicine field in China is collected and sorted in the application, various data of the patent text is cleaned, then a deep learning method and a method based on language rules are adopted for extracting knowledge, the extracted entities and relations are constructed into the knowledge graph in a triple form, and then the knowledge graph is led into a graph database, so that the structuralization and visualization of patent knowledge in the field of traditional Chinese medicine are realized.
Referring to fig. 1, a flowchart of a method for creating a knowledge graph of patent documents, which may be applied to a computer device, according to an embodiment of the present application is shown. The knowledge graph creating method of the patent text can comprise the following steps:
step 101, obtaining fields in a plurality of patent texts in the traditional Chinese medicine field, wherein the fields comprise invention names, abstracts, claims and specifications.
The patent text comprises the components of invention name, abstract, claim specification, specification and the like, wherein the invention name, the abstract, the claim specification and the specification comprise the entities and the relations required by the traditional Chinese medicine patent invention medicine, the preparation method, various herbal medicine dosages and the like for establishing the knowledge map.
It should be noted that, because the text content in the specification is large and the time consumption for extracting the entities and the relations from the text is long, in one example, the computer device can extract the entities and the relations from the invention name, the abstract and the claims.
Step 102, extracting entities from the fields by using a trained entity extraction model, wherein the entity extraction model is created and trained based on the sequence-BERT-BiGRU-CRF.
In this embodiment, the computer device creates an entity extraction model based on a deep learning method, and then extracts a plurality of entities from the field using the entity extraction model.
When an entity extraction model is created, computer equipment firstly acquires a training sample, performs word segmentation and part-of-speech tagging on text contents in fields of the training sample, and labels entity on words based on a regular expression and part-of-speech tagging; then a model is created based on the sequence-BERT-BiGRU-CRF; and finally, training the model by using the training sample to obtain the entity extraction model.
Specifically, the entity labels are labeled first, then a group of model parameters are defined, so that models corresponding to the group of model parameters and indexes such as precision rate and recall rate of the models can be obtained, different models can be obtained by changing model parameter combinations, computer equipment can select the model parameter combination with the highest index such as precision rate, recall rate or F value as required, and an optimal model can be obtained, and the finally selected model is called an entity extraction model in the implementation.
The sequence-BERT + BiGRU + CRF is an end-to-end deep learning model, the BERT has the advantage that semantic features of the corpus can be learned, however, the BERT structure has word number limitation, the sequence-BERT has the advantage that the word number limitation can be solved, and the time complexity is increased linearly; the BiGRU can learn longer context relations among words, and meanwhile, the GRU model can accelerate an NER model to carry out reasoning compared with an LSTM model; the CRF can correct errors in the sequence of BiGRU prediction, so that the accuracy of entity identification by an entity extraction model is improved.
In extracting the entities in the fields using the entity extraction model, the computer device extracts the entities from the fields that are not labeled with entity labels using the entity extraction model. The computer equipment can use a jieba word segmentation toolkit to perform text content word segmentation and part-of-speech tagging.
And 103, extracting entities and relations from the fields based on the language rules of the overall concept-component concept and the object concept-effect concept.
After a large amount of reading of patent documents, we find that the names, abstracts and structures of claims in patent documents are relatively standard and general, the subject matters of the patent documents are relatively fixed (namely technical subject matters, technical schemes and technical effects), and a plurality of fixed sentence expressions are also applied, so that some linguistic rules can be summarized on the basis, and entities and relations are extracted based on the linguistic rules.
Specifically, based on the linguistic rules of the overall concept-component concept and the object concept-effect concept, the extraction of the entity and the relationship from the field can include the following substeps:
(1) And acquiring predefined constant terms, wherein the constant terms are words or symbols extracted based on the fixed sentence pattern.
There are a large number of identical constant terms in different linguistic rules, and for the convenience of illustrating the linguistic rules, the following constant terms are defined in this embodiment:
defconstant constant term
{
Marking points: i.e., |. Is one of the deficiencies |? | | But |)! I: |, |? L; l, |;
the number: gaming machine without dust 1 and 2 without dust 3 without dust 4 and gaming machine without dust 6 and 7 without dust 8. Count 9 | < | NULL > <0 dust 1 and 2 without dust 3 and without dust 4 and without dust 5 and 6 without dust 7 and 8 without dust 9>
Quantifier: share | Wei | g | Shuangji | jin
Is a verb: is | as | becomes | is
There are verbs: having | has | possess | containing | exist
Verb-capable: can
And (3) realizing words: implementing | reach | get
Scheme words: solve | treatment | radical cure | prevention | avoidance | main cure |
Effect word: effect | efficacy | outcome | efficacy |
The object word: will | from | to |)
The dependent words: belong to
The public word: disclosure | relate to | provide
Raw material words: raw material | Material | composition
The term includes: inclusion | containing | Inclusion | including
The composition words are: configuration | component | part | assembly | component | device | part | accessory | component part: activity | period | process | step | phase | operation | degree | cell | action | aspect | mode | process consequence words: cause | to | for | is for
Reason words are: because | due to
The existence word: at
Position word: upper | lower | left | right | inside | outside
The following words: as follows | Below |)
The parallel words are: sum and are and
associated words: about has a relationship of contact | correlation |, correlation
}
In the present embodiment, only the above-mentioned constant terms are used for illustration, and in practical applications, more or less constant terms than the above-mentioned constant terms may be set according to requirements.
(2) A first linguistic rule is created for the overall concept, the constituent concepts, and the constant terms according to the fixed sentence pattern, and entities and relationships are extracted from the fields based on the first linguistic rule.
In this embodiment, the first language rule includes: the overall concept-constant term-component concept, the component concept-constant term-overall concept. The following exemplifies the combination of the overall concept, the composition concept, and the different constant terms.
Figure BDA0003952282260000071
Figure BDA0003952282260000081
In the first language rule, with "? The parts of the symbols are all technical feature words needing to be extracted, such as: [ is? Holistic concepts ] and [ is? Ingredient concept); with "! The parts of the symbol represent constant words of different classes, such as <! The term > indicates that any one of the terms "solve, cure, radical cure, prevent, avoid, treat" needs to be matched at this position.
In the present embodiment, only the above-mentioned overall concept, component concept and combination of different constant terms are exemplified, and in practical application, other combinations may be set as required.
(3) And creating a second language rule for the object concept, the effect concept and the constant item according to the fixed sentence pattern, and extracting the entity and the relation from the field based on the second language rule.
In this embodiment, the second language rule includes: object concept-constant term-effect concept. The following exemplifies the combination of the object concept, the effect concept, and the different constant terms.
Figure BDA0003952282260000091
In the second language rule, with "? The parts of the symbols are all technical feature words needing to be extracted, such as: [ is? Object concepts [ ] and? Concept of effect); with a'! The parts of the symbol represent constant words of different classes, such as <! The term > indicates that any one of the terms "solve, cure, radical cure, prevent, avoid, treat" needs to be matched at this position.
In this embodiment, only the combination of the object concept, the effect concept, and different constant terms is illustrated, and in practical application, other combinations may be set as needed.
Taking the example of extracting entities and relations from the abstract and the claims by applying a first language rule, the extracted result is assumed to be [ rose eight-treasure tea ], and the extracted result comprises the following raw materials: [ Rosa rugosa ], [ Huangshan tribute chrysanthemum ], [ Huangshan green tea ], [ 8230 ], [ 8230 ], from which triads can be obtained: the eight-treasure rose tea has the components of rose flower, 8230, etc.
Taking the example of extracting entities and relations from the invention name and abstract by applying the second language rule, assuming that the extracted result is that the medicine has the efficacy of enhancing human immunity, the triple can be obtained: the pharmacological action of the medicine enhances the immunity, and the medicine can be extracted from the name of the invention for replacement.
Of course, the user may also customize some language rules based on which entities and relationships are identified.
In this embodiment, the entity and the relationship may be extracted based on the language rule, and then the professional may be organized to perform the processing and the auditing, so as to ensure the accuracy of the extraction of the entity and the relationship.
And 104, creating a knowledge graph based on the entities extracted by the entity extraction model and the entities and relations extracted by the language rules.
Specifically, creating the knowledge graph based on the entities extracted by the entity extraction model and the entities and relationships extracted by the language rules may include: storing the entities extracted by the entity extraction model and the entities and the relations extracted by the language rules in a triple form, wherein the triples comprise head entities, head entity labels, relations, relation labels, tail entities and tail entity labels; a knowledge graph is created based on the triplets.
In this embodiment, the entities and relationships extracted from the patent text are stored in the form of triplets, each triplet including: head entity, head entity label, relationship label, tail entity and tail entity label. The following triplets are exemplified:
Figure BDA0003952282260000101
after the knowledge graph is obtained, the computer equipment can also store the knowledge graph by using a Gephi graph database; and displaying the knowledge graph in the Gephi graph database.
Among them, gephi is a piece of data visualization processing software in the field of network analysis, and a developer wishes to become "Photoshop in the field of data visualization". Gephi mainly has the following three characteristics:
(1) The built-in fast OpenGL engine provides support, gephi can utilize a very large network to push envelopes, the visual network has more than one million elements, and all the elements can run in real time, such as layout and filters;
(2) The method is simple and easy to install and use, and can be visualized to be a central UI (User Interface) like the graphical processing of Photoshop;
(3) The framework is constructed on a Netbeans platform and can be easily expanded or reused through a well-programmed API (Application Program Interface).
The knowledge graph is displayed in the Gephi graph database, so that the relation among all entities can be clearly displayed (as shown in figure 2), on one hand, the accuracy of patent examination and analysis can be further promoted, and on the other hand, the best way for professional and non-professional persons to learn the knowledge in the patent field can be provided.
In summary, in the method for creating a knowledge graph of a patent document provided in the embodiment of the present application, an entity extraction model is created and trained based on the sequence-BERT-BiGRU-CRF, the sequence-BERT can learn expected semantic features on the premise of removing word number limitation, and time complexity is increased linearly; the BiGRU can learn a longer context relationship between words and can accelerate an NER model to carry out reasoning; the CRF can correct errors in the sequence of BiGRU prediction, so that the accuracy of entity identification by an entity extraction model is improved.
The fields in the patent text usually contain a plurality of fixed sentence patterns, the linguistic rules based on the overall concept-component concept and the object concept-effect concept can be extracted according to the fields, and then the entities and the relations are extracted from the fields based on the linguistic rules, so that the coverage of knowledge acquisition can be effectively improved, and the scale of the knowledge map in the field of traditional Chinese medicine is expanded.
By displaying the knowledge graph in the Gephi graph database, the accuracy of patent examination and analysis can be promoted through the visualized knowledge graph, and an optimal way for professional and non-professional persons to learn knowledge in the field of traditional Chinese medicine can be provided.
Referring to fig. 3, a block diagram of a knowledge-graph creating apparatus for patent documents, which may be applied to a computer device, according to an embodiment of the present application is shown. The knowledge graph creating device of the patent text can comprise:
an obtaining module 310, configured to obtain fields in multiple patent texts in the field of traditional Chinese medicine, where the fields include an invention name, an abstract, a claim and a specification;
an extracting module 320, configured to extract an entity from a field by using a trained entity extraction model, where the entity extraction model is created and trained based on sequence-BERT-BiGRU-CRF;
an extracting module 320, further configured to extract entities and relationships from the fields based on the linguistic rules of the holistic concept-component concept and the object concept-effect concept;
and the creating module 330 is used for creating the knowledge graph based on the entities extracted by the entity extraction model and the entities and the relations extracted by the language rules.
In an optional embodiment, the creating module 330 is further configured to:
storing the entity extracted by the entity extraction model and the entity and the relation extracted by the language rule in a triple form, wherein the triple comprises a head entity, a head entity label, a relation label, a tail entity and a tail entity label;
a knowledge graph is created based on the triplets.
Referring to fig. 4, in an alternative embodiment, the apparatus further includes:
the storage module 340 is used for storing the knowledge graph by using the Gephi graph database;
and the display module 350 is used for displaying the knowledge graph in the Gephi graph database.
In an optional embodiment, the obtaining module 310 is further configured to obtain a training sample, perform word segmentation and part-of-speech tagging on text contents in a field of the training sample, and label an entity tag to a word based on a regular expression and the part-of-speech tagging;
a creating module 330, further configured to create a model based on the sequence-BERT-BiGRU-CRF;
the apparatus further includes a training module 360 for training the model with the training samples to obtain an entity extraction model.
In an optional embodiment, the extracting module 320 is further configured to:
and extracting the entity from the field of the label of the unlabeled entity by using an entity extraction model.
In an optional embodiment, the extracting module 320 is further configured to:
acquiring predefined constant terms, wherein the constant terms are words or symbols extracted based on a fixed sentence pattern;
creating a first language rule for the overall concept, the component concept and the constant item according to the fixed sentence pattern, and extracting entities and relations from the fields based on the first language rule;
and creating a second language rule for the object concept, the effect concept and the constant item according to the fixed sentence pattern, and extracting the entity and the relation from the field based on the second language rule.
In an alternative embodiment, the first language rule comprises: the overall concept-constant term-component concept, the component concept-constant term-overall concept;
the second language rules include: object concept-constant term-effect concept.
In summary, in the apparatus for creating a knowledge graph of a patent document provided in the embodiment of the present application, the entity extraction model is created and trained based on the sequence-BERT-BiGRU-CRF, the sequence-BERT can learn the expected semantic features on the premise of removing the word number limitation, and the time complexity linearly increases; the BiGRU can learn a longer context relationship between words and can accelerate an NER model to carry out reasoning; the CRF can correct errors in the sequence of BiGRU prediction, so that the accuracy of entity identification by an entity extraction model is improved.
The fields in the patent text usually contain a plurality of fixed sentence patterns, the linguistic rules based on the whole concept-component concept and the object concept-effect concept can be extracted according to the fields, and then the entities and the relations are extracted from the fields based on the linguistic rules, so that the coverage of knowledge acquisition can be effectively improved, and the scale of the knowledge graph in the field of traditional Chinese medicine is expanded.
By displaying the knowledge graph in the Gephi graph database, the accuracy of patent examination and analysis can be promoted through the visualized knowledge graph, and an optimal way for professional and non-professional persons to learn knowledge in the field of traditional Chinese medicine can be provided.
One embodiment of the present application provides a computer-readable storage medium having at least one instruction stored therein, which is loaded and executed by a processor to implement the method for creating a knowledge-graph of patent text as described above.
One embodiment of the present application provides a computer device comprising a processor and a memory, wherein the memory stores at least one instruction, and the instruction is loaded and executed by the processor to implement the method for creating a knowledge-graph of patent text as described above.
It should be noted that: the apparatus for creating a knowledge graph of a patent document according to the above embodiments is exemplified by only the division of the above functional modules when creating a knowledge graph of a patent document, and in practical applications, the functions may be distributed to different functional modules according to needs, that is, the internal structure of the apparatus for creating a knowledge graph of a patent document may be divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus for creating a knowledge graph of a patent text provided by the above embodiment and the method for creating a knowledge graph of a patent text belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment and are not described herein again.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description should not be taken as limiting the embodiments of the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the embodiments of the present application should be included in the scope of the embodiments of the present application.

Claims (10)

1. A method for creating a knowledge graph of patent text, the method comprising:
acquiring fields in a plurality of patent texts in the field of traditional Chinese medicine, wherein the fields comprise invention names, abstracts, claims and specifications;
extracting entities from the fields by using a trained entity extraction model, wherein the entity extraction model is created and trained based on sequence-BERT-BiGRU-CRF;
extracting entities and relations from the fields based on the language rules of the overall concept-component concept and the object concept-effect concept;
and creating a knowledge graph based on the entities extracted by the entity extraction model and the entities and the relations extracted by the language rules.
2. The method for creating a knowledge graph of patent texts according to claim 1, wherein the creating a knowledge graph based on the entities extracted by the entity extraction model and the entities and relationships extracted by the linguistic rules comprises:
storing the entities extracted by the entity extraction model and the entities and the relations extracted by the language rules in a triple form, wherein the triple comprises a head entity, a head entity label, a relation label, a tail entity and a tail entity label;
creating the knowledge-graph based on the triples.
3. The method of knowledge-graph creation of patent text according to claim 1, further comprising:
storing the knowledge graph by using a Gephi graph database;
and displaying the knowledge graph in the Gephi graph database.
4. The method of knowledge-graph creation of patent text according to claim 1, further comprising:
acquiring a training sample, performing word segmentation and part-of-speech tagging on text contents in fields of the training sample, and tagging entity labels to the word segmentation based on a regular expression and the part-of-speech tagging;
creating a model based on the sequence-BERT-BiGRU-CRF;
and training the model by using the training sample to obtain the entity extraction model.
5. The method for creating a knowledge graph of patent text according to claim 4, wherein the extracting entities from the fields by using the trained entity extraction model comprises:
extracting entities from the fields of unlabeled entity labels using the entity extraction model.
6. The method for creating a knowledge graph of patent text according to any one of claims 1 to 5, wherein the extracting of entities and relations from the fields based on the linguistic rules of the whole concept-component concept and the object concept-effect concept comprises:
acquiring a predefined constant item, wherein the constant item is a word or a symbol extracted based on a fixed sentence pattern;
creating a first language rule for the overall concept, the component concept and the constant item according to the fixed sentence pattern, and extracting entities and relations from the fields based on the first language rule;
and creating second language rules for the object concepts, the effect concepts and the constant items according to the fixed sentence patterns, and extracting entities and relations from the fields based on the second language rules.
7. The method of knowledge-graph creation of patent text according to claim 6,
the first language rule includes: the overall concept-constant term-component concept, the component concept-constant term-overall concept;
the second language rule includes: object concept-constant term-effect concept.
8. A knowledge-graph creation apparatus of patent text, the apparatus comprising:
the acquisition module is used for acquiring fields in a plurality of patent texts in the field of traditional Chinese medicine, wherein the fields comprise invention names, abstracts, claims and specifications;
an extraction module, configured to extract an entity from the field using a trained entity extraction model, where the entity extraction model is created and trained based on sequence-BERT-BiGRU-CRF;
the extraction module is further used for extracting entities and relations from the fields based on the language rules of the overall concept-component concept and the object concept-effect concept;
and the creating module is used for creating a knowledge graph based on the entities extracted by the entity extraction model and the entities and the relations extracted by the language rules.
9. A computer-readable storage medium, wherein at least one instruction is stored in the storage medium, and the at least one instruction is loaded and executed by a processor to implement the method for knowledge-graph creation of patent text according to any one of claims 1 to 7.
10. A computer device comprising a processor and a memory, the memory having stored therein at least one instruction that is loaded and executed by the processor to implement the method of knowledge-graph creation of patent text as claimed in any one of claims 1 to 7.
CN202211452940.5A 2022-11-21 2022-11-21 Knowledge graph creating method and device for patent text, storage medium and equipment Pending CN115757827A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202211452940.5A CN115757827A (en) 2022-11-21 2022-11-21 Knowledge graph creating method and device for patent text, storage medium and equipment
PCT/CN2023/106264 WO2024109097A1 (en) 2022-11-21 2023-07-07 Knowledge map creation method and apparatus for patent text, and storage medium and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211452940.5A CN115757827A (en) 2022-11-21 2022-11-21 Knowledge graph creating method and device for patent text, storage medium and equipment

Publications (1)

Publication Number Publication Date
CN115757827A true CN115757827A (en) 2023-03-07

Family

ID=85333162

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211452940.5A Pending CN115757827A (en) 2022-11-21 2022-11-21 Knowledge graph creating method and device for patent text, storage medium and equipment

Country Status (2)

Country Link
CN (1) CN115757827A (en)
WO (1) WO2024109097A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024109097A1 (en) * 2022-11-21 2024-05-30 开普云信息科技股份有限公司 Knowledge map creation method and apparatus for patent text, and storage medium and device

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222201B (en) * 2019-06-26 2021-04-27 中国医学科学院医学信息研究所 Method and device for constructing special disease knowledge graph
CN111221976A (en) * 2019-11-14 2020-06-02 北京京航计算通讯研究所 Knowledge graph construction method based on bert algorithm model
CN111639498A (en) * 2020-04-21 2020-09-08 平安国际智慧城市科技股份有限公司 Knowledge extraction method and device, electronic equipment and storage medium
CN111639190A (en) * 2020-04-30 2020-09-08 南京理工大学 Medical knowledge map construction method
CN111723570B (en) * 2020-06-09 2023-04-28 平安科技(深圳)有限公司 Construction method and device of medicine knowledge graph and computer equipment
CN112784051A (en) * 2021-02-05 2021-05-11 北京信息科技大学 Patent term extraction method
CN113342989B (en) * 2021-05-24 2022-12-20 北京航空航天大学 Knowledge graph construction method and device of patent data, storage medium and terminal
CN114996477A (en) * 2022-06-15 2022-09-02 湖南中医药大学 Knowledge graph construction method and device based on typhoid theory
CN114817576B (en) * 2022-06-28 2022-11-18 北京邮电大学 Model training and patent knowledge graph complementing method, device and storage medium
CN115757827A (en) * 2022-11-21 2023-03-07 开普云信息科技股份有限公司 Knowledge graph creating method and device for patent text, storage medium and equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024109097A1 (en) * 2022-11-21 2024-05-30 开普云信息科技股份有限公司 Knowledge map creation method and apparatus for patent text, and storage medium and device

Also Published As

Publication number Publication date
WO2024109097A1 (en) 2024-05-30

Similar Documents

Publication Publication Date Title
CN109492077B (en) Knowledge graph-based petrochemical field question-answering method and system
US20210173858A1 (en) Apparatus and method for automated and assisted patent claim mapping and expense planning
Pal et al. An approach to automatic text summarization using WordNet
CN110347894A (en) Knowledge mapping processing method, device, computer equipment and storage medium based on crawler
CN110032648A (en) A kind of case history structuring analytic method based on medical domain entity
CN109408811B (en) Data processing method and server
CN106776711A (en) A kind of Chinese medical knowledge mapping construction method based on deep learning
CN113590783B (en) NLP natural language processing-based traditional Chinese medicine health preserving intelligent question-answering system
CN115292457B (en) Knowledge question answering method and device, computer readable medium and electronic equipment
CN112487202A (en) Chinese medical named entity recognition method and device fusing knowledge map and BERT
CN115080694A (en) Power industry information analysis method and equipment based on knowledge graph
CN112035675A (en) Medical text labeling method, device, equipment and storage medium
CN107247739A (en) A kind of financial publication text knowledge extracting method based on factor graph
CN106909572A (en) A kind of construction method and device of question and answer knowledge base
CN111046272A (en) Intelligent question-answering system based on medical knowledge map
CN107357785A (en) Theme feature word abstracting method and system, feeling polarities determination methods and system
CN111651614A (en) Method and system for constructing medicated diet knowledge graph, electronic equipment and storage medium
Pal et al. An approach to automatic text summarization using simplified lesk algorithm and wordnet
CN116340544B (en) Visual analysis method and system for ancient Chinese medicine books based on knowledge graph
CN111428503A (en) Method and device for identifying and processing same-name person
CN107330111A (en) The search method and device of domain body based on common version body
Nguyen et al. Seagull: A bird’s-eye view of the evolution of technical games research
CN114153994A (en) Medical insurance information question-answering method and device
CN115757827A (en) Knowledge graph creating method and device for patent text, storage medium and equipment
CN112800244B (en) Method for constructing knowledge graph of traditional Chinese medicine and national medicine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination