CN112036151B - Gene disease relation knowledge base construction method, device and computer equipment - Google Patents

Gene disease relation knowledge base construction method, device and computer equipment Download PDF

Info

Publication number
CN112036151B
CN112036151B CN202010941642.7A CN202010941642A CN112036151B CN 112036151 B CN112036151 B CN 112036151B CN 202010941642 A CN202010941642 A CN 202010941642A CN 112036151 B CN112036151 B CN 112036151B
Authority
CN
China
Prior art keywords
path
gene
disease
natural
rule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010941642.7A
Other languages
Chinese (zh)
Other versions
CN112036151A (en
Inventor
张圣
顾大中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202010941642.7A priority Critical patent/CN112036151B/en
Priority to PCT/CN2020/125143 priority patent/WO2021155684A1/en
Publication of CN112036151A publication Critical patent/CN112036151A/en
Application granted granted Critical
Publication of CN112036151B publication Critical patent/CN112036151B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The application relates to the field of artificial intelligence and discloses a method, a device and computer equipment for constructing a gene disease relation knowledge base, wherein the method comprises the following steps: performing dependency analysis on the specified number of natural sentences to obtain dependency; determining a path descriptor of the natural sentence according to the dependency relationship; generating a rule template according to the path descriptor, and establishing a rule template library; and carrying out knowledge extraction on the whole medical literature by using the rule template, obtaining the gene disease relationship, and establishing a gene disease relationship knowledge base. The method can automatically learn a large number of rule templates, then automatically extract the relational knowledge of the gene diseases from the medical literature by using the rule templates, high labor cost is not needed, the extracted knowledge is large in quantity, the extraction effect is good, the mobility is good, and the method can be used for extracting the relational relations among more medical entities. The application also relates to blockchain technology, wherein the rule templates, the genetic disease relation knowledge base and the like are stored in the blockchain.

Description

Gene disease relation knowledge base construction method, device and computer equipment
Technical Field
The application relates to the field of artificial intelligence, in particular to a method, a device and computer equipment for constructing a gene disease relation knowledge base.
Background
The medical literature data contains a great number of natural sentences containing target relations among disease genes, and the target genes of the disease have important significance for research on medical foundation, diagnosis and treatment of the disease and research and development of targeted drugs. Regarding the construction of a knowledge base of disease target genes, the acquisition of the target relationship of the existing high-quality disease genes is basically obtained through expert manual construction, but along with the exponential growth of medical literature, the construction of a relatively complete knowledge base cannot be basically realized only by means of expert manual arrangement, editing and auditing of the constructed medical knowledge base.
At present, there are also technical schemes for automatically acquiring medical entity relations from medical literature data by using a computer technology, and the technical schemes are mainly divided into two types, namely medical entity relation extraction based on rules of artificial design and medical entity relation extraction by using a machine learning technology. The current practice of rule-based schemes is that the field expert is required to summarize available high-quality rules, the quantity of knowledge available is completely dependent on the quality and quantity of the high-quality rules, and most of the rule-based schemes at present have low recall, high accuracy and high cost. The scheme for extracting the medical relation based on the machine learning algorithm is that the best model is a relation extraction model based on deep learning at present, but even if the effect of extracting the medical relation based on the model based on deep learning at present is still lower, a larger transverse ditch is available in practice. In addition, a large number of high-quality label data sets are needed for training the deep learning model, and expert manual labeling is needed for extracting the label data from the high-quality medical relations.
Disclosure of Invention
The main purpose of the application is to provide a method, a device and computer equipment for constructing a gene disease relation knowledge base, which aim to solve the problems of high construction cost and poor effect of the current gene disease relation knowledge base.
In order to achieve the above object, the present application proposes a method for constructing a knowledge base of gene disease relationships, comprising:
performing dependency relationship analysis on a specified number of natural sentences comprising gene-disease entity pairs to obtain the dependency relationship of each natural sentence;
determining a path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the arrangement sequence of all words on the dependency path of the gene disease entity in the natural sentence;
generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library;
and carrying out knowledge extraction on the whole medical literature by utilizing the rule templates in the rule template library to obtain a gene disease relationship, and establishing a gene disease relationship knowledge base.
Further, the step of performing dependency analysis on a specified number of natural sentences including gene-disease entity pairs to obtain the dependency of each natural sentence includes:
acquiring natural sentences comprising gene-disease entity pairs from a designated medical database;
a specified number of natural sentences comprising gene-disease entity pairs are randomly selected.
Further, the step of performing dependency analysis on a specified number of natural sentences including gene-disease entity pairs to obtain a dependency of each of the natural sentences includes:
and carrying out dependency analysis on each natural sentence by using a natural language processing toolkit StanfordNLP to obtain the dependency of each natural sentence.
Further, after the step of determining the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, the step of generating a rule template according to the path descriptor of each natural sentence, and before the step of creating a rule template library, the method further includes:
calculating editing distances among different path descriptors, and clustering the path descriptors with the editing distances smaller than or equal to a first specified value into the same path descriptor; the method comprises the steps of,
and identifying whether negative semantics exist in the dependency relationship in the natural sentence, and if so, filtering out a path descriptor corresponding to the natural sentence.
Further, the step of generating a rule template according to the path descriptor of each natural sentence, and the step of creating a rule template library includes:
counting the number of cases of natural sentences corresponding to the same path descriptor, and filtering the path descriptors of which the case number is smaller than a second designated value;
and carrying out quality evaluation on the filtered path descriptors, storing the path descriptors passing the quality evaluation as rule templates, and establishing a rule template library.
Further, the step of performing quality evaluation on the filtered path descriptors, storing the path descriptors passing the quality evaluation as rule templates, and establishing a rule template library includes:
counting entity pair sets corresponding to the path descriptors to be evaluated;
counting the number of entity pairs in the entity pair set existing in the CTD;
if the number of the existing path descriptors is larger than a specified number threshold or the ratio of the number of the existing path descriptors to the total number of the entity pairs in the entity pair set is larger than a specified ratio threshold, the path descriptors to be evaluated are reserved as available rule templates, and a rule template library is stored and established.
Further, the step of using the rule templates in the rule template library to extract knowledge of the whole medical document to obtain the gene disease relationship, and the step of establishing the gene disease relationship knowledge base includes:
performing entity recognition on natural sentences in the full amount of medical documents to obtain natural sentences containing gene-disease entity pairs;
performing dependency relationship analysis on all natural sentences containing gene-disease entity pairs respectively to obtain the dependency relationship of each natural sentence;
determining a path descriptor of each natural sentence according to the dependency relationship of each natural sentence;
judging whether the path descriptor is in the rule template library or not;
if yes, acquiring the gene disease relation according to the path descriptor, and storing the gene disease relation in a gene disease relation knowledge base.
The embodiment of the application also provides a device for constructing the gene disease relation knowledge base, which comprises the following steps:
the dependency relationship analysis module is used for carrying out dependency relationship analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency relationship of each natural sentence;
a path descriptor determining module, configured to determine a path descriptor of each natural sentence according to the dependency relationship of each natural sentence;
the rule template generation module is used for generating rule templates according to the path descriptors of each natural sentence and establishing a rule template library;
and the knowledge extraction module is used for carrying out knowledge extraction on the full amount of medical documents by utilizing the rule templates in the rule template library, obtaining the gene disease relationship and establishing a gene disease relationship knowledge base.
The present application also provides a computer device comprising a memory storing a computer program and a processor implementing the steps of any of the methods described above when the computer program is executed by the processor.
The present application also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of any of the above.
According to the method, the device and the computer equipment for constructing the gene disease relation knowledge base, a large number of rule templates are automatically learned by analyzing the dependency relation of a specified number of natural sentences containing gene-disease entity pairs, and then the relation knowledge of the gene disease is automatically extracted from medical documents by using the rule templates, so that high labor cost is not needed, the extracted knowledge is large in quantity, the extraction effect is good, the migration and applicability are good, and the method and the device can be used for extracting the relation among more medical entities.
Drawings
FIG. 1 is a flow chart of a method for constructing a knowledge base of genetic disease relationships according to an embodiment of the present application;
FIG. 2 is an exemplary diagram of natural language dependency relationships according to one embodiment of the present application;
FIG. 3 is an exemplary diagram of natural language dependency relationships according to another embodiment of the present application;
FIG. 4 is a schematic block diagram of a knowledge base construction device for gene disease relationship according to an embodiment of the present application;
fig. 5 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.
The realization, functional characteristics and advantages of the present application will be further described with reference to the embodiments, referring to the attached drawings.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
Referring to fig. 1, in an embodiment of the present application, a method for constructing a knowledge base of gene disease relationships is provided, including the steps of:
s1, performing dependency relationship analysis on a specified number of natural sentences comprising gene-disease entity pairs to obtain the dependency relationship of each natural sentence;
s2, determining a path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the arrangement sequence of all words on the dependency path of the entity of the genetic disease in the natural sentence;
s3, generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library;
s4, knowledge extraction is carried out on the whole medical literature by utilizing the rule templates in the rule template library, the gene disease relation is obtained, and a gene disease relation knowledge base is established.
The task of medical relationship extraction is to determine the relationship between a given gene, a pair of disease entities based on semantic information in the medical text that contains the given gene-disease entity pair. In this embodiment, the rule template is not a rule template obtained by expert construction, and the rule templates constructed by the expert require higher labor cost, and the number of rule templates is smaller, so that the medical relationship knowledge extracted based on the expert construction rule is small in scale and high in cost. In this embodiment, a large number of high-quality available rule templates can be automatically learned in a specified number of natural sentences including entity pairs, and then knowledge extraction is performed by using the templates, so that a large number of medical relationship knowledge is acquired in a full number of medical documents, and a knowledge base is constructed.
As described above in step S1, the rule templates are extracted based on the dependency design on the natural language sentence. Dependency analysis, also known as dependency syntactic analysis, is one of the key technologies in natural language processing, which is the process of analyzing an input text sentence to obtain the syntactic structure of the sentence. Currently common dependency analysis tools are StanfordNLP kit from Stanford university, hanlp, spaCy, fudanNLP from double denier university, etc. Specifically, an example Case1 is illustrated.
Case 1: the dependency relationship of "The profile of the apelin makes it a therapeutic target for ischemic heart issue" is shown in fig. 2, where the arrow represents the dependency relationship direction between different words (words) in the sentence, the text on the arrow (e.g., det, nsubj, case, nmod, etc.) represents the specific dependency relationship type, and the dependency relationship type of the natural sentence has a widely accepted normalized classification. GENE in the figure represents apelin and dee represents ischemic heart disease.
It can be seen from the figure that the dependency path in this sentence for a given GENE entity and DISE entity is gene≡profile≡makes→target→DISE, and from this dependency path it can be seen that makes is the ROOT node (ROOT) in this path.
As described above in step S2, a path descriptor may be determined from the dependencies. Taking Case1 as an example, arranging all words (words) on the dependency path of a given GENE entity and disc entity in order in natural language can result in "profile GENE makes target DISE", "profile GENE makes target DISE" being referred to as a path descriptor.
As described in step S3 above, a number of path descriptors can be obtained by performing dependency analysis on a specified number of natural sentences including gene-disease entity pairs, then performing operations such as deduplication on the path descriptors to obtain candidate rule templates, then sorting the candidate rule templates, filtering out path descriptors whose case number extracted into a given path descriptor is smaller than a preset value, then evaluating the quality of the remaining path descriptors, and storing the evaluated path descriptors as rule templates in a rule template library.
And (4) after establishing the rule template library, carrying out knowledge extraction on the whole medical files to obtain massive gene disease relations, storing the obtained gene disease relations, and establishing a gene disease relation knowledge base.
In one embodiment, the step of performing dependency analysis on a specified number of natural sentences including gene-disease entity pairs to obtain the dependency of each natural sentence includes:
s01, acquiring natural sentences containing gene-disease entity pairs from a designated medical database;
s02, randomly selecting a specified number of natural sentences comprising gene-disease entity pairs.
As described above, the establishment of the rule templates first requires dependency analysis of a specified number of natural sentences containing gene-disease entity pairs, that is, the rule templates need to be automatically learned in a given natural sentence. In this embodiment, the designated medical repository is Pubmed, pubmed is the largest medical document database, and the number of documents in Pubmed by 2019 exceeds 3000 ten thousand. The gene entity library of ncbi is used as the gene entity library, the disease entity library is a mesh disease entity library, and the gene entity library and the disease entity library are widely accepted in the medical field at present and have high quality and wide coverage rate. The entity library provides english standard names and aliases of genes and diseases, sentences containing the genes and the diseases simultaneously are extracted from medical documents by using the names of the genes and the diseases, such as Breastfeeding and the risk of breast cancer in BRCA1 mutation carriers, wherein the break cancer is the name of a disease in the disease entity library, and the BRAC1 is the name of a gene in the gene entity library. And acquiring a sentence set simultaneously containing genes and disease entities from a medical document Pubmed, extracting a specified number of natural sentences for dependency analysis, and finally obtaining a rule template. More specifically, the specified number is 100 ten thousand.
In one embodiment, the step of performing dependency analysis on a specified number of natural sentences including gene-disease entity pairs to obtain a dependency of each of the natural sentences includes:
s11, performing dependency relationship analysis on each natural sentence by using a natural language processing tool package StanfordNLP to obtain the dependency relationship of each natural sentence.
As described above, in this embodiment, stanfordNLP is selected as a tool for dependency analysis. The StanfordN tool kit supports a complete text analysis pipeline of multiple languages, including word segmentation, part-of-speech tagging, word shape merging and dependency analysis, and in addition, the StanfordN tool kit also provides a Python interface with CoreNLP, and can easily set local Python implementation.
In one embodiment, after the step of determining the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, the step of generating a rule template according to the path descriptor of each natural sentence, and before the step of creating a rule template library, the method further includes:
s21, calculating editing distances among different path descriptors, and clustering the path descriptors with the editing distances smaller than or equal to a first specified value into the same path descriptor; the method comprises the steps of,
s22, identifying whether negative semantics exist in the dependency relationship in the natural sentence, and if so, filtering out a path descriptor corresponding to the natural sentence.
As described in the above step S21, there is a great deal of redundancy in the path descriptors acquired in the above step S2, such as the following path descriptors { "GENE target in DISE", "GENE target on DISE", "GENE targets in DISE", "GENE targets on DISE" }, which are actually redundant. Regarding the path descriptor redundancy problem, the present embodiment employs the method of calculating the edit distance of different path descriptors, and if the edit distance is equal to or less than a first specified value, the same path descriptor is considered. By subsequent statistical discovery, clustering the path descriptors with an edit distance of 2 (the preferred first specified value) can reduce the number of rule templates by 60%, thereby reducing redundancy of a large number of path descriptors. The edit distance refers to the minimum number of edit operations for converting from one to another for two given character strings, where the edit operations may be delete, insert, replace operations. For example, "GENE target in DISE" becomes "GENE targets in DISE" by one insertion operation (insertion s), and "GENE targets on DISE" can be made by one replacement operation (i is replaced with o). The edit distance of "GENE target in DISE" and "GENE targets in DISE" here is 2.
As described in step S22, the negative information (Neg) cannot be found in the existing path descriptor. This is described with a specific example.
Case 2: the dependency of "The profile of the apelin did not make it a therapeutic target for ischemic heart issue" is shown in fig. 3.
It can be seen that the dependency path given here for a given GENE, DISE is GENE+.Profile+.make→target→DISE, and the corresponding path descriptor is "profile GENE make target DISE", where make is the ROOT node (ROOT) of the path. The same semantics of the case 2 and case1 path descriptor "profile GENE makes target DISE" expressions can be found, but in practice the negative semantics of the case 2 expression can be found the case 2 root node make dependencyThere is negative semantics (neg) as seen in (a). All dependencies by the root node hereIn the store, if there is neg for the dependency of the root node, the sample is filtered out when generating the rule template.
In one embodiment, the step of generating a rule template from the path descriptor of each of the natural sentences, and creating a rule template library includes:
s31, counting the number of natural sentences cases corresponding to the same path descriptor, and filtering the path descriptors of which the case number is smaller than a second specified value;
s32, carrying out quality evaluation on the filtered path descriptors, storing the path descriptors passing the quality evaluation as rule templates, and establishing a rule template library.
As described above, a path descriptor is obtained in step S2, and a case1 is taken as an example, where a sentence of case1 is "The profile of the apelin makes it a therapeutic target for ischemic heart data," where a given GENE is apelin, and a given une represents ischemic heart disease. A path descriptor "profile GENE makes target DISE" (here the path descriptor is a candidate rule template) is derived. Each data sample is processed to obtain information { data sample, entity pairs in the data sample, corresponding path descriptor }.
For example, case1 may be found in { "The profile of the apelin makes it a therapeutic target for ischemic heart data", < apelin, ischemic heart disease >, "profile GENE makes target DISE" }
And then calculating the editing distance between every two path descriptors in all the path descriptors, and if the editing distance is less than or equal to 2, considering the same path descriptor, thereby solving the redundancy problem of the path descriptors. This is accomplished by deriving all path descriptors as candidate rule templates. For example, path descriptor "profile GENE makes target DISE" of case1 may be simplified to "profile GENE make target DISE", and the path descriptor may be simplified to a new path descriptor corresponding to all data. Such data may be obtained after all data sample processing:
{ case1, gene disease entity pair, path descriptor 1}, …, { case n, gene disease entity pair, path descriptor m }
The data format can obtain the cases of each path descriptor through simple statistics:
{ Path descriptor 1 for all cases, corresponding to all entity pair set 1}, …, { Path descriptor m for all cases, corresponding to all entity pair set m }.
According to the cases corresponding to each path descriptor, the number of data samples corresponding to each path descriptor can be obtained through simple statistics, and the format is as follows: { Path descriptor 1, corresponding case number }, …, { Path descriptor m, corresponding case number }. The path descriptors with the case number smaller than the second specified value (here, the second specified value is set to 3) are filtered out, ordered by the case number of each path descriptor. This improves the universality and accuracy of the extracted path descriptors.
And then carrying out quality evaluation on the filtered path descriptors, saving the path descriptors passing the quality evaluation as rule templates, and establishing a rule template library. The quality evaluation method can adopt manual crowdsourcing, supervised learning and the like.
In a specific embodiment, the step of performing quality evaluation on the filtered path descriptors, storing the path descriptors passing the quality evaluation as rule templates, and establishing a rule template library includes:
s321, counting entity pair sets corresponding to path descriptors to be evaluated;
s322, counting the number of entity pairs in the entity pair set in the CTD;
and S323, if the existing quantity is larger than a specified quantity threshold or the ratio of the existing quantity to the total quantity of the entity pairs in the set is larger than a specified ratio threshold, reserving the path descriptor to be evaluated as an available rule template, and storing the available rule template to establish a rule template library.
As described above, in the present embodiment, the rule templates are evaluated using the idea of remote supervision. The core idea of remote supervision is that if a knowledge triplet (e.g. < ACE, target, heart failure >, indicating that the gene ACE has a target relationship with the disease heart failure) already exists in an existing knowledge base, then the text-to-text high probability of referring to the entity pair (e.g. ACE, heart failure) is that of describing the target semantics of the entity pair. In particular, the existing knowledge base used is CTD, which is a widely accepted medical knowledge base in the medical field.
And arbitrarily selecting one path descriptor i in one path descriptor set 1-m and an entity pair set i corresponding to the path descriptor i, counting the number of entity pairs in the entity pair set i existing in a CTD knowledge base, and if the number existing is greater than a specified number threshold (preferably 4) or the ratio of the number existing to the total number of entity pairs in the entity pair set is greater than a specified ratio threshold (preferably 0.5), reserving the path descriptor i as an available rule template, and storing all the available rule templates to construct a rule template library.
In one embodiment, the step of using the rule templates in the rule template library to extract knowledge of the entire medical document, obtaining the gene disease relationship, and establishing the knowledge base of the gene disease relationship comprises:
s41, carrying out entity recognition on natural sentences in the full medical literature to obtain natural sentences containing gene-disease entity pairs;
s42, performing dependency relationship analysis on all natural sentences comprising gene-disease entity pairs respectively to obtain the dependency relationship of each natural sentence;
s43, determining a path descriptor of each natural sentence according to the dependency relationship of each natural sentence;
s44, judging whether the path descriptor is in the rule template library;
s45, if yes, acquiring a gene disease relation according to the path descriptor, and storing the gene disease relation in a gene disease relation knowledge base.
As described above, in steps S1 to S3, the path descriptor is obtained by performing dependency analysis on the natural sentence, and then the rule template is obtained by performing operations such as quality evaluation on the path descriptor, and a rule template library is established. In the process of establishing the rule template library, 100 ten thousand natural sentences containing gene-disease entity pairs are selected. Carrying out gene and disease entity identification on natural sentences in a full amount of medical documents, acquiring all natural sentences containing gene-disease entities, then carrying out dependency analysis on the natural sentences in sequence by utilizing a tool kit, acquiring the dependency of each natural sentence, determining a path descriptor, judging whether the path descriptor is in a rule template library created through the steps S1-S3, if so, acquiring the relationship (such as target in case 1) between the genes and the diseases according to the path descriptor, and storing the gene disease relationship in a gene disease relationship knowledge base.
In one embodiment, the medical database, the rule templates, the genetic disease relational knowledge base, and the like are stored in nodes of a blockchain in which the genetic disease relational knowledge base construction method as described above is implemented.
As mentioned above, blockchains are a new mode of application for computer technology such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer 10, and the like. The blockchain underlying platform may include processing modules for user management, basic services, smart contracts, operation monitoring, and the like. The user management module is responsible for identity information management of all blockchain participants, including maintenance of public and private key generation (account management), key management, maintenance of corresponding relation between the true identity of the user and the blockchain address (authority management), etc., and under the condition of authorization, supervision and audit of transaction conditions of certain true identities 15, and provision of rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node devices, is used for verifying the validity of a service request, recording the service request on a storage after the effective request is identified, for a new service request, the basic service firstly analyzes interface adaptation and authenticates the interface adaptation, encrypts service information (identification management) through an identification algorithm, and transmits the encrypted service information to a shared account book (network communication) in a complete and consistent manner, and records and stores the service information; the intelligent contract 20 module is responsible for the registration and release of contracts, the contract triggering and the contract execution, a developer can define contract logic through a certain programming language, the contract logic is released to a blockchain (contract registration), and the key or other event triggering execution is called according to the logic of contract clauses to complete the contract logic, and meanwhile, the function of updating and logging off the contract is provided; the operation monitoring module is mainly responsible for deployment in the product release process, modification of configuration, contract setting, cloud adaptation and visual output of real-time states in product operation, for example: alarms, 25 monitoring network conditions, monitoring node device health status, etc.
According to the method for constructing the genetic disease relational knowledge base, a large number of rule templates are automatically learned by performing dependency analysis on a specified number of natural sentences containing the genetic-disease entity pairs, and then the relational knowledge of the genetic disease is automatically extracted from medical documents by using the rule templates, so that high labor cost is not needed, the extracted knowledge is large in quantity, the extraction effect is good, and the method has good migration and applicability and can be used for extracting the relation among more medical entities.
Referring to fig. 4, an embodiment of the present application further provides a device for constructing a knowledge base of gene disease relationships, including:
the dependency relationship analysis module 1 is used for carrying out dependency relationship analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency relationship of each natural sentence;
a path descriptor determining module 2, configured to determine a path descriptor of each natural sentence according to a dependency relationship of each natural sentence, where the path descriptor refers to an arrangement order of all words on a dependency path of a genetic disease entity in the natural sentence;
a rule template generating module 3, configured to generate a rule template according to the path descriptor of each natural sentence, and establish a rule template library;
and the knowledge extraction module 4 is used for carrying out knowledge extraction on the full amount of medical documents by utilizing the rule templates in the rule template library to acquire gene disease relations and establishing a gene disease relation knowledge base.
In one embodiment, the genetic disease relationship knowledge base construction apparatus further includes:
a natural sentence acquisition module for acquiring natural sentences containing gene-disease entity pairs in a designated medical database;
and the selection module is used for randomly selecting a specified number of natural sentences containing the gene-disease entity pairs.
In one embodiment, the dependency analysis module 1 includes:
and the dependency relationship analysis unit is used for carrying out dependency relationship analysis on each natural sentence by using a natural language processing tool package StanfordLP to obtain the dependency relationship of each natural sentence.
In one embodiment, the genetic disease relationship knowledge base construction apparatus further includes:
the clustering module is used for calculating the editing distance between different path descriptors and clustering the path descriptors with the editing distance smaller than or equal to a first appointed value into the same path descriptor;
the filtering module is used for identifying whether negative semantics exist in the dependency relationship in the natural sentence, and if so, filtering out the path descriptor corresponding to the natural sentence.
In one embodiment, the rule template generation module 3 includes:
the statistics module is used for counting the number of natural sentences cases corresponding to the same path descriptor and filtering the path descriptors of which the case number is smaller than a second specified value;
the quality evaluation module is used for carrying out quality evaluation on the filtered path descriptors, saving the path descriptors passing the quality evaluation as rule templates and establishing a rule template library.
In one embodiment, the quality assessment module comprises:
the first statistics unit is used for counting entity pair sets corresponding to the path descriptors to be evaluated;
a second statistics unit, configured to count the number of entity pairs in the entity pair set that exist in the CTD;
and the processing unit is used for reserving the path descriptor to be evaluated as an available rule template and storing the available rule template to establish a rule template library if the existing number is larger than a specified number threshold or the ratio of the existing number to the total number of entity pairs in the entity pair set is larger than a specified ratio threshold.
In one embodiment, the knowledge extraction module 4 comprises:
the entity identification unit is used for carrying out entity identification on natural sentences in the full medical literature to obtain natural sentences containing gene-disease entity pairs;
a dependency relationship analysis unit for performing dependency relationship analysis on all natural sentences including gene-disease entity pairs, respectively, to obtain a dependency relationship of each natural sentence;
a path descriptor determining unit configured to determine a path descriptor of each natural sentence according to a dependency relationship of each natural sentence;
a judging unit, configured to judge whether the path descriptor is in the rule template library;
and the acquisition unit is used for acquiring the gene disease relationship according to the path descriptor if the gene disease relationship is acquired, and storing the gene disease relationship in a gene disease relationship knowledge base.
As described above, it may be understood that each component of the genetic disease relational knowledge base construction device provided in the present application may implement the function of any one of the genetic disease relational knowledge base construction methods described above, and the specific structure will not be described again.
Referring to fig. 5, a computer device is further provided in an embodiment of the present application, where the computer device may be a server, and the internal structure of the computer device may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data such as rule templates. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program when executed by a processor implements a method for constructing a knowledge base of genetic disease relationships.
The processor executes the method for constructing the gene disease relation knowledge base, which comprises the following steps:
performing dependency relationship analysis on a specified number of natural sentences comprising gene-disease entity pairs to obtain the dependency relationship of each natural sentence;
determining a path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the arrangement sequence of all words on the dependency path of the gene disease entity in the natural sentence;
generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library;
and carrying out knowledge extraction on the whole medical literature by utilizing the rule templates in the rule template library to obtain a gene disease relationship, and establishing a gene disease relationship knowledge base.
An embodiment of the present application further provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor, implements a method for constructing a knowledge base of genetic disease relationships, including the steps of:
performing dependency relationship analysis on a specified number of natural sentences comprising gene-disease entity pairs to obtain the dependency relationship of each natural sentence;
determining a path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the arrangement sequence of all words on the dependency path of the gene disease entity in the natural sentence;
generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library;
and carrying out knowledge extraction on the whole medical literature by utilizing the rule templates in the rule template library to obtain a gene disease relationship, and establishing a gene disease relationship knowledge base.
According to the construction method of the gene disease relation knowledge base, a large number of rule templates are automatically learned by performing dependency relation analysis on a specified number of natural sentences containing gene-disease entity pairs, and then the relation knowledge of the gene disease is automatically extracted from medical documents by using the rule templates, so that high labor cost is not needed, the extracted knowledge is large in quantity, the extraction effect is good, and the method has good migration and applicability and can be used for extracting the relation among more medical entities.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided herein and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims (8)

1. The method for constructing the gene disease relation knowledge base is characterized by comprising the following steps:
performing dependency relationship analysis on a specified number of natural sentences comprising gene-disease entity pairs to obtain the dependency relationship of each natural sentence;
determining a path descriptor of each natural sentence according to the dependency relationship of each natural sentence, wherein the path descriptor refers to the arrangement sequence of all words on the dependency path of the gene disease entity in the natural sentence;
generating a rule template according to the path descriptor of each natural sentence, and establishing a rule template library;
carrying out knowledge extraction on the whole medical literature by utilizing the rule templates in the rule template library to acquire a gene disease relationship, and establishing a gene disease relationship knowledge base;
after the step of determining the path descriptor of each natural sentence according to the dependency relationship of each natural sentence, the step of generating a rule template according to the path descriptor of each natural sentence, and before the step of establishing a rule template library, the step of further comprises:
calculating editing distances among different path descriptors, and clustering the path descriptors with the editing distances smaller than or equal to a first specified value into the same path descriptor; the method comprises the steps of,
identifying whether negative semantics exist in the dependency relationship in the natural sentence, and if so, filtering out a path descriptor corresponding to the natural sentence;
the step of generating a rule template according to the path descriptor of each natural sentence, and the step of establishing a rule template library comprises the following steps:
counting the number of cases of natural sentences corresponding to the same path descriptor, and filtering the path descriptors of which the case number is smaller than a second designated value;
and carrying out quality evaluation on the filtered path descriptors, storing the path descriptors passing the quality evaluation as rule templates, and establishing a rule template library.
2. The method according to claim 1, wherein the step of performing dependency analysis on a predetermined number of natural sentences including gene-disease entity pairs to obtain dependency of each of the natural sentences comprises:
acquiring natural sentences comprising gene-disease entity pairs from a designated medical database;
a specified number of natural sentences comprising gene-disease entity pairs are randomly selected.
3. The method of claim 1, wherein the step of performing dependency analysis on a specified number of natural sentences including gene-disease entity pairs to obtain the dependency of each of the natural sentences comprises:
and carrying out dependency analysis on each natural sentence by using a natural language processing toolkit StanfordNLP to obtain the dependency of each natural sentence.
4. The method of claim 1, wherein the step of performing quality evaluation on the filtered path descriptors, storing the path descriptors passing the quality evaluation as rule templates, and creating a rule template library comprises:
counting entity pair sets corresponding to the path descriptors to be evaluated;
counting the number of entity pairs in the entity pair set existing in the CTD;
if the number of the existing path descriptors is larger than a specified number threshold or the ratio of the number of the existing path descriptors to the total number of the entity pairs in the entity pair set is larger than a specified ratio threshold, the path descriptors to be evaluated are reserved as available rule templates, and a rule template library is stored and established.
5. The method for constructing a knowledge base of genetic disease relationships according to claim 1, wherein the step of extracting knowledge from a full amount of medical documents using the rule templates in the rule template library to obtain genetic disease relationships, and establishing the knowledge base of genetic disease relationships comprises:
performing entity recognition on natural sentences in the full amount of medical documents to obtain natural sentences containing gene-disease entity pairs;
performing dependency relationship analysis on all natural sentences containing gene-disease entity pairs respectively to obtain the dependency relationship of each natural sentence;
determining a path descriptor of each natural sentence according to the dependency relationship of each natural sentence;
judging whether the path descriptor is in the rule template library or not;
if yes, acquiring the gene disease relation according to the path descriptor, and storing the gene disease relation in a gene disease relation knowledge base.
6. A genetic disease relational knowledge base construction apparatus, comprising:
the dependency relationship analysis module is used for carrying out dependency relationship analysis on a specified number of natural sentences containing gene-disease entity pairs to obtain the dependency relationship of each natural sentence;
a path descriptor determining module, configured to determine a path descriptor of each natural sentence according to a dependency relationship of each natural sentence, where the path descriptor refers to an arrangement order of all words on a dependency path of a gene disease entity in the natural sentence;
the rule template generation module is used for generating rule templates according to the path descriptors of each natural sentence and establishing a rule template library;
the knowledge extraction module is used for extracting knowledge of the whole medical literature by utilizing the rule templates in the rule template library, obtaining a gene disease relationship and establishing a gene disease relationship knowledge base;
the genetic disease relation knowledge base construction device further comprises:
the clustering module is used for calculating the editing distance between different path descriptors and clustering the path descriptors with the editing distance smaller than or equal to a first appointed value into the same path descriptor;
the filtering module is used for identifying whether negative semantics exist in the dependency relationship in the natural sentence, and if so, filtering out a path descriptor corresponding to the natural sentence;
the rule template generation module comprises:
the statistics module is used for counting the number of natural sentences cases corresponding to the same path descriptor and filtering the path descriptors of which the case number is smaller than a second specified value;
the quality evaluation module is used for carrying out quality evaluation on the filtered path descriptors, saving the path descriptors passing the quality evaluation as rule templates and establishing a rule template library.
7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 5 when the computer program is executed.
8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 5.
CN202010941642.7A 2020-09-09 2020-09-09 Gene disease relation knowledge base construction method, device and computer equipment Active CN112036151B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202010941642.7A CN112036151B (en) 2020-09-09 2020-09-09 Gene disease relation knowledge base construction method, device and computer equipment
PCT/CN2020/125143 WO2021155684A1 (en) 2020-09-09 2020-10-30 Gene-disease relationship knowledge base construction method and apparatus, and computer device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010941642.7A CN112036151B (en) 2020-09-09 2020-09-09 Gene disease relation knowledge base construction method, device and computer equipment

Publications (2)

Publication Number Publication Date
CN112036151A CN112036151A (en) 2020-12-04
CN112036151B true CN112036151B (en) 2024-04-05

Family

ID=73584487

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010941642.7A Active CN112036151B (en) 2020-09-09 2020-09-09 Gene disease relation knowledge base construction method, device and computer equipment

Country Status (2)

Country Link
CN (1) CN112036151B (en)
WO (1) WO2021155684A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113626567A (en) * 2021-07-28 2021-11-09 上海基绪康生物科技有限公司 Method for mining information related to genes and diseases from biomedical literature
CN113836924A (en) * 2021-09-16 2021-12-24 东软集团股份有限公司 Entity relationship extraction method and device, storage medium and electronic equipment
CN114997398B (en) * 2022-03-09 2023-05-26 哈尔滨工业大学 Knowledge base fusion method based on relation extraction

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5424947A (en) * 1990-06-15 1995-06-13 International Business Machines Corporation Natural language analyzing apparatus and method, and construction of a knowledge base for natural language analysis
CN105138507A (en) * 2015-08-06 2015-12-09 电子科技大学 Pattern self-learning based Chinese open relationship extraction method
CN106897568A (en) * 2017-02-28 2017-06-27 北京大数医达科技有限公司 The treating method and apparatus of case history structuring
CN107797991A (en) * 2017-10-23 2018-03-13 南京云问网络技术有限公司 A kind of knowledge mapping extending method and system based on interdependent syntax tree
CN109284396A (en) * 2018-09-27 2019-01-29 北京大学深圳研究生院 Medical knowledge map construction method, apparatus, server and storage medium
CN109545373A (en) * 2018-11-08 2019-03-29 新博卓畅技术(北京)有限公司 A kind of automatic abstracting method of human body diseases symptom characteristic, system and equipment
CN110032649A (en) * 2019-04-12 2019-07-19 北京科技大学 Relation extraction method and device between a kind of entity of TCM Document
CN111339774A (en) * 2020-02-07 2020-06-26 腾讯科技(深圳)有限公司 Text entity relation extraction method and model training method
CN111428036A (en) * 2020-03-23 2020-07-17 浙江大学 Entity relationship mining method based on biomedical literature
CN111554360A (en) * 2020-04-27 2020-08-18 大连理工大学 Drug relocation prediction method based on biomedical literature and domain knowledge data

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140156264A1 (en) * 2012-11-19 2014-06-05 University of Washington through it Center for Commercialization Open language learning for information extraction
CN106021281A (en) * 2016-04-29 2016-10-12 京东方科技集团股份有限公司 Method for establishing medical knowledge graph, device for same and query method for same
CN109902301B (en) * 2019-02-26 2023-02-10 广东工业大学 Deep neural network-based relationship reasoning method, device and equipment
CN111291568B (en) * 2020-03-06 2023-03-31 西南交通大学 Automatic entity relationship labeling method applied to medical texts

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5424947A (en) * 1990-06-15 1995-06-13 International Business Machines Corporation Natural language analyzing apparatus and method, and construction of a knowledge base for natural language analysis
CN105138507A (en) * 2015-08-06 2015-12-09 电子科技大学 Pattern self-learning based Chinese open relationship extraction method
CN106897568A (en) * 2017-02-28 2017-06-27 北京大数医达科技有限公司 The treating method and apparatus of case history structuring
CN107797991A (en) * 2017-10-23 2018-03-13 南京云问网络技术有限公司 A kind of knowledge mapping extending method and system based on interdependent syntax tree
CN109284396A (en) * 2018-09-27 2019-01-29 北京大学深圳研究生院 Medical knowledge map construction method, apparatus, server and storage medium
CN109545373A (en) * 2018-11-08 2019-03-29 新博卓畅技术(北京)有限公司 A kind of automatic abstracting method of human body diseases symptom characteristic, system and equipment
CN110032649A (en) * 2019-04-12 2019-07-19 北京科技大学 Relation extraction method and device between a kind of entity of TCM Document
CN111339774A (en) * 2020-02-07 2020-06-26 腾讯科技(深圳)有限公司 Text entity relation extraction method and model training method
CN111428036A (en) * 2020-03-23 2020-07-17 浙江大学 Entity relationship mining method based on biomedical literature
CN111554360A (en) * 2020-04-27 2020-08-18 大连理工大学 Drug relocation prediction method based on biomedical literature and domain knowledge data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"PKDE4J Entity and relation extraction for public knowledge discovery";Min Song et al.;《Journal of Biomedical Informatics》;第57卷;第320-332页 *
"基于生物医学文献挖掘的疾病-基因-药物关系抽取研究";翟菊叶等;《新余学院学报》;20180410(第02期);第95-100页 *

Also Published As

Publication number Publication date
WO2021155684A1 (en) 2021-08-12
CN112036151A (en) 2020-12-04

Similar Documents

Publication Publication Date Title
CN111506722B (en) Knowledge graph question-answering method, device and equipment based on deep learning technology
CN112036151B (en) Gene disease relation knowledge base construction method, device and computer equipment
CN112016279B (en) Method, device, computer equipment and storage medium for structuring electronic medical record
WO2019218475A1 (en) Method and device for identifying abnormally-behaving subject, terminal device, and medium
CN112016318B (en) Triage information recommendation method, device, equipment and medium based on interpretation model
CN111797629B (en) Method and device for processing medical text data, computer equipment and storage medium
CN112132624A (en) Medical claims data prediction system
CN112017745B (en) Decision information recommendation and drug information recommendation methods, devices, equipment and media
CN111710383A (en) Medical record quality control method and device, computer equipment and storage medium
CN113724815B (en) Information pushing method and device based on decision grouping model
CN112765370B (en) Entity alignment method and device of knowledge graph, computer equipment and storage medium
CN112017735B (en) Drug discovery method, device and equipment based on relation extraction and knowledge reasoning
CN112347254B (en) Method, device, computer equipment and storage medium for classifying news text
CN113139876B (en) Risk model training method, risk model training device, computer equipment and readable storage medium
CN111767192B (en) Business data detection method, device, equipment and medium based on artificial intelligence
CN114491084B (en) Self-encoder-based relation network information mining method, device and equipment
CN111782821B (en) Medical hotspot prediction method and device based on FM model and computer equipment
CN116150663A (en) Data classification method, device, computer equipment and storage medium
CN112819175B (en) Illegal legal account identification method, device, equipment and storage medium
CN112395401A (en) Adaptive negative sample pair sampling method and device, electronic equipment and storage medium
CN107577760B (en) text classification method and device based on constraint specification
CN113312481B (en) Text classification method, device, equipment and storage medium based on blockchain
CN115601779A (en) Model iteration method and device
CN112035616B (en) BERT model and rule-based medical insurance data code matching method, device and equipment
CN112182069B (en) Agent retention prediction method, agent retention prediction device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40040362

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant