CN116246701B - Data analysis device, medium and equipment based on phenotype term and variant gene - Google Patents

Data analysis device, medium and equipment based on phenotype term and variant gene Download PDF

Info

Publication number
CN116246701B
CN116246701B CN202310116429.6A CN202310116429A CN116246701B CN 116246701 B CN116246701 B CN 116246701B CN 202310116429 A CN202310116429 A CN 202310116429A CN 116246701 B CN116246701 B CN 116246701B
Authority
CN
China
Prior art keywords
phenotype
term
disease
terms
phenotypic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310116429.6A
Other languages
Chinese (zh)
Other versions
CN116246701A (en
Inventor
牟文博
汤莎
张晶
方萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Kingmed Diagnostics Group Co ltd
Guangzhou Kingmed Diagnostics Central Co Ltd
Original Assignee
Guangzhou Kingmed Diagnostics Group Co ltd
Guangzhou Kingmed Diagnostics Central Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Kingmed Diagnostics Group Co ltd, Guangzhou Kingmed Diagnostics Central Co Ltd filed Critical Guangzhou Kingmed Diagnostics Group Co ltd
Priority to CN202310116429.6A priority Critical patent/CN116246701B/en
Publication of CN116246701A publication Critical patent/CN116246701A/en
Application granted granted Critical
Publication of CN116246701B publication Critical patent/CN116246701B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/40Population genetics; Linkage disequilibrium
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Public Health (AREA)
  • Ecology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Physiology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a data analysis device, medium and equipment based on phenotype terms and variant genes, which can effectively quantify the relevance of genes and diseases, provide richer, more elements and more accurate information for an analyst to evaluate the phenotype and variant relationship of a subject and provide recommendation reasons with high interpretability by acquiring the phenotype terms and the variant genes, calculating the phenotype likelihood ratio of the disease, the genotype likelihood ratio of the disease and the phenotype genotype composite likelihood ratio, analyzing and outputting variant gene sequences of target objects and related diseases of the variant gene sequences.

Description

Data analysis device, medium and equipment based on phenotype term and variant gene
Technical Field
The invention relates to the technical field of biological information analysis, in particular to a data analysis device, medium and equipment based on phenotypic terms and variant genes.
Background
Human genetic diseases refer to diseases caused by changes in genetic material (including chromosome number changes and structural deformities, single gene mutations, etc.), the number of genetic diseases known at present is more than 8 thousand, and new disease types are found each year. Most genetic diseases are early in onset, have a large number of affected parts, are severe in symptoms, are familial, and are important factors for endangering human health. The method can be used for timely and correctly diagnosing the genetic diseases, can reasonably judge and select prognosis and treatment schemes of patients, can accurately evaluate genetic risks of offspring and relatives, and has important clinical significance. Due to the high complexity of the human genome, clinical diagnosis of genetic diseases often requires the use of a variety of techniques and detection methods. With the recent development of the state of the art, detection methods based on high throughput sequencing (NGS) technology have been widely used in molecular diagnosis of genetic diseases and are one of the most dominant molecular diagnostic methods.
Human genome has high complexity, and although high-throughput sequencing technology can detect tens of thousands to hundreds of thousands of variant genes of approximately 2 thousands of genes of a target object at the same time, how to effectively combine clinical symptoms of the target object, and screening and analyzing variant genes related to genetic diseases from massive variant gene variants still remains a challenge. For the problem, the existing retrieval analysis method needs to involve multi-step information retrieval, matching and checking, and the whole process is long in time consumption and easy to leak.
Disclosure of Invention
Based on this, it is necessary to provide a data analysis device, medium and apparatus based on phenotypic terms and variant genes to solve the problem that variant genes related to genetic diseases cannot be screened out from among a vast number of variant gene variants.
A data analysis device based on phenotypic terms and variant genes, the data analysis device comprising:
a phenotypic term primary screening module for acquiring clinical description information of a target object, and screening at least one initial phenotypic term related to the target object according to the clinical description information;
a related disease screening module for searching for at least one related disease associated with the at least one initial phenotypic term according to a hierarchical relationship in a predetermined set of human phenotypic terms;
A phenotypic term optimization module for optimizing the at least one initial phenotypic term based on the number of associated diseases to obtain an optimized at least one optimized phenotypic term;
a disease phenotype likelihood ratio calculation module for calculating a disease phenotype likelihood ratio for the target subject based on the disease condition of each optimized phenotype term;
a variant gene acquisition module, configured to acquire at least one variant gene in the target gene sequence;
a disease genotype likelihood ratio calculation module for calculating a disease genotype likelihood ratio of the target subject based on the pathogenic condition of the at least one variant gene;
a phenotype genotype composite likelihood ratio calculation module for calculating a phenotype genotype composite likelihood ratio for the target subject based on the disease phenotype likelihood ratio and the disease genotype likelihood ratio;
and the analysis output module is used for carrying out mutation sequencing based on the phenotype genotype composite likelihood ratio and outputting mutation gene sequencing according with the target object and related diseases of the mutation gene sequencing.
In one embodiment, the set of human phenotypic terms includes a plurality of term units, each term unit consisting of a plurality of hierarchically progressive phenotypic terms, an i+1th hierarchical phenotypic term being a parent term of the i+1th hierarchical phenotypic term, the i+1th hierarchical phenotypic term being a child term of the i hierarchical phenotypic term, each phenotypic term being directly associated with at least one disease, the associated disease screening module being specifically configured to:
Acquiring all diseases directly related to the target initial term as related diseases; wherein the target initial term is any one of the at least one initial phenotypic term;
traversing all sub-terms of the target initial term, and taking all diseases directly related to all traversed sub-terms as related diseases.
In one embodiment, the phenotypic term optimization module is specifically configured to:
when the number of the associated diseases is greater than a preset upper limit of the number, retaining all sub-terms associated with the initial phenotypic terms as the optimized phenotypic terms;
when the number of the associated diseases is smaller than the preset lower limit of the number, taking all initial phenotype terms, child terms associated with all initial phenotype terms, father terms of all initial phenotype terms as the optimized phenotype terms, or taking the initial phenotype terms after modification as the optimized phenotype terms based on a modification instruction.
In one embodiment, the disease phenotype likelihood ratio calculation module is specifically configured to:
calculating a first probability of a target optimization term in an individual with a preset disease, and calculating a second probability of the target optimization term in an individual not with the preset disease; wherein the target optimization term is any one of the at least one optimization phenotype term;
Taking the ratio of the first probability to the second probability as a disease phenotype term likelihood ratio of the target optimized term and taking the product of the disease phenotype term likelihood ratios of all optimized phenotype terms as the disease phenotype likelihood ratio.
In one embodiment, the disease phenotype likelihood ratio calculation module is specifically configured to:
in the set of human phenotypic terms, if the target optimization term is directly associated with the preset disease, taking as the first probability the frequency of the acquired target optimization term in the individual suffering from the preset disease;
if the sub-term of the target optimization term is directly related to the preset disease, taking the ratio of the first target frequency to the number of sub-terms of the target optimization term as the first probability; wherein the first target frequency is a maximum frequency of sub-terms of the target optimization term in an individual suffering from the preset disease;
if the parent term of the target optimization term is directly related to the preset disease, taking a second target frequency as the first probability; wherein the second target frequency is a maximum frequency of a parent term of the target optimization term in an individual suffering from the preset disease;
Calculating a sum of frequencies of the target optimization term and disease within the set of human phenotypic terms; wherein said sum of frequencies is the sum of frequencies at which said target optimization term is directly associated with a disease within each of said sets of human phenotypic terms;
the ratio of the sum of the frequencies to the number of all diseases within the set of human phenotypic terms is taken as the second probability.
In one embodiment, the disease phenotype likelihood ratio calculation module is further configured to:
logarithmically converting the global minimum likelihood ratio and the disease phenotype term likelihood ratios of all optimized phenotype terms to obtain a correction score for the global minimum likelihood and a base score for each optimized phenotype term; wherein the global minimum likelihood is the minimum of the disease phenotype term likelihoods for all optimized phenotype terms;
calculating a relevance score for each optimized phenotype term based on the base score and the correction score for each optimized phenotype term; wherein the relevance score is used to indicate a strength of relevance between the optimized phenotypic term and the preset disease.
In one embodiment, the disease genotype likelihood ratio calculation module is specifically configured to:
Calculating a third probability that a genotype consisting of at least one variant gene of the target object is pathogenic, and calculating a fourth probability that a genotype consisting of at least one variant gene of the target object is non-pathogenic;
and taking the ratio of the third probability to the fourth probability as a disease genotype likelihood ratio of the target object.
In one embodiment, the disease genotype likelihood ratio calculation module is further specifically configured to:
calculating the third probability based on the first poisson distribution model; wherein the poisson distribution model is a probability distribution model indicating that k pathogenic variant genes exist on the genotype, and the event occurrence rate of the first poisson distribution model is a ratio between an expected number of pathogenic variants of a gene in a preset disease and a probability that a variant gene contained in the genotype is a pathogenic gene;
calculating the fourth probability based on a second poisson distribution model; wherein the event occurrence ratio of the second poisson distribution model is a ratio between an expected mutation number of a gene mutated in a healthy population and a probability that a mutated gene contained in the genotype is a pathogenic gene.
In one embodiment, the disease genotype likelihood ratio calculation module is further specifically configured to:
determining the probability that the variant genes contained in the genotype are pathogenic genes according to the number of the variant genes contained in the genotype, the preset weight and the pathogenicity probability of the individual variant genes; wherein the pathogenicity probability of the individual variant gene is determined by a preset rating condition.
In one embodiment, the phenotype genotype composite likelihood ratio calculation module is specifically configured to:
and multiplying the disease phenotype likelihood ratio by the disease genotype likelihood ratio by a phenotype genotype composite likelihood ratio for the subject of interest.
In one embodiment, the analysis output module is specifically configured to:
when a target variant gene has a plurality of phenotype genotype composite likelihood ratios, taking the maximum value of all phenotype genotype composite likelihood ratios of the target variant gene as a representative composite likelihood ratio, and when the target variant gene has one phenotype genotype composite likelihood ratio, taking the phenotype genotype composite likelihood ratio of the target variant gene as the representative composite likelihood ratio; wherein the target variant gene is any one of the at least one variant gene;
Sorting all variant genes based on the size of the representative composite likelihood ratio of each variant gene, and outputting variant genes with the representative composite likelihood ratio being greater than a preset disease phenotype conforming threshold and the corresponding preset gene weighted pathogenicity being greater than a preset pathogenicity threshold to obtain the variant gene sorting;
and outputting related diseases which are related to the variant genes in the variant gene sequence and have phenotype genotype composite likelihood ratios greater than a preset likelihood ratio threshold value based on the sequence of the variant gene sequence in sequence.
A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform:
acquiring clinical description information of a target object, and screening at least one initial phenotype term related to the target object according to the clinical description information;
searching for at least one associated disease associated with the at least one initial phenotypic term according to a hierarchical relationship in a predetermined set of human phenotypic terms;
optimizing the at least one initial phenotypic term based on the number of associated diseases to obtain an optimized at least one optimized phenotypic term;
Calculating a disease phenotype likelihood ratio for the target subject based on the disease condition of each optimized phenotype term;
obtaining at least one variant gene in the target object gene sequence;
calculating a disease genotype likelihood ratio of the target subject based on the pathogenic condition of the at least one variant gene;
calculating a composite likelihood ratio of the phenotype genotype of the subject of interest based on the disease phenotype likelihood ratio and the disease genotype likelihood ratio;
and carrying out mutation sequencing based on the phenotype genotype composite likelihood ratio, and outputting mutation gene sequencing conforming to the target object and related diseases of the mutation gene sequencing.
A data analysis device based on phenotypic terms and variant genes, comprising a processor and the above computer readable storage medium.
The invention provides a data analysis device, medium and equipment based on phenotype terms and variant genes, which can effectively quantify the relevance of genes and diseases by calculating and analyzing the phenotype likelihood ratio of diseases, the genotype likelihood ratio of diseases and the compound likelihood ratio of phenotype genotypes and outputting variant gene sequences of target objects and related diseases of the variant gene sequences, and provides richer, more elements and more accurate information for an analyzer to evaluate the phenotype and the variant relation of a testee and a recommendation reason with high interpretability.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Wherein:
FIG. 1 is a schematic diagram of a data analysis device based on phenotypic terms and variant genes in one embodiment;
FIG. 2 is a first schematic representation of a set of human phenotype terminology in one embodiment;
FIG. 3 is a second schematic representation of a set of human phenotype terminology in one embodiment;
FIG. 4 is a schematic diagram of an analysis report in one embodiment;
FIG. 5 is a block diagram of a data analysis device based on phenotypic terms and variant genes in one embodiment.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.
As shown in fig. 1, fig. 1 is a data analysis device based on a phenotypic term and a mutated gene in one embodiment, the data analysis device based on a phenotypic term and a mutated gene including:
A phenotypic term primary screening module 101 for acquiring clinical descriptive information of the target subject, and screening at least one primary phenotypic term associated with the target subject based on the clinical descriptive information.
Specifically, the phenotype term primary screening module, after receiving clinical description information provided by a doctor, selects keywords or key description sentences in the clinical description information and translates the keywords or the key description sentences, and searches in a human phenotype term set (HPO, human Phenotype Ontology) to find matched initial phenotype terms.
Wherein, the human phenotype term set is a dictionary library for standardized description of clinical phenotype abnormality, and covers more than 16000 phenotype abnormality characteristic terms related to 8000 genetic diseases, which are called phenotype terms (phenotype terms) for short. As shown in fig. 2, the phenotypic terms are associated in the form of a Directed Acyclic Graph (DAG) in HPO, each describing a unique phenotypic abnormality.
For example, the following table examples, based on the clinical description information "low sodium and high potassium appeared after birth", the CYP21A2 gene abnormality was detected, and the presence or absence of false aldosteronism or low resistance "was noted, keywords such as low sodium, high potassium, and aldosteronism were selected first, and then translated and searched for the following initial phenotypic terms.
An associative disease screening module 102 for searching for at least one associative disease associated with at least one initial phenotypic term according to a hierarchical relationship in a predetermined set of human phenotypic terms.
Referring to FIGS. 2 and 3, the set of human phenotypic terms includes a plurality of term units, each term unit consisting of a plurality of hierarchically progressive phenotypic terms, the i+1th hierarchical phenotypic term being a parent term for the i+1th hierarchical phenotypic term, the i+1th hierarchical phenotypic term being a child term for the i hierarchical phenotypic term, each phenotypic term being directly associated with at least one disease, e.g., where HP:0008221 may be the parent term for HP:0008231 and HP:0008258, the directly associated disease being "primary pigmentary nodular adrenocortical disease type 3"; HP 0008231 and HP 0008258 are used as sub-terms for HP 0008221, wherein the disease directly associated with HP 0008258 is "cytochrome P450 oxidoreductase-deficiency associated abnormal secretion of the steroid hormone".
Specifically, the related disease screening module firstly acquires all diseases directly related to the target initial term as related diseases; wherein the target initial term is any one of at least one initial phenotypic term corresponding to any one of HP:0002902 Hyponatrema (Hyponatremia), HP:0000859Hyperaldosteronism (aldosteronism), HP:0002153 Hyperkalimia (Hyperkalemia) and HP:0008221Adrenal hyperplasia (adrenal hyperplasia) in the examples above. Then traversing all the sub-terms of the initial terms of the target, and taking all diseases directly related to all the traversed sub-terms as related diseases.
Illustratively, as shown in FIG. 3, using HP 0008221 as an example, OMIM 614190 is a disease directly associated therewith, as the associated disease. The sub-term is then traversed, wherein OMIM:613571 is a direct association disease of its sub-term HP:0008258, which can be considered to be indirectly associated with HP:0008221, and also as an association disease. And so on, the associated diseases of the initial phenotypic terms selected in the phenotypic term initial screening module are counted according to the method to form the following table for presentation to an analyst.
A phenotypic term optimization module 103 for optimizing at least one initial phenotypic term based on the number of associated diseases to obtain at least one optimized phenotypic term after optimization.
In particular, the phenotypic term optimization module retains all the sub-terms associated with the initial phenotypic terms as optimized phenotypic terms, i.e., deletes the initial phenotypic terms, when the number of associated diseases is greater than a preset upper limit of the number, thereby reducing the number of associated diseases.
When the number of the associated diseases is smaller than the preset lower limit of the number, taking all initial phenotype terms, child terms associated with all initial phenotype terms and father terms of all initial phenotype terms as optimized phenotype terms, namely adding the father terms of the new initial phenotype terms as optimized phenotype terms, so that the number of the associated diseases is increased; or based on the modification instructions, the modified initial phenotypic terms are used as optimized phenotypic terms, namely, the phenotypic terms are redetermined, so that the number of associated diseases is increased.
Illustratively, because none of the 3 selected phenotypic terms displayed by the statistical information in the associative disease screening module have an indirect associative disease, a lesser number of associative diseases for all phenotypic terms may result in an insufficient sensitivity of the diagnosis. The HP shown in the following table can thus be chosen additionally: 0008221 parent term HP:0011733, more diseases are associated by a broad term, improving the sensitivity of the final result.
Since the phenotypic term priming module can only select a limited number of phenotypic terms that meet the clinical description, there may be a lack of selection during this process or the selected phenotypic terms may be too specific to subsequently correlate enough genes. Or over-correlating genes due to the selected phenotypic term being too broad. However, it can be obviously seen that the invention can well solve the problems after combining the related disease screening module and the phenotype term optimizing module, so that the final result has higher accuracy and stability.
A disease phenotype likelihood ratio calculation module 104 for calculating a disease phenotype likelihood ratio of the subject from the disease condition of each optimized phenotype term.
Specifically, the disease phenotype likelihood ratio calculation module first calculates a first probability of the target optimization term in an individual with the preset disease and calculates a second probability of the target optimization term in an individual without the preset disease; wherein the target optimization term is any one of the at least one optimization phenotype term, i.e. the same calculation process is performed for all optimization phenotype terms. And then taking the ratio of the first probability to the second probability as a disease phenotype term likelihood ratio of the target optimized term, and taking the product of the disease phenotype term likelihood ratios of all the optimized phenotype terms as the disease phenotype likelihood ratio. Can be expressed as:
In the above, LR (H) i ) I.e. the term likelihood ratio of disease phenotype, pr (H) i |D j ) I.e. first probability, pr (H) i |B j ) I.e. second probability, LR (phenotypejd j ) I.e. disease phenotype likelihood ratio, H i Indicating the ith optimized phenotype term, a total of n optimized phenotype terms, D j Indicating individuals with preset disease j, B j Indicating individuals not suffering from the preset disease j. Here, LR (phenoype|D since the clinical phenotype in a disease of the target subject is multiple and can be generally represented by multiple phenotypic terms j ) Can be made of multiple independent LRs (H i ) And multiplying. The LR (phenotypeD) j ) The larger the value of (c) the more relevant the disease is to the selected optimized phenotype term.
Optionally, in calculating the first probability, the disease phenotype likelihood ratio calculation module is specifically configured to optimize term H if the target is within the set of human phenotype terms j With preset disease D j Directly correlating, then optimizing the frequency f of the term obtained in the individual suffering from the preset disease i As a first probability.
Sub-term H of the term if target optimization j With preset disease D j Directly associate, then the first target frequency max j (f j ) Number of sub-terms n with target optimization term desc As a first probability; wherein the first target frequency max j (f j ) Optimizing the maximum frequency of sub-terms of terms for a target in an individual suffering from a preset disease.
If the parent term H of the target optimization term j With preset disease D j Directly associate, then the second target frequency max j (f j ) As a first probability; wherein, the second orderTarget frequency max j (f j ) Optimizing the maximum frequency of the term's parent term in an individual suffering from a preset disease for the goal. The above is expressed as:
in the above, H k Refers to and presets disease D j All diseases directly related, descendant indicates child terms and processingtor indicates parent terms.
Optionally, in calculating the second probability, the disease phenotype likelihood ratio calculation module is specifically configured to calculate a sum of frequencies of the target optimization term and the disease within the set of human phenotype terms first; wherein the sum of frequencies is the sum of frequencies at which the target optimization term is directly associated with the disease within each set of human phenotype terms. The ratio of the sum of the frequencies to the number of all diseases in the human phenotype terminology set is then taken as the second probability. The above is expressed as:
in the above-mentioned method, the step of,i.e., the sum of the frequencies, N is the number of all diseases within the human phenotype terminology set.
For example, assume that a total of only 5 diseases are in the human phenotype terminology set. HP terminology generally represents a symptom such as HP:0001250: epilepsy, assuming that 1 out of 1000 subjects suffering from a disease, 2 out of 30 subjects suffering from B disease, and none of 100, 2000, 30 subjects suffering from C, D, E disease were observed. The second probability here is (1/1000+2/30+0/1000+0/2000+0/30)/5.
Further, the disease phenotype likelihood ratio calculation module may be further configured to: performing logarithmic conversion on the global minimum likelihood ratio and the disease phenotype term likelihood ratios of all the optimized phenotype terms to obtain a correction score of the global minimum likelihood and a basic score of each optimized phenotype term; wherein the global minimum likelihood is the minimum of the disease phenotype term likelihoods of all optimized phenotype terms. Calculating a relevance score for each optimized phenotype term based on the base score and the correction score for each optimized phenotype term; wherein the relevance score is used to indicate the strength of the correlation between the optimized phenotypic term and the preset disease.
In one embodiment, the above can be calculated by the following formula:
score(H i |D j )=|log 10 (LR(H i ) min )|+log 10 (LR(H i ))
in the above formula, score (H i |D j ) I.e., correlation score, log 10 (LR(H i ) min ) I.e. correction score, log 10 (LR(H i ) I.e., the base score of the ith optimized phenotype term. Here, by calculating a correlation score (H i |D j ) Can assist an analyst in understanding the relationship between disease and phenotypic terms.
A variant gene acquisition module 105 for acquiring at least one variant gene in the target gene sequence.
The technology adopted by the variant gene acquisition module belongs to the prior art, and can refer to the contents of high-throughput sequencing data analysis, interpretation of clinical diagnosis flow and the like, and the general process comprises the steps of sequencing genes of a target object by adopting high-throughput sequencing data, comparing the detected gene sequence data with a human reference genome, and then detecting variant genes carried by the target object. Furthermore, the mutant gene can be annotated based on various public databases, and the content of mutant HGVS naming (comprising gene name, transcript nucleotide change and amino acid change), mutation type, crowd frequency, gene related diseases, known pathogenicity information and the like can be supplemented. Furthermore, the crowd high-frequency variation and known benign variation can be filtered out according to the annotation information.
A disease genotype likelihood ratio calculation module 106 for calculating a disease genotype likelihood ratio of the target subject based on the disease state of the at least one variant gene.
Specifically, the disease genotype likelihood ratio calculation module calculates a third probability that a genotype consisting of at least one variant gene of the target object has pathogenicity, and calculates a fourth probability that a genotype consisting of at least one variant gene of the target object has non-pathogenicity; and then taking the ratio of the third probability to the fourth probability as the disease genotype likelihood ratio of the target object. Can be expressed as:
in the above formula, LR (genotype|D j ) Namely, the disease genotype likelihood ratio, pr (G|D j ) I.e. third probability, pr (G|B) j ) The fourth probability, G, genotype, refers to a group of alleles of multiple variants in the isogenic gene, D j Indicating the pathogenicity of a preset disease j, B j Indicating that there is no pathogenicity of the preset disease j.
The occurrence of the mutation in the genome was found by preliminary analysis to conform to the rule of the Poisson distribution Poisson (k; lambda), whereas whether the mutation is pathogenic or not can be considered to conform to the rule of Binom (k; n, p).
Therefore, the probability of occurrence of k pathogenic variations on the gene obeys the following compound distribution law, namely, a poisson distribution model with parameters of k and lambada p (event occurrence ratio) can be constructed:
Pr(X=k)=Binom(k;n,p)Poisson(k;λ)=Poisson(k;λp)
Therefore, optionally, in calculating the third probability, the disease genotype likelihood ratio calculation module is specifically configured to calculate the third probability based on the first poisson distribution model; the poisson distribution model is a probability distribution model for indicating that k pathogenic variant genes exist on a genotype, and the event occurrence rate of the first poisson distribution model is a ratio between the expected number of pathogenic variants of a gene in a preset disease and the probability that the variant genes contained in the genotype are pathogenic genes. That is, the first Poisson distribution model may be expressed as Poisson (k; λ) D p i )。λ D For the genes pathogenic in disease jThe expected mutation number is 1 under the dominant genetic disease related gene or the X-linked recessive genetic disease related gene of the male object to be tested, and is 2 under the recessive genetic disease related gene. P is p i Is the probability that the variant gene contained in the genotype is a pathogenic gene.
Optionally, when calculating the fourth probability, the disease genotype likelihood ratio calculating module is specifically configured to calculate the fourth probability based on the second poisson distribution model; wherein the event occurrence rate of the second poisson distribution model is a ratio between an expected mutation number of the gene mutated in the healthy population and a probability that the mutated gene contained in the genotype is a pathogenic gene. That is, the second Poisson distribution model may be expressed as Poisson (k; λ) D p i )。λ B For the expected number of mutations that occur in healthy humans for this gene, this is obtained in the present method by averaging the frequencies of the mutated population provided by each gene in the genome aggregation database (gnomAD).
Thus, in general, the above-mentioned disease genotype likelihood ratio LR (genotype|D) j ) Can be unfolded into:
alternatively, the probability p that the variant gene contained in the calculated genotype is a pathogenic gene i The disease genotype likelihood ratio calculation module is also specifically used for: determining the probability p that the variant gene contained in the genotype is a pathogenic gene according to the number of variant genes contained in the genotype, the preset weight and the pathogenicity probability of the individual variant genes i
In one embodiment, the probability p that the variant gene comprised in the genotype is a pathogenic gene i The calculation formula of (2) is expressed as:
in the first case (single variant) in the above formula, if the genotype contains only one variation, thenThe pathogenicity probability Pr (V) equivalent to the variation k |D j )。
In the other cases, if a plurality of variations are included, since the conventional variation ranking methods are all performed independently based on individual variations, the method is easy to cause the underestimation of the pathogenicity of genotypes, and therefore the method performs weighting processing according to the relationship among the plurality of variations when the pathogenicity probabilities of the plurality of variations are calculated in an integrated manner.
In the second case (compound heterozygous), if the mutation of the target object to be examined is such that it contains family data, and the mutation is determined to be a complex heterozygous relationship by the parent source of the mutation, the probability Pr (V) k |D j ) The weight ω=1.5 was used on the basis of the largest two variations;
in the second case (varians in cis), the two variations have a cis-relationship (in-cis) and can be considered as a variation that is underestimated for pathogenicity, retaining only the pathogenicity probability of 1 variation and correcting by a weight ω=1.2.
Wherein the pathogenicity probability Pr (V k |D j ) The formula is determined by a preset rating, expressed in one particular embodiment as:
in the above formula, if the "multi-organization consistent" rating conclusion in ClinVar database or the pathogenic or suspected pathogenic variation checked by internal expert is the highly reliable pathogenic variation (high confidence pathogenicity), pr (V) k |D j )=1。
If the pathogenic or suspected pathogenic variation with consistent multiple mechanisms in the ClinVar database is not achieved, the clinical meaning of the ClinVar or the internal database is unknown, the pathogenic variation is predicted by the belief function, the pathogenic variation is estimated by an ACMG automatic evaluation algorithm, the pathogenic variation is low-credibility pathogenic variation (loW confidence pathogenicity), pr (V) k |D j )=0.5~0.8。
Other variants are nonpathogenic variants (no pathogenicity), pr (V) k |D j )=0。
A phenotype genotype composite likelihood ratio calculation module 107 for calculating a phenotype genotype composite likelihood ratio of the target subject based on the disease phenotype likelihood ratio and the disease genotype likelihood ratio.
Optionally, the phenotype genotype composite likelihood ratio calculation module is specifically configured to represent the phenotype genotype composite likelihood ratio of the target object, which is the product of the disease phenotype likelihood ratio and the disease genotype likelihood ratio, as:
LR(D j )=LR(phenotype|D j )×LR(genotype|D j )
the LR (D) j ) The larger the value of (c) is, the more relevant the disease is to the corresponding gene and phenotype.
The analysis output module 108 is used for performing mutation sorting based on the phenotype genotype composite likelihood ratio and outputting mutation gene sorting and related diseases of the mutation gene sorting according with the target object.
Specifically, the analysis output module is used for taking the maximum value of all the phenotype genotype composite likelihood ratios of the target variant gene as a representative composite likelihood ratio when a plurality of phenotype genotype composite likelihood ratios exist in the target variant gene, and taking the phenotype genotype composite likelihood ratio of the target variant gene as the representative composite likelihood ratio when one phenotype genotype composite likelihood ratio exists in the target variant gene; wherein the target variant gene is any one of the at least one variant gene, i.e., the same operation is performed on all variant genes.
Sorting all variant genes based on the size of the representative composite likelihood ratio of each variant gene, and outputting variant genes with the representative composite likelihood ratio being larger than a preset disease phenotype conforming threshold and the corresponding preset gene weighted pathogenicity being larger than a preset pathogenicity threshold to obtain variant gene sorting, which is approximately expressed in the form of variant gene 1 and variant gene 2 … ….
And outputting the related diseases which are related to the variant genes in the variant gene sequence and have the phenotype genotype composite likelihood ratio greater than a preset likelihood ratio threshold value based on the sequence of the variant gene sequence in sequence. That is, starting from variant gene 1 in the variant gene order, finding the phenotype genotype composite likelihood ratios related to all the variant genes 1, for example, phenotype genotype composite likelihood ratios of variant genes 1-disease 1, gene 1-disease 2, gene 1-disease 3, and the like, and then finding that the phenotype composite likelihood ratios satisfy the threshold value of the likelihood ratios greater than the preset likelihood ratio, and if the gene 1-disease 1 and the gene 1-disease 2 satisfy, outputting the disease 1 and the disease 2 preferentially. And so on to output all the associated diseases.
In actual operation, for each target object, as shown in FIG. 4, only the top 20 or a few of the top 20 genes that are not ranked but are recommended for reporting may be displayed. Meanwhile, in addition to displaying the ranking of genes and associated diseases, the result details of genes including report reason (Reporting recommendation), phenotype association graphic and mutation information table may be displayed. Wherein:
1) The reporting reasons include: the reason for the mutation to get the current ranking, including highest known mutation pathogenicity, highest possible related disease, and other weighted/de-weighted reasons;
2) The phenotype association diagram includes: the intensity of the correlation of each selected phenotype and all the genetic diseases associated with each gene is shown by a bar graph, and the correlation is stronger as the bar is longer, calculated based on the phenotype-genotype composite likelihood ratio.
3) The mutation information table includes: mutation information, such as mutation position, HGVS, heterozygosity, mutation source, known mutation rating, and pathogenicity prediction, involved in mutation ordering.
Therefore, the data analysis device based on the phenotype terms and the variant genes can effectively quantify the relevance of the genes and the diseases by calculating and analyzing the disease phenotype likelihood ratio, the disease genotype likelihood ratio and the phenotype genotype composite likelihood ratio and outputting the variant gene sequence of the target object and the related diseases of the variant gene sequence, provides richer, more elements and more accurate information for an analysis staff to evaluate the phenotype and the variant relation of the testee, and provides recommendation reasons with high interpretability.
To further illustrate the beneficial effects of the invention, the following is described by way of an experimental example:
150 genetic cases based on sequencing of the whole exon gene and obtaining report conclusions were selected in the experiment, including 110 single families and 40 three families. The report conclusions of the 150 cases, which were previously obtained based on ACMG sequence variation interpretation guidelines, were 63 cases positive, 23 cases unclear, and 64 cases negative. The report conclusion can be used as a judgment index of the subsequent accuracy.
Wherein, in the report conclusion above:
(1) Defining that a (pathogenic/suspected pathogenic) variation that can unequivocally explain the phenotype of the subject is a positive report.
(2) The defined detection of a variation that matches the subject's major clinical phenotype, but has not been clearly reported as conclusive of its correlation with its onset or progression.
(3) Defining that only other changes associated with a partial clinical phenotype of the subject are detected or that no changes associated with the clinical phenotype of the subject are detected is not reported negatively.
Next, the 150 example cases were again analyzed using the above-described data analysis device based on phenotypic terms and variant genes, and variant gene ranks and associated diseases were output, while four indicators of Top1, top10, top20, and Top20plus were used to evaluate the performance of the device.
Wherein Top1, top10, top20 represent the ratios of the variant sequences in the variant sequences of 1 st, top10 and Top20, respectively, in the report conclusion. Top20+ represents the proportion of the reported variation for the case that is Top20 or that is not Top20 but recommended for reporting. The results were as follows:
it can be seen that positive reports reported mutation Top10 with a recall exceeding 95.2% (60/63) and positive + inconclusive reports reported mutation with a recall exceeding 96.5% (83/86) when no internal pathogenicity rating of the mutation is retained and only the mutation rating of the public database is used. At the time of preserving the internal pathogenicity rating of the variation, the recall ratio of variation Top10 reported by the positive report exceeds 96.8% (61/63), and the recall ratio of Top20+ reported by the positive + inconclusive report exceeds 100%. Furthermore, although the objective of the present method is to maximize the recommended sensitivity, better specificity is still obtained on the negative cases, 39 cases out of the 64 negative cases can be successfully predicted as negative.
FIG. 5 shows an internal block diagram of a data analysis device based on phenotypic terms and variant genes in one embodiment. As shown in fig. 5, the phenotypic term and variant gene based data analysis device includes a processor, a memory, and a network interface connected by a system bus. The memory includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium of the phenotypic term and variant gene based data analysis device stores an operating system and may also store a computer program which, when executed by a processor, causes the processor to implement a phenotypic term and variant gene based data analysis method. The internal memory may also have stored therein a computer program which, when executed by a processor, causes the processor to perform a data analysis method based on phenotypic terms and variant genes. It will be appreciated by those skilled in the art that the structure shown in fig. 5 is merely a block diagram of a portion of the structure associated with the present application and does not constitute a limitation of the phenotypic term and variant gene based data analysis apparatus to which the present application is applied, and that a particular phenotypic term and variant gene based data analysis apparatus may include more or less components than shown in the drawings, or may combine certain components, or have a different arrangement of components.
A computer readable storage medium storing a computer program which when executed by a processor performs the steps of: acquiring clinical description information of a target object, and screening at least one initial phenotype term related to the target object according to the clinical description information; searching for at least one associated disease associated with the at least one initial phenotypic term according to a hierarchical relationship in a predetermined set of human phenotypic terms; optimizing the at least one initial phenotypic term based on the number of associated diseases to obtain an optimized at least one optimized phenotypic term; calculating a disease phenotype likelihood ratio for the target subject based on the disease condition of each optimized phenotype term; obtaining at least one variant gene in the target object gene sequence; calculating a disease genotype likelihood ratio of the target subject based on the pathogenic condition of the at least one variant gene; calculating a composite likelihood ratio of the phenotype genotype of the subject of interest based on the disease phenotype likelihood ratio and the disease genotype likelihood ratio; and carrying out mutation sequencing based on the phenotype genotype composite likelihood ratio, and outputting mutation gene sequencing conforming to the target object and related diseases of the mutation gene sequencing.
The set of human phenotypic terms includes a plurality of term units, each term unit consisting of a plurality of hierarchically progressive phenotypic terms, an i+1th hierarchical phenotypic term being a parent term of the i+1th hierarchical phenotypic term, the i+1th hierarchical phenotypic term being a child term of the i hierarchical phenotypic term, each phenotypic term being directly associated with at least one disease, in a specific embodiment the computer program when executed by the processor further performs the steps of: acquiring all diseases directly related to the target initial term as related diseases; wherein the target initial term is any one of the at least one initial phenotypic term; traversing all sub-terms of the target initial term, and taking all diseases directly related to all traversed sub-terms as related diseases.
In a specific embodiment, the computer program when executed by the processor further implements the steps of: when the number of the associated diseases is greater than a preset upper limit of the number, retaining all sub-terms associated with the initial phenotypic terms as the optimized phenotypic terms; when the number of the associated diseases is smaller than the preset lower limit of the number, taking all initial phenotype terms, child terms associated with all initial phenotype terms, father terms of all initial phenotype terms as the optimized phenotype terms, or taking the initial phenotype terms after modification as the optimized phenotype terms based on a modification instruction.
In a specific embodiment, the computer program when executed by the processor further implements the steps of: calculating a first probability of a target optimization term in an individual with a preset disease, and calculating a second probability of the target optimization term in an individual not with the preset disease; wherein the target optimization term is any one of the at least one optimization phenotype term; taking the ratio of the first probability to the second probability as a disease phenotype term likelihood ratio of the target optimized term and taking the product of the disease phenotype term likelihood ratios of all optimized phenotype terms as the disease phenotype likelihood ratio.
In a specific embodiment, the computer program when executed by the processor further implements the steps of: in the set of human phenotypic terms, if the target optimization term is directly associated with the preset disease, taking as the first probability the frequency of the acquired target optimization term in the individual suffering from the preset disease; if the sub-term of the target optimization term is directly related to the preset disease, taking the ratio of the first target frequency to the number of sub-terms of the target optimization term as the first probability; wherein the first target frequency is a maximum frequency of sub-terms of the target optimization term in an individual suffering from the preset disease; if the parent term of the target optimization term is directly related to the preset disease, taking a second target frequency as the first probability; wherein the second target frequency is a maximum frequency of a parent term of the target optimization term in an individual suffering from the preset disease; calculating a sum of frequencies of the target optimization term and disease within the set of human phenotypic terms; wherein said sum of frequencies is the sum of frequencies at which said target optimization term is directly associated with a disease within each of said sets of human phenotypic terms; the ratio of the sum of the frequencies to the number of all diseases within the set of human phenotypic terms is taken as the second probability.
In a specific embodiment, the computer program when executed by the processor further implements the steps of: logarithmically converting the global minimum likelihood ratio and the disease phenotype term likelihood ratios of all optimized phenotype terms to obtain a correction score for the global minimum likelihood and a base score for each optimized phenotype term; wherein the global minimum likelihood is the minimum of the disease phenotype term likelihoods for all optimized phenotype terms; calculating a relevance score for each optimized phenotype term based on the base score and the correction score for each optimized phenotype term; wherein the relevance score is used to indicate a strength of relevance between the optimized phenotypic term and the preset disease.
In a specific embodiment, the computer program when executed by the processor further implements the steps of: calculating a third probability that a genotype consisting of at least one variant gene of the target object is pathogenic, and calculating a fourth probability that a genotype consisting of at least one variant gene of the target object is non-pathogenic; and taking the ratio of the third probability to the fourth probability as a disease genotype likelihood ratio of the target object.
In a specific embodiment, the computer program when executed by the processor further implements the steps of: calculating the third probability based on the first poisson distribution model; wherein the poisson distribution model is a probability distribution model indicating that k pathogenic variant genes exist on the genotype, and the event occurrence rate of the first poisson distribution model is a ratio between an expected number of pathogenic variants of a gene in a preset disease and a probability that a variant gene contained in the genotype is a pathogenic gene; calculating the fourth probability based on a second poisson distribution model; wherein the event occurrence ratio of the second poisson distribution model is a ratio between an expected mutation number of a gene mutated in a healthy population and a probability that a mutated gene contained in the genotype is a pathogenic gene.
In a specific embodiment, the computer program when executed by the processor further implements the steps of: determining the probability that the variant genes contained in the genotype are pathogenic genes according to the number of the variant genes contained in the genotype, the preset weight and the pathogenicity probability of the individual variant genes; wherein the pathogenicity probability of the individual variant gene is determined by a preset rating condition.
In a specific embodiment, the computer program when executed by the processor further implements the steps of: and multiplying the disease phenotype likelihood ratio by the disease genotype likelihood ratio by a phenotype genotype composite likelihood ratio for the subject of interest.
In a specific embodiment, the computer program when executed by the processor further implements the steps of: when a target variant gene has a plurality of phenotype genotype composite likelihood ratios, taking the maximum value of all phenotype genotype composite likelihood ratios of the target variant gene as a representative composite likelihood ratio, and when the target variant gene has one phenotype genotype composite likelihood ratio, taking the phenotype genotype composite likelihood ratio of the target variant gene as the representative composite likelihood ratio; wherein the target variant gene is any one of the at least one variant gene; sorting all variant genes based on the size of the representative composite likelihood ratio of each variant gene, and outputting variant genes with the representative composite likelihood ratio being greater than a preset disease phenotype conforming threshold and the corresponding preset gene weighted pathogenicity being greater than a preset pathogenicity threshold to obtain the variant gene sorting; and outputting related diseases which are related to the variant genes in the variant gene sequence and have phenotype genotype composite likelihood ratios greater than a preset likelihood ratio threshold value based on the sequence of the variant gene sequence in sequence.
A data analysis device based on phenotypic terms and variant genes, comprising a processor and the above computer readable storage medium.
It should be noted that the above data analysis device, medium and apparatus based on the phenotypic term and the mutated gene belong to one general inventive concept, and the contents of the data analysis device, medium and apparatus embodiments based on the phenotypic term and the mutated gene are mutually applicable.
Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a non-transitory computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples represent only a few embodiments of the present application, which are described in more detail and are not thereby to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (11)

1. A data analysis device based on phenotypic terms and variant genes, characterized in that the data analysis device based on phenotypic terms and variant genes comprises:
a phenotypic term primary screening module for acquiring clinical description information of a target object, and screening at least one initial phenotypic term related to the target object according to the clinical description information;
A related disease screening module for searching for at least one related disease associated with the at least one initial phenotypic term according to a hierarchical relationship in a predetermined set of human phenotypic terms;
a phenotypic term optimization module for optimizing the at least one initial phenotypic term based on the number of associated diseases to obtain an optimized at least one optimized phenotypic term;
a disease phenotype likelihood ratio calculation module for calculating a disease phenotype likelihood ratio for the target subject based on the disease condition of each optimized phenotype term;
a variant gene acquisition module, configured to acquire at least one variant gene in the target gene sequence;
a disease genotype likelihood ratio calculation module for calculating a disease genotype likelihood ratio of the target subject based on the pathogenic condition of the at least one variant gene;
a phenotype genotype composite likelihood ratio calculation module for calculating a phenotype genotype composite likelihood ratio for the target subject based on the disease phenotype likelihood ratio and the disease genotype likelihood ratio;
the analysis output module is used for carrying out mutation sequencing based on the phenotype genotype composite likelihood ratio and outputting mutation gene sequencing and related diseases of the mutation gene sequencing which accord with the target object;
Wherein the phenotype term optimization module is specifically used for: when the number of the associated diseases is greater than a preset upper limit of the number, retaining all sub-terms associated with the initial phenotypic terms as the optimized phenotypic terms; when the number of the associated diseases is smaller than the preset lower limit of the number, taking all initial phenotype terms, child terms associated with all initial phenotype terms, father terms of all initial phenotype terms as the optimized phenotype terms, or taking the initial phenotype terms after modification as the optimized phenotype terms based on a modification instruction;
the phenotype genotype composite likelihood ratio calculation module is specifically used for: and (c) expressing a composite likelihood ratio of the phenotype genotype of the subject by multiplying the disease phenotype likelihood ratio by the disease genotype likelihood ratio.
2. The data analysis device based on phenotypic terms and variant genes according to claim 1, wherein the set of human phenotypic terms comprises a plurality of term units, each term unit consisting of phenotypic terms with progressive hierarchical relationships, the i-th hierarchical phenotypic term being a parent term of the i+1th hierarchical phenotypic term, the i+1th hierarchical phenotypic term being a child term of the i-th hierarchical phenotypic term, each phenotypic term being directly associated with at least one disease, the associated disease screening module being specifically adapted to:
Acquiring all diseases directly related to the target initial term as related diseases; wherein the target initial term is any one of the at least one initial phenotypic term;
traversing all sub-terms of the target initial term, and taking all diseases directly related to all traversed sub-terms as related diseases.
3. The data analysis device based on phenotypic terms and variant genes according to claim 2, wherein the disease phenotype likelihood ratio calculation module is specifically configured to:
calculating a first probability of a target optimization term in an individual with a preset disease, and calculating a second probability of the target optimization term in an individual not with the preset disease; wherein the target optimization term is any one of the at least one optimization phenotype term;
taking the ratio of the first probability to the second probability as a disease phenotype term likelihood ratio of the target optimized term and taking the product of the disease phenotype term likelihood ratios of all optimized phenotype terms as the disease phenotype likelihood ratio.
4. The data analysis device based on phenotypic terms and variant genes according to claim 3, wherein the disease phenotype likelihood ratio calculation module is specifically configured to:
In the set of human phenotypic terms, if the target optimization term is directly associated with the preset disease, taking as the first probability the frequency of the acquired target optimization term in the individual suffering from the preset disease;
if the sub-term of the target optimization term is directly related to the preset disease, taking the ratio of the first target frequency to the number of sub-terms of the target optimization term as the first probability; wherein the first target frequency is a maximum frequency of sub-terms of the target optimization term in an individual suffering from the preset disease;
if the parent term of the target optimization term is directly related to the preset disease, taking a second target frequency as the first probability; wherein the second target frequency is a maximum frequency of a parent term of the target optimization term in an individual suffering from the preset disease;
calculating a sum of frequencies of the target optimization term and disease within the set of human phenotypic terms; wherein said sum of frequencies is the sum of frequencies at which said target optimization term is directly associated with a disease within each of said sets of human phenotypic terms;
the ratio of the sum of the frequencies to the number of all diseases within the set of human phenotypic terms is taken as the second probability.
5. The data analysis device based on phenotypic terms and variant genes according to claim 3, wherein the disease phenotype likelihood ratio calculation module is further configured to:
logarithmically converting the global minimum likelihood ratio and the disease phenotype term likelihood ratios of all optimized phenotype terms to obtain a correction score for the global minimum likelihood and a base score for each optimized phenotype term; wherein the global minimum likelihood is the minimum of the disease phenotype term likelihoods for all optimized phenotype terms;
calculating a relevance score for each optimized phenotype term based on the base score and the correction score for each optimized phenotype term; wherein the relevance score is used to indicate a strength of relevance between the optimized phenotypic term and the preset disease.
6. The data analysis device based on phenotypic terms and variant genes according to claim 1, wherein the disease genotype likelihood ratio calculation module is specifically configured to:
calculating a third probability that a genotype consisting of at least one variant gene of the target object is pathogenic, and calculating a fourth probability that a genotype consisting of at least one variant gene of the target object is non-pathogenic;
And taking the ratio of the third probability to the fourth probability as a disease genotype likelihood ratio of the target object.
7. The data analysis device based on phenotypic terms and variant genes according to claim 6, wherein the disease genotype likelihood ratio calculation module is further specifically configured to:
calculating the third probability based on the first poisson distribution model; wherein the poisson distribution model is a probability distribution model indicating that k pathogenic variant genes exist on the genotype, and the event occurrence rate of the first poisson distribution model is a ratio between an expected number of pathogenic variants of a gene in a preset disease and a probability that a variant gene contained in the genotype is a pathogenic gene;
calculating the fourth probability based on a second poisson distribution model; wherein the event occurrence ratio of the second poisson distribution model is a ratio between an expected mutation number of a gene mutated in a healthy population and a probability that a mutated gene contained in the genotype is a pathogenic gene.
8. The data analysis device based on phenotypic terms and variant genes according to claim 7, wherein the disease genotype likelihood ratio calculation module is further specifically configured to:
Determining the probability that the variant genes contained in the genotype are pathogenic genes according to the number of the variant genes contained in the genotype, the preset weight and the pathogenicity probability of the individual variant genes; wherein the pathogenicity probability of the individual variant gene is determined by a preset rating condition.
9. The data analysis device based on phenotypic terms and variant genes according to claim 1, wherein the analysis output module is specifically configured to:
when a target variant gene has a plurality of phenotype genotype composite likelihood ratios, taking the maximum value of all phenotype genotype composite likelihood ratios of the target variant gene as a representative composite likelihood ratio, and when the target variant gene has one phenotype genotype composite likelihood ratio, taking the phenotype genotype composite likelihood ratio of the target variant gene as the representative composite likelihood ratio; wherein the target variant gene is any one of the at least one variant gene;
sorting all variant genes based on the size of the representative composite likelihood ratio of each variant gene, and outputting variant genes with the representative composite likelihood ratio being greater than a preset disease phenotype conforming threshold and the corresponding preset gene weighted pathogenicity being greater than a preset pathogenicity threshold to obtain the variant gene sorting;
And outputting related diseases which are related to the variant genes in the variant gene sequence and have phenotype genotype composite likelihood ratios greater than a preset likelihood ratio threshold value based on the sequence of the variant gene sequence in sequence.
10. A computer readable storage medium storing a computer program, which when executed by a processor causes the processor to perform:
acquiring clinical description information of a target object, and screening at least one initial phenotype term related to the target object according to the clinical description information;
searching for at least one associated disease associated with the at least one initial phenotypic term according to a hierarchical relationship in a predetermined set of human phenotypic terms;
optimizing the at least one initial phenotypic term based on the number of associated diseases to obtain an optimized at least one optimized phenotypic term;
calculating a disease phenotype likelihood ratio for the target subject based on the disease condition of each optimized phenotype term;
obtaining at least one variant gene in the target object gene sequence;
calculating a disease genotype likelihood ratio of the target subject based on the pathogenic condition of the at least one variant gene;
Calculating a composite likelihood ratio of the phenotype genotype of the subject of interest based on the disease phenotype likelihood ratio and the disease genotype likelihood ratio;
performing mutation ranking based on the phenotype genotype composite likelihood ratio, and outputting mutation gene ranking and related diseases of the mutation gene ranking which accord with the target object;
wherein said optimizing said at least one initial phenotypic term based on said number of associated diseases to obtain an optimized at least one optimized phenotypic term comprises: when the number of the associated diseases is greater than a preset upper limit of the number, retaining all sub-terms associated with the initial phenotypic terms as the optimized phenotypic terms; when the number of the associated diseases is smaller than the preset lower limit of the number, taking all initial phenotype terms, child terms associated with all initial phenotype terms, father terms of all initial phenotype terms as the optimized phenotype terms, or taking the initial phenotype terms after modification as the optimized phenotype terms based on a modification instruction;
said calculating a composite likelihood ratio of the phenotype genotype of said subject of interest based on said disease phenotype likelihood ratio and said disease genotype likelihood ratio, comprising: and (c) expressing a composite likelihood ratio of the phenotype genotype of the subject by multiplying the disease phenotype likelihood ratio by the disease genotype likelihood ratio.
11. A data analysis device based on phenotypic terms and variant genes, comprising a processor and a computer readable storage medium according to claim 10.
CN202310116429.6A 2023-02-13 2023-02-13 Data analysis device, medium and equipment based on phenotype term and variant gene Active CN116246701B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310116429.6A CN116246701B (en) 2023-02-13 2023-02-13 Data analysis device, medium and equipment based on phenotype term and variant gene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310116429.6A CN116246701B (en) 2023-02-13 2023-02-13 Data analysis device, medium and equipment based on phenotype term and variant gene

Publications (2)

Publication Number Publication Date
CN116246701A CN116246701A (en) 2023-06-09
CN116246701B true CN116246701B (en) 2024-03-22

Family

ID=86627243

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310116429.6A Active CN116246701B (en) 2023-02-13 2023-02-13 Data analysis device, medium and equipment based on phenotype term and variant gene

Country Status (1)

Country Link
CN (1) CN116246701B (en)

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006026074A2 (en) * 2004-08-04 2006-03-09 Duke University Atherosclerotic phenotype determinative genes and methods for using the same
CN102855398A (en) * 2012-08-28 2013-01-02 中国科学院自动化研究所 Method for obtaining disease potentially-associated gene based on multi-source information fusion
CN106575321A (en) * 2014-01-14 2017-04-19 欧米希亚公司 Methods and systems for genome analysis
CN110021364A (en) * 2017-11-24 2019-07-16 上海暖闻信息科技有限公司 Analysis detection system based on patients clinical symptom data and full sequencing of extron group data screening single gene inheritance disease Disease-causing gene
CN110944647A (en) * 2017-05-19 2020-03-31 加利福尼亚大学董事会 Antibody chemoinduced dimers (AbCID) as molecular switches for modulating cell therapy
WO2020086433A1 (en) * 2018-10-22 2020-04-30 The Jackson Laboratory Methods and apparatus for phenotype-driven clinical genomics using a likelihood ratio paradigm
CN111095422A (en) * 2017-06-19 2020-05-01 琼格拉有限责任公司 Interpretation of Gene and genomic variants by comprehensive computational and Experimental deep mutation learning frameworks
CN112270965A (en) * 2020-11-16 2021-01-26 苏州***医学研究所 Semantic structural processing method for medical text phenotype information
CN112687328A (en) * 2021-03-12 2021-04-20 北京贝瑞和康生物技术有限公司 Method, apparatus and medium for determining phenotypic information of clinical descriptive information
CN112765318A (en) * 2021-01-20 2021-05-07 阅尔基因技术(苏州)有限公司 Natural language processing method and system for infertility clinical phenotype information
WO2021202910A1 (en) * 2020-04-02 2021-10-07 Embark Veterinary, Inc. Methods and systems for determining pigmentation phenotypes
CN113889265A (en) * 2021-10-15 2022-01-04 浙江大学 Rare disease auxiliary reasoning method and system based on phenotype visualization
CN114724724A (en) * 2020-12-21 2022-07-08 苏州市爱生生物技术有限公司 Disease sequencing method and pathogenic gene sequencing method based on human phenotypic characteristics
CN115295075A (en) * 2022-07-20 2022-11-04 北京携云启源科技有限公司 Construction method of complex disease genetic risk assessment model, model and application thereof
CN115512843A (en) * 2022-11-15 2022-12-23 南京腾鸿医疗科技有限公司 Disease and gene prediction method based on standardized phenotypic terms
CN115547514A (en) * 2022-11-28 2022-12-30 苏州超云生命智能产业研究院有限公司 Pathogenic gene sequencing method, pathogenic gene sequencing device, electronic equipment and medium
CN115641956A (en) * 2022-10-26 2023-01-24 中科(厦门)数据智能研究院 Phenotype analysis method for disease prediction

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2430054A1 (en) * 2000-12-01 2002-06-06 University Of North Carolina - Chapel Hill Method for ultra-high resolution mapping of genes and determination of genetic networks among genes underlying phenotypic traits
US20140066320A1 (en) * 2012-09-04 2014-03-06 Microsoft Corporation Identifying causal genetic markers for a specified phenotype

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006026074A2 (en) * 2004-08-04 2006-03-09 Duke University Atherosclerotic phenotype determinative genes and methods for using the same
CN102855398A (en) * 2012-08-28 2013-01-02 中国科学院自动化研究所 Method for obtaining disease potentially-associated gene based on multi-source information fusion
CN106575321A (en) * 2014-01-14 2017-04-19 欧米希亚公司 Methods and systems for genome analysis
CN110944647A (en) * 2017-05-19 2020-03-31 加利福尼亚大学董事会 Antibody chemoinduced dimers (AbCID) as molecular switches for modulating cell therapy
CN111095422A (en) * 2017-06-19 2020-05-01 琼格拉有限责任公司 Interpretation of Gene and genomic variants by comprehensive computational and Experimental deep mutation learning frameworks
CN110021364A (en) * 2017-11-24 2019-07-16 上海暖闻信息科技有限公司 Analysis detection system based on patients clinical symptom data and full sequencing of extron group data screening single gene inheritance disease Disease-causing gene
WO2020086433A1 (en) * 2018-10-22 2020-04-30 The Jackson Laboratory Methods and apparatus for phenotype-driven clinical genomics using a likelihood ratio paradigm
CN113272912A (en) * 2018-10-22 2021-08-17 杰克逊实验室 Methods and apparatus for phenotype-driven clinical genomics using likelihood ratio paradigm
WO2021202910A1 (en) * 2020-04-02 2021-10-07 Embark Veterinary, Inc. Methods and systems for determining pigmentation phenotypes
CN112270965A (en) * 2020-11-16 2021-01-26 苏州***医学研究所 Semantic structural processing method for medical text phenotype information
CN114724724A (en) * 2020-12-21 2022-07-08 苏州市爱生生物技术有限公司 Disease sequencing method and pathogenic gene sequencing method based on human phenotypic characteristics
CN112765318A (en) * 2021-01-20 2021-05-07 阅尔基因技术(苏州)有限公司 Natural language processing method and system for infertility clinical phenotype information
CN112687328A (en) * 2021-03-12 2021-04-20 北京贝瑞和康生物技术有限公司 Method, apparatus and medium for determining phenotypic information of clinical descriptive information
CN113889265A (en) * 2021-10-15 2022-01-04 浙江大学 Rare disease auxiliary reasoning method and system based on phenotype visualization
CN115295075A (en) * 2022-07-20 2022-11-04 北京携云启源科技有限公司 Construction method of complex disease genetic risk assessment model, model and application thereof
CN115641956A (en) * 2022-10-26 2023-01-24 中科(厦门)数据智能研究院 Phenotype analysis method for disease prediction
CN115512843A (en) * 2022-11-15 2022-12-23 南京腾鸿医疗科技有限公司 Disease and gene prediction method based on standardized phenotypic terms
CN115547514A (en) * 2022-11-28 2022-12-30 苏州超云生命智能产业研究院有限公司 Pathogenic gene sequencing method, pathogenic gene sequencing device, electronic equipment and medium

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Interpretable Clinical Genomics with a Likelihood Ratio Paradigm;Peter N. Robinson 等;《AM J Hum Genet》;第107卷(第03期);403-417 *
全基因组关联分析中上位性识别算法的研究及其并行化设计;周智慧;《中国优秀硕士学位论文全文数据库 基础科学辑》(第(2015)09期);A006-69 *
基于表型以及微阵列数据的基因(型)分类技术研究;肖静;《中国博士学位论文全文数据库 基础科学辑》(第(2007)06期);A006-9 *
基于表型的罕见遗传病辅助诊断的研究和应用;王培萱;《中国优秀硕士学位论文全文数据库 医药卫生科技辑》(第(2021)07期);E065-47 *

Also Published As

Publication number Publication date
CN116246701A (en) 2023-06-09

Similar Documents

Publication Publication Date Title
Rahmioglu et al. The genetic basis of endometriosis and comorbidity with other pain and inflammatory conditions
Jia et al. Mapping quantitative trait loci for expression abundance
KR101693504B1 (en) Discovery system for disease cause by genetic variants using individual whole genome sequencing data
KR101693510B1 (en) Genotype analysis system and methods using genetic variants data of individual whole genome
US20210343414A1 (en) Methods and apparatus for phenotype-driven clinical genomics using a likelihood ratio paradigm
CN111508603A (en) Birth defect prediction and risk assessment method and system based on machine learning and electronic equipment
Favalli et al. Machine learning-based reclassification of germline variants of unknown significance: The RENOVO algorithm
CN111739642A (en) Colorectal cancer risk prediction method and system, computer equipment and readable storage medium
CN107451422A (en) A kind of gene sequence data analysis and online interaction visualization method
Zhang et al. MaLAdapt reveals novel targets of adaptive introgression from Neanderthals and Denisovans in worldwide human populations
CN116246701B (en) Data analysis device, medium and equipment based on phenotype term and variant gene
Schmidt et al. Recommendations for risk allele evidence curation, classification, and reporting from the ClinGen Low Penetrance/Risk Allele Working Group
Karim et al. Elucidation of familial relationships using hair shaft proteomics
CN116525108A (en) SNP data-based prediction method, device, equipment and storage medium
CN111816303A (en) Machine learning-based method for predicting risk of refractory schizophrenia
Feng et al. Nonparametric independence screening via favored smoothing bandwidth
WO2010064413A1 (en) System for predicting drug effects and adverse effects and program for the same
Capanu et al. False discovery rates for rare variants from sequenced data
CN116312764A (en) Mutation hazard classification device, method and application thereof
CN110459312A (en) Rheumatoid arthritis susceptibility loci and its application
CN112530591B (en) Method for generating auscultation test vocabulary and storage equipment
WO2021069105A1 (en) Diagnostic tool
US20240175087A1 (en) Methods and systems for predicting cancer homologous recombination pathway deficiency, and determining treatment response
CN116486913B (en) System, apparatus and medium for de novo predictive regulatory mutations based on single cell sequencing
CN116312778B (en) Auxiliary diagnosis and prediction method, device, equipment and medium for mature B cell tumor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant