CN114373547A - Method and system for predicting disease risk - Google Patents

Method and system for predicting disease risk Download PDF

Info

Publication number
CN114373547A
CN114373547A CN202210026857.5A CN202210026857A CN114373547A CN 114373547 A CN114373547 A CN 114373547A CN 202210026857 A CN202210026857 A CN 202210026857A CN 114373547 A CN114373547 A CN 114373547A
Authority
CN
China
Prior art keywords
snp
model
disease
risk
risk prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210026857.5A
Other languages
Chinese (zh)
Inventor
李映雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210026857.5A priority Critical patent/CN114373547A/en
Publication of CN114373547A publication Critical patent/CN114373547A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/30ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Analytical Chemistry (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Chemical & Material Sciences (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention provides a method and a system for predicting the risk of a disease, wherein the method comprises the steps of obtaining the Single Nucleotide Polymorphism (SNP) characterization information and clinical phenotype characteristic information of a gene variation site of a disease patient, and constructing a data set based on the SNP characterization information and the clinical phenotype characteristic information; building a risk prediction basic model based on a neural network; training the risk prediction basic model by using the data set to obtain an intelligent risk prediction model for predicting the disease risk probability; and performing performance evaluation on the intelligent risk prediction model. According to the scheme, deep learning is utilized to learn the SNP (single nucleotide polymorphism) characterization of the disease patient, and meanwhile, the association relation between the SNP locus and the disease can be captured through a deep learning model, so that the accuracy of disease risk prediction can be effectively improved.

Description

Method and system for predicting disease risk
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a system for predicting disease risk, a storage medium and computing equipment.
Background
Schizophrenia (schizophrenia) is the most common of mental diseases, and is the mental disease with the most complex etiology and clinical manifestations, and the heritability of schizophrenia is estimated to be about 80%, with the incidence rate being around 1% worldwide. Although schizophrenia has been studied for a long time, the pathogenesis of schizophrenia has not been clarified, and thus the treatment is clinically mainly performed on symptoms rather than on the causes, which has led researchers to concentrate on the study of the mechanism of schizophrenia for many years.
Researches on genomics, transcriptomics, proteomics, metabonomics and the like which have been developed in recent years inject new vitality into the research on pathogenesis of schizophrenia, and particularly, with the development of genome technology in recent years, the research on schizophrenia has serial results in the aspects of genome-wide association analysis (GWAS), low-frequency mutation and copy number analysis, gene expression, epigenetic modification and other analysis researches, and a plurality of schizophrenia susceptibility genes are found. Schizophrenia is now more generally thought to be a complex disease related to early neurodevelopmental abnormalities that is regulated by multiple genes and influenced by multiple environmental factors.
Current genome-wide association studies on schizophrenia identify multiple risk genes in numerous molecular pathways, but each gene has only minor influence, confirming that for the polygenic disease schizophrenia, the contribution of single genetic variation is small, and the disease is a combination of multiple unfortunate genetic variation site SNPs (single nucleotide polymorphisms). The existing PRS (Polygene risk score) is based on, and the generalized linearity of each SNP locus to the schizophrenia onset risk is assumed, so that the prediction precision of the disease is limited.
Disclosure of Invention
In view of the above, the present invention has been made to provide a method and system for predicting risk of disease that overcomes or at least partially solves the above problems.
According to a first aspect of the present invention, there is provided a method for predicting risk of developing a disease, comprising:
acquiring Single Nucleotide Polymorphism (SNP) characterization information and clinical phenotype characteristic information of a gene variation site of a disease patient, and constructing a data set based on the SNP characterization information and the clinical phenotype characteristic information;
building a risk prediction basic model based on a neural network;
training the risk prediction basic model by using the data set to obtain an intelligent risk prediction model for predicting the disease risk probability;
and performing performance evaluation on the intelligent risk prediction model.
Optionally, the acquiring the characterization information and the clinical phenotypic characteristic information of the genetic variation site single nucleotide polymorphism SNP of the disease patient further comprises:
selecting a plurality of groups of target SNP characterization information positioned in a promoter region based on the SNP characterization information;
and training an independent classifier for each promoter region by using the target SNP characterization information, and outputting an accurate disease risk prediction value corresponding to each SNP locus in the target SNP characterization information through the classifier.
Optionally, constructing a data set based on the SNP characterization information and clinical phenotypic characteristic information comprises:
and selecting a target promoter region from the multiple groups of promoter regions according to the disease risk prediction accurate value corresponding to each SNP locus by using the classifier corresponding to each promoter region, and taking SNP characterization information corresponding to the target promoter region as SNP characterization information required for constructing the data set.
Optionally, the building of the risk prediction base model based on the neural network includes:
constructing a risk prediction base model comprising a multi-gene prediction model and a gradient lifting tree model; the multi-gene prediction model is connected with the gradient lifting tree model, and the output variable of the multi-gene prediction model is used as the input variable of the gradient lifting tree model;
the multi-gene prediction model is used for learning to obtain a corresponding SNP score value according to the SNP representation information;
and the gradient lifting tree model is used for learning to obtain the disease risk probability according to the SNP score value and the clinical phenotype characteristic information.
Optionally, the multi-gene prediction model comprises a specific neural network and a score calculation network;
the specific neural network is connected with the score calculation network, and the output of the specific neural network is the input of the score calculation network; the specific neural network is used for learning SNP (single nucleotide polymorphism) representation information in the data set so as to obtain first characteristic information corresponding to the SNP representation information; the score calculation network is used for learning the first characteristic output by the specific neural network so as to obtain the SNP score value corresponding to the first characteristic information.
Optionally, the training the risk prediction base model by using the data set to obtain an intelligent risk prediction model for predicting the risk probability of disease includes:
training a multi-gene prediction model by utilizing SNP (single nucleotide polymorphism) representation information in the data set;
and training the gradient lifting tree model by using the SNP score values output by the multi-gene prediction model and the clinical phenotype characteristic information in the data set as input variables and using the disease risk probability as an output variable to obtain an intelligent risk prediction model for predicting the disease risk probability.
Optionally, the performing performance evaluation on the intelligent risk prediction model includes:
randomly dividing the acquired data set into a plurality of groups of subdata sets;
performing multiple rounds of tests on the risk prediction model by using the multiple groups of subdata sets to obtain multiple performance evaluation parameters; during each round of training, selecting any one sub data set in the multiple sub data sets as a test set, and using the rest other sub data sets as training sets;
taking the average value of the multiple performance evaluation parameters as an evaluation index of the risk prediction model to perform performance evaluation on the intelligent risk prediction model;
the performance evaluation parameters at least comprise specificity, sensitivity, accuracy, non-model evaluation scoring, area under a receiver operating characteristic ROC curve and area under a specificity sensitivity curve.
According to a second aspect of the present invention, there is provided a system for predicting risk of developing a disease, comprising:
the data set establishing module is used for acquiring Single Nucleotide Polymorphism (SNP) characterization information and clinical phenotype characteristic information of a gene variation site of a disease patient and establishing a data set based on the SNP characterization information and the clinical phenotype characteristic information;
the model building module is used for building a risk prediction basic model based on a neural network;
the model training module is used for training the risk prediction basic model by utilizing the data set to obtain an intelligent risk prediction model for predicting the disease risk probability;
and the prediction module is used for predicting the disease risk by utilizing the intelligent risk prediction model.
According to a third aspect of the present invention, there is provided a computer-readable storage medium for storing program code for performing the method for predicting the risk of developing a disease according to any one of the first aspect.
According to a fourth aspect of the invention, there is provided a computing device comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the method for predicting risk of disease according to any one of the second aspect according to instructions in the program code.
The invention provides a method and a system for predicting disease risk, a storage medium and computing equipment, which can learn SNP (single nucleotide polymorphism) representation of a disease patient by deep learning based on genetic information of the patient, and can capture the incidence relation between SNP sites and diseases through a deep learning model, meanwhile, the SNP representation information and clinical representation information respectively represent information of different dimensions of the patient, and the information of different dimensions can be complemented with each other, so that the accuracy of disease risk prediction can be improved, noise interference can be reduced, and the robustness of the model is improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
The above and other objects, advantages and features of the present invention will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flow chart of a method for predicting risk of disease according to an embodiment of the invention;
FIG. 2 illustrates a schematic diagram of a risk prediction base model structure according to an embodiment of the invention;
FIG. 3 is a diagram illustrating the structure of a specific neural network in a multi-gene prediction model according to an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating a network structure for calculating scores in a multi-gene prediction model according to an embodiment of the present invention;
FIG. 5A is a schematic diagram of a block in the score computation network of FIG. 4;
FIG. 5B is a block two diagram of the score computation network of FIG. 4;
FIG. 5C is a schematic diagram of a block diagram of the network of score computations shown in FIG. 4;
FIG. 6 is a schematic diagram of a system for predicting risk of developing a disease according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of a system for predicting risk of developing a disease according to another embodiment of the present invention;
FIG. 8 shows a schematic diagram of a computing device architecture according to an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
And acquiring and processing related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Fig. 1 is a schematic flow chart of a method for predicting the risk of disease according to an embodiment of the present invention, and as can be seen from fig. 1, the method for predicting the risk of disease according to an embodiment of the present invention at least includes the following steps S101 to S104.
S101, acquiring Single Nucleotide Polymorphism (SNP) characterization information and clinical phenotype characteristic information of a gene variation site of a disease patient, and constructing a data set based on the SNP characterization information and the clinical phenotype characteristic information.
The single nucleotide polymorphism SNP characterization information of the gene variation site mainly refers to DNA sequence polymorphism caused by variation of a single nucleotide on a genome level. In this embodiment, when acquiring the SNP characterization information of a single nucleotide polymorphism at a genetic variation site of a disease patient, the corresponding SNP characterization information of the patient with the disease can be acquired by a related medical institution, where the number of the disease patients in this embodiment is thousands, tens of thousands or even more, and this is not limited in the embodiment of the present invention. The method provided in this embodiment can be applied to patients with different types of diseases, such as schizophrenia, lupus erythematosus and other disease types with large association with gene sequences, which is not limited in the embodiments of the present invention.
The clinical phenotype characteristic information mainly includes clinical manifestation and typing related data of the disease patient, such as age, sex, BMI, blood test index, etc., and the SNP characteristic information and clinical phenotype characteristic information in this embodiment may be expressed in a sequence or a plurality of vectors.
Since in most cases, the SNP site highly correlated with a disease is located in the promoter region upstream of the gene and transcription is initiated, optionally, after acquiring the single nucleotide polymorphic SNP characterization information and clinical phenotypic characteristic information of the genetic variation site of a disease patient in step S101, a plurality of sets of target SNP characterization information located in the promoter region may be selected based on the SNP characterization information; and training an independent classifier for each promoter region by using the target SNP characterization information, and outputting an accurate disease risk prediction value corresponding to each SNP locus in the target SNP characterization information through the classifier.
That is, in this embodiment, SNPs located in promoter regions are used, an independent classifier is trained for each promoter region, and the promoters in the promoter regions are screened by the independently trained classifiers, so as to calculate accurate values of disease risk prediction corresponding to the respective SNP sites. The classifier in this embodiment may be a classifier of a common machine learning model, such as a naive bayes classifier.
Further, the step S101 of constructing a data set based on the SNP characterization information and the clinical phenotype characteristic information includes: and selecting a target promoter region from the multiple groups of promoter regions according to the disease risk prediction accurate value corresponding to each SNP locus by using the classifier corresponding to each promoter region, and taking SNP characterization information corresponding to the target promoter region as SNP characterization information required for constructing the data set. In this example, only the best performing promoter regions are considered for further analysis, alternatively, the gene promoter 128 before prediction accuracy, or other number of gene promoters, may be considered first.
And S102, building a risk prediction basic model based on the neural network.
As can be seen from fig. 2, the building of the risk prediction base model based on the neural network includes: constructing a risk prediction base model comprising a multi-gene prediction model and a gradient lifting tree model; the multi-gene prediction model is connected with the gradient lifting tree model, and the output variable of the multi-gene prediction model is used as the input variable of the gradient lifting tree model; the multi-gene prediction model is used for learning to obtain a corresponding SNP score value according to the SNP representation information; and the gradient lifting tree model is used for learning to obtain the disease risk probability according to the SNP score value and the clinical phenotype characteristic information.
With continued reference to fig. 2, the multi-gene prediction model of the present embodiment includes a specific neural network and a score calculation network. The specific neural network is connected with the score calculation network, and the output of the specific neural network is the input of the score calculation network; the specific neural network is used for learning SNP (single nucleotide polymorphism) representation information in the data set so as to obtain first characteristic information corresponding to the SNP representation information; the score calculation network is used for learning the first characteristic output by the specific neural network so as to obtain the SNP score value corresponding to the first characteristic information.
In this embodiment, the specific neural network may be represented by promoter-CNN, and the score calculation network for calculating multi-gene scores of samples based on promoter region combinations may be represented by SCH-Net, where promoter-CNN is used to learn the characterization of promoter regions, and the output of promoter-CNN is the input of SCH-Net.
Since the position of the promoter region on the genome is usually not recorded as specifically as the transcription start site of the gene, and since our deep neural network requires that the data representation of each promoter region have the same length, the following method is used to determine the SNP in the promoter region.
Using the transcription start sites reported in the RefSeq database, the nearest 56 SNPs upstream and 8 SNPs downstream of the transcription start site were considered as promoter regions for each gene. Thus, each promoter region consists of any of 64 {0, 1, 2} sets. A gene may have multiple transcription start sites and thus multiple promoter regions.
Referring to fig. 3, the promoter-CNN uses one input layer, two convolutional layers, one reforming layer, two dense layers, and one output layer. Unlike the SCH-Net network, the promoter-CNN is a simple network structure because of the short dimension of the input. Wherein the output dimension of the input layer is (64, 1); one of the convolutional layers may comprise 1 x 1 filter and 4 output channels with an output dimension of (64, 4); another convolutional layer may include 4 x 4 filters and 32 output channels with output dimensions (64, 32); the reforming layer is used for realizing feature flattening, and the output dimension is (1952, 1); the output dimension of one of the dense layers is (148, 1); the output dimension of the other dense layer is (16, 1); the output layer may include a softmax function with an output dimension of (2, 1).
The flattening Operation (Flatten Operation) is a common Operation inside convolutional neural networks. This is because the convolutional layer output that is passed to the fully-connected layer must be planarized before the fully-connected layer can accept the input.
The structure of the SCH-Net network is shown in fig. 4, the SCH-Net network includes several different blocks, fig. 5A to 5C respectively show the schematic diagrams of the first block, the second block, and the third block in the score calculation network shown in fig. 4, and the structures of the fourth block and the second block are the same.
As shown in fig. 4, the input layer of the SCH-Net network is composed by connecting SNPs from the target promoter region selected in step S101, so that it is 64 × 128 in length, where 64 is the number of SNPs in the promoter region, and it is planned to select 128 promoter regions per chromosome (this number will be adjusted to the actual situation).
In block one, each promoter region is considered separately. That is, information from different promoter regions is not combined. This enables the model to obtain a better high-level representation of the individual promoter regions before combining other promoter information. The first layer of block one is the convolutional layer, with a convolutional kernel (one-dimensional vector) length of 64 and a stride of 64. The span is set to 64 to ensure that information from the individual promoter regions is not combined, and the convolution kernel length set to 64 means that information from each promoter region is processed as a whole. This layer has 256 output channels, so the information of the initiator region is now represented by 256 values. The second layer is a convolutional layer with a convolutional kernel length of 256 and a span of 1, with 256 output channels. Such a convolution kernel can learn linear combinations of 128 promoters each, using different weights and biases to compute 256 times, once for each output channel. The first two convolutional layers are not provided with activation functions. The block ends with a batch normalization layer and a calibrated linear cell activation function.
The output of block one is reshaped into a three-dimensional tensor (see fig. 4) in which the information for each promoter region is reshaped from a vector of length 256 into a 16 x 16 matrix. This three-dimensional tensor can be viewed as an image of 16 x 16 pixels with 128 channels, each corresponding to a promoter region.
In block two, the promoter regions are combined together, so that from this step the information from the promoter regions is considered in one piece. Similar to block one, block two is composed of three layers of convolutional networks. It allows for larger dimensions of the input vector, but does not use too many trainable parameters. The block four link structure is shown in fig. 4, and the dimension of the convolution kernel is determined during model optimization. To prevent the model from overfitting, we designed pooling layers in block three. And, in order to prevent information loss in a deeper layer, a method of residual linking (directly using an operation of addition, denoted by '+' in fig. 1) is used. The model takes two dense layers and an S-shaped activation function as output. During model training, the loss function calculates the difference between the predicted value [0,1] and the true value (0/1) of the binary phenotype, and optimizes the model parameters through back propagation. And after the model training is finished, inputting a group of new inputs into the model, and finally obtaining the SNP score value of the individual patient through calculation of a network and the output value of an S-type activation function.
S103, training the risk prediction basic model by using the data set to obtain an intelligent risk prediction model for predicting the disease risk probability.
Optionally, when the risk prediction base model is trained to obtain the intelligent risk prediction model, the training of the multiple-gene prediction model by using the SNP characterization information in the data set may be performed; the SNP score values output by the multi-gene prediction model and the clinical phenotype characteristic information in the data set are used as input variables, the disease risk probability is used as an output variable to train the gradient lifting tree model, and an intelligent risk prediction model for predicting the disease risk probability is obtained and is marked as an XGboost model in the embodiment.
The XGboost model has the structure: XGboost is one of the gradient lifting tree models. The idea of the gradient boosting algorithm is to train a plurality of base classifiers in series, integrate them together to form a strong classifier, and take the sum of all the base classifiers as output. XGBoos typically use CART regression trees as base classifiers. In this embodiment, the clinical phenotype characteristic information and the SNP score can be used as input variables of the intelligent risk prediction model, and the prediction result of the intelligent risk prediction model can be more accurate by comprehensively learning the clinical phenotype characteristic information and the SNP score.
And S104, predicting the disease risk by using the intelligent risk prediction model.
After the intelligent risk prediction model is obtained through training, the intelligent risk prediction model can be used for predicting the disease risk. According to the method for predicting the disease risk, the intelligent risk prediction model obtained by training the risk prediction basic model built on the basis of the neural network is constructed by utilizing the SNP (Single nucleotide polymorphism) characterization information and the clinical phenotype characteristic information of the disease patient, and the accuracy of predicting the disease risk of other target individuals is effectively improved by comprehensively judging in combination with the characteristic information of multiple dimensions of the disease patient.
In this embodiment of the present invention, in step S103, the training of the risk prediction basis model by using the data set to obtain the intelligent risk prediction model for predicting the risk probability of disease may further include performing performance evaluation on the intelligent risk prediction model.
Optionally, the performing performance evaluation on the intelligent risk prediction model includes: randomly dividing the acquired data set into a plurality of groups of subdata sets; performing multiple rounds of tests on the risk prediction model by using the multiple groups of subdata sets to obtain multiple performance evaluation parameters; during each round of training, selecting any one sub data set in the multiple sub data sets as a test set, and using the rest other sub data sets as training sets; taking the average value of the multiple performance evaluation parameters as an evaluation index of the risk prediction model to perform performance evaluation on the intelligent risk prediction model; the performance evaluation parameters at least comprise specificity, sensitivity, accuracy, non-model evaluation scoring, area under a receiver operating characteristic ROC curve and area under a specificity sensitivity curve.
In this embodiment, a 5-fold cross validation method may be used to evaluate the performance of the model. And randomly dividing the data set into 5 uniform parts, taking 1 part of the data set as a test set every time, taking the other 4 parts of the data set as a training set every time, evaluating the performance of the model on the test of the current round every time, and taking the average value of 5 evaluation indexes as the final evaluation index of the model. In order to comprehensively measure the effect of each classifier, indexes such as specificity, sensitivity, accuracy, F score, AUROC (area under ROC curve), AUPRC (area under specific sensitivity curve) and the like are selected to evaluate the model (the evaluation logic is 5 times of cross validation, and the evaluation indexes are specificity, sensitivity, accuracy, F score, AUROC (area under ROC curve) and AUPRC (area under specific sensitivity curve)).
The invention provides a method for predicting the disease risk, which can learn the SNP (single nucleotide polymorphism) representation of a schizophrenia patient by utilizing deep learning based on genetic information of the patient, and can capture the incidence relation between SNP sites and schizophrenia through a deep learning model.
Based on the unified inventive concept, an embodiment of the present invention further provides a system for predicting risk of disease, referring to fig. 6, the system for predicting risk of disease of the present embodiment may include:
a data set establishing module 610, configured to obtain Single Nucleotide Polymorphism (SNP) characterization information and clinical phenotype characteristic information of a gene mutation site of a disease patient, and construct a data set based on the SNP characterization information and the clinical phenotype characteristic information;
a model building module 620, configured to build a risk prediction base model based on a neural network;
a model training module 630, configured to train the risk prediction base model using the data set, to obtain an intelligent risk prediction model for predicting risk probability of disease;
and the prediction module 640 is used for predicting the disease risk by using the intelligent risk prediction model.
In an alternative embodiment of the present invention, as shown in fig. 7, the system for predicting risk of disease may further include a classification training module 650;
the classification training module 650 is configured to select multiple sets of target SNP characterizing information located in the promoter region based on the SNP characterizing information;
and training an independent classifier for each promoter region by using the target SNP characterization information, and outputting an accurate disease risk prediction value corresponding to each SNP locus in the target SNP characterization information through the classifier.
In an optional embodiment of the present invention, the data set creating module 610 may further be configured to: and selecting a target promoter region from the multiple groups of promoter regions according to the disease risk prediction accurate value corresponding to each SNP locus by using the classifier corresponding to each promoter region, and taking SNP characterization information corresponding to the target promoter region as SNP characterization information required for constructing the data set.
In an optional embodiment of the present invention, the model building module 620 may be further configured to:
constructing a risk prediction base model comprising a multi-gene prediction model and a gradient lifting tree model; the multi-gene prediction model is connected with the gradient lifting tree model, and the output variable of the multi-gene prediction model is used as the input variable of the gradient lifting tree model;
the multi-gene prediction model is used for learning to obtain a corresponding SNP score value according to the SNP representation information;
and the gradient lifting tree model is used for learning to obtain the disease risk probability according to the SNP score value and the clinical phenotype characteristic information.
In an alternative embodiment of the present invention, model training module 630,
training a multi-gene prediction model by utilizing SNP (single nucleotide polymorphism) representation information in the data set;
and training the gradient lifting tree model by using the SNP score values output by the multi-gene prediction model and the clinical phenotype characteristic information in the data set as input variables and using the disease risk probability as an output variable to obtain an intelligent risk prediction model for predicting the disease risk probability.
In an alternative embodiment of the present invention, as shown in fig. 7, the model evaluation module 660 is configured to perform performance evaluation on the intelligent risk prediction model.
The model evaluation module 660 may also be configured to: randomly dividing the acquired data set into a plurality of groups of subdata sets;
performing multiple rounds of tests on the risk prediction model by using the multiple groups of subdata sets to obtain multiple performance evaluation parameters; during each round of training, selecting any one sub data set in the multiple sub data sets as a test set, and using the rest other sub data sets as training sets;
taking the average value of the multiple performance evaluation parameters as an evaluation index of the risk prediction model to perform performance evaluation on the intelligent risk prediction model;
the performance evaluation parameters at least comprise specificity, sensitivity, accuracy, non-model evaluation scoring, area under a receiver operating characteristic ROC curve and area under a specificity sensitivity curve.
It should be noted that other corresponding descriptions of the functional units related to the apparatus provided in the embodiment of the present application may refer to the corresponding descriptions of the method embodiments, and are not described herein again.
The embodiment of the present invention further provides a computer-readable storage medium, which is used for storing a program code, where the program code is used for executing the method for predicting the risk of disease according to the above embodiment.
An embodiment of the present invention further provides a computing device, where the computing device includes a processor and a memory: the memory is used for storing program codes and transmitting the program codes to the processor; the processor is configured to execute the method for predicting risk of disease according to any one of the above embodiments according to instructions in the program code.
As shown in fig. 8, the computing device according to the embodiment of the present invention includes a communication bus, a processor, a memory, and a communication interface, and may further include an input/output interface and a display device, where the functional units may communicate with each other through the bus. The memory stores computer programs, and the processor is used for executing the programs stored in the memory and executing the method of the embodiment.
It is clear to those skilled in the art that the specific working processes of the above-described systems, devices, modules and units may refer to the corresponding processes in the foregoing method embodiments, and for the sake of brevity, further description is omitted here.
In addition, the functional units in the embodiments of the present invention may be physically independent of each other, two or more functional units may be integrated together, or all the functional units may be integrated in one processing unit. The integrated functional units may be implemented in the form of hardware, or in the form of software or firmware.
Those of ordinary skill in the art will understand that: the integrated functional units, if implemented in software and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computing device (e.g., a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention when the instructions are executed. And the aforementioned storage medium includes: u disk, removable hard disk, Read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disk, and other various media capable of storing program code.
Alternatively, all or part of the steps of implementing the foregoing method embodiments may be implemented by hardware (such as a computing device, e.g., a personal computer, a server, or a network device) associated with program instructions, which may be stored in a computer-readable storage medium, and when the program instructions are executed by a processor of the computing device, the computing device executes all or part of the steps of the method according to the embodiments of the present invention.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments can be modified or some or all of the technical features can be equivalently replaced within the spirit and principle of the present invention; such modifications or substitutions do not depart from the scope of the present invention.

Claims (10)

1. A method for predicting risk of developing a disease, comprising:
acquiring Single Nucleotide Polymorphism (SNP) characterization information and clinical phenotype characteristic information of a gene variation site of a disease patient, and constructing a data set based on the SNP characterization information and the clinical phenotype characteristic information;
building a risk prediction basic model based on a neural network;
training the risk prediction basic model by using the data set to obtain an intelligent risk prediction model for predicting the disease risk probability;
and predicting the disease risk by using the intelligent risk prediction model.
2. The method according to claim 1, wherein the obtaining of the characterization information of the single nucleotide polymorphism SNP and the clinical phenotype characterization information of the genetic variation sites of the disease patients further comprises:
selecting a plurality of groups of target SNP characterization information positioned in a promoter region based on the SNP characterization information;
and training an independent classifier for each promoter region by using the target SNP characterization information, and outputting an accurate disease risk prediction value corresponding to each SNP locus in the target SNP characterization information through the classifier.
3. The method of claim 2, wherein constructing a data set based on the SNP characterization information and clinical phenotype characterization information comprises:
and selecting a target promoter region from the multiple groups of promoter regions according to the disease risk prediction accurate value corresponding to each SNP locus by using the classifier corresponding to each promoter region, and taking SNP characterization information corresponding to the target promoter region as SNP characterization information required for constructing the data set.
4. The method of claim 1, wherein building a risk prediction base model based on a neural network comprises:
constructing a risk prediction base model comprising a multi-gene prediction model and a gradient lifting tree model; the multi-gene prediction model is connected with the gradient lifting tree model, and the output variable of the multi-gene prediction model is used as the input variable of the gradient lifting tree model; the multi-gene prediction model is used for learning to obtain a corresponding SNP score value according to the SNP representation information;
and the gradient lifting tree model is used for learning to obtain the disease risk probability according to the SNP score value and the clinical phenotype characteristic information.
5. The method of claim 4, wherein the multi-gene prediction model comprises a specific neural network and a score computation network;
the specific neural network is connected with the score calculation network, and the output of the specific neural network is the input of the score calculation network; the specific neural network is used for learning SNP (single nucleotide polymorphism) representation information in the data set so as to obtain first characteristic information corresponding to the SNP representation information; the score calculation network is used for learning the first characteristic output by the specific neural network so as to obtain the SNP score value corresponding to the first characteristic information.
6. The method of claim 4, wherein training the risk prediction base model using the data set to obtain an intelligent risk prediction model for predicting risk probability of disease comprises:
training a multi-gene prediction model by utilizing SNP (single nucleotide polymorphism) representation information in the data set;
and training the gradient lifting tree model by using the SNP score values output by the multi-gene prediction model and the clinical phenotype characteristic information in the data set as input variables and using the disease risk probability as an output variable to obtain an intelligent risk prediction model for predicting the disease risk probability.
7. The method according to any one of claims 1-6, wherein after training the risk prediction base model using the data set to obtain an intelligent risk prediction model for predicting risk probability of disease, the method further comprises performing performance evaluation on the intelligent risk prediction model;
the performance evaluation of the intelligent risk prediction model comprises:
randomly dividing the acquired data set into a plurality of groups of subdata sets;
performing multiple rounds of tests on the risk prediction model by using the multiple groups of subdata sets to obtain multiple performance evaluation parameters; during each round of training, selecting any one sub data set in the multiple sub data sets as a test set, and using the rest other sub data sets as training sets;
taking the average value of the multiple performance evaluation parameters as an evaluation index of the risk prediction model to perform performance evaluation on the intelligent risk prediction model;
the performance evaluation parameters at least comprise specificity, sensitivity, accuracy, non-model evaluation scoring, area under a receiver operating characteristic ROC curve and area under a specificity sensitivity curve.
8. A system for predicting risk of developing a disease, comprising:
the data set establishing module is used for acquiring Single Nucleotide Polymorphism (SNP) characterization information and clinical phenotype characteristic information of a gene variation site of a disease patient and establishing a data set based on the SNP characterization information and the clinical phenotype characteristic information;
the model building module is used for building a risk prediction basic model based on a neural network;
the model training module is used for training the risk prediction basic model by utilizing the data set to obtain an intelligent risk prediction model for predicting the disease risk probability;
and the prediction module is used for predicting the disease risk by utilizing the intelligent risk prediction model.
9. A computer-readable storage medium for storing program code for performing the method for predicting the risk of a disease according to any one of claims 1 to 7.
10. A computing device, the computing device comprising a processor and a memory:
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to execute the method for predicting the risk of a disease according to any one of claims 1 to 7 according to instructions in the program code.
CN202210026857.5A 2022-01-11 2022-01-11 Method and system for predicting disease risk Pending CN114373547A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210026857.5A CN114373547A (en) 2022-01-11 2022-01-11 Method and system for predicting disease risk

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210026857.5A CN114373547A (en) 2022-01-11 2022-01-11 Method and system for predicting disease risk

Publications (1)

Publication Number Publication Date
CN114373547A true CN114373547A (en) 2022-04-19

Family

ID=81144136

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210026857.5A Pending CN114373547A (en) 2022-01-11 2022-01-11 Method and system for predicting disease risk

Country Status (1)

Country Link
CN (1) CN114373547A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841280A (en) * 2022-05-20 2022-08-02 北京安智因生物技术有限公司 Prediction classification method, system, medium, equipment and terminal for complex diseases
CN115602323A (en) * 2022-09-07 2023-01-13 浙江一山智慧医疗研究有限公司(Cn) Combined risk assessment model, method and application suitable for disease risk assessment
CN116072214A (en) * 2023-03-06 2023-05-05 之江实验室 Phenotype intelligent prediction and training method and device based on gene significance enhancement

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114841280A (en) * 2022-05-20 2022-08-02 北京安智因生物技术有限公司 Prediction classification method, system, medium, equipment and terminal for complex diseases
CN114841280B (en) * 2022-05-20 2023-02-14 北京安智因生物技术有限公司 Prediction classification method, system, medium, equipment and terminal for complex diseases
CN115602323A (en) * 2022-09-07 2023-01-13 浙江一山智慧医疗研究有限公司(Cn) Combined risk assessment model, method and application suitable for disease risk assessment
CN116072214A (en) * 2023-03-06 2023-05-05 之江实验室 Phenotype intelligent prediction and training method and device based on gene significance enhancement

Similar Documents

Publication Publication Date Title
Martorell-Marugán et al. Deep learning in omics data analysis and precision medicine
CN114373547A (en) Method and system for predicting disease risk
JP7200294B2 (en) A variant pathogenicity classifier trained to avoid overfitting the location-frequency matrix
WO2020077232A1 (en) Methods and systems for nucleic acid variant detection and analysis
JP7276915B2 (en) Method and System for Individualized Prediction of Psychiatric Disorders Based on Monkey-Human Species Transfer of Brain Function Maps
KR20200011445A (en) Semi-supervised learning to train ensemble of deep convolutional neural networks
CN110785814A (en) Predicting quality of sequencing results using deep neural networks
US20230222311A1 (en) Generating machine learning models using genetic data
JP6312253B2 (en) Trait prediction model creation method and trait prediction method
Zaman et al. Codon based back propagation neural network approach to classify hypertension gene sequences
US20230073973A1 (en) Deep learning based system and method for prediction of alternative polyadenylation site
WO2023196928A2 (en) True variant identification via multianalyte and multisample correlation
US20230410941A1 (en) Identifying genome features in health and disease
EP4392988A1 (en) Neural-network-based classifier
JP2023535285A (en) Mutant Pathogenicity Scoring and Classification and Their Use
Ali et al. MACHINE LEARNING IN EARLY GENETIC DETECTION OF MULTIPLE SCLEROSIS DISEASE: ASurvey
Khan et al. Genetic Algorithm for Biomarker Search Problem and Class Prediction
KR102659915B1 (en) Method of gene selection for predicting medical information of patients and uses thereof
CN113284611B (en) Cancer diagnosis and prognosis prediction system, apparatus and storage medium based on individual pathway activity
KR102659917B1 (en) Method for developing meta-gene based on non-negative matrix factorization and applications thereof
US20220301713A1 (en) Systems and methods for disease and trait prediction through genomic analysis
TWI650664B (en) Method for establishing assessment model for protein loss of function and risk assessment method and system using the assessment model
Xin Machine Learning Classification of Response to Internet-based Cognitive-Behavioural Therapy using Genome-Wide Association Study Data
Abass et al. Deep Learning Prediction of Exonic Sequence
Omar Ali A Comparative study of cancer detection models using deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination