CN114300036A

CN114300036A - Genetic variation pathogenicity prediction method and device, storage medium and computer equipment

Info

Publication number: CN114300036A
Application number: CN202111638768.8A
Authority: CN
Inventors: 彭继光; 韦荔全; 彭智宇
Original assignee: BGI Shenzhen Co Ltd
Current assignee: Shanghai Huada Medical Laboratory Co ltd; BGI Shenzhen Co Ltd
Priority date: 2021-12-29
Filing date: 2021-12-29
Publication date: 2022-04-08

Abstract

According to the genetic variation pathogenicity prediction method, the genetic variation pathogenicity prediction device, the storage medium and the computer equipment, firstly, characteristic data corresponding to a genetic variation locus to be predicted are determined, then, the characteristic data are predicted through a genetic variation pathogenicity prediction model, and a genetic variation pathogenicity prediction result is obtained; the genetic variation pathogenicity prediction model is obtained by integrating a plurality of target prediction submodels designed according to different pathogenicity grade classifications, and each target prediction submodel is trained by taking the characteristic data of a training genetic variation locus of the corresponding pathogenicity grade classification as a training sample and taking the pathogenicity grade classification of the training genetic variation locus as a sample label, and the training sample contains all mutation types.

Description

Genetic variation pathogenicity prediction method and device, storage medium and computer equipment

Technical Field

The present application relates to the field of genetic testing technologies, and in particular, to a method, an apparatus, a storage medium, and a computer device for predicting pathogenicity of genetic variation.

Background

With the gradual maturity and cost reduction of high-throughput sequencing technology, sequencing has become an important means for clinical research and diagnosis. Through years of clinical application accumulation, a large amount of variation site information is collected by each large public genome database, and the accurate interpretation of the variation sites is the key for realizing accurate medical treatment of human beings.

In 2015, the american society for medical genetics and genomics (ACMG) issued classification and interpretation standards for gene variation sites, which were classified into five major groups: pathogenicity and potential risk of the variant site are assessed by pathogenicity, possible pathogenicity, ambiguous clinical meaning, possible benign and benign. However, the currently commonly used genetic variation pathogenicity prediction methods mainly aim at missense mutation or non-synonymous mutation, cannot predict all mutation types, and do not consider the detailed classification of genetic variation sites, so that the prediction accuracy is not high.

Disclosure of Invention

The purpose of the present application is to solve at least one of the above technical drawbacks, and in particular, to solve the technical drawback that the prediction accuracy is not high because all mutation types cannot be predicted in the prior art.

The application provides a method for predicting the pathogenicity of genetic variation, which comprises the following steps:

obtaining a gene variation site to be predicted;

determining characteristic data corresponding to the gene variation site to be predicted;

inputting the characteristic data into a pre-configured genetic variation pathogenicity prediction model to obtain a genetic variation pathogenicity prediction result output by the genetic variation pathogenicity prediction model;

the genetic variation pathogenicity prediction model is obtained by integrating a plurality of target prediction submodels designed according to different pathogenicity grade classifications, each target prediction submodel is obtained by taking the characteristic data of a training genetic variation locus of the corresponding pathogenicity grade classification as a training sample and taking the pathogenicity grade classification of the training genetic variation locus as a sample label for training.

Optionally, the determining feature data corresponding to the genetic variation site to be predicted includes:

calling a local annotation file, wherein the local annotation file is an annotation file downloaded from a variation site function annotation library in advance;

searching characteristic data corresponding to the gene variation site to be predicted in the local annotation file;

and if the characteristic data corresponding to the genetic variation locus to be predicted is not found in the local annotation file, calling a webpage API (application program interface), and acquiring the characteristic data corresponding to the genetic variation locus through the webpage API.

Optionally, before inputting the feature data into the pre-configured genetic variation pathogenicity prediction model, the method further includes:

adding derivative variables to the characteristic data according to the missing condition of the characteristic data;

displaying the derived variables as missing characteristic data for completion;

and carrying out normalization processing on the supplemented characteristic data.

Optionally, the training process of each target predictor model includes:

determining an initial predictor model, and acquiring a pathogenicity grade classification corresponding to the initial predictor model and characteristic data of a training gene variation locus of the pathogenicity grade classification;

dividing the characteristic data of the training gene mutation sites into k data sets by adopting a k-fold cross validation mode, and performing k round model training;

selecting one data set as a verification set and the rest data sets as training sets during each round of model training, training the initial prediction submodel by using the characteristic data in the training sets, and verifying the trained initial prediction submodel by using the characteristic data in the verification sets to obtain a verification result;

and selecting a model with the highest prediction accuracy rate in verification results of k round model training as a target prediction sub-model according to the pathogenicity grade classification corresponding to the initial prediction sub-model.

Optionally, the genetic variation pathogenicity prediction model comprises a main direction prediction layer, a direction correction layer, a degree prediction layer and a mapping layer;

inputting the characteristic data into a pre-configured genetic variation pathogenicity prediction model to obtain a genetic variation pathogenicity prediction result output by the genetic variation pathogenicity prediction model, wherein the genetic variation pathogenicity prediction result comprises the following steps:

inputting the characteristic data into the main direction prediction layer, and predicting the genetic variation main direction of the characteristic data to obtain a genetic variation main direction prediction result;

correcting the genetic variation main direction of the genetic variation main direction prediction result through the direction correction layer to obtain a genetic variation main direction correction result;

carrying out probability value prediction on the genetic variation main direction correction result by utilizing the degree prediction layer to obtain a probability value prediction result;

and mapping the probability value prediction result to a corresponding mapping interval through the mapping layer to obtain a final mapping score, and taking the mapping score as a genetic variation pathogenicity prediction result.

Optionally, the target predictor models comprise a three-classification model and a two-classification model;

the three classification models are applied to the main direction prediction layer and the direction correction layer, and the two classification models are applied to the direction correction layer and the degree prediction layer.

Optionally, the three classification models applied to the primary direction prediction layer comprise a first classification model for predicting a primary direction of genetic variation of the feature data, the primary direction of genetic variation comprising a pathogenic tendency, a clinical significance ambiguity and a benign tendency;

the three classification models applied to the direction correction layer comprise a second classification model and a third classification model, wherein the second classification model is used for correcting pathogenicity, possible pathogenicity and clinical significance ambiguity in the main direction of the genetic variation, and the third classification model is used for correcting benign, possible benign and clinical significance ambiguity in the main direction of the genetic variation;

the two classification models applied to the direction correction layer comprise a fourth classification model for correcting pathogenic and benign tendencies in the main direction of genetic variation;

the two classification models applied to the degree prediction layer comprise a fifth classification model, a sixth classification model, a seventh classification model and an eighth classification model;

the fifth classification model is used for predicting probability values of diseases or possible diseases in the genetic variation main direction correction result, the sixth classification model is used for predicting probability values of possible diseases or clinical meanings of the genetic variation main direction correction result, the seventh classification model is used for predicting probability values of benign or possible benign in the genetic variation main direction correction result, and the eighth classification model is used for predicting probability values of possible benign or clinical meanings of the genetic variation main direction correction result.

The present application also provides a genetic variation pathogenicity prediction apparatus, including:

the locus acquisition module is used for acquiring a gene variation locus to be predicted;

the characteristic determining module is used for determining characteristic data corresponding to the gene variation site to be predicted;

the pathogenicity prediction module is used for inputting the characteristic data into a pre-configured genetic variation pathogenicity prediction model to obtain a genetic variation pathogenicity prediction result output by the genetic variation pathogenicity prediction model;

The present application also provides a storage medium characterized in that: the storage medium has stored therein computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform a method of predicting the pathogenicity of a genetic variation as described in any one of the embodiments above.

The present application also provides a computer device, comprising: one or more processors, and a memory;

the memory has stored therein computer readable instructions that, when executed by the one or more processors, perform a method of predicting the pathogenicity of a genetic variation as described in any one of the above embodiments.

According to the technical scheme, the embodiment of the application has the following advantages:

according to the genetic variation pathogenicity prediction method, the genetic variation pathogenicity prediction device, the storage medium and the computer equipment, when the pathogenicity of a genetic variation locus to be predicted is predicted, firstly, characteristic data corresponding to the genetic variation locus to be predicted is determined, then, the characteristic data is predicted through a pre-configured genetic variation pathogenicity prediction model, and a genetic variation pathogenicity prediction result is obtained; in the application, because the genetic variation pathogenicity prediction model is obtained by integrating a plurality of target prediction submodels designed according to different pathogenicity grades in a classified mode, and each target prediction submodel is trained by taking the characteristic data of a training genetic variation locus of the corresponding pathogenicity grade classification as a training sample and taking the pathogenicity grade classification of the training genetic variation locus as a sample label, compared with a network model obtained by directly training according to a single mutation type, the genetic variation pathogenicity prediction model uses the characteristic data of the training genetic variation locus of different pathogenicity grade classifications as the training sample, so that the training sample contains all the mutation types, therefore, the genetic variation pathogenicity prediction model is suitable for all the mutation types and can obtain an accurate genetic variation pathogenicity prediction result according to the input characteristic data of the genetic variation locus, thereby effectively improving the prediction accuracy of the pathogenicity of the genetic variation.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a method for predicting the pathogenicity of a genetic variation according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a training architecture of each target predictor model provided in an embodiment of the present application;

FIG. 3 is a schematic structural diagram of training parameters in a model training phase according to an embodiment of the present disclosure;

FIG. 4 is a diagram of a network architecture of a genetic variation pathogenicity prediction model provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of an apparatus for predicting the pathogenicity of a genetic variation according to an embodiment of the present disclosure;

fig. 6 is a schematic internal structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In one embodiment, as shown in fig. 1, fig. 1 is a schematic flow chart of a method for predicting the pathogenicity of a genetic variation according to an embodiment of the present disclosure; the application provides a method for predicting the pathogenicity of genetic variation, which comprises the following steps:

s110: and obtaining the gene variation site to be predicted.

In this step, in the case of predicting the pathogenicity of genetic variation, it is necessary to obtain a genetic variation site to be predicted. The source of the gene mutation site in the application can be a mutation site obtained by sequencing and sequence comparison of nucleic acid, a mutation site recorded in a public database or published by others, or a mutation site artificially simulated.

In the present application, when a mutation site obtained by sequencing or sequence alignment of nucleic acids is used, a plurality of different nucleic acid-containing samples may be obtained in advance, and the type of nucleic acid is not particularly limited, and may be deoxyribonucleic acid (DNA), ribonucleic acid (RNA), and preferably DNA. RNA can also be reverse transcribed into DNA by an experimental method for subsequent detection and analysis.

Furthermore, after a plurality of samples are obtained, the gene variation site of each sample can be determined in a sequence comparison mode. For example, the present application may detect the gene sequence of each sample by a gene sequencer, obtain sequencing data, map the sequencing data onto a reference genome sequence, and compare the reference genome sequence with the sequencing data, thereby finding out a single-base site, i.e., a gene variation site, in the sequencing data that is different from the reference genome sequence.

It is understood that the reference genome sequence herein refers to a gene sequence fragment having a unique base arrangement pattern, and the position on the chromosome can be accurately located by aligning with such a fragment. Reference genomic sequences with version numbers hg38, hg19, or hg18 may be used in the examples of the present application without limitation.

Furthermore, since one's whole genome sequencing data can be mined into four million SNVs (single base sites different from the reference genome), and there are also fifty-one hundred thousand indels (identities or deletions), the genetic variation sites in the present application can be multiple, and specific values can be determined according to actual situations, which is not limited herein.

S120: and determining characteristic data corresponding to the gene variation site to be predicted.

In this step, after the genetic variation site to be predicted is obtained through S110, the feature data corresponding to the genetic variation site to be predicted may be determined.

The feature data herein may include feature data such as function conservative feature (GERP + +, LRT, Phylop and pharmacons), function impact feature (SIFT, Polyphen2, PROVEAN, primateal ai, mutationssosser, ClinPred, M _ CAP, MVP, F athmm _ MKL, TraP, SilVa, usDSM, PrDSM and EnDSM), shear impact feature (helicea), population frequency feature (af _ popmax), and the like corresponding to the gene variation site.

The characteristic data in the application can be acquired by calling the local annotation text or calling a webpage API (application programming interface), and can be downloaded in an official mutation site annotation database or a third-party mutation site annotation database in advance when the local annotation text is called. For example, the dbNSFP database provides functional prediction and annotation data of human genetic variation sites, which provides a one-stop query and download function for nonsynonymous mutated single nucleotide variations (nsSNVs).

Furthermore, after the feature data of each gene variation site is obtained, preprocessing operation can be performed on the feature data, so that the preprocessed feature data can be applied to model training of the deep neural network.

When the feature data is preprocessed, feature data of each genetic variation site, which needs to be acquired from multiple dimensions, can be determined firstly, for example, feature data of four categories, namely, a function conservative feature, a function influence feature, a shearing influence feature and a crowd frequency feature, can be acquired, then the acquired feature data of each genetic variation site is checked to see whether missing feature data exists, and if missing exists, neutral threshold completion is performed on the missing feature data to enhance the learning capability of the model on mutation types.

In addition, the method and the device can also perform normalization processing on the feature data, and normalize the feature value of the feature data to the range of [ -1, 1], so that the model can be trained better.

S130: and inputting the characteristic data into a pre-configured genetic variation pathogenicity prediction model to obtain a genetic variation pathogenicity prediction result output by the genetic variation pathogenicity prediction model.

In this step, after the characteristic data corresponding to the genetic variation locus is determined by S120, the characteristic data may be input into a pre-configured genetic variation pathogenicity prediction model, so as to predict the characteristic data through the genetic variation pathogenicity prediction model, thereby obtaining a genetic variation pathogenicity prediction result, which may represent the pathogenicity of the genetic variation locus to be predicted.

The genetic variation pathogenicity prediction model is obtained by integrating a plurality of target predictor models which are designed according to different pathogenicity grades in a classified mode. It can be understood that, according to the 5-level classification standard of ACMG, pathogenicity of genetic variation sites can be classified into pathogenicity (P), possible pathogenicity (LP), clinical ambiguous meaning (VUS), possible virtuous (LB) and virtuous (B), and the present application can design a plurality of target predictor models according to the 5-level classification standard of ACMG (P/LP/VUS/LB/B) and interpretation experience of clinical interpretation team and combining difficulty of differentiation of each pathogenicity level, and form a final genetic variation pathogenicity prediction model after integrating the plurality of target predictor models, wherein the model covers characteristics of genetic variation in multiple dimensions comprehensively, and performs effectiveness screening, so that the model is closer to clinical application scenes.

When a plurality of target predictor models are integrated, the integrated network architecture of the genetic variation pathogenicity prediction model can be determined according to the prediction direction and the prediction range of each target predictor model. If the target prediction submodel with more prediction directions and wider prediction range span is used as a network layer close to an input layer in the genetic variation pathogenicity prediction model, and the target prediction submodel with less prediction directions and narrower prediction range span is used as a network layer close to an output layer in the genetic variation pathogenicity prediction model, the whole network architecture of the genetic variation pathogenicity prediction model is divided from coarse granularity to fine granularity, and the accuracy of model prediction is improved.

Furthermore, before the target predictor models are predicted, different target predictor models can be trained independently, each target predictor model can use the feature data of the training genetic variation sites of the corresponding pathogenicity grade classification as a training sample in the training process, and the pathogenicity grade classification of the training genetic variation sites is used as a sample label to train the models. For example, some training data of the target predictor model do not have variant sites with VUS rating, and some training data of the target predictor model only have variant sites with LB and VUS rating; similarly, before entering model training, the sample labels need to be reclassified, for example, the sample label of the variation site of LB rating in the target predictor model may be BLB, which refers to benign tendency, or LB.

In the above embodiment, when predicting the pathogenicity of a genetic variation locus to be predicted, first determining feature data corresponding to the genetic variation locus to be predicted, and then predicting the feature data by using a pre-configured genetic variation pathogenicity prediction model to obtain a genetic variation pathogenicity prediction result; in the application, because the genetic variation pathogenicity prediction model is obtained by integrating a plurality of target prediction submodels designed according to different pathogenicity grades in a classified mode, and each target prediction submodel is trained by taking the characteristic data of a training genetic variation locus of the corresponding pathogenicity grade classification as a training sample and taking the pathogenicity grade classification of the training genetic variation locus as a sample label, compared with a network model obtained by directly training according to a single mutation type, the genetic variation pathogenicity prediction model uses the characteristic data of the training genetic variation locus of different pathogenicity grade classifications as the training sample, so that the training sample contains all the mutation types, therefore, the genetic variation pathogenicity prediction model is suitable for all the mutation types and can obtain an accurate genetic variation pathogenicity prediction result according to the input characteristic data of the genetic variation locus, thereby effectively improving the prediction accuracy of the pathogenicity of the genetic variation.

In one embodiment, the determining the feature data corresponding to the genetic variation site to be predicted in S120 may include:

s121: and calling a local annotation file, wherein the local annotation file is an annotation file which is downloaded from a mutation site function annotation library in advance.

S122: and searching characteristic data corresponding to the gene variation site to be predicted in the local annotation file.

S123: and if the characteristic data corresponding to the genetic variation locus to be predicted is not found in the local annotation file, calling a webpage API (application program interface), and acquiring the characteristic data corresponding to the genetic variation locus through the webpage API.

In this embodiment, when the feature data corresponding to the genetic mutation site is obtained, the feature data may be obtained by calling a local annotation text, and when the local annotation text is called, the feature data may be downloaded in advance in an official mutation site annotation database or a third-party mutation site annotation database and stored locally. For example, the dbNSFP database provides functional prediction and annotation data of human genetic variation sites, which provides a one-stop query and download function for nonsynonymous mutated single nucleotide variations (nsSNVs).

It can be understood that, because the annotation file data of the local repository is large, the genetic variation sites and the feature data can be associated in an index creating manner, and when the features are obtained, the genetic variation sites are used as keywords to query locally, so that the feature data corresponding to the genetic variation sites can be found.

Further, since part of the feature data does not provide a download file, the feature data corresponding to the genetic variation site cannot be found through the local annotation file, and at this time, the corresponding feature data may be obtained in a web page query manner, for example, the feature data of the part may be obtained in a form of calling a web page API interface.

In one embodiment, before inputting the feature data into the pre-configured genetic variation pathogenicity prediction model in S130, the method may further include:

s210: and adding derivative variables to the characteristic data according to the missing condition of the characteristic data.

S220: and displaying the derived variables as missing characteristic data for completion.

S230: and carrying out normalization processing on the supplemented characteristic data.

In this embodiment, after the feature data of the genetic variation locus is obtained, the feature data may be preprocessed, so that the preprocessed feature data may be transmitted to a genetic variation pathogenicity prediction model to obtain a genetic variation pathogenicity prediction result.

When the feature data is preprocessed, the dimensionality of the feature data of each genetic variation site required to be acquired by the application can be determined firstly, for example, four categories of feature data, namely, functional conservative features, functional influence features, shearing influence features and crowd frequency features, can be acquired, then the acquired feature data of each genetic variation site is checked to see whether missing feature data exists or not, and if the missing feature data exists, the missing feature data is complemented so as to enhance the learning capability of the model on mutation types.

When checking whether there is missing feature data, the present application may add a derivative variable to the feature data, where the purpose of this variable is to mark whether the feature data is missing (i.e. the feature data is not found or not queried), and if it is marked as missing, a neutral threshold completion needs to be performed on the feature.

For example, 10 feature data correspond to 10 corresponding derived variables, and if a certain feature data does not exist or is not queried, that is, if the certain feature data is missing, the derived variable of the feature data is set to 1; in contrast, if the feature data exists, the derived variable of the feature data is set to 0.

After the derivative variables are added to each feature data, because the missing feature data are also missing, in order to further improve the learning capability of the model, the missing feature data displayed in the derivative variables can be complemented, so that the learning capability of the model on mutation types is enhanced. When the missing feature data are supplemented, the official recommendation threshold of the feature source corresponding to the missing feature data can be used for filling, and the official recommendation threshold represents the most neutral index, so that after filling by using official recommendation, the learning capability of the model is not influenced, and the training effect of the model can be improved.

In addition, the method and the device can also perform normalization processing on the supplemented feature data. In particular, the eigenvalues of the feature data may be normalized to the [ -1, 1] interval to facilitate better training of the model.

In one embodiment, the training process for each target predictor model may include:

s310: determining an initial predictor model, and acquiring a pathogenicity grade classification corresponding to the initial predictor model and characteristic data of a training gene variation locus of the pathogenicity grade classification.

In this step, when each target predictor model is trained, an initial predictor model may be determined first, and then the pathogenicity grade classification corresponding to the initial predictor model and the feature data of the training genetic variation site of the pathogenicity grade classification are obtained.

When the initial prediction submodel is determined, a whole set of classification rules can be designed according to clinical interpretation experience, and different classification strategies are adopted in different links, for example, three classes need to be distinguished in some links, and at the moment, three classification models can be selected; whereas if only two classes are to be distinguished, a binary model may be selected.

In the application, each target prediction submodel mainly uses a supervised learning mode during model training, namely training data comprises training samples and sample labels. The training sample in the present application may be feature data of a training genetic variation locus of a pathogenicity level classification corresponding to the initial predictor model, and the sample label may be a pathogenicity level classification corresponding to the initial predictor model. For example, when some target predictor models correspond to pathogenicity grades classified as P, LP, B, or LB, no VUS-rated mutation sites are present in the training data of the target predictor models, and when some target predictor models correspond to genetic mutation varieties LB and VUS, only LB-and VUS-rated mutation sites are present in the training data of the target predictor models; similarly, the sample tags are also reclassified, for example, the sample tags of the variation sites of the LB rating in the target predictor model can be BLB or LB.

Further, in the model training, genetic variation sites with a Review status (Review status) of 2 stars or more in the ClinVar database of the National Center for Biotechnology Information (NCBI) may be selected as training genetic variation sites. Generally, Review states in the Clinvar database are divided into 0-4 stars, the higher the star level is, the more reliable the interpretation of the genetic variation locus is represented, and the model training is performed by selecting the genetic variation locus of 2 stars or more and the corresponding rating result, so that the reliability of the model can be effectively improved.

S320: and dividing the characteristic data of the training gene mutation sites into k data sets by adopting a k-fold cross validation mode, and performing k round model training.

In this step, when each target predictor model is trained, batch training can be performed by using a k-fold cross validation method, and finally, the model with the highest prediction accuracy is reserved as the final target predictor model. Moreover, each target prediction submodel in the application is independently performed during k-fold cross validation, and the k-fold cross validation is not interfered with each other, so that the prediction accuracy of the model can be improved to a greater extent.

It can be understood that the k-fold cross validation method is to divide the training data into k parts, wherein 1 part is used as validation data, and the other k-1 parts are used as training data, and the optimal model is obtained after cross repetition is performed for k times. In the invention, k is set to be 5, and k rounds of model training are realized through a scinit-spare library.

S330: and when each round of model training is performed, selecting one data set as a verification set, using the rest data sets as training sets, training the initial prediction submodel by using the characteristic data in the training sets, and verifying the trained initial prediction submodel by using the characteristic data in the verification sets to obtain a verification result.

In this step, when k batches of model training are performed, in each cycle of model training process, one data set may be selected as a verification set, the remaining data sets are used as training sets, the initial predictor model is trained by using feature data in the training sets, and the trained initial predictor model is verified by using feature data in the verification sets, so as to obtain a verification result.

For example, the present application may use Pythrch to construct a deep neural network, with the number of neurons in the input layer being the number of feature data used for training, e.g., with { x > input layer_i|x₁,x₂,...,x_mIn which x_iThe ith characteristic data is input, and m is the number of the input characteristic data; the number of neurons in the output layer varies according to the classification number (e.g., two is classified as 1, three is classified as 3), e.g., the output of the deep neural network can be calculated as O ═ HW_o+b_o，H＝R(XW_h+b_h) Wherein, the output of the deep neural network is O, the output of the hidden layer is H, R is an activation function, X is a characteristic matrix of the input, W_hAnd b_hWeight and bias parameters, W, of the hidden layer, respectively_oAnd b_oRespectively the weight and deviation parameters of the output layer; the weight update formula may be

Wherein w is the weight of the neural network, and eta is the learning rateα is a regular weighting parameter of L2, and Loss is a Loss function.

It is understood that the learning rate, the L2 regularization weight parameter, the dropout ratio, the batch size, and the like are all hyper-parameters set during model training. The hyper-parameters are provided prior to training, and the process of training and evaluating the model, also referred to as parameter tuning, is constantly changing the hyper-parameters. Regularization is commonly used in machine learning to control the complexity of the model and reduce overfitting, L2 regularization is one of the modes, and the weight and dropout rate of L2 regularization are different according to the type of the model.

Further, for the binary classification model, the BCEWithLogs Loss can be formed by combining the Loss and activation functions of BCELoss and sigmoid, and the specific formula is as follows:

l_n＝-w_n[y_n·logσ(x_n)+(1-y_n)·log(1-σ(x_n))]

wherein N is the size of the batch, N is the nth site in each batch, and x_nRepresenting the neural network output value, y, at the nth site in a batch_nSample tag value, l, representing the nth site in a batch_nRepresents the value of the loss function at the nth position in a batch, w is the weight value, σ (x)_n) For activating the function, the training of the binary model in the present application may specifically be

For the three-classification model, the cross control Loss is formed by combining the Loss and activation functions of NLLLoss and softmax, and the specific formula is as follows:

wherein class is a class label value, C is the total class number, j is the jth class, x [ j ] represents the neural network output value of the jth class, and x [ class ] specifies the neural network output value predicted as the target class (real class).

S340: and selecting a model with the highest prediction accuracy rate in verification results of k round model training as a target prediction sub-model according to the pathogenicity grade classification corresponding to the initial prediction sub-model.

In the step, after k rounds of model training are carried out on the initial prediction submodels, the prediction accuracy of each round of model training can be verified according to the pathogenicity grade classification corresponding to the initial prediction submodel, and then the initial prediction submodel with the highest prediction accuracy in the verification results of the k rounds of model training is selected as the target prediction submodel, so that the generalization capability of the models can be improved.

In one embodiment, the genetic variation pathogenicity prediction model may include a main direction prediction layer, a direction correction layer, a degree prediction layer, and a mapping layer.

S130, inputting the feature data into a pre-configured genetic variation pathogenicity prediction model to obtain a genetic variation pathogenicity prediction result output by the genetic variation pathogenicity prediction model, where the genetic variation pathogenicity prediction result may include:

s131: and inputting the characteristic data into the main direction prediction layer, and predicting the genetic variation main direction of the characteristic data to obtain a genetic variation main direction prediction result.

S132: and correcting the genetic variation main direction of the genetic variation main direction prediction result through the direction correction layer to obtain a genetic variation main direction correction result.

S133: and predicting the probability value of the correction result of the genetic variation main direction by using the degree prediction layer to obtain a probability value prediction result.

S134: and mapping the probability value prediction result to a corresponding mapping interval through the mapping layer to obtain a final mapping score, and taking the mapping score as a genetic variation pathogenicity prediction result.

In this embodiment, when the genetic abnormality pathogenicity of the genetic variation locus is predicted, the network architecture of the genetic variation pathogenicity prediction model may be determined according to the prediction direction and the prediction range of each target predictor model. If the target prediction submodel with more prediction directions and wider prediction range span is used as a network layer close to an input layer in the genetic variation pathogenicity prediction model, and the target prediction submodel with less prediction directions and narrower prediction range span is used as a network layer close to an output layer in the genetic variation pathogenicity prediction model, the whole network architecture of the genetic variation pathogenicity prediction model is divided from coarse granularity to fine granularity, and the accuracy of model prediction is improved.

Specifically, when the pathogenicity of genetic abnormality of a genetic variation locus is predicted, the characteristic data of the genetic variation locus can be input into a main direction prediction layer, the main direction of the genetic variation of the characteristic data is predicted through the main direction prediction layer to obtain a main direction prediction result of the genetic variation, then the main direction prediction result of the genetic variation enters respective direction correction layers to perform direction correction according to the main direction prediction result of the genetic variation, after the main direction correction result of the genetic variation is obtained, the main direction correction result of the genetic variation enters a degree prediction layer to perform probability value prediction on the main direction correction result of the genetic variation to obtain a probability value prediction result, and finally the main direction correction result of the genetic variation enters a mapping layer to perform probability value prediction on the main direction correction result of the genetic variation to obtain a probability value prediction result.

In one embodiment, the target predictor models may include a three-classification model and a two-classification model; the three classification models are applied to the main direction prediction layer and the direction correction layer, and the two classification models are applied to the direction correction layer and the degree prediction layer.

In this embodiment, the target prediction submodel may be divided into a two-classification model and a three-classification model, wherein the three-classification model may predict three classes, and therefore, the three-classification model may be applied to a main direction prediction layer and a direction correction layer of the genetic variation pathogenicity prediction model, and the two-classification model may predict one of the two classes, and therefore, the two-classification model may be applied to a direction correction layer and a degree prediction layer of the genetic variation pathogenicity prediction model.

For example, eight two-class models or three-class models may be designed in combination with clinical medical interpretation experience, and model training is performed on the eight two-class models or three-class models, specifically referring to fig. 2, where fig. 2 is a schematic diagram of a training architecture of each target predictor model provided in the embodiment of the present application; it is understood that the pathogenicity of the genetic variation site can be classified into pathogenic (P), potentially pathogenic (LP), clinically ambiguous (VUS), potentially benign (LB) and benign (B) according to the 5-level classification criteria of ACMG, and that the present application can design a plurality of target predictor models, such as PLP/VUS/BLB three classification models, P/LP/VUS three classification models, PLP/BLB two classification models, B/LB/VUS three classification models, P/LP two classification models, LP/VUS two classification models, B/LB two classification models and LB/VUS two classification models, according to the interpretation experience of the clinical interpretation team, in combination with the ease of classification of each rating.

The PLP/VUS/BLB three-classification model is a model formed by combining P, LP, VUS, LB and B and used for predicting the main direction of genetic variation of characteristic data, wherein the main direction of genetic variation can comprise pathogenic tendency (PLP), clinical ambiguous meaning (VUS) and benign tendency (BLB); the P/LP/VUS three-classification model is a model formed by combining P, LP and VUS and used for correcting pathogenicity, possible pathogenicity and clinical significance in the main direction of genetic variation, and the like, so that eight two-classification models or three-classification models are obtained.

Further, when eight two-class models or three-class models are trained, because training samples and sample labels during training are different, hyper-parameters selected during training are also different, which can be specifically shown in fig. 3, where fig. 3 is a schematic structural diagram of training parameters in a model training stage provided in the embodiment of the present application; the training parameters in fig. 3 are different according to the pathogenicity classification of the neural network structure, and the types of the training parameters may include an optimizer, an activation function, a loss function, a regularization, a batch size, a discharge ratio, a learning ratio, and the like, which is not limited herein. In addition, the values of the training parameters in fig. 3 are the final parameter values determined after repeated parameter adjustment.

Further, when training the model, the sub-model training may be performed separately by disease level (P/LP/VUS/LB/B). Each gene mutation site can be predicted by all models to obtain a plurality of prediction scores, the output value of the most reliable classification is selected, and the technical effect similar to that of the application is obtained.

In one embodiment, the three classification models applied to the primary direction prediction layer may include a first classification model for predicting a primary direction of genetic variation of the feature data, the primary direction of genetic variation including a predisposition to disease, a clinical significance ambiguity, and a predisposition to benign.

The three classification models applied to the orientation correction layer may include a second classification model for correcting a pathogenic, possibly pathogenic, and clinical significance ambiguity in the main direction of genetic variation, and a third classification model for correcting a benign, possibly benign, and clinical significance ambiguity in the main direction of genetic variation.

The two classification models applied to the orientation correction layer may include a fourth classification model for correcting pathogenic and benign tendencies in the main direction of genetic variation.

The two classification models applied to the degree prediction layer may include a fifth classification model, a sixth classification model, a seventh classification model, and an eighth classification model; the fifth classification model is used for predicting probability values of diseases or possible diseases in the genetic variation main direction correction result, the sixth classification model is used for predicting probability values of possible diseases or clinical meanings of the genetic variation main direction correction result, the seventh classification model is used for predicting probability values of benign or possible benign in the genetic variation main direction correction result, and the eighth classification model is used for predicting probability values of possible benign or clinical meanings of the genetic variation main direction correction result.

In this embodiment, in the genetic variation pathogenicity prediction model, 8 target predictor models respectively exert different classification effects at different levels. Schematically, as shown in fig. 4, fig. 4 is a network architecture diagram of a genetic variation pathogenicity prediction model provided by an embodiment of the present application; in fig. 4, the main direction prediction layer of the genetic variation pathogenicity prediction model only comprises a first classification model, such as the PLP/VUS/BLB model 10, and performs prediction of the main direction of genetic variation, which has a PLP direction, a VUS direction and a BLB direction, i.e., a pathogenicity tendency, a clinical ambiguity and a benign tendency, respectively. The direction correction layer includes a second classification model, a third classification model and a fourth classification model, wherein the second classification model may be a P/LP/VUS model 20, the third classification model may be a B/LB/VUS model 30, and the fourth classification model may be a PLP/BLB model 40, which are respectively used for correction of three main directions of the main direction prediction layer. Meanwhile, the prediction result of the layer determines the mapping interval of the subsequent mapping layer, and the mapping interval is established by combining bioinformatics and clinical medicine interpretation experience.

Further, the degree prediction layer in fig. 4 may include a fifth classification model, a sixth classification model, a seventh classification model and an eighth classification model, for example, the fifth classification model may be the P/LP model 50, the sixth classification model may be the LP/VUS model 60, the seventh classification model may be the B/LB model 70, and the eighth classification model may be the LB/VUS model 80, which needs to retain the probability values predicted by the models for score mapping of the subsequent mapping layers. The mapping layer in the application mainly combines the mapping interval of the direction correction layer and the probability value of the degree prediction layer to obtain a final mapping score, namely a genetic variation pathogenicity prediction result.

Referring to fig. 4, a feature matrix of mutation sites may be first introduced into the PLP/VUS/BLB model 10, and if the classification result is VUS, the feature matrix will be introduced into the PLP/BLB model 40, and if the classification result is PLP, the feature matrix will be introduced into the LP/VUS model 60, and the final prediction score will be mapped within the interval [0.5, 0.85] as the prediction result of genetic mutation pathogenicity.

It is understood that the output of the genetic variation pathogenicity prediction model in the present application is a score between 0 and 1, and the present application can determine clear pathogenicity by partitioning the threshold. The threshold value can be customized by the user according to the training model and the application requirements, for example, defining [0.95, 1] as P, [0.85, 0.95) as LP, [0.5, 0.85) as VUS, [0.1, 0.5) as LB and [0, 0.1) as B.

The genetic variation pathogenicity prediction device provided by the embodiment of the present application is described below, and the genetic variation pathogenicity prediction device described below and the genetic variation pathogenicity prediction method described above may be referred to correspondingly.

In one embodiment, as shown in fig. 5, fig. 5 is a schematic structural diagram of a genetic variation pathogenicity prediction apparatus provided by an embodiment of the present application; the application also provides a genetic variation pathogenicity prediction device, which comprises a locus acquisition module 210, a characteristic determination module 220 and a pathogenicity prediction module 230, and specifically comprises the following steps:

and the site acquiring module 210 is used for acquiring the gene variation site to be predicted.

And a feature determining module 220, configured to determine feature data corresponding to the genetic variation site to be predicted.

The pathogenicity prediction module 230 is configured to input the feature data into a pre-configured genetic variation pathogenicity prediction model, so as to obtain a genetic variation pathogenicity prediction result output by the genetic variation pathogenicity prediction model.

In the above embodiment, when predicting the pathogenicity of a genetic variation locus to be predicted, first determining characteristic data corresponding to the genetic variation locus of a sample to be predicted, and then predicting the characteristic data through a pre-configured genetic variation pathogenicity prediction model to obtain a genetic variation pathogenicity prediction result; in the application, because the genetic variation pathogenicity prediction model is obtained by integrating a plurality of target prediction submodels classified according to different pathogenicity grades, and each target prediction submodel is trained by taking the characteristic data of the training genetic variation locus classified according to the corresponding pathogenicity grade as a training sample and taking the pathogenicity grade classification of the training genetic variation locus as a sample label, compared with a network model obtained by directly training according to a single mutation type, the genetic variation pathogenicity prediction model uses the characteristic data of the training genetic variation locus classified according to different pathogenicity grades as the training sample, so that the training sample contains all mutation types, therefore, the genetic variation pathogenicity prediction model is suitable for all mutation types and can obtain an accurate genetic variation pathogenicity prediction result according to the input characteristic data of the genetic variation locus, thereby effectively improving the prediction accuracy of the pathogenicity of the genetic variation.

In one embodiment, the present application further provides a storage medium characterized by: the storage medium has stored therein computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform a method of predicting the pathogenicity of a genetic variation as described in any one of the embodiments above.

In one embodiment, the present application further provides a computer device, comprising: one or more processors, and a memory.

Fig. 6 is a schematic diagram illustrating an internal structure of a computer device according to an embodiment of the present disclosure, and the computer device 300 may be provided as a server, as shown in fig. 6. Referring to fig. 6, computer device 300 includes a processing component 302 that further includes one or more processors, and memory resources, represented by memory 301, for storing instructions, such as application programs, that are executable by processing component 302. The application programs stored in memory 301 may include one or more modules that each correspond to a set of instructions. Further, the processing component 302 is configured to execute instructions to perform the genetic variation pathogenicity prediction method of any of the embodiments described above.

The computer device 300 may also include a power component 303 configured to perform power management of the computer device 300, a wired or wireless network interface 304 configured to connect the computer device 300 to a network, and an input output (I/O) interface 305. The computer device 300 may operate based on an operating system stored in memory 301, such as Mac OS XTM, Linux, Free BSDTM, or the like.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, the embodiments may be combined as needed, and the same and similar parts may be referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for predicting the pathogenicity of a genetic variation, the method comprising:

obtaining a gene variation site to be predicted;

2. The method of claim 1, wherein determining feature data corresponding to the genetic variation site to be predicted comprises:

3. The method of claim 1 or 2, wherein before inputting the characteristic data into a pre-configured genetic variation pathogenicity prediction model, further comprising:

displaying the derived variables as missing characteristic data for completion;

4. The method of any one of claims 1-3, wherein the training process for each target predictor model comprises:

5. The method of any one of claims 1-4, wherein the genetic variation pathogenicity prediction model comprises a principal direction prediction layer, a direction correction layer, a degree prediction layer, and a mapping layer;

6. The method of any of claims 1-5, wherein the target predictor models comprise a three-classification model and a two-classification model;

7. The method according to any one of claims 1-6, wherein the three classification models applied to the primary direction prediction layer comprise a first classification model for predicting a primary direction of genetic variation of the feature data, the primary direction of genetic variation comprising a pathogenic tendency, a clinical significance ambiguity, and a benign tendency;

8. An apparatus for predicting the pathogenicity of a genetic variation, comprising:

9. A storage medium, characterized by: the storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the method of predicting the pathogenicity of a genetic variation as set forth in any one of claims 1 to 7.

10. A computer device, comprising: one or more processors, and a memory;

the memory has stored therein computer readable instructions which, when executed by the one or more processors, perform the steps of the method of predicting the pathogenicity of a genetic variation as set forth in any one of claims 1 to 7.