CN116072214A

CN116072214A - Phenotype intelligent prediction and training method and device based on gene significance enhancement

Info

Publication number: CN116072214A
Application number: CN202310202392.9A
Authority: CN
Inventors: 应志文; 章依依; 徐晓刚; 王军
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-03-06
Filing date: 2023-03-06
Publication date: 2023-05-05
Anticipated expiration: 2043-03-06
Also published as: JP7490156B1; CN116072214B

Abstract

The invention discloses a phenotype intelligent prediction and training method and device based on gene saliency enhancement, which constructs an actual distribution list through gene morphology and phenotype height, constructs an expected distribution list of gene morphology and phenotype height according to chi-square hypothesis, carries out chi-square test on each gene locus and phenotype, obtains probability of chi-square hypothesis establishment based on the chi-square list, obtains saliency value of the gene locus on the phenotype, and encodes the gene at the same time; and then amplifying the codes of the genes according to the significance value of each gene locus, so that the association degree of the gene data and the phenotype is enhanced, and the accuracy of predicting the phenotype based on the gene loci is greatly improved. Aiming at organisms with chromosomes which are diploid, the invention adopts a deep learning training method, and improves the prediction accuracy from gene loci to phenotypes by enhancing the data of the gene loci.

Description

Phenotype intelligent prediction and training method and device based on gene significance enhancement

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a phenotype intelligent prediction and training method and device based on gene significance enhancement.

Background

In the process of gene prediction phenotype, a mode of predicting by using a deep learning model is widely paid attention to and applied. One method currently in mainstream is to use convolutional neural networks to perform convolutional feature extraction on gene data, so as to train a model of gene prediction phenotype. However, this approach ignores that the contribution of each gene itself to the phenotype is magnitude-differential, resulting in lower accuracy of prediction of the phenotype.

Disclosure of Invention

In order to solve the defects in the prior art and achieve the purpose of improving the prediction precision of the gene prediction phenotype, the invention adopts the following technical scheme:

the phenotype intelligent prediction training method based on gene significance enhancement comprises the following steps:

step one: obtaining a phenotype value and a corresponding gene sequence of a gene sample, wherein the gene sequence comprises a group of gene loci;

step two: calculating a phenotype average value through the phenotype value, classifying the phenotype of the gene through the phenotype average value, and constructing an actual distribution list of the morphology and the phenotype category of the gene; obtaining a desired distribution list of gene morphology and phenotype categories based on the assumption that the morphology of the gene is uncorrelated with the phenotype categories; calculating chi-square statistics under the assumption through a sample actual distribution list and an expected distribution list, obtaining a probability value of establishment of the assumption through inquiring the chi-square list, and calculating a significance value of a gene locus to a phenotype based on the probability value;

step three: scaling the coded gene loci by the saliency value to obtain gene enhancement data corresponding to the gene samples;

step four: and constructing a neural network model, and performing phenotype prediction training through the gene enhancement data set to obtain a trained model of the phenotype predicted by the gene enhancement data.

The second step comprises the following steps:

step 2.1: assuming that the gene locus where the significance value needs to be calculated is k, a chi-square assumption H is made _0k : assuming that the gene locus k has no significant relationship with phenotype y;

step 2.2: calculating the average value of N gene sample phenotypes, classifying all the gene samples according to the average value, combining multiple forms of gene loci, and constructing an actual distribution list of gene forms and phenotype categories to obtain the actual distribution condition O of each gene form under different phenotype categories _mn ；

Step 2.3: according to chi-square hypothesis H _0k The gene morphology and the phenotype have no significant relation, a desired distribution list of the gene morphology and the phenotype is constructed, obtaining the expected distribution E of each gene morphology under different phenotype categories _mn The method comprises the steps of carrying out a first treatment on the surface of the Calculating chi-square statistics:

wherein m represents the number of gene locus forms, n represents the category number of gene samples, and the chi-square list is queried through chi-square statistics to obtain the probability of the chi-square assumption being established as P _k Based on probability P _k Significance values were calculated for all loci.

In the step 2.2, the gene loci are in three forms of AA, AA and AA, and the deletion is not counted.

In the step 2.2, all the gene samples are classified into two categories according to the mean value:

wherein ,

representing the average value, wherein the average value is equal to or greater than the average value, and the total HN is calculated; less thanThe average value is one class, and LN strains are summed;

in the step 2.3, H is assumed according to chi-square _0k The gene morphology has no significant relationship with phenotype, then:

thus obtaining the expected distribution of the morphology and phenotype of the gene:

wherein O₁₁ 、E ₁₁ High phenotype actual and expected values representing the morphology of the gene AA, O ₁₂ 、E ₁₂ Indicating low phenotype actual and expected values of AA in gene morphology, O ₂₁ 、E ₂₁ Indicating high phenotypic actual and expected values of gene morphology Aa, O ₂₂ 、E ₂₂ Indicating low phenotype actual and expected values for gene morphology Aa, O ₃₁ 、E ₃₁ Indicating high phenotypic reality and expectation of aa gene morphology, O ₃₂ 、E ₃₂ Low phenotype actual value sum representing aa gene morphologyAn expected value.

The gene enhancement data in the third step:

wherein x_k Representing the encoded gene locus, P _k Representing a significance value.

The encoding mode of the gene locus adopts single-hot encoding.

In the fourth step, a training set and a test set are divided for the gene enhancement data set X, the training set is input into a neural network model for learning training, firstly, the data quantity of each input network is set as the batch size, the input dimension is the batch size, m is K, m represents the number of the gene locus forms, K represents the sequence length, the characteristics of the gene enhancement data are extracted through the neural network, the characteristics are connected through a full connection layer, so that a predicted phenotype value is output, the actual phenotype value and the predicted phenotype value are compared, the actual value and the predicted phenotype value are input into a loss network for loss calculation, the obtained loss value is transmitted forwards, corresponding parameters are updated, after repeated iterative updating, the loss value is converged, iteration is stopped, and the model of the trained gene enhancement data prediction phenotype is obtained.

The phenotype intelligent prediction training device based on the gene significance enhancement comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the phenotype intelligent prediction training method based on the gene significance enhancement when executing the executable codes.

The phenotype of the gene sample is predicted by a model for predicting the phenotype of the gene enhanced data trained by the intelligent prediction method based on the gene enhanced importance.

The phenotype intelligent prediction device based on gene significance enhancement comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the phenotype intelligent prediction method based on gene significance enhancement when executing the executable codes.

The invention has the advantages that:

according to the phenotype intelligent prediction and training method and device based on gene saliency enhancement, saliency values of each SNP locus are calculated through chi-square test, the saliency values are used as contribution degrees of the gene loci to scale gene coding data, and then deep learning neural network is used for extracting characteristics of the scaled gene data. Compared with the existing intelligent prediction, the method is simpler to extract the characteristics of the gene data by using the deep learning network, scales the gene coding data by using the significance values of different sites for the phenotype, extracts the characteristics of the gene data by using the deep learning network, and improves the accuracy of the gene phenotype prediction by adding the contribution degree of each gene to the phenotype.

Drawings

FIG. 1 is a flow chart of a method for intelligently predicting phenotypes based on gene significance enhancement in an embodiment of the invention.

FIG. 2 is a schematic illustration of an embodiment of the present invention schematic diagram of the saliency enhancement process.

FIG. 3 is a schematic structural diagram of a phenotype intelligent prediction device based on gene significance enhancement in an embodiment of the invention.

Detailed Description

The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.

As shown in fig. 1, the intelligent phenotype prediction method based on gene significance enhancement comprises the following steps:

the invention is that in the examples: and collecting the phenotype values of N gene samples, wherein the length of the gene sequence corresponding to the gene samples is K, and the gene sequence consists of gene loci (single nucleotide polymorphism SNP, single Nucleotide Polymorphisms).

the significance of the gene locus on phenotype was calculated in the examples of the present invention. First calculate the phenotype average for data set sample N

According to the phenotype mean->

Performing secondary classification on the phenotype, and obtaining an actual distribution list of the phenotype of the gene sample under three gene forms through classification of the phenotype and three gene forms; making the assumption: the morphology of the gene is not related to the phenotype, so that a desired distribution list of the phenotype of the gene sample under three gene morphologies is obtained; calculating chi-square statistics under the assumption through the actual distribution list and the expected distribution list of the samples, obtaining a probability value of assuming establishment through inquiring the chi-square list, and calculating a significance value of the gene locus on the phenotype based on the probability value; as shown in fig. 2, the method specifically comprises the following steps:

step 2.1: assuming that the gene locus where the significance value needs to be calculated is k, a chi-square assumption H is made _0k : it is assumed that the gene locus k has no significance (0) in relation to the phenotype y.

Step 2.2: calculating the average value of N gene sample phenotypes, classifying all the gene samples according to the average value, combining multiple forms of gene loci, and constructing an actual distribution list of gene forms and phenotype categories to obtain the actual distribution of each gene form under different phenotype categoriesCase O _mn ；

Specifically, the mean value of N gene sample phenotypes is calculated

：

According to the mean value

Two classifications were made for all gene samples: the average value is higher than the average value, and HN strains are summed; the LN strain was classified as a class lower than the average. Three morphologies of the hypothetical gene locus were expressed as: AA. Aa, deletions were not counted. The actual distribution list of the gene morphology and the phenotype can be obtained through the quantity statistics:

O ₁₁ high phenotype actual value representing the gene morphology of AA, O ₁₂ Indicating a low phenotypic actual value of AA in gene morphology, O ₂₁ Indicating a high phenotypic actual value of the gene morphology Aa, O ₂₂ Indicating a low phenotypic actual value of the gene morphology Aa, O ₃₁ Indicating a high phenotypic reality of aa in gene morphology, O ₃₂ Indicating a low phenotypic actual value for aa.

Step 2.3: according to chi-square hypothesis H _0k The gene morphology and the phenotype have no significant relation, and an expected distribution list of the gene morphology and the phenotype is constructed to obtain an expected distribution condition E of each gene morphology under different phenotype categories _mn The method comprises the steps of carrying out a first treatment on the surface of the Calculating chi-square statistics, querying the chi-square list through the chi-square statistics to obtain the probability of the chi-square hypothesis being established as P _k Based on probability P _k Significance values were calculated for all loci.

Specifically, according to chi-square hypothesis H _0k The gene morphology has no significant relation with phenotype, and can be theoretically obtained:

thus obtaining a desired distribution list of the morphology and phenotype of each gene:

wherein E₁₁ High phenotype expected value indicating that the gene morphology is AA, E ₁₂ Indicating a low phenotype expected value of AA in gene morphology, E ₂₁ Indicating a high phenotypic desired value for gene morphology Aa, E ₂₂ Indicating a low phenotype expected value for gene morphology Aa, E ₃₁ Indicating a high phenotypic desirability of aa in gene morphology, E ₃₂ Indicating a low phenotype expected value for aa in gene morphology.

Calculating chi-square statistics:

wherein m represents the number of gene locus forms, n represents the category number of gene samples, and the probability that chi-square assumption is established by querying chi-square list is P _k Based on probability P of chi-square hypothesis being established _k Significance values were calculated for all loci.

in the embodiment of the invention, a significance value of the phenotype corresponding to all K gene loci is calculated through the second step, and the single hot onehot coding is carried out on each gene locus of each gene sample, so that the gene locus coding x is obtained _k The weights of three gene morphologies are balanced, for example: coding the morphology AA of the gene locus as [1,0 ]]Aa is encoded as [0,1,0 ]]Aa is encoded as [0,1 ]]The deletion is encoded as [0,0]The method comprises the steps of carrying out a first treatment on the surface of the The encoded gene data is then scaled, i.e., the code x for each gene locus _k To which it is oppositeSignificance value of the corresponding gene-log ₁₀ P _k To obtain the gene enhancement data corresponding to the gene sample:

Dividing a training set and a test set for a gene enhancement data set X, inputting the training set into a neural network model for learning training, firstly setting the data quantity of each input network, wherein the input dimension is the quantity of the patterns of the gene loci, K represents the length of the sequences, extracting the characteristics of the gene enhancement data through the neural network, connecting the characteristics through a full connecting layer, and then outputting a predicted phenotype value, comparing the real phenotype value with the predicted phenotype value, inputting the real value and the predicted value into a loss network for loss calculation, transmitting the obtained loss value forward, updating corresponding parameters, and stopping iteration after repeated iteration updating, so as to obtain a model of the predicted phenotype of the trained gene enhanced data.

In the embodiment of the invention, a neural network model is constructed, a convolutional neural network for feature extraction is established by using CNN and a fully connected neural network, L1loss is used as a loss network of the model, parameters of the whole neural network are initialized, the parameters comprise condition parameters and super parameters for stopping iteration, and the like, and a gene enhancement data set X obtained in the step three is obtained by using 7:3, inputting the training set into a deep learning network for learning training, firstly setting the data quantity batch size of each time of inputting the network, extracting the characteristics of the gene enhancement data through a convolutional neural network, connecting the characteristics through a full-connection layer, outputting a predicted phenotype value, comparing the real phenotype value with the predicted phenotype value, inputting the real value with the predicted value into a loss network for loss calculation, transmitting the obtained loss value forward, updating the corresponding parameters of the network, recording as one training iteration after all the data are iterated, setting the iteration times of the network to be 200 or 300, and the like, and mainly enabling the loss value to achieve convergence. And stopping iteration after the loss value reaches a convergence or stopping condition, so that a model of the trained gene enhancement data prediction phenotype is obtained.

Fifthly, predicting the phenotype of the gene sample through a trained model for predicting the phenotype by the gene enhancement data.

Corresponding to the previous embodiments of the intelligent prediction method based on the phenotype with enhanced gene salience, the invention also provides the embodiment of the intelligent prediction device based on the phenotype with enhanced gene salience.

Referring to fig. 3, the intelligent prediction apparatus for gene significance enhancement provided by the embodiment of the invention comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the intelligent prediction method for gene significance enhancement in the embodiment when executing the executable codes.

The embodiment of the phenotype intelligent prediction device based on gene significance enhancement can be applied to any device with data processing capability, and the device with data processing capability can be a device or a device such as a computer. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 3, a hardware structure diagram of an arbitrary device with data processing capability where the phenotype intelligent prediction apparatus based on gene significance enhancement of the present invention is located is shown in fig. 3, and in addition to a processor, a memory, a network interface, and a nonvolatile memory shown in fig. 3, the arbitrary device with data processing capability where the apparatus is located in an embodiment generally includes other hardware according to an actual function of the arbitrary device with data processing capability, which is not described herein again.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The embodiment of the invention also provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements the phenotype intelligent prediction method based on gene significance enhancement in the above embodiment.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any external storage device that has data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present invention.

Claims

1. The phenotype intelligent prediction training method based on gene significance enhancement is characterized by comprising the following steps of:

step two: calculating a phenotype average value through the phenotype value, classifying the phenotype of the gene through the phenotype average value, and constructing an actual distribution list of the morphology and the phenotype category of the gene; obtaining a desired distribution list of gene morphology and phenotype categories based on the assumption that the morphology of the gene is uncorrelated with the phenotype categories; calculating chi-square statistics through a sample actual distribution list and an expected distribution list, obtaining a probability value of establishment of a hypothesis through inquiring the chi-square list, and calculating a significance value of a gene locus on a phenotype based on the probability value;

2. The gene significance enhanced-based phenotype intelligent predictive training method according to claim 1, wherein: the second step comprises the following steps:

Step 2.3: according to chi-square hypothesis H _0k The gene morphology and the phenotype have no significant relation, and an expected distribution list of the gene morphology and the phenotype is constructed to obtain an expected distribution condition E of each gene morphology under different phenotype categories _mn The method comprises the steps of carrying out a first treatment on the surface of the Calculating chi-square statistics:

，

3. The gene significance enhanced-based phenotype intelligent predictive training method according to claim 1, wherein: in the step 2.2, the gene loci are in three forms of AA, AA and AA, and the deletion is not counted.

4. A method of intelligent predictive training based on phenotypes with enhanced gene saliency according to claim 3, characterized in that: in the step 2.2, all the gene samples are classified into two categories according to the mean value:

，

wherein ,

representing the average value, wherein the average value is equal to or greater than the average value, and the total HN is calculated; the LN strains are classified as a class with less than the average value;

，

，

，

，/>

，

，

，

wherein O₁₁ 、E ₁₁ High phenotype actual and expected values representing the morphology of the gene AA, O ₁₂ 、E ₁₂ Indicating low phenotype actual and expected values of AA in gene morphology, O ₂₁ 、E ₂₁ Indicating high phenotypic actual and expected values of gene morphology Aa, O ₂₂ 、E ₂₂ Indicating low phenotype actual and expected values for gene morphology Aa, O ₃₁ 、E ₃₁ Indicating high phenotypic reality and expectation of aa gene morphology, O ₃₂ 、E ₃₂ Indicating a low phenotypic actual value and phase of aa gene morphologyAnd (5) observing the value.

5. The gene significance enhanced-based phenotype intelligent predictive training method according to claim 1, wherein: the gene enhancement data in the third step:

，

6. The gene significance enhanced based phenotype intelligent predictive training method according to claim 1 or 5, wherein: the encoding mode of the gene locus adopts single-hot encoding.

7. The gene significance enhanced-based phenotype intelligent predictive training method according to claim 1, wherein: in the fourth step, a training set and a test set are divided for the gene enhancement data set X, the training set is input into a neural network model for learning training, firstly, the data quantity of each input network is set as the batch size, the input dimension is the batch size, m is K, m represents the number of the gene locus forms, K represents the sequence length, the characteristics of the gene enhancement data are extracted through the neural network, the characteristics are connected through a full connection layer, so that a predicted phenotype value is output, the actual phenotype value and the predicted phenotype value are compared, the actual value and the predicted phenotype value are input into a loss network for loss calculation, the obtained loss value is transmitted forwards, corresponding parameters are updated, after repeated iterative updating, the loss value is converged, iteration is stopped, and the model of the trained gene enhancement data prediction phenotype is obtained.

8. Phenotype intelligence prediction trainer based on gene saliency reinforcing, its characterized in that: comprising a memory and one or more processors, the memory having executable code stored therein, which when executed, is operable to implement the gene significance enhancement based phenotype intelligent predictive training method of any of claims 1-7.

9. The phenotype intelligent prediction method based on gene significance enhancement is characterized by comprising the following steps of: predicting a phenotype of a gene sample by a model of a gene enhanced data prediction phenotype trained based on a gene significance enhanced phenotype intelligent prediction training method of claim 1.

10. Phenotype intelligence prediction unit based on gene saliency reinforcing, its characterized in that: comprising a memory and one or more processors, the memory having executable code stored therein, which when executed by the one or more processors, is operable to implement the gene significance enhancement based phenotype intelligent prediction method of claim 9.