CN116072214A - Phenotype intelligent prediction and training method and device based on gene significance enhancement - Google Patents

Phenotype intelligent prediction and training method and device based on gene significance enhancement Download PDF

Info

Publication number
CN116072214A
CN116072214A CN202310202392.9A CN202310202392A CN116072214A CN 116072214 A CN116072214 A CN 116072214A CN 202310202392 A CN202310202392 A CN 202310202392A CN 116072214 A CN116072214 A CN 116072214A
Authority
CN
China
Prior art keywords
gene
phenotype
value
morphology
significance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310202392.9A
Other languages
Chinese (zh)
Other versions
CN116072214B (en
Inventor
应志文
章依依
徐晓刚
王军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202310202392.9A priority Critical patent/CN116072214B/en
Publication of CN116072214A publication Critical patent/CN116072214A/en
Application granted granted Critical
Publication of CN116072214B publication Critical patent/CN116072214B/en
Priority to JP2024013809A priority patent/JP7490156B1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Genetics & Genomics (AREA)
  • Evolutionary Computation (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Data Mining & Analysis (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a phenotype intelligent prediction and training method and device based on gene saliency enhancement, which constructs an actual distribution list through gene morphology and phenotype height, constructs an expected distribution list of gene morphology and phenotype height according to chi-square hypothesis, carries out chi-square test on each gene locus and phenotype, obtains probability of chi-square hypothesis establishment based on the chi-square list, obtains saliency value of the gene locus on the phenotype, and encodes the gene at the same time; and then amplifying the codes of the genes according to the significance value of each gene locus, so that the association degree of the gene data and the phenotype is enhanced, and the accuracy of predicting the phenotype based on the gene loci is greatly improved. Aiming at organisms with chromosomes which are diploid, the invention adopts a deep learning training method, and improves the prediction accuracy from gene loci to phenotypes by enhancing the data of the gene loci.

Description

Phenotype intelligent prediction and training method and device based on gene significance enhancement
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a phenotype intelligent prediction and training method and device based on gene significance enhancement.
Background
In the process of gene prediction phenotype, a mode of predicting by using a deep learning model is widely paid attention to and applied. One method currently in mainstream is to use convolutional neural networks to perform convolutional feature extraction on gene data, so as to train a model of gene prediction phenotype. However, this approach ignores that the contribution of each gene itself to the phenotype is magnitude-differential, resulting in lower accuracy of prediction of the phenotype.
Disclosure of Invention
In order to solve the defects in the prior art and achieve the purpose of improving the prediction precision of the gene prediction phenotype, the invention adopts the following technical scheme:
the phenotype intelligent prediction training method based on gene significance enhancement comprises the following steps:
step one: obtaining a phenotype value and a corresponding gene sequence of a gene sample, wherein the gene sequence comprises a group of gene loci;
step two: calculating a phenotype average value through the phenotype value, classifying the phenotype of the gene through the phenotype average value, and constructing an actual distribution list of the morphology and the phenotype category of the gene; obtaining a desired distribution list of gene morphology and phenotype categories based on the assumption that the morphology of the gene is uncorrelated with the phenotype categories; calculating chi-square statistics under the assumption through a sample actual distribution list and an expected distribution list, obtaining a probability value of establishment of the assumption through inquiring the chi-square list, and calculating a significance value of a gene locus to a phenotype based on the probability value;
step three: scaling the coded gene loci by the saliency value to obtain gene enhancement data corresponding to the gene samples;
step four: and constructing a neural network model, and performing phenotype prediction training through the gene enhancement data set to obtain a trained model of the phenotype predicted by the gene enhancement data.
The second step comprises the following steps:
step 2.1: assuming that the gene locus where the significance value needs to be calculated is k, a chi-square assumption H is made 0k : assuming that the gene locus k has no significant relationship with phenotype y;
step 2.2: calculating the average value of N gene sample phenotypes, classifying all the gene samples according to the average value, combining multiple forms of gene loci, and constructing an actual distribution list of gene forms and phenotype categories to obtain the actual distribution condition O of each gene form under different phenotype categories mn
Step 2.3: according to chi-square hypothesis H 0k The gene morphology and the phenotype have no significant relation, a desired distribution list of the gene morphology and the phenotype is constructed, obtaining the expected distribution E of each gene morphology under different phenotype categories mn The method comprises the steps of carrying out a first treatment on the surface of the Calculating chi-square statistics:
Figure SMS_1
wherein m represents the number of gene locus forms, n represents the category number of gene samples, and the chi-square list is queried through chi-square statistics to obtain the probability of the chi-square assumption being established as P k Based on probability P k Significance values were calculated for all loci.
In the step 2.2, the gene loci are in three forms of AA, AA and AA, and the deletion is not counted.
In the step 2.2, all the gene samples are classified into two categories according to the mean value:
Figure SMS_2
wherein ,
Figure SMS_3
representing the average value, wherein the average value is equal to or greater than the average value, and the total HN is calculated; less thanThe average value is one class, and LN strains are summed;
in the step 2.3, H is assumed according to chi-square 0k The gene morphology has no significant relationship with phenotype, then:
Figure SMS_4
thus obtaining the expected distribution of the morphology and phenotype of the gene:
Figure SMS_5
Figure SMS_6
Figure SMS_7
Figure SMS_8
Figure SMS_9
Figure SMS_10
wherein O11 、E 11 High phenotype actual and expected values representing the morphology of the gene AA, O 12 、E 12 Indicating low phenotype actual and expected values of AA in gene morphology, O 21 、E 21 Indicating high phenotypic actual and expected values of gene morphology Aa, O 22 、E 22 Indicating low phenotype actual and expected values for gene morphology Aa, O 31 、E 31 Indicating high phenotypic reality and expectation of aa gene morphology, O 32 、E 32 Low phenotype actual value sum representing aa gene morphologyAn expected value.
The gene enhancement data in the third step:
Figure SMS_11
wherein xk Representing the encoded gene locus, P k Representing a significance value.
The encoding mode of the gene locus adopts single-hot encoding.
In the fourth step, a training set and a test set are divided for the gene enhancement data set X, the training set is input into a neural network model for learning training, firstly, the data quantity of each input network is set as the batch size, the input dimension is the batch size, m is K, m represents the number of the gene locus forms, K represents the sequence length, the characteristics of the gene enhancement data are extracted through the neural network, the characteristics are connected through a full connection layer, so that a predicted phenotype value is output, the actual phenotype value and the predicted phenotype value are compared, the actual value and the predicted phenotype value are input into a loss network for loss calculation, the obtained loss value is transmitted forwards, corresponding parameters are updated, after repeated iterative updating, the loss value is converged, iteration is stopped, and the model of the trained gene enhancement data prediction phenotype is obtained.
The phenotype intelligent prediction training device based on the gene significance enhancement comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the phenotype intelligent prediction training method based on the gene significance enhancement when executing the executable codes.
The phenotype of the gene sample is predicted by a model for predicting the phenotype of the gene enhanced data trained by the intelligent prediction method based on the gene enhanced importance.
The phenotype intelligent prediction device based on gene significance enhancement comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the phenotype intelligent prediction method based on gene significance enhancement when executing the executable codes.
The invention has the advantages that:
according to the phenotype intelligent prediction and training method and device based on gene saliency enhancement, saliency values of each SNP locus are calculated through chi-square test, the saliency values are used as contribution degrees of the gene loci to scale gene coding data, and then deep learning neural network is used for extracting characteristics of the scaled gene data. Compared with the existing intelligent prediction, the method is simpler to extract the characteristics of the gene data by using the deep learning network, scales the gene coding data by using the significance values of different sites for the phenotype, extracts the characteristics of the gene data by using the deep learning network, and improves the accuracy of the gene phenotype prediction by adding the contribution degree of each gene to the phenotype.
Drawings
FIG. 1 is a flow chart of a method for intelligently predicting phenotypes based on gene significance enhancement in an embodiment of the invention.
FIG. 2 is a schematic illustration of an embodiment of the present invention schematic diagram of the saliency enhancement process.
FIG. 3 is a schematic structural diagram of a phenotype intelligent prediction device based on gene significance enhancement in an embodiment of the invention.
Detailed Description
The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.
As shown in fig. 1, the intelligent phenotype prediction method based on gene significance enhancement comprises the following steps:
step one: obtaining a phenotype value and a corresponding gene sequence of a gene sample, wherein the gene sequence comprises a group of gene loci;
the invention is that in the examples: and collecting the phenotype values of N gene samples, wherein the length of the gene sequence corresponding to the gene samples is K, and the gene sequence consists of gene loci (single nucleotide polymorphism SNP, single Nucleotide Polymorphisms).
Step two: calculating a phenotype average value through the phenotype value, classifying the phenotype of the gene through the phenotype average value, and constructing an actual distribution list of the morphology and the phenotype category of the gene; obtaining a desired distribution list of gene morphology and phenotype categories based on the assumption that the morphology of the gene is uncorrelated with the phenotype categories; calculating chi-square statistics under the assumption through a sample actual distribution list and an expected distribution list, obtaining a probability value of establishment of the assumption through inquiring the chi-square list, and calculating a significance value of a gene locus to a phenotype based on the probability value;
the significance of the gene locus on phenotype was calculated in the examples of the present invention. First calculate the phenotype average for data set sample N
Figure SMS_12
According to the phenotype mean->
Figure SMS_13
Performing secondary classification on the phenotype, and obtaining an actual distribution list of the phenotype of the gene sample under three gene forms through classification of the phenotype and three gene forms; making the assumption: the morphology of the gene is not related to the phenotype, so that a desired distribution list of the phenotype of the gene sample under three gene morphologies is obtained; calculating chi-square statistics under the assumption through the actual distribution list and the expected distribution list of the samples, obtaining a probability value of assuming establishment through inquiring the chi-square list, and calculating a significance value of the gene locus on the phenotype based on the probability value; as shown in fig. 2, the method specifically comprises the following steps:
step 2.1: assuming that the gene locus where the significance value needs to be calculated is k, a chi-square assumption H is made 0k : it is assumed that the gene locus k has no significance (0) in relation to the phenotype y.
Step 2.2: calculating the average value of N gene sample phenotypes, classifying all the gene samples according to the average value, combining multiple forms of gene loci, and constructing an actual distribution list of gene forms and phenotype categories to obtain the actual distribution of each gene form under different phenotype categoriesCase O mn
Specifically, the mean value of N gene sample phenotypes is calculated
Figure SMS_14
Figure SMS_15
According to the mean value
Figure SMS_16
Two classifications were made for all gene samples: the average value is higher than the average value, and HN strains are summed; the LN strain was classified as a class lower than the average. Three morphologies of the hypothetical gene locus were expressed as: AA. Aa, deletions were not counted. The actual distribution list of the gene morphology and the phenotype can be obtained through the quantity statistics:
Figure SMS_17
O 11 high phenotype actual value representing the gene morphology of AA, O 12 Indicating a low phenotypic actual value of AA in gene morphology, O 21 Indicating a high phenotypic actual value of the gene morphology Aa, O 22 Indicating a low phenotypic actual value of the gene morphology Aa, O 31 Indicating a high phenotypic reality of aa in gene morphology, O 32 Indicating a low phenotypic actual value for aa.
Step 2.3: according to chi-square hypothesis H 0k The gene morphology and the phenotype have no significant relation, and an expected distribution list of the gene morphology and the phenotype is constructed to obtain an expected distribution condition E of each gene morphology under different phenotype categories mn The method comprises the steps of carrying out a first treatment on the surface of the Calculating chi-square statistics, querying the chi-square list through the chi-square statistics to obtain the probability of the chi-square hypothesis being established as P k Based on probability P k Significance values were calculated for all loci.
Specifically, according to chi-square hypothesis H 0k The gene morphology has no significant relation with phenotype, and can be theoretically obtained:
Figure SMS_18
thus obtaining a desired distribution list of the morphology and phenotype of each gene:
Figure SMS_19
wherein E11 High phenotype expected value indicating that the gene morphology is AA, E 12 Indicating a low phenotype expected value of AA in gene morphology, E 21 Indicating a high phenotypic desired value for gene morphology Aa, E 22 Indicating a low phenotype expected value for gene morphology Aa, E 31 Indicating a high phenotypic desirability of aa in gene morphology, E 32 Indicating a low phenotype expected value for aa in gene morphology.
Calculating chi-square statistics:
Figure SMS_20
wherein m represents the number of gene locus forms, n represents the category number of gene samples, and the probability that chi-square assumption is established by querying chi-square list is P k Based on probability P of chi-square hypothesis being established k Significance values were calculated for all loci.
Step three: scaling the coded gene loci by the saliency value to obtain gene enhancement data corresponding to the gene samples;
in the embodiment of the invention, a significance value of the phenotype corresponding to all K gene loci is calculated through the second step, and the single hot onehot coding is carried out on each gene locus of each gene sample, so that the gene locus coding x is obtained k The weights of three gene morphologies are balanced, for example: coding the morphology AA of the gene locus as [1,0 ]]Aa is encoded as [0,1,0 ]]Aa is encoded as [0,1 ]]The deletion is encoded as [0,0]The method comprises the steps of carrying out a first treatment on the surface of the The encoded gene data is then scaled, i.e., the code x for each gene locus k To which it is oppositeSignificance value of the corresponding gene-log 10 P k To obtain the gene enhancement data corresponding to the gene sample:
Figure SMS_21
step four: and constructing a neural network model, and performing phenotype prediction training through the gene enhancement data set to obtain a trained model of the phenotype predicted by the gene enhancement data.
Dividing a training set and a test set for a gene enhancement data set X, inputting the training set into a neural network model for learning training, firstly setting the data quantity of each input network, wherein the input dimension is the quantity of the patterns of the gene loci, K represents the length of the sequences, extracting the characteristics of the gene enhancement data through the neural network, connecting the characteristics through a full connecting layer, and then outputting a predicted phenotype value, comparing the real phenotype value with the predicted phenotype value, inputting the real value and the predicted value into a loss network for loss calculation, transmitting the obtained loss value forward, updating corresponding parameters, and stopping iteration after repeated iteration updating, so as to obtain a model of the predicted phenotype of the trained gene enhanced data.
In the embodiment of the invention, a neural network model is constructed, a convolutional neural network for feature extraction is established by using CNN and a fully connected neural network, L1loss is used as a loss network of the model, parameters of the whole neural network are initialized, the parameters comprise condition parameters and super parameters for stopping iteration, and the like, and a gene enhancement data set X obtained in the step three is obtained by using 7:3, inputting the training set into a deep learning network for learning training, firstly setting the data quantity batch size of each time of inputting the network, extracting the characteristics of the gene enhancement data through a convolutional neural network, connecting the characteristics through a full-connection layer, outputting a predicted phenotype value, comparing the real phenotype value with the predicted phenotype value, inputting the real value with the predicted value into a loss network for loss calculation, transmitting the obtained loss value forward, updating the corresponding parameters of the network, recording as one training iteration after all the data are iterated, setting the iteration times of the network to be 200 or 300, and the like, and mainly enabling the loss value to achieve convergence. And stopping iteration after the loss value reaches a convergence or stopping condition, so that a model of the trained gene enhancement data prediction phenotype is obtained.
Fifthly, predicting the phenotype of the gene sample through a trained model for predicting the phenotype by the gene enhancement data.
Corresponding to the previous embodiments of the intelligent prediction method based on the phenotype with enhanced gene salience, the invention also provides the embodiment of the intelligent prediction device based on the phenotype with enhanced gene salience.
Referring to fig. 3, the intelligent prediction apparatus for gene significance enhancement provided by the embodiment of the invention comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the intelligent prediction method for gene significance enhancement in the embodiment when executing the executable codes.
The embodiment of the phenotype intelligent prediction device based on gene significance enhancement can be applied to any device with data processing capability, and the device with data processing capability can be a device or a device such as a computer. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 3, a hardware structure diagram of an arbitrary device with data processing capability where the phenotype intelligent prediction apparatus based on gene significance enhancement of the present invention is located is shown in fig. 3, and in addition to a processor, a memory, a network interface, and a nonvolatile memory shown in fig. 3, the arbitrary device with data processing capability where the apparatus is located in an embodiment generally includes other hardware according to an actual function of the arbitrary device with data processing capability, which is not described herein again.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
The embodiment of the invention also provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements the phenotype intelligent prediction method based on gene significance enhancement in the above embodiment.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any external storage device that has data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.
The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present invention.

Claims (10)

1. The phenotype intelligent prediction training method based on gene significance enhancement is characterized by comprising the following steps of:
step one: obtaining a phenotype value and a corresponding gene sequence of a gene sample, wherein the gene sequence comprises a group of gene loci;
step two: calculating a phenotype average value through the phenotype value, classifying the phenotype of the gene through the phenotype average value, and constructing an actual distribution list of the morphology and the phenotype category of the gene; obtaining a desired distribution list of gene morphology and phenotype categories based on the assumption that the morphology of the gene is uncorrelated with the phenotype categories; calculating chi-square statistics through a sample actual distribution list and an expected distribution list, obtaining a probability value of establishment of a hypothesis through inquiring the chi-square list, and calculating a significance value of a gene locus on a phenotype based on the probability value;
step three: scaling the coded gene loci by the saliency value to obtain gene enhancement data corresponding to the gene samples;
step four: and constructing a neural network model, and performing phenotype prediction training through the gene enhancement data set to obtain a trained model of the phenotype predicted by the gene enhancement data.
2. The gene significance enhanced-based phenotype intelligent predictive training method according to claim 1, wherein: the second step comprises the following steps:
step 2.1: assuming that the gene locus where the significance value needs to be calculated is k, a chi-square assumption H is made 0k : assuming that the gene locus k has no significant relationship with phenotype y;
step 2.2: calculating the average value of N gene sample phenotypes, classifying all the gene samples according to the average value, combining multiple forms of gene loci, and constructing an actual distribution list of gene forms and phenotype categories to obtain the actual distribution condition O of each gene form under different phenotype categories mn
Step 2.3: according to chi-square hypothesis H 0k The gene morphology and the phenotype have no significant relation, and an expected distribution list of the gene morphology and the phenotype is constructed to obtain an expected distribution condition E of each gene morphology under different phenotype categories mn The method comprises the steps of carrying out a first treatment on the surface of the Calculating chi-square statistics:
Figure QLYQS_1
wherein m represents the number of gene locus forms, n represents the category number of gene samples, and the chi-square list is queried through chi-square statistics to obtain the probability of the chi-square assumption being established as P k Based on probability P k Significance values were calculated for all loci.
3. The gene significance enhanced-based phenotype intelligent predictive training method according to claim 1, wherein: in the step 2.2, the gene loci are in three forms of AA, AA and AA, and the deletion is not counted.
4. A method of intelligent predictive training based on phenotypes with enhanced gene saliency according to claim 3, characterized in that: in the step 2.2, all the gene samples are classified into two categories according to the mean value:
Figure QLYQS_2
wherein ,
Figure QLYQS_3
representing the average value, wherein the average value is equal to or greater than the average value, and the total HN is calculated; the LN strains are classified as a class with less than the average value;
in the step 2.3, H is assumed according to chi-square 0k The gene morphology has no significant relationship with phenotype, then:
Figure QLYQS_4
thus obtaining the expected distribution of the morphology and phenotype of the gene:
Figure QLYQS_5
Figure QLYQS_6
Figure QLYQS_7
,/>
Figure QLYQS_8
Figure QLYQS_9
Figure QLYQS_10
wherein O11 、E 11 High phenotype actual and expected values representing the morphology of the gene AA, O 12 、E 12 Indicating low phenotype actual and expected values of AA in gene morphology, O 21 、E 21 Indicating high phenotypic actual and expected values of gene morphology Aa, O 22 、E 22 Indicating low phenotype actual and expected values for gene morphology Aa, O 31 、E 31 Indicating high phenotypic reality and expectation of aa gene morphology, O 32 、E 32 Indicating a low phenotypic actual value and phase of aa gene morphologyAnd (5) observing the value.
5. The gene significance enhanced-based phenotype intelligent predictive training method according to claim 1, wherein: the gene enhancement data in the third step:
Figure QLYQS_11
wherein xk Representing the encoded gene locus, P k Representing a significance value.
6. The gene significance enhanced based phenotype intelligent predictive training method according to claim 1 or 5, wherein: the encoding mode of the gene locus adopts single-hot encoding.
7. The gene significance enhanced-based phenotype intelligent predictive training method according to claim 1, wherein: in the fourth step, a training set and a test set are divided for the gene enhancement data set X, the training set is input into a neural network model for learning training, firstly, the data quantity of each input network is set as the batch size, the input dimension is the batch size, m is K, m represents the number of the gene locus forms, K represents the sequence length, the characteristics of the gene enhancement data are extracted through the neural network, the characteristics are connected through a full connection layer, so that a predicted phenotype value is output, the actual phenotype value and the predicted phenotype value are compared, the actual value and the predicted phenotype value are input into a loss network for loss calculation, the obtained loss value is transmitted forwards, corresponding parameters are updated, after repeated iterative updating, the loss value is converged, iteration is stopped, and the model of the trained gene enhancement data prediction phenotype is obtained.
8. Phenotype intelligence prediction trainer based on gene saliency reinforcing, its characterized in that: comprising a memory and one or more processors, the memory having executable code stored therein, which when executed, is operable to implement the gene significance enhancement based phenotype intelligent predictive training method of any of claims 1-7.
9. The phenotype intelligent prediction method based on gene significance enhancement is characterized by comprising the following steps of: predicting a phenotype of a gene sample by a model of a gene enhanced data prediction phenotype trained based on a gene significance enhanced phenotype intelligent prediction training method of claim 1.
10. Phenotype intelligence prediction unit based on gene saliency reinforcing, its characterized in that: comprising a memory and one or more processors, the memory having executable code stored therein, which when executed by the one or more processors, is operable to implement the gene significance enhancement based phenotype intelligent prediction method of claim 9.
CN202310202392.9A 2023-03-06 2023-03-06 Phenotype intelligent prediction and training method and device based on gene significance enhancement Active CN116072214B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202310202392.9A CN116072214B (en) 2023-03-06 2023-03-06 Phenotype intelligent prediction and training method and device based on gene significance enhancement
JP2024013809A JP7490156B1 (en) 2023-03-06 2024-02-01 Intelligent phenotype prediction based on gene significance enhancement, training method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310202392.9A CN116072214B (en) 2023-03-06 2023-03-06 Phenotype intelligent prediction and training method and device based on gene significance enhancement

Publications (2)

Publication Number Publication Date
CN116072214A true CN116072214A (en) 2023-05-05
CN116072214B CN116072214B (en) 2023-07-11

Family

ID=86182149

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310202392.9A Active CN116072214B (en) 2023-03-06 2023-03-06 Phenotype intelligent prediction and training method and device based on gene significance enhancement

Country Status (2)

Country Link
JP (1) JP7490156B1 (en)
CN (1) CN116072214B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060195269A1 (en) * 2004-02-25 2006-08-31 Yeatman Timothy J Methods and systems for predicting cancer outcome
CN101210266A (en) * 2006-12-30 2008-07-02 苏州市长三角***生物交叉科学研究院有限公司 Measuring method for relativity of interaction and genetic character between genome genetic markers
GB201408687D0 (en) * 2014-05-16 2014-07-02 Univ Leuven Kath Method for predicting a phenotype from a genotype
CN105936907A (en) * 2016-04-27 2016-09-14 湖南杂交水稻研究中心 Seed breeding method for reducing cadmium content in rice grains
CN108256293A (en) * 2018-02-09 2018-07-06 哈尔滨工业大学深圳研究生院 A kind of statistical method and system of the disease association assortment of genes
CN108959848A (en) * 2018-05-30 2018-12-07 广州普世医学科技有限公司 Based on genetic mutation and the matched hereditary disease forecasting system of disease phenotype auto-associating
CN109182538A (en) * 2018-09-29 2019-01-11 南京农业大学 Mastadenitis of cow key SNPs site rs88640083 and 2b-RAD Genotyping and analysis method
CN110400597A (en) * 2018-04-23 2019-11-01 成都二十三魔方生物科技有限公司 A kind of genetype for predicting method based on deep learning
US20200135296A1 (en) * 2018-10-31 2020-04-30 Ancestry.Com Dna, Llc Estimation of phenotypes using dna, pedigree, and historical data
CN113502293A (en) * 2021-08-25 2021-10-15 湖南工业大学 Camellia oleifera self-incompatible associated gene, SNP molecular marker and application
CN114373547A (en) * 2022-01-11 2022-04-19 平安科技(深圳)有限公司 Method and system for predicting disease risk
CN115148278A (en) * 2022-07-21 2022-10-04 平安科技(深圳)有限公司 Training method and device of gene sequencing model, electronic equipment and storage medium
CN115331732A (en) * 2022-10-11 2022-11-11 之江实验室 Gene phenotype training and predicting method and device based on graph neural network
CN115547408A (en) * 2022-07-15 2022-12-30 宋炜宸 Method and equipment for predicting individual phenotype based on human whole genome genotype
CN115691661A (en) * 2022-09-26 2023-02-03 之江实验室 Gene coding breeding prediction method and device based on graph clustering

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2017242028A1 (en) 2016-03-29 2018-09-06 Regeneron Pharmaceuticals, Inc. Genetic variant-phenotype analysis system and methods of use
GB201912331D0 (en) 2019-08-28 2019-10-09 Genomics Plc Computer-implemented method and apparatus for analysing genentic data
US20220301658A1 (en) 2021-03-19 2022-09-22 X Development Llc Machine learning driven gene discovery and gene editing in plants

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060195269A1 (en) * 2004-02-25 2006-08-31 Yeatman Timothy J Methods and systems for predicting cancer outcome
CN101210266A (en) * 2006-12-30 2008-07-02 苏州市长三角***生物交叉科学研究院有限公司 Measuring method for relativity of interaction and genetic character between genome genetic markers
GB201408687D0 (en) * 2014-05-16 2014-07-02 Univ Leuven Kath Method for predicting a phenotype from a genotype
CN105936907A (en) * 2016-04-27 2016-09-14 湖南杂交水稻研究中心 Seed breeding method for reducing cadmium content in rice grains
CN108256293A (en) * 2018-02-09 2018-07-06 哈尔滨工业大学深圳研究生院 A kind of statistical method and system of the disease association assortment of genes
CN110400597A (en) * 2018-04-23 2019-11-01 成都二十三魔方生物科技有限公司 A kind of genetype for predicting method based on deep learning
CN108959848A (en) * 2018-05-30 2018-12-07 广州普世医学科技有限公司 Based on genetic mutation and the matched hereditary disease forecasting system of disease phenotype auto-associating
CN109182538A (en) * 2018-09-29 2019-01-11 南京农业大学 Mastadenitis of cow key SNPs site rs88640083 and 2b-RAD Genotyping and analysis method
US20200135296A1 (en) * 2018-10-31 2020-04-30 Ancestry.Com Dna, Llc Estimation of phenotypes using dna, pedigree, and historical data
CN113502293A (en) * 2021-08-25 2021-10-15 湖南工业大学 Camellia oleifera self-incompatible associated gene, SNP molecular marker and application
CN114373547A (en) * 2022-01-11 2022-04-19 平安科技(深圳)有限公司 Method and system for predicting disease risk
CN115547408A (en) * 2022-07-15 2022-12-30 宋炜宸 Method and equipment for predicting individual phenotype based on human whole genome genotype
CN115148278A (en) * 2022-07-21 2022-10-04 平安科技(深圳)有限公司 Training method and device of gene sequencing model, electronic equipment and storage medium
CN115691661A (en) * 2022-09-26 2023-02-03 之江实验室 Gene coding breeding prediction method and device based on graph clustering
CN115331732A (en) * 2022-10-11 2022-11-11 之江实验室 Gene phenotype training and predicting method and device based on graph neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SUNDAY O. PETERS 等: "Genomic Prediction With Different Heritability, QTL, and SNP Panel Scenarios Using Artificial Neural Network", 《IEEE ACCESS》, vol. 8, pages 147995, XP011806045, DOI: 10.1093/jas/skz258.532 *
张霄帅: "网络结构驱动的生物标记筛选及疾病预测模型研究", 《中国博士学位论文全文数据库 医药卫生科技辑》, vol. 2016, no. 9, pages 055 - 3 *

Also Published As

Publication number Publication date
JP7490156B1 (en) 2024-05-24
CN116072214B (en) 2023-07-11

Similar Documents

Publication Publication Date Title
Cui et al. A multi-objective particle swarm optimization algorithm based on two-archive mechanism
Sanchez et al. Deep learning for population size history inference: Design, comparison and combination with approximate Bayesian computation
CN111832101B (en) Construction method of cement strength prediction model and cement strength prediction method
CN111898689B (en) Image classification method based on neural network architecture search
Anderson Large-scale parentage inference with SNPs: an efficient algorithm for statistical confidence of parent pair allocations
CN111428853B (en) Negative sample countermeasure generation method with noise learning function
CN110766044A (en) Neural network training method based on Gaussian process prior guidance
CN114118369B (en) Image classification convolutional neural network design method based on group intelligent optimization
CN111462157B (en) Infrared image segmentation method based on genetic optimization threshold method
WO2023124342A1 (en) Low-cost automatic neural architecture search method for image classification
CN111008224A (en) Time sequence classification and retrieval method based on deep multitask representation learning
CN114496069A (en) Method for predicting off-target of CIRPCAs 9 system based on Transformer architecture
CN104809229B (en) A kind of text feature word extracting method and system
CN116401555A (en) Method, system and storage medium for constructing double-cell recognition model
CN109842614B (en) Network intrusion detection method based on data mining
CN116072214B (en) Phenotype intelligent prediction and training method and device based on gene significance enhancement
CN113762591A (en) Short-term electric quantity prediction method and system based on GRU and multi-core SVM counterstudy
CN108509764A (en) A kind of extinct plants and animal pedigree evolution analysis method based on genetic property yojan
CN110705704A (en) Neural network self-organizing genetic evolution algorithm based on correlation analysis
CN115908909A (en) Evolutionary neural architecture searching method and system based on Bayes convolutional neural network
CN115063374A (en) Model training method, face image quality scoring method, electronic device and storage medium
CN110533186B (en) Method, device, equipment and readable storage medium for evaluating crowdsourcing pricing system
CN103218543B (en) A kind of method and system distinguishing protein coding gene and Noncoding gene
CN117422428B (en) Automatic examination and approval method and system for robot based on artificial intelligence
Shen et al. A simple yet strong approach for installation prediction in sharechat ads

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant