CN115273973A - QTL sample processing, model training and identifying method, device and equipment - Google Patents

QTL sample processing, model training and identifying method, device and equipment Download PDF

Info

Publication number
CN115273973A
CN115273973A CN202210790511.2A CN202210790511A CN115273973A CN 115273973 A CN115273973 A CN 115273973A CN 202210790511 A CN202210790511 A CN 202210790511A CN 115273973 A CN115273973 A CN 115273973A
Authority
CN
China
Prior art keywords
qtl
snp
training
sample
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210790511.2A
Other languages
Chinese (zh)
Inventor
李林
李昭
陈晓轩
李伟夫
陈洪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong Agricultural University
Original Assignee
Huazhong Agricultural University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong Agricultural University filed Critical Huazhong Agricultural University
Priority to CN202210790511.2A priority Critical patent/CN115273973A/en
Publication of CN115273973A publication Critical patent/CN115273973A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The application relates to the technical field of QTL identification, in particular to QTL sample processing, model training and identifying methods, devices and equipment. According to the QTL sample processing method, sequencing and identification are carried out through a DNA mixed pool which is constructed by QTL positioning groups according to target character phenotype sequencing, SNP data and sample markers in the DNA mixed pool are obtained, and an RU-net model is trained in sequence to obtain a QTL identification model. The identification model can effectively identify the maize plant height QTL, the rice flowering period QTL, the Wuchang fish muscle thorny QTL and the like, and the identified result has lower deviation and signal-to-noise ratio relative to the delta SNP-index, ED4, G', smoothlod and Ridit algorithms, and can effectively identify and identify the micro-effect sites with the phenotype interpretation rate as low as 5%.

Description

QTL sample processing, model training and identifying method, device and equipment
Technical Field
The application relates to the technical field of QTL identification, in particular to QTL sample processing, model training and identifying methods, devices and equipment.
Background
QTL is an abbreviation for quantitative trait loci, meaning quantitative trait loci or quantitative trait loci, which refers to the location in the genome of a gene that controls a quantitative trait. The positioning of a QTL entails the use of genetic markers, one of which locates one or more QTLs next to a genetic marker located on the same chromosome by looking for a link between the genetic marker and the quantitative trait of interest, in other words, the marker and the QTL are linked.
Mixed pool sequencing analysis (Bulked segregant analysis), proposed by Michelmore et al in 1991, proved to be an effective method for applying QTL analysis. To date, a variety of algorithms have been developed for pool-mixture sequencing, such as the Δ SNP-index method based on differences in allele frequencies of high and low pools (Takagi et al, 2013), the ED4 method based on euclidean distance (Hill et al, 2013), the G' method based on G value calculation (Magwene et al, 2011), the SmoothLOD method based on LOD value calculation (Zhang et al, 2019), and the Ridit method based on non-parametric tests (Wang et al, 2019). The above method can be applied to three mixed pools and more than three mixed pools except for Ridit analysis, and other algorithms are only applied to the analysis and the analysis of two mixed pools. In addition, the methods are designed according to theoretical knowledge to detect target character sites, and the detection of the micro-effect sites is difficult under complex characters and backgrounds.
Disclosure of Invention
Aiming at the defect that the existing algorithm is difficult to detect the micro-effect sites, the inventor of the application develops a sample processing method, a model training method, a QTL identification method, a device and equipment applied to QTL identification model training. The sample processing method comprises the steps of sequencing, grouping and mixing DNA information of a QTL positioning group to obtain a DNA mixing pool, obtaining SNP data and sample marks thereof, constructing a training sample to train a model, wherein the formed QTL identification model can represent the response relation between the SNP data of the QTL positioning group and the QTL of the positioning group, the trained QTL identification model can effectively identify the QTL data of the corn plant height QTL, the rice flowering period QTL, the Wuchang fish intermuscular spur and the like, and the identification result has lower deviation and signal-to-noise ratio relative to delta SNP-index, ED4, G', smoothlOD and Ridit algorithms, and can effectively identify and identify micro-effect loci with the phenotype interpretation rate of 5%.
To this end, in a first aspect, an embodiment of the present application discloses a sample processing method applied to QTL recognition model training, which includes:
constructing a QTL positioning group;
ordering and grouping individuals in the QTL positioning population according to the phenotype of the QTL positioning population to obtain a plurality of ordered groups;
mixing the individual DNA samples in each group to obtain a plurality of sequenced DNA mixed pools;
carrying out fragmentation, sequencing, SNP identification and calculation on the plurality of DNA mixed pools respectively to obtain SNP data of the plurality of mixed pools;
and marking the SNP data respectively to obtain the SNP data and sample marks thereof, and using the SNP data and the sample marks as samples for training the QTL recognition model.
In certain embodiments, the SNP data includes SNP location information and SNP frequency information, the SNP frequency being the frequency of its occurrence in each DNA pool fragment, the method of tagging the SNP data comprising:
SNP data with a continuously increasing or decreasing SNP frequency within a plurality of sequenced DNA pools is labeled as 1, otherwise labeled as 0.
In certain embodiments, the sample processing method comprises the step of filtering the SNP data to remove low-quality SNP sites; the low-quality SNP locus has at least one of the following characteristics:
the read number of the corresponding DNA mixed pool sequencing is lower than a threshold value;
the SNP frequency in the DNA mixing pool is obviously deviated from 0.5 in the same direction; and
the difference in allele frequency between a SNP site and an adjacent SNP is greater than 0.1.
In a second aspect, the embodiment of the application discloses a training method applied to a QTL recognition model, which includes:
obtaining a training set, wherein the training set comprises a plurality of training samples obtained according to the sample processing method of any one of claims 1 to, each training sample comprises SNP data and sample marks of the SNP data, and the SNP data is obtained by sequencing a QTL positioning group mixed pool sample;
and inputting the two-dimensional tensor formed by the sample marks into a residual U-net model as an input layer, performing backward propagation iteration to update the weight of each layer by adopting a backward propagation algorithm and a random gradient descent method according to the magnitude of the forward propagation loss value, and stopping training until the loss value of the model tends to converge to obtain the QTL identification model.
In certain embodiments, the trained QTL identification model characterizes a response relationship between SNP data for the QTL localization population and the QTL of the localization population.
In some embodiments, the residual U-net model consists of an input layer, an encoder, a decoder, an output layer; each input of the input layer is a two-dimensional tensor consisting of 64 bit points marked by the sample, the encoder consists of a convolutional neural network and a residual error network, deep information is obtained through four times of downsampling, the deep information is converted into shallow information through four times of upsampling, and finally, a convolution layer with a 1 x 1 convolution kernel and an activation function being sigmoid is used; the encoder and the decoder are composed of three convolutional layers, the convolutional cores are respectively 1 × 1, 3 × 3 and 3 × 3, the down-sampling is completed through maximum pooling, and one convolutional layer can pass through after each up-convolution; residual concatenation works after every third convolution of the global, while there is also residual concatenation between the corresponding layers between encoder and decoder.
In a third aspect, the embodiment of the application discloses a QTL identification method, which includes:
obtaining a QTL identification model obtained by the training method in the second aspect;
inputting sample data to be detected into the QTL identification model to obtain output information, wherein the output information comprises position information of SNP loci on genes and confidence of the SNP loci belonging to QTL intervals;
and identifying a QTL interval according to the position information and the confidence coefficient.
In certain embodiments, the identification method is for identifying at least one of;
identifying the QTL of the corn plant height;
identifying the rice plant height QTL;
identifying QTL (quantitative trait locus) in the flowering period of rice; and
and identifying the Wuchang fish intermuscular thorn QTL.
In a fourth aspect, an embodiment of the present application discloses a QTL identification apparatus, including:
the acquisition module is used for acquiring a training set, wherein the training set comprises SNP data and sample marks of the SNP data, and the SNP data is obtained by sequencing a QTL positioning group mixed pool sample;
the residual U-Net model is formed by respectively adding residual connection into an encoding part and a decoding part on the basis of the overall structure of the U-Net, wherein the encoding part is used for extracting SNP data with low resolution, and the decoding part is used for extracting SNP data with high resolution;
and the training module is used for training the residual U-net model by utilizing the training set pair to obtain the residual U-net model.
In a fifth aspect, an embodiment of the present application discloses an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the method of the second aspect and/or the method of the second aspect when executing the computer program.
Drawings
Fig. 1 is a schematic flow chart illustrating an implementation of a sample processing method for QTL recognition model training provided in an embodiment of the present application.
Fig. 2 is a schematic flow chart illustrating an implementation process of a training method applied to a QTL recognition model according to an embodiment of the present application.
Fig. 3 is a schematic flow chart illustrating an implementation process of a QTL identification method according to an embodiment of the present application.
Fig. 4 is a schematic view illustrating an implementation of a specific method for identifying a QTL interval of a plant height of corn by using a sample processing method, a QTL model training method, and a QTL identification method according to an embodiment of the present application.
FIG. 5 is a training set sample for QTL model training provided by an embodiment of the present application.
FIG. 6 is a signal diagram of the deviation and SNR calculations performed using the simulation data of the present application, with the ordinate being the recognition value given to each SNP by the trained residual U-net model and the abscissa being the SNP position.
FIG. 7 is a diagram of the results of maize plant height QTL interval identification provided in an embodiment of the present application.
FIG. 8 is a comparison of QTL identification method provided herein with the results of different model methods for identifying QTL intervals of plant height of maize plants, as provided herein by an example of the present application; the left graph is the comparison of deviation results of the recognition results, and the right graph is the comparison of signal-to-noise ratio results of the recognition results.
FIG. 9 shows data of a sample to be tested for identifying a rice plant height QTL provided by the embodiment of the application.
FIG. 10 is a graph showing the comparison of the QTL identification method provided by the present application with the results of different model methods when the QTL identification method provided by the present application is used for identifying the QTL interval of rice plant height; the left graph is the comparison of deviation results of the recognition results, and the right graph is the comparison of signal-to-noise ratio results of the recognition results.
FIG. 11 shows a sample data to be tested for identifying rice flowering QTL provided by the embodiments of the present application.
FIG. 12 is a comparison of QTL identification method provided herein with the results of different model methods for identifying QTL intervals in the flowering phase of rice, as identified by the method provided herein according to an embodiment of the present application; the left graph is the comparison of deviation results of the recognition results, and the right graph is the comparison of signal-to-noise ratio results of the recognition results.
FIG. 13 is a sample data to be tested for Wuchang fish muscle spur QTL identification provided by the embodiment of the present application.
FIG. 14 is a diagram showing the comparison of QTL interval identification provided by the QTL identification method of the present application with the identification results of different model methods when the QTL interval identification method of the present application is used for identifying the Wuchang fish intramuscular spine QTL interval; the left graph is the comparison of deviation results of the recognition results, and the right graph is the comparison of signal-to-noise ratio results of the recognition results.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application. Reagents not individually specified in detail in this application are conventional and commercially available; methods not specifically described in detail are all routine experimental methods and are known from the prior art.
It should be noted that the terms "first", "second", and the like in the description and claims of the present invention and in the drawings are used for distinguishing similar objects, and do not necessarily have to be used for describing a specific order or sequence or have a substantial limitation on technical features thereafter. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Recently, a method based on a convolutional neural network is applied to the field of fast magnetic resonance imaging, the method utilizes a large amount of prior information to learn and train the convolutional neural network to obtain optimized network parameters, and a high-quality MRI image can be quickly reconstructed by utilizing the trained convolutional neural network, so that the method is a fast MRI imaging method with great application potential. The residual U-net network is one of the convolutional neural networks, has a relatively simple structure, less training parameters and shorter training time, and solves the problem of performance degradation of the deep convolutional neural network under the condition of extreme depth. At present, the application of the existing approximation to the residual U-net mainly focuses on the image recognition process, which is related to the convolution result of each layer and feature extraction, and for the application in other fields, the processing and feature extraction of a target object have difficulties.
In order to solve the defects of the prior art, the embodiment of the application discloses a sample processing method applied to QTL identification model training and a training method of the QTL identification model. For example, the response relationship includes: the QTL of the positioning group responds to that the sample mark of the SNP data on the genome of the positioning group is continuously 1, for example, the sample mark of the SNP data on the positioning group of each site is 1, the sample marks of the SNP data at the positions of 5-10 bp before and after the continuous position of the positioning group are all 1, and the identified QTL interval is formed between the sections.
Fig. 1 is a sample processing method for QTL recognition model training according to an embodiment of the present application, where a sample obtained according to the sample processing method includes SNP data and sample markers thereof, and a two-dimensional tensor is constructed by using the sample markers, and is suitable for training of a residual U-net model. For example, the sample processing method includes:
s100, constructing a QTL positioning group;
s200, sequencing and grouping individuals in the QTL positioning population according to the phenotype of the QTL positioning population to obtain a plurality of sequenced groups;
s300, mixing the individual DNA samples in each group to obtain a plurality of sequenced DNA mixing pools;
s400, respectively carrying out fragmentation, sequencing, SNP identification and calculation on the plurality of DNA mixed pools to obtain a plurality of SNP data;
and S500, marking the SNP data respectively to obtain the SNP data and sample marks thereof, and taking the SNP data and the sample marks as samples for training the QTL recognition model.
In some embodiments of S100, construction of the QTL localization population comprises: two parents with obvious difference in target characters are selected, one or more generations of hybridization, selfing or backcross are carried out, and groups such as F2, RIL, NIL and the like are constructed, namely the positioning group. The obtaining of the DNA sample comprises: and planting the constructed positioning population in the field, taking about 0.1g of fresh leaves of each individual plant by using a puncher when the individual plant grows to the six-leaf stage, extracting DNA of the leaves by using a Trizol method, and measuring the concentration of the DNA by using a spectrophotometer.
In some embodiments of S200, the phenotype of the trait of interest is measured for each individual, and all individuals are ranked according to phenotype (if the population consists of different families, then individuals within the family are ranked separately). For example, when the trait of interest is plant height, the individual plant heights in the population can be ranked from high to low.
In some S300 embodiments, all individuals were evenly divided into N (2-N-11) groups, taking into account phenotypic variation amplitude and experimental costs. The individual DNAs in each group were mixed in equal amounts to obtain N DNA pools.
In some embodiments of S400, after fragmenting the DNA from the DNA pool by sonication, the DNA is fragmented according to the KAPA Hyper Prep Kit (C.)
Figure BDA0003733720630000091
Platform) to construct a sequencing library with inserts of 400-500 bp in size. The DNA library of each pool was then loaded into one lane using the Illumina Hiseq2500 system and sequenced using Illumina Hiseq Xten, generating a 150bp double-ended read.
In some embodiments of S500, the SNP data includes SNP location information and SNP frequency information, the SNP frequency being the frequency of its occurrence in each DNA pool fragment (i.e., as a computational step), and the method of tagging the SNP data includes: SNP data with a continuously increasing or decreasing SNP frequency within a plurality of sequenced DNA pools is labeled as 1, otherwise labeled as 0. For example, observing all pool information of 128 continuous SNP sites around the SNP site, judging whether the SNP site is a target site interval, and marking 64 sites in the middle, namely utilizing the surrounding information for judging the positive and negative of the current cell; the SNP whose frequency change is consistent with the increasing or decreasing trend is marked as positive sample 1, and the non-consistent one is marked as negative sample 0.
In some embodiments, the sample processing method comprises the step of filtering the SNP data to remove low-quality SNP sites; the low-quality SNP locus has at least one of the following characteristics:
the read number of the corresponding DNA mixed pool sequencing is lower than a threshold value; this threshold is typically 1/3 of the sequencing depth;
the SNP frequency in the DNA mixing pool is obviously deviated from 0.5 in the same direction; and
the difference in allele frequency between a SNP site and an adjacent SNP is greater than 0.1.
In order to solve the defects of the prior art, an embodiment of the present application discloses a training method applied to a QTL recognition model, and fig. 2 is a schematic diagram of an implementation flow of the training method applied to the QTL recognition model according to an embodiment of the present application. The method in the implementation method can be executed by electronic equipment. Electronic devices include, but are not limited to, computers, tablets, servers, cell phones, cameras, wearable devices, or the like. The server includes, but is not limited to, a standalone server or a cloud server. As shown in FIG. 2, the method for training the residual U-net model for identifying the QTL of the plant height of corn can comprise steps S101 to S102. The method for training the residual U-net model for identifying the maize plant height QTL can effectively detect the micro-effect QTL information with the interpretation rate as low as 5% under the complex characters and the complex background.
The QTL model training method comprises the following steps:
s101, obtaining a training set, wherein the training set comprises a plurality of training samples, each training sample comprises SNP data and a sample marker of the SNP data, the SNP data is obtained by sequencing a QTL positioning group mixed pool sample, and the sample marking method comprises the following steps: marking SNP data with continuously increasing or continuously decreasing SNP frequency in a plurality of sequenced DNA mixing pools as 1, otherwise marking as 0;
s102, inputting the two-dimensional tensor formed by the sample marks into a residual U-net model as an input layer, performing backward propagation iteration to update the weight of each layer according to the magnitude of the forward propagation loss value by adopting a backward propagation algorithm and a random gradient descent method, and stopping training until the loss value of the model tends to be converged to obtain the QTL recognition model.
In some embodiments of S102, the residual U-net model consists of an input layer, an encoder, a decoder, an output layer; each input of the input layer is a two-dimensional tensor composed of 64 marked bit points in the training set, the encoder is composed of a convolutional neural network and a residual error network, deep information is obtained through four times of down-sampling, the deep information is converted into shallow information through four times of up-sampling, and finally a convolutional layer with sigmoid is obtained through a 1 x 1 convolutional kernel and an activation function; the encoder and the decoder are composed of three convolutional layers, the convolutional cores are respectively 1 × 1, 3 × 3 and 3 × 3, the down-sampling is completed through maximum pooling, and one convolutional layer can pass through after each up-convolution; residual concatenation is done after every third convolution globally, while there is also residual concatenation between the corresponding layers between encoder and decoder.
In some embodiments, "training" includes pre-training and tuning; in the tuning process, training the last 10 layers of the model and freezing other layer parameters.
In some embodiments, the SNP data in the training set is n × 64 × m, where n represents the batch size, 64 represents 64 sites, and m represents frequency information for m pools.
Fig. 3 is a schematic flow chart illustrating an implementation of a QTL identification method according to an embodiment of the present application. The method in this embodiment may be performed by an electronic device. Electronic devices include, but are not limited to, computers, tablets, servers, cell phones, cameras, wearable devices, or the like. The server includes, but is not limited to, a standalone server or a cloud server. As shown in fig. 3, the qtl identification method includes steps S101 to 102, and further includes:
s103: inputting sample data to be detected into a QTL identification model to obtain output information, wherein the output information comprises position information of an SNP locus on a gene and confidence coefficient of the SNP locus belonging to a QTL interval;
s104: and identifying the QTL interval according to the position information and the confidence coefficient.
In some embodiments, the position information of the SNP sites on the gene and the confidence of the SNP sites belonging to the QTL interval are tensors of 64 × 1, and the segmentation effect is achieved by identifying each SNP site.
In some embodiments, in order to make the recognition result more effective, all the points are subjected to a cubic kernel regression smoothing process, which can effectively improve the signal-to-noise ratio and reduce the influence of noise.
In order to further detail the QTL interval recognition of the corn plant height by using the sample processing method, the QTL model training method, and the QTL recognition method, fig. 4 shows a schematic diagram of an implementation process of the embodiment, which is specifically as follows:
FIG. 4a illustrates the implementation processes of S100-S300 of the sample processing method for QTL recognition model training. In the embodiment of the step, the yellow early four and 1462 of the classical maize inbred line are selected, the yellow early four with the shorter plant height is taken as a female parent, the 1462 with the higher plant height is taken as a male parent, hybridization is carried out to obtain F1, the F1 is inbred to obtain F2, and the F2 group is a positioning group. In the example of the step, the target property to be planned is the corn plant height, and the F2 group is sown in Beijing field, so that 7160 single plants with wide plant height variation are obtained in total, and 47 families are included. In the six-leaf stage of the plant, about 0.1g of fresh leaves of each individual plant are taken by a puncher, the DNA of the leaves is extracted by a Trizol method, and the concentration of the DNA is measured by a spectrophotometer, so that the DNA sample with the target character is obtained. The single plant heights of each family are sorted from low to high and are divided into 10 equal parts, wherein the 10% of the lowest plant height is the first part, and the 10% -20% of the plant heights is the second part.
FIG. 4b illustrates the implementation process of steps S400-S500 of the sample processing method for QTL recognition model training.
In the example of this step, the DNA in the DNA mixing pool was fragmented by sonication according to KAPA Hyper Prep Kit (KaPA)
Figure BDA0003733720630000121
Platform) to construct a sequencing library with inserts of 400-500 bp in size. The DNA library of each pool was then loaded into one lane using the Illumina Hiseq2500 system and sequenced using Illumina Hiseq Xten, generating a 150bp double-ended read. Each timeThe sequencing data volume of each mixed pool is 200Gb, and the coverage depth is 100X.
In an embodiment of this step, reads generated by sequencing are aligned onto the reference genome of version B73V 4 using BWA software, and the generated sam file is converted into a bam file using SAMtools software. The bam file contents are then sorted using Picard software and the reads that are repeatedly generated by the PCR are deleted. Then, the HaplotypeCaller module of the GATK software is used for detecting genome-wide SNP, and all parameters are software default parameters. Finally, a VCF file containing ten pool variation information is generated.
In order to filter low-quality SNP sites, in an embodiment of this step, S400 further includes filtering out low-quality SNP sites. Low-quality SNP sites include a corresponding DNA pool with a read number below a threshold (typically set to 1/3 of the sequencing depth), a significant co-directional deviation of SNP frequencies in the DNA pool of 0.5 (P < 0.01), and an allele frequency difference between a certain SNP site and an adjacent SNP of greater than 0.1.
In the embodiment of the step, positive and negative samples are marked, and a training set and a verification set are divided. Combining the VCF file obtained by the method with the positions of part of known plant height genes, and selecting a plurality of positions on the whole genome for marking positive and negative samples. Observing ten pool frequencies of 128 continuous SNP sites at the position, and marking the middle 64 sites as a positive sample 1 if the allele frequency change accords with the increment or decrement; if the allele frequency variation fluctuates around 0.5, the middle 64 sites are marked as negative 0. FIG. 4b shows the frequency of a sample labeled 1 (Relevant) in each pool for a certain SNP site and the frequency of a sample labeled 0 (Irrelevant) in each pool. The labeled SNP data were divided into training, validation and test sets according to 6. Fig. 5 is an example of SNP data of training set samples, which sequentially shows SNP sites at different positions on chromosome 1 and the allele frequencies of the SNP sites in 10 pools.
FIG. 4c shows a schematic diagram of training a built RU-net using a training set, and verification with a verification set. Fig. 4c is a schematic diagram of the RU-Net structure provided in this embodiment, and the right diagram is a verification result of the training model by the verification set formed by 10 DNA pools, and it can be seen that the result is AUC verification result, which shows that the identification authenticity of the detection method is very high.
Fig. 4d shows the result of recognition using the trained RU-net model. In this example, SNPs were scanned across the entire genome using a well-established RU-net model, identifying the confidence that each SNP is associated with a phenotype. As shown in fig. 7, the SNP position is plotted on the abscissa and the SNP confidence is plotted on the ordinate, and the curve is fitted with LOWESS, the peak of the fitted curve is the identified target site, and as shown on the right side of fig. 7, QTL intervals are identified at chromosome 1, 2, 3, and 6.
In addition, 10 QTL loci are randomly set on the genome, generation of a maize F2 population is simulated, mixed pool sequencing as shown above is performed respectively to obtain VCF files of ten pool variation information, the highest pool and the lowest pool are selected for SNP filtering and labeling (part of methods are only applied to two mixed pools), G', ED4, K, smoothLOD, ridit and SNP-index methods are used for identification, identified QTL signals and true QTL signals are mapped (as shown in fig. 6), deviation and signal-to-noise ratio are calculated, deviation and signal-to-noise ratio difference between these models and algorithms and the RU-net model (deep learning model DL) improved in the embodiment of the present application are counted, and the result is shown in fig. 8, and the identification result of the RU-net model (deep learning model DL) is integrated as best in deviation (left graph) and signal-to-noise ratio (right graph), thereby explaining the RU-to-net model training method and the accuracy of identification of maize QTL information provided by the present application.
In one embodiment, in order to further detail the QTL interval recognition of rice plant height by using the sample processing method, the QTL model training method, and the QTL recognition method, referring to the implementation process shown in fig. 1 to 5, the QTL locating population of rice plant height is processed by using the sample processing method as described above, to obtain sample data to be tested (as shown in fig. 9), which is input to the training to obtain the QTL recognition model, the recognition result of the RU-net model (deep learning model DL) is optimal in terms of deviation and signal-to-noise ratio, the output result is shown in fig. 10, fig. 10a is an LOWESS fitting curve, and fig. 10b is a recognition QTL interval result, which shows that the training method for the RU-net model and the RU-net model provided by the present application can effectively recognize rice QTL information and have very high accuracy.
In one embodiment, referring to the implementation process shown in fig. 1 to 5, the method for processing a QTL positioning population at a rice flowering stage is adopted, individuals in the positioning population are grouped according to the flowering time sequence to obtain a plurality of DNA pools, and are subjected to fragmentation, sequencing, SNP identification and calculation to obtain a plurality of SNP data, which are labeled to obtain sample data to be detected (shown in fig. 11), and the sample data to be detected is trained to obtain a QTL identification model. The output result is shown in fig. 12, fig. 12a is a fitting curve of LOWESS, and fig. 12b is a QTL interval result identified, which illustrates that the RU-net model training method and the RU-net model provided by the present application can effectively identify rice QTL information and have very high accuracy.
In one embodiment, referring to the implementation process shown in fig. 1 to 5, the sample processing method is adopted to process a location population of the megalobrama amblycephala mterventia QTL (whether the megalobrama amblycephala mterventia exists or not), individuals in the location population are grouped according to the time sequence of flowering to obtain 2 DNA mixed pools, and fragmentation, sequencing, SNP identification and calculation are performed according to the 2 DNA mixed pools to obtain a plurality of SNP data, which are used as sample data to be tested (shown in fig. 13), and the sample data are input to the training to obtain the QTL identification model. The output results are shown in fig. 14, fig. 14a is a LOWESS fitting curve, and fig. 14b is the identified QTL interval result, thereby illustrating that the RU-net model training method and the RU-net model provided by the application can effectively identify the Wuchang fish QTL information and have very high accuracy.
Therefore, the embodiment of the application also discloses a schematic structural diagram of the QTL identifying device. The device comprises an acquisition module, a construction module, a training module and an output module. The acquisition module is used for acquiring a training set, the training set comprises SNP data and sample marks of the SNP data, and the SNP data is obtained by sequencing a QTL positioning group mixed pool sample. The residual U-Net model is formed by respectively adding residual connection into an encoding part and a decoding part on the basis of the integral structure of the U-Net, wherein the encoding part is used for extracting SNP data with low resolution, and the decoding part is used for extracting SNP data with high resolution. And the training module is used for training the residual U-net model by utilizing the training set pair to obtain the residual U-net model. And the output module is used for identifying QTL information in the given DNA sequence by using the trained residual U-net model.
Embodiments of the present application also provide an electronic device, which may include one or more processors (only one of which is shown), a memory, and a computer program, for example, a lightweight program of the U-net model, stored in the memory and executable on the one or more processors. The one or more processors, when executing the computer program, may implement the steps in the method embodiments for weight reduction of the U-net model. Alternatively, the one or more processors may implement the functions of each module/unit in the lightweight device embodiment of the U-net model when executing the computer program, and the present invention is not limited thereto.
Those skilled in the art will appreciate that the electronic devices of the present application may include more or fewer components than those shown, or some of the components may be combined, or different components, e.g., the electronic devices may also include input-output devices, network access devices, buses, etc.
In one embodiment, the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In one embodiment, the storage may be an internal storage unit of the electronic device, such as a hard disk or a memory of the electronic device. The memory may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Memory Card (SMC), a Secure Digital (SD) card, a flash memory card (flash card), and the like, which are provided on the electronic device. Further, the memory may also include both internal storage units and external storage devices of the electronic device. The memory is used for storing computer programs and other programs and data required by the electronic device. The memory may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus/electronic device and method may be implemented in other ways. For example, the above-described apparatus/electronic device embodiments are merely illustrative, and for example, a module or a unit may be divided into only one logic function, and may be implemented in other ways, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the processes in the above method embodiments may be implemented by the present application, and a computer program that can be executed by a computer program to instruct related hardware can be stored in a computer readable storage medium, and when the computer program is executed by a processor, the steps of the above method embodiments can be implemented. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium can include any entity or device capable of carrying computer program code, recording media, U-disks, removable hard disks, magnetic disks, optical disks, computer memory, read-only memory (ROM), random Access Memory (RAM), electrical carrier signals, telecommunications signals, software distribution media, and the like. It should be noted that the computer-readable medium may contain suitable additions or subtractions depending on the requirements of legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer-readable media may not include electrical carrier signals or telecommunication signals in accordance with legislation and patent practice.
The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application.

Claims (10)

1. A sample processing method applied to QTL recognition model training comprises the following steps:
constructing a QTL positioning group;
ordering and grouping individuals in the QTL positioning population according to the phenotype of the QTL positioning population to obtain a plurality of ordered groups;
mixing the individual DNA samples in each group to obtain a plurality of sequenced DNA mixed pools;
carrying out fragmentation, sequencing, SNP identification and calculation on the plurality of DNA mixed pools respectively to obtain SNP data of the plurality of mixed pools;
and marking the SNP data respectively to obtain the SNP data and sample marks thereof to be used as samples for training the QTL identification model.
2. The sample processing method according to claim 1, wherein the SNP data includes SNP position information and SNP frequency information, the SNP frequency being a frequency of occurrence thereof at each DNA pool fragment, the method of labeling the SNP data includes:
SNP data with a continuously increasing or decreasing SNP frequency within a plurality of sequenced DNA pools is labeled as 1, otherwise labeled as 0.
3. The sample processing method of claim 2, comprising the step of filtering the SNP data to remove low-quality SNP sites; the low-quality SNP locus has at least one of the following characteristics:
the read number of the corresponding DNA mixed pool sequencing is lower than a threshold value;
the SNP frequency in the DNA mixing pool is obviously deviated from 0.5 in the same direction; and
the difference in allele frequency between a SNP site and an adjacent SNP is greater than 0.1.
4. A training method applied to a QTL recognition model, comprising:
obtaining a training set, wherein the training set comprises a plurality of training samples obtained according to the sample processing method of any one of claims 1 to, each training sample comprises SNP data and sample marks of the SNP data, and the SNP data is obtained by sequencing a QTL positioning group mixed pool sample;
and inputting the two-dimensional tensor formed by the sample marks into a residual U-net model as an input layer, performing backward propagation iteration to update the weight of each layer by adopting a backward propagation algorithm and a random gradient descent method according to the magnitude of the forward propagation loss value, and stopping training until the loss value of the model tends to converge to obtain the QTL identification model.
5. The training method of claim 4, wherein the trained QTL identification model characterizes the response relationship between SNP data of the QTL location population and the QTL of the location population.
6. The training method of claim 5, wherein the residual U-net model consists of an input layer, an encoder, a decoder, an output layer; each input of the input layer is a two-dimensional tensor formed by 64 bit points of the sample marks, the encoder is formed by a convolutional neural network and a residual error network, deep information is obtained through four times of down-sampling, the deep information is converted into shallow information through four times of up-sampling, and finally a convolution layer with sigmoid is formed through a 1 x 1 convolution kernel and an activation function; the encoder and the decoder are composed of three convolutional layers, the convolutional cores are respectively 1 × 1, 3 × 3 and 3 × 3, the down-sampling is completed through maximum pooling, and one convolutional layer can pass through after each up-convolution; residual concatenation works after every third convolution of the global, while there is also residual concatenation between the corresponding layers between encoder and decoder.
7. A method of identifying a QTL, comprising:
obtaining a QTL recognition model obtained by the training method of any one of claims 4 to 6;
inputting sample data to be detected into the QTL identification model to obtain output information, wherein the output information comprises position information of SNP loci on genes and confidence of the SNP loci belonging to QTL intervals;
and identifying a QTL interval according to the position information and the confidence coefficient.
8. The identification method according to claim 6, for identifying at least one of;
identifying the QTL of the corn plant height;
identifying rice plant height QTL;
identifying QTL (quantitative trait locus) in the flowering period of rice; and
and identifying the Wuchang fish intermuscular thorn QTL.
9. A QTL identification device, comprising:
the acquisition module is used for acquiring a training set, wherein the training set comprises SNP data and sample marks of the SNP data, and the SNP data is obtained by sequencing a QTL positioning group mixed pool sample;
the residual U-Net model is formed by respectively adding residual connection into an encoding part and a decoding part on the basis of the overall structure of the U-Net, wherein the encoding part is used for extracting SNP data with low resolution, and the decoding part is used for extracting SNP data with high resolution;
the training module is used for training the residual U-net model by utilizing the training set pair to obtain a residual U-net model;
and the output module is used for giving a DNA sequence and identifying QTL information in the DNA sequence by utilizing the trained residual U-net model.
10. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor being adapted to perform the method of any of claims 1 to 6 and/or the method of claim 7 or 8 when the computer program is executed.
CN202210790511.2A 2022-07-06 2022-07-06 QTL sample processing, model training and identifying method, device and equipment Pending CN115273973A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210790511.2A CN115273973A (en) 2022-07-06 2022-07-06 QTL sample processing, model training and identifying method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210790511.2A CN115273973A (en) 2022-07-06 2022-07-06 QTL sample processing, model training and identifying method, device and equipment

Publications (1)

Publication Number Publication Date
CN115273973A true CN115273973A (en) 2022-11-01

Family

ID=83763308

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210790511.2A Pending CN115273973A (en) 2022-07-06 2022-07-06 QTL sample processing, model training and identifying method, device and equipment

Country Status (1)

Country Link
CN (1) CN115273973A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403642A (en) * 2023-06-06 2023-07-07 中国农业科学院作物科学研究所 Quick and fine positioning method for QTL

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116403642A (en) * 2023-06-06 2023-07-07 中国农业科学院作物科学研究所 Quick and fine positioning method for QTL
CN116403642B (en) * 2023-06-06 2023-08-04 中国农业科学院作物科学研究所 Quick and fine positioning method for QTL

Similar Documents

Publication Publication Date Title
McCormick et al. 3D sorghum reconstructions from depth images identify QTL regulating shoot architecture
Prata et al. Towards integrative taxonomy in Neotropical botany: disentangling the Pagamea guianensis species complex (Rubiaceae)
Alberto et al. Within‐population spatial genetic structure, neighbourhood size and clonal subrange in the seagrass Cymodocea nodosa
Orr et al. A phylogenomic approach reveals a low somatic mutation rate in a long-lived plant
Kumar et al. Root phenotyping by root tip detection and classification through statistical learning
CN106446597B (en) Several species feature selecting and the method for identifying unknown gene
WO2017013462A1 (en) Improved computer implemented method for predicting true agronomical value of a plant
Santos et al. Fine scale genomic signals of admixture and alien introgression among Asian rice landraces
Friedline et al. The genetic architecture of local adaptation I: the genomic landscape of foxtail pine (Pinus balfouriana Grev. & Balf.) as revealed from a high-density linkage map
WO2014197997A1 (en) Systems, methods, and computer program products for merging a new nucleotide or amino acid sequence into operational taxonomic units
CN115273973A (en) QTL sample processing, model training and identifying method, device and equipment
Shujaat et al. Cr-prom: A convolutional neural network-based model for the prediction of rice promoters
Majidian et al. Hap10: reconstructing accurate and long polyploid haplotypes using linked reads
Guo et al. Revisiting the evolutionary history of domestic and wild ducks based on genomic analyses
Cooper et al. Target enrichment and extensive population sampling help untangle the recent, rapid radiation of Oenothera sect. Calylophus
Bisschop et al. Sweeps in time: leveraging the joint distribution of branch lengths
Hamid et al. Localizing post-admixture adaptive variants with object detection on ancestry-painted chromosomes
Liang et al. MAK: a machine learning framework improved genomic prediction via multi-target ensemble regressor chains and automatic selection of assistant traits
Chumová et al. The relationship between transposable elements and ecological niches in the Greater Cape Floristic Region: A study on the genus Pteronia (Asteraceae)
Liang et al. Distinct characteristics of genes associated with phenome-wide variation in maize (Zea mays)
CN114743601A (en) Breeding method, device and equipment based on multigroup data and deep learning
Lind et al. Haploid, diploid, and pooled exome capture recapitulate features of biology and paralogy in two non‐model tree species
Cudic et al. Prediction of sorghum bicolor genotype from in-situ images using autoencoder-identified SNPs
Xu QTL analysis in plants
Viruel et al. A bioinformatic pipeline to estimate ploidy level from target capture sequence data obtained from herbarium specimens

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination