CN115948521A

CN115948521A - Method for detecting aneuploid missing chromosome information

Info

Publication number: CN115948521A
Application number: CN202211716471.3A
Authority: CN
Inventors: 陈肃; 陈嵩; 于越; 王鑫宇; 周妍; 何蕊含; 刘文轩; 刘宣晨
Original assignee: Northeast Forestry University
Current assignee: Northeast Forestry University
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-04-11

Abstract

The invention discloses a method for detecting aneuploid deletion chromosome information, which comprises the following steps: extracting DNA of an organism to be tested and sequencing a whole genome to obtain a sequencing sequence; comparing the sequencing sequences to a reference genome, and acquiring a frequency scatter diagram of each chromosome of the organism to be detected; fitting the frequency scatter diagram of each chromosome, and acquiring sequencing depths corresponding to all Gaussian peaks in a fitting curve; and clustering the sequencing depth to further obtain the chromosome ploidy of the organism to be detected. The method is based on the second-generation high-throughput sequencing technology, can greatly shorten the time compared with the traditional detection method when facing large-batch samples, can realize automatic operation, and has the advantages of standard property, repeatability and the like.

Description

Method for detecting aneuploid missing chromosome information

Technical Field

The invention belongs to the field of genome sequencing and bioinformatics, and particularly relates to a method for detecting aneuploid deletion chromosome information.

Background

In most cases, aneuploidies are fatal to animals and humans, but plants often exhibit greater tolerance to aneuploidies, particularly in allopolyploid plants. The aneuploidy has incomparable advantages in the physical position determination of genes and molecular markers, gene transfer and the establishment of the corresponding relation between linkage groups and chromosomes, has important significance for the research of heredity and breeding of plants, and simultaneously obtains a plurality of achievements in the application of actual breeding.

Through aneuploidy research, the genetic rules among various properties of the plants can be cleared up more quickly and systematically, and the relationship between the chromosomes of the plants and the related plants of the plants can be determined, so that various special and excellent new varieties can be bred more systematically. However, since such research involves a large number of hybridization experiments, the workload is large and the time is long, and the research in the forest field is slightly deficient. Because the growth period of the forest is long and the direction is difficult to adjust in the breeding process, enough chromosome information must be obtained before the formal experiment is carried out.

Karyotyping based on individual chromosome sets, such as C-band method, G-band method, flow cytometry and Fluorescence In Situ Hybridization (FISH) based on chromosome specific probes are common aneuploidy identification methods today. However, most of the above methods have strong preference for the type of experimental material and require long-term experimental preparation, and will be slightly laborious in the face of screening work of large-scale experimental materials. Ploidy analyzers can quickly determine whether a created population is aneuploid, but it is difficult to determine the specific chromosome composition of each individual.

In addition, the method for detecting whether the coverage depth of a sample and a standard reference sample has a significant difference by using a T test after the traditional high-throughput sequencing is applied to the clinic of human diseases such as Down syndrome, 18-trisomy syndrome and the like. However, in the case of aneuploidy breeding of plants, it is difficult to develop the breeding method because of the factors such as the large number of hybrid varieties, large genome variation, large increase and decrease of the number of chromosomes, and difficulty in obtaining standard references. Therefore, it is desirable to provide a method for detecting aneuploidy missing chromosome information.

Disclosure of Invention

The invention aims to provide a method for detecting aneuploidy missing chromosome information so as to solve the problems in the prior art.

In order to achieve the above object, the present invention provides a method for detecting aneuploid missing chromosome information, comprising the steps of:

extracting DNA of an organism to be tested and sequencing a whole genome to obtain a sequencing sequence;

comparing the sequencing sequences to a reference genome, and acquiring a frequency scatter diagram of each chromosome of the organism to be detected;

fitting the frequency scatter diagram of each chromosome, and acquiring sequencing depths corresponding to all Gaussian peaks in a fitting curve;

and clustering the sequencing depth to further obtain the chromosome ploidy of the organism to be detected.

Optionally, the whole genome sequencing of the extracted DNA further comprises: and detecting the integrity of the DNA based on agarose gel electrophoresis, and detecting the concentration of the DNA by using a microplate reader.

Optionally, the reference genome is selected from the species itself of the test organism or the genome of a closely derived species and has been mounted to the chromosomal level.

Optionally, the process of obtaining a frequency scatter plot of each chromosome comprises: and acquiring the sequencing depth of each base on each chromosome, and counting the occurrence frequency of each sequencing depth to further acquire a frequency scatter diagram of each chromosome.

Optionally, the fitting the frequency scatter diagram of each chromosome includes: and fitting the frequency scatter diagram of each chromosome into a mixed Gaussian model formed by superposition of a single Gaussian curve or x Gaussian curves, wherein x is the number of peaks in the frequency scatter diagram of each chromosome.

Optionally, before the clustering the sequencing depth, the method further comprises: and sequencing the sequencing depth to obtain a Gaussian peak with the maximum sequencing depth in each chromosome.

Optionally, the process of obtaining the chromosomal ploidy of the test organism comprises: performing one-dimensional array clustering on the sequencing depth to obtain different clustering groups; obtaining the median of the sequencing depths of the different clustering groups based on the ploidy relationship among the medias of the different clustering groups; and clustering different chromosomes based on the median of the sequencing depths of the different clustering groups and the Gaussian peak with the largest sequencing depth in each chromosome, thereby obtaining the ploidy of the chromosome of the organism to be detected.

The invention has the technical effects that:

the method identifies the real sequencing depth of the chromosome by using a method for counting the occurrence frequency of each sequencing depth on a genome, splits a sequencing depth frequency curve by using a mixed Gaussian fitting model so as to comb various factors causing unstable sequencing depth of the chromosome, and applies a clustering algorithm to group all sequencing depth peak values so as to determine the sequencing depth of the monomer, thereby improving the detection precision of the ploidy of the chromosome.

The method is based on the second-generation high-throughput sequencing technology, can greatly shorten the time compared with the traditional detection method when facing large-batch samples, can realize automatic operation, and has the advantages of standard property, repeatability and the like.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:

FIG. 1 is a flowchart of a method for detecting aneuploidy missing chromosome information in an embodiment of the invention;

FIG. 2 is a frequency scattergram of 19 chromosomes according to an embodiment of the present invention;

FIG. 3 is a graph showing the fitting of sample No. 1 in the example of the present invention after Gaussian fitting.

Detailed Description

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.

Example one

As shown in fig. 1, the present embodiment provides a method for detecting aneuploidy missing chromosome information, including the following steps:

DNA extraction:

according to the characteristics of the animal and plant samples to be detected, a proper DNA extraction scheme is selected, the DNA content and the purity are detected, and the quality of the samples meets the official computer standard of a sequencer.

In the examples, 6 aneuploid plants obtained by hybridization were selected as experimental samples and two aneuploid plants were selected as controls, and standard DNA extraction was performed using MGIEasy universal DNA library preparation kit. And after extraction, detecting the integrity of the sample by using agarose gel electrophoresis, detecting the concentration by using an enzyme-linked immunosorbent assay, wherein the detection kit adopts DNABR. The results show that the quality of the extracted samples all meet the on-machine standard of the sequencing platform.

Whole genome sequencing:

based on the second generation high-throughput sequencing technology, sequencing library preparation and on-machine detection are carried out on an Illumina or BGI sequencing platform according to an official instruction manual, and instrument parameters and an operation method are all strictly carried out by referring to the instruction manual corresponding to the sequencing platform.

Based on the second generation high-throughput sequencing technology, the preparation of a sequencing library and the detection on a computer are carried out on a MGISEQ-2000 sequencing platform according to an official instruction manual. The library building type is DNBSEQ WGS, the sequencing mode is selected as PE150 whole genome sequencing, and instrument parameters and an operation method are carried out by strictly referring to an instruction manual of a corresponding sequencing platform.

Comparing the sequencing sequence with a reference genome and counting the sequencing depth:

and after off-machine data are obtained, aligning the sequences obtained by double-end sequencing to a reference genome, wherein the reference genome can be selected from the species or the genome of a closely-sourced species, but must be mounted to the chromosome level. In view of the allelic differences between different individuals within a species, the alignment scheme should select a method with a high tolerance to errors as much as possible. After the alignment is completed, the sequencing depth of each nucleotide on the reference genome is calculated respectively, and the frequency of occurrence of each sequencing depth is counted in units of chromosomes. The results are presented in a scatter plot, with the abscissa being the sequencing depth and the ordinate being the frequency of occurrence corresponding to that sequencing depth.

Taking sample No. 1 as an example, 266.24M double-ended sequences were obtained by the following machine. After filtering out low-quality sequences, the sequenced sequences were aligned to the poplar reference genome using BWA-MEM with an alignment rate of 92.06%. The sequencing depth of each base on the chromosome was then calculated, and the frequency was counted in units of chromosomes. As shown in FIG. 2, a line-linked scatter plot of 19 chromosomes is shown, and the abscissa represents the sequencing depth and the ordinate represents the frequency of occurrence of the sequencing depth.

Drawing a fitting curve by taking a chromosome as a unit and calculating a peak value:

and fitting the frequency scatter diagram of each chromosome into a mixed Gaussian model formed by superposition of single Gaussian curves or x Gaussian curves by utilizing a Gaussian fitting principle, wherein x is the number of peaks appearing in the frequency curves. After the fit was completed, the sequencing depths corresponding to all gaussian peaks were recorded and ranked from small to large.

Curve fitting was performed on each chromosome, and as shown in FIG. 3, using chromosome 1 of sample No. 1 as an example, gaussian fitting was performed according to the number of peaks to obtain 2 normal distribution curves, R-Square (R) ² ) Was 0.988. Record the sequencing depth values 34X and 68 corresponding to the fitted curve peaksX, and this is repeated for the remaining 18 chromosomes, and finally 38 normal distribution curves can be obtained.

Judging the specific ploidy of the chromosome in the organism:

and for all the numerical values recorded in the last step, performing one-dimensional array clustering by using DBSCAN or other clustering algorithms without setting the group number in advance. The median of the different clustering groups should have a ploidy relationship, and assuming that the sequencing depth corresponding to the first group of peaks (i.e., monomers) is y, the sequencing depth corresponding to the second group of peaks should be 2y, the sequencing depth corresponding to the third group of peaks should be 3y, and the sequencing depth corresponding to the nth group of peaks should be nxy. And then determining a Gaussian peak with the largest sequencing depth in each chromosome, wherein if the corresponding sequencing depth is clustered to the nth group, the ploidy of the chromosome in the organism is n.

And performing one-dimensional array clustering on the recorded results by using a DBSCAN algorithm, and dividing the curves into three groups, wherein the median of sequencing depth in each group is 34X, 68X and 101X. Wherein the last peak (sequencing depth: 68X) of chromosome 1 is finally clustered into a second group, representing that the ploidy of the chromosome in vivo is 2, i.e., disomic. And the last peaks of chromosomes 5, 8, 13 and 19 are clustered into a third group, representing a ploidy of 3 in vivo, i.e., trisomy, of the above chromosomes. In this way, the sample No. 1 is finally judged to be a three-body plant of chromosomes 5, 8, 13 and 19, and 42 chromosomes are contained in the cell. The results of the tests on the samples of this example are shown in table 1:

TABLE 1

/>

As can be seen from Table 1, the detection results of this example are consistent with those of the ploidy analyzer. Because the second generation high-throughput sequencing technology is used as a basis, compared with the traditional detection method, the method can greatly shorten the time when a large number of samples are detected, can realize automatic operation, and has the advantages of standard property, repeatability and the like.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for detecting aneuploidy deleted chromosome information, comprising the steps of:

extracting DNA of an organism to be detected and carrying out whole genome sequencing to obtain a sequencing sequence;

2. The method for detecting aneuploidy deleted chromosome information according to claim 1,

before whole genome sequencing of the extracted DNA, the method also comprises the following steps: and detecting the integrity of the DNA based on agarose gel electrophoresis, and detecting the concentration of the DNA by using a microplate reader.

3. The method for detecting aneuploidy deleted chromosome information according to claim 1,

the reference genome is selected from the species itself of the organism to be tested or the genome of a closely-derived species, and is mounted to the chromosome level.

4. The method for detecting aneuploidy deleted chromosome information according to claim 1,

the process of obtaining a frequency scattergram for each chromosome includes: and acquiring the sequencing depth of each base on each chromosome, and counting the occurrence frequency of each sequencing depth to further acquire a frequency scatter diagram of each chromosome.

5. The method for detecting aneuploidy deleted chromosome information according to claim 4,

the process of fitting the frequency scatter diagram of each chromosome includes: and fitting the frequency scatter diagram of each chromosome into a mixed Gaussian model formed by superposition of a single Gaussian curve or x Gaussian curves, wherein x is the number of peaks in the frequency scatter diagram of each chromosome.

6. The method for detecting aneuploidy deleted chromosome information according to claim 1,

before the clustering process is carried out on the sequencing depth, the method further comprises the following steps: and sequencing the sequencing depth to obtain a Gaussian peak with the maximum sequencing depth in each chromosome.

7. The method for detecting aneuploidy deleted chromosome information according to claim 6,

the process of obtaining the chromosome ploidy of the test organism comprises: performing one-dimensional array clustering on the sequencing depth to obtain different clustering groups; obtaining the median of the sequencing depths of the different clustering groups based on the ploidy relationship among the medias of the different clustering groups; and clustering different chromosomes based on the median of the sequencing depths of the different clustering groups and the Gaussian peak with the largest sequencing depth in each chromosome, thereby obtaining the ploidy of the chromosome of the organism to be detected.