CN111128304A

CN111128304A - Quality detection method and device for second-generation sequencing data

Info

Publication number: CN111128304A
Application number: CN201911292413.0A
Authority: CN
Inventors: 孙丰龙; 吕小莹
Original assignee: Digital China Health Technologies Co ltd
Current assignee: Digital China Health Technologies Co ltd
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2020-05-08

Abstract

The application provides a quality detection method and a quality detection device for next-generation sequencing data, and the method comprises the following steps: acquiring to-be-detected data of a target sample; determining evaluation information of a target sample according to data to be detected, wherein the evaluation information comprises any at least two kinds of following seed evaluation information: base distribution ratio, base number ratio, high quality ratio comparative example, cross contamination statistical value and hybridization capture quality value; and determining whether the target sample is a qualified sample according to the evaluation information.

Description

Quality detection method and device for second-generation sequencing data

Technical Field

The application relates to the field of data analysis, in particular to a quality detection method and device for second-generation sequencing data.

Background

With the continuous development of the second-generation sequencing technology, the price is continuously reduced. The use of human we technology (Whole exome Sequencing) and WGS technology (Whole genome Sequencing) in the field of genetic disease and cancer diagnosis is becoming more and more common, but the current Sequencing service providers in domestic markets are more than two hundred, and the quality control standards for laboratory library construction and subsequent bioinformation analysis of each family are different, which has a serious influence on the interpretation of the subsequent genetic disease sites. Currently, the cost of human genome sequencing is about one third of that of the WES data, particularly, the cost of the WES data is about one third of that of the WGS, the number of sequencing data is inevitably increased in the future, and how to form a quality control system with strict and complete subsequent sequencing data becomes a bottleneck of industry development.

In the prior art, the quality of data is controlled by a self-defined threshold value aiming at individual subdata in the second-generation sequencing data, so that the reliability and the validity of a data result cannot be ensured.

Disclosure of Invention

In view of this, an object of the present application is to provide a method and an apparatus for detecting quality of second-generation sequencing data, so as to solve the problem of how to improve the reliability of the quality control result of the second-generation sequencing data in the prior art.

In a first aspect, an embodiment of the present application provides a method for detecting quality of second-generation sequencing data, where the method includes:

acquiring to-be-detected data of a target sample;

determining evaluation information of a target sample according to data to be detected, wherein the evaluation information comprises any at least two kinds of following seed evaluation information: base distribution ratio, base number ratio, high quality ratio comparative example, cross contamination statistical value and hybridization capture quality value;

and determining whether the target sample is a qualified sample according to the evaluation information.

According to a first aspect, the present embodiments provide a first possible implementation manner of the first aspect, wherein determining whether the target sample is a qualified sample according to the evaluation information includes:

judging whether each seed evaluation information is in a qualified state;

and if each piece of evaluation information is in a qualified state, determining the target sample as a qualified sample.

According to a first possible embodiment of the first aspect, the present examples provide a second possible embodiment of the first aspect, wherein whether the base distribution ratio is in a qualified state is determined as follows:

calculating the base distribution ratio of the target sample according to the data to be detected, and judging whether the base distribution ratio meets a first condition; the first condition is that the base distribution proportion of the target sample does not exceed a preset base distribution proportion interval;

if the data to be detected meet the first condition, determining that the base distribution ratio is in a qualified state;

and if the base distribution ratio and other sub-evaluation information are in qualified states, determining that the target sample is a qualified sample.

According to a first possible embodiment of the first aspect, the present examples provide a third possible embodiment of the first aspect, wherein whether the base distribution ratio is in a qualified state is determined as follows:

calculating the base number proportion of each quality value of the target sample according to the data to be detected, and judging whether the base number proportion meets a second condition; the second condition is that the base number proportion of each quality value of the target sample does not exceed the corresponding preset base number proportion interval;

if the data to be detected meet the second condition, determining that the base number ratio is in a qualified state;

and if the base number proportion and other sub-evaluation information are in qualified states, determining the target sample as a qualified sample.

According to a first possible embodiment of the first aspect, the present examples provide a fourth possible embodiment of the first aspect, wherein whether the base distribution ratio is in a qualified state is determined as follows:

performing genome comparison on the data to be detected to obtain conversion data to be detected;

calculating a high-quality ratio comparative example of the target sample according to the conversion data to be detected, and judging whether the high-quality ratio comparative example meets a third condition; the third condition is that the high-quality comparison proportion of the target sample is not less than a preset comparison proportion threshold value;

if the conversion data to be detected meets the third condition, determining that the high-quality ratio comparative example is in a qualified state;

and if the high-quality ratio comparative example and other sub-evaluation information are both in a qualified state, determining that the target sample is a qualified sample.

According to a fifth possible embodiment of the first aspect, the present examples provide a fourth possible embodiment of the first aspect, wherein the determination of whether the base distribution ratio is in the qualified state is performed as follows:

calculating a cross contamination statistic value of the target sample according to the to-be-detected conversion data, and judging whether the cross contamination statistic value meets a fourth condition; the fourth condition is that the cross contamination statistic value of the target sample is not greater than a preset cross contamination statistic threshold value;

if the conversion data to be detected meets the fourth condition, determining that the cross contamination statistic value is in a qualified state;

and if the cross contamination statistic value and other sub-evaluation information are both in a qualified state, determining that the target sample is a qualified sample.

According to a first possible embodiment of the first aspect, the present examples provide a sixth possible embodiment of the first aspect, wherein whether the base distribution ratio is in a qualified state is determined as follows:

carrying out format genome comparison on the data to be detected to obtain conversion data to be detected;

calculating a plurality of hybridization capture quality values of the target sample according to the to-be-detected conversion data, and judging whether the plurality of hybridization capture quality values meet a fifth condition; the fifth condition is that each hybridization capture quality value of the target sample is within a corresponding preset hybridization capture quality value interval;

if the conversion data to be detected meets the fifth condition, determining that the hybridization capture quality value is in a qualified state;

and if the hybridization capture quality value and the other sub-evaluation information are both in a qualified state, determining that the target sample is a qualified sample.

In a second aspect, the present application provides an apparatus for detecting quality of second-generation sequencing data, the apparatus including:

the acquisition module is used for acquiring to-be-detected data of the target sample;

the calculation module is used for determining evaluation information of the target sample according to the data to be detected, wherein the evaluation information comprises any at least two kinds of following evaluation information: base distribution ratio, base number ratio, high quality ratio comparative example, cross contamination statistical value and hybridization capture quality value;

and the judging module is used for determining whether the target sample is a qualified sample according to the evaluation information.

In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the steps of the method according to any one of the first aspect and possible implementation manners when executing the computer program.

In a fourth aspect, this application is embodied as a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, performs the steps of the method of any one of the above first aspect and its possible implementations.

According to the quality detection method for the second-generation sequencing data, the obtained data to be detected of the target sample is analyzed, the evaluation information of the target sample is determined, and whether the target sample is a qualified sample is determined according to at least two kinds of evaluation sub-information contained in the evaluation information. The quality detection method for the second-generation sequencing data, provided by the embodiment of the application, can effectively detect the sample with the problem in the second-generation sequencing data, and improve the reliability of the quality control result of the second-generation sequencing data, so that the availability of the qualified sample after quality control is improved.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

FIG. 1 is a schematic flow chart of a method for detecting quality of second-generation sequencing data according to an embodiment of the present disclosure;

FIG. 2 is a schematic flowchart of a method for detecting quality of second-generation sequencing data according to an embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of an apparatus for detecting quality of second-generation sequencing data according to an embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a quality detection method of next-generation sequencing data, as shown in fig. 1, comprising the following steps:

s101, acquiring to-be-detected data of a target sample;

step S102, determining evaluation information of a target sample according to data to be detected, wherein the evaluation information comprises any at least two kinds of following evaluation information: base distribution ratio, base number ratio, high quality ratio comparative example, cross contamination statistical value and hybridization capture quality value;

and step S103, determining whether the target sample is a qualified sample according to the evaluation information.

Specifically, in order to ensure the reliability of the quality control result of the sample, the embodiment of the present application uses indexes of multiple second-generation sequencing data as the basis of quality control of the sample, and can calculate the evaluation information of the target sample corresponding to the data to be detected according to the data to be detected, and uses at least two seed evaluation information of a base distribution ratio, a base number ratio, a high quality ratio comparison ratio, a cross contamination statistic value and a hybridization capture quality value to determine whether the target sample is qualified.

In an alternative embodiment, the step S103 of determining whether the target sample is a qualified sample according to the evaluation information includes, as shown in fig. 2:

step S1031, judging whether each seed evaluation information is in a qualified state;

step S1032, if each piece of evaluation information is in a qualified state, determining that the target sample is a qualified sample.

In order to ensure the availability of the target sample, it is necessary to ensure that each evaluation sub-information in the evaluation information is in a qualified state, so as to determine that the target sample is a qualified sample.

The qualification status of each of the evaluators is determined by a standard library established based on standard samples in a 1000G database (1000genome project) and an Exome aggregate database (Exome aggregate database) and a quality control index threshold set based on the standard libraries.

In an alternative embodiment, whether the base distribution ratio is in the acceptable state is determined as follows:

2011, calculating a base distribution ratio of the target sample according to the data to be detected, and judging whether the base distribution ratio meets a first condition; the first condition is that the base distribution ratio of the target sample does not exceed a preset base distribution ratio interval;

step 2012, if the data to be detected meets the first condition, determining that the base distribution ratio is in a qualified state;

and 2013, if the base distribution ratio and other sub-evaluation information are in qualified states, determining that the target sample is a qualified sample.

Specifically, the acquired data to be detected of the target sample is an initial FastQ-formatted file, and the base distribution ratio of the target sample can be obtained by analyzing the data in the FastQ-formatted file.

For example, for target samples all of which are chinese, based on the relevant data of chinese in the 1000G database and the ExAC database, the average value of the base distribution ratio of the standard sample can be calculated as the standard value of the base distribution ratio, and the preset base distribution ratio interval is determined according to the allowable error range.

If the base distribution ratio of the target sample exceeds the preset base distribution ratio, the experimental sequencing process of the target sample is possibly problematic and cannot be used as a qualified sample.

step 2021, calculating the ratio of the number of bases of each quality value of the target sample according to the data to be detected, and judging whether the ratio of the number of bases meets a second condition; the second condition is that the ratio of the number of bases of each quality value of the target sample does not exceed the corresponding preset base number ratio interval;

step 2022, if the data to be detected satisfies the second condition, determining that the ratio of the number of bases is in a qualified state;

step 2023, if the base number ratio and other sub-evaluation information are all in a qualified state, determining that the target sample is a qualified sample.

Specifically, the base number ratio of Q20 and Q30 of the target sample can be found by analyzing the data in the file in the FastQ format.

Based on the data of Chinese in the 1000G database, the average and standard deviation of the base number ratio of Q20 of the standard sample were calculated to be 0.907234047 and 0.030598, respectively, the average and standard deviation of the base number ratio of Q30 were 0.632925 and 0.062931, respectively, as standard values of the base distribution ratio, and the preset interval of the base number ratio of Q20 was (0.815440047, 1) and the preset interval of the base number ratio of Q30 was (0.444132, 1) in accordance with the 3. sigma. principle.

If the base number ratio of Q20 and Q30 of the target sample exceeds the corresponding preset base number ratio interval, the target sample is indicated to have poor sequencing quality and cannot be used as a qualified sample.

step 2031, performing genome comparison on the data to be detected to obtain conversion data to be detected;

step 2032, calculating a high quality ratio comparison example of the target sample according to the conversion data to be detected, and judging whether the high quality ratio comparison example meets a third condition; the third condition is that the high quality comparison ratio of the target sample is not less than a preset comparison ratio threshold;

step 2033, if the conversion data to be detected meets the third condition, determining that the high quality ratio comparative example is in a qualified state;

step 2034, if the high quality ratio comparative example and the other sub-evaluation information are all in a qualified state, determining that the target sample is a qualified sample.

Specifically, a Bam-format file of the data to be detected of the target sample is obtained by performing genome comparison on the FastQ-format file, and a high-quality comparison proportion of the target sample, namely, a high-quality read comparison proportion can be obtained by analyzing the data in the Bam-format file.

By analyzing the standard sample, the average value of the ratio of the high-quality read of the standard sample to the reference genome is 0.991966, and the minimum value is 0.976797, so the preset comparison threshold value is taken between the average value of the ratio of the high-quality read of the standard sample to the reference genome and the minimum value. The embodiment of the application is preferably 0.98.

And if the high-quality contrast ratio of the target sample is less than 0.98, judging that the target sample is polluted by the DNA of other species.

2041, performing genome comparison on the data to be detected to obtain conversion data to be detected;

2042, calculating a cross contamination statistic of the target sample according to the to-be-detected conversion data, and judging whether the cross contamination statistic meets a fourth condition; the fourth condition is that the cross contamination statistic value of the target sample is not greater than a preset cross contamination statistic threshold value;

2043, if the conversion data to be detected meets the fourth condition, determining that the cross contamination statistic is in a qualified state;

and 2044, if the cross contamination statistic and other sub-evaluation information are both in a qualified state, determining that the target sample is a qualified sample.

Specifically, genome comparison is carried out on the files in the FastQ format to obtain the files in the Bam format of the data to be detected of the target sample, and the cross contamination statistical value of the target sample can be obtained by analyzing the data in the files in the Bam format.

The cross-contamination statistical threshold is set from the data in the ExAC database, preferably the calculated cross-contamination statistical threshold is set to 0.075.

When the cross-contamination statistic for the target sample is greater than 0.075, the target sample is already contaminated and cannot be used as a qualified sample.

step 2051, performing genome comparison on the to-be-detected data to obtain to-be-detected conversion data;

step 2052, calculating a plurality of hybridization capture quality values of the target sample according to the to-be-detected conversion data, and judging whether the plurality of hybridization capture quality values meet a fifth condition; the fifth condition is that each hybridization capture quality value of the target sample is within a corresponding preset hybridization capture quality value interval;

step 2053, if the conversion data to be detected meets the fifth condition, determining that the hybridization capture quality value is in a qualified state;

and step 2054, if the hybridization capture quality value and the other sub-evaluation information are both in a qualified state, determining that the target sample is a qualified sample.

Specifically, a Bam-formatted file of the data to be detected of the target sample is obtained by performing genome comparison on FastQ-formatted files, and a plurality of hybridization capture quality values of the target sample can be obtained by analyzing the data in the Bam-formatted file, including: average target capture zone depth, fold enrichment, percent uncovered target design zone, and percent of reads in total read by unique alignment to the reference genome.

The average value of the standard sample obtained according to the 1000G database of the average target capture area depth is 106.7, the lowest value is 50, the preset average target capture area depth can be set to 50, and if the average target capture area depth of the target sample is lower than 50, the target sample is unqualified.

The lowest value of the enrichment factor calculated from the standard sample of the 1000G database is 18.4, and preferably, the preset enrichment factor is set to 15.

The maximum value of the target design area uncovered percentage calculated by the standard sample of the 1000G database is 10.4%, preferably, the preset target design area uncovered percentage is set to 10%, and if the target design area uncovered percentage of the target sample exceeds 10%, the target sample is not qualified.

The percentage of reads obtained by unique comparison with the reference genome to the total reads has a large variation in value calculated by the standard sample of the 1000G database, but is substantially greater than 70%, so it is preferable to set the percentage of reads obtained by unique comparison with the reference genome to the total reads to 70%, and if the index of the target sample is less than 70%, the target sample is rejected.

The embodiment of the present application provides a quality detection apparatus for second-generation sequencing data, as shown in fig. 3, the apparatus includes:

the acquiring module 30 is used for acquiring to-be-detected data of the target sample;

the calculating module 31 is configured to determine evaluation information of the target sample according to the data to be detected, where the evaluation information includes any at least two kinds of following evaluation information: base distribution ratio, base number ratio, high quality ratio comparative example, cross contamination statistical value and hybridization capture quality value;

and the judging module 32 is used for determining whether the target sample is a qualified sample according to the evaluation information.

Corresponding to the quality detection method of the second-generation sequencing data in fig. 1, an embodiment of the present application further provides a computer device 400, as shown in fig. 4, the device includes a memory 401, a processor 402, and a computer program stored on the memory 401 and executable on the processor 402, wherein the processor 402 implements the quality detection method of the second-generation sequencing data when executing the computer program.

Specifically, the memory 401 and the processor 402 can be general-purpose memories and processors, which are not limited in this respect, and when the processor 402 runs a computer program stored in the memory 401, the method for detecting the quality of the second-generation sequencing data can be executed, so as to solve the problem in the prior art of how to improve the reliability of the quality control result of the second-generation sequencing data.

Corresponding to the quality detection method of the second-generation sequencing data in fig. 1, the embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to perform the steps of the quality detection method of the second-generation sequencing data.

Specifically, the storage medium can be a general storage medium, such as a mobile disk, a hard disk, and the like, and when a computer program on the storage medium is run, the method for detecting the quality of the second-generation sequencing data can be executed, so that the problem of how to improve the reliability of the quality control result of the second-generation sequencing data in the prior art is solved. The quality detection method for the second-generation sequencing data, provided by the embodiment of the application, can effectively detect the sample with the problem in the second-generation sequencing data, and improve the reliability of the quality control result of the second-generation sequencing data, so that the availability of the qualified sample after quality control is improved.

In the embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the present disclosure, which should be construed in light of the above teachings. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A quality detection method of second-generation sequencing data is characterized by comprising the following steps:

acquiring to-be-detected data of a target sample;

2. The method of claim 1, wherein determining whether the target sample is a qualified sample based on the evaluation information comprises:

judging whether each seed evaluation information is in a qualified state;

3. The method according to claim 2, wherein whether the base distribution ratio is in a qualified state is determined as follows:

4. The method according to claim 2, wherein whether the base distribution ratio is in a qualified state is determined as follows:

5. The method according to claim 2, wherein whether the base distribution ratio is in a qualified state is determined as follows:

6. The method according to claim 2, wherein whether the base distribution ratio is in a qualified state is determined as follows:

7. The method according to claim 2, wherein whether the base distribution ratio is in a qualified state is determined as follows:

8. A quality detection device for second-generation sequencing data is characterized by comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of the preceding claims 1-7 are implemented by the processor when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of the preceding claims 1 to 7.