Disclosure of Invention
In order to solve the above problems, the present invention provides a method for determining copy number variation with the aid of single nucleotide polymorphism, comprising the following steps:
step 1: calculating the CNV score of the gene according to the third data of the detection sample and the reference sample, screening the gene with possible copy number variation according to the CNV score, and selecting a second genome;
step 2: performing an snp analysis based on said third data of said reference sample, screening a first set of loci with stable mutation frequencies from loci of said second genome; performing an snp analysis according to the third data of the detection sample to obtain the snp base information of the detection sample in the first locus set as fifth data;
and 3, step 3: according to the fifth data, judging the gene copy number variation condition of the detection sample;
and the third data is obtained by removing low-quality sequences from the original sequencing data, aligning the original sequencing data to a reference genome, and performing duplication removal and quality control on the original sequencing data.
The low-quality sequence refers to a sequence with the average base quality and the reads length lower than set values, and an exemplary low-quality sequence removal mode is given in the first embodiment of the invention.
The present invention can be used for second generation sequencing, or sequencing results from high throughput sequencing platforms (e.g., illumina or MGI sequencing data) to identify the copy number variation status of a first genome.
The third data is obtained by the following method:
sequencing the test sample or the reference sample to obtain raw sequencing data including first genome information, wherein the first genome is one or more genes selected according to the analysis purpose;
clear data obtained by removing low-quality sequences from the original sequencing data is used as first data, and the first data is compared with a reference genome, so that the sequencing sequences in the first data are positioned on related genes, and comparison result data is obtained and used as second data;
and carrying out duplicate removal and quality control on the second data to obtain third data.
The method of alignment to the reference genome is: using the bwa software, the first data obtained by removing the low-quality sequences was aligned to the human reference genome hg19 using the mem algorithm, thereby mapping the sequenced sequences to the relevant genes.
The reference sample is a human leukocyte sample, and in order to ensure the diversity of the sample and the statistical significance of the result, a blood sample of a plurality of people can be selected, and 20 blood samples are selected to construct CNV and snp baselines in the invention.
Preferably, step 1 specifically comprises the following steps:
step 1.1 obtaining corrected standardized sequencing depths of a test sample and a reference sample;
when multiple reference samples are used, averaging the corrected standardized sequencing depths of all the reference samples, and calculating an average reference sequencing depth;
step 1.2, calculating the CNV copy number score of the detection sample; the calculation method is as follows: for each region, the CNV copy number score for that region =2 × corrected normalized sequencing depth/average reference sequencing depth;
step 1.3, screening samples with the copy number score of CNV being more than or equal to 0 and less than or equal to 1 or the copy number score of CNV being more than or equal to 4 and less than 6 as genes with possible copy number variation.
Preferably, the step 2 specifically comprises the following steps:
step 2.1 obtaining base information of the snp sites of the second genome, including reads supporting base mutations and reads of reference bases, from the third data of the reference sample; removing sites in which the sequencing depth is lower than a set value, and obtaining data called fourth data of the reference sample;
the fourth data for the reference sample was used for the following calculations and screening:
snp mutation frequency = number of reads supporting base mutation/(number of reads of reference base + number of reads supporting base mutation);
the screening standard is that the snp site mutation frequency is judged to be stable when being between 0.4 and 0.6, and the snp site mutation frequency is classified as a first site group.
Step 2.2 obtaining base information of the snp sites of the first set of sites including reads supporting base mutation and reads of reference base from the third data of the test sample; sites where the sequencing depth was below the set value were removed and the resulting data was referred to as fifth data for the reference sample.
Step 3.1 using the fifth data of the test sample to perform the following calculations:
snp mutation frequency = number of reads supporting base mutation/(number of reads of reference base + number of reads supporting base mutation);
step 3.2, if the distribution frequency of the snp is less than 0.3, the deletion of the copy number is possible, and if the distribution frequency of the snp after the gene standardization correction is more than 0.7, the amplification of the copy number is possible.
The present invention further provides a system for determining copy number variation using single nucleotide polymorphism, comprising:
a memory storing executable instructions, and; and
one or more processors in communication with the memory to execute executable instructions to:
step 1: calculating the CNV score of the gene according to the third data of the detection sample and the reference sample, screening the gene with possible copy number variation according to the CNV score, and selecting a second genome;
step 2: performing a snp analysis based on said third data from said reference sample to screen out a first set of loci with stable mutation frequencies from said loci of said second genome; performing an snp analysis according to the third data of the detection sample to obtain the snp mutation frequency data of the detection sample in the first locus set as fifth data;
and step 3: judging the copy number variation condition of the genes in the second genome of the detection sample according to the fifth data;
and the third data is obtained by removing low-quality sequences, aligning to a reference genome, removing duplication and controlling quality of the original sequencing data.
Preferably, the one or more processors are in communication with the memory to execute executable instructions to:
step 1.1 obtaining corrected standardized sequencing depths of a test sample and a reference sample;
when multiple reference samples are used, averaging the corrected standardized sequencing depths of all the reference samples, and calculating an average reference sequencing depth;
step 1.2, calculating the CNV copy number score of the detection sample;
the calculation method is as follows: for each region, the CNV copy number score for that region =2 × corrected normalized sequencing depth/average reference sequencing depth;
step 1.3, screening samples with the copy number score of CNV being more than or equal to 0 and less than or equal to 1 or the copy number score of CNV being more than or equal to 4 and less than 6 as genes with possible copy number variation;
step 2.1 obtaining base information of the snp sites of the second genome, including reads supporting base mutations and reads of reference bases, from the third data of the reference sample; removing sites in which the sequencing depth is lower than a set value, and obtaining data called fourth data of the reference sample;
the fourth data for the reference sample was used for the following calculations and screening:
snp mutation frequency = number of reads supporting base mutation/(number of reads of reference base + number of reads supporting base mutation);
the screening standard is that the snp site mutation frequency is judged to be stable when being between 0.4 and 0.6, and the snp site mutation frequency is classified as a first site group;
step 2.2 obtaining base information of the snp sites of the first set of sites including reads supporting base mutation and reads of reference base from the third data of the test sample; removing sites in which the sequencing depth is lower than a set value, and obtaining data called fifth data of the reference sample;
step 3.1 using the fifth data of the test sample to perform the following calculations:
snp mutation frequency = number of reads supporting base mutation/(number of reads of reference base + number of reads supporting base mutation);
step 3.2, if the distribution frequency of the snps is less than 0.3, the deletion of the copy number is possible, and if the distribution frequency of the snps after the gene standardization correction is more than 0.7, the amplification of the copy number is possible.
The invention has the advantages that:
the snp distribution frequency is adopted to assist in judging the gene copy number variation condition, and compared with the traditional method which simply depends on sequencing deep judgment, the result is more accurate.
In a preferred embodiment, errors due to poor quality of sequencing data, insufficient depth, and the like can be eliminated.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
Example one
In this example, a tumor-associated gene was selected as the first genome.
Selecting a tumor sample to be detected in a certain Guangdong hospital as a detection sample; leukocyte samples extracted from 20 blood samples derived from 20 healthy adults, respectively, were selected as reference samples.
Sequencing and pre-processing of sequencing data can be accomplished using conventional methods and software, and this example provides exemplary operations as follows:
sequencing the detection sample and the reference sample by adopting an illumina second-generation sequencing platform to obtain original sequencing data.
Carrying out early-stage processing on the original sequencing data of the detection sample: a. removal of low quality sequences of the original data: because the sequencing instrument generally has the condition that the base quality of sequencing data fluctuates at the sequencing starting stage and the sequencing ending stage and the sequencing quality is low, if the insert is short, sequencing is performed, namely the sequencing contains a sequencing adaptor, and therefore, in order to ensure the accuracy of a sample analysis result, a sequence with low sequencing quality or a sequencing adaptor introduced by sequencing needs to be removed. The removal criteria were as follows: removing the illumina linker used for sequencing, removing bases with the average base mass lower than 15 by taking 4bp as a sliding window, and removing reads with the length lower than 26 to obtain clean data, which is hereinafter referred to as first data for convenience of expression.
b. Aligning the reference genome: and (3) aligning the clean data sequence with the low-quality sequence removed onto a human reference genome by utilizing a bwa algorithm, so as to position the sequencing sequence onto the related gene, thereby obtaining second data.
c. And (3) removing duplication of comparison results: because PCR amplification exists in the experimental process, in order to avoid the influence caused by the amplification data, the aligned data (i.e., the second data) is subjected to deduplication by using the MarkDuplicates module of Picard software.
d. Quality control of sample sequencing quality: and evaluating the sequencing quality and the comparison quality of the sample based on the comparison result so as to judge whether the sample quality meets the requirement, wherein the accuracy of the analysis result cannot be ensured for the sample which cannot meet the requirement. And (3) carrying out quality control on the data after the duplication removal, wherein the quality control standard is as follows: at target >60%, Q30>85%, alignment >95%, mean depth after deduplication >1000x. And the second data is subjected to de-duplication and quality control to obtain third data.
The same process is applied to the raw sequencing data for each reference sample, resulting in third data for each reference sample.
Step 1
Calculation and screening of gene CNV score:
1.1 obtaining corrected normalized sequencing depth
Calculating the sequencing depth of each gene in the first genome region of the third data of the sample by using a multicov function of the bedtools software; calculating the total sequencing depth of the whole sample in the first genome region; standardizing according to the sequencing depth of each gene and the total sequencing depth obtained by calculation; the normalized sequencing depth results were corrected based on GC content using the loess method.
For the reference samples, the data were processed in the same manner, and the average value was taken after the processing results for each reference sample were obtained, to obtain the average reference sequencing depth.
1.2 calculating the CNV copy number score of the test sample
The calculation method is as follows: for each region, the CNV copy number score for that region =2 × corrected normalized sequencing depth/average reference sequencing depth;
1.3 selection of genes which are likely to have copy number variation
The screening standard is as follows: if the copy number score of the gene CNV is more than or equal to 4 and less than 6, amplification may occur; deletion may occur if the CNV score is less than or equal to 1; the genes in both cases were selected into the second genome.
Referring to FIG. 1 of the specification, the CNV copy number scores of EGFR, MET, and ERBB2 genes exemplarily indicated in the figure meet the screening criteria, and there is a large possibility that copy number variation occurs.
Step 2
The processing of the snp data specifically comprises:
2.1 screening for sites with stable snp mutation frequencies
The analysis of the snp site mutation frequency of 20 leukocyte samples is summarized, and the site with stable mutation frequency in the second genome is selected.
Specifically, base information of the snp site of the second genome is obtained from the third data of each reference sample by using the mpieup function of bcftools software; reading in the R language with the input, all relevant reads in the second genome in the sample can be obtained, including reads supporting base mutation and reads of the reference base. For each snp site, if the number of reads supporting base mutation + the number of reads of the reference base is < 200, it proves that the sequencing depth is not enough, which may affect the accuracy of the analysis result and needs to be removed. The data obtained after the above-described processing is referred to as fourth data of the reference sample.
The fourth data for the reference sample was used for the following calculations and screening:
snp mutation frequency = number of reads supporting base mutation/(number of reads of reference base + number of reads supporting base mutation)
The screening criteria is that the frequency of the snp site mutation is judged to be stable when the frequency is between 0.4 and 0.6, and the snp site mutation is classified into a first site group.
To visually display the results, a baseline of the snp mutation frequency of the white blood cells can be made, see the attached figure 2 of the specification.
The snp site with unstable mutation frequency in the white blood cells is generally unstable in other cells, so that the method cannot be used for analyzing the snp site, and the data of the partial snp site are removed, so that the result error caused by the abnormality of the partial snp site can be avoided.
2.2 obtaining fifth data of the test sample
Obtaining base information of the snp sites of the first set of sites from the third data of the test sample, including reads supporting base mutations and reads of reference bases; sites where the sequencing depth was below the set value were removed and the resulting data was referred to as fifth data for the reference sample.
Specifically, base information of the snp site of the first site group is acquired from the third data of the detection sample by using the mpireup function of bcftools software; for each snp site, if the number of reads supporting base mutation + the number of reads of the reference base is < 200, it proves that the sequencing depth is not enough, which may affect the accuracy of the analysis result and needs to be removed. The data obtained by the above processing is referred to as fifth data of the test sample.
And 3, judging copy number variation.
3.1 calculation of the snp mutation frequency of the test specimens
The following calculation is performed using the fifth data of the test sample:
snp mutation frequency = number of reads supporting base mutation/(number of reads of reference base + number of reads supporting base mutation);
3.2 copy number variation determination
And (3) screening a gene with the snp mutation frequency less than 0.3 as a fourth genome which is possible to generate copy number deletion and screening a gene with the snp mutation frequency more than 0.7 as a fifth genome which is possible to generate copy number amplification according to the calculated snp mutation frequency data in the step 3.1.
For visual display, the calculated snp mutation frequency data can be used for mapping, as shown in fig. 3 in the specification, and the position difference of possible copy number variation of different genes is clearly shown in fig. 3. In the case of CNV without deletion and mutation, the frequency of snp mutations is generally distributed between 0.4 and 0.6, close to the baseline of the snp of the white blood cells. There is a possibility of copy number deletion in the CDKN2A and CDKN2B genes, and copy number amplification in the ERBB2 and MET genes.
As an example of the advantages of the present invention, the EGFR gene judged to be likely to have copy number variation according to FIG. 1 can be substantially excluded from the copy number variation in FIG. 3. This may prove that the present invention has a higher accuracy using the snp assisted judgment than the prior art.
It should be noted that the drawing is meant to more intuitively display data, and the technical solution of the present invention can be completed by data processing even without drawing.
Example two
The application also provides a system for judging copy number variation by using single nucleotide polymorphism, which can be realized in the forms of a mobile terminal, a Personal Computer (PC), a tablet computer, a server and the like. Referring now to FIG. 5, a schematic diagram of a system for determining copy number variation using single nucleotide polymorphisms is shown, which is suitable for implementing embodiments of the present application.
As shown in fig. 5, the computer system 300 includes one or more processors, communication sections, and the like, for example: one or more Central Processing Units (CPUs) 301, and/or one or more image processors (GPUs) 313, etc., which may perform various appropriate actions and processes according to executable instructions stored in a Read Only Memory (ROM) 302 or loaded from a storage section 308 into a Random Access Memory (RAM) 303. The communication section 312 may include, but is not limited to, a network card, which may include, but is not limited to, an IB (Infiniband) network card.
The processor may communicate with the read-only memory 302 and/or the random access memory 303 to execute the executable instructions, connect with the communication part 312 through the bus 304, and communicate with other target devices through the communication part 312, so as to complete the operations corresponding to any one of the methods provided by the embodiments of the present application, for example:
step 1.2, calculating the CNV copy number score of the detection sample;
the calculation method is as follows: for each region, the CNV copy number score for that region =2 × corrected normalized sequencing depth/average reference sequencing depth;
step 1.3, screening samples with the copy number score of CNV being more than or equal to 0 and less than or equal to 1 or the copy number score of CNV being more than or equal to 4 and less than 6 as genes with possible copy number variation;
step 2.1 obtaining base information of the snp sites of the second genome, including reads supporting base mutations and reads of reference bases, from the third data of the reference sample; removing sites in which the sequencing depth is lower than a set value, and obtaining data called fourth data of the reference sample;
the set value of the sequencing depth can be automatically adjusted by an operator according to the amplification times and the like during sequencing, so that the interference is reduced, and the analysis accuracy is improved. In this example, steps 2.1 and 2.2 are both set to 200.
The fourth data for the reference sample was used for the following calculations and screening:
snp mutation frequency = number of reads supporting base mutation/(number of reads of reference base + number of reads supporting base mutation);
the screening standard is that the snp site mutation frequency is judged to be stable when being between 0.4 and 0.6, and the snp site mutation frequency is classified as a first site group;
step 2.2 obtaining base information of the snp sites of the first set of sites including reads supporting base mutation and reads of reference base from the third data of the test sample; removing sites in which the sequencing depth is lower than a set value, and obtaining data called fifth data of the reference sample;
step 3.1 uses the fifth data of the test sample to perform the following calculations:
snp mutation frequency = number of reads supporting base mutation/(number of reads of reference base + number of reads supporting base mutation);
step 3.2, if the distribution frequency of the snps is less than 0.3, the deletion of the copy number is possible, and if the distribution frequency of the snps after the gene standardization correction is more than 0.7, the amplification of the copy number is possible.
In addition, in the RAM 303, various programs and data necessary for the operation of the apparatus can also be stored. The CPU 301, ROM 302, and RAM 303 are connected to each other via a bus 304. The ROM 302 is an optional module in the presence of the RAM 303. The RAM 303 stores or writes executable instructions into the ROM 302 at runtime, and the executable instructions cause the processor 301 to perform operations corresponding to the above-described communication method. An input/output interface (I/O interface) 305 is also connected to the bus 304. The communication unit 312 may be integrated, or may be provided with a plurality of sub-modules (e.g., a plurality of IB network cards) and connected to the bus link.
The following components are connected to the I/O interface 305: an input portion 306 including a keyboard, a mouse, and the like; an output section 307 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage unit 308 including a hard disk and the like; and a communication section 309 including a network interface card such as a LAN card, a modem, or the like. The communication section 309 performs communication processing via a network such as the internet. A drive 310 is also connected to the I/O interface 305 as needed. A removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 310 as necessary.
It should be noted that the architecture shown in fig. 5 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 5 may be selected, deleted, added, or replaced according to actual needs; in different functional component settings, separate settings or integrated settings may also be used, for example, the GPU and the CPU may be separately set or the GPU may be integrated on the CPU, the communication unit 312 may be separately set or integrated on the CPU or the GPU, and so on. These alternative embodiments are all within the scope of the present disclosure.
In particular, the process described with reference to the flowchart of fig. 5 may be implemented as a computer program product according to the present application. For example, the present application provides a computer program product comprising computer readable instructions that when executed by a processor perform the following: in such embodiments, the computer program product may be downloaded and installed from a network via the communication section 309, and/or read and installed from the removable medium 311. The above-described functions as defined in the method of the present application are performed when the computer program product is executed by a Central Processing Unit (CPU) 301.
The solution of the present application may be implemented in many ways. For example, the technical solutions of the present application may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The order of the steps used to describe the method is provided for clarity only. Unless specifically limited, the method steps of the present application are not limited to the order specifically described above. Furthermore, in some embodiments, the present application may also be implemented as a storage medium storing a computer program product.