CN115762630A

CN115762630A - Method and system for judging copy number variation by using single nucleotide polymorphism

Info

Publication number: CN115762630A
Application number: CN202211451852.3A
Authority: CN
Inventors: 李奎; 全智慧; 裘宇容; 于雪; 滕祥云
Original assignee: Guangzhou Huayinkang Medical Group Co ltd; Guangzhou Huayin Medical Laboratory Center Co Ltd
Current assignee: Guangzhou Huayinkang Medical Group Co ltd; Guangzhou Huayin Medical Laboratory Center Co Ltd
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2023-03-07

Abstract

The invention relates to a method for assisting in judging copy number variation by utilizing single nucleotide polymorphism and a data processing system capable of implementing the method. The method for assisting in judging copy number variation by using single nucleotide polymorphism comprises the following steps: calculating the CNV score of the gene according to the third data of the detection sample and the reference sample, screening the gene with possible copy number variation according to the CNV score, and selecting a second genome; performing a snp analysis based on said third data from said reference sample to screen out a first set of loci with stable mutation frequencies from said loci of said second genome; performing an snp analysis according to the third data of the detection sample to obtain the snp base information of the detection sample in the first locus set as fifth data; and judging the gene copy number variation condition of the detection sample according to the fifth data.

Description

Method and system for judging copy number variation by using single nucleotide polymorphism

Technical Field

The present invention relates to a method for determining copy number variation and a data processing system, and more particularly, to a method for determining copy number variation using single nucleotide polymorphism and a data processing system for implementing the same.

Background

With the development of individualized medicine and the proposal of the concept of 'precise medicine', the tumor drug treatment develops rapidly, and more gene mutations related to the prediction of the curative effect of drug treatment are gradually discovered and confirmed in clinical research. The traditional gene mutation detection method, such as Sanger sequencing, pyrosequencing, real-time fluorescence PCR and the like, can only detect single genes or partial exon mutation of single genes, and the traditional gene mutation detection method is adopted to simultaneously detect a plurality of genes, so that the required sample size is large, and longer detection time and larger workload are required. High-throughput sequencing (HTS), that is, next-generation sequencing technology (NGS), can simultaneously sequence millions or even billions of DNA fragments, and can detect up to hundreds of tumor-associated genes, whole exons, and whole genomes at a time at a low cost without increasing the amount of samples required. Because of the advantages of NGS in flux, cost and efficiency, NGS has wide application prospect in somatic mutation of solid tumors.

Copy number variation plays an important role in human disease and biology. For example, germ cell copy number variations can have a tremendous impact on development, such as trisomy 21, trisomy 18, and trisomy 13, leading to down syndrome, edwards syndrome, and pato syndrome, respectively, among others. On the other hand, somatic copy number variations are often found in patients with cancer, and are the major drivers of tumor development and drug resistance. Copy number amplification of MYC and copy number deletion of PTEN and TP53 were found in pan-cancer genomic analysis reports in many tumor types. In Acute Myeloid Leukemia (AML), deletions involving a large number of chromosomes 5 and 7 often occur in patients with an unfavorable cytogenetic risk. In multiple myeloma, deletions of chromosome 17 are associated with more aggressive disease, and obtaining deletions of chromosome 17 during disease progression may lead to a worse prognosis. In addition, deletion of TP53 chromosome or amplification of chromosome 1 results in abnormality of myeloma pathogenesis-related genes (e.g., CKS1B, MCL 1), which is associated with poor prognosis. Therefore, the description of cancer-associated copy number variation events is of great interest for determining patient subpopulations and for insight into prognosis and potential therapeutic strategies.

Currently, the NGS method can mainly use 1) for detecting CNV, read depth (read depth): indicating copy number amplifications and deletions based on sliding window read depths; 2) The pair-end method: confirming copy number variation according to the difference between the distance between the two ends of pair-end and the reference genome; 3) And the sequence assembling method comprises the following steps: short reads are assembled and differences between them and the reference genome are sought to confirm copy number variation. The first method based on read depth is a method which is widely applied at present, and the latter two methods are mainly used for detecting other structural variations, such as conversion and transversion. The core technology of the read depth detection method is mainly based on a probability statistical model. The detection method based on probability statistics has an assumption premise that: the read depth and the copy number variation number are in a linear relation, namely the sequencing process is uniform by default, and the read depth of the sliding statistics according to a specific window on a chromosome obeys a specific distribution, such as a Poisson distribution, a Gaussian distribution and the like. If the read depth increases or decreases in the sliding window, it means that the copy number amplification or deletion occurs. However, errors accumulated in the sequencing process cause that the read depth and the copy number variation number are not in a linear relation, so that the method is based on the wrong assumption that the result is relatively large in error.

In addition, the method is low in efficiency and has more subjectivity through manual identification according to the cell marker genes.

To address the limitations of the above approaches, a method has been developed to aid in determining the status of gene copy number variation based on the frequency of multiple single nucleotide polymorphisms in the gene range.

Disclosure of Invention

In order to solve the above problems, the present invention provides a method for determining copy number variation with the aid of single nucleotide polymorphism, comprising the following steps:

step 1: calculating the CNV score of the gene according to the third data of the detection sample and the reference sample, screening the gene with possible copy number variation according to the CNV score, and selecting a second genome;

step 2: performing an snp analysis based on said third data of said reference sample, screening a first set of loci with stable mutation frequencies from loci of said second genome; performing an snp analysis according to the third data of the detection sample to obtain the snp base information of the detection sample in the first locus set as fifth data;

and 3, step 3: according to the fifth data, judging the gene copy number variation condition of the detection sample;

and the third data is obtained by removing low-quality sequences from the original sequencing data, aligning the original sequencing data to a reference genome, and performing duplication removal and quality control on the original sequencing data.

The low-quality sequence refers to a sequence with the average base quality and the reads length lower than set values, and an exemplary low-quality sequence removal mode is given in the first embodiment of the invention.

The present invention can be used for second generation sequencing, or sequencing results from high throughput sequencing platforms (e.g., illumina or MGI sequencing data) to identify the copy number variation status of a first genome.

The third data is obtained by the following method:

sequencing the test sample or the reference sample to obtain raw sequencing data including first genome information, wherein the first genome is one or more genes selected according to the analysis purpose;

clear data obtained by removing low-quality sequences from the original sequencing data is used as first data, and the first data is compared with a reference genome, so that the sequencing sequences in the first data are positioned on related genes, and comparison result data is obtained and used as second data;

and carrying out duplicate removal and quality control on the second data to obtain third data.

The method of alignment to the reference genome is: using the bwa software, the first data obtained by removing the low-quality sequences was aligned to the human reference genome hg19 using the mem algorithm, thereby mapping the sequenced sequences to the relevant genes.

The reference sample is a human leukocyte sample, and in order to ensure the diversity of the sample and the statistical significance of the result, a blood sample of a plurality of people can be selected, and 20 blood samples are selected to construct CNV and snp baselines in the invention.

Preferably, step 1 specifically comprises the following steps:

step 1.1 obtaining corrected standardized sequencing depths of a test sample and a reference sample;

when multiple reference samples are used, averaging the corrected standardized sequencing depths of all the reference samples, and calculating an average reference sequencing depth;

step 1.2, calculating the CNV copy number score of the detection sample; the calculation method is as follows: for each region, the CNV copy number score for that region =2 × corrected normalized sequencing depth/average reference sequencing depth;

step 1.3, screening samples with the copy number score of CNV being more than or equal to 0 and less than or equal to 1 or the copy number score of CNV being more than or equal to 4 and less than 6 as genes with possible copy number variation.

Preferably, the step 2 specifically comprises the following steps:

step 2.1 obtaining base information of the snp sites of the second genome, including reads supporting base mutations and reads of reference bases, from the third data of the reference sample; removing sites in which the sequencing depth is lower than a set value, and obtaining data called fourth data of the reference sample;

the fourth data for the reference sample was used for the following calculations and screening:

snp mutation frequency = number of reads supporting base mutation/(number of reads of reference base + number of reads supporting base mutation);

the screening standard is that the snp site mutation frequency is judged to be stable when being between 0.4 and 0.6, and the snp site mutation frequency is classified as a first site group.

Step 2.2 obtaining base information of the snp sites of the first set of sites including reads supporting base mutation and reads of reference base from the third data of the test sample; sites where the sequencing depth was below the set value were removed and the resulting data was referred to as fifth data for the reference sample.

Step 3.1 using the fifth data of the test sample to perform the following calculations:

step 3.2, if the distribution frequency of the snp is less than 0.3, the deletion of the copy number is possible, and if the distribution frequency of the snp after the gene standardization correction is more than 0.7, the amplification of the copy number is possible.

The present invention further provides a system for determining copy number variation using single nucleotide polymorphism, comprising:

a memory storing executable instructions, and; and

one or more processors in communication with the memory to execute executable instructions to:

step 2: performing a snp analysis based on said third data from said reference sample to screen out a first set of loci with stable mutation frequencies from said loci of said second genome; performing an snp analysis according to the third data of the detection sample to obtain the snp mutation frequency data of the detection sample in the first locus set as fifth data;

and step 3: judging the copy number variation condition of the genes in the second genome of the detection sample according to the fifth data;

and the third data is obtained by removing low-quality sequences, aligning to a reference genome, removing duplication and controlling quality of the original sequencing data.

Preferably, the one or more processors are in communication with the memory to execute executable instructions to:

step 1.2, calculating the CNV copy number score of the detection sample;

the calculation method is as follows: for each region, the CNV copy number score for that region =2 × corrected normalized sequencing depth/average reference sequencing depth;

step 1.3, screening samples with the copy number score of CNV being more than or equal to 0 and less than or equal to 1 or the copy number score of CNV being more than or equal to 4 and less than 6 as genes with possible copy number variation;

the screening standard is that the snp site mutation frequency is judged to be stable when being between 0.4 and 0.6, and the snp site mutation frequency is classified as a first site group;

step 2.2 obtaining base information of the snp sites of the first set of sites including reads supporting base mutation and reads of reference base from the third data of the test sample; removing sites in which the sequencing depth is lower than a set value, and obtaining data called fifth data of the reference sample;

step 3.2, if the distribution frequency of the snps is less than 0.3, the deletion of the copy number is possible, and if the distribution frequency of the snps after the gene standardization correction is more than 0.7, the amplification of the copy number is possible.

The invention has the advantages that:

the snp distribution frequency is adopted to assist in judging the gene copy number variation condition, and compared with the traditional method which simply depends on sequencing deep judgment, the result is more accurate.

In a preferred embodiment, errors due to poor quality of sequencing data, insufficient depth, and the like can be eliminated.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a CNV score distribution chart of a gene of example;

FIG. 2 is a baseline graph of the snp mutation frequency of a reference sample (white blood cells) in example;

FIG. 3 is a graph showing the frequency distribution of the snp mutations at the first set of positions in the sample according to one embodiment;

FIG. 4 is a flow chart of an embodiment of the present invention;

FIG. 5 is a diagram of a data processing system according to a second embodiment of the present invention.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Example one

In this example, a tumor-associated gene was selected as the first genome.

Selecting a tumor sample to be detected in a certain Guangdong hospital as a detection sample; leukocyte samples extracted from 20 blood samples derived from 20 healthy adults, respectively, were selected as reference samples.

Sequencing and pre-processing of sequencing data can be accomplished using conventional methods and software, and this example provides exemplary operations as follows:

sequencing the detection sample and the reference sample by adopting an illumina second-generation sequencing platform to obtain original sequencing data.

Carrying out early-stage processing on the original sequencing data of the detection sample: a. removal of low quality sequences of the original data: because the sequencing instrument generally has the condition that the base quality of sequencing data fluctuates at the sequencing starting stage and the sequencing ending stage and the sequencing quality is low, if the insert is short, sequencing is performed, namely the sequencing contains a sequencing adaptor, and therefore, in order to ensure the accuracy of a sample analysis result, a sequence with low sequencing quality or a sequencing adaptor introduced by sequencing needs to be removed. The removal criteria were as follows: removing the illumina linker used for sequencing, removing bases with the average base mass lower than 15 by taking 4bp as a sliding window, and removing reads with the length lower than 26 to obtain clean data, which is hereinafter referred to as first data for convenience of expression.

b. Aligning the reference genome: and (3) aligning the clean data sequence with the low-quality sequence removed onto a human reference genome by utilizing a bwa algorithm, so as to position the sequencing sequence onto the related gene, thereby obtaining second data.

c. And (3) removing duplication of comparison results: because PCR amplification exists in the experimental process, in order to avoid the influence caused by the amplification data, the aligned data (i.e., the second data) is subjected to deduplication by using the MarkDuplicates module of Picard software.

d. Quality control of sample sequencing quality: and evaluating the sequencing quality and the comparison quality of the sample based on the comparison result so as to judge whether the sample quality meets the requirement, wherein the accuracy of the analysis result cannot be ensured for the sample which cannot meet the requirement. And (3) carrying out quality control on the data after the duplication removal, wherein the quality control standard is as follows: at target >60%, Q30>85%, alignment >95%, mean depth after deduplication >1000x. And the second data is subjected to de-duplication and quality control to obtain third data.

The same process is applied to the raw sequencing data for each reference sample, resulting in third data for each reference sample.

Step 1

Calculation and screening of gene CNV score:

1.1 obtaining corrected normalized sequencing depth

Calculating the sequencing depth of each gene in the first genome region of the third data of the sample by using a multicov function of the bedtools software; calculating the total sequencing depth of the whole sample in the first genome region; standardizing according to the sequencing depth of each gene and the total sequencing depth obtained by calculation; the normalized sequencing depth results were corrected based on GC content using the loess method.

For the reference samples, the data were processed in the same manner, and the average value was taken after the processing results for each reference sample were obtained, to obtain the average reference sequencing depth.

1.2 calculating the CNV copy number score of the test sample

1.3 selection of genes which are likely to have copy number variation

The screening standard is as follows: if the copy number score of the gene CNV is more than or equal to 4 and less than 6, amplification may occur; deletion may occur if the CNV score is less than or equal to 1; the genes in both cases were selected into the second genome.

Referring to FIG. 1 of the specification, the CNV copy number scores of EGFR, MET, and ERBB2 genes exemplarily indicated in the figure meet the screening criteria, and there is a large possibility that copy number variation occurs.

Step 2

The processing of the snp data specifically comprises:

2.1 screening for sites with stable snp mutation frequencies

The analysis of the snp site mutation frequency of 20 leukocyte samples is summarized, and the site with stable mutation frequency in the second genome is selected.

Specifically, base information of the snp site of the second genome is obtained from the third data of each reference sample by using the mpieup function of bcftools software; reading in the R language with the input, all relevant reads in the second genome in the sample can be obtained, including reads supporting base mutation and reads of the reference base. For each snp site, if the number of reads supporting base mutation + the number of reads of the reference base is < 200, it proves that the sequencing depth is not enough, which may affect the accuracy of the analysis result and needs to be removed. The data obtained after the above-described processing is referred to as fourth data of the reference sample.

snp mutation frequency = number of reads supporting base mutation/(number of reads of reference base + number of reads supporting base mutation)

The screening criteria is that the frequency of the snp site mutation is judged to be stable when the frequency is between 0.4 and 0.6, and the snp site mutation is classified into a first site group.

To visually display the results, a baseline of the snp mutation frequency of the white blood cells can be made, see the attached figure 2 of the specification.

The snp site with unstable mutation frequency in the white blood cells is generally unstable in other cells, so that the method cannot be used for analyzing the snp site, and the data of the partial snp site are removed, so that the result error caused by the abnormality of the partial snp site can be avoided.

2.2 obtaining fifth data of the test sample

Obtaining base information of the snp sites of the first set of sites from the third data of the test sample, including reads supporting base mutations and reads of reference bases; sites where the sequencing depth was below the set value were removed and the resulting data was referred to as fifth data for the reference sample.

Specifically, base information of the snp site of the first site group is acquired from the third data of the detection sample by using the mpireup function of bcftools software; for each snp site, if the number of reads supporting base mutation + the number of reads of the reference base is < 200, it proves that the sequencing depth is not enough, which may affect the accuracy of the analysis result and needs to be removed. The data obtained by the above processing is referred to as fifth data of the test sample.

And 3, judging copy number variation.

3.1 calculation of the snp mutation frequency of the test specimens

The following calculation is performed using the fifth data of the test sample:

3.2 copy number variation determination

And (3) screening a gene with the snp mutation frequency less than 0.3 as a fourth genome which is possible to generate copy number deletion and screening a gene with the snp mutation frequency more than 0.7 as a fifth genome which is possible to generate copy number amplification according to the calculated snp mutation frequency data in the step 3.1.

For visual display, the calculated snp mutation frequency data can be used for mapping, as shown in fig. 3 in the specification, and the position difference of possible copy number variation of different genes is clearly shown in fig. 3. In the case of CNV without deletion and mutation, the frequency of snp mutations is generally distributed between 0.4 and 0.6, close to the baseline of the snp of the white blood cells. There is a possibility of copy number deletion in the CDKN2A and CDKN2B genes, and copy number amplification in the ERBB2 and MET genes.

As an example of the advantages of the present invention, the EGFR gene judged to be likely to have copy number variation according to FIG. 1 can be substantially excluded from the copy number variation in FIG. 3. This may prove that the present invention has a higher accuracy using the snp assisted judgment than the prior art.

It should be noted that the drawing is meant to more intuitively display data, and the technical solution of the present invention can be completed by data processing even without drawing.

Example two

The application also provides a system for judging copy number variation by using single nucleotide polymorphism, which can be realized in the forms of a mobile terminal, a Personal Computer (PC), a tablet computer, a server and the like. Referring now to FIG. 5, a schematic diagram of a system for determining copy number variation using single nucleotide polymorphisms is shown, which is suitable for implementing embodiments of the present application.

As shown in fig. 5, the computer system 300 includes one or more processors, communication sections, and the like, for example: one or more Central Processing Units (CPUs) 301, and/or one or more image processors (GPUs) 313, etc., which may perform various appropriate actions and processes according to executable instructions stored in a Read Only Memory (ROM) 302 or loaded from a storage section 308 into a Random Access Memory (RAM) 303. The communication section 312 may include, but is not limited to, a network card, which may include, but is not limited to, an IB (Infiniband) network card.

The processor may communicate with the read-only memory 302 and/or the random access memory 303 to execute the executable instructions, connect with the communication part 312 through the bus 304, and communicate with other target devices through the communication part 312, so as to complete the operations corresponding to any one of the methods provided by the embodiments of the present application, for example:

step 1.2, calculating the CNV copy number score of the detection sample;

the set value of the sequencing depth can be automatically adjusted by an operator according to the amplification times and the like during sequencing, so that the interference is reduced, and the analysis accuracy is improved. In this example, steps 2.1 and 2.2 are both set to 200.

step 3.1 uses the fifth data of the test sample to perform the following calculations:

In addition, in the RAM 303, various programs and data necessary for the operation of the apparatus can also be stored. The CPU 301, ROM 302, and RAM 303 are connected to each other via a bus 304. The ROM 302 is an optional module in the presence of the RAM 303. The RAM 303 stores or writes executable instructions into the ROM 302 at runtime, and the executable instructions cause the processor 301 to perform operations corresponding to the above-described communication method. An input/output interface (I/O interface) 305 is also connected to the bus 304. The communication unit 312 may be integrated, or may be provided with a plurality of sub-modules (e.g., a plurality of IB network cards) and connected to the bus link.

The following components are connected to the I/O interface 305: an input portion 306 including a keyboard, a mouse, and the like; an output section 307 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage unit 308 including a hard disk and the like; and a communication section 309 including a network interface card such as a LAN card, a modem, or the like. The communication section 309 performs communication processing via a network such as the internet. A drive 310 is also connected to the I/O interface 305 as needed. A removable medium 311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 310 as necessary.

It should be noted that the architecture shown in fig. 5 is only an optional implementation manner, and in a specific practical process, the number and types of the components in fig. 5 may be selected, deleted, added, or replaced according to actual needs; in different functional component settings, separate settings or integrated settings may also be used, for example, the GPU and the CPU may be separately set or the GPU may be integrated on the CPU, the communication unit 312 may be separately set or integrated on the CPU or the GPU, and so on. These alternative embodiments are all within the scope of the present disclosure.

In particular, the process described with reference to the flowchart of fig. 5 may be implemented as a computer program product according to the present application. For example, the present application provides a computer program product comprising computer readable instructions that when executed by a processor perform the following: in such embodiments, the computer program product may be downloaded and installed from a network via the communication section 309, and/or read and installed from the removable medium 311. The above-described functions as defined in the method of the present application are performed when the computer program product is executed by a Central Processing Unit (CPU) 301.

The solution of the present application may be implemented in many ways. For example, the technical solutions of the present application may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The order of the steps used to describe the method is provided for clarity only. Unless specifically limited, the method steps of the present application are not limited to the order specifically described above. Furthermore, in some embodiments, the present application may also be implemented as a storage medium storing a computer program product.

Claims

1. A method for assisting in judging copy number variation by using single nucleotide polymorphism is characterized by comprising the following steps:

and 2, step: performing an snp analysis based on said third data of said reference sample, screening a first set of loci with stable mutation frequencies from loci of said second genome; performing an snp analysis according to the third data of the detection sample to obtain the snp base information of the detection sample in the first locus set as fifth data;

and 3, step 3: judging the gene copy number variation condition of the detection sample according to the fifth data;

and the third data is obtained by removing low-quality sequences from the original sequencing data, comparing the low-quality sequences to a reference genome, removing duplication and performing quality control, wherein the low-quality sequences refer to sequences of which the average base quality and the reads length are lower than preset values.

2. The method of claim 1, wherein the third data is obtained by:

sequencing the test sample or the reference sample to obtain raw sequencing data including information of a first genome, the first genome being one or more genes selected by a person according to an analysis purpose;

taking clean data obtained by removing a low-quality sequence from the original sequencing data as first data, and comparing the first data to a reference genome, so that a sequencing sequence in the first data is positioned on a related gene to obtain comparison result data as second data;

3. The method of claim 1, wherein the reference sample is a human leukocyte sample.

4. The method of claim 1, wherein said calculating a gene CNV score comprises the steps of:

step 1.2, calculating the CNV copy number score of the detection sample;

the calculation method is as follows: for each region, the CNV copy number score for that region =2 × corrected normalized sequencing depth/average reference sequencing depth.

5. The method of claim 4, wherein the method of screening for genes that may have copy number variation comprises the steps of:

6. The method of claim 1, wherein the step of selecting a first site group with stable mutation frequency comprises the steps of:

7. The method according to claim 1, wherein the fifth data acquisition method comprises the steps of:

8. The method of claim 1, wherein step 3 comprises the steps of:

9. A data processing system, comprising:

a memory storing executable instructions; and the number of the first and second groups,

step 1: calculating the CNV score of the gene according to the third data of the detection sample and the reference sample, screening genes with possible copy number variation according to the CNV score, and selecting a second genome;

step 2: performing an snp analysis based on said third data of said reference sample, screening a first set of loci with stable mutation frequencies from loci of said second genome; performing an snp analysis according to the third data of the detection sample to obtain the snp mutation frequency data of the detection sample in the first locus set as fifth data;

and the third data is obtained by removing low-quality sequences from the original sequencing data, comparing the low-quality sequences to a reference genome, removing duplication and controlling quality, wherein the low-quality sequences refer to sequences with average base quality and reads length lower than set values.

10. The data processing system of claim 9, wherein the one or more processors are in communication with the memory to execute executable instructions to:

step 1.2, calculating the CNV copy number score of the detection sample;

the screening standard is that the snp site mutation frequency is judged to be stable when being between 0.4 and 0.6, and the snp site mutation frequency is classified into a first site group;