CN113299342B - Copy number variation detection method and detection device based on chip data - Google Patents

Copy number variation detection method and detection device based on chip data Download PDF

Info

Publication number
CN113299342B
CN113299342B CN202110673034.7A CN202110673034A CN113299342B CN 113299342 B CN113299342 B CN 113299342B CN 202110673034 A CN202110673034 A CN 202110673034A CN 113299342 B CN113299342 B CN 113299342B
Authority
CN
China
Prior art keywords
window
detected
signal intensity
fluorescence signal
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110673034.7A
Other languages
Chinese (zh)
Other versions
CN113299342A (en
Inventor
卢娜如
张军
孔令印
梁波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Basecare Medical Device Co ltd
Original Assignee
Suzhou Basecare Medical Device Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Basecare Medical Device Co ltd filed Critical Suzhou Basecare Medical Device Co ltd
Priority to CN202110673034.7A priority Critical patent/CN113299342B/en
Publication of CN113299342A publication Critical patent/CN113299342A/en
Application granted granted Critical
Publication of CN113299342B publication Critical patent/CN113299342B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Investigating, Analyzing Materials By Fluorescence Or Luminescence (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

The application relates to a copy number variation detection method and device based on chip data. According to the method, the sequence to be detected of the fluorescence signal intensity data of each window of the sample to be detected on the genome is compared with the reference sequence of the fluorescence signal intensity data of the corresponding window of the reference sample on the genome in the pre-established reference library, so that the comparison data sequence of the fluorescence signal intensity of each window is obtained, the significance difference between the windows is detected, the target window and the corresponding region to be determined in the sample to be detected are determined based on the significance difference between the windows, and further the abnormal region in the sample to be detected can be effectively identified according to the region to be determined, the comparison data sequence of the fluorescence signal intensity of each window and the preset variation threshold, the boundary of the abnormal region can be intuitively determined based on the target window corresponding to the abnormal region, and manual intervention is not needed, so that the detection of high-resolution copy number variation is realized.

Description

Copy number variation detection method and detection device based on chip data
Technical Field
The present application relates to the field of genetic data analysis technology, and in particular, to a method, a device, a computer device, and a storage medium for detecting copy number variation based on chip data.
Background
Copy number variation (Copy Number Variation, abbreviated as CNV) is caused by rearrangement of the genome, and generally refers to an increase or decrease in copy number of a genome fragment of 1KB or more in length, and is mainly represented by deletion and duplication at a sub-microscopic level. In recent years, various techniques have been developed for the detection of CNV in the human genome.
The current CNV detection method mainly comprises CNV detection based on high-throughput sequencing and CNV detection based on a chip. In chip-based CNV detection, a high-density SNP (Single Nucleotide Polymorphism, single nucleotide polymorphism site) chip is a common CNV detection method. SNP chips can be divided into: specific site hybridization (ASH), specific site primer extension (ASPE), single base extension (SBCE), specific site cleavage (ASC), and specific site ligation (ASL) 5 species. However, currently, there are PennCNV, cnvPartition and other algorithms commonly used for CNV detection using SNP chips, wherein PennCNV is to extract the fluorescent signal intensity of the allele from the SNP chip, integrate the information of the SNP position and SNP allele frequency, and perform CNV recognition by using a Hidden Markov Model (HMM) algorithm; CNV recognition is then mainly performed by two indexes, log r ratio (LR) and B Allele Frequency (BAF) and gives confidence.
However, the above-mentioned various CNV detection methods of SNP chips mainly use semi-automatic analysis of Windows software, and the data analysis can be performed only by converting the off-chip data into a fixed format by means of third-party data conversion software, which is not only troublesome to operate, but also has the problems of low sensitivity of chimeric detection and too high false positive of CNV identification, and also cannot accurately give the starting and ending positions of CNV, and the CNV boundary can be determined by manually checking and checking. As can be seen, there is no effective detection scheme for copy number variation in the conventional technology.
Disclosure of Invention
Based on this, it is necessary to provide a copy number variation detection method, a detection apparatus, a computer device, and a storage medium based on chip data capable of effectively detecting copy number variation, in view of the problems of the copy number variation detection in the conventional techniques described above.
A copy number variation detection method based on chip data, the method comprising:
according to the gene sequencing data of a sample to be detected, obtaining a sequence to be detected of fluorescence signal intensity data of each window of the sample to be detected on a genome;
comparing the sequence to be detected of the fluorescence signal intensity data with a reference sequence of fluorescence signal intensity data of a corresponding window of a reference sample in a pre-established reference library on the basis of each window on the genome to obtain a comparison data sequence of fluorescence signal intensity of each window;
Detecting significance differences among windows according to comparison data sequences of fluorescence signal intensities of the windows, and determining a target window and a corresponding region to be determined in a sample to be detected based on the significance differences among the windows, wherein the target window is a window corresponding to a starting position and a window corresponding to an ending position of the region to be determined in the sample to be detected;
and determining an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescent signal intensity of each window and a preset variation threshold value.
In one embodiment, the preset mutation threshold includes a preset copy number deletion threshold and a copy number repetition threshold; the determining the abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescent signal intensities of each window and the preset variation threshold value comprises the following steps: acquiring a comparison data average value of the window fluorescence signal intensity in the region to be determined according to the window corresponding to the starting position and the window corresponding to the ending position of the region to be determined in the sample to be detected and the comparison data sequence of the window fluorescence signal intensity; matching the comparison data average value of the window fluorescence signal intensity in the region to be determined with a preset copy number missing threshold value and a copy number repetition threshold value, and if the comparison data average value of the region to be determined is matched with the preset copy number missing threshold value, determining the region to be determined as an abnormal region with copy number missing; if the comparison data average value of the area to be determined is matched with a preset copy number repetition threshold value, determining that the area to be determined is an abnormal area with repeated copy number; and if the comparison data average value of the area to be determined is not matched with the preset copy number repetition threshold value and the copy number deletion threshold value, determining that the area to be determined is not an abnormal area.
In one embodiment, after the determining the abnormal region in the sample to be detected, the method further includes: obtaining the copy number of the abnormal region in the sample to be detected; and calculating the chimeric proportion of the abnormal region according to the copy number of the abnormal region and a preset variation threshold value.
In one embodiment, the calculating the chimeric proportion of the abnormal region includes: if the abnormal region is determined to be an abnormal region with repeated copy numbers, calculating a first difference value between the copy numbers of the abnormal region and 2, and determining a quotient between the first difference value and a copy number repetition threshold value as a jogged proportion of the abnormal region; if the abnormal region is determined to be the abnormal region with the missing copy number, calculating a second difference value between the copy numbers of the abnormal region and the second difference value, and determining a quotient between the second difference value and a copy number missing threshold value as the embedding proportion of the abnormal region.
In one embodiment, the obtaining the sequence to be detected of the fluorescence signal intensity data of each window of the genome of the sample to be detected according to the gene sequencing data of the sample to be detected includes: performing data conversion on the gene sequencing data of the sample to be detected to obtain the fluorescence signal intensity to be detected of each position corresponding to the gene sequencing data of the sample to be detected; window segmentation is carried out on the genome of the sample to be detected according to window division conditions of the reference sample, and the fluorescent signal intensity to be detected of the window is obtained according to the fluorescent signal intensity to be detected of the sites in each window; carrying out local linear regression correction on the fluorescence signal intensity to be detected of the windows in the sample to be detected to obtain corrected fluorescence signal intensity data to be detected of each window; and generating a sequence to be detected of the fluorescence signal intensity data based on the fluorescence signal intensity data to be detected of each window in the sample to be detected.
In one embodiment, the comparing the sequence to be detected of the fluorescence signal intensity data with a reference sequence of fluorescence signal intensity data of a corresponding window on the genome of a reference sample in a pre-established reference library based on each window on the genome to obtain a comparison data sequence of fluorescence signal intensity of each window includes: for each window on the genome, extracting fluorescence signal intensity data to be detected of the window from the sequence to be detected, and extracting reference fluorescence signal intensity data of the corresponding window from the reference sequence; acquiring the ratio of the fluorescence signal intensity data to be detected and the reference fluorescence signal intensity data of the window, and determining the logarithm of the ratio based on 2 as comparison data of the fluorescence signal intensity of the window; and obtaining comparison data sequences of the fluorescence signal intensities of the windows based on the comparison data of the fluorescence signal intensities of the windows of each genome.
In one embodiment, the detecting the saliency difference between windows according to the comparison data sequence of the fluorescence signal intensities of the windows, determining the target window and the corresponding region to be determined in the sample to be detected based on the saliency difference between the windows includes: identifying the significance difference between the windows by adopting a statistical algorithm according to the comparison data sequence of the fluorescence signal intensity of each window; acquiring a plurality of areas with saliency differences based on the saliency differences among the windows, wherein the saliency differences among the windows in each area do not exist; identifying the saliency difference among the plurality of areas, and merging adjacent areas without the saliency difference if the saliency difference does not exist among the adjacent areas; and if the saliency difference exists between the adjacent areas, obtaining the area to be determined and a window corresponding to the starting position and a window corresponding to the ending position of the area to be determined based on the adjacent areas with the saliency difference.
A copy number variation detection apparatus based on chip data, the apparatus comprising:
the fluorescence signal intensity data acquisition module is used for acquiring a sequence to be detected of fluorescence signal intensity data of each window of a genome of a sample to be detected according to gene sequencing data of the sample to be detected;
the comparison processing module is used for comparing the sequence to be detected of the fluorescence signal intensity data with a reference sequence of fluorescence signal intensity data of a corresponding window of a reference sample in a pre-established reference library on the basis of each window on the genome to obtain a comparison data sequence of fluorescence signal intensity of each window;
the region identification module is used for detecting the saliency difference between the windows according to the comparison data sequence of the fluorescent signal intensity of each window, and determining a target window and a corresponding region to be determined in a sample to be detected based on the saliency difference between the windows, wherein the target window is a window corresponding to the starting position and a window corresponding to the ending position of the region to be determined in the sample to be detected;
the abnormal region determining module is used for determining an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescent signal intensity of each window and a preset variation threshold value.
A computer device comprising a memory storing a computer program and a processor implementing the steps of the method as described above when the processor executes the computer program.
A computer readable storage medium having stored thereon a computer program which when executed by a processor realizes the steps of the method as described above.
According to the copy number variation detection method, the copy number variation detection device, the computer equipment and the storage medium, the sequence to be detected of the fluorescence signal intensity data of each window of the sample to be detected on the genome is obtained according to the gene sequencing data of the sample to be detected, the sequence to be detected of the fluorescence signal intensity data is compared with the reference sequence of the fluorescence signal intensity data of the corresponding window of the reference sample on the genome based on each window on the genome, the comparison data sequence of the fluorescence signal intensity data of each window is obtained, the significance difference between each window is detected, the target window and the corresponding region to be determined in the sample to be detected are determined based on the significance difference between each window, and then the abnormal region in the sample to be detected can be effectively identified according to the comparison data sequence of the fluorescence signal intensity of each window and the preset variation threshold value, the boundary of the abnormal region can be intuitively determined based on the target window corresponding to the abnormal region, and manual intervention is not needed, so that the high-resolution copy number variation detection is realized.
Drawings
FIG. 1 is a flow chart of a copy number variation detection method according to an embodiment;
FIG. 2 is a flowchart illustrating steps of a sequence to be detected for acquiring fluorescence signal intensity data in one embodiment;
FIG. 3 is a flow chart of a comparison step of fluorescent signal intensity data according to one embodiment;
FIG. 4 is a flow chart of a copy number variation detection method according to another embodiment;
FIG. 5 is a block diagram showing a copy number variation detecting apparatus according to an embodiment;
fig. 6 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, a method for detecting copy number variation based on chip data is provided, and this embodiment is applied to a terminal for illustration, where it is understood that the method may also be applied to a server, and may also be applied to a system including a terminal and a server, and implemented through interaction between the terminal and the server. The terminal may be, but not limited to, various personal computers, notebook computers, smartphones, tablet computers and portable wearable devices, and the server may be implemented by a separate server or a server cluster formed by a plurality of servers. In this embodiment, the method includes the steps of:
Step 102, obtaining a sequence to be detected of fluorescence signal intensity data of each window of the sample to be detected on a genome according to the gene sequencing data of the sample to be detected.
The sample to be detected is a sample to be subjected to copy number variation detection. The gene sequencing data refers to the original data of the corresponding sample after the chip is started. The window is obtained by dividing each chromosome chip SNP (Single Nucleotide Polymorphism, single nucleotide polymorphism site) site of the whole genome according to a preset rule. The sequence to be detected is obtained based on the fluorescence signal intensity data of each window of the corresponding sample on the genome. In this embodiment, according to the gene sequencing data of the sample to be detected, the data processing is performed to obtain fluorescence signal intensity data of each window of the sample to be detected on the genome, and the sequence to be detected of the fluorescence signal intensity data of the sample to be detected is generated based on the fluorescence signal intensity data of each window and the arrangement sequence of each window on the genome.
And 104, comparing the sequence to be detected of the fluorescence signal intensity data with a reference sequence of the fluorescence signal intensity data of a corresponding window of a reference sample in a pre-established reference library on the basis of each window on the genome to obtain a comparison data sequence of the fluorescence signal intensity of each window.
The reference sample is a normal sample, i.e. a sample without copy number variation. The reference library is a sample database built based on the reference samples. In this embodiment, stored in the reference library is a reference sequence of fluorescence signal intensity data for each window of the reference sample across the genome. Specifically, according to the gene sequencing data of the reference sample in advance, the data processing is carried out on the data to obtain the fluorescence signal intensity data of each window of the reference sample on the genome, the reference sequence of the fluorescence signal intensity data of the reference sample is generated based on the fluorescence signal intensity data of each window and the arrangement sequence of each window on the genome, and the reference library is built based on the reference sequence of the fluorescence signal intensity data of the reference sample, so that the reference sequence of the fluorescence signal intensity data of the reference sample can be directly called from the reference degree when the copy number variation of the sample to be detected is required. It can be understood that when the data processing and window dividing are performed on the sample to be detected, the data processing and window dividing may be performed in the same manner according to the reference sequence for obtaining the fluorescence signal intensity data of the reference sample, so that each window of the sample to be detected corresponds to each window of the reference sample one by one based on the genome, and the data have the same dimension. The comparison refers to the process of comparing the fluorescence signal intensity data of each window of the sample to be detected with the fluorescence signal intensity data of each window of the reference sample under the specified condition. Specifically, in this embodiment, the comparison data of the fluorescence signal intensity of each window is obtained by comparing the fluorescence signal intensity data of the sample to be detected corresponding to each window with the fluorescence signal intensity data of the corresponding window in the reference sample based on each window on the genome, and then the corresponding comparison data sequence of the fluorescence signal intensity is obtained according to the comparison data of the fluorescence signal intensity of each window on the genome and the arrangement sequence of each window on the genome.
And 106, detecting the saliency difference between the windows according to the comparison data sequence of the fluorescence signal intensities of the windows, and determining a target window and a corresponding region to be determined in the sample to be detected based on the saliency difference between the windows.
The to-be-determined areas are areas with statistical differences identified based on the significant differences among the windows, the windows in each to-be-determined area have no significant differences, and the target windows are windows corresponding to the starting positions and the ending positions of the to-be-determined areas in the to-be-detected sample. Specifically, in this embodiment, the significance difference between the windows on the genome is detected based on the comparison data sequence of the fluorescence signal intensities of the windows, and then the windows are combined based on the significance difference between the windows on the genome, that is, the front and rear windows having no significance difference are combined, so as to obtain one or more regions to be determined after the combination.
And step 108, determining an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescent signal intensities of each window and a preset variation threshold.
Since copy number variation is generally expressed as copy number loss or copy number duplication, the variation threshold set in advance in this embodiment may include a copy number loss threshold and a copy number duplication threshold. In this embodiment, a comparison data average value of the window fluorescent signal intensities in the to-be-determined area is obtained based on the comparison data sequence of the window fluorescent signal intensities, and then the comparison data average value of the window fluorescent signal intensities in the to-be-determined area is matched with a preset variation threshold value, if so, the corresponding to-be-determined area in the to-be-detected sample is determined to be an abnormal area, otherwise, the corresponding to-be-determined area in the to-be-detected sample is determined to be not an abnormal area.
According to the copy number variation detection method, the sequence to be detected of the fluorescence signal intensity data of each window of the sample to be detected on the genome is obtained according to the gene sequencing data of the sample to be detected, the sequence to be detected of the fluorescence signal intensity data is compared with the reference sequence of the fluorescence signal intensity data of the corresponding window of the reference sample on the genome based on each window of the genome, the comparison data sequence of the fluorescence signal intensity of each window is obtained, the significance difference among the windows is detected, the target window and the corresponding region to be determined in the sample to be detected are determined based on the significance difference among the windows, and further the abnormal region in the sample to be detected can be effectively identified according to the comparison data sequence of the fluorescence signal intensity of each window and the preset variation threshold, the boundary of the abnormal region can be intuitively determined based on the target window corresponding to the abnormal region, and the detection of the high-resolution copy number variation is realized.
In one embodiment, as shown in fig. 2, according to the gene sequencing data of the sample to be detected, the obtaining the sequence to be detected of the fluorescence signal intensity data of each window of the sample to be detected on the genome may specifically include:
Step 202, performing data conversion on the gene sequencing data of the sample to be detected to obtain the fluorescence signal intensity to be detected of each position corresponding to the gene sequencing data of the sample to be detected.
The gene sequencing data refers to the original data of the chip corresponding to the sample. In this embodiment, the fluorescence signal intensity of each site corresponding to the gene sequencing data of the sample to be detected is obtained by performing data conversion on the gene sequencing data of the sample to be detected, and in this embodiment, in order to distinguish the fluorescence signal intensity of the sample to be detected from the fluorescence signal intensity of the reference sample, the fluorescence signal intensity of each site of the sample to be detected is referred to as the fluorescence signal intensity to be detected, and the fluorescence signal intensity of each site of the reference sample is referred to as the reference fluorescence signal intensity.
Further, due to the fact that sample loading amounts and the like of different samples are different, normalization processing is needed to be conducted on the R values of the fluorescence signal intensity to be detected of each site of each sample to be detected, and the R values of each sample to be detected are mapped to the same dimension, and therefore errors caused by different dimensions are eliminated. Also, because there may be a difference between samples, such as a chip, a reagent, or a person operating, measurement errors may be caused, and thus, there may be a fluctuation in the fluorescence signal intensity value normalized to the same site. And sites with larger fluctuations can affect subsequent CNV detection. Therefore, in this embodiment, the chip sites may be filtered according to statistics of the degree of dispersion of the reaction data such as Standard Deviation (SD) and uniformity (Evenness) between the same sites, and the sites with the fluorescent signal intensity of 0 in the sample are removed at the same time, so as to eliminate the difference, so as to improve the accuracy of subsequent detection.
And 204, carrying out window segmentation on the genome of the sample to be detected according to the window division condition of the reference sample, and obtaining the fluorescent signal intensity to be detected of the window according to the fluorescent signal intensity to be detected of the locus in each window.
Because the sequence to be detected of the fluorescence signal intensity data of the sample to be detected is compared with the reference sequence of the fluorescence signal intensity data of the corresponding window of the reference sample on the genome based on each window on the genome, the window dividing conditions are required to be consistent, and based on the window dividing conditions, the embodiment should adopt the same conditions as the window dividing conditions of the reference sample when the genome of the sample to be detected is subjected to window dividing. Specifically, the window dividing condition of the reference sample needs to satisfy the number of sites of one window < =10 or the window size < =18kb. In order to improve the accuracy of comparison and avoid errors caused by different chip data, the gene sequencing data of the sample to be detected obtained in the steps should be the detection data of the same chip as the gene sequencing data of the reference sample. In this embodiment, window division is performed on the genome of the sample to be detected according to the window division condition of the reference sample, so as to obtain N windows after division, and further, the average value of the intensities of the fluorescent signals to be detected of all the sites in each window is counted, and the average value is used as the intensity of the fluorescent signals to be detected of the corresponding window, and is denoted as B, and the calculation formula is as follows:
Wherein B is i R is the value of R in the ith window (namely the intensity of the fluorescence signal to be detected in the ith window), n is the number of bits contained in the ith window, R j The R value of the jth site in the window (i.e., the intensity of the fluorescent signal to be detected at the jth site).
And 206, carrying out local linear regression correction on the fluorescence signal intensity to be detected of the windows in the sample to be detected, and obtaining corrected fluorescence signal intensity data to be detected of each window.
In the genome sequence, the characteristics of GC content (GC content refers to the proportion of G base and C base in the combination of a section of genome sequence ATGC) and the like can influence the sequence amplification efficiency of a sample and the binding efficiency of a target sequence and a probe in the chip sequencing process, so that the light intensity ratio of the genome presents nonlinear distribution. Therefore, in order to eliminate the influence of such genome properties, GC correction was performed on the fluorescence signal intensity values of the respective windows in this example. The GC correction principle is based on the fact that the fluorescence signal intensities of the same GC content region are affected consistently by GC content, whereas the fluorescence signal intensities of different GC content regions should be different in theory, based on which the fluorescence signal intensities of the windows of the same GC content can be multiplied by a fixed weight, thereby correcting the fluorescence signal intensities back to a linear level.
Specifically, the GC content of the chromosomal sequence in each window in the sample to be detected is calculated and denoted as C 1 ,C 2 ,C 3 ,…,C N It will be appreciated that where the fluorescence signal intensity of each window is known, the GC content of each window can be obtained from the corresponding conversion pattern. If the median of the fluorescence signal intensities of the N windows (i.e. all windows in the sample to be detected) is M. The windows with equal GC content are classified into one category by thousandth bit, and M shares are counted as G 1 ,G 2 ,G 3 ,…,G M . Assume that for a certain GC content G i Which contains n windows, each of which has a fluorescent signal intensity value of B 1 ,B 2 ,B 3 ,…,B n G is then i N windows B 1 ,B 2 ,B 3 ,…,B n The median of the fluorescence signal intensity values of (2) is M i Based on this, a weight is assigned to each window, thereby eliminating the effect of GC content on fluorescence signal intensity. And correcting through GC to obtain the fluorescence signal intensity data to be detected after correcting each window of all chromosomes. It can be corrected specifically by the following formula:
wherein M is the median of the fluorescence signal intensity values of all windows in the sample to be detected, M i For GC content G i In the median of the fluorescence signal intensity of all windows of (2), B is GC content G i Before a certain window of the fluorescent signal intensity is corrected, B GC The fluorescence signal intensity data obtained after correction of the window. Specifically, the value of M may be different based on the parity of the number N of windows in the sample to be detected, for example, when N is an odd number, +.>When N is even, ">
Step 208, generating a sequence to be detected of the fluorescence signal intensity data based on the fluorescence signal intensity data to be detected of each window in the sample to be detected.
Specifically, local linear regression correction is respectively performed on the fluorescence signal intensities to be detected of each window in the sample to be detected based on the steps, so that fluorescence signal intensity data after correction of each window is obtained, and a corresponding sequence to be detected of the fluorescence signal intensity data is obtained based on the fluorescence signal intensity data after correction of each window on the genome of the sample to be detected and the arrangement sequence of each window on the genome.
In one embodiment, as shown in fig. 3, the comparison of the sequence to be detected of the fluorescence signal intensity data with the reference sequence of the fluorescence signal intensity data of the corresponding window on the genome of the reference sample in the pre-established reference library is performed based on each window on the genome, so as to obtain the comparison data sequence of the fluorescence signal intensity of each window, which specifically includes:
Step 302, for each window on the genome, extracting fluorescence signal intensity data to be detected of the window from the sequence to be detected, and extracting reference fluorescence signal intensity data of the corresponding window from the reference sequence.
It should be noted that, in this embodiment, the reference sequence of the fluorescence signal intensity data of the corresponding window of the reference sample on the genome is obtained by processing the same method as shown in fig. 2 based on the gene sequencing data of the reference sample. And the gene sequencing data of the sample to be detected and the reference sample which participate in comparison are from the same chip, and the two are subjected to window division by adopting the same window division condition, so that the data on each window are conveniently compared. Specifically, in this embodiment, for each window on the genome, the fluorescence signal intensity data to be detected of the window is extracted from the sequence to be detected, the reference fluorescence signal intensity data of the corresponding window is extracted from the reference sequence, and the window data is compared through the subsequent steps.
And 304, acquiring the ratio of the fluorescence signal intensity data to be detected and the reference fluorescence signal intensity data of the window, and determining the logarithm of the ratio with the base of 2 as comparison data of the fluorescence signal intensity of the window.
And for each window on the genome, comparing and analyzing the fluorescence signal intensity data to be detected after the same window correction with the reference fluorescence signal intensity data, and calculating the comparison data of the window fluorescence signal intensity of each window. In particular, the comparison data CNV of the fluorescent signal intensities of the windows can be determined using the following formula i (also known as log 2 RR value):
specifically, i is the number of the corresponding window, TR i For the fluorescence signal intensity data to be detected after the ith window correction in the sample to be detected, R i For the reference fluorescence signal intensity data after the ith window correction in the reference sample, CNV i Alignment data for window fluorescence signal intensity for the ith window
Step 306, obtaining comparison data sequences of each window fluorescence signal intensity based on the comparison data of the window fluorescence signal intensity of each window on the genome.
Specifically, based on the steps, comparing the to-be-detected fluorescent signal intensity data of each window in the sample to be detected with the reference fluorescent signal intensity data of the corresponding window in the reference sample, so as to obtain comparison data of the window fluorescent signal intensity of each window on the genome, and further obtaining the comparison data sequence of the corresponding window fluorescent signal intensity according to the arrangement sequence of each window on the genome.
In one embodiment, detecting a saliency difference between windows according to a comparison data sequence of fluorescent signal intensities of the windows, determining a target window and a corresponding region to be determined in a sample to be detected based on the saliency difference between the windows, including: identifying the significance difference between the windows by adopting a statistical algorithm according to the comparison data sequence of the fluorescence signal intensity of each window; acquiring a plurality of areas with saliency differences based on the saliency differences among the windows, wherein the saliency differences among the windows in each area are not present; identifying the saliency difference among the multiple areas, and merging adjacent areas without the saliency difference if the saliency difference does not exist among the adjacent areas; and if the adjacent areas have the significance difference, obtaining the area to be determined and a window corresponding to the starting position and a window corresponding to the ending position of the area to be determined based on the adjacent areas with the significance difference.
For example, after the comparison data sequence of the fluorescence signal intensities of each window is obtained through the above steps, a circular binary segmentation (Circular Binary Segmentation, abbreviated asCBS) or hidden markov model (Hidden Markov Model, HMM) for each window log on each chromosome 2 The RR values are statistically analyzed to identify statistically different regions, and it is understood that the following principles may be referenced for the division of regions: there is no significant difference between windows in each region, while there is a significant difference between two adjacent windows located in different regions.
Further, due to the large number of windows on the genome, there are large regions of statistical variance identified using algorithms such as CBS or HMM, and cannot represent the presence of CNV. Based on this, in order to obtain the final complete CNV region, the present embodiment further processes the above identified statistical difference region based on the small segment CNV merging algorithm. Specifically, the small segment CNV merging algorithm checks whether the front and rear regions have significant differences based on a standard Z test principle, if no difference exists, the region merging is performed, if the difference exists, the breakpoint position (i.e., the position of the target window) is determined, the rear break point is determined by using the same step, after two breakpoints are determined, the region to be determined, in which the chromosome may have abnormality, is analyzed, and it is understood that the port corresponding to the first breakpoint position is the window corresponding to the start position of the region to be determined, and the port corresponding to the second breakpoint position is the window corresponding to the end position of the region to be determined.
In one embodiment, the pre-set variation threshold includes a pre-set copy number deficiency threshold and a copy number duplication threshold; determining an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescence signal intensities of each window and a preset variation threshold, wherein the method specifically comprises the following steps:
and obtaining a comparison data average value of the window fluorescence signal intensity in the area to be determined according to the window corresponding to the starting position and the window corresponding to the ending position of the area to be determined in the sample to be detected and the comparison data sequence of the window fluorescence signal intensity. Specifically, according to the window corresponding to the starting position and the window corresponding to the ending position of the area to be determined in the sample to be detected, which specific windows corresponding to the area to be determined exist can be obtained, and further according to the comparison data sequence of the fluorescence signal intensities of all windows, the comparison data of the fluorescence signal intensities of the specific windows corresponding to the area to be determined are extracted, so that the average value of the comparison data of the fluorescence signal intensities of all windows in the area to be determined is calculated. Further, matching the comparison data average value of the window fluorescence signal intensity in the region to be determined with a preset copy number missing threshold value and a copy number duplication threshold value, and if the comparison data average value of the region to be determined is matched with the preset copy number missing threshold value, determining the region to be determined as an abnormal region with copy number missing; if the average value of the comparison data of the area to be determined is matched with a preset copy number repetition threshold value, determining that the area to be determined is an abnormal area with repeated copy numbers; and if the comparison data average value of the area to be determined is not matched with the preset copy number repetition threshold value and the copy number deletion threshold value, determining that the area to be determined is not an abnormal area. It is thus possible to determine whether or not there is a chromosomal abnormality in the sample to be detected. Specifically, in this embodiment, when the comparison data average value of the window fluorescence signal intensity in the area to be determined is greater than the copy number repetition threshold value set in advance, it may be determined that the comparison data average value of the area to be determined matches the copy number repetition threshold value set in advance, and it is determined that the area to be determined is an abnormal area in which the copy number is repeated; when the comparison data average value of the window fluorescence signal intensity in the region to be determined is smaller than the preset copy number deletion threshold value, the comparison data average value of the region to be determined can be determined to be matched with the preset copy number deletion threshold value, and the region to be determined is determined to be an abnormal region with copy number deletion.
Further, when the average value of the comparison data of the to-be-determined area is larger than zero and smaller than a preset copy number repetition threshold value, the repetition chimeric proportion of the to-be-determined area can be calculated according to the subsequent steps; or when the average value of the comparison data of the region to be determined is larger than a preset copy number deletion threshold value and smaller than zero, the deletion chimeric proportion can be calculated according to the subsequent steps.
In one embodiment, as shown in fig. 4, after determining the abnormal region in the sample to be detected, the method further includes:
step 402, obtaining copy number of abnormal region in sample to be detected.
In this embodiment, after determining that an abnormal region exists in the sample to be detected, the chimeric proportion of the abnormal region may be further calculated, so as to improve the sensitivity of CNV detection and avoid the problem of false positive in CNV identification. Specifically, after determining that an abnormal region exists in the sample to be detected, the copy number of the abnormal region may be obtained, specifically by the following formula:
wherein (1)>Log for each window in the anomaly region 2 Average value of RR, CN is the copy number of the abnormal region.
Step 404, calculating the embedding proportion of the abnormal region according to the copy number of the abnormal region and a preset variation threshold value.
In this embodiment, if it is determined that the abnormal region is an abnormal region in which the copy number is repeated, a first difference between the copy number of the abnormal region and 2 is calculated, and a quotient between the first difference and the copy number repetition threshold is determined as a fitting ratio of the abnormal region; if the abnormal region is determined to be the abnormal region with the missing copy number, calculating a second difference value between the copy numbers of the abnormal region and the second difference value, and determining the quotient between the second difference value and the copy number missing threshold value as the embedding proportion of the abnormal region.
For example, if it is determined that the copy number of the abnormal region is CN, the preset mutation threshold includes a copy number repetition threshold a and a copy number deletion threshold b, and if the abnormal region is an abnormal region with repeated copy number, the fitting ratio of the repeated regions is:
for an abnormal region in which the copy number is missing, the chimeric ratio of the missing region is:
according to the embodiment, the detection rate can be effectively improved by calculating the embedding proportion of the abnormal region, the problem of false positive is avoided, and the accuracy of CNV identification is greatly improved.
It should be understood that, although the steps in the flowcharts of fig. 1-4 are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in FIGS. 1-4 may include multiple steps or stages that are not necessarily performed at the same time, but may be performed at different times, nor do the order in which the steps or stages are performed necessarily performed in sequence, but may be performed alternately or alternately with at least a portion of the steps or stages in other steps or other steps.
In one embodiment, as shown in fig. 5, there is provided a copy number variation detecting apparatus based on chip data, including: a fluorescence signal intensity data acquisition module 501, an alignment processing module 502, a region identification module 503, and an abnormal region determination module 504, wherein:
the fluorescence signal intensity data acquisition module 501 is used for acquiring a sequence to be detected of fluorescence signal intensity data of each window of a genome of a sample to be detected according to gene sequencing data of the sample to be detected;
the comparison processing module 502 is configured to compare, based on each window on the genome, the sequence to be detected of the fluorescence signal intensity data with a reference sequence of fluorescence signal intensity data of a corresponding window on the genome of a reference sample in a pre-established reference library, so as to obtain a comparison data sequence of fluorescence signal intensity of each window;
the region identifying module 503 is configured to detect a significance difference between windows according to a comparison data sequence of fluorescent signal intensities of the windows, and determine a target window in a sample to be detected and a corresponding region to be determined based on the significance difference between the windows, where the target window is a window corresponding to a start position and a window corresponding to an end position of the region to be determined in the sample to be detected;
The abnormal region determining module 504 is configured to determine an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescent signal intensities of each window, and a preset variation threshold.
In one embodiment, the preset variation threshold includes a preset copy number deletion threshold and a copy number duplication threshold; the abnormal region determining module is specifically configured to: acquiring a comparison data average value of the window fluorescence signal intensity in the region to be determined according to the window corresponding to the starting position and the window corresponding to the ending position of the region to be determined in the sample to be detected and the comparison data sequence of the window fluorescence signal intensity; matching the comparison data average value of the window fluorescence signal intensity in the region to be determined with a preset copy number missing threshold value and a copy number repetition threshold value, and if the comparison data average value of the region to be determined is matched with the preset copy number missing threshold value, determining the region to be determined as an abnormal region with copy number missing; if the comparison data average value of the area to be determined is matched with a preset copy number repetition threshold value, determining that the area to be determined is an abnormal area with repeated copy number; and if the comparison data average value of the area to be determined is not matched with the preset copy number repetition threshold value and the copy number deletion threshold value, determining that the area to be determined is not an abnormal area.
In one embodiment, the device further comprises a chimeric proportion calculating module, which is used for obtaining the copy number of the abnormal region in the sample to be detected after determining the abnormal region in the sample to be detected; and calculating the chimeric proportion of the abnormal region according to the copy number of the abnormal region and a preset variation threshold value.
In one embodiment, the chimeric proportion calculating module is specifically configured to: if the abnormal region is determined to be an abnormal region with repeated copy numbers, calculating a first difference value between the copy numbers of the abnormal region and 2, and determining a quotient between the first difference value and a copy number repetition threshold value as a jogged proportion of the abnormal region; if the abnormal region is determined to be the abnormal region with the missing copy number, calculating a second difference value between the copy numbers of the abnormal region and the second difference value, and determining a quotient between the second difference value and a copy number missing threshold value as the embedding proportion of the abnormal region.
In one embodiment, the fluorescence signal intensity data acquisition module is specifically configured to: performing data conversion on the gene sequencing data of the sample to be detected to obtain the fluorescence signal intensity to be detected of each position corresponding to the gene sequencing data of the sample to be detected; window segmentation is carried out on the genome of the sample to be detected according to window division conditions of the reference sample, and the fluorescent signal intensity to be detected of the window is obtained according to the fluorescent signal intensity to be detected of the sites in each window; carrying out local linear regression correction on the fluorescence signal intensity to be detected of the windows in the sample to be detected to obtain corrected fluorescence signal intensity data to be detected of each window; and generating a sequence to be detected of the fluorescence signal intensity data based on the fluorescence signal intensity data to be detected of each window in the sample to be detected.
In one embodiment, the comparison processing module is specifically configured to: for each window on the genome, extracting fluorescence signal intensity data to be detected of the window from the sequence to be detected, and extracting reference fluorescence signal intensity data of the corresponding window from the reference sequence; acquiring the ratio of the fluorescence signal intensity data to be detected and the reference fluorescence signal intensity data of the window, and determining the logarithm of the ratio based on 2 as comparison data of the fluorescence signal intensity of the window; and obtaining comparison data sequences of the fluorescence signal intensities of the windows based on the comparison data of the fluorescence signal intensities of the windows of each genome.
In one embodiment, the area identifying module is specifically configured to: identifying the significance difference between the windows by adopting a statistical algorithm according to the comparison data sequence of the fluorescence signal intensity of each window; acquiring a plurality of areas with saliency differences based on the saliency differences among the windows, wherein the saliency differences among the windows in each area do not exist; identifying the saliency difference among the plurality of areas, and merging adjacent areas without the saliency difference if the saliency difference does not exist among the adjacent areas; and if the saliency difference exists between the adjacent areas, obtaining the area to be determined and a window corresponding to the starting position and a window corresponding to the ending position of the area to be determined based on the adjacent areas with the saliency difference.
Specific limitations regarding the copy number variation detection apparatus can be found in the above description of the limitation of the copy number variation detection method, and will not be described here. The respective modules in the copy number variation detecting apparatus described above may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a method of detecting copy number variation. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in fig. 6 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:
according to the gene sequencing data of a sample to be detected, obtaining a sequence to be detected of fluorescence signal intensity data of each window of the sample to be detected on a genome;
comparing the sequence to be detected of the fluorescence signal intensity data with a reference sequence of fluorescence signal intensity data of a corresponding window of a reference sample in a pre-established reference library on the basis of each window on the genome to obtain a comparison data sequence of fluorescence signal intensity of each window;
detecting significance differences among windows according to comparison data sequences of fluorescence signal intensities of the windows, and determining a target window and a corresponding region to be determined in a sample to be detected based on the significance differences among the windows, wherein the target window is a window corresponding to a starting position and a window corresponding to an ending position of the region to be determined in the sample to be detected;
And determining an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescent signal intensity of each window and a preset variation threshold value.
In one embodiment, the preset variation threshold includes a preset copy number deletion threshold and a copy number duplication threshold; the processor when executing the computer program also implements the steps of: acquiring a comparison data average value of the window fluorescence signal intensity in the region to be determined according to the window corresponding to the starting position and the window corresponding to the ending position of the region to be determined in the sample to be detected and the comparison data sequence of the window fluorescence signal intensity; matching the comparison data average value of the window fluorescence signal intensity in the region to be determined with a preset copy number missing threshold value and a copy number repetition threshold value, and if the comparison data average value of the region to be determined is matched with the preset copy number missing threshold value, determining the region to be determined as an abnormal region with copy number missing; if the comparison data average value of the area to be determined is matched with a preset copy number repetition threshold value, determining that the area to be determined is an abnormal area with repeated copy number; and if the comparison data average value of the area to be determined is not matched with the preset copy number repetition threshold value and the copy number deletion threshold value, determining that the area to be determined is not an abnormal area.
In one embodiment, the processor when executing the computer program further performs the steps of: obtaining the copy number of the abnormal region in the sample to be detected; and calculating the chimeric proportion of the abnormal region according to the copy number of the abnormal region and a preset variation threshold value.
In one embodiment, the processor when executing the computer program further performs the steps of: if the abnormal region is determined to be an abnormal region with repeated copy numbers, calculating a first difference value between the copy numbers of the abnormal region and 2, and determining a quotient between the first difference value and a copy number repetition threshold value as a jogged proportion of the abnormal region; if the abnormal region is determined to be the abnormal region with the missing copy number, calculating a second difference value between the copy numbers of the abnormal region and the second difference value, and determining a quotient between the second difference value and a copy number missing threshold value as the embedding proportion of the abnormal region.
In one embodiment, the processor when executing the computer program further performs the steps of: performing data conversion on the gene sequencing data of the sample to be detected to obtain the fluorescence signal intensity to be detected of each position corresponding to the gene sequencing data of the sample to be detected; window segmentation is carried out on the genome of the sample to be detected according to window division conditions of the reference sample, and the fluorescent signal intensity to be detected of the window is obtained according to the fluorescent signal intensity to be detected of the sites in each window; carrying out local linear regression correction on the fluorescence signal intensity to be detected of the windows in the sample to be detected to obtain corrected fluorescence signal intensity data to be detected of each window; and generating a sequence to be detected of the fluorescence signal intensity data based on the fluorescence signal intensity data to be detected of each window in the sample to be detected.
In one embodiment, the processor when executing the computer program further performs the steps of: for each window on the genome, extracting fluorescence signal intensity data to be detected of the window from the sequence to be detected, and extracting reference fluorescence signal intensity data of the corresponding window from the reference sequence; acquiring the ratio of the fluorescence signal intensity data to be detected and the reference fluorescence signal intensity data of the window, and determining the logarithm of the ratio based on 2 as comparison data of the fluorescence signal intensity of the window; and obtaining comparison data sequences of the fluorescence signal intensities of the windows based on the comparison data of the fluorescence signal intensities of the windows of each genome.
In one embodiment, the processor when executing the computer program further performs the steps of: identifying the significance difference between the windows by adopting a statistical algorithm according to the comparison data sequence of the fluorescence signal intensity of each window; acquiring a plurality of areas with saliency differences based on the saliency differences among the windows, wherein the saliency differences among the windows in each area do not exist; identifying the saliency difference among the plurality of areas, and merging adjacent areas without the saliency difference if the saliency difference does not exist among the adjacent areas; and if the saliency difference exists between the adjacent areas, obtaining the area to be determined and a window corresponding to the starting position and a window corresponding to the ending position of the area to be determined based on the adjacent areas with the saliency difference.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
according to the gene sequencing data of a sample to be detected, obtaining a sequence to be detected of fluorescence signal intensity data of each window of the sample to be detected on a genome;
comparing the sequence to be detected of the fluorescence signal intensity data with a reference sequence of fluorescence signal intensity data of a corresponding window of a reference sample in a pre-established reference library on the basis of each window on the genome to obtain a comparison data sequence of fluorescence signal intensity of each window;
detecting significance differences among windows according to comparison data sequences of fluorescence signal intensities of the windows, and determining a target window and a corresponding region to be determined in a sample to be detected based on the significance differences among the windows, wherein the target window is a window corresponding to a starting position and a window corresponding to an ending position of the region to be determined in the sample to be detected;
and determining an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescent signal intensity of each window and a preset variation threshold value.
In one embodiment, the preset variation threshold includes a preset copy number deletion threshold and a copy number duplication threshold; the computer program when executed by the processor also performs the steps of: acquiring a comparison data average value of the window fluorescence signal intensity in the region to be determined according to the window corresponding to the starting position and the window corresponding to the ending position of the region to be determined in the sample to be detected and the comparison data sequence of the window fluorescence signal intensity; matching the comparison data average value of the window fluorescence signal intensity in the region to be determined with a preset copy number missing threshold value and a copy number repetition threshold value, and if the comparison data average value of the region to be determined is matched with the preset copy number missing threshold value, determining the region to be determined as an abnormal region with copy number missing; if the comparison data average value of the area to be determined is matched with a preset copy number repetition threshold value, determining that the area to be determined is an abnormal area with repeated copy number; and if the comparison data average value of the area to be determined is not matched with the preset copy number repetition threshold value and the copy number deletion threshold value, determining that the area to be determined is not an abnormal area.
In one embodiment, the computer program when executed by the processor further performs the steps of: obtaining the copy number of the abnormal region in the sample to be detected; and calculating the chimeric proportion of the abnormal region according to the copy number of the abnormal region and a preset variation threshold value.
In one embodiment, the computer program when executed by the processor further performs the steps of: if the abnormal region is determined to be an abnormal region with repeated copy numbers, calculating a first difference value between the copy numbers of the abnormal region and 2, and determining a quotient between the first difference value and a copy number repetition threshold value as a jogged proportion of the abnormal region; if the abnormal region is determined to be the abnormal region with the missing copy number, calculating a second difference value between the copy numbers of the abnormal region and the second difference value, and determining a quotient between the second difference value and a copy number missing threshold value as the embedding proportion of the abnormal region.
In one embodiment, the computer program when executed by the processor further performs the steps of: performing data conversion on the gene sequencing data of the sample to be detected to obtain the fluorescence signal intensity to be detected of each position corresponding to the gene sequencing data of the sample to be detected; window segmentation is carried out on the genome of the sample to be detected according to window division conditions of the reference sample, and the fluorescent signal intensity to be detected of the window is obtained according to the fluorescent signal intensity to be detected of the sites in each window; carrying out local linear regression correction on the fluorescence signal intensity to be detected of the windows in the sample to be detected to obtain corrected fluorescence signal intensity data to be detected of each window; and generating a sequence to be detected of the fluorescence signal intensity data based on the fluorescence signal intensity data to be detected of each window in the sample to be detected.
In one embodiment, the computer program when executed by the processor further performs the steps of: for each window on the genome, extracting fluorescence signal intensity data to be detected of the window from the sequence to be detected, and extracting reference fluorescence signal intensity data of the corresponding window from the reference sequence; acquiring the ratio of the fluorescence signal intensity data to be detected and the reference fluorescence signal intensity data of the window, and determining the logarithm of the ratio based on 2 as comparison data of the fluorescence signal intensity of the window; and obtaining comparison data sequences of the fluorescence signal intensities of the windows based on the comparison data of the fluorescence signal intensities of the windows of each genome.
In one embodiment, the computer program when executed by the processor further performs the steps of: identifying the significance difference between the windows by adopting a statistical algorithm according to the comparison data sequence of the fluorescence signal intensity of each window; acquiring a plurality of areas with saliency differences based on the saliency differences among the windows, wherein the saliency differences among the windows in each area do not exist; identifying the saliency difference among the plurality of areas, and merging adjacent areas without the saliency difference if the saliency difference does not exist among the adjacent areas; and if the saliency difference exists between the adjacent areas, obtaining the area to be determined and a window corresponding to the starting position and a window corresponding to the ending position of the area to be determined based on the adjacent areas with the saliency difference.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (9)

1. A method for detecting copy number variation based on chip data, the method comprising:
according to the gene sequencing data of a sample to be detected, obtaining a sequence to be detected of fluorescence signal intensity data of each window of the sample to be detected on a genome;
comparing the sequence to be detected of the fluorescence signal intensity data with a reference sequence of fluorescence signal intensity data of a corresponding window of a reference sample in a pre-established reference library on the basis of each window on the genome to obtain a comparison data sequence of fluorescence signal intensity of each window;
detecting significance differences among windows according to comparison data sequences of fluorescence signal intensities of the windows, and determining a target window and a corresponding region to be determined in a sample to be detected based on the significance differences among the windows, wherein the target window is a window corresponding to a starting position and a window corresponding to an ending position of the region to be determined in the sample to be detected;
Determining an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescent signal intensity of each window and a preset variation threshold;
the preset mutation threshold comprises a preset copy number deletion threshold and a copy number repetition threshold; the determining the abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescent signal intensities of each window and the preset variation threshold value comprises the following steps: acquiring a comparison data average value of the window fluorescence signal intensity in the region to be determined according to the window corresponding to the starting position and the window corresponding to the ending position of the region to be determined in the sample to be detected and the comparison data sequence of the window fluorescence signal intensity; matching the comparison data average value of the window fluorescence signal intensity in the region to be determined with a preset copy number missing threshold value and a copy number repetition threshold value, and if the comparison data average value of the region to be determined is matched with the preset copy number missing threshold value, determining the region to be determined as an abnormal region with copy number missing; if the comparison data average value of the area to be determined is matched with a preset copy number repetition threshold value, determining that the area to be determined is an abnormal area with repeated copy number; and if the comparison data average value of the area to be determined is not matched with the preset copy number repetition threshold value and the copy number deletion threshold value, determining that the area to be determined is not an abnormal area.
2. The method of claim 1, wherein after the determining of the abnormal region in the sample to be detected, the method further comprises:
obtaining the copy number of the abnormal region in the sample to be detected;
and calculating the chimeric proportion of the abnormal region according to the copy number of the abnormal region and a preset variation threshold value.
3. The method according to claim 2, wherein the calculating the fit ratio of the abnormal region includes:
if the abnormal region is determined to be an abnormal region with repeated copy numbers, calculating a first difference value between the copy numbers of the abnormal region and 2, and determining a quotient between the first difference value and a copy number repetition threshold value as a jogged proportion of the abnormal region;
if the abnormal region is determined to be the abnormal region with the missing copy number, calculating a second difference value between the copy numbers of the abnormal region and the second difference value, and determining a quotient between the second difference value and a copy number missing threshold value as the embedding proportion of the abnormal region.
4. The method according to claim 1, wherein the obtaining the sequence to be detected of the fluorescence signal intensity data of each window of the sample to be detected on the genome according to the gene sequencing data of the sample to be detected comprises:
Performing data conversion on the gene sequencing data of the sample to be detected to obtain the fluorescence signal intensity to be detected of each position corresponding to the gene sequencing data of the sample to be detected;
window segmentation is carried out on the genome of the sample to be detected according to window division conditions of the reference sample, and the fluorescent signal intensity to be detected of the window is obtained according to the fluorescent signal intensity to be detected of the sites in each window;
carrying out local linear regression correction on the fluorescence signal intensity to be detected of the windows in the sample to be detected to obtain corrected fluorescence signal intensity data to be detected of each window;
and generating a sequence to be detected of the fluorescence signal intensity data based on the fluorescence signal intensity data to be detected of each window in the sample to be detected.
5. The method according to claim 1, wherein the comparing the sequence to be detected of the fluorescence signal intensity data based on each window on the genome with a reference sequence of fluorescence signal intensity data of a corresponding window on the genome of a reference sample in a pre-established reference library to obtain a comparison data sequence of fluorescence signal intensity of each window comprises:
for each window on the genome, extracting fluorescence signal intensity data to be detected of the window from the sequence to be detected, and extracting reference fluorescence signal intensity data of the corresponding window from the reference sequence;
Acquiring the ratio of the fluorescence signal intensity data to be detected and the reference fluorescence signal intensity data of the window, and determining the logarithm of the ratio based on 2 as comparison data of the fluorescence signal intensity of the window;
and obtaining comparison data sequences of the fluorescence signal intensities of the windows based on the comparison data of the fluorescence signal intensities of the windows of each genome.
6. The method according to claim 1, wherein detecting the saliency difference between windows according to the comparison data sequence of the fluorescence signal intensities of the windows, determining the target window and the corresponding region to be determined in the sample to be detected based on the saliency difference between windows, comprises:
identifying the significance difference between the windows by adopting a statistical algorithm according to the comparison data sequence of the fluorescence signal intensity of each window;
acquiring a plurality of areas with saliency differences based on the saliency differences among the windows, wherein the saliency differences among the windows in each area do not exist;
identifying the saliency difference among the plurality of areas, and merging adjacent areas without the saliency difference if the saliency difference does not exist among the adjacent areas; and if the saliency difference exists between the adjacent areas, obtaining the area to be determined and a window corresponding to the starting position and a window corresponding to the ending position of the area to be determined based on the adjacent areas with the saliency difference.
7. A copy number variation detection apparatus based on chip data, the apparatus comprising:
the fluorescence signal intensity data acquisition module is used for acquiring a sequence to be detected of fluorescence signal intensity data of each window of a genome of a sample to be detected according to gene sequencing data of the sample to be detected;
the comparison processing module is used for comparing the sequence to be detected of the fluorescence signal intensity data with a reference sequence of fluorescence signal intensity data of a corresponding window of a reference sample in a pre-established reference library on the basis of each window on the genome to obtain a comparison data sequence of fluorescence signal intensity of each window;
the region identification module is used for detecting the saliency difference between the windows according to the comparison data sequence of the fluorescent signal intensity of each window, and determining a target window and a corresponding region to be determined in a sample to be detected based on the saliency difference between the windows, wherein the target window is a window corresponding to the starting position and a window corresponding to the ending position of the region to be determined in the sample to be detected;
the abnormal region determining module is used for determining an abnormal region in the sample to be detected according to the region to be determined, the comparison data sequence of the fluorescent signal intensity of each window and a preset variation threshold value;
The preset mutation threshold comprises a preset copy number deletion threshold and a copy number repetition threshold; the abnormal region determining module is specifically configured to: acquiring a comparison data average value of the window fluorescence signal intensity in the region to be determined according to the window corresponding to the starting position and the window corresponding to the ending position of the region to be determined in the sample to be detected and the comparison data sequence of the window fluorescence signal intensity; matching the comparison data average value of the window fluorescence signal intensity in the region to be determined with a preset copy number missing threshold value and a copy number repetition threshold value, and if the comparison data average value of the region to be determined is matched with the preset copy number missing threshold value, determining the region to be determined as an abnormal region with copy number missing; if the comparison data average value of the area to be determined is matched with a preset copy number repetition threshold value, determining that the area to be determined is an abnormal area with repeated copy number; and if the comparison data average value of the area to be determined is not matched with the preset copy number repetition threshold value and the copy number deletion threshold value, determining that the area to be determined is not an abnormal area.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.
CN202110673034.7A 2021-06-17 2021-06-17 Copy number variation detection method and detection device based on chip data Active CN113299342B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110673034.7A CN113299342B (en) 2021-06-17 2021-06-17 Copy number variation detection method and detection device based on chip data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110673034.7A CN113299342B (en) 2021-06-17 2021-06-17 Copy number variation detection method and detection device based on chip data

Publications (2)

Publication Number Publication Date
CN113299342A CN113299342A (en) 2021-08-24
CN113299342B true CN113299342B (en) 2024-03-15

Family

ID=77328615

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110673034.7A Active CN113299342B (en) 2021-06-17 2021-06-17 Copy number variation detection method and detection device based on chip data

Country Status (1)

Country Link
CN (1) CN113299342B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115376613A (en) * 2022-09-13 2022-11-22 郑州思昆生物工程有限公司 Base type detection method, device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104204220A (en) * 2011-12-31 2014-12-10 深圳华大基因医学有限公司 Method for detecting genetic variation
CN105574361A (en) * 2015-11-05 2016-05-11 上海序康医疗科技有限公司 Method for detecting variation of copy numbers of genomes
CN111916150A (en) * 2019-05-10 2020-11-10 北京贝瑞和康生物技术有限公司 Method and device for detecting genome copy number variation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7822555B2 (en) * 2002-11-11 2010-10-26 Affymetrix, Inc. Methods for identifying DNA copy number changes

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104204220A (en) * 2011-12-31 2014-12-10 深圳华大基因医学有限公司 Method for detecting genetic variation
CN105574361A (en) * 2015-11-05 2016-05-11 上海序康医疗科技有限公司 Method for detecting variation of copy numbers of genomes
CN111916150A (en) * 2019-05-10 2020-11-10 北京贝瑞和康生物技术有限公司 Method and device for detecting genome copy number variation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
拷贝数变异的全基因组关联分析;孙玉琳;刘飞;赵晓航;;生物化学与生物物理进展(第08期);全文 *

Also Published As

Publication number Publication date
CN113299342A (en) 2021-08-24

Similar Documents

Publication Publication Date Title
US8594951B2 (en) Methods and systems for nucleic acid sequence analysis
RU2768718C2 (en) Detection of somatic variation of number of copies
CN108920899B (en) Single exon copy number variation prediction method based on target region sequencing
JP2018522531A5 (en)
CN111462816B (en) Method, electronic device and computer storage medium for detecting microdeletion and microduplication of germ line genes
CN110268044B (en) Method and device for detecting chromosome variation
CN110648721B (en) Method and device for detecting copy number variation by aiming at exon capture technology
US20230287487A1 (en) Systems and methods for genetic identification and analysis
KR102273257B1 (en) Copy number variations detecting method based on read-depth and analysis apparatus
KR20200107774A (en) How to align targeting nucleic acid sequencing data
CN111968701A (en) Method and device for detecting somatic copy number variation of designated genome region
CN115064209B (en) Malignant cell identification method and system
CN113299342B (en) Copy number variation detection method and detection device based on chip data
CN117334249A (en) Method, apparatus and medium for detecting copy number variation based on amplicon sequencing data
EP4016533A1 (en) Method and apparatus for machine learning based identification of structural variants in cancer genomes
Duan et al. Common copy number variation detection from multiple sequenced samples
CN111696622B (en) Method for correcting and evaluating detection result of mutation detection software
CN111508559B (en) Method and device for detecting target area CNV
CN112863602B (en) Chromosome abnormality detection method, chromosome abnormality detection device, chromosome abnormality detection computer device, and chromosome abnormality detection storage medium
CN114694752B (en) Method, computing device and medium for predicting homologous recombination repair defects
JP2007520829A (en) Method and system for linked analysis of array CGH data and gene expression data
EP1798651A1 (en) Gene information display method and apparatus
CN114613434A (en) Method and system for detecting gene copy number variation based on population sample depth information
Vepakomma et al. Diverse data selection via combinatorial quasi-concavity of distance covariance: A polynomial time global minimax algorithm
JP2008226095A (en) Gene expression variation analysis method, system and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant