CN103778350B - Somatic cell copy number based on Two-dimensional Statistical model variation significance detection method - Google Patents

Somatic cell copy number based on Two-dimensional Statistical model variation significance detection method Download PDF

Info

Publication number
CN103778350B
CN103778350B CN201410010002.9A CN201410010002A CN103778350B CN 103778350 B CN103778350 B CN 103778350B CN 201410010002 A CN201410010002 A CN 201410010002A CN 103778350 B CN103778350 B CN 103778350B
Authority
CN
China
Prior art keywords
scna
statistic
random
dimensional
amplitude
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410010002.9A
Other languages
Chinese (zh)
Other versions
CN103778350A (en
Inventor
袁细国
张军英
杨利英
张胜利
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201410010002.9A priority Critical patent/CN103778350B/en
Publication of CN103778350A publication Critical patent/CN103778350A/en
Application granted granted Critical
Publication of CN103778350B publication Critical patent/CN103778350B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

A kind of somatic cell copy number based on Two-dimensional Statistical model variation significance detection method, it includes, S1 gathers SCNA data, and SCNA data are carried out pretreatment;S2 calculates SCNA coefficient of relationship adjacent between site, and chromosome is divided into multiple relatively independent SCNA construction unit;S3 calculates the statistic of each SCNA construction unit, and implements two-dimensional random displacement on full-length genome;S4, for the different length L of SCNA construction unit, is the statistic of the SCNA pattern of L by calculating random length in displacement sample, constructs zero cloth D based on L in two-dimensional spaceL;By statistic and the D of corresponding SCNALContrast, by the statistic of described SCNA and described DLIt is designated as p value;If p value is less than the threshold value set, then corresponding SCNA is notable, has potential cancer function.

Description

Somatic cell copy number based on Two-dimensional Statistical model variation significance detection method
Technical field
A kind of somatic cell copy number based on Two-dimensional Statistical model of present invention variation significance detection method.
Background technology
Somatic cell copy number variation (somatic copy number alteration, SCNA) is the important phenomenon in cancer gene group.It mainly shows as amplification and the disappearance two states of copy number, and generation, development with cancerous cell have close ties.Therefore, SCNA carries out system analyzing is to study the pathogenesis of cancer from molecular level to provide important channel, and its bottom, most crucial problem are the SCNA patterns and the random SCNA occurred how distinguished and have cancer function.
Numerous researchs show, SCNA functional mode is often implied in the consistent variation region of cancer gene group sample, so set up the computational methods based on theory of statistics, detection SCNA repeats (Recurrent) significance level occurred in multiple samples, for identifying SCNA functional mode and finding that potential cancer gene provides direct, feasible technological means, and then provide important information for biological physician to prediction and the diagnosis of cancer.Therefore, rationally and effectively statistical inspection model is set up most important.
The intensive in high flux full-length genome SCNA site and the complexity of structure thereof, bring challenge greatly to the foundation of statistical inspection model and the detection of SCNA significance, be mainly reflected in following two aspect.First, the difficult point of problem itself: a) number of loci is up to more than 180 ten thousand and sample number is the most less, defines the data general layout of a kind of high latitude small sample;B) there is stronger relatedness between SCNA site, and dependent so that there is reciprocal effect between detecting factor;C) copy number amplification or miss status include both sides feature, i.e. variation frequency and variation amplitude, and this requires the mechanism of a rational balance the two feature;D) length of SCNA structural models is not quite similar, and this requires that the SCNA considering different length has different background distributions.Second, solve theory and the challenge of method of problem: a) data scale is big, the effectively control to calculating Time & Space Complexity is a challenge;B) how to take into full account the relatedness between SCNA site, reduce the conservative that SCNA significance level is estimated, be a difficulties;C) how to set up null hypothesis consistent with statistic distribution, strengthen the statistical significance that significance level is estimated, be an emphasis and the problem not yet broken through at present.
Summary of the invention
In order to solve the problems referred to above, a kind of somatic cell copy number based on Two-dimensional Statistical model of present invention variation significance detection method, it is characterised in that: it includes,
S1 gathers SCNA data, and SCNA data are carried out pretreatment;
S2 calculates SCNA coefficient of relationship adjacent between site, and chromosome is divided into multiple relatively independent SCNA construction unit;
S3 calculates the statistic of each SCNA construction unit, and implements two-dimensional random displacement on full-length genome;
S4, for the different length L of SCNA construction unit, is the statistic of the SCNA pattern of L by calculating random length in displacement sample, constructs zero cloth D based on L in two-dimensional spaceL;By statistic and the D of corresponding SCNALContrast, by the statistic of described SCNA and described DLIt is designated as p value;If p value is less than the threshold value set, then corresponding SCNA is notable, has potential cancer function.
On the basis of technique scheme, described step S1 includes:
SCNA signal is processed, the SCNA signal that can contrast with acquisition;Utilize partitioning algorithm that noise is processed, and define SCNA amplification and miss status.
On the basis of technique scheme, described step S2 includes: utilize Pearson formula to calculate SCNA coefficient of relationship adjacent between site, and chromosome is divided into multiple relatively independent SCNA construction unit.
On the basis of technique scheme, step S3 includes
Utilize known SCNA functional mode structure training set, learn frequency w1Weight w with amplitude2, counting statistics amount,
Stest=w1*f+w2*a
Wherein, f, a, StestRefer to the value of the frequency of SCNA functional mode, amplitude, and statistic in training set respectively.
On the basis of technique scheme, described step S3 also includes:
Described two-dimensional random displacement detailed process is as follows:
A) frequency occurred for SCNA, its position occurred in full-length genome of random permutation;For each displacement sample set, calculate the occurrence frequency of random SCNA, set up zero cloth D based on frequencyf
B) for the variation amplitude of SCNA, the position that random permutation amplitude occurs in full-length genome;For each displacement sample set, calculate the amplitude of random SCNA, set up zero cloth D based on amplitudea
C) weight of supervised learning, w are utilized1And w2, construct zero cloth D, with the significance level of detection statistic:
Wherein D=w1*Df+w2*Da
Compared with prior art, feature of both copy number of the present invention variation: variation frequency and variation amplitude, all there is important biological meaning, then construct statistic based on the two feature and significance level that statistical inspection model is conducive to objective estimation copy number to make a variation;And prior art the most only emphasizes copy number variation frequency, easily ignore the importance of variation amplitude;For this, the present invention is on the feature space of these two aspects, set up Two-dimensional Statistical testing model, and by supervised learning strategy balance the two feature with reasonably counting statistics amount, this not only makes hypothesis testing model and statistic have concordance, and can strengthen statistics and the biological double meaning that significance level is estimated.
Accompanying drawing explanation
Fig. 1 is the flow chart of the present invention.
Detailed description of the invention
Refer to Fig. 1, a kind of somatic cell copy number based on Two-dimensional Statistical model variation significance detection method, it is characterised in that: it includes,
S1 gathers SCNA data, and SCNA data are carried out pretreatment;
S2 calculates SCNA coefficient of relationship adjacent between site, and chromosome is divided into multiple relatively independent SCNA construction unit;
S3 calculates the statistic of each SCNA construction unit, and implements two-dimensional random displacement on full-length genome;
S4, for the different length L of SCNA construction unit, is the statistic of the SCNA pattern of L by calculating random length in displacement sample, constructs zero cloth D based on L in two-dimensional spaceL;By statistic and the D of corresponding SCNALContrast, by the statistic of described SCNA and described DLIt is designated as p value;If p value is less than the threshold value set, then corresponding SCNA is notable, has potential cancer function.
On the basis of technique scheme, described step S1 includes:
SCNA signal is processed, the SCNA signal that can contrast with acquisition;Utilize partitioning algorithm that noise is processed, and define SCNA amplification and miss status.SCNA Signal Pretreatment refers to be standardized signal and Logarithm conversion, i.e. for each cancer sample, the copy number variability signals of the normal structure matched with it by its copy number signal is compared, and set up a sample for reference based on the sample set analyzed, so that all of sample to be standardized.So can weaken the Batch effect existed between different sample, eliminate the sexual cell impact on SCNA signal simultaneously.
On the basis of technique scheme, described step S2 includes: utilize Pearson formula to calculate SCNA coefficient of relationship adjacent between site, and chromosome is divided into multiple relatively independent SCNA construction unit.
On the basis of technique scheme, step S3 includes
Utilize known SCNA functional mode structure training set, learn frequency w1Weight w with amplitude2, counting statistics amount,
Stest=w1*f+w2*a
Wherein, f, a, StestRefer to the value of the frequency of SCNA functional mode, amplitude, and statistic in training set respectively.
On the basis of technique scheme, described step S3 also includes:
Described two-dimensional random displacement detailed process is as follows:
A) frequency occurred for SCNA, its position occurred in full-length genome of random permutation;For each displacement sample set, calculate the occurrence frequency of random SCNA, set up zero cloth D based on frequencyf
B) for the variation amplitude of SCNA, the position that random permutation amplitude occurs in full-length genome;For each displacement sample set, calculate the amplitude of random SCNA, set up zero cloth D based on amplitudea
C) weight of supervised learning, w are utilized1And w2, construct zero cloth D, with the significance level of detection statistic:
Wherein D=w1*Df+w2*Da
Meanwhile, the performance of algorithm is evaluated by three below aspect of the present invention: a) can evaluation algorithm in the case of false positive rate (FPR) be controlled, it is thus achieved that higher valid positive rate (TPR);B) whether evaluation algorithms can accurately estimate p value (Type I Error Rate), i.e. whether the statistical model of algorithm has stronger statistical significance;C) computation complexity of parser.For this, we intend with the normal cell copy number of Affymetrix full-length genome SNP6.0 chip detection as background, with theory of probability and nonstationary model basis, build markov SCNA emulation mode, simulate large-scale SCNA data, the method performance of the present invention is tested.For c), analyzing theoretically, SCNA construction unit number is more much smaller than number of sites, therefore Replacement Strategy based on construction unit spends calculating time much less than Replacement Strategy based on site, and therefore the time complexity of algorithm is relatively low.
In sum, only the preferred embodiments of the invention, do not limit protection scope of the present invention with this, all equivalence changes made according to the scope of the claims of the present invention and description with modify, be all within the scope of patent of the present invention contains.

Claims (3)

1. somatic cell copy number based on a Two-dimensional Statistical model variation significance detection method, it is characterised in that: it includes,
S1 gathers SCNA data, and SCNA data are carried out pretreatment;
S2 calculates SCNA coefficient of relationship adjacent between site, and chromosome is divided into multiple relatively independent SCNA construction unit;
S3 calculates the statistic of each SCNA construction unit, and implements two-dimensional random displacement on full-length genome;Utilize known SCNA functional mode structure training set, weight w of study frequency1Weight w with amplitude2, counting statistics amount,
Stest=w1*f+w2*a
Wherein, f, a, StestRefer to the value of the frequency of SCNA functional mode, amplitude, and statistic in training set respectively;
Described two-dimensional random displacement detailed process is as follows:
A) frequency occurred for SCNA, its position occurred in full-length genome of random permutation;For each displacement sample set, calculate the occurrence frequency of random SCNA, set up zero cloth D based on frequencyf
B) for the variation amplitude of SCNA, the position that random permutation amplitude occurs in full-length genome;For each displacement sample set, calculate the amplitude of random SCNA, set up zero cloth D based on amplitudea
C) weight of supervised learning, w are utilized1And w2, construct zero cloth D, with the significance level of detection statistic:
Wherein D=w1*Df+w2*Da
S4, for the different length L of SCNA construction unit, is the statistic of the SCNA pattern of L by calculating random length in displacement sample, constructs zero cloth D based on L in two-dimensional spaceL;By statistic and the D of corresponding SCNALContrast, by the statistic of described SCNA and described DLIt is designated as p value;If p value is less than the threshold value set, then corresponding SCNA is notable, has potential cancer function.
A kind of somatic cell copy number based on Two-dimensional Statistical model variation significance detection method, it is characterised in that: described step S1 includes:
SCNA signal is carried out pretreatment, the SCNA signal that can contrast with acquisition;Utilize partitioning algorithm that noise is processed, and define SCNA amplification and miss status.
A kind of somatic cell copy number based on Two-dimensional Statistical model variation significance detection method, it is characterized in that: described step S2 includes: utilize Pearson formula to calculate SCNA coefficient of relationship adjacent between site, and chromosome is divided into multiple relatively independent SCNA construction unit.
CN201410010002.9A 2014-01-09 2014-01-09 Somatic cell copy number based on Two-dimensional Statistical model variation significance detection method Expired - Fee Related CN103778350B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410010002.9A CN103778350B (en) 2014-01-09 2014-01-09 Somatic cell copy number based on Two-dimensional Statistical model variation significance detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410010002.9A CN103778350B (en) 2014-01-09 2014-01-09 Somatic cell copy number based on Two-dimensional Statistical model variation significance detection method

Publications (2)

Publication Number Publication Date
CN103778350A CN103778350A (en) 2014-05-07
CN103778350B true CN103778350B (en) 2016-10-05

Family

ID=50570578

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410010002.9A Expired - Fee Related CN103778350B (en) 2014-01-09 2014-01-09 Somatic cell copy number based on Two-dimensional Statistical model variation significance detection method

Country Status (1)

Country Link
CN (1) CN103778350B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2016025892A1 (en) * 2014-08-15 2016-02-18 Life Technologies Corporation Methods and systems for detecting minor variants in a sample of genetic material
CN105760712B (en) * 2016-03-01 2019-03-26 西安电子科技大学 A kind of copy number mutation detection method based on new-generation sequencing
CN106682455B (en) * 2016-11-24 2019-03-26 西安电子科技大学 A kind of Statistical Identifying Method of multisample copy number consistency variable region
CN106650312B (en) * 2016-12-29 2022-05-17 浙江安诺优达生物科技有限公司 Device for detecting copy number variation of circulating tumor DNA

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5919624A (en) * 1997-01-10 1999-07-06 The United States Of America As Represented By The Department Of Health & Human Services Methods for detecting cervical cancer
CN102103750A (en) * 2011-01-07 2011-06-22 杭州电子科技大学 Vision significance detection method based on Weber's law and center-periphery hypothesis
CN103093119A (en) * 2013-01-24 2013-05-08 南京大学 Method for recognizing significant biologic pathway through utilization of network structural information

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7822555B2 (en) * 2002-11-11 2010-10-26 Affymetrix, Inc. Methods for identifying DNA copy number changes

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5919624A (en) * 1997-01-10 1999-07-06 The United States Of America As Represented By The Department Of Health & Human Services Methods for detecting cervical cancer
CN102103750A (en) * 2011-01-07 2011-06-22 杭州电子科技大学 Vision significance detection method based on Weber's law and center-periphery hypothesis
CN103093119A (en) * 2013-01-24 2013-05-08 南京大学 Method for recognizing significant biologic pathway through utilization of network structural information

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"a faster circular binary segmention algorithm for the analysis of CGH data";E.S.Venkatraman etal;《original paper》;20070118;第23卷(第6期);第657-663页 *
"改进的基因拷贝数变异检测算法";李平等;《计算机工程》;20130131;第39卷(第1期);第309-312页 *
Vonn walter etal."DiNAMIC: Amethod to identify recurrent DNA copy number aberrations in tumors".《Bioinformatics》.2010,第27卷(第5期),第678-685页. *
Xiguo Yuan etal."TAG: A method to identify significant consensus events of copy number alterations in cancer".《PloSone》.2012,第7卷(第7期),第1-10页. *

Also Published As

Publication number Publication date
CN103778350A (en) 2014-05-07

Similar Documents

Publication Publication Date Title
CN105760712B (en) A kind of copy number mutation detection method based on new-generation sequencing
US11507049B2 (en) Method for detecting abnormity in unsupervised industrial system based on deep transfer learning
CN103778350B (en) Somatic cell copy number based on Two-dimensional Statistical model variation significance detection method
CN103544392B (en) Medical science Gas Distinguishing Method based on degree of depth study
Brill et al. Testing for differential abundance in compositional counts data, with application to microbiome studies
Bao et al. One-dimensional convolutional neural network for damage detection of jacket-type offshore platforms
CN104318059B (en) Method for tracking target and tracking system for non-linear Gaussian Systems
CN103245907B (en) A kind of analog-circuit fault diagnosis method
CN104820993B (en) It is a kind of to combine particle filter and track the underwater weak signal target tracking for putting preceding detection
CN102829967A (en) Time-domain fault identifying method based on coefficient variation of regression model
CN111562108A (en) Rolling bearing intelligent fault diagnosis method based on CNN and FCMC
CN112949387B (en) Intelligent anti-interference target detection method based on transfer learning
CN104330721A (en) Integrated circuit hardware Trojan horse detection method and integrated circuit hardware Trojan horse detection system
CN108549908A (en) Chemical process fault detection method based on more sampled probability core principle component models
CN103323228A (en) Mining drill fault intelligent identification method
CN113239022B (en) Method and device for complementing missing data in medical diagnosis, electronic device and medium
CN105447243A (en) Weak signal detection method based on adaptive fractional order stochastic resonance system
CN103885867A (en) Online evaluation method of performance of analog circuit
CN117495640A (en) Regional carbon emission prediction method and system
He et al. Study on missing data imputation and modeling for the leaching process
CN115310499B (en) Industrial equipment fault diagnosis system and method based on data fusion
Huang et al. Threshold-optimized swarm decomposition using grey wolf optimizer for the acoustic-based internal defect detection of arc magnets
Tang et al. A rolling bearing signal model based on a correlation probability box
Mills et al. Phase space sampling and operator confidence with generative adversarial networks
CN112651168B (en) Construction land area prediction method based on improved neural network algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161005

CF01 Termination of patent right due to non-payment of annual fee