CN107944225A

CN107944225A - Gene high-flux sequence data mutation detection methods

Info

Publication number: CN107944225A
Application number: CN201711214506.2A
Authority: CN
Inventors: 李超
Original assignee: Hui - Ying Medical Technology (shanghai) Co Ltd
Current assignee: Huisuan Gene Technology Shanghai Co ltd
Priority date: 2017-11-28
Filing date: 2017-11-28
Publication date: 2018-04-20
Anticipated expiration: 2037-11-28
Also published as: CN107944225B

Abstract

The present invention provides a kind of gene high-flux sequence data mutation detection methods, including step：S1：Obtain the high-flux sequence data of a cdna sample；S2：Generate the position information tag of each gene order of the high-flux sequence data of the cdna sample；S3：Each gene order is grouped according to the position information tag and calculates the mutation total amount of acquisition one；S4：The mutation total amount is substituted into a background model output mutation testing result.A kind of gene high-flux sequence data mutation detection methods of the present invention, the method combined using virtual molecular label with background database reduces noise, improve the specificity and sensitiveness of detection, can be in the random error during experiment can be effectively reduced on the premise of not increasing experimental cost, with reference to correction of the background database to systematic error, it can achieve the purpose that the low abundance mutation of precise Identification.

Description

Gene high-throughput sequencing data mutation detection method

Technical Field

The invention relates to the technical field of gene detection, in particular to a gene high-throughput sequencing data mutation detection method.

Background

In the past clinical and scientific research application of tumor gene mutation detection, only the condition of high-abundance gene mutation in tumor tissues is generally concerned. Due to low content of mutant nucleic acid, the low-abundance mutation is easy to miss detection or false positive under the condition of low sequencing coverage. However, in some application scenarios, for example, detection of low-abundance tumor mutant nucleic acid in blood by means of liquid biopsy, accurate detection of low-abundance mutation is required. By combining high-throughput sequencing targeted capture or amplification technology with high-depth sequencing, the sequencing coverage of important tumor mutation sites can be improved, and the detection sensitivity can be improved. However, due to the noise naturally existing in high-throughput sequencing, it is still difficult to distinguish the true mutation from the noise point only from the experimental point of view, and the above problems must be solved by establishing a model of noise reduction and mutation detection through an algorithm.

In the existing scheme, sequencing data of healthy people are used as background values, and a background noise threshold value of each site is determined through normal distribution fitting, so that true positive points and noise are distinguished. However, this solution has several problems: 1. batch effect exists in high-throughput sequencing experiments and data generation, a background model established by sequencing data of healthy people can remove system errors existing in a sequencing system, but experimental errors randomly generated in each experiment cannot be effectively removed; 2. the background data of healthy people needs to be established by measuring data of a large sample size of a large number of sites, the required cost is high, and the noise reduction effect cannot be realized on the sites which are not covered temporarily in a background database.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a gene high-throughput sequencing data mutation detection method, which adopts a method of combining a virtual molecular tag and a background database to reduce noise, improves the specificity and sensitivity of detection, can effectively reduce random errors in an experiment on the premise of not increasing the experiment cost, and can achieve the aim of accurately identifying low-abundance mutation by combining the correction of the background database on system errors.

In order to achieve the above object, the present invention provides a gene high-throughput sequencing data mutation detection method, comprising the steps of:

s1: obtaining high-throughput sequencing data of a gene sample;

s2: generating a position information label of each gene sequence of the high-throughput sequencing data of the gene sample;

s3: grouping each gene sequence according to the position information label and calculating to obtain a total mutation amount;

s4: and substituting the total mutation amount into a background model to output a mutation detection result.

Preferably, the S2 step further comprises the steps of:

s21: comparing each gene sequence to a reference genome through a sequence comparison algorithm to form comparison information of each gene sequence;

s22: storing the comparison information in a SAM/BAM format file;

s23: judging a template chain Ti of a sequence source of each gene sequence according to the SAM/BAM format file, wherein i is more than or equal to 1 and less than or equal to n, and n is the number of the gene sequences;

and S24, generating a position information label of each gene sequence according to the template strand Ti of the sequence source and the SAM/BAM format file.

Preferably, the S23 step further comprises the steps of:

extracting a first comparison starting position Pi of each gene sequence, a second comparison starting position Qi of a same fragment comparison sequence, positive and negative chain information Si and a sequence number Ri of the gene sequence from the SAM/BAM format file;

the template strand of sequence origin Ti is positive when the sequence number Ri of the gene sequence is equal to the value of the read1 position of the SAM/BAM format file and the positive-negative chain information Si is equal to the value of the foward position of the SAM/BAM format file, or the sequence number Ri of the gene sequence is not equal to the value of the read1 position of the SAM/BAM format file and the positive-negative chain information Si is not equal to the value of the foward position of the SAM/BAM format file;

the template strand Ti from which the sequence originates is negative when the sequence number Ri of the gene sequence is equal to the value of the read1 position of the SAM/BAM format file and the positive-negative chain information Si is not equal to the value of the forward 1 position of the SAM/BAM format file, or the sequence number Ri of the gene sequence is not equal to the value of the read1 position of the SAM/BAM format file and the positive-negative chain information Si is equal to the value of the forward position of the SAM/BAM format file.

Preferably, the position information tag is represented as (Pi, qi, ti).

Preferably, the S3 step further comprises the steps of:

s31: dividing the gene sequences with the consistent position information tags into the same genome;

s32: counting each gene sequence in the genome and a target gene position g of the reference genome _i Corresponding to a current genotype location of mutant genotype and base quality q&gt 30 mutation number v of the gene sequence _j J is a natural number greater than or equal to 1;

such as v _j &gt 0, recording the base quality q of the current gene position&gt 30 number n of the gene sequence _j ；

Such as v _j <f*n _j Then v is _j =0, wherein f is a preset lowest base identity ratio value;

s33: repeating the step S32 to obtain the mutation number v of each target gene position _j And calculating a total number of mutations based on the number of mutationsWherein

When in useAt the time, reserveAnd continuing the subsequent steps;

when in useWhen in use, willAnd resetting the numerical value of the step (b) and continuing the subsequent steps.

Preferably, the S4 step further comprises the steps of:

s41: establishing a background model, wherein the formula of the background model is as follows:

wherein, P _gi In order to accumulate the distribution frequency, gamma is a first fitting parameter, delta is a second fitting parameter, epsilon is a third fitting parameter, and lambda is a fourth fitting parameter;

obtaining the first fitting parameter, the second fitting parameter, the third fitting parameter and the fourth fitting parameter according to fitting of a plurality of sample data;

s42: substituting the total mutation amount into the background model, and calculating the cumulative distribution frequency;

s43: and when the cumulative distribution frequency value is greater than 0.95, judging that a gene locus corresponding to the current position information label is a positive locus.

Preferably, the number of the sample data is greater than or equal to 1000.

Due to the adoption of the technical scheme, the invention has the following beneficial effects:

1. the random noise in high throughput sequencing is removed without adding experimental steps and cost.

2. A calculation model for distinguishing the positive mutation sites is established by modeling sequencing data of healthy people after random noise is removed.

Finally, the sensitivity and specificity of low-abundance variation detection can be obviously improved on the premise of not changing the existing experimental system.

Drawings

FIG. 1 is a flow chart of a gene high throughput sequencing data mutation detection method according to an embodiment of the present invention.

Detailed Description

The following description of the preferred embodiment of the present invention, in accordance with the accompanying drawings of which 1 is presented to enable a better understanding of the invention as to its functions and features.

Referring to fig. 1, a method for detecting mutation in gene high-throughput sequencing data according to an embodiment of the present invention includes the steps of:

s1: high throughput sequencing data of a gene sample is obtained.

S2: generating a positional information tag for each gene sequence of the high-throughput sequencing data of the gene sample.

Wherein, the step of S2 further comprises the steps of:

s21: comparing each gene sequence to a reference genome through a sequence comparison algorithm to form comparison information of each gene sequence; the sequence comparison algorithm can adopt any existing sequence comparison algorithm, and is not particularly limited; the comparison information comprises first comparison starting position information, second comparison starting position information, base quality information, positive and negative chain information, sequence number information of a gene sequence and the like;

s22: storing the comparison information in a SAM/BAM format file;

s23: judging template chains Ti of sequence sources of all gene sequences according to the SAM/BAM format file, wherein i is more than or equal to 1 and less than or equal to n, and n is the number of the gene sequences;

Wherein, the step of S23 further comprises the steps of:

extracting a first comparison initial position Pi of each gene sequence, a second comparison initial position Qi of a same fragment comparison sequence, positive and negative chain information Si and a sequence number Ri of the gene sequence from the SAM/BAM format file; the logical relationship of the template strand Ti from which the sequence originates can be expressed as:

when the sequence number Ri of the gene sequence is equal to the value of the read1 position of the SAM/BAM format file and the positive and negative chain information Si is equal to the value of the foward position of the SAM/BAM format file, or the sequence number Ri of the gene sequence is not equal to the value of the read1 position of the SAM/BAM format file and the positive and negative chain information Si is not equal to the value of the foward position of the SAM/BAM format file, the template chain Ti of the sequence source is positive;

when the sequence number Ri of the gene sequence is equal to the value of the read1 position of the SAM/BAM format file and the positive-negative chain information Si is not equal to the value of the foward position of the SAM/BAM format file, or the sequence number Ri of the gene sequence is not equal to the value of the read1 position of the SAM/BAM format file and the positive-negative chain information Si is equal to the value of the foward position of the SAM/BAM format file, the template chain Ti from which the sequence originates is negative.

In the present embodiment, the positional information tag is represented by (Pi, qi, ti). The triplet is capable of uniquely identifying all sequences from a uniform template nucleic acid and is capable of distinguishing between the sense and anti-sense strands of the template.

S3: and grouping the gene sequences according to the position information labels and calculating to obtain a total mutation amount.

Wherein, the step S3 further comprises the steps of:

s31: dividing the gene sequences with consistent position information labels into the same genome;

s32: counting the gene sequence in the genome and the target gene position g of the reference genome _i Corresponding to a current genotype location of mutant genotype and base quality q&gt 30 number of mutations of the Gene sequence v _j J is a natural number greater than or equal to 1;

such as v _j &gt 0, recording base quality q of current gene position&Number n of gene sequences of gt, 30 _j ；

When in useAt the time, reserveAnd continuing the subsequent steps;

Wherein, the step S4 further comprises the steps of:

obtaining a first fitting parameter, a second fitting parameter, a third fitting parameter and a fourth fitting parameter according to fitting of more than 1000 sample data;

s42: substituting the total mutation amount into a background model, and calculating cumulative distribution frequency;

The gene high-throughput sequencing data mutation detection method provided by the embodiment of the invention has the following beneficial effects:

1. the random noise in high-throughput sequencing is removed without increasing the experimental steps and cost.

2. A calculation model for distinguishing the positive mutation sites is established by modeling sequencing data of healthy people without random noise.

While the present invention has been described in detail and with reference to the embodiments thereof as illustrated in the accompanying drawings, it will be apparent to one skilled in the art that various changes and modifications can be made therein. Therefore, certain details of the embodiments are not to be interpreted as limiting, and the scope of the invention is to be determined by the appended claims.

Claims

1. A gene high-throughput sequencing data mutation detection method comprises the following steps:

s1: obtaining high-throughput sequencing data of a gene sample;

2. The method for detecting mutation in gene high-throughput sequencing data according to claim 1, wherein the step S2 further comprises the steps of:

s22: storing the comparison information in a SAM/BAM format file;

3. The method for detecting mutation in gene high-throughput sequencing data according to claim 2, wherein said step of S23 further comprises the steps of:

4. The method for detecting mutation in gene high throughput sequencing data according to claim 3, wherein said positional information tag is represented by (Pi, qi, ti).

5. The method for detecting mutation in gene high throughput sequencing data according to any one of claims 1 to 4, wherein said S3 step further comprises the steps of:

s31: dividing the gene sequences with the same position information label into the same genome;

s32: counting each gene sequence in the genome and a target gene position g of the reference genome _i Corresponding to a current gene position of mutant genotype and base quality q&gt 30 mutation number v of the gene sequence _j J is a natural number greater than or equal to 1;

such as v _j &gt 0, recording the base quality q of the current gene position&gt 30, the number n of the gene sequences _j ；

When the temperature is higher than the set temperatureAt the time, reserveAnd continuing the subsequent steps;

when the temperature is higher than the set temperatureWhen in use, willAnd resetting the value of (4) and continuing the subsequent steps.

6. The method for detecting mutation in gene high-throughput sequencing data according to claim 5, wherein the step S4 further comprises the steps of:

wherein, P _gi The cumulative distribution frequency is shown, gamma is a first fitting parameter, delta is a second fitting parameter, epsilon is a third fitting parameter, and lambda is a fourth fitting parameter;

7. The method of detecting mutations in gene high-throughput sequencing data according to claim 6, wherein the number of sample data is 1000 or more.