CN111161801A

CN111161801A - Method for automatically identifying heterozygous mutation in first-generation gene sequencing

Info

Publication number: CN111161801A
Application number: CN201911403408.2A
Authority: CN
Inventors: 杨琦; 张未波; 李孝尧; 施笑蕾; 濮娜; 张国福; 陈炜炜; 柯路; 童智慧; 李维勤
Original assignee: Individual
Current assignee: Individual
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-15
Anticipated expiration: 2039-12-31
Also published as: CN111161801B

Abstract

The invention relates to a method for automatically identifying heterozygous mutation in first-generation gene sequencing, which comprises the following steps: s1, converting the detection signal data of each base in the sequencing result of the first generation gene into a point with positive real number on the abscissa; s2, calculating the envelope area of the signal intensity detected by each base on each integral point; s3 grouping the integer points so that the maximum envelope area values of the detection signals of the bases in each group are close to each other; s4 determining an area difference threshold and an area ratio threshold within each said group of integer point bits for identifying heterozygous mutations, respectively; s5, identifying a noise integer point position set in the sequence to be detected by using a noise identification algorithm; s6 determining an area difference threshold and an area ratio threshold according to the integer point position grouping; and calculating the area difference and the area ratio of the two bases with the largest area of the integer point positions, and judging the mutation. The invention can improve the identification efficiency and accuracy of heterozygous mutation and reduce the labor cost.

Description

Method for automatically identifying heterozygous mutation in first-generation gene sequencing

Technical Field

The invention relates to a gene sequencing result analysis method, in particular to a method capable of automatically identifying heterozygous mutation in gene sequencing, and particularly relates to a method for automatically identifying heterozygous mutation in first-generation gene sequencing.

Background

The existing generation gene sequencing (Sanger sequencing method) results lack of mature automatic identification technology for heterozygous mutation, and the efficiency and the accuracy of heterozygous mutation identification are low due to manual identification, while the labor cost and the time cost are very high.

Disclosure of Invention

The invention aims to provide a method for automatically identifying heterozygous mutation in first-generation gene sequencing aiming at the defects of the prior art, greatly improves the identification efficiency and accuracy of heterozygous mutation, reduces the labor cost and time cost, and has wide application prospect

The technical scheme of the invention is as follows:

a method for automatically identifying heterozygous mutations in the sequencing of a generation gene comprises the following steps:

s1, converting the detected signal data of each base in the sequencing result of one generation gene into a coordinate sequence with the abscissa as the real number point and the ordinate as the detected signal intensity value of the corresponding base; wherein the unit of the real number point location is bp;

s2, calculating the area of the graph enclosed by the curve of the base detection signal intensity point and the abscissa axis of each base in each 0.5bp interval around each integer point as the envelope area of the base detection signal intensity at each integer point based on the coordinate sequence of each base;

s3, grouping the integer points by using the maximum envelope area of the detected signal of each base on each integer point as a classification basis, so that the maximum envelope area values of the detected signal of each base in each group are close to each other;

s4, according to the statistical characteristics of the occurrence frequency of the heterozygous mutant gene and the graphic characteristics in each 0.5bp interval around the integer point position where the heterozygous mutation occurs, respectively determining an area difference threshold and an area ratio threshold for identifying the heterozygous mutation in each integer point position group;

s5, identifying a noise integer point position set in the sequence to be detected by using a noise identification algorithm;

s6, determining an area difference threshold and an area ratio threshold according to the integer point location grouping of each integer point location not in the noise integer point location set; and calculating the area difference and the area ratio of the two bases with the largest area of the integer point according to the envelope area of the signal intensity detected by each base of the integer point, and if the area difference is smaller than the area difference threshold value and the area ratio is smaller than the area ratio threshold value, judging the point as the suspected heterozygous mutation.

Preferably, the step S1 includes the steps of:

s1.1 reading a positive rational number sequence consisting of detection signal intensity values corresponding to various bases, hereinafter referred to as a detection sequence;

s1.2, reading a detection sequence subscript where the current integer point location is located; the unit of the point location is bp; the subscripts are positive integers;

s1.3, solving the subscript difference of the detected sequence between the current integer point location and the previous integer point location, and taking the reciprocal of the subscript difference as the data point location from the previous integer point location to the current integer point location, namely the step unit value for short; the stepping unit value is a positive rational number; the unit is bp;

s1.4, taking out the detected sequence data from the previous point position on the abscissa according to the mark pressing sequence, wherein the data is a positive rational number;

s1.5, taking the detected sequence data as a vertical coordinate, and taking the stepping unit value obtained by accumulating S1.3 as a horizontal coordinate until the next integer point position; pushing a sequence formed by the points corresponding to the abscissa and the ordinate into the end of the coordinate sequence;

s1.6, if the current integer point position does not exceed the length of the sequence to be sequenced, adding one to the current integer point position, and returning to the S1.3 step for continuing. If the current integer point reaches the length of the sequence to be sequenced, a complete coordinate sequence is obtained at the moment.

Preferably, the step S2 includes the steps of:

s2.1 intercepting sequences, namely intercepting sequences for short, for 0.5bp intervals on two sides of each basic group at the current integer point;

s2.2, when both ends of 0.5bp on both sides of the current integer point position have no data in S2.1, the two ends can be corrected through an interpolation algorithm, so that the calculation precision of the envelope area is improved;

s2.3, calculating the trapezoidal area formed by every two groups of coordinates in each integer point position interception sequence, accumulating the trapezoidal areas to obtain the envelope area of each base at each point position, and pushing the envelope area to the end of the area sequence;

s2.4, if the current integer point position does not exceed the length of the sequence to be sequenced, adding one to the current integer point position and returning to the step S2.1 for continuation; if the current integer point reaches the length of the sequence to be sequenced, a complete area sequence is obtained at the moment.

Preferably, the step S4 includes the steps of:

s4.1, subtracting the second maximum value from the maximum value of the area of each base at each point position to obtain a maximum envelope area difference sequence, which is referred to as an area difference sequence for short;

s4.2, dividing the maximum value of the area of each base at each point position by the second maximum value to be used as a maximum envelope area ratio sequence, which is referred to as an area ratio sequence for short;

s4.3, calculating the median of the S4.1 area difference sequence, which is referred to as the area difference median for short; calculating the median of the S4.2 area ratio sequence, which is referred to as the area ratio median for short;

s4.4, the area difference threshold is equal to the median of the area difference multiplied by the maximum change rate of the waveform peak value in the packet; the area ratio threshold is equal to the median of the area ratio multiplied by the maximum rate of change of the peak of the waveform within the packet.

Preferably, the step S6 includes the steps of:

s6.1, if the current point location is in the noise point location, skipping the point location, and continuing to analyze the next point location;

s6.2, acquiring an integer point position group where the current point position is located according to the maximum base envelope area where the current point position is located, and acquiring an area difference threshold value and an area ratio threshold value of the group;

s6.3, the area difference and the area ratio of the analysis point location are taken out from the sequence, and if the area difference of the analysis point location is smaller than or equal to the area difference threshold value and the area ratio of the analysis point location is smaller than or equal to the area ratio threshold value, the point location is considered to be a suspected heterozygous mutation point location.

The invention has the beneficial effects that:

the invention has reasonable design and convenient use, can imitate the process of observing heterozygous mutation by human by utilizing a computer graphics method, greatly improves the identification efficiency and the accuracy of heterozygous mutation, reduces the labor cost and the time cost, and has wide application prospect.

Drawings

FIG. 1 is a diagram showing the base fluorescence signal intensity curves of the line graphs of different bases.

Fig. 2 is a partially enlarged view of fig. 1.

FIG. 3 is a graph area enclosed by two sets of coordinate points.

FIG. 4 is a graph area enclosed by the envelope and the dot position axis within each integer dot position range.

Wherein:

in FIG. 1, the top number is the integer point, the letter on the curve is the base type corresponding to the data point, the lower A, T, C, G letter is the base type identified by the integer point, and the lowest number is the Phred score for evaluating the sequencing quality.

The 313-integer locus outlined in FIG. 2 shows the signature curves characteristic of heterozygous mutations for C and T bases because the envelope of the two bases is close to the locus axis area difference and area ratio.

FIG. 3 is a graph area enclosed by two sets of coordinate points calculated using the trapezoidal area formula.

FIG. 4 is a diagram illustrating the calculation of the multiple trapezoidal areas within each integer point position of the fluorescence curve of the base corresponding to a certain type of base, and the summation of the multiple trapezoidal areas to obtain the envelope area of the fluorescence curve of the base corresponding to the type of base within each integer point position and the point position axis.

Detailed Description

The invention is further described below with reference to the figures and examples.

As shown in figures 1,2,3 and 4.

s1, converting the detected signal data of each base in the sequencing result of one generation gene into a coordinate sequence with positive integer point shown in the upper part of figure 1 and figure 2 and vertical coordinate as the intensity value of the detected signal corresponding to the base, because multiple base detected signals can appear on the horizontal coordinate of each positive integer point, the waveform part shown in figure 1 and figure 2 is marked by the letter corresponding to the base; wherein the unit of the real number point is bp, for example, the point framed in fig. 2 is 313, and the base detection signal corresponding to the internal waveform mainly comprises C and T;

s1.1 this example uses the nucleotide sequence corresponding to the human LPL protein as an example, and first reads a positive rational number sequence composed of detection signal intensity values corresponding to various bases, for example, the sequence of detection signal intensity values corresponding to A bases in a certain sequence is [50, 99, 203, 389, 679, 3455, 3816, 4172 … ], which includes the detection signal intensity values at 5921 sampling points, and is hereinafter referred to as a detection sequence;

s1.2, reading a detection sequence subscript where the current integer point location is located; the unit of the point location is bp; the subscript is a positive integer, e.g., [3, 39, 44, 58, 76, 94, 101, 109 … ], and the sequence comprises 494 integer points corresponding to detected signal intensity value data subscripts, e.g., point 1, which correspond to three data points having subscripts 0,1, 2, i.e., 50, 99, and 203 at A bases;

s1.3, solving the subscript difference of the detected sequence between the current integer point location and the previous integer point location, and taking the reciprocal of the subscript difference as the data point location from the previous integer point location to the current integer point location, namely the step unit value for short; the stepping unit value is a positive rational number; the unit is bp, and is connected with S1.1 and S1.2, for example, the subscript difference from a point 2 to a point 1 is 39-3=36, which indicates that 36 uniform sampling points exist in the middle, the step value is 1/36, the point 1 is special, the subscript difference is subtracted from 0 to be 3-0=3, and the step value is 1/3;

s1.4, taking out detection sequence data in the order of subscript from the previous point on the abscissa, wherein the data is a positive rational number, and taking out S1.3 as an example, wherein the detection signal intensity value corresponding to integer point 1 is 3 samples in total of 50, 99 and 203, and the detection signal intensity value corresponding to integer point 2 is 36 samples in total of 389, 679, 3455, 3816 and 4172 …;

s1.5, taking the detected sequence data as a vertical coordinate, and taking the stepping unit value obtained by accumulating S1.3 as a horizontal coordinate until the next integer point position; pushing the sequence of points corresponding to the abscissa and the ordinate to the end of the coordinate sequence, connecting to S1.4, accumulating 1/3 each data point from 0 in the abscissa of three samples of integer point 1 to form a coordinate sequence (1/3,50), (2/3, 99), (3/3, 203), accumulating 1/36 each data point from 1 in the abscissa of 36 samples of integer point 2 to form a coordinate sequence (1+1/36, 389), (1+2/36, 679), (1+3/36, 3455) …, and so on;

s1.6, if the current integer point position does not exceed the length of the sequence to be sequenced, adding one to the current integer point position, and returning to the S1.3 step for continuing. If the current integer point position reaches the length of the sequence to be sequenced, a complete coordinate sequence is obtained at the moment;

s2.1, intercepting sequences at the current integer point for 0.5bp intervals at two sides of each base, namely intercepting sequences for short, and connecting S1.4, wherein 0.5bp intervals at two sides of the integer point 1 are [0.5bp, 1.5bp ], so that the intercepting sequences at the integer point 1 are (2/3, 99), (3/3, 203), (1+1/36, 389), (1+2/36, 679), (1+3/36, 3455) … to the abscissa 1+ 18/36;

s2.2, when both end points of 0.5bp on both sides of the current integer point in S2.1 have no data, the calculation precision of the envelope area can be improved by compensating through an interpolation algorithm, taking the data in S2.1 as an example, taking the current integer point of 1 bit as the example, the data point exists until 1.5bp on the right side, but the data point does not exist until 0.5bp on the left side, so that the slope of the point formed by (1/3,50) and (2/3, 99) at the 0.5bp position can be obtained by using a slope equation, and then the interpolation point at the 0.5bp position is obtained by the slope and is 74, so that (0.5,74) is obtained;

s2.3, as shown in fig. 3, calculating the trapezoidal area formed by each two sets of coordinates in each integer point truncation sequence by using a trapezoidal area formula, accumulating the trapezoidal areas to obtain the envelope area of each base located at each point as shown in fig. 4, and pushing the envelope area to the end of the area sequence, taking the corrected data in S2.2 as an example, the area enclosed by two points (0.5,74), (2/3, 99) can be obtained by (upper base + lower base) × height/2.0, that is, (74+99) x (2/3-0.5)/2.0 = 14;

s2.4, if the current integer point position does not exceed the length of the sequence to be sequenced, adding one to the current integer point position and returning to the step S2.1 for continuation; if the current integer point reaches the length of the sequence to be sequenced, a complete area sequence is obtained at the moment;

s3, grouping the integer points by using the maximum envelope area of the detected signal of each base on each integer point as a classification basis, so that the maximum envelope area values of the detected signal of each base in each group are close to each other; common grouping methods include K-Means, SVM, and the like; as can be seen from FIG. 1 and FIG. 2, the maximum envelope peak of the detected signal of each base is fluctuated, and the method can reduce the influence of the fluctuation on the judgment of heterozygous mutation, thereby improving the accuracy; envelope area sequences of 0.5 range around 1-4 integer point of ATCG base: the maximum envelope area sequence of (0, 100,0, 0), (98, 0,91, 0), (10, 2,3,1), (2,50,49,1), (4, 2, 30,1), (0, 8, 0, 8), … is (100, 98, 10, 50, 30, 8), if the three groups are high, medium and low, the high group contains (0, 100,0, 0), (98, 0,91, 0), … the group contains (2,50,49,1), (4, 2, 30,1), and the … low group contains (10, 2,3,1), (0, 8, 0, 8) ….

s4.1, subtracting the second maximum value from the maximum value of the area of each base at each point position to obtain a maximum envelope area difference sequence, which is referred to as an area difference sequence for short; (S3), wherein the high group area difference sequence is (100, 7, …), the medium group area difference sequence is (1, 28, …) and the low group area difference sequence is (7, 0, …);

s4.2, dividing the maximum value of the area of each base at each point position by the second maximum value to be used as a maximum envelope area ratio sequence, which is referred to as an area ratio sequence for short; connecting to S3, wherein the high group area ratio sequence is (positive infinity, 98/91, …), the medium group area difference sequence is (50/49, 30/4, …), and the low group area difference sequence is (10/3, 8/8, …);

s4.3, calculating the median of the S4.1 area difference sequence, namely the median of the area difference, and connecting the median to S4.1, wherein the median is high 96, medium 24 and low 8; calculating the median of the S4.2 area ratio sequence, which is called the area ratio median for short, and connecting S4.1, wherein the median is high group 999, medium group 777 and low group 666;

s4.4, the area difference threshold is equal to the median of the area difference multiplied by the maximum change rate of the waveform peak value in the packet; the area ratio threshold is equal to the median of the area ratio multiplied by the maximum rate of change of the peak value of the waveform in the packet, where the maximum rate of change of the peak value of the waveform in each group is 20%, the area difference threshold is 96x20% =19.2 for the high group, 24x20% =4.8 for the medium group and 8x20% =1.6 for the low group, respectively, the area ratio threshold is 400 x20% =80 for the high group, 280 x20% =56 for the medium group, and 80 x20% =16 for the low group;

finding out a low-noise sequence with the maximum Phred score by using a Modified Mott trimming algorithm based on sequencing mass fraction (Phred score), wherein the two sides of the sequence are head and tail noise sequences, and elements are noise point positions;

removing a noise spectrum in the coordinate sequence by using wavelet transform or Fourier transform;

the figure quality of each point location can be evaluated by a method based on a neural network, and the poor figure quality in the evaluation is identified as noise;

other noise identification algorithms may be used;

s6, determining an area difference threshold and an area ratio threshold according to the integer point location grouping of each integer point location not in the noise integer point location set; calculating the area difference and the area ratio of two bases with the largest area of the integer point according to the signal intensity envelope area detected by each base of the integer point, if the area difference is smaller than the area difference threshold and the area ratio is smaller than the area ratio threshold, judging the point as the suspected heterozygous mutation, and as can be seen from figure 2, the area difference is smaller at the point with the heterozygous mutation, the area ratio is closer to 1, and the visual rule is met;

s6.1, if the current point location is in the noise point locations, skipping this point location, and continuing to analyze the next point location, for example, the set of noise point locations is (1,2,3,412,413), and if the current point location is 1, skipping directly, because 2 and 3 are also noise point locations, the analysis starts from point location 4;

s6.2, obtaining the integer point position grouping according to the maximum base envelope area where the current point position is located, thereby obtaining the area difference threshold value and the area ratio threshold value of the grouping, wherein if the envelope area sequences of the fluorescence signals of various bases of the point positions 4, 5 and 6 are (2,50,49,1), (4, 2, 30,1), (0, 8, 0, 8), wherein (2,50,49,1), (4, 2, 30,1) belong to the middle group, the area difference threshold value is 4.8, the area ratio threshold value is 56, wherein (0, 8, 0, 8) belongs to the low group, the area difference threshold value is 1.6, and the area ratio threshold value is 16;

s6.3, taking out the area difference and the area ratio of the analysis point positions from the sequence, if the area difference of the analysis point positions is smaller than or equal to an area difference threshold value, and the area ratio of the analysis point positions is smaller than or equal to an area ratio threshold value, considering the point position as a suspected heterozygous mutation point position, and connecting the area difference sequences on the point positions 4, 5 and 6 of S6.2 cases as 1, 26 and 0, wherein 4 and 6 meet the condition of being smaller than the area difference threshold value, the area ratio sequences are 50/49, 30/4 and 1, and are smaller than corresponding threshold values, so the integer point positions 4 and 6 meeting the two conditions at the same time are identified as the heterozygous mutation point position.

The invention simulates the process of observing heterozygous mutation by human beings by utilizing the computer graphics principle, greatly improves the identification efficiency and the accuracy of heterozygous mutation, and reduces the labor cost and the time cost, thereby having wide application prospect.

The parts not involved in the present invention are the same as or can be implemented using the prior art.

Claims

1. A method for automatically identifying heterozygous mutation in first-generation gene sequencing is characterized in that: the method comprises the following steps:

2. The method of automatically identifying heterozygous mutations in generational gene sequencing according to claim 1, wherein: the step S1 includes the steps of:

s1.6, if the current integer point position does not exceed the length of the sequence to be sequenced, adding one to the current integer point position and returning to the S1.3 step for continuing; if the current integer point reaches the length of the sequence to be sequenced, a complete coordinate sequence is obtained at the moment.

3. The method of automatically identifying heterozygous mutations in generational gene sequencing according to claim 1, wherein: the step S2 includes the steps of:

4. The method of automatically identifying heterozygous mutations in generational gene sequencing according to claim 1, wherein: the step S4 includes the steps of:

5. The method of automatically identifying heterozygous mutations in generational gene sequencing according to claim 1, wherein: the step S6 includes the steps of:

s6.2 obtaining the integer point position grouping according to the maximum base envelope area of the current point position, thereby obtaining the area difference threshold value and the area ratio threshold value of the grouping