CN111161801A - Method for automatically identifying heterozygous mutation in first-generation gene sequencing - Google Patents

Method for automatically identifying heterozygous mutation in first-generation gene sequencing Download PDF

Info

Publication number
CN111161801A
CN111161801A CN201911403408.2A CN201911403408A CN111161801A CN 111161801 A CN111161801 A CN 111161801A CN 201911403408 A CN201911403408 A CN 201911403408A CN 111161801 A CN111161801 A CN 111161801A
Authority
CN
China
Prior art keywords
area
sequence
integer
point
base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911403408.2A
Other languages
Chinese (zh)
Other versions
CN111161801B (en
Inventor
杨琦
张未波
李孝尧
施笑蕾
濮娜
张国福
陈炜炜
柯路
童智慧
李维勤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201911403408.2A priority Critical patent/CN111161801B/en
Publication of CN111161801A publication Critical patent/CN111161801A/en
Application granted granted Critical
Publication of CN111161801B publication Critical patent/CN111161801B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/20Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention relates to a method for automatically identifying heterozygous mutation in first-generation gene sequencing, which comprises the following steps: s1, converting the detection signal data of each base in the sequencing result of the first generation gene into a point with positive real number on the abscissa; s2, calculating the envelope area of the signal intensity detected by each base on each integral point; s3 grouping the integer points so that the maximum envelope area values of the detection signals of the bases in each group are close to each other; s4 determining an area difference threshold and an area ratio threshold within each said group of integer point bits for identifying heterozygous mutations, respectively; s5, identifying a noise integer point position set in the sequence to be detected by using a noise identification algorithm; s6 determining an area difference threshold and an area ratio threshold according to the integer point position grouping; and calculating the area difference and the area ratio of the two bases with the largest area of the integer point positions, and judging the mutation. The invention can improve the identification efficiency and accuracy of heterozygous mutation and reduce the labor cost.

Description

Method for automatically identifying heterozygous mutation in first-generation gene sequencing
Technical Field
The invention relates to a gene sequencing result analysis method, in particular to a method capable of automatically identifying heterozygous mutation in gene sequencing, and particularly relates to a method for automatically identifying heterozygous mutation in first-generation gene sequencing.
Background
The existing generation gene sequencing (Sanger sequencing method) results lack of mature automatic identification technology for heterozygous mutation, and the efficiency and the accuracy of heterozygous mutation identification are low due to manual identification, while the labor cost and the time cost are very high.
Disclosure of Invention
The invention aims to provide a method for automatically identifying heterozygous mutation in first-generation gene sequencing aiming at the defects of the prior art, greatly improves the identification efficiency and accuracy of heterozygous mutation, reduces the labor cost and time cost, and has wide application prospect
The technical scheme of the invention is as follows:
a method for automatically identifying heterozygous mutations in the sequencing of a generation gene comprises the following steps:
s1, converting the detected signal data of each base in the sequencing result of one generation gene into a coordinate sequence with the abscissa as the real number point and the ordinate as the detected signal intensity value of the corresponding base; wherein the unit of the real number point location is bp;
s2, calculating the area of the graph enclosed by the curve of the base detection signal intensity point and the abscissa axis of each base in each 0.5bp interval around each integer point as the envelope area of the base detection signal intensity at each integer point based on the coordinate sequence of each base;
s3, grouping the integer points by using the maximum envelope area of the detected signal of each base on each integer point as a classification basis, so that the maximum envelope area values of the detected signal of each base in each group are close to each other;
s4, according to the statistical characteristics of the occurrence frequency of the heterozygous mutant gene and the graphic characteristics in each 0.5bp interval around the integer point position where the heterozygous mutation occurs, respectively determining an area difference threshold and an area ratio threshold for identifying the heterozygous mutation in each integer point position group;
s5, identifying a noise integer point position set in the sequence to be detected by using a noise identification algorithm;
s6, determining an area difference threshold and an area ratio threshold according to the integer point location grouping of each integer point location not in the noise integer point location set; and calculating the area difference and the area ratio of the two bases with the largest area of the integer point according to the envelope area of the signal intensity detected by each base of the integer point, and if the area difference is smaller than the area difference threshold value and the area ratio is smaller than the area ratio threshold value, judging the point as the suspected heterozygous mutation.
Preferably, the step S1 includes the steps of:
s1.1 reading a positive rational number sequence consisting of detection signal intensity values corresponding to various bases, hereinafter referred to as a detection sequence;
s1.2, reading a detection sequence subscript where the current integer point location is located; the unit of the point location is bp; the subscripts are positive integers;
s1.3, solving the subscript difference of the detected sequence between the current integer point location and the previous integer point location, and taking the reciprocal of the subscript difference as the data point location from the previous integer point location to the current integer point location, namely the step unit value for short; the stepping unit value is a positive rational number; the unit is bp;
s1.4, taking out the detected sequence data from the previous point position on the abscissa according to the mark pressing sequence, wherein the data is a positive rational number;
s1.5, taking the detected sequence data as a vertical coordinate, and taking the stepping unit value obtained by accumulating S1.3 as a horizontal coordinate until the next integer point position; pushing a sequence formed by the points corresponding to the abscissa and the ordinate into the end of the coordinate sequence;
s1.6, if the current integer point position does not exceed the length of the sequence to be sequenced, adding one to the current integer point position, and returning to the S1.3 step for continuing. If the current integer point reaches the length of the sequence to be sequenced, a complete coordinate sequence is obtained at the moment.
Preferably, the step S2 includes the steps of:
s2.1 intercepting sequences, namely intercepting sequences for short, for 0.5bp intervals on two sides of each basic group at the current integer point;
s2.2, when both ends of 0.5bp on both sides of the current integer point position have no data in S2.1, the two ends can be corrected through an interpolation algorithm, so that the calculation precision of the envelope area is improved;
s2.3, calculating the trapezoidal area formed by every two groups of coordinates in each integer point position interception sequence, accumulating the trapezoidal areas to obtain the envelope area of each base at each point position, and pushing the envelope area to the end of the area sequence;
s2.4, if the current integer point position does not exceed the length of the sequence to be sequenced, adding one to the current integer point position and returning to the step S2.1 for continuation; if the current integer point reaches the length of the sequence to be sequenced, a complete area sequence is obtained at the moment.
Preferably, the step S4 includes the steps of:
s4.1, subtracting the second maximum value from the maximum value of the area of each base at each point position to obtain a maximum envelope area difference sequence, which is referred to as an area difference sequence for short;
s4.2, dividing the maximum value of the area of each base at each point position by the second maximum value to be used as a maximum envelope area ratio sequence, which is referred to as an area ratio sequence for short;
s4.3, calculating the median of the S4.1 area difference sequence, which is referred to as the area difference median for short; calculating the median of the S4.2 area ratio sequence, which is referred to as the area ratio median for short;
s4.4, the area difference threshold is equal to the median of the area difference multiplied by the maximum change rate of the waveform peak value in the packet; the area ratio threshold is equal to the median of the area ratio multiplied by the maximum rate of change of the peak of the waveform within the packet.
Preferably, the step S6 includes the steps of:
s6.1, if the current point location is in the noise point location, skipping the point location, and continuing to analyze the next point location;
s6.2, acquiring an integer point position group where the current point position is located according to the maximum base envelope area where the current point position is located, and acquiring an area difference threshold value and an area ratio threshold value of the group;
s6.3, the area difference and the area ratio of the analysis point location are taken out from the sequence, and if the area difference of the analysis point location is smaller than or equal to the area difference threshold value and the area ratio of the analysis point location is smaller than or equal to the area ratio threshold value, the point location is considered to be a suspected heterozygous mutation point location.
The invention has the beneficial effects that:
the invention has reasonable design and convenient use, can imitate the process of observing heterozygous mutation by human by utilizing a computer graphics method, greatly improves the identification efficiency and the accuracy of heterozygous mutation, reduces the labor cost and the time cost, and has wide application prospect.
Drawings
FIG. 1 is a diagram showing the base fluorescence signal intensity curves of the line graphs of different bases.
Fig. 2 is a partially enlarged view of fig. 1.
FIG. 3 is a graph area enclosed by two sets of coordinate points.
FIG. 4 is a graph area enclosed by the envelope and the dot position axis within each integer dot position range.
Wherein:
in FIG. 1, the top number is the integer point, the letter on the curve is the base type corresponding to the data point, the lower A, T, C, G letter is the base type identified by the integer point, and the lowest number is the Phred score for evaluating the sequencing quality.
The 313-integer locus outlined in FIG. 2 shows the signature curves characteristic of heterozygous mutations for C and T bases because the envelope of the two bases is close to the locus axis area difference and area ratio.
FIG. 3 is a graph area enclosed by two sets of coordinate points calculated using the trapezoidal area formula.
FIG. 4 is a diagram illustrating the calculation of the multiple trapezoidal areas within each integer point position of the fluorescence curve of the base corresponding to a certain type of base, and the summation of the multiple trapezoidal areas to obtain the envelope area of the fluorescence curve of the base corresponding to the type of base within each integer point position and the point position axis.
Detailed Description
The invention is further described below with reference to the figures and examples.
As shown in figures 1,2,3 and 4.
A method for automatically identifying heterozygous mutations in the sequencing of a generation gene comprises the following steps:
s1, converting the detected signal data of each base in the sequencing result of one generation gene into a coordinate sequence with positive integer point shown in the upper part of figure 1 and figure 2 and vertical coordinate as the intensity value of the detected signal corresponding to the base, because multiple base detected signals can appear on the horizontal coordinate of each positive integer point, the waveform part shown in figure 1 and figure 2 is marked by the letter corresponding to the base; wherein the unit of the real number point is bp, for example, the point framed in fig. 2 is 313, and the base detection signal corresponding to the internal waveform mainly comprises C and T;
s1.1 this example uses the nucleotide sequence corresponding to the human LPL protein as an example, and first reads a positive rational number sequence composed of detection signal intensity values corresponding to various bases, for example, the sequence of detection signal intensity values corresponding to A bases in a certain sequence is [50, 99, 203, 389, 679, 3455, 3816, 4172 … ], which includes the detection signal intensity values at 5921 sampling points, and is hereinafter referred to as a detection sequence;
s1.2, reading a detection sequence subscript where the current integer point location is located; the unit of the point location is bp; the subscript is a positive integer, e.g., [3, 39, 44, 58, 76, 94, 101, 109 … ], and the sequence comprises 494 integer points corresponding to detected signal intensity value data subscripts, e.g., point 1, which correspond to three data points having subscripts 0,1, 2, i.e., 50, 99, and 203 at A bases;
s1.3, solving the subscript difference of the detected sequence between the current integer point location and the previous integer point location, and taking the reciprocal of the subscript difference as the data point location from the previous integer point location to the current integer point location, namely the step unit value for short; the stepping unit value is a positive rational number; the unit is bp, and is connected with S1.1 and S1.2, for example, the subscript difference from a point 2 to a point 1 is 39-3=36, which indicates that 36 uniform sampling points exist in the middle, the step value is 1/36, the point 1 is special, the subscript difference is subtracted from 0 to be 3-0=3, and the step value is 1/3;
s1.4, taking out detection sequence data in the order of subscript from the previous point on the abscissa, wherein the data is a positive rational number, and taking out S1.3 as an example, wherein the detection signal intensity value corresponding to integer point 1 is 3 samples in total of 50, 99 and 203, and the detection signal intensity value corresponding to integer point 2 is 36 samples in total of 389, 679, 3455, 3816 and 4172 …;
s1.5, taking the detected sequence data as a vertical coordinate, and taking the stepping unit value obtained by accumulating S1.3 as a horizontal coordinate until the next integer point position; pushing the sequence of points corresponding to the abscissa and the ordinate to the end of the coordinate sequence, connecting to S1.4, accumulating 1/3 each data point from 0 in the abscissa of three samples of integer point 1 to form a coordinate sequence (1/3,50), (2/3, 99), (3/3, 203), accumulating 1/36 each data point from 1 in the abscissa of 36 samples of integer point 2 to form a coordinate sequence (1+1/36, 389), (1+2/36, 679), (1+3/36, 3455) …, and so on;
s1.6, if the current integer point position does not exceed the length of the sequence to be sequenced, adding one to the current integer point position, and returning to the S1.3 step for continuing. If the current integer point position reaches the length of the sequence to be sequenced, a complete coordinate sequence is obtained at the moment;
s2, calculating the area of the graph enclosed by the curve of the base detection signal intensity point and the abscissa axis of each base in each 0.5bp interval around each integer point as the envelope area of the base detection signal intensity at each integer point based on the coordinate sequence of each base;
s2.1, intercepting sequences at the current integer point for 0.5bp intervals at two sides of each base, namely intercepting sequences for short, and connecting S1.4, wherein 0.5bp intervals at two sides of the integer point 1 are [0.5bp, 1.5bp ], so that the intercepting sequences at the integer point 1 are (2/3, 99), (3/3, 203), (1+1/36, 389), (1+2/36, 679), (1+3/36, 3455) … to the abscissa 1+ 18/36;
s2.2, when both end points of 0.5bp on both sides of the current integer point in S2.1 have no data, the calculation precision of the envelope area can be improved by compensating through an interpolation algorithm, taking the data in S2.1 as an example, taking the current integer point of 1 bit as the example, the data point exists until 1.5bp on the right side, but the data point does not exist until 0.5bp on the left side, so that the slope of the point formed by (1/3,50) and (2/3, 99) at the 0.5bp position can be obtained by using a slope equation, and then the interpolation point at the 0.5bp position is obtained by the slope and is 74, so that (0.5,74) is obtained;
s2.3, as shown in fig. 3, calculating the trapezoidal area formed by each two sets of coordinates in each integer point truncation sequence by using a trapezoidal area formula, accumulating the trapezoidal areas to obtain the envelope area of each base located at each point as shown in fig. 4, and pushing the envelope area to the end of the area sequence, taking the corrected data in S2.2 as an example, the area enclosed by two points (0.5,74), (2/3, 99) can be obtained by (upper base + lower base) × height/2.0, that is, (74+99) x (2/3-0.5)/2.0 = 14;
s2.4, if the current integer point position does not exceed the length of the sequence to be sequenced, adding one to the current integer point position and returning to the step S2.1 for continuation; if the current integer point reaches the length of the sequence to be sequenced, a complete area sequence is obtained at the moment;
s3, grouping the integer points by using the maximum envelope area of the detected signal of each base on each integer point as a classification basis, so that the maximum envelope area values of the detected signal of each base in each group are close to each other; common grouping methods include K-Means, SVM, and the like; as can be seen from FIG. 1 and FIG. 2, the maximum envelope peak of the detected signal of each base is fluctuated, and the method can reduce the influence of the fluctuation on the judgment of heterozygous mutation, thereby improving the accuracy; envelope area sequences of 0.5 range around 1-4 integer point of ATCG base: the maximum envelope area sequence of (0, 100,0, 0), (98, 0,91, 0), (10, 2,3,1), (2,50,49,1), (4, 2, 30,1), (0, 8, 0, 8), … is (100, 98, 10, 50, 30, 8), if the three groups are high, medium and low, the high group contains (0, 100,0, 0), (98, 0,91, 0), … the group contains (2,50,49,1), (4, 2, 30,1), and the … low group contains (10, 2,3,1), (0, 8, 0, 8) ….
S4, according to the statistical characteristics of the occurrence frequency of the heterozygous mutant gene and the graphic characteristics in each 0.5bp interval around the integer point position where the heterozygous mutation occurs, respectively determining an area difference threshold and an area ratio threshold for identifying the heterozygous mutation in each integer point position group;
s4.1, subtracting the second maximum value from the maximum value of the area of each base at each point position to obtain a maximum envelope area difference sequence, which is referred to as an area difference sequence for short; (S3), wherein the high group area difference sequence is (100, 7, …), the medium group area difference sequence is (1, 28, …) and the low group area difference sequence is (7, 0, …);
s4.2, dividing the maximum value of the area of each base at each point position by the second maximum value to be used as a maximum envelope area ratio sequence, which is referred to as an area ratio sequence for short; connecting to S3, wherein the high group area ratio sequence is (positive infinity, 98/91, …), the medium group area difference sequence is (50/49, 30/4, …), and the low group area difference sequence is (10/3, 8/8, …);
s4.3, calculating the median of the S4.1 area difference sequence, namely the median of the area difference, and connecting the median to S4.1, wherein the median is high 96, medium 24 and low 8; calculating the median of the S4.2 area ratio sequence, which is called the area ratio median for short, and connecting S4.1, wherein the median is high group 999, medium group 777 and low group 666;
s4.4, the area difference threshold is equal to the median of the area difference multiplied by the maximum change rate of the waveform peak value in the packet; the area ratio threshold is equal to the median of the area ratio multiplied by the maximum rate of change of the peak value of the waveform in the packet, where the maximum rate of change of the peak value of the waveform in each group is 20%, the area difference threshold is 96x20% =19.2 for the high group, 24x20% =4.8 for the medium group and 8x20% =1.6 for the low group, respectively, the area ratio threshold is 400 x20% =80 for the high group, 280 x20% =56 for the medium group, and 80 x20% =16 for the low group;
s5, identifying a noise integer point position set in the sequence to be detected by using a noise identification algorithm;
finding out a low-noise sequence with the maximum Phred score by using a Modified Mott trimming algorithm based on sequencing mass fraction (Phred score), wherein the two sides of the sequence are head and tail noise sequences, and elements are noise point positions;
removing a noise spectrum in the coordinate sequence by using wavelet transform or Fourier transform;
the figure quality of each point location can be evaluated by a method based on a neural network, and the poor figure quality in the evaluation is identified as noise;
other noise identification algorithms may be used;
s6, determining an area difference threshold and an area ratio threshold according to the integer point location grouping of each integer point location not in the noise integer point location set; calculating the area difference and the area ratio of two bases with the largest area of the integer point according to the signal intensity envelope area detected by each base of the integer point, if the area difference is smaller than the area difference threshold and the area ratio is smaller than the area ratio threshold, judging the point as the suspected heterozygous mutation, and as can be seen from figure 2, the area difference is smaller at the point with the heterozygous mutation, the area ratio is closer to 1, and the visual rule is met;
s6.1, if the current point location is in the noise point locations, skipping this point location, and continuing to analyze the next point location, for example, the set of noise point locations is (1,2,3,412,413), and if the current point location is 1, skipping directly, because 2 and 3 are also noise point locations, the analysis starts from point location 4;
s6.2, obtaining the integer point position grouping according to the maximum base envelope area where the current point position is located, thereby obtaining the area difference threshold value and the area ratio threshold value of the grouping, wherein if the envelope area sequences of the fluorescence signals of various bases of the point positions 4, 5 and 6 are (2,50,49,1), (4, 2, 30,1), (0, 8, 0, 8), wherein (2,50,49,1), (4, 2, 30,1) belong to the middle group, the area difference threshold value is 4.8, the area ratio threshold value is 56, wherein (0, 8, 0, 8) belongs to the low group, the area difference threshold value is 1.6, and the area ratio threshold value is 16;
s6.3, taking out the area difference and the area ratio of the analysis point positions from the sequence, if the area difference of the analysis point positions is smaller than or equal to an area difference threshold value, and the area ratio of the analysis point positions is smaller than or equal to an area ratio threshold value, considering the point position as a suspected heterozygous mutation point position, and connecting the area difference sequences on the point positions 4, 5 and 6 of S6.2 cases as 1, 26 and 0, wherein 4 and 6 meet the condition of being smaller than the area difference threshold value, the area ratio sequences are 50/49, 30/4 and 1, and are smaller than corresponding threshold values, so the integer point positions 4 and 6 meeting the two conditions at the same time are identified as the heterozygous mutation point position.
The invention simulates the process of observing heterozygous mutation by human beings by utilizing the computer graphics principle, greatly improves the identification efficiency and the accuracy of heterozygous mutation, and reduces the labor cost and the time cost, thereby having wide application prospect.
The parts not involved in the present invention are the same as or can be implemented using the prior art.

Claims (5)

1. A method for automatically identifying heterozygous mutation in first-generation gene sequencing is characterized in that: the method comprises the following steps:
s1, converting the detected signal data of each base in the sequencing result of one generation gene into a coordinate sequence with the abscissa as the real number point and the ordinate as the detected signal intensity value of the corresponding base; wherein the unit of the real number point location is bp;
s2, calculating the area of the graph enclosed by the curve of the base detection signal intensity point and the abscissa axis of each base in each 0.5bp interval around each integer point as the envelope area of the base detection signal intensity at each integer point based on the coordinate sequence of each base;
s3, grouping the integer points by using the maximum envelope area of the detected signal of each base on each integer point as a classification basis, so that the maximum envelope area values of the detected signal of each base in each group are close to each other;
s4, according to the statistical characteristics of the occurrence frequency of the heterozygous mutant gene and the graphic characteristics in each 0.5bp interval around the integer point position where the heterozygous mutation occurs, respectively determining an area difference threshold and an area ratio threshold for identifying the heterozygous mutation in each integer point position group;
s5, identifying a noise integer point position set in the sequence to be detected by using a noise identification algorithm;
s6, determining an area difference threshold and an area ratio threshold according to the integer point location grouping of each integer point location not in the noise integer point location set; and calculating the area difference and the area ratio of the two bases with the largest area of the integer point according to the envelope area of the signal intensity detected by each base of the integer point, and if the area difference is smaller than the area difference threshold value and the area ratio is smaller than the area ratio threshold value, judging the point as the suspected heterozygous mutation.
2. The method of automatically identifying heterozygous mutations in generational gene sequencing according to claim 1, wherein: the step S1 includes the steps of:
s1.1 reading a positive rational number sequence consisting of detection signal intensity values corresponding to various bases, hereinafter referred to as a detection sequence;
s1.2, reading a detection sequence subscript where the current integer point location is located; the unit of the point location is bp; the subscripts are positive integers;
s1.3, solving the subscript difference of the detected sequence between the current integer point location and the previous integer point location, and taking the reciprocal of the subscript difference as the data point location from the previous integer point location to the current integer point location, namely the step unit value for short; the stepping unit value is a positive rational number; the unit is bp;
s1.4, taking out the detected sequence data from the previous point position on the abscissa according to the mark pressing sequence, wherein the data is a positive rational number;
s1.5, taking the detected sequence data as a vertical coordinate, and taking the stepping unit value obtained by accumulating S1.3 as a horizontal coordinate until the next integer point position; pushing a sequence formed by the points corresponding to the abscissa and the ordinate into the end of the coordinate sequence;
s1.6, if the current integer point position does not exceed the length of the sequence to be sequenced, adding one to the current integer point position and returning to the S1.3 step for continuing; if the current integer point reaches the length of the sequence to be sequenced, a complete coordinate sequence is obtained at the moment.
3. The method of automatically identifying heterozygous mutations in generational gene sequencing according to claim 1, wherein: the step S2 includes the steps of:
s2.1 intercepting sequences, namely intercepting sequences for short, for 0.5bp intervals on two sides of each basic group at the current integer point;
s2.2, when both ends of 0.5bp on both sides of the current integer point position have no data in S2.1, the two ends can be corrected through an interpolation algorithm, so that the calculation precision of the envelope area is improved;
s2.3, calculating the trapezoidal area formed by every two groups of coordinates in each integer point position interception sequence, accumulating the trapezoidal areas to obtain the envelope area of each base at each point position, and pushing the envelope area to the end of the area sequence;
s2.4, if the current integer point position does not exceed the length of the sequence to be sequenced, adding one to the current integer point position and returning to the step S2.1 for continuation; if the current integer point reaches the length of the sequence to be sequenced, a complete area sequence is obtained at the moment.
4. The method of automatically identifying heterozygous mutations in generational gene sequencing according to claim 1, wherein: the step S4 includes the steps of:
s4.1, subtracting the second maximum value from the maximum value of the area of each base at each point position to obtain a maximum envelope area difference sequence, which is referred to as an area difference sequence for short;
s4.2, dividing the maximum value of the area of each base at each point position by the second maximum value to be used as a maximum envelope area ratio sequence, which is referred to as an area ratio sequence for short;
s4.3, calculating the median of the S4.1 area difference sequence, which is referred to as the area difference median for short; calculating the median of the S4.2 area ratio sequence, which is referred to as the area ratio median for short;
s4.4, the area difference threshold is equal to the median of the area difference multiplied by the maximum change rate of the waveform peak value in the packet; the area ratio threshold is equal to the median of the area ratio multiplied by the maximum rate of change of the peak of the waveform within the packet.
5. The method of automatically identifying heterozygous mutations in generational gene sequencing according to claim 1, wherein: the step S6 includes the steps of:
s6.1, if the current point location is in the noise point location, skipping the point location, and continuing to analyze the next point location;
s6.2 obtaining the integer point position grouping according to the maximum base envelope area of the current point position, thereby obtaining the area difference threshold value and the area ratio threshold value of the grouping
S6.3, the area difference and the area ratio of the analysis point location are taken out from the sequence, and if the area difference of the analysis point location is smaller than or equal to the area difference threshold value and the area ratio of the analysis point location is smaller than or equal to the area ratio threshold value, the point location is considered to be a suspected heterozygous mutation point location.
CN201911403408.2A 2019-12-31 2019-12-31 Method for automatically identifying heterozygous mutation in first generation gene sequencing Active CN111161801B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911403408.2A CN111161801B (en) 2019-12-31 2019-12-31 Method for automatically identifying heterozygous mutation in first generation gene sequencing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911403408.2A CN111161801B (en) 2019-12-31 2019-12-31 Method for automatically identifying heterozygous mutation in first generation gene sequencing

Publications (2)

Publication Number Publication Date
CN111161801A true CN111161801A (en) 2020-05-15
CN111161801B CN111161801B (en) 2023-06-06

Family

ID=70559791

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911403408.2A Active CN111161801B (en) 2019-12-31 2019-12-31 Method for automatically identifying heterozygous mutation in first generation gene sequencing

Country Status (1)

Country Link
CN (1) CN111161801B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102234690A (en) * 2010-04-30 2011-11-09 爱科来株式会社 Method for detecting mutation in exon 12 of JAK2 gene, and nucleic acid probe and kit therefor
CN104630375A (en) * 2015-02-16 2015-05-20 北京圣谷同创科技发展有限公司 Cancer gene mutation and gene amplification detection
CN106202991A (en) * 2016-06-30 2016-12-07 厦门艾德生物医药科技股份有限公司 The detection method of abrupt information in a kind of genome multiplex amplification order-checking product
CN107326064A (en) * 2016-04-29 2017-11-07 天昊生物医药科技(苏州)有限公司 Gene inversion mutation detection methods
CN107944225A (en) * 2017-11-28 2018-04-20 慧算医疗科技(上海)有限公司 Gene high-flux sequence data mutation detection methods

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102234690A (en) * 2010-04-30 2011-11-09 爱科来株式会社 Method for detecting mutation in exon 12 of JAK2 gene, and nucleic acid probe and kit therefor
CN104630375A (en) * 2015-02-16 2015-05-20 北京圣谷同创科技发展有限公司 Cancer gene mutation and gene amplification detection
CN107326064A (en) * 2016-04-29 2017-11-07 天昊生物医药科技(苏州)有限公司 Gene inversion mutation detection methods
CN106202991A (en) * 2016-06-30 2016-12-07 厦门艾德生物医药科技股份有限公司 The detection method of abrupt information in a kind of genome multiplex amplification order-checking product
CN107944225A (en) * 2017-11-28 2018-04-20 慧算医疗科技(上海)有限公司 Gene high-flux sequence data mutation detection methods

Also Published As

Publication number Publication date
CN111161801B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
CN108814590B (en) Detection method of electrocardio QRS wave group and electrocardio analysis method thereof
CN109346130B (en) Method for directly obtaining micro-haplotype from whole genome re-sequencing data and typing micro-haplotype
CN102854445A (en) Method for extracting waveform feature of local discharge pulse current
CN108814591B (en) Method for detecting width of electrocardio QRS wave group and electrocardio analysis method thereof
CN106480221B (en) Based on gene copy number variation site to the method for forest tree population genotyping
CN105372531A (en) Transformer insulation thermal aging parameter correlation calculation method based on Weibull distribution model
CN108344922B (en) Power transmission line direct lightning strike fault identification method based on phase classification and S transformation
He et al. Scale dependence of tree abundance and richness in a tropical rain forest, Malaysia
CN114417926A (en) Power equipment partial discharge pattern recognition method and system based on deep convolution generation countermeasure network
CN115586280A (en) Chromatographic peak identification method based on self-adaptive threshold
CN115061203A (en) Mine single-channel microseismic signal noise reduction method based on frequency domain singular value decomposition and application
CN111161801A (en) Method for automatically identifying heterozygous mutation in first-generation gene sequencing
CN113886375A (en) Wind power data cleaning method based on isolated forest and local outlier factors
CN113514743A (en) Construction method of GIS partial discharge pattern recognition system based on multi-dimensional features
CN106971025B (en) Method for determining effective utilization coefficient of creepage distance of composite insulator
CN114936947A (en) High-voltage direct-current transmission line fault diagnosis method based on GADF-VGG16
CN114114400B (en) Microseism event effective signal pickup method
CN108051676A (en) A kind of amplitude of lightning current cumulative probability distribution curve the Fitting Calculation method
CN105467270B (en) Single Terminal Traveling Wave Fault Location back wave identification algorithm based on frequency spectrum similarity evaluation
Nybom Applications of DNA fingerprinting in plant population studies
CN111985526B (en) Similar scene clustering-based trailing interval management strategy generation method and system
CN114878973A (en) Multi-branch distribution line lightning stroke fault positioning method and device and storage medium
CN114184886A (en) Method for quantizing complexity of fault traveling wave of power transmission line
CN114660560B (en) Pulse repetition interval sorting method based on equivalent DTOA density curve
CN114819212B (en) Dynamic partitioning method for asphalt pavement maintenance road section considering spatial continuity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant