CN113380323B - Sanger sequencing peak image interception identification method and system, computer equipment and storage medium - Google Patents

Sanger sequencing peak image interception identification method and system, computer equipment and storage medium Download PDF

Info

Publication number
CN113380323B
CN113380323B CN202110814332.3A CN202110814332A CN113380323B CN 113380323 B CN113380323 B CN 113380323B CN 202110814332 A CN202110814332 A CN 202110814332A CN 113380323 B CN113380323 B CN 113380323B
Authority
CN
China
Prior art keywords
sequence
base
detection site
genotype
corrected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110814332.3A
Other languages
Chinese (zh)
Other versions
CN113380323A (en
Inventor
陈文拴
郭惠民
张盼
陆文俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dipu Medical Laboratory Co ltd
Original Assignee
Zhejiang Dipu Diagnosis Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Dipu Diagnosis Technology Co ltd filed Critical Zhejiang Dipu Diagnosis Technology Co ltd
Priority to CN202110814332.3A priority Critical patent/CN113380323B/en
Publication of CN113380323A publication Critical patent/CN113380323A/en
Application granted granted Critical
Publication of CN113380323B publication Critical patent/CN113380323B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Chemical & Material Sciences (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a Sanger sequencing peak diagram intercepting and identifying method, a system for realizing the method, computer equipment and a storage medium, wherein the method comprises the following steps of: reading a sequencing file and configuration file information, and deriving a base peak diagram and a base sequence based on the sequencing file; processing the base peak image and the base sequence to identify extended base information; intercepting and identifying the peak image of the extended base sequence based on the identified extended base information. By adopting the method, the requirement of carrying out screenshot identification processing on a large number of Sanger sequencing peak diagram results can be met.

Description

Sanger sequencing peak image interception identification method and system, computer equipment and storage medium
Technical Field
The invention belongs to the technical field of gene detection, and particularly relates to a Sanger sequencing peak diagram intercepting and identifying method, a Sanger sequencing peak diagram intercepting and identifying system and a storage medium.
Background
The PCR-time-of-flight mass spectrometry can detect the polymorphism of a set nucleic acid site by utilizing the difference of nucleotide molecular weight, so that a gene detection kit based on a time-of-flight mass spectrometry platform can be developed by utilizing the principle. Before the time-of-flight mass spectrometry platform and the gene detection kit based on the platform are actually applied to clinic, a large amount of clinical samples are required to be used for detection test, and the detection result is compared with the gold standard of the clinical samples for evaluation so as to confirm the effectiveness of the platform.
Since Sanger sequencing is one of the gold standard methods for detecting nucleic acid site polymorphisms in clinical samples, during a clinical confirmation test for detecting nucleic acid site polymorphisms using a time-of-flight mass spectrometry platform, the same nucleic acid detection sites in the same sample need to be confirmed using Sanger sequencing. For Sanger sequencing technology reasons, when detecting nucleic acid site polymorphism, a total of at least 150bp nucleic acid sequences around the site need to be sequenced. The sequencing result returned by the sequencing company is an ab1 file recording the base peak pattern of each base, and special software is required to open the sequencing file to view the base peak pattern and base sequence. After the file is opened, a sequence containing the detection site (or adjacent to the detection site) can be input in a search box of the software or the detection site can be searched from the sequence peak map sequence by a method of identifying the sequence by naked eyes and the detection genotype can be identified. The submitted clinical test material needs to include the Sanger sequencing peak map screenshots and genotype statistics of each detection site corresponding to each sample in the clinical test, and the screenshots need to have related information such as sample names, detection site base identifications in base sequences and the like.
The manual screenshot identification process for the Sanger sequencing peak map is as follows:
opening a Sanger sequencing ab1 file by using chromas software to display a base sequence peak diagram, and adjusting the horizontal or vertical scaling to enable the width and height of the peak diagram in a screen to be in a proper display state;
secondly, the detection sites are visually identified by pulling the transverse moving strips, or the detection sites are searched by using sequences, and the transverse moving strips are adjusted to enable the detection sites to be in a proper screenshot area;
thirdly, screenshot is carried out by using a shortcut key (alt + A), screenshot areas are adjusted up, down, left and right, a tool in a screenshot toolbar is used for marking bases of detection sites in a striking mode (such as a red vertical arrow or a red square is added), and the screenshot is stored according to a file naming format;
and fourth, adding the screenshot into a word or ppt document, adding a text box in a blank area on the left side of the picture, inputting related information such as a site name and a sample ID name into the text box, combining the picture and the text box, storing the picture and the text box into a picture type, and naming according to a file naming format.
The manual screenshot identification method is only applicable to handle a small number of Sanger sequencing results and can misidentify genotypes when encountering an aberrant sequencing peak (a normal heterozygous doublet into two single peaks). When a large number of detection samples exist and each sample has a plurality of detection sites, the manual screenshot identification process is time-consuming, labor-consuming, low in efficiency and prone to errors. The manual screenshot identification shows that 1 site of 180 samples is subjected to screenshot identification processing, the average processing time of each person is about 8 hours, and the genotype statistics time is not calculated. The sample size in a clinical validation experiment varies from hundreds to thousands of samples, and there are actually many detection sites (normally about 10 to about 20) in each sample, so it is highly desirable to develop an automatic batch screenshot identification system to replace the original manual screenshot identification method.
Disclosure of Invention
In order to solve the problems that screenshot identification processing is not suitable for Sanger sequencing results of a large number of samples and multiple sites under the condition of low manual screenshot efficiency, and genotype identification is possibly wrong when abnormal sequencing peaks are encountered, the invention provides a Sanger sequencing peak graph intercepting identification system capable of solving the problems.
The purpose of the invention is realized by the following technical scheme:
the invention provides a Sanger sequencing peak graph intercepting and identifying method in a first aspect, which comprises the following steps:
s1, reading the information of the sequencing file and the configuration file, and deriving a base peak diagram and a base sequence based on the sequencing file;
s2, processing the base peak image and the base sequence to identify the extension base information;
s3, intercepting and identifying the peak image of the extended base sequence based on the identified extended base information.
Further, in step S1, the sequencing file and the configuration file include: a single or massively compressed Sanger sequencing ab1 file and a json profile containing at least sequencing primer names, detection site names and recognition sequence information.
Further, step S1 specifically includes:
s11, grouping all ab1 files in the sequencing files according to the names of sequencing primers;
and S12, processing each group of ab1 files according to the identification sequences, and deriving a base peak map and a base sequence from the ab1 file by using a sangerseqR package, wherein the base peak map comprises a sequencing full-length peak map and a 20nt base length peak map screenshot containing a detection site, and the base sequence comprises a primary sequence and a secondary sequence.
Further, in step S12, when deriving the 20nt base length peak map screenshot, the number of bases clipped at the 5 'end of the full-length sequence trim5 and the number of bases clipped at the 3' end of the full-length sequence trim3 are determined, and the process of determining the number of clipped bases is as follows:
s121, identifying the position of the detection site by using the identification sequence;
s122, when the position of the detection site in the primary sequence is successfully identified, obtaining a head base sequence and a tail base sequence which do not contain a segmentation sequence;
s123, when the recognition sequence is at the 5' end of the detection site, the trim5= the length of the first base sequence + the length of the segmentation sequence-10, and the trim3= the length of the tail base sequence-10; when the recognition sequence is at the 3' end of the detection site, trim5= first base sequence length-11, trim3= last base sequence length + split sequence length-9.
Further, in step S121, the detection site position identification process is as follows:
using the full length of the identification sequence as a segmentation sequence for segmenting the sequencing sequence, if the segmentation sequence completely exists in the sequencing sequence, segmenting the sequencing sequence, otherwise, not segmenting;
if the segmented sequence cannot be segmented, continuing the segmentation attempt as a new segmented sequence after cutting off one base from the end of the original segmented sequence, and if the segmented sequence cannot be segmented, continuing to repeat the process, wherein when one base is cut off from the end of the original segmented sequence, the recognition sequence marked with '3' is cut off from the 3 'end, otherwise, the recognition sequence is cut off from the 5' end;
if the recognition sequence can not be segmented after 5 bases are cut off at the end, or the segmentation result exceeds two segments, the segmentation is stopped, the detection site is judged to fail to be identified, and the subsequent treatment is not carried out.
Further, step S2 specifically includes:
s21, identifying the sample detection genotype from the derived base sequence;
s22, filtering out samples with wrong identification of the detection sites according to the index values of the detection sites;
s23, identifying the base sequence in the screenshot and the pixel abscissa values of the left side and the right side of each base;
s24, correcting the genotype of the identified error according to the error correction sequence;
s25, determining two transverse pixel coordinate values of the genotype to be identified by the red frame, and storing the information in a database.
Further, step S21 specifically includes:
s211, identifying the position of a detection site in a primary sequence and a secondary sequence respectively;
s212, when the identification sequence is at the 5 'end of the detection site and the position of the detection site is successfully identified, the index value of the detection site = the length of the first segment of the segmented sequence + the length of the segmented sequence +1, the base of the detection site is the base at the position, and the error correction sequence is 4 base sequences adjacent to the 3' end of the detection site;
when the recognition sequence is at the 3 'end of the detection site and the position of the detection site is successfully recognized, the index value of the detection site = the length of the first segment of the sequence after segmentation, the base of the detection site is the base of the position, and the error correction sequence is 4 base sequences adjacent to the 5' end of the detection site;
s213, when the four base characters of the detection site A, T, C, G of the primary sequence and the secondary sequence are the same, judging that the genotype is homozygous, otherwise, judging that the genotype is heterozygous;
s214, storing the three information of the identified index value of the detection site, the genotype, the error correction sequence and the like into a database.
Further, step S22 specifically includes:
s221, deriving detection site index values of all samples of the detection site from a database;
s222, judging the index value which is not in the interval [ Q1-IQR par, Q3 + IQR par ] as an abnormal value, further judging that the identification of the sample detection site is wrong, and not carrying out subsequent processing; wherein Q1 is the lower quartile of the detection site index value data set, Q3 is the upper quartile of the detection site index value data set, IQR = Q3-Q1, par is a preset constant.
Further, step S23 specifically includes:
s231, setting the intercepted peak image picture as 576 pixels higher by 2448 pixels wide, selecting a horizontal straight line which is 88 pixels vertically downward from the upper boundary of the image as a base coordinate identification line, sequentially passing through the base sequence from left to right, wherein the passing position is near the upper 1/3 of the vertical height of each base letter, and the length of the identification line is 2448 pixels wide;
s232, sequentially recognizing the RGB color codes from left to right on the recognition line, namely sequentially reading R values, G values and B values in the RGB color codes corresponding to the points (88, 0), (88, 1), (88, 2) … … (88, 2447) in the picture and processing the R values, the G values and the B values;
s233, picture character recognition is carried out according to the corresponding relation between the RGB color codes output from the recognition lines and the actual characters in the picture, and the picture character recognition process is as follows:
(ii) if an RGB color code appears, from RGB 1: r >100 & G >100 & B >100 to RGB 2: r >100 & G <100 & B <100, to RGB 3: r >100 & G >100 & B >100, then the letter C appears at the position of the horizontal coordinate x of RGB2, the horizontal coordinate L of the left boundary of the letter C is x + the left boundary compensation distance L (-2), and the horizontal coordinate R of the right boundary of the letter C is x + the right boundary compensation distance R (25);
if the RGB color codes appear, the color codes are selected from RGB 1: r >100 & G >100 & B >100 to RGB 2: r <100 & G >100 & B <100, to RGB 3: r >100 & G >100 & B >100, the letter A temporarily appears at the position of the horizontal coordinate x of the RGB2, the horizontal coordinate L of the left boundary of the letter A is x + the left boundary compensation distance L (-8), and the horizontal coordinate R of the right boundary of the letter A is x + the right boundary compensation distance R (22);
(iii) if there is an RGB color code from RGB 1: r >100 & G >100 & B >100 to RGB 2: r <100 & G <100 & B >100, to RGB 3: r >100 & G >100 & B >100, then the letter T appears at the RGB2 horizontal coordinate x position, the letter T left side boundary horizontal coordinate L is x + left boundary compensation distance L (-12), the letter T right side boundary horizontal coordinate R is x + right boundary compensation distance R (14);
if the RGB color codes appear, the color codes are selected from RGB 1: r >100 & G >100 & B >100 to RGB 2: r <100 & G <100 & B <100, to RGB 3: r >100 & G >100 & B >100, then the letter G appears at the position of the horizontal coordinate x of RGB2, the horizontal coordinate L of the left boundary of the letter G is x + the left boundary compensation distance L (-2), and the horizontal coordinate R of the right boundary of the letter G is x + the right boundary compensation distance R (24);
if the RGB color code appears, the color code is selected from RGB 1: r >100 & G >100 & B >100 to RGB 2: r >100 & G <100 & B >100, to RGB 3: r >100 & G >100 & B >100, the letter R appears tentatively at the position of the horizontal coordinate x of the RGB2, the horizontal coordinate L of the left boundary of the letter R is x + the left boundary compensation distance L (-2), and the horizontal coordinate R of the right boundary of the letter R is x + the right boundary compensation distance R (24);
and for the case of recognizing the A or the R, if the recognized letter is the first recognized letter in the picture, directly judging that the A or the R is recognized, otherwise, judging that the A or the R is recognized when the difference between the left boundary of the letter and the right boundary of the previous letter is not less than 23.
Further, step S24 specifically includes:
s241, taking the error correction sequence with the largest quantity in the database as a standard sequence, and taking other types of error correction sequences as sequences to be corrected;
s242, generating a set of 4 possible genotypes according to bases contained in the two genotypes with the largest quantity in the database;
s243, when the sequence to be corrected is at the 3 'end of the detection site, the first base at the 5' end of the sequence to be corrected is taken as the genotype to be corrected, if the corresponding relation between the base of the sequence to be corrected and the base of the standard sequence meets one of the following conditions:
firstly, the 2 nd to 4 th bases of the sequence to be corrected are the same as the 1 st to 3 rd bases of the standard sequence;
base at positions 3 and 4 of the sequence to be corrected is the same as base at positions 2 and 3 of the standard sequence, and base at position 2 of the sequence to be corrected is different from base at position 1 of the standard sequence;
thirdly, the bases at the 2 nd and 4 th positions of the sequence to be corrected are the same as the bases at the 1 st and 3 rd positions of the standard sequence, and the base at the 3 rd position of the sequence to be corrected is different from the base at the 2 nd position of the standard sequence;
fourthly, the 2 nd and 3 rd bases of the sequence to be corrected are the same as the 1 st and 2 nd bases of the standard sequence, and the 4 th base of the sequence to be corrected is different from the 3 rd base of the standard sequence;
when the undetermined genotype exists in the 4 genotype sets, combining the initially identified genotype and the undetermined genotype to be used as the corrected genotype;
when the sequence to be corrected is at the 5 'end of the detection site, the first base at the 3' end of the sequence to be corrected is taken as the genotype to be determined, if the corresponding relation between the base of the sequence to be corrected and the base of the standard sequence meets one of the following conditions
Firstly, the 1 st to 3 rd bases of the sequence to be corrected are the same as the 2 nd to 4 th bases of the standard sequence;
base numbers 2 and 3 of the sequence to be corrected are the same as base numbers 3 and 4 of the standard sequence, and base number 1 of the sequence to be corrected is different from base number 2 of the standard sequence;
the 1 st base and the 3 rd base of the sequence to be corrected are the same as the 2 nd base and the 4 th base of the standard sequence, and the 2 nd base of the sequence to be corrected is different from the 3 rd base of the standard sequence;
the 1 st and 2 nd bases of the sequence to be error-corrected are the same as the 2 nd and 3 rd bases of the standard sequence, and the 3 rd base of the sequence to be error-corrected is different from the 4 th base of the standard sequence;
and the undetermined genotype is present in the set of 4 genotypes, the initially identified genotype and the undetermined genotype are combined as the corrected genotype.
Further, in step S25, the determining two horizontal pixel coordinate values of the genotype to be identified with the red box specifically comprises:
two horizontal pixel coordinate values of the red box mark of the genotype which does not need error correction are respectively a left side boundary horizontal coordinate L and a right side boundary horizontal coordinate R of the 11 th base character in the base sequence recognized from the screenshot;
the genotype after error correction, and the recognition sequence is at the 5' end of the detection site, and the coordinate values of two horizontal pixels marked by a red frame are respectively the left side boundary abscissa L and the right side boundary abscissa R of the 11 th base character and the 12 th base character in the base sequence recognized from the screenshot;
and (3) correcting the corrected genotype, identifying the sequence at the 3' end of the detection site, wherein the two horizontal pixel coordinate values of the red box mark are respectively the left side boundary horizontal coordinate L and the right side boundary horizontal coordinate R of the 10 th base character and the 11 th base character in the base sequence identified from the screenshot.
Further, step S3 specifically includes:
s31, performing whitewashing on a region with the width of 440 pixels on the left side of the screenshot and a region where the second sequence is located, and if a heterozygous genotype base exists in the second sequence, not cleaning the base;
s32, adding a red box mark to the detection site, wherein the horizontal coordinate values of the upper left corner and the lower right corner of the red box mark the horizontal coordinate value-10 of the left boundary and the horizontal coordinate value +10 of the right boundary respectively for the red box recorded in the database; if the genotype is homozygous or heterozygous after error correction, the longitudinal coordinate values of the upper left corner and the lower right corner of the red frame are 5 and 60; if the genotype is heterozygous for the initial recognition, the longitudinal coordinate values of the upper left corner and the lower right corner of the red box are 5 and 115;
and S33, adding detection site names and sample name information at the fixed position of the left side white area of the picture, and intercepting and saving the picture to the specified path.
The invention provides a Sanger sequencing peak diagram intercepting and identifying system in a second aspect, which comprises a local terminal and a server;
the local terminal uploads a Sanger sequencing ab1 file to be processed to a server based on a configured file uploading module;
the server realizes the Sanger sequencing peak graph interception and identification method according to any one of the claims 1-12 based on a configured ab1 file processing module, a sequence image information processing module and a screenshot identification processing module; the ab1 file processing module is used for reading sequencing files and configuration file information and deriving a base peak map and a base sequence based on the sequencing files, the sequence image information processing module is used for processing the base peak map and the base sequence and identifying extended base information, and the screenshot identification processing module is used for intercepting and identifying the extended base sequence peak map based on the identified extended base information;
the local terminal receives the extended base sequence peak image after the identifier is viewed on the basis of the configured data display downloading module, or modifies bases in the picture base sequence according to needs, or appoints a picture file after downloading processing, a sample processing statistical file or a screenshot processing intermediate process file.
A third aspect of the present invention provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the method according to the first aspect of the present invention when the processor executes the computer program.
A fourth aspect of the invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method according to the first aspect of the invention as described above.
The invention has the beneficial effects that:
by adopting the method and the system for intercepting and identifying the Sanger sequencing peak map, the method and the system can efficiently perform screenshot identification processing on the Sanger sequencing peak map of a large number of clinical samples, simultaneously recognize and count the genotypes of the detection sites in the peak map, do not need screenshot personnel to manually gather after manually intercepting the picture and then visually recognize the genotype of each detection site of each sample, thereby greatly reducing the workload of screenshot personnel for intercepting and counting the genotypes of the detection sites, further improving the working efficiency and avoiding the occurrence of possible human negligence errors.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate an exemplary embodiment of the invention and, together with the description, serve to explain the invention and are not intended to limit the invention.
FIG. 1 is a schematic flow diagram of the process of the present invention.
FIG. 2 is a flow chart of data processing according to an embodiment of the present invention.
FIG. 3 is a schematic diagram of genotype correction in an embodiment of the present invention.
Fig. 4 is a schematic diagram of a process of identifying a base character and a horizontal coordinate region value thereof in a screenshot according to an embodiment of the present invention.
FIG. 5 is a diagram illustrating identification of different areas of a screenshot operation interface according to an embodiment of the present invention.
FIG. 6 is a diagram of an interface of base modification operation in the screenshot according to an embodiment of the present invention.
Detailed Description
In order to make the technical scheme and advantages of the present invention more clearly understood, the present invention is further described in detail below with reference to the accompanying drawings and embodiments.
Referring to fig. 1, the invention provides a Sanger sequencing peak graph intercepting and identifying method, which comprises the following steps:
s1, reading the information of the sequencing file and the configuration file, and deriving a base peak diagram and a base sequence based on the sequencing file;
s2, processing the base peak diagram and the base sequence to identify the extending base information;
s3, intercepting and identifying the peak image of the extended base sequence based on the identified extended base information.
The above process is further illustrated by the following examples. It should be noted that the method of the present invention can be executed by a plurality of functional modules configured on a computer or a cloud server, having corresponding functions, and being capable of implementing a certain step of the present invention individually or cooperatively, so as to implement the method of the present invention completely. The functional blocks in the following embodiments are only for facilitating the understanding of the implementation process of the method of the present invention by those skilled in the art, and do not specifically limit or deviate the method of the present invention.
Referring to fig. 2, in the present embodiment, the method of the present invention is implemented based on a file uploading module, an ab1 file processing module, a sequence image information processing module, a screenshot identification processing module, and a data display downloading module configured on a computer or a cloud server. The file uploading module is used for uploading files from a local computer to the server; the ab1 file processing module is used for outputting sequence peak map information from a sequencing file; the sequence image information processing module is used for identifying the extension base information; the screenshot identification processing module is used for intercepting an identification extended base sequence peak image; the data display downloading module is used for checking the downloading processing result.
In one embodiment, the file uploading module uploads a single or a large number of compressed Sanger sequencing ab1 files and json configuration files in a one-key mode, the compressed files are decompressed after being uploaded, the json configuration files contain information such as sequencing primer names, detection site names and recognition sequences, the names of the files are used for file naming, and screenshot of sequencing results of the same panel is only needed to be carried out when the json files are used for the first time.
In one embodiment, the ab1 file processing module first groups all uploaded ab1 files by sequencing primer names, then calls a sanger seqtopdf.r script inside the software, and processes each group of ab1 files according to the recognition sequence. The SangSeqtopdf. R script mainly utilizes a sangerseqR package to derive a base peak map and a base sequence from an ab1 file, wherein the base peak map comprises two pdf files, namely a full-length sequencing peak map and 20 nt-length peak map screenshots including detection sites, and the base sequence comprises a primary sequence and a secondary sequence. When the base peak pattern is a single peak, the base at that position in the primary and secondary sequences is the same; when the base peak image is a double peak, the corresponding base in the primary sequence is a base with a high peak, and the corresponding base in the secondary sequence is a base with a low peak, so that the primary and secondary sequences can be combined to preliminarily identify the homozygous heterozygous genotype, but sometimes, the sequencing will detect a double peak as two crossed single peaks or two normal single peaks, the preliminary identification result will identify the heterozygous as homozygous, and then, the error correction treatment is needed. When deriving the 20nt base length peak screenshots, it is necessary to determine the number of bases cut out at the 5' end of the full-length sequence, trim5, and trim3, as follows:
the method includes identifying a detection bit position by an identification sequence. The recognition sequence (derived from json configuration file) is a base sequence adjacent to the detection site, the length of the sequence is set to 10nt by default, the recognition sequence can be taken from the 5 ' end or the 3 ' end of the detection site according to actual conditions, and when the recognition sequence is at the 3 ' end, a character of ' 3 ' needs to be added to the left side of the recognition sequence for marking. Since all bases of the recognition sequence are not necessarily identical to the sequence at the corresponding position in the derived sequence due to occasional sequencing errors of individual bases, a specific recognition method is required to recognize the position of the detection site. The detection site position identification process is as follows:
firstly, using the full length of an identification sequence (deleting '3' if a character '3' exists on the left side) as a segmentation sequence for segmenting a sequencing sequence, if the segmentation sequence completely exists in the sequencing sequence, segmenting the sequencing sequence, otherwise, segmenting the sequencing sequence, and segmenting the sequencing sequence into two segments under normal conditions;
if the segmented sequence can not be segmented, cutting off a base from the tail end of the original segmented sequence (the identification sequence marked with '3' is cut from the 3 'end, otherwise, the identification sequence is cut from the 5' end), and then continuing to perform segmentation attempt as a new segmented sequence, if the segmented sequence can not be segmented into two segments, continuing to repeat the process;
thirdly, the length of 5 bases at most is cut off at the tail end of the recognition sequence, if the length of 5 bases is cut off, the segmentation cannot be carried out, or if the segmentation result exceeds two sections, the segmentation is stopped, namely, the recognition of the detection site fails, and the follow-up processing cannot be carried out.
Secondly, after the position of the detection site in the primary sequence is successfully identified, obtaining a head base sequence and a tail base sequence which do not contain the segmentation sequence;
when the recognition sequence is at the 5' end of the detection site, the length of the first base sequence and the length of the segmentation sequence of the trim5= is minus 10, and the length of the trim3= the length of the tail base sequence of the trim 10;
fourth, when the sequence is identified at the 3' end of the detection site, trim5= the length of the first base sequence-11, and trim3= the length of the last base sequence + the length of the split sequence-9.
In one embodiment, the sequential image information processing module functions to: identifying a sample detection genotype from the derived base sequence, filtering out a sample with a detection site identification error, identifying the base sequence in the screenshot and pixel abscissa values at the left side and the right side of each base, correcting the identification error genotype, determining two horizontal pixel coordinate values of the genotype to be identified by a red frame, and storing the information in a database.
The sequence image information processing module filters samples with detection site identification errors according to the detection site index values, genotype error correction is carried out according to the error correction sequences, and information such as the detection site index values and the error correction sequences is synchronously generated in the genotype identification process. The genotype recognition process is as follows:
the method comprises the steps of identifying the position of a detection position in primary and secondary sequences respectively, wherein the identification process of the position of the detection position is the same as that of an ab1 file processing module;
when the identification sequence is at the 5 'end of the detection site and the position of the detection site is successfully identified, the index value of the detection site = the length of the first segment of the segmented sequence + the length of the segmented sequence +1, the base of the detection site is the base of the position, and the error-correcting sequence is 4 base sequences adjacent to the 3' end of the detection site;
thirdly, when the recognition sequence is at the 3 'end of the detection site and the position of the detection site is successfully recognized, the index value of the detection site = the length of the first segment of the sequence after segmentation, the base of the detection site is the base at the position, and the error-correcting sequence is 4 base sequences adjacent to the 5' end of the detection site;
fourth, when base characters (only A, T, C, G four characters) of detection sites of the primary sequence and the secondary sequence are the same, the genotype is homozygous, and otherwise, the genotype is heterozygous;
fifthly, storing the identified index value of the detection site, the genotype, the error correction sequence and the like into a database, and when one of the information cannot be correctly identified, setting the index value of the detection site to be 0, setting the genotype to be 'N' and setting the error correction sequence to be 'minus'.
Due to accidental sequencing errors of bases in a recognition sequence region in the sequence, no segmented sequence is generated beside a detection site, and a small-probability event of the segmented sequence occurs at other positions, so that the recognition errors of the detection site are caused. Although the lengths of the sequencing sequences of different samples of the same detection site are not completely the same, the overall deviation is not large, so that the position index value of the detection site in the sequencing sequence should be changed within an interval, and therefore, the detection site with the wrong identification can be found according to the abnormality of the detection site index value. The method for filtering the error detection sites by the sequence image information processing module comprises the following steps: the index values of the detection sites of all samples of the detection sites are derived from the database, and index values which are not in the intervals [ Q1-IQR par, Q3 + IQR par ] are determined as abnormal values, which indicates that the detection sites of the samples are identified incorrectly, so that the subsequent processing is not performed. Wherein Q1 is the lower quartile of the detection site index value data set, Q3 is the upper quartile of the detection site index value data set, IQR = Q3-Q1, par is a constant, and the constant value is set to be 4 according to the test results of 3000 sequencing files.
The reason for the misidentification of genotype is that due to abnormal sequencing, the normal double peak (i.e. two peaks with overlapping nestings AT the same base position) of heterozygous genotype (e.g. A/T) is detected as two adjacent peaks, and then two adjacent bases (e.g. AT) are output. In a batch of sequencing data, the abnormal condition of sequencing is a small-probability event, so that most of genotype identification results of the same locus are correct, most of error correction sequences are the same in the same way, detection locus bases are not contained in the error correction sequences, a detection locus base exists in the error correction sequences with genotype identification errors, and the sequence image information processing module corrects the initially identified genotype based on the detection locus bases, and referring to fig. 3, the genotype error correction process is as follows:
the error correction sequence with the largest quantity in the database (if the genotype corresponding to the error correction sequence contains heterozygosis, the optimal error correction sequence) is taken as a standard sequence, and other types of error correction sequences are to-be-corrected sequences. The set of 4 genotypes that may exist, such as A, T, A/T, T/A, is generated based on the bases contained in the two most abundant genotypes in the database.
And when the sequence to be corrected is at the 3 'end of the detection site, taking the first base at the 5' end of the sequence to be corrected as the undetermined genotype, if the corresponding relation between the base of the sequence to be corrected and the base of the standard sequence meets one of the following conditions, and the undetermined genotype exists in the 4 genotype sets, merging the initially recognized genotype and the undetermined genotype as the corrected genotype, and if the initially recognized genotype is A and the undetermined genotype is T, merging the merged genotype into A/T.
Firstly, the 2 nd to 4 th bases of the sequence to be corrected are the same as the 1 st to 3 rd bases of the standard sequence;
the 3 rd and 4 th bases of the sequence to be corrected are the same as the 2 nd and 3 rd bases of the standard sequence, and the 2 nd base of the sequence to be corrected is different from the 1 st base of the standard sequence;
thirdly, the bases at the 2 nd and 4 th positions of the sequence to be corrected are the same as the bases at the 1 st and 3 rd positions of the standard sequence, and the base at the 3 rd position of the sequence to be corrected is different from the base at the 2 nd position of the standard sequence;
fourthly, the 2 nd and 3 rd bases of the sequence to be corrected are the same as the 1 st and 2 nd bases of the standard sequence, and the 4 th base of the sequence to be corrected is different from the 3 rd base of the standard sequence.
Thirdly, when the sequence to be corrected is at the 5 'end of the detection site, the first base at the 3' end of the sequence to be corrected is used as the genotype to be determined, if the corresponding relation between the base of the sequence to be corrected and the base of the standard sequence meets one of the following conditions, and the genotype to be determined exists in the 4 genotype sets, the initially identified genotype and the genotype to be determined are merged to be used as the genotype after correction.
Firstly, the 1 st to 3 rd bases of the sequence to be corrected are the same as the 2 nd to 4 th bases of the standard sequence;
base numbers 2 and 3 of the sequence to be corrected are the same as base numbers 3 and 4 of the standard sequence, and base number 1 of the sequence to be corrected is different from base number 2 of the standard sequence;
the 1 st base and the 3 rd base of the sequence to be corrected are the same as the 2 nd base and the 4 th base of the standard sequence, and the 2 nd base of the sequence to be corrected is different from the 3 rd base of the standard sequence;
fourthly, the 1 st and 2 nd bases of the sequence to be corrected are the same as the 2 nd and 3 rd bases of the standard sequence, and the 3 rd base of the sequence to be corrected is different from the 4 th base of the standard sequence.
Although the captured picture is subjected to unification processing during screenshot, such as: the length and the width value of the intercepted picture are fixed, the length of the base sequences in the picture is unified to 20nt, the height of a peak picture base line and the base sequences in the picture is fixed, a detection site without error correction is at the 11 th position in the base sequences of the screenshot, but the sequencing single peak width corresponding to each base is not fixed, so that the transverse coordinate values of the genotypes of different sample detection sites in the picture are not fixed, the position of the genotype of the detection site in the picture can move left and right in a section of area along with the difference of the samples, in addition, the initially recognized heterozygous genotype is two longitudinal position bases in the picture, and the corrected heterozygous genotype is two transverse position bases in the picture, so that a red box mark cannot be added in the picture by using a method of fixing the coordinate values, but a red box mark needs to be added according to the actual coordinate positions of the genotypes of different sample detection sites, this further requires determining the coordinate information of the upper left and lower right points of the position of the added red box. In a sequence image information processing module, a method for identifying a base sequence and a base transverse coordinate by using the change of RGB color codes of the base sequence position in a picture is invented, and referring to the attached figure 4, the specific identification process is as follows:
the method comprises the steps of arranging an intercepted peak graph picture as 576 pixels (height) × 2448 pixels (width), selecting a horizontal straight line which is 88 pixels vertically downwards from the upper boundary of the picture as a base coordinate identification line, sequentially passing through a base sequence from left to right, passing through the position near the upper 1/3 of the vertical height of each base letter, and obtaining the length of the identification line as the picture width 2448 pixels;
secondly, sequentially reading the R value, the G value and the B value in the RGB color codes corresponding to the points (88, 0), (88, 1), (88, 2) … … (88, 2447) in the picture from left to right on the identification line for processing;
thirdly, according to the research result of the corresponding relation between the RGB color codes output from the identification line and the actual characters in the picture, a set of picture character identification method is designed, and the picture character identification process is as follows:
if a process that RGB color codes are changed from RGB1(R >100 & G >100 & B >100) to RGB2(R >100 & G <100 & B <100) and then to RGB3(R >100 & G >100 & B >100) occurs, a letter C appears at the position of an RGB2 horizontal coordinate x, the left side boundary abscissa L of the letter C is x + the left side boundary compensation distance L (-2), and the right side boundary abscissa R of the letter C is x + the right side boundary compensation distance R (25);
if the process that the RGB color code is changed from RGB1 (R100 & G100 & B100) to RGB2 (R100 & G100 & B100) and then to RGB3 (R100 & G100 & B100) occurs, a letter A temporarily appears at the position of the horizontal coordinate x of the RGB2, the horizontal coordinate L of the left boundary of the letter A is x + the left boundary compensation distance L (-8), and the horizontal coordinate R of the right boundary of the letter A is x + the right boundary compensation distance R (22);
if the process that the RGB color code is changed from RGB1(R & 100 & G & 100 & B & 100) to RGB2(R & 100 & G & 100 & B & 100) and then to RGB3(R & 100 & G & 100 & B & 100) occurs, a letter T appears at the position of the horizontal coordinate x of RGB2, the horizontal coordinate L of the left side boundary of the letter T is x + the left boundary compensation distance L (-12), and the horizontal coordinate R of the right side boundary of the letter T is x + the right boundary compensation distance R (14);
if the process that the RGB color code changes from RGB1 (R100 & G100 & B100) to RGB2 (R100 & G100 & B100) and then to RGB3 (R100 & G100 & B100) occurs, the letter G appears at the position of the horizontal coordinate x of the RGB2, the horizontal coordinate L of the left boundary of the letter G is x + the left boundary compensation distance L (-2), and the horizontal coordinate R of the right boundary of the letter G is x + the right boundary compensation distance R (24);
if a process that the RGB color code is changed from RGB1(R >100 & G >100 & B >100) to RGB2(R >100 & G <100 & B >100) and then to RGB3(R >100 & G >100 & B >100) occurs, a letter R (R occurs because the base A, T, C, G cannot be read from an ab1 file due to sequencing reasons and has low actual occurrence probability) temporarily occurs at the position of the horizontal coordinate x of the RGB2, the horizontal coordinate L of the left side boundary of the letter R is x + the left side boundary compensation distance L (-2), and the horizontal coordinate R of the right side boundary of the letter R is x + the right side boundary compensation distance R (24);
sixthly, since the letters a and R have a special appearance structure, the identification line passes through the same letter twice, and if only the above rule is used, the same letter a or R is identified as two letters a or R, and thus additional conditions are required for identifying the letter a or R: if the recognized letter is the first recognized letter in the picture, the letter is directly judged to be A or R, otherwise, the letter can be judged to be A or R when the difference between the left boundary of the letter and the right boundary of the previous letter is not less than 23 (the minimum width of a blank area between two base letters in the picture).
In the 20nt base length peak screenshots, the initially recognized extended bases are all located at the 11 th base, but after error correction, the extended bases of the error correction genotype (heterozygous) at the 5 'end of the extended base of the recognition sequence are located at the 11 th and 12 th bases, and the extended bases of the error correction genotype (heterozygous) at the 3' end of the extended base of the recognition sequence are located at the 10 th and 11 th bases. Therefore, the two horizontal pixel coordinate values of the red box mark of the genotype not requiring error correction are the left-side boundary abscissa L and the right-side boundary abscissa R of the 11 th base character in the base sequence recognized from the screenshot, respectively; the genotype is corrected, the identification sequence is at the 5' end of the detection site, and the two horizontal pixel coordinate values of the red frame mark are respectively the left side boundary abscissa L and the right side boundary abscissa R of the 11 th base character and the 12 th base character in the base sequence identified from the screenshot; and (3) correcting the corrected genotype, identifying the sequence at the 3' end of the detection site, wherein the two horizontal pixel coordinate values of the red box mark are respectively the left side boundary horizontal coordinate L and the right side boundary horizontal coordinate R of the 10 th base character and the 11 th base character in the base sequence identified from the screenshot.
In one embodiment, the screenshot identification processing module processes the picture according to information read from the database. The module firstly performs whitewashing on a region with the width of 440 pixels on the left side of the screenshot and a region where the second sequence is located, and if a heterozygous genotype base exists in the second sequence, the base cannot be cleaned; and adding a red frame mark to the detection site, wherein the horizontal coordinate values of the upper left corner and the lower right corner of the red frame are respectively the horizontal coordinate value minus 10 of the left side boundary and the horizontal coordinate value plus 10 of the right side boundary of the red frame mark (the position of the base letter for detecting the genotype) recorded in the database. If the genotype is homozygous or heterozygous after error correction, the longitudinal coordinate values of the upper left corner and the lower right corner of the red frame are 5 and 60; if the genotype is heterozygous for the initial recognition, the longitudinal coordinate values of the upper left corner and the lower right corner of the red box are 5 and 115; and finally, adding detection site names and sample name information at a fixed position of a left whitewashing area of the picture, and storing the picture to an appointed path until the picture processing process is completed.
The method of the present invention will be described in detail with reference to further embodiments, which are implemented based on computer software programs written according to the method of the present invention and running on local computers and servers.
Example (b): a method for sequencing Sanger samples based on a human drug metabolism and action target point polygene combined detection kit (hereinafter referred to as C17) clinical test.
C17 detects 17 sites based on the nucleic acid mass spectrum platform, in a clinical confirmation test, C17 confirms the detection result of the clinical sample by using a Sanger sequencing method, namely, the genotype detected by the mass spectrum platform is compared with the genotype of the detection site identified from the Sanger sequencing peak map, whether the mass spectrum platform result is consistent with the Sanger sequencing result or not is counted, and the Sanger sequencing peak map of the detection site needs to be provided as a support in the confirmation result.
Before Sanger sequencing, firstly, the nucleic acid sequences of the 17 sites need to be amplified, then the amplification sequences are sequenced, each amplification sequence is sequenced by using one sequencing primer, and because the two sites of c.526C > T and c.388T > C are on the same amplification product, the amplification result of the 17 sites only needs 16 sequencing primers. Recognition sequences for 8 out of 17 sites were located 3 ' to the detection site, with the character ' 3 ' to the left of these recognition sequences. The screenshot of the C17 project panel identifies the configuration file C17.json as follows:
{
"H1-F": {"c.100C>T": "TGCACGCTAC"},
"H2-R": {"c.526C>T": "3CTTCTGCAGG", "c.388T>C":"3CACGTCCTCC"},
"H3-R": {"c.430C>T": "3GTCCTCAATG"},
"H4-R": {"c.1166A>C": "3GCTCATTTGG"},
"H5-R": {"c.1173C>T": "GATCATCGAC"},
"H6-F": {"c.1165G>C": "CAGAGCAGTC"},
"H8-F": {"c.1510G>A": "GGCATACACT"},
"H9-F": {"c.1075A>C": "CCAGAGATAC"},
"H10-F": {"c.-1639G>A": "CCACCGCACC"},
"H11-F": {"c.681G>A": "ATTATTTCCC"},
"H12-R": {"c.636G>A": "TTACCTGGAT"},
"H13-R": {"c.-806C>T": "3CTTTGAGAAC"},
"H14-R": {"c.388A>G": "3GATATTAGTT"},
"H15-F": {"I/D": "AGTCACTTTT"},
"H16-R": {"c.521T>C": "3CATATATCCA"},
"H17-R": {"c.665C>T": "3CTCCCGCAGA"}
}
the specific screenshot operation based on the software program comprises the following steps:
the method comprises the following steps: carrying out user login on a login interface, and if the user is not registered, firstly switching to a registration page to carry out user registration; and after the login is successful, jumping to a screenshot operation interface, as shown in fig. 5. In an actual clinical test, 17-site amplification sequencing corresponding to all samples is performed in batches, and the following process is to perform screenshot identification processing on sequencing results of 4 sites in a certain batch.
Step two: selecting a Sanger sequencing ab1 file (containing 4 sequencing primers H10-F, H11-F, H12-R, H13-R, wherein each sequencing primer has 207 samples) to be processed in an uploaded file operation region, if the file is a first processed panel project, selecting a json file (selecting a C17 json file to upload from a corresponding path in the example), otherwise, selecting a name of the corresponding panel project in a configuration file selection frame, and clicking a submit button to create a path for storing the processed data file according to the user name and the panel project name by a file upload module. After the file is uploaded successfully, software can automatically decompress the compressed file, and after the decompression is finished, information such as 'upload success' and the like can be displayed in a system prompt area.
Step three: clicking an 'AB 1to PDF' button in a data processing operation area, starting an AB1 file processing module, firstly grouping all uploaded AB1 files according to the names of sequencing primers, then respectively outputting full-length base sequence peak maps and PDF files of a 20nt base sequence peak map containing detection sites from all AB1 files in each group according to groups, and simultaneously outputting primary and secondary base sequence text files. When the module runs and processes, a processing progress bar is displayed in a display area at the lower left corner of the control interface, and a running ab1to pdf and a start at 2021-06-0809: 33:37 are displayed in a system prompt area. After the operation is completed, the progress bar disappears, the system prompts area displays 'complete run, touch 156 seconds, a total of 828 files company processed, success 828, fail 0', and the ab1 file processing state (i.e. success or fail) of each sample is automatically and specifically shown in a form of a table in the right preview area.
Step four: clicking a 'Treat extended Base' button in a data processing operation area, starting a sequence image information processing module to identify a sample detection genotype, comprising two processes of extended Base identification (including a Base coordinate identification process) and extended Base error correction (including a process of determining the horizontal coordinate of a red frame of genotype identification), during the operation of each process, the display area displays the progress of the corresponding process, the system prompt area displays 'running extended base registration' or 'running extended base correction', the progress bar disappears after the operation is finished, the system prompt area displays 'complete run, identification and error correction of extended base look 82 controls', the right preview area adds information such as genotype and position index of the detection site in the sequencing sequence on the basis of the previous result information, for the genotype-corrected samples, "correct _ heterozygosity" is displayed in the remarks column.
Step five: clicking a ' screensaver ' button in a data processing operation area, starting a Screenshot identification processing module to perform Screenshot identification processing, displaying a processing progress in a display area during running, displaying ' running screensaver Annotation ' in a system prompt area, disappearing a progress bar after running is finished, and displaying ' complete run, a total of 825 screensaver areas Annotation and ' book 233 seconds ' in the system prompt area.
Step six: and clicking a showjpg button in the operation bar of the right table, starting a data display downloading module to view data, and displaying the final identification screenshot of the corresponding sample in the lower left corner display area. And displaying the screenshot of only one sample in the Display area by default, namely, the screenshot of the next sample is viewed to cover the viewed last sample screenshot by default, if a plurality of sample screenshots are desired to be displayed, clicking a 'Multiple Display' radio box in the data processing operation area, and otherwise, clicking a 'Single Display' radio box to switch to the original default state. And if the original screenshot which is not subjected to identification processing is desired to be viewed, clicking a Middle Picture radio box in the data processing operation area, and otherwise, clicking a Final Picture radio box to switch to the original default state. If it is desired to modify the base letters in the final screenshot (for dealing with the situation that the color of the peak image with a small probability is inconsistent with the base), only the corresponding "modifybase" button in the operation column of the right table is clicked, at this time, an operation frame for modifying the base sequence of the screenshot is popped up (as shown in fig. 6), the base index and the corresponding base type are selected in the operation frame, if a plurality of bases are modified, the sequence of the base index value and the base type are corresponding to each other, and the modification can be completed by clicking the "submit" button.
Step seven: clicking a 'Download File' button in the data processing operation area, starting a data display downloading module to Download data, and unfolding three hidden Download buttons of 'Download Jpg', 'Download Xlsx' and 'Download Pdf' under the button. Clicking a 'Download Jpg' button to package and compress the processed final identification screenshot file, displaying 'Download file Extended _ Base _ Sanger _ Peak.zip generated subsequent comment' in a system prompt area after the processing is finished, and popping up a Download frame in a browser to store the Download frame; clicking a 'Download Xlsx' button to derive 8 columns of information such as a sample name, a sequencing primer name, a detection site name, an ab1topdf state, a detection genotype, a detection site index, whether a final screenshot exists or not, remarks and the like from a database to an excel file, displaying 'Download file Extended _ Base _ gene _ statistics, xlsxx generated subsequent full' in a system prompt area after the processing is finished, and saving the information in a browser pop-up Download box; clicking a ' Download Pdf ' button to pack and compress the Intermediate process files including a primary sequence file, a secondary sequence file, a full-length sequence peak map Pdf file, a 20nt base sequence peak map Pdf file and a jpg picture file converted by the primary sequence peak map Pdf file, displaying ' Download file Intermediate _ process _ document, zip generated subsequent summary full in a system prompt region after the Intermediate process files are processed, and popping up a Download box in a browser to store.
Step eight: and clicking an exit link at the upper right of the screenshot operation interface to exit the login.
The software processing result shows that 3 parts of ab1 files in the 828 sequencing ab1 files have wrong genotype identification of the detection site judged by software and cannot be corrected due to the sequencing reason, so that a final screenshot identification picture is not generated, and re-sequencing is needed; 96 ab1 files have wrong initial genotype identification due to sequencing reasons, but become correct after error correction, and a final cut map identification picture is normally output; none of the other ab1 file processing was anomalous.
The software completes the screenshot identification and genotype statistics of 828 sequencing results in 8 min, and if the sequencing results are processed by a manual screenshot identification method, a person needs to spend at least 37 hours in an accumulated way to complete the screenshot identification work, and the genotype statistics is not included.
In conclusion, the Sanger sequencing peak graph interception identification software realized based on the method of the invention, can replace that the chromas software opens each ab1 sequencing file one by one and then adjusts the peak image area, and then the shortcut key plus mouse auxiliary screenshot mark is used for saving, finally, the manual mode of adding sample name and site name information by using a text box is used, the screenshot identification processing can be efficiently carried out on the Sanger sequencing peak images of a large number of clinical samples only by simple mouse click operations for a plurality of times, meanwhile, the genotype of the detection sites in the peak images is identified and counted, so that a screenwriter does not need to manually collect the genotype of each detection site of each sample after manually intercepting the images and identifying the genotype of each detection site by naked eyes, thereby greatly reducing the workload of screenshot personnel for screenshot identification and genotype statistics of detection sites, the working efficiency is further improved, and the processing efficiency of the screenshot identification process is improved by at least 280 times; while also avoiding the occurrence of possible human negligence errors.
The above description is illustrative of the present invention and is not to be construed as limiting the invention. Modifications and variations may be made by those skilled in the art in light of the embodiments disclosed herein without departing from the scope and spirit of the invention.

Claims (9)

1. A Sanger sequencing peak graph intercepting and identifying method is characterized by comprising the following steps:
s1, reading information of a sequencing file and a configuration file, wherein the sequencing file and the configuration file comprise: the sequencing method comprises a single or massive compressed Sanger sequencing ab1 file and a json configuration file, wherein the json configuration file at least comprises a sequencing primer name, a detection site name and identification sequence information, and a base peak diagram and a base sequence are derived based on the sequencing file, and the sequencing method specifically comprises the following steps:
s11, grouping all ab1 files in the sequencing files according to the names of sequencing primers;
s12, processing each group of ab1 files according to the identification sequences, and deriving a base peak map and a base sequence from the ab1 files by using a sangerseqR package, wherein the base peak map comprises a sequencing full-length peak map and a 20nt base length peak map screenshot containing a detection site, and the base sequence comprises a primary sequence and a secondary sequence;
s2, processing the base peak diagram and the base sequence to identify the extended base information, which specifically comprises:
s21, identifying the sample from the derived base sequence and detecting the genotype, which specifically comprises the following steps:
s211, identifying the position of a detection site in a primary sequence and a secondary sequence respectively;
s212, when the identification sequence is at the 5 'end of the detection site and the position of the detection site is successfully identified, the index value of the detection site = the length of the first segment of the segmented sequence + the length of the segmented sequence +1, the base of the detection site is the base at the position, and the error correction sequence is 4 base sequences adjacent to the 3' end of the detection site;
when the recognition sequence is at the 3 'end of the detection site and the position of the detection site is successfully recognized, the index value of the detection site = the length of the first segment of the sequence after segmentation, the base of the detection site is the base of the position, and the error correction sequence is 4 base sequences adjacent to the 5' end of the detection site;
s213, when the four base characters of the detection site A, T, C, G of the primary sequence and the secondary sequence are the same, judging that the genotype is homozygous, otherwise, judging that the genotype is heterozygous;
s214, storing the three information of the identified index value, the genotype and the error correction sequence of the detection site into a database;
s22, filtering out the sample with wrong identification of the detection site according to the index value of the detection site;
s23, identifying the base sequence in the screenshot and the pixel abscissa values of the left side and the right side of each base;
s24, correcting the genotype of the identified error according to the error correction sequence;
s25, determining two transverse pixel coordinate values of the genotype to be identified by the red frame, and storing the information in a database, wherein the method specifically comprises the following steps:
two horizontal pixel coordinate values of the red box mark of the genotype which does not need error correction are respectively the left side boundary abscissa L and the right side boundary abscissa R of the 11 th base character in the base sequence recognized from the screenshot;
the genotype after error correction, and the recognition sequence is at the 5' end of the detection site, and the coordinate values of two horizontal pixels marked by a red frame are respectively the left side boundary abscissa L and the right side boundary abscissa R of the 11 th base character and the 12 th base character in the base sequence recognized from the screenshot;
the genotype after error correction, and the recognition sequence is at the 3' end of the detection site, and the coordinate values of two horizontal pixels marked by a red frame are respectively the left side boundary abscissa L and the right side boundary abscissa R of the 10 th base character and the 11 th base character in the base sequence recognized from the screenshot;
s3, intercepting and identifying the peak diagram of the extended base sequence based on the identified extended base information, which specifically comprises the following steps:
s31, performing whitewashing on a region with the width of 440 pixels on the left side of the screenshot and a region where the second sequence is located, and if a heterozygous genotype base exists in the second sequence, not cleaning the base;
s32, adding a red box mark to the detection site, wherein the horizontal coordinate values of the upper left corner and the lower right corner of the red box mark the horizontal coordinate value-10 of the left boundary and the horizontal coordinate value +10 of the right boundary of the red box mark recorded in the database respectively; if the genotype is homozygous or heterozygous after error correction, the longitudinal coordinate values of the upper left corner and the lower right corner of the red frame are 5 and 60; if the genotype is heterozygous for the initial recognition, the longitudinal coordinate values of the upper left corner and the lower right corner of the red box are 5 and 115;
and S33, adding detection site names and sample name information at the fixed position of the left side white area of the picture, and intercepting and saving the picture to the specified path.
2. The Sanger sequencing peak map truncation identification method according to claim 1, wherein in step S12, when deriving the 20nt base length peak map screenshot, the number of bases cut off at the 5 'end of the full-length sequence trim5 and the number of bases cut off at the 3' end of the full-length sequence trim3 are determined, and the process of determining the number of cut bases is as follows:
s121, identifying the position of the detection site by using the identification sequence;
s122, when the position of the detection site in the primary sequence is successfully identified, obtaining a head base sequence and a tail base sequence which do not contain a segmentation sequence;
s123, when the recognition sequence is at the 5' end of the detection site, the trim5= the length of the first base sequence + the length of the segmentation sequence-10, and the trim3= the length of the tail base sequence-10; when the recognition sequence is at the 3' end of the detection site, trim5= first base sequence length-11, trim3= last base sequence length + split sequence length-9.
3. The Sanger sequencing peak plot interception identification method according to claim 2, wherein in step S121, the detection site position identification process is as follows:
using the full length of the recognition sequence as a segmentation sequence for segmenting the sequencing sequence, if the segmentation sequence completely exists in the sequencing sequence, segmenting the sequencing sequence, otherwise, the segmentation cannot be carried out;
if the segmented sequence cannot be segmented, cutting off a base from the end of the original segmented sequence, and then continuing the segmentation attempt as a new segmented sequence, if the segmented sequence cannot be segmented, continuing to repeat the process, wherein when a base is cut off from the end of the original segmented sequence, the recognition sequence marked with '3' is cut off from the 3 'end, otherwise, the recognition sequence is cut off from the 5' end;
if the segmentation can not be carried out after 5 bases are cut off from the tail end of the recognition sequence, or the segmentation result exceeds two segments, the segmentation is stopped, the recognition failure of the detection site is judged, and the subsequent treatment is not carried out.
4. The Sanger sequencing peak plot interception identification method according to claim 1, wherein the step S22 specifically comprises:
s221, deriving detection site index values of all samples of the detection site from a database;
s222, judging the index value which is not in the interval [ Q1-IQR par, Q3 + IQR par ] as an abnormal value, further judging that the identification of the sample detection site is wrong, and not carrying out subsequent processing; wherein Q1 is the lower quartile of the detection site index value data set, Q3 is the upper quartile of the detection site index value data set, IQR = Q3-Q1, par is a preset constant.
5. The Sanger sequencing peak plot interception identification method according to claim 4, wherein the step S23 specifically comprises:
s231, setting the intercepted peak image as 576 pixels higher by 2448 pixels wide, and selecting a horizontal straight line which is 88 pixels vertically downward from the upper boundary of the image as a base coordinate identification line, wherein the identification line sequentially penetrates through the base sequence from left to right;
s232, sequentially recognizing the RGB color codes from left to right on the recognition line, and sequentially reading R values, G values and B values in the RGB color codes corresponding to the points (88, 0), (88, 1), (88, 2) … … (88, 2447) in the picture for processing;
s233, recognizing characters and coordinates of the picture according to the corresponding relation between the RGB color codes output from the recognition line and the actual characters in the picture, wherein the recognition process of the characters and the coordinates of the picture is as follows:
if the RGB color codes appear, the color codes are selected from RGB 1: r >100 & G >100 & B >100 to RGB 2: r >100 & G <100 & B <100, to RGB 3: r >100 & G >100 & B >100, then the letter C appears at the position of the horizontal coordinate x of RGB2, the horizontal coordinate L of the left boundary of the letter C is x + the left boundary compensation distance L, L = -2, the horizontal coordinate R of the right boundary of the letter C is x + the right boundary compensation distance R, R = 25;
if the RGB color codes appear, the color codes are selected from RGB 1: r >100 & G >100 & B >100 to RGB 2: r <100 & G >100 & B <100, to RGB 3: r >100 & G >100 & B >100, the letter A is tentatively appeared at the position of the horizontal coordinate x of the RGB2, the horizontal coordinate L of the left boundary of the letter A is x + the left boundary compensation distance L, L = -8, the horizontal coordinate R of the right boundary of the letter A is x + the right boundary compensation distance R, and R = 22;
(iii) if there is an RGB color code from RGB 1: r >100 & G >100 & B >100 to RGB 2: r <100 & G <100 & B >100, to RGB 3: r >100 & G >100 & B >100, the letter T appears at the position of the horizontal coordinate x of RGB2, the left boundary abscissa L of the letter T is x + the left boundary compensation distance L, L = -12, the right boundary abscissa R of the letter T is x + the right boundary compensation distance R, R = 14;
if the RGB color codes appear, the color codes are selected from RGB 1: r >100 & G >100 & B >100 to RGB 2: r <100 & G <100 & B <100, to RGB 3: r >100 & G >100 & B >100, then the letter G appears at the position of the horizontal coordinate x of the RGB2, the horizontal coordinate L of the left boundary of the letter G is x + the left boundary compensation distance L, L = -2, the horizontal coordinate R of the right boundary of the letter G is x + the right boundary compensation distance R, and R = 24;
if the RGB color code appears, the color code is selected from the following colors of RGB 1: r >100 & G >100 & B >100 to RGB 2: r >100 & G <100 & B >100, to RGB 3: r >100 & G >100 & B >100, the letter R appears tentatively at the position of the horizontal coordinate x of the RGB2, the horizontal coordinate L of the left boundary of the letter R is x + the left boundary compensation distance L, L = -2, the horizontal coordinate R of the right boundary of the letter R is x + the right boundary compensation distance R, and R = 24;
and if the recognized letter is the first recognized letter in the picture, directly judging that the letter is A or R, otherwise, judging that the letter is A or R when the difference between the left boundary of the letter and the right boundary of the previous letter is not less than 23.
6. The Sanger sequencing peak plot interception identification method according to claim 5, wherein the step S24 specifically comprises:
s241, taking the error correction sequence with the largest quantity in the database as a standard sequence, and taking other types of error correction sequences as sequences to be corrected;
s242, generating a set of 4 possible genotypes according to bases contained in the two genotypes with the largest quantity in the database;
s243, when the sequence to be corrected is at the 3 'end of the detection site, the first base at the 5' end of the sequence to be corrected is taken as the genotype to be corrected, if the corresponding relation between the base of the sequence to be corrected and the base of the standard sequence meets one of the following conditions:
firstly, the 2 nd to 4 th bases of the sequence to be corrected are the same as the 1 st to 3 rd bases of the standard sequence;
base at positions 3 and 4 of the sequence to be corrected is the same as base at positions 2 and 3 of the standard sequence, and base at position 2 of the sequence to be corrected is different from base at position 1 of the standard sequence;
thirdly, the bases at the 2 nd and 4 th positions of the sequence to be corrected are the same as the bases at the 1 st and 3 rd positions of the standard sequence, and the base at the 3 rd position of the sequence to be corrected is different from the base at the 2 nd position of the standard sequence;
fourthly, the 2 nd and 3 rd bases of the sequence to be corrected are the same as the 1 st and 2 nd bases of the standard sequence, and the 4 th base of the sequence to be corrected is different from the 3 rd base of the standard sequence;
when the undetermined genotype exists in the 4 genotype sets, combining the initially identified genotype and the undetermined genotype to be used as the corrected genotype;
when the sequence to be corrected is at the 5 'end of the detection site, the first base at the 3' end of the sequence to be corrected is used as the genotype to be corrected, if the corresponding relationship between the base of the sequence to be corrected and the base of the standard sequence meets one of the following conditions
Firstly, the 1 st to 3 rd bases of the sequence to be corrected are the same as the 2 nd to 4 th bases of the standard sequence;
base numbers 2 and 3 of the sequence to be corrected are the same as base numbers 3 and 4 of the standard sequence, and base number 1 of the sequence to be corrected is different from base number 2 of the standard sequence;
the 1 st base and the 3 rd base of the sequence to be corrected are the same as the 2 nd base and the 4 th base of the standard sequence, and the 2 nd base of the sequence to be corrected is different from the 3 rd base of the standard sequence;
the 1 st and 2 nd bases of the sequence to be error-corrected are the same as the 2 nd and 3 rd bases of the standard sequence, and the 3 rd base of the sequence to be error-corrected is different from the 4 th base of the standard sequence;
and the undetermined genotype is present in the set of 4 genotypes, the initially identified genotype and the undetermined genotype are combined as the corrected genotype.
7. A Sanger sequencing peak diagram intercepting and identifying system is characterized by comprising a local terminal and a server;
the local terminal uploads a Sanger sequencing ab1 file to be processed to a server based on a configured file uploading module;
the server implements the Sanger sequencing peak plot interception identification method of any one of claims 1to 6 based on a configured ab1 file processing module, a sequence image information processing module and a screenshot identification processing module; the ab1 file processing module is used for reading sequencing files and configuration file information, deriving a base peak map and a base sequence based on the sequencing files, the sequence image information processing module is used for processing the base peak map and the base sequence and identifying extended base information, and the screenshot identification processing module is used for intercepting and identifying the extended base sequence peak map based on the extended base information obtained by identification;
the local terminal receives the extended base sequence peak image after the identifier is viewed on the basis of the configured data display downloading module, or modifies bases in the picture base sequence according to needs, or appoints a picture file after downloading processing, a sample processing statistical file or a screenshot processing intermediate process file.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method according to any of claims 1-6 when executing the computer program.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1to 6.
CN202110814332.3A 2021-07-19 2021-07-19 Sanger sequencing peak image interception identification method and system, computer equipment and storage medium Active CN113380323B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110814332.3A CN113380323B (en) 2021-07-19 2021-07-19 Sanger sequencing peak image interception identification method and system, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110814332.3A CN113380323B (en) 2021-07-19 2021-07-19 Sanger sequencing peak image interception identification method and system, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113380323A CN113380323A (en) 2021-09-10
CN113380323B true CN113380323B (en) 2022-09-23

Family

ID=77582507

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110814332.3A Active CN113380323B (en) 2021-07-19 2021-07-19 Sanger sequencing peak image interception identification method and system, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113380323B (en)

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101984445B (en) * 2010-03-04 2012-03-14 深圳华大基因科技有限公司 Method and system for implementing typing based on polymerase chain reaction sequencing
CN107368706A (en) * 2017-06-27 2017-11-21 中国水稻研究所 Sequencing data interpretation of result method and apparatus, sequencing library structure and sequence measurement
CN110016498B (en) * 2019-04-24 2020-05-08 北京诺赛基因组研究中心有限公司 Method for determining single nucleotide polymorphism in Sanger method sequencing
CN110042172B (en) * 2019-06-03 2020-12-04 广东省农业科学院果树研究所 Rapid identification primer and method for citrus hybrids based on SNP markers
CN111378721B (en) * 2020-04-16 2023-06-23 广西壮族自治区水产科学研究院 Molecular marker related to nitrite nitrogen resistance character of litopenaeus vannamei and screening thereof
CN111816254A (en) * 2020-06-01 2020-10-23 上海派森诺生物科技股份有限公司 Method for quickly removing carrier sequences in batches based on perl language
CN112669903B (en) * 2020-12-29 2024-04-02 北京旌准医疗科技有限公司 HLA typing method and equipment based on Sanger sequencing

Also Published As

Publication number Publication date
CN113380323A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
CN108470021B (en) Method and device for positioning table in PDF document
US6681186B1 (en) System and method for improving the accuracy of DNA sequencing and error probability estimation through application of a mathematical model to the analysis of electropherograms
CA2178962C (en) Method of joining handwritten input
US6021220A (en) System and method for pattern recognition
CN110222195B (en) Method and electronic device for mining relation between question answering result and knowledge point
JPH0373084A (en) Character recognizing device
CN113380323B (en) Sanger sequencing peak image interception identification method and system, computer equipment and storage medium
JP2997508B2 (en) Pattern recognition device
JPH064595A (en) Method and apparatus for retrieving ideograph and dictionary head word
JP3940450B2 (en) Character reader
JP2017111500A (en) Character recognizing apparatus, and program
CN110010203B (en) Interactive dynamic QTL analysis system and method based on biological cloud platform
JP3037727B2 (en) OCR system
JPH07160801A (en) On-line character recognizing device
CN113764041B (en) Searching method and device for species gene identification tag and electronic equipment
US20230304966A1 (en) Display method, analyzer, and storage medium
JPH0612520A (en) Confirming and correcting system for character recognizing device
CN116049801A (en) Click verification code identification method and system
JPH08202856A (en) Picture processing method
JP7317886B2 (en) Information processing device and information processing method
JP7178445B2 (en) Information processing device, information processing method, and program
CN113178231B (en) Cononsus sequence statistical analysis and visualization method based on second-generation sequencing technology
CN117574864A (en) Graph processing method, device, equipment and storage medium
CN116386066A (en) Handwriting recognition system for error correction of choice questions
JP2022070553A (en) Information processing device, control method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20221102

Address after: Room 102, Building 3, No. 355, Xingzhong Road, Hangzhou Yuhang Economic and Technological Development Zone, Yuhang District, Hangzhou City, Zhejiang Province, 311100

Patentee after: Hangzhou Dipu Medical Laboratory Co.,Ltd.

Address before: 311101 second floor, building 9, 355 Xingzhong Road, Yuhang Economic and Technological Development Zone, Yuhang District, Hangzhou City, Zhejiang Province

Patentee before: ZHEJIANG DIPU DIAGNOSIS TECHNOLOGY Co.,Ltd.