CN111816249B - Cyclization analysis method of genome - Google Patents
Cyclization analysis method of genome Download PDFInfo
- Publication number
- CN111816249B CN111816249B CN202010484602.4A CN202010484602A CN111816249B CN 111816249 B CN111816249 B CN 111816249B CN 202010484602 A CN202010484602 A CN 202010484602A CN 111816249 B CN111816249 B CN 111816249B
- Authority
- CN
- China
- Prior art keywords
- sequence
- software
- sequences
- splicing
- genome
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 9
- 238000007363 ring formation reaction Methods 0.000 title abstract description 8
- 101150020338 dnaA gene Proteins 0.000 claims abstract description 13
- 230000001580 bacterial effect Effects 0.000 claims abstract description 10
- 238000000034 method Methods 0.000 claims abstract description 9
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 7
- 238000012216 screening Methods 0.000 claims abstract description 7
- 230000015572 biosynthetic process Effects 0.000 abstract description 3
- 230000010076 replication Effects 0.000 abstract description 3
- 230000000052 comparative effect Effects 0.000 description 3
- 241000193830 Bacillus <bacterium> Species 0.000 description 2
- 241000193744 Bacillus amyloliquefaciens Species 0.000 description 2
- 238000012163 sequencing technique Methods 0.000 description 2
- 101100499417 Chlamydia pneumoniae dnaA1 gene Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
Landscapes
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Chemical & Material Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Analytical Chemistry (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Theoretical Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a cyclization analysis method of genome, which is characterized by comprising the following steps: firstly splicing the data, carrying out self-correction, screening a template sequence from a splicing result, and comparing the self-correction sequence with the template sequence to obtain an overlapped sequence; splicing by using the overlapped sequence as input to obtain a contig sequence, and then comparing the contig sequence with the spliced sequence to obtain a comparison result, and connecting the comparison result into contig to form a ring; the starting site was determined by aligning the sequences after loop formation with the bacterial dnaA gene sequences. The method of the invention can automatically judge the relation and the connection point between the contigs, thereby connecting the contigs and cyclizing the whole genome, and judging the origin of replication or the origin of the gene to adjust the origin of the genome.
Description
Technical Field
The invention relates to the field of biological analysis, in particular to a cyclization analysis method of genome.
Background
Bacterial genome is generally circular, and the splicing process requires circularization and adjustment of the genome starting point.
At present, the assembly of a bacterial genome completion chart is generally carried out by splicing 3-generation sequencing data and correcting 2-generation sequencing data.
Because the genome itself is in an end-to-end annular shape when splicing, the method generally used at present aims at the problem of splicing by using different splicing software, because the starting point of the splicing result of each software is different in large probability, the sequences are subjected to bilinear analysis, the overlapping region is searched, and then the sequences are cyclized according to the overlapping region.
In the aspect of finding the genome starting point, the gene prediction is carried out after splicing, and then annotation is carried out to judge the dnaA position, so that the operation is complicated.
Disclosure of Invention
In order to overcome the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a method for analyzing a genome circularization, which can automatically determine the relationship and the junction point between contigs, thereby performing contig ligation and whole genome circularization, and determine the origin of replication or the origin of a gene to adjust the origin of the genome.
In order to achieve the purpose of the invention, the technical scheme adopted is as follows:
a method of circularization analysis of a genome, comprising the steps of:
step one: running HGAP software, wherein Genome Length in software parameters is selected to be similar to the size of a reference Genome, and other parameters are used for splicing pacbio3 generation data to obtain a spliced sequence;
step two: carrying out self correction on the pacbio3 data by adopting canu software to obtain a correction sequence;
step three: taking 50K regions at two ends of each contig cut in the splicing result of the step one as a template sequence of a screening sequence;
step four: running a minimap program, comparing the correction sequence in the second step with the template sequence obtained in the third step, and screening out an overlapping sequence with overlapping outer edges of the correction sequence and the template sequence;
step five: running canu software, and splicing the overlapping sequences screened in the step four as input to obtain a contig sequence;
step six: running blast software, comparing the contig sequences obtained in the fifth step with the splicing sequences in the first step, and finding out an overlapping region to obtain a comparison result;
step seven: according to the comparison result in the step six, searching the relation among the contigs, and then connecting to obtain the connected contigs, wherein the sequence generated in the step five is formed by splicing the data of the contig edges, so that the relation among the contigs is searched more pertinently, and the success rate is higher;
step eight: taking the connected contig as input, and repeating the steps three to seven until looping;
step nine: running makeblastdb software, taking a bacterial dnaA gene sequence as input to construct a database, and comparing the cyclized sequence in the step eight with the bacterial dnaA gene sequence in a blast environment to obtain a comparison sequence;
step ten: and (3) selecting the result with the maximum identity, similarity and coverage value from the alignment sequences in the step nine to determine the initial site.
In a preferred embodiment of the present invention, the step nine is specifically:
utilizing the redirect software of the NCBI website to download all bacterial dnaA gene sequences in the NCBI database in batches;
running makeblastdb software, and constructing a database by taking the downloaded gene sequence as input;
the blast software was run to align the circularised sequences to the dnaA gene database.
Because of the steps, the invention has the beneficial effects that:
the method of the invention can automatically judge the relation and the connection point between the contigs, thereby connecting the contigs and cyclizing the whole genome, and judging the origin of replication or the origin of the gene to adjust the origin of the genome.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 shows the splice results of the end regions of the present invention.
FIG. 3 is a comparison result of finding overlapping areas according to the present invention.
FIG. 4 shows a comparison result of finding a starting point in the present invention.
Fig. 5 shows the splice result of canu software in the comparative example.
Fig. 6 is a result of finding an overlapping region in the comparative example.
Detailed Description
Example 1:
cyclization scheme:
1. bacillus belicus (Bacillus velezensis) Genome was about 4m, HGAP was run, genome Length parameter was set to 4m and splice was performed to obtain contig of 3,980,777bp in Length, and whether or not loop formation was possible was not determined.
2. And running canu software to perform self correction on the pacbio3 algebra, and cutting off an unreliable part after correction to obtain a high-quality sequence.
3. And (3) taking 50K regions at two ends of the contig as templates of the screening sequence in the splicing result obtained in the step (1).
4. Running a minimap to align the high-quality sequence obtained in the step (2) to the template sequence obtained in the step (3). And (5) screening sequences which overlap with the outer edge of the template sequence in the high-quality sequence.
5. And (3) running canu software, and splicing the screened sequences in the step (4) as input. The splice results are shown in FIG. 2.
6. And (3) running blast software, comparing the sequence generated in the step (5) to the contig sequence obtained in the step (1), and searching an overlapping region, wherein the comparison result is shown in figure 3.
7. According to the comparison result of the step (6), the sequence tig00000001 generated in the step (5) can be found to be capable of ending the contig generated in the step (2) and supplementing 1bp gap, and cyclization is successful.
Because the sequence generated in the step (5) is formed by splicing the data of the contig edges, searching the relationship between contigs is more targeted, and the success rate is higher.
And (3) adjusting a starting point flow:
1. all bacterial dnaA gene sequences in the NCBI database were downloaded in bulk using the redirect software of the NCBI website itself.
2. And (3) running makeblastdb software, and constructing a database by taking the downloaded gene sequence in the step (9) as input.
3. The blast software was run to align the circularised sequences to the dnaA gene database. The comparison result is shown in FIG. 4.
4. For the alignment of FIG. 4, the results of the alignment with a 1-position [ S2] for the start of the target sequence and the highest values of Identies [%IDY ] and Similary [%SIM ] were selected as the best alignment, with the 11 th row of the alignment being the best, the 26007 th base of [ S1] was selected as the locus of the dnaA gene in the genome and the start of the genome.
The invention is mainly characterized in that: the sequences spliced out by the invention are linear, so that the linear sequences are connected end to form a loop.
The existing method is to splice with different splicing software, because the starting point of the splicing result of each software is different in large probability, then the sequences are subjected to bilinear analysis, the overlapping area is searched, and then the sequences are cyclized according to the overlapping area.
Comparative example 1:
cyclization scheme:
bacillus belicus (Bacillus velezensis) Genome was about 4m, HGAP was run, genome Length parameter was set to 4m and splice was performed to obtain contig of 3,980,777bp in Length, and whether or not loop formation was possible was not determined.
And running canu software to carry out re-head splicing on the pacbio3 algebra, wherein the splicing result of the canu software is shown in fig. 5, and 31 contigs are spliced in total. The conventional approach is to compare the splice results of the two software (the results of other splice software may be used here).
And (3) comparing the contig generated in the step (2) to the contig sequence obtained in the step (1) by using blast software, searching an overlapping region, and partially comparing the results with the results shown in figure 6.
As can be seen from FIG. 6, rows 5,6 align 1-10036 of tig00000015 to 10039-1 of 000000 F|arow and 10061-125104 of tig00000015 to 3980754 ~ 3865712 of 000000 F|arow, because the region near the end 3980777 ~ 3980755 locus of 000000 F|arow is not aligned, tig00000015 cannot completely cover the head-tail region of 000000 F|arow and cannot be used as evidence of cyclization.
Claims (2)
1. A method for circularization analysis of a genome, comprising the steps of:
step one: running HGAP software, wherein Genome Length in software parameters is selected to be similar to the size of a reference Genome, and other parameters are used for splicing pacbio3 generation data to obtain a spliced sequence;
step two: carrying out self correction on the pacbio3 data by adopting canu software to obtain a correction sequence;
step three: taking 50K regions at two ends of each contig cut in the splicing result of the step one as a template sequence of a screening sequence;
step four: running a minimap program, comparing the correction sequence in the second step with the template sequence obtained in the third step, and screening out an overlapping sequence with overlapping outer edges of the correction sequence and the template sequence;
step five: running canu software, and splicing the overlapping sequences screened in the step four as input to obtain a contig sequence;
step six: running blast software, comparing the contig sequences obtained in the fifth step with the splicing sequences in the first step, and finding out an overlapping region to obtain a comparison result;
step seven: according to the comparison result in the step six, searching the relation between the contigs and then connecting to obtain the connected contigs;
step eight: taking the connected contig as input, and repeating the steps three to seven until looping;
step nine: running makeblastdb software, taking a bacterial dnaA gene sequence as input to construct a database, and comparing the cyclized sequence in the step eight with the bacterial dnaA gene sequence in a blast environment to obtain a comparison sequence;
step ten: and (3) selecting the result with the maximum identity, similarity and coverage value from the alignment sequences in the step nine to determine the initial site.
2. The method of claim 1, wherein the step nine is specifically:
utilizing the redirect software of the NCBI website to download all bacterial dnaA gene sequences in the NCBI database in batches;
running makeblastdb software, and constructing a database by taking the downloaded gene sequence as input;
the blast software was run to align the circularised sequences to the dnaA gene database.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010484602.4A CN111816249B (en) | 2020-06-01 | 2020-06-01 | Cyclization analysis method of genome |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010484602.4A CN111816249B (en) | 2020-06-01 | 2020-06-01 | Cyclization analysis method of genome |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111816249A CN111816249A (en) | 2020-10-23 |
CN111816249B true CN111816249B (en) | 2023-12-08 |
Family
ID=72848167
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010484602.4A Active CN111816249B (en) | 2020-06-01 | 2020-06-01 | Cyclization analysis method of genome |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111816249B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104531848A (en) * | 2014-12-11 | 2015-04-22 | 杭州和壹基因科技有限公司 | Method and system for assembling genome sequence |
CN105989249A (en) * | 2014-09-26 | 2016-10-05 | 叶承羲 | Method, system and device for assembling genomic sequence |
CN106055925A (en) * | 2016-05-24 | 2016-10-26 | 中国水产科学研究院 | Method and apparatus for assembling genome sequence based on transcriptome paired-end sequencing data |
CN106778060A (en) * | 2016-10-09 | 2017-05-31 | 南京双运生物技术有限公司 | A kind of utilization prokaryotic gene group high-quality sketch completes the method for figure |
CN111199772A (en) * | 2019-12-27 | 2020-05-26 | 上海派森诺生物科技股份有限公司 | PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190385703A1 (en) * | 2016-06-08 | 2019-12-19 | The Broad Institute, Inc. | Linear genome assembly from three dimensional genome structure |
WO2019005913A1 (en) * | 2017-06-28 | 2019-01-03 | Icahn School Of Medicine At Mount Sinai | Methods for high-resolution microbiome analysis |
-
2020
- 2020-06-01 CN CN202010484602.4A patent/CN111816249B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105989249A (en) * | 2014-09-26 | 2016-10-05 | 叶承羲 | Method, system and device for assembling genomic sequence |
CN104531848A (en) * | 2014-12-11 | 2015-04-22 | 杭州和壹基因科技有限公司 | Method and system for assembling genome sequence |
CN106055925A (en) * | 2016-05-24 | 2016-10-26 | 中国水产科学研究院 | Method and apparatus for assembling genome sequence based on transcriptome paired-end sequencing data |
CN106778060A (en) * | 2016-10-09 | 2017-05-31 | 南京双运生物技术有限公司 | A kind of utilization prokaryotic gene group high-quality sketch completes the method for figure |
CN111199772A (en) * | 2019-12-27 | 2020-05-26 | 上海派森诺生物科技股份有限公司 | PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing |
Non-Patent Citations (2)
Title |
---|
Creating a functional single-chromosome yeast;Yangyang Shao et al.;Nature;第560卷(第331期);全文 * |
Long-reads reveal that Rhododendron delavayi plastid genome contains extensive repeat sequences, and recombination exists among plastid genomes of photosynthetic Ericaceae;Huie Li et al.;PeerJ;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN111816249A (en) | 2020-10-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111816249B (en) | Cyclization analysis method of genome | |
Fuse et al. | Efficient amide bond formation through a rapid and strong activation of carboxylic acids in a microflow reactor | |
RU2667391C2 (en) | Wireless network access method, device and system | |
CN105389481B (en) | The detection method of variable sheer body in a kind of three generations's overall length transcript profile | |
CN86107554A (en) | Method for secreting heterologous protein by using Saccharomyces cerevisiae gene BAR1 | |
WO2018232598A1 (en) | Pcr primer pair and application thereof | |
CN108804874B (en) | Immune group library analysis of biological information method based on molecular labeling | |
CN105647959B (en) | A method of building yeast multi-copy expression vector | |
CN116455523B (en) | Rapid transmission verification method for multivariate data | |
CN111199772A (en) | PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing | |
CN117001188A (en) | Wafer laser dicing method, dicing system, process apparatus, and readable storage medium | |
US20210265019A1 (en) | Automatic conformation analysis method for quasi-drug organic molecules | |
CN110781567B (en) | Denture processing method, denture processing device and denture processing equipment | |
CN112051804B (en) | Numerical controller | |
US9776278B2 (en) | Laser welder alignment system | |
Sanchez-Gonzalez et al. | A rule-based solution search methodology for self-optimization in cellular networks | |
CN114075577A (en) | Method for controlling fermentation procedure in traditional Chinese medicine probiotic composite fermentation process | |
WO2017222596A1 (en) | Computer-implemented method for designing synthetic dna, and terminal, system and computer-readable medium for the same | |
CN106047912B (en) | Novel gene cloning method | |
CN110714019A (en) | Novel humanized Fab phage display carrier | |
Wong et al. | LaneRuler: automated lane tracking for DNA electrophoresis gel images | |
CN112883751A (en) | Material scanning method and device, terminal equipment and storage medium | |
CN111916147A (en) | Transcript classification method | |
CN111192635A (en) | Analysis method for circular RNA identification and expression quantification | |
CN117473200B (en) | Comprehensive acquisition and analysis method for website information data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |