CN111816249B - Cyclization analysis method of genome - Google Patents

Cyclization analysis method of genome Download PDF

Info

Publication number
CN111816249B
CN111816249B CN202010484602.4A CN202010484602A CN111816249B CN 111816249 B CN111816249 B CN 111816249B CN 202010484602 A CN202010484602 A CN 202010484602A CN 111816249 B CN111816249 B CN 111816249B
Authority
CN
China
Prior art keywords
sequence
software
sequences
splicing
genome
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010484602.4A
Other languages
Chinese (zh)
Other versions
CN111816249A (en
Inventor
崔天一
张海焕
姜丽荣
孙子奎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Personal Biotechnology Co ltd
Original Assignee
Shanghai Personal Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Personal Biotechnology Co ltd filed Critical Shanghai Personal Biotechnology Co ltd
Priority to CN202010484602.4A priority Critical patent/CN111816249B/en
Publication of CN111816249A publication Critical patent/CN111816249A/en
Application granted granted Critical
Publication of CN111816249B publication Critical patent/CN111816249B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cyclization analysis method of genome, which is characterized by comprising the following steps: firstly splicing the data, carrying out self-correction, screening a template sequence from a splicing result, and comparing the self-correction sequence with the template sequence to obtain an overlapped sequence; splicing by using the overlapped sequence as input to obtain a contig sequence, and then comparing the contig sequence with the spliced sequence to obtain a comparison result, and connecting the comparison result into contig to form a ring; the starting site was determined by aligning the sequences after loop formation with the bacterial dnaA gene sequences. The method of the invention can automatically judge the relation and the connection point between the contigs, thereby connecting the contigs and cyclizing the whole genome, and judging the origin of replication or the origin of the gene to adjust the origin of the genome.

Description

Cyclization analysis method of genome
Technical Field
The invention relates to the field of biological analysis, in particular to a cyclization analysis method of genome.
Background
Bacterial genome is generally circular, and the splicing process requires circularization and adjustment of the genome starting point.
At present, the assembly of a bacterial genome completion chart is generally carried out by splicing 3-generation sequencing data and correcting 2-generation sequencing data.
Because the genome itself is in an end-to-end annular shape when splicing, the method generally used at present aims at the problem of splicing by using different splicing software, because the starting point of the splicing result of each software is different in large probability, the sequences are subjected to bilinear analysis, the overlapping region is searched, and then the sequences are cyclized according to the overlapping region.
In the aspect of finding the genome starting point, the gene prediction is carried out after splicing, and then annotation is carried out to judge the dnaA position, so that the operation is complicated.
Disclosure of Invention
In order to overcome the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a method for analyzing a genome circularization, which can automatically determine the relationship and the junction point between contigs, thereby performing contig ligation and whole genome circularization, and determine the origin of replication or the origin of a gene to adjust the origin of the genome.
In order to achieve the purpose of the invention, the technical scheme adopted is as follows:
a method of circularization analysis of a genome, comprising the steps of:
step one: running HGAP software, wherein Genome Length in software parameters is selected to be similar to the size of a reference Genome, and other parameters are used for splicing pacbio3 generation data to obtain a spliced sequence;
step two: carrying out self correction on the pacbio3 data by adopting canu software to obtain a correction sequence;
step three: taking 50K regions at two ends of each contig cut in the splicing result of the step one as a template sequence of a screening sequence;
step four: running a minimap program, comparing the correction sequence in the second step with the template sequence obtained in the third step, and screening out an overlapping sequence with overlapping outer edges of the correction sequence and the template sequence;
step five: running canu software, and splicing the overlapping sequences screened in the step four as input to obtain a contig sequence;
step six: running blast software, comparing the contig sequences obtained in the fifth step with the splicing sequences in the first step, and finding out an overlapping region to obtain a comparison result;
step seven: according to the comparison result in the step six, searching the relation among the contigs, and then connecting to obtain the connected contigs, wherein the sequence generated in the step five is formed by splicing the data of the contig edges, so that the relation among the contigs is searched more pertinently, and the success rate is higher;
step eight: taking the connected contig as input, and repeating the steps three to seven until looping;
step nine: running makeblastdb software, taking a bacterial dnaA gene sequence as input to construct a database, and comparing the cyclized sequence in the step eight with the bacterial dnaA gene sequence in a blast environment to obtain a comparison sequence;
step ten: and (3) selecting the result with the maximum identity, similarity and coverage value from the alignment sequences in the step nine to determine the initial site.
In a preferred embodiment of the present invention, the step nine is specifically:
utilizing the redirect software of the NCBI website to download all bacterial dnaA gene sequences in the NCBI database in batches;
running makeblastdb software, and constructing a database by taking the downloaded gene sequence as input;
the blast software was run to align the circularised sequences to the dnaA gene database.
Because of the steps, the invention has the beneficial effects that:
the method of the invention can automatically judge the relation and the connection point between the contigs, thereby connecting the contigs and cyclizing the whole genome, and judging the origin of replication or the origin of the gene to adjust the origin of the genome.
Drawings
FIG. 1 is a flow chart of the present invention.
FIG. 2 shows the splice results of the end regions of the present invention.
FIG. 3 is a comparison result of finding overlapping areas according to the present invention.
FIG. 4 shows a comparison result of finding a starting point in the present invention.
Fig. 5 shows the splice result of canu software in the comparative example.
Fig. 6 is a result of finding an overlapping region in the comparative example.
Detailed Description
Example 1:
cyclization scheme:
1. bacillus belicus (Bacillus velezensis) Genome was about 4m, HGAP was run, genome Length parameter was set to 4m and splice was performed to obtain contig of 3,980,777bp in Length, and whether or not loop formation was possible was not determined.
2. And running canu software to perform self correction on the pacbio3 algebra, and cutting off an unreliable part after correction to obtain a high-quality sequence.
3. And (3) taking 50K regions at two ends of the contig as templates of the screening sequence in the splicing result obtained in the step (1).
4. Running a minimap to align the high-quality sequence obtained in the step (2) to the template sequence obtained in the step (3). And (5) screening sequences which overlap with the outer edge of the template sequence in the high-quality sequence.
5. And (3) running canu software, and splicing the screened sequences in the step (4) as input. The splice results are shown in FIG. 2.
6. And (3) running blast software, comparing the sequence generated in the step (5) to the contig sequence obtained in the step (1), and searching an overlapping region, wherein the comparison result is shown in figure 3.
7. According to the comparison result of the step (6), the sequence tig00000001 generated in the step (5) can be found to be capable of ending the contig generated in the step (2) and supplementing 1bp gap, and cyclization is successful.
Because the sequence generated in the step (5) is formed by splicing the data of the contig edges, searching the relationship between contigs is more targeted, and the success rate is higher.
And (3) adjusting a starting point flow:
1. all bacterial dnaA gene sequences in the NCBI database were downloaded in bulk using the redirect software of the NCBI website itself.
2. And (3) running makeblastdb software, and constructing a database by taking the downloaded gene sequence in the step (9) as input.
3. The blast software was run to align the circularised sequences to the dnaA gene database. The comparison result is shown in FIG. 4.
4. For the alignment of FIG. 4, the results of the alignment with a 1-position [ S2] for the start of the target sequence and the highest values of Identies [%IDY ] and Similary [%SIM ] were selected as the best alignment, with the 11 th row of the alignment being the best, the 26007 th base of [ S1] was selected as the locus of the dnaA gene in the genome and the start of the genome.
The invention is mainly characterized in that: the sequences spliced out by the invention are linear, so that the linear sequences are connected end to form a loop.
The existing method is to splice with different splicing software, because the starting point of the splicing result of each software is different in large probability, then the sequences are subjected to bilinear analysis, the overlapping area is searched, and then the sequences are cyclized according to the overlapping area.
Comparative example 1:
cyclization scheme:
bacillus belicus (Bacillus velezensis) Genome was about 4m, HGAP was run, genome Length parameter was set to 4m and splice was performed to obtain contig of 3,980,777bp in Length, and whether or not loop formation was possible was not determined.
And running canu software to carry out re-head splicing on the pacbio3 algebra, wherein the splicing result of the canu software is shown in fig. 5, and 31 contigs are spliced in total. The conventional approach is to compare the splice results of the two software (the results of other splice software may be used here).
And (3) comparing the contig generated in the step (2) to the contig sequence obtained in the step (1) by using blast software, searching an overlapping region, and partially comparing the results with the results shown in figure 6.
As can be seen from FIG. 6, rows 5,6 align 1-10036 of tig00000015 to 10039-1 of 000000 F|arow and 10061-125104 of tig00000015 to 3980754 ~ 3865712 of 000000 F|arow, because the region near the end 3980777 ~ 3980755 locus of 000000 F|arow is not aligned, tig00000015 cannot completely cover the head-tail region of 000000 F|arow and cannot be used as evidence of cyclization.

Claims (2)

1. A method for circularization analysis of a genome, comprising the steps of:
step one: running HGAP software, wherein Genome Length in software parameters is selected to be similar to the size of a reference Genome, and other parameters are used for splicing pacbio3 generation data to obtain a spliced sequence;
step two: carrying out self correction on the pacbio3 data by adopting canu software to obtain a correction sequence;
step three: taking 50K regions at two ends of each contig cut in the splicing result of the step one as a template sequence of a screening sequence;
step four: running a minimap program, comparing the correction sequence in the second step with the template sequence obtained in the third step, and screening out an overlapping sequence with overlapping outer edges of the correction sequence and the template sequence;
step five: running canu software, and splicing the overlapping sequences screened in the step four as input to obtain a contig sequence;
step six: running blast software, comparing the contig sequences obtained in the fifth step with the splicing sequences in the first step, and finding out an overlapping region to obtain a comparison result;
step seven: according to the comparison result in the step six, searching the relation between the contigs and then connecting to obtain the connected contigs;
step eight: taking the connected contig as input, and repeating the steps three to seven until looping;
step nine: running makeblastdb software, taking a bacterial dnaA gene sequence as input to construct a database, and comparing the cyclized sequence in the step eight with the bacterial dnaA gene sequence in a blast environment to obtain a comparison sequence;
step ten: and (3) selecting the result with the maximum identity, similarity and coverage value from the alignment sequences in the step nine to determine the initial site.
2. The method of claim 1, wherein the step nine is specifically:
utilizing the redirect software of the NCBI website to download all bacterial dnaA gene sequences in the NCBI database in batches;
running makeblastdb software, and constructing a database by taking the downloaded gene sequence as input;
the blast software was run to align the circularised sequences to the dnaA gene database.
CN202010484602.4A 2020-06-01 2020-06-01 Cyclization analysis method of genome Active CN111816249B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010484602.4A CN111816249B (en) 2020-06-01 2020-06-01 Cyclization analysis method of genome

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010484602.4A CN111816249B (en) 2020-06-01 2020-06-01 Cyclization analysis method of genome

Publications (2)

Publication Number Publication Date
CN111816249A CN111816249A (en) 2020-10-23
CN111816249B true CN111816249B (en) 2023-12-08

Family

ID=72848167

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010484602.4A Active CN111816249B (en) 2020-06-01 2020-06-01 Cyclization analysis method of genome

Country Status (1)

Country Link
CN (1) CN111816249B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence
CN105989249A (en) * 2014-09-26 2016-10-05 叶承羲 Method, system and device for assembling genomic sequence
CN106055925A (en) * 2016-05-24 2016-10-26 中国水产科学研究院 Method and apparatus for assembling genome sequence based on transcriptome paired-end sequencing data
CN106778060A (en) * 2016-10-09 2017-05-31 南京双运生物技术有限公司 A kind of utilization prokaryotic gene group high-quality sketch completes the method for figure
CN111199772A (en) * 2019-12-27 2020-05-26 上海派森诺生物科技股份有限公司 PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190385703A1 (en) * 2016-06-08 2019-12-19 The Broad Institute, Inc. Linear genome assembly from three dimensional genome structure
WO2019005913A1 (en) * 2017-06-28 2019-01-03 Icahn School Of Medicine At Mount Sinai Methods for high-resolution microbiome analysis

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989249A (en) * 2014-09-26 2016-10-05 叶承羲 Method, system and device for assembling genomic sequence
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence
CN106055925A (en) * 2016-05-24 2016-10-26 中国水产科学研究院 Method and apparatus for assembling genome sequence based on transcriptome paired-end sequencing data
CN106778060A (en) * 2016-10-09 2017-05-31 南京双运生物技术有限公司 A kind of utilization prokaryotic gene group high-quality sketch completes the method for figure
CN111199772A (en) * 2019-12-27 2020-05-26 上海派森诺生物科技股份有限公司 PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Creating a functional single-chromosome yeast;Yangyang Shao et al.;Nature;第560卷(第331期);全文 *
Long-reads reveal that Rhododendron delavayi plastid genome contains extensive repeat sequences, and recombination exists among plastid genomes of photosynthetic Ericaceae;Huie Li et al.;PeerJ;全文 *

Also Published As

Publication number Publication date
CN111816249A (en) 2020-10-23

Similar Documents

Publication Publication Date Title
CN111816249B (en) Cyclization analysis method of genome
Fuse et al. Efficient amide bond formation through a rapid and strong activation of carboxylic acids in a microflow reactor
RU2667391C2 (en) Wireless network access method, device and system
CN105389481B (en) The detection method of variable sheer body in a kind of three generations's overall length transcript profile
CN86107554A (en) Method for secreting heterologous protein by using Saccharomyces cerevisiae gene BAR1
WO2018232598A1 (en) Pcr primer pair and application thereof
CN108804874B (en) Immune group library analysis of biological information method based on molecular labeling
CN105647959B (en) A method of building yeast multi-copy expression vector
CN116455523B (en) Rapid transmission verification method for multivariate data
CN111199772A (en) PEDV (porcine epidemic diarrhea Virus) genome analysis method based on next generation sequencing
CN117001188A (en) Wafer laser dicing method, dicing system, process apparatus, and readable storage medium
US20210265019A1 (en) Automatic conformation analysis method for quasi-drug organic molecules
CN110781567B (en) Denture processing method, denture processing device and denture processing equipment
CN112051804B (en) Numerical controller
US9776278B2 (en) Laser welder alignment system
Sanchez-Gonzalez et al. A rule-based solution search methodology for self-optimization in cellular networks
CN114075577A (en) Method for controlling fermentation procedure in traditional Chinese medicine probiotic composite fermentation process
WO2017222596A1 (en) Computer-implemented method for designing synthetic dna, and terminal, system and computer-readable medium for the same
CN106047912B (en) Novel gene cloning method
CN110714019A (en) Novel humanized Fab phage display carrier
Wong et al. LaneRuler: automated lane tracking for DNA electrophoresis gel images
CN112883751A (en) Material scanning method and device, terminal equipment and storage medium
CN111916147A (en) Transcript classification method
CN111192635A (en) Analysis method for circular RNA identification and expression quantification
CN117473200B (en) Comprehensive acquisition and analysis method for website information data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant