CN111816249B

CN111816249B - Cyclization analysis method of genome

Info

Publication number: CN111816249B
Application number: CN202010484602.4A
Authority: CN
Inventors: 崔天一; 张海焕; 姜丽荣; 孙子奎
Original assignee: Shanghai Personal Biotechnology Co ltd
Current assignee: Shanghai Personal Biotechnology Co ltd
Priority date: 2020-06-01
Filing date: 2020-06-01
Publication date: 2023-12-08
Anticipated expiration: 2040-06-01
Also published as: CN111816249A

Abstract

The invention discloses a cyclization analysis method of genome, which is characterized by comprising the following steps: firstly splicing the data, carrying out self-correction, screening a template sequence from a splicing result, and comparing the self-correction sequence with the template sequence to obtain an overlapped sequence; splicing by using the overlapped sequence as input to obtain a contig sequence, and then comparing the contig sequence with the spliced sequence to obtain a comparison result, and connecting the comparison result into contig to form a ring; the starting site was determined by aligning the sequences after loop formation with the bacterial dnaA gene sequences. The method of the invention can automatically judge the relation and the connection point between the contigs, thereby connecting the contigs and cyclizing the whole genome, and judging the origin of replication or the origin of the gene to adjust the origin of the genome.

Description

Cyclization analysis method of genome

Technical Field

The invention relates to the field of biological analysis, in particular to a cyclization analysis method of genome.

Background

Bacterial genome is generally circular, and the splicing process requires circularization and adjustment of the genome starting point.

At present, the assembly of a bacterial genome completion chart is generally carried out by splicing 3-generation sequencing data and correcting 2-generation sequencing data.

Because the genome itself is in an end-to-end annular shape when splicing, the method generally used at present aims at the problem of splicing by using different splicing software, because the starting point of the splicing result of each software is different in large probability, the sequences are subjected to bilinear analysis, the overlapping region is searched, and then the sequences are cyclized according to the overlapping region.

In the aspect of finding the genome starting point, the gene prediction is carried out after splicing, and then annotation is carried out to judge the dnaA position, so that the operation is complicated.

Disclosure of Invention

In order to overcome the above-mentioned drawbacks of the prior art, an object of the present invention is to provide a method for analyzing a genome circularization, which can automatically determine the relationship and the junction point between contigs, thereby performing contig ligation and whole genome circularization, and determine the origin of replication or the origin of a gene to adjust the origin of the genome.

In order to achieve the purpose of the invention, the technical scheme adopted is as follows:

a method of circularization analysis of a genome, comprising the steps of:

step one: running HGAP software, wherein Genome Length in software parameters is selected to be similar to the size of a reference Genome, and other parameters are used for splicing pacbio3 generation data to obtain a spliced sequence;

step two: carrying out self correction on the pacbio3 data by adopting canu software to obtain a correction sequence;

step three: taking 50K regions at two ends of each contig cut in the splicing result of the step one as a template sequence of a screening sequence;

step four: running a minimap program, comparing the correction sequence in the second step with the template sequence obtained in the third step, and screening out an overlapping sequence with overlapping outer edges of the correction sequence and the template sequence;

step five: running canu software, and splicing the overlapping sequences screened in the step four as input to obtain a contig sequence;

step six: running blast software, comparing the contig sequences obtained in the fifth step with the splicing sequences in the first step, and finding out an overlapping region to obtain a comparison result;

step seven: according to the comparison result in the step six, searching the relation among the contigs, and then connecting to obtain the connected contigs, wherein the sequence generated in the step five is formed by splicing the data of the contig edges, so that the relation among the contigs is searched more pertinently, and the success rate is higher;

step eight: taking the connected contig as input, and repeating the steps three to seven until looping;

step nine: running makeblastdb software, taking a bacterial dnaA gene sequence as input to construct a database, and comparing the cyclized sequence in the step eight with the bacterial dnaA gene sequence in a blast environment to obtain a comparison sequence;

step ten: and (3) selecting the result with the maximum identity, similarity and coverage value from the alignment sequences in the step nine to determine the initial site.

In a preferred embodiment of the present invention, the step nine is specifically:

utilizing the redirect software of the NCBI website to download all bacterial dnaA gene sequences in the NCBI database in batches;

running makeblastdb software, and constructing a database by taking the downloaded gene sequence as input;

the blast software was run to align the circularised sequences to the dnaA gene database.

Because of the steps, the invention has the beneficial effects that:

the method of the invention can automatically judge the relation and the connection point between the contigs, thereby connecting the contigs and cyclizing the whole genome, and judging the origin of replication or the origin of the gene to adjust the origin of the genome.

Drawings

FIG. 1 is a flow chart of the present invention.

FIG. 2 shows the splice results of the end regions of the present invention.

FIG. 3 is a comparison result of finding overlapping areas according to the present invention.

FIG. 4 shows a comparison result of finding a starting point in the present invention.

Fig. 5 shows the splice result of canu software in the comparative example.

Fig. 6 is a result of finding an overlapping region in the comparative example.

Detailed Description

Example 1:

cyclization scheme:

1. bacillus belicus (Bacillus velezensis) Genome was about 4m, HGAP was run, genome Length parameter was set to 4m and splice was performed to obtain contig of 3,980,777bp in Length, and whether or not loop formation was possible was not determined.

2. And running canu software to perform self correction on the pacbio3 algebra, and cutting off an unreliable part after correction to obtain a high-quality sequence.

3. And (3) taking 50K regions at two ends of the contig as templates of the screening sequence in the splicing result obtained in the step (1).

4. Running a minimap to align the high-quality sequence obtained in the step (2) to the template sequence obtained in the step (3). And (5) screening sequences which overlap with the outer edge of the template sequence in the high-quality sequence.

5. And (3) running canu software, and splicing the screened sequences in the step (4) as input. The splice results are shown in FIG. 2.

6. And (3) running blast software, comparing the sequence generated in the step (5) to the contig sequence obtained in the step (1), and searching an overlapping region, wherein the comparison result is shown in figure 3.

7. According to the comparison result of the step (6), the sequence tig00000001 generated in the step (5) can be found to be capable of ending the contig generated in the step (2) and supplementing 1bp gap, and cyclization is successful.

Because the sequence generated in the step (5) is formed by splicing the data of the contig edges, searching the relationship between contigs is more targeted, and the success rate is higher.

And (3) adjusting a starting point flow:

1. all bacterial dnaA gene sequences in the NCBI database were downloaded in bulk using the redirect software of the NCBI website itself.

2. And (3) running makeblastdb software, and constructing a database by taking the downloaded gene sequence in the step (9) as input.

3. The blast software was run to align the circularised sequences to the dnaA gene database. The comparison result is shown in FIG. 4.

4. For the alignment of FIG. 4, the results of the alignment with a 1-position [ S2] for the start of the target sequence and the highest values of Identies [%IDY ] and Similary [%SIM ] were selected as the best alignment, with the 11 th row of the alignment being the best, the 26007 th base of [ S1] was selected as the locus of the dnaA gene in the genome and the start of the genome.

The invention is mainly characterized in that: the sequences spliced out by the invention are linear, so that the linear sequences are connected end to form a loop.

The existing method is to splice with different splicing software, because the starting point of the splicing result of each software is different in large probability, then the sequences are subjected to bilinear analysis, the overlapping area is searched, and then the sequences are cyclized according to the overlapping area.

Comparative example 1:

cyclization scheme:

bacillus belicus (Bacillus velezensis) Genome was about 4m, HGAP was run, genome Length parameter was set to 4m and splice was performed to obtain contig of 3,980,777bp in Length, and whether or not loop formation was possible was not determined.

And running canu software to carry out re-head splicing on the pacbio3 algebra, wherein the splicing result of the canu software is shown in fig. 5, and 31 contigs are spliced in total. The conventional approach is to compare the splice results of the two software (the results of other splice software may be used here).

And (3) comparing the contig generated in the step (2) to the contig sequence obtained in the step (1) by using blast software, searching an overlapping region, and partially comparing the results with the results shown in figure 6.

As can be seen from FIG. 6, rows 5,6 align 1-10036 of tig00000015 to 10039-1 of 000000 F|arow and 10061-125104 of tig00000015 to 3980754 ～ 3865712 of 000000 F|arow, because the region near the end 3980777 ～ 3980755 locus of 000000 F|arow is not aligned, tig00000015 cannot completely cover the head-tail region of 000000 F|arow and cannot be used as evidence of cyclization.

Claims

1. A method for circularization analysis of a genome, comprising the steps of:

step seven: according to the comparison result in the step six, searching the relation between the contigs and then connecting to obtain the connected contigs;

2. The method of claim 1, wherein the step nine is specifically: