CN103810402A

CN103810402A - Data processing method and device for genomes

Info

Publication number: CN103810402A
Application number: CN201410064832.XA
Authority: CN
Inventors: 江文恺; 占伟
Original assignee: Nuo Hezhi Source Beijing Bioinformation Science And Technology Ltd
Current assignee: Beijing Polytron Technologies Inc
Priority date: 2014-02-25
Filing date: 2014-02-25
Publication date: 2014-05-21
Anticipated expiration: 2034-02-25
Also published as: CN103810402B

Abstract

The invention discloses a data processing method and device for genomes. The data processing method for the genomes includes the steps that first comparison is carried out on information of the target genomes with the information of the reference genomes to obtain a first comparison result; information of sections, which do not meet the comparison conditions, of the genomes is obtained from the first comparison result; second comparison is carried out on the information of the sections, which do not meet the comparison conditions, of the genomes with the information of the reference genomes to obtain a second comparison result; information of distinguished sequences of the target genomes is obtained from the second comparison result. By means of the data processing method and device, the problem that the accurate distinguished sequences are difficult to obtain through the relative technology is solved.

Description

For genomic data processing method and device

Technical field

The present invention relates to data processing field, be used for genomic data processing method and device in particular to one.

Background technology

Icp gene group analysis direction comprises: one, and by finding genomic similar gene order between species, similar gene function and the mechanism that between research species, may have; Two, by find genome broader region between species phase Sihe distinguished sequence, evolutionary history and the genome mutation event that species produce during evolution etc. of research species.

At present, in correlation technique, in the time finding between species genomic distinguished sequence, just the genome protein sequence of species to be studied is compared with the genome protein sequence of the nearly edge species on evolutionary relationship, to obtain the comparison information of protein sequence between species, and the comparison information of protein sequence between species is carried out to cluster, thereby obtain genomic distinguished sequence between species.Because genome is except comprising protein sequence, also comprises the sequence of other elements, thereby be difficult to obtain accurate distinguished sequence.

In addition, because genomic quantity of information is larger, therefore in technique scheme, the comparison of genome protein sequence need to consume a large amount of time and internal memory.

For the problem that is difficult to obtain accurate distinguished sequence in correlation technique, effective solution is not yet proposed at present.

Summary of the invention

It is a kind of for genomic data processing method and device, to solve the problem that is difficult to obtain accurate distinguished sequence in correlation technique that fundamental purpose of the present invention is to provide.

To achieve these goals, according to an aspect of the present invention, provide a kind of for genomic data processing method.The method comprises: the information of target gene group is carried out to first with the genomic information of reference and compare, obtain the first comparison result; From the first comparison result, obtain the information of the genomic fragment in not comparison; The information of the genomic fragment of not comparing is carried out to second with the genomic information of reference and compare, obtain the second comparison result; And from the second comparison result, obtain the information of the distinguished sequence of target gene group.

Further, the information of the genomic fragment of not comparing is carried out to second with the genomic information of reference and compare, obtain the second comparison result and comprise: the sequence information that whether has repetition in the information of the genomic fragment that detection is not compared; If detect not in the information of the genomic fragment in comparison and have the sequence information repeating, the sequence information of repetition is marked to the information that obtains marking; Never the information that in the information of the genetic fragment in comparison, filtering marked, the information after being filtered; And the information after filtering and the genomic information of reference are compared, obtain the second comparison result.

Further, the first comparison result comprises multiple homologous gene group fragments, wherein, multiple homologous gene group fragments are the genomic fragment in multiple comparisons, the information of obtaining the genomic fragment in not comparison from the first comparison result comprises: the multiple homologous gene group of filtering fragment from the first comparison result, obtains the sub-fragment of genome in multiple not comparisons; Position relationship according to the sub-fragment of genome in multiple not comparisons in target gene group sorts, and obtains the sequence of the sub-fragment of genome in multiple not comparisons; The sub-fragment of genome adjacent any two positions in sequence and that have a lap is merged, obtain comprising the sequence of the sub-fragment of genome in the not comparison of multiple merging; And connect the sub-fragment of full gene group in the sequence of the sub-fragment of genome in the not comparison that comprises multiple merging, the information of the genomic fragment of not compared.

Further, the second comparison result comprises multiple homologous gene group fragments, and the information of obtaining the distinguished sequence of target gene group from the second comparison result comprises: extract multiple homologous gene group fragments; Position relationship according to multiple homologous gene group fragments in target gene group sorts, and obtains the sequence of multiple homologous gene group fragments; Detect the adjacent homologous gene group fragment in any two positions in sequence and whether have lap; There is lap if detect the adjacent homologous gene group fragment in any two positions in sequence, merge lap, obtain the homologous gene group fragment after multiple merging; And from the second comparison result, filtering comprises the information of the homologous gene group fragment after multiple merging, obtains the information of the distinguished sequence of target gene group.

Further, before extracting multiple homologous gene group fragments, data processing method also comprises: whether the length that judges multiple genome fragments is more than or equal to preset length; Be more than or equal to preset length if judge the length of multiple genome fragments, judge whether the similarity of multiple genome fragments is more than or equal to default similarity; Be more than or equal to default similarity if judge the similarity of multiple genome fragments, judge whether the comparison rate of multiple genome fragments is more than or equal to default comparison rate; And if the comparison rate of judging multiple genome fragments is more than or equal to default comparison rate, the information using the information of multiple genome fragments as multiple homologous gene group fragments.

To achieve these goals, according to a further aspect in the invention, provide a kind of for genomic data processing equipment.This device comprises: the first comparing unit, and for being carried out to first with the genomic information of reference, the information of target gene group compares, obtain the first comparison result; The first acquiring unit, for obtaining the information of the genomic fragment not comparison from the first comparison result; The second comparing unit, compares for the information of the genomic fragment of not comparing is carried out to second with the genomic information of reference, obtains the second comparison result; And second acquisition unit, for obtain the information of the distinguished sequence of target gene group from the second comparison result.

Further, the second comparing unit comprises: first detection module, for detection of the sequence information that whether has repetition in the information of genomic fragment in comparison not; Labeling module, if there is for the information that detects the genomic fragment in not comparison the sequence information repeating, marks the sequence information of repetition the information that obtains marking; The first filtering module, the information marking for the information filtering of the genetic fragment in never comparison, the information after being filtered; And comparing module, for the information after filtering and the genomic information of reference are compared, obtain the second comparison result.

Further, the first comparison result comprises multiple homologous gene group fragments, wherein, multiple homologous gene group fragments are the genomic fragment in multiple comparisons, the first acquiring unit comprises: the second filtering module, for from the multiple homologous gene group of the first comparison result filtering fragment, obtain the sub-fragment of genome in multiple not comparisons; The first order module, for sorting at the position relationship of target gene group according to the sub-fragment of genome in multiple not comparisons, obtains the sequence of the sub-fragment of genome in multiple not comparisons; First merges module, for the sub-fragment of genome adjacent any sequence two positions and that have a lap is merged, obtains comprising the sequence of the sub-fragment of genome in the not comparison of multiple merging; And link block, for connecting the sub-fragment of full gene group of sequence of the sub-fragment of genome in the not comparison that comprises multiple merging, the information of the genomic fragment of not compared.

Further, the second comparison result comprises multiple homologous gene group fragments, and second acquisition unit comprises: extraction module, for extracting multiple homologous gene group fragments; The second order module, for sorting at the position relationship of target gene group according to multiple homologous gene group fragments, obtains the sequence of multiple homologous gene group fragments; Whether the second detection module, there is lap for detection of the adjacent homologous gene group fragment in any two positions in sequence; Second merges module, if there is lap for detecting the adjacent homologous gene group fragment in any two positions of sequence, merges lap, obtains the homologous gene group fragment after multiple merging; And the 3rd filtering module, for comprise the information of the homologous gene group fragment multiple merging from the second comparison result filtering, obtain the information of the distinguished sequence of target gene group.

Further, this data processing equipment also comprises: the first judge module, for before extracting multiple homologous gene group fragments, judges whether the length of multiple genome fragments is more than or equal to preset length; The second judge module, if be more than or equal to preset length for the length of judging multiple genome fragments, judges whether the similarity of multiple genome fragments is more than or equal to default similarity; The 3rd judge module, if be more than or equal to default similarity for judging the similarity of multiple genome fragments, judges whether the comparison rate of multiple genome fragments is more than or equal to default comparison rate; And determination module, if be more than or equal to default comparison rate for judging the comparison rate of multiple genome fragments, be the information of multiple homologous gene group fragments by the validation of information of multiple genome fragments.

By the present invention, adopt the information of target gene group and carry out first with reference to genomic information and compare, obtain the first comparison result; From the first comparison result, obtain the information of the genomic fragment in not comparison; The information of the genomic fragment of not comparing is carried out to second with the genomic information of reference and compare, obtain the second comparison result; And from the second comparison result, obtain the information of the distinguished sequence of target gene group, solve and in correlation technique, be difficult to obtain the problem of accurate distinguished sequence, and then reached the effect that improves the degree of accuracy of distinguished sequence.

Accompanying drawing explanation

The accompanying drawing that forms the application's a part is used to provide a further understanding of the present invention, and schematic description and description of the present invention is used for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 is the schematic diagram for genomic data processing equipment according to the embodiment of the present invention;

Fig. 2 is the schematic diagram that is preferably used for genomic data processing equipment according to the embodiment of the present invention;

Fig. 3 is the process flow diagram for genomic data processing method according to the embodiment of the present invention; And

Fig. 4 is the process flow diagram that is preferably used for genomic data processing method according to the embodiment of the present invention.

Embodiment

It should be noted that, in the situation that not conflicting, the feature in embodiment and embodiment in the application can combine mutually.Describe below with reference to the accompanying drawings and in conjunction with the embodiments the present invention in detail.

In order to make those skilled in the art better understand the present invention program, below in conjunction with the accompanying drawing in the embodiment of the present invention, to being clearly and completely described in the embodiment of the present invention, obviously, described embodiment is only the embodiment of a part of the present invention, rather than whole embodiment.Based on the embodiment in the present invention, do not make the every other embodiment obtaining under creative work prerequisite those of ordinary skills, all should belong to protection scope of the present invention.

It should be noted that, term " first ", " second " etc. in instructions of the present invention and claims and above-mentioned accompanying drawing are for distinguishing similar object, and needn't be used for describing specific order or precedence.The data that should be appreciated that such use suitably can exchanged in situation, so as embodiments of the invention described herein can with except diagram here or describe those order enforcement.In addition, term " comprises " and " having " and their any distortion, is intended to be to cover not exclusive comprising.

According to embodiments of the invention, provide a kind of for genomic data processing equipment, this is used for genomic data processing equipment for obtaining the information of accurate distinguished sequence, for genetic analysis accurately creates conditions.

Fig. 1 is the schematic diagram for genomic data processing equipment according to the embodiment of the present invention.

As shown in Figure 1, this device comprises: the first comparing unit 10, the first acquiring unit 20, the second comparing unit 30 and second acquisition unit 40.

The first comparing unit 10 is compared for the information of target gene group is carried out to first with the genomic information of reference, obtains the first comparison result.

Particularly, can, by the nucmer instrument in MUMmer software, carry out the first comparison by same target gene group with reference to genome, obtain the first comparison result between two genomes.It should be noted that, in the first comparison of full genome range, can replace nucmer instrument.

Wherein, target gene group and can come from different species with reference to genome, and target gene group can be the genome of species to be studied, and can be the genome of the known species of gene information with reference to genome.For example, in the time analyzing the genome of willow, the genome of willow can be used as target gene group, and if analyze the gene function relation between willow and willow, can be using the genome of willow as with reference to genome, and if analyze the gene function relation between willow and Chinese scholartree, can be using the genome of Chinese scholartree as with reference to genome.The first comparison can be preliminary comparison, and corresponding the first comparison result can be preliminary comparison result.It should be noted that, species to be studied can comprise plant, animal and microorganism etc.

Preferably, in the first comparison, can and be divided into respectively N gene regions with reference to genome by target gene group, the N of a target gene group gene regions can be compared with the genomic N of a reference gene regions simultaneously.Like this, can save comparison time, improve comparison efficiency.

Alternatively, this data processing equipment can also comprise: the 3rd acquiring unit and the 4th acquiring unit.Wherein, the 3rd acquiring unit for by the information of target gene group with carry out first and compare with reference to genomic information, before obtaining the first comparison result, the 4th acquiring unit is for obtaining the information of target gene group, and obtains with reference to genomic information.

The first acquiring unit 20 is for obtaining the information of the genomic fragment not comparison from the first comparison result.

Wherein, the first comparison result can comprise the information of the genomic fragment in comparison and the information of genomic fragment in comparison.Genomic fragment in comparison can be called again homologous gene group fragment.

Particularly, the first acquiring unit 20 can obtain by following two kinds of methods the information of the genomic fragment in not comparison:

Method one, the information of the genomic fragment that extraction is not compared from the first comparison result.

Wherein, the information of the genomic fragment in comparison can not comprise: with the information that is less than the genomic fragment of the first default similarity with reference to genomic similarity, for example, this preset value can be 98%; Genomic fragment is less than the information of the genomic fragment of the first preset length, and for example, the first preset length can be 40bp, and if the first preset length is a length bunch, this length bunch can be 90bp; Comparison rate is less than the information of the genomic fragment of the first default comparison rate.Comparison rate can be that sequence to be compared in the genomic fragment of target gene group accounts for the ratio with reference to sequence to be compared in genome.

Method two, falls the information filtering of the homologous gene group fragment in the first comparison result, obtains the information of remaining genomic fragment, wherein, and the information using the information of remaining genomic fragment as the genomic fragment of not comparing.

Wherein, can the information filtering of homologous gene group fragment be fallen by bedtools instrument.Like this, by filtering out the information of homologous gene group fragment, can save the consumption to calculator memory.

The second comparing unit 30 is compared for the information of the genomic fragment of not comparing is carried out to second with the genomic information of reference, obtains the second comparison result.

The second comparison result can comprise multiple homologous gene group fragments and distinguished sequence.Wherein, homologous gene group fragment is the genome fragment in comparison; Distinguished sequence is the sequence of not comparing, and it can comprise gene order and other element sequences.

Particularly, can the information of the genomic fragment of not comparing be carried out and be compared with reference to genomic information by blastn software, obtain the second comparison result.Wherein, this comparison is meticulous comparison, and corresponding the second comparison result is meticulous comparison result.Like this, the homologous gene group fragment of comparing out in the first comparison can be found out, and filter out, thereby can obtain accurate distinguished sequence.This be because, in the second comparison, the length of homologous gene group fragment can be the second preset length, and the second preset length can be greater than the first preset length, for example, the second preset length can be 100bp; And the similarity of homologous gene group fragment can be the second default similarity; And the comparison rate of the second comparison can be the second default comparison rate, for example, the second default comparison rate can be 90.

Preferably, in the second comparison, can and be divided into respectively N genome district with reference to genome by the genome of not comparing, N gene regions of the genomic fragment of not comparing can be compared with the genomic N of a reference gene regions simultaneously.Like this, can save comparison time, improve comparison efficiency.

Second acquisition unit 40 is for obtaining the information of the distinguished sequence of target gene group from the second comparison result.

The method of information of obtaining the distinguished sequence of target gene group from the second comparison result is similar with the method for information of obtaining the genomic fragment of not comparing from the first comparison result, does not repeat them here.

Pass through the embodiment of the present invention, due to the information of target gene group with successively carry out first with reference to genomic information and compare with second and compare twice and compare, and each comparison adopts different comparison software and the preset length not waiting, default similarity, presets comparison rate geometric ratio to data, thereby has reached the effect of the degree of accuracy of raising distinguished sequence.In addition, by the cooperation of MUMmer software and blastn software, can analyze the otherness of distinguished sequence in gene structure level.

Fig. 2 is the schematic diagram that is preferably used for genomic data processing equipment according to the embodiment of the present invention.

As shown in Figure 2, this embodiment can be used as preferred implementation embodiment illustrated in fig. 1, this embodiment comprises the first comparing unit 10, the first acquiring unit 20, the second comparing unit 30 and the second acquisition unit 40 of the first embodiment for genomic data processing equipment, wherein, the second comparing unit 30 comprises first detection module 301, labeling module 302, the first filtering module 303 and comparing module 304.

Identical with the first embodiment of the effect of the first comparing unit 10, the first acquiring unit 20 and second acquisition unit 40, does not repeat them here.

In the information of first detection module 301 for detection of the genomic fragment of not comparing, whether there is the sequence information of repetition.

Preferably, in the time that species to be studied are plant, in the information of the genomic fragment that detection is not compared, whether there is the sequence information meaning of repetition, this is because there is the sequence of a large amount of repetitions in the genome of plant, and in the time that species to be studied are animal, can not detect the sequence information that whether does not have repetition in the information of the genetic fragment in comparison, this is because there is the sequence of a small amount of repetition in the genome of animal.

If labeling module 302 exists for the information that detects the genomic fragment in not comparison the sequence information repeating, the sequence information of repetition is marked to the information that obtains marking.

Particularly, can mark out by repeatmasker software the sequence information of repetition, and can mark the sequence information repeating by other characters or the numeral etc. that are different from base symbol.Like this, can prevent that the information marking from obscuring mutually with base sequence information.

The information that the first filtering module 303 marked for the information filtering of the genetic fragment in never comparison, the information after being filtered.

It should be noted that, can not filter the information marking, but in the time comparing with the genomic information of reference, skip the information marking.

Comparing module 304, for the information after filtering and the genomic information of reference are compared, obtains the second comparison result.

Pass through the embodiment of the present invention, in the time comparing with the genomic information of reference, employing detects the sequence information of repetition, and by its filtering or the mode that is skipped in comparison, can reduce the quantity of genome sequence to be compared, thereby can improve comparison efficiency, and the information that filtering marked can reduce the consumption of genome to calculator memory.

Alternatively, in embodiments of the present invention, the first comparison result can comprise multiple homologous gene group fragments, wherein, multiple homologous gene group fragments are the genomic fragment in multiple comparisons, and the first acquiring unit can comprise: the second filtering module, the first order module, first merge module and link block.

The second filtering module, for from the multiple homologous gene group of the first comparison result filtering fragment, obtains the sub-fragment of genome in multiple not comparisons.

It should be noted that, above-mentioned from the multiple homologous gene group of filtering fragment from the first comparison result, the step that obtains the sub-fragment of genome in multiple not comparisons can be replaced by the step of extracting the sub-fragment of genome in multiple not comparisons.

The first order module, for sorting at the position relationship of target gene group according to the sub-fragment of genome in multiple not comparisons, obtains the sequence of the sub-fragment of genome in multiple not comparisons.

First merges module for the sub-fragment of genome adjacent any sequence two positions and that have a lap is merged, and obtains comprising the sequence of the sub-fragment of genome in the not comparison of multiple merging.

The sub-fragment of genome that particularly, can these be had to lap by bedtools instrument merges.

Preferably, before this, can first detect the adjacent sub-fragment of genome in any two positions in sequence and whether there is lap, there is lap if detect the adjacent sub-fragment of genome in any two positions in sequence, the sub-fragment of genome adjacent any two positions in sequence and that have a lap is merged, obtain comprising the sequence of the sub-fragment of genome in the not comparison of multiple merging.Do not there is lap if detect the adjacent sub-fragment of genome in any two positions in sequence, skip the sub-fragment of genome adjacent any two positions in sequence and that have a lap is merged, obtain comprising the step of the sequence of the sub-fragment of genome in the not comparison of multiple merging.Wherein, overlapping can be that the part of two sub-fragments of genome has occurred overlapping, or it is overlapping to be that two sub-fragments of genome whole have occurred, or can be that the whole of a sub-fragment of genome have occurred overlapping with the part of the sub-fragment of another genome.

By repeating part in the sub-fragment of genome in multiple not comparisons is merged, can reduce the comparison that repeats to identical genomic fragment in the second comparison, thereby the time loss can reduce comparison time, and repeating part is merged to the consumption that can also reduce calculator memory.

Link block is used for connecting the sub-fragment of full gene group of the sequence of the sub-fragment of genome in the not comparison that comprises multiple merging, the information of the genomic fragment of not compared.

For example, after multiple homologous gene group fragments in filtering the first comparison result, can obtain 4 sub-fragments of genome of not comparing, it is respectively the first sub-fragment, the second sub-fragment, the 3rd sub-fragment and the 4th sub-fragment, wherein, the first sub-fragment, the second sub-fragment, it is a sequence that the 3rd sub-fragment and the 4th sub-fragment are from left to right arranged in order according to the position relationship in genome, and the afterbody of the 3rd sub-fragment in this sequence and the stem of the 4th sub-fragment overlap, this overlapping part can be merged like this, and the 3rd sub-fragment and the 4th sub-fragment are merged into a sub-fragment of new genome---the 5th sub-fragment, thereby can obtain by the first sub-fragment, the new sequence of the second sub-fragment and the 5th sub-fragment composition, by the first sub-fragment in this new sequence, the information that the second sub-fragment and the 5th sub-fragment are connected the genomic fragment obtaining is successively the information of the genomic fragment in not comparison.

Alternatively, the second comparison result can comprise multiple homologous gene group fragments, and second acquisition unit can comprise: extraction module, the second order module, the second detection module, second merge module and the 3rd filtering module.

Extraction module is used for extracting multiple homologous gene group fragments.The second order module is for sorting at the position relationship of target gene group according to multiple homologous gene group fragments, obtain the sequence of multiple homologous gene group fragments, particularly, can sort to multiple homologous gene group fragments by the sort instrument in bedtools.Whether the second detection module there is lap for detection of the adjacent homologous gene group fragment in any two positions in sequence.There is lap if second merges module for detecting the adjacent homologous gene group fragment in any two positions of sequence, merge lap, obtain the homologous gene group fragment after multiple merging.The 3rd filtering module is for comprising the information of the homologous gene group fragment multiple merging from the second comparison result filtering, obtain the information of the distinguished sequence of target gene group, wherein, herein by the information of filtering except comprising the information of the homologous gene group fragment after multiple merging, also comprise the information of the homologous gene group fragment that does not have lap.Wherein, filtering homologous gene group fragment step can be replaced by upset homologous gene group fragment step, particularly, can overturn to homologous gene group fragment by complement instrument.

It should be noted that, can replace by the function of the first acquiring unit from the function of second acquisition unit, do not repeat them here.

Preferably, this data processing equipment can also comprise: the first judge module, the second judge module, the 3rd judge module and determination module.The first judge module, for before extracting multiple homologous gene group fragments, judges whether the length of multiple gene fragments is more than or equal to preset length.Wherein, preset length is identical with the second preset length.If the second judge module is more than or equal to preset length for the length of judging multiple genome fragments, judge whether the similarity of multiple genome fragments is more than or equal to default similarity.Wherein, default similarity is identical with the second default similarity.If the 3rd judge module is more than or equal to default similarity for the similarity of judging multiple genome fragments, judge whether the comparison rate of multiple genome fragments is more than or equal to default comparison rate.Wherein, default comparison rate is identical with the second default comparison rate.If determination module is more than or equal to default comparison rate, the information using the information of multiple genome fragments as multiple homologous gene group fragments for the comparison rate of judging multiple genome fragments.

According to embodiments of the invention, provide a kind of for genomic data processing method, this is used for genomic data processing method for obtaining the information of accurate distinguished sequence, for genetic analysis accurately creates conditions.This is used for genomic data processing method and may operate in computer-processing equipment.It should be noted that, what the embodiment of the present invention provided can be by the carrying out for genomic data processing equipment of the embodiment of the present invention for genomic data processing method, the embodiment of the present invention for genomic data processing equipment also can for carry out the embodiment of the present invention for genomic data processing method.

Fig. 3 is the process flow diagram for genomic data processing method according to the embodiment of the present invention.

As shown in Figure 3, the method comprises that following step S302 is to step S308:

Step S302, carries out first by the information of target gene group with the genomic information of reference and compares, and obtains the first comparison result.

Preferably, in the first comparison, can and be divided into respectively N genome district with reference to genome by target gene group, the N of target gene group genome district can be compared with genomic N genome district of reference simultaneously.Like this, can save comparison time, improve comparison efficiency.

Alternatively, by the information of target gene group with carry out first and compare with reference to genomic information, before obtaining the first comparison result, this data processing method can also comprise: obtain the information of target gene group, and obtain with reference to genomic information.

Step S304 obtains not the information of the genomic fragment in comparison from the first comparison result.

Particularly, can obtain by following two kinds of methods the information of the genomic fragment in not comparison:

Wherein, can the information filtering of homologous gene group fragment be fallen by the nucmer instrument in MUMmer software.Like this, by filtering out the information of homologous gene group fragment, can save the consumption to calculator memory.

Step S306, carries out second by the information of the genomic fragment of not comparing with the genomic information of reference and compares, and obtains the second comparison result.

Particularly, can the information of the genetic fragment of not comparing be carried out and be compared with reference to genomic information by blastn software, obtain the second comparison result.Wherein, this comparison is meticulous comparison, and corresponding the second comparison result is meticulous comparison result.Like this, the homologous gene group fragment of comparing out in the first comparison can be found out, and filter out, thereby can obtain accurate distinguished sequence.This be because, in the second comparison, the length of homologous gene group fragment can be the second preset length, and the second preset length can be greater than the first preset length, for example, the second preset length can be 100bp; And the similarity of homologous gene group fragment can be the second default similarity; And the comparison rate of the second comparison can be the second default comparison rate, for example, the second default comparison rate can be 90.

Preferably, in the second comparison, can and be divided into respectively N genome district with reference to genome by the genomic fragment of not comparing, N genome district of the genomic fragment of not comparing can be compared with genomic N genome district of reference simultaneously.Like this, can save comparison time, improve comparison efficiency.

Step S308 obtains the information of the distinguished sequence of target gene group from the second comparison result.

As shown in Figure 4, this is used for genomic data processing method and comprises that following step S402 is to step S414, and this embodiment can be used as preferred implementation embodiment illustrated in fig. 3.

Step S402 is to step S404, respectively with step S302 embodiment illustrated in fig. 3 to step S304, do not repeat them here.

Step S406, detects the sequence information that whether has repetition in the information of genomic fragment in comparison.

Preferably, in the time that species to be studied are plant, in the information of the genomic fragment that detection is not compared, whether there is the sequence information meaning of repetition, this is because there is the sequence of a large amount of repetitions in the genome of plant, and in the time that species to be studied are animal, can not detect the sequence information that whether does not have repetition in the information of the genomic fragment in comparison, this is because there is the sequence of a small amount of repetition in the genome of animal.

, there is if detected not in the information of the genomic fragment in comparison the sequence information repeating in step S408, the sequence information of repetition is marked to the information that obtains marking.

Particularly, can mark out by repeatmasker software the sequence information of repetition, and can mark the sequence information repeating by other characters or the numeral etc. that are different from base symbol.Like this, can prevent that the information marking from obscuring mutually with base sequence information

Step S410, the information that never in the information of the genomic fragment in comparison, filtering marked, the information after being filtered.

Step S412, compares the information after filtering and the genomic information of reference, obtains the second comparison result.

Step S414, with step S308 embodiment illustrated in fig. 3, does not repeat them here.

Pass through the embodiment of the present invention, in the time comparing with the genomic information of reference, employing detects the sequence information of repetition, and by its filtering or the mode skipped, can reduce the quantity of genome sequence to be compared, thereby can improve comparison efficiency, and the information that filtering marked can reduce the consumption of genome to calculator memory.

Alternatively, in embodiments of the present invention, the first comparison result can comprise multiple homologous gene group fragments, wherein, multiple homologous gene group fragments are the genomic fragment in multiple comparisons, and the information of obtaining the genomic fragment in not comparison from the first comparison result can comprise the steps:

First, the multiple homologous gene group of filtering fragment from the first comparison result, obtains the gene polyadenylation signal fragment in multiple not comparisons.

Then, the position relationship according to the sub-fragment of genome in multiple not comparisons in target gene group sorts, and obtains the sequence of the sub-fragment of genome in multiple not comparisons.

Then, the sub-fragment of genome adjacent any two positions in sequence and that have a lap is merged, obtain comprising the sequence of the sub-fragment of genome in the not comparison of multiple merging.

Finally, connect the sub-fragment of full gene in the sequence of the sub-fragment of genome in the not comparison that comprises multiple merging, the information of the sub-fragment of genome of not compared.

For example, after multiple homologous gene group fragments in filtering the first comparison result, can obtain 4 sub-fragments of genome of not comparing, it is respectively the first sub-fragment, the second sub-fragment, the 3rd sub-fragment and the 4th sub-fragment, wherein, the first sub-fragment, the second sub-fragment, it is a sequence that the 3rd sub-fragment and the 4th sub-fragment are from left to right arranged in order according to the position relationship in genome, and the afterbody of the 3rd sub-fragment in this sequence and the stem of the 4th sub-fragment overlap, this overlapping part can be merged like this, and the 3rd sub-fragment and the 4th sub-fragment are merged into a sub-fragment of new genome---the 5th sub-fragment, thereby can obtain by the first sub-fragment, the new sequence of the second sub-fragment and the 5th sub-fragment composition, by the first sub-fragment in this new sequence, the information that the second sub-fragment and the 5th sub-fragment are connected the sub-fragment of genome obtaining is successively the information of the sub-fragment of genome in not comparison.

Alternatively, the second comparison result can comprise multiple homologous gene group fragments, and the information of obtaining the distinguished sequence of target gene group from the second comparison result can comprise the steps:

First, extract multiple homologous gene group fragments.Secondly, the position relationship according to multiple homologous gene group fragments in target gene group sorts, and obtains the sequence of multiple homologous gene group fragments, particularly, can sort to multiple homologous gene group fragments by the sort instrument in bedtools.Again, detect the adjacent homologous gene group fragment in any two positions in sequence and whether have lap.Then, there is lap if detect the adjacent homologous gene group fragment in any two positions in sequence, merge lap, obtain the homologous gene group fragment after multiple merging.Finally, from the second comparison result, filtering comprises the information of the homologous gene group fragment after multiple merging, obtain the information of the distinguished sequence of target gene group, wherein, herein by the information of filtering except comprising the information of the homologous gene group fragment after multiple merging, also comprise the information of the homologous gene group fragment that does not have lap.Wherein, filtering homologous gene fragment step can be replaced by upset homologous gene group fragment step, particularly, can overturn to homologous gene group fragment by complement instrument.

It should be noted that, the step of obtaining the information of the distinguished sequence of target gene group from the second comparison result can, with replacing with the step of the information of obtaining the genetic fragment of not comparing from the first comparison result, not repeat them here.

Preferably, before extracting multiple homologous gene group fragments, this data processing method can also comprise: first, judge whether the length of multiple genome fragments is more than or equal to preset length.Wherein, preset length is identical with the second preset length.Then, be more than or equal to preset length if judge the length of multiple genome fragments, judge whether the similarity of multiple genome fragments is more than or equal to default similarity.Wherein, default similarity is identical with the second default similarity.Then, be more than or equal to default similarity if judge the similarity of multiple genome fragments, judge whether the comparison rate of multiple genome fragments is more than or equal to default comparison rate.Wherein, default comparison rate is identical with the second default comparison rate.Finally, be more than or equal to default comparison rate if judge the comparison rate of multiple genome fragments, the information using the information of multiple gene fragments as multiple homologous gene group fragments.

From above description, can find out, the present invention also uses by long sequence alignment software and short sequence alignment software, obtain all types of distinguished sequences (being not limited to protein sequence) between accurate species, and the time while having reached minimizing genome alignment and the effect of internal memory, this can provide condition for the variety analysis of follow-up species.

It should be noted that, can in the computer system such as one group of computer executable instructions, carry out in the step shown in the process flow diagram of accompanying drawing, and, although there is shown logical order in flow process, but in some cases, can carry out shown or described step with the order being different from herein.

Obviously, those skilled in the art should be understood that, above-mentioned of the present invention each module or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on the network that multiple calculation elements form, alternatively, they can be realized with the executable program code of calculation element, thereby, they can be stored in memory storage and be carried out by calculation element, or they are made into respectively to each integrated circuit modules, or the multiple modules in them or step are made into single integrated circuit module to be realized.Like this, the present invention is not restricted to any specific hardware and software combination.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims

1. for a genomic data processing method, it is characterized in that, comprising:

The information of target gene group is carried out to first with the genomic information of reference and compare, obtain the first comparison result;

From described the first comparison result, obtain the information of the genomic fragment in not comparison;

The information of genomic fragment in described comparison and the genomic information of described reference are carried out to second and compare, obtain the second comparison result; And

From described the second comparison result, obtain the information of the distinguished sequence of described target gene group.

2. data processing method according to claim 1, is characterized in that, the information of genomic fragment in described comparison and the genomic information of described reference is carried out to second and compare, and obtains the second comparison result and comprises:

In the information of the genomic fragment on not comparing described in detection, whether there is the sequence information of repetition;

If there is the sequence information repeating in the information of the genomic fragment on not comparing described in detecting, the sequence information of described repetition marked to the information that obtains marking;

The information marking described in filtering in the information of the genetic fragment from described not comparison, the information after being filtered; And

Information after described filtration and the genomic information of described reference are compared, obtain described the second comparison result.

3. data processing method according to claim 1, it is characterized in that, described the first comparison result comprises multiple homologous gene group fragments, wherein, described multiple homologous gene group fragment is the genomic fragment in multiple comparisons, and the information of obtaining the genomic fragment in not comparison from described the first comparison result comprises:

Multiple homologous gene group fragments described in filtering from described the first comparison result, obtain the sub-fragment of genome in multiple not comparisons;

Position relationship according to the sub-fragment of genome in described multiple not comparisons in described target gene group sorts, and obtains the sequence of the sub-fragment of genome in multiple not comparisons;

The sub-fragment of genome adjacent any two positions in described sequence and that have a lap is merged, obtain comprising the sequence of the sub-fragment of genome in the not comparison of multiple merging; And

Described in connection, comprise the sub-fragment of full gene group in the sequence of the sub-fragment of genome in the not comparison of multiple merging, the information of the genomic fragment in comparison described in obtaining.

4. data processing method according to claim 1, is characterized in that, described the second comparison result comprises multiple homologous gene group fragments, and the information of obtaining the distinguished sequence of described target gene group from described the second comparison result comprises:

Extract described multiple homologous gene group fragment;

Position relationship according to described multiple homologous gene group fragments in described target gene group sorts, and obtains the sequence of described multiple homologous gene group fragments;

Detect the adjacent homologous gene group fragment in any two positions in described sequence and whether have lap;

There is lap if detect the adjacent homologous gene group fragment in any two positions in described sequence, merge described lap, obtain the homologous gene group fragment after multiple merging; And

From described the second comparison result, filtering comprises the information of the homologous gene group fragment after multiple merging, obtains the information of the distinguished sequence of described target gene group.

5. data processing method according to claim 4, is characterized in that, before extracting described multiple homologous gene group fragments, described data processing method also comprises:

Whether the length that judges multiple genome fragments is more than or equal to preset length;

Be more than or equal to preset length if judge the length of described multiple genome fragments, judge whether the similarity of described multiple genome fragments is more than or equal to default similarity;

Be more than or equal to default similarity if judge the similarity of described multiple genome fragments, judge whether the comparison rate of described multiple genome fragments is more than or equal to default comparison rate; And

Be more than or equal to default comparison rate if judge the comparison rate of described multiple genome fragments, the information using the information of described multiple genome fragments as described multiple homologous gene group fragments.

6. for a genomic data processing equipment, it is characterized in that, comprising:

The first comparing unit, compares for the information of target gene group is carried out to first with the genomic information of reference, obtains the first comparison result;

The first acquiring unit, for obtaining the information of the genomic fragment not comparison from described the first comparison result;

The second comparing unit, compares for the information of genomic fragment in described comparison and the genomic information of described reference are carried out to second, obtains the second comparison result; And

Second acquisition unit, for obtaining the information of the distinguished sequence of described target gene group from described the second comparison result.

7. data processing equipment according to claim 6, is characterized in that, described the second comparing unit comprises:

First detection module, for detection of the sequence information that whether has repetition in the information of genomic fragment in described comparison;

Labeling module, if for described in detecting the information of the genomic fragment in comparison there is the sequence information repeating, the sequence information of described repetition is marked to the information that obtains marking;

The first filtering module, for the information marking described in the information filtering of the genetic fragment from described not comparison, the information after being filtered; And

Comparing module, for the information after described filtration and the genomic information of described reference are compared, obtains described the second comparison result.

8. data processing equipment according to claim 6, is characterized in that, described the first comparison result comprises multiple homologous gene group fragments, and wherein, described multiple homologous gene group fragments are the genomic fragment in multiple comparisons, and described the first acquiring unit comprises:

The second filtering module, for from multiple homologous gene group fragments described in described the first comparison result filtering, obtains the sub-fragment of genome in multiple not comparisons;

The first order module, for sorting at the position relationship of described target gene group according to the sub-fragment of genome in described multiple not comparisons, obtains the sequence of the sub-fragment of genome in multiple not comparisons;

First merges module, for the sub-fragment of genome adjacent any described sequence two positions and that have a lap is merged, obtains comprising the sequence of the sub-fragment of genome in the not comparison of multiple merging; And

Link block, for comprising the sub-fragment of full gene group of sequence of the sub-fragment of genome in the not comparison of multiple merging described in connecting, the information of the genomic fragment in comparison described in obtaining.

9. data processing equipment according to claim 6, is characterized in that, described the second comparison result comprises multiple homologous gene group fragments, and described second acquisition unit comprises:

Extraction module, for extracting described multiple homologous gene group fragment;

The second order module, for sorting at the position relationship of described target gene group according to described multiple homologous gene group fragments, obtains the sequence of described multiple homologous gene group fragments;

Whether the second detection module, there is lap for detection of the adjacent homologous gene group fragment in any two positions in described sequence;

Second merges module, if there is lap for detecting the adjacent homologous gene group fragment in any two positions of described sequence, merges described lap, obtains the homologous gene group fragment after multiple merging; And

The 3rd filtering module, for comprise the information of the homologous gene group fragment multiple merging from described the second comparison result filtering, obtains the information of the distinguished sequence of described target gene group.

10. data processing equipment according to claim 9, is characterized in that, also comprises:

The first judge module, for before extracting described multiple homologous gene group fragments, judges whether the length of multiple genome fragments is more than or equal to preset length;

The second judge module, if be more than or equal to preset length for the length of judging described multiple genome fragments, judges whether the similarity of described multiple genome fragments is more than or equal to default similarity;

The 3rd judge module, if be more than or equal to default similarity for judging the similarity of described multiple genome fragments, judges whether the comparison rate of described multiple genome fragments is more than or equal to default comparison rate; And

Determination module, if be more than or equal to default comparison rate for judging the comparison rate of described multiple genome fragments, is the information of described multiple homologous gene group fragments by the validation of information of described multiple genome fragments.