CN109243529A

CN109243529A - Gene transferring horizontally recognition methods based on local sensitivity Hash

Info

Publication number: CN109243529A
Application number: CN201810988512.1A
Authority: CN
Inventors: 江育娥; 魏静; 林劼
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2018-08-28
Filing date: 2018-08-28
Publication date: 2019-01-18
Anticipated expiration: 2038-08-28
Also published as: CN109243529B

Abstract

The present invention discloses the gene transferring horizontally recognition methods based on local sensitivity Hash comprising following steps: step 1, full-length genome data being cut into N number of segment with identical bp number according to setting step-length；Step 2, each segment is subjected to the processing of K- word；Step 3, the word frequency of the K- word in each segment and standardization are counted；Step 4, m hash family of functions is constructed, m hash value of each segment is calculated；Step 5, the processing of branch's item is carried out to m hash value of each genetic fragment, compares in the row item of the segment hash value in the row item of hash value and full-length genome, when hash value all at least one row item is all equal, think that the segment is similar to full-length genome；Otherwise, Horizontal Gene Transfer occurs for prediction；Step 6, it calculates the Euclidean distance in similar set between candidate segment and m hash value of full-length genome and Horizontal Gene Transfer then occurs greater than given threshold.The present invention helps to reduce computer using resource, improves prediction computational efficiency.

Description

Gene transferring horizontally recognition methods based on local sensitivity Hash

Technical field

The present invention relates to Bioinformatics fields, more particularly to the identification of the gene transferring horizontally based on local sensitivity Hash Method.

Background technique

Horizontal Gene Transfer (horizontal gene transfer, that is, HGT) refers to inhereditary material with landscape mode in list Cell and (or) information interchange between multicellular organisms, i.e. organism obtains inhereditary material from the individual other than parental generation Process, be promote spore a significant process.Horizontal Gene Transfer can occur in different plant species (interspecies level base Because of transfer) between or same species (plant in Horizontal Gene Transfer) between.It has broken the boundary of affiliation, flows gene Possibility become increasingly complex.

As the mankind and other biological genome examining orders are successively performed, it has been found that between different plant species, or even parent The universal of Horizontal Gene Transfer is further demonstrated with the presence of a large amount of homologous genes on genome between the far biology of edge relationship Property and remote edge.For the understanding during biological evolution and between species, inhereditary material is determined for the prediction of gene transferring horizontally Property and quantitative estimation have important meaning.And in recent years, find there are a large amount of DNA with activity of conversion in natural environment Molecule and the competent cell that can actively absorb exogenous DNA, so that people have the Horizontal Gene Transfer occurred in environment New understanding.Further investigation to the ecological effect of Horizontal Gene Transfer and its generation, it will help genetically engineered biological is done New evaluation out, so that the application of technique for gene engineering and genetically modified organism plays bigger effect

The method of detection level gene transfer (HGT) is broadly divided into two major classes: systematic growth method and parametric method.Base In phylogenetic tree method in terms of detecting HGT confidence level it is higher.But systematic growth method be largely dependent upon it is defeated The accuracy of the intrinsic gene and species tree that enter, this may be challenging to gene tree and the building of species tree.Even if input tree In there is no mistake, if systematic growth it is conflicting be also likely to be the evolutionary process other than HGT result (as repeat and lose), from And infer HGT with leading to these method faults.

Summary of the invention

The purpose of the present invention is to provide the gene transferring horizontally recognition methods based on local sensitivity Hash.

The technical solution adopted by the present invention is that:

Gene transferring horizontally recognition methods based on local sensitivity Hash comprising following steps:

Step 1, full-length genome data are cut into N number of segment with identical bp number according to setting step-length；

Step 2, each segment after segmentation is subjected to the processing of K- word, the i.e. sliding that a segment is K by a length Window, this section of sequence in the window are a K- words, and each genetic fragment obtains altogether | ∑ |^KA word, wherein | ∑ | it is base The character set size of cause, K are the length of window of K- word processing；

Step 3, the word frequency for counting the K- word in each segment obtains the word frequency set X={ X of K- word₁, X₂..., X_t..., X_n, X_tFor the frequency that t-th of K- word occurs in current clip, n is the sum of K- word；And the word frequency of each K- word is marked Quasi-ization handles to obtain the result S after word frequency standardization_t, form the word frequency set S={ S after standardizing₁, S₂..., S_t..., S_n, S_t For after the word frequency standardization of t-th K- word as a result, n is the sum of K- word；

The wherein result S after the word frequency standardization of each K- word_tCalculation formula it is as follows:

μ is the mean value of all K- word word frequency, and δ is the standard deviation of K- word word frequency；

Step 4, m hash family of functions is constructed, m hash value of each segment is calculated.

Step 5, the processing of branch's item is carried out to m hash value of each genetic fragment, compares hash in the row item of the segment Value and hash value in the row item of full-length genome,

When there is hash value all in one or more row items all equal, think that the segment is similar to full-length genome and shape At similar set；Otherwise, prediction is that Horizontal Gene Transfer has occurred；

Step 6, the Euclidean distance in similar set between candidate segment and m hash value of full-length genome is calculated, Europe is worked as Family name's distance is greater than given threshold, and the value of threshold value isThe length of window that K is handled for K- word, the i.e. length of K- word, then It is considered that Horizontal Gene Transfer has occurred；

Euclidean distance formula are as follows:

M is hash function number, i.e. m is the hash value number of sequence.

Further, full-length genome data are cut into N number of weight with 5000 bp according to 500 step-length in step 1 Folded segment.

Further, in step 4 hash family of functions form are as follows:

Wherein b is (0, r) inner random number, and r is the segment length being segmented on straight line, the function in family of functions according to a and b not It is same to establish index functions.

The invention adopts the above technical scheme, and using in parametric method, the most commonly used non-comparison method proposition is based on The prediction technique of local sensitivity Hash is not only improved the standard the search efficiency of metastatic gene using local sensitivity Hash, and mentioned Height overcomes parametric method method for the process of ancient generation for improveing the susceptibility of gene to improve the accuracy of prediction The shortcomings that insensitivity of the gene transferring horizontally of improvement, reduces computer and uses in the case where not reducing the preceding topic of forecasting accuracy Resource improves prediction computational efficiency.The present invention non-comparison method highly efficient using detection efficiency, overcomes systematic growth method The disadvantages of calculating costs dearly, and excessively relies on reliable phylogenetic tree.

Detailed description of the invention

The present invention is described in further details below in conjunction with the drawings and specific embodiments；

Fig. 1 is that the present invention is based on the flow diagrams of the gene transferring horizontally recognition methods of local sensitivity Hash.

Specific embodiment

An important factor for Horizontal Gene Transfer frequently occurs in nature, and HGT is many biological evolutions, Horizontal Gene Transfer An important factor for being many biological evolutions, the inhereditary material generated is for promoting genome innovation to have highly important work With they may provide the adaptability that selective advantage improves new environment for host organisms.Since Horizontal Gene Transfer can Gene will be transferred to from the entirely different genotype of remote source phylogenetic ancestry, or the new gene with new function Group, it is the main source and niche adaptation mechanism of phenotype innovation, while also enriching the diversity of species.Detect HGT thing Part can help us to be best understood from the historical background of odc gene group, explain the generation of new factor during biological evolution, With biological significance.In addition, detection HGT event can help us to be best understood from the historical background of odc gene group, solve The generation of new factor during biological evolution is released, there is biological significance, it is theoretical to enrich classical darwinian evolution.

The present invention uses p-stable LSH algorithm, which does not need to be mapped to luv space in the space Hamming, Local sensitivity Hash operation can be carried out directly under Euclidean space.

Local sensitivity, it is very big in the still adjacent probability of new data space to be construed to the point being closer in space, And apart from the farther away probability very little for being mapped to the same bucket.Local sensitivity Hash (Locality-Sensitive Hashing, LSH) it is to solve the problems, such as that approximate KNN is searched for using the method for local sensitivity.Original LSH method, being will be former For beginning space reflection into the space Hamming, i.e. the expression-form at luv space midpoint is converted into the expression at the space Hamming midpoint Form, distance metric are also converted into the distance metric in the space Hamming.The search time of LSH method is linear phase with dimension It closes, with Space Scale time correlation of indices, greatly reduces search time.

Wherein p-stable LSH is applied in d dimension Euclidean space, 0 < p≤2.P-stable LSH be LSH into Rank, the algorithm apply the concept of p-stable distribution (p- Stable distritation).P- Stable distritation is not specific point Cloth, but meet the family of distributions of certain condition.As p=1, representative is standard Cauchy distribution；As p=2, representative is Gauss point Cloth.The form of p-stable LSH family of functions are as follows:Wherein b is the random number in (0, r), and r is straight line The segment length of upper segmentation, the function in hash function race establish index functions according to the difference of a and b.The dot product of vector a and v are used to Hash function race is generated, and the hash function race is local sensitivity.P-stable LSH method is that straight line is divided into length Degree is the several segments isoplith of r, is mapped to point on same section and assigns identical hash value, is mapped to point on different sections then Assign different hash values.

As shown in Figure 1, the invention discloses the gene transferring horizontally recognition methods based on local sensitivity Hash comprising with Lower step:

Euclidean distance formula are as follows:

M is hash function number, i.e. m is the hash value number of sequence.

Further, in step 4 hash family of functions form are as follows:

In order to verify prediction effect of the invention, can be carried out according to the HGT segment result of prediction and actual HGT segment real Interpretation of result is tested, the gene transferring horizontally recognition methods model based on local sensitivity Hash is carried out using F-measure method Assessment.

Just treatment process of the invention is described in detail below:

It with intercepted length is that 300bp Escherichia coli segment is to become apparent from analogue data treatment process in description this patent Receptor, it is random to intercept haemophilus influenzae and full genome segment that bacillus subtilis length is 50bp is as donor gene group, Two positions are randomly selected in acceptor gene group to be inserted into, and are constituted simulation HGT and are shifted data.Water based on local sensitivity Hash Flat turn moves gene recognition method, and steps are as follows:

Such as: the sequence that a total length is 400 bp, part of segment are to belong to the segment that HGT transfer occurs；

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAG CTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTATCTGTCAGAAAAATGACTAAATAGCGGC TCCCACAATGTTCAAATGTGGGGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTAC AGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCGCTGGCGTTTCTCTGCTCGACGGTC ACCGGGATTTTATTTGGCTGGTTACACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAA AAGCCCGCACCTGACAGTGCGG

Step 1, sequence is cut into the segment of the overlapping of 100 bp, step-length 20, available 20 segments；

Such as first segment are as follows:

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAG CTTCTGAACTGGTTACCTGCCGTGAGTAAAT

Second segment are as follows:

ACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGC CGTGAGTAAATTAAAATTTTATTGACTTATC

And so on etc..It is introduced by taking the first two segment as an example below:

Step 2, each segment after segmentation is subjected to the processing of K- word, the i.e. sliding that a segment is K by a length Window, this section of sequence in the window are a K- words, and each genetic fragment obtains altogether | ∑ |^KA word；

Such as when setting sliding window length K=2, the K- word number of each segment acquisition: AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT

Step 3, the word frequency X of the K- word occurred in each genetic fragment is counted_t, X_tOccur in the segment for t-th of K- word Frequency, and word frequency is standardized, the result after standardization is S_t, S_tAfter the word frequency standardization of t-th of K- word As a result；

Table one: the word frequency of each word in the two sequences

Table two: the result after word frequency standardization

Step 4, m hash family of functions is constructed, m hash value of each segment, such as 6 hash functions of building are calculated Race.

A and b is the vector of 16 dimensions of Gaussian distributed.

a1[0.2728,-0.5261,-1.1690,-1.0743,1.0394,0.0461,0.1844,0.0007, 0.0448,-1.1816,-0.2499,-1.82 94,-0.3410,1.8652,0.2897,1.9430]

a2[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214, 0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]

a3[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214, 0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]

a4[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214, 0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]

a5[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214, 0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]

a6[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214, 0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]

b1[4.5692,1.2317,5.4587,0.7120,4.6902,3.0200,5.4109,2.5662,4.5222, 2.4379,5.7483,1.7211,1.7362,4.0020,0.2040,5.9893]

b2[3.4901,1.3752,0.0630,3.3493,3.6519,3.3700,2.7356,3.2697,3.7174, 4.9290,3.5241,5.7950,5.5712,0.2950,2.7426,0.5739]

b3[2.7009,0.8035,4.7329,1.3775,4.6591,0.6980,4.4261,0.7008,0.8813, 0.8644,2.3402,5.8353,1.3543,3.7460,2.9168,5.3354]

b4[2.5758,2.2662,5.6072,1.1562,1.1456,3.6358,5.4605,4.5566,2.8090, 1.1688,5.1190,0.4194,2.6726,3.0355,3.5281,4.6226]

b5[0.9153,2.3426,3.6952,4.1016,0.0308,1.0368,1.8780,3.7876,5.7924, 2.4167,5.8599,0.6297,5.6887,4.0295,1.6678,1.9148]

b6[1.8040,3.3165,2.1672,3.7913,4.2346,3.4207,0.8061,3.8269,3.8916, 3.2757,1.0729,3.6728,2.2185,2.7205,3.6779,0.1046]

Two segments are calculated according to p-stable LSH formula, and each segment obtains 6 hash values；

First segment, 6 hash values are [7,6,6,6,6,6]

Second segment, 6 hash values are [7,7,6,6,6,7]

Step 5, the processing of branch's item is carried out to 6 hash values of each genetic fragment, compares hash in the row item of the segment Value and hash value in the row item of full-length genome, if having hash value all in one or more row items it is all equal if think the segment Similar to full-length genome, otherwise prediction is that Horizontal Gene Transfer has occurred；

Step 6, the Euclidean distance in similar set between candidate segment and full-length genome hash value is calculated, if Euclidean distance Then it is considered that Horizontal Gene Transfer has occurred greater than threshold value；

The present invention assumes that the gene of horizontal transfer is related to distant relative raw using non-comparison method prediction level gene transfer The gene of object has the similitude of height.Non- comparison method is more sensitive generally for the gene transferring horizontally occurred recently, It is insensitive for the gene for having already passed through improvement of ancient generation horizontal transfer.The non-usual speed of comparison method is quickly.They It can quickly determine the candidate list in the horizontal transfer region assumed.Non- comparison method avoids the problem of arrangement mass data, These data may excessively disperse, and can not obtain significant comparison.Non- comparison method will not using gene as unit of analysis, because This can detecte the genome area in any possible lateral source.

The present invention is not only improved the standard the search efficiency of metastatic gene using local sensitivity Hash, and is improved for improvement The susceptibility of gene, to improve the accuracy of prediction.It is high it is an object of the invention to overcome systematic growth method to calculate cost It is expensive, the disadvantages of excessively relying on reliable phylogenetic tree.Using in parametric method, the most commonly used non-comparison method proposes base In the prediction technique of local sensitivity Hash, the characteristics of using local sensitivity hash function, improve for ancient turn by improvement The susceptibility for moving gene overcomes the insensitivity of the gene transferring horizontally of process improvement of the parametric method method for ancient generation Disadvantage reduces computer and uses resource in the case where not reducing the preceding topic of forecasting accuracy, improves prediction computational efficiency.

Claims

1. the gene transferring horizontally recognition methods based on local sensitivity Hash, it is characterised in that: itself the following steps are included:

Step 2, each segment after segmentation is subjected to the processing of K- word, each genetic fragment obtains altogether | ∑ |^KA word, wherein | ∑ | For the character set size of gene, K is the length of window of K- word processing；

Step 3, the word frequency for counting the K- word in each segment obtains the word frequency set X={ X of K- word₁, X₂..., X_t..., X_n, X_tFor The frequency that t-th of K- word occurs in current clip, n are the sum of K- word；And place is standardized to the word frequency of each K- word Reason obtains the result S after word frequency standardization_t, form the word frequency set S={ S after standardizing₁, S₂..., S_t..., S_n, S_tIt is t-th K- word word frequency standardization after as a result, n be K- word sum；

Step 4, m hash family of functions is constructed, m hash value of each segment is calculated；

Step 5, the processing of branch's item is carried out to m hash value of each genetic fragment, compare in the row item of the segment hash value with Hash value in the row item of full-length genome,

When there is hash value all in one or more row items all equal, think that the segment is similar to full-length genome and forms phase Like set；Otherwise, prediction is that Horizontal Gene Transfer has occurred；

Step 6, calculate Euclidean distance between candidate segment and m hash value of full-length genome in similar set, when Euclidean away from It is from the value for being greater than given threshold, threshold valueK is the length of window of K- word processing, the i.e. length of K- word, then it is assumed that It is that Horizontal Gene Transfer has occurred；

Euclidean distance formula are as follows:

M is hash function number, i.e. m is the hash value number of sequence.

2. the gene transferring horizontally recognition methods according to claim 1 based on local sensitivity Hash, it is characterised in that: step Full-length genome data are cut into the segment of N number of overlapping with 5000 bp in rapid 1 according to 500 step-length.

3. the gene transferring horizontally recognition methods according to claim 1 based on local sensitivity Hash, it is characterised in that: step The form of hash family of functions in rapid 4 are as follows:

Wherein b is (0, r) inner random number, and r is the segment length being segmented on straight line, and the function in family of functions is built according to the difference of a and b Vertical index functions.