CN109243529B

CN109243529B - Horizontal transfer gene identification method based on locality sensitive hashing

Info

Publication number: CN109243529B
Application number: CN201810988512.1A
Authority: CN
Inventors: 江育娥; 魏静; 林劼
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2018-08-28
Filing date: 2018-08-28
Publication date: 2021-09-07
Anticipated expiration: 2038-08-28
Also published as: CN109243529A

Abstract

The invention discloses a horizontal transfer gene identification method based on locality sensitive hashing, which comprises the following steps: step 1, cutting the whole genome data into N fragments with the same bp number according to a set step length; step 2, performing K-word processing on each segment; step 3, counting word frequency of K-words in each segment and carrying out standardization processing; step 4, constructing m hash function families, and calculating m hash values of each segment; step 5, dividing m hash values of each gene segment into lines, comparing the hash values in the lines of the segment with the hash values in the lines of the whole genome, and considering that the segment is similar to the whole genome when all the hash values in at least one line are equal; otherwise, predicting the occurrence of horizontal gene transfer; and 6, calculating the Euclidean distance between the candidate fragments in the similar set and the m hash values of the whole genome, and if the Euclidean distance is larger than a set threshold value, generating horizontal gene transfer. The invention is beneficial to reducing the use resources of the computer and improving the prediction calculation efficiency.

Description

Horizontal transfer gene identification method based on locality sensitive hashing

Technical Field

The invention relates to the field of biological information processing, in particular to a horizontal transfer gene identification method based on locality sensitive hashing.

Background

Horizontal Gene Transfer (HGT) refers to the communication of genetic material between unicellular and/or multicellular organisms in a transverse manner, i.e., the process by which the organism obtains the genetic material from individuals other than the parent, and is an important process in promoting species evolution. Horizontal gene transfer can occur between different species (interspecies-level gene transfer) or between the same species (intraspecies-level gene transfer). It breaks the boundaries of genetic relationships and makes the possibility of gene flow more complex.

As the genome sequencing work of human and other organisms is carried out successively, it is found that a large number of homologous genes exist on the genome between different species, even between organisms with far-reaching relationships, and the universality and the distancing performance of horizontal gene transfer are further confirmed. The prediction of horizontal transfer genes is of great significance to the understanding of the biological evolution process and the qualitative and quantitative estimation of genetic material between species. In recent years, the discovery of the existence of a large number of DNA molecules having transforming activity and competent cells capable of actively taking up foreign DNA in the natural environment has led to new insights into the horizontal gene transfer occurring in the environment. The deep research on horizontal gene transfer and the ecological effect generated by the horizontal gene transfer is helpful for making new evaluation on the genetic engineering organisms, so that the application of the genetic engineering technology and the transgenic organisms can play a greater role

Methods for detecting Horizontal Gene Transfer (HGT) fall into two main categories: phylogenetic methods and parameterization methods. Methods based on phylogenetic trees have a higher degree of confidence in detecting HGT. However, phylogenetic approaches rely heavily on the accuracy of the input native gene and species trees, which can be challenging to construct. Even if there are no errors in the input tree, phylogenetic conflicts may be the result of evolutionary processes other than HGT (e.g., duplication and loss), leading these methods to incorrectly infer HGT.

Disclosure of Invention

The invention aims to provide a horizontal transfer gene identification method based on locality sensitive hashing.

The technical scheme adopted by the invention is as follows:

the method for identifying the horizontal transfer gene based on locality sensitive hashing comprises the following steps:

step 1, cutting the whole genome data into N fragments with the same bp number according to a set step length;

step 2, performing K-word processing on each segmented fragment, namely enabling one fragment to pass through a sliding window with the length of K, enabling the sequence of the segment in the window to be a K-word, and enabling each gene fragment to obtain [ sigma ] totally^KThe word, wherein | ∑ is the character set size of the gene, and K is the window length of K-word processing;

step 3, counting the word frequency of the K-word in each segment to obtain a word frequency set X of the K-word { X ═ X }₁，X₂…，X_t…，X_n}，X_tThe frequency of occurrence of the t-th K-word in the current segment is shown, and n is the total number of the K-words; and standardizing the word frequency of each K-word to obtain the word frequencyNormalized result S_tForming a normalized set of word frequencies S ═ S₁，S₂…，S_t…，S_n}，S_tThe word frequency of the t-th K-word is normalized, and n is the total number of the K-words;

wherein the word frequency of each K-word is normalized to a result S_tThe calculation formula of (a) is as follows:

mu is the mean value of all the word frequencies of the K-words, and delta is the standard deviation of the word frequencies of the K-words;

and 4, constructing m hash function families, and calculating m hash values of each segment.

Step 5, dividing the m hash values of each gene segment into lines, comparing the hash values in the lines of the segment with the hash values in the lines of the whole genome,

when all hash values in one or more row bars are equal, the fragment is considered to be similar to the whole genome and form a similar set; otherwise, horizontal gene transfer is predicted to occur;

step 6, calculating Euclidean distances between the candidate segments in the similar set and m hash values of the whole genome, wherein when the Euclidean distances are larger than a set threshold, the value of the threshold is

K is the length of the window for processing the K-word, namely the length of the K-word, and the horizontal gene transfer is considered to occur;

the Euclidean distance formula is as follows:

m is the number of hash functions, i.e. m is the number of hash values of the sequence.

Further, the whole genome data was cut into N overlapping fragments of 5000 bp according to a step size of 500 in step 1.

Further, the hash function family in step 4 is in the form of:

wherein b is a random number in (0, r), r is the segment length of the segment on the straight line, and the functions in the function family establish the function index according to the difference of a and b.

By adopting the technical scheme, the most common non-comparison method in the parameterization method is adopted to provide a prediction method based on the locality sensitive hashing, the locality sensitive hashing is utilized to improve the searching efficiency of the horizontal transfer gene and improve the sensitivity to the improved gene, so that the prediction accuracy is improved, the defect that the parameter method is insensitive to the improved horizontal transfer gene which occurs in the past is overcome, the use resources of a computer are reduced under the premise of not reducing the prediction accuracy, and the prediction calculation efficiency is improved. The invention adopts a non-comparison method with more efficient detection efficiency, and overcomes the defects that the phylogenetic method is expensive in calculation cost, excessively depends on a reliable phylogenetic tree and the like.

Drawings

The invention is described in further detail below with reference to the accompanying drawings and the detailed description;

FIG. 1 is a schematic flow chart of the method for identifying a gene by horizontal migration based on locality sensitive hashing according to the present invention.

Detailed Description

Horizontal gene transfer frequently occurs in nature, HGT is an important factor for the evolution of many organisms, horizontal gene transfer is an important factor for the evolution of many organisms, the generated genetic materials play an important role in promoting genome innovation, and the genetic materials can provide a selective advantage for host organisms and improve the adaptability of new environments. Since horizontal gene transfer can transfer completely different genotypes from distant phylogenetic lineages, or new genes with new functions into the genome, it is a major source of phenotypic innovation and niche adaptation mechanism, and also enriches species diversity. The detection of HGT events can help us to better understand the historical background of the existing genome, explain the generation of new factors in the biological evolution process, and have biological significance. In addition, the detection of HGT events can help us to better understand the historical background of the existing genome, explain the generation of new factors in the biological evolution process, have biological significance and enrich the classical Darwinian evolution theory.

The invention uses p-stable LSH algorithm, which can directly carry out local sensitive Hash operation in Euclidean space without mapping the original space to Hamming space.

Local sensitivity can be interpreted as the probability that a closer point in space is still adjacent in the new data space is high, while the probability that a farther point is mapped to the same bucket is low. local-Sensitive Hashing (LSH) is a method for solving the problem of approximate nearest neighbor search using local sensitivity. The original LSH method maps the original space into the Hamming space, that is, the expression form of the midpoint of the original space is converted into the expression form of the midpoint of the Hamming space, and the distance measurement is also converted into the distance measurement in the Hamming space. The search time and the dimensionality of the LSH method are linearly related and are related to the space scale sub-exponential, and the search time is greatly reduced.

Where p-stable LSH is applied in d-dimensional Euclidean space, 0<p<2. The p-stable LSH is the advanced stage of LSH, and the algorithm applies the concept of p-stable distribution. The p-stable distribution is not a specific distribution, but a family of distributions that satisfy a certain condition. When p is 1, the representation is a standard cauchy distribution; when p is 2, the representation is gaussian. The p-stable LSH functional family is of the form:

wherein b is a random number in (0, r), r is the length of a segment segmented on a straight line, and functions in the hash function family establish function indexes according to the difference between a and b. The dot product of vectors a and v is used to generate a hash function family, and the hash function family is locally sensitive. The P-stable LSH method divides a straight line into a plurality of equal-length segments with the length of r, points mapped on the same segment are endowed with the same hash value, and the segments are mapped to different segmentsThe points above are assigned different hash values.

As shown in FIG. 1, the invention discloses a method for identifying a horizontal transfer gene based on locality sensitive hashing, which comprises the following steps:

step 3, counting the word frequency of the K-word in each segment to obtain a word frequency set X of the K-word { X ═ X }₁，X₂…，X_t…，X_n}，X_tThe frequency of occurrence of the t-th K-word in the current segment is shown, and n is the total number of the K-words; and standardizing the word frequency of each K-word to obtain a result S after the word frequency is standardized_tForming a normalized set of word frequencies S ═ S₁，S₂…，S_t…，S_n}，S_tThe word frequency of the t-th K-word is normalized, and n is the total number of the K-words;

the Euclidean distance formula is as follows:

Further, the hash function family in step 4 is in the form of:

In order to verify the prediction effect of the method, experimental result analysis can be carried out according to the result of the predicted HGT segment and the actual HGT segment, and the F-measure method is utilized to evaluate the horizontal transfer gene identification method model based on the locality sensitive hash.

The following is a detailed description of the process of the present invention:

in order to describe the process of the simulated data processing more clearly, the method comprises the steps of taking an intercepted Escherichia coli segment with the length of 300bp as an acceptor, randomly intercepting a whole gene segment with the length of 50bp of Haemophilus influenzae and Bacillus subtilis as a donor genome, and randomly selecting two positions on the acceptor genome for insertion to form the simulated HGT transfer data. The method for identifying the horizontal transfer gene based on the locality sensitive hashing comprises the following steps:

for example: a sequence with the total length of 400 bp, wherein part of fragments belong to fragments for HGT transfer;

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTATCTGTCAGAAAAATGACTAAATAGCGGCTCCCACAATGTTCAAATGTGGGGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCGCTGGCGTTTCTCTGCTCGACGGTCACCGGGATTTTATTTGGCTGGTTACACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGG

step 1, cutting the sequence into overlapping fragments of 100 bp, wherein the step length is 20, and 20 fragments can be obtained;

for example, the first fragment is:

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAAT

the second fragment is:

ACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTATC

and so on. The following is presented by way of example of the first two segments:

step 2, performing K-word processing on each segmented fragment, namely enabling one fragment to pass through a sliding window with the length of K, enabling the sequence of the segment in the window to be a K-word, and enabling each gene fragment to obtain [ sigma ] totally^KA word;

for example, assuming that the sliding window length K is 2, the number of K-words obtained per segment: AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT

Step 3, counting the word frequency X of the K-words appearing in each gene segment_t，X_tThe frequency of the t-th K-word appearing in the segment is calculated, the word frequency is normalized, and the normalized result is S_t，S_tThe word frequency of the t-th K-word is normalized;

table one: word frequency of each word in two sequences

Table two: word frequency normalized results

And 4, constructing m hash function families, and calculating m hash values of each segment, for example, constructing 6 hash function families.

a and b are 16-dimensional vectors that follow a gaussian distribution.

a1[0.2728,-0.5261,-1.1690,-1.0743,1.0394,0.0461,0.1844,0.0007,0.0448,-1.1816,-0.2499,-1.82 94,-0.3410,1.8652,0.2897,1.9430]

a2[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214,0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]

a3[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214,0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]

a4[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214,0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]

a5[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214,0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]

a6[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214,0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]

b1[4.5692,1.2317,5.4587,0.7120,4.6902,3.0200,5.4109,2.5662,4.5222,2.4379,5.7483,1.7211,1.7362,4.0020,0.2040,5.9893]

b2[3.4901,1.3752,0.0630,3.3493,3.6519,3.3700,2.7356,3.2697,3.7174,4.9290,3.5241,5.7950,5.5712,0.2950,2.7426,0.5739]

b3[2.7009,0.8035,4.7329,1.3775,4.6591,0.6980,4.4261,0.7008,0.8813,0.8644,2.3402,5.8353,1.3543,3.7460,2.9168,5.3354]

b4[2.5758,2.2662,5.6072,1.1562,1.1456,3.6358,5.4605,4.5566,2.8090,1.1688,5.1190,0.4194,2.6726,3.0355,3.5281,4.6226]

b5[0.9153,2.3426,3.6952,4.1016,0.0308,1.0368,1.8780,3.7876,5.7924,2.4167,5.8599,0.6297,5.6887,4.0295,1.6678,1.9148]

b6[1.8040,3.3165,2.1672,3.7913,4.2346,3.4207,0.8061,3.8269,3.8916,3.2757,1.0729,3.6728,2.2185,2.7205,3.6779,0.1046]

The two segments are calculated according to a p-stable LSH formula, and each segment obtains 6 hash values;

the first fragment had 6 hash values [7, 6, 6, 6, 6, 6]

The second fragment had 6 hash values [7, 7, 6, 6, 6, 7]

Step 5, dividing the 6 hash values of each gene segment into lines, comparing the hash values in the lines of the segment with the hash values in the lines of the whole genome, if all the hash values in one or more lines are equal, determining that the segment is similar to the whole genome, otherwise, predicting that horizontal gene transfer occurs;

step 6, calculating Euclidean distance between the candidate fragments in the similar set and the whole genome hash value, and if the Euclidean distance is larger than a threshold value, determining that horizontal gene transfer occurs;

The invention adopts a non-comparison method to predict horizontal gene transfer, and assumes that the genes transversely transferred have high similarity with the genes of distant related organisms. Non-aligned methods are generally more sensitive to recently occurring horizontally transferred genes and less sensitive to older genes that have been modified to undergo horizontal transfer. Non-aligned methods are generally fast. They can quickly determine a candidate list of putative horizontal transition regions. Non-alignment methods avoid the problem of arranging large amounts of data that may be too scattered to obtain a meaningful alignment. Non-aligned methods do not use genes as analytical units and therefore any possible genomic regions of lateral origin can be detected.

The method utilizes the locality sensitive hashing to improve the searching efficiency of the horizontal transfer gene and improve the sensitivity of the improved gene, thereby improving the accuracy of prediction. The invention aims to overcome the defects that the phylogenetic method is expensive in calculation cost and excessively depends on a reliable phylogenetic tree. The method is characterized in that a prediction method based on locality sensitive hashing is provided by using a parameterization method, the most common non-comparison method is used, the sensitivity of the method to the improved ancient metastasis genes is improved by using the characteristics of the locality sensitive hashing function, the defect of insensitivity of the parameter method to the improved horizontal metastasis genes which occur in the ancient period is overcome, the computer use resources are reduced on the premise of not reducing the prediction accuracy, and the prediction calculation efficiency is improved.

Claims

1. The horizontal transfer gene identification method based on locality sensitive hashing is characterized by comprising the following steps: which comprises the following steps:

step 2, performing K-word processing on each segmented fragment, and obtaining [ sigma ] calculation through each gene fragment^KThe word, wherein | ∑ is the character set size of the gene, and K is the window length of K-word processing;

step 4, constructing m hash functions, and calculating m hash values of each segment;

the Euclidean distance formula is as follows:

wherein x_iIs the i-th hash value, y, of the gene fragment_iThe number of the ith hash value of the whole genome is m, namely the number of the hash functions, namely m is the number of the hash values of the sequence.

2. The locality-sensitive-hash-based horizontal migration gene identification method according to claim 1, wherein: in step 1 the whole genome data was cut into N overlapping fragments of 5000 bp according to a step size of 500.

3. The locality-sensitive-hash-based horizontal migration gene identification method according to claim 1, wherein: the hash function family in step 4 is in the form of:

wherein a is a random number sequence generated based on a p stable distribution, b is a random number in (0, r), r is the segment length of the segment on the straight line, and the functions in the function family establish function indexes according to the difference of a and b.