CN109243529B - Horizontal transfer gene identification method based on locality sensitive hashing - Google Patents

Horizontal transfer gene identification method based on locality sensitive hashing Download PDF

Info

Publication number
CN109243529B
CN109243529B CN201810988512.1A CN201810988512A CN109243529B CN 109243529 B CN109243529 B CN 109243529B CN 201810988512 A CN201810988512 A CN 201810988512A CN 109243529 B CN109243529 B CN 109243529B
Authority
CN
China
Prior art keywords
word
segment
hash values
hash
gene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201810988512.1A
Other languages
Chinese (zh)
Other versions
CN109243529A (en
Inventor
江育娥
魏静
林劼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Normal University
Original Assignee
Fujian Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Normal University filed Critical Fujian Normal University
Priority to CN201810988512.1A priority Critical patent/CN109243529B/en
Publication of CN109243529A publication Critical patent/CN109243529A/en
Application granted granted Critical
Publication of CN109243529B publication Critical patent/CN109243529B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a horizontal transfer gene identification method based on locality sensitive hashing, which comprises the following steps: step 1, cutting the whole genome data into N fragments with the same bp number according to a set step length; step 2, performing K-word processing on each segment; step 3, counting word frequency of K-words in each segment and carrying out standardization processing; step 4, constructing m hash function families, and calculating m hash values of each segment; step 5, dividing m hash values of each gene segment into lines, comparing the hash values in the lines of the segment with the hash values in the lines of the whole genome, and considering that the segment is similar to the whole genome when all the hash values in at least one line are equal; otherwise, predicting the occurrence of horizontal gene transfer; and 6, calculating the Euclidean distance between the candidate fragments in the similar set and the m hash values of the whole genome, and if the Euclidean distance is larger than a set threshold value, generating horizontal gene transfer. The invention is beneficial to reducing the use resources of the computer and improving the prediction calculation efficiency.

Description

Horizontal transfer gene identification method based on locality sensitive hashing
Technical Field
The invention relates to the field of biological information processing, in particular to a horizontal transfer gene identification method based on locality sensitive hashing.
Background
Horizontal Gene Transfer (HGT) refers to the communication of genetic material between unicellular and/or multicellular organisms in a transverse manner, i.e., the process by which the organism obtains the genetic material from individuals other than the parent, and is an important process in promoting species evolution. Horizontal gene transfer can occur between different species (interspecies-level gene transfer) or between the same species (intraspecies-level gene transfer). It breaks the boundaries of genetic relationships and makes the possibility of gene flow more complex.
As the genome sequencing work of human and other organisms is carried out successively, it is found that a large number of homologous genes exist on the genome between different species, even between organisms with far-reaching relationships, and the universality and the distancing performance of horizontal gene transfer are further confirmed. The prediction of horizontal transfer genes is of great significance to the understanding of the biological evolution process and the qualitative and quantitative estimation of genetic material between species. In recent years, the discovery of the existence of a large number of DNA molecules having transforming activity and competent cells capable of actively taking up foreign DNA in the natural environment has led to new insights into the horizontal gene transfer occurring in the environment. The deep research on horizontal gene transfer and the ecological effect generated by the horizontal gene transfer is helpful for making new evaluation on the genetic engineering organisms, so that the application of the genetic engineering technology and the transgenic organisms can play a greater role
Methods for detecting Horizontal Gene Transfer (HGT) fall into two main categories: phylogenetic methods and parameterization methods. Methods based on phylogenetic trees have a higher degree of confidence in detecting HGT. However, phylogenetic approaches rely heavily on the accuracy of the input native gene and species trees, which can be challenging to construct. Even if there are no errors in the input tree, phylogenetic conflicts may be the result of evolutionary processes other than HGT (e.g., duplication and loss), leading these methods to incorrectly infer HGT.
Disclosure of Invention
The invention aims to provide a horizontal transfer gene identification method based on locality sensitive hashing.
The technical scheme adopted by the invention is as follows:
the method for identifying the horizontal transfer gene based on locality sensitive hashing comprises the following steps:
step 1, cutting the whole genome data into N fragments with the same bp number according to a set step length;
step 2, performing K-word processing on each segmented fragment, namely enabling one fragment to pass through a sliding window with the length of K, enabling the sequence of the segment in the window to be a K-word, and enabling each gene fragment to obtain [ sigma ] totallyKThe word, wherein | ∑ is the character set size of the gene, and K is the window length of K-word processing;
step 3, counting the word frequency of the K-word in each segment to obtain a word frequency set X of the K-word { X ═ X }1,X2…,Xt…,Xn},XtThe frequency of occurrence of the t-th K-word in the current segment is shown, and n is the total number of the K-words; and standardizing the word frequency of each K-word to obtain the word frequencyNormalized result StForming a normalized set of word frequencies S ═ S1,S2…,St…,Sn},StThe word frequency of the t-th K-word is normalized, and n is the total number of the K-words;
wherein the word frequency of each K-word is normalized to a result StThe calculation formula of (a) is as follows:
Figure BDA0001780244490000021
mu is the mean value of all the word frequencies of the K-words, and delta is the standard deviation of the word frequencies of the K-words;
and 4, constructing m hash function families, and calculating m hash values of each segment.
Step 5, dividing the m hash values of each gene segment into lines, comparing the hash values in the lines of the segment with the hash values in the lines of the whole genome,
when all hash values in one or more row bars are equal, the fragment is considered to be similar to the whole genome and form a similar set; otherwise, horizontal gene transfer is predicted to occur;
step 6, calculating Euclidean distances between the candidate segments in the similar set and m hash values of the whole genome, wherein when the Euclidean distances are larger than a set threshold, the value of the threshold is
Figure BDA0001780244490000022
K is the length of the window for processing the K-word, namely the length of the K-word, and the horizontal gene transfer is considered to occur;
the Euclidean distance formula is as follows:
Figure BDA0001780244490000023
m is the number of hash functions, i.e. m is the number of hash values of the sequence.
Further, the whole genome data was cut into N overlapping fragments of 5000 bp according to a step size of 500 in step 1.
Further, the hash function family in step 4 is in the form of:
Figure BDA0001780244490000024
wherein b is a random number in (0, r), r is the segment length of the segment on the straight line, and the functions in the function family establish the function index according to the difference of a and b.
By adopting the technical scheme, the most common non-comparison method in the parameterization method is adopted to provide a prediction method based on the locality sensitive hashing, the locality sensitive hashing is utilized to improve the searching efficiency of the horizontal transfer gene and improve the sensitivity to the improved gene, so that the prediction accuracy is improved, the defect that the parameter method is insensitive to the improved horizontal transfer gene which occurs in the past is overcome, the use resources of a computer are reduced under the premise of not reducing the prediction accuracy, and the prediction calculation efficiency is improved. The invention adopts a non-comparison method with more efficient detection efficiency, and overcomes the defects that the phylogenetic method is expensive in calculation cost, excessively depends on a reliable phylogenetic tree and the like.
Drawings
The invention is described in further detail below with reference to the accompanying drawings and the detailed description;
FIG. 1 is a schematic flow chart of the method for identifying a gene by horizontal migration based on locality sensitive hashing according to the present invention.
Detailed Description
Horizontal gene transfer frequently occurs in nature, HGT is an important factor for the evolution of many organisms, horizontal gene transfer is an important factor for the evolution of many organisms, the generated genetic materials play an important role in promoting genome innovation, and the genetic materials can provide a selective advantage for host organisms and improve the adaptability of new environments. Since horizontal gene transfer can transfer completely different genotypes from distant phylogenetic lineages, or new genes with new functions into the genome, it is a major source of phenotypic innovation and niche adaptation mechanism, and also enriches species diversity. The detection of HGT events can help us to better understand the historical background of the existing genome, explain the generation of new factors in the biological evolution process, and have biological significance. In addition, the detection of HGT events can help us to better understand the historical background of the existing genome, explain the generation of new factors in the biological evolution process, have biological significance and enrich the classical Darwinian evolution theory.
The invention uses p-stable LSH algorithm, which can directly carry out local sensitive Hash operation in Euclidean space without mapping the original space to Hamming space.
Local sensitivity can be interpreted as the probability that a closer point in space is still adjacent in the new data space is high, while the probability that a farther point is mapped to the same bucket is low. local-Sensitive Hashing (LSH) is a method for solving the problem of approximate nearest neighbor search using local sensitivity. The original LSH method maps the original space into the Hamming space, that is, the expression form of the midpoint of the original space is converted into the expression form of the midpoint of the Hamming space, and the distance measurement is also converted into the distance measurement in the Hamming space. The search time and the dimensionality of the LSH method are linearly related and are related to the space scale sub-exponential, and the search time is greatly reduced.
Where p-stable LSH is applied in d-dimensional Euclidean space, 0<p<2. The p-stable LSH is the advanced stage of LSH, and the algorithm applies the concept of p-stable distribution. The p-stable distribution is not a specific distribution, but a family of distributions that satisfy a certain condition. When p is 1, the representation is a standard cauchy distribution; when p is 2, the representation is gaussian. The p-stable LSH functional family is of the form:
Figure BDA0001780244490000031
wherein b is a random number in (0, r), r is the length of a segment segmented on a straight line, and functions in the hash function family establish function indexes according to the difference between a and b. The dot product of vectors a and v is used to generate a hash function family, and the hash function family is locally sensitive. The P-stable LSH method divides a straight line into a plurality of equal-length segments with the length of r, points mapped on the same segment are endowed with the same hash value, and the segments are mapped to different segmentsThe points above are assigned different hash values.
As shown in FIG. 1, the invention discloses a method for identifying a horizontal transfer gene based on locality sensitive hashing, which comprises the following steps:
step 1, cutting the whole genome data into N fragments with the same bp number according to a set step length;
step 2, performing K-word processing on each segmented fragment, namely enabling one fragment to pass through a sliding window with the length of K, enabling the sequence of the segment in the window to be a K-word, and enabling each gene fragment to obtain [ sigma ] totallyKThe word, wherein | ∑ is the character set size of the gene, and K is the window length of K-word processing;
step 3, counting the word frequency of the K-word in each segment to obtain a word frequency set X of the K-word { X ═ X }1,X2…,Xt…,Xn},XtThe frequency of occurrence of the t-th K-word in the current segment is shown, and n is the total number of the K-words; and standardizing the word frequency of each K-word to obtain a result S after the word frequency is standardizedtForming a normalized set of word frequencies S ═ S1,S2…,St…,Sn},StThe word frequency of the t-th K-word is normalized, and n is the total number of the K-words;
wherein the word frequency of each K-word is normalized to a result StThe calculation formula of (a) is as follows:
Figure BDA0001780244490000041
mu is the mean value of all the word frequencies of the K-words, and delta is the standard deviation of the word frequencies of the K-words;
and 4, constructing m hash function families, and calculating m hash values of each segment.
Step 5, dividing the m hash values of each gene segment into lines, comparing the hash values in the lines of the segment with the hash values in the lines of the whole genome,
when all hash values in one or more row bars are equal, the fragment is considered to be similar to the whole genome and form a similar set; otherwise, horizontal gene transfer is predicted to occur;
step 6, calculating Euclidean distances between the candidate segments in the similar set and m hash values of the whole genome, wherein when the Euclidean distances are larger than a set threshold, the value of the threshold is
Figure BDA0001780244490000042
K is the length of the window for processing the K-word, namely the length of the K-word, and the horizontal gene transfer is considered to occur;
the Euclidean distance formula is as follows:
Figure BDA0001780244490000043
m is the number of hash functions, i.e. m is the number of hash values of the sequence.
Further, the whole genome data was cut into N overlapping fragments of 5000 bp according to a step size of 500 in step 1.
Further, the hash function family in step 4 is in the form of:
Figure BDA0001780244490000051
wherein b is a random number in (0, r), r is the segment length of the segment on the straight line, and the functions in the function family establish the function index according to the difference of a and b.
In order to verify the prediction effect of the method, experimental result analysis can be carried out according to the result of the predicted HGT segment and the actual HGT segment, and the F-measure method is utilized to evaluate the horizontal transfer gene identification method model based on the locality sensitive hash.
The following is a detailed description of the process of the present invention:
in order to describe the process of the simulated data processing more clearly, the method comprises the steps of taking an intercepted Escherichia coli segment with the length of 300bp as an acceptor, randomly intercepting a whole gene segment with the length of 50bp of Haemophilus influenzae and Bacillus subtilis as a donor genome, and randomly selecting two positions on the acceptor genome for insertion to form the simulated HGT transfer data. The method for identifying the horizontal transfer gene based on the locality sensitive hashing comprises the following steps:
for example: a sequence with the total length of 400 bp, wherein part of fragments belong to fragments for HGT transfer;
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTATCTGTCAGAAAAATGACTAAATAGCGGCTCCCACAATGTTCAAATGTGGGGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCGCTGGCGTTTCTCTGCTCGACGGTCACCGGGATTTTATTTGGCTGGTTACACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAGCCCGCACCTGACAGTGCGG
step 1, cutting the sequence into overlapping fragments of 100 bp, wherein the step length is 20, and 20 fragments can be obtained;
for example, the first fragment is:
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAAT
the second fragment is:
ACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTATC
and so on. The following is presented by way of example of the first two segments:
step 2, performing K-word processing on each segmented fragment, namely enabling one fragment to pass through a sliding window with the length of K, enabling the sequence of the segment in the window to be a K-word, and enabling each gene fragment to obtain [ sigma ] totallyKA word;
for example, assuming that the sliding window length K is 2, the number of K-words obtained per segment: AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
Step 3, counting the word frequency X of the K-words appearing in each gene segmentt,XtThe frequency of the t-th K-word appearing in the segment is calculated, the word frequency is normalized, and the normalized result is St,StThe word frequency of the t-th K-word is normalized;
Figure BDA0001780244490000061
table one: word frequency of each word in two sequences
Figure BDA0001780244490000062
Table two: word frequency normalized results
And 4, constructing m hash function families, and calculating m hash values of each segment, for example, constructing 6 hash function families.
a and b are 16-dimensional vectors that follow a gaussian distribution.
a1[0.2728,-0.5261,-1.1690,-1.0743,1.0394,0.0461,0.1844,0.0007,0.0448,-1.1816,-0.2499,-1.82 94,-0.3410,1.8652,0.2897,1.9430]
a2[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214,0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]
a3[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214,0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]
a4[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214,0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]
a5[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214,0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]
a6[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214,0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]
b1[4.5692,1.2317,5.4587,0.7120,4.6902,3.0200,5.4109,2.5662,4.5222,2.4379,5.7483,1.7211,1.7362,4.0020,0.2040,5.9893]
b2[3.4901,1.3752,0.0630,3.3493,3.6519,3.3700,2.7356,3.2697,3.7174,4.9290,3.5241,5.7950,5.5712,0.2950,2.7426,0.5739]
b3[2.7009,0.8035,4.7329,1.3775,4.6591,0.6980,4.4261,0.7008,0.8813,0.8644,2.3402,5.8353,1.3543,3.7460,2.9168,5.3354]
b4[2.5758,2.2662,5.6072,1.1562,1.1456,3.6358,5.4605,4.5566,2.8090,1.1688,5.1190,0.4194,2.6726,3.0355,3.5281,4.6226]
b5[0.9153,2.3426,3.6952,4.1016,0.0308,1.0368,1.8780,3.7876,5.7924,2.4167,5.8599,0.6297,5.6887,4.0295,1.6678,1.9148]
b6[1.8040,3.3165,2.1672,3.7913,4.2346,3.4207,0.8061,3.8269,3.8916,3.2757,1.0729,3.6728,2.2185,2.7205,3.6779,0.1046]
The two segments are calculated according to a p-stable LSH formula, and each segment obtains 6 hash values;
the first fragment had 6 hash values [7, 6, 6, 6, 6, 6]
The second fragment had 6 hash values [7, 7, 6, 6, 6, 7]
Step 5, dividing the 6 hash values of each gene segment into lines, comparing the hash values in the lines of the segment with the hash values in the lines of the whole genome, if all the hash values in one or more lines are equal, determining that the segment is similar to the whole genome, otherwise, predicting that horizontal gene transfer occurs;
step 6, calculating Euclidean distance between the candidate fragments in the similar set and the whole genome hash value, and if the Euclidean distance is larger than a threshold value, determining that horizontal gene transfer occurs;
in order to verify the prediction effect of the method, experimental result analysis can be carried out according to the result of the predicted HGT segment and the actual HGT segment, and the F-measure method is utilized to evaluate the horizontal transfer gene identification method model based on the locality sensitive hash.
The invention adopts a non-comparison method to predict horizontal gene transfer, and assumes that the genes transversely transferred have high similarity with the genes of distant related organisms. Non-aligned methods are generally more sensitive to recently occurring horizontally transferred genes and less sensitive to older genes that have been modified to undergo horizontal transfer. Non-aligned methods are generally fast. They can quickly determine a candidate list of putative horizontal transition regions. Non-alignment methods avoid the problem of arranging large amounts of data that may be too scattered to obtain a meaningful alignment. Non-aligned methods do not use genes as analytical units and therefore any possible genomic regions of lateral origin can be detected.
The method utilizes the locality sensitive hashing to improve the searching efficiency of the horizontal transfer gene and improve the sensitivity of the improved gene, thereby improving the accuracy of prediction. The invention aims to overcome the defects that the phylogenetic method is expensive in calculation cost and excessively depends on a reliable phylogenetic tree. The method is characterized in that a prediction method based on locality sensitive hashing is provided by using a parameterization method, the most common non-comparison method is used, the sensitivity of the method to the improved ancient metastasis genes is improved by using the characteristics of the locality sensitive hashing function, the defect of insensitivity of the parameter method to the improved horizontal metastasis genes which occur in the ancient period is overcome, the computer use resources are reduced on the premise of not reducing the prediction accuracy, and the prediction calculation efficiency is improved.

Claims (3)

1. The horizontal transfer gene identification method based on locality sensitive hashing is characterized by comprising the following steps: which comprises the following steps:
step 1, cutting the whole genome data into N fragments with the same bp number according to a set step length;
step 2, performing K-word processing on each segmented fragment, and obtaining [ sigma ] calculation through each gene fragmentKThe word, wherein | ∑ is the character set size of the gene, and K is the window length of K-word processing;
step 3, counting the word frequency of the K-word in each segment to obtain a word frequency set X of the K-word { X ═ X }1,X2…,Xt…,Xn},XtThe frequency of occurrence of the t-th K-word in the current segment is shown, and n is the total number of the K-words; and standardizing the word frequency of each K-word to obtain a result S after the word frequency is standardizedtForming a normalized set of word frequencies S ═ S1,S2…,St…,Sn},StThe word frequency of the t-th K-word is normalized, and n is the total number of the K-words;
wherein the word frequency of each K-word is normalized to a result StThe calculation formula of (a) is as follows:
Figure FDA0003165126360000011
mu is the mean value of all the word frequencies of the K-words, and delta is the standard deviation of the word frequencies of the K-words;
step 4, constructing m hash functions, and calculating m hash values of each segment;
step 5, dividing the m hash values of each gene segment into lines, comparing the hash values in the lines of the segment with the hash values in the lines of the whole genome,
when all hash values in one or more row bars are equal, the fragment is considered to be similar to the whole genome and form a similar set; otherwise, horizontal gene transfer is predicted to occur;
step 6, calculating Euclidean distances between the candidate segments in the similar set and m hash values of the whole genome, wherein when the Euclidean distances are larger than a set threshold, the value of the threshold is
Figure FDA0003165126360000012
K is the length of the window for processing the K-word, namely the length of the K-word, and the horizontal gene transfer is considered to occur;
the Euclidean distance formula is as follows:
Figure FDA0003165126360000013
wherein xiIs the i-th hash value, y, of the gene fragmentiThe number of the ith hash value of the whole genome is m, namely the number of the hash functions, namely m is the number of the hash values of the sequence.
2. The locality-sensitive-hash-based horizontal migration gene identification method according to claim 1, wherein: in step 1 the whole genome data was cut into N overlapping fragments of 5000 bp according to a step size of 500.
3. The locality-sensitive-hash-based horizontal migration gene identification method according to claim 1, wherein: the hash function family in step 4 is in the form of:
Figure FDA0003165126360000021
wherein a is a random number sequence generated based on a p stable distribution, b is a random number in (0, r), r is the segment length of the segment on the straight line, and the functions in the function family establish function indexes according to the difference of a and b.
CN201810988512.1A 2018-08-28 2018-08-28 Horizontal transfer gene identification method based on locality sensitive hashing Expired - Fee Related CN109243529B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810988512.1A CN109243529B (en) 2018-08-28 2018-08-28 Horizontal transfer gene identification method based on locality sensitive hashing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810988512.1A CN109243529B (en) 2018-08-28 2018-08-28 Horizontal transfer gene identification method based on locality sensitive hashing

Publications (2)

Publication Number Publication Date
CN109243529A CN109243529A (en) 2019-01-18
CN109243529B true CN109243529B (en) 2021-09-07

Family

ID=65068789

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810988512.1A Expired - Fee Related CN109243529B (en) 2018-08-28 2018-08-28 Horizontal transfer gene identification method based on locality sensitive hashing

Country Status (1)

Country Link
CN (1) CN109243529B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110660452B (en) * 2019-09-20 2020-10-27 中国人民解放军军事科学院军事医学研究院 Method for detecting bacterial gene level transfer DNA fragment and transfer donor strain

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101533484A (en) * 2008-03-12 2009-09-16 中国科学院半导体研究所 Method for forecasting gene transferring horizontally in genome
CN103631928A (en) * 2013-12-05 2014-03-12 中国科学院信息工程研究所 LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system
CN105183792A (en) * 2015-08-21 2015-12-23 东南大学 Distributed fast text classification method based on locality sensitive hashing
CN107103206A (en) * 2017-04-27 2017-08-29 福建师范大学 The DNA sequence dna cluster of local sensitivity Hash based on standard entropy

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101533484A (en) * 2008-03-12 2009-09-16 中国科学院半导体研究所 Method for forecasting gene transferring horizontally in genome
CN103631928A (en) * 2013-12-05 2014-03-12 中国科学院信息工程研究所 LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system
CN105183792A (en) * 2015-08-21 2015-12-23 东南大学 Distributed fast text classification method based on locality sensitive hashing
CN107103206A (en) * 2017-04-27 2017-08-29 福建师范大学 The DNA sequence dna cluster of local sensitivity Hash based on standard entropy

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《SSAW:A new sequence similarity analysis method based on the stationary discrete wavelet transform》;Jie Lin 等;《Spring》;20180502;全文 *
基于位置信息熵的局部敏感哈希聚类方法;徐彭娜 等;《计算机应用与软件》;20180331;第35卷(第3期);全文 *

Also Published As

Publication number Publication date
CN109243529A (en) 2019-01-18

Similar Documents

Publication Publication Date Title
AU2021282469B2 (en) Deep learning-based variant classifier
AU2021257920A1 (en) Variant classifier based on deep neural networks
Clare et al. Machine learning of functional class from phenotype data
WO2019200338A1 (en) Variant classifier based on deep neural networks
Hellmuth et al. From sequence data including orthologs, paralogs, and xenologs to gene and species trees
Dalquen et al. The impact of gene duplication, insertion, deletion, lateral gene transfer and sequencing error on orthology inference: a simulation study
CN107885971B (en) Method for identifying key protein by adopting improved flower pollination algorithm
US20190177719A1 (en) Method and System for Generating and Comparing Reduced Genome Data Sets
Mazandu et al. Scoring protein relationships in functional interaction networks predicted from sequence data
CN109243529B (en) Horizontal transfer gene identification method based on locality sensitive hashing
Suzuki et al. ORF-based binarized structure network analysis of plasmids (OSNAp), a novel approach to core gene-independent plasmid phylogeny
Lee The fractal dimension as a measure for characterizing genetic variation of the human genome
US20140336950A1 (en) Clustering copy-number values for segments of genomic data
US7962427B2 (en) Method for the detection of atypical sequences via generalized compositional methods
Wei et al. Comparison of methods for biological sequence clustering
Liu et al. SICD6mA: identifying 6mA sites using deep memory network
Fan et al. iterb-PPse: Identification of transcriptional terminators in bacterial by incorporating nucleotide properties into PseKNC
Nguyen et al. Efficient agglomerative hierarchical clustering for biological sequence analysis
Aleb et al. An improved K-means algorithm for DNA sequence clustering
US9342653B2 (en) Identification of ribosomal DNA sequences
Nguyen et al. Efficient and accurate OTU clustering with GPU-based sequence alignment and dynamic dendrogram cutting
CN117746997B (en) Cis-regulation die body identification method based on multi-mode priori information
Ulrich et al. Taxor: Fast and space-efficient taxonomic classification of long reads with hierarchical interleaved XOR filters
Rasic et al. Taxonomic Classification at the Strain Level using a Species-of-Interest $\boldsymbol {k} $-mer Database
Aljouie et al. Cross-validation and cross-study validation of chronic lymphocytic leukaemia with exome sequences and machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210907

CF01 Termination of patent right due to non-payment of annual fee