CN109243529A - Gene transferring horizontally recognition methods based on local sensitivity Hash - Google Patents

Gene transferring horizontally recognition methods based on local sensitivity Hash Download PDF

Info

Publication number
CN109243529A
CN109243529A CN201810988512.1A CN201810988512A CN109243529A CN 109243529 A CN109243529 A CN 109243529A CN 201810988512 A CN201810988512 A CN 201810988512A CN 109243529 A CN109243529 A CN 109243529A
Authority
CN
China
Prior art keywords
word
segment
hash
length
hash value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810988512.1A
Other languages
Chinese (zh)
Other versions
CN109243529B (en
Inventor
江育娥
魏静
林劼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Normal University
Original Assignee
Fujian Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Normal University filed Critical Fujian Normal University
Priority to CN201810988512.1A priority Critical patent/CN109243529B/en
Publication of CN109243529A publication Critical patent/CN109243529A/en
Application granted granted Critical
Publication of CN109243529B publication Critical patent/CN109243529B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses the gene transferring horizontally recognition methods based on local sensitivity Hash comprising following steps: step 1, full-length genome data being cut into N number of segment with identical bp number according to setting step-length;Step 2, each segment is subjected to the processing of K- word;Step 3, the word frequency of the K- word in each segment and standardization are counted;Step 4, m hash family of functions is constructed, m hash value of each segment is calculated;Step 5, the processing of branch's item is carried out to m hash value of each genetic fragment, compares in the row item of the segment hash value in the row item of hash value and full-length genome, when hash value all at least one row item is all equal, think that the segment is similar to full-length genome;Otherwise, Horizontal Gene Transfer occurs for prediction;Step 6, it calculates the Euclidean distance in similar set between candidate segment and m hash value of full-length genome and Horizontal Gene Transfer then occurs greater than given threshold.The present invention helps to reduce computer using resource, improves prediction computational efficiency.

Description

Gene transferring horizontally recognition methods based on local sensitivity Hash
Technical field
The present invention relates to Bioinformatics fields, more particularly to the identification of the gene transferring horizontally based on local sensitivity Hash Method.
Background technique
Horizontal Gene Transfer (horizontal gene transfer, that is, HGT) refers to inhereditary material with landscape mode in list Cell and (or) information interchange between multicellular organisms, i.e. organism obtains inhereditary material from the individual other than parental generation Process, be promote spore a significant process.Horizontal Gene Transfer can occur in different plant species (interspecies level base Because of transfer) between or same species (plant in Horizontal Gene Transfer) between.It has broken the boundary of affiliation, flows gene Possibility become increasingly complex.
As the mankind and other biological genome examining orders are successively performed, it has been found that between different plant species, or even parent The universal of Horizontal Gene Transfer is further demonstrated with the presence of a large amount of homologous genes on genome between the far biology of edge relationship Property and remote edge.For the understanding during biological evolution and between species, inhereditary material is determined for the prediction of gene transferring horizontally Property and quantitative estimation have important meaning.And in recent years, find there are a large amount of DNA with activity of conversion in natural environment Molecule and the competent cell that can actively absorb exogenous DNA, so that people have the Horizontal Gene Transfer occurred in environment New understanding.Further investigation to the ecological effect of Horizontal Gene Transfer and its generation, it will help genetically engineered biological is done New evaluation out, so that the application of technique for gene engineering and genetically modified organism plays bigger effect
The method of detection level gene transfer (HGT) is broadly divided into two major classes: systematic growth method and parametric method.Base In phylogenetic tree method in terms of detecting HGT confidence level it is higher.But systematic growth method be largely dependent upon it is defeated The accuracy of the intrinsic gene and species tree that enter, this may be challenging to gene tree and the building of species tree.Even if input tree In there is no mistake, if systematic growth it is conflicting be also likely to be the evolutionary process other than HGT result (as repeat and lose), from And infer HGT with leading to these method faults.
Summary of the invention
The purpose of the present invention is to provide the gene transferring horizontally recognition methods based on local sensitivity Hash.
The technical solution adopted by the present invention is that:
Gene transferring horizontally recognition methods based on local sensitivity Hash comprising following steps:
Step 1, full-length genome data are cut into N number of segment with identical bp number according to setting step-length;
Step 2, each segment after segmentation is subjected to the processing of K- word, the i.e. sliding that a segment is K by a length Window, this section of sequence in the window are a K- words, and each genetic fragment obtains altogether | ∑ |KA word, wherein | ∑ | it is base The character set size of cause, K are the length of window of K- word processing;
Step 3, the word frequency for counting the K- word in each segment obtains the word frequency set X={ X of K- word1, X2..., Xt..., Xn, XtFor the frequency that t-th of K- word occurs in current clip, n is the sum of K- word;And the word frequency of each K- word is marked Quasi-ization handles to obtain the result S after word frequency standardizationt, form the word frequency set S={ S after standardizing1, S2..., St..., Sn, St For after the word frequency standardization of t-th K- word as a result, n is the sum of K- word;
The wherein result S after the word frequency standardization of each K- wordtCalculation formula it is as follows:
μ is the mean value of all K- word word frequency, and δ is the standard deviation of K- word word frequency;
Step 4, m hash family of functions is constructed, m hash value of each segment is calculated.
Step 5, the processing of branch's item is carried out to m hash value of each genetic fragment, compares hash in the row item of the segment Value and hash value in the row item of full-length genome,
When there is hash value all in one or more row items all equal, think that the segment is similar to full-length genome and shape At similar set;Otherwise, prediction is that Horizontal Gene Transfer has occurred;
Step 6, the Euclidean distance in similar set between candidate segment and m hash value of full-length genome is calculated, Europe is worked as Family name's distance is greater than given threshold, and the value of threshold value isThe length of window that K is handled for K- word, the i.e. length of K- word, then It is considered that Horizontal Gene Transfer has occurred;
Euclidean distance formula are as follows:
M is hash function number, i.e. m is the hash value number of sequence.
Further, full-length genome data are cut into N number of weight with 5000 bp according to 500 step-length in step 1 Folded segment.
Further, in step 4 hash family of functions form are as follows:
Wherein b is (0, r) inner random number, and r is the segment length being segmented on straight line, the function in family of functions according to a and b not It is same to establish index functions.
The invention adopts the above technical scheme, and using in parametric method, the most commonly used non-comparison method proposition is based on The prediction technique of local sensitivity Hash is not only improved the standard the search efficiency of metastatic gene using local sensitivity Hash, and mentioned Height overcomes parametric method method for the process of ancient generation for improveing the susceptibility of gene to improve the accuracy of prediction The shortcomings that insensitivity of the gene transferring horizontally of improvement, reduces computer and uses in the case where not reducing the preceding topic of forecasting accuracy Resource improves prediction computational efficiency.The present invention non-comparison method highly efficient using detection efficiency, overcomes systematic growth method The disadvantages of calculating costs dearly, and excessively relies on reliable phylogenetic tree.
Detailed description of the invention
The present invention is described in further details below in conjunction with the drawings and specific embodiments;
Fig. 1 is that the present invention is based on the flow diagrams of the gene transferring horizontally recognition methods of local sensitivity Hash.
Specific embodiment
An important factor for Horizontal Gene Transfer frequently occurs in nature, and HGT is many biological evolutions, Horizontal Gene Transfer An important factor for being many biological evolutions, the inhereditary material generated is for promoting genome innovation to have highly important work With they may provide the adaptability that selective advantage improves new environment for host organisms.Since Horizontal Gene Transfer can Gene will be transferred to from the entirely different genotype of remote source phylogenetic ancestry, or the new gene with new function Group, it is the main source and niche adaptation mechanism of phenotype innovation, while also enriching the diversity of species.Detect HGT thing Part can help us to be best understood from the historical background of odc gene group, explain the generation of new factor during biological evolution, With biological significance.In addition, detection HGT event can help us to be best understood from the historical background of odc gene group, solve The generation of new factor during biological evolution is released, there is biological significance, it is theoretical to enrich classical darwinian evolution.
The present invention uses p-stable LSH algorithm, which does not need to be mapped to luv space in the space Hamming, Local sensitivity Hash operation can be carried out directly under Euclidean space.
Local sensitivity, it is very big in the still adjacent probability of new data space to be construed to the point being closer in space, And apart from the farther away probability very little for being mapped to the same bucket.Local sensitivity Hash (Locality-Sensitive Hashing, LSH) it is to solve the problems, such as that approximate KNN is searched for using the method for local sensitivity.Original LSH method, being will be former For beginning space reflection into the space Hamming, i.e. the expression-form at luv space midpoint is converted into the expression at the space Hamming midpoint Form, distance metric are also converted into the distance metric in the space Hamming.The search time of LSH method is linear phase with dimension It closes, with Space Scale time correlation of indices, greatly reduces search time.
Wherein p-stable LSH is applied in d dimension Euclidean space, 0 < p≤2.P-stable LSH be LSH into Rank, the algorithm apply the concept of p-stable distribution (p- Stable distritation).P- Stable distritation is not specific point Cloth, but meet the family of distributions of certain condition.As p=1, representative is standard Cauchy distribution;As p=2, representative is Gauss point Cloth.The form of p-stable LSH family of functions are as follows:Wherein b is the random number in (0, r), and r is straight line The segment length of upper segmentation, the function in hash function race establish index functions according to the difference of a and b.The dot product of vector a and v are used to Hash function race is generated, and the hash function race is local sensitivity.P-stable LSH method is that straight line is divided into length Degree is the several segments isoplith of r, is mapped to point on same section and assigns identical hash value, is mapped to point on different sections then Assign different hash values.
As shown in Figure 1, the invention discloses the gene transferring horizontally recognition methods based on local sensitivity Hash comprising with Lower step:
Step 1, full-length genome data are cut into N number of segment with identical bp number according to setting step-length;
Step 2, each segment after segmentation is subjected to the processing of K- word, the i.e. sliding that a segment is K by a length Window, this section of sequence in the window are a K- words, and each genetic fragment obtains altogether | ∑ |KA word, wherein | ∑ | it is base The character set size of cause, K are the length of window of K- word processing;
Step 3, the word frequency for counting the K- word in each segment obtains the word frequency set X={ X of K- word1, X2..., Xt..., Xn, XtFor the frequency that t-th of K- word occurs in current clip, n is the sum of K- word;And the word frequency of each K- word is marked Quasi-ization handles to obtain the result S after word frequency standardizationt, form the word frequency set S={ S after standardizing1, S2..., St..., Sn, St For after the word frequency standardization of t-th K- word as a result, n is the sum of K- word;
The wherein result S after the word frequency standardization of each K- wordtCalculation formula it is as follows:
μ is the mean value of all K- word word frequency, and δ is the standard deviation of K- word word frequency;
Step 4, m hash family of functions is constructed, m hash value of each segment is calculated.
Step 5, the processing of branch's item is carried out to m hash value of each genetic fragment, compares hash in the row item of the segment Value and hash value in the row item of full-length genome,
When there is hash value all in one or more row items all equal, think that the segment is similar to full-length genome and shape At similar set;Otherwise, prediction is that Horizontal Gene Transfer has occurred;
Step 6, the Euclidean distance in similar set between candidate segment and m hash value of full-length genome is calculated, Europe is worked as Family name's distance is greater than given threshold, and the value of threshold value isThe length of window that K is handled for K- word, the i.e. length of K- word, then It is considered that Horizontal Gene Transfer has occurred;
Euclidean distance formula are as follows:
M is hash function number, i.e. m is the hash value number of sequence.
Further, full-length genome data are cut into N number of weight with 5000 bp according to 500 step-length in step 1 Folded segment.
Further, in step 4 hash family of functions form are as follows:
Wherein b is (0, r) inner random number, and r is the segment length being segmented on straight line, the function in family of functions according to a and b not It is same to establish index functions.
In order to verify prediction effect of the invention, can be carried out according to the HGT segment result of prediction and actual HGT segment real Interpretation of result is tested, the gene transferring horizontally recognition methods model based on local sensitivity Hash is carried out using F-measure method Assessment.
Just treatment process of the invention is described in detail below:
It with intercepted length is that 300bp Escherichia coli segment is to become apparent from analogue data treatment process in description this patent Receptor, it is random to intercept haemophilus influenzae and full genome segment that bacillus subtilis length is 50bp is as donor gene group, Two positions are randomly selected in acceptor gene group to be inserted into, and are constituted simulation HGT and are shifted data.Water based on local sensitivity Hash Flat turn moves gene recognition method, and steps are as follows:
Such as: the sequence that a total length is 400 bp, part of segment are to belong to the segment that HGT transfer occurs;
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAG CTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTATCTGTCAGAAAAATGACTAAATAGCGGC TCCCACAATGTTCAAATGTGGGGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTAC AGAGTACACAACATCCATGAAACGCATTAGCACCACCATTACCACCACCATCGCTGGCGTTTCTCTGCTCGACGGTC ACCGGGATTTTATTTGGCTGGTTACACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAA AAGCCCGCACCTGACAGTGCGG
Step 1, sequence is cut into the segment of the overlapping of 100 bp, step-length 20, available 20 segments;
Such as first segment are as follows:
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAG CTTCTGAACTGGTTACCTGCCGTGAGTAAAT
Second segment are as follows:
ACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGC CGTGAGTAAATTAAAATTTTATTGACTTATC
And so on etc..It is introduced by taking the first two segment as an example below:
Step 2, each segment after segmentation is subjected to the processing of K- word, the i.e. sliding that a segment is K by a length Window, this section of sequence in the window are a K- words, and each genetic fragment obtains altogether | ∑ |KA word;
Such as when setting sliding window length K=2, the K- word number of each segment acquisition: AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
Step 3, the word frequency X of the K- word occurred in each genetic fragment is countedt, XtOccur in the segment for t-th of K- word Frequency, and word frequency is standardized, the result after standardization is St, StAfter the word frequency standardization of t-th of K- word As a result;
Table one: the word frequency of each word in the two sequences
Table two: the result after word frequency standardization
Step 4, m hash family of functions is constructed, m hash value of each segment, such as 6 hash functions of building are calculated Race.
A and b is the vector of 16 dimensions of Gaussian distributed.
a1[0.2728,-0.5261,-1.1690,-1.0743,1.0394,0.0461,0.1844,0.0007, 0.0448,-1.1816,-0.2499,-1.82 94,-0.3410,1.8652,0.2897,1.9430]
a2[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214, 0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]
a3[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214, 0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]
a4[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214, 0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]
a5[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214, 0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]
a6[-0.4519,-1.2261,-1.1323,1.3255,-0.9256,-0.1244,-1.0290,-0.4214, 0.0356,-0.5530,2.6491,0.7 705,-0.9915,-2.0205,-0.4313,-0.5070]
b1[4.5692,1.2317,5.4587,0.7120,4.6902,3.0200,5.4109,2.5662,4.5222, 2.4379,5.7483,1.7211,1.7362,4.0020,0.2040,5.9893]
b2[3.4901,1.3752,0.0630,3.3493,3.6519,3.3700,2.7356,3.2697,3.7174, 4.9290,3.5241,5.7950,5.5712,0.2950,2.7426,0.5739]
b3[2.7009,0.8035,4.7329,1.3775,4.6591,0.6980,4.4261,0.7008,0.8813, 0.8644,2.3402,5.8353,1.3543,3.7460,2.9168,5.3354]
b4[2.5758,2.2662,5.6072,1.1562,1.1456,3.6358,5.4605,4.5566,2.8090, 1.1688,5.1190,0.4194,2.6726,3.0355,3.5281,4.6226]
b5[0.9153,2.3426,3.6952,4.1016,0.0308,1.0368,1.8780,3.7876,5.7924, 2.4167,5.8599,0.6297,5.6887,4.0295,1.6678,1.9148]
b6[1.8040,3.3165,2.1672,3.7913,4.2346,3.4207,0.8061,3.8269,3.8916, 3.2757,1.0729,3.6728,2.2185,2.7205,3.6779,0.1046]
Two segments are calculated according to p-stable LSH formula, and each segment obtains 6 hash values;
First segment, 6 hash values are [7,6,6,6,6,6]
Second segment, 6 hash values are [7,7,6,6,6,7]
Step 5, the processing of branch's item is carried out to 6 hash values of each genetic fragment, compares hash in the row item of the segment Value and hash value in the row item of full-length genome, if having hash value all in one or more row items it is all equal if think the segment Similar to full-length genome, otherwise prediction is that Horizontal Gene Transfer has occurred;
Step 6, the Euclidean distance in similar set between candidate segment and full-length genome hash value is calculated, if Euclidean distance Then it is considered that Horizontal Gene Transfer has occurred greater than threshold value;
In order to verify prediction effect of the invention, can be carried out according to the HGT segment result of prediction and actual HGT segment real Interpretation of result is tested, the gene transferring horizontally recognition methods model based on local sensitivity Hash is carried out using F-measure method Assessment.
The present invention assumes that the gene of horizontal transfer is related to distant relative raw using non-comparison method prediction level gene transfer The gene of object has the similitude of height.Non- comparison method is more sensitive generally for the gene transferring horizontally occurred recently, It is insensitive for the gene for having already passed through improvement of ancient generation horizontal transfer.The non-usual speed of comparison method is quickly.They It can quickly determine the candidate list in the horizontal transfer region assumed.Non- comparison method avoids the problem of arrangement mass data, These data may excessively disperse, and can not obtain significant comparison.Non- comparison method will not using gene as unit of analysis, because This can detecte the genome area in any possible lateral source.
The present invention is not only improved the standard the search efficiency of metastatic gene using local sensitivity Hash, and is improved for improvement The susceptibility of gene, to improve the accuracy of prediction.It is high it is an object of the invention to overcome systematic growth method to calculate cost It is expensive, the disadvantages of excessively relying on reliable phylogenetic tree.Using in parametric method, the most commonly used non-comparison method proposes base In the prediction technique of local sensitivity Hash, the characteristics of using local sensitivity hash function, improve for ancient turn by improvement The susceptibility for moving gene overcomes the insensitivity of the gene transferring horizontally of process improvement of the parametric method method for ancient generation Disadvantage reduces computer and uses resource in the case where not reducing the preceding topic of forecasting accuracy, improves prediction computational efficiency.

Claims (3)

1. the gene transferring horizontally recognition methods based on local sensitivity Hash, it is characterised in that: itself the following steps are included:
Step 1, full-length genome data are cut into N number of segment with identical bp number according to setting step-length;
Step 2, each segment after segmentation is subjected to the processing of K- word, each genetic fragment obtains altogether | ∑ |KA word, wherein | ∑ | For the character set size of gene, K is the length of window of K- word processing;
Step 3, the word frequency for counting the K- word in each segment obtains the word frequency set X={ X of K- word1, X2..., Xt..., Xn, XtFor The frequency that t-th of K- word occurs in current clip, n are the sum of K- word;And place is standardized to the word frequency of each K- word Reason obtains the result S after word frequency standardizationt, form the word frequency set S={ S after standardizing1, S2..., St..., Sn, StIt is t-th K- word word frequency standardization after as a result, n be K- word sum;
The wherein result S after the word frequency standardization of each K- wordtCalculation formula it is as follows:
μ is the mean value of all K- word word frequency, and δ is the standard deviation of K- word word frequency;
Step 4, m hash family of functions is constructed, m hash value of each segment is calculated;
Step 5, the processing of branch's item is carried out to m hash value of each genetic fragment, compare in the row item of the segment hash value with Hash value in the row item of full-length genome,
When there is hash value all in one or more row items all equal, think that the segment is similar to full-length genome and forms phase Like set;Otherwise, prediction is that Horizontal Gene Transfer has occurred;
Step 6, calculate Euclidean distance between candidate segment and m hash value of full-length genome in similar set, when Euclidean away from It is from the value for being greater than given threshold, threshold valueK is the length of window of K- word processing, the i.e. length of K- word, then it is assumed that It is that Horizontal Gene Transfer has occurred;
Euclidean distance formula are as follows:
M is hash function number, i.e. m is the hash value number of sequence.
2. the gene transferring horizontally recognition methods according to claim 1 based on local sensitivity Hash, it is characterised in that: step Full-length genome data are cut into the segment of N number of overlapping with 5000 bp in rapid 1 according to 500 step-length.
3. the gene transferring horizontally recognition methods according to claim 1 based on local sensitivity Hash, it is characterised in that: step The form of hash family of functions in rapid 4 are as follows:
Wherein b is (0, r) inner random number, and r is the segment length being segmented on straight line, and the function in family of functions is built according to the difference of a and b Vertical index functions.
CN201810988512.1A 2018-08-28 2018-08-28 Horizontal transfer gene identification method based on locality sensitive hashing Expired - Fee Related CN109243529B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810988512.1A CN109243529B (en) 2018-08-28 2018-08-28 Horizontal transfer gene identification method based on locality sensitive hashing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810988512.1A CN109243529B (en) 2018-08-28 2018-08-28 Horizontal transfer gene identification method based on locality sensitive hashing

Publications (2)

Publication Number Publication Date
CN109243529A true CN109243529A (en) 2019-01-18
CN109243529B CN109243529B (en) 2021-09-07

Family

ID=65068789

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810988512.1A Expired - Fee Related CN109243529B (en) 2018-08-28 2018-08-28 Horizontal transfer gene identification method based on locality sensitive hashing

Country Status (1)

Country Link
CN (1) CN109243529B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110660452A (en) * 2019-09-20 2020-01-07 中国人民解放军军事科学院军事医学研究院 Method for detecting bacterial gene horizontal transfer and donor strain generating horizontal transfer DNA fragment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101533484A (en) * 2008-03-12 2009-09-16 中国科学院半导体研究所 Method for forecasting gene transferring horizontally in genome
CN103631928A (en) * 2013-12-05 2014-03-12 中国科学院信息工程研究所 LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system
CN105183792A (en) * 2015-08-21 2015-12-23 东南大学 Distributed fast text classification method based on locality sensitive hashing
CN107103206A (en) * 2017-04-27 2017-08-29 福建师范大学 The DNA sequence dna cluster of local sensitivity Hash based on standard entropy

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101533484A (en) * 2008-03-12 2009-09-16 中国科学院半导体研究所 Method for forecasting gene transferring horizontally in genome
CN103631928A (en) * 2013-12-05 2014-03-12 中国科学院信息工程研究所 LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system
CN105183792A (en) * 2015-08-21 2015-12-23 东南大学 Distributed fast text classification method based on locality sensitive hashing
CN107103206A (en) * 2017-04-27 2017-08-29 福建师范大学 The DNA sequence dna cluster of local sensitivity Hash based on standard entropy

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIE LIN 等: "《SSAW:A new sequence similarity analysis method based on the stationary discrete wavelet transform》", 《SPRING》 *
徐彭娜 等: "基于位置信息熵的局部敏感哈希聚类方法", 《计算机应用与软件》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110660452A (en) * 2019-09-20 2020-01-07 中国人民解放军军事科学院军事医学研究院 Method for detecting bacterial gene horizontal transfer and donor strain generating horizontal transfer DNA fragment

Also Published As

Publication number Publication date
CN109243529B (en) 2021-09-07

Similar Documents

Publication Publication Date Title
Koetschan et al. Internal transcribed spacer 1 secondary structure analysis reveals a common core throughout the anaerobic fungi (Neocallimastigomycota)
CN101748213B (en) Environmental microorganism detection method and system
CN110997936A (en) Method and device for genotyping based on low-depth genome sequencing and application of method and device
US20190177719A1 (en) Method and System for Generating and Comparing Reduced Genome Data Sets
Ma DeepMNE: deep multi-network embedding for lncRNA-disease association prediction
Azad et al. Detecting laterally transferred genes
Zhou et al. Deep learning reveals many more inter-protein residue-residue contacts than direct coupling analysis
Shaker et al. Information retrieval for cancer cell detection based on advanced machine learning techniques
de Silva et al. Evolutionary k-nearest neighbor imputation algorithm for gene expression data
Chen et al. Computational prediction of operons in Synechococcus sp. WH8102
Saha et al. In silico prediction of yeast deletion phenotypes
Priyadarshana et al. A modified cross entropy method for detecting multiple change points in DNA Count Data
CN109243529A (en) Gene transferring horizontally recognition methods based on local sensitivity Hash
CN108268753A (en) A kind of microorganism group recognition methods and device, equipment
US7962427B2 (en) Method for the detection of atypical sequences via generalized compositional methods
Garzon et al. DNA chips for species identification and biological phylogenies
James et al. MeShClust2: Application of alignment-free identity scores in clustering long DNA sequences
CN106555008A (en) Detection and identification method and system for microorganisms
Aleb et al. An improved K-means algorithm for DNA sequence clustering
CN106650311A (en) Detection and recognition method and system for microorganisms
Nguyen et al. Efficient and accurate OTU clustering with GPU-based sequence alignment and dynamic dendrogram cutting
Raje et al. Self-organizing maps: A tool to ascertain taxonomic relatedness based on features derived from 16S rDNA sequence
Das et al. A novel SFLA based method for gene expression biclustering
Shi et al. Ultra-rapid metagenotyping of the human gut microbiome
Aljouie et al. Cross-validation and cross-study validation of chronic lymphocytic leukaemia with exome sequences and machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210907