CN107103206A - The DNA sequence dna cluster of local sensitivity Hash based on standard entropy - Google Patents

The DNA sequence dna cluster of local sensitivity Hash based on standard entropy Download PDF

Info

Publication number
CN107103206A
CN107103206A CN201710285598.7A CN201710285598A CN107103206A CN 107103206 A CN107103206 A CN 107103206A CN 201710285598 A CN201710285598 A CN 201710285598A CN 107103206 A CN107103206 A CN 107103206A
Authority
CN
China
Prior art keywords
mrow
sequence dna
dna fragment
sequence
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710285598.7A
Other languages
Chinese (zh)
Other versions
CN107103206B (en
Inventor
江育娥
徐彭娜
林劼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Normal University
Original Assignee
Fujian Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Normal University filed Critical Fujian Normal University
Priority to CN201710285598.7A priority Critical patent/CN107103206B/en
Publication of CN107103206A publication Critical patent/CN107103206A/en
Application granted granted Critical
Publication of CN107103206B publication Critical patent/CN107103206B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The present invention discloses the DNA sequence dna cluster of the local sensitivity Hash based on standard entropy, by being mapped according to L Gram models original DNA sequence dna, by calculating the matrix that the LF entropy of N bar sequences is constituted, and then draw its standard entropy, Hash mapping is carried out to standard entropy using Locality Sensitive Hashing, the candidate collection of sequence dna fragment is obtained, sequence dna fragment of the editing distance less than d is calculated in candidate collection and obtains cluster result.The feature space that the present invention considers after conversion includes enough original DNA information, avoid the loss of DNA information, every section of DNA sequence is switched into a new space, and calculates candidate's sequence dna fragment set of each sequence dna fragment, arithmetic speed and accuracy can be improved.

Description

The DNA sequence dna cluster of local sensitivity Hash based on standard entropy
Technical field
The present invention relates to Bioinformatics field, more particularly to the local sensitivity Hash based on standard entropy DNA sequence dna Cluster.
Background technology
With the arrival and the development of information technology of Internet era, gene sequencing technology development ground is more ripe, in addition The development of every Gene Project, the quantity of biological data is in the formula growth that explodes, and traditional method can not meet the number of magnanimity According to Treatment Analysis.Bioinformatics refers to be combined biology with computer technology, is interacted with Mathematics Discipline, obtains biological information It is processed, extract, analyze, stored, the positional information of inhereditary material is excavated.Data mining technology is that one kind can be from a large amount of numbers According to the middle technology for extracting useful, potential effective information.Cluster in data mining can be by the sequence with some same characteristic features Row are flocked together, the function or structure of more preferable analyze data, and unknown sequence is explored from the sequence of known function and structure The effective information of row is with very big meaning.
There are many defects in existing Sequence clustering method.K-medoid algorithms based on division, based on the complete of level (complete-link) algorithm, these traditional clustering algorithms are connected, it is necessary to be compared two-by-two to sequence, time complexity is high, DNA sequence dna quantity of today, which increases, to be exceedingly fast, and traditional algorithm can not be applied in mass data.K-means algorithms are it needs to be determined that poly- Class number, the barycenter of sequence data is not easy to calculate, and initial cluster center causes cluster result unstable, is applied to biology at random Sequence data Clustering Effect is not good.The result of clustering algorithm based on BAG figures effectively, but needs to use cluster in the segmentation of class Unit is guided, and the sequence number in gene pool is excessive, causes it to represent that excessive sequence variation is difficult using non-directed graph.
The content of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide the DNA of the local sensitivity Hash based on standard entropy Sequence clustering.
To achieve these goals, the present invention uses following technical scheme:
The DNA sequence dna cluster of local sensitivity Hash based on standard entropy, comprises the following steps:
(1) whole piece sequence to be measured is sequenced using second generation sequencing technologies, obtains a collection of DNA short-movie sections, each Short-movie section is referred to as sequence dna fragment;
(2) set of letters in sequence dna fragment is { A, C, G, T }, | ∑ | number alphabetical in the set of letters is represented, The word length size L of pending word is initialized, pending word Y collection is obtained using the sliding window of regular length to sequence dna fragment Close, pending word Y number is in pending word Y set | ∑ |L, according to the positional information X of each pending wordtCalculate its entropy Value h;
The positional information X of the pending wordtRefer to corresponding two when pending word occurs twice in sequence dna fragment The inverse of distance between individual position;
Wherein, Y represents pending word, and t represents the sequence of positions that pending word occurs, LFt YRepresent pending word Y t The secondary position for appearing in sequence dna fragment, YλRepresent the λ pretreatment word;λ represents the numbering of pending word;Z represents pending The frequency that word occurs;P [t] is discrete probabilistic P t-th of discrete probabilistic, is part and StAccount for summation Z than discrete probabilistic;
Part and StRepresent positional information XtSum, St=X1+X2+...+Xt
Summation Z=S1+S2+...+Sn
(3) characteristic vector is calculated:Entropy is obtained into standard entropy H using formula standardizationLFIt is used as the feature of hash function Variable, standard entropy HLFCalculation formula it is as follows:
h(Yλ) it is word YλEntropy, z represents the frequency that pending word occurs;
(4) Hash matrix H M is calculated:By the corresponding standard entropy H of N bar sequence dna fragmentsLFCounted using LSH methods Calculate, the Hash matrix H M for obtaining num_f*N is calculated using num_f hash function, the formula of hash function is as follows:
Wherein v be sequence dna fragment characteristic vector, a between characteristic vector v numbers identical 0 to 1 it is random to Amount, m is 0 any integer for arriving w, and w is any positive integer, such hash function fa,m(v) a d dimension space vector v is mapped as One integer;
(5) splicing Hash matrix PHM is calculated:Using variable b, Hash matrix H M is divided into b bucket, each bucket there are r rows, its Middle r=num_f/b, for each barrel of Hash matrix H M, i-th (i ∈ [1, num_f]) row represents i-th of hash function, jth (j ∈ [1, N]) row represent j-th strip sequence dna fragment, then HMijThe standard entropy of j-th strip sequence dna fragment is used i-th by expression Individual hash function carries out the integer value after Hash mapping;Then to HMijOnly retain front three, then complementary 0 less than three;Most Afterwards by HMjThe often capable splicing Hash matrix PHM for being spliced as Hash splicing value, obtaining b*N;
(6) candidate's sequence dna fragment set is calculated:For sequence dna fragment Si(i ∈ [1, N]), when in splicing Hash square There is sequence dna fragment S in battle array PHMj(j ∈ [1, N], i ≠ j) and SiIt is identical in the Hash splicing value with a line, then SjIt is Si's Candidate's sequence dna fragment, SiAll candidate's sequence dna fragments constitute candidate's sequence dna fragment set Candidate;
(7) cluster is realized:A sequence dna fragment not being clustered is randomly selected as cluster centre, the cluster is screened The corresponding candidate's sequence dna fragment set Candidate in center and the cluster centre editing distance are less than the threshold values d's specified Candidate sequence as a cluster result, (due to the sequence clustered be not re-used as cluster centre or appear in other gather In class result), the sequence dna fragment being clustered is stored in clustered, above-mentioned sorting procedure, Zhi Daosuo is circulated There is sequence dna fragment to be all clustered and (i.e. include all sequences in clustered).
Wherein, the dissimilar degree between editing distance reflection two sequences, editing distance is bigger, and explanation is more dissimilar.Poly- In class process, it is intended that similar sequence is got together, therefore a threshold values d can be specified, editing distance illustrates this less than d Two sequences meet the similar requirement of definition.
Further, the step (4), w value is 4.
The present invention uses above technical scheme, and in numerous DNA sequence analysis methods, we pass through to original DNA Sequence is mapped according to L-Gram models, i.e., because DNA sequence dna is, by { A, T, C, G } four letter compositions, to pre-process word length For L, so as to obtain | ∑ |LIndividual pending word;So as to which original DNA sequence obtains a new Serial No. by mapping.Pass through Local Frequency (abbreviation LF) entropy for calculating N bar sequences constitutes N* | ∑ |LMatrix, and then draw its standard entropy make It is characterized vector;Hash mapping is carried out to standard entropy using Locality-Sensitive Hashing (abbreviation LSH), obtained The candidate collection of sequence dna fragment, calculating editing distance is obtained less than the threshold values d specified sequence dna fragment in candidate collection To similar sequences.Believed using the Local Frequency feature spaces considered after conversion comprising enough original DNAs Breath, it is to avoid the loss of DNA sequence dna information;Calculated entropy based on Local Frequency more can be finely it is anti- Answer the structural information of DNA sequence dna;Characteristic vector map using Locality-Sensitive Hashing to obtain candidate Sequence dna fragment set, greatly reduces the sequence quantity that sequence is compared two-by-two, shortens the sequence alignment time.DNA sequence dna Similitude can all have application, including one section of unknown nucleotide sequence of prediction as the Elementary Measures in bioinformatics in many occasions Effect and function, the systematic evolution tree for building biological or species, the homology etc. of analyzing species.For a large amount of DNA sequence dnas Cluster, the DNA sequence dna cluster of the local sensitivity Hash based on standard entropy of the invention, this method calculated per section of DNA sequence Standard entropy, as the characteristic vector of local sensitivity Hash, calculate candidate's sequence dna fragment set of sequence dna fragment, The quantity of aligned sequences is significantly reduced, arithmetic speed is improved in the case where ensureing accuracy.
Brief description of the drawings
The present invention is described in further details below in conjunction with the drawings and specific embodiments:
Fig. 1 is the flow chart of the DNA sequence dna cluster of the local sensitivity Hash of the invention based on standard entropy.
Embodiment
As shown in figure 1, the DNA sequence dna cluster of the local sensitivity Hash of the invention based on standard entropy, it comprises the following steps:
(1) whole piece sequence to be measured is sequenced using second generation sequencing technologies, obtains a collection of DNA short-movie sections, each Short-movie section is referred to as sequence dna fragment;
(2) set of letters in sequence dna fragment is { A, C, G, T }, | ∑ | number alphabetical in the set of letters is represented, The word length size L of pending word is initialized, pending word Y collection is obtained using the sliding window of regular length to sequence dna fragment Close, pending word Y number is in pending word Y set | ∑ |L, according to the positional information X of each pending wordtCalculate its entropy Value h;
The positional information X of the pending wordtRefer to corresponding two when pending word occurs twice in sequence dna fragment The inverse of distance between individual position;
Wherein, Y represents pending word, and t represents the sequence of positions that pending word occurs, LFt YRepresent pending word Y t The secondary position for appearing in sequence dna fragment, YλRepresent the λ pretreatment word;λ represents the numbering of pending word;Z represents pending The frequency that word occurs;P [t] is discrete probabilistic P t-th of discrete probabilistic, is part and StAccount for summation Z than discrete probabilistic;
Part and StRepresent positional information XtSum, St=X1+X2+...+Xt
Summation Z=S1+S2+...+Sn
(3) characteristic vector is calculated:Entropy is obtained into standard entropy H using formula standardizationLFIt is used as the feature of hash function Variable, standard entropy HLFCalculation formula it is as follows:
h(Yλ) it is word YλEntropy, z represents the frequency that pending word occurs;
(4) Hash matrix H M is calculated:By the corresponding standard entropy H of N bar sequence dna fragmentsLFCounted using LSH methods Calculate, the Hash matrix H M for obtaining num_f*N is calculated using num_f hash function, the formula of hash function is as follows:
Wherein v be sequence dna fragment characteristic vector, a between characteristic vector v numbers identical 0 to 1 it is random to Amount, m is 0 any integer for arriving w, and w is any positive integer, such hash function fa,m(v) a d dimension space vector v is mapped as One integer;
(5) splicing Hash matrix PHM is calculated:Using variable b, Hash matrix H M is divided into b bucket, each bucket there are r rows, its Middle r=num_f/b, for each barrel of Hash matrix H M, i-th (i ∈ [1, num_f]) row represents i-th of hash function, jth (j ∈ [1, N]) row represent j-th strip sequence dna fragment, then HMijThe standard entropy of j-th strip sequence dna fragment is used i-th by expression Individual hash function carries out the integer value after Hash mapping;Then to HMijOnly retain front three, then complementary 0 less than three;Most Afterwards by HMjThe often capable splicing Hash matrix PHM for being spliced as Hash splicing value, obtaining b*N;
(6) candidate's sequence dna fragment set is calculated:For sequence dna fragment Si(i ∈ [1, N]), when in splicing Hash square There is sequence dna fragment S in battle array PHMj(j ∈ [1, N], i ≠ j) and SiIt is identical in the Hash splicing value with a line, then SjIt is Si's Candidate's sequence dna fragment, SiAll candidate's sequence dna fragments constitute candidate's sequence dna fragment set Candidate;
(7) cluster is realized:A sequence dna fragment not being clustered is randomly selected as cluster centre, the cluster is screened The corresponding candidate's sequence dna fragment set Candidate in center and the cluster centre editing distance are less than the threshold values d's specified Candidate sequence as a cluster result, (due to the sequence clustered be not re-used as cluster centre or appear in other gather In class result), the sequence dna fragment being clustered is stored in clustered, above-mentioned sorting procedure, Zhi Daosuo is circulated There is sequence dna fragment to be all clustered.
Just the processing procedure of the present invention is described in detail below:
In order to become apparent from description the present invention in DNA sequence dna processing procedure, randomly select 12 DNA encoding sequences as point Object is analysed, invention implementation process is described in detail by sample of these DNA sequence dnas.Local sensitivity based on standard entropy is breathed out Uncommon DNA sequence dna sorting procedure is as follows:
(1) whole piece sequence to be measured is sequenced using second generation sequencing technologies, obtains a collection of DNA short-movie sections, each Short-movie section is referred to as sequence dna fragment;
1 12 DNA fragmentations of table
(2) set of letters in sequence dna fragment is { A, C, G, T }, | ∑ | number alphabetical in the set of letters is represented, The word length size L of pending word is initialized, pending word Y collection is obtained using the sliding window of regular length to sequence dna fragment Close, pending word Y number is in pending word Y set | ∑ |L, according to the positional information X of each pending wordtCalculate its entropy Value h;
The positional information X of the pending wordtRefer to corresponding two when pending word occurs twice in sequence dna fragment The inverse of distance between individual position;
Wherein, Y represents pending word, and t represents the sequence of positions that pending word occurs, LFt YRepresent pending word Y t The secondary position for appearing in sequence dna fragment, YλRepresent the λ pretreatment word;λ represents the numbering of pending word;Z represents pending The frequency that word occurs;P [t] is discrete probabilistic P t-th of discrete probabilistic, is part and StAccount for summation Z than discrete probabilistic;
Part and StRepresent positional information XtSum, St=X1+X2+...+Xt
Summation Z=S1+S2+...+Sn
(3) characteristic vector is calculated:Entropy is obtained into standard entropy H using formula standardizationLFIt is used as the feature of hash function Variable, standard entropy HLFCalculation formula it is as follows:
h(Yλ) it is word YλEntropy, z represents the frequency that pending word occurs;
Characteristic vector table is obtained according to formula calculating as follows:
The characteristic vector table of table 2
AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
1 5.096 1.067 -0.001 -0.001 0.258 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001
2 5.166 -0.001 -0.001 0.957 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 0.000 -0.001 -0.001 -0.001
3 4.990 1.561 -0.001 -0.001 1.559 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001
4 5.134 0.708 -0.001 -0.001 0.000 0.000 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001
5 5.201 -0.001 0.000 -0.001 -0.001 -0.001 -0.001 -0.001 0.000 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001
6 5.190 0.000 -0.001 -0.001 0.000 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001
7 5.059 0.482 -0.001 0.000 0.475 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 0.000 -0.001 -0.001 -0.001
8 5.065 0.408 -0.001 0.000 0.402 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 0.000 -0.001 -0.001 -0.001
9 4.950 1.592 -0.001 0.000 1.094 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 0.000 -0.001 -0.001 -0.001
10 5.131 -0.001 0.000 0.000 -0.001 -0.001 -0.001 -0.001 0.000 -0.001 -0.001 -0.001 0.000 -0.001 -0.001 -0.001
11 5.125 -0.001 -0.001 0.994 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 0.992 -0.001 -0.001 -0.001
12 5.032 1.498 -0.001 -0.001 1.162 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001 -0.001
(4) Hash matrix H M is calculated:By the corresponding standard entropy H of N bar sequence dna fragmentsLFCounted using LSH methods Calculate, the Hash matrix H M for obtaining num_f*N is calculated using num_f hash function, the formula of hash function is as follows:
Wherein v be sequence dna fragment characteristic vector, a between characteristic vector v numbers identical 0 to 1 it is random to Amount, m is 0 any integer for arriving w, and w is any positive integer, such hash function fa,m(v) a d dimension space vector v is mapped as One integer;Num_f=8, w=4 in the present embodiment;
The Hash matrix of table 3
(5) splicing Hash matrix PHM is calculated:Using variable b, Hash matrix H M is divided into b bucket, each bucket there are r rows, its Middle r=num_f/b, for each barrel of Hash matrix H M, i-th (i ∈ [1, num_f]) row represents i-th of hash function, jth (j ∈ [1, N]) row represent j-th strip sequence dna fragment, then HMijThe standard entropy of j-th strip sequence dna fragment is used i-th by expression Individual hash function carries out the integer value after Hash mapping;Then to HMijOnly retain front three, then complementary 0 less than three;Most Afterwards by HMjThe often capable splicing Hash matrix PHM for being spliced as Hash splicing value, obtaining b*N;
(6) candidate's sequence dna fragment set is calculated:For sequence dna fragment Si(i ∈ [1, N]), when in splicing Hash square There is sequence dna fragment S in battle array PHMj(j ∈ [1, N], i ≠ j) and SiIt is identical in the Hash splicing value with a line, then SjIt is Si's Candidate's sequence dna fragment, SiAll candidate's sequence dna fragments constitute candidate's sequence dna fragment set Candidate;
Following table is candidate's sequence dna fragment collection table, in order to make it easy to understand, by the candidate of 12 sequences in the present embodiment Sequence dna fragment collection table statistics is as follows:
The collection of candidate sequences table of table 4
s[1] 3 9 12 4 5 6 7 8 10 2
s[2] 5 6 10 11 4 7 8 1
s[3] 1 9 12
s[4] 1 5 6 7 8 9 10 12 2 11
s[5] 2 6 10 11 1 4 7 8 9 12
s[6] 2 5 10 11 1 4 7 8 9 12
s[7] 8 1 4 5 6 9 10 12 2 11
s[8] 7 1 4 5 6 9 10 12 2 11
s[9] 1 3 12 4 5 6 7 8 10
s[10] 2 5 6 11 1 4 7 8 9 12
s[11] 2 5 6 10 4 7 8
s[12] 1 3 9 4 5 6 7 8 10
(7) cluster is realized:A sequence dna fragment not being clustered is randomly selected as cluster centre, the cluster is screened The corresponding candidate's sequence dna fragment set Candidate in center and the cluster centre editing distance are less than the threshold values d's specified Candidate sequence as a cluster result, (due to the sequence clustered be not re-used as cluster centre or appear in other gather In class result), the sequence dna fragment being clustered is stored in clustered, above-mentioned sorting procedure, Zhi Daosuo is circulated There is sequence dna fragment to be all clustered.
The present invention uses above technical scheme, and in numerous DNA sequence analysis methods, we pass through to original DNA Sequence is mapped according to L-Gram models, i.e., because DNA sequence dna is, by { A, T, C, G } four letter compositions, to pre-process word length For L, so as to obtain | ∑ |LIndividual pending word;So as to which original DNA sequence obtains a new Serial No. by mapping.Pass through Local Frequency (abbreviation LF) entropy for calculating N bar sequences constitutes N* | ∑ |LMatrix, and then draw its standard entropy make It is characterized vector;Hash mapping is carried out to standard entropy using Locality-Sensitive Hashing (abbreviation LSH), obtained The candidate collection of sequence dna fragment, sequence dna fragment of the editing distance less than d is calculated in candidate collection and obtains similar sequences. Enough original DNA information is included using the Local Frequency feature spaces considered after conversion, it is to avoid DNA The loss of sequence information;Calculated entropy based on Local Frequency more can be fine reaction dna sequence Structural information;Characteristic vector map using Locality-Sensitive Hashing to obtain candidate's sequence dna fragment Set, greatly reduces the sequence quantity that sequence is compared two-by-two, shortens the sequence alignment time.The similitude conduct of DNA sequence dna Elementary Measures in bioinformatics, can all have application in many occasions, include the effect and work(of one section of unknown nucleotide sequence of prediction Systematic evolution tree, homology of analysis species of biological or species etc., can be built.For the cluster of a large amount of DNA sequence dnas, it is based on The DNA sequence dna cluster of the local sensitivity Hash of standard entropy, this method calculates the standard entropy per section of DNA sequence, as The characteristic vector of local sensitivity Hash, calculates candidate's sequence dna fragment set of sequence dna fragment, significantly drawdown ratio is to sequence The quantity of row, arithmetic speed is improved in the case where ensureing accuracy.

Claims (2)

1. the DNA sequence dna cluster of the local sensitivity Hash based on standard entropy, it is characterised in that:It comprises the following steps:
(1) whole piece sequence to be measured is sequenced using second generation sequencing technologies, obtains a collection of DNA short-movie sections, each short-movie Section is referred to as sequence dna fragment;
(2) set of letters in sequence dna fragment is { A, C, G, T }, | ∑ | number alphabetical in the set of letters is represented, initially Change the word length size L of pending word, obtaining pending word Y using the sliding window of regular length to sequence dna fragment gathers, and treats Pending word Y number is in the Y set of processing word | ∑ |L, according to the positional information X of each pending wordtCalculate its entropy h;
The positional information X of the pending wordtRefer to corresponding two positions when pending word occurs twice in sequence dna fragment The inverse of distance between putting;
<mrow> <msub> <mi>X</mi> <mi>t</mi> </msub> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <msup> <msub> <mi>LF</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mi>Y</mi> </msup> <mo>-</mo> <msup> <msub> <mi>LF</mi> <mi>t</mi> </msub> <mi>Y</mi> </msup> </mrow> </mfrac> <mo>,</mo> <mrow> <mo>(</mo> <mi>t</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mo>...</mo> <mo>...</mo> <mo>,</mo> <mi>z</mi> <mo>-</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
<mrow> <mi>h</mi> <mrow> <mo>(</mo> <msub> <mi>Y</mi> <mi>&amp;lambda;</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>z</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <mi>P</mi> <mo>&amp;lsqb;</mo> <mi>t</mi> <mo>&amp;rsqb;</mo> <mo>*</mo> <msup> <msub> <mi>log</mi> <mn>2</mn> </msub> <mrow> <mi>P</mi> <mo>&amp;lsqb;</mo> <mi>t</mi> <mo>&amp;rsqb;</mo> </mrow> </msup> <mo>,</mo> <mrow> <mo>(</mo> <mi>&amp;lambda;</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mo>|</mo> <mi>&amp;Sigma;</mi> <msup> <mo>|</mo> <mi>L</mi> </msup> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
Wherein, Y represents pending word, and t represents the sequence of positions that pending word occurs, LFt YRepresent that the t times of pending word Y goes out The position of present sequence dna fragment, YλRepresent the λ pretreatment word;λ represents the numbering of pending word;Z represents pending word and gone out Existing frequency;P [t] is discrete probabilistic P t-th of discrete probabilistic, is part and StAccount for summation Z than discrete probabilistic;
Part and StRepresent positional information XtSum, St=X1+X2+...+Xt
Summation Z=S1+S2+...+Sn
(3) characteristic vector is calculated:Entropy is obtained into standard entropy H using formula standardizationLFAs the characteristic variable of hash function, Standard entropy HLFCalculation formula it is as follows:
<mrow> <msub> <mi>H</mi> <mrow> <mi>L</mi> <mi>F</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mrow> <mi>h</mi> <mrow> <mo>(</mo> <msub> <mi>Y</mi> <mi>&amp;lambda;</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mo>-</mo> <mfrac> <mn>1</mn> <mi>z</mi> </mfrac> <mo>*</mo> <msup> <msub> <mi>log</mi> <mn>2</mn> </msub> <mfrac> <mn>1</mn> <mi>z</mi> </mfrac> </msup> </mrow> </mfrac> <mo>,</mo> <mrow> <mo>(</mo> <mi>&amp;lambda;</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mo>|</mo> <mi>&amp;Sigma;</mi> <msup> <mo>|</mo> <mi>L</mi> </msup> <mo>)</mo> </mrow> </mrow>
h(Yλ) it is word YλEntropy, z represents the frequency that pending word occurs;
(4) Hash matrix H M is calculated:By the corresponding standard entropy H of N bar sequence dna fragmentsLFCalculated, made using LSH methods The Hash matrix H M for obtaining num_f*N is calculated with num_f hash function, the formula of hash function is as follows:
Wherein v is the characteristic vector of sequence dna fragment, and a is the random vector between characteristic vector v numbers identical 0 to 1, m For 0 any integer for arriving w, w is any positive integer, such hash function fa,m(v) a d dimension space vector v is mapped as one Integer;
(5) splicing Hash matrix PHM is calculated:Using variable b, Hash matrix H M is divided into b bucket, each bucket has r rows, wherein r =num_f/b, for each barrel of Hash matrix H M, i-th (i ∈ [1, num_f]) row represents i-th of hash function, jth (j ∈ [1, N]) row represent j-th strip sequence dna fragment, then HMijRepresent the standard entropy of j-th strip sequence dna fragment using i-th of Kazakhstan Uncommon function carries out the integer value after Hash mapping;Then to HMijOnly retain front three, then complementary 0 less than three;Finally will HMjThe often capable splicing Hash matrix PHM for being spliced as Hash splicing value, obtaining b*N;
(6) candidate's sequence dna fragment set is calculated:For sequence dna fragment Si(i ∈ [1, N]), when in splicing Hash matrix PHM In there is sequence dna fragment Sj(j ∈ [1, N], i ≠ j) and SiIt is identical in the Hash splicing value with a line, then SjIt is SiCandidate Sequence dna fragment, SiAll candidate's sequence dna fragments constitute candidate's sequence dna fragment set Candidate;
(7) cluster is realized:A sequence dna fragment not being clustered is randomly selected as cluster centre, the cluster centre is screened The editing distance of corresponding candidate's sequence dna fragment set Candidate and the cluster centre is less than the threshold values d specified candidate The sequence dna fragment being clustered is stored in clustered by sequence as a cluster result, circulates above-mentioned cluster Step, until all sequence dna fragments are all clustered.
2. the DNA sequence dna cluster of the local sensitivity Hash according to claim 1 based on standard entropy, it is characterised in that:Institute Step (4) is stated, w value is 4.
CN201710285598.7A 2017-04-27 2017-04-27 The DNA sequence dna of local sensitivity Hash based on standard entropy clusters Expired - Fee Related CN107103206B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710285598.7A CN107103206B (en) 2017-04-27 2017-04-27 The DNA sequence dna of local sensitivity Hash based on standard entropy clusters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710285598.7A CN107103206B (en) 2017-04-27 2017-04-27 The DNA sequence dna of local sensitivity Hash based on standard entropy clusters

Publications (2)

Publication Number Publication Date
CN107103206A true CN107103206A (en) 2017-08-29
CN107103206B CN107103206B (en) 2019-10-18

Family

ID=59657448

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710285598.7A Expired - Fee Related CN107103206B (en) 2017-04-27 2017-04-27 The DNA sequence dna of local sensitivity Hash based on standard entropy clusters

Country Status (1)

Country Link
CN (1) CN107103206B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109243529A (en) * 2018-08-28 2019-01-18 福建师范大学 Gene transferring horizontally recognition methods based on local sensitivity Hash
CN112397148A (en) * 2019-08-23 2021-02-23 武汉未来组生物科技有限公司 Sequence comparison method, sequence correction method and device thereof
CN113420141A (en) * 2021-06-24 2021-09-21 中国人民解放军陆军工程大学 Sensitive data searching method based on Hash clustering and context information

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631928A (en) * 2013-12-05 2014-03-12 中国科学院信息工程研究所 LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence
CN106557668A (en) * 2016-11-04 2017-04-05 福建师范大学 DNA sequence dna similar test method based on LF entropys

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631928A (en) * 2013-12-05 2014-03-12 中国科学院信息工程研究所 LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system
CN104531848A (en) * 2014-12-11 2015-04-22 杭州和壹基因科技有限公司 Method and system for assembling genome sequence
CN106557668A (en) * 2016-11-04 2017-04-05 福建师范大学 DNA sequence dna similar test method based on LF entropys

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张懿璞: "一种新的DNA 模体发现聚类求精算法", 《西安电子科技大学学报(自然科学版)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109243529A (en) * 2018-08-28 2019-01-18 福建师范大学 Gene transferring horizontally recognition methods based on local sensitivity Hash
CN109243529B (en) * 2018-08-28 2021-09-07 福建师范大学 Horizontal transfer gene identification method based on locality sensitive hashing
CN112397148A (en) * 2019-08-23 2021-02-23 武汉未来组生物科技有限公司 Sequence comparison method, sequence correction method and device thereof
CN112397148B (en) * 2019-08-23 2023-10-03 武汉希望组生物科技有限公司 Sequence comparison method, sequence correction method and device thereof
CN113420141A (en) * 2021-06-24 2021-09-21 中国人民解放军陆军工程大学 Sensitive data searching method based on Hash clustering and context information
CN113420141B (en) * 2021-06-24 2022-10-04 中国人民解放军陆军工程大学 Sensitive data searching method based on Hash clustering and context information

Also Published As

Publication number Publication date
CN107103206B (en) 2019-10-18

Similar Documents

Publication Publication Date Title
Erisoglu et al. A new algorithm for initial cluster centers in k-means algorithm
Du et al. Spatial and spectral unmixing using the beta compositional model
Zheng et al. Gene differential coexpression analysis based on biweight correlation and maximum clique
CN112750502B (en) Single cell transcriptome sequencing data clustering recommendation method based on two-dimensional distribution structure judgment
CN107292341A (en) Adaptive multi views clustering method based on paired collaboration regularization and NMF
US20210104298A1 (en) Secure communication of nucleic acid sequence information through a network
CN111710364B (en) Method, device, terminal and storage medium for acquiring flora marker
Karamichalis et al. Additive methods for genomic signatures
CN107103206B (en) The DNA sequence dna of local sensitivity Hash based on standard entropy clusters
Bogner et al. Characterising flow patterns in soils by feature extraction and multiple consensus clustering
McClelland et al. EMDUniFrac: exact linear time computation of the UniFrac metric and identification of differentially abundant organisms
Chen et al. Estimating large covariance matrix with network topology for high-dimensional biomedical data
Jeong et al. PRIME: a probabilistic imputation method to reduce dropout effects in single-cell RNA sequencing
EP3435264B1 (en) Method and system for identification and classification of operational taxonomic units in a metagenomic sample
CN107480471A (en) The method for the sequence similarity analysis being characterized based on wavelet transformation
CN106557668B (en) DNA sequence dna similar test method based on LF entropy
Anyaso-Samuel et al. Metagenomic geolocation prediction using an adaptive ensemble classifier
CN110442674B (en) Label propagation clustering method, terminal equipment, storage medium and device
CN110819704A (en) Methods and systems for improving microbial community taxonomy resolution based on amplicon sequencing
Mohammadi et al. Estimating missing value in microarray data using fuzzy clustering and gene ontology
CN111383716A (en) Method and device for screening gene pairs, computer equipment and storage medium
Mallick et al. A scalable unsupervised learning of scRNAseq data detects rare cells through integration of structure-preserving embedding, clustering and outlier detection
Gudodagi et al. Investigations and Compression of Genomic Data
Pourhashem et al. Missing value estimation in microarray data using fuzzy clustering and semantic similarity
Tapinos et al. Alignment by numbers: sequence assembly using compressed numerical representations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Jiang Binghua

Inventor after: Jiang Yue

Inventor after: Xu Pengna

Inventor after: Lin Jie

Inventor before: Jiang Yue

Inventor before: Xu Pengna

Inventor before: Lin Jie

GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20191018