CN107103206A - The DNA sequence dna cluster of local sensitivity Hash based on standard entropy - Google Patents
The DNA sequence dna cluster of local sensitivity Hash based on standard entropy Download PDFInfo
- Publication number
- CN107103206A CN107103206A CN201710285598.7A CN201710285598A CN107103206A CN 107103206 A CN107103206 A CN 107103206A CN 201710285598 A CN201710285598 A CN 201710285598A CN 107103206 A CN107103206 A CN 107103206A
- Authority
- CN
- China
- Prior art keywords
- mrow
- sequence dna
- dna fragment
- sequence
- hash
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Biotechnology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The present invention discloses the DNA sequence dna cluster of the local sensitivity Hash based on standard entropy, by being mapped according to L Gram models original DNA sequence dna, by calculating the matrix that the LF entropy of N bar sequences is constituted, and then draw its standard entropy, Hash mapping is carried out to standard entropy using Locality Sensitive Hashing, the candidate collection of sequence dna fragment is obtained, sequence dna fragment of the editing distance less than d is calculated in candidate collection and obtains cluster result.The feature space that the present invention considers after conversion includes enough original DNA information, avoid the loss of DNA information, every section of DNA sequence is switched into a new space, and calculates candidate's sequence dna fragment set of each sequence dna fragment, arithmetic speed and accuracy can be improved.
Description
Technical field
The present invention relates to Bioinformatics field, more particularly to the local sensitivity Hash based on standard entropy DNA sequence dna
Cluster.
Background technology
With the arrival and the development of information technology of Internet era, gene sequencing technology development ground is more ripe, in addition
The development of every Gene Project, the quantity of biological data is in the formula growth that explodes, and traditional method can not meet the number of magnanimity
According to Treatment Analysis.Bioinformatics refers to be combined biology with computer technology, is interacted with Mathematics Discipline, obtains biological information
It is processed, extract, analyze, stored, the positional information of inhereditary material is excavated.Data mining technology is that one kind can be from a large amount of numbers
According to the middle technology for extracting useful, potential effective information.Cluster in data mining can be by the sequence with some same characteristic features
Row are flocked together, the function or structure of more preferable analyze data, and unknown sequence is explored from the sequence of known function and structure
The effective information of row is with very big meaning.
There are many defects in existing Sequence clustering method.K-medoid algorithms based on division, based on the complete of level
(complete-link) algorithm, these traditional clustering algorithms are connected, it is necessary to be compared two-by-two to sequence, time complexity is high,
DNA sequence dna quantity of today, which increases, to be exceedingly fast, and traditional algorithm can not be applied in mass data.K-means algorithms are it needs to be determined that poly-
Class number, the barycenter of sequence data is not easy to calculate, and initial cluster center causes cluster result unstable, is applied to biology at random
Sequence data Clustering Effect is not good.The result of clustering algorithm based on BAG figures effectively, but needs to use cluster in the segmentation of class
Unit is guided, and the sequence number in gene pool is excessive, causes it to represent that excessive sequence variation is difficult using non-directed graph.
The content of the invention
It is an object of the invention to overcome the deficiencies of the prior art and provide the DNA of the local sensitivity Hash based on standard entropy
Sequence clustering.
To achieve these goals, the present invention uses following technical scheme:
The DNA sequence dna cluster of local sensitivity Hash based on standard entropy, comprises the following steps:
(1) whole piece sequence to be measured is sequenced using second generation sequencing technologies, obtains a collection of DNA short-movie sections, each
Short-movie section is referred to as sequence dna fragment;
(2) set of letters in sequence dna fragment is { A, C, G, T }, | ∑ | number alphabetical in the set of letters is represented,
The word length size L of pending word is initialized, pending word Y collection is obtained using the sliding window of regular length to sequence dna fragment
Close, pending word Y number is in pending word Y set | ∑ |L, according to the positional information X of each pending wordtCalculate its entropy
Value h;
The positional information X of the pending wordtRefer to corresponding two when pending word occurs twice in sequence dna fragment
The inverse of distance between individual position;
Wherein, Y represents pending word, and t represents the sequence of positions that pending word occurs, LFt YRepresent pending word Y t
The secondary position for appearing in sequence dna fragment, YλRepresent the λ pretreatment word;λ represents the numbering of pending word;Z represents pending
The frequency that word occurs;P [t] is discrete probabilistic P t-th of discrete probabilistic, is part and StAccount for summation Z than discrete probabilistic;
Part and StRepresent positional information XtSum, St=X1+X2+...+Xt;
Summation Z=S1+S2+...+Sn;
(3) characteristic vector is calculated:Entropy is obtained into standard entropy H using formula standardizationLFIt is used as the feature of hash function
Variable, standard entropy HLFCalculation formula it is as follows:
h(Yλ) it is word YλEntropy, z represents the frequency that pending word occurs;
(4) Hash matrix H M is calculated:By the corresponding standard entropy H of N bar sequence dna fragmentsLFCounted using LSH methods
Calculate, the Hash matrix H M for obtaining num_f*N is calculated using num_f hash function, the formula of hash function is as follows:
Wherein v be sequence dna fragment characteristic vector, a between characteristic vector v numbers identical 0 to 1 it is random to
Amount, m is 0 any integer for arriving w, and w is any positive integer, such hash function fa,m(v) a d dimension space vector v is mapped as
One integer;
(5) splicing Hash matrix PHM is calculated:Using variable b, Hash matrix H M is divided into b bucket, each bucket there are r rows, its
Middle r=num_f/b, for each barrel of Hash matrix H M, i-th (i ∈ [1, num_f]) row represents i-th of hash function, jth
(j ∈ [1, N]) row represent j-th strip sequence dna fragment, then HMijThe standard entropy of j-th strip sequence dna fragment is used i-th by expression
Individual hash function carries out the integer value after Hash mapping;Then to HMijOnly retain front three, then complementary 0 less than three;Most
Afterwards by HMjThe often capable splicing Hash matrix PHM for being spliced as Hash splicing value, obtaining b*N;
(6) candidate's sequence dna fragment set is calculated:For sequence dna fragment Si(i ∈ [1, N]), when in splicing Hash square
There is sequence dna fragment S in battle array PHMj(j ∈ [1, N], i ≠ j) and SiIt is identical in the Hash splicing value with a line, then SjIt is Si's
Candidate's sequence dna fragment, SiAll candidate's sequence dna fragments constitute candidate's sequence dna fragment set Candidate;
(7) cluster is realized:A sequence dna fragment not being clustered is randomly selected as cluster centre, the cluster is screened
The corresponding candidate's sequence dna fragment set Candidate in center and the cluster centre editing distance are less than the threshold values d's specified
Candidate sequence as a cluster result, (due to the sequence clustered be not re-used as cluster centre or appear in other gather
In class result), the sequence dna fragment being clustered is stored in clustered, above-mentioned sorting procedure, Zhi Daosuo is circulated
There is sequence dna fragment to be all clustered and (i.e. include all sequences in clustered).
Wherein, the dissimilar degree between editing distance reflection two sequences, editing distance is bigger, and explanation is more dissimilar.Poly-
In class process, it is intended that similar sequence is got together, therefore a threshold values d can be specified, editing distance illustrates this less than d
Two sequences meet the similar requirement of definition.
Further, the step (4), w value is 4.
The present invention uses above technical scheme, and in numerous DNA sequence analysis methods, we pass through to original DNA
Sequence is mapped according to L-Gram models, i.e., because DNA sequence dna is, by { A, T, C, G } four letter compositions, to pre-process word length
For L, so as to obtain | ∑ |LIndividual pending word;So as to which original DNA sequence obtains a new Serial No. by mapping.Pass through
Local Frequency (abbreviation LF) entropy for calculating N bar sequences constitutes N* | ∑ |LMatrix, and then draw its standard entropy make
It is characterized vector;Hash mapping is carried out to standard entropy using Locality-Sensitive Hashing (abbreviation LSH), obtained
The candidate collection of sequence dna fragment, calculating editing distance is obtained less than the threshold values d specified sequence dna fragment in candidate collection
To similar sequences.Believed using the Local Frequency feature spaces considered after conversion comprising enough original DNAs
Breath, it is to avoid the loss of DNA sequence dna information;Calculated entropy based on Local Frequency more can be finely it is anti-
Answer the structural information of DNA sequence dna;Characteristic vector map using Locality-Sensitive Hashing to obtain candidate
Sequence dna fragment set, greatly reduces the sequence quantity that sequence is compared two-by-two, shortens the sequence alignment time.DNA sequence dna
Similitude can all have application, including one section of unknown nucleotide sequence of prediction as the Elementary Measures in bioinformatics in many occasions
Effect and function, the systematic evolution tree for building biological or species, the homology etc. of analyzing species.For a large amount of DNA sequence dnas
Cluster, the DNA sequence dna cluster of the local sensitivity Hash based on standard entropy of the invention, this method calculated per section of DNA sequence
Standard entropy, as the characteristic vector of local sensitivity Hash, calculate candidate's sequence dna fragment set of sequence dna fragment,
The quantity of aligned sequences is significantly reduced, arithmetic speed is improved in the case where ensureing accuracy.
Brief description of the drawings
The present invention is described in further details below in conjunction with the drawings and specific embodiments:
Fig. 1 is the flow chart of the DNA sequence dna cluster of the local sensitivity Hash of the invention based on standard entropy.
Embodiment
As shown in figure 1, the DNA sequence dna cluster of the local sensitivity Hash of the invention based on standard entropy, it comprises the following steps:
(1) whole piece sequence to be measured is sequenced using second generation sequencing technologies, obtains a collection of DNA short-movie sections, each
Short-movie section is referred to as sequence dna fragment;
(2) set of letters in sequence dna fragment is { A, C, G, T }, | ∑ | number alphabetical in the set of letters is represented,
The word length size L of pending word is initialized, pending word Y collection is obtained using the sliding window of regular length to sequence dna fragment
Close, pending word Y number is in pending word Y set | ∑ |L, according to the positional information X of each pending wordtCalculate its entropy
Value h;
The positional information X of the pending wordtRefer to corresponding two when pending word occurs twice in sequence dna fragment
The inverse of distance between individual position;
Wherein, Y represents pending word, and t represents the sequence of positions that pending word occurs, LFt YRepresent pending word Y t
The secondary position for appearing in sequence dna fragment, YλRepresent the λ pretreatment word;λ represents the numbering of pending word;Z represents pending
The frequency that word occurs;P [t] is discrete probabilistic P t-th of discrete probabilistic, is part and StAccount for summation Z than discrete probabilistic;
Part and StRepresent positional information XtSum, St=X1+X2+...+Xt;
Summation Z=S1+S2+...+Sn;
(3) characteristic vector is calculated:Entropy is obtained into standard entropy H using formula standardizationLFIt is used as the feature of hash function
Variable, standard entropy HLFCalculation formula it is as follows:
h(Yλ) it is word YλEntropy, z represents the frequency that pending word occurs;
(4) Hash matrix H M is calculated:By the corresponding standard entropy H of N bar sequence dna fragmentsLFCounted using LSH methods
Calculate, the Hash matrix H M for obtaining num_f*N is calculated using num_f hash function, the formula of hash function is as follows:
Wherein v be sequence dna fragment characteristic vector, a between characteristic vector v numbers identical 0 to 1 it is random to
Amount, m is 0 any integer for arriving w, and w is any positive integer, such hash function fa,m(v) a d dimension space vector v is mapped as
One integer;
(5) splicing Hash matrix PHM is calculated:Using variable b, Hash matrix H M is divided into b bucket, each bucket there are r rows, its
Middle r=num_f/b, for each barrel of Hash matrix H M, i-th (i ∈ [1, num_f]) row represents i-th of hash function, jth
(j ∈ [1, N]) row represent j-th strip sequence dna fragment, then HMijThe standard entropy of j-th strip sequence dna fragment is used i-th by expression
Individual hash function carries out the integer value after Hash mapping;Then to HMijOnly retain front three, then complementary 0 less than three;Most
Afterwards by HMjThe often capable splicing Hash matrix PHM for being spliced as Hash splicing value, obtaining b*N;
(6) candidate's sequence dna fragment set is calculated:For sequence dna fragment Si(i ∈ [1, N]), when in splicing Hash square
There is sequence dna fragment S in battle array PHMj(j ∈ [1, N], i ≠ j) and SiIt is identical in the Hash splicing value with a line, then SjIt is Si's
Candidate's sequence dna fragment, SiAll candidate's sequence dna fragments constitute candidate's sequence dna fragment set Candidate;
(7) cluster is realized:A sequence dna fragment not being clustered is randomly selected as cluster centre, the cluster is screened
The corresponding candidate's sequence dna fragment set Candidate in center and the cluster centre editing distance are less than the threshold values d's specified
Candidate sequence as a cluster result, (due to the sequence clustered be not re-used as cluster centre or appear in other gather
In class result), the sequence dna fragment being clustered is stored in clustered, above-mentioned sorting procedure, Zhi Daosuo is circulated
There is sequence dna fragment to be all clustered.
Just the processing procedure of the present invention is described in detail below:
In order to become apparent from description the present invention in DNA sequence dna processing procedure, randomly select 12 DNA encoding sequences as point
Object is analysed, invention implementation process is described in detail by sample of these DNA sequence dnas.Local sensitivity based on standard entropy is breathed out
Uncommon DNA sequence dna sorting procedure is as follows:
(1) whole piece sequence to be measured is sequenced using second generation sequencing technologies, obtains a collection of DNA short-movie sections, each
Short-movie section is referred to as sequence dna fragment;
1 12 DNA fragmentations of table
(2) set of letters in sequence dna fragment is { A, C, G, T }, | ∑ | number alphabetical in the set of letters is represented,
The word length size L of pending word is initialized, pending word Y collection is obtained using the sliding window of regular length to sequence dna fragment
Close, pending word Y number is in pending word Y set | ∑ |L, according to the positional information X of each pending wordtCalculate its entropy
Value h;
The positional information X of the pending wordtRefer to corresponding two when pending word occurs twice in sequence dna fragment
The inverse of distance between individual position;
Wherein, Y represents pending word, and t represents the sequence of positions that pending word occurs, LFt YRepresent pending word Y t
The secondary position for appearing in sequence dna fragment, YλRepresent the λ pretreatment word;λ represents the numbering of pending word;Z represents pending
The frequency that word occurs;P [t] is discrete probabilistic P t-th of discrete probabilistic, is part and StAccount for summation Z than discrete probabilistic;
Part and StRepresent positional information XtSum, St=X1+X2+...+Xt;
Summation Z=S1+S2+...+Sn;
(3) characteristic vector is calculated:Entropy is obtained into standard entropy H using formula standardizationLFIt is used as the feature of hash function
Variable, standard entropy HLFCalculation formula it is as follows:
h(Yλ) it is word YλEntropy, z represents the frequency that pending word occurs;
Characteristic vector table is obtained according to formula calculating as follows:
The characteristic vector table of table 2
AA | AC | AG | AT | CA | CC | CG | CT | GA | GC | GG | GT | TA | TC | TG | TT | |
1 | 5.096 | 1.067 | -0.001 | -0.001 | 0.258 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 |
2 | 5.166 | -0.001 | -0.001 | 0.957 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | 0.000 | -0.001 | -0.001 | -0.001 |
3 | 4.990 | 1.561 | -0.001 | -0.001 | 1.559 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 |
4 | 5.134 | 0.708 | -0.001 | -0.001 | 0.000 | 0.000 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 |
5 | 5.201 | -0.001 | 0.000 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | 0.000 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 |
6 | 5.190 | 0.000 | -0.001 | -0.001 | 0.000 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 |
7 | 5.059 | 0.482 | -0.001 | 0.000 | 0.475 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | 0.000 | -0.001 | -0.001 | -0.001 |
8 | 5.065 | 0.408 | -0.001 | 0.000 | 0.402 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | 0.000 | -0.001 | -0.001 | -0.001 |
9 | 4.950 | 1.592 | -0.001 | 0.000 | 1.094 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | 0.000 | -0.001 | -0.001 | -0.001 |
10 | 5.131 | -0.001 | 0.000 | 0.000 | -0.001 | -0.001 | -0.001 | -0.001 | 0.000 | -0.001 | -0.001 | -0.001 | 0.000 | -0.001 | -0.001 | -0.001 |
11 | 5.125 | -0.001 | -0.001 | 0.994 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | 0.992 | -0.001 | -0.001 | -0.001 |
12 | 5.032 | 1.498 | -0.001 | -0.001 | 1.162 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 |
(4) Hash matrix H M is calculated:By the corresponding standard entropy H of N bar sequence dna fragmentsLFCounted using LSH methods
Calculate, the Hash matrix H M for obtaining num_f*N is calculated using num_f hash function, the formula of hash function is as follows:
Wherein v be sequence dna fragment characteristic vector, a between characteristic vector v numbers identical 0 to 1 it is random to
Amount, m is 0 any integer for arriving w, and w is any positive integer, such hash function fa,m(v) a d dimension space vector v is mapped as
One integer;Num_f=8, w=4 in the present embodiment;
The Hash matrix of table 3
(5) splicing Hash matrix PHM is calculated:Using variable b, Hash matrix H M is divided into b bucket, each bucket there are r rows, its
Middle r=num_f/b, for each barrel of Hash matrix H M, i-th (i ∈ [1, num_f]) row represents i-th of hash function, jth
(j ∈ [1, N]) row represent j-th strip sequence dna fragment, then HMijThe standard entropy of j-th strip sequence dna fragment is used i-th by expression
Individual hash function carries out the integer value after Hash mapping;Then to HMijOnly retain front three, then complementary 0 less than three;Most
Afterwards by HMjThe often capable splicing Hash matrix PHM for being spliced as Hash splicing value, obtaining b*N;
(6) candidate's sequence dna fragment set is calculated:For sequence dna fragment Si(i ∈ [1, N]), when in splicing Hash square
There is sequence dna fragment S in battle array PHMj(j ∈ [1, N], i ≠ j) and SiIt is identical in the Hash splicing value with a line, then SjIt is Si's
Candidate's sequence dna fragment, SiAll candidate's sequence dna fragments constitute candidate's sequence dna fragment set Candidate;
Following table is candidate's sequence dna fragment collection table, in order to make it easy to understand, by the candidate of 12 sequences in the present embodiment
Sequence dna fragment collection table statistics is as follows:
The collection of candidate sequences table of table 4
s[1] | 3 9 12 4 5 6 7 8 10 2 |
s[2] | 5 6 10 11 4 7 8 1 |
s[3] | 1 9 12 |
s[4] | 1 5 6 7 8 9 10 12 2 11 |
s[5] | 2 6 10 11 1 4 7 8 9 12 |
s[6] | 2 5 10 11 1 4 7 8 9 12 |
s[7] | 8 1 4 5 6 9 10 12 2 11 |
s[8] | 7 1 4 5 6 9 10 12 2 11 |
s[9] | 1 3 12 4 5 6 7 8 10 |
s[10] | 2 5 6 11 1 4 7 8 9 12 |
s[11] | 2 5 6 10 4 7 8 |
s[12] | 1 3 9 4 5 6 7 8 10 |
(7) cluster is realized:A sequence dna fragment not being clustered is randomly selected as cluster centre, the cluster is screened
The corresponding candidate's sequence dna fragment set Candidate in center and the cluster centre editing distance are less than the threshold values d's specified
Candidate sequence as a cluster result, (due to the sequence clustered be not re-used as cluster centre or appear in other gather
In class result), the sequence dna fragment being clustered is stored in clustered, above-mentioned sorting procedure, Zhi Daosuo is circulated
There is sequence dna fragment to be all clustered.
The present invention uses above technical scheme, and in numerous DNA sequence analysis methods, we pass through to original DNA
Sequence is mapped according to L-Gram models, i.e., because DNA sequence dna is, by { A, T, C, G } four letter compositions, to pre-process word length
For L, so as to obtain | ∑ |LIndividual pending word;So as to which original DNA sequence obtains a new Serial No. by mapping.Pass through
Local Frequency (abbreviation LF) entropy for calculating N bar sequences constitutes N* | ∑ |LMatrix, and then draw its standard entropy make
It is characterized vector;Hash mapping is carried out to standard entropy using Locality-Sensitive Hashing (abbreviation LSH), obtained
The candidate collection of sequence dna fragment, sequence dna fragment of the editing distance less than d is calculated in candidate collection and obtains similar sequences.
Enough original DNA information is included using the Local Frequency feature spaces considered after conversion, it is to avoid DNA
The loss of sequence information;Calculated entropy based on Local Frequency more can be fine reaction dna sequence
Structural information;Characteristic vector map using Locality-Sensitive Hashing to obtain candidate's sequence dna fragment
Set, greatly reduces the sequence quantity that sequence is compared two-by-two, shortens the sequence alignment time.The similitude conduct of DNA sequence dna
Elementary Measures in bioinformatics, can all have application in many occasions, include the effect and work(of one section of unknown nucleotide sequence of prediction
Systematic evolution tree, homology of analysis species of biological or species etc., can be built.For the cluster of a large amount of DNA sequence dnas, it is based on
The DNA sequence dna cluster of the local sensitivity Hash of standard entropy, this method calculates the standard entropy per section of DNA sequence, as
The characteristic vector of local sensitivity Hash, calculates candidate's sequence dna fragment set of sequence dna fragment, significantly drawdown ratio is to sequence
The quantity of row, arithmetic speed is improved in the case where ensureing accuracy.
Claims (2)
1. the DNA sequence dna cluster of the local sensitivity Hash based on standard entropy, it is characterised in that:It comprises the following steps:
(1) whole piece sequence to be measured is sequenced using second generation sequencing technologies, obtains a collection of DNA short-movie sections, each short-movie
Section is referred to as sequence dna fragment;
(2) set of letters in sequence dna fragment is { A, C, G, T }, | ∑ | number alphabetical in the set of letters is represented, initially
Change the word length size L of pending word, obtaining pending word Y using the sliding window of regular length to sequence dna fragment gathers, and treats
Pending word Y number is in the Y set of processing word | ∑ |L, according to the positional information X of each pending wordtCalculate its entropy h;
The positional information X of the pending wordtRefer to corresponding two positions when pending word occurs twice in sequence dna fragment
The inverse of distance between putting;
<mrow>
<msub>
<mi>X</mi>
<mi>t</mi>
</msub>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mrow>
<msup>
<msub>
<mi>LF</mi>
<mrow>
<mi>t</mi>
<mo>+</mo>
<mn>1</mn>
</mrow>
</msub>
<mi>Y</mi>
</msup>
<mo>-</mo>
<msup>
<msub>
<mi>LF</mi>
<mi>t</mi>
</msub>
<mi>Y</mi>
</msup>
</mrow>
</mfrac>
<mo>,</mo>
<mrow>
<mo>(</mo>
<mi>t</mi>
<mo>=</mo>
<mn>1</mn>
<mo>,</mo>
<mn>2</mn>
<mo>,</mo>
<mo>...</mo>
<mo>...</mo>
<mo>,</mo>
<mi>z</mi>
<mo>-</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
<mrow>
<mi>h</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>Y</mi>
<mi>&lambda;</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mo>-</mo>
<munderover>
<mo>&Sigma;</mo>
<mrow>
<mi>t</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mrow>
<mi>z</mi>
<mo>-</mo>
<mn>1</mn>
</mrow>
</munderover>
<mi>P</mi>
<mo>&lsqb;</mo>
<mi>t</mi>
<mo>&rsqb;</mo>
<mo>*</mo>
<msup>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<mrow>
<mi>P</mi>
<mo>&lsqb;</mo>
<mi>t</mi>
<mo>&rsqb;</mo>
</mrow>
</msup>
<mo>,</mo>
<mrow>
<mo>(</mo>
<mi>&lambda;</mi>
<mo>=</mo>
<mn>1</mn>
<mo>,</mo>
<mn>2</mn>
<mo>,</mo>
<mo>...</mo>
<mo>,</mo>
<mo>|</mo>
<mi>&Sigma;</mi>
<msup>
<mo>|</mo>
<mi>L</mi>
</msup>
<mo>)</mo>
</mrow>
<mo>;</mo>
</mrow>
Wherein, Y represents pending word, and t represents the sequence of positions that pending word occurs, LFt YRepresent that the t times of pending word Y goes out
The position of present sequence dna fragment, YλRepresent the λ pretreatment word;λ represents the numbering of pending word;Z represents pending word and gone out
Existing frequency;P [t] is discrete probabilistic P t-th of discrete probabilistic, is part and StAccount for summation Z than discrete probabilistic;
Part and StRepresent positional information XtSum, St=X1+X2+...+Xt;
Summation Z=S1+S2+...+Sn;
(3) characteristic vector is calculated:Entropy is obtained into standard entropy H using formula standardizationLFAs the characteristic variable of hash function,
Standard entropy HLFCalculation formula it is as follows:
<mrow>
<msub>
<mi>H</mi>
<mrow>
<mi>L</mi>
<mi>F</mi>
</mrow>
</msub>
<mo>=</mo>
<mfrac>
<mrow>
<mi>h</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>Y</mi>
<mi>&lambda;</mi>
</msub>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mo>-</mo>
<mfrac>
<mn>1</mn>
<mi>z</mi>
</mfrac>
<mo>*</mo>
<msup>
<msub>
<mi>log</mi>
<mn>2</mn>
</msub>
<mfrac>
<mn>1</mn>
<mi>z</mi>
</mfrac>
</msup>
</mrow>
</mfrac>
<mo>,</mo>
<mrow>
<mo>(</mo>
<mi>&lambda;</mi>
<mo>=</mo>
<mn>1</mn>
<mo>,</mo>
<mn>2</mn>
<mo>,</mo>
<mo>...</mo>
<mo>,</mo>
<mo>|</mo>
<mi>&Sigma;</mi>
<msup>
<mo>|</mo>
<mi>L</mi>
</msup>
<mo>)</mo>
</mrow>
</mrow>
h(Yλ) it is word YλEntropy, z represents the frequency that pending word occurs;
(4) Hash matrix H M is calculated:By the corresponding standard entropy H of N bar sequence dna fragmentsLFCalculated, made using LSH methods
The Hash matrix H M for obtaining num_f*N is calculated with num_f hash function, the formula of hash function is as follows:
Wherein v is the characteristic vector of sequence dna fragment, and a is the random vector between characteristic vector v numbers identical 0 to 1, m
For 0 any integer for arriving w, w is any positive integer, such hash function fa,m(v) a d dimension space vector v is mapped as one
Integer;
(5) splicing Hash matrix PHM is calculated:Using variable b, Hash matrix H M is divided into b bucket, each bucket has r rows, wherein r
=num_f/b, for each barrel of Hash matrix H M, i-th (i ∈ [1, num_f]) row represents i-th of hash function, jth (j ∈
[1, N]) row represent j-th strip sequence dna fragment, then HMijRepresent the standard entropy of j-th strip sequence dna fragment using i-th of Kazakhstan
Uncommon function carries out the integer value after Hash mapping;Then to HMijOnly retain front three, then complementary 0 less than three;Finally will
HMjThe often capable splicing Hash matrix PHM for being spliced as Hash splicing value, obtaining b*N;
(6) candidate's sequence dna fragment set is calculated:For sequence dna fragment Si(i ∈ [1, N]), when in splicing Hash matrix PHM
In there is sequence dna fragment Sj(j ∈ [1, N], i ≠ j) and SiIt is identical in the Hash splicing value with a line, then SjIt is SiCandidate
Sequence dna fragment, SiAll candidate's sequence dna fragments constitute candidate's sequence dna fragment set Candidate;
(7) cluster is realized:A sequence dna fragment not being clustered is randomly selected as cluster centre, the cluster centre is screened
The editing distance of corresponding candidate's sequence dna fragment set Candidate and the cluster centre is less than the threshold values d specified candidate
The sequence dna fragment being clustered is stored in clustered by sequence as a cluster result, circulates above-mentioned cluster
Step, until all sequence dna fragments are all clustered.
2. the DNA sequence dna cluster of the local sensitivity Hash according to claim 1 based on standard entropy, it is characterised in that:Institute
Step (4) is stated, w value is 4.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710285598.7A CN107103206B (en) | 2017-04-27 | 2017-04-27 | The DNA sequence dna of local sensitivity Hash based on standard entropy clusters |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710285598.7A CN107103206B (en) | 2017-04-27 | 2017-04-27 | The DNA sequence dna of local sensitivity Hash based on standard entropy clusters |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107103206A true CN107103206A (en) | 2017-08-29 |
CN107103206B CN107103206B (en) | 2019-10-18 |
Family
ID=59657448
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710285598.7A Expired - Fee Related CN107103206B (en) | 2017-04-27 | 2017-04-27 | The DNA sequence dna of local sensitivity Hash based on standard entropy clusters |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107103206B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109243529A (en) * | 2018-08-28 | 2019-01-18 | 福建师范大学 | Gene transferring horizontally recognition methods based on local sensitivity Hash |
CN112397148A (en) * | 2019-08-23 | 2021-02-23 | 武汉未来组生物科技有限公司 | Sequence comparison method, sequence correction method and device thereof |
CN113420141A (en) * | 2021-06-24 | 2021-09-21 | 中国人民解放军陆军工程大学 | Sensitive data searching method based on Hash clustering and context information |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103631928A (en) * | 2013-12-05 | 2014-03-12 | 中国科学院信息工程研究所 | LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system |
CN104531848A (en) * | 2014-12-11 | 2015-04-22 | 杭州和壹基因科技有限公司 | Method and system for assembling genome sequence |
CN106557668A (en) * | 2016-11-04 | 2017-04-05 | 福建师范大学 | DNA sequence dna similar test method based on LF entropys |
-
2017
- 2017-04-27 CN CN201710285598.7A patent/CN107103206B/en not_active Expired - Fee Related
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103631928A (en) * | 2013-12-05 | 2014-03-12 | 中国科学院信息工程研究所 | LSH (Locality Sensitive Hashing)-based clustering and indexing method and LSH-based clustering and indexing system |
CN104531848A (en) * | 2014-12-11 | 2015-04-22 | 杭州和壹基因科技有限公司 | Method and system for assembling genome sequence |
CN106557668A (en) * | 2016-11-04 | 2017-04-05 | 福建师范大学 | DNA sequence dna similar test method based on LF entropys |
Non-Patent Citations (1)
Title |
---|
张懿璞: "一种新的DNA 模体发现聚类求精算法", 《西安电子科技大学学报(自然科学版)》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109243529A (en) * | 2018-08-28 | 2019-01-18 | 福建师范大学 | Gene transferring horizontally recognition methods based on local sensitivity Hash |
CN109243529B (en) * | 2018-08-28 | 2021-09-07 | 福建师范大学 | Horizontal transfer gene identification method based on locality sensitive hashing |
CN112397148A (en) * | 2019-08-23 | 2021-02-23 | 武汉未来组生物科技有限公司 | Sequence comparison method, sequence correction method and device thereof |
CN112397148B (en) * | 2019-08-23 | 2023-10-03 | 武汉希望组生物科技有限公司 | Sequence comparison method, sequence correction method and device thereof |
CN113420141A (en) * | 2021-06-24 | 2021-09-21 | 中国人民解放军陆军工程大学 | Sensitive data searching method based on Hash clustering and context information |
CN113420141B (en) * | 2021-06-24 | 2022-10-04 | 中国人民解放军陆军工程大学 | Sensitive data searching method based on Hash clustering and context information |
Also Published As
Publication number | Publication date |
---|---|
CN107103206B (en) | 2019-10-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Erisoglu et al. | A new algorithm for initial cluster centers in k-means algorithm | |
Du et al. | Spatial and spectral unmixing using the beta compositional model | |
Zheng et al. | Gene differential coexpression analysis based on biweight correlation and maximum clique | |
CN112750502B (en) | Single cell transcriptome sequencing data clustering recommendation method based on two-dimensional distribution structure judgment | |
CN107292341A (en) | Adaptive multi views clustering method based on paired collaboration regularization and NMF | |
US20210104298A1 (en) | Secure communication of nucleic acid sequence information through a network | |
CN111710364B (en) | Method, device, terminal and storage medium for acquiring flora marker | |
Karamichalis et al. | Additive methods for genomic signatures | |
CN107103206B (en) | The DNA sequence dna of local sensitivity Hash based on standard entropy clusters | |
Bogner et al. | Characterising flow patterns in soils by feature extraction and multiple consensus clustering | |
McClelland et al. | EMDUniFrac: exact linear time computation of the UniFrac metric and identification of differentially abundant organisms | |
Chen et al. | Estimating large covariance matrix with network topology for high-dimensional biomedical data | |
Jeong et al. | PRIME: a probabilistic imputation method to reduce dropout effects in single-cell RNA sequencing | |
EP3435264B1 (en) | Method and system for identification and classification of operational taxonomic units in a metagenomic sample | |
CN107480471A (en) | The method for the sequence similarity analysis being characterized based on wavelet transformation | |
CN106557668B (en) | DNA sequence dna similar test method based on LF entropy | |
Anyaso-Samuel et al. | Metagenomic geolocation prediction using an adaptive ensemble classifier | |
CN110442674B (en) | Label propagation clustering method, terminal equipment, storage medium and device | |
CN110819704A (en) | Methods and systems for improving microbial community taxonomy resolution based on amplicon sequencing | |
Mohammadi et al. | Estimating missing value in microarray data using fuzzy clustering and gene ontology | |
CN111383716A (en) | Method and device for screening gene pairs, computer equipment and storage medium | |
Mallick et al. | A scalable unsupervised learning of scRNAseq data detects rare cells through integration of structure-preserving embedding, clustering and outlier detection | |
Gudodagi et al. | Investigations and Compression of Genomic Data | |
Pourhashem et al. | Missing value estimation in microarray data using fuzzy clustering and semantic similarity | |
Tapinos et al. | Alignment by numbers: sequence assembly using compressed numerical representations |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information | ||
CB03 | Change of inventor or designer information |
Inventor after: Jiang Binghua Inventor after: Jiang Yue Inventor after: Xu Pengna Inventor after: Lin Jie Inventor before: Jiang Yue Inventor before: Xu Pengna Inventor before: Lin Jie |
|
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20191018 |