CN107103206A

CN107103206A - The DNA sequence dna cluster of local sensitivity Hash based on standard entropy

Info

Publication number: CN107103206A
Application number: CN201710285598.7A
Authority: CN
Inventors: 江育娥; 徐彭娜; 林劼
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2017-04-27
Filing date: 2017-04-27
Publication date: 2017-08-29
Anticipated expiration: 2037-04-27
Also published as: CN107103206B

Abstract

The present invention discloses the DNA sequence dna cluster of the local sensitivity Hash based on standard entropy, by being mapped according to L Gram models original DNA sequence dna, by calculating the matrix that the LF entropy of N bar sequences is constituted, and then draw its standard entropy, Hash mapping is carried out to standard entropy using Locality Sensitive Hashing, the candidate collection of sequence dna fragment is obtained, sequence dna fragment of the editing distance less than d is calculated in candidate collection and obtains cluster result.The feature space that the present invention considers after conversion includes enough original DNA information, avoid the loss of DNA information, every section of DNA sequence is switched into a new space, and calculates candidate's sequence dna fragment set of each sequence dna fragment, arithmetic speed and accuracy can be improved.

Description

The DNA sequence dna cluster of local sensitivity Hash based on standard entropy

Technical field

The present invention relates to Bioinformatics field, more particularly to the local sensitivity Hash based on standard entropy DNA sequence dna Cluster.

Background technology

With the arrival and the development of information technology of Internet era, gene sequencing technology development ground is more ripe, in addition The development of every Gene Project, the quantity of biological data is in the formula growth that explodes, and traditional method can not meet the number of magnanimity According to Treatment Analysis.Bioinformatics refers to be combined biology with computer technology, is interacted with Mathematics Discipline, obtains biological information It is processed, extract, analyze, stored, the positional information of inhereditary material is excavated.Data mining technology is that one kind can be from a large amount of numbers According to the middle technology for extracting useful, potential effective information.Cluster in data mining can be by the sequence with some same characteristic features Row are flocked together, the function or structure of more preferable analyze data, and unknown sequence is explored from the sequence of known function and structure The effective information of row is with very big meaning.

There are many defects in existing Sequence clustering method.K-medoid algorithms based on division, based on the complete of level (complete-link) algorithm, these traditional clustering algorithms are connected, it is necessary to be compared two-by-two to sequence, time complexity is high, DNA sequence dna quantity of today, which increases, to be exceedingly fast, and traditional algorithm can not be applied in mass data.K-means algorithms are it needs to be determined that poly- Class number, the barycenter of sequence data is not easy to calculate, and initial cluster center causes cluster result unstable, is applied to biology at random Sequence data Clustering Effect is not good.The result of clustering algorithm based on BAG figures effectively, but needs to use cluster in the segmentation of class Unit is guided, and the sequence number in gene pool is excessive, causes it to represent that excessive sequence variation is difficult using non-directed graph.

The content of the invention

It is an object of the invention to overcome the deficiencies of the prior art and provide the DNA of the local sensitivity Hash based on standard entropy Sequence clustering.

To achieve these goals, the present invention uses following technical scheme：

The DNA sequence dna cluster of local sensitivity Hash based on standard entropy, comprises the following steps：

(1) whole piece sequence to be measured is sequenced using second generation sequencing technologies, obtains a collection of DNA short-movie sections, each Short-movie section is referred to as sequence dna fragment；

(2) set of letters in sequence dna fragment is { A, C, G, T }, | ∑ | number alphabetical in the set of letters is represented, The word length size L of pending word is initialized, pending word Y collection is obtained using the sliding window of regular length to sequence dna fragment Close, pending word Y number is in pending word Y set | ∑ |^L, according to the positional information X of each pending word_tCalculate its entropy Value h；

The positional information X of the pending word_tRefer to corresponding two when pending word occurs twice in sequence dna fragment The inverse of distance between individual position；

Wherein, Y represents pending word, and t represents the sequence of positions that pending word occurs, LF_t ^YRepresent pending word Y t The secondary position for appearing in sequence dna fragment, Y_λRepresent the λ pretreatment word；λ represents the numbering of pending word；Z represents pending The frequency that word occurs；P [t] is discrete probabilistic P t-th of discrete probabilistic, is part and S_tAccount for summation Z than discrete probabilistic；

Part and S_tRepresent positional information X_tSum, S_t=X₁+X₂+...+X_t；

Summation Z=S₁+S₂+...+S_n；

(3) characteristic vector is calculated：Entropy is obtained into standard entropy H using formula standardization_LFIt is used as the feature of hash function Variable, standard entropy H_LFCalculation formula it is as follows：

h(Y_λ) it is word Y_λEntropy, z represents the frequency that pending word occurs；

(4) Hash matrix H M is calculated：By the corresponding standard entropy H of N bar sequence dna fragments_LFCounted using LSH methods Calculate, the Hash matrix H M for obtaining num_f*N is calculated using num_f hash function, the formula of hash function is as follows：

Wherein v be sequence dna fragment characteristic vector, a between characteristic vector v numbers identical 0 to 1 it is random to Amount, m is 0 any integer for arriving w, and w is any positive integer, such hash function f_a,m(v) a d dimension space vector v is mapped as One integer；

(5) splicing Hash matrix PHM is calculated：Using variable b, Hash matrix H M is divided into b bucket, each bucket there are r rows, its Middle r=num_f/b, for each barrel of Hash matrix H M, i-th (i ∈ [1, num_f]) row represents i-th of hash function, jth (j ∈ [1, N]) row represent j-th strip sequence dna fragment, then HM_ijThe standard entropy of j-th strip sequence dna fragment is used i-th by expression Individual hash function carries out the integer value after Hash mapping；Then to HM_ijOnly retain front three, then complementary 0 less than three；Most Afterwards by HM_jThe often capable splicing Hash matrix PHM for being spliced as Hash splicing value, obtaining b*N；

(6) candidate's sequence dna fragment set is calculated：For sequence dna fragment S_i(i ∈ [1, N]), when in splicing Hash square There is sequence dna fragment S in battle array PHM_j(j ∈ [1, N], i ≠ j) and S_iIt is identical in the Hash splicing value with a line, then S_jIt is S_i's Candidate's sequence dna fragment, S_iAll candidate's sequence dna fragments constitute candidate's sequence dna fragment set Candidate；

(7) cluster is realized：A sequence dna fragment not being clustered is randomly selected as cluster centre, the cluster is screened The corresponding candidate's sequence dna fragment set Candidate in center and the cluster centre editing distance are less than the threshold values d's specified Candidate sequence as a cluster result, (due to the sequence clustered be not re-used as cluster centre or appear in other gather In class result), the sequence dna fragment being clustered is stored in clustered, above-mentioned sorting procedure, Zhi Daosuo is circulated There is sequence dna fragment to be all clustered and (i.e. include all sequences in clustered).

Wherein, the dissimilar degree between editing distance reflection two sequences, editing distance is bigger, and explanation is more dissimilar.Poly- In class process, it is intended that similar sequence is got together, therefore a threshold values d can be specified, editing distance illustrates this less than d Two sequences meet the similar requirement of definition.

Further, the step (4), w value is 4.

The present invention uses above technical scheme, and in numerous DNA sequence analysis methods, we pass through to original DNA Sequence is mapped according to L-Gram models, i.e., because DNA sequence dna is, by { A, T, C, G } four letter compositions, to pre-process word length For L, so as to obtain | ∑ |^LIndividual pending word；So as to which original DNA sequence obtains a new Serial No. by mapping.Pass through Local Frequency (abbreviation LF) entropy for calculating N bar sequences constitutes N* | ∑ |^LMatrix, and then draw its standard entropy make It is characterized vector；Hash mapping is carried out to standard entropy using Locality-Sensitive Hashing (abbreviation LSH), obtained The candidate collection of sequence dna fragment, calculating editing distance is obtained less than the threshold values d specified sequence dna fragment in candidate collection To similar sequences.Believed using the Local Frequency feature spaces considered after conversion comprising enough original DNAs Breath, it is to avoid the loss of DNA sequence dna information；Calculated entropy based on Local Frequency more can be finely it is anti- Answer the structural information of DNA sequence dna；Characteristic vector map using Locality-Sensitive Hashing to obtain candidate Sequence dna fragment set, greatly reduces the sequence quantity that sequence is compared two-by-two, shortens the sequence alignment time.DNA sequence dna Similitude can all have application, including one section of unknown nucleotide sequence of prediction as the Elementary Measures in bioinformatics in many occasions Effect and function, the systematic evolution tree for building biological or species, the homology etc. of analyzing species.For a large amount of DNA sequence dnas Cluster, the DNA sequence dna cluster of the local sensitivity Hash based on standard entropy of the invention, this method calculated per section of DNA sequence Standard entropy, as the characteristic vector of local sensitivity Hash, calculate candidate's sequence dna fragment set of sequence dna fragment, The quantity of aligned sequences is significantly reduced, arithmetic speed is improved in the case where ensureing accuracy.

Brief description of the drawings

The present invention is described in further details below in conjunction with the drawings and specific embodiments：

Fig. 1 is the flow chart of the DNA sequence dna cluster of the local sensitivity Hash of the invention based on standard entropy.

Embodiment

As shown in figure 1, the DNA sequence dna cluster of the local sensitivity Hash of the invention based on standard entropy, it comprises the following steps：

Part and S_tRepresent positional information X_tSum, S_t=X₁+X₂+...+X_t；

Summation Z=S₁+S₂+...+S_n；

(7) cluster is realized：A sequence dna fragment not being clustered is randomly selected as cluster centre, the cluster is screened The corresponding candidate's sequence dna fragment set Candidate in center and the cluster centre editing distance are less than the threshold values d's specified Candidate sequence as a cluster result, (due to the sequence clustered be not re-used as cluster centre or appear in other gather In class result), the sequence dna fragment being clustered is stored in clustered, above-mentioned sorting procedure, Zhi Daosuo is circulated There is sequence dna fragment to be all clustered.

Just the processing procedure of the present invention is described in detail below：

In order to become apparent from description the present invention in DNA sequence dna processing procedure, randomly select 12 DNA encoding sequences as point Object is analysed, invention implementation process is described in detail by sample of these DNA sequence dnas.Local sensitivity based on standard entropy is breathed out Uncommon DNA sequence dna sorting procedure is as follows：

1 12 DNA fragmentations of table

Part and S_tRepresent positional information X_tSum, S_t=X₁+X₂+...+X_t；

Summation Z=S₁+S₂+...+S_n；

Characteristic vector table is obtained according to formula calculating as follows：

The characteristic vector table of table 2

	AA	AC	AG	AT	CA	CC	CG	CT	GA	GC	GG	GT	TA	TC	TG	TT
																	1	5.096	1.067	-0.001	-0.001	0.258	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001
2	5.166	-0.001	-0.001	0.957	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	0.000	-0.001	-0.001	-0.001
																	3	4.990	1.561	-0.001	-0.001	1.559	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001
4	5.134	0.708	-0.001	-0.001	0.000	0.000	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001
																	5	5.201	-0.001	0.000	-0.001	-0.001	-0.001	-0.001	-0.001	0.000	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001
6	5.190	0.000	-0.001	-0.001	0.000	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001
																	7	5.059	0.482	-0.001	0.000	0.475	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	0.000	-0.001	-0.001	-0.001
8	5.065	0.408	-0.001	0.000	0.402	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	0.000	-0.001	-0.001	-0.001
																	9	4.950	1.592	-0.001	0.000	1.094	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	0.000	-0.001	-0.001	-0.001
10	5.131	-0.001	0.000	0.000	-0.001	-0.001	-0.001	-0.001	0.000	-0.001	-0.001	-0.001	0.000	-0.001	-0.001	-0.001
																	11	5.125	-0.001	-0.001	0.994	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	0.992	-0.001	-0.001	-0.001
12	5.032	1.498	-0.001	-0.001	1.162	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001	-0.001

Wherein v be sequence dna fragment characteristic vector, a between characteristic vector v numbers identical 0 to 1 it is random to Amount, m is 0 any integer for arriving w, and w is any positive integer, such hash function f_a,m(v) a d dimension space vector v is mapped as One integer；Num_f=8, w=4 in the present embodiment；

The Hash matrix of table 3

Following table is candidate's sequence dna fragment collection table, in order to make it easy to understand, by the candidate of 12 sequences in the present embodiment Sequence dna fragment collection table statistics is as follows：

The collection of candidate sequences table of table 4

s[1]	3 9 12 4 5 6 7 8 10 2
		s[2]	5 6 10 11 4 7 8 1
s[3]	1 9 12
		s[4]	1 5 6 7 8 9 10 12 2 11
s[5]	2 6 10 11 1 4 7 8 9 12
		s[6]	2 5 10 11 1 4 7 8 9 12
s[7]	8 1 4 5 6 9 10 12 2 11
		s[8]	7 1 4 5 6 9 10 12 2 11
s[9]	1 3 12 4 5 6 7 8 10
		s[10]	2 5 6 11 1 4 7 8 9 12
s[11]	2 5 6 10 4 7 8
		s[12]	1 3 9 4 5 6 7 8 10

The present invention uses above technical scheme, and in numerous DNA sequence analysis methods, we pass through to original DNA Sequence is mapped according to L-Gram models, i.e., because DNA sequence dna is, by { A, T, C, G } four letter compositions, to pre-process word length For L, so as to obtain | ∑ |^LIndividual pending word；So as to which original DNA sequence obtains a new Serial No. by mapping.Pass through Local Frequency (abbreviation LF) entropy for calculating N bar sequences constitutes N* | ∑ |^LMatrix, and then draw its standard entropy make It is characterized vector；Hash mapping is carried out to standard entropy using Locality-Sensitive Hashing (abbreviation LSH), obtained The candidate collection of sequence dna fragment, sequence dna fragment of the editing distance less than d is calculated in candidate collection and obtains similar sequences. Enough original DNA information is included using the Local Frequency feature spaces considered after conversion, it is to avoid DNA The loss of sequence information；Calculated entropy based on Local Frequency more can be fine reaction dna sequence Structural information；Characteristic vector map using Locality-Sensitive Hashing to obtain candidate's sequence dna fragment Set, greatly reduces the sequence quantity that sequence is compared two-by-two, shortens the sequence alignment time.The similitude conduct of DNA sequence dna Elementary Measures in bioinformatics, can all have application in many occasions, include the effect and work(of one section of unknown nucleotide sequence of prediction Systematic evolution tree, homology of analysis species of biological or species etc., can be built.For the cluster of a large amount of DNA sequence dnas, it is based on The DNA sequence dna cluster of the local sensitivity Hash of standard entropy, this method calculates the standard entropy per section of DNA sequence, as The characteristic vector of local sensitivity Hash, calculates candidate's sequence dna fragment set of sequence dna fragment, significantly drawdown ratio is to sequence The quantity of row, arithmetic speed is improved in the case where ensureing accuracy.

Claims

1. the DNA sequence dna cluster of the local sensitivity Hash based on standard entropy, it is characterised in that：It comprises the following steps：

(2) set of letters in sequence dna fragment is { A, C, G, T }, | ∑ | number alphabetical in the set of letters is represented, initially Change the word length size L of pending word, obtaining pending word Y using the sliding window of regular length to sequence dna fragment gathers, and treats Pending word Y number is in the Y set of processing word | ∑ |^L, according to the positional information X of each pending word_tCalculate its entropy h；

The positional information X of the pending word_tRefer to corresponding two positions when pending word occurs twice in sequence dna fragment The inverse of distance between putting；

<mrow> <mi>h</mi> <mrow> <mo>(</mo> <msub> <mi>Y</mi> <mi>&lambda;</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>z</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <mi>P</mi> <mo>&lsqb;</mo> <mi>t</mi> <mo>&rsqb;</mo> <mo>*</mo> <msup> <msub> <mi>log</mi> <mn>2</mn> </msub> <mrow> <mi>P</mi> <mo>&lsqb;</mo> <mi>t</mi> <mo>&rsqb;</mo> </mrow> </msup> <mo>,</mo> <mrow> <mo>(</mo> <mi>&lambda;</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mo>|</mo> <mi>&Sigma;</mi> <msup> <mo>|</mo> <mi>L</mi> </msup> <mo>)</mo> </mrow> <mo>;</mo> </mrow>

Wherein, Y represents pending word, and t represents the sequence of positions that pending word occurs, LF_t ^YRepresent that the t times of pending word Y goes out The position of present sequence dna fragment, Y_λRepresent the λ pretreatment word；λ represents the numbering of pending word；Z represents pending word and gone out Existing frequency；P [t] is discrete probabilistic P t-th of discrete probabilistic, is part and S_tAccount for summation Z than discrete probabilistic；

Part and S_tRepresent positional information X_tSum, S_t=X₁+X₂+...+X_t；

Summation Z=S₁+S₂+...+S_n；

(3) characteristic vector is calculated：Entropy is obtained into standard entropy H using formula standardization_LFAs the characteristic variable of hash function, Standard entropy H_LFCalculation formula it is as follows：

<mrow> <msub> <mi>H</mi> <mrow> <mi>L</mi> <mi>F</mi> </mrow> </msub> <mo>=</mo> <mfrac> <mrow> <mi>h</mi> <mrow> <mo>(</mo> <msub> <mi>Y</mi> <mi>&lambda;</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mo>-</mo> <mfrac> <mn>1</mn> <mi>z</mi> </mfrac> <mo>*</mo> <msup> <msub> <mi>log</mi> <mn>2</mn> </msub> <mfrac> <mn>1</mn> <mi>z</mi> </mfrac> </msup> </mrow> </mfrac> <mo>,</mo> <mrow> <mo>(</mo> <mi>&lambda;</mi> <mo>=</mo> <mn>1</mn> <mo>,</mo> <mn>2</mn> <mo>,</mo> <mo>...</mo> <mo>,</mo> <mo>|</mo> <mi>&Sigma;</mi> <msup> <mo>|</mo> <mi>L</mi> </msup> <mo>)</mo> </mrow> </mrow>

(4) Hash matrix H M is calculated：By the corresponding standard entropy H of N bar sequence dna fragments_LFCalculated, made using LSH methods The Hash matrix H M for obtaining num_f*N is calculated with num_f hash function, the formula of hash function is as follows：

Wherein v is the characteristic vector of sequence dna fragment, and a is the random vector between characteristic vector v numbers identical 0 to 1, m For 0 any integer for arriving w, w is any positive integer, such hash function f_a,m(v) a d dimension space vector v is mapped as one Integer；

(5) splicing Hash matrix PHM is calculated：Using variable b, Hash matrix H M is divided into b bucket, each bucket has r rows, wherein r =num_f/b, for each barrel of Hash matrix H M, i-th (i ∈ [1, num_f]) row represents i-th of hash function, jth (j ∈ [1, N]) row represent j-th strip sequence dna fragment, then HM_ijRepresent the standard entropy of j-th strip sequence dna fragment using i-th of Kazakhstan Uncommon function carries out the integer value after Hash mapping；Then to HM_ijOnly retain front three, then complementary 0 less than three；Finally will HM_jThe often capable splicing Hash matrix PHM for being spliced as Hash splicing value, obtaining b*N；

(6) candidate's sequence dna fragment set is calculated：For sequence dna fragment S_i(i ∈ [1, N]), when in splicing Hash matrix PHM In there is sequence dna fragment S_j(j ∈ [1, N], i ≠ j) and S_iIt is identical in the Hash splicing value with a line, then S_jIt is S_iCandidate Sequence dna fragment, S_iAll candidate's sequence dna fragments constitute candidate's sequence dna fragment set Candidate；

(7) cluster is realized：A sequence dna fragment not being clustered is randomly selected as cluster centre, the cluster centre is screened The editing distance of corresponding candidate's sequence dna fragment set Candidate and the cluster centre is less than the threshold values d specified candidate The sequence dna fragment being clustered is stored in clustered by sequence as a cluster result, circulates above-mentioned cluster Step, until all sequence dna fragments are all clustered.

2. the DNA sequence dna cluster of the local sensitivity Hash according to claim 1 based on standard entropy, it is characterised in that：Institute Step (4) is stated, w value is 4.