CN107480469B - Method for rapidly searching given pattern in gene sequence - Google Patents
Method for rapidly searching given pattern in gene sequence Download PDFInfo
- Publication number
- CN107480469B CN107480469B CN201710642956.5A CN201710642956A CN107480469B CN 107480469 B CN107480469 B CN 107480469B CN 201710642956 A CN201710642956 A CN 201710642956A CN 107480469 B CN107480469 B CN 107480469B
- Authority
- CN
- China
- Prior art keywords
- pwm
- matrix
- pwm matrix
- vector
- matching
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
Landscapes
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Engineering & Computer Science (AREA)
- Genetics & Genomics (AREA)
- Biotechnology (AREA)
- Biophysics (AREA)
- Theoretical Computer Science (AREA)
- Molecular Biology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Medical Informatics (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Chemical & Material Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a method for rapidly searching for a given pattern in a gene sequence, comprising the following steps: counting the background distribution of the input sequence; converting the position frequency matrix of each given mode into a PWM matrix, and calculating a threshold value K of the PWM matrix; calculating an expected vector of each PWM matrix; calculating the pre-calculation window position of each PWM matrix; calculating the total arrangement score of a pre-calculation window of each PWM matrix; calculating a matching sequence vector of each PWM matrix; calculating the maximum score vector of each PWM matrix; and scanning the input sequence through a sliding window, and matching the input sequence with each PWM matrix to obtain a result vector successfully matched. Compared with the prior art, the method and the device have the advantages that the matching of all the modes can be completed only by searching in sequence without scanning and calculating for matching for multiple sequence modes, the matching speed is high, and the like.
Description
Technical Field
The present invention relates to bioinformatics analysis techniques of gene sequences, and more particularly, to a method for rapidly searching for a given pattern in a gene sequence.
Background
Genomes have evolved through random mutation accumulation and selection for diverse functional requirements. For species with short survival times and large population numbers (e.g.: bacteria), this strong selection force can bring about a highly optimized genome, dense genes and a negligible overhead with non-functional sequences: this would make the prediction of prokaryotic genes relatively simple. In contrast, species genomes (e.g., vertebrates) with longer survival times and smaller population numbers accumulate massive amounts of genetic material, with the majority of this genetic material emerging without selection mechanisms. More than half of the human genome is in the sequence of retrotransposon elements, DNA transposons and other species of repeats. Functional sequences and regulatory factors account for only a small portion of a vertebrate's genome, making their identification difficult. Most vertebrate genes are interrupted by intronic feeding, filled with material that is to a large extent subject to low restriction; long introns are particularly difficult to model. As with modeling non-coding transcripts, identifying alternative splicing requires more complex algorithms to implement. For computational biology, de novo vertebrate gene prediction is a very challenging task. In DNA and other biological sequences, these matches can be represented using statistical model PWM matrices, such as transcription factor binding sites.
Current sequencing technology has made it possible to sequence a complete gene sequence in a very short time. The next generation sequencing technology will only require $ 1000 to complete sequencing a human genome in less than a day. Many other genomes will also be sequenced and spliced with high quality, including drosophila, mice, chickens, chimpanzees, dogs, pigs, cats, horses, cows, and zebrafish. As databases of DNA and protein gene sequences exponentially grow, weighting matrices can be synthesized (e.g., TRANSFAC, PRINTS, BLOCKS, JASPAR databases), and even if the computing speed of the existing related hardware devices is very high, it is very difficult to find related statistical models, typically transcription factor binding sites, on such a huge gene data set. In such a context, it is of exceptional importance to develop a fast search matching technique between the weight matrix PWM and the sequence.
Disclosure of Invention
It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art by providing a method for rapidly searching for a given pattern in a gene sequence.
The purpose of the invention can be realized by the following technical scheme:
a method for rapidly searching for a given pattern in a gene sequence, comprising the steps of:
s1, counting the background distribution of the input sequence;
s2, converting the position frequency matrix of each given mode into a PWM matrix, and calculating a threshold value K of the PWM matrix according to a set P value;
s3, calculating an expected vector of each PWM matrix according to the background distribution obtained in the step S1;
s4, calculating the pre-calculation window position of each PWM matrix according to the preset size of the pre-calculation window for pre-matching and the expected vector obtained in the step S3;
s5, calculating the total arrangement score of the pre-calculation window of each PWM matrix according to the pre-calculation window position and the size of the pre-calculation window obtained in the step S4;
s6, calculating a matching sequence vector of each PWM matrix according to the pre-calculated window position obtained in the step S4 and the expected vector obtained in the step S3;
s7, calculating the maximum score vector of each PWM matrix according to the matching sequence vector obtained in the step S6;
and S8, scanning the input sequence through the sliding window, and matching the input sequence with each PWM matrix according to the threshold value K, the pre-calculated total window arrangement score, the matching sequence vector and the maximum score vector to obtain a result vector successfully matched.
The background distribution includes the ratio of each character in the input sequence, the number of rows of the PWM matrix is equal to the number of classes of characters, and the number of columns is equal to the number of columns of the position frequency matrix of the given pattern.
The step S2 of calculating the threshold K of the PWM matrix according to the set P value specifically includes:
determining the number y of sequences which can be matched with the PWM matrix according to the P value, generating all permutation and combination sequences with the length of the PWM matrix as the length of all kinds of characters of the PWM matrix, calculating the score of each sequence in the PWM matrix and sequencing the score in a descending order, and taking the score of the y-th sequence as a threshold value K.
The expected vector of each PWM matrix in step S3 includes an expected value L of each column of the PWM matrixj:
Wherein L isjAn expected value indicating that the jth column of the PWM matrix fails to match, a indicates a certain character, M (j, a) indicates a numerical value corresponding to the character a of the jth column of the PWM matrix, qaIndicating the frequency of occurrence of the character a in the input sequence, A indicates what is in the PWM matrixThere is a collection of class characters.
The step S4 specifically includes:
setting m to represent the column number of the PWM matrix, and setting n to represent the size of a pre-calculation window, and if m is less than or equal to n, directly setting the position of the pre-calculation window of the PWM matrix to be 1; if m is larger than n, calculating the sum of the expected values of n columns from the k column position in the PWM matrix:
wherein L isjAnd (4) selecting the position of the initial column corresponding to the maximum expectation sum as the position of a pre-calculation window of the PWM matrix in the obtained m-n +1 expectation sums, wherein the expectation values indicate the matching failure of the jth column of the PWM matrix.
The step S5 specifically includes:
s51, when the number of columns of the PWM matrix is larger than the size of the pre-calculation window, setting the position of the pre-calculation window of the PWM matrix as the r-th column, and calculating the sum T of the maximum values of each column left after n columns from the r-th column are removed from the PWM matrix, wherein n represents the size of the pre-calculation window;
s52, arranging and combining all kinds of characters in the PWM matrix into a sequence with the length of n to obtain xnA plurality of different overall arrangements, wherein x represents the number of character types;
s53, calculating the score S of each total arrangement in each PWM matrix:
wherein, alRepresenting the first character in a certain total permutation, M (l + r-1, a)l) Representing the character a on the l + r-1 column in the PWM matrixlA value of (d);
s54, if the S + T is larger than or equal to K, recording the score S of the total arrangement, the mark of the corresponding PWM matrix and the Boolean value of True, and otherwise, recording the score 0, the mark of the corresponding PWM matrix and the Boolean value of False;
s55, when the column number of the PWM matrix is smaller than the size of the given pre-calculation window, taking the n value as the column number value of the PWM matrix, and calculating the total arrangement score of the pre-calculation window of the PWM matrix according to the steps S51-S54.
The step S6 specifically includes:
in the expected vector of the PWM matrix, the part with the width taking the position of the pre-calculation window as the starting point as the size of the pre-calculation window is removed, the expected values of the rest columns and the column positions of the PWM matrix where the rest columns are located are paired one by one, and the column positions are arranged in a descending order according to the corresponding expected values and then are used as the matching sequence vector of the PWM matrix.
The step S7 specifically includes:
and accumulating the maximum values in the column of the PWM matrix corresponding to each column position in the matching sequence vector in sequence from back to front, and forming the maximum score vector of the PWM matrix by the accumulated sum of the maximum values obtained by matching all the column positions of the sequence vector.
The step S8 specifically includes:
scanning an input sequence through a sliding window with the length equal to the size of a pre-calculation window according to a sliding step length of 1 bit, sequentially matching each sliding window subsequence with all total arrangements of each PWM matrix, further matching the PWM matrix with a Boolean value of True with the rest of the PWM matrix according to a threshold value K, a pre-calculation window total arrangement score, a matching sequence vector and a maximum score vector, and recording sequence information and information of the PWM matrix which are successfully matched to obtain a result vector which is successfully matched.
The step of further matching the rest parts of the PWM matrix, in which the total arrangement Boolean value is True, with the PWM matrix according to the threshold value K, the pre-calculated window total arrangement value, the matching sequence vector and the maximum value vector, specifically comprises the following steps:
setting the initial position of the part of the input sequence, which is matched with the total arrangement of the PWM matrix and has the Boolean value True, as the rr-th column, taking out the pre-calculation window position wp, the matching sequence vector Vmatch and the maximum score vector Smatch of the PWM matrix, and calculating the difference R between the score of each bit starting from the first position of the matching sequence vector and the threshold Ki:
Wherein S represents the score of the total arrangement in the PWM matrix, Vmatch (g) represents the value of the matching sequence vector at position g, Smatch (i) represents the value of the maximum score vector at position i, wherein 1 ≦ i ≦ p, p represents the length of the maximum score vector, Seq (rr + Vmatch (g) -wp) represents the character at position rr + Vmatch (g) -wp in the input sequence, M (Vmatch (g), Seq (rr + Vmatch (g) -wp)) represents the value of the Vmatch (g) column character Seq (rr + Vmatch) (g) -wp) in the PWM matrix, and R (rr + Vmatch (g)) represents the value of the Vmatch (g) column character at position (g) wp) in the PWM matrix if a certain position is matchediIf the number is negative, stopping matching the next bit of the PWM matrix, and matching another PWM matrix with the total arrangement Boolean value of True; otherwise, continuing to match the next position of the matching sequence vector of the PWM matrix until all the positions in the matching sequence vector are matched, if the position R is matched at the momentiIf the number is still not negative, the matching of the PWM matrix is successful.
Compared with the prior art, the method and the device have the advantages that the input sequence is preliminarily matched through the pre-calculation window of the PWM matrix, and the PWM matrix with successfully matched pre-calculation window sequence is matched with characters at other positions through the matching sequence of the expected difference of the introduced positions, so that the matching can be timely stopped when the threshold value K is not accordant, unnecessary calculation is avoided, the sequence matching process is accelerated, and the method and the device have the advantages of reducing calculation cost and improving efficiency.
Drawings
FIG. 1 is a flow chart of a method of the present invention for rapidly searching for a given pattern in a gene sequence;
FIG. 2 is a schematic diagram of the matching process of the method for rapidly searching a given pattern in a gene sequence according to the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.
As shown in fig. 1 and 2, a method for rapidly searching for a given pattern in a gene sequence, comprising the steps of:
s1, counting the background distribution of the input sequence;
s2, converting the position frequency matrix of each given mode into PWM matrixes, wherein N given modes are provided in the graph 2, converting to obtain N PWM matrixes, and calculating the threshold value K of each PWM matrix according to the set P value;
s3, calculating an expected vector of each PWM matrix according to the background distribution obtained in the step S1;
s4, calculating the pre-calculation window position of each PWM matrix according to the preset size of the pre-calculation window for pre-matching and the expected vector obtained in the step S3;
s5, calculating the total arrangement score of the pre-calculation window of each PWM matrix according to the pre-calculation window position and the size of the pre-calculation window obtained in the step S4;
s6, calculating a matching sequence vector of each PWM matrix according to the pre-calculated window position obtained in the step S4 and the expected vector obtained in the step S3;
s7, calculating the maximum score vector of each PWM matrix according to the matching sequence vector obtained in the step S6;
and S8, scanning the input sequence through the sliding window, and matching the input sequence with each PWM matrix according to the threshold value K, the pre-calculated total window arrangement score, the matching sequence vector and the maximum score vector to obtain a result vector successfully matched.
The sequence comprises characters, namely A, a, C, C, G, G, T and T, which represent various bases, and the characters can be formulated by self, a given pattern represents a certain sequence to be matched, the given pattern is given in the form of a position frequency matrix and needs to be converted into a PWM matrix, and the analysis technology refers to finding a subsequence of the PWM matrix which is above a threshold value K and accords with the given pattern from an input sequence.
The background distribution comprises the ratio of each character in the input sequence, and for the non-default characters contained in the sequence, when the background distribution corresponding to the sequence is counted, the characters appearing in the sequence can be recorded, and the corresponding character objects are set, namely, in the input sequence, the character set needing to be analyzed is set so as to support the non-default characters. The number of rows of the PWM matrix is equal to the number of kinds of characters and the number of columns is equal to the number of columns of the position frequency matrix of a given pattern.
In step S2, the PWM matrix is obtained by calculating the position-frequency matrix Π of the given mode and the background distribution of the input sequence:
wherein a represents a certain character in the sequence, M (j, a) represents the numerical value corresponding to the character a in the jth column in the PWM matrix, pi (j, a) represents the numerical value of the character a in the jth column in the position frequency matrix, qaIndicating the frequency with which the character a appears in the input sequence.
Step S2 calculates the threshold K of the PWM matrix from the set P value. Determining the number y of sequences which can be matched with the PWM matrix according to the P value, generating all sequences with the length of the PWM matrix by arranging and combining all kinds of characters of each PWM matrix, calculating the score of each sequence in the PWM matrix and sequencing the sequences in a descending order, and taking the score of the y-th sequence as a threshold value K. The score S' for each sequence is calculated as:
wherein, ahA character representing the h-th digit of a permutation and combination sequence, M (h, a)h) Representing the character a in the h column of the PWM matrixhAnd w represents the length of the PWM matrix.
The P value represents the proportion of the selection range to the total number of sequences possibly appearing in the whole PWM and is given by the P value, if the P value is large, the matching precision is low, and many matched sequences exist. The pre-calculation window is used for selecting a small part of the PWM matrix which is difficult to match successfully as a pre-matching part, the size of the pre-calculation window represents the width of the pre-matching part, the width should not be too large or too small, the recommended value is 9, and if the PWM matrices are all small in width, the size of the pre-calculation window is reduced appropriately.
The following illustrates a method for calculating the threshold K of the PWM matrix: suppose a PWM matrixHas a width of 9 columns and contains 4 characters, the PWM matrix may represent a total number of different sequences of length 9 of 49P value is set to be 1.0e-5, for 49× 1.0.0 e-5, rounding down to 2, representing that the number of sequences which can be successfully matched is 2, calculating the score of each PWM matrix arrangement sequence and then sorting according to a descending order, and then the score of the 2 nd sequence is the threshold value K of the PWM matrix.
The expected vector of each PWM matrix in step S3 includes an expected value L of each column of the PWM matrixj:
Wherein L isjAn expected value indicating that the jth column of the PWM matrix fails to match, a indicates a certain character, M (j, a) indicates a numerical value corresponding to the character a of the jth column of the PWM matrix, qaIndicating the frequency with which the character a appears in the input sequence and a represents the set of all kinds of characters in the PWM matrix.
Step S4 specifically includes:
setting m to represent the column number of the PWM matrix, and setting n to represent the size of a pre-calculation window, and if m is less than or equal to n, directly setting the position of the pre-calculation window of the PWM matrix to be 1; if m is larger than n, calculating the sum of the expected values of n columns from the k column position in the PWM matrix:
wherein L isjAnd (4) selecting the position of the initial column corresponding to the maximum expectation sum as the position of a pre-calculation window of the PWM matrix in the obtained m-n +1 expectation sums, wherein the expectation values indicate the matching failure of the jth column of the PWM matrix.
Step S5 specifically includes:
s51, when the number of columns of the PWM matrix is larger than the size of the pre-calculation window, setting the position of the pre-calculation window of the PWM matrix as the r-th column, and calculating the sum T of the maximum values of each column left after n columns from the r-th column are removed from the PWM matrix, wherein n represents the size of the pre-calculation window;
s52, arranging and combining all kinds of characters in the PWM matrix into a sequence with the length of n to obtain xnA plurality of different overall arrangements, wherein x represents the number of character types;
s53, calculating the score S of each total arrangement in each PWM matrix:
wherein, alRepresenting the first character in a certain total permutation, M (l + r-1, a)l) Representing the character a on the l + r-1 column in the PWM matrixlA value of (d);
s54, if S + T is larger than or equal to K, recording the score S of the total arrangement, the mark of the corresponding PWM matrix and the Boolean value of True to the position corresponding to the total arrangement in the vector of the total arrangement score, otherwise, recording the score 0, the mark of the corresponding PWM matrix and the Boolean value of False, wherein the mark of the PWM matrix refers to the position of the PWM matrix in a pre-established large matrix comprising all PWM matrices, finding the PWM matrix through the mark, and specifically establishing a retrieval method through hash transformation and the like;
s55, when the column number of the PWM matrix is smaller than the size of the given pre-calculation window, taking the n value as the column number value of the PWM matrix, and calculating the total arrangement score of the pre-calculation window of the PWM matrix according to the steps S51-S54.
Step S6 specifically includes:
a new desired vector (L) is obtained by excluding a portion having a width starting from a position of a pre-calculation window and having a size of the pre-calculation window from the desired vector L of the PWM matrix1,L2,...,Lr-1,Lr+n-1,Lr+n,...,Lm) Pairing the new vectors with their indices to obtain m-n < LzZ is the position corresponding to the expected value of the residual part after the pre-calculation of the window part is removed by the PWM matrix, and then m-n pieces of the position are less than LzZ > according to LzSorting the values in a descending order, taking out m-n subscript values z corresponding to the values, and forming a vector Vmatch which is the vectorAnd matching sequence vectors corresponding to the PWM matrix.
Step S7 specifically includes:
and accumulating the maximum values in the column of the PWM matrix corresponding to each column position in the matching sequence vector in sequence from back to front, and forming the maximum score vector Smatch of the PWM matrix by the accumulated sum of the maximum values obtained by all the column positions of the matching sequence vector.
Step S8 specifically includes:
inputting a sequence by scanning a sliding window, wherein the width of the sliding window is equal to the size n of a pre-calculation window, the sliding step length is 1 bit, the subsequence in each sliding window is sequentially matched with all total arrangements of each PWM matrix, and the PWM matrix with the Boolean value of True is further matched with the rest parts of the PWM matrix, and the method specifically comprises the following steps:
setting the initial position of the part of the input sequence, which is matched with the total arrangement of the PWM matrix and has the Boolean value True, as the rr-th column, taking out the pre-calculation window position wp, the matching sequence vector Vmatch and the maximum score vector Smatch of the PWM matrix, and calculating the difference R between the score of each bit starting from the first position of the matching sequence vector and the threshold Ki:
Wherein S represents the score of the total arrangement in the PWM matrix, Vmatch (g) represents the value of the matching order vector at position g, Smtch (i) represents the value of the maximum score vector at position i, wherein 1 ≦ i ≦ p, p represents the length of the maximum score vector, Seq (rr + Vmatch (g) -wp) represents the character at position rr + Vmatch (g) -wp in the input sequence, M (Vmatch (g), Seq (rr + Vmatch (g) -wp)) represents the value of the Vmatch (g) column character Seq (rr + Vmatch (g) -wp) in the PWM matrix, if R represents the value of one position of the matching order vectoriIf the number is negative, the matching of the PWM matrix is failed, the matching process of the PWM matrix is stopped, and another unmatched PWM matrix with the total arrangement Boolean value of True is matched again; otherwise, continuing to match the next position of the matching sequence vector of the PWM matrix until all the positions in the matching sequence vector are matched, if the position R is matched at the momentiIf the number is still non-negative, the matching of the PWM matrix is successful, and the sequence information and the PWM matrix information which are successfully matched are recorded to obtain a matching result vector. The result vector includes: the mark of the successfully matched PWM matrix, the starting position of the successfully matched sequence in the input sequence, the length of the successfully matched sequence, namely the number of columns of the corresponding PWM matrix, and the matching fraction Ri+K。
Claims (10)
1. A method for rapidly searching for a given pattern in a gene sequence, comprising the steps of:
s1, counting the background distribution of the input sequence;
s2, converting the position frequency matrix of each given mode into a PWM matrix, and calculating a threshold value K of the PWM matrix according to a set P value, wherein the P value represents the proportion of the selection range on the total number of sequences which may appear in the whole PWM;
s3, calculating an expected vector of each PWM matrix according to the background distribution obtained in the step S1;
s4, calculating the pre-calculation window position of each PWM matrix according to the preset size of the pre-calculation window for pre-matching and the expected vector obtained in the step S3;
s5, calculating the total arrangement score of the pre-calculation window of each PWM matrix according to the pre-calculation window position and the size of the pre-calculation window obtained in the step S4;
s6, calculating a matching sequence vector of each PWM matrix according to the pre-calculated window position obtained in the step S4 and the expected vector obtained in the step S3;
s7, calculating the maximum score vector of each PWM matrix according to the matching sequence vector obtained in the step S6;
and S8, scanning the input sequence through the sliding window, and matching the input sequence with each PWM matrix according to the threshold value K, the pre-calculated total window arrangement score, the matching sequence vector and the maximum score vector to obtain a result vector successfully matched.
2. The method of claim 1, wherein the background distribution comprises a ratio of characters in the input sequence, the number of rows of the PWM matrix is equal to the number of kinds of characters, and the number of columns is equal to the number of columns of the position frequency matrix of the given pattern.
3. The method according to claim 1, wherein the step S2 of calculating the threshold K of the PWM matrix according to the set P value specifically comprises:
determining the number y of sequences which can be matched with the PWM matrix according to the P value, generating all permutation and combination sequences with the length of the PWM matrix as the length of all kinds of characters of the PWM matrix, calculating the score of each sequence in the PWM matrix and sequencing the score in a descending order, and taking the score of the y-th sequence as a threshold value K.
4. The method of claim 1, wherein the expected vector of each PWM matrix in step S3 comprises an expected value L of each column of the PWM matrixj:
Wherein L isjAn expected value indicating that the jth column of the PWM matrix fails to match, a indicates a certain character, M (j, a) indicates a numerical value corresponding to the character a of the jth column of the PWM matrix, qaIndicating the frequency with which the character a appears in the input sequence and a represents the set of all kinds of characters in the PWM matrix.
5. The method according to claim 4, wherein the step S4 specifically comprises:
setting m to represent the column number of the PWM matrix, and setting n to represent the size of a pre-calculation window, and if m is less than or equal to n, directly setting the position of the pre-calculation window of the PWM matrix to be 1; if m is larger than n, calculating the sum of the expected values of n columns from the k column position in the PWM matrix:
wherein L isjAnd (4) selecting the position of the initial column corresponding to the maximum expectation sum as the position of a pre-calculation window of the PWM matrix in the obtained m-n +1 expectation sums, wherein the expectation values indicate the matching failure of the jth column of the PWM matrix.
6. The method according to claim 5, wherein the step S5 specifically comprises:
s51, when the number of columns of the PWM matrix is larger than the size of the pre-calculation window, setting the position of the pre-calculation window of the PWM matrix as the r-th column, and calculating the sum T of the maximum values of each column left after n columns from the r-th column are removed from the PWM matrix, wherein n represents the size of the pre-calculation window;
s52, arranging and combining all kinds of characters in the PWM matrix into a sequence with the length of n to obtain xnA plurality of different overall arrangements, wherein x represents the number of character types;
s53, calculating the score S of each total arrangement in each PWM matrix:
wherein, alRepresenting the first character in a certain total permutation, M (l + r-1, a)l) Representing the character a on the l + r-1 column in the PWM matrixlA value of (d);
s54, if the S + T is larger than or equal to K, recording the score S of the total arrangement, the mark of the corresponding PWM matrix and the Boolean value of True, and otherwise, recording the score 0, the mark of the corresponding PWM matrix and the Boolean value of False;
s55, when the column number of the PWM matrix is smaller than the size of the given pre-calculation window, taking the n value as the column number value of the PWM matrix, and calculating the total arrangement score of the pre-calculation window of the PWM matrix according to the steps S51-S54.
7. The method according to claim 4, wherein the step S6 specifically comprises:
in the expected vector of the PWM matrix, the part with the width taking the position of the pre-calculation window as the starting point as the size of the pre-calculation window is removed, the expected values of the rest columns and the column positions of the PWM matrix where the rest columns are located are paired one by one, and the column positions are arranged in a descending order according to the corresponding expected values and then are used as the matching sequence vector of the PWM matrix.
8. The method according to claim 7, wherein the step S7 specifically comprises:
and accumulating the maximum values in the column of the PWM matrix corresponding to each column position in the matching sequence vector in sequence from back to front, and forming the maximum score vector of the PWM matrix by the accumulated sum of the maximum values obtained by matching all the column positions of the sequence vector.
9. The method according to claim 6, wherein the step S8 specifically comprises:
scanning an input sequence through a sliding window with the length equal to the size of a pre-calculation window according to a sliding step length of 1 bit, sequentially matching each sliding window subsequence with all total arrangements of each PWM matrix, further matching the PWM matrix with a Boolean value of True with the rest of the PWM matrix according to a threshold value K, a pre-calculation window total arrangement score, a matching sequence vector and a maximum score vector, and recording sequence information and information of the PWM matrix which are successfully matched to obtain a result vector which is successfully matched.
10. The method of claim 9, wherein the step of further matching the PWM matrix with the total permutation Boolean value True to the rest of the PWM matrix according to the threshold K, the pre-computed window total permutation score, the matching order vector and the maximum score vector comprises:
setting the initial position of the part of the input sequence, which is matched with the total arrangement of the PWM matrix and has the Boolean value True, as the rr-th column, taking out the pre-calculation window position wp, the matching sequence vector Vmatch and the maximum score vector Smatch of the PWM matrix, and calculating the difference R between the score of each bit starting from the first position of the matching sequence vector and the threshold Ki:
Wherein S represents the score of the total arrangement in the PWM matrix, Vmatch (g) represents the value of the matching sequence vector at position g, Smatch (i) represents the value of the maximum score vector at position i, wherein 1 ≦ i ≦ p, p represents the length of the maximum score vector, Seq (rr + Vmatch (g) -wp) represents the character at position rr + Vmatch (g) -wp in the input sequence, M (Vmatch (g), Seq (rr + Vmatch (g) -wp)) represents the value of the Vmatch (g) column character Seq (rr + Vmatch) (g) -wp) in the PWM matrix, and R (rr + Vmatch (g)) represents the value of the Vmatch (g) column character at position (g) wp) in the PWM matrix if a certain position is matchediIf the number is negative, stopping matching the next bit of the PWM matrix, and matching another PWM matrix with the total arrangement Boolean value of True; otherwise, continuing to match the next position of the matching sequence vector of the PWM matrix until all the positions in the matching sequence vector are matched, if the position R is matched at the momentiIf the number is still not negative, the matching of the PWM matrix is successful.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710642956.5A CN107480469B (en) | 2017-07-31 | 2017-07-31 | Method for rapidly searching given pattern in gene sequence |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710642956.5A CN107480469B (en) | 2017-07-31 | 2017-07-31 | Method for rapidly searching given pattern in gene sequence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107480469A CN107480469A (en) | 2017-12-15 |
CN107480469B true CN107480469B (en) | 2020-07-07 |
Family
ID=60598627
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710642956.5A Active CN107480469B (en) | 2017-07-31 | 2017-07-31 | Method for rapidly searching given pattern in gene sequence |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107480469B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116168765B (en) * | 2023-04-25 | 2023-08-18 | 山东大学 | Gene sequence generation method and system based on improved stroboemer |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20050106816A (en) * | 2004-05-06 | 2005-11-11 | 재단법인서울대학교산학협력재단 | A hybrid approach and a computer program to predict core-promoter region on human dna |
CN101278282A (en) * | 2004-12-23 | 2008-10-01 | 剑桥显示技术公司 | Digital signal processing methods and apparatus |
CN102760209A (en) * | 2012-05-17 | 2012-10-31 | 南京理工大学常熟研究院有限公司 | Transmembrane helix predicting method for nonparametric membrane protein |
CN104992079A (en) * | 2015-06-29 | 2015-10-21 | 南京理工大学 | Sampling learning based protein-ligand binding site prediction method |
-
2017
- 2017-07-31 CN CN201710642956.5A patent/CN107480469B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20050106816A (en) * | 2004-05-06 | 2005-11-11 | 재단법인서울대학교산학협력재단 | A hybrid approach and a computer program to predict core-promoter region on human dna |
CN101278282A (en) * | 2004-12-23 | 2008-10-01 | 剑桥显示技术公司 | Digital signal processing methods and apparatus |
CN102760209A (en) * | 2012-05-17 | 2012-10-31 | 南京理工大学常熟研究院有限公司 | Transmembrane helix predicting method for nonparametric membrane protein |
CN104992079A (en) * | 2015-06-29 | 2015-10-21 | 南京理工大学 | Sampling learning based protein-ligand binding site prediction method |
Non-Patent Citations (1)
Title |
---|
A sparse model based detection of copy number variations from exome sequencing data;Junbo Duan et.al;《IEEE Transactions on Biomedical Engineering》;20150805;第63卷(第3期);第1-2页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107480469A (en) | 2017-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20190139624A1 (en) | Identifying ancestral relationships using a continuous stream of input | |
Tavakoli | Modeling genome data using bidirectional LSTM | |
CN115618140B (en) | Data processing system for acquiring link entity | |
Huo et al. | Optimizing genetic algorithm for motif discovery | |
CN105760706A (en) | Compression method for next generation sequencing data | |
Lasfar et al. | A method of data mining using Hidden Markov Models (HMMs) for protein secondary structure prediction | |
CN107480469B (en) | Method for rapidly searching given pattern in gene sequence | |
JPH10228460A (en) | Genetic algorithm executing device, executing method and its program recording medium | |
US20200040329A1 (en) | Systems and methods for predicting repair outcomes in genetic engineering | |
CN110517656B (en) | Lyric rhythm generation method, device, storage medium and apparatus | |
CN115795051A (en) | Data processing system for obtaining link entity based on entity relationship | |
CN109918659B (en) | Method for optimizing word vector based on unreserved optimal individual genetic algorithm | |
CN110070908B (en) | Motif searching method, device, equipment and storage medium of binomial tree model | |
CN110533186B (en) | Method, device, equipment and readable storage medium for evaluating crowdsourcing pricing system | |
De Clercq et al. | Deep learning for classification of DNA functional sequences | |
Yoon et al. | RNA secondary structure prediction using context-sensitive hidden Markov models | |
Ahmed et al. | Application of an Efficient Genetic Algorithm for Solving n× 𝒎𝒎 Flow Shop Scheduling Problem Comparing it with Branch and Bound Algorithm and Tabu Search Algorithm | |
Polushina et al. | Change-point detection in binary Markov DNA sequences by the Cross-Entropy method | |
Junyan et al. | Sequence pattern mining based on markov chain | |
CN112825267B (en) | Method for determining a collection of small nucleic acid sequences and use thereof | |
CN109360602B (en) | DNA coding sequence design method and device based on fuzzy priority | |
Sonnenburg | Machine Learning for Genomic Sequence Analysis | |
Rafiuddin | Estimation of Phylogenetic Tree using Gene Sequencing Data | |
Jiang et al. | Scaling DEPP phylogenetic placement to ultra-large reference trees: a tree-aware ensemble approach | |
Zoghlami et al. | A structure based multiple instance learning approach for bacterial ionizing radiation resistance prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |