CN107480469B - Method for rapidly searching given pattern in gene sequence - Google Patents

Method for rapidly searching given pattern in gene sequence Download PDF

Info

Publication number
CN107480469B
CN107480469B CN201710642956.5A CN201710642956A CN107480469B CN 107480469 B CN107480469 B CN 107480469B CN 201710642956 A CN201710642956 A CN 201710642956A CN 107480469 B CN107480469 B CN 107480469B
Authority
CN
China
Prior art keywords
pwm
matrix
pwm matrix
vector
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710642956.5A
Other languages
Chinese (zh)
Other versions
CN107480469A (en
Inventor
黄德双
高良心
朱麟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tongji University
Original Assignee
Tongji University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tongji University filed Critical Tongji University
Priority to CN201710642956.5A priority Critical patent/CN107480469B/en
Publication of CN107480469A publication Critical patent/CN107480469A/en
Application granted granted Critical
Publication of CN107480469B publication Critical patent/CN107480469B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Molecular Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for rapidly searching for a given pattern in a gene sequence, comprising the following steps: counting the background distribution of the input sequence; converting the position frequency matrix of each given mode into a PWM matrix, and calculating a threshold value K of the PWM matrix; calculating an expected vector of each PWM matrix; calculating the pre-calculation window position of each PWM matrix; calculating the total arrangement score of a pre-calculation window of each PWM matrix; calculating a matching sequence vector of each PWM matrix; calculating the maximum score vector of each PWM matrix; and scanning the input sequence through a sliding window, and matching the input sequence with each PWM matrix to obtain a result vector successfully matched. Compared with the prior art, the method and the device have the advantages that the matching of all the modes can be completed only by searching in sequence without scanning and calculating for matching for multiple sequence modes, the matching speed is high, and the like.

Description

Method for rapidly searching given pattern in gene sequence
Technical Field
The present invention relates to bioinformatics analysis techniques of gene sequences, and more particularly, to a method for rapidly searching for a given pattern in a gene sequence.
Background
Genomes have evolved through random mutation accumulation and selection for diverse functional requirements. For species with short survival times and large population numbers (e.g.: bacteria), this strong selection force can bring about a highly optimized genome, dense genes and a negligible overhead with non-functional sequences: this would make the prediction of prokaryotic genes relatively simple. In contrast, species genomes (e.g., vertebrates) with longer survival times and smaller population numbers accumulate massive amounts of genetic material, with the majority of this genetic material emerging without selection mechanisms. More than half of the human genome is in the sequence of retrotransposon elements, DNA transposons and other species of repeats. Functional sequences and regulatory factors account for only a small portion of a vertebrate's genome, making their identification difficult. Most vertebrate genes are interrupted by intronic feeding, filled with material that is to a large extent subject to low restriction; long introns are particularly difficult to model. As with modeling non-coding transcripts, identifying alternative splicing requires more complex algorithms to implement. For computational biology, de novo vertebrate gene prediction is a very challenging task. In DNA and other biological sequences, these matches can be represented using statistical model PWM matrices, such as transcription factor binding sites.
Current sequencing technology has made it possible to sequence a complete gene sequence in a very short time. The next generation sequencing technology will only require $ 1000 to complete sequencing a human genome in less than a day. Many other genomes will also be sequenced and spliced with high quality, including drosophila, mice, chickens, chimpanzees, dogs, pigs, cats, horses, cows, and zebrafish. As databases of DNA and protein gene sequences exponentially grow, weighting matrices can be synthesized (e.g., TRANSFAC, PRINTS, BLOCKS, JASPAR databases), and even if the computing speed of the existing related hardware devices is very high, it is very difficult to find related statistical models, typically transcription factor binding sites, on such a huge gene data set. In such a context, it is of exceptional importance to develop a fast search matching technique between the weight matrix PWM and the sequence.
Disclosure of Invention
It is an object of the present invention to overcome the above-mentioned drawbacks of the prior art by providing a method for rapidly searching for a given pattern in a gene sequence.
The purpose of the invention can be realized by the following technical scheme:
a method for rapidly searching for a given pattern in a gene sequence, comprising the steps of:
s1, counting the background distribution of the input sequence;
s2, converting the position frequency matrix of each given mode into a PWM matrix, and calculating a threshold value K of the PWM matrix according to a set P value;
s3, calculating an expected vector of each PWM matrix according to the background distribution obtained in the step S1;
s4, calculating the pre-calculation window position of each PWM matrix according to the preset size of the pre-calculation window for pre-matching and the expected vector obtained in the step S3;
s5, calculating the total arrangement score of the pre-calculation window of each PWM matrix according to the pre-calculation window position and the size of the pre-calculation window obtained in the step S4;
s6, calculating a matching sequence vector of each PWM matrix according to the pre-calculated window position obtained in the step S4 and the expected vector obtained in the step S3;
s7, calculating the maximum score vector of each PWM matrix according to the matching sequence vector obtained in the step S6;
and S8, scanning the input sequence through the sliding window, and matching the input sequence with each PWM matrix according to the threshold value K, the pre-calculated total window arrangement score, the matching sequence vector and the maximum score vector to obtain a result vector successfully matched.
The background distribution includes the ratio of each character in the input sequence, the number of rows of the PWM matrix is equal to the number of classes of characters, and the number of columns is equal to the number of columns of the position frequency matrix of the given pattern.
The step S2 of calculating the threshold K of the PWM matrix according to the set P value specifically includes:
determining the number y of sequences which can be matched with the PWM matrix according to the P value, generating all permutation and combination sequences with the length of the PWM matrix as the length of all kinds of characters of the PWM matrix, calculating the score of each sequence in the PWM matrix and sequencing the score in a descending order, and taking the score of the y-th sequence as a threshold value K.
The expected vector of each PWM matrix in step S3 includes an expected value L of each column of the PWM matrixj
Figure BDA0001366274290000031
Wherein L isjAn expected value indicating that the jth column of the PWM matrix fails to match, a indicates a certain character, M (j, a) indicates a numerical value corresponding to the character a of the jth column of the PWM matrix, qaIndicating the frequency of occurrence of the character a in the input sequence, A indicates what is in the PWM matrixThere is a collection of class characters.
The step S4 specifically includes:
setting m to represent the column number of the PWM matrix, and setting n to represent the size of a pre-calculation window, and if m is less than or equal to n, directly setting the position of the pre-calculation window of the PWM matrix to be 1; if m is larger than n, calculating the sum of the expected values of n columns from the k column position in the PWM matrix:
Figure BDA0001366274290000032
wherein L isjAnd (4) selecting the position of the initial column corresponding to the maximum expectation sum as the position of a pre-calculation window of the PWM matrix in the obtained m-n +1 expectation sums, wherein the expectation values indicate the matching failure of the jth column of the PWM matrix.
The step S5 specifically includes:
s51, when the number of columns of the PWM matrix is larger than the size of the pre-calculation window, setting the position of the pre-calculation window of the PWM matrix as the r-th column, and calculating the sum T of the maximum values of each column left after n columns from the r-th column are removed from the PWM matrix, wherein n represents the size of the pre-calculation window;
s52, arranging and combining all kinds of characters in the PWM matrix into a sequence with the length of n to obtain xnA plurality of different overall arrangements, wherein x represents the number of character types;
s53, calculating the score S of each total arrangement in each PWM matrix:
Figure BDA0001366274290000033
wherein, alRepresenting the first character in a certain total permutation, M (l + r-1, a)l) Representing the character a on the l + r-1 column in the PWM matrixlA value of (d);
s54, if the S + T is larger than or equal to K, recording the score S of the total arrangement, the mark of the corresponding PWM matrix and the Boolean value of True, and otherwise, recording the score 0, the mark of the corresponding PWM matrix and the Boolean value of False;
s55, when the column number of the PWM matrix is smaller than the size of the given pre-calculation window, taking the n value as the column number value of the PWM matrix, and calculating the total arrangement score of the pre-calculation window of the PWM matrix according to the steps S51-S54.
The step S6 specifically includes:
in the expected vector of the PWM matrix, the part with the width taking the position of the pre-calculation window as the starting point as the size of the pre-calculation window is removed, the expected values of the rest columns and the column positions of the PWM matrix where the rest columns are located are paired one by one, and the column positions are arranged in a descending order according to the corresponding expected values and then are used as the matching sequence vector of the PWM matrix.
The step S7 specifically includes:
and accumulating the maximum values in the column of the PWM matrix corresponding to each column position in the matching sequence vector in sequence from back to front, and forming the maximum score vector of the PWM matrix by the accumulated sum of the maximum values obtained by matching all the column positions of the sequence vector.
The step S8 specifically includes:
scanning an input sequence through a sliding window with the length equal to the size of a pre-calculation window according to a sliding step length of 1 bit, sequentially matching each sliding window subsequence with all total arrangements of each PWM matrix, further matching the PWM matrix with a Boolean value of True with the rest of the PWM matrix according to a threshold value K, a pre-calculation window total arrangement score, a matching sequence vector and a maximum score vector, and recording sequence information and information of the PWM matrix which are successfully matched to obtain a result vector which is successfully matched.
The step of further matching the rest parts of the PWM matrix, in which the total arrangement Boolean value is True, with the PWM matrix according to the threshold value K, the pre-calculated window total arrangement value, the matching sequence vector and the maximum value vector, specifically comprises the following steps:
setting the initial position of the part of the input sequence, which is matched with the total arrangement of the PWM matrix and has the Boolean value True, as the rr-th column, taking out the pre-calculation window position wp, the matching sequence vector Vmatch and the maximum score vector Smatch of the PWM matrix, and calculating the difference R between the score of each bit starting from the first position of the matching sequence vector and the threshold Ki
Figure BDA0001366274290000041
Wherein S represents the score of the total arrangement in the PWM matrix, Vmatch (g) represents the value of the matching sequence vector at position g, Smatch (i) represents the value of the maximum score vector at position i, wherein 1 ≦ i ≦ p, p represents the length of the maximum score vector, Seq (rr + Vmatch (g) -wp) represents the character at position rr + Vmatch (g) -wp in the input sequence, M (Vmatch (g), Seq (rr + Vmatch (g) -wp)) represents the value of the Vmatch (g) column character Seq (rr + Vmatch) (g) -wp) in the PWM matrix, and R (rr + Vmatch (g)) represents the value of the Vmatch (g) column character at position (g) wp) in the PWM matrix if a certain position is matchediIf the number is negative, stopping matching the next bit of the PWM matrix, and matching another PWM matrix with the total arrangement Boolean value of True; otherwise, continuing to match the next position of the matching sequence vector of the PWM matrix until all the positions in the matching sequence vector are matched, if the position R is matched at the momentiIf the number is still not negative, the matching of the PWM matrix is successful.
Compared with the prior art, the method and the device have the advantages that the input sequence is preliminarily matched through the pre-calculation window of the PWM matrix, and the PWM matrix with successfully matched pre-calculation window sequence is matched with characters at other positions through the matching sequence of the expected difference of the introduced positions, so that the matching can be timely stopped when the threshold value K is not accordant, unnecessary calculation is avoided, the sequence matching process is accelerated, and the method and the device have the advantages of reducing calculation cost and improving efficiency.
Drawings
FIG. 1 is a flow chart of a method of the present invention for rapidly searching for a given pattern in a gene sequence;
FIG. 2 is a schematic diagram of the matching process of the method for rapidly searching a given pattern in a gene sequence according to the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments. The present embodiment is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the scope of the present invention is not limited to the following embodiments.
As shown in fig. 1 and 2, a method for rapidly searching for a given pattern in a gene sequence, comprising the steps of:
s1, counting the background distribution of the input sequence;
s2, converting the position frequency matrix of each given mode into PWM matrixes, wherein N given modes are provided in the graph 2, converting to obtain N PWM matrixes, and calculating the threshold value K of each PWM matrix according to the set P value;
s3, calculating an expected vector of each PWM matrix according to the background distribution obtained in the step S1;
s4, calculating the pre-calculation window position of each PWM matrix according to the preset size of the pre-calculation window for pre-matching and the expected vector obtained in the step S3;
s5, calculating the total arrangement score of the pre-calculation window of each PWM matrix according to the pre-calculation window position and the size of the pre-calculation window obtained in the step S4;
s6, calculating a matching sequence vector of each PWM matrix according to the pre-calculated window position obtained in the step S4 and the expected vector obtained in the step S3;
s7, calculating the maximum score vector of each PWM matrix according to the matching sequence vector obtained in the step S6;
and S8, scanning the input sequence through the sliding window, and matching the input sequence with each PWM matrix according to the threshold value K, the pre-calculated total window arrangement score, the matching sequence vector and the maximum score vector to obtain a result vector successfully matched.
The sequence comprises characters, namely A, a, C, C, G, G, T and T, which represent various bases, and the characters can be formulated by self, a given pattern represents a certain sequence to be matched, the given pattern is given in the form of a position frequency matrix and needs to be converted into a PWM matrix, and the analysis technology refers to finding a subsequence of the PWM matrix which is above a threshold value K and accords with the given pattern from an input sequence.
The background distribution comprises the ratio of each character in the input sequence, and for the non-default characters contained in the sequence, when the background distribution corresponding to the sequence is counted, the characters appearing in the sequence can be recorded, and the corresponding character objects are set, namely, in the input sequence, the character set needing to be analyzed is set so as to support the non-default characters. The number of rows of the PWM matrix is equal to the number of kinds of characters and the number of columns is equal to the number of columns of the position frequency matrix of a given pattern.
In step S2, the PWM matrix is obtained by calculating the position-frequency matrix Π of the given mode and the background distribution of the input sequence:
Figure BDA0001366274290000061
wherein a represents a certain character in the sequence, M (j, a) represents the numerical value corresponding to the character a in the jth column in the PWM matrix, pi (j, a) represents the numerical value of the character a in the jth column in the position frequency matrix, qaIndicating the frequency with which the character a appears in the input sequence.
Step S2 calculates the threshold K of the PWM matrix from the set P value. Determining the number y of sequences which can be matched with the PWM matrix according to the P value, generating all sequences with the length of the PWM matrix by arranging and combining all kinds of characters of each PWM matrix, calculating the score of each sequence in the PWM matrix and sequencing the sequences in a descending order, and taking the score of the y-th sequence as a threshold value K. The score S' for each sequence is calculated as:
Figure BDA0001366274290000062
wherein, ahA character representing the h-th digit of a permutation and combination sequence, M (h, a)h) Representing the character a in the h column of the PWM matrixhAnd w represents the length of the PWM matrix.
The P value represents the proportion of the selection range to the total number of sequences possibly appearing in the whole PWM and is given by the P value, if the P value is large, the matching precision is low, and many matched sequences exist. The pre-calculation window is used for selecting a small part of the PWM matrix which is difficult to match successfully as a pre-matching part, the size of the pre-calculation window represents the width of the pre-matching part, the width should not be too large or too small, the recommended value is 9, and if the PWM matrices are all small in width, the size of the pre-calculation window is reduced appropriately.
The following illustrates a method for calculating the threshold K of the PWM matrix: suppose a PWM matrixHas a width of 9 columns and contains 4 characters, the PWM matrix may represent a total number of different sequences of length 9 of 49P value is set to be 1.0e-5, for 49× 1.0.0 e-5, rounding down to 2, representing that the number of sequences which can be successfully matched is 2, calculating the score of each PWM matrix arrangement sequence and then sorting according to a descending order, and then the score of the 2 nd sequence is the threshold value K of the PWM matrix.
The expected vector of each PWM matrix in step S3 includes an expected value L of each column of the PWM matrixj
Figure BDA0001366274290000071
Wherein L isjAn expected value indicating that the jth column of the PWM matrix fails to match, a indicates a certain character, M (j, a) indicates a numerical value corresponding to the character a of the jth column of the PWM matrix, qaIndicating the frequency with which the character a appears in the input sequence and a represents the set of all kinds of characters in the PWM matrix.
Step S4 specifically includes:
setting m to represent the column number of the PWM matrix, and setting n to represent the size of a pre-calculation window, and if m is less than or equal to n, directly setting the position of the pre-calculation window of the PWM matrix to be 1; if m is larger than n, calculating the sum of the expected values of n columns from the k column position in the PWM matrix:
Figure BDA0001366274290000072
wherein L isjAnd (4) selecting the position of the initial column corresponding to the maximum expectation sum as the position of a pre-calculation window of the PWM matrix in the obtained m-n +1 expectation sums, wherein the expectation values indicate the matching failure of the jth column of the PWM matrix.
Step S5 specifically includes:
s51, when the number of columns of the PWM matrix is larger than the size of the pre-calculation window, setting the position of the pre-calculation window of the PWM matrix as the r-th column, and calculating the sum T of the maximum values of each column left after n columns from the r-th column are removed from the PWM matrix, wherein n represents the size of the pre-calculation window;
s52, arranging and combining all kinds of characters in the PWM matrix into a sequence with the length of n to obtain xnA plurality of different overall arrangements, wherein x represents the number of character types;
s53, calculating the score S of each total arrangement in each PWM matrix:
Figure BDA0001366274290000073
wherein, alRepresenting the first character in a certain total permutation, M (l + r-1, a)l) Representing the character a on the l + r-1 column in the PWM matrixlA value of (d);
s54, if S + T is larger than or equal to K, recording the score S of the total arrangement, the mark of the corresponding PWM matrix and the Boolean value of True to the position corresponding to the total arrangement in the vector of the total arrangement score, otherwise, recording the score 0, the mark of the corresponding PWM matrix and the Boolean value of False, wherein the mark of the PWM matrix refers to the position of the PWM matrix in a pre-established large matrix comprising all PWM matrices, finding the PWM matrix through the mark, and specifically establishing a retrieval method through hash transformation and the like;
s55, when the column number of the PWM matrix is smaller than the size of the given pre-calculation window, taking the n value as the column number value of the PWM matrix, and calculating the total arrangement score of the pre-calculation window of the PWM matrix according to the steps S51-S54.
Step S6 specifically includes:
a new desired vector (L) is obtained by excluding a portion having a width starting from a position of a pre-calculation window and having a size of the pre-calculation window from the desired vector L of the PWM matrix1,L2,...,Lr-1,Lr+n-1,Lr+n,...,Lm) Pairing the new vectors with their indices to obtain m-n < LzZ is the position corresponding to the expected value of the residual part after the pre-calculation of the window part is removed by the PWM matrix, and then m-n pieces of the position are less than LzZ > according to LzSorting the values in a descending order, taking out m-n subscript values z corresponding to the values, and forming a vector Vmatch which is the vectorAnd matching sequence vectors corresponding to the PWM matrix.
Step S7 specifically includes:
and accumulating the maximum values in the column of the PWM matrix corresponding to each column position in the matching sequence vector in sequence from back to front, and forming the maximum score vector Smatch of the PWM matrix by the accumulated sum of the maximum values obtained by all the column positions of the matching sequence vector.
Step S8 specifically includes:
inputting a sequence by scanning a sliding window, wherein the width of the sliding window is equal to the size n of a pre-calculation window, the sliding step length is 1 bit, the subsequence in each sliding window is sequentially matched with all total arrangements of each PWM matrix, and the PWM matrix with the Boolean value of True is further matched with the rest parts of the PWM matrix, and the method specifically comprises the following steps:
setting the initial position of the part of the input sequence, which is matched with the total arrangement of the PWM matrix and has the Boolean value True, as the rr-th column, taking out the pre-calculation window position wp, the matching sequence vector Vmatch and the maximum score vector Smatch of the PWM matrix, and calculating the difference R between the score of each bit starting from the first position of the matching sequence vector and the threshold Ki
Figure BDA0001366274290000081
Wherein S represents the score of the total arrangement in the PWM matrix, Vmatch (g) represents the value of the matching order vector at position g, Smtch (i) represents the value of the maximum score vector at position i, wherein 1 ≦ i ≦ p, p represents the length of the maximum score vector, Seq (rr + Vmatch (g) -wp) represents the character at position rr + Vmatch (g) -wp in the input sequence, M (Vmatch (g), Seq (rr + Vmatch (g) -wp)) represents the value of the Vmatch (g) column character Seq (rr + Vmatch (g) -wp) in the PWM matrix, if R represents the value of one position of the matching order vectoriIf the number is negative, the matching of the PWM matrix is failed, the matching process of the PWM matrix is stopped, and another unmatched PWM matrix with the total arrangement Boolean value of True is matched again; otherwise, continuing to match the next position of the matching sequence vector of the PWM matrix until all the positions in the matching sequence vector are matched, if the position R is matched at the momentiIf the number is still non-negative, the matching of the PWM matrix is successful, and the sequence information and the PWM matrix information which are successfully matched are recorded to obtain a matching result vector. The result vector includes: the mark of the successfully matched PWM matrix, the starting position of the successfully matched sequence in the input sequence, the length of the successfully matched sequence, namely the number of columns of the corresponding PWM matrix, and the matching fraction Ri+K。

Claims (10)

1. A method for rapidly searching for a given pattern in a gene sequence, comprising the steps of:
s1, counting the background distribution of the input sequence;
s2, converting the position frequency matrix of each given mode into a PWM matrix, and calculating a threshold value K of the PWM matrix according to a set P value, wherein the P value represents the proportion of the selection range on the total number of sequences which may appear in the whole PWM;
s3, calculating an expected vector of each PWM matrix according to the background distribution obtained in the step S1;
s4, calculating the pre-calculation window position of each PWM matrix according to the preset size of the pre-calculation window for pre-matching and the expected vector obtained in the step S3;
s5, calculating the total arrangement score of the pre-calculation window of each PWM matrix according to the pre-calculation window position and the size of the pre-calculation window obtained in the step S4;
s6, calculating a matching sequence vector of each PWM matrix according to the pre-calculated window position obtained in the step S4 and the expected vector obtained in the step S3;
s7, calculating the maximum score vector of each PWM matrix according to the matching sequence vector obtained in the step S6;
and S8, scanning the input sequence through the sliding window, and matching the input sequence with each PWM matrix according to the threshold value K, the pre-calculated total window arrangement score, the matching sequence vector and the maximum score vector to obtain a result vector successfully matched.
2. The method of claim 1, wherein the background distribution comprises a ratio of characters in the input sequence, the number of rows of the PWM matrix is equal to the number of kinds of characters, and the number of columns is equal to the number of columns of the position frequency matrix of the given pattern.
3. The method according to claim 1, wherein the step S2 of calculating the threshold K of the PWM matrix according to the set P value specifically comprises:
determining the number y of sequences which can be matched with the PWM matrix according to the P value, generating all permutation and combination sequences with the length of the PWM matrix as the length of all kinds of characters of the PWM matrix, calculating the score of each sequence in the PWM matrix and sequencing the score in a descending order, and taking the score of the y-th sequence as a threshold value K.
4. The method of claim 1, wherein the expected vector of each PWM matrix in step S3 comprises an expected value L of each column of the PWM matrixj
Figure FDA0002415962850000021
Wherein L isjAn expected value indicating that the jth column of the PWM matrix fails to match, a indicates a certain character, M (j, a) indicates a numerical value corresponding to the character a of the jth column of the PWM matrix, qaIndicating the frequency with which the character a appears in the input sequence and a represents the set of all kinds of characters in the PWM matrix.
5. The method according to claim 4, wherein the step S4 specifically comprises:
setting m to represent the column number of the PWM matrix, and setting n to represent the size of a pre-calculation window, and if m is less than or equal to n, directly setting the position of the pre-calculation window of the PWM matrix to be 1; if m is larger than n, calculating the sum of the expected values of n columns from the k column position in the PWM matrix:
Figure FDA0002415962850000022
wherein L isjAnd (4) selecting the position of the initial column corresponding to the maximum expectation sum as the position of a pre-calculation window of the PWM matrix in the obtained m-n +1 expectation sums, wherein the expectation values indicate the matching failure of the jth column of the PWM matrix.
6. The method according to claim 5, wherein the step S5 specifically comprises:
s51, when the number of columns of the PWM matrix is larger than the size of the pre-calculation window, setting the position of the pre-calculation window of the PWM matrix as the r-th column, and calculating the sum T of the maximum values of each column left after n columns from the r-th column are removed from the PWM matrix, wherein n represents the size of the pre-calculation window;
s52, arranging and combining all kinds of characters in the PWM matrix into a sequence with the length of n to obtain xnA plurality of different overall arrangements, wherein x represents the number of character types;
s53, calculating the score S of each total arrangement in each PWM matrix:
Figure FDA0002415962850000023
wherein, alRepresenting the first character in a certain total permutation, M (l + r-1, a)l) Representing the character a on the l + r-1 column in the PWM matrixlA value of (d);
s54, if the S + T is larger than or equal to K, recording the score S of the total arrangement, the mark of the corresponding PWM matrix and the Boolean value of True, and otherwise, recording the score 0, the mark of the corresponding PWM matrix and the Boolean value of False;
s55, when the column number of the PWM matrix is smaller than the size of the given pre-calculation window, taking the n value as the column number value of the PWM matrix, and calculating the total arrangement score of the pre-calculation window of the PWM matrix according to the steps S51-S54.
7. The method according to claim 4, wherein the step S6 specifically comprises:
in the expected vector of the PWM matrix, the part with the width taking the position of the pre-calculation window as the starting point as the size of the pre-calculation window is removed, the expected values of the rest columns and the column positions of the PWM matrix where the rest columns are located are paired one by one, and the column positions are arranged in a descending order according to the corresponding expected values and then are used as the matching sequence vector of the PWM matrix.
8. The method according to claim 7, wherein the step S7 specifically comprises:
and accumulating the maximum values in the column of the PWM matrix corresponding to each column position in the matching sequence vector in sequence from back to front, and forming the maximum score vector of the PWM matrix by the accumulated sum of the maximum values obtained by matching all the column positions of the sequence vector.
9. The method according to claim 6, wherein the step S8 specifically comprises:
scanning an input sequence through a sliding window with the length equal to the size of a pre-calculation window according to a sliding step length of 1 bit, sequentially matching each sliding window subsequence with all total arrangements of each PWM matrix, further matching the PWM matrix with a Boolean value of True with the rest of the PWM matrix according to a threshold value K, a pre-calculation window total arrangement score, a matching sequence vector and a maximum score vector, and recording sequence information and information of the PWM matrix which are successfully matched to obtain a result vector which is successfully matched.
10. The method of claim 9, wherein the step of further matching the PWM matrix with the total permutation Boolean value True to the rest of the PWM matrix according to the threshold K, the pre-computed window total permutation score, the matching order vector and the maximum score vector comprises:
setting the initial position of the part of the input sequence, which is matched with the total arrangement of the PWM matrix and has the Boolean value True, as the rr-th column, taking out the pre-calculation window position wp, the matching sequence vector Vmatch and the maximum score vector Smatch of the PWM matrix, and calculating the difference R between the score of each bit starting from the first position of the matching sequence vector and the threshold Ki
Figure FDA0002415962850000031
Wherein S represents the score of the total arrangement in the PWM matrix, Vmatch (g) represents the value of the matching sequence vector at position g, Smatch (i) represents the value of the maximum score vector at position i, wherein 1 ≦ i ≦ p, p represents the length of the maximum score vector, Seq (rr + Vmatch (g) -wp) represents the character at position rr + Vmatch (g) -wp in the input sequence, M (Vmatch (g), Seq (rr + Vmatch (g) -wp)) represents the value of the Vmatch (g) column character Seq (rr + Vmatch) (g) -wp) in the PWM matrix, and R (rr + Vmatch (g)) represents the value of the Vmatch (g) column character at position (g) wp) in the PWM matrix if a certain position is matchediIf the number is negative, stopping matching the next bit of the PWM matrix, and matching another PWM matrix with the total arrangement Boolean value of True; otherwise, continuing to match the next position of the matching sequence vector of the PWM matrix until all the positions in the matching sequence vector are matched, if the position R is matched at the momentiIf the number is still not negative, the matching of the PWM matrix is successful.
CN201710642956.5A 2017-07-31 2017-07-31 Method for rapidly searching given pattern in gene sequence Active CN107480469B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710642956.5A CN107480469B (en) 2017-07-31 2017-07-31 Method for rapidly searching given pattern in gene sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710642956.5A CN107480469B (en) 2017-07-31 2017-07-31 Method for rapidly searching given pattern in gene sequence

Publications (2)

Publication Number Publication Date
CN107480469A CN107480469A (en) 2017-12-15
CN107480469B true CN107480469B (en) 2020-07-07

Family

ID=60598627

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710642956.5A Active CN107480469B (en) 2017-07-31 2017-07-31 Method for rapidly searching given pattern in gene sequence

Country Status (1)

Country Link
CN (1) CN107480469B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116168765B (en) * 2023-04-25 2023-08-18 山东大学 Gene sequence generation method and system based on improved stroboemer

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050106816A (en) * 2004-05-06 2005-11-11 재단법인서울대학교산학협력재단 A hybrid approach and a computer program to predict core-promoter region on human dna
CN101278282A (en) * 2004-12-23 2008-10-01 剑桥显示技术公司 Digital signal processing methods and apparatus
CN102760209A (en) * 2012-05-17 2012-10-31 南京理工大学常熟研究院有限公司 Transmembrane helix predicting method for nonparametric membrane protein
CN104992079A (en) * 2015-06-29 2015-10-21 南京理工大学 Sampling learning based protein-ligand binding site prediction method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20050106816A (en) * 2004-05-06 2005-11-11 재단법인서울대학교산학협력재단 A hybrid approach and a computer program to predict core-promoter region on human dna
CN101278282A (en) * 2004-12-23 2008-10-01 剑桥显示技术公司 Digital signal processing methods and apparatus
CN102760209A (en) * 2012-05-17 2012-10-31 南京理工大学常熟研究院有限公司 Transmembrane helix predicting method for nonparametric membrane protein
CN104992079A (en) * 2015-06-29 2015-10-21 南京理工大学 Sampling learning based protein-ligand binding site prediction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
A sparse model based detection of copy number variations from exome sequencing data;Junbo Duan et.al;《IEEE Transactions on Biomedical Engineering》;20150805;第63卷(第3期);第1-2页 *

Also Published As

Publication number Publication date
CN107480469A (en) 2017-12-15

Similar Documents

Publication Publication Date Title
US20190139624A1 (en) Identifying ancestral relationships using a continuous stream of input
Tavakoli Modeling genome data using bidirectional LSTM
CN115618140B (en) Data processing system for acquiring link entity
Huo et al. Optimizing genetic algorithm for motif discovery
CN105760706A (en) Compression method for next generation sequencing data
Lasfar et al. A method of data mining using Hidden Markov Models (HMMs) for protein secondary structure prediction
CN107480469B (en) Method for rapidly searching given pattern in gene sequence
JPH10228460A (en) Genetic algorithm executing device, executing method and its program recording medium
US20200040329A1 (en) Systems and methods for predicting repair outcomes in genetic engineering
CN110517656B (en) Lyric rhythm generation method, device, storage medium and apparatus
CN115795051A (en) Data processing system for obtaining link entity based on entity relationship
CN109918659B (en) Method for optimizing word vector based on unreserved optimal individual genetic algorithm
CN110070908B (en) Motif searching method, device, equipment and storage medium of binomial tree model
CN110533186B (en) Method, device, equipment and readable storage medium for evaluating crowdsourcing pricing system
De Clercq et al. Deep learning for classification of DNA functional sequences
Yoon et al. RNA secondary structure prediction using context-sensitive hidden Markov models
Ahmed et al. Application of an Efficient Genetic Algorithm for Solving n× 𝒎𝒎 Flow Shop Scheduling Problem Comparing it with Branch and Bound Algorithm and Tabu Search Algorithm
Polushina et al. Change-point detection in binary Markov DNA sequences by the Cross-Entropy method
Junyan et al. Sequence pattern mining based on markov chain
CN112825267B (en) Method for determining a collection of small nucleic acid sequences and use thereof
CN109360602B (en) DNA coding sequence design method and device based on fuzzy priority
Sonnenburg Machine Learning for Genomic Sequence Analysis
Rafiuddin Estimation of Phylogenetic Tree using Gene Sequencing Data
Jiang et al. Scaling DEPP phylogenetic placement to ultra-large reference trees: a tree-aware ensemble approach
Zoghlami et al. A structure based multiple instance learning approach for bacterial ionizing radiation resistance prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant