CN110059228B - DNA data set implantation motif searching method and device and storage medium thereof - Google Patents

DNA data set implantation motif searching method and device and storage medium thereof Download PDF

Info

Publication number
CN110059228B
CN110059228B CN201910181475.8A CN201910181475A CN110059228B CN 110059228 B CN110059228 B CN 110059228B CN 201910181475 A CN201910181475 A CN 201910181475A CN 110059228 B CN110059228 B CN 110059228B
Authority
CN
China
Prior art keywords
mer
mers
obtaining
dna sequence
big data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910181475.8A
Other languages
Chinese (zh)
Other versions
CN110059228A (en
Inventor
于强
张晓�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201910181475.8A priority Critical patent/CN110059228B/en
Publication of CN110059228A publication Critical patent/CN110059228A/en
Application granted granted Critical
Publication of CN110059228B publication Critical patent/CN110059228B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90348Query processing by searching ordered data, e.g. alpha-numerically ordered data

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a searching method of a DNA data set implantation model, a device and a storage medium thereof, wherein the method comprises the following steps: acquiring a DNA sequence big data set and acquiring implantation motif search parameters of the DNA sequence big data set; obtaining a first k-mer set according to the DNA sequence big data set and the implantation motif searching parameter, obtaining a first l-mer set according to the first k-mer set, and obtaining a second l-mer set according to the first l-mer set; determining the implant motif from the second set of l-mers according to a first scoring model. The invention can not only search out the implanted die body from the DNA sequence big data set by the APMS method, but also find out the time order of the operation time of the implanted die body to be faster than other implanted die body searching methods.

Description

DNA data set implantation motif searching method and device and storage medium thereof
Technical Field
The invention belongs to the field of DNA sequence big data processing, and particularly relates to a DNA data set implantation model searching method, a device and a storage medium thereof.
Background
DNA is a carrier of genetic information, the genetic information is stored in a sequence consisting of four characters of the DNA, and the nature of growth and development of organisms is the transmission and expression of the genetic information. Transcription is central to regulatory mechanisms as the first step in the expression of genetic information. Transcription factors are combined at specific sites (the length is about 5-20 base pairs) in a DNA sequence, and transcription of genes is started and transcription efficiency of the genes is controlled. These Sites are called Transcription Factor Binding Sites (TFBS), and the positioning of TFBS is important for studying the transcriptional regulation of genes.
Quorum implant motif search (qPMS) is one of the well-known computational models for locating TFBS in DNA Sequences. Common qPMS methods include a sample mode-driven precision method and a suffix tree precision method, wherein the sample mode-driven precision method, such as pmsystem, stemrender, qPMS7, travstr, PMS8, and qPMS9, includes two stages of sample driving and mode driving, the sample driving stage is to select some reference DNA sequences as constraints to generate candidate motifs as little as possible, and the mode driving stage is to verify the candidate motifs; suffix tree based precision methods, such as Weeder, RISOTTO and FMotif, build suffix tree indices to the input sequence to accelerate verification of candidate motifs. The approximate qPMS method aims to find an optimal or near optimal phantom in a short time, and the most typical approximate qPMS method includes expectation maximization, Gibbs sampling, genetic methods, etc., and refines an initial phantom, and among these methods, MEME-ChIP, which is a method based on expectation maximization, is one of the most well-known phantom discovery methods. In order to process large data sets efficiently, some motif discovery methods based on new strategies are proposed, such as the PairMotifChIP method, which is to dig and merge similar substring pairs from input DNA sequences to obtain motifs.
However, the qPMS method and the approximate qPMS method, the pairmotif chip method have a common problem: computational problems, resulting in a long run time, present a bottleneck in processing large data sets of DNA sequences.
Disclosure of Invention
In order to solve the above problems in the prior art, the present invention provides a method and apparatus for searching a DNA dataset implantation model, and a storage medium.
The embodiment of the invention provides a method for searching a DNA data set implantation model, which comprises the following steps:
acquiring a DNA sequence big data set and acquiring implantation motif search parameters of the DNA sequence big data set;
obtaining a first k-mer set according to the DNA sequence big data set and the implantation motif searching parameter, obtaining a first l-mer set according to the first k-mer set, and obtaining a second l-mer set according to the first l-mer set;
determining the implant motif from the second set of l-mers according to a first scoring model.
In one embodiment of the present invention, obtaining a first k-mer set according to the DNA sequence big data set and the implantation motif search parameter includes:
acquiring length k, and acquiring a plurality of k-mers from the DNA sequence big data set according to the length k;
and acquiring a first threshold, and acquiring the first k-mer set according to the first threshold and the k-mers.
In one embodiment of the present invention, obtaining the length k includes:
obtaining a first expected value according to the DNA sequence big data set;
obtaining a second expected value according to the DNA sequence big data set and the implantation motif searching parameters;
and obtaining the length k according to the first expected value and the second expected value.
In one embodiment of the present invention, obtaining the first threshold includes:
obtaining the number of DNA sequences from the DNA sequence big data set;
and obtaining the first threshold value according to the second expected value and the quantity of the DNA sequences.
In one embodiment of the present invention, obtaining the first set of l-mers from the first set of k-mers comprises:
obtaining k-mers from the first set of k-mers;
performing expansion processing on each k-mer in the DNA sequence big data set to obtain an expanded first k-mer set;
performing redundancy removal processing on the expanded first k-mer set according to a second score model to obtain an expanded second k-mer set;
intercepting the expanded second k-mer set to obtain a first l-mer;
and obtaining the first l-mer set according to the first l-mer.
In an embodiment of the present invention, intercepting the extended second k-mer set to obtain a first l-mer, includes:
obtaining an alignment sequence according to the expanded second k-mer set;
and intercepting the comparison sequence according to a preset rule to obtain the first l-mer.
In one embodiment of the present invention, deriving a second set of l-mers from the first set of l-mers comprises:
constructing a second tree for a first l-mer in the first l-mer set;
calculating scores of all nodes of the constructed binomial tree according to the first score model, and taking the node with the highest score as a second l-mer;
performing redundancy removal processing on the first k-mer set according to the second l-mer to obtain a second k-mer set;
and processing the first l-mer set according to the second k-mer set to obtain a second l-mer set.
In an embodiment of the present invention, performing redundancy removal on the first k-mer set according to the second l-mer to obtain a second k-mer set, includes:
obtaining a fourth l-mer from the DNA sequence big data set;
acquiring a third expected value between the k-mer of the second l-mer and the k-mer of the fourth l-mer;
judging whether the k-mers in the first k-mer set are redundant or not according to the third expected value, deleting the k-mers from the first k-mer set to obtain a second k-mer set when the hamming distance d between the k-mers in the first k-mer set and the k-mers in the second l-mer set is smaller than or equal to the third expected value and the k-mers in the first k-mer set are redundant, and otherwise, keeping the k-mers in the first k-mer set to obtain the second k-mer set.
Another embodiment of the present invention provides a DNA dataset implantation motif search apparatus, including:
the data acquisition module is used for acquiring the DNA sequence big data set and acquiring the implantation motif search parameter of the DNA sequence big data set;
the data processing module is used for obtaining the first k-mer set according to the DNA sequence big data set and the implantation motif searching parameter, obtaining the first l-mer set according to the first k-mer set, and obtaining the second l-mer set according to the first l-mer set;
a data determination module to determine the implant motif from the second set of l-mers according to the first scoring model.
Yet another embodiment of the invention provides a computer readable storage medium, which when executed by a processor implements the method of any of the above.
Compared with the prior art, the invention has the beneficial effects that:
the invention can not only search out the implanted die body from the DNA sequence big data set by the APMS method, but also find out the time order of the operation time of the implanted die body to be faster than other implanted die body searching methods.
Drawings
FIG. 1 is a schematic flow chart of a method for searching a DNA data set implantation model according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating searching for an implanted motif of a conventional binomial tree according to an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of an apparatus for searching for DNA data set embedded motifs according to an embodiment of the present invention;
FIG. 4 is a graph showing the comparison results of APMS, PairMotifChIP and MEME-ChIP methods provided by the present invention under different DNA sequences of the simulation data;
fig. 5 is a schematic diagram of an experimental result of a method for efficiently solving a search of a large data set implantation model of a DNA sequence in real data according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.
Example one
Referring to fig. 1, fig. 1 is a schematic flow chart of a method for searching a DNA dataset implantation model according to an embodiment of the present invention. The embodiment of the invention provides a method for searching a DNA data set implantation model, which comprises the following steps:
step 1, obtaining a DNA sequence big data set and obtaining implantation motif searching parameters of the DNA sequence big data set.
Step 1.1, obtaining a DNA sequence big data set.
Specifically, the DNA sequence large data set D obtained in the present embodiment includes t DNA sequences, and the DNA sequence large data set D may be expressed as D ═ s1,s2,…stIn which s isiRepresents the ith DNA sequence; each DNA sequence comprises n characters. Wherein each DNA sequence siIs one of the character tables Σ ═ a, C, G, T }The character string, i.e. each DNA sequence, is composed of A, C, G, T into a character string of length n. si[j]J-th character, s, representing the ith DNA sequencei[j..j']Represents a string starting at position j and ending at position j' in the ith DNA sequence. Wherein, the value of i is 0-t-1, and the value of j is 0-n-1.
Step 1.2, obtaining the implantation motif searching parameter of the DNA sequence big data set.
Specifically, in this embodiment, the search parameter of the implant motif (l, d) includes a length l of the implant motif (l, d), a hamming distance d of the implant motif (l, d), a search ratio q of the implant motif (l, d), and a conservative parameter g.
In this embodiment, for the implanted motifs (l, d), the APMS method solves the problems: given t large data sets of DNA sequences of length n D ═ s1,s2,…,stThe sum satisfies 0<l<n、0≤d<l and 0<q.ltoreq.1, the aim being to find an l-mer (string of length l) m such that at least qt (q.ltoreq.t) DNA sequences siAll of which contain a l-mer m which differs from the l-mer m by at most d positions (mutations)iThe position difference (mutation) is defined as the hamming distance: dH(m,mi)=|{i:1≤i≤l,m[i]!=mi[i]J. Wherein l-mer m is called an implantation motif (l, d), and is one l-mer m in a large data set of DNA sequencesiReferred to as motif examples, sequences in the large data set of DNA sequences that do not meet the hamming distance described above are referred to as background sequences. Wherein, the APMS method is a DNA data set implantation model searching method of the invention.
Large datasets of DNA sequences are advantageous for finding high quality implant motifs (l, d), but most existing qPMS methods are too time consuming to complete the computation of qPMS to find implant motifs (l, d) in a reasonable time. On the basis of the qPMS method, the APMS method in this embodiment is applicable to a large DNA sequence data set, and not only can find the implanted motifs (l, d), but also can run time orders of magnitude faster than the existing motif search method.
And 2, obtaining a first k-mer set according to the DNA sequence big data set and the implantation die body searching parameters, obtaining a first l-mer set according to the first k-mer set, and obtaining a second l-mer set according to the first l-mer set.
And 2.1, searching parameters according to a DNA sequence big data set and an implantation die body (l, d) to obtain a first k-mer set, wherein the first k-mer set comprises a plurality of k-mers, and each k-mer comprises k characters.
Specifically, a first k-mer set is obtained according to a DNA sequence big data set and implantation motif (l, d) searching parameters, and the method comprises the following steps:
acquiring length k, and acquiring a plurality of k-mers from a DNA sequence big data set according to the length k;
obtaining a first threshold value
Figure GDA0002997852320000082
According to a first threshold value
Figure GDA0002997852320000083
The k-mers yield a first set of k-mers.
Further, obtaining a length k comprises:
obtaining a first expected value according to the DNA sequence big data set;
searching parameters according to the DNA sequence big data set and the implantation die body (l, d) to obtain a second expected value;
and obtaining the length k according to the first expected value and the second expected value.
Specifically, the present embodiment employs a probability analysis method to determine a suitable k value, so that it can better distinguish k-mers in the background sequence and the motif example. Let fr(k) Is a first desired value, a first desired value fr(k) An expectation value representing the frequency of occurrence of k-mers in any background sequence in the DNA sequence big data set D; let fm(k) Is a second desired value, a second desired value fm(k) Represents the expected value of the frequency of occurrence of k-mers in any example motif in the DNA sequence big data set D. Wherein the second desired value fm(k) And a first desired value fr(k) The larger the ratio of (A) is, the more distinguishable the k-mers in the background sequence and in the motif example are from the frequency of occurrence. Thus, the present embodiment employs the following formula to determine the value of k:
Figure GDA0002997852320000081
wherein k isminRepresenting the minimum value of k, epsilon being used to correspond to the first desired value fr(k) A factor of less than 1. k is a radical ofminPreferably 5, since k is small, it is difficult to distinguish k-mers in the background sequence and the motif example. ε is empirically set to 1.
In this example, the first expected value f in formula (1) was obtained from a large DNA sequence datasetr(k) The specific design is as follows:
Figure GDA0002997852320000091
assuming that the searched implanted motif (l, d) is m, there is an example of the motif m1And example of the phantom m2In the DNA sequence big data set D, for an arbitrary motif example m1One k-mer x of any starting position in1And another arbitrary motif example m2One k-mer x of the same starting position in2Let p stand forkDenotes k-mer x1And k-mer x2Equal probability, then the second expected value f in equation (1)m(k) The design is as follows:
Figure GDA0002997852320000092
for p in formula (3)kDenotes k-mer x1And k-mer x2Equal probability according to the total probability formula, pkThe design is as follows:
Figure GDA0002997852320000093
wherein, PriShows the implant phantom (l, d) m and phantom example m1Hamming distance d ofH(m,m1) Probability of i (i 0. ltoreq. i.ltoreq.d), PrjIndicating the implantation phantom (l, d) m and example of the mold2Hamming distance d ofH(m,m2) Probability of j (j 0. ltoreq. j. ltoreq.d), PriThe design is as follows:
Figure GDA0002997852320000094
wherein g represents a conservative parameter, and the value range of g is more than or equal to 0 and less than or equal to 1.
In the same way, PrjThe design is as follows:
Figure GDA0002997852320000101
and p isijIs shown at dH(m,m1) I and dH(m,m2) J, k-mer x1And k-mer x2Equal probability, pijThe design is as follows:
Figure GDA0002997852320000102
as can be seen from formula (7), pijThe products of the multiplication of the three factors are accumulated in the range of 0 to min i, j. Wherein the first factor represents the motif example m1In one arbitrary k-mer x1The probability of a mutations in; the second factor represents k-mer x2And k-mer x1The probability of the mutation positions being identical; the third factor is expressed in k-mer x2And k-mer x1When the mutation position is the same, the probability that the mutated base is completely the same.
The first expected value f is calculated by the above equations (2) to (3)r(k) And a second desired value fm(k) And k is in the range of 0 to l, and then a second expected value f is calculated according to the formula (1)m(k) And a first desired value fr(k) The largest value in the ratios of (a) is taken as the length k of each k-mer in the first set of k-mers in this example.
From the DNA sequence big data set D, several k-mers of length k are obtained.
Further, a first threshold value is obtained
Figure GDA0002997852320000103
The method comprises the following steps:
obtaining the number of DNA sequences from the DNA sequence big data set;
obtaining a first threshold value according to the second expected value and the number of DNA sequences
Figure GDA0002997852320000104
Specifically, in this embodiment, all k-mers with length k are not obtained from the DNA sequence big data set D, but the frequency of occurrence in the DNA sequence big data set D is greater than or equal to the first threshold
Figure GDA0002997852320000111
The k-mers of (a) are used as high frequency k-mers to generate a first set of k-mers. As described above, fm(k) Indicates the expectation that an arbitrary k-mer in an arbitrary motif example will occur frequently in the DNA sequence big data set D, if any
Figure GDA0002997852320000112
Is directly set to fm(k) Then multiple high frequency k-mers corresponding to the same phantom may be acquired. Thus, the first threshold value
Figure GDA0002997852320000113
Is designed at fm(k) A variable proportional to the number t of DNA sequences is added to avoid obtaining excessive redundant high-frequency k-mers. First threshold value of the present embodiment
Figure GDA0002997852320000114
The design of (2) is as follows:
Figure GDA0002997852320000115
further, the first threshold value obtained according to the formula (8)
Figure GDA0002997852320000116
And acquiring k-mers which meet the first threshold value or more from each DNA sequence as high-frequency k-mers to generate a first k-mer set.
And 2.2, obtaining a first l-mer set according to the first k-mer set, wherein the first l-mer set comprises a plurality of first l-mers, and each first l-mer comprises l characters.
Specifically, obtaining a first l-mer set according to the first k-mer set comprises:
acquiring k-mers from the first k-mer set;
expanding each k-mer in the DNA sequence big data set to obtain an expanded first k-mer set, wherein the length of each expanded k-mer in the expanded first k-mer set is 2 l-k;
performing redundancy removal processing on the expanded first k-mer set according to the second score model to obtain an expanded second k-mer set;
intercepting the expanded second k-mer set to obtain a first l-mer;
and obtaining the first l-mer set according to the first l-mer.
Further, expanding the k-mers in the DNA sequence big data set to obtain an expanded first k-mer set, wherein the length of each expanded k-mer in the expanded first k-mer set is 2 l-k.
Specifically, the implantation motif (l, D) is searched by the first k-mer set, and k-mer x is firstly obtained from the first k-mer set, because the starting position of k-mer x in the implantation motif (l, D) is unknown, so that after k-mer x is found in the DNA sequence big data set D, the embodiment expands k-mer x by l-k characters to the left and to the right in the DNA sequence big data set D, and the expanded k-mer x becomes a character string with the length of 2 l-k. By doing so, the extended k-mer x motif example in the DNA sequence big dataset D can cover the implant motif (l, D).
For example, suppose si[j..j+k–1]Is the exact occurrence of k-mer x in the DNA sequence corpus D, the motif example of the k-mer x thus obtained, which is extended in the DNA sequence corpus D, is si[j–l+k..j+l–1]。
Further, each k-mer x in the DNA sequence big data set D is subjected to expansion processing to obtain an expanded first k-mer set.
And further, performing redundancy removal processing on the expanded first k-mer set according to a second score model to obtain an expanded second k-mer set, wherein the length of each expanded k-mer in the expanded second k-mer set is 2 l-k.
In particular, if an extended k-mer x in the extended first k-mer set does not contain a motif instance in the DNA sequence big data set D, i.e., it consists entirely of background sequences, such an extended k-mer x will affect the quality of the first l-mer set. Thus, the present embodiment generates the first l-mer set according to the designed second scoring model score before generating the first l-mer seti(y) evaluating the extended k-mer x to assess whether the extended k-mer x consists of a background sequence. From the above, because the first expected value fr(k) Represents the expectation of the frequency of k-mers in an arbitrary background sequence in the DNA sequence big data set D, so in this example, the second scoring model is designed as follows:
Figure GDA0002997852320000131
as can be seen from equation (9), the second scoring model scoreiThe smaller the score of (y), the more likely the expanded k-mer x is to be composed of background sequences, thus filtering the expanded k-mer x with the smallest score from the expanded first set of k-mers, resulting in an expanded second set of k-mers.
In the embodiment, by designing the second scoring model, the expanded k-mer x which may be a background sequence is filtered from the expanded first k-mer set, so that the calculation amount of subsequent implantation phantom (l, d) search is reduced, and the running time of the APMS method is reduced.
Further, intercepting the expanded second k-mer set to obtain a first l-mer, including:
obtaining an alignment sequence according to the expanded second k-mer set;
and intercepting the comparison sequence according to a preset rule to obtain a first l-mer.
Specifically, in this embodiment, after the expanded first k-mer set in the DNA sequence big data set D is subjected to redundancy elimination, the remaining expanded k-mers form an expanded second k-mer set, the expanded k-mers in the expanded second k-mer set form an alignment sequence align with a length of 2l-k, r (align [ i ]) represents the information amount of the ith column in the alignment sequence align, and then, the information amount is intercepted according to a preset rule to obtain the first l-mer. The information quantity is Position Weight Matrix (PWM), each column in the Position Weight matrix is the ratio of four characters in the expanded k-mer, and the four characters are A, C, G, T respectively.
After the expanded k-mers in the expanded second k-mer set are right-aligned to form an alignment sequence align, according to the information content of each row r (align [ i ]) in the alignment sequence align, a consistent sequence with a length of 2l-k is first obtained, and then the rows r (align [ i ]) with smaller information content at the left end and the right end in the consistent sequence are repeatedly removed by comparison until a consistent sequence with a length of l is obtained, where the consistent sequence with the length of l is the first l-mer.
For example, in this embodiment, if the length l of the implant motif (l, d) is 6 and the length k of the k-mer is 3, wherein the DNA sequence large dataset includes 6 extended k-mers, respectively { AGATTGCAG }, { CGATTGCAG }, { CGATTGCAC }, { CGCTTGCAG }, { CGCTTGCAG }, and { CTATTGTAG }, the 6 extended k-mers are first aligned right:
{AGATTGCAG,
CGATTGCAG,
CGATTGCAC,
CGCTTGCAC,
CGCTTGCAG,
CTATTGTAG, forming an aligned sequence align, wherein the information content of each row r (align [ i ]) of the aligned sequence align is:
{A:0.17,0.00,0.67,0.17,0.00,0.17,0.00,1.00,0.00
C:0.83,0.00,0.33,0.00,0.00,0.00,0.83,0.00,0.33
G:0.00,0.83,0.00,0.00,0.00,0.66,0.00,0.00,0.67
t: 0.00,0.17,0.00,083,1.00,0.17,0.17,0.00,0.00}, and then obtaining a consistent sequence according to the information content of each row r (align [ i ]), wherein the consistent sequence is { CGATTGCAG }. Looking at the occupancy ratio of each column of characters A, C, G, T of the consensus sequence { CGATTGCAG } starting from the left, the occupancy ratio of C in the first column on the left being the largest, the character C being selected on the left, then the occupancy ratio of G in the first column on the right being the largest, the character G being selected on the right, comparing the occupancy ratio of the first column of characters C on the left with the occupancy ratio of the first column of characters G on the right, the occupancy ratio of the first column of characters C on the left being greater than the occupancy ratio of the first column of characters G, then retaining the first column of characters C on the left, and deleting all characters in the first column on the right; then, selecting a reserved character C in the first left column, selecting a character A on the right side when the ratio of the character C in the first right column is the largest, comparing the ratio of the character C in the first left column with the ratio of the character A in the first right column, reserving the character A in the first right column when the ratio of the character C in the first left column is smaller than the ratio of the character A in the first right column, and deleting all characters in the first left column; and so on until the consensus sequence is truncated to an l-mer of length l, which is { ATTGCA } and which is the first l-mer.
And further traversing the k-mers in the first k-mer set, and finding out the first l-mer of each k-mer in the DNA sequence big data set to form a first l-mer set.
And 2.3, obtaining a second l-mer set according to the first l-mer set, wherein the second l-mer set comprises a plurality of second l-mers, and each second l-mer comprises l characters.
Specifically, obtaining a second l-mer set according to the first l-mer set comprises:
constructing a second tree for a first l-mer in the first l-mer set;
calculating scores of all nodes of the binomial tree according to the first score model, and taking the node with the highest score as a second l-mer;
performing redundancy removal processing on the first k-mer set according to the second l-mer to obtain a second k-mer set;
and processing the first l-mer set according to the second k-mer set to obtain a second l-mer set.
Further, constructing a binomial tree for a first l-mer in the first l-mer set, comprising:
selecting the first l-mer as a root node of the binomial tree;
sequentially generating an i +1 th layer of the binomial tree according to the i th layer of the binomial tree, judging whether the number of nodes of the i +1 th layer of the binomial tree is greater than a second threshold value or not, if the number of nodes of the i +1 th layer is greater than the second threshold value, obtaining the nodes of the i +1 th layer of the final binomial tree according to a first score model, wherein the number of the nodes of the i +1 th layer is equal to the second threshold value, if the number of the nodes of the i +1 th layer is less than or equal to the second threshold value, the nodes of the i +1 th layer of the binomial tree are kept, and the value of i is 0< i < d;
judging whether a node of the ith layer of the binomial tree is an implanted die body (l, d) or not, if so, storing the node in a first array M, and if not, storing the node in the first array M, wherein the value of i is 0< i < d;
and according to the node scores in the first array M, taking the node with the highest score as a second l-mer.
Specifically, referring to fig. 2, fig. 2 is a schematic diagram illustrating searching for an implanted motif of a conventional binomial tree according to an embodiment of the present invention. As can be seen from fig. 2, in the conventional method for constructing the binomial tree, the root node of the binomial tree is the first l-mer in the first l-mer set, the internal node or leaf node of the ith layer of the binomial tree is a node whose hamming distance from the first l-mer of the root node is i, the value range of i is 0< i < d, and the depth of the binomial tree is d. Each layer of the binomial tree corresponds to a plurality of expansion nodes, the expansion nodes are d neighbors of the first l-mer of the root node, and the expansion nodes and the first l-mer have difference in marked positions on a path from the root node to the internal node or the leaf node. Thus, each node in the binomial tree represents a d neighbor that is a Hamming distance i (0 ≦ i ≦ d) from the first l-mer. Wherein the expansion nodes are all l-mers with the length of l.
In this embodiment, a binomial tree is constructed, a root node is a first l-mer in a first l-mer set, then an i +1 th layer of the binomial tree is sequentially generated according to the i th layer of the binomial tree, whether the number of nodes of the i +1 th layer of the binomial tree is greater than a second threshold is judged, if the number of nodes of the layer is greater than the second threshold, nodes of the i +1 th layer of the binomial tree are obtained according to a first score model, the number of nodes of the layer is equal to the second threshold, and if the number of nodes of the layer is less than or equal to the second threshold, the nodes of the i +1 th layer of the binomial tree are maintained, and the value of i is 0< i < d.
Specifically, let the second threshold be Nmm(i),Nmm(i) I (0) th representing a binomial tree<i<d) Number of level nodes, to avoid missing extended nodes at each level of the binomial tree, N is calculatedmm(i) Then, the number of nodes at the i-th layer is multiplied by a safety factor alpha (alpha is more than or equal to 1). In the implementation of the APMS method, a is empirically set to a preferred value of 2, Nmm(i) The design is as follows:
Figure GDA0002997852320000171
for example, when the binomial tree is constructed in this embodiment, it is known that the length (l, d) of the implanted model is 5, and the hamming distance d is 3, where the root node of the binomial tree is the first l-mer, and the node of the first layer of the binomial tree is the l-mer with the hamming distance from the first l-mer of the root node being 1, the number of nodes is 15, because the implanted model (l, d) is the l-mer with the length of 5, each position has 3 mutation cases, the node of the first layer of the binomial tree in this embodiment takes all the mutation cases of the first l-mer of the root node, that is, the number of nodes of the first layer of the binomial tree is 15; the node of the second layer of the binomial tree is expanded on the basis that the node of the first layer of the binomial tree is implanted with the motif (l, d), the hamming distance between the node and the expanded node of the node is 1, and the number of the nodes of the second layer of the binomial tree, which is C in total, is determined by a formula (11)3 22 ═ 6; similarly, the nodes at the third layer of the binomial tree are expanded on the basis that the nodes (the number of the nodes is 6) at the second layer of the binomial tree are implanted into the die bodies (l, d), the hamming distance between the nodes and the expanded nodes of the nodes is 1, and the number of the nodes at the third layer of the binomial tree is determined by the formula (11) to be C3 32 ═ 2. The second tree constructed finally is a tree structure with the first l-mer as a root node, the first layer of the second tree being 15 nodes, the second layer of the second tree being 6 nodes, and the third layer of the second tree being 2 nodes.
And further, obtaining nodes of the (i + 1) th layer of the final binomial tree according to the first score model, wherein the number of the nodes of the layer is equal to a second threshold value.
Specifically, in the present embodiment, under the qPMS model, a first score model is designed to evaluate the score of each node y of the constructed binomial tree. Wherein D' (y) is a set containing qt DNA sequences selected from the DNA sequence big data set D and used for calculating the score of each node y of the binomial tree, and s is the l-mer with the minimum Hamming distance between a certain DNA sequence and the node y. Generally, the higher the score of node y of the binomial tree, the closer the node y is to the implant motif (l, d). The first scoring model of this embodiment is designed as follows:
Figure GDA0002997852320000181
as shown in the formula (11), in the binomial tree constructed by any first l-mer, the score of each node y in the binomial tree is evaluated by the conventional method, the score of the node y in t DNA sequences in a DNA sequence big data set is calculated firstly, the score of an l-mer with the minimum Hamming distance from the node y to the first l-mer is found from each DNA sequence to be used as the score of the DNA sequence, then the previous qt DNA sequences with the highest score are obtained, and the scores of the qt DNA sequences are added to obtain the final score to be used as the score of the node y. For each node y, there is a score for the correspondencenAnd (y) selecting the node y with the highest score from the nodes y as the second l-mer.
However, the traditional method has the defects that when the node y score is calculated each time, a large data set of the DNA sequence needs to be scanned again, and the calculation cost is high. In this embodiment, to solve this problem, all l-mers in each DNA sequence are sorted in ascending order according to the hamming distance between the l-mer and the first l-mer to obtain a queue sequence, and according to such queue sequence, it can be determined that the earlier l-mer in the queue sequence is most likely to be the second l-mer finally obtained. The calculation cost is reduced by calculating the second l-mer through the queuing sequence, and the l-mer with the best score in the DNA sequence can be found by basically scanning the first l-mers in the queuing sequence. Wherein D '(y) in formula (11) is a set of qt DNA sequences selected from the DNA sequence big data set D for calculating the score of the node y, and in this embodiment, the set of D' (y) is represented as:
Figure GDA0002997852320000191
after the queuing sequences are obtained, all the l-mers in each DNA sequence are arranged at the forefront of the queuing sequences after the Hamming distances between all the l-mers in each DNA sequence and the first l-mer are arranged in an ascending order from small to large. The l-mers with the minimum score obtained from each DNA sequence are rearranged in an ascending order from small to large according to the Hamming distance, a new queuing sequence is obtained after arrangement, and a certain line in the new queuing sequence is called CiIn this embodiment, then, C exists for a first l-mer m' and a d-neighbor y of the first l-mer miAnd CiOne middle position j (j is more than or equal to 1 and less than or equal to | CiIf d) is not presentH(Ci[j],m')–dH(y, m') > 0, then dH(Ci[j],m')–dH(y, m') is dH(y,Ci[j]) The smallest possible value of. Thus, when a score is scanned and calculated based on a new queuing sequence, a certain row C in the new queuing sequenceiWhen d is metH(Ci[j],m')–dH(y,m')≥dis(y,Ci[j]) In this case, this row C can be completediScanning, current line CiHas a minimum Hamming distance of dis (y, C)i[j]) Will dis (y, C)i[j]) Substituting into equation (11) to obtain C at node yiScore of line scoren(y), and the next row C is startediAnd +1 line scanning, and taking the highest score in the scores of each line in the new queuing sequence as the score of the node y until all lines in the new queuing sequence are scannedn(y)。
Scoring score as above is performed on all nodes of the i +1 th level of the binomial tree respectivelyn(y) calculating, sorting the obtained scores in ascending order from small to large,and selecting the nodes with the larger scores of the second threshold value in the sequence as the nodes of the (i + 1) th layer of the final binomial tree, wherein the number of the nodes of the layer is equal to the second threshold value.
In the embodiment, in the constructing of the binomial tree by the first l-mer in the first l-mer set, the node with the high score is calculated and selected according to the first score model to generate the expansion node, because the node with the high score is more likely to be the implanted motif (l, d), the embodiment generates the expansion node from the direction of the implanted motif (l, d), thereby reducing the calculation amount of the subsequent implanted motif (l, d) and reducing the running time of the APMS method.
Further, whether a node of the ith layer of the binomial tree is an implanted motif (l, d) is judged, if the node is the implanted motif (l, d), the node is stored in the first array M, and if the node is not the implanted motif (l, d), the node does not need to be stored in the first array M;
specifically, in this embodiment, not all d-neighbor nodes of the binomial tree are used to search for the implanted motifs (l, d) as in the conventional method, but nodes similar to the implanted motifs (l, d) are used to search. When judging whether the node of the ith layer of the binomial tree is the implantation motif (l, d), substituting the node into a DNA sequence big data set, judging whether at least qt DNA sequences exist and the hamming distance between the l-mer and the node is smaller than or equal to d, if yes, judging that the node is the implantation motif (l, d), storing the node in a first array M, and if not, judging that the node is not the implantation motif (l, d) and not storing the node in the first array M. And the hamming distance between the node of the ith layer and the expansion node of the (i + 1) th layer of the node is 1. Wherein, the value of i is 0< i < d.
And further, according to the node scores in the first array M, taking the node with the highest score as a second l-mer.
Specifically, the nodes in the first array M are the set of nodes selected for the first l-mer that are close to the implant motif (l, d), the node with the highest score is selected from the first array M as the implant motif that is most likely to be the search, and the node with the highest score is taken as the second l-mer.
Further, traversing each first l-mer in the first l-mer set, constructing the binomial tree model to obtain a second l-mer, obtaining a second l-mer set according to the second l-mer, and obtaining a final implantation motif (l, d) through the second l-mer set.
Specifically, the binomial tree model is constructed for each first l-mer in the first l-mer set, the first array M of each binomial tree model taking the first l-mer as a root node is calculated according to the first score model, the node with the highest score in the first array M is selected as the second l-mer of the first l-mer, then the second l-mer obtained from each first l-mer in the first l-mer set is formed into the second l-mer set by the second l-mer, the second l-mer in the second l-mer set is calculated according to the first score model, the scores are rearranged from high to low, and the rearranged node set is output as the final implantation motif (l, d).
In summary, the searching for the implant motifs (l, d) based on the binomial tree method is performed layer by layer starting from the first l-mer of the root node. For the first l-mer of the root node, firstly, whether the first l-mer of the root node is an implantation motif (l, d) is judged, and all nodes with the Hamming distance of 1 from the first l-mer of the root node are used as expansion nodes of the 1 st layer. For the i (0)<i<d) A layer, N is selected from the expansion nodes of the layermm(i) The node with high score is used as the final node of the layer and is respectively connected with the Nmm(i) And the expansion node with the hamming distance of 1 of the selected node is used as the node of the (i + 1) th layer. For the d-th layer, directly judging whether the node of the layer is an implantation motif (l, d). And judging whether each expansion node of each layer is an implantation motif (l, d), if so, storing the expansion node in a first array M, and if not, storing the expansion node in the first array M. In the searching process, if a plurality of implanted motifs (l, d) exist in a first array M in the binomial tree constructed by the first l-mer, the node with the highest score is selected from the first array M to serve as the second l-mer. Obtaining second l-mers from each first l-mer in the first l-mer set, obtaining second l-mer sets from the second l-mers, and repeating the steps from high to low for the second l-mers in the second l-mer sets according to the scores of the second l-mersAnd (e) outputting the reordered node set as a final implantation motif (l, d).
Further, performing redundancy removal processing on the first k-mer set according to the second l-mer to obtain a second k-mer set, including:
acquiring a fourth l-mer from the DNA sequence big data set;
acquiring a third expected value between the k-mer of the second l-mer and the k-mer of the fourth l-mer;
and judging whether the k-mers in the first k-mer set are redundant or not according to the third expected value, deleting the k-mers from the first k-mer set to obtain a second k-mer set when the hamming distance d between the k-mers in the first k-mer set and the k-mers in the second l-mer set is smaller than or equal to the third expected value and the k-mers in the first k-mer set are redundant, and otherwise, keeping the k-mers in the first k-mer set to obtain the second k-mer set.
In particular, for a first set of k-mers, there may be redundant k-mers in the first set of k-mers, which are substrings of the same starting position in a second l-mer, or k '(k) is the length of k' between a k-mer and a second l-mermin≤k'<k) Are overlapped. Based on this, in this embodiment, each time a first l-mer is obtained by a k-mer in a first k-mer set, first, a second l-mer generated last time is adopted to determine whether the k-mer in the first k-mer set is a redundant k-mer, and if the k-mer is redundant, the k-mer is deleted from the first k-mer set to obtain a second k-mer set; and if the k-mer is not redundant, keeping the k-mer in the first k-mer set to obtain a second k-mer set, wherein the second k-mer set comprises a plurality of k-mers, and each k-mer comprises k characters.
Let the third expected value e (k) represent the expected value of the Hamming distance of a k-mer at an arbitrary starting position in an arbitrary phantom example from the k-mer at the same starting position in the implanted phantom (l, d). This example obtains a fourth l-mer from the DNA sequence big data set D, where the fourth l-mer includes l characters, the fourth l-mer is taken as an example of a phantom calculated from a third expected value e (k), and the second l-mer is taken as an example of an implantation phantom (l, D) calculated from a third expected value e (k). e (l) calculated based on the total probability formula, and assuming a mutation position between the fourth l-mer and the second l-mer, the third expected value e (k) is equal to e (l) multiplied by k/l, assuming that the mutation randomly occurs at one of the l positions. The third expected value e (k) of this embodiment is designed as follows:
Figure GDA0002997852320000231
in this embodiment, for a k-mer x in the first k-mer set, which is a redundant k-mer, is defined as: the presence of one k-mer z in the second l-mer being such that dH(z, x) is less than or equal to e (k), namely the hamming distance d between the k-mer x in the first k-mer set and the k-mer z in the second l-mer set is less than or equal to a third expected value e (k), the k-mers in the first k-mer set are redundant, the k-mers are deleted from the first k-mer set, the motif (l, d) searching process is not required to be implanted to the k-mers, otherwise, the k-mers are kept in the first k-mer set, and the motif (l, d) searching process is performed as above. Wherein a k-mer x that is redundant to a k-mer in the first set of k-mers can be further defined as: let pf (x, k ') and sf (x, k') denote a k 'long prefix and a k' long suffix, respectively, of a string k-mer x, there being kmin≤k'<k is such that dH(pf (z, k '), sf (x, k ')) ≦ e (k ') or dH(sf(z,k'),pf(x,k'))≤e(k')。
In this embodiment, the redundancy removal processing is performed on the first k-mer set by designing the third expected value e (k), so that the calculation amount of subsequent die bodies (l, d) to be implanted is reduced, and the running time of the APMS method is reduced.
And further, processing the first l-mer set according to the second k-mer set to obtain a second l-mer set.
Specifically, after the redundancy removing processing is performed on the first k-mer set, a second k-mer set is obtained, and the first k-mer set is updated by the second k-mer set. In the APMS method of this embodiment, after deleting the redundant k-mer from the first k-mer set, there is no need to obtain the redundant k-mer from the first k-mer set and then obtain the first l-mer operation, so that each time the APMS method of this embodiment obtains the k-mer from the first k-mer set, obtains the first l-mer through the k-mer, reconstructs a binomial tree from the first l-mer, obtains the second l-mer through the binomial tree, then removes the redundant k-mer from the first k-mer set through the second l-mer to obtain a second k-mer set, updates the first k-mer set with the second k-mer set, further obtains the k-mer from the updated first k-mer set, obtains the first l-mer through the k-mer, the above-described repeated process is performed. For the first l-mer set, each first l-mer in the first l-mer set constructs a binomial tree, the score of each node in the binomial tree is calculated, the node with the highest score in the binomial tree is used as a second l-mer corresponding to the first l-mer, and one second l-mer correspondingly exists in the first l-mer in each first l-mer set to obtain a second l-mer set.
And 3, determining an implant motif (l, d) from the second l-mer set according to the first scoring model.
Specifically, the second l-mers in the second l-mer set are sorted from high to low according to the scores of the first scoring model, and the reordered second l-mer set is output, so that the implantation motifs (l, d) are obtained.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a DNA dataset implantation motif search device according to an embodiment of the present invention. Another embodiment of the present invention provides a DNA data set implantation motif searching apparatus, including:
the data acquisition module is used for acquiring a DNA sequence big data set and acquiring implantation motif search parameters of the DNA sequence big data set;
the data processing module is used for obtaining a first k-mer set according to a DNA sequence big data set and an implantation die body searching parameter, obtaining a first l-mer set according to the first k-mer set and obtaining a second l-mer set according to the first l-mer set;
a data determination module that determines an implant motif from the second set of l-mers based on the first scoring model.
The device for searching a DNA data set implanted model provided by the embodiment of the invention can execute the method embodiment, has similar realization principle and technical effect, and is not repeated herein.
Yet another embodiment of the present invention provides a computer-readable storage medium having a computer program stored thereon, where the computer program when executed by a processor implements the following steps:
acquiring a DNA sequence big data set and acquiring implantation motif search parameters of the DNA sequence big data set;
obtaining a first k-mer set according to a DNA sequence big data set and an implantation die body searching parameter, obtaining a first l-mer set according to the first k-mer set, and obtaining a second l-mer set according to the first l-mer set;
an implant motif is determined from the second set of l-mers according to the first scoring model.
The computer-readable storage medium provided by the embodiment of the present invention may implement the above method embodiments, and the implementation principle and technical effect are similar, which are not described herein again.
To illustrate the advantages of the present invention, the present example verifies the advantages of the APMS method of the present invention on the simulated data and the real data, respectively. The simulation data is mainly used for testing the efficiency of the APMS method by comparing with the existing method in the running time, and simultaneously verifying whether the APMS method can find the implantation die body (l, d); the real data are mainly used for verifying the effectiveness of the APMS method, and whether the APMS method can efficiently find a real motif in the biological data of the real world is verified.
In the simulation data, three sets of simulation data sets are generated in the embodiment for comprehensive testing, and compared with the existing method, the advantages of the method APMS are verified under the three sets of simulation data sets. Wherein, the existing methods for selecting and comparing comprise FMotif, PairMotifChIP and MEME-ChIP: FMotif is a precise PMS method with highest efficiency for dealing with large data sets of DNA sequences; PairMotifChIP is a newly proposed approximate PMS method capable of coping with a large DNA sequence data set; MEME-ChIP is one of the best known motif discovery methods.
This example uses the performance coefficient mPC to measure the prediction model (l, d) mpAnd the implantation model (l, d) mkThe similarity of (c). Wherein lenoverlap(mp,mk) Representing the predicted motif (l, d) mpAnd the implantation model (l, d) mkThe number of overlapping characters, mPC, is calculated as follows:
Figure GDA0002997852320000261
(1) the first set of simulated data sets was used for validation testing on data with different motifs (l, d), where in the large DNA sequence data set the number of DNA sequences t is 3000, the number of characters per DNA sequence n is 200, the first set of simulated data machine tests in which the motif (l, d) was implanted was searched for a ratio q is 0.5, i.e. the number of DNA sequences required in the first set of simulated data set tests was 3000.0.5 1500, the conservative parameter g is 0.5, and the APMS, FMotif, pairmotif ChIP and me-ChIP methods were compared at different values of l and d.
TABLE 1 comparison results on the first set of simulation datasets
Figure GDA0002997852320000271
In Table 1, time represents run time, s represents seconds, m represents minutes, h represents hours, and N represents run time exceeding 48 hours, which cannot be predicted. As can be seen from Table 1, given t, n, q, g, the APMS method runs faster than the APMS, FMotif, PairMotifChIP and MEME-ChIP methods at different values of l and d. When the values of l and d are large, the FMotif method has the condition that the running time exceeds 48 hours and cannot be predicted; the PairMotifChIP and MEME-ChIP methods have relatively stable running time when l and d are increased, and the running time of the APMS method is s-level although the running time is increased along with the increase of l and d, and is faster than the PairMotifChIP method and the MEME-ChIP method.
(2) The second set of simulation data sets was used for validation testing on data with different phantom signal strengths: in the DNA sequence big data set, the number t of DNA sequences is 3000, the number n of characters of each DNA sequence is 200, the implanted motifs (l, d) are (15,5), the implanted motifs (l, d) are searched for the ratio q and the conservative parameter g in the second group of simulation data tests, and APMS, FMotif, PairMotifChIP and MEME-ChIP methods are compared under different values. The die body signal intensity depends on q and g, and when the value of q is small and the value of g is large, the die body signal intensity is small; when q is large and g is small, the die body signal intensity is large.
TABLE 2 comparison results on the second set of simulation datasets
Figure GDA0002997852320000281
In Table 2, time represents the running time, s represents seconds, m represents minutes, h represents hours, and N represents that the running time exceeds 48 hours, which cannot be predicted. As can be seen from Table 2, given t, n, l, d, the APMS method runs faster than the APMS, FMotif, PairMotifChIP, and MEME-ChIP methods at different values of q and g. When the intensity of the die body signal is small, the FMotif method has the condition that the running time exceeds 48 hours and cannot be predicted; the operation time of the APMS, PairMotifChIP and MEME-ChIP methods is relatively stable, and the operation time of the APMS is faster than that of the PairMotifChIP method and is faster than that of the MEME-ChIP method.
(3) The third set of mock datasets was used to perform validation tests on large datasets of DNA sequences at different scales: the number of characters n of each DNA sequence is 200, the implantation motif (l, d) is (15,5), the implantation motif (l, d) search occupancy q is 0.5 and the conservative parameter g is 0.5 in a third set of simulation data tests, and then the APMS, FMotif, pairmotif ChIP and MEME-ChIP methods are compared at different values for the number of DNA sequences t.
TABLE 3 comparison results on the third set of simulation datasets
Figure GDA0002997852320000291
In Table 3, time represents the running time, s represents seconds, m represents minutes, h represents hours, and N represents that the running time exceeds 48 hours, which cannot be predicted. As can be seen from Table 3, given n, q, g, l, d, the APMS method runs faster than the APMS, FMotif, PairMotifChIP and MEME-ChIP methods at different values of t. When the data of a large DNA sequence data set is large, the MEME-ChIP method has the condition that the running time exceeds 48 hours and cannot be predicted, and the running time of the PairMotifChIP method is increased by a level larger than that of the APMS method. Among these, because FMotif defines that the maximum set of DNA sequences processed is 3000, FMotif does not participate in the comparison on the third set of data sets.
As can be seen from tables 1, 2 and 3, the APMS method can complete the prediction of the implant motif (l, d) in the shortest time in all cases, orders of magnitude faster than the FMotif, pairmotif ChIP and MEME-ChIP methods. The value of the performance coefficient mPC is 1 for all the methods, which means that they can accurately find out the implanted motifs (l, d), mainly because the motif information contained in the three sets of simulation data sets is quite sufficient, and even when the signal intensity of the motifs is very small, the implanted motifs (l, d) can still be accurately found out.
Referring to fig. 4, fig. 4 is a graph showing the comparison results of the APMS, pairmotif ChIP, and MEME-ChIP methods provided in the embodiments of the present invention under different DNA sequences of the simulation data. It can be seen that the running time of the APMS method increases approximately linearly with the increase of the DNA sequence number set, while the running time of PairMotifChIP increases approximately in the square order with the increase of the DNA sequence number set, while the MEME-ChIP method cannot be predicted when the running time exceeds 48 hours when the DNA sequence number is 12000.
In the present embodiment, ChIP-seq data of Mouse Embryonic Stem cells (mESC cells, abbreviated as mESC) is used as the real data, and the ChIP-seq data is the most widely used data for verifying the validity of the motif search method. The mESC data contained 12 sets of datasets (c-Myc, CTCF, Esrrb, Klf4, Nanog, n-Myc, Oct4, Smad1, Sox2, STAT3, Tcfcp2I1, Zfx), each named by ChIP-ed transcription factor. When searching for motifs by the APMS method, uniform implantation motif (l, d) search parameters were used for 12 different sets of data, where implantation motif (l, d) is (13,4), implantation motif (l, d) search occupancy q is 0.3, conservative parameter g is 0.5, and for each data set, the top 3000 DNA sequences were taken as input to the APMS method.
Referring to fig. 5, fig. 5 is a schematic diagram of an experimental result of a method for efficiently solving a search of a large data set implantation model of a DNA sequence in real data according to an embodiment of the present invention. As can be seen, for each data set, the published motifs and predicted motifs in the form of the contained DNA sequences, running times, sequence logos are shown, wherein the top of the sequence logo is the published motif and the bottom is the predicted motif. For each data set, comparing the predicted motif with the published motif, finding that the APMS method can find the predicted motif similar to the published motif on 12 groups of data sets; and the run time on all data sets was within 6 minutes.
It can be seen that the APMS method can be used to efficiently and effectively process large datasets of true DNA sequences.
In summary, the APMS method can efficiently and effectively process a large data set of DNA sequences regardless of a simulation data set or a real data set, the APMS method can successfully find out an implant motif (l, d) or a real motif, and operates much faster than the existing implant motif (l, d) search method, and in the simulation data set, it is seen that the operation time of the APMS method increases linearly with the increase of the scale of the DNA sequence data set.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (8)

1. A method for searching for DNA data set implantation motifs, comprising:
acquiring a DNA sequence big data set and acquiring implantation die body searching parameters of the DNA sequence big data set, wherein the DNA sequence big data set comprises a plurality of DNA sequences, each DNA sequence comprises a plurality of characters, and the implantation die body searching parameters comprise the length l of an implantation die body and the hamming distance d of the implantation die body;
obtaining a first k-mer set according to the DNA sequence big data set and the implantation motif search parameter, obtaining a first l-mer set according to the first k-mer set, and obtaining a second l-mer set according to the first l-mer set, wherein the first k-mer set comprises a plurality of k-mers, each k-mer comprises k characters, the first l-mer set comprises a plurality of first l-mers, each first l-mer comprises l characters, the second l-mer set comprises a plurality of second l-mers, and each second l-mer comprises l characters;
determining the implant motif from the second set of l-mers according to a first scoring model;
obtaining the first l-mer set according to the first k-mer set, wherein the obtaining the first l-mer set comprises:
obtaining k-mers from the first set of k-mers;
performing expansion processing on each k-mer in the DNA sequence big data set to obtain an expanded first k-mer set;
performing redundancy removal processing on the expanded first k-mer set according to a second score model to obtain an expanded second k-mer set, wherein the second score model is used for evaluating each expanded first k-mer in the expanded first k-mer set and evaluating whether the expanded first k-mer is composed of a background sequence;
intercepting the expanded second k-mer set to obtain a first l-mer;
obtaining the first l-mer set according to the first l-mer;
obtaining a second l-mer set according to the first l-mer set, wherein the obtaining of the second l-mer set comprises:
constructing a second tree for a first l-mer in the first l-mer set;
calculating scores of all nodes of the constructed binomial tree according to the first score model, and taking the node with the highest score as a second l-mer;
performing redundancy removal processing on the first k-mer set according to the second l-mer to obtain a second k-mer set, wherein the second k-mer set comprises a plurality of k-mers, and each k-mer comprises k characters;
processing the first l-mer set according to the second k-mer set to obtain a second l-mer set;
the first score model represents the score of each node in the binomial tree model, and is obtained according to the DNA sequence big data set and the implantation motif searching parameters, and the method comprises the following steps:
acquiring a plurality of l-mers from the DNA sequence big data set, wherein each l-mer comprises l characters;
obtaining a sequencing queue according to the Hamming distance between the plurality of l-mers and the first l-mer;
and obtaining the first score model according to the sorting queue.
2. The method of claim 1, wherein obtaining a first set of k-mers from the DNA sequence big data set and the implantation motif search parameters comprises:
acquiring length k, and acquiring a plurality of k-mers from the DNA sequence big data set according to the length k;
and acquiring a first threshold, and acquiring the first k-mer set according to the first threshold and the k-mers.
3. The method of claim 2, wherein obtaining the length k comprises:
obtaining a first expected value according to the DNA sequence big data set;
obtaining a second expected value according to the DNA sequence big data set and the implantation motif searching parameters;
and obtaining the length k according to the first expected value and the second expected value.
4. The method of claim 3, wherein obtaining the first threshold value comprises:
obtaining the number of DNA sequences from the DNA sequence big data set;
and obtaining the first threshold value according to the second expected value and the quantity of the DNA sequences.
5. The method of claim 1, wherein truncating the extended second k-mer set to obtain a first l-mer comprises:
obtaining an alignment sequence according to the expanded second k-mer set;
and intercepting the comparison sequence according to a preset rule to obtain the first l-mer.
6. The method of claim 2, wherein performing de-redundancy processing on the first set of k-mers with respect to the second l-mer to obtain a second set of k-mers, comprises:
obtaining a fourth l-mer from the DNA sequence big data set, wherein the fourth l-mer comprises l characters;
acquiring a third expected value between the k-mer of the second l-mer and the k-mer of the fourth l-mer;
judging whether k-mers in a first k-mer set are redundant according to the third expected value, deleting the k-mers from the first k-mer set to obtain a second k-mer set when the hamming distance d between the k-mers in the first k-mer set and the k-mers in the second l-mers is smaller than or equal to the third expected value and the k-mers in the first k-mer set are redundant, or keeping the k-mers in the first k-mer set to obtain the second k-mer set.
7. An apparatus for searching for an implanted motif in a DNA data set, the apparatus comprising:
the data acquisition module is used for acquiring a DNA sequence big data set and acquiring implantation die body search parameters of the DNA sequence big data set, wherein the DNA sequence big data set comprises a plurality of DNA sequences, each DNA sequence comprises a plurality of characters, and the implantation die body search parameters comprise the length l of an implantation die body and the hamming distance d of the implantation die body;
a data processing module, configured to obtain a first k-mer set according to the DNA sequence big data set and the implantation motif search parameter, obtain a first l-mer set according to the first k-mer set, and obtain a second l-mer set according to the first l-mer set, where the first k-mer set includes a plurality of k-mers, each k-mer includes k characters, the first l-mer set includes a plurality of first l-mers, each first l-mer includes l characters, the second l-mer set includes a plurality of second l-mers, and each second l-mer includes l characters;
a data determination module to determine the implant motif from the second set of l-mers according to a first scoring model;
the data processing module is specifically configured to:
obtaining k-mers from the first set of k-mers;
performing expansion processing on each k-mer in the DNA sequence big data set to obtain an expanded first k-mer set;
performing redundancy removal processing on the expanded first k-mer set according to a second score model to obtain an expanded second k-mer set, wherein the second score model is used for evaluating each expanded first k-mer in the expanded first k-mer set and evaluating whether the expanded first k-mer is composed of a background sequence;
intercepting the expanded second k-mer set to obtain a first l-mer;
obtaining the first l-mer set according to the first l-mer;
the data processing module is further specifically configured to:
constructing a second tree for a first l-mer in the first l-mer set;
calculating scores of all nodes of the constructed binomial tree according to the first score model, and taking the node with the highest score as a second l-mer;
performing redundancy removal processing on the first k-mer set according to the second l-mer to obtain a second k-mer set, wherein the second k-mer set comprises a plurality of k-mers, and each k-mer comprises k characters;
processing the first l-mer set according to the second k-mer set to obtain a second l-mer set;
wherein the first score model represents a score of each node in the binomial tree model, and the data determination module is specifically configured to:
acquiring a plurality of l-mers from the DNA sequence big data set, wherein each l-mer comprises l characters;
obtaining a sequencing queue according to the Hamming distance between the plurality of l-mers and the first l-mer;
and obtaining the first score model according to the sorting queue.
8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 6.
CN201910181475.8A 2019-03-11 2019-03-11 DNA data set implantation motif searching method and device and storage medium thereof Active CN110059228B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910181475.8A CN110059228B (en) 2019-03-11 2019-03-11 DNA data set implantation motif searching method and device and storage medium thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910181475.8A CN110059228B (en) 2019-03-11 2019-03-11 DNA data set implantation motif searching method and device and storage medium thereof

Publications (2)

Publication Number Publication Date
CN110059228A CN110059228A (en) 2019-07-26
CN110059228B true CN110059228B (en) 2021-11-30

Family

ID=67316070

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910181475.8A Active CN110059228B (en) 2019-03-11 2019-03-11 DNA data set implantation motif searching method and device and storage medium thereof

Country Status (1)

Country Link
CN (1) CN110059228B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111933215B (en) * 2020-06-08 2024-04-05 西安电子科技大学 Transcription factor binding site searching method, system, storage medium and terminal

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102651030A (en) * 2012-04-09 2012-08-29 华中科技大学 Social network association searching method based on graphics processing unit (GPU) multiple sequence alignment algorithm
CN103995988A (en) * 2014-05-30 2014-08-20 周家锐 High-throughput DNA sequencing mass fraction lossless compression system and method
CN107729762A (en) * 2017-08-31 2018-02-23 徐州医科大学 A kind of DNA based on difference secret protection model closes frequent motif discovery method
CN108664807A (en) * 2018-04-03 2018-10-16 徐州医科大学 Method based on the difference privacy DNA motif discoveries that stochastical sampling and die body are compressed

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103425900A (en) * 2012-05-21 2013-12-04 上海聚类生物科技有限公司 Statistical-significance-based system capable of quickly identifying genome transcription factor binding sites
CN103514381B (en) * 2013-07-22 2016-05-18 湖南大学 Integrate the protein bio-networks motif discovery method of topological attribute and function
EP3198476A1 (en) * 2014-09-26 2017-08-02 British Telecommunications Public Limited Company Efficient pattern matching
US10726110B2 (en) * 2017-03-01 2020-07-28 Seven Bridges Genomics, Inc. Watermarking for data security in bioinformatic sequence analysis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102651030A (en) * 2012-04-09 2012-08-29 华中科技大学 Social network association searching method based on graphics processing unit (GPU) multiple sequence alignment algorithm
CN103995988A (en) * 2014-05-30 2014-08-20 周家锐 High-throughput DNA sequencing mass fraction lossless compression system and method
CN107729762A (en) * 2017-08-31 2018-02-23 徐州医科大学 A kind of DNA based on difference secret protection model closes frequent motif discovery method
CN108664807A (en) * 2018-04-03 2018-10-16 徐州医科大学 Method based on the difference privacy DNA motif discoveries that stochastical sampling and die body are compressed

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种新的DNA模体发现聚类求精算法;张懿璞;《西安电子科技大学学报》;20140404;第95-99页 *

Also Published As

Publication number Publication date
CN110059228A (en) 2019-07-26

Similar Documents

Publication Publication Date Title
CN110070909B (en) Deep learning-based multi-feature fusion protein function prediction method
CN111192631A (en) Method and system for constructing model for predicting protein-RNA interaction binding site
CN107403075B (en) Comparison method, device and system
CN112232413A (en) High-dimensional data feature selection method based on graph neural network and spectral clustering
CN112489723B (en) DNA binding protein prediction method based on local evolution information
Kolpakov et al. Searching for gapped palindromes
CN114093422B (en) Prediction method and system for interaction between miRNA and gene based on multiple relationship graph rolling network
CN113539372A (en) Efficient prediction method for LncRNA and disease association relation
CN110059228B (en) DNA data set implantation motif searching method and device and storage medium thereof
CN110070908B (en) Motif searching method, device, equipment and storage medium of binomial tree model
Paul et al. Identification of weak motifs in multiple biological sequences using genetic algorithm
Orzechowski et al. Propagation-based biclustering algorithm for extracting inclusion-maximal motifs
CN111048145B (en) Method, apparatus, device and storage medium for generating protein prediction model
Gohardani et al. A multi-objective imperialist competitive algorithm (MOICA) for finding motifs in DNA sequences
US20040153307A1 (en) Discriminative feature selection for data sequences
CN109033746B (en) Protein compound identification method based on node vector
CN116153396A (en) Non-coding variation prediction method based on transfer learning
CN109918659B (en) Method for optimizing word vector based on unreserved optimal individual genetic algorithm
CN108182347B (en) Large-scale cross-platform gene expression data classification method
Liu et al. Discovery of deep order-preserving submatrix in DNA microarray data based on sequential pattern mining
CN108388774B (en) Online analysis method of polypeptide spectrum matching data
CN111755074A (en) Method for predicting DNA replication origin in saccharomyces cerevisiae
CN111383710A (en) Gene splice site recognition model construction method based on particle swarm optimization gemini support vector machine
CN115910216B (en) Method and system for identifying genome sequence classification errors based on machine learning
CN113887636B (en) Selectable data enhancement method and system based on genetic algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant