CN110059228B

CN110059228B - DNA data set implantation motif searching method and device and storage medium thereof

Info

Publication number: CN110059228B
Application number: CN201910181475.8A
Authority: CN
Inventors: 于强; 张晓�
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2019-03-11
Filing date: 2019-03-11
Publication date: 2021-11-30
Anticipated expiration: 2039-03-11
Also published as: CN110059228A

Abstract

The invention relates to a searching method of a DNA data set implantation model, a device and a storage medium thereof, wherein the method comprises the following steps: acquiring a DNA sequence big data set and acquiring implantation motif search parameters of the DNA sequence big data set; obtaining a first k-mer set according to the DNA sequence big data set and the implantation motif searching parameter, obtaining a first l-mer set according to the first k-mer set, and obtaining a second l-mer set according to the first l-mer set; determining the implant motif from the second set of l-mers according to a first scoring model. The invention can not only search out the implanted die body from the DNA sequence big data set by the APMS method, but also find out the time order of the operation time of the implanted die body to be faster than other implanted die body searching methods.

Description

DNA data set implantation motif searching method and device and storage medium thereof

Technical Field

The invention belongs to the field of DNA sequence big data processing, and particularly relates to a DNA data set implantation model searching method, a device and a storage medium thereof.

Background

DNA is a carrier of genetic information, the genetic information is stored in a sequence consisting of four characters of the DNA, and the nature of growth and development of organisms is the transmission and expression of the genetic information. Transcription is central to regulatory mechanisms as the first step in the expression of genetic information. Transcription factors are combined at specific sites (the length is about 5-20 base pairs) in a DNA sequence, and transcription of genes is started and transcription efficiency of the genes is controlled. These Sites are called Transcription Factor Binding Sites (TFBS), and the positioning of TFBS is important for studying the transcriptional regulation of genes.

Quorum implant motif search (qPMS) is one of the well-known computational models for locating TFBS in DNA Sequences. Common qPMS methods include a sample mode-driven precision method and a suffix tree precision method, wherein the sample mode-driven precision method, such as pmsystem, stemrender, qPMS7, travstr, PMS8, and qPMS9, includes two stages of sample driving and mode driving, the sample driving stage is to select some reference DNA sequences as constraints to generate candidate motifs as little as possible, and the mode driving stage is to verify the candidate motifs; suffix tree based precision methods, such as Weeder, RISOTTO and FMotif, build suffix tree indices to the input sequence to accelerate verification of candidate motifs. The approximate qPMS method aims to find an optimal or near optimal phantom in a short time, and the most typical approximate qPMS method includes expectation maximization, Gibbs sampling, genetic methods, etc., and refines an initial phantom, and among these methods, MEME-ChIP, which is a method based on expectation maximization, is one of the most well-known phantom discovery methods. In order to process large data sets efficiently, some motif discovery methods based on new strategies are proposed, such as the PairMotifChIP method, which is to dig and merge similar substring pairs from input DNA sequences to obtain motifs.

However, the qPMS method and the approximate qPMS method, the pairmotif chip method have a common problem: computational problems, resulting in a long run time, present a bottleneck in processing large data sets of DNA sequences.

Disclosure of Invention

In order to solve the above problems in the prior art, the present invention provides a method and apparatus for searching a DNA dataset implantation model, and a storage medium.

The embodiment of the invention provides a method for searching a DNA data set implantation model, which comprises the following steps:

acquiring a DNA sequence big data set and acquiring implantation motif search parameters of the DNA sequence big data set;

obtaining a first k-mer set according to the DNA sequence big data set and the implantation motif searching parameter, obtaining a first l-mer set according to the first k-mer set, and obtaining a second l-mer set according to the first l-mer set;

determining the implant motif from the second set of l-mers according to a first scoring model.

In one embodiment of the present invention, obtaining a first k-mer set according to the DNA sequence big data set and the implantation motif search parameter includes:

acquiring length k, and acquiring a plurality of k-mers from the DNA sequence big data set according to the length k;

and acquiring a first threshold, and acquiring the first k-mer set according to the first threshold and the k-mers.

In one embodiment of the present invention, obtaining the length k includes:

obtaining a first expected value according to the DNA sequence big data set;

obtaining a second expected value according to the DNA sequence big data set and the implantation motif searching parameters;

and obtaining the length k according to the first expected value and the second expected value.

In one embodiment of the present invention, obtaining the first threshold includes:

obtaining the number of DNA sequences from the DNA sequence big data set;

and obtaining the first threshold value according to the second expected value and the quantity of the DNA sequences.

In one embodiment of the present invention, obtaining the first set of l-mers from the first set of k-mers comprises:

obtaining k-mers from the first set of k-mers;

performing expansion processing on each k-mer in the DNA sequence big data set to obtain an expanded first k-mer set;

performing redundancy removal processing on the expanded first k-mer set according to a second score model to obtain an expanded second k-mer set;

intercepting the expanded second k-mer set to obtain a first l-mer;

and obtaining the first l-mer set according to the first l-mer.

In an embodiment of the present invention, intercepting the extended second k-mer set to obtain a first l-mer, includes:

obtaining an alignment sequence according to the expanded second k-mer set;

and intercepting the comparison sequence according to a preset rule to obtain the first l-mer.

In one embodiment of the present invention, deriving a second set of l-mers from the first set of l-mers comprises:

constructing a second tree for a first l-mer in the first l-mer set;

calculating scores of all nodes of the constructed binomial tree according to the first score model, and taking the node with the highest score as a second l-mer;

performing redundancy removal processing on the first k-mer set according to the second l-mer to obtain a second k-mer set;

and processing the first l-mer set according to the second k-mer set to obtain a second l-mer set.

In an embodiment of the present invention, performing redundancy removal on the first k-mer set according to the second l-mer to obtain a second k-mer set, includes:

obtaining a fourth l-mer from the DNA sequence big data set;

acquiring a third expected value between the k-mer of the second l-mer and the k-mer of the fourth l-mer;

judging whether the k-mers in the first k-mer set are redundant or not according to the third expected value, deleting the k-mers from the first k-mer set to obtain a second k-mer set when the hamming distance d between the k-mers in the first k-mer set and the k-mers in the second l-mer set is smaller than or equal to the third expected value and the k-mers in the first k-mer set are redundant, and otherwise, keeping the k-mers in the first k-mer set to obtain the second k-mer set.

Another embodiment of the present invention provides a DNA dataset implantation motif search apparatus, including:

the data acquisition module is used for acquiring the DNA sequence big data set and acquiring the implantation motif search parameter of the DNA sequence big data set;

the data processing module is used for obtaining the first k-mer set according to the DNA sequence big data set and the implantation motif searching parameter, obtaining the first l-mer set according to the first k-mer set, and obtaining the second l-mer set according to the first l-mer set;

a data determination module to determine the implant motif from the second set of l-mers according to the first scoring model.

Yet another embodiment of the invention provides a computer readable storage medium, which when executed by a processor implements the method of any of the above.

Compared with the prior art, the invention has the beneficial effects that:

the invention can not only search out the implanted die body from the DNA sequence big data set by the APMS method, but also find out the time order of the operation time of the implanted die body to be faster than other implanted die body searching methods.

Drawings

FIG. 1 is a schematic flow chart of a method for searching a DNA data set implantation model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating searching for an implanted motif of a conventional binomial tree according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of an apparatus for searching for DNA data set embedded motifs according to an embodiment of the present invention;

FIG. 4 is a graph showing the comparison results of APMS, PairMotifChIP and MEME-ChIP methods provided by the present invention under different DNA sequences of the simulation data;

fig. 5 is a schematic diagram of an experimental result of a method for efficiently solving a search of a large data set implantation model of a DNA sequence in real data according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to specific examples, but the embodiments of the present invention are not limited thereto.

Example one

Referring to fig. 1, fig. 1 is a schematic flow chart of a method for searching a DNA dataset implantation model according to an embodiment of the present invention. The embodiment of the invention provides a method for searching a DNA data set implantation model, which comprises the following steps:

step 1, obtaining a DNA sequence big data set and obtaining implantation motif searching parameters of the DNA sequence big data set.

Step 1.1, obtaining a DNA sequence big data set.

Specifically, the DNA sequence large data set D obtained in the present embodiment includes t DNA sequences, and the DNA sequence large data set D may be expressed as D ═ s₁,s₂,…s_tIn which s is_iRepresents the ith DNA sequence; each DNA sequence comprises n characters. Wherein each DNA sequence s_iIs one of the character tables Σ ═ a, C, G, T }The character string, i.e. each DNA sequence, is composed of A, C, G, T into a character string of length n. s_i[j]J-th character, s, representing the ith DNA sequence_i[j..j']Represents a string starting at position j and ending at position j' in the ith DNA sequence. Wherein, the value of i is 0-t-1, and the value of j is 0-n-1.

Step 1.2, obtaining the implantation motif searching parameter of the DNA sequence big data set.

Specifically, in this embodiment, the search parameter of the implant motif (l, d) includes a length l of the implant motif (l, d), a hamming distance d of the implant motif (l, d), a search ratio q of the implant motif (l, d), and a conservative parameter g.

In this embodiment, for the implanted motifs (l, d), the APMS method solves the problems: given t large data sets of DNA sequences of length n D ═ s₁,s₂,…,s_tThe sum satisfies 0<l<n、0≤d<l and 0<q.ltoreq.1, the aim being to find an l-mer (string of length l) m such that at least qt (q.ltoreq.t) DNA sequences s_iAll of which contain a l-mer m which differs from the l-mer m by at most d positions (mutations)_iThe position difference (mutation) is defined as the hamming distance: d_H(m,m_i)＝|{i:1≤i≤l,m[i]！＝m_i[i]J. Wherein l-mer m is called an implantation motif (l, d), and is one l-mer m in a large data set of DNA sequences_iReferred to as motif examples, sequences in the large data set of DNA sequences that do not meet the hamming distance described above are referred to as background sequences. Wherein, the APMS method is a DNA data set implantation model searching method of the invention.

Large datasets of DNA sequences are advantageous for finding high quality implant motifs (l, d), but most existing qPMS methods are too time consuming to complete the computation of qPMS to find implant motifs (l, d) in a reasonable time. On the basis of the qPMS method, the APMS method in this embodiment is applicable to a large DNA sequence data set, and not only can find the implanted motifs (l, d), but also can run time orders of magnitude faster than the existing motif search method.

And 2, obtaining a first k-mer set according to the DNA sequence big data set and the implantation die body searching parameters, obtaining a first l-mer set according to the first k-mer set, and obtaining a second l-mer set according to the first l-mer set.

And 2.1, searching parameters according to a DNA sequence big data set and an implantation die body (l, d) to obtain a first k-mer set, wherein the first k-mer set comprises a plurality of k-mers, and each k-mer comprises k characters.

Specifically, a first k-mer set is obtained according to a DNA sequence big data set and implantation motif (l, d) searching parameters, and the method comprises the following steps:

acquiring length k, and acquiring a plurality of k-mers from a DNA sequence big data set according to the length k;

obtaining a first threshold value

According to a first threshold value

The k-mers yield a first set of k-mers.

Further, obtaining a length k comprises:

obtaining a first expected value according to the DNA sequence big data set;

searching parameters according to the DNA sequence big data set and the implantation die body (l, d) to obtain a second expected value;

Specifically, the present embodiment employs a probability analysis method to determine a suitable k value, so that it can better distinguish k-mers in the background sequence and the motif example. Let f_r(k) Is a first desired value, a first desired value f_r(k) An expectation value representing the frequency of occurrence of k-mers in any background sequence in the DNA sequence big data set D; let f_m(k) Is a second desired value, a second desired value f_m(k) Represents the expected value of the frequency of occurrence of k-mers in any example motif in the DNA sequence big data set D. Wherein the second desired value f_m(k) And a first desired value f_r(k) The larger the ratio of (A) is, the more distinguishable the k-mers in the background sequence and in the motif example are from the frequency of occurrence. Thus, the present embodiment employs the following formula to determine the value of k:

wherein k is_minRepresenting the minimum value of k, epsilon being used to correspond to the first desired value f_r(k) A factor of less than 1. k is a radical of_minPreferably 5, since k is small, it is difficult to distinguish k-mers in the background sequence and the motif example. ε is empirically set to 1.

In this example, the first expected value f in formula (1) was obtained from a large DNA sequence dataset_r(k) The specific design is as follows:

assuming that the searched implanted motif (l, d) is m, there is an example of the motif m₁And example of the phantom m₂In the DNA sequence big data set D, for an arbitrary motif example m₁One k-mer x of any starting position in₁And another arbitrary motif example m₂One k-mer x of the same starting position in₂Let p stand for_kDenotes k-mer x₁And k-mer x₂Equal probability, then the second expected value f in equation (1)_m(k) The design is as follows:

for p in formula (3)_kDenotes k-mer x₁And k-mer x₂Equal probability according to the total probability formula, p_kThe design is as follows:

wherein, Pr_iShows the implant phantom (l, d) m and phantom example m₁Hamming distance d of_H(m,m₁) Probability of i (i 0. ltoreq. i.ltoreq.d), Pr_jIndicating the implantation phantom (l, d) m and example of the mold₂Hamming distance d of_H(m,m₂) Probability of j (j 0. ltoreq. j. ltoreq.d), Pr_iThe design is as follows:

wherein g represents a conservative parameter, and the value range of g is more than or equal to 0 and less than or equal to 1.

In the same way, Pr_jThe design is as follows:

and p is_ijIs shown at d_H(m,m₁) I and d_H(m,m₂) J, k-mer x₁And k-mer x₂Equal probability, p_ijThe design is as follows:

as can be seen from formula (7), p_ijThe products of the multiplication of the three factors are accumulated in the range of 0 to min i, j. Wherein the first factor represents the motif example m₁In one arbitrary k-mer x₁The probability of a mutations in; the second factor represents k-mer x₂And k-mer x₁The probability of the mutation positions being identical; the third factor is expressed in k-mer x₂And k-mer x₁When the mutation position is the same, the probability that the mutated base is completely the same.

The first expected value f is calculated by the above equations (2) to (3)_r(k) And a second desired value f_m(k) And k is in the range of 0 to l, and then a second expected value f is calculated according to the formula (1)_m(k) And a first desired value f_r(k) The largest value in the ratios of (a) is taken as the length k of each k-mer in the first set of k-mers in this example.

From the DNA sequence big data set D, several k-mers of length k are obtained.

Further, a first threshold value is obtained

The method comprises the following steps:

obtaining the number of DNA sequences from the DNA sequence big data set;

obtaining a first threshold value according to the second expected value and the number of DNA sequences

Specifically, in this embodiment, all k-mers with length k are not obtained from the DNA sequence big data set D, but the frequency of occurrence in the DNA sequence big data set D is greater than or equal to the first threshold

The k-mers of (a) are used as high frequency k-mers to generate a first set of k-mers. As described above, f_m(k) Indicates the expectation that an arbitrary k-mer in an arbitrary motif example will occur frequently in the DNA sequence big data set D, if any

Is directly set to f_m(k) Then multiple high frequency k-mers corresponding to the same phantom may be acquired. Thus, the first threshold value

Is designed at f_m(k) A variable proportional to the number t of DNA sequences is added to avoid obtaining excessive redundant high-frequency k-mers. First threshold value of the present embodiment

The design of (2) is as follows:

further, the first threshold value obtained according to the formula (8)

And acquiring k-mers which meet the first threshold value or more from each DNA sequence as high-frequency k-mers to generate a first k-mer set.

And 2.2, obtaining a first l-mer set according to the first k-mer set, wherein the first l-mer set comprises a plurality of first l-mers, and each first l-mer comprises l characters.

Specifically, obtaining a first l-mer set according to the first k-mer set comprises:

acquiring k-mers from the first k-mer set;

expanding each k-mer in the DNA sequence big data set to obtain an expanded first k-mer set, wherein the length of each expanded k-mer in the expanded first k-mer set is 2 l-k;

performing redundancy removal processing on the expanded first k-mer set according to the second score model to obtain an expanded second k-mer set;

intercepting the expanded second k-mer set to obtain a first l-mer;

and obtaining the first l-mer set according to the first l-mer.

Further, expanding the k-mers in the DNA sequence big data set to obtain an expanded first k-mer set, wherein the length of each expanded k-mer in the expanded first k-mer set is 2 l-k.

Specifically, the implantation motif (l, D) is searched by the first k-mer set, and k-mer x is firstly obtained from the first k-mer set, because the starting position of k-mer x in the implantation motif (l, D) is unknown, so that after k-mer x is found in the DNA sequence big data set D, the embodiment expands k-mer x by l-k characters to the left and to the right in the DNA sequence big data set D, and the expanded k-mer x becomes a character string with the length of 2 l-k. By doing so, the extended k-mer x motif example in the DNA sequence big dataset D can cover the implant motif (l, D).

For example, suppose s_i[j..j+k–1]Is the exact occurrence of k-mer x in the DNA sequence corpus D, the motif example of the k-mer x thus obtained, which is extended in the DNA sequence corpus D, is s_i[j–l+k..j+l–1]。

Further, each k-mer x in the DNA sequence big data set D is subjected to expansion processing to obtain an expanded first k-mer set.

And further, performing redundancy removal processing on the expanded first k-mer set according to a second score model to obtain an expanded second k-mer set, wherein the length of each expanded k-mer in the expanded second k-mer set is 2 l-k.

In particular, if an extended k-mer x in the extended first k-mer set does not contain a motif instance in the DNA sequence big data set D, i.e., it consists entirely of background sequences, such an extended k-mer x will affect the quality of the first l-mer set. Thus, the present embodiment generates the first l-mer set according to the designed second scoring model score before generating the first l-mer set_i(y) evaluating the extended k-mer x to assess whether the extended k-mer x consists of a background sequence. From the above, because the first expected value f_r(k) Represents the expectation of the frequency of k-mers in an arbitrary background sequence in the DNA sequence big data set D, so in this example, the second scoring model is designed as follows:

as can be seen from equation (9), the second scoring model score_iThe smaller the score of (y), the more likely the expanded k-mer x is to be composed of background sequences, thus filtering the expanded k-mer x with the smallest score from the expanded first set of k-mers, resulting in an expanded second set of k-mers.

In the embodiment, by designing the second scoring model, the expanded k-mer x which may be a background sequence is filtered from the expanded first k-mer set, so that the calculation amount of subsequent implantation phantom (l, d) search is reduced, and the running time of the APMS method is reduced.

Further, intercepting the expanded second k-mer set to obtain a first l-mer, including:

obtaining an alignment sequence according to the expanded second k-mer set;

and intercepting the comparison sequence according to a preset rule to obtain a first l-mer.

Specifically, in this embodiment, after the expanded first k-mer set in the DNA sequence big data set D is subjected to redundancy elimination, the remaining expanded k-mers form an expanded second k-mer set, the expanded k-mers in the expanded second k-mer set form an alignment sequence align with a length of 2l-k, r (align [ i ]) represents the information amount of the ith column in the alignment sequence align, and then, the information amount is intercepted according to a preset rule to obtain the first l-mer. The information quantity is Position Weight Matrix (PWM), each column in the Position Weight matrix is the ratio of four characters in the expanded k-mer, and the four characters are A, C, G, T respectively.

After the expanded k-mers in the expanded second k-mer set are right-aligned to form an alignment sequence align, according to the information content of each row r (align [ i ]) in the alignment sequence align, a consistent sequence with a length of 2l-k is first obtained, and then the rows r (align [ i ]) with smaller information content at the left end and the right end in the consistent sequence are repeatedly removed by comparison until a consistent sequence with a length of l is obtained, where the consistent sequence with the length of l is the first l-mer.

For example, in this embodiment, if the length l of the implant motif (l, d) is 6 and the length k of the k-mer is 3, wherein the DNA sequence large dataset includes 6 extended k-mers, respectively { AGATTGCAG }, { CGATTGCAG }, { CGATTGCAC }, { CGCTTGCAG }, { CGCTTGCAG }, and { CTATTGTAG }, the 6 extended k-mers are first aligned right:

{AGATTGCAG,

CGATTGCAG,

CGATTGCAC,

CGCTTGCAC,

CGCTTGCAG,

CTATTGTAG, forming an aligned sequence align, wherein the information content of each row r (align [ i ]) of the aligned sequence align is:

{A：0.17,0.00,0.67,0.17,0.00,0.17,0.00,1.00,0.00

C：0.83,0.00,0.33,0.00,0.00,0.00,0.83,0.00,0.33

G：0.00,0.83,0.00,0.00,0.00,0.66,0.00,0.00,0.67

t: 0.00,0.17,0.00,083,1.00,0.17,0.17,0.00,0.00}, and then obtaining a consistent sequence according to the information content of each row r (align [ i ]), wherein the consistent sequence is { CGATTGCAG }. Looking at the occupancy ratio of each column of characters A, C, G, T of the consensus sequence { CGATTGCAG } starting from the left, the occupancy ratio of C in the first column on the left being the largest, the character C being selected on the left, then the occupancy ratio of G in the first column on the right being the largest, the character G being selected on the right, comparing the occupancy ratio of the first column of characters C on the left with the occupancy ratio of the first column of characters G on the right, the occupancy ratio of the first column of characters C on the left being greater than the occupancy ratio of the first column of characters G, then retaining the first column of characters C on the left, and deleting all characters in the first column on the right; then, selecting a reserved character C in the first left column, selecting a character A on the right side when the ratio of the character C in the first right column is the largest, comparing the ratio of the character C in the first left column with the ratio of the character A in the first right column, reserving the character A in the first right column when the ratio of the character C in the first left column is smaller than the ratio of the character A in the first right column, and deleting all characters in the first left column; and so on until the consensus sequence is truncated to an l-mer of length l, which is { ATTGCA } and which is the first l-mer.

And further traversing the k-mers in the first k-mer set, and finding out the first l-mer of each k-mer in the DNA sequence big data set to form a first l-mer set.

And 2.3, obtaining a second l-mer set according to the first l-mer set, wherein the second l-mer set comprises a plurality of second l-mers, and each second l-mer comprises l characters.

Specifically, obtaining a second l-mer set according to the first l-mer set comprises:

constructing a second tree for a first l-mer in the first l-mer set;

calculating scores of all nodes of the binomial tree according to the first score model, and taking the node with the highest score as a second l-mer;

Further, constructing a binomial tree for a first l-mer in the first l-mer set, comprising:

selecting the first l-mer as a root node of the binomial tree;

sequentially generating an i +1 th layer of the binomial tree according to the i th layer of the binomial tree, judging whether the number of nodes of the i +1 th layer of the binomial tree is greater than a second threshold value or not, if the number of nodes of the i +1 th layer is greater than the second threshold value, obtaining the nodes of the i +1 th layer of the final binomial tree according to a first score model, wherein the number of the nodes of the i +1 th layer is equal to the second threshold value, if the number of the nodes of the i +1 th layer is less than or equal to the second threshold value, the nodes of the i +1 th layer of the binomial tree are kept, and the value of i is 0< i < d;

judging whether a node of the ith layer of the binomial tree is an implanted die body (l, d) or not, if so, storing the node in a first array M, and if not, storing the node in the first array M, wherein the value of i is 0< i < d;

and according to the node scores in the first array M, taking the node with the highest score as a second l-mer.

Specifically, referring to fig. 2, fig. 2 is a schematic diagram illustrating searching for an implanted motif of a conventional binomial tree according to an embodiment of the present invention. As can be seen from fig. 2, in the conventional method for constructing the binomial tree, the root node of the binomial tree is the first l-mer in the first l-mer set, the internal node or leaf node of the ith layer of the binomial tree is a node whose hamming distance from the first l-mer of the root node is i, the value range of i is 0< i < d, and the depth of the binomial tree is d. Each layer of the binomial tree corresponds to a plurality of expansion nodes, the expansion nodes are d neighbors of the first l-mer of the root node, and the expansion nodes and the first l-mer have difference in marked positions on a path from the root node to the internal node or the leaf node. Thus, each node in the binomial tree represents a d neighbor that is a Hamming distance i (0 ≦ i ≦ d) from the first l-mer. Wherein the expansion nodes are all l-mers with the length of l.

In this embodiment, a binomial tree is constructed, a root node is a first l-mer in a first l-mer set, then an i +1 th layer of the binomial tree is sequentially generated according to the i th layer of the binomial tree, whether the number of nodes of the i +1 th layer of the binomial tree is greater than a second threshold is judged, if the number of nodes of the layer is greater than the second threshold, nodes of the i +1 th layer of the binomial tree are obtained according to a first score model, the number of nodes of the layer is equal to the second threshold, and if the number of nodes of the layer is less than or equal to the second threshold, the nodes of the i +1 th layer of the binomial tree are maintained, and the value of i is 0< i < d.

Specifically, let the second threshold be N_mm(i)，N_mm(i) I (0) th representing a binomial tree<i<d) Number of level nodes, to avoid missing extended nodes at each level of the binomial tree, N is calculated_mm(i) Then, the number of nodes at the i-th layer is multiplied by a safety factor alpha (alpha is more than or equal to 1). In the implementation of the APMS method, a is empirically set to a preferred value of 2, N_mm(i) The design is as follows:

for example, when the binomial tree is constructed in this embodiment, it is known that the length (l, d) of the implanted model is 5, and the hamming distance d is 3, where the root node of the binomial tree is the first l-mer, and the node of the first layer of the binomial tree is the l-mer with the hamming distance from the first l-mer of the root node being 1, the number of nodes is 15, because the implanted model (l, d) is the l-mer with the length of 5, each position has 3 mutation cases, the node of the first layer of the binomial tree in this embodiment takes all the mutation cases of the first l-mer of the root node, that is, the number of nodes of the first layer of the binomial tree is 15; the node of the second layer of the binomial tree is expanded on the basis that the node of the first layer of the binomial tree is implanted with the motif (l, d), the hamming distance between the node and the expanded node of the node is 1, and the number of the nodes of the second layer of the binomial tree, which is C in total, is determined by a formula (11)₃ ²2 ═ 6; similarly, the nodes at the third layer of the binomial tree are expanded on the basis that the nodes (the number of the nodes is 6) at the second layer of the binomial tree are implanted into the die bodies (l, d), the hamming distance between the nodes and the expanded nodes of the nodes is 1, and the number of the nodes at the third layer of the binomial tree is determined by the formula (11) to be C₃ ³2 ═ 2. The second tree constructed finally is a tree structure with the first l-mer as a root node, the first layer of the second tree being 15 nodes, the second layer of the second tree being 6 nodes, and the third layer of the second tree being 2 nodes.

And further, obtaining nodes of the (i + 1) th layer of the final binomial tree according to the first score model, wherein the number of the nodes of the layer is equal to a second threshold value.

Specifically, in the present embodiment, under the qPMS model, a first score model is designed to evaluate the score of each node y of the constructed binomial tree. Wherein D' (y) is a set containing qt DNA sequences selected from the DNA sequence big data set D and used for calculating the score of each node y of the binomial tree, and s is the l-mer with the minimum Hamming distance between a certain DNA sequence and the node y. Generally, the higher the score of node y of the binomial tree, the closer the node y is to the implant motif (l, d). The first scoring model of this embodiment is designed as follows:

as shown in the formula (11), in the binomial tree constructed by any first l-mer, the score of each node y in the binomial tree is evaluated by the conventional method, the score of the node y in t DNA sequences in a DNA sequence big data set is calculated firstly, the score of an l-mer with the minimum Hamming distance from the node y to the first l-mer is found from each DNA sequence to be used as the score of the DNA sequence, then the previous qt DNA sequences with the highest score are obtained, and the scores of the qt DNA sequences are added to obtain the final score to be used as the score of the node y. For each node y, there is a score for the correspondence_nAnd (y) selecting the node y with the highest score from the nodes y as the second l-mer.

However, the traditional method has the defects that when the node y score is calculated each time, a large data set of the DNA sequence needs to be scanned again, and the calculation cost is high. In this embodiment, to solve this problem, all l-mers in each DNA sequence are sorted in ascending order according to the hamming distance between the l-mer and the first l-mer to obtain a queue sequence, and according to such queue sequence, it can be determined that the earlier l-mer in the queue sequence is most likely to be the second l-mer finally obtained. The calculation cost is reduced by calculating the second l-mer through the queuing sequence, and the l-mer with the best score in the DNA sequence can be found by basically scanning the first l-mers in the queuing sequence. Wherein D '(y) in formula (11) is a set of qt DNA sequences selected from the DNA sequence big data set D for calculating the score of the node y, and in this embodiment, the set of D' (y) is represented as:

after the queuing sequences are obtained, all the l-mers in each DNA sequence are arranged at the forefront of the queuing sequences after the Hamming distances between all the l-mers in each DNA sequence and the first l-mer are arranged in an ascending order from small to large. The l-mers with the minimum score obtained from each DNA sequence are rearranged in an ascending order from small to large according to the Hamming distance, a new queuing sequence is obtained after arrangement, and a certain line in the new queuing sequence is called C_iIn this embodiment, then, C exists for a first l-mer m' and a d-neighbor y of the first l-mer m_iAnd C_iOne middle position j (j is more than or equal to 1 and less than or equal to | C_iIf d) is not present_H(C_i[j],m')–d_H(y, m') > 0, then d_H(C_i[j],m')–d_H(y, m') is d_H(y,C_i[j]) The smallest possible value of. Thus, when a score is scanned and calculated based on a new queuing sequence, a certain row C in the new queuing sequence_iWhen d is met_H(C_i[j],m')–d_H(y,m')≥dis(y,C_i[j]) In this case, this row C can be completed_iScanning, current line C_iHas a minimum Hamming distance of dis (y, C)_i[j]) Will dis (y, C)_i[j]) Substituting into equation (11) to obtain C at node y_iScore of line score_n(y), and the next row C is started_iAnd +1 line scanning, and taking the highest score in the scores of each line in the new queuing sequence as the score of the node y until all lines in the new queuing sequence are scanned_n(y)。

Scoring score as above is performed on all nodes of the i +1 th level of the binomial tree respectively_n(y) calculating, sorting the obtained scores in ascending order from small to large,and selecting the nodes with the larger scores of the second threshold value in the sequence as the nodes of the (i + 1) th layer of the final binomial tree, wherein the number of the nodes of the layer is equal to the second threshold value.

In the embodiment, in the constructing of the binomial tree by the first l-mer in the first l-mer set, the node with the high score is calculated and selected according to the first score model to generate the expansion node, because the node with the high score is more likely to be the implanted motif (l, d), the embodiment generates the expansion node from the direction of the implanted motif (l, d), thereby reducing the calculation amount of the subsequent implanted motif (l, d) and reducing the running time of the APMS method.

Further, whether a node of the ith layer of the binomial tree is an implanted motif (l, d) is judged, if the node is the implanted motif (l, d), the node is stored in the first array M, and if the node is not the implanted motif (l, d), the node does not need to be stored in the first array M;

specifically, in this embodiment, not all d-neighbor nodes of the binomial tree are used to search for the implanted motifs (l, d) as in the conventional method, but nodes similar to the implanted motifs (l, d) are used to search. When judging whether the node of the ith layer of the binomial tree is the implantation motif (l, d), substituting the node into a DNA sequence big data set, judging whether at least qt DNA sequences exist and the hamming distance between the l-mer and the node is smaller than or equal to d, if yes, judging that the node is the implantation motif (l, d), storing the node in a first array M, and if not, judging that the node is not the implantation motif (l, d) and not storing the node in the first array M. And the hamming distance between the node of the ith layer and the expansion node of the (i + 1) th layer of the node is 1. Wherein, the value of i is 0< i < d.

And further, according to the node scores in the first array M, taking the node with the highest score as a second l-mer.

Specifically, the nodes in the first array M are the set of nodes selected for the first l-mer that are close to the implant motif (l, d), the node with the highest score is selected from the first array M as the implant motif that is most likely to be the search, and the node with the highest score is taken as the second l-mer.

Further, traversing each first l-mer in the first l-mer set, constructing the binomial tree model to obtain a second l-mer, obtaining a second l-mer set according to the second l-mer, and obtaining a final implantation motif (l, d) through the second l-mer set.

Specifically, the binomial tree model is constructed for each first l-mer in the first l-mer set, the first array M of each binomial tree model taking the first l-mer as a root node is calculated according to the first score model, the node with the highest score in the first array M is selected as the second l-mer of the first l-mer, then the second l-mer obtained from each first l-mer in the first l-mer set is formed into the second l-mer set by the second l-mer, the second l-mer in the second l-mer set is calculated according to the first score model, the scores are rearranged from high to low, and the rearranged node set is output as the final implantation motif (l, d).

In summary, the searching for the implant motifs (l, d) based on the binomial tree method is performed layer by layer starting from the first l-mer of the root node. For the first l-mer of the root node, firstly, whether the first l-mer of the root node is an implantation motif (l, d) is judged, and all nodes with the Hamming distance of 1 from the first l-mer of the root node are used as expansion nodes of the 1 st layer. For the i (0)<i<d) A layer, N is selected from the expansion nodes of the layer_mm(i) The node with high score is used as the final node of the layer and is respectively connected with the N_mm(i) And the expansion node with the hamming distance of 1 of the selected node is used as the node of the (i + 1) th layer. For the d-th layer, directly judging whether the node of the layer is an implantation motif (l, d). And judging whether each expansion node of each layer is an implantation motif (l, d), if so, storing the expansion node in a first array M, and if not, storing the expansion node in the first array M. In the searching process, if a plurality of implanted motifs (l, d) exist in a first array M in the binomial tree constructed by the first l-mer, the node with the highest score is selected from the first array M to serve as the second l-mer. Obtaining second l-mers from each first l-mer in the first l-mer set, obtaining second l-mer sets from the second l-mers, and repeating the steps from high to low for the second l-mers in the second l-mer sets according to the scores of the second l-mersAnd (e) outputting the reordered node set as a final implantation motif (l, d).

Further, performing redundancy removal processing on the first k-mer set according to the second l-mer to obtain a second k-mer set, including:

acquiring a fourth l-mer from the DNA sequence big data set;

and judging whether the k-mers in the first k-mer set are redundant or not according to the third expected value, deleting the k-mers from the first k-mer set to obtain a second k-mer set when the hamming distance d between the k-mers in the first k-mer set and the k-mers in the second l-mer set is smaller than or equal to the third expected value and the k-mers in the first k-mer set are redundant, and otherwise, keeping the k-mers in the first k-mer set to obtain the second k-mer set.

In particular, for a first set of k-mers, there may be redundant k-mers in the first set of k-mers, which are substrings of the same starting position in a second l-mer, or k '(k) is the length of k' between a k-mer and a second l-mer_min≤k'<k) Are overlapped. Based on this, in this embodiment, each time a first l-mer is obtained by a k-mer in a first k-mer set, first, a second l-mer generated last time is adopted to determine whether the k-mer in the first k-mer set is a redundant k-mer, and if the k-mer is redundant, the k-mer is deleted from the first k-mer set to obtain a second k-mer set; and if the k-mer is not redundant, keeping the k-mer in the first k-mer set to obtain a second k-mer set, wherein the second k-mer set comprises a plurality of k-mers, and each k-mer comprises k characters.

Let the third expected value e (k) represent the expected value of the Hamming distance of a k-mer at an arbitrary starting position in an arbitrary phantom example from the k-mer at the same starting position in the implanted phantom (l, d). This example obtains a fourth l-mer from the DNA sequence big data set D, where the fourth l-mer includes l characters, the fourth l-mer is taken as an example of a phantom calculated from a third expected value e (k), and the second l-mer is taken as an example of an implantation phantom (l, D) calculated from a third expected value e (k). e (l) calculated based on the total probability formula, and assuming a mutation position between the fourth l-mer and the second l-mer, the third expected value e (k) is equal to e (l) multiplied by k/l, assuming that the mutation randomly occurs at one of the l positions. The third expected value e (k) of this embodiment is designed as follows:

in this embodiment, for a k-mer x in the first k-mer set, which is a redundant k-mer, is defined as: the presence of one k-mer z in the second l-mer being such that d_H(z, x) is less than or equal to e (k), namely the hamming distance d between the k-mer x in the first k-mer set and the k-mer z in the second l-mer set is less than or equal to a third expected value e (k), the k-mers in the first k-mer set are redundant, the k-mers are deleted from the first k-mer set, the motif (l, d) searching process is not required to be implanted to the k-mers, otherwise, the k-mers are kept in the first k-mer set, and the motif (l, d) searching process is performed as above. Wherein a k-mer x that is redundant to a k-mer in the first set of k-mers can be further defined as: let pf (x, k ') and sf (x, k') denote a k 'long prefix and a k' long suffix, respectively, of a string k-mer x, there being k_min≤k'<k is such that d_H(pf (z, k '), sf (x, k ')) ≦ e (k ') or d_H(sf(z,k'),pf(x,k'))≤e(k')。

In this embodiment, the redundancy removal processing is performed on the first k-mer set by designing the third expected value e (k), so that the calculation amount of subsequent die bodies (l, d) to be implanted is reduced, and the running time of the APMS method is reduced.

And further, processing the first l-mer set according to the second k-mer set to obtain a second l-mer set.

Specifically, after the redundancy removing processing is performed on the first k-mer set, a second k-mer set is obtained, and the first k-mer set is updated by the second k-mer set. In the APMS method of this embodiment, after deleting the redundant k-mer from the first k-mer set, there is no need to obtain the redundant k-mer from the first k-mer set and then obtain the first l-mer operation, so that each time the APMS method of this embodiment obtains the k-mer from the first k-mer set, obtains the first l-mer through the k-mer, reconstructs a binomial tree from the first l-mer, obtains the second l-mer through the binomial tree, then removes the redundant k-mer from the first k-mer set through the second l-mer to obtain a second k-mer set, updates the first k-mer set with the second k-mer set, further obtains the k-mer from the updated first k-mer set, obtains the first l-mer through the k-mer, the above-described repeated process is performed. For the first l-mer set, each first l-mer in the first l-mer set constructs a binomial tree, the score of each node in the binomial tree is calculated, the node with the highest score in the binomial tree is used as a second l-mer corresponding to the first l-mer, and one second l-mer correspondingly exists in the first l-mer in each first l-mer set to obtain a second l-mer set.

And 3, determining an implant motif (l, d) from the second l-mer set according to the first scoring model.

Specifically, the second l-mers in the second l-mer set are sorted from high to low according to the scores of the first scoring model, and the reordered second l-mer set is output, so that the implantation motifs (l, d) are obtained.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a DNA dataset implantation motif search device according to an embodiment of the present invention. Another embodiment of the present invention provides a DNA data set implantation motif searching apparatus, including:

the data acquisition module is used for acquiring a DNA sequence big data set and acquiring implantation motif search parameters of the DNA sequence big data set;

the data processing module is used for obtaining a first k-mer set according to a DNA sequence big data set and an implantation die body searching parameter, obtaining a first l-mer set according to the first k-mer set and obtaining a second l-mer set according to the first l-mer set;

a data determination module that determines an implant motif from the second set of l-mers based on the first scoring model.

The device for searching a DNA data set implanted model provided by the embodiment of the invention can execute the method embodiment, has similar realization principle and technical effect, and is not repeated herein.

Yet another embodiment of the present invention provides a computer-readable storage medium having a computer program stored thereon, where the computer program when executed by a processor implements the following steps:

obtaining a first k-mer set according to a DNA sequence big data set and an implantation die body searching parameter, obtaining a first l-mer set according to the first k-mer set, and obtaining a second l-mer set according to the first l-mer set;

an implant motif is determined from the second set of l-mers according to the first scoring model.

The computer-readable storage medium provided by the embodiment of the present invention may implement the above method embodiments, and the implementation principle and technical effect are similar, which are not described herein again.

To illustrate the advantages of the present invention, the present example verifies the advantages of the APMS method of the present invention on the simulated data and the real data, respectively. The simulation data is mainly used for testing the efficiency of the APMS method by comparing with the existing method in the running time, and simultaneously verifying whether the APMS method can find the implantation die body (l, d); the real data are mainly used for verifying the effectiveness of the APMS method, and whether the APMS method can efficiently find a real motif in the biological data of the real world is verified.

In the simulation data, three sets of simulation data sets are generated in the embodiment for comprehensive testing, and compared with the existing method, the advantages of the method APMS are verified under the three sets of simulation data sets. Wherein, the existing methods for selecting and comparing comprise FMotif, PairMotifChIP and MEME-ChIP: FMotif is a precise PMS method with highest efficiency for dealing with large data sets of DNA sequences; PairMotifChIP is a newly proposed approximate PMS method capable of coping with a large DNA sequence data set; MEME-ChIP is one of the best known motif discovery methods.

This example uses the performance coefficient mPC to measure the prediction model (l, d) m_pAnd the implantation model (l, d) m_kThe similarity of (c). Wherein len_overlap(m_p,m_k) Representing the predicted motif (l, d) m_pAnd the implantation model (l, d) m_kThe number of overlapping characters, mPC, is calculated as follows:

(1) the first set of simulated data sets was used for validation testing on data with different motifs (l, d), where in the large DNA sequence data set the number of DNA sequences t is 3000, the number of characters per DNA sequence n is 200, the first set of simulated data machine tests in which the motif (l, d) was implanted was searched for a ratio q is 0.5, i.e. the number of DNA sequences required in the first set of simulated data set tests was 3000.0.5 1500, the conservative parameter g is 0.5, and the APMS, FMotif, pairmotif ChIP and me-ChIP methods were compared at different values of l and d.

TABLE 1 comparison results on the first set of simulation datasets

In Table 1, time represents run time, s represents seconds, m represents minutes, h represents hours, and N represents run time exceeding 48 hours, which cannot be predicted. As can be seen from Table 1, given t, n, q, g, the APMS method runs faster than the APMS, FMotif, PairMotifChIP and MEME-ChIP methods at different values of l and d. When the values of l and d are large, the FMotif method has the condition that the running time exceeds 48 hours and cannot be predicted; the PairMotifChIP and MEME-ChIP methods have relatively stable running time when l and d are increased, and the running time of the APMS method is s-level although the running time is increased along with the increase of l and d, and is faster than the PairMotifChIP method and the MEME-ChIP method.

(2) The second set of simulation data sets was used for validation testing on data with different phantom signal strengths: in the DNA sequence big data set, the number t of DNA sequences is 3000, the number n of characters of each DNA sequence is 200, the implanted motifs (l, d) are (15,5), the implanted motifs (l, d) are searched for the ratio q and the conservative parameter g in the second group of simulation data tests, and APMS, FMotif, PairMotifChIP and MEME-ChIP methods are compared under different values. The die body signal intensity depends on q and g, and when the value of q is small and the value of g is large, the die body signal intensity is small; when q is large and g is small, the die body signal intensity is large.

TABLE 2 comparison results on the second set of simulation datasets

In Table 2, time represents the running time, s represents seconds, m represents minutes, h represents hours, and N represents that the running time exceeds 48 hours, which cannot be predicted. As can be seen from Table 2, given t, n, l, d, the APMS method runs faster than the APMS, FMotif, PairMotifChIP, and MEME-ChIP methods at different values of q and g. When the intensity of the die body signal is small, the FMotif method has the condition that the running time exceeds 48 hours and cannot be predicted; the operation time of the APMS, PairMotifChIP and MEME-ChIP methods is relatively stable, and the operation time of the APMS is faster than that of the PairMotifChIP method and is faster than that of the MEME-ChIP method.

(3) The third set of mock datasets was used to perform validation tests on large datasets of DNA sequences at different scales: the number of characters n of each DNA sequence is 200, the implantation motif (l, d) is (15,5), the implantation motif (l, d) search occupancy q is 0.5 and the conservative parameter g is 0.5 in a third set of simulation data tests, and then the APMS, FMotif, pairmotif ChIP and MEME-ChIP methods are compared at different values for the number of DNA sequences t.

TABLE 3 comparison results on the third set of simulation datasets

In Table 3, time represents the running time, s represents seconds, m represents minutes, h represents hours, and N represents that the running time exceeds 48 hours, which cannot be predicted. As can be seen from Table 3, given n, q, g, l, d, the APMS method runs faster than the APMS, FMotif, PairMotifChIP and MEME-ChIP methods at different values of t. When the data of a large DNA sequence data set is large, the MEME-ChIP method has the condition that the running time exceeds 48 hours and cannot be predicted, and the running time of the PairMotifChIP method is increased by a level larger than that of the APMS method. Among these, because FMotif defines that the maximum set of DNA sequences processed is 3000, FMotif does not participate in the comparison on the third set of data sets.

As can be seen from tables 1, 2 and 3, the APMS method can complete the prediction of the implant motif (l, d) in the shortest time in all cases, orders of magnitude faster than the FMotif, pairmotif ChIP and MEME-ChIP methods. The value of the performance coefficient mPC is 1 for all the methods, which means that they can accurately find out the implanted motifs (l, d), mainly because the motif information contained in the three sets of simulation data sets is quite sufficient, and even when the signal intensity of the motifs is very small, the implanted motifs (l, d) can still be accurately found out.

Referring to fig. 4, fig. 4 is a graph showing the comparison results of the APMS, pairmotif ChIP, and MEME-ChIP methods provided in the embodiments of the present invention under different DNA sequences of the simulation data. It can be seen that the running time of the APMS method increases approximately linearly with the increase of the DNA sequence number set, while the running time of PairMotifChIP increases approximately in the square order with the increase of the DNA sequence number set, while the MEME-ChIP method cannot be predicted when the running time exceeds 48 hours when the DNA sequence number is 12000.

In the present embodiment, ChIP-seq data of Mouse Embryonic Stem cells (mESC cells, abbreviated as mESC) is used as the real data, and the ChIP-seq data is the most widely used data for verifying the validity of the motif search method. The mESC data contained 12 sets of datasets (c-Myc, CTCF, Esrrb, Klf4, Nanog, n-Myc, Oct4, Smad1, Sox2, STAT3, Tcfcp2I1, Zfx), each named by ChIP-ed transcription factor. When searching for motifs by the APMS method, uniform implantation motif (l, d) search parameters were used for 12 different sets of data, where implantation motif (l, d) is (13,4), implantation motif (l, d) search occupancy q is 0.3, conservative parameter g is 0.5, and for each data set, the top 3000 DNA sequences were taken as input to the APMS method.

Referring to fig. 5, fig. 5 is a schematic diagram of an experimental result of a method for efficiently solving a search of a large data set implantation model of a DNA sequence in real data according to an embodiment of the present invention. As can be seen, for each data set, the published motifs and predicted motifs in the form of the contained DNA sequences, running times, sequence logos are shown, wherein the top of the sequence logo is the published motif and the bottom is the predicted motif. For each data set, comparing the predicted motif with the published motif, finding that the APMS method can find the predicted motif similar to the published motif on 12 groups of data sets; and the run time on all data sets was within 6 minutes.

It can be seen that the APMS method can be used to efficiently and effectively process large datasets of true DNA sequences.

In summary, the APMS method can efficiently and effectively process a large data set of DNA sequences regardless of a simulation data set or a real data set, the APMS method can successfully find out an implant motif (l, d) or a real motif, and operates much faster than the existing implant motif (l, d) search method, and in the simulation data set, it is seen that the operation time of the APMS method increases linearly with the increase of the scale of the DNA sequence data set.

The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A method for searching for DNA data set implantation motifs, comprising:

acquiring a DNA sequence big data set and acquiring implantation die body searching parameters of the DNA sequence big data set, wherein the DNA sequence big data set comprises a plurality of DNA sequences, each DNA sequence comprises a plurality of characters, and the implantation die body searching parameters comprise the length l of an implantation die body and the hamming distance d of the implantation die body;

obtaining a first k-mer set according to the DNA sequence big data set and the implantation motif search parameter, obtaining a first l-mer set according to the first k-mer set, and obtaining a second l-mer set according to the first l-mer set, wherein the first k-mer set comprises a plurality of k-mers, each k-mer comprises k characters, the first l-mer set comprises a plurality of first l-mers, each first l-mer comprises l characters, the second l-mer set comprises a plurality of second l-mers, and each second l-mer comprises l characters;

determining the implant motif from the second set of l-mers according to a first scoring model;

obtaining the first l-mer set according to the first k-mer set, wherein the obtaining the first l-mer set comprises:

obtaining k-mers from the first set of k-mers;

performing redundancy removal processing on the expanded first k-mer set according to a second score model to obtain an expanded second k-mer set, wherein the second score model is used for evaluating each expanded first k-mer in the expanded first k-mer set and evaluating whether the expanded first k-mer is composed of a background sequence;

intercepting the expanded second k-mer set to obtain a first l-mer;

obtaining the first l-mer set according to the first l-mer;

obtaining a second l-mer set according to the first l-mer set, wherein the obtaining of the second l-mer set comprises:

constructing a second tree for a first l-mer in the first l-mer set;

performing redundancy removal processing on the first k-mer set according to the second l-mer to obtain a second k-mer set, wherein the second k-mer set comprises a plurality of k-mers, and each k-mer comprises k characters;

processing the first l-mer set according to the second k-mer set to obtain a second l-mer set;

the first score model represents the score of each node in the binomial tree model, and is obtained according to the DNA sequence big data set and the implantation motif searching parameters, and the method comprises the following steps:

acquiring a plurality of l-mers from the DNA sequence big data set, wherein each l-mer comprises l characters;

obtaining a sequencing queue according to the Hamming distance between the plurality of l-mers and the first l-mer;

and obtaining the first score model according to the sorting queue.

2. The method of claim 1, wherein obtaining a first set of k-mers from the DNA sequence big data set and the implantation motif search parameters comprises:

3. The method of claim 2, wherein obtaining the length k comprises:

obtaining a first expected value according to the DNA sequence big data set;

4. The method of claim 3, wherein obtaining the first threshold value comprises:

obtaining the number of DNA sequences from the DNA sequence big data set;

5. The method of claim 1, wherein truncating the extended second k-mer set to obtain a first l-mer comprises:

obtaining an alignment sequence according to the expanded second k-mer set;

6. The method of claim 2, wherein performing de-redundancy processing on the first set of k-mers with respect to the second l-mer to obtain a second set of k-mers, comprises:

obtaining a fourth l-mer from the DNA sequence big data set, wherein the fourth l-mer comprises l characters;

judging whether k-mers in a first k-mer set are redundant according to the third expected value, deleting the k-mers from the first k-mer set to obtain a second k-mer set when the hamming distance d between the k-mers in the first k-mer set and the k-mers in the second l-mers is smaller than or equal to the third expected value and the k-mers in the first k-mer set are redundant, or keeping the k-mers in the first k-mer set to obtain the second k-mer set.

7. An apparatus for searching for an implanted motif in a DNA data set, the apparatus comprising:

the data acquisition module is used for acquiring a DNA sequence big data set and acquiring implantation die body search parameters of the DNA sequence big data set, wherein the DNA sequence big data set comprises a plurality of DNA sequences, each DNA sequence comprises a plurality of characters, and the implantation die body search parameters comprise the length l of an implantation die body and the hamming distance d of the implantation die body;

a data processing module, configured to obtain a first k-mer set according to the DNA sequence big data set and the implantation motif search parameter, obtain a first l-mer set according to the first k-mer set, and obtain a second l-mer set according to the first l-mer set, where the first k-mer set includes a plurality of k-mers, each k-mer includes k characters, the first l-mer set includes a plurality of first l-mers, each first l-mer includes l characters, the second l-mer set includes a plurality of second l-mers, and each second l-mer includes l characters;

a data determination module to determine the implant motif from the second set of l-mers according to a first scoring model;

the data processing module is specifically configured to:

obtaining k-mers from the first set of k-mers;

intercepting the expanded second k-mer set to obtain a first l-mer;

obtaining the first l-mer set according to the first l-mer;

the data processing module is further specifically configured to:

constructing a second tree for a first l-mer in the first l-mer set;

wherein the first score model represents a score of each node in the binomial tree model, and the data determination module is specifically configured to:

and obtaining the first score model according to the sorting queue.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 6.