CN114582419B - Sliding window based gene sequence poly A tail extraction method - Google Patents

Sliding window based gene sequence poly A tail extraction method Download PDF

Info

Publication number
CN114582419B
CN114582419B CN202210110546.7A CN202210110546A CN114582419B CN 114582419 B CN114582419 B CN 114582419B CN 202210110546 A CN202210110546 A CN 202210110546A CN 114582419 B CN114582419 B CN 114582419B
Authority
CN
China
Prior art keywords
tail
sequence
poly
base
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210110546.7A
Other languages
Chinese (zh)
Other versions
CN114582419A (en
Inventor
吴小惠
刘梦飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202210110546.7A priority Critical patent/CN114582419B/en
Publication of CN114582419A publication Critical patent/CN114582419A/en
Application granted granted Critical
Publication of CN114582419B publication Critical patent/CN114582419B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Abstract

The invention discloses a sliding window-based gene sequence polyadenylic acid tail extraction method which is characterized by comprising the steps of inputting a gene sequence file, searching continuous N A base fragments and continuous N T base fragments as the initial positions of potential polyadenylic acid tails; initializing and setting sliding window parameters, moving the sliding window until the sliding window reaches the rightmost end or the left end of the sequence, or stopping sliding when the mismatch penalty value reaches a threshold value of 5, and obtaining a polyadenylic acid tail sequence and a tail length value; if a reference genome sequence of the species is provided, comparing the sequence to be compared with the reference genome sequence through a sequence comparison tool or filtering the poly A tail by using a filtering condition to obtain the poly A tail with higher accuracy, and determining the sequence type according to the number of the poly A tail; and finally, determining the type of the polyadenylic acid tail, and has the advantages of high precision, high operation speed and user friendliness.

Description

Sliding window based gene sequence poly A tail extraction method
Technical Field
The invention relates to a method for extracting polyadenylic acid tail, in particular to a method for extracting gene sequence polyadenylic acid tail based on a sliding window.
Background
With the continuous development of sequencing technology, the third generation sequencing technology, namely the single molecule real-time sequencing technology, is mature gradually, the technology can measure billions of sequence templates at the same time, ten bases can be measured in 1 second, the sequencing accuracy is as high as 99.9%, and the full-length DNA sequencing technology with high flux, high speed and high precision rapidly generates a large amount of gene sequence data. These gene sequence data, which contain abundant information, can be used to study selective Polyadenylation (APA), especially the length of the poly (a) tail. The poly (A) tail is a series of sequences mainly composed of adenosine A at the end of mRNA, but the poly (A) tail in the gene sequence measured by the high-throughput sequencing technology can be in the middle of the sequence, and an algorithm needs to be developed for extraction because the sequencing error contains non-A base. At present, a large number of scholars have proved that poly (A) tail plays an important role in many biological processes, and the length, translation efficiency and stability of the poly (A) tail are closely related. However, there is still a lack of efficient, accurate, flexible and easy-to-use algorithm tools for identifying and extracting poly (A) tail from the sequence.
The existing method for identifying and extracting poly (A) tail from sequence comprises
(1) PAISO-seq (Liu, Y., et al. Poly (A) inclusive RNA isofonm sequencing (PAISO-seq) derived-specific non-adenosine residues with RNA poly (A) peptides. Nature Communications 2019 (1): 5292) provides more process installation dependence, is only suitable for three generations of full-length sequencing data generated by PAISO-seq sequencing, and is not suitable for data generated based on a short sequencing strategy; PAISo-seq used empirical conditions to filter sequences that were cut out after alignment to obtain the specific base composition and tail length of the tail. Specifically, the original sequencing data is aligned to the reference genome, and the sequences that are not successfully aligned to the reference genome are considered as potential tails, and are further filtered according to the following conditions: 1) The sequence length is not less than 15nt; 2) The sequence at least comprises 5 continuous A bases; 3) The number of non-A bases in the sequence is less than 20; 4) The proportion of non-A bases in the sequence is less than 50%. Satisfying the above four conditions is recognized as poly (A) tail by PAISo-seq. The method can only extract the tail meeting the fixed filtering condition, is only suitable for full-length sequencing data, can obtain the tail only through sequence comparison, is not suitable for species without reference genome, and only can find one tail in one sequence.
(2) FLAMseq (Legnini, I., et al. FLAM-seq: full-length mRNA sequencing derivatives of poly (A) tail length control. Nat. Methods 20116 (9): 879-886.) developed tail finding tool FLAMAnalysis determines tail length by voting: firstly, determining the starting point of the tail by using an artificial marker sequence (a known sequence contained in a FLAMseq sequencing result), then taking a sequence with a fixed length from the starting point backward, judging whether the non-A base proportion of the tail in the sequence meets the requirement, and if so, continuing searching backward until the condition is not met. When determining the tail, the method adopts multiple groups of parameters, for example, the sequence length of each scanning is set as L, the proportion of A base is set as N, result voting is carried out through multiple groups of L-N combinations, and the tail result with high vote number is determined as the final result. The method uses a multi-round voting mode, and the voting algorithm involves many parameters and is not intuitive, so that the calculation speed is low, and parameter values are difficult to determine. In addition, the method is only applicable to FLAMseq-generated sequencing sequences with fixed structures, and is not applicable to data generated by other sequencing methods.
(3) The tail quantification tool, tailfindr (Krause, M., et al. Tailfindr: alignment-free poly (A) length measurement for Oxford Nanopore RNA and DNA sequencing.2019;25 (10): 1229-1241), developed specifically for Nanopore, can only approximate the tail length from the Nanopore sequencing data (i.e., level signals) and cannot accurately quantify the base composition of the tail. As the Nanopore sequencing method adopts the difference of the conductivities of different bases for gene sequencing, the ACTG four bases are used as resistors to generate different currents under the same voltage, namely, electric signals with different intensities can be generated when the different bases pass through the voltage, and the characteristics of the electric signals generated by the different bases can be easily determined through supervised learning (the intensity of the electric signals can represent the different bases, and the duration of the electric signals can represent the number of the bases). A continuously varying electrical signal can therefore be used to characterize a string of gene sequences. Tailfirdr measures tail length using this principle: when an RNA sequence is represented as a string of electrical signals, the length of the tail in the sequence can be estimated by simply looking for the duration of the intensity of the electrical signal corresponding to the A base in all the electrical signals. This method is only applicable to Nanopore for sequencing. In addition, since the rate of generation of level signals is high, when a large number of identical bases are present (such as poly (A) tail), it is difficult to accurately estimate the true length of the sequence, and it is completely impossible to identify the electrical signals generated by other bases mixed in a small amount therein.
(4) FLEP-seq (Long, Y., et al. FLEP-seq: discrete amino detection of RNA polymerase II position, licensing status, polyadenylation site and poly (A) tail length gene-side scale by single-molecule nucleic acid sequencing. Nature Protocols 2021 (9): 4355-4381.) two different primers were developed to quantify poly (A) tails for two different full-length sequencing methods, pacBio (full-length sequencing) and Nanopore. And (3) finding a tail in the FLEP-seq by adopting an integration mode aiming at PacBio data, namely starting from the sequence starting point, integrating +1 if the PacBio data is A base, or integrating-1.5 if the PacBio data is A base, considering that the tail starting point is continuously scanned backwards when the integration is regular at a certain base, changing the equal integration into positive again if the integration is negative, considering the tail starting point as the tail starting point until the sweep is finished, recording the point with the maximum integration as the tail ending point, and intercepting the tail. The method can only obtain the tail length information of the sequence, but cannot obtain the specific sequence or position information of the tail and the like.
(5) Neither the procedure named PA-finder or PAISo-seq2, as mentioned in Poly (A) -seq (Yu, F., et al. Poly (A) -seq: A method for direct sequencing and analysis of the transfer Poly (A) -tails, PLOS ONE 2020 (6): e 0234696), provides any usable code. Poly (A) -seq uses a complex approach to tail localization: firstly, removing artificial joints added in the sequencing process; then matching the start of the tail by a x 9 pattern, allowing the highest one mismatch, and cutting the sequence from the start; finding a sequence that satisfies the condition is followed by two empirical conditions: 1) Length at least 10nt, 2) non-A base content less than or equal to 4nt, and filtering the tail. The method can only find the tail containing at least 9 continuous A bases, and can only extract the tail meeting the fixed filtering condition, and only one tail can be found in one sequence. Furthermore, the flow of the method is described only briefly in the paper, and the author does not provide code or a toolkit.
Disclosure of Invention
The invention aims to solve the technical problem of providing a sliding window-based gene sequence poly A tail extraction method which is high in precision, high in operation speed and user-friendly.
The technical scheme adopted by the invention for solving the technical problems is as follows: a sliding window based gene sequence poly A tail extraction method comprises the following steps:
(1) Searching continuous N A base fragments and continuous N T base fragments in a gene sequence file input in a FastQ format as the initial positions of potential polyadenylic acid tails;
(2) Initializing and setting parameters of a sliding window;
(3) If the potential poly A tail is an A base segment, taking the end of the segment as an initial position, moving a sliding window until the sliding window reaches the rightmost end of the sequence, or stopping sliding when a mismatch penalty value reaches a threshold value, and obtaining a poly A tail sequence and a tail length value;
(4) If the potential poly A tail is a T base segment, taking the first base of the segment as an initial position, moving a sliding window until the sliding window reaches the leftmost end of the sequence, or stopping sliding when a mismatch penalty value reaches a threshold value, and obtaining a poly A tail sequence and a tail length value;
(5) Filtering the polyadenylic acid tails obtained in the step (3) and the step (4) to obtain polyadenylic acid tails with higher accuracy, and determining the type of the gene sequence input in the step (1) according to the number of the polyadenylic acid tails extracted from the sequence;
(6) If a linker sequence is provided, searching for the linker sequence at the 3 'end of the tail position of the polyadenylic acid tail obtained in the step (5), if the starting position of the found linker sequence is within 10 bases of the 3' end of the tail position of the polyadenylic acid tail, marking the polyadenylic acid tail type as a "structural" type, otherwise, marking the polyadenylic acid tail type as a "nonstructural" type;
(7) If no linker sequence is provided, calculating the number of bases from the terminal position of the poly A tail to the terminal point of the 3' end of the sequence in step (5), if more than 25 bases, marking the type of poly A tail as "nonstructural", otherwise marking the type of poly A tail as "structural".
Further, the step (1) is specifically as follows: inputting a gene sequence file in a FastQ format, searching continuous N A base segments and continuous N T base segments in the input gene sequence, taking the searched A base segments and T base segments as initial positions of potential polyadenylic acid tails, and if the A base segments and the T base segments are not found, determining that no polyadenylic acid tail exists in the gene sequence, wherein N is the initial tail length, and the default value is 8.
Further, the step (2) is specifically as follows: setting a sliding window for each potential poly A tail obtained in the step (1), wherein the initial size of the sliding window is the length of the potential poly A tail, the initial position is the starting position of the potential poly A tail, the sliding distance is fixed to 1, and the sliding direction of the sliding window is from the 5 'end to the 3' end of the sequence, namely, if the tail is an A base fragment, the sliding direction is from left to right, and if the tail is a T base fragment, the sliding direction is from right to left.
Further, the step (3) is specifically as follows: if the potential polyadenylic acid tail is an A base fragment, taking the last base of the fragment as an initial position, moving the initialized sliding window to the right by 1 base, and if the base is A and the mismatch penalty value is 0, adding 1 to the tail length counter value, and keeping the mismatch penalty value unchanged; if the basic group is A and the mismatch penalty value is not 0, adding 1 to the tail length count value, subtracting 1 from the mismatch penalty value, and resetting the value to 0 if the mismatch penalty value is negative at the moment; if the base is not A, adding 1 to both the length counting value and the mismatch penalty value, repeating the process until the sliding window reaches the rightmost end of the sequence, or stopping sliding when the mismatch penalty value reaches a threshold value, and obtaining a polyadenylic acid tail sequence and a tail length value, wherein the polyadenylic acid tail sequence is a sequence from the leftmost end of the tail to one base to the left of the last increase of 1 to the mismatch penalty value.
The initial value of the count value of the length of the poly A tail is the initial tail length N, the N value is 8, the mismatching penalty value represents the number of non-A bases in the tail of the A base fragment, and the initial value is 0; the mismatch penalty threshold represents the longest number of consecutive non-A bases present in the tail of the A base fragment, and is default to 5.
Further, the step (4) specifically comprises: if the potential polyadenylic acid tail is a T base fragment, taking the first base of the fragment as an initial position, moving the initialized sliding window by 1 base to the left, and if the base is T and the mismatch penalty value is 0, adding 1 to the tail length counting value, and keeping the mismatch penalty value unchanged; if the basic group is T and the mismatch penalty value is not 0, adding 1 to the tail length count value, subtracting 1 from the mismatch penalty value, and resetting the value to 0 if the mismatch penalty value is negative at the moment; and if the base is not T, adding 1 to both the length count value and the mismatch penalty value, repeating the process until the sliding window reaches the leftmost end of the sequence, or stopping sliding when the mismatch penalty value reaches a threshold value to obtain a polyadenylic acid tail sequence and a tail length value, wherein the polyadenylic acid tail sequence is a sequence from the rightmost end of the tail to a base on the right of which the mismatch penalty value is added by 1 for the last time.
The initial value of the count value of the length of the poly A tail is the length N of the initial tail, the value of N is 8, the mismatching penalty value represents the number of non-T bases in the tail of the T base fragment, and the initial value is 0; the mismatch penalty threshold represents the number of longest consecutive non-T bases present in the tail of a T base fragment, with a default value of 5.
Further, if a reference genome sequence of the species is provided in the step (5), taking each poly a tail sequence obtained in the step (3) and the step (4) and 200 base sequences outside the 5' end thereof as sequences to be aligned, if the length of the sequence outside the 5' end of the poly a tail sequence is less than 200 base, taking the poly a tail sequence and all sequences outside the 5' end thereof as sequences to be aligned, comparing the sequences to be aligned with the reference genome sequence through a sequence comparison tool, extracting and filtering a segment which cannot be compared with the reference genome sequence in the poly a tail sequences to be aligned, obtaining a poly a tail with higher accuracy, and determining the type of the gene sequence input in the step (1) according to the number of the poly a tails extracted from the sequences; if the poly A tail sequence is not different in base from the reference genome sequence, it is indicated that the tail sequence is actually a sequence from the reference genome and is not a true poly A tail.
Further, the filtration in the step (5) is specifically as follows:
A. discarding if the poly A tail length is less than 12 bases in length;
B. if a species full-length gene sequence recognizes multiple polyadenylated tails, discarding tails that are less than two-thirds the length of the average length of all polyadenylated tails;
C. the poly A tail with non A base content higher than 90% and the poly A tail with non T base content higher than 90% are discarded.
The sequence alignment tool is minimap2, bowtie or STAR software.
Compared with the prior art, the invention has the advantages that: the invention provides a gene sequence poly A tail extraction method based on a sliding window. Has the following advantages: 1) Because the poly A or T identification and sliding window method is used, the running speed is high and the precision is high; 2) And the method is also suitable for short-reading or full-length sequencing data and is also suitable for species without reference genome.
In summary, the invention locates and identifies the poly-A tail by poly-A or poly-T identification and sliding window, and obtains the precise tail position and length. The method can directly and accurately extract the tail from the original sequence without sequence comparison, can further correct the tail under the condition of providing a reference genome sequence, and is suitable for high-throughput short-read sequencing data or full-length sequencing data.
Detailed Description
The present invention is described in further detail below with reference to examples.
In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the claimed embodiments. Accordingly, the description is to be regarded as illustrative in nature and not as restrictive.
1. Detailed description of the preferred embodiments
Because the gene sequence is double-stranded, in the actual data analysis process, due to the specific direction of an unknown sequence, the conditions of the + direction and the-direction need to be considered simultaneously during calculation. That is, for the tail extraction problem, both the tail of the A base and the tail of the T base will be considered.
Initializing parameters:
linker sequence (adapterSeq): adapters (adapters) are a known short nucleotide sequence added by the sequencer during the sequencing process, some adapter sequences of the sequencing data may have been removed (e.g., the sequencer will pre-process the data generated by the sequencer and return the removed adapter to the customer), and some remain. For example, FLAM-seq sequencing inserts a specific sequence (9G bases) at the end of the tail to label the end of the tail. The import by parameter adapterSeq can here specify the sequence of joints according to the actual situation of the user data. If not, it may not be initialized. The adapterSeq default is "" A "".
Initial tail length (anchorelen): for determining the starting position of the tail in the sequence, first find the anchorLen consecutive A or T bases in the sequence as the initial tail. The default value is 8.
Upper limit of the number of non-A bases (dropoff): the default value is 5 given the longest consecutive number of non-A bases (if a base tail) or non-T bases (if a T base tail) that may be present in the tail.
Reference genomic sequence of species: a reference genomic sequence of a species corresponding to the sequence data for the tail to be scanned is specified, in sequence file format FASTA. Default to "", i.e., mean not provided.
A sliding window based gene sequence poly A tail extraction method comprises the following steps:
1. searching continuous N A base fragments and continuous N T base fragments as the initial positions of potential polyadenylic acid tails in a gene sequence file input in a FastQ format, which comprises the following steps: inputting a gene sequence file in a FastQ format, searching continuous N A base segments and continuous N T base segments in the input gene sequence, taking the searched A base segments and T base segments as initial positions of potential polyadenylic acid tails, and if the A base segments and the T base segments are not found, determining that no polyadenylic acid tail exists in the gene sequence, wherein N is the initial tail length, and the default value is 8.
2. Initializing and setting parameters of a sliding window, specifically as follows: setting a sliding window for each potential poly A tail obtained in the step 1, wherein the initial size of the sliding window is the length of the potential poly A tail, the initial position is the starting position of the potential poly A tail, the sliding distance is fixed to 1, the sliding direction of the sliding window is from the 5 'end to the 3' end of the sequence, namely, if the tail is an A base segment (from the 5 'end to the 3' end in the sequence direction), the sliding direction is from left to right, and if the tail is a T base segment (from the 3 'end to the 5' end in the sequence direction), the sliding direction is from right to left.
3. If the potential poly A tail is an A base fragment, taking the last base of the fragment as an initial position, moving a sliding window until the sliding window reaches the rightmost end of the sequence, or stopping sliding when a mismatch penalty value reaches a threshold value, and obtaining a poly A tail sequence and a tail length value, wherein the specific steps are as follows: if the potential polyadenylic acid tail is an A base segment, taking the last base of the segment as an initial position, moving the initialized sliding window to the right by 1 base, and if the base is A and the mismatch punishment value is 0, adding 1 to the tail length counting value, and keeping the mismatch punishment value unchanged; if the basic group is A and the mismatch penalty value is not 0, adding 1 to the tail length count value, subtracting 1 from the mismatch penalty value, and resetting the value to 0 if the mismatch penalty value is negative at the moment; if the base is not A, adding 1 to both the length counting value and the mismatch penalty value, repeating the process until the sliding window reaches the rightmost end of the sequence, or stopping sliding when the mismatch penalty value reaches a threshold value, and obtaining a polyadenylic acid tail sequence and a tail length value, wherein the polyadenylic acid tail sequence is a sequence from the leftmost end of the tail to one base to the left of the last increase of 1 to the mismatch penalty value. Wherein the initial value of the count value of the length of the polyadenylic acid tail is the initial tail length N, the value of N is 8, the mismatch penalty value represents the number of non-A bases in the tail of the A base fragment, and the initial value is 0; the mismatch penalty threshold represents the number of longest consecutive non-A bases present in the tail of the A base fragment, with a default value of 5.
4. If the potential poly A tail is a T base segment, taking the first base of the segment as the initial position, moving a sliding window until the sliding window reaches the leftmost end of the sequence, or stopping sliding when the mismatch penalty value reaches a threshold value, and obtaining the poly A tail sequence and the tail length value, wherein the specific steps are as follows: if the potential polyadenylic acid tail is a T base fragment, taking the first base of the fragment as an initial position, moving the initialized sliding window by 1 base to the left, and if the base is T and the mismatch penalty value is 0, adding 1 to the tail length counting value, and keeping the mismatch penalty value unchanged; if the basic group is T and the mismatch penalty value is not 0, adding 1 to the tail length count value, subtracting 1 from the mismatch penalty value, and resetting the value to 0 if the mismatch penalty value is negative; if the base is not T, adding 1 to both the length counting value and the mismatch penalty value, repeating the process until a sliding window reaches the leftmost end of the sequence, or stopping sliding when the mismatch penalty value reaches a threshold value, obtaining a poly-A tail sequence and a tail length value, wherein the poly-A tail sequence is a sequence between the rightmost end of the tail and the right base to which the mismatch penalty value is added for the last time by 1, the initial value of the poly-A tail length counting value is the initial tail length N, the N value is 8, the mismatch penalty value represents the number of non-T bases in the tail of the T base segment, and the initial value is 0; the mismatch penalty threshold represents the number of longest consecutive non-T bases present in the tail of a T base fragment, with a default value of 5.
5. Filtering the polyadenylated tails obtained in the step 3 and the step 4 to obtain polyadenylated tails with higher accuracy, and determining the type of the gene sequence input in the step 1 according to the number of polyadenylated tails extracted from the sequence, wherein the filtering is specifically as follows:
A. discarding if the poly A tail length is less than 12 bases in length;
B. if a species full-length gene sequence recognizes multiple polyadenylated tails, discarding tails that are less than two-thirds the length of the average length of all polyadenylated tails;
C. the poly A tail with non A base content higher than 90% and the poly A tail with non T base content higher than 90% are discarded.
If a reference genome sequence of the species is provided, taking each poly A tail sequence obtained in the steps 3 and 4 and 200 base sequences outside the 5' end of the poly A tail sequence as sequences to be compared, if the sequence length outside the 5' end of the poly A tail sequence is less than 200 base, taking the poly A tail sequence and all sequences outside the 5' end of the poly A tail sequence as sequences to be compared, comparing the sequences to be compared with the reference genome sequence through a sequence comparison tool, extracting and filtering fragments which cannot be compared with the reference genome sequence in the poly A tail sequences in the sequences to be compared, obtaining the poly A tail with higher accuracy, and determining the type of the gene sequence input in the step (1) according to the number of the extracted poly A tails in the sequences; if the poly A tail sequence is not different in base from the reference genome sequence, it is indicated that the tail sequence is actually a sequence from the reference genome and is not a true poly A tail. Sequence alignment is a routine step in most biological data analysis, and there are many well-accepted alignment tools, such as minimap2, bowtie, STAR, etc., which a user can download and install on the network by himself, using default parameters for alignment.
6. If a linker sequence is provided, the linker sequence is searched for at the 3 'end of the end position of the poly A tail obtained in step 5 (the right end of the tail if it is an A base tail; the left end of the tail if it is a T base tail), and if the start position of the found linker sequence is within 10 bases of the 3' end of the end position of the poly A tail, the poly A tail type is labeled as "structural" type, otherwise it is labeled as "nonstructural" type.
7. If no linker sequence is provided, calculating the number of bases from the terminal position of the poly A tail to the terminal point of the 3' end of the sequence in step (5), if more than 25 bases, marking the type of poly A tail as "nonstructural", otherwise marking the type of poly A tail as "structural". The tail is generally present at the 3' end of the sequence, and the "structural" type indicates that the tail corresponds to the sequencing structure, and is more likely to be a true, physically authentic tail.
Finally, a list of tails is output, including sequence ID, sequence orientation (if a reference genome is provided), position on sequence alignment (if a reference genome is provided), tail complete sequence, tail length, number of non-A/T bases, non-A/T base ratio (total number of non-A/T bases/total length of tail), number of C bases, C base ratio, number of G bases, G base ratio, number of T/A bases, T/A base ratio, tail type (structured or structured), sequence type "one-tail" (sequence has only one tail), "two-tail" (sequence has two tails), "multi-tail" (sequence has more than 2 tails).
2. Analysis of results
TABLE 1 comparison of the operating speeds of the software package PolyAtailor designed according to the extraction method of the present invention with other tools
Figure BDA0003494956460000091
Note: "alignment-free" means that the poly (A) tail is extracted directly from the sequenced sequence without sequence alignment when the reference genome of the species is not provided. "alignment-based" refers to the extraction of a poly (A) tail from a sequenced sequence by sequence alignment when a reference genome of a species is provided. "-" in the table indicates that the corresponding tool does not provide the function.
The above description is not intended to limit the present invention, and the present invention is not limited to the above examples. Those skilled in the art should also realize that changes, modifications, additions and substitutions can be made without departing from the true spirit and scope of the invention.

Claims (7)

1. A gene sequence poly A tail extraction method based on a sliding window is characterized by comprising the following steps:
(1) Searching continuous N A base fragments and continuous N T base fragments in a gene sequence file input into a FastQ format to serve as the initial positions of potential polyadenylic acid tails;
(2) Initializing and setting parameters of a sliding window;
(3) If the potential poly A tail is an A base segment, taking the last base of the segment as the initial position, moving a sliding window until the sliding window reaches the rightmost end of the sequence, or stopping sliding when the mismatch penalty value reaches the threshold value, and obtaining the poly A tail sequence and the tail length value, wherein the method specifically comprises the following steps: if the potential polyadenylic acid tail is an A base fragment, taking the last base of the fragment as an initial position, moving the initialized sliding window to the right by 1 base, and if the base is A and the mismatch penalty value is 0, adding 1 to the tail length counting value, and keeping the mismatch penalty value unchanged; if the basic group is A and the mismatch penalty value is not 0, adding 1 to the tail length count value, subtracting 1 from the mismatch penalty value, and resetting the value to 0 if the mismatch penalty value is negative at the moment; if the base is not A, adding 1 to both the length count value and the mismatch penalty value, repeating the process until a sliding window reaches the rightmost end of the sequence, or stopping sliding when the mismatch penalty value reaches a threshold value to obtain a polyadenylic acid tail sequence and a tail length value, wherein the polyadenylic acid tail sequence is a sequence from the leftmost end of the tail to a base on the left side of which the mismatch penalty value is added by 1 for the last time;
(4) If the potential poly A tail is T base segment, taking the first base of the segment as the starting position, moving the sliding window until the sliding window reaches the leftmost end of the sequence or the mismatch penalty value reaches the threshold value, stopping sliding, and obtaining the poly A tail sequence and the tail length value, specifically: if the potential polyadenylic acid tail is a T base fragment, taking the first base of the fragment as an initial position, moving the initialized sliding window by 1 base to the left, and if the base is T and the mismatch penalty value is 0, adding 1 to the tail length counting value, and keeping the mismatch penalty value unchanged; if the basic group is T and the mismatch penalty value is not 0, adding 1 to the tail length count value, subtracting 1 from the mismatch penalty value, and resetting the value to 0 if the mismatch penalty value is negative at the moment; if the base is not T, adding 1 to both the length count value and the mismatch penalty value, repeating the process until the sliding window reaches the leftmost end of the sequence, or stopping sliding when the mismatch penalty value reaches a threshold value, and obtaining a polyadenylic acid tail sequence and a tail length value, wherein the polyadenylic acid tail sequence is a sequence from the rightmost end of the tail to the right base of which the mismatch penalty value is added by 1 for the last time;
(5) Filtering the polyadenylic acid tails obtained in the step (3) and the step (4) to obtain the polyadenylic acid tails with higher accuracy, and determining the type of the gene sequence input in the step (1) according to the number of the polyadenylic acid tails extracted from the sequence, wherein the filtering is specifically as follows:
A. discarding if the poly A tail length is less than 12 bases in length;
B. if a species full-length gene sequence recognizes multiple polyadenylated tails, discarding tails that are less than two-thirds the length of the average length of all polyadenylated tails;
C. discarding a poly a tail with a non a base content higher than 90% in the a base tail and a poly a tail with a non T base content higher than 90% in the T base tail;
(6) If a linker sequence is provided, searching the linker sequence at the 3 'end of the tail position of the polyadenylic acid tail obtained in the step (5), if the initial position of the found linker sequence is located in the range of 10 bases of the 3' end of the tail position of the polyadenylic acid tail, marking the polyadenylic acid tail type as a "structural" type, otherwise, marking the polyadenylic acid tail type as a "nonstructural" type;
(7) If no linker sequence is provided, calculating the number of bases from the terminal position of the poly A tail to the terminal point of the 3' end of the sequence in step (5), if more than 25 bases, marking the type of poly A tail as "nonstructural", otherwise marking the type of poly A tail as "structural".
2. The method for extracting the poly A tail of the gene sequence based on the sliding window according to the claim 1, wherein the step (1) is specifically as follows: inputting a gene sequence file in a FastQ format, searching continuous N A base segments and continuous N T base segments in the input gene sequence, taking the searched A base segments and T base segments as the initial positions of potential polyA tails, and if the A base segments and the T base segments are not found, determining that no polyA tail exists in the gene sequence, wherein N is the length of the initial tail, and the default value is 8.
3. The method for extracting the poly A tail of the gene sequence based on the sliding window as claimed in claim 2, wherein the step (2) is specifically as follows: setting a sliding window for each potential poly A tail obtained in the step (1), wherein the initial size of the sliding window is the length of the potential poly A tail, the initial position is the starting position of the potential poly A tail, the sliding distance is fixed to 1, and the sliding direction of the sliding window is from the 5 'end to the 3' end of the sequence, namely, if the tail is an A base fragment, the sliding direction is from left to right, and if the tail is a T base fragment, the sliding direction is from right to left.
4. The sliding window-based gene sequence poly A tail extraction method according to claim 1, wherein: the initial value of the count value of the length of the poly A tail is the initial tail length N, the N value is 8, the mismatching penalty value represents the number of non-A bases in the tail of the A base fragment, and the initial value is 0; the mismatch penalty threshold represents the number of longest consecutive non-A bases present in the tail of the A base fragment, with a default value of 5.
5. The sliding window based gene sequence poly A tail extraction method of claim 1, wherein: the initial value of the count value of the length of the poly A tail is the length N of the initial tail, the value of N is 8, the mismatching penalty value represents the number of non-T bases in the tail of the T base fragment, and the initial value is 0; the mismatch penalty threshold represents the longest number of consecutive non-T bases present in the tail of a T-base fragment, with a default value of 5.
6. The sliding window based gene sequence poly A tail extraction method of claim 1, wherein: if a reference genome sequence of the species is provided in the step (5), taking each poly A tail sequence obtained in the step (3) and the step (4) and 200 base sequences outside the 5' end of the poly A tail sequence as sequences to be compared, if the length of the sequence outside the 5' end of the poly A tail sequence is less than 200 base, taking the poly A tail sequence and all sequences outside the 5' end of the poly A tail sequence as sequences to be compared, comparing the sequences to be compared with the reference genome sequence through a sequence comparison tool, extracting and filtering fragments which cannot be compared with the reference genome sequence in the poly A tail sequences in the sequences to be compared, obtaining the poly A tail with higher accuracy, and determining the type of the gene sequence input in the step (1) according to the number of the poly A tails extracted in the sequences; if the poly A tail sequence is not different in base from the reference genome sequence, it is indicated that the tail sequence is actually a sequence from the reference genome and is not a true poly A tail.
7. The method of claim 6, wherein the method comprises the following steps: the sequence alignment tool is minimap2, bowtie or STAR software.
CN202210110546.7A 2022-01-29 2022-01-29 Sliding window based gene sequence poly A tail extraction method Active CN114582419B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210110546.7A CN114582419B (en) 2022-01-29 2022-01-29 Sliding window based gene sequence poly A tail extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210110546.7A CN114582419B (en) 2022-01-29 2022-01-29 Sliding window based gene sequence poly A tail extraction method

Publications (2)

Publication Number Publication Date
CN114582419A CN114582419A (en) 2022-06-03
CN114582419B true CN114582419B (en) 2023-02-10

Family

ID=81769566

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210110546.7A Active CN114582419B (en) 2022-01-29 2022-01-29 Sliding window based gene sequence poly A tail extraction method

Country Status (1)

Country Link
CN (1) CN114582419B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102732629A (en) * 2012-08-01 2012-10-17 复旦大学 Method for concurrently determining gene expression level and polyadenylic acid tailing by using high-throughput sequencing
CN104711340A (en) * 2013-12-17 2015-06-17 北京大学 Transcriptome sequencing method

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110059453A1 (en) * 2009-08-23 2011-03-10 Affymetrix, Inc. Poly(A) Tail Length Measurement by PCR
TR201810530T4 (en) * 2010-10-22 2018-08-27 Cold Spring Harbor Laboratory Count the variety of nucleic acids to obtain genomic copy number information.
US20180265912A1 (en) * 2011-08-23 2018-09-20 Rutgers, The State University Of New Jersey Modified 3' region extraction and deep sequencing of polydenylation sites and poly(a) tail length analysis
US10829804B2 (en) * 2015-03-23 2020-11-10 The University Of North Carolina At Chapel Hill Method for identification and enumeration of nucleic acid sequences, expression, splice variant, translocation, copy, or DNA methylation changes using combined nuclease, ligase, polymerase, terminal transferase, and sequencing reactions
CN105734053B (en) * 2016-04-20 2018-12-14 武汉生命之美科技有限公司 A kind of construction method in high-flux sequence analysis Poly (A) tail length degree library
EP3724355A1 (en) * 2017-12-15 2020-10-21 Novartis AG Polya tail length analysis of rna by mass spectrometry
JP2021532794A (en) * 2018-08-03 2021-12-02 ビーム セラピューティクス インク. Multi-effector nucleobase editor and methods for modifying nucleic acid target sequences using it
US20220002797A1 (en) * 2018-10-02 2022-01-06 Max-Delbrück-Centrum Für Molekulare Medizin In Der Helmholtz-Gemeinschaft Full-length rna sequencing
CN113574181A (en) * 2019-03-01 2021-10-29 武汉华大医学检验所有限公司 Nucleic acid sequence for direct RNA library construction, method for direct construction of sequencing library based on RNA sample and application
CN110499356B (en) * 2019-09-05 2021-06-08 中国科学院遗传与发育生物学研究所 Construction method of sequencing library of RNA (ribonucleic acid) with poly (A) tail in sample to be detected
CN112481363A (en) * 2020-03-09 2021-03-12 南京大学 Application of mutant Aerolysin monomer in detection of RNA base sequence and RNA modification

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102732629A (en) * 2012-08-01 2012-10-17 复旦大学 Method for concurrently determining gene expression level and polyadenylic acid tailing by using high-throughput sequencing
CN104711340A (en) * 2013-12-17 2015-06-17 北京大学 Transcriptome sequencing method

Also Published As

Publication number Publication date
CN114582419A (en) 2022-06-03

Similar Documents

Publication Publication Date Title
AU2020210279B2 (en) Large-scale biomolecular analysis with sequence tags
KR101795124B1 (en) Method and system for detecting copy number variation
KR20200013709A (en) Verification method and system for sequence variant call
JP7171709B2 (en) Methods for Detection of Fusions Using Compacted Molecularly Tagged Nucleic Acid Sequence Data
WO2000000637A2 (en) Method for sequencing nucleic acids with reduced errors
CN104404160A (en) MIT (Mitochondrion) primer design method and method for constructing planktonic animal barcode database by utilization of high-throughput sequencing
CN113744807A (en) Macrogenomics-based pathogenic microorganism detection method and device
CN112687344B (en) Human adenovirus molecule typing and tracing method and system based on metagenome
CN115312121B (en) Target gene locus detection method, device, equipment and computer storage medium
CN115691672B (en) Base quality value correction method and device for sequencing platform characteristics, electronic equipment and storage medium
CN112011615A (en) Gene fusion kit for human thyroid cancer and detection method
CN114582419B (en) Sliding window based gene sequence poly A tail extraction method
CN106021980B (en) A kind of DNA and protein level mutation analysis system
CN111292806B (en) Transcriptome analysis method by using nanopore sequencing
CN104789675A (en) Method for detecting rumen microorganisms of Holstein cows
Thanaraj et al. Prediction of exact boundaries of exons
CN110232951B (en) Method, computer readable medium and application for judging saturation of sequencing data
CN106011313B (en) A kind of the multi-fluorescence immunoassay method and reagent of quick differentiation ILTV, IBV, MG and MS
WO2019132010A1 (en) Method, apparatus and program for estimating base type in base sequence
CN104951673B (en) A kind of genome restriction enzyme mapping joining method and system
CN111798922B (en) Method for identifying genome selection utilization interval of wheat breeding based on polymorphism site density in resequencing data
CN104769129B (en) Major histocompatibility complex MHC typing method and application thereof
CN110684830A (en) RNA analysis method for paraffin section tissue
CN112750501A (en) Optimized analysis method for macrovirome process
CN114171121B (en) Quick detection method for mRNA 5'3' terminal difference

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant