CN116110493A - Data set construction method for G-quadruplex prediction model and prediction method thereof - Google Patents

Data set construction method for G-quadruplex prediction model and prediction method thereof Download PDF

Info

Publication number
CN116110493A
CN116110493A CN202310267142.3A CN202310267142A CN116110493A CN 116110493 A CN116110493 A CN 116110493A CN 202310267142 A CN202310267142 A CN 202310267142A CN 116110493 A CN116110493 A CN 116110493A
Authority
CN
China
Prior art keywords
data
dna sequence
quadruplex
data set
antisense strand
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310267142.3A
Other languages
Chinese (zh)
Other versions
CN116110493B (en
Inventor
刘利
崔益智
邹权
丁漪杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yangtze River Delta Research Institute of UESTC Huzhou
Original Assignee
Yangtze River Delta Research Institute of UESTC Huzhou
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yangtze River Delta Research Institute of UESTC Huzhou filed Critical Yangtze River Delta Research Institute of UESTC Huzhou
Priority to CN202310267142.3A priority Critical patent/CN116110493B/en
Publication of CN116110493A publication Critical patent/CN116110493A/en
Application granted granted Critical
Publication of CN116110493B publication Critical patent/CN116110493B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B5/00ICT specially adapted for modelling or simulations in systems biology, e.g. gene-regulatory networks, protein interaction networks or metabolic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Physiology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Train Traffic Observation, Control, And Security (AREA)

Abstract

The scheme discloses a data set construction method for a G-quadruplex prediction model and a prediction method thereof, the method provides a brand-new data set construction method, the data set construction method can obtain a data set with higher resolution and higher signal to noise ratio, the data is clearer and more real to train the model, a new thought is provided for the construction of the G4 training data set, the prediction performance of the model is improved, meanwhile, the method considers the chain specificity of DNA, the structures of the G-quadruplex on different chains of the DNA are distinguished, the data set is more real and reliable, and the prediction effect of the model is effectively improved.

Description

Data set construction method for G-quadruplex prediction model and prediction method thereof
Technical Field
The scheme belongs to the technical field of computer biology, and provides a data set construction method for a G-quadruplex predictive model and a predictive method thereof.
Background
G-quadruplexes are composed of four consecutive guanines (G) on a single-stranded nucleic acid, where the G on different segments form square planes, adjacent G's in the planes form a planar square structure that interacts two by Hoogsteen base pairing, and a plurality of such planar structures are stacked together to form a G-quadruplex structure.
The G-quadruplex on genomic DNA is called G4-DNA, and more researches indicate that G4-DNA has a non-negligible effect on gene transcription and plays an important role in the biological processes of regulating gene expression, maintaining genomic stability and the like. The ability to predict in vivo G4-DNA will greatly facilitate the study of tumor genes and the study of the relationship between genomic structure and major disease.
In order to realize in vivo prediction of G4-DNA, a great deal of research is carried out, a support vector machine and a random forest machine learning prediction method are proposed, but all machine learning prediction methods are not separated from a data set, while in the prior art, when in vivo G4-DNA is predicted by using a machine learning method, the constructed data set is rough, negative samples are used by simple rules, such as selecting sequences with the same length, GC content and repeatability as positive samples or sequences synthesized directly according to the same length, GC content and repeatability on human genome. Aiming at data set construction, the scheme disclosed in China patent discloses a prediction method (application number: 2021100305029) of a cell-specific genome G-quadruplex, which provides some improvements on the basis of the original method, takes the opening degree and methylation degree of chromatin as characteristic vectors of sequences, takes the chromatin opening degree and the methylation degree as input training models instead of directly taking DNA sequences as input, can predict cell-type-specific G4-DNA, and can discover the influence of epigenetic modification on the formation of a G4 structure. However, since the input characteristics are not the DNA sequence itself, there is a problem that the sequence dependence of the G4 structure cannot be recognized, resulting in the inability to predict G4-DNA from scratch. Furthermore, the prior art does not consider the strand specificity of the G4 structure when constructing the dataset and does not distinguish between the forward and reverse strands, and the above scheme, although referring to predicting G4-DNA having cell specificity, refers to predicting whether G4-DNA will be present in a specific cell, is not a concept compared to the scheme which considers the strand specificity of the G4 structure when constructing the dataset.
Therefore, aiming at a data set constructed by a G-quadruplex predictive model, the prior art has the problems of low resolution, low signal to noise ratio, unclear and real data and the like, so that model training is not ideal, and the final predictive effect is affected.
Disclosure of Invention
The purpose of the scheme is to provide a data set construction method for a G-quadruplex predictive model and a predictive method thereof, and provide a brand-new data processing flow, supplement the original data set construction method based on selection and generation, provide a thought for G4 training data set construction, and provide powerful method support for prediction of whole genome activity G4 regions.
A data set construction method for a G-quadruplex predictive model, the method comprising:
s1, collecting G4CUT & Tag original sequencing data, and preprocessing the G4CUT & Tag original sequencing data to obtain preprocessed G4 sequencing data;
collecting a human genome sequencing file;
s2, predicting each chromosome in the human genome sequencing file by using a G-quadruplex prediction tool based on sequence scores to obtain a G4 candidate pool formed by X sequences;
s3, dividing the G4 candidate pool into a G4 sense strand region and a G4 antisense strand region based on a scoring rule;
s4, carrying out overlap peak extraction operation on the pretreated G4 sequencing data and the G4 sense strand region and the G4 antisense strand region respectively;
s5, directly extracting an overlapped peak DNA sequence of the G4 sense strand region and G4 sequencing data;
extracting the overlapped peak DNA sequence of the G4 antisense strand region and the G4 sequencing data, and performing a negative strand taking operation on the overlapped peak DNA sequence;
taking the two groups of extracted data as positive samples of a data set;
s6, deleting data of overlapping peaks of the G4 candidate pool and the G4 sequencing data to obtain a deleted G4 candidate pool, and randomly selecting DNA sequences with the same number as positive samples from the deleted G4 candidate pool;
direct extraction of the selected DNA sequence in the G4 sense strand region;
carrying out negative strand taking operation on the selected DNA sequence in the G4 antisense strand region;
the two sets of data described above act as negative examples of the dataset.
In the above data set construction method for a G-quadruplex prediction model, step S1 specifically includes:
s11, collecting G4CUT & Tag original sequencing data of N different cell lines;
s12, mapping G4CUT & Tag original sequencing data to an hg38 human reference genome by using a sequence alignment tool;
s13, identifying the peak top of the data by using a peak top identification tool;
s14, DNA sequence data with the length of 120bp around the peak top is obtained and used as G4 sequencing data;
in step S6, the center of the obtained DNA sequence is extended to 120bp to the left and right, and the extended sequence is used as a negative sample of the data set.
In the above-mentioned data set construction method for the G-quadruplex predictive model, in step S11, G4CUT & Tag raw sequencing data of five different cell lines HEK293T, hela, K562, LM2, SW1271 are collected;
wherein the sequencing data of one or more cell lines is used as a training set, and the sequencing data of the other one or more cell lines is used as a test set;
alternatively, part of the sequencing data of one or more cell lines is used as a training set and the remaining sequencing data is used as a test set. The training is performed using sequencing data of one cell line as in the training, and the testing is performed using the same cell line data for the test and different cell lines as in the testing.
In the above data set construction method for a G-quadruplex prediction model, in step S6, the method further includes:
detecting whether the overlapping result shows the position of the repeated DNA sequence;
if yes, carrying out the duplicate removal operation on the duplicate reset device.
In the data set construction method for the G-quadruplex predictive model, repeated detection and de-duplication treatment are carried out on repeated DNA sequences of sense strands and antisense strands in the overlapping result; no duplicate detection and/or no deduplication treatment was performed between the sense strand and the antisense strand. That is, the repetition detection between the sense strand and the antisense strand can be performed, and the deduplication treatment is not performed, or the repetition detection between the sense strand and the antisense strand is not performed directly, even if it is repeated. The position of the repeat may result in sequences from the sense strand and the antisense strand of the DNA, and the repeat between this portion of independent data is not deleted because strand specificity is considered in this scheme.
In the above-described data set construction method for the G-quadruplex predictive model, in step S5, the antisense strand sequence is subjected to a centrosymmetric process so that the data of both the sense strand and the antisense strand exist in the order from the 5 'end to the 3' end.
In the above-described data set construction method for the G-quadruplex predictive model, in step S6, a centrosymmetric process is performed on the sequence obtained by taking the minus strand.
A method of predicting G-quadruplex structure, the method comprising:
SA1, constructing a data set comprising a positive sample and a negative sample by the data set construction method for the G-quadruplex predictive model, and preprocessing the data set;
SA2, dividing the preprocessed data set into a training set and a testing set;
SA3, respectively training and testing the G-quadruplex predictive model by using a training set and a testing set to obtain a trained G-quadruplex predictive model;
SA4, dividing the DNA sequence to be predicted into a sense strand area and an antisense strand area based on the same scoring rule, taking the DNA sequence of the sense strand area and the DNA sequence taking the negative strand of the antisense strand area as input, inputting the input into a trained G-quadruplex predictive model, and outputting the activity of the G4 area of the DNA sequence to be predicted by the model.
In the above-mentioned G-quadruplex structure prediction method, in step SA1, the following preprocessing is performed on the data set:
encoding the sequences in the dataset into a single thermal encoding layer; encoding a 120bp long DNA sequence into a 120X 4 matrix by single-heat encoding, wherein each column corresponds to a DNA base;
taking the DNA sequence after the single thermal coding as input, and inputting the DNA sequence into the G-quadruplex predictive model in the step SA 3;
in step SA4, after obtaining the DNA sequence of the sense strand region and the DNA sequence of the antisense strand region from the DNA sequence to be detected, the DNA sequence is subjected to single-heat encoding, the single-heat encoding encodes a DNA sequence with the length of 120bp into a 120×4 matrix, wherein each column corresponds to one DNA base (A, C, G or T), and the DNA sequence after single-heat encoding is used as input to a trained G-quadruplex predictive model.
In the above-mentioned method for predicting a G-quadruplex structure, in step SA4, the DNA sequence after taking the negative strand from the antisense strand region is subjected to a centrosymmetric treatment;
in the step SA1, the G-quadruplex prediction model adopts XGBoost;
for one input, the model output is a score between 0 and 1 representing the activity of the G4 region.
The advantage of this scheme lies in:
(1) The method provides a brand new data set construction method, the data set with higher resolution and signal to noise ratio can be obtained through the data set construction method, the data is clearer and real to train the model, a new thought is provided for the construction of the G4 training data set, and the prediction performance of the model is improved.
(2) According to the method, the cell line specificity is considered, 5 different cell lines are prepared as data sets, and an untrained cell line is used for testing, so that the model cell-crossing line can be effectively evaluated, and experiments prove that the effect of predicting the model cell-crossing line is obvious.
(3) The method considers the strand specificity of the DNA, and distinguishes the structures of the G-quadruplex on different strands of the DNA, so that a data set is more real and reliable, and the prediction effect of a model is improved.
(4) The novel data processing flow is provided for supplementing the original construction method for generating the data set based on scoring selection, and the method can provide a thought for constructing the G4 training data set and provides powerful method support for predicting the whole genome activity G4 region.
Drawings
FIG. 1 is a diagram of a method of data set construction for a G-quadruplex predictive model in accordance with the present invention;
FIG. 2 is a diagram of a G-quadruplex predictive model training and prediction method implemented based on a dataset constructed by the proposed method of the present invention;
fig. 3 to 5 are graphs showing the results of three experiments, respectively.
Detailed Description
The present invention is described in further detail below with reference to the drawings and detailed description.
The scheme provides a data set construction method for a G-quadruplex prediction model and a prediction method for realizing the data set constructed based on the method. The data set construction is the basis for obtaining a G-quadruplex prediction model, wherein the data set is constructed by using a new method, then the G-quadruplex prediction model is trained by using the data set constructed by using the new method, and finally the prediction model is used for predicting a sequence to be predicted to realize the prediction method of the G-quadruplex structure, and the specific flow is as shown in figure 1:
in a first step, the accession number GSE178668 of Gene Expression Omnibus (GEO) was used to log into GEO to download 5G 4CUT & Tag raw sequencing data from different cell lines (HEK 293T, hela, K562, LM2, SW 1271), then human genome sequencing files and their annotation data were downloaded from UCSC Genome Browser (https:// genome. Ucsc. Edu /), with the sequence of each chromosome in the human genome being downloaded separately in fasta format.
Second, for each cell line of G4CUT & Tag raw sequencing data, the G4CUT & Tag raw sequencing data was processed using the following steps, respectively:
(1) G4CUT & Tag raw sequencing data in gz format was downloaded at GEO using the SRAToolkit (https:// github. Com/ncbi/sra-tools/wiki) tool and decompressed to fastq format.
Note that GEO in the first step is the download location of sequencing data, and SRA ToolKit in this step is the download tool of sequencing data.
(2) Using sequence alignment tools
Bowtie2 (https:// Bowtie-bio.sourcefuge.net/Bowtie 2/manual. Shtml) maps sequencing data onto hg38 human reference genome; the hg388 human reference genome was downloaded from UCSC Genome Browser (https:// genome. Ucsc. Edu /), and the mapping step was aimed at aligning reads obtained after sequencing to the human genome to obtain G4 positional information on the human genome.
(3) Call peak was performed using MACS2 (https:// pypi. Org/project/MACS2 /) under default parameters to perform identification of sequencing data peak; call peak is a computational method used to identify regions of the genome enriched for aligned reads by sequencing.
(4) Using
Bedtools2 (https:// bedtools.readthes.io/en/latest/index.html) acquires DNA sequence data of 120bp in length around the peak top as G4 sequencing data.
Third, as a key step of the present scheme, the data set is constructed and divided using the proposed new data set construction method in the third step, and the specific steps are shown in fig. 2, first, each chromosome in the human genome is predicted under the G4Hunter parameter (window=25, score=1.2) using the G4 Hunter-based scoring method, and the predicted results are combined, which shows that approximately 80w sequences are captured by the G4Hunter in the human genome in this setting, and these captured sequences are used as a candidate pool of G4.
The scoring method G4Hunter is used directly here as a G-quadruplex prediction tool. Other predictive tools and other scoring methods may also be used when put into service.
The pool of candidates was divided into two different regions according to the scoring rules of G4Hunter, one region containing sequences with scores greater than 1.2, defined as the "G4 sense strand region", and the other region containing sequences with scores less than-1.2, defined as the "G4 antisense strand region". The scoring rule here is that the score for a single G nucleotide is 1; in the GG sequence, the score of each G is 2; in the GGG sequence, the score of each G is 3; in a sequence of 4 or more G, the score of each G is 4. Scoring for the C nucleotides is the same rule as scoring for G, but the resulting scores are reversed.
1.2 is a parameter set for the model in the G4 model of the G4 structure prediction tool based on the scoring method mentioned herein, which is a parameter enabling the G4Hunter to have the best predicted performance. The sign preceding 1.2 indicates the predicted occurrence of the G4 structure by the G4Hunter tool and can be simply understood as distinguishing the positive and negative strands of DNA. In a specific example, 1.2 may be other values based on the setting parameters of the container.
Then, using the processed G4CUT & Tag sequencing data, herein referred to as G4 sequencing data, the operations of taking the overlapping peaks are performed with the "G4 sense strand region" and the "G4 antisense strand region" in the G4 candidate pool generated by capturing using the G4Hunter, respectively, the DNA sequence in the "G4 sense strand region" obtained after taking the overlapping peaks with the G4 sequencing data is directly extracted, and the DNA sequence in the "G4 antisense strand region" obtained after taking the overlapping peaks with the G4 sequencing data is subjected to the negative strand taking operation, i.e., the sequence on the same position negative strand is extracted according to the DNA sequence position information. The antisense strand sequence is obtained and then subjected to center symmetry processing, so that the data of the sense strand and the antisense strand are all in the order from the 5 'end to the 3' end, and the 5 'end and the 3' end are proper nouns describing the DNA direction and are the basic biological concepts. The 5 'to 3' order is used here to ensure that each DNA sequence from the sense strand or the antisense strand of DNA is kept in the same order, eliminating the effect on the model due to the different order of the sense and antisense strands.
The sequence obtained by using the default parameters by the G4Hunter and the processed G4 sequencing data may have repeated positions of some DNA sequences, whether the sequence is repeatedly judged by manual detection or not, or whether the sequence is automatically judged by writing a corresponding program by engineering personnel, and the detection process is not repeated. And finally, carrying out a deduplication operation on the data set, wherein in the deduplication process, not only the two independent parts are found to be duplicated, but also duplicated positions are found between the two independent data, and the duplicated positions can be sequences from a DNA sense strand and a DNA antisense strand, and the duplication between the two independent data is not deleted because the scheme considers strand specificity. After all of the two DNA sequences obtained by the respective overlapping were subjected to deduplication, the two combined data were taken as positive samples (positive samples) of the experimental dataset.
As a negative sample (control) sequence, the protocol was obtained as follows: firstly, deleting data which are overlapped with G4CUT & Tag sequencing data in a candidate pool captured by G4Hunter, randomly selecting the same number of DNA sequences as positive samples in the candidate pool (comprising a ' G4 sense strand region and a ' G4 antisense strand region ') after deletion operation so as to ensure that a balanced data set is generated, and the proportion of the positive samples to the negative samples in the final data set is 1:1. some of these randomly selected DNA sequences are in the sense strand region and some are in the antisense strand region, which is directly extracted for the G4 sense strand region, and for the G4 antisense strand region, the randomly selected DNA sequences are subjected to a negative strand manipulation and a central symmetry treatment for the obtained sequences, similar to the positive sample structure, so that the data of both the sense and antisense strands exist in the order from the 5 'end to the 3' end.
Finally, the center of the obtained DNA sequence is extended to 120bp in length left and right, and the extended sequence is taken as a negative sample of the data set constructed in the experiment, namely, a negative sample 1 in FIG. 2.
After the data sets are respectively constructed, the positive samples are combined with two different negative samples, so that two data sets used by the scheme are obtained.
Fourth, sequences in the dataset are first encoded as a single-heat encoded layer, which encodes a 120bp long DNA sequence into a 120 x 4 matrix, where each column corresponds to one DNA base (A, C, G or T).
Fifthly, taking the DNA sequence after single thermal coding as input, inputting a prediction model for training, wherein the prediction model used in the embodiment is a prediction model XGBoost based on a Boosting method in machine learning. XGBoost is similar to the decision tree approach by generating multiple weak classifiers that are combined together to generate one strong classifier. For a given data set of n instances and m features,
Figure SMS_1
a tree-set model predicts the output using K addition functions.
Figure SMS_2
(1)
Figure SMS_4
Is the space of the regression tree (also known as CART). Here->
Figure SMS_9
Representing the structure of each tree, one example is mapped to a corresponding leaf index. />
Figure SMS_10
Is the number of leaves in the tree. Unlike decision trees, each regression tree contains a continuous score on each leaf, with +.>
Figure SMS_5
To represent +.>
Figure SMS_6
Score on individual leaves. For a given example, decision rules in the tree will be used (by +.>
Figure SMS_7
Given) to classify for a given example, decision rules in the tree will be used (by +.>
Figure SMS_8
Given) classify it into leaves and passThe scores in the corresponding leaves are overadded (by +.>
Figure SMS_3
Given) to calculate the final prediction result. To learn the set of functions used in the model, the following canonical targets are minimized.
Figure SMS_11
Figure SMS_12
(2)
Here, the
Figure SMS_13
Is a separable convex loss function, measuring the predicted value +.>
Figure SMS_14
And target value->
Figure SMS_15
Differences between them. The second term omega penalizes the complexity of the model (i.e., regression tree function). Additional regularization terms help smooth the final learned weights, avoiding overfitting. Intuitively, regularization targets will tend to choose a model that employs simple and predictive functions. Similar regularization techniques have been used to regularize a greedy forest (RGF) model, with targets and corresponding learning algorithms simpler and easier to parallelize than RGF. When the regularization parameter is set to zero, the target will revert to conventional gradient tree lifting.
Sixth, the trained XGBoost model is used to predict G4 region activity (a score between 0 and 1) of the new DNA sequence, the higher the score, the higher the likelihood that the predicted DNA sequence to be tested has a G-quadruplex structure.
Seventh, sequences from different cell lines (GEO GSE 178668) such as HEK293T, hela are divided into training and test sets, the training set is used to get the model with the best hyper parameters, and the test set is used to evaluate the accuracy of the model.
To verify the advantages of this protocol, eight evaluation indices, accuracy, F1-Score, ROC cut (receiver operating characteristic curve, subject work characteristic curve) and AUROC (area under theROC) were used here to evaluate model performance. The experimental results are shown in fig. 3, fig. 4 and fig. 5, all the data of each cell line are sequentially used as a training set to train the model, the rest cell coefficient data are respectively used as a test set to evaluate the performance of the model, and the process can be used for proving that the obtained model can realize the prediction effect of the cross cell line.
Fig. 3 is a graph showing the results of the model prediction of the active G4 region of HEK293T cells on an independent test set, auroc=0.99 demonstrating that the model achieved good prediction results.
As shown in fig. 2, the present solution further constructs a negative sample 2 for the negative sample, where the differences between the negative sample 2 and the negative sample 1 are mainly represented by the fact that the antisense strand sequence does not take the negative strand, and two different data sets Date1 and Date2 are used to train the model respectively, and the 10-fold cross-validation result is shown in fig. 4, and data1: auro= 0.9632 and data2: auroc=0.9606, data1 is the negative sample 1+ positive sample in fig. 2, and data2 is the negative sample 2+ positive sample in fig. 2. From the experimental results, it can be seen that the training set without the negative-strand taking operation and the data set with the negative-strand taking operation have the results obtained under the same parameters and 10-fold cross validation without great change. The foregoing results are known to occur by analysis of the positive and negative set sequence information content because the information in the G4 negative sample is much more chaotic than the sequences in the G4 positive sample and the sequences are more non-conservative, so the information content is lower around the G4 structural region in the negative sample. The G4 sequence in the positive sample sequence is more conservative, and the information confusion degree is lower, so that the G4 structure and both ends of the G4 sequence have higher information content, and the conservative information of the sequences can be captured for a prediction model so as to be successfully predicted. The comparison experiment proves that the negative strand specificity hardly affects the precision of the model, so that the negative strand taking operation of the antisense strand region of the DNA to be detected can take the strand specificity of the G4 structure into consideration on the premise of not affecting the prediction precision of the model, and powerful method support is provided for the prediction of the whole genome activity G4 region.
FIG. 5 is a graph of experimental results of predictions of test sets of other cell lines using the XGBoost prediction model trained on HEK293T cell line data. Experimental results show that Hela: auroc=0.968; k562: auroc=0.974; LM2: auroc=0.973; SW1271: auroc=0.975, which shows that the corresponding test results in Hela, K562, LM2 and SW1271 are all relatively accurate, and the experimental results show that the model has relatively good robustness.
In addition, the present protocol uses eight evaluation indicators Sn (Sensitivity), sp (Specificity), pre (Precision), acc (Accuracy), MCC (Matthews correlation coefficient), F1 (F1-score), AUPRC (area under the precision-recovery curve), AUROC to evaluate the effect of the predictive model, and uses different cell line data for the test and training sets of the model. The evaluation results of the eight indexes show outstanding effects, which shows that the classification model provided by the inventor can well distinguish the G4 structure from the non-G4 structure, and can reach higher prediction accuracy on the positive antisense strand, thereby having potential to be applied to whole genome headedness prediction without losing the G4 structure of the antisense strand.
According to the scheme, a brand-new data processing method is provided, by using brand-new G4 in-vivo sequencing data G4CUT & Tag and a possible G4 region in a whole genome predicted by a G4 prediction method G4Hunter based on scores to obtain an overlapped sequence, after the obtained DNA sequence is processed, a data set which has higher resolution and signal to noise ratio and clearer data and is real can be obtained to train a model, and theoretical and experimental results show that the prediction performance of the model obtained by training the data set processed by the data processing method can be greatly improved.
The specific embodiments described herein are offered by way of example only to illustrate the spirit of the present solution. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions in a similar manner without departing from the spirit of the invention or exceeding the scope of the invention as defined in the accompanying claims.

Claims (10)

1. A data set construction method for a G-quadruplex predictive model, the method comprising:
s1, collecting G4CUT & Tag original sequencing data, and preprocessing the G4CUT & Tag original sequencing data to obtain preprocessed G4 sequencing data;
collecting a human genome sequencing file;
s2, predicting each chromosome in the human genome sequencing file by using a G-quadruplex prediction tool to obtain a G4 candidate pool formed by X sequences;
s3, dividing the G4 candidate pool into a G4 sense strand region and a G4 antisense strand region based on a scoring rule;
s4, carrying out overlap peak extraction operation on the pretreated G4 sequencing data and the G4 sense strand region and the G4 antisense strand region respectively;
s5, directly extracting DNA sequence data obtained after overlapping peaks are obtained between the G4 sense strand region and G4 sequencing data;
extracting DNA sequence data obtained after overlapping peaks of the G4 antisense strand region and the G4 sequencing data, and executing negative strand extraction operation on the DNA sequence data;
taking the two groups of extracted data as positive samples of a data set;
s6, deleting data of overlapping peaks of the G4 candidate pool and the G4 sequencing data to obtain a deleted G4 candidate pool, and randomly selecting DNA sequences with the same number as positive samples from the deleted G4 candidate pool;
directly extracting the DNA sequence randomly selected in the G4 sense strand region;
carrying out negative strand taking operation on the selected DNA sequence in the G4 antisense strand region;
the two sets of data described above act as negative examples of the dataset.
2. The method for constructing a dataset for a G-quadruplex predictive model according to claim 1, wherein step S1 specifically comprises:
s11, collecting G4CUT & Tag original sequencing data of N different cell lines;
s12, mapping G4CUT & Tag original sequencing data to an hg38 human reference genome by using a sequence alignment tool;
s13, identifying the peak top of the data by using a peak top identification tool;
s14, DNA sequence data with the length of 120bp around the peak top is obtained and used as G4 sequencing data;
in step S6, the center of the obtained DNA sequence is extended to 120bp to the left and right, and the extended sequence is used as a negative sample of the data set.
3. The method for constructing a dataset for a G-quadruplex predictive model according to claim 2, wherein in step S11, G4CUT & Tag raw sequencing data of five different cell lines HEK293T, hela, K562, LM2, SW1271 are collected;
wherein the sequencing data of one or more cell lines is used as a training set, and the sequencing data of the other one or more cell lines is used as a test set;
alternatively, part of the sequencing data of one or more cell lines is used as a training set and the remaining sequencing data is used as a test set.
4. The method for constructing a dataset for a G-quadruplex predictive model according to claim 2, further comprising, in step S6:
detecting whether the overlapping result shows the position of the repeated DNA sequence;
if yes, carrying out the duplicate removal operation on the duplicate reset device.
5. The method for constructing a data set for a G-quadruplex predictive model according to claim 4, wherein the repeated DNA sequences of the sense strand and the antisense strand are repeatedly detected and de-duplicated from each other in the overlapping result; no duplicate detection and/or no deduplication treatment was performed between the sense strand and the antisense strand.
6. The method according to claim 1, wherein in step S5, the antisense strand sequence is subjected to a centrosymmetric process such that the data of both the sense strand and the antisense strand exist in the order from the 5 'end to the 3' end.
7. The method according to claim 1, wherein in step S6, the sequence obtained by taking the negative strand is subjected to a centrosymmetric process so that the data of both the sense strand and the antisense strand exist in the order from the 5 'end to the 3' end.
8. A method for predicting G-quadruplex structure, the method comprising:
1 constructing a data set comprising positive and negative samples by the method of any one of claims 1-7 and pre-processing the data set;
SA2, dividing the preprocessed data set into a training set and a testing set;
SA3, respectively training and testing the G-quadruplex predictive model by using a training set and a testing set to obtain a trained G-quadruplex predictive model;
SA4 dividing the DNA sequence to be predicted into a sense strand region and an antisense strand region, inputting the DNA sequence of the sense strand region and the DNA sequence of the antisense strand region after taking the negative strand, inputting the DNA sequence into a trained G-quadruplex predictive model, and outputting the G4 region activity of the DNA sequence to be predicted by the model.
9. The method according to claim 8, wherein in step SA1, the data set is preprocessed as follows:
encoding the sequences in the dataset into a single thermal encoding layer; encoding a 120bp long DNA sequence into a 120X 4 matrix by single-heat encoding, wherein each column corresponds to a DNA base;
taking the DNA sequence after the single thermal coding as input, and inputting the DNA sequence into the G-quadruplex predictive model in the step SA 3;
in step SA4, the DNA sequence to be predicted is subjected to single-heat encoding and then used as input.
10. The method according to claim 8, wherein in step SA1, in step SA4, the DNA sequence obtained by taking the negative strand from the antisense strand region is subjected to a centrosymmetric treatment;
the G-quadruplex prediction model adopts XGBoost;
for one input, the model output is a score between 0 and 1 representing the activity of the G4 region.
CN202310267142.3A 2023-03-20 2023-03-20 Data set construction method for G-quadruplex prediction model and prediction method thereof Active CN116110493B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310267142.3A CN116110493B (en) 2023-03-20 2023-03-20 Data set construction method for G-quadruplex prediction model and prediction method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310267142.3A CN116110493B (en) 2023-03-20 2023-03-20 Data set construction method for G-quadruplex prediction model and prediction method thereof

Publications (2)

Publication Number Publication Date
CN116110493A true CN116110493A (en) 2023-05-12
CN116110493B CN116110493B (en) 2023-06-20

Family

ID=86267490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310267142.3A Active CN116110493B (en) 2023-03-20 2023-03-20 Data set construction method for G-quadruplex prediction model and prediction method thereof

Country Status (1)

Country Link
CN (1) CN116110493B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112860977A (en) * 2021-03-18 2021-05-28 杭州师范大学 Link prediction method based on convolutional neural network
CN113160877A (en) * 2021-01-11 2021-07-23 东南大学 Prediction method of cell-specific genome G-quadruplex
US20210257050A1 (en) * 2018-08-13 2021-08-19 Roche Sequencing Solutions, Inc. Systems and methods for using neural networks for germline and somatic variant calling
CN113344272A (en) * 2021-06-08 2021-09-03 汕头大学 Prediction method of interaction relation between circRNA, miRNA and RBP based on machine learning
CN113939600A (en) * 2019-05-29 2022-01-14 X基因组公司 System and method for sequencing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210257050A1 (en) * 2018-08-13 2021-08-19 Roche Sequencing Solutions, Inc. Systems and methods for using neural networks for germline and somatic variant calling
CN113939600A (en) * 2019-05-29 2022-01-14 X基因组公司 System and method for sequencing
CN113160877A (en) * 2021-01-11 2021-07-23 东南大学 Prediction method of cell-specific genome G-quadruplex
CN112860977A (en) * 2021-03-18 2021-05-28 杭州师范大学 Link prediction method based on convolutional neural network
CN113344272A (en) * 2021-06-08 2021-09-03 汕头大学 Prediction method of interaction relation between circRNA, miRNA and RBP based on machine learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NADIA NORHAKIM等: "Stability of guanine-quadruplexes as sensing elements investigated using solvation free energy", 《2021 IEEE INTERNATIONAL CONFERENCE ON SENSORS AND NANOTECHNOLOGY (SENNANO)》 *
丁漪杰: "基于氨基酸序列的蛋白质交互作用预测方法研究", 《中国博士学位论文全文数据库 基础科学辑》 *
董红斌;石丽;李涛;: "一种改进的microRNA预测模型集成方法", 计算机科学, no. 02 *

Also Published As

Publication number Publication date
CN116110493B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
CN111161793A (en) Stacking integration based N in RNA6Method for predicting methyladenosine modification site
US20230222311A1 (en) Generating machine learning models using genetic data
CN107463795A (en) A kind of prediction algorithm for identifying tyrosine posttranslational modification site
Zhang et al. iPromoter-5mC: a novel fusion decision predictor for the identification of 5-methylcytosine sites in genome-wide DNA promoters
JP7490168B1 (en) Method, device, equipment, and medium for mining biosynthetic pathways of marine nutrients
CN106446597A (en) Multi-species feature selection and unknown gene identification methods
Li et al. Computational analysis and prediction of PE_PGRS proteins using machine learning
Sherkatghanad et al. Using traditional machine learning and deep learning methods for on-and off-target prediction in CRISPR/Cas9: a review
Raad et al. miRe2e: a full end-to-end deep model based on transformers for prediction of pre-miRNAs
KR102124193B1 (en) Method for screening makers for predicting depressive disorder or suicide risk using machine learning, markers for predicting depressive disorder or suicide risk, method for predicting depressive disorder or suicide risk
CN116110493B (en) Data set construction method for G-quadruplex prediction model and prediction method thereof
Liu et al. HEAP: a task adaptive-based explainable deep learning framework for enhancer activity prediction
CN115810398A (en) TF-DNA binding identification method based on multi-feature fusion
CN114187963A (en) Prediction method of protein binding nucleotide sites on full-length circular RNA
JP2008161056A (en) Dna sequence analyzer and method and program for analyzing dna sequence
CN113764031A (en) Prediction method of N6 methyladenosine locus in trans-tissue/species RNA
Al-Barhamtoshy et al. DNA sequence error corrections based on TensorFlow
CN111951889A (en) Identification prediction method and system for M5C site in RNA sequence
Meharunnisa et al. An Optimized Hybrid Model for Classifying Bacterial Genus using an Integrated CNN-RF Approach on 16S rDNA Sequences: OPTIMIZED CNN-RF MODEL FOR BACTERIAL GENUS CLASSIFICATION
Upadhyay et al. Exploratory data analysis and prediction of human genetic disorder and species using dna sequencing
CN117935933B (en) Analysis method and system for CDKN2A/B homozygosity deletion
Zhang et al. SpliceCannon: A novel framework for the prediction of canonical and non-canonical splice sites based on deep learning
Mapiye et al. Phenotype Prediction of DNA Sequence Data: A Machine-and Statistical Learning Approach
Dutta et al. Inference of splicing motifs through visualization of recurrent networks
CN115547407B (en) lncRNA-protein interaction prediction method based on depth automatic encoder

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant