CN116386720A - Single cell transcription factor prediction method based on deep learning and attention mechanism - Google Patents
Single cell transcription factor prediction method based on deep learning and attention mechanism Download PDFInfo
- Publication number
- CN116386720A CN116386720A CN202310383948.9A CN202310383948A CN116386720A CN 116386720 A CN116386720 A CN 116386720A CN 202310383948 A CN202310383948 A CN 202310383948A CN 116386720 A CN116386720 A CN 116386720A
- Authority
- CN
- China
- Prior art keywords
- cell
- data
- feature vector
- transcription factor
- sequencing data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 108091023040 Transcription factor Proteins 0.000 title claims abstract description 59
- 102000040945 Transcription factor Human genes 0.000 title claims abstract description 59
- 238000000034 method Methods 0.000 title claims abstract description 40
- 230000007246 mechanism Effects 0.000 title claims abstract description 14
- 238000013135 deep learning Methods 0.000 title claims abstract description 13
- 210000004027 cell Anatomy 0.000 claims abstract description 89
- 239000013598 vector Substances 0.000 claims abstract description 70
- 238000012163 sequencing technique Methods 0.000 claims abstract description 56
- 210000003483 chromatin Anatomy 0.000 claims abstract description 39
- 238000004458 analytical method Methods 0.000 claims abstract description 35
- 108010077544 Chromatin Proteins 0.000 claims abstract description 22
- 108091028043 Nucleic acid sequence Proteins 0.000 claims abstract description 18
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 238000011176 pooling Methods 0.000 claims description 29
- 230000006870 function Effects 0.000 claims description 15
- 239000012634 fragment Substances 0.000 claims description 12
- 230000004913 activation Effects 0.000 claims description 10
- 230000000694 effects Effects 0.000 claims description 9
- 238000012549 training Methods 0.000 claims description 9
- 238000013528 artificial neural network Methods 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000003321 amplification Effects 0.000 claims description 3
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 3
- 238000003672 processing method Methods 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 238000010008 shearing Methods 0.000 claims description 3
- 230000009286 beneficial effect Effects 0.000 description 7
- 238000000605 extraction Methods 0.000 description 7
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 108090000623 proteins and genes Proteins 0.000 description 4
- 238000013136 deep learning model Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000007812 deficiency Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 238000013518 transcription Methods 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- 108091032973 (ribonucleotides)n+m Proteins 0.000 description 1
- 108020004414 DNA Proteins 0.000 description 1
- 102000007260 Deoxyribonuclease I Human genes 0.000 description 1
- 108010008532 Deoxyribonuclease I Proteins 0.000 description 1
- 206010028980 Neoplasm Diseases 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 230000011712 cell development Effects 0.000 description 1
- 230000024245 cell differentiation Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012165 high-throughput sequencing Methods 0.000 description 1
- 210000005260 human cell Anatomy 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 230000008844 regulatory mechanism Effects 0.000 description 1
- 210000000130 stem cell Anatomy 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000002103 transcriptional effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/30—Detection of binding sites or motifs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/082—Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Bioethics (AREA)
- Public Health (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Epidemiology (AREA)
- Databases & Information Systems (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Genetics & Genomics (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a single-cell transcription factor prediction method based on deep learning and attention mechanism, which comprises the steps of obtaining single-cell chromatin accessibility analysis sequencing data, preprocessing the single-cell chromatin accessibility analysis sequencing data, and then performing data enhancement operation to obtain enhanced sequencing data; extracting regression peaks in the enhanced sequencing data as a feature vector S, splicing the forward and reverse enhanced sequencing data as a feature vector A, and converting DNA sequence data from the whole genome into a feature vector U; and splicing the feature vector S, the feature vector A and the feature vector U, inputting a depth network model to predict the probability of each transcription factor in single cells, wherein the depth network model comprises a convolution module and a channel attention model.
Description
Technical Field
The invention relates to a transcription factor detection technology, in particular to a single-cell transcription factor prediction method based on deep learning and attention mechanisms.
Background
Gene transcription regulation is an important implementation mechanism of organism gene expression, which ensures normal gene expression process of cells by interaction between transcription factors and related receptors in organisms, and can respond rapidly to subtle changes in biological environment. Currently, detection of transcription factor binding sites across the genome is an important loop in exploring the transcriptional regulatory mechanisms of genes.
In recent years, single cell sequencing technology has also emerged as a rapid development stage, thanks to the high-speed development of high-throughput sequencing technology. From the single cell perspective, we can know important fields such as cancer heterogeneity, stem cell development and differentiation, human cell map, gene transcription regulation and the like more deeply and more accurately. How to accurately and efficiently predict the transcription factor binding site by using the single cell data existing at present is a very important and urgent problem to be solved.
Currently, due to advances in sequencing technology and significant improvements in computational power, deep learning techniques have been widely used in the prediction of transcription factor binding sites by virtue of their advantages in data processing. The prediction model based on deep learning mainly comprises a convolutional neural network, a cyclic neural network, a hybrid neural network and the like. The effect of the prediction method based on the convolutional neural network is ideal. The deep sea model is proposed by Zhou et al, which predicts transcription factors by learning large-scale chromatin data information, and predicts the influence of DNase I sensitivity and single nucleotide sensitivity on chromatin. Alipaahi et al propose a deep model that uses convolutional neural networks to achieve the function of predicting the sequence specificity of DNA or RNA and protein binding.
Although the deep learning algorithm overcomes the deficiencies of the traditional machine learning method to some extent, the following deficiencies still exist: first, the training data currently used is relatively single, but other schools can provide relevant characteristic information as well. Second, due to the deepening of the convolutional neural network layers and the black box nature of the neural network, the interpretability of the entire computational model is also significantly reduced.
Disclosure of Invention
Aiming at the defects in the prior art, the single-cell transcription factor prediction method based on the deep learning and attention mechanism solves the problem that the deep learning model has low extraction efficiency on the data characteristics.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
provided is a single cell transcription factor prediction method based on deep learning and attention mechanisms, comprising the steps of:
acquiring single-cell chromatin accessibility analysis sequencing data, preprocessing the single-cell chromatin accessibility analysis sequencing data, and then performing data enhancement operation to obtain enhanced sequencing data;
extracting regression peaks in the enhanced sequencing data as a feature vector S, splicing the forward and reverse enhanced sequencing data as a feature vector A, and converting DNA sequence data from the whole genome into a feature vector U;
and splicing the feature vector S, the feature vector A and the feature vector U, inputting a depth network model to predict the probability of each transcription factor in single cells, wherein the depth network model comprises a convolution module and a channel attention model.
The beneficial effects of the invention are as follows: according to the scheme, the single-cell chromatin accessibility analysis sequencing data are preprocessed, noise in the data can be reduced, and then the data are input into a deep learning model embedded with a channel attention module for recognition, so that the extraction speed and the extraction precision of the model to required data features are improved.
Further, inputting the probability of the depth network model to predict the transcription factor further comprises:
s31, judging whether a cell line to which single-cell chromatin accessibility analysis sequencing data belong is known, if so, selecting a depth network model of a corresponding cell line for prediction, otherwise, entering a step S32;
s32, respectively inputting spliced data into depth network models corresponding to cell lines GM12878, K562 and H1ESC to predict the probability of transcription factors;
s33, selecting the maximum value of the transcription factor prediction probabilities output by the three models as the transcription factor prediction probability of single cells.
The technical scheme has the beneficial effects that the scheme builds the corresponding deep network model aiming at common cell lines, so that the probability of determining that single cells are positioned in a specific cell line can be rapidly and accurately obtained under the condition of limited acquired sequencing data information.
Further, the convolution module comprises a convolution layer, an activation layer, a pooling layer and a full connection layer which are sequentially connected; the channel attention model comprises a maximum pooling layer/average pooling layer, a shared multi-layer sensing module, a full-connection layer and a flattening layer which are sequentially connected; the model structure of the depth network model is as follows:
F 1 =max_pooling(ReLU(conv 1 (S F )),F 2 =max_pooling(ReLU(conv 2 (F 1 ))
F 3 =max_pooling(ReLU(conv 3 (F 2 )),F=ReLU(W 1 ·F 3 )
wherein S is F Is the vector after splicing; reLU () is the activation function; the convolution module comprises a convolution layer, an activation layer, a pooling layer and a full connection layer which are sequentially connected; the channel attention model comprises a maximum pooling layer/average pooling layer, a shared multi-layer sensing module, a full-connection layer and a flattening layer which are sequentially connected; f (F) 1 、F 2 And F 3 Feature vector maps of each convolution layer of the convolution module are respectively provided; w (W) 1 And W is 2 Respectively weight matrixes in the two full connection layers;and->Respectively calculating characteristics of a global average pool and a global maximum pool in the channel attention model; sigma is a sigmoid activation function; w (W) 0 And W is 3 Two layers of parameters in the multi-layer sensor model are respectively; m is M c (F) Output of the channel attention model; z is Z i,n,k The probability of coincidence of transcription factor k with the nth cell peak within each cell i; k epsilon 1 … M, M is the total number of transcription factors in each cell in the deep network model, N epsilon 1 … N, N is the number of cell peaks that can be read per cell.
The technical scheme has the beneficial effects that the convolution model with the number of layers not being deep is combined with the attention mechanism module, so that the feature extraction efficiency of the model is improved, and meanwhile, the interpretation is provided for the model.
Further, the training method of the depth network model comprises the following steps:
pretreatment of DNA sequence data: selecting DNA sequence data of the whole genome, and cutting the DNA sequence data into 200bp fragments, wherein a sliding cutting interval between each two fragments is 50bp; acquiring a reverse chain of each fragment, expanding each section of reverse chain into a 1000bp fragment with 200bp as a center, and then converting into map Data;
pretreatment of batch tissue chromatin accessibility analysis sequencing data: sequencing data were obtained from mass tissue chromatin accessibility analysis in the cell lines GM12878, K562, H1ESC and subjected to a shearing operation, after which the sheared data were mapped to human genome hg19 using a Bowtie2 tool and sheared using samtools, picard operation; converting the cut bam file into a bigwig file by using deepools 2, and then splicing all bigwig files into a data matrix by using a bigWigMerge tool;
the DNA sequence data before pretreatment is encoded into a characteristic vector S of 4 x 1000 by using one-hot 1 The method comprises the steps of carrying out a first treatment on the surface of the Splicing the corresponding forward and reverse pretreatment batch tissue chromatin accessibility analysis sequencing data of each cell line as a characteristic vector A 1 、A 2 And A 3 The method comprises the steps of carrying out a first treatment on the surface of the Converting capability Data into feature vector U of 2 x 1000 1 ;
Respectively splicing characteristic vectors S 1 Feature vector A 1 And feature vector U 1 Feature vector S 1 Feature vector A 2 And feature vector U 1 Feature vector S 1 Feature vector A 3 And feature vector U 1 And respectively inputting the three spliced vectors into three deep neural networks for training to obtain a deep network model corresponding to the cell lines GM12878, K562 and H1 ESC.
The beneficial effects of the technical scheme are as follows: the scheme uses batch tissue chromatin accessibility analysis sequencing data, DNA sequence data and Mapability Feature data for pretraining, and can fully use the data characteristics of each data, thereby ensuring the accuracy of transcription factor prediction in single cell chromatin accessibility analysis sequencing data.
Further, the method for preprocessing single cell chromatin accessibility analysis sequencing data comprises the following steps:
screening peak reading information of single-cell chromatin accessibility analysis sequencing data by adopting an ENCODE single-cell chromatin accessibility analysis sequencing data processing method;
the bam file of peak reading information is converted to a bigwig file using deepfools 2, after which all bigwig files are spliced into a data matrix using a bigWigMerge tool.
The technical scheme has the beneficial effects that the signal original information can be reserved to the greatest extent by preprocessing single-cell chromatin sequencing data by adopting an ENCODE process.
Further, the data enhancement operation includes:
calculating potential characteristics of single cells by adopting a cisTopic method, and calculating similarity scores among cells by using data indexes of cosine similarity;
the 100 adjacent cells most similar to the above were selected as data for single cell chromatin accessibility analysis sequencing data amplification, and single cells and selected adjacent cells were pooled as enhanced sequencing data.
The beneficial effects of the technical scheme are as follows: the scheme can increase the coverage of chromatin accessibility through the mode, meanwhile, the cell specificity is kept, and the batch processing effect can be effectively lightened through the operation.
Further, the single cell transcription factor prediction method further comprises calculating an activity factor fraction of the transcription factor according to the probability of the transcription factor:
wherein,,the probability of coincidence between the transcription factor k output by the final network model and the nth cell peak in the cell i is from the first value to the second value after the probability is ordered from high to low until M values; c (C) i,k To all C i,n,k A result of normalizing the value; PC (personal computer) i,k A score for the calculated activity factor; k epsilon 1 … M, M is the total number of transcription factors in each cell in the deep network model, N epsilon 1 … N, and N is the number of cell peaks which can be read in each cell; c (C) i,n,k According to Z i,n,k And comparing the first two values with the highest output coincidence probability, and then assigning the values.
The technical scheme has the beneficial effects that the transcription factor with higher regulation and control intensity or the transcription factor with the largest regulation and control relation can be more intuitively found out through the obtained activity score of the transcription factor.
Drawings
FIG. 1 is a flow chart of a method of single cell transcription factor prediction based on deep learning and attention mechanisms.
FIG. 2 is a framework diagram of a single cell transcription factor prediction method based on deep learning and attention mechanisms.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
Referring to fig. 1, fig. 1 shows a flowchart of a single cell transcription factor prediction method based on deep learning and attention mechanisms, which includes steps S1 to S3 as shown in fig. 1.
In step S1, acquiring single-cell chromatin accessibility analysis sequencing data, preprocessing the sequencing data, and then performing data enhancement operation to obtain enhanced sequencing data;
in practice, the present protocol preferably provides a method for preprocessing single cell chromatin accessibility analysis sequencing data comprising:
screening peak reading information of single-cell chromatin accessibility analysis sequencing data by adopting an ENCODE single-cell chromatin accessibility analysis sequencing data processing method;
the bam file of peak reading information is converted to a bigwig file using deepfools 2, after which all bigwig files are spliced into a data matrix using a bigWigMerge tool.
Wherein the data enhancement operation comprises: calculating potential characteristics of single cells by adopting a cisTopic method, and calculating similarity scores among cells by using data indexes of cosine similarity; the 100 adjacent cells most similar to the above were selected as data for single cell chromatin accessibility analysis sequencing data amplification, and single cells and selected adjacent cells were pooled as enhanced sequencing data.
In step S2, extracting regression peaks in the enhanced sequencing data to generate a feature vector S of 4×1000, concatenating the forward and reverse enhanced sequencing data to a feature vector a of 2×1000, and converting DNA sequence data taken from the whole genome to a feature vector U of 2×1000;
in step S3, the feature vector S, the feature vector a, and the feature vector U are feature vectors of 8×1000, and the probability of predicting each transcription factor in a single cell is input into a deep network model, where the deep network model includes a convolution module and a channel attention model.
The convolution layer in the convolution module is mainly used for extracting the characteristics of the input DNA sequence and shape data, mining the correlation between different data, removing noise and unstable components in the data, and transmitting the relatively stable information of the processed mode as a whole to the attention mechanism module for TFBS prediction.
The deep network model uses only the channel attention portion in the CBAM to capture the feature and shape data of the biological sequence. The channel attention module spatially compresses the input feature map. The compression method not only uses average pooling extracted features, but also introduces maximum pooling as a supplement. Global average pooling has feedback for every pixel on the feature map, while global maximum pooling has gradient feedback only on the feature map that is most responsive.
According to the scheme, the single-cell chromatin accessibility analysis sequencing data are preprocessed, noise in the data can be reduced, and then the data are input into a deep learning model embedded with a channel attention module for recognition, so that the extraction speed and the extraction precision of the model to required data features are improved.
In one embodiment of the present invention, inputting the probability that the deep network model predicts the transcription factor further comprises:
s31, judging whether a cell line to which single-cell chromatin accessibility analysis sequencing data belong is known, if so, selecting a depth network model of a corresponding cell line for prediction, otherwise, entering a step S32;
s32, respectively inputting spliced data into depth network models corresponding to cell lines GM12878, K562 and H1ESC to predict the probability of transcription factors;
s33, selecting the maximum value of the transcription factor prediction probabilities output by the three models as the transcription factor prediction probability of single cells.
The scheme builds a deep network model for common cell lines, so that the probability of determining that single cells are located in a specific cell line can be rapidly and accurately obtained under the condition that the acquired sequencing data information is limited.
As shown in fig. 2, the convolution module comprises a convolution layer, an activation layer, a pooling layer and a full connection layer which are sequentially connected; the channel attention model comprises a maximum pooling layer/average pooling layer, a shared multi-layer sensing module, a full-connection layer and a flattening layer which are sequentially connected; the model structure of the depth network model is as follows:
F 1 =max_pooling(ReLU(conv 1 (S F )),F 2 =max_pooling(ReLU(conv 2 (F 1 ))
F 3 =max_pooling(ReLU(conv 3 (F 2 )),F=ReLU(W 1 ·F 3 )
wherein S is F Is the vector after the splicingThe method comprises the steps of carrying out a first treatment on the surface of the ReLU () is the activation function; max_pooling () is the maximum pooling layer function; conv 1 (.)、conv 2 (.) and conv 3 (.) are respectively a first convolution function, a second convolution function and a third convolution function; f (F) 1 、F 2 And F 3 Feature vector maps of each convolution layer of the convolution module are respectively provided; w (W) 1 And W is 2 Respectively weight matrixes in the two full connection layers;andrespectively calculating characteristics of a global average pool and a global maximum pool in the channel attention model; sigma is a sigmoid activation function; w (W) 0 And W is 3 Two layers of parameters in the multi-layer sensor model are respectively; m is M c (F) Output of the channel attention model; z is Z i,n,k The probability of coincidence of transcription factor k with the nth cell peak within each cell i; k epsilon 1 … M, M is the total number of transcription factors in each cell in the deep network model, N epsilon 1 … N, N is the number of cell peaks that can be read per cell.
When the method is implemented, the training method of the depth network model preferably comprises the following steps:
pretreatment of DNA sequence data: selecting DNA sequence data of the whole genome, and cutting the DNA sequence data into 200bp fragments, wherein a sliding cutting interval between each two fragments is 50bp; acquiring a reverse chain of each fragment, expanding each section of reverse chain into a 1000bp fragment with 200bp as a center, and then converting into map Data;
pretreatment of batch tissue chromatin accessibility analysis sequencing data: sequencing data were obtained from mass tissue chromatin accessibility analysis in the cell lines GM12878, K562, H1ESC and subjected to a shearing operation, after which the sheared data were mapped to human genome hg19 using a Bowtie2 tool and sheared using samtools, picard operation; converting the cut bam file into a bigwig file by using deepools 2, and then splicing all bigwig files into a data matrix by using a bigWigMerge tool;
the DNA sequence data before pretreatment is encoded into a characteristic vector S of 4 x 1000 by using one-hot 1 The method comprises the steps of carrying out a first treatment on the surface of the Sequencing data of batch tissue chromatin accessibility analysis after forward and reverse pretreatment corresponding to each cell line are spliced into a feature vector A of 2 x 1000 1 、A 2 And A 3 The method comprises the steps of carrying out a first treatment on the surface of the Converting capability Data into feature vector U of 2 x 1000 1 ;
Respectively splicing characteristic vectors S 1 Feature vector A 1 And feature vector U 1 Feature vector S 1 Feature vector A 2 And feature vector U 1 Feature vector S 1 Feature vector A 3 And feature vector U 1 And respectively inputting the three spliced vectors into three deep neural networks for training (specifically, the training process is that the input data is transmitted forward, the error is calculated, the back transmission is performed, the parameter is updated, and the training is finished), so as to obtain the deep network models corresponding to the cell lines GM12878, K562 and H1 ESC.
The single cell transcription factor prediction method of the scheme further comprises the step of calculating the activity factor fraction of the transcription factor according to the probability of the transcription factor:
wherein,,the probability of coincidence between the transcription factor k output by the final network model and the nth cell peak in the cell i is from the first value to the second value after the probability is ordered from high to low until M values; c (C) i,k To all C i,n,k A result of normalizing the value; PC (personal computer) i,k A score for the calculated activity factor; k epsilon 1 … M, M is the total number of transcription factors in each cell in the deep network model, N epsilon 1 … N, and N is the number of cell peaks which can be read in each cell; c (C) i,n,k According to Z i,n,k And comparing the first two values with the highest output coincidence probability, and then assigning the values.
In summary, the problem of low extraction efficiency of the data features due to the black box characteristic of the neural network can be effectively solved by combining the convolution module with the channel attention mechanism.
Claims (7)
1. The single cell transcription factor prediction method based on deep learning and attention mechanism is characterized by comprising the following steps:
acquiring single-cell chromatin accessibility analysis sequencing data, preprocessing the single-cell chromatin accessibility analysis sequencing data, and then performing data enhancement operation to obtain enhanced sequencing data;
extracting regression peaks in the enhanced sequencing data as a feature vector S, splicing the forward and reverse enhanced sequencing data as a feature vector A, and converting DNA sequence data from the whole genome into a feature vector U;
and splicing the feature vector S, the feature vector A and the feature vector U, inputting a depth network model to predict the probability of each transcription factor in single cells, wherein the depth network model comprises a convolution module and a channel attention model.
2. The method of claim 1, wherein inputting probabilities of transcription factors predicted by the deep network model further comprises:
s31, judging whether a cell line to which single-cell chromatin accessibility analysis sequencing data belong is known, if so, selecting a depth network model of a corresponding cell line for prediction, otherwise, entering a step S32;
s32, respectively inputting spliced data into depth network models corresponding to cell lines GM12878, K562 and H1ESC to predict the probability of transcription factors;
s33, selecting the maximum value of the transcription factor prediction probabilities output by the three models as the transcription factor prediction probability of single cells.
3. The single cell transcription factor prediction method according to claim 2, wherein the convolution module comprises a convolution layer, an activation layer, a pooling layer and a full-connection layer which are sequentially connected; the channel attention model comprises a maximum pooling layer/average pooling layer, a shared multi-layer sensing module, a full-connection layer and a flattening layer which are sequentially connected; the model structure of the depth network model is as follows:
F 1 =max_pooling(ReLU(conv 1 (S F )),F 2 =max_pooling(ReLU(conv 2 (F 1 ))
F 3 =max_pooling(ReLU(conv 3 (F 2 )),F=ReLU(W 1 ·F 3 )
wherein S is F Is the vector after splicing; reLU () is the activation function; max_pooling () is the maximum pooling layer function; conv 1 (.)、conv 2 (.) and conv 3 (.) are respectively a first convolution function, a second convolution function and a third convolution function; f (F) 1 、F 2 And F 3 Feature vector maps of each convolution layer of the convolution module are respectively provided; w (W) 1 And W is 2 Respectively weight matrixes in the two full connection layers;and->Respectively calculating characteristics of a global average pool and a global maximum pool in the channel attention model; sigma is a sigmoid activation function; w (W) 0 And W is 3 Two layers of parameters in the multi-layer sensor model are respectively; m is M c (F) Output of the channel attention model; z is Z i,n,k The probability of coincidence of transcription factor k with the nth cell peak within each cell i; k epsilon 1 … M, M is the total number of transcription factors in each cell in the deep network model, N epsilon 1 … N, N is the number of cell peaks that can be read per cell.
4. The method of claim 1, wherein the training method of the deep network model comprises:
pretreatment of DNA sequence data: selecting DNA sequence data of the whole genome, and cutting the DNA sequence data into 200bp fragments, wherein a sliding cutting interval between each two fragments is 50bp; acquiring a reverse chain of each fragment, expanding each section of reverse chain into a 1000bp fragment with 200bp as a center, and then converting into map Data;
pretreatment of batch tissue chromatin accessibility analysis sequencing data: sequencing data were obtained from mass tissue chromatin accessibility analysis in the cell lines GM12878, K562, H1ESC and subjected to a shearing operation, after which the sheared data were mapped to human genome hg19 using a Bowtie2 tool and sheared using samtools, picard operation; converting the cut bam file into a bigwig file by using deepools 2, and then splicing all bigwig files into a data matrix by using a bigWigMerge tool;
the DNA sequence data before pretreatment is encoded into a characteristic vector S of 4 x 1000 by using one-hot 1 The method comprises the steps of carrying out a first treatment on the surface of the Splicing the corresponding forward and reverse pretreatment batch tissue chromatin accessibility analysis sequencing data of each cell line as a characteristic vector A 1 、A 2 And A 3 The method comprises the steps of carrying out a first treatment on the surface of the Converting capability Data into feature vector U of 2 x 1000 1 ;
Respectively splicing characteristic vectors S 1 Feature vector A 1 And feature vector U 1 Feature vector S 1 Feature vector A 2 And feature vector U 1 Feature vector S 1 Feature vector A 3 And feature vector U 1 Respectively inputting the spliced three vectors into three deep neural networks for training to obtain a deep network model corresponding to the cell lines GM12878, K562 and H1ESC。
5. The method of claim 1, wherein the pre-processing the sequencing data of the single cell chromatin accessibility analysis comprises:
screening peak reading information of single-cell chromatin accessibility analysis sequencing data by adopting an ENCODE single-cell chromatin accessibility analysis sequencing data processing method;
the bam file of peak reading information is converted to a bigwig file using deepfools 2, after which all bigwig files are spliced into a data matrix using a bigWigMerge tool.
6. The single cell transcription factor prediction method of claim 1, wherein the data enhancement operation comprises:
calculating potential characteristics of single cells by adopting a cisTopic method, and calculating similarity scores among cells by using data indexes of cosine similarity;
the 100 adjacent cells most similar to the above were selected as data for single cell chromatin accessibility analysis sequencing data amplification, and single cells and selected adjacent cells were pooled as enhanced sequencing data.
7. The single cell transcription factor predicting method of claim 1 further comprising calculating an activity factor fraction of the transcription factor based on the probability of the transcription factor:
wherein,,the probability of coincidence between the transcription factor k output by the final network model and the nth cell peak in the cell i is from the first value to the second value after the probability is ordered from high to low until M values; c (C) i,k To all C i,n,k A result of normalizing the value; PC (personal computer) i,k A score for the calculated activity factor; c (C) i,n,k According to Z i,n,k And comparing the first two values with the highest output coincidence probability, and then assigning the values.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310383948.9A CN116386720A (en) | 2023-04-11 | 2023-04-11 | Single cell transcription factor prediction method based on deep learning and attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310383948.9A CN116386720A (en) | 2023-04-11 | 2023-04-11 | Single cell transcription factor prediction method based on deep learning and attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116386720A true CN116386720A (en) | 2023-07-04 |
Family
ID=86965393
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310383948.9A Pending CN116386720A (en) | 2023-04-11 | 2023-04-11 | Single cell transcription factor prediction method based on deep learning and attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116386720A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116825204A (en) * | 2023-08-30 | 2023-09-29 | 鲁东大学 | Single-cell RNA sequence gene regulation inference method based on deep learning |
-
2023
- 2023-04-11 CN CN202310383948.9A patent/CN116386720A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116825204A (en) * | 2023-08-30 | 2023-09-29 | 鲁东大学 | Single-cell RNA sequence gene regulation inference method based on deep learning |
CN116825204B (en) * | 2023-08-30 | 2023-11-07 | 鲁东大学 | Single-cell RNA sequence gene regulation inference method based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111312329B (en) | Transcription factor binding site prediction method based on deep convolution automatic encoder | |
CN111243674B (en) | Base sequence identification method, device and storage medium | |
CN110245685B (en) | Method, system and storage medium for predicting pathogenicity of genome single-site variation | |
CN111341386A (en) | Attention-introducing multi-scale CNN-BilSTM non-coding RNA interaction relation prediction method | |
CN111261223B (en) | CRISPR off-target effect prediction method based on deep learning | |
CN111564179B (en) | Species biology classification method and system based on triple neural network | |
CN111507155A (en) | U-Net + + and UDA combined microseism effective signal first-arrival pickup method and device | |
CN116386720A (en) | Single cell transcription factor prediction method based on deep learning and attention mechanism | |
CN108710784A (en) | A kind of genetic transcription variation probability and the algorithm in the direction that makes a variation | |
CN114239585A (en) | Biomedical nested named entity recognition method | |
CN116312748A (en) | Enhancer-promoter interaction prediction model construction method based on multi-head attention mechanism | |
CN110599502A (en) | Skin lesion segmentation method based on deep learning | |
CN116680594A (en) | Method for improving classification accuracy of thyroid cancer of multiple groups of chemical data by using depth feature selection algorithm | |
CN116805514B (en) | DNA sequence function prediction method based on deep learning | |
CN112908421A (en) | Tumor neogenesis antigen prediction method, device, equipment and medium | |
CN113160885A (en) | RNA and protein binding preference prediction method and system based on capsule network | |
CN117393042A (en) | Analysis method for predicting pathogenicity of missense mutation | |
CN114566216B (en) | Attention mechanism-based splice site prediction and interpretation method | |
CN114168782B (en) | Deep hash image retrieval method based on triplet network | |
CN115810398A (en) | TF-DNA binding identification method based on multi-feature fusion | |
CN112365924A (en) | Bidirectional trinucleotide position specificity preference and point combined mutual information DNA/RNA sequence coding method | |
CN113764031A (en) | Prediction method of N6 methyladenosine locus in trans-tissue/species RNA | |
CN115019876A (en) | Gene expression prediction method and device | |
CN114694746A (en) | Plant pri-miRNA coding peptide prediction method based on improved MRMD algorithm and DF model | |
CN114566215A (en) | Double-end paired splice site prediction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |