CN116386720A - Single cell transcription factor prediction method based on deep learning and attention mechanism - Google Patents

Single cell transcription factor prediction method based on deep learning and attention mechanism Download PDF

Info

Publication number
CN116386720A
CN116386720A CN202310383948.9A CN202310383948A CN116386720A CN 116386720 A CN116386720 A CN 116386720A CN 202310383948 A CN202310383948 A CN 202310383948A CN 116386720 A CN116386720 A CN 116386720A
Authority
CN
China
Prior art keywords
cell
data
feature vector
transcription factor
sequencing data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310383948.9A
Other languages
Chinese (zh)
Inventor
张永清
邹权
何宇辰
牛颢
丁春利
吴锡
王紫轩
刘宇航
王茂丞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN202310383948.9A priority Critical patent/CN116386720A/en
Publication of CN116386720A publication Critical patent/CN116386720A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Genetics & Genomics (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a single-cell transcription factor prediction method based on deep learning and attention mechanism, which comprises the steps of obtaining single-cell chromatin accessibility analysis sequencing data, preprocessing the single-cell chromatin accessibility analysis sequencing data, and then performing data enhancement operation to obtain enhanced sequencing data; extracting regression peaks in the enhanced sequencing data as a feature vector S, splicing the forward and reverse enhanced sequencing data as a feature vector A, and converting DNA sequence data from the whole genome into a feature vector U; and splicing the feature vector S, the feature vector A and the feature vector U, inputting a depth network model to predict the probability of each transcription factor in single cells, wherein the depth network model comprises a convolution module and a channel attention model.

Description

Single cell transcription factor prediction method based on deep learning and attention mechanism
Technical Field
The invention relates to a transcription factor detection technology, in particular to a single-cell transcription factor prediction method based on deep learning and attention mechanisms.
Background
Gene transcription regulation is an important implementation mechanism of organism gene expression, which ensures normal gene expression process of cells by interaction between transcription factors and related receptors in organisms, and can respond rapidly to subtle changes in biological environment. Currently, detection of transcription factor binding sites across the genome is an important loop in exploring the transcriptional regulatory mechanisms of genes.
In recent years, single cell sequencing technology has also emerged as a rapid development stage, thanks to the high-speed development of high-throughput sequencing technology. From the single cell perspective, we can know important fields such as cancer heterogeneity, stem cell development and differentiation, human cell map, gene transcription regulation and the like more deeply and more accurately. How to accurately and efficiently predict the transcription factor binding site by using the single cell data existing at present is a very important and urgent problem to be solved.
Currently, due to advances in sequencing technology and significant improvements in computational power, deep learning techniques have been widely used in the prediction of transcription factor binding sites by virtue of their advantages in data processing. The prediction model based on deep learning mainly comprises a convolutional neural network, a cyclic neural network, a hybrid neural network and the like. The effect of the prediction method based on the convolutional neural network is ideal. The deep sea model is proposed by Zhou et al, which predicts transcription factors by learning large-scale chromatin data information, and predicts the influence of DNase I sensitivity and single nucleotide sensitivity on chromatin. Alipaahi et al propose a deep model that uses convolutional neural networks to achieve the function of predicting the sequence specificity of DNA or RNA and protein binding.
Although the deep learning algorithm overcomes the deficiencies of the traditional machine learning method to some extent, the following deficiencies still exist: first, the training data currently used is relatively single, but other schools can provide relevant characteristic information as well. Second, due to the deepening of the convolutional neural network layers and the black box nature of the neural network, the interpretability of the entire computational model is also significantly reduced.
Disclosure of Invention
Aiming at the defects in the prior art, the single-cell transcription factor prediction method based on the deep learning and attention mechanism solves the problem that the deep learning model has low extraction efficiency on the data characteristics.
In order to achieve the aim of the invention, the invention adopts the following technical scheme:
provided is a single cell transcription factor prediction method based on deep learning and attention mechanisms, comprising the steps of:
acquiring single-cell chromatin accessibility analysis sequencing data, preprocessing the single-cell chromatin accessibility analysis sequencing data, and then performing data enhancement operation to obtain enhanced sequencing data;
extracting regression peaks in the enhanced sequencing data as a feature vector S, splicing the forward and reverse enhanced sequencing data as a feature vector A, and converting DNA sequence data from the whole genome into a feature vector U;
and splicing the feature vector S, the feature vector A and the feature vector U, inputting a depth network model to predict the probability of each transcription factor in single cells, wherein the depth network model comprises a convolution module and a channel attention model.
The beneficial effects of the invention are as follows: according to the scheme, the single-cell chromatin accessibility analysis sequencing data are preprocessed, noise in the data can be reduced, and then the data are input into a deep learning model embedded with a channel attention module for recognition, so that the extraction speed and the extraction precision of the model to required data features are improved.
Further, inputting the probability of the depth network model to predict the transcription factor further comprises:
s31, judging whether a cell line to which single-cell chromatin accessibility analysis sequencing data belong is known, if so, selecting a depth network model of a corresponding cell line for prediction, otherwise, entering a step S32;
s32, respectively inputting spliced data into depth network models corresponding to cell lines GM12878, K562 and H1ESC to predict the probability of transcription factors;
s33, selecting the maximum value of the transcription factor prediction probabilities output by the three models as the transcription factor prediction probability of single cells.
The technical scheme has the beneficial effects that the scheme builds the corresponding deep network model aiming at common cell lines, so that the probability of determining that single cells are positioned in a specific cell line can be rapidly and accurately obtained under the condition of limited acquired sequencing data information.
Further, the convolution module comprises a convolution layer, an activation layer, a pooling layer and a full connection layer which are sequentially connected; the channel attention model comprises a maximum pooling layer/average pooling layer, a shared multi-layer sensing module, a full-connection layer and a flattening layer which are sequentially connected; the model structure of the depth network model is as follows:
F 1 =max_pooling(ReLU(conv 1 (S F )),F 2 =max_pooling(ReLU(conv 2 (F 1 ))
F 3 =max_pooling(ReLU(conv 3 (F 2 )),F=ReLU(W 1 ·F 3 )
Figure BDA0004173251990000031
Z i,n,k =sigmoid(W 2 ·M c (F))
wherein S is F Is the vector after splicing; reLU () is the activation function; the convolution module comprises a convolution layer, an activation layer, a pooling layer and a full connection layer which are sequentially connected; the channel attention model comprises a maximum pooling layer/average pooling layer, a shared multi-layer sensing module, a full-connection layer and a flattening layer which are sequentially connected; f (F) 1 、F 2 And F 3 Feature vector maps of each convolution layer of the convolution module are respectively provided; w (W) 1 And W is 2 Respectively weight matrixes in the two full connection layers;
Figure BDA0004173251990000032
and->
Figure BDA0004173251990000033
Respectively calculating characteristics of a global average pool and a global maximum pool in the channel attention model; sigma is a sigmoid activation function; w (W) 0 And W is 3 Two layers of parameters in the multi-layer sensor model are respectively; m is M c (F) Output of the channel attention model; z is Z i,n,k The probability of coincidence of transcription factor k with the nth cell peak within each cell i; k epsilon 1 … M, M is the total number of transcription factors in each cell in the deep network model, N epsilon 1 … N, N is the number of cell peaks that can be read per cell.
The technical scheme has the beneficial effects that the convolution model with the number of layers not being deep is combined with the attention mechanism module, so that the feature extraction efficiency of the model is improved, and meanwhile, the interpretation is provided for the model.
Further, the training method of the depth network model comprises the following steps:
pretreatment of DNA sequence data: selecting DNA sequence data of the whole genome, and cutting the DNA sequence data into 200bp fragments, wherein a sliding cutting interval between each two fragments is 50bp; acquiring a reverse chain of each fragment, expanding each section of reverse chain into a 1000bp fragment with 200bp as a center, and then converting into map Data;
pretreatment of batch tissue chromatin accessibility analysis sequencing data: sequencing data were obtained from mass tissue chromatin accessibility analysis in the cell lines GM12878, K562, H1ESC and subjected to a shearing operation, after which the sheared data were mapped to human genome hg19 using a Bowtie2 tool and sheared using samtools, picard operation; converting the cut bam file into a bigwig file by using deepools 2, and then splicing all bigwig files into a data matrix by using a bigWigMerge tool;
the DNA sequence data before pretreatment is encoded into a characteristic vector S of 4 x 1000 by using one-hot 1 The method comprises the steps of carrying out a first treatment on the surface of the Splicing the corresponding forward and reverse pretreatment batch tissue chromatin accessibility analysis sequencing data of each cell line as a characteristic vector A 1 、A 2 And A 3 The method comprises the steps of carrying out a first treatment on the surface of the Converting capability Data into feature vector U of 2 x 1000 1
Respectively splicing characteristic vectors S 1 Feature vector A 1 And feature vector U 1 Feature vector S 1 Feature vector A 2 And feature vector U 1 Feature vector S 1 Feature vector A 3 And feature vector U 1 And respectively inputting the three spliced vectors into three deep neural networks for training to obtain a deep network model corresponding to the cell lines GM12878, K562 and H1 ESC.
The beneficial effects of the technical scheme are as follows: the scheme uses batch tissue chromatin accessibility analysis sequencing data, DNA sequence data and Mapability Feature data for pretraining, and can fully use the data characteristics of each data, thereby ensuring the accuracy of transcription factor prediction in single cell chromatin accessibility analysis sequencing data.
Further, the method for preprocessing single cell chromatin accessibility analysis sequencing data comprises the following steps:
screening peak reading information of single-cell chromatin accessibility analysis sequencing data by adopting an ENCODE single-cell chromatin accessibility analysis sequencing data processing method;
the bam file of peak reading information is converted to a bigwig file using deepfools 2, after which all bigwig files are spliced into a data matrix using a bigWigMerge tool.
The technical scheme has the beneficial effects that the signal original information can be reserved to the greatest extent by preprocessing single-cell chromatin sequencing data by adopting an ENCODE process.
Further, the data enhancement operation includes:
calculating potential characteristics of single cells by adopting a cisTopic method, and calculating similarity scores among cells by using data indexes of cosine similarity;
the 100 adjacent cells most similar to the above were selected as data for single cell chromatin accessibility analysis sequencing data amplification, and single cells and selected adjacent cells were pooled as enhanced sequencing data.
The beneficial effects of the technical scheme are as follows: the scheme can increase the coverage of chromatin accessibility through the mode, meanwhile, the cell specificity is kept, and the batch processing effect can be effectively lightened through the operation.
Further, the single cell transcription factor prediction method further comprises calculating an activity factor fraction of the transcription factor according to the probability of the transcription factor:
Figure BDA0004173251990000051
Figure BDA0004173251990000061
Figure BDA0004173251990000062
wherein,,
Figure BDA0004173251990000063
the probability of coincidence between the transcription factor k output by the final network model and the nth cell peak in the cell i is from the first value to the second value after the probability is ordered from high to low until M values; c (C) i,k To all C i,n,k A result of normalizing the value; PC (personal computer) i,k A score for the calculated activity factor; k epsilon 1 … M, M is the total number of transcription factors in each cell in the deep network model, N epsilon 1 … N, and N is the number of cell peaks which can be read in each cell; c (C) i,n,k According to Z i,n,k And comparing the first two values with the highest output coincidence probability, and then assigning the values.
The technical scheme has the beneficial effects that the transcription factor with higher regulation and control intensity or the transcription factor with the largest regulation and control relation can be more intuitively found out through the obtained activity score of the transcription factor.
Drawings
FIG. 1 is a flow chart of a method of single cell transcription factor prediction based on deep learning and attention mechanisms.
FIG. 2 is a framework diagram of a single cell transcription factor prediction method based on deep learning and attention mechanisms.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
Referring to fig. 1, fig. 1 shows a flowchart of a single cell transcription factor prediction method based on deep learning and attention mechanisms, which includes steps S1 to S3 as shown in fig. 1.
In step S1, acquiring single-cell chromatin accessibility analysis sequencing data, preprocessing the sequencing data, and then performing data enhancement operation to obtain enhanced sequencing data;
in practice, the present protocol preferably provides a method for preprocessing single cell chromatin accessibility analysis sequencing data comprising:
screening peak reading information of single-cell chromatin accessibility analysis sequencing data by adopting an ENCODE single-cell chromatin accessibility analysis sequencing data processing method;
the bam file of peak reading information is converted to a bigwig file using deepfools 2, after which all bigwig files are spliced into a data matrix using a bigWigMerge tool.
Wherein the data enhancement operation comprises: calculating potential characteristics of single cells by adopting a cisTopic method, and calculating similarity scores among cells by using data indexes of cosine similarity; the 100 adjacent cells most similar to the above were selected as data for single cell chromatin accessibility analysis sequencing data amplification, and single cells and selected adjacent cells were pooled as enhanced sequencing data.
In step S2, extracting regression peaks in the enhanced sequencing data to generate a feature vector S of 4×1000, concatenating the forward and reverse enhanced sequencing data to a feature vector a of 2×1000, and converting DNA sequence data taken from the whole genome to a feature vector U of 2×1000;
in step S3, the feature vector S, the feature vector a, and the feature vector U are feature vectors of 8×1000, and the probability of predicting each transcription factor in a single cell is input into a deep network model, where the deep network model includes a convolution module and a channel attention model.
The convolution layer in the convolution module is mainly used for extracting the characteristics of the input DNA sequence and shape data, mining the correlation between different data, removing noise and unstable components in the data, and transmitting the relatively stable information of the processed mode as a whole to the attention mechanism module for TFBS prediction.
The deep network model uses only the channel attention portion in the CBAM to capture the feature and shape data of the biological sequence. The channel attention module spatially compresses the input feature map. The compression method not only uses average pooling extracted features, but also introduces maximum pooling as a supplement. Global average pooling has feedback for every pixel on the feature map, while global maximum pooling has gradient feedback only on the feature map that is most responsive.
According to the scheme, the single-cell chromatin accessibility analysis sequencing data are preprocessed, noise in the data can be reduced, and then the data are input into a deep learning model embedded with a channel attention module for recognition, so that the extraction speed and the extraction precision of the model to required data features are improved.
In one embodiment of the present invention, inputting the probability that the deep network model predicts the transcription factor further comprises:
s31, judging whether a cell line to which single-cell chromatin accessibility analysis sequencing data belong is known, if so, selecting a depth network model of a corresponding cell line for prediction, otherwise, entering a step S32;
s32, respectively inputting spliced data into depth network models corresponding to cell lines GM12878, K562 and H1ESC to predict the probability of transcription factors;
s33, selecting the maximum value of the transcription factor prediction probabilities output by the three models as the transcription factor prediction probability of single cells.
The scheme builds a deep network model for common cell lines, so that the probability of determining that single cells are located in a specific cell line can be rapidly and accurately obtained under the condition that the acquired sequencing data information is limited.
As shown in fig. 2, the convolution module comprises a convolution layer, an activation layer, a pooling layer and a full connection layer which are sequentially connected; the channel attention model comprises a maximum pooling layer/average pooling layer, a shared multi-layer sensing module, a full-connection layer and a flattening layer which are sequentially connected; the model structure of the depth network model is as follows:
F 1 =max_pooling(ReLU(conv 1 (S F )),F 2 =max_pooling(ReLU(conv 2 (F 1 ))
F 3 =max_pooling(ReLU(conv 3 (F 2 )),F=ReLU(W 1 ·F 3 )
Figure BDA0004173251990000091
Z i,n,k =sigmoid(W 2 ·M c (F))
wherein S is F Is the vector after the splicingThe method comprises the steps of carrying out a first treatment on the surface of the ReLU () is the activation function; max_pooling () is the maximum pooling layer function; conv 1 (.)、conv 2 (.) and conv 3 (.) are respectively a first convolution function, a second convolution function and a third convolution function; f (F) 1 、F 2 And F 3 Feature vector maps of each convolution layer of the convolution module are respectively provided; w (W) 1 And W is 2 Respectively weight matrixes in the two full connection layers;
Figure BDA0004173251990000092
and
Figure BDA0004173251990000093
respectively calculating characteristics of a global average pool and a global maximum pool in the channel attention model; sigma is a sigmoid activation function; w (W) 0 And W is 3 Two layers of parameters in the multi-layer sensor model are respectively; m is M c (F) Output of the channel attention model; z is Z i,n,k The probability of coincidence of transcription factor k with the nth cell peak within each cell i; k epsilon 1 … M, M is the total number of transcription factors in each cell in the deep network model, N epsilon 1 … N, N is the number of cell peaks that can be read per cell.
When the method is implemented, the training method of the depth network model preferably comprises the following steps:
pretreatment of DNA sequence data: selecting DNA sequence data of the whole genome, and cutting the DNA sequence data into 200bp fragments, wherein a sliding cutting interval between each two fragments is 50bp; acquiring a reverse chain of each fragment, expanding each section of reverse chain into a 1000bp fragment with 200bp as a center, and then converting into map Data;
pretreatment of batch tissue chromatin accessibility analysis sequencing data: sequencing data were obtained from mass tissue chromatin accessibility analysis in the cell lines GM12878, K562, H1ESC and subjected to a shearing operation, after which the sheared data were mapped to human genome hg19 using a Bowtie2 tool and sheared using samtools, picard operation; converting the cut bam file into a bigwig file by using deepools 2, and then splicing all bigwig files into a data matrix by using a bigWigMerge tool;
the DNA sequence data before pretreatment is encoded into a characteristic vector S of 4 x 1000 by using one-hot 1 The method comprises the steps of carrying out a first treatment on the surface of the Sequencing data of batch tissue chromatin accessibility analysis after forward and reverse pretreatment corresponding to each cell line are spliced into a feature vector A of 2 x 1000 1 、A 2 And A 3 The method comprises the steps of carrying out a first treatment on the surface of the Converting capability Data into feature vector U of 2 x 1000 1
Respectively splicing characteristic vectors S 1 Feature vector A 1 And feature vector U 1 Feature vector S 1 Feature vector A 2 And feature vector U 1 Feature vector S 1 Feature vector A 3 And feature vector U 1 And respectively inputting the three spliced vectors into three deep neural networks for training (specifically, the training process is that the input data is transmitted forward, the error is calculated, the back transmission is performed, the parameter is updated, and the training is finished), so as to obtain the deep network models corresponding to the cell lines GM12878, K562 and H1 ESC.
The single cell transcription factor prediction method of the scheme further comprises the step of calculating the activity factor fraction of the transcription factor according to the probability of the transcription factor:
Figure BDA0004173251990000101
Figure BDA0004173251990000102
Figure BDA0004173251990000103
wherein,,
Figure BDA0004173251990000104
the probability of coincidence between the transcription factor k output by the final network model and the nth cell peak in the cell i is from the first value to the second value after the probability is ordered from high to low until M values; c (C) i,k To all C i,n,k A result of normalizing the value; PC (personal computer) i,k A score for the calculated activity factor; k epsilon 1 … M, M is the total number of transcription factors in each cell in the deep network model, N epsilon 1 … N, and N is the number of cell peaks which can be read in each cell; c (C) i,n,k According to Z i,n,k And comparing the first two values with the highest output coincidence probability, and then assigning the values.
In summary, the problem of low extraction efficiency of the data features due to the black box characteristic of the neural network can be effectively solved by combining the convolution module with the channel attention mechanism.

Claims (7)

1. The single cell transcription factor prediction method based on deep learning and attention mechanism is characterized by comprising the following steps:
acquiring single-cell chromatin accessibility analysis sequencing data, preprocessing the single-cell chromatin accessibility analysis sequencing data, and then performing data enhancement operation to obtain enhanced sequencing data;
extracting regression peaks in the enhanced sequencing data as a feature vector S, splicing the forward and reverse enhanced sequencing data as a feature vector A, and converting DNA sequence data from the whole genome into a feature vector U;
and splicing the feature vector S, the feature vector A and the feature vector U, inputting a depth network model to predict the probability of each transcription factor in single cells, wherein the depth network model comprises a convolution module and a channel attention model.
2. The method of claim 1, wherein inputting probabilities of transcription factors predicted by the deep network model further comprises:
s31, judging whether a cell line to which single-cell chromatin accessibility analysis sequencing data belong is known, if so, selecting a depth network model of a corresponding cell line for prediction, otherwise, entering a step S32;
s32, respectively inputting spliced data into depth network models corresponding to cell lines GM12878, K562 and H1ESC to predict the probability of transcription factors;
s33, selecting the maximum value of the transcription factor prediction probabilities output by the three models as the transcription factor prediction probability of single cells.
3. The single cell transcription factor prediction method according to claim 2, wherein the convolution module comprises a convolution layer, an activation layer, a pooling layer and a full-connection layer which are sequentially connected; the channel attention model comprises a maximum pooling layer/average pooling layer, a shared multi-layer sensing module, a full-connection layer and a flattening layer which are sequentially connected; the model structure of the depth network model is as follows:
F 1 =max_pooling(ReLU(conv 1 (S F )),F 2 =max_pooling(ReLU(conv 2 (F 1 ))
F 3 =max_pooling(ReLU(conv 3 (F 2 )),F=ReLU(W 1 ·F 3 )
Figure FDA0004173251980000021
Z i,n,k =sigmoid(W 2 ·M c (F))
wherein S is F Is the vector after splicing; reLU () is the activation function; max_pooling () is the maximum pooling layer function; conv 1 (.)、conv 2 (.) and conv 3 (.) are respectively a first convolution function, a second convolution function and a third convolution function; f (F) 1 、F 2 And F 3 Feature vector maps of each convolution layer of the convolution module are respectively provided; w (W) 1 And W is 2 Respectively weight matrixes in the two full connection layers;
Figure FDA0004173251980000022
and->
Figure FDA0004173251980000023
Respectively calculating characteristics of a global average pool and a global maximum pool in the channel attention model; sigma is a sigmoid activation function; w (W) 0 And W is 3 Two layers of parameters in the multi-layer sensor model are respectively; m is M c (F) Output of the channel attention model; z is Z i,n,k The probability of coincidence of transcription factor k with the nth cell peak within each cell i; k epsilon 1 … M, M is the total number of transcription factors in each cell in the deep network model, N epsilon 1 … N, N is the number of cell peaks that can be read per cell.
4. The method of claim 1, wherein the training method of the deep network model comprises:
pretreatment of DNA sequence data: selecting DNA sequence data of the whole genome, and cutting the DNA sequence data into 200bp fragments, wherein a sliding cutting interval between each two fragments is 50bp; acquiring a reverse chain of each fragment, expanding each section of reverse chain into a 1000bp fragment with 200bp as a center, and then converting into map Data;
pretreatment of batch tissue chromatin accessibility analysis sequencing data: sequencing data were obtained from mass tissue chromatin accessibility analysis in the cell lines GM12878, K562, H1ESC and subjected to a shearing operation, after which the sheared data were mapped to human genome hg19 using a Bowtie2 tool and sheared using samtools, picard operation; converting the cut bam file into a bigwig file by using deepools 2, and then splicing all bigwig files into a data matrix by using a bigWigMerge tool;
the DNA sequence data before pretreatment is encoded into a characteristic vector S of 4 x 1000 by using one-hot 1 The method comprises the steps of carrying out a first treatment on the surface of the Splicing the corresponding forward and reverse pretreatment batch tissue chromatin accessibility analysis sequencing data of each cell line as a characteristic vector A 1 、A 2 And A 3 The method comprises the steps of carrying out a first treatment on the surface of the Converting capability Data into feature vector U of 2 x 1000 1
Respectively splicing characteristic vectors S 1 Feature vector A 1 And feature vector U 1 Feature vector S 1 Feature vector A 2 And feature vector U 1 Feature vector S 1 Feature vector A 3 And feature vector U 1 Respectively inputting the spliced three vectors into three deep neural networks for training to obtain a deep network model corresponding to the cell lines GM12878, K562 and H1ESC。
5. The method of claim 1, wherein the pre-processing the sequencing data of the single cell chromatin accessibility analysis comprises:
screening peak reading information of single-cell chromatin accessibility analysis sequencing data by adopting an ENCODE single-cell chromatin accessibility analysis sequencing data processing method;
the bam file of peak reading information is converted to a bigwig file using deepfools 2, after which all bigwig files are spliced into a data matrix using a bigWigMerge tool.
6. The single cell transcription factor prediction method of claim 1, wherein the data enhancement operation comprises:
calculating potential characteristics of single cells by adopting a cisTopic method, and calculating similarity scores among cells by using data indexes of cosine similarity;
the 100 adjacent cells most similar to the above were selected as data for single cell chromatin accessibility analysis sequencing data amplification, and single cells and selected adjacent cells were pooled as enhanced sequencing data.
7. The single cell transcription factor predicting method of claim 1 further comprising calculating an activity factor fraction of the transcription factor based on the probability of the transcription factor:
Figure FDA0004173251980000041
Figure FDA0004173251980000042
Figure FDA0004173251980000043
wherein,,
Figure FDA0004173251980000044
the probability of coincidence between the transcription factor k output by the final network model and the nth cell peak in the cell i is from the first value to the second value after the probability is ordered from high to low until M values; c (C) i,k To all C i,n,k A result of normalizing the value; PC (personal computer) i,k A score for the calculated activity factor; c (C) i,n,k According to Z i,n,k And comparing the first two values with the highest output coincidence probability, and then assigning the values.
CN202310383948.9A 2023-04-11 2023-04-11 Single cell transcription factor prediction method based on deep learning and attention mechanism Pending CN116386720A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310383948.9A CN116386720A (en) 2023-04-11 2023-04-11 Single cell transcription factor prediction method based on deep learning and attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310383948.9A CN116386720A (en) 2023-04-11 2023-04-11 Single cell transcription factor prediction method based on deep learning and attention mechanism

Publications (1)

Publication Number Publication Date
CN116386720A true CN116386720A (en) 2023-07-04

Family

ID=86965393

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310383948.9A Pending CN116386720A (en) 2023-04-11 2023-04-11 Single cell transcription factor prediction method based on deep learning and attention mechanism

Country Status (1)

Country Link
CN (1) CN116386720A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116825204A (en) * 2023-08-30 2023-09-29 鲁东大学 Single-cell RNA sequence gene regulation inference method based on deep learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116825204A (en) * 2023-08-30 2023-09-29 鲁东大学 Single-cell RNA sequence gene regulation inference method based on deep learning
CN116825204B (en) * 2023-08-30 2023-11-07 鲁东大学 Single-cell RNA sequence gene regulation inference method based on deep learning

Similar Documents

Publication Publication Date Title
CN111312329B (en) Transcription factor binding site prediction method based on deep convolution automatic encoder
CN111243674B (en) Base sequence identification method, device and storage medium
CN110245685B (en) Method, system and storage medium for predicting pathogenicity of genome single-site variation
CN111341386A (en) Attention-introducing multi-scale CNN-BilSTM non-coding RNA interaction relation prediction method
CN111261223B (en) CRISPR off-target effect prediction method based on deep learning
CN111564179B (en) Species biology classification method and system based on triple neural network
CN111507155A (en) U-Net + + and UDA combined microseism effective signal first-arrival pickup method and device
CN116386720A (en) Single cell transcription factor prediction method based on deep learning and attention mechanism
CN108710784A (en) A kind of genetic transcription variation probability and the algorithm in the direction that makes a variation
CN114239585A (en) Biomedical nested named entity recognition method
CN116312748A (en) Enhancer-promoter interaction prediction model construction method based on multi-head attention mechanism
CN110599502A (en) Skin lesion segmentation method based on deep learning
CN116680594A (en) Method for improving classification accuracy of thyroid cancer of multiple groups of chemical data by using depth feature selection algorithm
CN116805514B (en) DNA sequence function prediction method based on deep learning
CN112908421A (en) Tumor neogenesis antigen prediction method, device, equipment and medium
CN113160885A (en) RNA and protein binding preference prediction method and system based on capsule network
CN117393042A (en) Analysis method for predicting pathogenicity of missense mutation
CN114566216B (en) Attention mechanism-based splice site prediction and interpretation method
CN114168782B (en) Deep hash image retrieval method based on triplet network
CN115810398A (en) TF-DNA binding identification method based on multi-feature fusion
CN112365924A (en) Bidirectional trinucleotide position specificity preference and point combined mutual information DNA/RNA sequence coding method
CN113764031A (en) Prediction method of N6 methyladenosine locus in trans-tissue/species RNA
CN115019876A (en) Gene expression prediction method and device
CN114694746A (en) Plant pri-miRNA coding peptide prediction method based on improved MRMD algorithm and DF model
CN114566215A (en) Double-end paired splice site prediction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination