CN116386720A

CN116386720A - Single cell transcription factor prediction method based on deep learning and attention mechanism

Info

Publication number: CN116386720A
Application number: CN202310383948.9A
Authority: CN
Inventors: 张永清; 邹权; 何宇辰; 牛颢; 丁春利; 吴锡; 王紫轩; 刘宇航; 王茂丞
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2023-04-11
Filing date: 2023-04-11
Publication date: 2023-07-04

Abstract

The invention discloses a single-cell transcription factor prediction method based on deep learning and attention mechanism, which comprises the steps of obtaining single-cell chromatin accessibility analysis sequencing data, preprocessing the single-cell chromatin accessibility analysis sequencing data, and then performing data enhancement operation to obtain enhanced sequencing data; extracting regression peaks in the enhanced sequencing data as a feature vector S, splicing the forward and reverse enhanced sequencing data as a feature vector A, and converting DNA sequence data from the whole genome into a feature vector U; and splicing the feature vector S, the feature vector A and the feature vector U, inputting a depth network model to predict the probability of each transcription factor in single cells, wherein the depth network model comprises a convolution module and a channel attention model.

Description

Single cell transcription factor prediction method based on deep learning and attention mechanism

Technical Field

The invention relates to a transcription factor detection technology, in particular to a single-cell transcription factor prediction method based on deep learning and attention mechanisms.

Background

Gene transcription regulation is an important implementation mechanism of organism gene expression, which ensures normal gene expression process of cells by interaction between transcription factors and related receptors in organisms, and can respond rapidly to subtle changes in biological environment. Currently, detection of transcription factor binding sites across the genome is an important loop in exploring the transcriptional regulatory mechanisms of genes.

In recent years, single cell sequencing technology has also emerged as a rapid development stage, thanks to the high-speed development of high-throughput sequencing technology. From the single cell perspective, we can know important fields such as cancer heterogeneity, stem cell development and differentiation, human cell map, gene transcription regulation and the like more deeply and more accurately. How to accurately and efficiently predict the transcription factor binding site by using the single cell data existing at present is a very important and urgent problem to be solved.

Currently, due to advances in sequencing technology and significant improvements in computational power, deep learning techniques have been widely used in the prediction of transcription factor binding sites by virtue of their advantages in data processing. The prediction model based on deep learning mainly comprises a convolutional neural network, a cyclic neural network, a hybrid neural network and the like. The effect of the prediction method based on the convolutional neural network is ideal. The deep sea model is proposed by Zhou et al, which predicts transcription factors by learning large-scale chromatin data information, and predicts the influence of DNase I sensitivity and single nucleotide sensitivity on chromatin. Alipaahi et al propose a deep model that uses convolutional neural networks to achieve the function of predicting the sequence specificity of DNA or RNA and protein binding.

Although the deep learning algorithm overcomes the deficiencies of the traditional machine learning method to some extent, the following deficiencies still exist: first, the training data currently used is relatively single, but other schools can provide relevant characteristic information as well. Second, due to the deepening of the convolutional neural network layers and the black box nature of the neural network, the interpretability of the entire computational model is also significantly reduced.

Disclosure of Invention

Aiming at the defects in the prior art, the single-cell transcription factor prediction method based on the deep learning and attention mechanism solves the problem that the deep learning model has low extraction efficiency on the data characteristics.

In order to achieve the aim of the invention, the invention adopts the following technical scheme:

provided is a single cell transcription factor prediction method based on deep learning and attention mechanisms, comprising the steps of:

acquiring single-cell chromatin accessibility analysis sequencing data, preprocessing the single-cell chromatin accessibility analysis sequencing data, and then performing data enhancement operation to obtain enhanced sequencing data;

extracting regression peaks in the enhanced sequencing data as a feature vector S, splicing the forward and reverse enhanced sequencing data as a feature vector A, and converting DNA sequence data from the whole genome into a feature vector U;

and splicing the feature vector S, the feature vector A and the feature vector U, inputting a depth network model to predict the probability of each transcription factor in single cells, wherein the depth network model comprises a convolution module and a channel attention model.

The beneficial effects of the invention are as follows: according to the scheme, the single-cell chromatin accessibility analysis sequencing data are preprocessed, noise in the data can be reduced, and then the data are input into a deep learning model embedded with a channel attention module for recognition, so that the extraction speed and the extraction precision of the model to required data features are improved.

Further, inputting the probability of the depth network model to predict the transcription factor further comprises:

s31, judging whether a cell line to which single-cell chromatin accessibility analysis sequencing data belong is known, if so, selecting a depth network model of a corresponding cell line for prediction, otherwise, entering a step S32;

s32, respectively inputting spliced data into depth network models corresponding to cell lines GM12878, K562 and H1ESC to predict the probability of transcription factors;

s33, selecting the maximum value of the transcription factor prediction probabilities output by the three models as the transcription factor prediction probability of single cells.

The technical scheme has the beneficial effects that the scheme builds the corresponding deep network model aiming at common cell lines, so that the probability of determining that single cells are positioned in a specific cell line can be rapidly and accurately obtained under the condition of limited acquired sequencing data information.

Further, the convolution module comprises a convolution layer, an activation layer, a pooling layer and a full connection layer which are sequentially connected; the channel attention model comprises a maximum pooling layer/average pooling layer, a shared multi-layer sensing module, a full-connection layer and a flattening layer which are sequentially connected; the model structure of the depth network model is as follows:

F ₁ ＝max_pooling(ReLU(conv ₁ (S _F ))，F ₂ ＝max_pooling(ReLU(conv ₂ (F ₁ ))

F ₃ ＝max_pooling(ReLU(conv ₃ (F ₂ ))，F＝ReLU(W ₁ ·F ₃ )

Z _i,n,k ＝sigmoid(W ₂ ·M _c (F))

wherein S is _F Is the vector after splicing; reLU () is the activation function; the convolution module comprises a convolution layer, an activation layer, a pooling layer and a full connection layer which are sequentially connected; the channel attention model comprises a maximum pooling layer/average pooling layer, a shared multi-layer sensing module, a full-connection layer and a flattening layer which are sequentially connected; f (F) ₁ 、F ₂ And F ₃ Feature vector maps of each convolution layer of the convolution module are respectively provided; w (W) ₁ And W is ₂ Respectively weight matrixes in the two full connection layers;

and->

Respectively calculating characteristics of a global average pool and a global maximum pool in the channel attention model; sigma is a sigmoid activation function; w (W) ₀ And W is ₃ Two layers of parameters in the multi-layer sensor model are respectively; m is M _c (F) Output of the channel attention model; z is Z _i,n,k The probability of coincidence of transcription factor k with the nth cell peak within each cell i; k epsilon 1 … M, M is the total number of transcription factors in each cell in the deep network model, N epsilon 1 … N, N is the number of cell peaks that can be read per cell.

The technical scheme has the beneficial effects that the convolution model with the number of layers not being deep is combined with the attention mechanism module, so that the feature extraction efficiency of the model is improved, and meanwhile, the interpretation is provided for the model.

Further, the training method of the depth network model comprises the following steps:

pretreatment of DNA sequence data: selecting DNA sequence data of the whole genome, and cutting the DNA sequence data into 200bp fragments, wherein a sliding cutting interval between each two fragments is 50bp; acquiring a reverse chain of each fragment, expanding each section of reverse chain into a 1000bp fragment with 200bp as a center, and then converting into map Data;

pretreatment of batch tissue chromatin accessibility analysis sequencing data: sequencing data were obtained from mass tissue chromatin accessibility analysis in the cell lines GM12878, K562, H1ESC and subjected to a shearing operation, after which the sheared data were mapped to human genome hg19 using a Bowtie2 tool and sheared using samtools, picard operation; converting the cut bam file into a bigwig file by using deepools 2, and then splicing all bigwig files into a data matrix by using a bigWigMerge tool;

the DNA sequence data before pretreatment is encoded into a characteristic vector S of 4 x 1000 by using one-hot ₁ The method comprises the steps of carrying out a first treatment on the surface of the Splicing the corresponding forward and reverse pretreatment batch tissue chromatin accessibility analysis sequencing data of each cell line as a characteristic vector A ₁ 、A ₂ And A ₃ The method comprises the steps of carrying out a first treatment on the surface of the Converting capability Data into feature vector U of 2 x 1000 ₁ ；

Respectively splicing characteristic vectors S ₁ Feature vector A ₁ And feature vector U ₁ Feature vector S ₁ Feature vector A ₂ And feature vector U ₁ Feature vector S ₁ Feature vector A ₃ And feature vector U ₁ And respectively inputting the three spliced vectors into three deep neural networks for training to obtain a deep network model corresponding to the cell lines GM12878, K562 and H1 ESC.

The beneficial effects of the technical scheme are as follows: the scheme uses batch tissue chromatin accessibility analysis sequencing data, DNA sequence data and Mapability Feature data for pretraining, and can fully use the data characteristics of each data, thereby ensuring the accuracy of transcription factor prediction in single cell chromatin accessibility analysis sequencing data.

Further, the method for preprocessing single cell chromatin accessibility analysis sequencing data comprises the following steps:

screening peak reading information of single-cell chromatin accessibility analysis sequencing data by adopting an ENCODE single-cell chromatin accessibility analysis sequencing data processing method;

the bam file of peak reading information is converted to a bigwig file using deepfools 2, after which all bigwig files are spliced into a data matrix using a bigWigMerge tool.

The technical scheme has the beneficial effects that the signal original information can be reserved to the greatest extent by preprocessing single-cell chromatin sequencing data by adopting an ENCODE process.

Further, the data enhancement operation includes:

calculating potential characteristics of single cells by adopting a cisTopic method, and calculating similarity scores among cells by using data indexes of cosine similarity;

the 100 adjacent cells most similar to the above were selected as data for single cell chromatin accessibility analysis sequencing data amplification, and single cells and selected adjacent cells were pooled as enhanced sequencing data.

The beneficial effects of the technical scheme are as follows: the scheme can increase the coverage of chromatin accessibility through the mode, meanwhile, the cell specificity is kept, and the batch processing effect can be effectively lightened through the operation.

Further, the single cell transcription factor prediction method further comprises calculating an activity factor fraction of the transcription factor according to the probability of the transcription factor:

wherein,,

the probability of coincidence between the transcription factor k output by the final network model and the nth cell peak in the cell i is from the first value to the second value after the probability is ordered from high to low until M values; c (C) _i，k To all C _i,n,k A result of normalizing the value; PC (personal computer) _i，k A score for the calculated activity factor; k epsilon 1 … M, M is the total number of transcription factors in each cell in the deep network model, N epsilon 1 … N, and N is the number of cell peaks which can be read in each cell; c (C) _i,n,k According to Z _i,n,k And comparing the first two values with the highest output coincidence probability, and then assigning the values.

The technical scheme has the beneficial effects that the transcription factor with higher regulation and control intensity or the transcription factor with the largest regulation and control relation can be more intuitively found out through the obtained activity score of the transcription factor.

Drawings

FIG. 1 is a flow chart of a method of single cell transcription factor prediction based on deep learning and attention mechanisms.

FIG. 2 is a framework diagram of a single cell transcription factor prediction method based on deep learning and attention mechanisms.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.

Referring to fig. 1, fig. 1 shows a flowchart of a single cell transcription factor prediction method based on deep learning and attention mechanisms, which includes steps S1 to S3 as shown in fig. 1.

In step S1, acquiring single-cell chromatin accessibility analysis sequencing data, preprocessing the sequencing data, and then performing data enhancement operation to obtain enhanced sequencing data;

in practice, the present protocol preferably provides a method for preprocessing single cell chromatin accessibility analysis sequencing data comprising:

Wherein the data enhancement operation comprises: calculating potential characteristics of single cells by adopting a cisTopic method, and calculating similarity scores among cells by using data indexes of cosine similarity; the 100 adjacent cells most similar to the above were selected as data for single cell chromatin accessibility analysis sequencing data amplification, and single cells and selected adjacent cells were pooled as enhanced sequencing data.

In step S2, extracting regression peaks in the enhanced sequencing data to generate a feature vector S of 4×1000, concatenating the forward and reverse enhanced sequencing data to a feature vector a of 2×1000, and converting DNA sequence data taken from the whole genome to a feature vector U of 2×1000;

in step S3, the feature vector S, the feature vector a, and the feature vector U are feature vectors of 8×1000, and the probability of predicting each transcription factor in a single cell is input into a deep network model, where the deep network model includes a convolution module and a channel attention model.

The convolution layer in the convolution module is mainly used for extracting the characteristics of the input DNA sequence and shape data, mining the correlation between different data, removing noise and unstable components in the data, and transmitting the relatively stable information of the processed mode as a whole to the attention mechanism module for TFBS prediction.

The deep network model uses only the channel attention portion in the CBAM to capture the feature and shape data of the biological sequence. The channel attention module spatially compresses the input feature map. The compression method not only uses average pooling extracted features, but also introduces maximum pooling as a supplement. Global average pooling has feedback for every pixel on the feature map, while global maximum pooling has gradient feedback only on the feature map that is most responsive.

According to the scheme, the single-cell chromatin accessibility analysis sequencing data are preprocessed, noise in the data can be reduced, and then the data are input into a deep learning model embedded with a channel attention module for recognition, so that the extraction speed and the extraction precision of the model to required data features are improved.

In one embodiment of the present invention, inputting the probability that the deep network model predicts the transcription factor further comprises:

The scheme builds a deep network model for common cell lines, so that the probability of determining that single cells are located in a specific cell line can be rapidly and accurately obtained under the condition that the acquired sequencing data information is limited.

As shown in fig. 2, the convolution module comprises a convolution layer, an activation layer, a pooling layer and a full connection layer which are sequentially connected; the channel attention model comprises a maximum pooling layer/average pooling layer, a shared multi-layer sensing module, a full-connection layer and a flattening layer which are sequentially connected; the model structure of the depth network model is as follows:

F ₃ ＝max_pooling(ReLU(conv ₃ (F ₂ ))，F＝ReLU(W ₁ ·F ₃ )

Z _i,n,k ＝sigmoid(W ₂ ·M _c (F))

wherein S is _F Is the vector after the splicingThe method comprises the steps of carrying out a first treatment on the surface of the ReLU () is the activation function; max_pooling () is the maximum pooling layer function; conv ₁ (.)、conv ₂ (.) and conv ₃ (.) are respectively a first convolution function, a second convolution function and a third convolution function; f (F) ₁ 、F ₂ And F ₃ Feature vector maps of each convolution layer of the convolution module are respectively provided; w (W) ₁ And W is ₂ Respectively weight matrixes in the two full connection layers;

and

When the method is implemented, the training method of the depth network model preferably comprises the following steps:

the DNA sequence data before pretreatment is encoded into a characteristic vector S of 4 x 1000 by using one-hot ₁ The method comprises the steps of carrying out a first treatment on the surface of the Sequencing data of batch tissue chromatin accessibility analysis after forward and reverse pretreatment corresponding to each cell line are spliced into a feature vector A of 2 x 1000 ₁ 、A ₂ And A ₃ The method comprises the steps of carrying out a first treatment on the surface of the Converting capability Data into feature vector U of 2 x 1000 ₁ ；

Respectively splicing characteristic vectors S ₁ Feature vector A ₁ And feature vector U ₁ Feature vector S ₁ Feature vector A ₂ And feature vector U ₁ Feature vector S ₁ Feature vector A ₃ And feature vector U ₁ And respectively inputting the three spliced vectors into three deep neural networks for training (specifically, the training process is that the input data is transmitted forward, the error is calculated, the back transmission is performed, the parameter is updated, and the training is finished), so as to obtain the deep network models corresponding to the cell lines GM12878, K562 and H1 ESC.

The single cell transcription factor prediction method of the scheme further comprises the step of calculating the activity factor fraction of the transcription factor according to the probability of the transcription factor:

wherein,,

In summary, the problem of low extraction efficiency of the data features due to the black box characteristic of the neural network can be effectively solved by combining the convolution module with the channel attention mechanism.

Claims

1. The single cell transcription factor prediction method based on deep learning and attention mechanism is characterized by comprising the following steps:

2. The method of claim 1, wherein inputting probabilities of transcription factors predicted by the deep network model further comprises:

3. The single cell transcription factor prediction method according to claim 2, wherein the convolution module comprises a convolution layer, an activation layer, a pooling layer and a full-connection layer which are sequentially connected; the channel attention model comprises a maximum pooling layer/average pooling layer, a shared multi-layer sensing module, a full-connection layer and a flattening layer which are sequentially connected; the model structure of the depth network model is as follows:

F ₃ ＝max_pooling(ReLU(conv ₃ (F ₂ ))，F＝ReLU(W ₁ ·F ₃ )

Z _i,n,k ＝sigmoid(W ₂ ·M _c (F))

wherein S is _F Is the vector after splicing; reLU () is the activation function; max_pooling () is the maximum pooling layer function; conv ₁ (.)、conv ₂ (.) and conv ₃ (.) are respectively a first convolution function, a second convolution function and a third convolution function; f (F) ₁ 、F ₂ And F ₃ Feature vector maps of each convolution layer of the convolution module are respectively provided; w (W) ₁ And W is ₂ Respectively weight matrixes in the two full connection layers;

and->

4. The method of claim 1, wherein the training method of the deep network model comprises:

Respectively splicing characteristic vectors S ₁ Feature vector A ₁ And feature vector U ₁ Feature vector S ₁ Feature vector A ₂ And feature vector U ₁ Feature vector S ₁ Feature vector A ₃ And feature vector U ₁ Respectively inputting the spliced three vectors into three deep neural networks for training to obtain a deep network model corresponding to the cell lines GM12878, K562 and H1ESC。

5. The method of claim 1, wherein the pre-processing the sequencing data of the single cell chromatin accessibility analysis comprises:

6. The single cell transcription factor prediction method of claim 1, wherein the data enhancement operation comprises:

7. The single cell transcription factor predicting method of claim 1 further comprising calculating an activity factor fraction of the transcription factor based on the probability of the transcription factor:

wherein,,

the probability of coincidence between the transcription factor k output by the final network model and the nth cell peak in the cell i is from the first value to the second value after the probability is ordered from high to low until M values; c (C) _i，k To all C _i,n,k A result of normalizing the value; PC (personal computer) _i，k A score for the calculated activity factor; c (C) _i,n,k According to Z _i,n,k And comparing the first two values with the highest output coincidence probability, and then assigning the values.