CN115713970A

CN115713970A - Transcription factor identification method based on Transformer-Encoder and multi-scale convolutional neural network

Info

Publication number: CN115713970A
Application number: CN202211459776.0A
Authority: CN
Inventors: 刘娟; 杨志辉
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-11-16
Filing date: 2022-11-16
Publication date: 2023-02-24

Abstract

The invention discloses a transcription factor identification method based on a Transformer-Encoder and a multi-scale convolutional neural network. The method comprises the steps of firstly extracting global features of a protein sequence by using a Transformer-Encoder, then further extracting multi-scale local features from the global features by using a multi-scale convolution neural network, finally fusing the extracted multiple features, and outputting the probability that the protein sequence is a transcription factor. The invention uses a transcription factor recognition method based on a multi-layer multi-head attention mechanism, namely a transform-Encoder and a multi-scale convolutional neural network, can finish the recognition work of whether an unknown protein sequence is a transcription factor or not at high precision, can quickly judge the transcription factor only by the protein sequence, and greatly improves the protein labeling efficiency.

Description

Transcription factor identification method based on Transformer-Encoder and multi-scale convolutional neural network

Technical Field

The invention relates to the field of protein function annotation, in particular to a transcription factor identification method based on a Transformer-Encoder and a multi-scale convolutional neural network.

Background

Transcription Factor (Transcription Factor) is a protein molecule with a special structure and functions to regulate gene expression. Transcription factors regulate the expression of a target gene by specifically binding to DNA sequences, promoting or inhibiting the transcription process of a particular DNA into RNA.

Traditionally, methods for identifying and distinguishing transcription factors through biochemical experiments are time-consuming, expensive in cost and incapable of large-scale use; the homology search method using BLAST cannot identify whether a protein that is not homologous to a known protein in a database is a transcription factor; the prediction method adopting the traditional machine learning can identify whether the protein is a transcription factor or not based on the protein structure or sequence information, but needs manual design and characteristics related to the transcription factor, needs stronger domain knowledge and has low prediction precision; deep learning has the advantage of directly learning the characteristics of protein sequences, but most of the existing methods build prediction models based on convolutional neural networks. Due to the limitation of convolution kernels, although the methods can automatically learn feature representation, the methods can only learn local features of relationships between amino acids at close distances, and cannot learn global features of relationships between amino acids at far distances, so that the prediction accuracy of the model is influenced.

Disclosure of Invention

In order to solve the technical problems, the invention provides a transcription factor identification method based on a Transformer-Encoder and a multi-scale convolutional neural network, which can simultaneously extract global and local information in a protein sequence and automatically obtain comprehensive expression characteristics of the transcription factor, thereby further improving the prediction precision.

The technical scheme provided by the invention is as follows:

a transcription factor identification method based on a Transformer-Encoder and a multi-scale convolutional neural network comprises the following steps:

step 1: constructing a training set: collecting protein sequences from a protein database, and marking each protein sequence as a transcription factor or a non-transcription factor according to corresponding protein annotation information; preprocessing all sequences to obtain a training data set;

step 2: building a network structure: building a Transformer-Encoder and a multi-scaleConstructing a transcription factor prediction model by a network structure combined with a convolutional neural network; wherein a Transformer-Encoder is used to obtain the i-th protein sequence X _i Global feature of (2)

Multi-scale convolutional neural network for use in a network based on

Performing transcription factor prediction identification;

and step 3: training a prediction model: training the network constructed in the step 2 by using the training set obtained in the step 1 to obtain a trained transcription factor prediction model;

and 4, step 4: transcription factor prediction: and (4) predicting whether the unknown protein sequence is a transcription factor or not by using the prediction model obtained in the step (3), and outputting a prediction result.

Further, the step 1 comprises the following substeps:

1.1 selecting protein sequences which do not contain non-standard amino acids, namely B, O, U and Z from a protein database to form a data set S1;

1.2 removing sequences with the length more than 1000 from S1 and only retaining sequences with the length less than or equal to 1000; padding protein sequences with length less than 1000 with zeros to 1000; finally obtaining a protein sequence data set S2;

1.3 according to GO annotation information of each protein in a protein database, respectively endowing each protein sequence in S2 with a label of a transcription factor '1' or a non-transcription factor '0'; finally obtaining a training data set S = (X) _i ,c _i ) I =1, ·, N; wherein X _i Represents the ith protein sequence in the dataset; c. C _i Is X _i A label of c _i E {0,1}; n is the size of S.

Further, in step 1.3, if GO term of "transformation factor" is included in GO annotation of protein, or two GO terms of "transformation regulation" and "DNA binding" are included at the same time, the protein sequence is transcription factor and is assigned as "1"; otherwise, the protein sequence is a non-transcription factor and is assigned a value of "0".

Further, the network structure in the step 2 comprises a transform-Encoder structure and a multi-scale convolutional neural network structure which are formed in series;

the Transformer-Encoder structure only reserves an Encoder part in the Transformer and is formed by stacking 6 Encoder blocks, and each Encoder block comprises 12 attention heads; the Transformer-Encoder is used for extracting global characteristics from an input protein sequence;

the multi-scale convolution neural network consists of four convolution sub-networks which are connected in parallel and have different one-dimensional convolution kernels, two full-connection layers and an output layer; the convolution layer comprises a plurality of one-dimensional convolution operations respectively corresponding to convolution kernels with different sizes to obtain a plurality of convolution characteristics with different sizes; the pooling layer is used for pooling the convolution characteristics respectively to obtain the characteristics with reduced dimensionality; after the pooling, the characteristics are spliced and sent into a full connecting layer; and the prediction result obtained after the calculation of the full connection layer is output by the output layer.

Further, in the step 2, a protein sequence is set as X _i ＝x _i1 ,x _i2 ,…,x _ij ,…x _i1000 ，x _ij Represents the protein sequence X _i The amino acid at the j-th position in the sequence is converted into X by using a Transformer-Encoder _i Global feature of (2)

The method comprises the following specific steps:

2.1 obtaining X by an embedding operation _i The imbedding vector is specifically as follows:

2.1.1 random initialization of different amino acid classes is first carried out, and then X is assigned to the corresponding amino acid class _i Each amino acid x of _ij The imbedding generates a corresponding vector;

2.1.2 extracting the position information of amino acid in the protein sequence by using position codes, wherein the position codes identify the amino acid at different positions of the protein by sine and cosine functions, and the position coding formula of the jth amino acid is as follows:

wherein pos represents the position of an amino acid in a protein sequence, d represents the dimension of an insertion vector, and k is a natural number;

2.1.3 Each amino acid x _ij Adding the embedding and the corresponding position code to obtain protein X _i An imbedding vector of the sequence;

2.2 obtaining the protein sequence X _i Taking the embedding vector as the input of a Transformer-Encoder, excavating an attition score between every two amino acids by utilizing an attition mechanism of the embedding vector, and performing cross multiplication on the attition score and the embedding vector to obtain the whole protein sequence X _i Global feature of (2)

Furthermore, in the step 2, the convolution sub-network is composed of a convolution layer, a normalization layer, a Dropout layer and a Max-Pooling layer in sequence;

wherein the calculation formula of the ith sub-network output is as follows:

F _i (x)＝MaxPooling(ReLU(Norm(ConV(x)))

wherein x is the input of the convolutional layer;

the spliced outputs of the four sub-networks are:

ouput＝Concat(F _i (x)),i＝1，2，3，4。

further, in step 2, the multi-scale convolution neural network uses one-dimensional convolution kernels of different sizes to extract features between local protein sequences of different lengths.

Furthermore, in the step 2, the length of the convolution kernel is the same as the embedding dimension of the protein, the width is set to be 4-20, and the sizes of the convolution kernels are 4,8, 12 and 16 in sequence.

Further, in the step 2, based on

The method for predicting and identifying the transcription factors comprises the following steps:

based on the protein sequence X _i Global feature of (2)

Obtaining X after convolution operation of different scales _i Local feature of

J is local features corresponding to different convolution kernels, all the local features are spliced, and finally the local features are input into a full connection layer, wherein cross entry is selected by a loss function, and the method is specifically represented by the following formula:

wherein

The probability that the model prediction sample is the transcription factor is shown, y is a sample label, if the sample belongs to a positive example, the value is 1, otherwise, the value is 0;

for all protein sequences, the final output probability obtained by the SoftMax function is [ p ] _1j ,p _2j ,…,p _nj ]The predicted class y of protein i is calculated by the following formula _i ：

Such as y _i If "= 1", this means that the protein sequence i is a transcription factor, whereas it is a non-transcription factor.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention uses a new identification and classification method, solves the problems of high experiment cost, complex process and the like of the traditional biochemical experiment, can quickly judge the transcription factor only by the protein sequence, and greatly improves the protein labeling efficiency.

2. In the aspect of feature extraction, compared with the method proposed by the prior scholars, the method increases the dimension of the feature of the extracted sequence, and effectively increases the dimension of the extracted protein feature from the feature of only extracting the 3-16 segment amino acid sequence to the feature of extracting the whole protein sequence, thereby improving the prediction accuracy of the transcription factor.

3. In the invention, a transformer-encoder module is introduced into the model, and the module can calculate the attention score between each amino acid and other amino acids, thereby discovering amino acid pairs with higher amino acid association degree in the protein sequence and better excavating the interrelation among the amino acids in the protein sequence.

Drawings

FIG. 1 is a model flow diagram of the present invention;

FIG. 2 is a protein data pre-processing flow;

FIG. 3 is a diagram of a Transformer-Encoder model architecture used in the present invention;

FIG. 4 is a flow chart of an attention score calculation in the transform-Encoder of the present invention;

FIG. 5 is a schematic diagram of a multi-scale convolutional neural network model in the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention are described below in detail with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Example 1

Referring to fig. 1-5, a transcription factor identification method based on a transform-Encoder and a multi-scale convolutional neural network includes the following steps:

step 1: all protein sequences were obtained from UniProtKB 2021_04Swiss-Prot database and the data were preprocessed to construct transcription factor training set S.

The data preprocessing operation described in step 1, as shown in fig. 2, includes the following sub-steps:

s1.1 all Protein sequences were first downloaded in the UniProtKB 2021_04Swiss-Prot dataset, 565,928 strips total, with attributes in the dataset including "Sequence", "Length", "Gene connectivity (GO)", "Entry name", "Protein names". Abnormal protein data containing non-standard amino acids (i.e., B, O, U, Z) were then deleted from the dataset.

S1.2, screening out sequences with the sequence Length less than or equal to 1000 according to the Length attribute, and carrying out zero filling on the sequences until the Length is 1000.

S1.3, collecting terms containing three functions of ' transformation factor ', ' transformation regulation ' and ' DNAdinding ' from a protein annotation database, wherein the GO terms is a protein gene annotation function category, so that the GO terms is significant in creating a label index of a transcription factor, and the GO terms which is sorted to obtain the transformation factor comprises ' GO:0000976; GO:0000977; GO:0000978; GO:0000979; GO:0000981; GO:0000984; GO:0000985; GO:0000986; GO:0000987; GO:0000992", and the like; the GO terms of the translation adjustment includes "GO:0001228; GO:0006351; GO:0006355; GO:0043433", and the like; GO terms for DNA binding includes "GO:0003677; GO:0008301; GO:0043565; GO:0050692".

S1.4, screening GO terms corresponding to the protein sequence obtained in the step (1), if the GO terms contains the terms of the "transcription factor" in the step 2 or contains the terms of both the "transcription regulation" and the "DNA binding", marking the GO terms as a transcription factor, otherwise, marking the GO terms as a non-transcription factor. This constructed a data set containing 124,316 washed and pretreated protein sequences and transcription factor tags for the proteins.

And 2, step: and (3) building a network model M, building a transform-Encoder (shown in a figure 3-4) with a 2-layer 8-head attention mechanism, simultaneously building a multi-scale convolutional neural network, namely a convolutional neural network (shown in a figure 5) comprising one-dimensional convolutional kernels with kernel sizes of 4,8, 12 and 16 respectively, and connecting the convolutional neural network and the convolutional neural network in series to build the model M.

And 2, the network structure is formed by connecting a Transformer-Encoder structure and a multi-scale convolutional neural network structure in series. The Transformer-Encoder structure is from the Transformer network structure in the field of natural language processing. The original Transformer adopts an Encoder-Decoder architecture, only the Encoder part in the Transformer is reserved in the step 2, and the original Transformer is formed by stacking 6 Encoder blocks, wherein each Encoder block comprises 12 attention heads. The Transformer-Encoder is used to extract global features from the input protein sequence.

The multi-scale convolution neural network consists of four convolution sub-networks which are connected in parallel and have different one-dimensional convolution kernels, two full-connection layers and an output layer; the convolution layer comprises a plurality of one-dimensional convolution operations respectively corresponding to convolution kernels with different sizes to obtain a plurality of convolution characteristics with different sizes; the pooling layer is used for pooling the convolution characteristics respectively to obtain the characteristics with reduced dimensionality; after pooling, the characteristics are spliced and sent into a full connecting layer; and the prediction result obtained after the calculation of the full connection layer is output by the output layer.

Setting a protein sequence as X _i ＝x _i1 ,x _i2 ,…,x _ij ,…x _i1000 ，x _ij Represents the protein sequence X _i The j-th amino acid in (1). Obtaining X by using Transformer-Encoder in step 2 _i Global feature of (2)

The method comprises the following specific steps:

2.1 obtaining X by an embedding operation _i The embedding vector comprises the following specific methods:

2.1.1 random initialization of different amino acid classes is first carried out, and then X is assigned to the corresponding amino acid class _i Each amino acid x of _ij embedding generates the corresponding vector.

2.1.2 extracting the position information of amino acid in the protein sequence by using position codes, wherein the position codes identify different positions of the amino acid in the protein by sine and cosine functions, and the position coding formula of j-th amino acid is shown as follows:

wherein pos represents the position of an amino acid in a protein sequence, d represents the dimension of an insertion vector, and k is a natural number; .

2.1.3 Each amino acid x _ij Adding the embedding and the corresponding position code to obtain protein X _i An embedding vector of the sequence.

2.2 obtaining the protein sequence X _i Using the embedding vector as the input of a Transformer-Encoder, mining the attention score between every two amino acids by utilizing an attention mechanism, and cross-multiplying the attention score and the embedding vector to obtain the whole protein sequence X _i Global feature of (2)

A multi-scale convolutional neural network using one-dimensional convolution kernels of different sizes will be used to extract features between local protein sequences of different lengths. The length of the convolution kernel is the same as the embedding dimension of the protein, the width is set to be 4-20, and the sizes of the convolution kernels are 4,8, 12 and 16 in sequence.

The convolution subnetwork sequentially comprises a convolution layer, a normalization layer, a Dropout layer and a Max-Pooling layer;

wherein the calculation formula output by the ith sub-network is as follows:

F _i (x)＝MaxPooling(ReLU(Norm(ConV(x)))

wherein x is the input of the convolutional layer;

the splice outputs of the four subnetworks are:

ouput＝Concat(F _i (x)),i＝1，2，3，4。

based on

based on the protein sequence X _i Global feature of (2)

Obtaining X after convolution operation of different scales _i Local characteristics of

And j is the local features corresponding to different convolution kernels, then all the local features are spliced, and finally the input is carried out on the full connection layer. The loss function selects cross entry, and the method is specifically represented by the following formula.

Wherein

The probability that the model prediction sample is the transcription factor is shown, y is a sample label, if the sample belongs to a positive example, the value is 1, otherwise, the value is 0.

For all protein sequences, the final output probability of [ p ] is obtained through a SoftMax function _1j ,p _2j ,…,p _nj ]The predicted class y of protein i is calculated by the following formula _i ：

And 3, step 3: and (4) training the model M by using the data set S in the step (1) to obtain a trained transcription factor prediction model T. The model training in the step 3 adopts a random gradient descent method, and is characterized in that an Adam optimizer is adopted, and the initial learning rate is 0.0001. In the process of training the model, the value of batch size is 20.

And 4, step 4: inputting the protein sequence to be identified into model T (as shown in FIG. 1), and obtaining the result of whether the sequence is a transcription factor.

Aiming at the limitation of the existing method mentioned in the background technology of transcription factor prediction, the invention improves the original deep learning method, and utilizes the global feature extraction of a Transformer-Encoder and the local feature extraction of a multi-scale convolution neural network to identify the transcription factor, thereby effectively improving the accuracy of the prediction of the original deep learning method. Table 1 shows the performance comparison of the model formed by using the Transformer-Encoder and the multi-scale convolutional neural network in the invention with other comparison models, and the results in Table 1 show that the method provided by the invention has the best comprehensive identification effect.

TABLE 1 comparison of recognition results of different methods

Method	Acc	Prec	Recall	F1
					Example 1	0.95	0.92	0.87	0.89
DeepTfactor	0.94	0.93	0.84	0.88
					GRU	0.80	0.67	0.45	0.52

Finally, it should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and not intended to limit the present invention, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications and equivalents can be made in the technical solutions described in the foregoing embodiments, or some technical features thereof can be replaced. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any modification, equivalent replacement, and improvement made by those skilled in the art within the technical scope of the present invention should be included in the scope of the present invention.

Claims

1. A transcription factor identification method based on a Transformer-Encoder and a multi-scale convolutional neural network is characterized by comprising the following steps:

and 2, step: building a network structure: building a Transformer-Encoder and a multi-scale volumeConstructing a transcription factor prediction model by a network structure combined with a neural network; wherein a Transformer-Encoder is used to obtain the i-th protein sequence X _i Global feature of (2)

Multi-scale convolutional neural network for use in a network based on

Performing transcription factor prediction identification;

2. The method of claim 1, wherein: the step 1 comprises the following substeps:

1.1 selecting protein sequences which do not contain nonstandard amino acids, namely B, O, U and Z from a protein database to form a data set S1;

1.2 removing sequences with the length more than 1000 from S1 and only retaining sequences with the length less than or equal to 1000; for protein sequences less than 1000 in length, zero padding to 1000 in length; finally obtaining a protein sequence data set S2;

1.3 according to GO annotation information of each protein in the protein database, respectively endowing each protein sequence in S2 with a label of a transcription factor '1' or a non-transcription factor '0'; finally obtaining a training data set S = (X) _i ,c _i ) I =1, ·, N; wherein X _i Represents the ith protein sequence in the dataset; c. C _i Is X _i A label of (c) _i E {0,1}; n is the size of S.

3. The method of claim 1, wherein: in step 1.3, if the GO term of the protein contains the transformation factor or contains two GO terms of transformation regulation and DNA binding, the protein sequence is a transcription factor and is assigned as 1; otherwise, the protein sequence is a non-transcription factor and is assigned a value of "0".

4. The method of claim 1, wherein: the network structure in the step 2 comprises a transform-Encoder structure and a multi-scale convolution neural network structure which are formed in series;

the multi-scale convolution neural network consists of four convolution sub-networks which are connected in parallel and have different one-dimensional convolution kernels, two full-connection layers and an output layer; the convolution layer comprises a plurality of one-dimensional convolution operations respectively corresponding to convolution kernels with different sizes to obtain a plurality of convolution characteristics with different sizes; the pooling layer is used for pooling the convolution characteristics respectively to obtain the characteristics with reduced dimensionality; after pooling, the characteristics are spliced and sent into a full connecting layer; and outputting the prediction result obtained after the calculation of the full connection layer by the output layer.

5. The method of claim 1, wherein: in the step 2, a protein sequence is set as X _i ＝x _i1 ,x _i2 ,…,x _ij ,…x _i1000 ，x _ij Represents the protein sequence X _i The amino acid at the j-th position in the sequence is converted into X by using a Transformer-Encoder _i Global feature of (2)

The method comprises the following specific steps:

2.1.1 first of all for different amino acid classesRows are randomly initialized, and then X is assigned to the corresponding amino acid type _i Each amino acid x of _ij The imbedding generates a corresponding vector;

6. The method of claim 4, wherein: in the step 2, the convolution sub-network sequentially consists of a convolution layer, a normalization layer, a Dropout layer and a Max-Pooling layer;

wherein the calculation formula output by the ith sub-network is as follows:

F _i (x)＝MaxPooling(ReLU(Norm(ConV(x)))

wherein x is the input of the convolutional layer;

the splice outputs of the four subnetworks are:

ouput＝Concat(F _i (x)),i＝1，2，3，4。

7. the method of claim 1, wherein: in the step 2, the multi-scale convolutional neural network uses one-dimensional convolutional kernels with different sizes to extract features between local protein sequences with different lengths.

8. The method of claim 7, wherein: in the step 2, the length of the convolution kernel is the same as the embedding dimension of the protein, the width is set to be 4-20, and the sizes of the convolution kernels are 4,8, 12 and 16 in sequence.

9. The method of claim 1, wherein: in the step 2, based on

based on the protein sequence X _i Global feature of (2)

wherein

Is the probability that the model predicts that the sample is a transcription factor, y is the sample label, if the sample belongs to the positive example, takeThe value is 1, otherwise the value is 0;