CN115713970A - Transcription factor identification method based on Transformer-Encoder and multi-scale convolutional neural network - Google Patents

Transcription factor identification method based on Transformer-Encoder and multi-scale convolutional neural network Download PDF

Info

Publication number
CN115713970A
CN115713970A CN202211459776.0A CN202211459776A CN115713970A CN 115713970 A CN115713970 A CN 115713970A CN 202211459776 A CN202211459776 A CN 202211459776A CN 115713970 A CN115713970 A CN 115713970A
Authority
CN
China
Prior art keywords
protein
transcription factor
encoder
protein sequence
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211459776.0A
Other languages
Chinese (zh)
Inventor
刘娟
杨志辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202211459776.0A priority Critical patent/CN115713970A/en
Publication of CN115713970A publication Critical patent/CN115713970A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a transcription factor identification method based on a Transformer-Encoder and a multi-scale convolutional neural network. The method comprises the steps of firstly extracting global features of a protein sequence by using a Transformer-Encoder, then further extracting multi-scale local features from the global features by using a multi-scale convolution neural network, finally fusing the extracted multiple features, and outputting the probability that the protein sequence is a transcription factor. The invention uses a transcription factor recognition method based on a multi-layer multi-head attention mechanism, namely a transform-Encoder and a multi-scale convolutional neural network, can finish the recognition work of whether an unknown protein sequence is a transcription factor or not at high precision, can quickly judge the transcription factor only by the protein sequence, and greatly improves the protein labeling efficiency.

Description

Transcription factor identification method based on Transformer-Encoder and multi-scale convolutional neural network
Technical Field
The invention relates to the field of protein function annotation, in particular to a transcription factor identification method based on a Transformer-Encoder and a multi-scale convolutional neural network.
Background
Transcription Factor (Transcription Factor) is a protein molecule with a special structure and functions to regulate gene expression. Transcription factors regulate the expression of a target gene by specifically binding to DNA sequences, promoting or inhibiting the transcription process of a particular DNA into RNA.
Traditionally, methods for identifying and distinguishing transcription factors through biochemical experiments are time-consuming, expensive in cost and incapable of large-scale use; the homology search method using BLAST cannot identify whether a protein that is not homologous to a known protein in a database is a transcription factor; the prediction method adopting the traditional machine learning can identify whether the protein is a transcription factor or not based on the protein structure or sequence information, but needs manual design and characteristics related to the transcription factor, needs stronger domain knowledge and has low prediction precision; deep learning has the advantage of directly learning the characteristics of protein sequences, but most of the existing methods build prediction models based on convolutional neural networks. Due to the limitation of convolution kernels, although the methods can automatically learn feature representation, the methods can only learn local features of relationships between amino acids at close distances, and cannot learn global features of relationships between amino acids at far distances, so that the prediction accuracy of the model is influenced.
Disclosure of Invention
In order to solve the technical problems, the invention provides a transcription factor identification method based on a Transformer-Encoder and a multi-scale convolutional neural network, which can simultaneously extract global and local information in a protein sequence and automatically obtain comprehensive expression characteristics of the transcription factor, thereby further improving the prediction precision.
The technical scheme provided by the invention is as follows:
a transcription factor identification method based on a Transformer-Encoder and a multi-scale convolutional neural network comprises the following steps:
step 1: constructing a training set: collecting protein sequences from a protein database, and marking each protein sequence as a transcription factor or a non-transcription factor according to corresponding protein annotation information; preprocessing all sequences to obtain a training data set;
step 2: building a network structure: building a Transformer-Encoder and a multi-scaleConstructing a transcription factor prediction model by a network structure combined with a convolutional neural network; wherein a Transformer-Encoder is used to obtain the i-th protein sequence X i Global feature of (2)
Figure BDA0003955003440000021
Multi-scale convolutional neural network for use in a network based on
Figure BDA0003955003440000022
Performing transcription factor prediction identification;
and step 3: training a prediction model: training the network constructed in the step 2 by using the training set obtained in the step 1 to obtain a trained transcription factor prediction model;
and 4, step 4: transcription factor prediction: and (4) predicting whether the unknown protein sequence is a transcription factor or not by using the prediction model obtained in the step (3), and outputting a prediction result.
Further, the step 1 comprises the following substeps:
1.1 selecting protein sequences which do not contain non-standard amino acids, namely B, O, U and Z from a protein database to form a data set S1;
1.2 removing sequences with the length more than 1000 from S1 and only retaining sequences with the length less than or equal to 1000; padding protein sequences with length less than 1000 with zeros to 1000; finally obtaining a protein sequence data set S2;
1.3 according to GO annotation information of each protein in a protein database, respectively endowing each protein sequence in S2 with a label of a transcription factor '1' or a non-transcription factor '0'; finally obtaining a training data set S = (X) i ,c i ) I =1, ·, N; wherein X i Represents the ith protein sequence in the dataset; c. C i Is X i A label of c i E {0,1}; n is the size of S.
Further, in step 1.3, if GO term of "transformation factor" is included in GO annotation of protein, or two GO terms of "transformation regulation" and "DNA binding" are included at the same time, the protein sequence is transcription factor and is assigned as "1"; otherwise, the protein sequence is a non-transcription factor and is assigned a value of "0".
Further, the network structure in the step 2 comprises a transform-Encoder structure and a multi-scale convolutional neural network structure which are formed in series;
the Transformer-Encoder structure only reserves an Encoder part in the Transformer and is formed by stacking 6 Encoder blocks, and each Encoder block comprises 12 attention heads; the Transformer-Encoder is used for extracting global characteristics from an input protein sequence;
the multi-scale convolution neural network consists of four convolution sub-networks which are connected in parallel and have different one-dimensional convolution kernels, two full-connection layers and an output layer; the convolution layer comprises a plurality of one-dimensional convolution operations respectively corresponding to convolution kernels with different sizes to obtain a plurality of convolution characteristics with different sizes; the pooling layer is used for pooling the convolution characteristics respectively to obtain the characteristics with reduced dimensionality; after the pooling, the characteristics are spliced and sent into a full connecting layer; and the prediction result obtained after the calculation of the full connection layer is output by the output layer.
Further, in the step 2, a protein sequence is set as X i =x i1 ,x i2 ,…,x ij ,…x i1000 ,x ij Represents the protein sequence X i The amino acid at the j-th position in the sequence is converted into X by using a Transformer-Encoder i Global feature of (2)
Figure BDA0003955003440000023
The method comprises the following specific steps:
2.1 obtaining X by an embedding operation i The imbedding vector is specifically as follows:
2.1.1 random initialization of different amino acid classes is first carried out, and then X is assigned to the corresponding amino acid class i Each amino acid x of ij The imbedding generates a corresponding vector;
2.1.2 extracting the position information of amino acid in the protein sequence by using position codes, wherein the position codes identify the amino acid at different positions of the protein by sine and cosine functions, and the position coding formula of the jth amino acid is as follows:
Figure BDA0003955003440000031
wherein pos represents the position of an amino acid in a protein sequence, d represents the dimension of an insertion vector, and k is a natural number;
2.1.3 Each amino acid x ij Adding the embedding and the corresponding position code to obtain protein X i An imbedding vector of the sequence;
2.2 obtaining the protein sequence X i Taking the embedding vector as the input of a Transformer-Encoder, excavating an attition score between every two amino acids by utilizing an attition mechanism of the embedding vector, and performing cross multiplication on the attition score and the embedding vector to obtain the whole protein sequence X i Global feature of (2)
Figure BDA0003955003440000032
Furthermore, in the step 2, the convolution sub-network is composed of a convolution layer, a normalization layer, a Dropout layer and a Max-Pooling layer in sequence;
wherein the calculation formula of the ith sub-network output is as follows:
F i (x)=MaxPooling(ReLU(Norm(ConV(x)))
wherein x is the input of the convolutional layer;
the spliced outputs of the four sub-networks are:
ouput=Concat(F i (x)),i=1,2,3,4。
further, in step 2, the multi-scale convolution neural network uses one-dimensional convolution kernels of different sizes to extract features between local protein sequences of different lengths.
Furthermore, in the step 2, the length of the convolution kernel is the same as the embedding dimension of the protein, the width is set to be 4-20, and the sizes of the convolution kernels are 4,8, 12 and 16 in sequence.
Further, in the step 2, based on
Figure BDA0003955003440000033
The method for predicting and identifying the transcription factors comprises the following steps:
based on the protein sequence X i Global feature of (2)
Figure BDA0003955003440000034
Obtaining X after convolution operation of different scales i Local feature of
Figure BDA0003955003440000035
J is local features corresponding to different convolution kernels, all the local features are spliced, and finally the local features are input into a full connection layer, wherein cross entry is selected by a loss function, and the method is specifically represented by the following formula:
Figure BDA0003955003440000036
wherein
Figure BDA0003955003440000042
The probability that the model prediction sample is the transcription factor is shown, y is a sample label, if the sample belongs to a positive example, the value is 1, otherwise, the value is 0;
for all protein sequences, the final output probability obtained by the SoftMax function is [ p ] 1j ,p 2j ,…,p nj ]The predicted class y of protein i is calculated by the following formula i
Figure BDA0003955003440000041
Such as y i If "= 1", this means that the protein sequence i is a transcription factor, whereas it is a non-transcription factor.
Compared with the prior art, the invention has the beneficial effects that:
1. the invention uses a new identification and classification method, solves the problems of high experiment cost, complex process and the like of the traditional biochemical experiment, can quickly judge the transcription factor only by the protein sequence, and greatly improves the protein labeling efficiency.
2. In the aspect of feature extraction, compared with the method proposed by the prior scholars, the method increases the dimension of the feature of the extracted sequence, and effectively increases the dimension of the extracted protein feature from the feature of only extracting the 3-16 segment amino acid sequence to the feature of extracting the whole protein sequence, thereby improving the prediction accuracy of the transcription factor.
3. In the invention, a transformer-encoder module is introduced into the model, and the module can calculate the attention score between each amino acid and other amino acids, thereby discovering amino acid pairs with higher amino acid association degree in the protein sequence and better excavating the interrelation among the amino acids in the protein sequence.
Drawings
FIG. 1 is a model flow diagram of the present invention;
FIG. 2 is a protein data pre-processing flow;
FIG. 3 is a diagram of a Transformer-Encoder model architecture used in the present invention;
FIG. 4 is a flow chart of an attention score calculation in the transform-Encoder of the present invention;
FIG. 5 is a schematic diagram of a multi-scale convolutional neural network model in the present invention;
Detailed Description
The technical solutions in the embodiments of the present invention are described below in detail with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
Example 1
Referring to fig. 1-5, a transcription factor identification method based on a transform-Encoder and a multi-scale convolutional neural network includes the following steps:
step 1: all protein sequences were obtained from UniProtKB 2021_04Swiss-Prot database and the data were preprocessed to construct transcription factor training set S.
The data preprocessing operation described in step 1, as shown in fig. 2, includes the following sub-steps:
s1.1 all Protein sequences were first downloaded in the UniProtKB 2021_04Swiss-Prot dataset, 565,928 strips total, with attributes in the dataset including "Sequence", "Length", "Gene connectivity (GO)", "Entry name", "Protein names". Abnormal protein data containing non-standard amino acids (i.e., B, O, U, Z) were then deleted from the dataset.
S1.2, screening out sequences with the sequence Length less than or equal to 1000 according to the Length attribute, and carrying out zero filling on the sequences until the Length is 1000.
S1.3, collecting terms containing three functions of ' transformation factor ', ' transformation regulation ' and ' DNAdinding ' from a protein annotation database, wherein the GO terms is a protein gene annotation function category, so that the GO terms is significant in creating a label index of a transcription factor, and the GO terms which is sorted to obtain the transformation factor comprises ' GO:0000976; GO:0000977; GO:0000978; GO:0000979; GO:0000981; GO:0000984; GO:0000985; GO:0000986; GO:0000987; GO:0000992", and the like; the GO terms of the translation adjustment includes "GO:0001228; GO:0006351; GO:0006355; GO:0043433", and the like; GO terms for DNA binding includes "GO:0003677; GO:0008301; GO:0043565; GO:0050692".
S1.4, screening GO terms corresponding to the protein sequence obtained in the step (1), if the GO terms contains the terms of the "transcription factor" in the step 2 or contains the terms of both the "transcription regulation" and the "DNA binding", marking the GO terms as a transcription factor, otherwise, marking the GO terms as a non-transcription factor. This constructed a data set containing 124,316 washed and pretreated protein sequences and transcription factor tags for the proteins.
And 2, step: and (3) building a network model M, building a transform-Encoder (shown in a figure 3-4) with a 2-layer 8-head attention mechanism, simultaneously building a multi-scale convolutional neural network, namely a convolutional neural network (shown in a figure 5) comprising one-dimensional convolutional kernels with kernel sizes of 4,8, 12 and 16 respectively, and connecting the convolutional neural network and the convolutional neural network in series to build the model M.
And 2, the network structure is formed by connecting a Transformer-Encoder structure and a multi-scale convolutional neural network structure in series. The Transformer-Encoder structure is from the Transformer network structure in the field of natural language processing. The original Transformer adopts an Encoder-Decoder architecture, only the Encoder part in the Transformer is reserved in the step 2, and the original Transformer is formed by stacking 6 Encoder blocks, wherein each Encoder block comprises 12 attention heads. The Transformer-Encoder is used to extract global features from the input protein sequence.
The multi-scale convolution neural network consists of four convolution sub-networks which are connected in parallel and have different one-dimensional convolution kernels, two full-connection layers and an output layer; the convolution layer comprises a plurality of one-dimensional convolution operations respectively corresponding to convolution kernels with different sizes to obtain a plurality of convolution characteristics with different sizes; the pooling layer is used for pooling the convolution characteristics respectively to obtain the characteristics with reduced dimensionality; after pooling, the characteristics are spliced and sent into a full connecting layer; and the prediction result obtained after the calculation of the full connection layer is output by the output layer.
Setting a protein sequence as X i =x i1 ,x i2 ,…,x ij ,…x i1000 ,x ij Represents the protein sequence X i The j-th amino acid in (1). Obtaining X by using Transformer-Encoder in step 2 i Global feature of (2)
Figure BDA0003955003440000061
The method comprises the following specific steps:
2.1 obtaining X by an embedding operation i The embedding vector comprises the following specific methods:
2.1.1 random initialization of different amino acid classes is first carried out, and then X is assigned to the corresponding amino acid class i Each amino acid x of ij embedding generates the corresponding vector.
2.1.2 extracting the position information of amino acid in the protein sequence by using position codes, wherein the position codes identify different positions of the amino acid in the protein by sine and cosine functions, and the position coding formula of j-th amino acid is shown as follows:
Figure BDA0003955003440000062
wherein pos represents the position of an amino acid in a protein sequence, d represents the dimension of an insertion vector, and k is a natural number; .
2.1.3 Each amino acid x ij Adding the embedding and the corresponding position code to obtain protein X i An embedding vector of the sequence.
2.2 obtaining the protein sequence X i Using the embedding vector as the input of a Transformer-Encoder, mining the attention score between every two amino acids by utilizing an attention mechanism, and cross-multiplying the attention score and the embedding vector to obtain the whole protein sequence X i Global feature of (2)
Figure BDA0003955003440000063
A multi-scale convolutional neural network using one-dimensional convolution kernels of different sizes will be used to extract features between local protein sequences of different lengths. The length of the convolution kernel is the same as the embedding dimension of the protein, the width is set to be 4-20, and the sizes of the convolution kernels are 4,8, 12 and 16 in sequence.
The convolution subnetwork sequentially comprises a convolution layer, a normalization layer, a Dropout layer and a Max-Pooling layer;
wherein the calculation formula output by the ith sub-network is as follows:
F i (x)=MaxPooling(ReLU(Norm(ConV(x)))
wherein x is the input of the convolutional layer;
the splice outputs of the four subnetworks are:
ouput=Concat(F i (x)),i=1,2,3,4。
based on
Figure BDA0003955003440000064
The method for predicting and identifying the transcription factors comprises the following steps:
based on the protein sequence X i Global feature of (2)
Figure BDA0003955003440000065
Obtaining X after convolution operation of different scales i Local characteristics of
Figure BDA0003955003440000071
And j is the local features corresponding to different convolution kernels, then all the local features are spliced, and finally the input is carried out on the full connection layer. The loss function selects cross entry, and the method is specifically represented by the following formula.
Figure BDA0003955003440000072
Wherein
Figure BDA0003955003440000073
The probability that the model prediction sample is the transcription factor is shown, y is a sample label, if the sample belongs to a positive example, the value is 1, otherwise, the value is 0.
For all protein sequences, the final output probability of [ p ] is obtained through a SoftMax function 1j ,p 2j ,…,p nj ]The predicted class y of protein i is calculated by the following formula i
Figure BDA0003955003440000074
Such as y i If "= 1", this means that the protein sequence i is a transcription factor, whereas it is a non-transcription factor.
And 3, step 3: and (4) training the model M by using the data set S in the step (1) to obtain a trained transcription factor prediction model T. The model training in the step 3 adopts a random gradient descent method, and is characterized in that an Adam optimizer is adopted, and the initial learning rate is 0.0001. In the process of training the model, the value of batch size is 20.
And 4, step 4: inputting the protein sequence to be identified into model T (as shown in FIG. 1), and obtaining the result of whether the sequence is a transcription factor.
Aiming at the limitation of the existing method mentioned in the background technology of transcription factor prediction, the invention improves the original deep learning method, and utilizes the global feature extraction of a Transformer-Encoder and the local feature extraction of a multi-scale convolution neural network to identify the transcription factor, thereby effectively improving the accuracy of the prediction of the original deep learning method. Table 1 shows the performance comparison of the model formed by using the Transformer-Encoder and the multi-scale convolutional neural network in the invention with other comparison models, and the results in Table 1 show that the method provided by the invention has the best comprehensive identification effect.
TABLE 1 comparison of recognition results of different methods
Method Acc Prec Recall F1
Example 1 0.95 0.92 0.87 0.89
DeepTfactor 0.94 0.93 0.84 0.88
GRU 0.80 0.67 0.45 0.52
Finally, it should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and not intended to limit the present invention, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications and equivalents can be made in the technical solutions described in the foregoing embodiments, or some technical features thereof can be replaced. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any modification, equivalent replacement, and improvement made by those skilled in the art within the technical scope of the present invention should be included in the scope of the present invention.

Claims (9)

1. A transcription factor identification method based on a Transformer-Encoder and a multi-scale convolutional neural network is characterized by comprising the following steps:
step 1: constructing a training set: collecting protein sequences from a protein database, and marking each protein sequence as a transcription factor or a non-transcription factor according to corresponding protein annotation information; preprocessing all sequences to obtain a training data set;
and 2, step: building a network structure: building a Transformer-Encoder and a multi-scale volumeConstructing a transcription factor prediction model by a network structure combined with a neural network; wherein a Transformer-Encoder is used to obtain the i-th protein sequence X i Global feature of (2)
Figure FDA0003955003430000011
Multi-scale convolutional neural network for use in a network based on
Figure FDA0003955003430000012
Performing transcription factor prediction identification;
and step 3: training a prediction model: training the network constructed in the step 2 by using the training set obtained in the step 1 to obtain a trained transcription factor prediction model;
and 4, step 4: transcription factor prediction: and (4) predicting whether the unknown protein sequence is a transcription factor or not by using the prediction model obtained in the step (3), and outputting a prediction result.
2. The method of claim 1, wherein: the step 1 comprises the following substeps:
1.1 selecting protein sequences which do not contain nonstandard amino acids, namely B, O, U and Z from a protein database to form a data set S1;
1.2 removing sequences with the length more than 1000 from S1 and only retaining sequences with the length less than or equal to 1000; for protein sequences less than 1000 in length, zero padding to 1000 in length; finally obtaining a protein sequence data set S2;
1.3 according to GO annotation information of each protein in the protein database, respectively endowing each protein sequence in S2 with a label of a transcription factor '1' or a non-transcription factor '0'; finally obtaining a training data set S = (X) i ,c i ) I =1, ·, N; wherein X i Represents the ith protein sequence in the dataset; c. C i Is X i A label of (c) i E {0,1}; n is the size of S.
3. The method of claim 1, wherein: in step 1.3, if the GO term of the protein contains the transformation factor or contains two GO terms of transformation regulation and DNA binding, the protein sequence is a transcription factor and is assigned as 1; otherwise, the protein sequence is a non-transcription factor and is assigned a value of "0".
4. The method of claim 1, wherein: the network structure in the step 2 comprises a transform-Encoder structure and a multi-scale convolution neural network structure which are formed in series;
the Transformer-Encoder structure only reserves an Encoder part in the Transformer and is formed by stacking 6 Encoder blocks, and each Encoder block comprises 12 attention heads; the Transformer-Encoder is used for extracting global characteristics from an input protein sequence;
the multi-scale convolution neural network consists of four convolution sub-networks which are connected in parallel and have different one-dimensional convolution kernels, two full-connection layers and an output layer; the convolution layer comprises a plurality of one-dimensional convolution operations respectively corresponding to convolution kernels with different sizes to obtain a plurality of convolution characteristics with different sizes; the pooling layer is used for pooling the convolution characteristics respectively to obtain the characteristics with reduced dimensionality; after pooling, the characteristics are spliced and sent into a full connecting layer; and outputting the prediction result obtained after the calculation of the full connection layer by the output layer.
5. The method of claim 1, wherein: in the step 2, a protein sequence is set as X i =x i1 ,x i2 ,…,x ij ,…x i1000 ,x ij Represents the protein sequence X i The amino acid at the j-th position in the sequence is converted into X by using a Transformer-Encoder i Global feature of (2)
Figure FDA0003955003430000021
The method comprises the following specific steps:
2.1 obtaining X by an embedding operation i The imbedding vector is specifically as follows:
2.1.1 first of all for different amino acid classesRows are randomly initialized, and then X is assigned to the corresponding amino acid type i Each amino acid x of ij The imbedding generates a corresponding vector;
2.1.2 extracting the position information of amino acid in the protein sequence by using position codes, wherein the position codes identify different positions of the amino acid in the protein by sine and cosine functions, and the position coding formula of j-th amino acid is shown as follows:
Figure FDA0003955003430000022
wherein pos represents the position of an amino acid in a protein sequence, d represents the dimension of an insertion vector, and k is a natural number;
2.1.3 Each amino acid x ij Adding the embedding and the corresponding position code to obtain protein X i An imbedding vector of the sequence;
2.2 obtaining the protein sequence X i Taking the embedding vector as the input of a Transformer-Encoder, excavating an attition score between every two amino acids by utilizing an attition mechanism of the embedding vector, and performing cross multiplication on the attition score and the embedding vector to obtain the whole protein sequence X i Global feature of (2)
Figure FDA0003955003430000023
6. The method of claim 4, wherein: in the step 2, the convolution sub-network sequentially consists of a convolution layer, a normalization layer, a Dropout layer and a Max-Pooling layer;
wherein the calculation formula output by the ith sub-network is as follows:
F i (x)=MaxPooling(ReLU(Norm(ConV(x)))
wherein x is the input of the convolutional layer;
the splice outputs of the four subnetworks are:
ouput=Concat(F i (x)),i=1,2,3,4。
7. the method of claim 1, wherein: in the step 2, the multi-scale convolutional neural network uses one-dimensional convolutional kernels with different sizes to extract features between local protein sequences with different lengths.
8. The method of claim 7, wherein: in the step 2, the length of the convolution kernel is the same as the embedding dimension of the protein, the width is set to be 4-20, and the sizes of the convolution kernels are 4,8, 12 and 16 in sequence.
9. The method of claim 1, wherein: in the step 2, based on
Figure FDA0003955003430000031
The method for predicting and identifying the transcription factors comprises the following steps:
based on the protein sequence X i Global feature of (2)
Figure FDA0003955003430000032
Obtaining X after convolution operation of different scales i Local characteristics of
Figure FDA0003955003430000033
J is local features corresponding to different convolution kernels, all the local features are spliced, and finally the local features are input into a full connection layer, wherein cross entry is selected by a loss function, and the method is specifically represented by the following formula:
Figure FDA0003955003430000034
wherein
Figure FDA0003955003430000035
Is the probability that the model predicts that the sample is a transcription factor, y is the sample label, if the sample belongs to the positive example, takeThe value is 1, otherwise the value is 0;
for all protein sequences, the final output probability obtained by the SoftMax function is [ p ] 1j ,p 2j ,…,p nj ]The predicted class y of protein i is calculated by the following formula i
Figure FDA0003955003430000036
Such as y i If "= 1", this means that the protein sequence i is a transcription factor, whereas it is a non-transcription factor.
CN202211459776.0A 2022-11-16 2022-11-16 Transcription factor identification method based on Transformer-Encoder and multi-scale convolutional neural network Pending CN115713970A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211459776.0A CN115713970A (en) 2022-11-16 2022-11-16 Transcription factor identification method based on Transformer-Encoder and multi-scale convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211459776.0A CN115713970A (en) 2022-11-16 2022-11-16 Transcription factor identification method based on Transformer-Encoder and multi-scale convolutional neural network

Publications (1)

Publication Number Publication Date
CN115713970A true CN115713970A (en) 2023-02-24

Family

ID=85234458

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211459776.0A Pending CN115713970A (en) 2022-11-16 2022-11-16 Transcription factor identification method based on Transformer-Encoder and multi-scale convolutional neural network

Country Status (1)

Country Link
CN (1) CN115713970A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304889A (en) * 2023-05-22 2023-06-23 鲁东大学 Receptor classification method based on convolution and transducer

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304889A (en) * 2023-05-22 2023-06-23 鲁东大学 Receptor classification method based on convolution and transducer

Similar Documents

Publication Publication Date Title
CN111694924B (en) Event extraction method and system
CN109344250B (en) Rapid structuring method of single disease diagnosis information based on medical insurance data
CN112214610B (en) Entity relationship joint extraction method based on span and knowledge enhancement
CN110070909B (en) Deep learning-based multi-feature fusion protein function prediction method
CN112614538A (en) Antibacterial peptide prediction method and device based on protein pre-training characterization learning
CN108710894B (en) Active learning labeling method and device based on clustering representative points
CN112733866B (en) Network construction method for improving text description correctness of controllable image
CN112347284B (en) Combined trademark image retrieval method
CN112732946B (en) Modular data analysis and database establishment method for medical literature
CN115098620B (en) Cross-modal hash retrieval method for attention similarity migration
CN111966825A (en) Power grid equipment defect text classification method based on machine learning
CN108764280B (en) Medical data processing method and system based on symptom vector
CN110347791B (en) Topic recommendation method based on multi-label classification convolutional neural network
CN115526236A (en) Text network graph classification method based on multi-modal comparative learning
CN112800248A (en) Similar case retrieval method, similar case retrieval device, computer equipment and storage medium
CN111191051A (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN115713970A (en) Transcription factor identification method based on Transformer-Encoder and multi-scale convolutional neural network
Tavakoli Seq2image: Sequence analysis using visualization and deep convolutional neural network
CN116304020A (en) Industrial text entity extraction method based on semantic source analysis and span characteristics
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN113157918B (en) Commodity name short text classification method and system based on attention mechanism
CN114138971A (en) Genetic algorithm-based maximum multi-label classification method
CN117151222A (en) Domain knowledge guided emergency case entity attribute and relation extraction method thereof, electronic equipment and storage medium
CN114049165B (en) Commodity price comparison method, device, equipment and medium for purchasing system
CN116302953A (en) Software defect positioning method based on enhanced embedded vector semantic representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination